feat: chapter content sync via PHPSESSID + per-host pacing (0.25.0)
After the metadata pass, the crawler now fetches per-chapter image content for chapters belonging to bookmarked mangas. Logged-in chapter pages render every page image at once (no per-page navigation), so the crawler reuses the operator's browser session via a pasted PHPSESSID cookie. Each chapter sync is a single transaction: storage puts + page row inserts + page_count update commit together, or roll back together on any image error so the chapter stays at page_count=0 and is retried next run. New crawler modules: - `rate_limit::HostRateLimiters`: per-host buckets keyed by URL host, with optional per-host overrides. Replaces the single shared `Mutex<RateLimiter>`. Catalog and CDN no longer share a budget; default 1 req/s per host. - `session`: derives `.<registrable>.<tld>` from the start URL (override via `CRAWLER_COOKIE_DOMAIN` for multi-part TLDs), injects PHPSESSID into the Chromium cookie store, probes `#avatar_menu` at startup to fail fast on a bad/expired cookie. - `content`: parses `a#pic_container img:not(.loading)` with `pageN` id-based sorting (DOM order isn't trusted), then performs the atomic chapter sync. bin/crawler additions: - Concurrent chapter content phase via `futures_util::for_each_concurrent` (`CRAWLER_CHAPTER_WORKERS`, default 1). Browser is borrowed across workers — chromiumoxide allows concurrent `new_page` on `&self` — and per-host rate limit gates total RPS regardless of worker count. - reqwest gets the `cookies` feature, a `Jar` seeded with PHPSESSID for the catalog domain only (CDN intentionally not given the cookie), and `Referer` is set on cover + chapter image fetches. - New env knobs: `CRAWLER_PHPSESSID`, `CRAWLER_COOKIE_DOMAIN`, `CRAWLER_USER_AGENT`, `CRAWLER_CHAPTER_WORKERS`, `CRAWLER_SKIP_CHAPTER_CONTENT`, `CRAWLER_FORCE_REFETCH_CHAPTERS`, `CRAWLER_CDN_HOST` + `CRAWLER_CDN_RATE_MS`. - Mid-run session-expired detection: `#avatar_menu` is re-checked on every chapter page nav; first failure aborts the phase with a cookie-refresh message. Bookmark-driven enqueueing is sync-on-crawl-tick only: the bookmarked chapters with `page_count = 0` are queried at the start of the chapter-content phase. Sync-on-bookmark via an API hook is deferred to a follow-up branch — that needs a daemon consumer of crawler_jobs, which doesn't exist yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
244
backend/src/crawler/content.rs
Normal file
244
backend/src/crawler/content.rs
Normal file
@@ -0,0 +1,244 @@
|
||||
//! Chapter content sync — fetch a logged-in chapter page, extract its
|
||||
//! image URLs in `pageN` order, download each to storage, and atomically
|
||||
//! persist a `pages` row per image plus the chapter's `page_count`.
|
||||
//!
|
||||
//! Only chapters belonging to a manga someone has bookmarked are
|
||||
//! candidates. The crawler scans bookmarks at the start of each run and
|
||||
//! enqueues unfetched chapters; the API also enqueues at bookmark-time
|
||||
//! so users get instant feedback. Both feed into the same queue and
|
||||
//! dedup by chapter id.
|
||||
|
||||
// Implementation lands in the next commits in this branch. Module is
|
||||
// declared so other crates can `use crawler::content` without breaking
|
||||
// builds while iteration is in progress.
|
||||
|
||||
use anyhow::Context;
|
||||
use sqlx::PgPool;
|
||||
use uuid::Uuid;
|
||||
|
||||
use crate::crawler::rate_limit::HostRateLimiters;
|
||||
use crate::crawler::session;
|
||||
use crate::storage::Storage;
|
||||
|
||||
/// Parse the chapter page DOM and return the page images in `pageN`
|
||||
/// order. Filters out the loader `<img class="loading">` and any
|
||||
/// `<img>` without a numeric `id="pageN"`.
|
||||
pub fn parse_chapter_pages(html: &str) -> Vec<ChapterImage> {
|
||||
let doc = scraper::Html::parse_document(html);
|
||||
let sel = scraper::Selector::parse("a#pic_container img:not(.loading)").unwrap();
|
||||
let mut pages: Vec<ChapterImage> = doc
|
||||
.select(&sel)
|
||||
.filter_map(|img| {
|
||||
let id = img.value().id()?;
|
||||
let n: i32 = id.strip_prefix("page")?.parse().ok()?;
|
||||
let src = img.value().attr("src")?.trim().to_string();
|
||||
if src.is_empty() {
|
||||
return None;
|
||||
}
|
||||
Some(ChapterImage { page_number: n, url: src })
|
||||
})
|
||||
.collect();
|
||||
pages.sort_by_key(|p| p.page_number);
|
||||
pages
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, PartialEq, Eq)]
|
||||
pub struct ChapterImage {
|
||||
pub page_number: i32,
|
||||
pub url: String,
|
||||
}
|
||||
|
||||
/// Outcome of a single chapter sync — surfaced to callers for logging
|
||||
/// and exit-code decisions.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
|
||||
pub enum SyncOutcome {
|
||||
/// All images downloaded and stored, chapter row updated.
|
||||
Fetched { pages: usize },
|
||||
/// `page_count > 0` already — no-op unless force_refetch is set.
|
||||
Skipped,
|
||||
/// Session probe failed mid-sync (avatar selector missing on the
|
||||
/// chapter page). Caller should abort the whole crawler run.
|
||||
SessionExpired,
|
||||
}
|
||||
|
||||
/// Fetch all images for one chapter and persist them atomically. On
|
||||
/// any error after the first storage put, the DB transaction rolls
|
||||
/// back so the chapter stays at `page_count = 0` and is retried on the
|
||||
/// next run. Bytes already written to storage become orphans; a future
|
||||
/// reaper sweeps them.
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
pub async fn sync_chapter_content(
|
||||
browser: &chromiumoxide::Browser,
|
||||
db: &PgPool,
|
||||
storage: &dyn Storage,
|
||||
http: &reqwest::Client,
|
||||
rate: &HostRateLimiters,
|
||||
chapter_id: Uuid,
|
||||
manga_id: Uuid,
|
||||
source_url: &str,
|
||||
force_refetch: bool,
|
||||
) -> anyhow::Result<SyncOutcome> {
|
||||
// Skip if already fetched, unless caller explicitly forces.
|
||||
if !force_refetch {
|
||||
let (page_count,): (i32,) =
|
||||
sqlx::query_as("SELECT page_count FROM chapters WHERE id = $1")
|
||||
.bind(chapter_id)
|
||||
.fetch_one(db)
|
||||
.await
|
||||
.context("read chapter page_count")?;
|
||||
if page_count > 0 {
|
||||
return Ok(SyncOutcome::Skipped);
|
||||
}
|
||||
}
|
||||
|
||||
// Nav to chapter page (rate-limited per host).
|
||||
rate.wait_for(source_url).await?;
|
||||
let page = browser
|
||||
.new_page(source_url)
|
||||
.await
|
||||
.with_context(|| format!("open chapter page {source_url}"))?;
|
||||
page.wait_for_navigation().await.context("wait for chapter nav")?;
|
||||
|
||||
// Session probe: avatar present == still logged in. Missing means
|
||||
// PHPSESSID expired; bail the entire crawler run.
|
||||
if page.find_element("#avatar_menu").await.is_err() {
|
||||
page.close().await.ok();
|
||||
return Ok(SyncOutcome::SessionExpired);
|
||||
}
|
||||
|
||||
let html = page.content().await.context("read chapter html")?;
|
||||
page.close().await.ok();
|
||||
|
||||
let images = parse_chapter_pages(&html);
|
||||
if images.is_empty() {
|
||||
anyhow::bail!("no page images parsed from {source_url}");
|
||||
}
|
||||
|
||||
// Resolve image URLs against the chapter URL (they may be relative).
|
||||
let base = reqwest::Url::parse(source_url).context("parse chapter URL")?;
|
||||
|
||||
// Fetch every image bytes-first into memory before writing
|
||||
// anything. Lets us bail the whole chapter cleanly if any image
|
||||
// fails — DB stays at page_count=0, no partial rows persisted.
|
||||
let mut fetched: Vec<(i32, Vec<u8>, &'static str)> = Vec::with_capacity(images.len());
|
||||
for img in &images {
|
||||
let url = base.join(&img.url).with_context(|| {
|
||||
format!("join image URL {} onto {source_url}", img.url)
|
||||
})?;
|
||||
rate.wait_for(url.as_str()).await?;
|
||||
let resp = http
|
||||
.get(url.clone())
|
||||
// Source CDNs commonly check Referer. Set it to the
|
||||
// chapter page — matches what the browser would send.
|
||||
.header(reqwest::header::REFERER, source_url)
|
||||
.send()
|
||||
.await
|
||||
.with_context(|| format!("GET {url}"))?
|
||||
.error_for_status()
|
||||
.with_context(|| format!("non-2xx for {url}"))?;
|
||||
let bytes = resp.bytes().await.context("read image body")?.to_vec();
|
||||
let ext = infer::get(&bytes).map(|k| k.extension()).unwrap_or("bin");
|
||||
fetched.push((img.page_number, bytes, ext));
|
||||
}
|
||||
|
||||
// Atomic write: storage puts + page row inserts + page_count
|
||||
// update, all in one transaction. If anything fails, rollback +
|
||||
// the chapter is retried next run. Storage orphans the bytes; a
|
||||
// reaper sweeps them later.
|
||||
let mut tx = db.begin().await.context("open chapter sync tx")?;
|
||||
for (page_number, bytes, ext) in &fetched {
|
||||
let key = format!(
|
||||
"mangas/{manga_id}/chapters/{chapter_id}/pages/{:04}.{ext}",
|
||||
page_number
|
||||
);
|
||||
storage
|
||||
.put(&key, bytes)
|
||||
.await
|
||||
.with_context(|| format!("put {key}"))?;
|
||||
// (chapter_id, page_number) is unique — re-runs idempotent.
|
||||
sqlx::query(
|
||||
"INSERT INTO pages (chapter_id, page_number, storage_key, content_type)
|
||||
VALUES ($1, $2, $3, $4)
|
||||
ON CONFLICT (chapter_id, page_number) DO UPDATE
|
||||
SET storage_key = EXCLUDED.storage_key,
|
||||
content_type = EXCLUDED.content_type",
|
||||
)
|
||||
.bind(chapter_id)
|
||||
.bind(page_number)
|
||||
.bind(&key)
|
||||
.bind(format!("image/{ext}"))
|
||||
.execute(&mut *tx)
|
||||
.await
|
||||
.with_context(|| format!("insert page row {page_number}"))?;
|
||||
}
|
||||
sqlx::query("UPDATE chapters SET page_count = $1 WHERE id = $2")
|
||||
.bind(fetched.len() as i32)
|
||||
.bind(chapter_id)
|
||||
.execute(&mut *tx)
|
||||
.await
|
||||
.context("update page_count")?;
|
||||
tx.commit().await.context("commit chapter sync")?;
|
||||
|
||||
Ok(SyncOutcome::Fetched { pages: fetched.len() })
|
||||
}
|
||||
|
||||
// Suppress unused-import warning for `session` until the bin/crawler
|
||||
// wiring lands in this branch and uses it through this module.
|
||||
#[allow(dead_code)]
|
||||
fn _keep_session_in_scope() {
|
||||
let _ = session::registrable_domain;
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn parse_chapter_pages_skips_loader_and_sorts_by_id() {
|
||||
// Loader image, two real pages out of order, and one with no id.
|
||||
let html = r#"
|
||||
<html><body id="body"><a id="pic_container">
|
||||
<img class="loading" src="/images/ajax-loader2.gif">
|
||||
<img id="page2" class="page2" src="https://cdn/2.jpg">
|
||||
<img id="page1" class="page1" src="https://cdn/1.jpg">
|
||||
<img src="https://cdn/orphan.jpg">
|
||||
<img id="not-a-page" src="https://cdn/not-a-page.jpg">
|
||||
</a></body></html>
|
||||
"#;
|
||||
let pages = parse_chapter_pages(html);
|
||||
assert_eq!(pages.len(), 2);
|
||||
assert_eq!(pages[0].page_number, 1);
|
||||
assert_eq!(pages[0].url, "https://cdn/1.jpg");
|
||||
assert_eq!(pages[1].page_number, 2);
|
||||
assert_eq!(pages[1].url, "https://cdn/2.jpg");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_chapter_pages_drops_images_without_src() {
|
||||
let html = r#"
|
||||
<a id="pic_container">
|
||||
<img id="page1" src="">
|
||||
<img id="page2" src="https://cdn/2.jpg">
|
||||
</a>
|
||||
"#;
|
||||
let pages = parse_chapter_pages(html);
|
||||
assert_eq!(pages.len(), 1);
|
||||
assert_eq!(pages[0].page_number, 2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_chapter_pages_handles_three_digit_page_ids() {
|
||||
let html = r#"
|
||||
<a id="pic_container">
|
||||
<img id="page126" src="https://cdn/126.jpg">
|
||||
<img id="page9" src="https://cdn/9.jpg">
|
||||
<img id="page50" src="https://cdn/50.jpg">
|
||||
</a>
|
||||
"#;
|
||||
let pages = parse_chapter_pages(html);
|
||||
assert_eq!(
|
||||
pages.iter().map(|p| p.page_number).collect::<Vec<_>>(),
|
||||
vec![9, 50, 126]
|
||||
);
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user