feat: chapter content sync via PHPSESSID + per-host pacing (0.25.0)

After the metadata pass, the crawler now fetches per-chapter image content for chapters belonging to bookmarked mangas. Logged-in chapter pages render every page image at once (no per-page navigation), so the crawler reuses the operator's browser session via a pasted PHPSESSID cookie. Each chapter sync is a single transaction: storage puts + page row inserts + page_count update commit together, or roll back together on any image error so the chapter stays at page_count=0 and is retried next run. New crawler modules: - `rate_limit::HostRateLimiters`: per-host buckets keyed by URL host, with optional per-host overrides. Replaces the single shared `Mutex<RateLimiter>`. Catalog and CDN no longer share a budget; default 1 req/s per host. - `session`: derives `.<registrable>.<tld>` from the start URL (override via `CRAWLER_COOKIE_DOMAIN` for multi-part TLDs), injects PHPSESSID into the Chromium cookie store, probes `#avatar_menu` at startup to fail fast on a bad/expired cookie. - `content`: parses `a#pic_container img:not(.loading)` with `pageN` id-based sorting (DOM order isn't trusted), then performs the atomic chapter sync. bin/crawler additions: - Concurrent chapter content phase via `futures_util::for_each_concurrent` (`CRAWLER_CHAPTER_WORKERS`, default 1). Browser is borrowed across workers — chromiumoxide allows concurrent `new_page` on `&self` — and per-host rate limit gates total RPS regardless of worker count. - reqwest gets the `cookies` feature, a `Jar` seeded with PHPSESSID for the catalog domain only (CDN intentionally not given the cookie), and `Referer` is set on cover + chapter image fetches. - New env knobs: `CRAWLER_PHPSESSID`, `CRAWLER_COOKIE_DOMAIN`, `CRAWLER_USER_AGENT`, `CRAWLER_CHAPTER_WORKERS`, `CRAWLER_SKIP_CHAPTER_CONTENT`, `CRAWLER_FORCE_REFETCH_CHAPTERS`, `CRAWLER_CDN_HOST` + `CRAWLER_CDN_RATE_MS`. - Mid-run session-expired detection: `#avatar_menu` is re-checked on every chapter page nav; first failure aborts the phase with a cookie-refresh message. Bookmark-driven enqueueing is sync-on-crawl-tick only: the bookmarked chapters with `page_count = 0` are queried at the start of the chapter-content phase. Sync-on-bookmark via an API hook is deferred to a follow-up branch — that needs a daemon consumer of crawler_jobs, which doesn't exist yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 00:28:36 +02:00
parent 51346227dd
commit d24e68c78d
10 changed files with 846 additions and 35 deletions
--- a/backend/src/crawler/content.rs
+++ b/backend/src/crawler/content.rs
@@ -0,0 +1,244 @@
+//! Chapter content sync — fetch a logged-in chapter page, extract its
+//! image URLs in `pageN` order, download each to storage, and atomically
+//! persist a `pages` row per image plus the chapter's `page_count`.
+//!
+//! Only chapters belonging to a manga someone has bookmarked are
+//! candidates. The crawler scans bookmarks at the start of each run and
+//! enqueues unfetched chapters; the API also enqueues at bookmark-time
+//! so users get instant feedback. Both feed into the same queue and
+//! dedup by chapter id.
+
+// Implementation lands in the next commits in this branch. Module is
+// declared so other crates can `use crawler::content` without breaking
+// builds while iteration is in progress.
+
+use anyhow::Context;
+use sqlx::PgPool;
+use uuid::Uuid;
+
+use crate::crawler::rate_limit::HostRateLimiters;
+use crate::crawler::session;
+use crate::storage::Storage;
+
+/// Parse the chapter page DOM and return the page images in `pageN`
+/// order. Filters out the loader `<img class="loading">` and any
+/// `<img>` without a numeric `id="pageN"`.
+pub fn parse_chapter_pages(html: &str) -> Vec<ChapterImage> {
+    let doc = scraper::Html::parse_document(html);
+    let sel = scraper::Selector::parse("a#pic_container img:not(.loading)").unwrap();
+    let mut pages: Vec<ChapterImage> = doc
+        .select(&sel)
+        .filter_map(|img| {
+            let id = img.value().id()?;
+            let n: i32 = id.strip_prefix("page")?.parse().ok()?;
+            let src = img.value().attr("src")?.trim().to_string();
+            if src.is_empty() {
+                return None;
+            }
+            Some(ChapterImage { page_number: n, url: src })
+        })
+        .collect();
+    pages.sort_by_key(|p| p.page_number);
+    pages
+}
+
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub struct ChapterImage {
+    pub page_number: i32,
+    pub url: String,
+}
+
+/// Outcome of a single chapter sync — surfaced to callers for logging
+/// and exit-code decisions.
+#[derive(Debug, Clone, Copy, PartialEq, Eq)]
+pub enum SyncOutcome {
+    /// All images downloaded and stored, chapter row updated.
+    Fetched { pages: usize },
+    /// `page_count > 0` already — no-op unless force_refetch is set.
+    Skipped,
+    /// Session probe failed mid-sync (avatar selector missing on the
+    /// chapter page). Caller should abort the whole crawler run.
+    SessionExpired,
+}
+
+/// Fetch all images for one chapter and persist them atomically. On
+/// any error after the first storage put, the DB transaction rolls
+/// back so the chapter stays at `page_count = 0` and is retried on the
+/// next run. Bytes already written to storage become orphans; a future
+/// reaper sweeps them.
+#[allow(clippy::too_many_arguments)]
+pub async fn sync_chapter_content(
+    browser: &chromiumoxide::Browser,
+    db: &PgPool,
+    storage: &dyn Storage,
+    http: &reqwest::Client,
+    rate: &HostRateLimiters,
+    chapter_id: Uuid,
+    manga_id: Uuid,
+    source_url: &str,
+    force_refetch: bool,
+) -> anyhow::Result<SyncOutcome> {
+    // Skip if already fetched, unless caller explicitly forces.
+    if !force_refetch {
+        let (page_count,): (i32,) =
+            sqlx::query_as("SELECT page_count FROM chapters WHERE id = $1")
+                .bind(chapter_id)
+                .fetch_one(db)
+                .await
+                .context("read chapter page_count")?;
+        if page_count > 0 {
+            return Ok(SyncOutcome::Skipped);
+        }
+    }
+
+    // Nav to chapter page (rate-limited per host).
+    rate.wait_for(source_url).await?;
+    let page = browser
+        .new_page(source_url)
+        .await
+        .with_context(|| format!("open chapter page {source_url}"))?;
+    page.wait_for_navigation().await.context("wait for chapter nav")?;
+
+    // Session probe: avatar present == still logged in. Missing means
+    // PHPSESSID expired; bail the entire crawler run.
+    if page.find_element("#avatar_menu").await.is_err() {
+        page.close().await.ok();
+        return Ok(SyncOutcome::SessionExpired);
+    }
+
+    let html = page.content().await.context("read chapter html")?;
+    page.close().await.ok();
+
+    let images = parse_chapter_pages(&html);
+    if images.is_empty() {
+        anyhow::bail!("no page images parsed from {source_url}");
+    }
+
+    // Resolve image URLs against the chapter URL (they may be relative).
+    let base = reqwest::Url::parse(source_url).context("parse chapter URL")?;
+
+    // Fetch every image bytes-first into memory before writing
+    // anything. Lets us bail the whole chapter cleanly if any image
+    // fails — DB stays at page_count=0, no partial rows persisted.
+    let mut fetched: Vec<(i32, Vec<u8>, &'static str)> = Vec::with_capacity(images.len());
+    for img in &images {
+        let url = base.join(&img.url).with_context(|| {
+            format!("join image URL {} onto {source_url}", img.url)
+        })?;
+        rate.wait_for(url.as_str()).await?;
+        let resp = http
+            .get(url.clone())
+            // Source CDNs commonly check Referer. Set it to the
+            // chapter page — matches what the browser would send.
+            .header(reqwest::header::REFERER, source_url)
+            .send()
+            .await
+            .with_context(|| format!("GET {url}"))?
+            .error_for_status()
+            .with_context(|| format!("non-2xx for {url}"))?;
+        let bytes = resp.bytes().await.context("read image body")?.to_vec();
+        let ext = infer::get(&bytes).map(|k| k.extension()).unwrap_or("bin");
+        fetched.push((img.page_number, bytes, ext));
+    }
+
+    // Atomic write: storage puts + page row inserts + page_count
+    // update, all in one transaction. If anything fails, rollback +
+    // the chapter is retried next run. Storage orphans the bytes; a
+    // reaper sweeps them later.
+    let mut tx = db.begin().await.context("open chapter sync tx")?;
+    for (page_number, bytes, ext) in &fetched {
+        let key = format!(
+            "mangas/{manga_id}/chapters/{chapter_id}/pages/{:04}.{ext}",
+            page_number
+        );
+        storage
+            .put(&key, bytes)
+            .await
+            .with_context(|| format!("put {key}"))?;
+        // (chapter_id, page_number) is unique — re-runs idempotent.
+        sqlx::query(
+            "INSERT INTO pages (chapter_id, page_number, storage_key, content_type)
+             VALUES ($1, $2, $3, $4)
+             ON CONFLICT (chapter_id, page_number) DO UPDATE
+             SET storage_key = EXCLUDED.storage_key,
+                 content_type = EXCLUDED.content_type",
+        )
+        .bind(chapter_id)
+        .bind(page_number)
+        .bind(&key)
+        .bind(format!("image/{ext}"))
+        .execute(&mut *tx)
+        .await
+        .with_context(|| format!("insert page row {page_number}"))?;
+    }
+    sqlx::query("UPDATE chapters SET page_count = $1 WHERE id = $2")
+        .bind(fetched.len() as i32)
+        .bind(chapter_id)
+        .execute(&mut *tx)
+        .await
+        .context("update page_count")?;
+    tx.commit().await.context("commit chapter sync")?;
+
+    Ok(SyncOutcome::Fetched { pages: fetched.len() })
+}
+
+// Suppress unused-import warning for `session` until the bin/crawler
+// wiring lands in this branch and uses it through this module.
+#[allow(dead_code)]
+fn _keep_session_in_scope() {
+    let _ = session::registrable_domain;
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn parse_chapter_pages_skips_loader_and_sorts_by_id() {
+        // Loader image, two real pages out of order, and one with no id.
+        let html = r#"
+          <html><body id="body"><a id="pic_container">
+            <img class="loading" src="/images/ajax-loader2.gif">
+            <img id="page2" class="page2" src="https://cdn/2.jpg">
+            <img id="page1" class="page1" src="https://cdn/1.jpg">
+            <img src="https://cdn/orphan.jpg">
+            <img id="not-a-page" src="https://cdn/not-a-page.jpg">
+          </a></body></html>
+        "#;
+        let pages = parse_chapter_pages(html);
+        assert_eq!(pages.len(), 2);
+        assert_eq!(pages[0].page_number, 1);
+        assert_eq!(pages[0].url, "https://cdn/1.jpg");
+        assert_eq!(pages[1].page_number, 2);
+        assert_eq!(pages[1].url, "https://cdn/2.jpg");
+    }
+
+    #[test]
+    fn parse_chapter_pages_drops_images_without_src() {
+        let html = r#"
+          <a id="pic_container">
+            <img id="page1" src="">
+            <img id="page2" src="https://cdn/2.jpg">
+          </a>
+        "#;
+        let pages = parse_chapter_pages(html);
+        assert_eq!(pages.len(), 1);
+        assert_eq!(pages[0].page_number, 2);
+    }
+
+    #[test]
+    fn parse_chapter_pages_handles_three_digit_page_ids() {
+        let html = r#"
+          <a id="pic_container">
+            <img id="page126" src="https://cdn/126.jpg">
+            <img id="page9" src="https://cdn/9.jpg">
+            <img id="page50" src="https://cdn/50.jpg">
+          </a>
+        "#;
+        let pages = parse_chapter_pages(html);
+        assert_eq!(
+            pages.iter().map(|p| p.page_number).collect::<Vec<_>>(),
+            vec![9, 50, 126]
+        );
+    }
+}