feat: chapter content sync via PHPSESSID + per-host pacing (0.25.0)
After the metadata pass, the crawler now fetches per-chapter image content for chapters belonging to bookmarked mangas. Logged-in chapter pages render every page image at once (no per-page navigation), so the crawler reuses the operator's browser session via a pasted PHPSESSID cookie. Each chapter sync is a single transaction: storage puts + page row inserts + page_count update commit together, or roll back together on any image error so the chapter stays at page_count=0 and is retried next run. New crawler modules: - `rate_limit::HostRateLimiters`: per-host buckets keyed by URL host, with optional per-host overrides. Replaces the single shared `Mutex<RateLimiter>`. Catalog and CDN no longer share a budget; default 1 req/s per host. - `session`: derives `.<registrable>.<tld>` from the start URL (override via `CRAWLER_COOKIE_DOMAIN` for multi-part TLDs), injects PHPSESSID into the Chromium cookie store, probes `#avatar_menu` at startup to fail fast on a bad/expired cookie. - `content`: parses `a#pic_container img:not(.loading)` with `pageN` id-based sorting (DOM order isn't trusted), then performs the atomic chapter sync. bin/crawler additions: - Concurrent chapter content phase via `futures_util::for_each_concurrent` (`CRAWLER_CHAPTER_WORKERS`, default 1). Browser is borrowed across workers — chromiumoxide allows concurrent `new_page` on `&self` — and per-host rate limit gates total RPS regardless of worker count. - reqwest gets the `cookies` feature, a `Jar` seeded with PHPSESSID for the catalog domain only (CDN intentionally not given the cookie), and `Referer` is set on cover + chapter image fetches. - New env knobs: `CRAWLER_PHPSESSID`, `CRAWLER_COOKIE_DOMAIN`, `CRAWLER_USER_AGENT`, `CRAWLER_CHAPTER_WORKERS`, `CRAWLER_SKIP_CHAPTER_CONTENT`, `CRAWLER_FORCE_REFETCH_CHAPTERS`, `CRAWLER_CDN_HOST` + `CRAWLER_CDN_RATE_MS`. - Mid-run session-expired detection: `#avatar_menu` is re-checked on every chapter page nav; first failure aborts the phase with a cookie-refresh message. Bookmark-driven enqueueing is sync-on-crawl-tick only: the bookmarked chapters with `page_count = 0` are queried at the start of the chapter-content phase. Sync-on-bookmark via an API hook is deferred to a follow-up branch — that needs a daemon consumer of crawler_jobs, which doesn't exist yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -2,9 +2,10 @@
|
||||
//!
|
||||
//! Walks the source's manga listing (all pages), fetches each manga's
|
||||
//! metadata + chapter list, downloads the cover into `Storage`, and
|
||||
//! reconciles everything into the DB. Chapter *content* (page images)
|
||||
//! is out of scope for now — only chapter rows + their source links
|
||||
//! are written.
|
||||
//! reconciles everything into the DB. Then, for any chapter belonging
|
||||
//! to a bookmarked manga whose `page_count` is still 0, fetches the
|
||||
//! chapter page (logged in), pulls every image from the CDN, and writes
|
||||
//! the `pages` rows atomically per chapter.
|
||||
//!
|
||||
//! Configuration:
|
||||
//! - **Start URL** (required): first CLI positional arg, else
|
||||
@@ -15,13 +16,34 @@
|
||||
//! - **Browser**: see `LaunchOptions::from_env` —
|
||||
//! `CRAWLER_BROWSER_MODE` (`headed`|`headless`) and
|
||||
//! `CRAWLER_BROWSER_ARGS`.
|
||||
//! - **Rate limit**: `CRAWLER_RATE_MS` (ms between requests, default
|
||||
//! `1000`).
|
||||
//! - **Rate limit**: `CRAWLER_RATE_MS` (ms between requests per host,
|
||||
//! default `1000`). Per-host: catalog and each CDN have their own
|
||||
//! bucket and don't share a budget.
|
||||
//! - **CDN rate override** (optional): `CRAWLER_CDN_HOST` plus
|
||||
//! `CRAWLER_CDN_RATE_MS` to give a specific host a different
|
||||
//! interval. Useful when the image CDN tolerates higher RPS than
|
||||
//! the catalog host.
|
||||
//! - **Cap**: `CRAWLER_LIMIT` (max manga detail fetches per run,
|
||||
//! default `0` = no cap).
|
||||
//! - **Skip chapters**: `CRAWLER_SKIP_CHAPTERS=1` — turn off the
|
||||
//! chapter selector in the parser AND skip the per-manga
|
||||
//! `sync_manga_chapters` write. Use this for "metadata only" runs.
|
||||
//! - **Skip chapter content**: `CRAWLER_SKIP_CHAPTER_CONTENT=1` —
|
||||
//! skip the page-image phase even if chapters need syncing.
|
||||
//! - **Chapter content workers**: `CRAWLER_CHAPTER_WORKERS` (default
|
||||
//! `1`). Multiple workers process distinct chapters concurrently;
|
||||
//! the per-host rate limiter still gates total RPS to each origin.
|
||||
//! - **Force re-fetch**: `CRAWLER_FORCE_REFETCH_CHAPTERS=1` — re-fetch
|
||||
//! chapter images even when `page_count > 0`. Rare; use after the
|
||||
//! source replaces a chapter's images.
|
||||
//! - **PHPSESSID**: `CRAWLER_PHPSESSID` — paste your browser's
|
||||
//! session cookie. Required for chapter content (logged-out reader
|
||||
//! is paginated per-image and not viable at scale).
|
||||
//! - **Cookie domain** (optional): `CRAWLER_COOKIE_DOMAIN` overrides
|
||||
//! the auto-derived `.<registrable>.<tld>`. Only needed for
|
||||
//! multi-part TLDs (`.co.uk`, etc.).
|
||||
//! - **User agent** (optional): `CRAWLER_USER_AGENT` — applies to
|
||||
//! reqwest image fetches. Default uses reqwest's built-in UA.
|
||||
//! - **Proxy**: `$CRAWLER_PROXY` — single URL applied to both
|
||||
//! Chromium (`--proxy-server`) and `reqwest::Proxy::all`. Supports
|
||||
//! `http://`, `https://`, and `socks5://` (with optional user:pass).
|
||||
@@ -32,16 +54,18 @@ use std::sync::Arc;
|
||||
use std::time::Duration;
|
||||
|
||||
use anyhow::{anyhow, Context};
|
||||
use futures_util::stream::{self, StreamExt};
|
||||
use mangalord::crawler::{
|
||||
browser::{self, LaunchOptions},
|
||||
rate_limit::RateLimiter,
|
||||
content::{self, SyncOutcome},
|
||||
rate_limit::HostRateLimiters,
|
||||
session,
|
||||
source::{target::TargetSource, DiscoverMode, FetchContext, Source},
|
||||
};
|
||||
use mangalord::repo;
|
||||
use mangalord::storage::{LocalStorage, Storage};
|
||||
use sqlx::postgres::PgPoolOptions;
|
||||
use sqlx::PgPool;
|
||||
use tokio::sync::Mutex;
|
||||
use tracing_subscriber::EnvFilter;
|
||||
use uuid::Uuid;
|
||||
|
||||
@@ -64,8 +88,25 @@ async fn main() -> anyhow::Result<()> {
|
||||
.unwrap_or_else(|_| "./data/storage".to_string())
|
||||
.into();
|
||||
let rate_ms = env_u64("CRAWLER_RATE_MS", 1000);
|
||||
let cdn_host = std::env::var("CRAWLER_CDN_HOST")
|
||||
.ok()
|
||||
.filter(|s| !s.trim().is_empty());
|
||||
let cdn_rate_ms = env_u64("CRAWLER_CDN_RATE_MS", rate_ms);
|
||||
let limit = env_u64("CRAWLER_LIMIT", 0) as usize;
|
||||
let skip_chapters = env_bool("CRAWLER_SKIP_CHAPTERS", false);
|
||||
let skip_chapter_content = env_bool("CRAWLER_SKIP_CHAPTER_CONTENT", false);
|
||||
let chapter_workers = env_u64("CRAWLER_CHAPTER_WORKERS", 1).max(1) as usize;
|
||||
let force_refetch_chapters = env_bool("CRAWLER_FORCE_REFETCH_CHAPTERS", false);
|
||||
let phpsessid = std::env::var("CRAWLER_PHPSESSID")
|
||||
.ok()
|
||||
.filter(|s| !s.trim().is_empty());
|
||||
let cookie_domain = std::env::var("CRAWLER_COOKIE_DOMAIN")
|
||||
.ok()
|
||||
.filter(|s| !s.trim().is_empty())
|
||||
.or_else(|| session::registrable_domain(&start_url));
|
||||
let user_agent = std::env::var("CRAWLER_USER_AGENT")
|
||||
.ok()
|
||||
.filter(|s| !s.trim().is_empty());
|
||||
let proxy_url = std::env::var("CRAWLER_PROXY")
|
||||
.ok()
|
||||
.filter(|s| !s.trim().is_empty());
|
||||
@@ -79,13 +120,25 @@ async fn main() -> anyhow::Result<()> {
|
||||
|
||||
let storage: Arc<dyn Storage> = Arc::new(LocalStorage::new(&storage_dir));
|
||||
|
||||
// `no_proxy()` disables reqwest's own env-based detection so the
|
||||
// single `CRAWLER_PROXY` knob is the only thing that influences
|
||||
// routing. Otherwise an unrelated `HTTPS_PROXY` in the shell would
|
||||
// silently route cover downloads while the browser stayed direct.
|
||||
// Build reqwest with: own cookie jar (seeded with PHPSESSID for
|
||||
// the catalog domain only), optional UA override, optional single
|
||||
// proxy. `no_proxy()` disables env-based detection so the
|
||||
// CRAWLER_PROXY knob is the only routing input.
|
||||
let cookie_jar = Arc::new(reqwest::cookie::Jar::default());
|
||||
if let (Some(sid), Some(domain)) = (&phpsessid, &cookie_domain) {
|
||||
let cookie_str = format!("PHPSESSID={sid}; Domain={domain}; Path=/");
|
||||
let seed_url =
|
||||
reqwest::Url::parse(&start_url).context("parse start URL for cookie seed")?;
|
||||
cookie_jar.add_cookie_str(&cookie_str, &seed_url);
|
||||
tracing::info!(domain, "seeded PHPSESSID into reqwest cookie jar");
|
||||
}
|
||||
let mut http_builder = reqwest::Client::builder()
|
||||
.timeout(Duration::from_secs(30))
|
||||
.no_proxy();
|
||||
.no_proxy()
|
||||
.cookie_provider(cookie_jar);
|
||||
if let Some(ua) = &user_agent {
|
||||
http_builder = http_builder.user_agent(ua);
|
||||
}
|
||||
if let Some(proxy) = &proxy_url {
|
||||
http_builder = http_builder
|
||||
.proxy(reqwest::Proxy::all(proxy).with_context(|| format!("parse proxy URL: {proxy}"))?);
|
||||
@@ -100,40 +153,86 @@ async fn main() -> anyhow::Result<()> {
|
||||
?options,
|
||||
%start_url,
|
||||
rate_ms,
|
||||
cdn_host = ?cdn_host,
|
||||
cdn_rate_ms,
|
||||
limit,
|
||||
skip_chapters,
|
||||
skip_chapter_content,
|
||||
chapter_workers,
|
||||
force_refetch_chapters,
|
||||
phpsessid_set = phpsessid.is_some(),
|
||||
cookie_domain = ?cookie_domain,
|
||||
user_agent = ?user_agent,
|
||||
proxy = ?proxy_url,
|
||||
storage_dir = %storage_dir.display(),
|
||||
"starting crawler"
|
||||
);
|
||||
|
||||
let handle = browser::launch(options).await.context("launch browser")?;
|
||||
|
||||
// Cookie + session probe must happen *before* any browser
|
||||
// navigation that depends on auth (i.e. chapter content). The
|
||||
// discover/metadata phase doesn't strictly need auth, but
|
||||
// probing now lets us fail fast: a bad cookie costs ~2s here
|
||||
// instead of 30 min into a backfill.
|
||||
let session_ready = if let (Some(sid), Some(domain)) = (&phpsessid, &cookie_domain) {
|
||||
if let Err(e) = session::inject_phpsessid(handle.browser(), sid, domain).await {
|
||||
handle.close().await.ok();
|
||||
return Err(e);
|
||||
}
|
||||
match session::verify_session(handle.browser(), &start_url).await {
|
||||
Ok(()) => true,
|
||||
Err(e) => {
|
||||
handle.close().await.ok();
|
||||
return Err(e);
|
||||
}
|
||||
}
|
||||
} else {
|
||||
tracing::info!("no PHPSESSID supplied — chapter content phase will be skipped");
|
||||
false
|
||||
};
|
||||
|
||||
let result = run(
|
||||
handle.browser(),
|
||||
&db,
|
||||
storage.as_ref(),
|
||||
Arc::clone(&storage),
|
||||
&http,
|
||||
&start_url,
|
||||
rate_ms,
|
||||
cdn_host.as_deref(),
|
||||
cdn_rate_ms,
|
||||
limit,
|
||||
skip_chapters,
|
||||
skip_chapter_content || !session_ready,
|
||||
chapter_workers,
|
||||
force_refetch_chapters,
|
||||
)
|
||||
.await;
|
||||
handle.close().await.ok();
|
||||
result
|
||||
}
|
||||
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
async fn run(
|
||||
browser: &chromiumoxide::Browser,
|
||||
db: &PgPool,
|
||||
storage: &dyn Storage,
|
||||
storage: Arc<dyn Storage>,
|
||||
http: &reqwest::Client,
|
||||
start_url: &str,
|
||||
rate_ms: u64,
|
||||
cdn_host: Option<&str>,
|
||||
cdn_rate_ms: u64,
|
||||
limit: usize,
|
||||
skip_chapters: bool,
|
||||
skip_chapter_content: bool,
|
||||
chapter_workers: usize,
|
||||
force_refetch_chapters: bool,
|
||||
) -> anyhow::Result<()> {
|
||||
let rate = Mutex::new(RateLimiter::new(Duration::from_millis(rate_ms)));
|
||||
let mut rate = HostRateLimiters::new(Duration::from_millis(rate_ms));
|
||||
if let Some(host) = cdn_host {
|
||||
rate = rate.with_override(host, Duration::from_millis(cdn_rate_ms));
|
||||
}
|
||||
let rate = Arc::new(rate);
|
||||
let source = {
|
||||
let s = TargetSource::new(start_url.to_string());
|
||||
if skip_chapters {
|
||||
@@ -144,7 +243,7 @@ async fn run(
|
||||
};
|
||||
let ctx = FetchContext {
|
||||
browser,
|
||||
rate: &rate,
|
||||
rate: rate.as_ref(),
|
||||
};
|
||||
|
||||
let source_id = source.id();
|
||||
@@ -208,9 +307,9 @@ async fn run(
|
||||
if let Some(cover_url) = manga.cover_url.as_deref() {
|
||||
if let Err(e) = download_and_store_cover(
|
||||
db,
|
||||
storage,
|
||||
storage.as_ref(),
|
||||
http,
|
||||
&rate,
|
||||
rate.as_ref(),
|
||||
&r.url,
|
||||
upsert.manga_id,
|
||||
cover_url,
|
||||
@@ -252,14 +351,149 @@ async fn run(
|
||||
tracing::info!(limit, "partial sync — skipping drop pass");
|
||||
}
|
||||
|
||||
if !skip_chapter_content {
|
||||
sync_bookmarked_chapter_content(
|
||||
browser,
|
||||
db,
|
||||
Arc::clone(&storage),
|
||||
http,
|
||||
Arc::clone(&rate),
|
||||
source_id,
|
||||
chapter_workers,
|
||||
force_refetch_chapters,
|
||||
)
|
||||
.await?;
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Find every chapter whose manga is bookmarked by at least one user
|
||||
/// and that hasn't been content-synced yet, then fan them out across
|
||||
/// `workers` concurrent tasks. Each task is one full chapter sync; the
|
||||
/// per-host rate limiter caps total RPS to the source/CDN regardless
|
||||
/// of worker count.
|
||||
///
|
||||
/// A session-expired result from any task aborts the whole phase —
|
||||
/// continuing wastes time and risks the source flagging the pattern.
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
async fn sync_bookmarked_chapter_content(
|
||||
browser: &chromiumoxide::Browser,
|
||||
db: &PgPool,
|
||||
storage: Arc<dyn Storage>,
|
||||
http: &reqwest::Client,
|
||||
rate: Arc<HostRateLimiters>,
|
||||
source_id: &str,
|
||||
workers: usize,
|
||||
force_refetch: bool,
|
||||
) -> anyhow::Result<()> {
|
||||
let pending: Vec<(Uuid, Uuid, String)> = sqlx::query_as(
|
||||
r#"
|
||||
SELECT DISTINCT c.id, c.manga_id, cs.source_url
|
||||
FROM chapters c
|
||||
JOIN bookmarks b ON b.manga_id = c.manga_id
|
||||
JOIN chapter_sources cs ON cs.chapter_id = c.id
|
||||
WHERE cs.source_id = $1
|
||||
AND cs.dropped_at IS NULL
|
||||
AND (c.page_count = 0 OR $2)
|
||||
ORDER BY c.manga_id, c.created_at ASC
|
||||
"#,
|
||||
)
|
||||
.bind(source_id)
|
||||
.bind(force_refetch)
|
||||
.fetch_all(db)
|
||||
.await
|
||||
.context("query pending chapter content")?;
|
||||
|
||||
if pending.is_empty() {
|
||||
tracing::info!("chapter content: nothing pending");
|
||||
return Ok(());
|
||||
}
|
||||
tracing::info!(count = pending.len(), workers, "chapter content phase starting");
|
||||
|
||||
// `for_each_concurrent` polls up to `workers` futures at once on
|
||||
// the *current* task, so each future borrows the browser, db, and
|
||||
// http client from the outer scope rather than requiring 'static
|
||||
// captures via spawn. chromiumoxide's `Browser::new_page(&self)`
|
||||
// is safe for concurrent calls; the per-host rate limiter
|
||||
// serializes the actual on-wire requests against each origin.
|
||||
let session_expired = Arc::new(std::sync::atomic::AtomicBool::new(false));
|
||||
let stats = std::sync::Mutex::new(WorkerStats::default());
|
||||
|
||||
stream::iter(pending.into_iter())
|
||||
.for_each_concurrent(workers.max(1), |(chapter_id, manga_id, source_url)| {
|
||||
let session_expired = Arc::clone(&session_expired);
|
||||
let storage = Arc::clone(&storage);
|
||||
let rate = Arc::clone(&rate);
|
||||
let stats = &stats;
|
||||
async move {
|
||||
if session_expired.load(std::sync::atomic::Ordering::Relaxed) {
|
||||
return;
|
||||
}
|
||||
let outcome = content::sync_chapter_content(
|
||||
browser,
|
||||
db,
|
||||
storage.as_ref(),
|
||||
http,
|
||||
rate.as_ref(),
|
||||
chapter_id,
|
||||
manga_id,
|
||||
&source_url,
|
||||
force_refetch,
|
||||
)
|
||||
.await;
|
||||
let mut s = stats.lock().unwrap();
|
||||
match outcome {
|
||||
Ok(SyncOutcome::Fetched { pages }) => {
|
||||
tracing::info!(%chapter_id, pages, "chapter content fetched");
|
||||
s.fetched += 1;
|
||||
}
|
||||
Ok(SyncOutcome::Skipped) => s.skipped += 1,
|
||||
Ok(SyncOutcome::SessionExpired) => {
|
||||
tracing::error!(
|
||||
%chapter_id,
|
||||
"session expired mid-run — refresh CRAWLER_PHPSESSID and re-run"
|
||||
);
|
||||
session_expired
|
||||
.store(true, std::sync::atomic::Ordering::Relaxed);
|
||||
}
|
||||
Err(e) => {
|
||||
tracing::warn!(
|
||||
%chapter_id, error = ?e, "chapter content sync failed"
|
||||
);
|
||||
s.failed += 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
})
|
||||
.await;
|
||||
|
||||
let total = stats.into_inner().unwrap();
|
||||
tracing::info!(
|
||||
fetched = total.fetched,
|
||||
skipped = total.skipped,
|
||||
failed = total.failed,
|
||||
"chapter content phase done"
|
||||
);
|
||||
|
||||
if session_expired.load(std::sync::atomic::Ordering::Relaxed) {
|
||||
anyhow::bail!("session expired during chapter content phase");
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[derive(Default, Clone, Copy)]
|
||||
struct WorkerStats {
|
||||
fetched: usize,
|
||||
skipped: usize,
|
||||
failed: usize,
|
||||
}
|
||||
|
||||
async fn download_and_store_cover(
|
||||
db: &PgPool,
|
||||
storage: &dyn Storage,
|
||||
http: &reqwest::Client,
|
||||
rate: &Mutex<RateLimiter>,
|
||||
rate: &HostRateLimiters,
|
||||
manga_url: &str,
|
||||
manga_id: Uuid,
|
||||
cover_url: &str,
|
||||
@@ -269,9 +503,13 @@ async fn download_and_store_cover(
|
||||
.join(cover_url)
|
||||
.context("join cover URL onto manga URL")?;
|
||||
|
||||
rate.lock().await.wait().await;
|
||||
rate.wait_for(absolute.as_str()).await?;
|
||||
let resp = http
|
||||
.get(absolute.clone())
|
||||
// Source CDNs commonly check Referer. Set it to the manga
|
||||
// detail page that linked the cover — same UX as a real
|
||||
// browser fetching the image.
|
||||
.header(reqwest::header::REFERER, manga_url)
|
||||
.send()
|
||||
.await
|
||||
.with_context(|| format!("GET {absolute}"))?
|
||||
|
||||
Reference in New Issue
Block a user