feat: chapter content sync via PHPSESSID + per-host pacing (0.25.0)

After the metadata pass, the crawler now fetches per-chapter image
content for chapters belonging to bookmarked mangas. Logged-in chapter
pages render every page image at once (no per-page navigation), so the
crawler reuses the operator's browser session via a pasted PHPSESSID
cookie. Each chapter sync is a single transaction: storage puts + page
row inserts + page_count update commit together, or roll back together
on any image error so the chapter stays at page_count=0 and is retried
next run.

New crawler modules:

- `rate_limit::HostRateLimiters`: per-host buckets keyed by URL host,
  with optional per-host overrides. Replaces the single shared
  `Mutex<RateLimiter>`. Catalog and CDN no longer share a budget;
  default 1 req/s per host.
- `session`: derives `.<registrable>.<tld>` from the start URL
  (override via `CRAWLER_COOKIE_DOMAIN` for multi-part TLDs), injects
  PHPSESSID into the Chromium cookie store, probes `#avatar_menu` at
  startup to fail fast on a bad/expired cookie.
- `content`: parses `a#pic_container img:not(.loading)` with `pageN`
  id-based sorting (DOM order isn't trusted), then performs the
  atomic chapter sync.

bin/crawler additions:

- Concurrent chapter content phase via `futures_util::for_each_concurrent`
  (`CRAWLER_CHAPTER_WORKERS`, default 1). Browser is borrowed across
  workers — chromiumoxide allows concurrent `new_page` on `&self` —
  and per-host rate limit gates total RPS regardless of worker count.
- reqwest gets the `cookies` feature, a `Jar` seeded with PHPSESSID
  for the catalog domain only (CDN intentionally not given the
  cookie), and `Referer` is set on cover + chapter image fetches.
- New env knobs: `CRAWLER_PHPSESSID`, `CRAWLER_COOKIE_DOMAIN`,
  `CRAWLER_USER_AGENT`, `CRAWLER_CHAPTER_WORKERS`,
  `CRAWLER_SKIP_CHAPTER_CONTENT`, `CRAWLER_FORCE_REFETCH_CHAPTERS`,
  `CRAWLER_CDN_HOST` + `CRAWLER_CDN_RATE_MS`.
- Mid-run session-expired detection: `#avatar_menu` is re-checked on
  every chapter page nav; first failure aborts the phase with a
  cookie-refresh message.

Bookmark-driven enqueueing is sync-on-crawl-tick only: the bookmarked
chapters with `page_count = 0` are queried at the start of the
chapter-content phase. Sync-on-bookmark via an API hook is deferred
to a follow-up branch — that needs a daemon consumer of crawler_jobs,
which doesn't exist yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-05-23 00:28:36 +02:00
parent 51346227dd
commit d24e68c78d
10 changed files with 846 additions and 35 deletions

View File

@@ -1,11 +1,22 @@
//! Per-host request pacing.
//!
//! Single-token bucket: each `wait().await` either returns immediately
//! (if at least `interval` has elapsed since the last call) or sleeps
//! just enough to satisfy it. Uses `tokio::time::Instant` so tests can
//! run under `start_paused` virtual time without sleeping for real.
//! `RateLimiter` is a single-token bucket: each `wait().await` returns
//! immediately when at least `interval` has elapsed since the last call,
//! otherwise sleeps just enough to satisfy it. Uses
//! `tokio::time::Instant` so tests can run under `start_paused` virtual
//! time without sleeping for real.
//!
//! `HostRateLimiters` is the multi-host wrapper actually used by the
//! crawler — concurrent workers issuing requests to different origins
//! (catalog vs. CDN) don't contend on a shared budget; each host gets
//! its own bucket. `wait_for(url)` extracts the host, lazily creates a
//! limiter for it, and serializes only against other callers hitting
//! the same host.
use std::collections::HashMap;
use std::sync::Arc;
use std::time::Duration;
use tokio::sync::Mutex;
use tokio::time::Instant;
#[derive(Debug)]
@@ -33,6 +44,70 @@ impl RateLimiter {
}
}
/// Per-host rate limiter map. The outer `Mutex<HashMap>` is held only
/// during the entry-or-insert + Arc clone; the per-host `Mutex<RateLimiter>`
/// is held during the actual `wait().await`. So N workers calling
/// `wait_for(url)` on N different hosts contend nowhere except the brief
/// HashMap lookup; workers hitting the same host serialize on that
/// host's bucket.
#[derive(Debug)]
pub struct HostRateLimiters {
default_interval: Duration,
overrides: HashMap<String, Duration>,
map: Mutex<HashMap<String, Arc<Mutex<RateLimiter>>>>,
}
impl HostRateLimiters {
pub fn new(default_interval: Duration) -> Self {
Self {
default_interval,
overrides: HashMap::new(),
map: Mutex::new(HashMap::new()),
}
}
/// Set a per-host interval that overrides `default_interval`. Calls
/// after a host's limiter has been instantiated do *not* re-create
/// it — set all overrides before the first `wait_for` to that host.
pub fn with_override(mut self, host: impl Into<String>, interval: Duration) -> Self {
self.overrides.insert(host.into(), interval);
self
}
/// Block until the per-host budget allows the next request to
/// `url`'s host. Returns an error only when the URL has no host
/// (malformed input).
pub async fn wait_for(&self, url: &str) -> anyhow::Result<()> {
let host = host_of(url)
.ok_or_else(|| anyhow::anyhow!("no host in url: {url}"))?;
let limiter = {
let mut map = self.map.lock().await;
map.entry(host.clone())
.or_insert_with(|| {
let interval = self
.overrides
.get(&host)
.copied()
.unwrap_or(self.default_interval);
Arc::new(Mutex::new(RateLimiter::new(interval)))
})
.clone()
};
limiter.lock().await.wait().await;
Ok(())
}
}
/// Extract the host (no port) from a URL string. Returns `None` for
/// inputs without a `scheme://host` shape — those would never have
/// reached the network layer anyway.
fn host_of(url: &str) -> Option<String> {
let after_scheme = url.split_once("://")?.1;
let host_with_port = after_scheme.split('/').next()?;
let host = host_with_port.rsplit_once(':').map_or(host_with_port, |(h, _)| h);
(!host.is_empty()).then(|| host.to_ascii_lowercase())
}
#[cfg(test)]
mod tests {
use super::*;
@@ -66,4 +141,44 @@ mod tests {
// Already 250ms past — no further wait needed.
assert_eq!(Instant::now() - t0, Duration::ZERO);
}
#[test]
fn host_of_parses_scheme_path_and_port() {
assert_eq!(host_of("https://Example.com/path").as_deref(), Some("example.com"));
assert_eq!(host_of("http://cdn.foo.bar/img.jpg").as_deref(), Some("cdn.foo.bar"));
assert_eq!(host_of("http://localhost:8080/x").as_deref(), Some("localhost"));
assert!(host_of("not a url").is_none());
}
#[tokio::test(start_paused = true)]
async fn host_rate_limiters_pace_per_host() {
// Two hosts at 100ms each. Two consecutive calls to the SAME
// host wait 100ms total. Two consecutive calls to DIFFERENT
// hosts both fire immediately.
let rl = HostRateLimiters::new(Duration::from_millis(100));
let t0 = Instant::now();
rl.wait_for("https://a.example/x").await.unwrap();
rl.wait_for("https://b.example/y").await.unwrap();
assert_eq!(Instant::now() - t0, Duration::ZERO, "different hosts don't contend");
let t1 = Instant::now();
rl.wait_for("https://a.example/x").await.unwrap();
assert_eq!(
Instant::now() - t1,
Duration::from_millis(100),
"second call to same host waits a full interval"
);
}
#[tokio::test(start_paused = true)]
async fn host_rate_limiters_honor_overrides() {
let rl = HostRateLimiters::new(Duration::from_millis(1000))
.with_override("fast.example", Duration::from_millis(100));
rl.wait_for("https://fast.example/a").await.unwrap();
let t0 = Instant::now();
rl.wait_for("https://fast.example/b").await.unwrap();
assert_eq!(Instant::now() - t0, Duration::from_millis(100));
}
}