After the metadata pass, the crawler now fetches per-chapter image
content for chapters belonging to bookmarked mangas. Logged-in chapter
pages render every page image at once (no per-page navigation), so the
crawler reuses the operator's browser session via a pasted PHPSESSID
cookie. Each chapter sync is a single transaction: storage puts + page
row inserts + page_count update commit together, or roll back together
on any image error so the chapter stays at page_count=0 and is retried
next run.
New crawler modules:
- `rate_limit::HostRateLimiters`: per-host buckets keyed by URL host,
with optional per-host overrides. Replaces the single shared
`Mutex<RateLimiter>`. Catalog and CDN no longer share a budget;
default 1 req/s per host.
- `session`: derives `.<registrable>.<tld>` from the start URL
(override via `CRAWLER_COOKIE_DOMAIN` for multi-part TLDs), injects
PHPSESSID into the Chromium cookie store, probes `#avatar_menu` at
startup to fail fast on a bad/expired cookie.
- `content`: parses `a#pic_container img:not(.loading)` with `pageN`
id-based sorting (DOM order isn't trusted), then performs the
atomic chapter sync.
bin/crawler additions:
- Concurrent chapter content phase via `futures_util::for_each_concurrent`
(`CRAWLER_CHAPTER_WORKERS`, default 1). Browser is borrowed across
workers — chromiumoxide allows concurrent `new_page` on `&self` —
and per-host rate limit gates total RPS regardless of worker count.
- reqwest gets the `cookies` feature, a `Jar` seeded with PHPSESSID
for the catalog domain only (CDN intentionally not given the
cookie), and `Referer` is set on cover + chapter image fetches.
- New env knobs: `CRAWLER_PHPSESSID`, `CRAWLER_COOKIE_DOMAIN`,
`CRAWLER_USER_AGENT`, `CRAWLER_CHAPTER_WORKERS`,
`CRAWLER_SKIP_CHAPTER_CONTENT`, `CRAWLER_FORCE_REFETCH_CHAPTERS`,
`CRAWLER_CDN_HOST` + `CRAWLER_CDN_RATE_MS`.
- Mid-run session-expired detection: `#avatar_menu` is re-checked on
every chapter page nav; first failure aborts the phase with a
cookie-refresh message.
Bookmark-driven enqueueing is sync-on-crawl-tick only: the bookmarked
chapters with `page_count = 0` are queried at the start of the
chapter-content phase. Sync-on-bookmark via an API hook is deferred
to a follow-up branch — that needs a daemon consumer of crawler_jobs,
which doesn't exist yet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>