Operators whose sources shard images across numbered CDN subdomains
can't pre-enumerate every host in CRAWLER_DOWNLOAD_ALLOWLIST. The new
flag short-circuits the host check in DownloadAllowlist::contains
while leaving scheme, localhost, and private-IP defenses in
is_safe_url untouched — scraped URLs pointing at 10.x /
169.254.169.254 / file:// stay refused. Default is false; fail-closed
posture is preserved unless the operator opts in. Wired into both the
server (config::build_download_allowlist) and the bin/crawler.rs
one-shot.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collapses the crawler to a single newest-first walker and replaces the
N-consecutive-unchanged streak with a per-manga rule: stop on the first
manga where metadata is Unchanged AND chapter sync reports zero new
chapters. The early stop is gated by a per-source recovery flag stored
in `crawler_state` — set to `false` when a run starts, back to `true`
only on a clean exit (end-of-walk or intentional stop). A crashed run
leaves the flag `false` automatically (no shutdown code runs), so the
next tick walks the full catalog instead of bailing at the first
caught-up manga.
This means a crashed mid-walk run self-heals on the next tick: the
flag stays `false`, the next walk visits every page (recovering
anything the crash missed past its crash point), and steady state
resumes once the recovery sweep reaches end-of-walk.
Removed:
- DiscoverMode enum, Backfill mode, the boundary re-check +
displaced-refs machinery in TargetSourceWalker.
- Drop-pass (mark_dropped_mangas) and seed-completion plumbing
(mark_seed_completed / seed_completed_at). The recovery flag
subsumes the seed-completion signal; drop detection was explicitly
opted out.
- JobPayload::Discover (no production callers).
- CRAWLER_MODE / CRAWLER_INCREMENTAL_STOP_AFTER env vars and the
CrawlerModePref config type.
`should_mark_clean_exit(walked_to_completion, hit_stop_condition)`
encodes the clean-exit truth table in its signature — `hit_limit` is
deliberately absent so a future edit cannot accidentally count a
caller-imposed cap as a clean exit.
Net -501 lines, 261 backend tests passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five fixes bundled into one release:
- preserve user-attached tags across crawler upserts
(repo::crawler::sync_tags now scopes to added_by IS NULL; orphaned
attachments from deleted users are reaped as crawler-owned)
- gate manga PATCH and cover endpoints on uploaded_by (require_can_edit
in api::mangas; non-NULL uploaded_by must match the caller)
- equalise login response time across user-existence branches
(run argon2 against a OnceLock-cached dummy hash on the no-user
branch so timing doesn't leak username existence)
- crawler download defences (SSRF allowlist of host literals
including IPv4-mapped IPv6 ranges, 32 MiB streamed size cap,
reject non-whitelisted image types, three-way chapter-probe
classifier replaces the binary #avatar_menu check)
- tighten validation and clean up dead unload path
(attach_tag + create_token enforce 64-char caps; LocalStorage
rejects NUL bytes explicitly; reader flushFinalProgress drops
the always-405 sendBeacon path)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Daemon now auto-detects mode per source: Backfill until the first
full walk records `seed_completed:<source>` in `crawler_state`, then
Incremental (newest-first, stops after N consecutive Unchanged
upserts). `CRAWLER_MODE` overrides to a fixed mode; CLI rejects
`auto` since it has no pre-run DB state.
`Source::discover` returns a lazy `DiscoverWalk` so Incremental can
break out mid-walk without prefetching pages. The drop pass and seed
marker are now gated on a true full walk — fixes a latent soft-drop
of the index tail under partial sweeps.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The backend now boots an internal crawler daemon that runs a daily
metadata pass (CRAWLER_DAILY_AT in CRAWLER_TZ, advisory-lock guarded
for multi-replica safety) and drains SyncChapterContent jobs from
crawler_jobs through a worker pool. Chromium launches lazily on first
job and is torn down after CRAWLER_IDLE_TIMEOUT_S seconds of inactivity.
Modules:
- crawler::browser_manager — lazy-launch / idle-teardown wrapper
around browser::Handle, with an on_launch hook that re-injects
PHPSESSID on every fresh Chromium spawn.
- crawler::pipeline — run_metadata_pass (the shared discover/upsert
/cover/sync-chapters loop) and the enqueue_bookmarked_pending helper
used by the cron tick.
- crawler::daemon — cron task + worker pool, behind two trait seams
(MetadataPass, ChapterDispatcher) so tests can inject stubs without
standing up Chromium or a live source.
Behavior:
- CRAWLER_DAEMON=false skips daemon spawn entirely (default for tests).
- Catch-up tick fires on startup if the last persisted slot was missed.
- A SyncOutcome::SessionExpired sets a sticky AtomicBool; workers
idle until operator restart with a refreshed PHPSESSID.
- Worker dispatch wrapped in catch_unwind so a panicking handler
marks the job failed instead of taking down the worker.
- Migration 0015 adds a small crawler_state k-v table for the
last_metadata_tick_at watermark.
Dep additions: chrono-tz (IANA TZ parsing).
CLI (bin/crawler) reuses pipeline::run_metadata_pass and now holds
the browser via BrowserManager so the on_launch session injection
flow stays in one place. Inline chapter-content sync semantics are
unchanged — the queue is for the daemon, force-refetches and manual
backfills still bypass it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PG rejects `SELECT DISTINCT c.id, c.manga_id, cs.source_url ... ORDER BY
c.manga_id, c.created_at` because the ORDER BY references a column not in
the DISTINCT projection. Wrap the DISTINCT in a subquery (which includes
created_at) and apply the ORDER BY in the outer SELECT.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Debug aid: when set in headed mode, the crawler blocks on Ctrl+C at
every shutdown point (early auth bails + normal completion) instead
of closing the browser immediately. Operator can inspect DOM, cookies,
and network state in the visible Chromium window before exit.
Ignored in headless (no window to inspect) — logged as a warning if
set under headless so the operator doesn't sit waiting.
chromiumoxide's `Browser` is `kill_on_drop`, so the close-or-wait
helper must await Ctrl+C *before* the Handle is dropped — otherwise
the Chromium child gets killed out from under the operator.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the metadata pass, the crawler now fetches per-chapter image
content for chapters belonging to bookmarked mangas. Logged-in chapter
pages render every page image at once (no per-page navigation), so the
crawler reuses the operator's browser session via a pasted PHPSESSID
cookie. Each chapter sync is a single transaction: storage puts + page
row inserts + page_count update commit together, or roll back together
on any image error so the chapter stays at page_count=0 and is retried
next run.
New crawler modules:
- `rate_limit::HostRateLimiters`: per-host buckets keyed by URL host,
with optional per-host overrides. Replaces the single shared
`Mutex<RateLimiter>`. Catalog and CDN no longer share a budget;
default 1 req/s per host.
- `session`: derives `.<registrable>.<tld>` from the start URL
(override via `CRAWLER_COOKIE_DOMAIN` for multi-part TLDs), injects
PHPSESSID into the Chromium cookie store, probes `#avatar_menu` at
startup to fail fast on a bad/expired cookie.
- `content`: parses `a#pic_container img:not(.loading)` with `pageN`
id-based sorting (DOM order isn't trusted), then performs the
atomic chapter sync.
bin/crawler additions:
- Concurrent chapter content phase via `futures_util::for_each_concurrent`
(`CRAWLER_CHAPTER_WORKERS`, default 1). Browser is borrowed across
workers — chromiumoxide allows concurrent `new_page` on `&self` —
and per-host rate limit gates total RPS regardless of worker count.
- reqwest gets the `cookies` feature, a `Jar` seeded with PHPSESSID
for the catalog domain only (CDN intentionally not given the
cookie), and `Referer` is set on cover + chapter image fetches.
- New env knobs: `CRAWLER_PHPSESSID`, `CRAWLER_COOKIE_DOMAIN`,
`CRAWLER_USER_AGENT`, `CRAWLER_CHAPTER_WORKERS`,
`CRAWLER_SKIP_CHAPTER_CONTENT`, `CRAWLER_FORCE_REFETCH_CHAPTERS`,
`CRAWLER_CDN_HOST` + `CRAWLER_CDN_RATE_MS`.
- Mid-run session-expired detection: `#avatar_menu` is re-checked on
every chapter page nav; first failure aborts the phase with a
cookie-refresh message.
Bookmark-driven enqueueing is sync-on-crawl-tick only: the bookmarked
chapters with `page_count = 0` are queried at the start of the
chapter-content phase. Sync-on-bookmark via an API hook is deferred
to a follow-up branch — that needs a daemon consumer of crawler_jobs,
which doesn't exist yet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- TargetSource: first concrete impl of the Source trait, modeled on
the old Puppeteer crawler's selectors (+ status normalization,
tag-count stripping, chapter list)
- DiscoverMode::Backfill walks pagination last->1, reverse within each
page (oldest-first); Incremental walks forward
- RateLimiter (tokio-time aware) plumbed through FetchContext so the
pagination walk honors the same per-host budget as the outer loop
- repo::crawler: ensure_source, upsert_manga_from_source (returns
New/Updated/Unchanged + current cover_image_path for backfill
decisions), sync_manga_chapters, mark_dropped_mangas — all
transactional, with case-insensitive lookups and source-insertable
genres
- Cover image download via reqwest+infer; stored under
mangas/{id}/cover.{ext} via the Storage trait
- Single CRAWLER_PROXY env wires both Chromium (--proxy-server) and
reqwest::Proxy::all (HTTP/HTTPS/SOCKS5)
- Crawler binary: positional start URL or $CRAWLER_START_URL,
$CRAWLER_LIMIT (cap fetches + skip drop pass on partial runs),
$CRAWLER_SKIP_CHAPTERS (disable selector AND sync), $CRAWLER_RATE_MS
- Silences chromiumoxide 0.7's known CDP deserialize log spam via
default tracing filter + CdpError::Serde downgrade
- 9 sqlx integration tests + 11 selector/rate-limit unit tests