The wait_for_selector wait in 0.36.2 narrows the partial-render race
window but doesn't close it: a render that takes longer than
SELECTOR_TIMEOUT (10s) still hands an empty Vec to sync_manga_chapters,
and the soft-drop branch flips every existing chapter to dropped_at.
The next tick recovers but a manga's reader briefly stops working in
between.
Close it at the pipeline level. Between fetch_manga and the upsert/
sync, if the parsed chapter list is empty and the prior live count
for (source_id, source_manga_key) is > 0, treat the fetch as a
transient failure: log, bump mangas_failed, skip upsert + sync + the
seen.insert so a later batch / tick retries. Brand-new mangas with
genuinely zero chapters (prior == 0) pass through unchanged.
New repo helper repo::crawler::live_chapter_count_for_source_manga
joins chapters → chapter_sources → manga_sources with dropped_at IS
NULL — same lockstep as dispatch_target and the enqueue queries.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collapses the crawler to a single newest-first walker and replaces the
N-consecutive-unchanged streak with a per-manga rule: stop on the first
manga where metadata is Unchanged AND chapter sync reports zero new
chapters. The early stop is gated by a per-source recovery flag stored
in `crawler_state` — set to `false` when a run starts, back to `true`
only on a clean exit (end-of-walk or intentional stop). A crashed run
leaves the flag `false` automatically (no shutdown code runs), so the
next tick walks the full catalog instead of bailing at the first
caught-up manga.
This means a crashed mid-walk run self-heals on the next tick: the
flag stays `false`, the next walk visits every page (recovering
anything the crash missed past its crash point), and steady state
resumes once the recovery sweep reaches end-of-walk.
Removed:
- DiscoverMode enum, Backfill mode, the boundary re-check +
displaced-refs machinery in TargetSourceWalker.
- Drop-pass (mark_dropped_mangas) and seed-completion plumbing
(mark_seed_completed / seed_completed_at). The recovery flag
subsumes the seed-completion signal; drop detection was explicitly
opted out.
- JobPayload::Discover (no production callers).
- CRAWLER_MODE / CRAWLER_INCREMENTAL_STOP_AFTER env vars and the
CrawlerModePref config type.
`should_mark_clean_exit(walked_to_completion, hit_stop_condition)`
encodes the clean-exit truth table in its signature — `hit_limit` is
deliberately absent so a future edit cannot accidentally count a
caller-imposed cap as a clean exit.
Net -501 lines, 261 backend tests passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The partial dedup index only blocks (pending|running) duplicates, so
once a SyncChapterContent job transitions to 'dead' (max_attempts
exhausted) the slot frees. Every subsequent cron tick re-enqueued the
chapter — page_count = 0 and dropped_at IS NULL stay true — burned
another max_attempts retries, and died again. Permanent-failure
chapters spun forever.
enqueue_bookmarked_pending and enqueue_pending_for_manga now skip
chapters whose latest sync_chapter_content job is dead within
CHAPTER_DEAD_QUARANTINE_DAYS (7). A failed chapter goes silent for a
week, then gets one more shot — long enough for a transient site
issue to resolve, short enough that permanent failures don't stay
permanent if conditions change.
Two integration tests pin both halves of the contract.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The target site orders by update_date DESC, and any new or updated
manga pushes everyone down by one slot. The paginated walker was
blind to this drift:
* Backfill (page last -> 1): shifts push items into pages already
finished. The displaced manga was silently missed; with
mark_dropped_mangas running on a fully-completed walk, items even
got false-dropped because last_seen_at was stale.
* Incremental (page 1 -> last): a shift causes the slot-last item
of an already-read page to reappear on the next page, leading to
a redundant fetch_manga and an inflated consecutive_unchanged
streak.
Fix is two-pronged:
1. Backfill boundary re-check. After fetching each page P, re-fetch
the previously-walked page P+1 and check where its old slot-0
key now sits. If it slid to slot K, the first K entries are
items that used to live on P and slid past us; they get appended
to the batch. If the anchor is gone entirely (multi-page shift
or it was bumped to page 1), the whole re-fetched page is
processed conservatively and the pipeline dedup absorbs the
noise. The re-check must be the *last* navigation of the
iteration to close the within-iteration race.
2. Run-scoped dedup in run_metadata_pass. A HashSet<String> of
source_manga_keys avoids double-processing. The set uses a
contains-then-insert pattern with insert firing *after* a
successful upsert, so a transient fetch/upsert failure leaves
the key retryable if it reappears later in the same pass (via
the boundary re-check or another batch).
Incremental mode does not run the re-check (shifts move in the
same direction as the walk); only the dedup helps it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five fixes bundled into one release:
- preserve user-attached tags across crawler upserts
(repo::crawler::sync_tags now scopes to added_by IS NULL; orphaned
attachments from deleted users are reaped as crawler-owned)
- gate manga PATCH and cover endpoints on uploaded_by (require_can_edit
in api::mangas; non-NULL uploaded_by must match the caller)
- equalise login response time across user-existence branches
(run argon2 against a OnceLock-cached dummy hash on the no-user
branch so timing doesn't leak username existence)
- crawler download defences (SSRF allowlist of host literals
including IPv4-mapped IPv6 ranges, 32 MiB streamed size cap,
reject non-whitelisted image types, three-way chapter-probe
classifier replaces the binary #avatar_menu check)
- tighten validation and clean up dead unload path
(attach_tag + create_token enforce 64-char caps; LocalStorage
rejects NUL bytes explicitly; reader flushFinalProgress drops
the always-405 sendBeacon path)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three layering cleanups from REVIEW.md §5 / §3:
- Drop the three private `is_unique_violation` helpers in
repo::{user,chapter,bookmark} in favour of sqlx 0.8's
`DatabaseError::is_unique_violation()` method (already used by
repo::collection).
- Remove the unreachable 23505 branch in repo::chapter::create — the
(manga_id, number) UNIQUE was dropped in 0013, so the defensive arm
could no longer fire. A doc note records what to do if uniqueness
is re-added.
- Move three inline SQL queries out of handlers/daemon into repo
functions: bookmarks' chapter-belongs-to-manga guard
(`repo::chapter::belongs_to_manga`), the daemon's dispatch lookup
(`repo::chapter::dispatch_target`), and the daemon's page_count
safety net (`repo::chapter::page_count`). Restores the
handlers→repo layering invariant in CLAUDE.md.
- New `crawler::url_utils` module consolidates host_of / origin_of /
registrable_domain — they used to live in three crawler submodules
with diverging edge-case behaviour. Tests moved with them.
- Doc cross-references on repo::author::set_for_manga and
repo::genre::set_for_manga pointing to the crawler's name-keyed
variants, so the intentional duplication is discoverable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Daemon now auto-detects mode per source: Backfill until the first
full walk records `seed_completed:<source>` in `crawler_state`, then
Incremental (newest-first, stops after N consecutive Unchanged
upserts). `CRAWLER_MODE` overrides to a fixed mode; CLI rejects
`auto` since it has no pre-run DB state.
`Source::discover` returns a lazy `DiscoverWalk` so Incremental can
break out mid-walk without prefetching pages. The drop pass and seed
marker are now gated on a true full walk — fixes a latent soft-drop
of the index tail under partial sweeps.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The backend now boots an internal crawler daemon that runs a daily
metadata pass (CRAWLER_DAILY_AT in CRAWLER_TZ, advisory-lock guarded
for multi-replica safety) and drains SyncChapterContent jobs from
crawler_jobs through a worker pool. Chromium launches lazily on first
job and is torn down after CRAWLER_IDLE_TIMEOUT_S seconds of inactivity.
Modules:
- crawler::browser_manager — lazy-launch / idle-teardown wrapper
around browser::Handle, with an on_launch hook that re-injects
PHPSESSID on every fresh Chromium spawn.
- crawler::pipeline — run_metadata_pass (the shared discover/upsert
/cover/sync-chapters loop) and the enqueue_bookmarked_pending helper
used by the cron tick.
- crawler::daemon — cron task + worker pool, behind two trait seams
(MetadataPass, ChapterDispatcher) so tests can inject stubs without
standing up Chromium or a live source.
Behavior:
- CRAWLER_DAEMON=false skips daemon spawn entirely (default for tests).
- Catch-up tick fires on startup if the last persisted slot was missed.
- A SyncOutcome::SessionExpired sets a sticky AtomicBool; workers
idle until operator restart with a refreshed PHPSESSID.
- Worker dispatch wrapped in catch_unwind so a panicking handler
marks the job failed instead of taking down the worker.
- Migration 0015 adds a small crawler_state k-v table for the
last_metadata_tick_at watermark.
Dep additions: chrono-tz (IANA TZ parsing).
CLI (bin/crawler) reuses pipeline::run_metadata_pass and now holds
the browser via BrowserManager so the on_launch session injection
flow stays in one place. Inline chapter-content sync semantics are
unchanged — the queue is for the daemon, force-refetches and manual
backfills still bypass it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>