Extends the live dashboard so an operator can see exactly what's being
fetched, in realtime:
- Chapters being crawled now are tracked in the status as `active_chapters`
(manga title · ch.N) with a live page counter that climbs per stored page
(set_chapter_pages, pushed via the existing watch→SSE). The dispatcher
registers each via an RAII ChapterGuard (sync Mutex) that removes the
entry on completion, panic, or timeout-drop — replacing the old per-worker
slot model.
- Covers: status now carries the cover being fetched now (`current_cover`,
set around download_and_store_cover in both the metadata pass and backfill)
and a `covers_queued` backlog count; CoverBackfill phase gains index/total.
- Two paginated backlog endpoints (fetched on demand, auto-refreshed when the
live counts change): GET /admin/crawler/active-jobs (which chapters of which
mangas are queued/running) and GET /admin/crawler/covers (mangas missing a
cover). repo: list_active_jobs, list_missing_cover_mangas, count_missing_covers.
- dispatch_target now also returns manga title + chapter number.
Frontend: the crawler page replaces the Workers table with an Active-chapters
table (live page bars), adds a current-cover line + covers-queued figure, and
two backlog sections (Queued chapters / Queued covers) with search + Pager,
auto-refetched via $effect on the live counts.
Tests: status guard/page + cover unit tests; repo list/count tests; endpoint
tests; frontend api tests. Version 0.53.1 -> 0.54.0.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds the in-process observability + control infrastructure the admin
dashboard consumes:
- status.rs: CrawlerStatus/Phase/WorkerState + StatusHandle. The daemon
publishes its current phase (idle/walking/fetching-metadata/cover-backfill),
per-worker activity, and last-pass summary. Wired through the cron,
run_metadata_pass, and the worker loop.
- session_control.rs: SessionController refreshes PHPSESSID at runtime —
rewrites the shared reqwest cookie jar, updates the value on_launch reads,
persists to crawler_state (survives restart), and clears the expired flag.
on_launch now reads the live session instead of a startup snapshot.
- RealChapterDispatcher auto-triggers a coordinated browser restart after
CRAWLER_BROWSER_RESTART_THRESHOLD consecutive transient failures.
- repo::crawler: list_dead_jobs, requeue_dead_jobs (all/manga/job, bypassing
the quarantine, skipping live duplicates), job_state_counts.
- AppState gains CrawlerControl bundling browser_manager + session + status
+ metadata_pass for the admin endpoints.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds a Healthy/Draining/Restarting lifecycle gate. `acquire()` parks while
a restart is in progress; `coordinated_restart(deadline)` blocks new
acquires, drains in-flight leases (bounded, then forces), closes +
relaunches Chromium (re-running on_launch → re-inject session + probe),
then resumes parked acquirers. Concurrent restart requests collapse into
one relaunch; the phase always returns to Healthy so a failed relaunch
never wedges acquisition.
The metadata pass cooperates via is_restart_pending() at its per-manga
checkpoint, yielding its long-lived lease (recovery sweep next tick)
instead of stalling the drain.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A1 Lease heartbeat: jobs::renew keeps a long-but-healthy job's lease fresh
so it is never stolen mid-flight nor inflated toward max_attempts.
A2 Stream chapter pages straight to storage (peak memory = one image) and
persist rows + page_count in one short transaction off the network path
(S3-ready); roll back stored blobs on failure via Storage::delete.
A3 ±20% jitter on exponential backoff to avoid a retry thundering herd.
A4 Outer per-dispatch timeout (CRAWLER_JOB_TIMEOUT_SECS, default 600) so a
hung job is acked-failed instead of wedging a worker.
A5 Metadata circuit-breaker (CRAWLER_METADATA_MAX_CONSECUTIVE_FAILURES,
default 10): abort a pass on a source outage without marking a clean exit,
so the next tick recovery-sweeps.
Adds CRAWLER_BROWSER_RESTART_THRESHOLD config (used by the upcoming
coordinated browser restart). Bumps version 0.52.0 -> 0.53.0.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Both enqueue paths now order by chapters.number so the cron tick and the
bookmark hook insert jobs from chapter 1 upward instead of source-discovery
or random-UUID order. The lease query tiebreaks on created_at so jobs
sharing a batch's scheduled_at come off the queue in insertion order,
propagating the enqueue intent through to dequeue. Concurrent workers
and per-CDN latency can still drift actual completion order.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a per-tick cover-backfill pass to the crawler daemon so mangas whose
cover download failed on first attempt get retried — the metadata pass's
early-stop optimisation otherwise prevents the walk from revisiting them.
Adds admin-only POST /admin/mangas/:id/resync and POST /admin/chapters/:id/resync
that refetch metadata + cover (or chapter content with force_refetch) from the
crawler source synchronously and return the refreshed row. Surfaced in the
UI as "Force resync" buttons on the manga detail and reader pages,
admin-only via session.user.is_admin.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds CRAWLER_TOR_CONTROL_URL / _PASSWORD / _COOKIE_PATH /
_RECIRCUIT_MAX_ATTEMPTS to CrawlerConfig and to bin/crawler.rs's
env reads. Constructs an Option<Arc<TorController>> at daemon /
CLI startup and threads it through FetchContext,
pipeline::run_metadata_pass, and content::sync_chapter_content as
Option<&TorController>.
Pure scaffolding — the controller isn't used yet; behavior is
unchanged. Next commit wires the retry hooks and session-probe
recircuit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The wait_for_selector wait in 0.36.2 narrows the partial-render race
window but doesn't close it: a render that takes longer than
SELECTOR_TIMEOUT (10s) still hands an empty Vec to sync_manga_chapters,
and the soft-drop branch flips every existing chapter to dropped_at.
The next tick recovers but a manga's reader briefly stops working in
between.
Close it at the pipeline level. Between fetch_manga and the upsert/
sync, if the parsed chapter list is empty and the prior live count
for (source_id, source_manga_key) is > 0, treat the fetch as a
transient failure: log, bump mangas_failed, skip upsert + sync + the
seen.insert so a later batch / tick retries. Brand-new mangas with
genuinely zero chapters (prior == 0) pass through unchanged.
New repo helper repo::crawler::live_chapter_count_for_source_manga
joins chapters → chapter_sources → manga_sources with dropped_at IS
NULL — same lockstep as dispatch_target and the enqueue queries.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collapses the crawler to a single newest-first walker and replaces the
N-consecutive-unchanged streak with a per-manga rule: stop on the first
manga where metadata is Unchanged AND chapter sync reports zero new
chapters. The early stop is gated by a per-source recovery flag stored
in `crawler_state` — set to `false` when a run starts, back to `true`
only on a clean exit (end-of-walk or intentional stop). A crashed run
leaves the flag `false` automatically (no shutdown code runs), so the
next tick walks the full catalog instead of bailing at the first
caught-up manga.
This means a crashed mid-walk run self-heals on the next tick: the
flag stays `false`, the next walk visits every page (recovering
anything the crash missed past its crash point), and steady state
resumes once the recovery sweep reaches end-of-walk.
Removed:
- DiscoverMode enum, Backfill mode, the boundary re-check +
displaced-refs machinery in TargetSourceWalker.
- Drop-pass (mark_dropped_mangas) and seed-completion plumbing
(mark_seed_completed / seed_completed_at). The recovery flag
subsumes the seed-completion signal; drop detection was explicitly
opted out.
- JobPayload::Discover (no production callers).
- CRAWLER_MODE / CRAWLER_INCREMENTAL_STOP_AFTER env vars and the
CrawlerModePref config type.
`should_mark_clean_exit(walked_to_completion, hit_stop_condition)`
encodes the clean-exit truth table in its signature — `hit_limit` is
deliberately absent so a future edit cannot accidentally count a
caller-imposed cap as a clean exit.
Net -501 lines, 261 backend tests passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The partial dedup index only blocks (pending|running) duplicates, so
once a SyncChapterContent job transitions to 'dead' (max_attempts
exhausted) the slot frees. Every subsequent cron tick re-enqueued the
chapter — page_count = 0 and dropped_at IS NULL stay true — burned
another max_attempts retries, and died again. Permanent-failure
chapters spun forever.
enqueue_bookmarked_pending and enqueue_pending_for_manga now skip
chapters whose latest sync_chapter_content job is dead within
CHAPTER_DEAD_QUARANTINE_DAYS (7). A failed chapter goes silent for a
week, then gets one more shot — long enough for a transient site
issue to resolve, short enough that permanent failures don't stay
permanent if conditions change.
Two integration tests pin both halves of the contract.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The target site orders by update_date DESC, and any new or updated
manga pushes everyone down by one slot. The paginated walker was
blind to this drift:
* Backfill (page last -> 1): shifts push items into pages already
finished. The displaced manga was silently missed; with
mark_dropped_mangas running on a fully-completed walk, items even
got false-dropped because last_seen_at was stale.
* Incremental (page 1 -> last): a shift causes the slot-last item
of an already-read page to reappear on the next page, leading to
a redundant fetch_manga and an inflated consecutive_unchanged
streak.
Fix is two-pronged:
1. Backfill boundary re-check. After fetching each page P, re-fetch
the previously-walked page P+1 and check where its old slot-0
key now sits. If it slid to slot K, the first K entries are
items that used to live on P and slid past us; they get appended
to the batch. If the anchor is gone entirely (multi-page shift
or it was bumped to page 1), the whole re-fetched page is
processed conservatively and the pipeline dedup absorbs the
noise. The re-check must be the *last* navigation of the
iteration to close the within-iteration race.
2. Run-scoped dedup in run_metadata_pass. A HashSet<String> of
source_manga_keys avoids double-processing. The set uses a
contains-then-insert pattern with insert firing *after* a
successful upsert, so a transient fetch/upsert failure leaves
the key retryable if it reappears later in the same pass (via
the boundary re-check or another batch).
Incremental mode does not run the re-check (shifts move in the
same direction as the walk); only the dedup helps it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five fixes bundled into one release:
- preserve user-attached tags across crawler upserts
(repo::crawler::sync_tags now scopes to added_by IS NULL; orphaned
attachments from deleted users are reaped as crawler-owned)
- gate manga PATCH and cover endpoints on uploaded_by (require_can_edit
in api::mangas; non-NULL uploaded_by must match the caller)
- equalise login response time across user-existence branches
(run argon2 against a OnceLock-cached dummy hash on the no-user
branch so timing doesn't leak username existence)
- crawler download defences (SSRF allowlist of host literals
including IPv4-mapped IPv6 ranges, 32 MiB streamed size cap,
reject non-whitelisted image types, three-way chapter-probe
classifier replaces the binary #avatar_menu check)
- tighten validation and clean up dead unload path
(attach_tag + create_token enforce 64-char caps; LocalStorage
rejects NUL bytes explicitly; reader flushFinalProgress drops
the always-405 sendBeacon path)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three layering cleanups from REVIEW.md §5 / §3:
- Drop the three private `is_unique_violation` helpers in
repo::{user,chapter,bookmark} in favour of sqlx 0.8's
`DatabaseError::is_unique_violation()` method (already used by
repo::collection).
- Remove the unreachable 23505 branch in repo::chapter::create — the
(manga_id, number) UNIQUE was dropped in 0013, so the defensive arm
could no longer fire. A doc note records what to do if uniqueness
is re-added.
- Move three inline SQL queries out of handlers/daemon into repo
functions: bookmarks' chapter-belongs-to-manga guard
(`repo::chapter::belongs_to_manga`), the daemon's dispatch lookup
(`repo::chapter::dispatch_target`), and the daemon's page_count
safety net (`repo::chapter::page_count`). Restores the
handlers→repo layering invariant in CLAUDE.md.
- New `crawler::url_utils` module consolidates host_of / origin_of /
registrable_domain — they used to live in three crawler submodules
with diverging edge-case behaviour. Tests moved with them.
- Doc cross-references on repo::author::set_for_manga and
repo::genre::set_for_manga pointing to the crawler's name-keyed
variants, so the intentional duplication is discoverable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Daemon now auto-detects mode per source: Backfill until the first
full walk records `seed_completed:<source>` in `crawler_state`, then
Incremental (newest-first, stops after N consecutive Unchanged
upserts). `CRAWLER_MODE` overrides to a fixed mode; CLI rejects
`auto` since it has no pre-run DB state.
`Source::discover` returns a lazy `DiscoverWalk` so Incremental can
break out mid-walk without prefetching pages. The drop pass and seed
marker are now gated on a true full walk — fixes a latent soft-drop
of the index tail under partial sweeps.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The backend now boots an internal crawler daemon that runs a daily
metadata pass (CRAWLER_DAILY_AT in CRAWLER_TZ, advisory-lock guarded
for multi-replica safety) and drains SyncChapterContent jobs from
crawler_jobs through a worker pool. Chromium launches lazily on first
job and is torn down after CRAWLER_IDLE_TIMEOUT_S seconds of inactivity.
Modules:
- crawler::browser_manager — lazy-launch / idle-teardown wrapper
around browser::Handle, with an on_launch hook that re-injects
PHPSESSID on every fresh Chromium spawn.
- crawler::pipeline — run_metadata_pass (the shared discover/upsert
/cover/sync-chapters loop) and the enqueue_bookmarked_pending helper
used by the cron tick.
- crawler::daemon — cron task + worker pool, behind two trait seams
(MetadataPass, ChapterDispatcher) so tests can inject stubs without
standing up Chromium or a live source.
Behavior:
- CRAWLER_DAEMON=false skips daemon spawn entirely (default for tests).
- Catch-up tick fires on startup if the last persisted slot was missed.
- A SyncOutcome::SessionExpired sets a sticky AtomicBool; workers
idle until operator restart with a refreshed PHPSESSID.
- Worker dispatch wrapped in catch_unwind so a panicking handler
marks the job failed instead of taking down the worker.
- Migration 0015 adds a small crawler_state k-v table for the
last_metadata_tick_at watermark.
Dep additions: chrono-tz (IANA TZ parsing).
CLI (bin/crawler) reuses pipeline::run_metadata_pass and now holds
the browser via BrowserManager so the on_launch session injection
flow stays in one place. Inline chapter-content sync semantics are
unchanged — the queue is for the daemon, force-refetches and manual
backfills still bypass it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>