Mangalord

Author	SHA1	Message	Date
MechaCat02	e02d125f51	feat(crawler): live cover + chapter-content observability with realtime page counts Extends the live dashboard so an operator can see exactly what's being fetched, in realtime: - Chapters being crawled now are tracked in the status as `active_chapters` (manga title · ch.N) with a live page counter that climbs per stored page (set_chapter_pages, pushed via the existing watch→SSE). The dispatcher registers each via an RAII ChapterGuard (sync Mutex) that removes the entry on completion, panic, or timeout-drop — replacing the old per-worker slot model. - Covers: status now carries the cover being fetched now (`current_cover`, set around download_and_store_cover in both the metadata pass and backfill) and a `covers_queued` backlog count; CoverBackfill phase gains index/total. - Two paginated backlog endpoints (fetched on demand, auto-refreshed when the live counts change): GET /admin/crawler/active-jobs (which chapters of which mangas are queued/running) and GET /admin/crawler/covers (mangas missing a cover). repo: list_active_jobs, list_missing_cover_mangas, count_missing_covers. - dispatch_target now also returns manga title + chapter number. Frontend: the crawler page replaces the Workers table with an Active-chapters table (live page bars), adds a current-cover line + covers-queued figure, and two backlog sections (Queued chapters / Queued covers) with search + Pager, auto-refetched via $effect on the live counts. Tests: status guard/page + cover unit tests; repo list/count tests; endpoint tests; frontend api tests. Version 0.53.1 -> 0.54.0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 20:41:51 +02:00
MechaCat02	cd0a1e13a9	feat(crawler): live status surface, runtime session, dead-job repo, auto-restart Adds the in-process observability + control infrastructure the admin dashboard consumes: - status.rs: CrawlerStatus/Phase/WorkerState + StatusHandle. The daemon publishes its current phase (idle/walking/fetching-metadata/cover-backfill), per-worker activity, and last-pass summary. Wired through the cron, run_metadata_pass, and the worker loop. - session_control.rs: SessionController refreshes PHPSESSID at runtime — rewrites the shared reqwest cookie jar, updates the value on_launch reads, persists to crawler_state (survives restart), and clears the expired flag. on_launch now reads the live session instead of a startup snapshot. - RealChapterDispatcher auto-triggers a coordinated browser restart after CRAWLER_BROWSER_RESTART_THRESHOLD consecutive transient failures. - repo::crawler: list_dead_jobs, requeue_dead_jobs (all/manga/job, bypassing the quarantine, skipping live duplicates), job_state_counts. - AppState gains CrawlerControl bundling browser_manager + session + status + metadata_pass for the admin endpoints. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 20:38:54 +02:00
MechaCat02	3f91bea768	feat(crawler): coordinated browser restart gate in BrowserManager Adds a Healthy/Draining/Restarting lifecycle gate. `acquire()` parks while a restart is in progress; `coordinated_restart(deadline)` blocks new acquires, drains in-flight leases (bounded, then forces), closes + relaunches Chromium (re-running on_launch → re-inject session + probe), then resumes parked acquirers. Concurrent restart requests collapse into one relaunch; the phase always returns to Healthy so a failed relaunch never wedges acquisition. The metadata pass cooperates via is_restart_pending() at its per-manga checkpoint, yielding its long-lived lease (recovery sweep next tick) instead of stalling the drain. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 20:18:09 +02:00
MechaCat02	7a6815661f	feat(crawler): reliability fixes — heartbeat, streaming, jitter, timeout, breaker A1 Lease heartbeat: jobs::renew keeps a long-but-healthy job's lease fresh so it is never stolen mid-flight nor inflated toward max_attempts. A2 Stream chapter pages straight to storage (peak memory = one image) and persist rows + page_count in one short transaction off the network path (S3-ready); roll back stored blobs on failure via Storage::delete. A3 ±20% jitter on exponential backoff to avoid a retry thundering herd. A4 Outer per-dispatch timeout (CRAWLER_JOB_TIMEOUT_SECS, default 600) so a hung job is acked-failed instead of wedging a worker. A5 Metadata circuit-breaker (CRAWLER_METADATA_MAX_CONSECUTIVE_FAILURES, default 10): abort a pass on a source outage without marking a clean exit, so the next tick recovery-sweeps. Adds CRAWLER_BROWSER_RESTART_THRESHOLD config (used by the upcoming coordinated browser restart). Bumps version 0.52.0 -> 0.53.0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 20:13:17 +02:00
MechaCat02	e93eec89e5	fix(crawler): queue chapter content in ascending number order (0.51.1) Both enqueue paths now order by chapters.number so the cron tick and the bookmark hook insert jobs from chapter 1 upward instead of source-discovery or random-UUID order. The lease query tiebreaks on created_at so jobs sharing a batch's scheduled_at come off the queue in insertion order, propagating the enqueue intent through to dequeue. Concurrent workers and per-CDN latency can still drift actual completion order. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 21:13:51 +02:00
MechaCat02	c134bdbbde	feat: cover retry backfill + admin force-resync for manga & chapter (0.50.0) Adds a per-tick cover-backfill pass to the crawler daemon so mangas whose cover download failed on first attempt get retried — the metadata pass's early-stop optimisation otherwise prevents the walk from revisiting them. Adds admin-only POST /admin/mangas/:id/resync and POST /admin/chapters/:id/resync that refetch metadata + cover (or chapter content with force_refetch) from the crawler source synchronously and return the refreshed row. Surfaced in the UI as "Force resync" buttons on the manga detail and reader pages, admin-only via session.user.is_admin. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-01 22:00:09 +02:00
MechaCat02	8557e432a2	feat(crawler): plumb TorController through FetchContext and pipelines Adds CRAWLER_TOR_CONTROL_URL / _PASSWORD / _COOKIE_PATH / _RECIRCUIT_MAX_ATTEMPTS to CrawlerConfig and to bin/crawler.rs's env reads. Constructs an Option<Arc<TorController>> at daemon / CLI startup and threads it through FetchContext, pipeline::run_metadata_pass, and content::sync_chapter_content as Option<&TorController>. Pure scaffolding — the controller isn't used yet; behavior is unchanged. Next commit wires the retry hooks and session-probe recircuit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-31 19:59:47 +02:00
MechaCat02	eaa5afda50	fix(crawler): skip sync when empty chapters + prior > 0 (0.36.6) The wait_for_selector wait in 0.36.2 narrows the partial-render race window but doesn't close it: a render that takes longer than SELECTOR_TIMEOUT (10s) still hands an empty Vec to sync_manga_chapters, and the soft-drop branch flips every existing chapter to dropped_at. The next tick recovers but a manga's reader briefly stops working in between. Close it at the pipeline level. Between fetch_manga and the upsert/ sync, if the parsed chapter list is empty and the prior live count for (source_id, source_manga_key) is > 0, treat the fetch as a transient failure: log, bump mangas_failed, skip upsert + sync + the seen.insert so a later batch / tick retries. Brand-new mangas with genuinely zero chapters (prior == 0) pass through unchanged. New repo helper repo::crawler::live_chapter_count_for_source_manga joins chapters → chapter_sources → manga_sources with dropped_at IS NULL — same lockstep as dispatch_target and the enqueue queries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-30 20:17:42 +02:00
MechaCat02	9f56f283d4	feat(crawler): single-mode walker gated by recovery flag (0.36.0) Collapses the crawler to a single newest-first walker and replaces the N-consecutive-unchanged streak with a per-manga rule: stop on the first manga where metadata is Unchanged AND chapter sync reports zero new chapters. The early stop is gated by a per-source recovery flag stored in `crawler_state` — set to `false` when a run starts, back to `true` only on a clean exit (end-of-walk or intentional stop). A crashed run leaves the flag `false` automatically (no shutdown code runs), so the next tick walks the full catalog instead of bailing at the first caught-up manga. This means a crashed mid-walk run self-heals on the next tick: the flag stays `false`, the next walk visits every page (recovering anything the crash missed past its crash point), and steady state resumes once the recovery sweep reaches end-of-walk. Removed: - DiscoverMode enum, Backfill mode, the boundary re-check + displaced-refs machinery in TargetSourceWalker. - Drop-pass (mark_dropped_mangas) and seed-completion plumbing (mark_seed_completed / seed_completed_at). The recovery flag subsumes the seed-completion signal; drop detection was explicitly opted out. - JobPayload::Discover (no production callers). - CRAWLER_MODE / CRAWLER_INCREMENTAL_STOP_AFTER env vars and the CrawlerModePref config type. `should_mark_clean_exit(walked_to_completion, hit_stop_condition)` encodes the clean-exit truth table in its signature — `hit_limit` is deliberately absent so a future edit cannot accidentally count a caller-imposed cap as a clean exit. Net -501 lines, 261 backend tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-29 23:49:28 +02:00
MechaCat02	50763addcf	fix(crawler): quarantine recently-dead chapters from re-enqueue (0.35.4) The partial dedup index only blocks (pending\|running) duplicates, so once a SyncChapterContent job transitions to 'dead' (max_attempts exhausted) the slot frees. Every subsequent cron tick re-enqueued the chapter — page_count = 0 and dropped_at IS NULL stay true — burned another max_attempts retries, and died again. Permanent-failure chapters spun forever. enqueue_bookmarked_pending and enqueue_pending_for_manga now skip chapters whose latest sync_chapter_content job is dead within CHAPTER_DEAD_QUARANTINE_DAYS (7). A failed chapter goes silent for a week, then gets one more shot — long enough for a transient site issue to resolve, short enough that permanent failures don't stay permanent if conditions change. Two integration tests pin both halves of the contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-29 20:42:41 +02:00
MechaCat02	dea9b1aaa8	fix(crawler): close walker race against site reordering (0.35.1) The target site orders by update_date DESC, and any new or updated manga pushes everyone down by one slot. The paginated walker was blind to this drift: * Backfill (page last -> 1): shifts push items into pages already finished. The displaced manga was silently missed; with mark_dropped_mangas running on a fully-completed walk, items even got false-dropped because last_seen_at was stale. * Incremental (page 1 -> last): a shift causes the slot-last item of an already-read page to reappear on the next page, leading to a redundant fetch_manga and an inflated consecutive_unchanged streak. Fix is two-pronged: 1. Backfill boundary re-check. After fetching each page P, re-fetch the previously-walked page P+1 and check where its old slot-0 key now sits. If it slid to slot K, the first K entries are items that used to live on P and slid past us; they get appended to the batch. If the anchor is gone entirely (multi-page shift or it was bumped to page 1), the whole re-fetched page is processed conservatively and the pipeline dedup absorbs the noise. The re-check must be the last navigation of the iteration to close the within-iteration race. 2. Run-scoped dedup in run_metadata_pass. A HashSet<String> of source_manga_keys avoids double-processing. The set uses a contains-then-insert pattern with insert firing after a successful upsert, so a transient fetch/upsert failure leaves the key retryable if it reappears later in the same pass (via the boundary re-check or another batch). Incremental mode does not run the re-check (shifts move in the same direction as the walk); only the dedup helps it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-29 20:14:01 +02:00
MechaCat02	8d34132883	bugfix: security & correctness bundle (0.34.1) Five fixes bundled into one release: - preserve user-attached tags across crawler upserts (repo::crawler::sync_tags now scopes to added_by IS NULL; orphaned attachments from deleted users are reaped as crawler-owned) - gate manga PATCH and cover endpoints on uploaded_by (require_can_edit in api::mangas; non-NULL uploaded_by must match the caller) - equalise login response time across user-existence branches (run argon2 against a OnceLock-cached dummy hash on the no-user branch so timing doesn't leak username existence) - crawler download defences (SSRF allowlist of host literals including IPv4-mapped IPv6 ranges, 32 MiB streamed size cap, reject non-whitelisted image types, three-way chapter-probe classifier replaces the binary #avatar_menu check) - tighten validation and clean up dead unload path (attach_tag + create_token enforce 64-char caps; LocalStorage rejects NUL bytes explicitly; reader flushFinalProgress drops the always-405 sendBeacon path) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 20:24:51 +02:00
MechaCat02	c320eda7cd	chore: dedupe is_unique_violation, lift SQL into repo, centralise URL parsing Three layering cleanups from REVIEW.md §5 / §3: - Drop the three private `is_unique_violation` helpers in repo::{user,chapter,bookmark} in favour of sqlx 0.8's `DatabaseError::is_unique_violation()` method (already used by repo::collection). - Remove the unreachable 23505 branch in repo::chapter::create — the (manga_id, number) UNIQUE was dropped in 0013, so the defensive arm could no longer fire. A doc note records what to do if uniqueness is re-added. - Move three inline SQL queries out of handlers/daemon into repo functions: bookmarks' chapter-belongs-to-manga guard (`repo::chapter::belongs_to_manga`), the daemon's dispatch lookup (`repo::chapter::dispatch_target`), and the daemon's page_count safety net (`repo::chapter::page_count`). Restores the handlers→repo layering invariant in CLAUDE.md. - New `crawler::url_utils` module consolidates host_of / origin_of / registrable_domain — they used to live in three crawler submodules with diverging edge-case behaviour. Tests moved with them. - Doc cross-references on repo::author::set_for_manga and repo::genre::set_for_manga pointing to the crawler's name-keyed variants, so the intentional duplication is discoverable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 20:24:05 +02:00
MechaCat02	45ce0d8f12	feat: incremental crawl mode with seed-completion gate (0.33.0) Daemon now auto-detects mode per source: Backfill until the first full walk records `seed_completed:<source>` in `crawler_state`, then Incremental (newest-first, stops after N consecutive Unchanged upserts). `CRAWLER_MODE` overrides to a fixed mode; CLI rejects `auto` since it has no pre-run DB state. `Source::discover` returns a lazy `DiscoverWalk` so Incremental can break out mid-walk without prefetching pages. The drop pass and seed marker are now gated on a true full walk — fixes a latent soft-drop of the index tail under partial sweeps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 06:41:26 +02:00
MechaCat02	9fe0f26d75	feat: in-process crawler daemon with cron and worker pool (0.28.0) The backend now boots an internal crawler daemon that runs a daily metadata pass (CRAWLER_DAILY_AT in CRAWLER_TZ, advisory-lock guarded for multi-replica safety) and drains SyncChapterContent jobs from crawler_jobs through a worker pool. Chromium launches lazily on first job and is torn down after CRAWLER_IDLE_TIMEOUT_S seconds of inactivity. Modules: - crawler::browser_manager — lazy-launch / idle-teardown wrapper around browser::Handle, with an on_launch hook that re-injects PHPSESSID on every fresh Chromium spawn. - crawler::pipeline — run_metadata_pass (the shared discover/upsert /cover/sync-chapters loop) and the enqueue_bookmarked_pending helper used by the cron tick. - crawler::daemon — cron task + worker pool, behind two trait seams (MetadataPass, ChapterDispatcher) so tests can inject stubs without standing up Chromium or a live source. Behavior: - CRAWLER_DAEMON=false skips daemon spawn entirely (default for tests). - Catch-up tick fires on startup if the last persisted slot was missed. - A SyncOutcome::SessionExpired sets a sticky AtomicBool; workers idle until operator restart with a refreshed PHPSESSID. - Worker dispatch wrapped in catch_unwind so a panicking handler marks the job failed instead of taking down the worker. - Migration 0015 adds a small crawler_state k-v table for the last_metadata_tick_at watermark. Dep additions: chrono-tz (IANA TZ parsing). CLI (bin/crawler) reuses pipeline::run_metadata_pass and now holds the browser via BrowserManager so the on_launch session injection flow stays in one place. Inline chapter-content sync semantics are unchanged — the queue is for the daemon, force-refetches and manual backfills still bypass it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 20:32:02 +02:00

15 Commits