Mangalord

Author	SHA1	Message	Date
MechaCat02	a2826d6467	feat(crawler): CRAWLER_ALLOW_ANY_HOST bypasses the host allowlist (0.44.0) Some checks failed deploy / test-backend (push) Failing after 11s Details deploy / test-frontend (push) Failing after 36s Details deploy / build-and-push (push) Has been skipped Details deploy / deploy (push) Has been skipped Details Operators whose sources shard images across numbered CDN subdomains can't pre-enumerate every host in CRAWLER_DOWNLOAD_ALLOWLIST. The new flag short-circuits the host check in DownloadAllowlist::contains while leaving scheme, localhost, and private-IP defenses in is_safe_url untouched — scraped URLs pointing at 10.x / 169.254.169.254 / file:// stay refused. Default is false; fail-closed posture is preserved unless the operator opts in. Wired into both the server (config::build_download_allowlist) and the bin/crawler.rs one-shot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-31 14:52:49 +02:00
MechaCat02	9f56f283d4	feat(crawler): single-mode walker gated by recovery flag (0.36.0) Collapses the crawler to a single newest-first walker and replaces the N-consecutive-unchanged streak with a per-manga rule: stop on the first manga where metadata is Unchanged AND chapter sync reports zero new chapters. The early stop is gated by a per-source recovery flag stored in `crawler_state` — set to `false` when a run starts, back to `true` only on a clean exit (end-of-walk or intentional stop). A crashed run leaves the flag `false` automatically (no shutdown code runs), so the next tick walks the full catalog instead of bailing at the first caught-up manga. This means a crashed mid-walk run self-heals on the next tick: the flag stays `false`, the next walk visits every page (recovering anything the crash missed past its crash point), and steady state resumes once the recovery sweep reaches end-of-walk. Removed: - DiscoverMode enum, Backfill mode, the boundary re-check + displaced-refs machinery in TargetSourceWalker. - Drop-pass (mark_dropped_mangas) and seed-completion plumbing (mark_seed_completed / seed_completed_at). The recovery flag subsumes the seed-completion signal; drop detection was explicitly opted out. - JobPayload::Discover (no production callers). - CRAWLER_MODE / CRAWLER_INCREMENTAL_STOP_AFTER env vars and the CrawlerModePref config type. `should_mark_clean_exit(walked_to_completion, hit_stop_condition)` encodes the clean-exit truth table in its signature — `hit_limit` is deliberately absent so a future edit cannot accidentally count a caller-imposed cap as a clean exit. Net -501 lines, 261 backend tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-29 23:49:28 +02:00
MechaCat02	8d34132883	bugfix: security & correctness bundle (0.34.1) Five fixes bundled into one release: - preserve user-attached tags across crawler upserts (repo::crawler::sync_tags now scopes to added_by IS NULL; orphaned attachments from deleted users are reaped as crawler-owned) - gate manga PATCH and cover endpoints on uploaded_by (require_can_edit in api::mangas; non-NULL uploaded_by must match the caller) - equalise login response time across user-existence branches (run argon2 against a OnceLock-cached dummy hash on the no-user branch so timing doesn't leak username existence) - crawler download defences (SSRF allowlist of host literals including IPv4-mapped IPv6 ranges, 32 MiB streamed size cap, reject non-whitelisted image types, three-way chapter-probe classifier replaces the binary #avatar_menu check) - tighten validation and clean up dead unload path (attach_tag + create_token enforce 64-char caps; LocalStorage rejects NUL bytes explicitly; reader flushFinalProgress drops the always-405 sendBeacon path) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 20:24:51 +02:00
MechaCat02	45ce0d8f12	feat: incremental crawl mode with seed-completion gate (0.33.0) Daemon now auto-detects mode per source: Backfill until the first full walk records `seed_completed:<source>` in `crawler_state`, then Incremental (newest-first, stops after N consecutive Unchanged upserts). `CRAWLER_MODE` overrides to a fixed mode; CLI rejects `auto` since it has no pre-run DB state. `Source::discover` returns a lazy `DiscoverWalk` so Incremental can break out mid-walk without prefetching pages. The drop pass and seed marker are now gated on a true full walk — fixes a latent soft-drop of the index tail under partial sweeps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 06:41:26 +02:00
MechaCat02	9fe0f26d75	feat: in-process crawler daemon with cron and worker pool (0.28.0) The backend now boots an internal crawler daemon that runs a daily metadata pass (CRAWLER_DAILY_AT in CRAWLER_TZ, advisory-lock guarded for multi-replica safety) and drains SyncChapterContent jobs from crawler_jobs through a worker pool. Chromium launches lazily on first job and is torn down after CRAWLER_IDLE_TIMEOUT_S seconds of inactivity. Modules: - crawler::browser_manager — lazy-launch / idle-teardown wrapper around browser::Handle, with an on_launch hook that re-injects PHPSESSID on every fresh Chromium spawn. - crawler::pipeline — run_metadata_pass (the shared discover/upsert /cover/sync-chapters loop) and the enqueue_bookmarked_pending helper used by the cron tick. - crawler::daemon — cron task + worker pool, behind two trait seams (MetadataPass, ChapterDispatcher) so tests can inject stubs without standing up Chromium or a live source. Behavior: - CRAWLER_DAEMON=false skips daemon spawn entirely (default for tests). - Catch-up tick fires on startup if the last persisted slot was missed. - A SyncOutcome::SessionExpired sets a sticky AtomicBool; workers idle until operator restart with a refreshed PHPSESSID. - Worker dispatch wrapped in catch_unwind so a panicking handler marks the job failed instead of taking down the worker. - Migration 0015 adds a small crawler_state k-v table for the last_metadata_tick_at watermark. Dep additions: chrono-tz (IANA TZ parsing). CLI (bin/crawler) reuses pipeline::run_metadata_pass and now holds the browser via BrowserManager so the on_launch session injection flow stays in one place. Inline chapter-content sync semantics are unchanged — the queue is for the daemon, force-refetches and manual backfills still bypass it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 20:32:02 +02:00
MechaCat02	89b84252a5	bugfix: subquery-wrap pending chapters query so DISTINCT + ORDER BY agree (0.26.1) PG rejects `SELECT DISTINCT c.id, c.manga_id, cs.source_url ... ORDER BY c.manga_id, c.created_at` because the ORDER BY references a column not in the DISTINCT projection. Wrap the DISTINCT in a subquery (which includes created_at) and apply the ORDER BY in the outer SELECT. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 22:20:15 +02:00
MechaCat02	728d704a66	feat: CRAWLER_KEEP_BROWSER_OPEN waits for Ctrl+C in headed mode (0.26.0) Debug aid: when set in headed mode, the crawler blocks on Ctrl+C at every shutdown point (early auth bails + normal completion) instead of closing the browser immediately. Operator can inspect DOM, cookies, and network state in the visible Chromium window before exit. Ignored in headless (no window to inspect) — logged as a warning if set under headless so the operator doesn't sit waiting. chromiumoxide's `Browser` is `kill_on_drop`, so the close-or-wait helper must await Ctrl+C before the Handle is dropped — otherwise the Chromium child gets killed out from under the operator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 21:33:18 +02:00
MechaCat02	d24e68c78d	feat: chapter content sync via PHPSESSID + per-host pacing (0.25.0) After the metadata pass, the crawler now fetches per-chapter image content for chapters belonging to bookmarked mangas. Logged-in chapter pages render every page image at once (no per-page navigation), so the crawler reuses the operator's browser session via a pasted PHPSESSID cookie. Each chapter sync is a single transaction: storage puts + page row inserts + page_count update commit together, or roll back together on any image error so the chapter stays at page_count=0 and is retried next run. New crawler modules: - `rate_limit::HostRateLimiters`: per-host buckets keyed by URL host, with optional per-host overrides. Replaces the single shared `Mutex<RateLimiter>`. Catalog and CDN no longer share a budget; default 1 req/s per host. - `session`: derives `.<registrable>.<tld>` from the start URL (override via `CRAWLER_COOKIE_DOMAIN` for multi-part TLDs), injects PHPSESSID into the Chromium cookie store, probes `#avatar_menu` at startup to fail fast on a bad/expired cookie. - `content`: parses `a#pic_container img:not(.loading)` with `pageN` id-based sorting (DOM order isn't trusted), then performs the atomic chapter sync. bin/crawler additions: - Concurrent chapter content phase via `futures_util::for_each_concurrent` (`CRAWLER_CHAPTER_WORKERS`, default 1). Browser is borrowed across workers — chromiumoxide allows concurrent `new_page` on `&self` — and per-host rate limit gates total RPS regardless of worker count. - reqwest gets the `cookies` feature, a `Jar` seeded with PHPSESSID for the catalog domain only (CDN intentionally not given the cookie), and `Referer` is set on cover + chapter image fetches. - New env knobs: `CRAWLER_PHPSESSID`, `CRAWLER_COOKIE_DOMAIN`, `CRAWLER_USER_AGENT`, `CRAWLER_CHAPTER_WORKERS`, `CRAWLER_SKIP_CHAPTER_CONTENT`, `CRAWLER_FORCE_REFETCH_CHAPTERS`, `CRAWLER_CDN_HOST` + `CRAWLER_CDN_RATE_MS`. - Mid-run session-expired detection: `#avatar_menu` is re-checked on every chapter page nav; first failure aborts the phase with a cookie-refresh message. Bookmark-driven enqueueing is sync-on-crawl-tick only: the bookmarked chapters with `page_count = 0` are queried at the start of the chapter-content phase. Sync-on-bookmark via an API hook is deferred to a follow-up branch — that needs a daemon consumer of crawler_jobs, which doesn't exist yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 00:28:36 +02:00
MechaCat02	b1a3a4e9d3	feat: crawler manga-list & metadata sync with cover download (0.23.0) - TargetSource: first concrete impl of the Source trait, modeled on the old Puppeteer crawler's selectors (+ status normalization, tag-count stripping, chapter list) - DiscoverMode::Backfill walks pagination last->1, reverse within each page (oldest-first); Incremental walks forward - RateLimiter (tokio-time aware) plumbed through FetchContext so the pagination walk honors the same per-host budget as the outer loop - repo::crawler: ensure_source, upsert_manga_from_source (returns New/Updated/Unchanged + current cover_image_path for backfill decisions), sync_manga_chapters, mark_dropped_mangas — all transactional, with case-insensitive lookups and source-insertable genres - Cover image download via reqwest+infer; stored under mangas/{id}/cover.{ext} via the Storage trait - Single CRAWLER_PROXY env wires both Chromium (--proxy-server) and reqwest::Proxy::all (HTTP/HTTPS/SOCKS5) - Crawler binary: positional start URL or $CRAWLER_START_URL, $CRAWLER_LIMIT (cap fetches + skip drop pass on partial runs), $CRAWLER_SKIP_CHAPTERS (disable selector AND sync), $CRAWLER_RATE_MS - Silences chromiumoxide 0.7's known CDP deserialize log spam via default tracing filter + CdpError::Serde downgrade - 9 sqlx integration tests + 11 selector/rate-limit unit tests	2026-05-21 22:04:23 +02:00
MechaCat02	26eccd0abe	feat: crawler scaffold with chromium launcher (0.22.0) - crawler module (browser, source trait, jobs, diff) + binary - chromiumoxide launcher with fetcher feature (auto-downloads Chromium on first run, caches under ~/.cache/mangalord/chromium) - LaunchOptions struct with extra_args, parseable from CRAWLER_BROWSER_MODE and CRAWLER_BROWSER_ARGS - migration 0012 introduces sources, manga_sources, chapter_sources, crawler_jobs - integration tests for headed + headless launch, ipify load+parse, and extra-args propagation (all #[ignore], opt-in)	2026-05-20 22:07:56 +02:00

10 Commits