Commit Graph

5 Commits

Author SHA1 Message Date
MechaCat02
1eebb90e25 fix(crawler): unhang shutdown on lingering Arc<Browser>, silence WS noise (0.43.1)
Some checks failed
deploy / test-backend (push) Failing after 6s
deploy / test-frontend (push) Failing after 40s
deploy / build-and-push (push) Has been skipped
deploy / deploy (push) Has been skipped
- Handle::close aborts its chromiumoxide driver task when another
  Arc<Browser> outlives the call, so shutdown returns instead of
  hanging on a stream that never terminates. Generic close_or_abort
  helper with regression tests covering both Arc paths.
- daemon.shutdown() is wrapped in a 5s timeout in main as defense
  in depth.
- Default RUST_LOG silences chromiumoxide::conn / chromiumoxide::handler
  WS-deserialize ERROR spam.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 14:47:36 +02:00
MechaCat02
51f42b03e9 feat: default crawler browser to headless (0.32.0)
LaunchOptions::from_env() and LaunchOptions::default() now return
BrowserMode::Headless. The in-process daemon (via CrawlerConfig::from_env)
and the standalone crawler binary both pick this up — no display
required for production runs, smaller resource footprint.

`Headed` stays as an explicit opt-in via CRAWLER_BROWSER_MODE=headed
for debugging or sites that fingerprint headless Chrome. New unit test
locks the default in place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 20:27:05 +02:00
MechaCat02
9fe0f26d75 feat: in-process crawler daemon with cron and worker pool (0.28.0)
The backend now boots an internal crawler daemon that runs a daily
metadata pass (CRAWLER_DAILY_AT in CRAWLER_TZ, advisory-lock guarded
for multi-replica safety) and drains SyncChapterContent jobs from
crawler_jobs through a worker pool. Chromium launches lazily on first
job and is torn down after CRAWLER_IDLE_TIMEOUT_S seconds of inactivity.

Modules:
- crawler::browser_manager — lazy-launch / idle-teardown wrapper
  around browser::Handle, with an on_launch hook that re-injects
  PHPSESSID on every fresh Chromium spawn.
- crawler::pipeline — run_metadata_pass (the shared discover/upsert
  /cover/sync-chapters loop) and the enqueue_bookmarked_pending helper
  used by the cron tick.
- crawler::daemon — cron task + worker pool, behind two trait seams
  (MetadataPass, ChapterDispatcher) so tests can inject stubs without
  standing up Chromium or a live source.

Behavior:
- CRAWLER_DAEMON=false skips daemon spawn entirely (default for tests).
- Catch-up tick fires on startup if the last persisted slot was missed.
- A SyncOutcome::SessionExpired sets a sticky AtomicBool; workers
  idle until operator restart with a refreshed PHPSESSID.
- Worker dispatch wrapped in catch_unwind so a panicking handler
  marks the job failed instead of taking down the worker.
- Migration 0015 adds a small crawler_state k-v table for the
  last_metadata_tick_at watermark.

Dep additions: chrono-tz (IANA TZ parsing).

CLI (bin/crawler) reuses pipeline::run_metadata_pass and now holds
the browser via BrowserManager so the on_launch session injection
flow stays in one place. Inline chapter-content sync semantics are
unchanged — the queue is for the daemon, force-refetches and manual
backfills still bypass it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 20:32:02 +02:00
MechaCat02
b1a3a4e9d3 feat: crawler manga-list & metadata sync with cover download (0.23.0)
- TargetSource: first concrete impl of the Source trait, modeled on
  the old Puppeteer crawler's selectors (+ status normalization,
  tag-count stripping, chapter list)
- DiscoverMode::Backfill walks pagination last->1, reverse within each
  page (oldest-first); Incremental walks forward
- RateLimiter (tokio-time aware) plumbed through FetchContext so the
  pagination walk honors the same per-host budget as the outer loop
- repo::crawler: ensure_source, upsert_manga_from_source (returns
  New/Updated/Unchanged + current cover_image_path for backfill
  decisions), sync_manga_chapters, mark_dropped_mangas — all
  transactional, with case-insensitive lookups and source-insertable
  genres
- Cover image download via reqwest+infer; stored under
  mangas/{id}/cover.{ext} via the Storage trait
- Single CRAWLER_PROXY env wires both Chromium (--proxy-server) and
  reqwest::Proxy::all (HTTP/HTTPS/SOCKS5)
- Crawler binary: positional start URL or $CRAWLER_START_URL,
  $CRAWLER_LIMIT (cap fetches + skip drop pass on partial runs),
  $CRAWLER_SKIP_CHAPTERS (disable selector AND sync), $CRAWLER_RATE_MS
- Silences chromiumoxide 0.7's known CDP deserialize log spam via
  default tracing filter + CdpError::Serde downgrade
- 9 sqlx integration tests + 11 selector/rate-limit unit tests
2026-05-21 22:04:23 +02:00
MechaCat02
26eccd0abe feat: crawler scaffold with chromium launcher (0.22.0)
- crawler module (browser, source trait, jobs, diff) + binary
- chromiumoxide launcher with fetcher feature (auto-downloads
  Chromium on first run, caches under ~/.cache/mangalord/chromium)
- LaunchOptions struct with extra_args, parseable from
  CRAWLER_BROWSER_MODE and CRAWLER_BROWSER_ARGS
- migration 0012 introduces sources, manga_sources,
  chapter_sources, crawler_jobs
- integration tests for headed + headless launch, ipify load+parse,
  and extra-args propagation (all #[ignore], opt-in)
2026-05-20 22:07:56 +02:00