feat: in-process crawler daemon with cron and worker pool (0.28.0)

The backend now boots an internal crawler daemon that runs a daily
metadata pass (CRAWLER_DAILY_AT in CRAWLER_TZ, advisory-lock guarded
for multi-replica safety) and drains SyncChapterContent jobs from
crawler_jobs through a worker pool. Chromium launches lazily on first
job and is torn down after CRAWLER_IDLE_TIMEOUT_S seconds of inactivity.

Modules:
- crawler::browser_manager — lazy-launch / idle-teardown wrapper
  around browser::Handle, with an on_launch hook that re-injects
  PHPSESSID on every fresh Chromium spawn.
- crawler::pipeline — run_metadata_pass (the shared discover/upsert
  /cover/sync-chapters loop) and the enqueue_bookmarked_pending helper
  used by the cron tick.
- crawler::daemon — cron task + worker pool, behind two trait seams
  (MetadataPass, ChapterDispatcher) so tests can inject stubs without
  standing up Chromium or a live source.

Behavior:
- CRAWLER_DAEMON=false skips daemon spawn entirely (default for tests).
- Catch-up tick fires on startup if the last persisted slot was missed.
- A SyncOutcome::SessionExpired sets a sticky AtomicBool; workers
  idle until operator restart with a refreshed PHPSESSID.
- Worker dispatch wrapped in catch_unwind so a panicking handler
  marks the job failed instead of taking down the worker.
- Migration 0015 adds a small crawler_state k-v table for the
  last_metadata_tick_at watermark.

Dep additions: chrono-tz (IANA TZ parsing).

CLI (bin/crawler) reuses pipeline::run_metadata_pass and now holds
the browser via BrowserManager so the on_launch session injection
flow stays in one place. Inline chapter-content sync semantics are
unchanged — the queue is for the daemon, force-refetches and manual
backfills still bypass it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-05-25 20:32:02 +02:00
parent 93c7fd63fc
commit 9fe0f26d75
14 changed files with 2162 additions and 309 deletions

View File

@@ -15,6 +15,7 @@
//! caller-provided.
use std::path::PathBuf;
use std::sync::Arc;
use anyhow::Context;
use chromiumoxide::browser::{Browser, BrowserConfig};
@@ -95,25 +96,49 @@ pub(crate) fn parse_args(s: &str) -> Vec<String> {
/// Owned browser plus the spawned task that drives its CDP event loop.
/// Dropping `Handle` without calling `close` leaks the Chromium process
/// — always call `close().await` in production paths.
///
/// The browser is stored behind an `Arc` so it can be shared across
/// worker tasks (via [`Handle::shared`]) without copying. `Browser::new_page`
/// only needs `&self`, so multiple workers can drive the same browser
/// concurrently as long as the manager keeps the `Arc` alive.
pub struct Handle {
browser: Browser,
browser: Arc<Browser>,
driver: JoinHandle<()>,
}
impl Handle {
/// Borrow the browser. Equivalent to `&*handle.shared()`.
pub fn browser(&self) -> &Browser {
&self.browser
}
pub fn browser_mut(&mut self) -> &mut Browser {
&mut self.browser
/// Clone the shared handle. Workers hold these to call `new_page`
/// concurrently. The browser only exits when the last `Arc<Browser>`
/// is dropped (kill-on-drop), or when `close()` is called on the
/// originating `Handle` while it is the sole holder.
pub fn shared(&self) -> Arc<Browser> {
Arc::clone(&self.browser)
}
/// Closes the browser and awaits the driver task. Safe to call
/// multiple times — subsequent calls are no-ops.
pub async fn close(mut self) -> anyhow::Result<()> {
let _ = self.browser.close().await;
let _ = self.browser.wait().await;
/// Closes the browser and awaits the driver task. If other Arcs to
/// the browser are still alive we fall back to drop-kills-Chromium
/// semantics and just join the driver — this is the rare case where
/// shutdown raced an outstanding worker; the OS-level kill is the
/// safety net.
pub async fn close(self) -> anyhow::Result<()> {
match Arc::try_unwrap(self.browser) {
Ok(mut owned) => {
let _ = owned.close().await;
let _ = owned.wait().await;
}
Err(shared) => {
tracing::warn!(
strong_count = Arc::strong_count(&shared),
"Handle::close while Arc<Browser> still shared — relying on kill-on-drop"
);
drop(shared);
}
}
let _ = self.driver.await;
Ok(())
}
@@ -184,7 +209,10 @@ pub async fn launch(options: LaunchOptions) -> anyhow::Result<Handle> {
}
});
Ok(Handle { browser, driver })
Ok(Handle {
browser: Arc::new(browser),
driver,
})
}
fn cache_dir() -> anyhow::Result<PathBuf> {