feat(crawler): single-mode walker gated by recovery flag (0.36.0)
Collapses the crawler to a single newest-first walker and replaces the N-consecutive-unchanged streak with a per-manga rule: stop on the first manga where metadata is Unchanged AND chapter sync reports zero new chapters. The early stop is gated by a per-source recovery flag stored in `crawler_state` — set to `false` when a run starts, back to `true` only on a clean exit (end-of-walk or intentional stop). A crashed run leaves the flag `false` automatically (no shutdown code runs), so the next tick walks the full catalog instead of bailing at the first caught-up manga. This means a crashed mid-walk run self-heals on the next tick: the flag stays `false`, the next walk visits every page (recovering anything the crash missed past its crash point), and steady state resumes once the recovery sweep reaches end-of-walk. Removed: - DiscoverMode enum, Backfill mode, the boundary re-check + displaced-refs machinery in TargetSourceWalker. - Drop-pass (mark_dropped_mangas) and seed-completion plumbing (mark_seed_completed / seed_completed_at). The recovery flag subsumes the seed-completion signal; drop detection was explicitly opted out. - JobPayload::Discover (no production callers). - CRAWLER_MODE / CRAWLER_INCREMENTAL_STOP_AFTER env vars and the CrawlerModePref config type. `should_mark_clean_exit(walked_to_completion, hit_stop_condition)` encodes the clean-exit truth table in its signature — `hit_limit` is deliberately absent so a future edit cannot accidentally count a caller-imposed cap as a clean exit. Net -501 lines, 261 backend tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -8,19 +8,6 @@ pub mod target;
|
||||
|
||||
use async_trait::async_trait;
|
||||
use chromiumoxide::browser::Browser;
|
||||
use serde::{Deserialize, Serialize};
|
||||
|
||||
/// How a `discover` job should walk the source's index.
|
||||
#[derive(Clone, Copy, Debug, Serialize, Deserialize)]
|
||||
pub enum DiscoverMode {
|
||||
/// Walk every index page from last back to first. Used for the
|
||||
/// initial seed of a source.
|
||||
Backfill,
|
||||
/// Walk index pages from page 1 forward, stopping after
|
||||
/// `stop_after_unchanged` consecutive mangas whose `metadata_hash`
|
||||
/// matches storage. Used for the recurring cron tick.
|
||||
Incremental { stop_after_unchanged: usize },
|
||||
}
|
||||
|
||||
/// Pointer at a manga in the source's index, before we've fetched the
|
||||
/// detail page. The `source_manga_key` is whatever stable id the source
|
||||
@@ -83,14 +70,14 @@ pub struct FetchContext<'a> {
|
||||
}
|
||||
|
||||
/// Lazy iterator over discovered manga refs. The caller drives the
|
||||
/// walk one batch at a time, so it can break out as soon as a
|
||||
/// downstream stop condition is met (e.g. N consecutive Unchanged
|
||||
/// upserts in Incremental mode) without paying for pages it won't use.
|
||||
/// walk one batch at a time, so it can break out as soon as the
|
||||
/// downstream stop condition is met (the first manga where metadata is
|
||||
/// `Unchanged` and chapter sync reports zero new chapters) without
|
||||
/// paying for pages it won't use.
|
||||
///
|
||||
/// Batches are typically one source-index page each. Within a batch
|
||||
/// refs are already in the right per-page order for the active mode
|
||||
/// (Backfill reverses each page to oldest-first; Incremental leaves
|
||||
/// the source's natural newest-first ordering).
|
||||
/// refs are in the source's natural newest-first ordering — the same
|
||||
/// `update_date DESC` sort that makes the stop condition meaningful.
|
||||
#[async_trait]
|
||||
pub trait DiscoverWalk: Send {
|
||||
/// Return the next batch of refs, or `Ok(None)` when the source has
|
||||
@@ -107,16 +94,14 @@ pub trait Source: Send + Sync {
|
||||
/// Stable identifier — also the row key in the `sources` table.
|
||||
fn id(&self) -> &'static str;
|
||||
|
||||
/// Begin discovery in `mode`. Returns a walker the caller drives
|
||||
/// page-by-page via `next_batch`. The initial page-1 probe (used
|
||||
/// to determine `last_page` and warm the cache for sites that
|
||||
/// can't be paged without knowing the bound) happens inside this
|
||||
/// call, so a fresh walker is ready to yield its first batch
|
||||
/// without further setup.
|
||||
/// Begin discovery. Returns a walker the caller drives page-by-page
|
||||
/// via `next_batch`. The initial page-1 probe (used to determine
|
||||
/// `last_page` and warm the cache for sites that can't be paged
|
||||
/// without knowing the bound) happens inside this call, so a fresh
|
||||
/// walker is ready to yield its first batch without further setup.
|
||||
async fn discover(
|
||||
&self,
|
||||
ctx: &FetchContext<'_>,
|
||||
mode: DiscoverMode,
|
||||
) -> anyhow::Result<Box<dyn DiscoverWalk + Send>>;
|
||||
|
||||
async fn fetch_manga(
|
||||
|
||||
Reference in New Issue
Block a user