feat: incremental crawl mode with seed-completion gate (0.33.0)
Daemon now auto-detects mode per source: Backfill until the first full walk records `seed_completed:<source>` in `crawler_state`, then Incremental (newest-first, stops after N consecutive Unchanged upserts). `CRAWLER_MODE` overrides to a fixed mode; CLI rejects `auto` since it has no pre-run DB state. `Source::discover` returns a lazy `DiscoverWalk` so Incremental can break out mid-walk without prefetching pages. The drop pass and seed marker are now gated on a true full walk — fixes a latent soft-drop of the index tail under partial sweeps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -82,21 +82,42 @@ pub struct FetchContext<'a> {
|
||||
pub rate: &'a crate::crawler::rate_limit::HostRateLimiters,
|
||||
}
|
||||
|
||||
/// Lazy iterator over discovered manga refs. The caller drives the
|
||||
/// walk one batch at a time, so it can break out as soon as a
|
||||
/// downstream stop condition is met (e.g. N consecutive Unchanged
|
||||
/// upserts in Incremental mode) without paying for pages it won't use.
|
||||
///
|
||||
/// Batches are typically one source-index page each. Within a batch
|
||||
/// refs are already in the right per-page order for the active mode
|
||||
/// (Backfill reverses each page to oldest-first; Incremental leaves
|
||||
/// the source's natural newest-first ordering).
|
||||
#[async_trait]
|
||||
pub trait DiscoverWalk: Send {
|
||||
/// Return the next batch of refs, or `Ok(None)` when the source has
|
||||
/// no more pages. The walker is single-use; calling `next_batch`
|
||||
/// after `None` is allowed and continues to return `None`.
|
||||
async fn next_batch(
|
||||
&mut self,
|
||||
ctx: &FetchContext<'_>,
|
||||
) -> anyhow::Result<Option<Vec<SourceMangaRef>>>;
|
||||
}
|
||||
|
||||
#[async_trait]
|
||||
pub trait Source: Send + Sync {
|
||||
/// Stable identifier — also the row key in the `sources` table.
|
||||
fn id(&self) -> &'static str;
|
||||
|
||||
/// Returns up to `max_results` manga refs in source order. Pass
|
||||
/// `None` for an uncapped walk (full backfill / incremental sweep).
|
||||
/// Implementations should stop paginating as soon as the cap is
|
||||
/// reached so partial runs don't pay for pages they won't use.
|
||||
/// Begin discovery in `mode`. Returns a walker the caller drives
|
||||
/// page-by-page via `next_batch`. The initial page-1 probe (used
|
||||
/// to determine `last_page` and warm the cache for sites that
|
||||
/// can't be paged without knowing the bound) happens inside this
|
||||
/// call, so a fresh walker is ready to yield its first batch
|
||||
/// without further setup.
|
||||
async fn discover(
|
||||
&self,
|
||||
ctx: &FetchContext<'_>,
|
||||
mode: DiscoverMode,
|
||||
max_results: Option<usize>,
|
||||
) -> anyhow::Result<Vec<SourceMangaRef>>;
|
||||
) -> anyhow::Result<Box<dyn DiscoverWalk + Send>>;
|
||||
|
||||
async fn fetch_manga(
|
||||
&self,
|
||||
|
||||
Reference in New Issue
Block a user