Collapses the crawler to a single newest-first walker and replaces the N-consecutive-unchanged streak with a per-manga rule: stop on the first manga where metadata is Unchanged AND chapter sync reports zero new chapters. The early stop is gated by a per-source recovery flag stored in `crawler_state` — set to `false` when a run starts, back to `true` only on a clean exit (end-of-walk or intentional stop). A crashed run leaves the flag `false` automatically (no shutdown code runs), so the next tick walks the full catalog instead of bailing at the first caught-up manga. This means a crashed mid-walk run self-heals on the next tick: the flag stays `false`, the next walk visits every page (recovering anything the crash missed past its crash point), and steady state resumes once the recovery sweep reaches end-of-walk. Removed: - DiscoverMode enum, Backfill mode, the boundary re-check + displaced-refs machinery in TargetSourceWalker. - Drop-pass (mark_dropped_mangas) and seed-completion plumbing (mark_seed_completed / seed_completed_at). The recovery flag subsumes the seed-completion signal; drop detection was explicitly opted out. - JobPayload::Discover (no production callers). - CRAWLER_MODE / CRAWLER_INCREMENTAL_STOP_AFTER env vars and the CrawlerModePref config type. `should_mark_clean_exit(walked_to_completion, hit_stop_condition)` encodes the clean-exit truth table in its signature — `hit_limit` is deliberately absent so a future edit cannot accidentally count a caller-imposed cap as a clean exit. Net -501 lines, 261 backend tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
83 lines
3.0 KiB
Rust
83 lines
3.0 KiB
Rust
//! Integration tests for the per-source recovery flag:
|
|
//! `mark_run_started` / `mark_run_completed` / `last_run_completed_cleanly`
|
|
//! round-trip via the `crawler_state` table.
|
|
//!
|
|
//! End-to-end pipeline behavior (a crashed run forcing a recovery sweep
|
|
//! on the next tick) requires a real `chromiumoxide::Browser` to drive
|
|
//! the walker, so that path is covered by `crawler_browser_smoke.rs`.
|
|
//! The pure stop-condition logic itself is unit-tested in
|
|
//! `crawler::pipeline::tests`.
|
|
|
|
use mangalord::repo::crawler;
|
|
use sqlx::PgPool;
|
|
|
|
#[sqlx::test(migrations = "./migrations")]
|
|
async fn defaults_to_clean_when_no_marker(pool: PgPool) {
|
|
// First-ever run semantics: absence of the key must NOT trigger a
|
|
// recovery walk on a virgin DB. Treat missing as "previous run
|
|
// completed cleanly" so the first tick can take the early-stop path.
|
|
crawler::ensure_source(&pool, "target", "T", "https://x.example")
|
|
.await
|
|
.unwrap();
|
|
let clean = crawler::last_run_completed_cleanly(&pool, "target")
|
|
.await
|
|
.unwrap();
|
|
assert!(clean, "absent marker must read as clean");
|
|
}
|
|
|
|
#[sqlx::test(migrations = "./migrations")]
|
|
async fn mark_run_started_flips_to_false(pool: PgPool) {
|
|
crawler::ensure_source(&pool, "target", "T", "https://x.example")
|
|
.await
|
|
.unwrap();
|
|
crawler::mark_run_started(&pool, "target").await.unwrap();
|
|
let clean = crawler::last_run_completed_cleanly(&pool, "target")
|
|
.await
|
|
.unwrap();
|
|
assert!(!clean, "after mark_run_started, flag must read false");
|
|
}
|
|
|
|
#[sqlx::test(migrations = "./migrations")]
|
|
async fn started_then_completed_round_trips_to_clean(pool: PgPool) {
|
|
// Steady-state: a run starts (flag → false) and exits cleanly
|
|
// (flag → true). The next tick should see "clean" and apply the
|
|
// normal stop condition.
|
|
crawler::ensure_source(&pool, "target", "T", "https://x.example")
|
|
.await
|
|
.unwrap();
|
|
crawler::mark_run_started(&pool, "target").await.unwrap();
|
|
crawler::mark_run_completed(&pool, "target").await.unwrap();
|
|
let clean = crawler::last_run_completed_cleanly(&pool, "target")
|
|
.await
|
|
.unwrap();
|
|
assert!(
|
|
clean,
|
|
"after start → complete the flag must round-trip to clean"
|
|
);
|
|
}
|
|
|
|
#[sqlx::test(migrations = "./migrations")]
|
|
async fn flag_is_per_source(pool: PgPool) {
|
|
// Two sources, only one is mid-run. The other must still report
|
|
// clean — the crawler_state key is namespaced by source_id.
|
|
crawler::ensure_source(&pool, "target", "T", "https://x.example")
|
|
.await
|
|
.unwrap();
|
|
crawler::ensure_source(&pool, "other", "O", "https://y.example")
|
|
.await
|
|
.unwrap();
|
|
crawler::mark_run_started(&pool, "target").await.unwrap();
|
|
assert!(
|
|
!crawler::last_run_completed_cleanly(&pool, "target")
|
|
.await
|
|
.unwrap(),
|
|
"target is mid-run"
|
|
);
|
|
assert!(
|
|
crawler::last_run_completed_cleanly(&pool, "other")
|
|
.await
|
|
.unwrap(),
|
|
"other source is untouched and reads clean"
|
|
);
|
|
}
|