Files
Mangalord/backend/tests/crawler_recovery_flag.rs
MechaCat02 9f56f283d4 feat(crawler): single-mode walker gated by recovery flag (0.36.0)
Collapses the crawler to a single newest-first walker and replaces the
N-consecutive-unchanged streak with a per-manga rule: stop on the first
manga where metadata is Unchanged AND chapter sync reports zero new
chapters. The early stop is gated by a per-source recovery flag stored
in `crawler_state` — set to `false` when a run starts, back to `true`
only on a clean exit (end-of-walk or intentional stop). A crashed run
leaves the flag `false` automatically (no shutdown code runs), so the
next tick walks the full catalog instead of bailing at the first
caught-up manga.

This means a crashed mid-walk run self-heals on the next tick: the
flag stays `false`, the next walk visits every page (recovering
anything the crash missed past its crash point), and steady state
resumes once the recovery sweep reaches end-of-walk.

Removed:
- DiscoverMode enum, Backfill mode, the boundary re-check +
  displaced-refs machinery in TargetSourceWalker.
- Drop-pass (mark_dropped_mangas) and seed-completion plumbing
  (mark_seed_completed / seed_completed_at). The recovery flag
  subsumes the seed-completion signal; drop detection was explicitly
  opted out.
- JobPayload::Discover (no production callers).
- CRAWLER_MODE / CRAWLER_INCREMENTAL_STOP_AFTER env vars and the
  CrawlerModePref config type.

`should_mark_clean_exit(walked_to_completion, hit_stop_condition)`
encodes the clean-exit truth table in its signature — `hit_limit` is
deliberately absent so a future edit cannot accidentally count a
caller-imposed cap as a clean exit.

Net -501 lines, 261 backend tests passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 23:49:28 +02:00

83 lines
3.0 KiB
Rust

//! Integration tests for the per-source recovery flag:
//! `mark_run_started` / `mark_run_completed` / `last_run_completed_cleanly`
//! round-trip via the `crawler_state` table.
//!
//! End-to-end pipeline behavior (a crashed run forcing a recovery sweep
//! on the next tick) requires a real `chromiumoxide::Browser` to drive
//! the walker, so that path is covered by `crawler_browser_smoke.rs`.
//! The pure stop-condition logic itself is unit-tested in
//! `crawler::pipeline::tests`.
use mangalord::repo::crawler;
use sqlx::PgPool;
#[sqlx::test(migrations = "./migrations")]
async fn defaults_to_clean_when_no_marker(pool: PgPool) {
// First-ever run semantics: absence of the key must NOT trigger a
// recovery walk on a virgin DB. Treat missing as "previous run
// completed cleanly" so the first tick can take the early-stop path.
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let clean = crawler::last_run_completed_cleanly(&pool, "target")
.await
.unwrap();
assert!(clean, "absent marker must read as clean");
}
#[sqlx::test(migrations = "./migrations")]
async fn mark_run_started_flips_to_false(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
crawler::mark_run_started(&pool, "target").await.unwrap();
let clean = crawler::last_run_completed_cleanly(&pool, "target")
.await
.unwrap();
assert!(!clean, "after mark_run_started, flag must read false");
}
#[sqlx::test(migrations = "./migrations")]
async fn started_then_completed_round_trips_to_clean(pool: PgPool) {
// Steady-state: a run starts (flag → false) and exits cleanly
// (flag → true). The next tick should see "clean" and apply the
// normal stop condition.
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
crawler::mark_run_started(&pool, "target").await.unwrap();
crawler::mark_run_completed(&pool, "target").await.unwrap();
let clean = crawler::last_run_completed_cleanly(&pool, "target")
.await
.unwrap();
assert!(
clean,
"after start → complete the flag must round-trip to clean"
);
}
#[sqlx::test(migrations = "./migrations")]
async fn flag_is_per_source(pool: PgPool) {
// Two sources, only one is mid-run. The other must still report
// clean — the crawler_state key is namespaced by source_id.
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
crawler::ensure_source(&pool, "other", "O", "https://y.example")
.await
.unwrap();
crawler::mark_run_started(&pool, "target").await.unwrap();
assert!(
!crawler::last_run_completed_cleanly(&pool, "target")
.await
.unwrap(),
"target is mid-run"
);
assert!(
crawler::last_run_completed_cleanly(&pool, "other")
.await
.unwrap(),
"other source is untouched and reads clean"
);
}