fix(crawler): close walker race against site reordering (0.35.1)
The target site orders by update_date DESC, and any new or updated manga pushes everyone down by one slot. The paginated walker was blind to this drift: * Backfill (page last -> 1): shifts push items into pages already finished. The displaced manga was silently missed; with mark_dropped_mangas running on a fully-completed walk, items even got false-dropped because last_seen_at was stale. * Incremental (page 1 -> last): a shift causes the slot-last item of an already-read page to reappear on the next page, leading to a redundant fetch_manga and an inflated consecutive_unchanged streak. Fix is two-pronged: 1. Backfill boundary re-check. After fetching each page P, re-fetch the previously-walked page P+1 and check where its old slot-0 key now sits. If it slid to slot K, the first K entries are items that used to live on P and slid past us; they get appended to the batch. If the anchor is gone entirely (multi-page shift or it was bumped to page 1), the whole re-fetched page is processed conservatively and the pipeline dedup absorbs the noise. The re-check must be the *last* navigation of the iteration to close the within-iteration race. 2. Run-scoped dedup in run_metadata_pass. A HashSet<String> of source_manga_keys avoids double-processing. The set uses a contains-then-insert pattern with insert firing *after* a successful upsert, so a transient fetch/upsert failure leaves the key retryable if it reappears later in the same pass (via the boundary re-check or another batch). Incremental mode does not run the re-check (shifts move in the same direction as the walk); only the dedup helps it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -2,6 +2,8 @@
|
||||
//! that fan out chapter-content work. Shared between the daemon (cron tick)
|
||||
//! and the CLI (`bin/crawler.rs`) so behavior stays in lockstep.
|
||||
|
||||
use std::collections::HashSet;
|
||||
|
||||
use anyhow::Context;
|
||||
use sqlx::PgPool;
|
||||
use uuid::Uuid;
|
||||
@@ -105,6 +107,14 @@ pub async fn run_metadata_pass(
|
||||
.context("discover failed")?;
|
||||
|
||||
let mut stats = MetadataStats::default();
|
||||
// Run-scoped dedup of `source_manga_key`s already processed this pass.
|
||||
// Backfill: the walker may append displaced refs that also appear on
|
||||
// the page we're about to visit naturally; skipping the dup avoids
|
||||
// redundant fetch_manga + upsert. Incremental: a shift causes the
|
||||
// slot-last item of the page we just read to reappear at slot 0 of
|
||||
// the next page; skipping it preserves the consecutive_unchanged
|
||||
// streak math instead of inflating it with a re-confirm.
|
||||
let mut seen: HashSet<String> = HashSet::new();
|
||||
let mut consecutive_unchanged: usize = 0;
|
||||
let mut walked_to_completion = false;
|
||||
let mut hit_limit = false;
|
||||
@@ -124,6 +134,23 @@ pub async fn run_metadata_pass(
|
||||
tracing::info!(cap = ?max_refs, "max_results reached; halting walk");
|
||||
break 'outer;
|
||||
}
|
||||
// Skip refs we've already *successfully* processed this pass.
|
||||
// Checking `contains` here (rather than `insert`) keeps the key
|
||||
// out of `seen` on failure paths below, so a transient fetch or
|
||||
// upsert error gets a second chance if the ref reappears via the
|
||||
// backfill boundary re-check or another batch. Done *before*
|
||||
// counting toward `stats.discovered` (the skipped ref did no
|
||||
// work) and *before* touching `consecutive_unchanged` (a
|
||||
// `continue` here preserves the streak rather than resetting or
|
||||
// inflating it). The matching `seen.insert(...)` lives just
|
||||
// after the successful upsert below.
|
||||
if seen.contains(&r.source_manga_key) {
|
||||
tracing::debug!(
|
||||
key = %r.source_manga_key,
|
||||
"skip already-seen key in this run"
|
||||
);
|
||||
continue;
|
||||
}
|
||||
stats.discovered += 1;
|
||||
tracing::info!(
|
||||
idx = stats.discovered,
|
||||
@@ -161,6 +188,10 @@ pub async fn run_metadata_pass(
|
||||
}
|
||||
};
|
||||
stats.upserted += 1;
|
||||
// Record success in the dedup set. Cover and chapter-sync
|
||||
// failures below are non-fatal and don't roll this back —
|
||||
// metadata is the durable source of truth for the dedup.
|
||||
seen.insert(r.source_manga_key.clone());
|
||||
tracing::info!(
|
||||
key = %manga.source_manga_key,
|
||||
manga_id = %upsert.manga_id,
|
||||
@@ -473,4 +504,31 @@ mod tests {
|
||||
};
|
||||
assert!(should_stop(mode, 0));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn run_scoped_seen_set_skips_duplicate_source_manga_keys() {
|
||||
// Pins the per-ref loop contract: `contains` gates whether work
|
||||
// runs, and `insert` only fires on the success path (after upsert).
|
||||
// A failed ref that reappears later in the same pass must get a
|
||||
// second chance — that's why the loop uses contains-then-insert
|
||||
// instead of insert-and-skip-on-collision.
|
||||
let mut seen: HashSet<String> = HashSet::new();
|
||||
|
||||
// First sighting of a key: not yet seen → loop proceeds.
|
||||
assert!(!seen.contains("manga-a"), "first sighting is unseen");
|
||||
// Simulate a failed fetch_manga: do NOT insert. Next sighting must
|
||||
// still be considered unseen so the loop retries it.
|
||||
assert!(!seen.contains("manga-a"), "failed key is still retryable");
|
||||
|
||||
// Now simulate a successful upsert — insert is called.
|
||||
seen.insert("manga-a".to_string());
|
||||
// Subsequent sightings of the same key are skipped.
|
||||
assert!(seen.contains("manga-a"), "successful key is now seen");
|
||||
|
||||
// Distinct keys never collide.
|
||||
assert!(!seen.contains("manga-b"), "different key independent");
|
||||
seen.insert("manga-b".to_string());
|
||||
assert!(seen.contains("manga-b"));
|
||||
assert!(seen.contains("manga-a"), "first key still recorded");
|
||||
}
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user