Commit Graph

4 Commits

Author SHA1 Message Date
MechaCat02
e93eec89e5 fix(crawler): queue chapter content in ascending number order (0.51.1)
Both enqueue paths now order by chapters.number so the cron tick and the
bookmark hook insert jobs from chapter 1 upward instead of source-discovery
or random-UUID order. The lease query tiebreaks on created_at so jobs
sharing a batch's scheduled_at come off the queue in insertion order,
propagating the enqueue intent through to dequeue. Concurrent workers
and per-CDN latency can still drift actual completion order.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 21:13:51 +02:00
MechaCat02
9f56f283d4 feat(crawler): single-mode walker gated by recovery flag (0.36.0)
Collapses the crawler to a single newest-first walker and replaces the
N-consecutive-unchanged streak with a per-manga rule: stop on the first
manga where metadata is Unchanged AND chapter sync reports zero new
chapters. The early stop is gated by a per-source recovery flag stored
in `crawler_state` — set to `false` when a run starts, back to `true`
only on a clean exit (end-of-walk or intentional stop). A crashed run
leaves the flag `false` automatically (no shutdown code runs), so the
next tick walks the full catalog instead of bailing at the first
caught-up manga.

This means a crashed mid-walk run self-heals on the next tick: the
flag stays `false`, the next walk visits every page (recovering
anything the crash missed past its crash point), and steady state
resumes once the recovery sweep reaches end-of-walk.

Removed:
- DiscoverMode enum, Backfill mode, the boundary re-check +
  displaced-refs machinery in TargetSourceWalker.
- Drop-pass (mark_dropped_mangas) and seed-completion plumbing
  (mark_seed_completed / seed_completed_at). The recovery flag
  subsumes the seed-completion signal; drop detection was explicitly
  opted out.
- JobPayload::Discover (no production callers).
- CRAWLER_MODE / CRAWLER_INCREMENTAL_STOP_AFTER env vars and the
  CrawlerModePref config type.

`should_mark_clean_exit(walked_to_completion, hit_stop_condition)`
encodes the clean-exit truth table in its signature — `hit_limit` is
deliberately absent so a future edit cannot accidentally count a
caller-imposed cap as a clean exit.

Net -501 lines, 261 backend tests passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 23:49:28 +02:00
MechaCat02
766c6eebac fix(crawler): guard ack_done/ack_failed/release on state='running' (0.35.3)
The three lease-ack functions matched their UPDATE on the job id
alone. If a lease expired and another worker re-leased the row, a
late ack from the original worker would clobber the new lease's
state, leased_until, and (for release) decrement its attempts.

Add `AND state = 'running'` to each UPDATE and log a warn when
rows_affected is zero, so a stolen lease shows up in telemetry without
blocking the new lease holder's progress.

Three new integration tests pin the contract:
- ack_done_no_ops_when_lease_was_stolen
- ack_failed_no_ops_when_state_is_not_running
- release_no_ops_when_state_is_not_running

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 20:42:18 +02:00
MechaCat02
93c7fd63fc feat: crawler job queue ops and dedup index (0.27.0)
Adds enqueue / lease / ack_done / ack_failed / release / reap_done on
crawler::jobs, backed by the existing crawler_jobs table. lease() uses
a single FOR UPDATE SKIP LOCKED CTE that also re-claims stale running
rows (crashed-worker recovery), and ack_failed applies an exponential
backoff capped at 1h before retrying.

Migration 0014 adds a partial unique index on
(payload->>'chapter_id') restricted to (pending|running)
sync_chapter_content jobs, so producers can just
INSERT ... ON CONFLICT DO NOTHING without racing each other. The slot
frees again the moment the job leaves the in-flight states, so a
future force-refetch can re-enqueue.

Library-only — no daemon, no API hook. Those land in the next two
phases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 19:59:09 +02:00