Commit Graph

8 Commits

Author SHA1 Message Date
MechaCat02
9f56f283d4 feat(crawler): single-mode walker gated by recovery flag (0.36.0)
Collapses the crawler to a single newest-first walker and replaces the
N-consecutive-unchanged streak with a per-manga rule: stop on the first
manga where metadata is Unchanged AND chapter sync reports zero new
chapters. The early stop is gated by a per-source recovery flag stored
in `crawler_state` — set to `false` when a run starts, back to `true`
only on a clean exit (end-of-walk or intentional stop). A crashed run
leaves the flag `false` automatically (no shutdown code runs), so the
next tick walks the full catalog instead of bailing at the first
caught-up manga.

This means a crashed mid-walk run self-heals on the next tick: the
flag stays `false`, the next walk visits every page (recovering
anything the crash missed past its crash point), and steady state
resumes once the recovery sweep reaches end-of-walk.

Removed:
- DiscoverMode enum, Backfill mode, the boundary re-check +
  displaced-refs machinery in TargetSourceWalker.
- Drop-pass (mark_dropped_mangas) and seed-completion plumbing
  (mark_seed_completed / seed_completed_at). The recovery flag
  subsumes the seed-completion signal; drop detection was explicitly
  opted out.
- JobPayload::Discover (no production callers).
- CRAWLER_MODE / CRAWLER_INCREMENTAL_STOP_AFTER env vars and the
  CrawlerModePref config type.

`should_mark_clean_exit(walked_to_completion, hit_stop_condition)`
encodes the clean-exit truth table in its signature — `hit_limit` is
deliberately absent so a future edit cannot accidentally count a
caller-imposed cap as a clean exit.

Net -501 lines, 261 backend tests passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 23:49:28 +02:00
MechaCat02
c686d6eb51 fix(crawler): sentinel-gate parse_chapter_list to stop false drops (0.35.2)
parse_chapter_list previously returned Vec::new() on any selector
miss. The empty list flowed into sync_manga_chapters, whose soft-drop
branch then flipped every existing chapter's dropped_at to NOW().
Bookmarks subsequently pointed at dropped sources, and
enqueue_bookmarked_pending (filters on cs.dropped_at IS NULL) silently
stopped re-fetching pages.

Same shape as the walker race fixed in 0.35.1: a transient parse miss
masquerading as "source removed everything" → false soft-drop.

Fix: require #chapter_table in the DOM. Present-but-empty is preserved
as Ok(vec![]) so a freshly added manga with no published chapters
still parses cleanly. Absent table is now Transient — the job system
reschedules with backoff instead of treating the partial render as
data.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 20:41:47 +02:00
MechaCat02
dea9b1aaa8 fix(crawler): close walker race against site reordering (0.35.1)
The target site orders by update_date DESC, and any new or updated
manga pushes everyone down by one slot. The paginated walker was
blind to this drift:

* Backfill (page last -> 1): shifts push items into pages already
  finished. The displaced manga was silently missed; with
  mark_dropped_mangas running on a fully-completed walk, items even
  got false-dropped because last_seen_at was stale.
* Incremental (page 1 -> last): a shift causes the slot-last item
  of an already-read page to reappear on the next page, leading to
  a redundant fetch_manga and an inflated consecutive_unchanged
  streak.

Fix is two-pronged:

1. Backfill boundary re-check. After fetching each page P, re-fetch
   the previously-walked page P+1 and check where its old slot-0
   key now sits. If it slid to slot K, the first K entries are
   items that used to live on P and slid past us; they get appended
   to the batch. If the anchor is gone entirely (multi-page shift
   or it was bumped to page 1), the whole re-fetched page is
   processed conservatively and the pipeline dedup absorbs the
   noise. The re-check must be the *last* navigation of the
   iteration to close the within-iteration race.

2. Run-scoped dedup in run_metadata_pass. A HashSet<String> of
   source_manga_keys avoids double-processing. The set uses a
   contains-then-insert pattern with insert firing *after* a
   successful upsert, so a transient fetch/upsert failure leaves
   the key retryable if it reappears later in the same pass (via
   the boundary re-check or another batch).

Incremental mode does not run the re-check (shifts move in the
same direction as the walk); only the dedup helps it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 20:14:01 +02:00
MechaCat02
45ce0d8f12 feat: incremental crawl mode with seed-completion gate (0.33.0)
Daemon now auto-detects mode per source: Backfill until the first
full walk records `seed_completed:<source>` in `crawler_state`, then
Incremental (newest-first, stops after N consecutive Unchanged
upserts). `CRAWLER_MODE` overrides to a fixed mode; CLI rejects
`auto` since it has no pre-run DB state.

`Source::discover` returns a lazy `DiscoverWalk` so Incremental can
break out mid-walk without prefetching pages. The drop pass and seed
marker are now gated on a true full walk — fixes a latent soft-drop
of the index tail under partial sweeps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 06:41:26 +02:00
MechaCat02
9ff49166a5 feat: transient-page detection across the crawler (0.30.0)
Until now, when the target site returned its 403 "we're sorry, the
request file are not found" response on a page that actually exists,
selectors matched nothing and the crawler treated the page as
"legitimately empty". Pagination walks silently dropped whole pages
worth of mangas, fetch_manga skipped individual entries, and the
startup session probe blamed PHPSESSID for what was a site hiccup.

This branch adds a single detection layer that the whole pipeline
routes through:

- `crawler::detect`: PageError::Transient typed signal, plus two
  primitives (`is_broken_page_body` matches the universal 403 body;
  `has_logo_sentinel` asserts #logo, the site-wide header element)
  and a `retry_on_transient` helper that retries a closure on
  Transient with a small attempt budget.
- `navigate()` screens every fetched body for the broken-page
  signature before handing it to a selector.
- Parsers (`parse_manga_list_from`, `parse_manga_detail`,
  `parse_chapter_pages`) check their structural sentinels (#logo for
  full-layout pages; a#pic_container for the reader, which doesn't
  render #logo) and return Result<_, PageError>. Empty Vec is now
  reserved for genuinely empty pages.
- `discover()` retries each pagination page up to 3× (2s apart) before
  failing the whole Discover job — at which point the existing job
  system's retry/backoff takes over for longer outages.
- `verify_session` is three-state: broken-page → retry probe;
  #logo present but #avatar_menu absent → genuine logout (the only
  state that should blame PHPSESSID); both present → ok.

Test coverage added at the helper level: 13 unit tests for the
detection module (body signature, logo sentinel, PageError, retry
helper), parser-level tests for both transient and legitimately-empty
inputs, and 6 unit tests for the session probe classifier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 22:47:21 +02:00
MechaCat02
d24e68c78d feat: chapter content sync via PHPSESSID + per-host pacing (0.25.0)
After the metadata pass, the crawler now fetches per-chapter image
content for chapters belonging to bookmarked mangas. Logged-in chapter
pages render every page image at once (no per-page navigation), so the
crawler reuses the operator's browser session via a pasted PHPSESSID
cookie. Each chapter sync is a single transaction: storage puts + page
row inserts + page_count update commit together, or roll back together
on any image error so the chapter stays at page_count=0 and is retried
next run.

New crawler modules:

- `rate_limit::HostRateLimiters`: per-host buckets keyed by URL host,
  with optional per-host overrides. Replaces the single shared
  `Mutex<RateLimiter>`. Catalog and CDN no longer share a budget;
  default 1 req/s per host.
- `session`: derives `.<registrable>.<tld>` from the start URL
  (override via `CRAWLER_COOKIE_DOMAIN` for multi-part TLDs), injects
  PHPSESSID into the Chromium cookie store, probes `#avatar_menu` at
  startup to fail fast on a bad/expired cookie.
- `content`: parses `a#pic_container img:not(.loading)` with `pageN`
  id-based sorting (DOM order isn't trusted), then performs the
  atomic chapter sync.

bin/crawler additions:

- Concurrent chapter content phase via `futures_util::for_each_concurrent`
  (`CRAWLER_CHAPTER_WORKERS`, default 1). Browser is borrowed across
  workers — chromiumoxide allows concurrent `new_page` on `&self` —
  and per-host rate limit gates total RPS regardless of worker count.
- reqwest gets the `cookies` feature, a `Jar` seeded with PHPSESSID
  for the catalog domain only (CDN intentionally not given the
  cookie), and `Referer` is set on cover + chapter image fetches.
- New env knobs: `CRAWLER_PHPSESSID`, `CRAWLER_COOKIE_DOMAIN`,
  `CRAWLER_USER_AGENT`, `CRAWLER_CHAPTER_WORKERS`,
  `CRAWLER_SKIP_CHAPTER_CONTENT`, `CRAWLER_FORCE_REFETCH_CHAPTERS`,
  `CRAWLER_CDN_HOST` + `CRAWLER_CDN_RATE_MS`.
- Mid-run session-expired detection: `#avatar_menu` is re-checked on
  every chapter page nav; first failure aborts the phase with a
  cookie-refresh message.

Bookmark-driven enqueueing is sync-on-crawl-tick only: the bookmarked
chapters with `page_count = 0` are queried at the start of the
chapter-content phase. Sync-on-bookmark via an API hook is deferred
to a follow-up branch — that needs a daemon consumer of crawler_jobs,
which doesn't exist yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 00:28:36 +02:00
MechaCat02
c51353ead3 bugfix: chapter source key uses chapter id, not /pg-1/ (0.23.1)
Listing links point at the reader's page 1
(`.../uu/br_chapter-N/pg-1/`). The generic `derive_key_from_url` took
the last URL segment and returned `"pg-1"` for every chapter, so all
parsed chapters collapsed onto a single `chapter_sources` row downstream
and the first-manga chapter was the only row that survived. New
`derive_chapter_key_from_url` strips a trailing `/pg-\d+/` before
picking the chapter-identifying segment (`br_chapter-N` / `to_chapter-N`).

Notices, hiatus rows, and duplicate-numbered chapters are preserved as
distinct parser entries. The (manga_id, number) UNIQUE collapse in the
chapters table is a separate, follow-up concern handled in
feat/chapter-id-routing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 23:15:36 +02:00
MechaCat02
b1a3a4e9d3 feat: crawler manga-list & metadata sync with cover download (0.23.0)
- TargetSource: first concrete impl of the Source trait, modeled on
  the old Puppeteer crawler's selectors (+ status normalization,
  tag-count stripping, chapter list)
- DiscoverMode::Backfill walks pagination last->1, reverse within each
  page (oldest-first); Incremental walks forward
- RateLimiter (tokio-time aware) plumbed through FetchContext so the
  pagination walk honors the same per-host budget as the outer loop
- repo::crawler: ensure_source, upsert_manga_from_source (returns
  New/Updated/Unchanged + current cover_image_path for backfill
  decisions), sync_manga_chapters, mark_dropped_mangas — all
  transactional, with case-insensitive lookups and source-insertable
  genres
- Cover image download via reqwest+infer; stored under
  mangas/{id}/cover.{ext} via the Storage trait
- Single CRAWLER_PROXY env wires both Chromium (--proxy-server) and
  reqwest::Proxy::all (HTTP/HTTPS/SOCKS5)
- Crawler binary: positional start URL or $CRAWLER_START_URL,
  $CRAWLER_LIMIT (cap fetches + skip drop pass on partial runs),
  $CRAWLER_SKIP_CHAPTERS (disable selector AND sync), $CRAWLER_RATE_MS
- Silences chromiumoxide 0.7's known CDP deserialize log spam via
  default tracing filter + CdpError::Serde downgrade
- 9 sqlx integration tests + 11 selector/rate-limit unit tests
2026-05-21 22:04:23 +02:00