Mangalord

fabi/Mangalord

Fork 0

Commit Graph

Author	SHA1	Message	Date
MechaCat02	c320eda7cd	chore: dedupe is_unique_violation, lift SQL into repo, centralise URL parsing Three layering cleanups from REVIEW.md §5 / §3: - Drop the three private `is_unique_violation` helpers in repo::{user,chapter,bookmark} in favour of sqlx 0.8's `DatabaseError::is_unique_violation()` method (already used by repo::collection). - Remove the unreachable 23505 branch in repo::chapter::create — the (manga_id, number) UNIQUE was dropped in 0013, so the defensive arm could no longer fire. A doc note records what to do if uniqueness is re-added. - Move three inline SQL queries out of handlers/daemon into repo functions: bookmarks' chapter-belongs-to-manga guard (`repo::chapter::belongs_to_manga`), the daemon's dispatch lookup (`repo::chapter::dispatch_target`), and the daemon's page_count safety net (`repo::chapter::page_count`). Restores the handlers→repo layering invariant in CLAUDE.md. - New `crawler::url_utils` module consolidates host_of / origin_of / registrable_domain — they used to live in three crawler submodules with diverging edge-case behaviour. Tests moved with them. - Doc cross-references on repo::author::set_for_manga and repo::genre::set_for_manga pointing to the crawler's name-keyed variants, so the intentional duplication is discoverable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 20:24:05 +02:00
MechaCat02	9ff49166a5	feat: transient-page detection across the crawler (0.30.0) Until now, when the target site returned its 403 "we're sorry, the request file are not found" response on a page that actually exists, selectors matched nothing and the crawler treated the page as "legitimately empty". Pagination walks silently dropped whole pages worth of mangas, fetch_manga skipped individual entries, and the startup session probe blamed PHPSESSID for what was a site hiccup. This branch adds a single detection layer that the whole pipeline routes through: - `crawler::detect`: PageError::Transient typed signal, plus two primitives (`is_broken_page_body` matches the universal 403 body; `has_logo_sentinel` asserts #logo, the site-wide header element) and a `retry_on_transient` helper that retries a closure on Transient with a small attempt budget. - `navigate()` screens every fetched body for the broken-page signature before handing it to a selector. - Parsers (`parse_manga_list_from`, `parse_manga_detail`, `parse_chapter_pages`) check their structural sentinels (#logo for full-layout pages; a#pic_container for the reader, which doesn't render #logo) and return Result<_, PageError>. Empty Vec is now reserved for genuinely empty pages. - `discover()` retries each pagination page up to 3× (2s apart) before failing the whole Discover job — at which point the existing job system's retry/backoff takes over for longer outages. - `verify_session` is three-state: broken-page → retry probe; #logo present but #avatar_menu absent → genuine logout (the only state that should blame PHPSESSID); both present → ok. Test coverage added at the helper level: 13 unit tests for the detection module (body signature, logo sentinel, PageError, retry helper), parser-level tests for both transient and legitimately-empty inputs, and 6 unit tests for the session probe classifier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 22:47:21 +02:00
MechaCat02	d24e68c78d	feat: chapter content sync via PHPSESSID + per-host pacing (0.25.0) After the metadata pass, the crawler now fetches per-chapter image content for chapters belonging to bookmarked mangas. Logged-in chapter pages render every page image at once (no per-page navigation), so the crawler reuses the operator's browser session via a pasted PHPSESSID cookie. Each chapter sync is a single transaction: storage puts + page row inserts + page_count update commit together, or roll back together on any image error so the chapter stays at page_count=0 and is retried next run. New crawler modules: - `rate_limit::HostRateLimiters`: per-host buckets keyed by URL host, with optional per-host overrides. Replaces the single shared `Mutex<RateLimiter>`. Catalog and CDN no longer share a budget; default 1 req/s per host. - `session`: derives `.<registrable>.<tld>` from the start URL (override via `CRAWLER_COOKIE_DOMAIN` for multi-part TLDs), injects PHPSESSID into the Chromium cookie store, probes `#avatar_menu` at startup to fail fast on a bad/expired cookie. - `content`: parses `a#pic_container img:not(.loading)` with `pageN` id-based sorting (DOM order isn't trusted), then performs the atomic chapter sync. bin/crawler additions: - Concurrent chapter content phase via `futures_util::for_each_concurrent` (`CRAWLER_CHAPTER_WORKERS`, default 1). Browser is borrowed across workers — chromiumoxide allows concurrent `new_page` on `&self` — and per-host rate limit gates total RPS regardless of worker count. - reqwest gets the `cookies` feature, a `Jar` seeded with PHPSESSID for the catalog domain only (CDN intentionally not given the cookie), and `Referer` is set on cover + chapter image fetches. - New env knobs: `CRAWLER_PHPSESSID`, `CRAWLER_COOKIE_DOMAIN`, `CRAWLER_USER_AGENT`, `CRAWLER_CHAPTER_WORKERS`, `CRAWLER_SKIP_CHAPTER_CONTENT`, `CRAWLER_FORCE_REFETCH_CHAPTERS`, `CRAWLER_CDN_HOST` + `CRAWLER_CDN_RATE_MS`. - Mid-run session-expired detection: `#avatar_menu` is re-checked on every chapter page nav; first failure aborts the phase with a cookie-refresh message. Bookmark-driven enqueueing is sync-on-crawl-tick only: the bookmarked chapters with `page_count = 0` are queried at the start of the chapter-content phase. Sync-on-bookmark via an API hook is deferred to a follow-up branch — that needs a daemon consumer of crawler_jobs, which doesn't exist yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 00:28:36 +02:00

Author

SHA1

Message

Date

MechaCat02

c320eda7cd

chore: dedupe is_unique_violation, lift SQL into repo, centralise URL parsing

Three layering cleanups from REVIEW.md §5 / §3:

- Drop the three private `is_unique_violation` helpers in
  repo::{user,chapter,bookmark} in favour of sqlx 0.8's
  `DatabaseError::is_unique_violation()` method (already used by
  repo::collection).
- Remove the unreachable 23505 branch in repo::chapter::create — the
  (manga_id, number) UNIQUE was dropped in 0013, so the defensive arm
  could no longer fire. A doc note records what to do if uniqueness
  is re-added.
- Move three inline SQL queries out of handlers/daemon into repo
  functions: bookmarks' chapter-belongs-to-manga guard
  (`repo::chapter::belongs_to_manga`), the daemon's dispatch lookup
  (`repo::chapter::dispatch_target`), and the daemon's page_count
  safety net (`repo::chapter::page_count`). Restores the
  handlers→repo layering invariant in CLAUDE.md.
- New `crawler::url_utils` module consolidates host_of / origin_of /
  registrable_domain — they used to live in three crawler submodules
  with diverging edge-case behaviour. Tests moved with them.
- Doc cross-references on repo::author::set_for_manga and
  repo::genre::set_for_manga pointing to the crawler's name-keyed
  variants, so the intentional duplication is discoverable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-28 20:24:05 +02:00

MechaCat02

9ff49166a5

feat: transient-page detection across the crawler (0.30.0)

Until now, when the target site returned its 403 "we're sorry, the
request file are not found" response on a page that actually exists,
selectors matched nothing and the crawler treated the page as
"legitimately empty". Pagination walks silently dropped whole pages
worth of mangas, fetch_manga skipped individual entries, and the
startup session probe blamed PHPSESSID for what was a site hiccup.

This branch adds a single detection layer that the whole pipeline
routes through:

- `crawler::detect`: PageError::Transient typed signal, plus two
  primitives (`is_broken_page_body` matches the universal 403 body;
  `has_logo_sentinel` asserts #logo, the site-wide header element)
  and a `retry_on_transient` helper that retries a closure on
  Transient with a small attempt budget.
- `navigate()` screens every fetched body for the broken-page
  signature before handing it to a selector.
- Parsers (`parse_manga_list_from`, `parse_manga_detail`,
  `parse_chapter_pages`) check their structural sentinels (#logo for
  full-layout pages; a#pic_container for the reader, which doesn't
  render #logo) and return Result<_, PageError>. Empty Vec is now
  reserved for genuinely empty pages.
- `discover()` retries each pagination page up to 3× (2s apart) before
  failing the whole Discover job — at which point the existing job
  system's retry/backoff takes over for longer outages.
- `verify_session` is three-state: broken-page → retry probe;
  #logo present but #avatar_menu absent → genuine logout (the only
  state that should blame PHPSESSID); both present → ok.

Test coverage added at the helper level: 13 unit tests for the
detection module (body signature, logo sentinel, PageError, retry
helper), parser-level tests for both transient and legitimately-empty
inputs, and 6 unit tests for the session probe classifier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-26 22:47:21 +02:00

MechaCat02

d24e68c78d

feat: chapter content sync via PHPSESSID + per-host pacing (0.25.0)

After the metadata pass, the crawler now fetches per-chapter image
content for chapters belonging to bookmarked mangas. Logged-in chapter
pages render every page image at once (no per-page navigation), so the
crawler reuses the operator's browser session via a pasted PHPSESSID
cookie. Each chapter sync is a single transaction: storage puts + page
row inserts + page_count update commit together, or roll back together
on any image error so the chapter stays at page_count=0 and is retried
next run.

New crawler modules:

- `rate_limit::HostRateLimiters`: per-host buckets keyed by URL host,
  with optional per-host overrides. Replaces the single shared
  `Mutex<RateLimiter>`. Catalog and CDN no longer share a budget;
  default 1 req/s per host.
- `session`: derives `.<registrable>.<tld>` from the start URL
  (override via `CRAWLER_COOKIE_DOMAIN` for multi-part TLDs), injects
  PHPSESSID into the Chromium cookie store, probes `#avatar_menu` at
  startup to fail fast on a bad/expired cookie.
- `content`: parses `a#pic_container img:not(.loading)` with `pageN`
  id-based sorting (DOM order isn't trusted), then performs the
  atomic chapter sync.

bin/crawler additions:

- Concurrent chapter content phase via `futures_util::for_each_concurrent`
  (`CRAWLER_CHAPTER_WORKERS`, default 1). Browser is borrowed across
  workers — chromiumoxide allows concurrent `new_page` on `&self` —
  and per-host rate limit gates total RPS regardless of worker count.
- reqwest gets the `cookies` feature, a `Jar` seeded with PHPSESSID
  for the catalog domain only (CDN intentionally not given the
  cookie), and `Referer` is set on cover + chapter image fetches.
- New env knobs: `CRAWLER_PHPSESSID`, `CRAWLER_COOKIE_DOMAIN`,
  `CRAWLER_USER_AGENT`, `CRAWLER_CHAPTER_WORKERS`,
  `CRAWLER_SKIP_CHAPTER_CONTENT`, `CRAWLER_FORCE_REFETCH_CHAPTERS`,
  `CRAWLER_CDN_HOST` + `CRAWLER_CDN_RATE_MS`.
- Mid-run session-expired detection: `#avatar_menu` is re-checked on
  every chapter page nav; first failure aborts the phase with a
  cookie-refresh message.

Bookmark-driven enqueueing is sync-on-crawl-tick only: the bookmarked
chapters with `page_count = 0` are queried at the start of the
chapter-content phase. Sync-on-bookmark via an API hook is deferred
to a follow-up branch — that needs a daemon consumer of crawler_jobs,
which doesn't exist yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-23 00:28:36 +02:00

3 Commits