Mangalord

fabi/Mangalord

Fork 0

Commit Graph

Author	SHA1	Message	Date
MechaCat02	d24e68c78d	feat: chapter content sync via PHPSESSID + per-host pacing (0.25.0) After the metadata pass, the crawler now fetches per-chapter image content for chapters belonging to bookmarked mangas. Logged-in chapter pages render every page image at once (no per-page navigation), so the crawler reuses the operator's browser session via a pasted PHPSESSID cookie. Each chapter sync is a single transaction: storage puts + page row inserts + page_count update commit together, or roll back together on any image error so the chapter stays at page_count=0 and is retried next run. New crawler modules: - `rate_limit::HostRateLimiters`: per-host buckets keyed by URL host, with optional per-host overrides. Replaces the single shared `Mutex<RateLimiter>`. Catalog and CDN no longer share a budget; default 1 req/s per host. - `session`: derives `.<registrable>.<tld>` from the start URL (override via `CRAWLER_COOKIE_DOMAIN` for multi-part TLDs), injects PHPSESSID into the Chromium cookie store, probes `#avatar_menu` at startup to fail fast on a bad/expired cookie. - `content`: parses `a#pic_container img:not(.loading)` with `pageN` id-based sorting (DOM order isn't trusted), then performs the atomic chapter sync. bin/crawler additions: - Concurrent chapter content phase via `futures_util::for_each_concurrent` (`CRAWLER_CHAPTER_WORKERS`, default 1). Browser is borrowed across workers — chromiumoxide allows concurrent `new_page` on `&self` — and per-host rate limit gates total RPS regardless of worker count. - reqwest gets the `cookies` feature, a `Jar` seeded with PHPSESSID for the catalog domain only (CDN intentionally not given the cookie), and `Referer` is set on cover + chapter image fetches. - New env knobs: `CRAWLER_PHPSESSID`, `CRAWLER_COOKIE_DOMAIN`, `CRAWLER_USER_AGENT`, `CRAWLER_CHAPTER_WORKERS`, `CRAWLER_SKIP_CHAPTER_CONTENT`, `CRAWLER_FORCE_REFETCH_CHAPTERS`, `CRAWLER_CDN_HOST` + `CRAWLER_CDN_RATE_MS`. - Mid-run session-expired detection: `#avatar_menu` is re-checked on every chapter page nav; first failure aborts the phase with a cookie-refresh message. Bookmark-driven enqueueing is sync-on-crawl-tick only: the bookmarked chapters with `page_count = 0` are queried at the start of the chapter-content phase. Sync-on-bookmark via an API hook is deferred to a follow-up branch — that needs a daemon consumer of crawler_jobs, which doesn't exist yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 00:28:36 +02:00
MechaCat02	c51353ead3	bugfix: chapter source key uses chapter id, not /pg-1/ (0.23.1) Listing links point at the reader's page 1 (`.../uu/br_chapter-N/pg-1/`). The generic `derive_key_from_url` took the last URL segment and returned `"pg-1"` for every chapter, so all parsed chapters collapsed onto a single `chapter_sources` row downstream and the first-manga chapter was the only row that survived. New `derive_chapter_key_from_url` strips a trailing `/pg-\d+/` before picking the chapter-identifying segment (`br_chapter-N` / `to_chapter-N`). Notices, hiatus rows, and duplicate-numbered chapters are preserved as distinct parser entries. The (manga_id, number) UNIQUE collapse in the chapters table is a separate, follow-up concern handled in feat/chapter-id-routing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 23:15:36 +02:00
MechaCat02	b1a3a4e9d3	feat: crawler manga-list & metadata sync with cover download (0.23.0) - TargetSource: first concrete impl of the Source trait, modeled on the old Puppeteer crawler's selectors (+ status normalization, tag-count stripping, chapter list) - DiscoverMode::Backfill walks pagination last->1, reverse within each page (oldest-first); Incremental walks forward - RateLimiter (tokio-time aware) plumbed through FetchContext so the pagination walk honors the same per-host budget as the outer loop - repo::crawler: ensure_source, upsert_manga_from_source (returns New/Updated/Unchanged + current cover_image_path for backfill decisions), sync_manga_chapters, mark_dropped_mangas — all transactional, with case-insensitive lookups and source-insertable genres - Cover image download via reqwest+infer; stored under mangas/{id}/cover.{ext} via the Storage trait - Single CRAWLER_PROXY env wires both Chromium (--proxy-server) and reqwest::Proxy::all (HTTP/HTTPS/SOCKS5) - Crawler binary: positional start URL or $CRAWLER_START_URL, $CRAWLER_LIMIT (cap fetches + skip drop pass on partial runs), $CRAWLER_SKIP_CHAPTERS (disable selector AND sync), $CRAWLER_RATE_MS - Silences chromiumoxide 0.7's known CDP deserialize log spam via default tracing filter + CdpError::Serde downgrade - 9 sqlx integration tests + 11 selector/rate-limit unit tests	2026-05-21 22:04:23 +02:00

Author

SHA1

Message

Date

MechaCat02

d24e68c78d

feat: chapter content sync via PHPSESSID + per-host pacing (0.25.0)

After the metadata pass, the crawler now fetches per-chapter image
content for chapters belonging to bookmarked mangas. Logged-in chapter
pages render every page image at once (no per-page navigation), so the
crawler reuses the operator's browser session via a pasted PHPSESSID
cookie. Each chapter sync is a single transaction: storage puts + page
row inserts + page_count update commit together, or roll back together
on any image error so the chapter stays at page_count=0 and is retried
next run.

New crawler modules:

- `rate_limit::HostRateLimiters`: per-host buckets keyed by URL host,
  with optional per-host overrides. Replaces the single shared
  `Mutex<RateLimiter>`. Catalog and CDN no longer share a budget;
  default 1 req/s per host.
- `session`: derives `.<registrable>.<tld>` from the start URL
  (override via `CRAWLER_COOKIE_DOMAIN` for multi-part TLDs), injects
  PHPSESSID into the Chromium cookie store, probes `#avatar_menu` at
  startup to fail fast on a bad/expired cookie.
- `content`: parses `a#pic_container img:not(.loading)` with `pageN`
  id-based sorting (DOM order isn't trusted), then performs the
  atomic chapter sync.

bin/crawler additions:

- Concurrent chapter content phase via `futures_util::for_each_concurrent`
  (`CRAWLER_CHAPTER_WORKERS`, default 1). Browser is borrowed across
  workers — chromiumoxide allows concurrent `new_page` on `&self` —
  and per-host rate limit gates total RPS regardless of worker count.
- reqwest gets the `cookies` feature, a `Jar` seeded with PHPSESSID
  for the catalog domain only (CDN intentionally not given the
  cookie), and `Referer` is set on cover + chapter image fetches.
- New env knobs: `CRAWLER_PHPSESSID`, `CRAWLER_COOKIE_DOMAIN`,
  `CRAWLER_USER_AGENT`, `CRAWLER_CHAPTER_WORKERS`,
  `CRAWLER_SKIP_CHAPTER_CONTENT`, `CRAWLER_FORCE_REFETCH_CHAPTERS`,
  `CRAWLER_CDN_HOST` + `CRAWLER_CDN_RATE_MS`.
- Mid-run session-expired detection: `#avatar_menu` is re-checked on
  every chapter page nav; first failure aborts the phase with a
  cookie-refresh message.

Bookmark-driven enqueueing is sync-on-crawl-tick only: the bookmarked
chapters with `page_count = 0` are queried at the start of the
chapter-content phase. Sync-on-bookmark via an API hook is deferred
to a follow-up branch — that needs a daemon consumer of crawler_jobs,
which doesn't exist yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-23 00:28:36 +02:00

MechaCat02

c51353ead3

bugfix: chapter source key uses chapter id, not /pg-1/ (0.23.1)

Listing links point at the reader's page 1
(`.../uu/br_chapter-N/pg-1/`). The generic `derive_key_from_url` took
the last URL segment and returned `"pg-1"` for every chapter, so all
parsed chapters collapsed onto a single `chapter_sources` row downstream
and the first-manga chapter was the only row that survived. New
`derive_chapter_key_from_url` strips a trailing `/pg-\d+/` before
picking the chapter-identifying segment (`br_chapter-N` / `to_chapter-N`).

Notices, hiatus rows, and duplicate-numbered chapters are preserved as
distinct parser entries. The (manga_id, number) UNIQUE collapse in the
chapters table is a separate, follow-up concern handled in
feat/chapter-id-routing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-22 23:15:36 +02:00

MechaCat02

b1a3a4e9d3

feat: crawler manga-list & metadata sync with cover download (0.23.0)

- TargetSource: first concrete impl of the Source trait, modeled on
  the old Puppeteer crawler's selectors (+ status normalization,
  tag-count stripping, chapter list)
- DiscoverMode::Backfill walks pagination last->1, reverse within each
  page (oldest-first); Incremental walks forward
- RateLimiter (tokio-time aware) plumbed through FetchContext so the
  pagination walk honors the same per-host budget as the outer loop
- repo::crawler: ensure_source, upsert_manga_from_source (returns
  New/Updated/Unchanged + current cover_image_path for backfill
  decisions), sync_manga_chapters, mark_dropped_mangas — all
  transactional, with case-insensitive lookups and source-insertable
  genres
- Cover image download via reqwest+infer; stored under
  mangas/{id}/cover.{ext} via the Storage trait
- Single CRAWLER_PROXY env wires both Chromium (--proxy-server) and
  reqwest::Proxy::all (HTTP/HTTPS/SOCKS5)
- Crawler binary: positional start URL or $CRAWLER_START_URL,
  $CRAWLER_LIMIT (cap fetches + skip drop pass on partial runs),
  $CRAWLER_SKIP_CHAPTERS (disable selector AND sync), $CRAWLER_RATE_MS
- Silences chromiumoxide 0.7's known CDP deserialize log spam via
  default tracing filter + CdpError::Serde downgrade
- 9 sqlx integration tests + 11 selector/rate-limit unit tests

2026-05-21 22:04:23 +02:00

3 Commits