fix(crawler): walk list pages incrementally; stop on empty page (0.45.1) #4

Merged
fabi merged 1 commits from bugfix/crawler-incremental-pagination into main 2026-05-31 16:37:14 +00:00
Owner

Reviewed & approved. Incremental page walk (1,2,3... until empty) fixes chapter loss when the site's pagination links under-report the true last page.

  • Compiles cleanly: removed VecDeque import + parse_last_page/build_page_order + their 5 unit tests, no dangling refs; collapse_whitespace still used; no test asserts the old walk behavior.
  • Termination bounded in normal operation by max_refs + the incremental should_stop + the run-scoped seen dedup. Lockstep 0.45.1.
  • Non-blocking note: the walk has no hard upper bound — on a full/unclean run with limit=0 it relies on the site returning an empty page past the end (documented). A MAX_PAGES backstop would close that corner; offered separately rather than silently overriding the documented design.

🤖 Generated with Claude Code

Reviewed & approved. Incremental page walk (`1,2,3...` until empty) fixes chapter loss when the site's pagination links under-report the true last page. - Compiles cleanly: removed `VecDeque` import + `parse_last_page`/`build_page_order` + their 5 unit tests, no dangling refs; `collapse_whitespace` still used; no test asserts the old walk behavior. - Termination bounded in normal operation by `max_refs` + the incremental `should_stop` + the run-scoped `seen` dedup. Lockstep 0.45.1. - Non-blocking note: the walk has no hard upper bound — on a full/unclean run with `limit=0` it relies on the site returning an empty page past the end (documented). A `MAX_PAGES` backstop would close that corner; offered separately rather than silently overriding the documented design. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
fabi added 1 commit 2026-05-31 16:37:13 +00:00
fix(crawler): walk list pages incrementally; stop on empty page (0.45.1)
Some checks failed
deploy / test-frontend (pull_request) Waiting to run
deploy / test-backend (pull_request) Failing after 9s
deploy / build-and-push (pull_request) Has been cancelled
deploy / deploy (pull_request) Has been cancelled
ee4594f679
The pre-built `1..=parse_last_page` queue silently broke whenever the
configured CRAWLER_START_URL lacked a `/N/` path segment: page_url
returned the input unchanged, every "next" page re-fetched page 1, and
the dedup set caught the duplicates as a flood of "skip already-seen
key in this run" debug lines. The walker now increments next_page on
each batch and terminates when parse_manga_list_from yields an empty
list (the `#logo` sentinel still converts unrendered pages into
transient errors, so an Ok(vec![]) is a real end-of-index signal).

parse_last_page and build_page_order are deleted along with their
unit tests; they have no callers under the new model. page_url and
the page-1 HTML cache from discover() are retained as-is.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fabi merged commit 3b3d13a0f6 into main 2026-05-31 16:37:14 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: fabi/Mangalord#4