Compare commits

..

73 Commits

Author SHA1 Message Date
MechaCat02
d3935dc82f docs: hand-off report for 2026-06-05 session
Snapshot of main, in-flight branches, session-specific changes (CRAWLER_LIMIT
in the daemon, browser-restart clears session_expired), and dev-stack
commands — for whoever picks the work up next.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-05 07:11:39 +02:00
MechaCat02
679abae736 feat(chapter): preserve source-site order in chapter list (0.52.0)
Some checks failed
deploy / test-backend (push) Failing after 11m48s
deploy / test-frontend (push) Successful in 9m45s
deploy / build-and-push (push) Has been skipped
deploy / deploy (push) Has been skipped
The user-facing chapter list ordered by (number ASC, created_at ASC),
which broke the source site's order in two ways: non-numeric entries
("notice. : Officials") parsed to number=0 and clustered at the top,
even though the site placed them mid-list, and variants sharing a
number ("Ch.14 : PH" / "Ch.14 : Official") were torn apart by the
created_at tiebreak.

Capture each chapter's position in the source DOM as `source_index`
(0 = first = newest on this site) on every crawler sync, including the
UPDATE branch so a new chapter prepended on the source shifts every
existing row down by one on the next tick. The list query reverses
this with `ORDER BY source_index DESC NULLS LAST, number ASC,
created_at ASC` so the oldest chapter appears first, variants stay
adjacent in the order the site shows them, and non-numeric entries
land where the site placed them. User-uploaded chapters and pre-
migration rows keep their NULL source_index and fall through to the
prior number/created_at tiebreak via NULLS LAST.

The reader's client-side `[...chapters].sort((a,b) => a.number - b.number)`
is dropped; prev/next now walks the server-ordered array positionally
so it traverses variants and non-numeric entries in display order.

Existing data populates on the next cron tick or via admin force-resync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-03 07:25:09 +02:00
MechaCat02
b812c6d16c fix(reader): drop "Chapter N:" prefix from chapter title display (0.51.2)
The chapter list on the manga detail page, the reader's chapter-select
dropdown, the continuous-mode chapter bar, the browser tab title, and
the profile upload-history entries all prepended "Chapter {number}:"
in front of the crawled site title. Source titles already include
"Ch.N" themselves and the manga page renders chapters inside an <ol>,
so the prefix duplicated information the user could already see.

A small chapterLabel(c) helper in $lib/api/chapters returns the site
title as-is, falling back to "Chapter {number}" only when the
crawler captured an empty title (link/option stays non-empty). The
five render sites now call it. The previous-/next-chapter nav
buttons still read "Previous chapter (Ch. N)" / "Next chapter (Ch. N)"
since those are wayfinding labels, not title display.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-03 07:22:17 +02:00
MechaCat02
e93eec89e5 fix(crawler): queue chapter content in ascending number order (0.51.1)
Both enqueue paths now order by chapters.number so the cron tick and the
bookmark hook insert jobs from chapter 1 upward instead of source-discovery
or random-UUID order. The lease query tiebreaks on created_at so jobs
sharing a batch's scheduled_at come off the queue in insertion order,
propagating the enqueue intent through to dequeue. Concurrent workers
and per-CDN latency can still drift actual completion order.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 21:13:51 +02:00
MechaCat02
8818c890c5 feat(reader): chapter select dropdown for direct chapter jumps (0.51.0)
Adds a chapter `<select>` to the reader's top nav listing every chapter
of the current manga, defaulting to the open chapter; picking another
entry navigates straight to it without going back to the manga detail
page. Options use the "Ch. N — Title" form to match the existing
chapter tile and prev/next buttons in the reader bar.

Covered by a new Playwright spec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 07:09:30 +02:00
MechaCat02
c134bdbbde feat: cover retry backfill + admin force-resync for manga & chapter (0.50.0)
Adds a per-tick cover-backfill pass to the crawler daemon so mangas whose
cover download failed on first attempt get retried — the metadata pass's
early-stop optimisation otherwise prevents the walk from revisiting them.

Adds admin-only POST /admin/mangas/:id/resync and POST /admin/chapters/:id/resync
that refetch metadata + cover (or chapter content with force_refetch) from the
crawler source synchronously and return the refreshed row. Surfaced in the
UI as "Force resync" buttons on the manga detail and reader pages,
admin-only via session.user.is_admin.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-01 22:00:09 +02:00
MechaCat02
5c22dfdb41 feat: paginate list views, fix stale page titles, tidy admin filter bar
Bundle of small UI/UX fixes plus a build hygiene tweak.

* List pagination — Home (`/`) and `/authors/[id]` silently capped at
  the backend default of 50 with no UI to advance. New reusable
  `Pager.svelte` (Prev/Next + numbered with ellipsis), URL-synced
  `?page=N`, and filter/search/sort reset to page 1 so users aren't
  stranded on an out-of-range page. Count label now shows a range
  ("Showing 51–100 of 237").

* Stale page title — Pages without a `<svelte:head><title>` left the
  document title at whatever the last manga / author / collection page
  set it to. Move static-route titles into a route-id → title map in
  the root layout and invert every dynamic title to brand-first
  (`Mangalord | {X}`) for consistency.

* Admin filter bar — `/admin/mangas` search input had `flex: 1` and
  ballooned across the row, shoving the sync-state select + Search
  button to the far right. Cap at 24rem, vertical-align the row, and
  promote the previously aria-only "Sync state" label to visible text.

* Build hygiene — `backend/target` had grown to 68 GiB. Cleaned and
  added `[profile.dev] debug = "line-tables-only"` (and `[profile.test]`
  too) to cut future dev builds by ~50–70% while keeping line numbers
  in backtraces.

Also: configure vitest to resolve Svelte's browser entry so
`@testing-library/svelte` can mount components in jsdom — needed for
the new `Pager.svelte.test.ts`.

Bump 0.48.0 -> 0.49.1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-01 21:18:53 +02:00
MechaCat02
e50fc093c3 feat: add PRIVATE_MODE site-wide auth gate (0.48.0)
When `PRIVATE_MODE=true`, every API path except a small allowlist
(`/health`, `/auth/{config,login,logout,register}`) requires a valid
session cookie or bearer token — anonymous reads are rejected with
401. Self-registration is force-disabled in private mode regardless
of `ALLOW_SELF_REGISTER`, so a locked-down instance flips with a
single switch (admins still mint accounts via `POST /admin/users`).

The backend gate is a tower middleware that reuses the existing
`CurrentUser` extractor, so the cookie + bearer paths cannot drift
from per-handler auth. `/auth/config` now exposes the flag plus the
effective `self_register_enabled` value so the frontend can render
the navbar correctly on the first paint.

On the frontend, a new universal root `+layout.ts` fetches the
config and redirects anonymous visitors to `/login?next=<path>`
before page-specific loads fire. The redirect is UX only — the
backend middleware is the source of truth, so crafted requests
still 401.

Defaults stay public (`PRIVATE_MODE=false`); existing deployments
need no env change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-01 20:11:22 +02:00
MechaCat02
72756cfef2 feat(crawler): honour CRAWLER_LIMIT in the in-process daemon (0.47.0)
The CLI binary already capped runs at CRAWLER_LIMIT mangas, but the
daemon's RealMetadataPass passed a hardcoded `0` (no cap) to
`pipeline::run_metadata_pass`, so the env var was silently ignored once
the daemon took over the metadata pass.

Adds `manga_limit` to `CrawlerConfig`, reads it from `CRAWLER_LIMIT`
(default 0 = no cap), and threads it through `RealMetadataPass::run`
so a daemon-driven sweep stops at the same boundary as a CLI run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-01 20:07:01 +02:00
MechaCat02
4e20350645 fix(crawler): translate socks5h:// → socks5:// for Chromium --proxy-server
All checks were successful
deploy / test-backend (push) Successful in 19m30s
deploy / test-frontend (push) Successful in 9m42s
deploy / build-and-push (push) Successful in 8m10s
deploy / deploy (push) Successful in 15s
Chromium doesn't know the socks5h scheme (curl/reqwest convention)
and bails navigations with ERR_NO_SUPPORTED_PROXIES. It does, however,
send destination hostnames over SOCKS5 by default, so stripping the
`h` is a pure scheme rename — remote-DNS behaviour is preserved.

reqwest keeps the user's original CRAWLER_PROXY string (`socks5h://...`
remains valid and meaningful for it).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 20:56:45 +02:00
MechaCat02
713ca139c4 feat(deploy): add optional tor service to dev compose for native-backend dev
Mirrors the prod tor service but with 127.0.0.1-only host port bindings
so a `cargo run` on the host can reach 127.0.0.1:9050 / 9051. Default
password baked in (overridable via TOR_CONTROL_PASSWORD env) since
host-loopback is the only exposure surface — same friction-free posture
as the postgres entry in this file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 20:56:45 +02:00
MechaCat02
e3cff9d874 fix(deploy): pivot tor service to password auth + wrapper entrypoint
Some checks failed
deploy / test-backend (push) Successful in 20m29s
deploy / test-frontend (push) Successful in 9m42s
deploy / deploy (push) Has been cancelled
deploy / build-and-push (push) Has been cancelled
Dockurr/tor's stock entrypoint binds the control port to localhost
(unreachable from a sibling container), refuses to run as a
non-default user (its setup chowns dirs and su-execs down to its
`tor` user, both requiring root), and skips its own
HashedControlPassword injection whenever the user's torrc declares
a ControlPort. The combination meant the original cookie-via-shared-
volume design couldn't work without fighting the image.

This commit:

- Adds tor/entrypoint.sh, a small wrapper that hashes $PASSWORD
  with `tor --hash-password`, appends the hash to a writable copy
  of /etc/tor/torrc, then execs tor. Container runs as root only
  for that bring-up; the torrc's `User tor` directive drops privs
  after port binding.
- Adds a healthcheck on the tor service that gates downstream
  containers on both 9050 + 9051 actually listening (was
  service_started, which fires before tor finishes bootstrap).
- Loosens MaxCircuitDirtiness 60 → 600. The 60s value would have
  rotated mid-chapter for any chapter with > ~50 images, which is
  exactly the kind of fingerprint we're trying to avoid.
- Wires TOR_CONTROL_PASSWORD as a REQUIRED .env var on both sides
  (PASSWORD on tor, CRAWLER_TOR_CONTROL_PASSWORD on backend).
  docker-compose.yml fails fast if unset.
- Removes the tor-data shared volume on backend (cookie auth is no
  longer the default; operators wanting cookie can mount it back).
- Documents the pivot + the cookie-vs-password tradeoff in
  .env.example.

End-to-end validated: `docker compose up -d tor`, then
`printf 'AUTHENTICATE "test"\r\nSIGNAL NEWNYM\r\nQUIT\r\n' | nc tor 9051`
returns three `250 OK` lines.

Audit ref: #2, #3, #6.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 20:25:54 +02:00
MechaCat02
d47e832613 fix(crawler): redact TorAuth::Password in Debug, drop NEWNYM info→debug
The startup log line in app.rs and bin/crawler.rs `?t`-debug-formats
the TorController, which through the derived Debug on TorAuth would
expand TorAuth::Password(p) and leak the plaintext password to logs.
Implement Debug manually on TorAuth — None / Password(<redacted>) /
Cookie(<path>) — and lock the redaction with a regression test.

Drop the per-NEWNYM success log from info to debug: a busy crawl
rotates circuits many times per minute. Failed NEWNYMs already log
at warn — those stay loud.

Tightens the closed_connection_mid_reply_is_an_error assertion which
was tautological (`closed connection` OR `AUTHENTICATE`) by driving
the mock to read the AUTH line then drop, exercising only the
EOF-mid-reply path.

Audit ref: #7, #9, nit on tautological test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 20:25:36 +02:00
MechaCat02
c30c7a546f fix(crawler): unify recircuit budget semantics — N = total attempts
The three retry-with-recircuit sites disagreed: detect.rs's
retry_on_transient_with_hook used "N = total attempts" (3 → 3
fetches), but session.rs's unauth branch and content.rs's chapter
loop used "N = recircuits" (3 → 4 fetches). At the same wall-clock
"max=3", different sites hit the upstream a different number of times.

Unify on N = total attempts (matching the existing
retry_on_transient convention). The CRAWLER_TOR_RECIRCUIT_MAX_ATTEMPTS
env var now means exactly what its name suggests. Disabling the
recircuit feature collapses to max_attempts=1 (single attempt, no
retry) — bit-for-bit pre-TOR behavior preserved.

Adds a debug_assert!(max >= 1) on both helpers and a new
content.rs test exercising the mixed Transient → Unauth → Ok
sequence to lock in the shared-counter invariant.

Audit ref: #5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 20:25:25 +02:00
MechaCat02
a0db7beb81 chore: bump to 0.46.0 for TOR proxy + recircuit feature
CRAWLER_TOR_CONTROL_URL, _PASSWORD, _COOKIE_PATH,
_RECIRCUIT_MAX_ATTEMPTS are new feature env vars; treat per CLAUDE.md
as a minor bump (feat:).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 20:01:57 +02:00
MechaCat02
ecbbebafc4 feat(deploy): dockurr/tor service + torrc; wire crawler to use it by default
Adds a `tor` service to the compose stack (dockurr/tor) with a torrc
tuned for the crawler — SOCKS5 on 9050 with IsolateDestAddr +
IsolateDestPort so NEWNYM picks up promptly, control port on 9051
with cookie auth, MaxCircuitDirtiness 60.

Backend defaults CRAWLER_PROXY → socks5h://tor:9050 and
CRAWLER_TOR_CONTROL_URL → tcp://tor:9051 so TOR + recircuit are on
out-of-the-box. Operators can override both to empty in .env to opt
out without removing the service.

The tor-data named volume is mounted ro on the backend so it can read
/var/lib/tor/control_auth_cookie; CookieAuthFileGroupReadable handles
the permissions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 20:01:04 +02:00
MechaCat02
8c6378b877 feat(crawler): recircuit TOR on transient pages and unauthenticated probes
- target.rs swaps retry_on_transient → retry_on_transient_with_hook,
  signaling NEWNYM via ctx.tor between attempts when configured.
- session.rs gains verify_session_with_recircuit; the bare
  verify_session is now a one-line wrapper passing tor=None,
  unauth_max_recircuit=0. The inner run_session_probe_loop is
  pure-over-IO and unit-tested with closure-based fakes.
- content.rs extracts fetch_chapter_html_once + the closure-driven
  fetch_chapter_html_with_recircuit, used by sync_chapter_content to
  retry on Transient or Unauthenticated up to a recircuit_budget.
  Budget = 0 (no TOR) preserves original behavior bit-for-bit.
- app.rs and bin/crawler.rs construct the controller before on_launch
  and pass it into verify_session_with_recircuit, so a transient
  hiccup at startup no longer requires PHPSESSID rotation.

Recircuit budget defaults to CRAWLER_TOR_RECIRCUIT_MAX_ATTEMPTS (3).
Errors from NEWNYM are logged and swallowed — failing to recircuit
should not take down the crawl.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 20:01:04 +02:00
MechaCat02
8557e432a2 feat(crawler): plumb TorController through FetchContext and pipelines
Adds CRAWLER_TOR_CONTROL_URL / _PASSWORD / _COOKIE_PATH /
_RECIRCUIT_MAX_ATTEMPTS to CrawlerConfig and to bin/crawler.rs's
env reads. Constructs an Option<Arc<TorController>> at daemon /
CLI startup and threads it through FetchContext,
pipeline::run_metadata_pass, and content::sync_chapter_content as
Option<&TorController>.

Pure scaffolding — the controller isn't used yet; behavior is
unchanged. Next commit wires the retry hooks and session-probe
recircuit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 19:59:47 +02:00
MechaCat02
d6d84dedcb feat(crawler): retry_on_transient_with_hook for between-retry side effects
Adds a sibling fn that fires a caller-supplied async hook between a
transient failure and the next attempt. The existing
retry_on_transient becomes a thin wrapper over it (no-op hook), so
no call sites churn yet.

Hook contract: fires only between attempts (N-1 times for N
attempts), never after a non-transient error or after the final
attempt. Designed for TOR NEWNYM, but the signature is generic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 19:59:47 +02:00
MechaCat02
d37b94871e feat(crawler): TorController for control-port NEWNYM signaling
Minimal client over tokio::net::TcpStream — AUTHENTICATE then
SIGNAL NEWNYM, one-shot connection. Supports cookie-file and
password auth (cookie preferred when both provided); covers the
multi-line `250-...\r\n250 OK` reply form so future torrc tweaks
won't confuse the parser.

Not yet wired into the crawler — that lands in the next commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 19:59:47 +02:00
8e39fadd21 ci: build via host docker socket (plain build); fix missing daemon socket (#5)
All checks were successful
deploy / test-backend (push) Successful in 19m12s
deploy / test-frontend (push) Successful in 9m43s
deploy / build-and-push (push) Successful in 8m11s
deploy / deploy (push) Successful in 11s
2026-05-31 17:40:14 +00:00
3b3d13a0f6 fix(crawler): walk list pages incrementally; stop on empty page (0.45.1) (#4)
Some checks failed
deploy / test-backend (push) Successful in 18m58s
deploy / test-frontend (push) Successful in 9m43s
deploy / build-and-push (push) Failing after 2m26s
deploy / deploy (push) Has been skipped
2026-05-31 16:37:14 +00:00
0f90af80cb ci(test-backend): ubuntu-latest + rustup (fix node-not-found) (#3)
Some checks failed
deploy / test-backend (push) Has been cancelled
deploy / test-frontend (push) Has been cancelled
deploy / build-and-push (push) Has been cancelled
deploy / deploy (push) Has been cancelled
2026-05-31 16:18:21 +00:00
6b49a47d0a feat(crawler): system Chromium via CRAWLER_CHROMIUM_BINARY (0.45.0) (#2)
Some checks failed
deploy / test-backend (push) Failing after 7s
deploy / test-frontend (push) Failing after 33s
deploy / build-and-push (push) Has been skipped
deploy / deploy (push) Has been skipped
2026-05-31 15:47:47 +00:00
e851355f28 Merge pull request 'ci: no-SSH local deploy + Dockerfile build fixes' (#1) from fix/ci-deploy-pipeline into main
Some checks failed
deploy / test-backend (push) Failing after 7s
deploy / test-frontend (push) Failing after 30s
deploy / build-and-push (push) Has been skipped
deploy / deploy (push) Has been skipped
2026-05-31 15:43:54 +00:00
2a0cc24c07 ci: deploy to the local stack over the runner socket, not SSH
Some checks failed
deploy / test-backend (pull_request) Failing after 1m6s
deploy / test-frontend (pull_request) Failing after 1m18s
deploy / build-and-push (pull_request) Has been skipped
deploy / deploy (pull_request) Has been skipped
The runner lives on the deploy host and shares its docker daemon, so the
deploy job runs `docker compose pull && up -d` against the central compose
via a bind-mounted compose dir (docker:cli + docker_host: "-") instead of
appleboy/ssh-action. Drops the SSH_* secrets and recreates only the two
mangalord services at the freshly built SHA. Requires /mnt/ssd/docker-data
in the runner's container.valid_volumes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 17:26:58 +02:00
a615b0aee7 fix(docker): unblock image builds on this host
- backend dep-cache stage stubs only main.rs/lib.rs, but Cargo.toml
  declares a second [[bin]] crawler at src/bin/crawler.rs, so
  `cargo build --locked` aborts ("can't find bin crawler"). Stub it too.
- runtime was debian:bookworm-slim (glibc 2.36) while rust:1-slim now
  tracks trixie (glibc 2.41) -> "GLIBC_2.39 not found" at boot. Pin the
  runtime to debian:trixie-slim so it matches the builder's glibc.
- frontend healthcheck probed localhost (-> musl picks IPv6 ::1) but the
  Node server binds IPv4 0.0.0.0 only -> false "unhealthy". Probe 127.0.0.1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 17:26:58 +02:00
MechaCat02
a2826d6467 feat(crawler): CRAWLER_ALLOW_ANY_HOST bypasses the host allowlist (0.44.0)
Some checks failed
deploy / test-backend (push) Failing after 11s
deploy / test-frontend (push) Failing after 36s
deploy / build-and-push (push) Has been skipped
deploy / deploy (push) Has been skipped
Operators whose sources shard images across numbered CDN subdomains
can't pre-enumerate every host in CRAWLER_DOWNLOAD_ALLOWLIST. The new
flag short-circuits the host check in DownloadAllowlist::contains
while leaving scheme, localhost, and private-IP defenses in
is_safe_url untouched — scraped URLs pointing at 10.x /
169.254.169.254 / file:// stay refused. Default is false; fail-closed
posture is preserved unless the operator opts in. Wired into both the
server (config::build_download_allowlist) and the bin/crawler.rs
one-shot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 14:52:49 +02:00
MechaCat02
1eebb90e25 fix(crawler): unhang shutdown on lingering Arc<Browser>, silence WS noise (0.43.1)
Some checks failed
deploy / test-backend (push) Failing after 6s
deploy / test-frontend (push) Failing after 40s
deploy / build-and-push (push) Has been skipped
deploy / deploy (push) Has been skipped
- Handle::close aborts its chromiumoxide driver task when another
  Arc<Browser> outlives the call, so shutdown returns instead of
  hanging on a stream that never terminates. Generic close_or_abort
  helper with regression tests covering both Arc paths.
- daemon.shutdown() is wrapped in a 5s timeout in main as defense
  in depth.
- Default RUST_LOG silences chromiumoxide::conn / chromiumoxide::handler
  WS-deserialize ERROR spam.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 14:47:36 +02:00
MechaCat02
030b27754b feat(api): admin-initiated user creation via POST /admin/users (0.43.0)
Some checks failed
deploy / test-backend (push) Failing after 8s
deploy / test-frontend (push) Failing after 38s
deploy / build-and-push (push) Has been skipped
deploy / deploy (push) Has been skipped
Pairs with the ALLOW_SELF_REGISTER toggle from 0.42.0: admins can mint
accounts regardless of the toggle state, so a closed-membership
deployment still has a working enrollment path. The endpoint accepts
{ username, password, is_admin? } so admins can mint co-admins in one
call (avoiding a separate promote + extra audit row for the common
"invite a co-admin" flow).

Implementation:
- POST /api/v1/admin/users guarded by RequireAdmin
- Reuses validate_username / validate_password from api::auth (made
  pub(crate)) so the admin path can never produce an account self-
  register would reject and vice versa
- repo::user::admin_create_user wraps INSERT + admin_audit insert in
  a single tx — same "audit reflects what committed" semantics as the
  existing admin_safe_* fns
- Audit row: action="create_user", payload={username, is_admin}

Frontend:
- createAdminUser() in lib/api/admin.ts
- /admin/users grows a collapsible "Create user" form above the table
  (username, password, "Make admin" checkbox). Errors surface inline;
  the list reloads on success.

Backend tests: 7 new, including the headline
`create_user_works_even_when_self_register_disabled` that pins the
admin-create path is NOT gated by the public toggle.
2026-05-31 14:00:31 +02:00
MechaCat02
2f47faa11c feat(auth): ALLOW_SELF_REGISTER toggle + public /auth/config endpoint (0.42.0)
Lets operators run a closed-membership deployment by setting
ALLOW_SELF_REGISTER=false (default true, so existing deploys are
unaffected). When off, POST /auth/register returns 403 forbidden. The
rate-limit token is consumed BEFORE the disabled check so the timing
doesn't distinguish enabled-but-rejected from disabled — closes the
toggle-state probe channel.

New public GET /auth/config returns { self_register_enabled: bool }
so the frontend can render its register affordances correctly
without conflating "disabled" with "rate-limited" (which a probe
attempt would).

Frontend: a lightweight reactive `authConfig` store loads the flag
once on root-layout mount (and again on /register direct navigation,
which bypasses the layout's onMount). Header hides the Register link
when the toggle is off; /register renders a "self-registration is
disabled — ask an administrator" notice instead of the form.

Admin-create endpoint that pairs with this toggle is intentionally
not in this PR — it lands as the next branch (feat/admin-user-create).
The toggle alone is independently useful for deployments that want
to lock down enrollment without yet wiring an admin UI.
2026-05-31 13:56:18 +02:00
MechaCat02
6dd21451a8 chore: sync Cargo.lock for 0.41.2 2026-05-30 22:26:24 +02:00
MechaCat02
f6728dc71a fix(admin): security-audit findings — paginate chapters, lock down unchecked helper (0.41.2)
Addresses the security-audit findings on top of the admin feature stack:

M1: /admin/mangas/:id/chapters now paginates (default limit 200, max 500).
A long-runner with thousands of chapters would otherwise produce a multi-MB
response with that many scalar subqueries per row — admin-only but a real
stall risk on one expand-click. Adds explicit pagination tests for the cap
and offset; frontend renders a "Showing first N of M" hint when the cap
clips the result.

L1: repo::user::set_is_admin renamed to set_is_admin_unchecked with a
doc-comment pointing at admin_safe_set_is_admin for production use. The
short name was a footgun — a future contributor reaching for it would
silently bypass self-protection, the last-admin invariant, and the audit
log. Used only by integration-test setup; production code goes through
the admin_safe_* paths.

CSRF posture: build_session_cookie carries a comment that the
SameSite=Lax default is the project's CSRF defense for state-changing
mutations and breaks the instant anyone adds a side-effecting GET under
/admin/*. Spells out what to do then (Strict + explicit token check).

Test counts: 43 backend admin tests + 12 vitest admin tests all green;
svelte-check 0/0 across 446 files.
2026-05-30 22:23:55 +02:00
MechaCat02
aa2159ca06 fix(admin): three review findings — audit no-op, 404, chapter priority (0.41.1)
- admin_safe_set_is_admin: short-circuit when target.is_admin == value,
  before writing audit. PATCH {is_admin: true} on someone already admin
  previously wrote a misleading "promote_user" row even though the UPDATE
  was a no-op.

- list_chapters (/admin/mangas/:id/chapters): explicit exists() check on
  manga_id, returns 404 instead of 200 [] for a typo'd / deleted manga.

- ChapterSyncState priority: the Failed branch now requires page_count = 0,
  so a chapter with pages on disk AND a historical dead job (from a
  re-download attempt that crashed) stays Synced. The old order
  contradicted Synced's documented "downloaded at some point" contract.
  Doc comments updated alongside the SQL.

Three new regression tests pin the behaviour.
2026-05-30 21:58:15 +02:00
MechaCat02
b434c9b68d feat(frontend): /admin dashboard with users/mangas/system views (0.41.0)
Adds the SvelteKit /admin route tree backed by the admin endpoints
landed in PR 1-4. Pages: Overview (alerts + summary cards), Users
(list / promote-demote / delete), Mangas (list with sync state +
expandable per-chapter state), System (live disk/mem/cpu bars,
refreshing every 5s).

Security model: the backend's RequireAdmin extractor is the actual
boundary. /admin/+layout.ts calls getSystemStats() at load and
translates the response — 401 → redirect to /login, 403 → throw
SvelteKit error(403) which renders the framework error page. The
header's "Admin" link is hidden unless `session.user?.is_admin`,
but that's UX only.

Carries `is_admin: boolean` through to the frontend User TS type so
the header check works and so admin tables can show role per row.

Vitest covers lib/api/admin.ts (10 tests: list/delete/PATCH for
users, sync-state filter for mangas, nested chapter route, system
disk-nullable case). Playwright is intentionally deferred until the
routes stabilise — admin UI is operator-only and changes shape often
in v0.
2026-05-30 21:49:39 +02:00
MechaCat02
cc4ec76d17 feat(api): admin system metrics endpoint with disk/mem/cpu alerts (0.40.0)
Adds GET /api/v1/admin/system returning disk (scoped to storage_dir
via statvfs), memory, CPU, and a server-side alerts array that fires
at >90% disk or memory.

Disk uses nix::sys::statvfs directly rather than sysinfo's Disks API
to avoid mountpoint-matching gymnastics for the storage_dir. A new
`Storage::local_root() -> Option<&Path>` trait method exposes the
root; the default returns None so a future S3Storage gets `disk:
null` in the response instead of fabricated numbers.

CPU is sampled inline (refresh → 250ms sleep → refresh → read) so the
endpoint adds 250ms of latency per call. No background-cache yet —
admin traffic is low-volume and the moving parts aren't worth it
until polling shows up.

Alerts are evaluated server-side so the frontend can render them
without re-implementing the thresholds.
2026-05-30 21:45:06 +02:00
MechaCat02
bf7c9b5c2a feat(api): admin manga/chapter overview with derived sync state (0.39.0)
Adds GET /api/v1/admin/mangas and /admin/mangas/:id/chapters guarded by
RequireAdmin. Sync state is computed at query time from the existing
crawler signals (manga_sources / chapter_sources / crawler_jobs) — no
new state column is persisted, so the crawler stays the single writer
of these signals.

Per-manga priority: InProgress (in-flight sync_manga or
sync_chapter_list job) > Dropped (all source rows soft-dropped) >
Synced (default; covers user-uploaded mangas with zero source rows).

Per-chapter priority: Downloading (in-flight sync_chapter_content) >
Dropped (all source rows soft-dropped) > Failed (most-recent terminal
job is dead) > NotDownloaded (page_count = 0) > Synced. The Failed
check sits ABOVE NotDownloaded so the more informative "we tried and
it died" state wins over "we never got around to it" — see the
priority comment in repo/admin_view.rs.

Migration 0020 adds a partial index on
crawler_jobs((payload->>'source_manga_key')) for the one job kind
(sync_manga) whose payload doesn't carry manga_id directly — without it
the in-flight detection for a manga falls back to a seqscan over the
job table.
2026-05-30 21:41:09 +02:00
MechaCat02
0b2018ceca feat(api): admin user management endpoints with audit log (0.38.0)
Adds /api/v1/admin/users list / DELETE / PATCH guarded by RequireAdmin,
plus the audit-log substrate every future destructive admin endpoint
will reuse.

Safety properties:
- Cannot self-delete or self-demote (409 conflict, message calls out
  "yourself" so the UI can render an explanation).
- Cannot remove the last admin via either DELETE or demote. The check
  takes pg_advisory_xact_lock(ADMIN_INVARIANT_LOCK_KEY) and re-counts
  admins inside the same tx, closing the parallel-demote race that a
  bare "if count > 1" check would let through. The HTTP-serial path to
  this guard is structurally unreachable (the actor would have to be
  the lone admin demoting themselves, which the self-guard fires on
  first); the parallel race test exercises it via repo calls.

Audit log (admin_audit table) records the action inside the same tx
as the action itself, so a rolled-back action never leaves an orphan
audit row. actor_user_id is ON DELETE SET NULL so the log outlives a
later-deleted admin. target_id is not a FK because future audit kinds
will target non-user rows.
2026-05-30 21:35:35 +02:00
MechaCat02
ab8b7acc34 feat(auth): admin role with cookie-only RequireAdmin extractor (0.37.0)
Adds an `is_admin` flag on users plus the substrate every later PR in the
admin feature builds on:

- migration 0018 adds the column with default false
- `repo::user::bootstrap_admin` creates or promotes the user named by
  `ADMIN_USERNAME` at startup, hashing `ADMIN_PASSWORD` only when the row
  is new — never overwriting an existing hash, so an operator can rotate
  the admin password via the UI without env-var conflict
- `CurrentSessionUser` extractor accepts only the session cookie;
  `RequireAdmin` composes over it and additionally requires
  `user.is_admin`. Bearer tokens are intentionally excluded so an
  admin's bot token never inherits admin authority (privilege-escalation
  surface that bites every "API keys reuse user perms" auth design)
- demotion is instant: `RequireAdmin` re-reads the user row each request

`/api/v1/auth/me` now exposes `is_admin`; no other response embeds
`User`, so no privacy fanout to audit.
2026-05-30 21:26:26 +02:00
MechaCat02
9925f54695 fix(crawler): narrow browser-dead heuristic to typed downcasts (0.36.7)
anyhow_looks_browser_dead substring-matched any chain message
containing channel / connection / websocket / transport / closed /
nav timeout. Real chromium failures hit those words, but so do
reqwest TCP-reset errors during CDN image downloads, sqlx pool-
timeout errors, and any number of non-browser failures — each of
which triggered a wasted chromium relaunch + session-probe re-run
against the catalog's rate-limit budget.

Drop the substring pass. Walk the chain looking only for typed
NavError (flagged via is_likely_browser_dead) or CdpError. Every
place we feed a chromium error into anyhow goes through one of
those types, so the typed downcasts cover the real cases without
the false-positive surface.

NavError::is_likely_browser_dead also drops its own substring
check on Cdp(e); any CdpError surfacing at the navigation layer
means the chromium-facing channel is the failing layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-30 20:41:59 +02:00
MechaCat02
eaa5afda50 fix(crawler): skip sync when empty chapters + prior > 0 (0.36.6)
The wait_for_selector wait in 0.36.2 narrows the partial-render race
window but doesn't close it: a render that takes longer than
SELECTOR_TIMEOUT (10s) still hands an empty Vec to sync_manga_chapters,
and the soft-drop branch flips every existing chapter to dropped_at.
The next tick recovers but a manga's reader briefly stops working in
between.

Close it at the pipeline level. Between fetch_manga and the upsert/
sync, if the parsed chapter list is empty and the prior live count
for (source_id, source_manga_key) is > 0, treat the fetch as a
transient failure: log, bump mangas_failed, skip upsert + sync + the
seen.insert so a later batch / tick retries. Brand-new mangas with
genuinely zero chapters (prior == 0) pass through unchanged.

New repo helper repo::crawler::live_chapter_count_for_source_manga
joins chapters → chapter_sources → manga_sources with dropped_at IS
NULL — same lockstep as dispatch_target and the enqueue queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-30 20:17:42 +02:00
MechaCat02
5c04b0532b fix(crawler): panic-isolate the cron tick body (0.36.5)
Worker dispatch was already wrapped in AssertUnwindSafe(...)
.catch_unwind() — a panicking handler ack's the job failed and the
worker keeps going. The cron tick had no such guard: a panic in
metadata.run, enqueue_bookmarked_pending, reap_done, or
write_last_tick would kill the cron task. The JoinSet would drop it,
workers would keep running, and no future metadata pass would ever
fire until daemon restart.

Wrap the tick body (between advisory-lock acquire and unlock) in the
same AssertUnwindSafe(...).catch_unwind() pattern. The unlock and
connection drop run unconditionally so a panicked tick doesn't leave
the lock held for another replica.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-30 20:08:11 +02:00
MechaCat02
655ea42731 fix(crawler): scope dispatch_target to live sources, newest first (0.36.4)
The chapter dispatcher's URL resolver had no dropped_at filter and no
ORDER BY — a chapter whose only chapter_sources row had been soft-
dropped was still dispatched against the stale URL, eating retry
budget on guaranteed transients. With multiple live sources the LIMIT
1 winner was nondeterministic.

Add `AND cs.dropped_at IS NULL` and `ORDER BY cs.last_seen_at DESC`
to dispatch_target, bringing it in lockstep with the enqueue queries
in pipeline.rs that already filter on dropped_at. Returns None when
all sources are dropped — callers in daemon.rs already treat None
as "ack the job, skip the work."

Tests in tests/repo_chapter.rs cover the three branches (freshest
live wins, dropped sources skipped, all-dropped returns None).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-30 20:03:45 +02:00
MechaCat02
70e8a7895c fix(crawler): relaunch chromium on CDP / nav-timeout errors (0.36.3)
BrowserManager only re-launched chromium when the cached handle was
None. A crash mid-pass left the handle Some pointing at a dead
process — every subsequent acquire returned the zombie Browser, and
every nav cascaded CDP errors until the idle reaper fired.

Add BrowserManager::invalidate(): take the inner mutex, drop the
handle (closing it if present), and signal the next acquire to
relaunch. Idempotent — invalidating an empty handle is a no-op.

Wire detection via NavError::is_likely_browser_dead and a
chain-walking anyhow_looks_browser_dead helper: substring-match
common channel/connection/transport/WebSocket markers and surface
NavError::Timeout as "presumed dead." Apply at both error
boundaries — RealChapterDispatcher::dispatch and
RealMetadataPass::run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-30 18:39:19 +02:00
MechaCat02
8e0b638e3f fix(crawler): wait for page marker instead of fixed 1s sleep (0.36.2)
A chromium snapshot taken between the wrapper-render and row-render
phases let parse_chapter_list return Ok(vec![]) for a manga that
actually has chapters — the soft-drop branch in sync_manga_chapters
then flipped every existing chapter to dropped_at.

Add wait_for_selector to crawler::nav. navigate() now takes a CSS
marker matching the most-specific element the downstream parser will
look for (one of LIST_PAGE_MARKER / DETAIL_PAGE_CHAPTERS_MARKER /
DETAIL_PAGE_LAYOUT_MARKER). The wait is best-effort and capped by
SELECTOR_TIMEOUT (10s); a legitimately empty page can still pass
through because the parser's #chapter_table sentinel and the
universal broken-page body check stay in force.

Same pattern wired at the reader nav (a#pic_container) and probe
nav (#logo), replacing the implicit assumption that the post-load
JS had finished within 1 second.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-30 18:29:38 +02:00
MechaCat02
e2bd1462ba fix(crawler): wrap wait_for_navigation in 30s timeout (0.36.1)
A hung TLS handshake or a page that never fires load could wedge a
worker (or the cron metadata pass) indefinitely — chromiumoxide
imposes no navigation timeout of its own.

New crawler::nav::wait_for_nav caps each navigation at NAV_TIMEOUT
(30s) and returns a typed NavError so timeouts surface as transient
(retryable) errors. Wired at the three navigation sites:
- source::target::navigate (catalog/detail/pagination)
- content::sync_chapter_content (chapter reader)
- session::fetch_probe_html (session probe)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-30 18:10:51 +02:00
MechaCat02
9f56f283d4 feat(crawler): single-mode walker gated by recovery flag (0.36.0)
Collapses the crawler to a single newest-first walker and replaces the
N-consecutive-unchanged streak with a per-manga rule: stop on the first
manga where metadata is Unchanged AND chapter sync reports zero new
chapters. The early stop is gated by a per-source recovery flag stored
in `crawler_state` — set to `false` when a run starts, back to `true`
only on a clean exit (end-of-walk or intentional stop). A crashed run
leaves the flag `false` automatically (no shutdown code runs), so the
next tick walks the full catalog instead of bailing at the first
caught-up manga.

This means a crashed mid-walk run self-heals on the next tick: the
flag stays `false`, the next walk visits every page (recovering
anything the crash missed past its crash point), and steady state
resumes once the recovery sweep reaches end-of-walk.

Removed:
- DiscoverMode enum, Backfill mode, the boundary re-check +
  displaced-refs machinery in TargetSourceWalker.
- Drop-pass (mark_dropped_mangas) and seed-completion plumbing
  (mark_seed_completed / seed_completed_at). The recovery flag
  subsumes the seed-completion signal; drop detection was explicitly
  opted out.
- JobPayload::Discover (no production callers).
- CRAWLER_MODE / CRAWLER_INCREMENTAL_STOP_AFTER env vars and the
  CrawlerModePref config type.

`should_mark_clean_exit(walked_to_completion, hit_stop_condition)`
encodes the clean-exit truth table in its signature — `hit_limit` is
deliberately absent so a future edit cannot accidentally count a
caller-imposed cap as a clean exit.

Net -501 lines, 261 backend tests passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 23:49:28 +02:00
MechaCat02
33f7e19077 fix(crawler): serialize sync_manga_chapters per-manga (0.35.6)
Two concurrent calls of sync_manga_chapters for the same manga both
read seen_keys, both run the drop UPDATE filtered on `NOT (key = ANY
$3)`, and the later commit can soft-drop a chapter the earlier had
just inserted (lost-update under MVCC). Today the cron tick is the
only caller and the daemon-level advisory lock keeps it single-flight,
but that lock is held on one pool connection and doesn't actually
serialize the *function*: any future caller (bookmark hook,
admin-triggered re-sync, parallel worker) would race against the cron.

Add `pg_advisory_xact_lock(hashtextextended(manga_id::text, 0))` at
the start of the transaction. Auto-releases on commit/rollback so a
panic mid-call can't strand the lock. Lock keyed per-manga so calls
for different mangas still parallelize.

Test sync_chapters_serializes_concurrent_calls_for_same_manga spawns
two tokio tasks calling the function concurrently with overlapping
chapter lists and asserts every chapter survives.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 20:45:01 +02:00
MechaCat02
c6bb9160e3 fix(crawler): scope chapter_sources lookup per-manga (0.35.5)
chapter_sources's PRIMARY KEY was (source_id, source_chapter_key) and
the lookup in sync_manga_chapters didn't constrain by manga_id, so a
source whose chapter slugs aren't globally unique (e.g. "chapter-1"
appearing under multiple mangas) silently attributed every collision
to the first manga that synced it. The INSERT path would have
conflicted on the second manga's sync.

Migration 0017 drops the old PK and rekeys on (source_id, chapter_id)
— the natural identity of a per-source chapter attachment — and adds
an index on (source_id, source_chapter_key) for the lookup path. The
repo lookup now joins chapters and filters by manga_id; the UPDATE
path keys on chapter_id directly (the row's natural identifier
post-migration).

Test sync_chapters_isolates_colliding_keys_across_mangas pins the
contract end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 20:43:08 +02:00
MechaCat02
50763addcf fix(crawler): quarantine recently-dead chapters from re-enqueue (0.35.4)
The partial dedup index only blocks (pending|running) duplicates, so
once a SyncChapterContent job transitions to 'dead' (max_attempts
exhausted) the slot frees. Every subsequent cron tick re-enqueued the
chapter — page_count = 0 and dropped_at IS NULL stay true — burned
another max_attempts retries, and died again. Permanent-failure
chapters spun forever.

enqueue_bookmarked_pending and enqueue_pending_for_manga now skip
chapters whose latest sync_chapter_content job is dead within
CHAPTER_DEAD_QUARANTINE_DAYS (7). A failed chapter goes silent for a
week, then gets one more shot — long enough for a transient site
issue to resolve, short enough that permanent failures don't stay
permanent if conditions change.

Two integration tests pin both halves of the contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 20:42:41 +02:00
MechaCat02
766c6eebac fix(crawler): guard ack_done/ack_failed/release on state='running' (0.35.3)
The three lease-ack functions matched their UPDATE on the job id
alone. If a lease expired and another worker re-leased the row, a
late ack from the original worker would clobber the new lease's
state, leased_until, and (for release) decrement its attempts.

Add `AND state = 'running'` to each UPDATE and log a warn when
rows_affected is zero, so a stolen lease shows up in telemetry without
blocking the new lease holder's progress.

Three new integration tests pin the contract:
- ack_done_no_ops_when_lease_was_stolen
- ack_failed_no_ops_when_state_is_not_running
- release_no_ops_when_state_is_not_running

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 20:42:18 +02:00
MechaCat02
c686d6eb51 fix(crawler): sentinel-gate parse_chapter_list to stop false drops (0.35.2)
parse_chapter_list previously returned Vec::new() on any selector
miss. The empty list flowed into sync_manga_chapters, whose soft-drop
branch then flipped every existing chapter's dropped_at to NOW().
Bookmarks subsequently pointed at dropped sources, and
enqueue_bookmarked_pending (filters on cs.dropped_at IS NULL) silently
stopped re-fetching pages.

Same shape as the walker race fixed in 0.35.1: a transient parse miss
masquerading as "source removed everything" → false soft-drop.

Fix: require #chapter_table in the DOM. Present-but-empty is preserved
as Ok(vec![]) so a freshly added manga with no published chapters
still parses cleanly. Absent table is now Transient — the job system
reschedules with backoff instead of treating the partial render as
data.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 20:41:47 +02:00
MechaCat02
dea9b1aaa8 fix(crawler): close walker race against site reordering (0.35.1)
The target site orders by update_date DESC, and any new or updated
manga pushes everyone down by one slot. The paginated walker was
blind to this drift:

* Backfill (page last -> 1): shifts push items into pages already
  finished. The displaced manga was silently missed; with
  mark_dropped_mangas running on a fully-completed walk, items even
  got false-dropped because last_seen_at was stale.
* Incremental (page 1 -> last): a shift causes the slot-last item
  of an already-read page to reappear on the next page, leading to
  a redundant fetch_manga and an inflated consecutive_unchanged
  streak.

Fix is two-pronged:

1. Backfill boundary re-check. After fetching each page P, re-fetch
   the previously-walked page P+1 and check where its old slot-0
   key now sits. If it slid to slot K, the first K entries are
   items that used to live on P and slid past us; they get appended
   to the batch. If the anchor is gone entirely (multi-page shift
   or it was bumped to page 1), the whole re-fetched page is
   processed conservatively and the pipeline dedup absorbs the
   noise. The re-check must be the *last* navigation of the
   iteration to close the within-iteration race.

2. Run-scoped dedup in run_metadata_pass. A HashSet<String> of
   source_manga_keys avoids double-processing. The set uses a
   contains-then-insert pattern with insert firing *after* a
   successful upsert, so a transient fetch/upsert failure leaves
   the key retryable if it reappears later in the same pass (via
   the boundary re-check or another batch).

Incremental mode does not run the re-check (shifts move in the
same direction as the walk); only the dedup helps it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 20:14:01 +02:00
MechaCat02
f57ca8e45c feat: harden auth, shutdown, and session bundle (0.35.0)
Some checks failed
deploy / test-backend (push) Failing after 1m37s
deploy / test-frontend (push) Failing after 16m31s
deploy / build-and-push (push) Has been skipped
deploy / deploy (push) Has been skipped
Three features bundled into one release:

- rate-limit /auth/login, /register, /me/password (token bucket,
  5 req/sec sustained with 10-request burst by default; 429 +
  Retry-After header on hit; tracing::warn! per hit so operators
  see attack patterns; AUTH_RATE_PER_SEC / AUTH_RATE_BURST env knobs)
- handle SIGTERM for graceful container stops (replaces bare
  ctrl_c() with a select over ctrl_c + SignalKind::terminate() so
  docker compose stop runs the daemon shutdown path instead of
  letting Chromium leak past SIGKILL)
- clear session.user on 401 from any API call (setOn401Hook in
  api/client.ts, registered from session.svelte.ts gated on
  $app/environment::browser so the SSR bundle never installs it;
  fixes "logged in but no bookmarks/collections" mid-session
  expiry state)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 20:27:21 +02:00
MechaCat02
8d34132883 bugfix: security & correctness bundle (0.34.1)
Five fixes bundled into one release:

- preserve user-attached tags across crawler upserts
  (repo::crawler::sync_tags now scopes to added_by IS NULL; orphaned
  attachments from deleted users are reaped as crawler-owned)
- gate manga PATCH and cover endpoints on uploaded_by (require_can_edit
  in api::mangas; non-NULL uploaded_by must match the caller)
- equalise login response time across user-existence branches
  (run argon2 against a OnceLock-cached dummy hash on the no-user
  branch so timing doesn't leak username existence)
- crawler download defences (SSRF allowlist of host literals
  including IPv4-mapped IPv6 ranges, 32 MiB streamed size cap,
  reject non-whitelisted image types, three-way chapter-probe
  classifier replaces the binary #avatar_menu check)
- tighten validation and clean up dead unload path
  (attach_tag + create_token enforce 64-char caps; LocalStorage
  rejects NUL bytes explicitly; reader flushFinalProgress drops
  the always-405 sendBeacon path)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 20:24:51 +02:00
MechaCat02
c5c1179e9d chore: full hop-by-hop header strip and 60s timeout on /api/* proxy
The SvelteKit proxy was only stripping host + content-length; the rest
of RFC 7230 §6.1 (connection, keep-alive, proxy-authenticate,
proxy-authorization, te, trailer, transfer-encoding, upgrade) leaked
through to axum. Axum doesn't emit them so the impact is theoretical,
but the proxy should be RFC-conformant. Also adds an AbortController
with a configurable 60s timeout (BACKEND_PROXY_TIMEOUT_MS) so a
wedged backend can't hang the browser request indefinitely — failures
surface as the standard 502 upstream_unavailable envelope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 20:24:05 +02:00
MechaCat02
c320eda7cd chore: dedupe is_unique_violation, lift SQL into repo, centralise URL parsing
Three layering cleanups from REVIEW.md §5 / §3:

- Drop the three private `is_unique_violation` helpers in
  repo::{user,chapter,bookmark} in favour of sqlx 0.8's
  `DatabaseError::is_unique_violation()` method (already used by
  repo::collection).
- Remove the unreachable 23505 branch in repo::chapter::create — the
  (manga_id, number) UNIQUE was dropped in 0013, so the defensive arm
  could no longer fire. A doc note records what to do if uniqueness
  is re-added.
- Move three inline SQL queries out of handlers/daemon into repo
  functions: bookmarks' chapter-belongs-to-manga guard
  (`repo::chapter::belongs_to_manga`), the daemon's dispatch lookup
  (`repo::chapter::dispatch_target`), and the daemon's page_count
  safety net (`repo::chapter::page_count`). Restores the
  handlers→repo layering invariant in CLAUDE.md.
- New `crawler::url_utils` module consolidates host_of / origin_of /
  registrable_domain — they used to live in three crawler submodules
  with diverging edge-case behaviour. Tests moved with them.
- Doc cross-references on repo::author::set_for_manga and
  repo::genre::set_for_manga pointing to the crawler's name-keyed
  variants, so the intentional duplication is discoverable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 20:24:05 +02:00
MechaCat02
bd9a6bd257 chore: drop dead 'failed' branch from crawler_jobs partial index
0012_crawler.sql's partial index on `state IN ('pending','failed')`
indexes a state that no code path ever writes — ack_failed in
crawler/jobs.rs only ever moves jobs to 'dead' or 'pending'. The
'failed' branch costs a write on every state change without ever
matching a query. Drop it; the CHECK still allows 'failed' so a
future migration can re-introduce it cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 20:24:05 +02:00
MechaCat02
ebc1966103 chore: run containers as non-root, add HEALTHCHECK, npm ci
Backend: new `app` user (UID 10001), STORAGE_DIR pre-chowned so the
named volume inherits ownership, curl installed for the HEALTHCHECK
that pings /api/v1/health. The crawler's Chromium uses --no-sandbox
already so dropping privileges costs nothing operationally.

Frontend: switch `npm install` to `npm ci` (matches CI; deterministic
versions; refuses to silently rewrite package-lock.json mid-build).
Run as the built-in `node` user via --chown=node:node, add a busybox
wget HEALTHCHECK on port 3000.

Both images now expose container-level health so orchestrators can
take a wedged container out of rotation instead of letting it keep
serving timeouts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 20:24:05 +02:00
MechaCat02
e4333631e1 chore: run CI on PRs, require POSTGRES_PASSWORD, document HTTPS need
- .gitea/workflows/deploy.yml: trigger on pull_request to main so PRs
  get test feedback; gate build-and-push + deploy on push events so
  PRs only run the test jobs (no registry push, no SSH deploy).
- docker-compose.yml: change `${POSTGRES_PASSWORD:-mangalord}` to
  `${POSTGRES_PASSWORD:?...}` so a deploy without an .env fails fast
  instead of booting Postgres with a known-default credential.
- .env.example: change the example value to a "change-me" sentinel,
  add a banner explaining that production needs HTTPS in front of
  the frontend container because COOKIE_SECURE=true makes browsers
  refuse cookies over plain HTTP.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 20:24:05 +02:00
MechaCat02
e7662d18d6 feat: gitea actions for build, push, and ssh deploy (0.34.0)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 06:56:13 +02:00
MechaCat02
45ce0d8f12 feat: incremental crawl mode with seed-completion gate (0.33.0)
Daemon now auto-detects mode per source: Backfill until the first
full walk records `seed_completed:<source>` in `crawler_state`, then
Incremental (newest-first, stops after N consecutive Unchanged
upserts). `CRAWLER_MODE` overrides to a fixed mode; CLI rejects
`auto` since it has no pre-run DB state.

`Source::discover` returns a lazy `DiscoverWalk` so Incremental can
break out mid-walk without prefetching pages. The drop pass and seed
marker are now gated on a true full walk — fixes a latent soft-drop
of the index tail under partial sweeps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 06:41:26 +02:00
MechaCat02
51f42b03e9 feat: default crawler browser to headless (0.32.0)
LaunchOptions::from_env() and LaunchOptions::default() now return
BrowserMode::Headless. The in-process daemon (via CrawlerConfig::from_env)
and the standalone crawler binary both pick this up — no display
required for production runs, smaller resource footprint.

`Headed` stays as an explicit opt-in via CRAWLER_BROWSER_MODE=headed
for debugging or sites that fingerprint headless Chrome. New unit test
locks the default in place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 20:27:05 +02:00
MechaCat02
fa0a7da311 feat: edit existing manga metadata (0.31.0)
Adds PUT /mangas/:id/cover (multipart) and DELETE /mangas/:id/cover so
covers can be replaced or cleared after creation, and wires a dedicated
/manga/[id]/edit SvelteKit route that combines the existing PATCH with
the new cover endpoints. Cover PUT cleans up the old blob when the
extension changes, swallowing StorageError::NotFound so a manually-gone
file doesn't surface as a 404 to the client. Edit link on the manga
detail page is gated on session.user, matching the auth posture of the
underlying handlers.

Also pins the local-dev port story via loadEnv() in vite.config.ts so
VITE_PORT / BACKEND_URL from a (gitignored) .env keep the dev URL
stable across runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 20:26:23 +02:00
MechaCat02
9ff49166a5 feat: transient-page detection across the crawler (0.30.0)
Until now, when the target site returned its 403 "we're sorry, the
request file are not found" response on a page that actually exists,
selectors matched nothing and the crawler treated the page as
"legitimately empty". Pagination walks silently dropped whole pages
worth of mangas, fetch_manga skipped individual entries, and the
startup session probe blamed PHPSESSID for what was a site hiccup.

This branch adds a single detection layer that the whole pipeline
routes through:

- `crawler::detect`: PageError::Transient typed signal, plus two
  primitives (`is_broken_page_body` matches the universal 403 body;
  `has_logo_sentinel` asserts #logo, the site-wide header element)
  and a `retry_on_transient` helper that retries a closure on
  Transient with a small attempt budget.
- `navigate()` screens every fetched body for the broken-page
  signature before handing it to a selector.
- Parsers (`parse_manga_list_from`, `parse_manga_detail`,
  `parse_chapter_pages`) check their structural sentinels (#logo for
  full-layout pages; a#pic_container for the reader, which doesn't
  render #logo) and return Result<_, PageError>. Empty Vec is now
  reserved for genuinely empty pages.
- `discover()` retries each pagination page up to 3× (2s apart) before
  failing the whole Discover job — at which point the existing job
  system's retry/backoff takes over for longer outages.
- `verify_session` is three-state: broken-page → retry probe;
  #logo present but #avatar_menu absent → genuine logout (the only
  state that should blame PHPSESSID); both present → ok.

Test coverage added at the helper level: 13 unit tests for the
detection module (body signature, logo sentinel, PageError, retry
helper), parser-level tests for both transient and legitimately-empty
inputs, and 6 unit tests for the session probe classifier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 22:47:21 +02:00
MechaCat02
b845d88766 feat: bookmark create enqueues SyncChapterContent jobs (0.29.0)
After a successful bookmark insert, the create handler spawns a
detached tokio task that calls pipeline::enqueue_pending_for_manga
for every chapter of the manga where page_count = 0 and the source
row is not dropped. Bookmark create returns 201 immediately; enqueue
work happens in the background and its failure is logged without
surfacing to the user (the daily cron sweeps anything missed).

The Phase A dedup index handles re-bookmarks idempotently — deleting
and recreating a bookmark does not duplicate in-flight jobs — and the
Phase B worker pool drains them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 20:59:14 +02:00
MechaCat02
9fe0f26d75 feat: in-process crawler daemon with cron and worker pool (0.28.0)
The backend now boots an internal crawler daemon that runs a daily
metadata pass (CRAWLER_DAILY_AT in CRAWLER_TZ, advisory-lock guarded
for multi-replica safety) and drains SyncChapterContent jobs from
crawler_jobs through a worker pool. Chromium launches lazily on first
job and is torn down after CRAWLER_IDLE_TIMEOUT_S seconds of inactivity.

Modules:
- crawler::browser_manager — lazy-launch / idle-teardown wrapper
  around browser::Handle, with an on_launch hook that re-injects
  PHPSESSID on every fresh Chromium spawn.
- crawler::pipeline — run_metadata_pass (the shared discover/upsert
  /cover/sync-chapters loop) and the enqueue_bookmarked_pending helper
  used by the cron tick.
- crawler::daemon — cron task + worker pool, behind two trait seams
  (MetadataPass, ChapterDispatcher) so tests can inject stubs without
  standing up Chromium or a live source.

Behavior:
- CRAWLER_DAEMON=false skips daemon spawn entirely (default for tests).
- Catch-up tick fires on startup if the last persisted slot was missed.
- A SyncOutcome::SessionExpired sets a sticky AtomicBool; workers
  idle until operator restart with a refreshed PHPSESSID.
- Worker dispatch wrapped in catch_unwind so a panicking handler
  marks the job failed instead of taking down the worker.
- Migration 0015 adds a small crawler_state k-v table for the
  last_metadata_tick_at watermark.

Dep additions: chrono-tz (IANA TZ parsing).

CLI (bin/crawler) reuses pipeline::run_metadata_pass and now holds
the browser via BrowserManager so the on_launch session injection
flow stays in one place. Inline chapter-content sync semantics are
unchanged — the queue is for the daemon, force-refetches and manual
backfills still bypass it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 20:32:02 +02:00
MechaCat02
93c7fd63fc feat: crawler job queue ops and dedup index (0.27.0)
Adds enqueue / lease / ack_done / ack_failed / release / reap_done on
crawler::jobs, backed by the existing crawler_jobs table. lease() uses
a single FOR UPDATE SKIP LOCKED CTE that also re-claims stale running
rows (crashed-worker recovery), and ack_failed applies an exponential
backoff capped at 1h before retrying.

Migration 0014 adds a partial unique index on
(payload->>'chapter_id') restricted to (pending|running)
sync_chapter_content jobs, so producers can just
INSERT ... ON CONFLICT DO NOTHING without racing each other. The slot
frees again the moment the job leaves the in-flight states, so a
future force-refetch can re-enqueue.

Library-only — no daemon, no API hook. Those land in the next two
phases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 19:59:09 +02:00
MechaCat02
89b84252a5 bugfix: subquery-wrap pending chapters query so DISTINCT + ORDER BY agree (0.26.1)
PG rejects `SELECT DISTINCT c.id, c.manga_id, cs.source_url ... ORDER BY
c.manga_id, c.created_at` because the ORDER BY references a column not in
the DISTINCT projection. Wrap the DISTINCT in a subquery (which includes
created_at) and apply the ORDER BY in the outer SELECT.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 22:20:15 +02:00
MechaCat02
728d704a66 feat: CRAWLER_KEEP_BROWSER_OPEN waits for Ctrl+C in headed mode (0.26.0)
Debug aid: when set in headed mode, the crawler blocks on Ctrl+C at
every shutdown point (early auth bails + normal completion) instead
of closing the browser immediately. Operator can inspect DOM, cookies,
and network state in the visible Chromium window before exit.

Ignored in headless (no window to inspect) — logged as a warning if
set under headless so the operator doesn't sit waiting.

chromiumoxide's `Browser` is `kill_on_drop`, so the close-or-wait
helper must await Ctrl+C *before* the Handle is dropped — otherwise
the Chromium child gets killed out from under the operator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 21:33:18 +02:00
MechaCat02
d24e68c78d feat: chapter content sync via PHPSESSID + per-host pacing (0.25.0)
After the metadata pass, the crawler now fetches per-chapter image
content for chapters belonging to bookmarked mangas. Logged-in chapter
pages render every page image at once (no per-page navigation), so the
crawler reuses the operator's browser session via a pasted PHPSESSID
cookie. Each chapter sync is a single transaction: storage puts + page
row inserts + page_count update commit together, or roll back together
on any image error so the chapter stays at page_count=0 and is retried
next run.

New crawler modules:

- `rate_limit::HostRateLimiters`: per-host buckets keyed by URL host,
  with optional per-host overrides. Replaces the single shared
  `Mutex<RateLimiter>`. Catalog and CDN no longer share a budget;
  default 1 req/s per host.
- `session`: derives `.<registrable>.<tld>` from the start URL
  (override via `CRAWLER_COOKIE_DOMAIN` for multi-part TLDs), injects
  PHPSESSID into the Chromium cookie store, probes `#avatar_menu` at
  startup to fail fast on a bad/expired cookie.
- `content`: parses `a#pic_container img:not(.loading)` with `pageN`
  id-based sorting (DOM order isn't trusted), then performs the
  atomic chapter sync.

bin/crawler additions:

- Concurrent chapter content phase via `futures_util::for_each_concurrent`
  (`CRAWLER_CHAPTER_WORKERS`, default 1). Browser is borrowed across
  workers — chromiumoxide allows concurrent `new_page` on `&self` —
  and per-host rate limit gates total RPS regardless of worker count.
- reqwest gets the `cookies` feature, a `Jar` seeded with PHPSESSID
  for the catalog domain only (CDN intentionally not given the
  cookie), and `Referer` is set on cover + chapter image fetches.
- New env knobs: `CRAWLER_PHPSESSID`, `CRAWLER_COOKIE_DOMAIN`,
  `CRAWLER_USER_AGENT`, `CRAWLER_CHAPTER_WORKERS`,
  `CRAWLER_SKIP_CHAPTER_CONTENT`, `CRAWLER_FORCE_REFETCH_CHAPTERS`,
  `CRAWLER_CDN_HOST` + `CRAWLER_CDN_RATE_MS`.
- Mid-run session-expired detection: `#avatar_menu` is re-checked on
  every chapter page nav; first failure aborts the phase with a
  cookie-refresh message.

Bookmark-driven enqueueing is sync-on-crawl-tick only: the bookmarked
chapters with `page_count = 0` are queried at the start of the
chapter-content phase. Sync-on-bookmark via an API hook is deferred
to a follow-up branch — that needs a daemon consumer of crawler_jobs,
which doesn't exist yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 00:28:36 +02:00
MechaCat02
51346227dd feat: route reader by chapter id, allow duplicate-numbered chapters (0.24.0)
Real-world sources publish multiple chapters at the same number:
different scanlators ("Ch.52 from bloomingdale" + "Ch.52 from mina"),
translator notices and farewells, alt-translations. The (manga_id,
number) UNIQUE constraint from 0001 silently collapsed all of those
into a single row via the upsert path in repo::crawler. Migration 0013
drops the constraint; sync_manga_chapters now plain-INSERTs each
SourceChapterRef so every parsed chapter survives as its own row.

Identity moves from the (manga_id, number) tuple to the chapter UUID:

- `GET /api/v1/mangas/:manga_id/chapters/:chapter_id` (replaces :number)
- `GET /api/v1/mangas/:manga_id/chapters/:chapter_id/pages`
- `repo::chapter::find_by_id_in_manga` (replaces find_by_manga_and_number)
- Frontend reader route renamed to `/manga/[id]/chapter/[chapter_id]`
- Chapter links throughout (manga page list, continue-reading CTA,
  reader prev/next, history rows, bookmark cards) use chapter.id
- API clients getChapter/getChapterPages take a chapter id string

read_progress + bookmarks already FK chapter_id; they only enrich with
chapter_number for display, which is preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 23:37:07 +02:00
MechaCat02
c51353ead3 bugfix: chapter source key uses chapter id, not /pg-1/ (0.23.1)
Listing links point at the reader's page 1
(`.../uu/br_chapter-N/pg-1/`). The generic `derive_key_from_url` took
the last URL segment and returned `"pg-1"` for every chapter, so all
parsed chapters collapsed onto a single `chapter_sources` row downstream
and the first-manga chapter was the only row that survived. New
`derive_chapter_key_from_url` strips a trailing `/pg-\d+/` before
picking the chapter-identifying segment (`br_chapter-N` / `to_chapter-N`).

Notices, hiatus rows, and duplicate-numbered chapters are preserved as
distinct parser entries. The (manga_id, number) UNIQUE collapse in the
chapters table is a separate, follow-up concern handled in
feat/chapter-id-routing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 23:15:36 +02:00
145 changed files with 20434 additions and 780 deletions

View File

@@ -1,20 +1,30 @@
# Copy to .env for `docker compose up --build`. Local-dev runs (cargo run
# / npm run dev) read backend/.env if present, or pick up the variables
# from your shell.
#
# Production note: COOKIE_SECURE=true (the default below) makes browsers
# refuse to send the session cookie over plain HTTP. Run with a TLS-
# terminating reverse proxy (Caddy, Traefik, nginx) in front — the
# compose file here doesn't ship one. Local/dev runs without HTTPS
# should set COOKIE_SECURE=false.
# ----- Postgres -----
# These are read by the Postgres container *and* by DATABASE_URL below;
# changing them after the first boot won't migrate existing data, so set
# them up front for any new deployment.
#
# POSTGRES_PASSWORD is REQUIRED — docker-compose.yml fails fast if it
# isn't set in this file, to prevent a deploy without an .env booting
# Postgres with a publicly-known credential.
POSTGRES_USER=mangalord
POSTGRES_PASSWORD=mangalord
POSTGRES_PASSWORD=change-me-to-a-strong-random-string
POSTGRES_DB=mangalord
# ----- Backend -----
DATABASE_URL=postgres://mangalord:mangalord@postgres:5432/mangalord
BIND_ADDRESS=0.0.0.0:8080
STORAGE_DIR=/var/lib/mangalord/storage
RUST_LOG=info,mangalord=debug
RUST_LOG=info,mangalord=debug,chromiumoxide::conn=off,chromiumoxide::handler=off
# ----- Auth / cookies -----
# COOKIE_SECURE controls whether the `Secure` flag is set on the session
@@ -29,6 +39,13 @@ COOKIE_DOMAIN=
# get reaped lazily.
SESSION_TTL_DAYS=30
# ----- Auth brute-force rate limits -----
# Token-bucket budget shared across /auth/login, /auth/register, and
# /auth/me/password. Set per_sec=0 to disable (e.g. behind a
# rate-limiting reverse proxy that already enforces a budget).
AUTH_RATE_PER_SEC=5
AUTH_RATE_BURST=10
# ----- CORS -----
# Comma-separated origins allowed to call the API with credentials.
# Default is empty: same-origin only. Set when frontend and backend live
@@ -44,6 +61,69 @@ MAX_REQUEST_BYTES=209715200
# Default 20 MiB.
MAX_FILE_BYTES=20971520
# ----- Crawler download safety -----
# Hosts the crawler is allowed to fetch images/covers from, in addition
# to CRAWLER_START_URL's host and CRAWLER_CDN_HOST. Comma-separated.
# Defends against SSRF via scraped <img src="http://10.0.0.1/...">.
CRAWLER_DOWNLOAD_ALLOWLIST=
# Bypass the host allowlist entirely. Intended for sources that shard
# images across numbered CDN subdomains (cdn1/cdn2/…) where enumerating
# every host upfront is impractical. The private-IP / localhost / non-
# http(s) scheme defenses STAY ON — a scraped <img src="http://10.0.0.1/">
# is still refused with this flag set.
CRAWLER_ALLOW_ANY_HOST=false
# Hard cap on a single image body. Default 32 MiB.
CRAWLER_MAX_IMAGE_BYTES=33554432
# Max manga detail fetches per metadata pass (both the in-process daemon
# and the `bin/crawler` CLI). 0 means no cap — let the source walker run
# to completion. Useful for capped test runs against a new source.
CRAWLER_LIMIT=0
# Path to a system Chromium binary. When set, the crawler skips the
# bundled-fetcher download. Required on platforms without a usable
# upstream Chromium build (notably Linux_arm64 / Raspberry Pi). On
# Debian: /usr/bin/chromium-headless-shell or /usr/bin/chromium. On
# Ubuntu the package is chromium-browser (different path). Pair with
# `docker compose build --build-arg INSTALL_CHROMIUM=true backend` so
# the image actually contains the binary.
CRAWLER_CHROMIUM_BINARY=
# ----- Crawler TOR proxy + recircuit -----
# The compose stack ships a `tor` service (dockurr/tor) and defaults
# CRAWLER_PROXY to it, so by default all crawler traffic exits via the
# TOR network. To opt out, set CRAWLER_PROXY= (empty) AND
# CRAWLER_TOR_CONTROL_URL= (empty) below — the tor service can stay
# running, it just won't be used.
#
# Going through TOR adds latency to every fetch; image downloads in
# particular slow noticeably. The win is on sites that rate-limit or
# fingerprint by exit IP — NEWNYM recirculation makes a fresh exit
# cheap to reach for.
#
# CRAWLER_PROXY: SOCKS5(h) URL. Use `socks5h://` (not `socks5://`) so
# DNS resolution also goes through TOR, avoiding leaks via the host's
# resolver. Leave unset to talk to the upstream directly.
CRAWLER_PROXY=socks5h://tor:9050
# Control-port URL for SIGNAL NEWNYM ("get a fresh circuit"). Triggered
# automatically on bad pages (broken-page body, missing #logo) and on
# the Unauthenticated session probe outcome. Leave unset to disable
# the recircuit feature (the SOCKS proxy still works).
CRAWLER_TOR_CONTROL_URL=tcp://tor:9051
# Max NEWNYM-and-retry cycles per recircuit-eligible failure. Default 3.
CRAWLER_TOR_RECIRCUIT_MAX_ATTEMPTS=3
# ----- TOR control-port password -----
# Shared between the bundled dockurr/tor service (which hashes it into
# its HashedControlPassword) and the backend's
# CRAWLER_TOR_CONTROL_PASSWORD. REQUIRED — docker-compose.yml fails
# fast if absent. Generate a strong random string; rotate by setting
# a new value and restarting both `tor` and `backend`.
#
# Operators running their own non-dockurr tor daemon with cookie-file
# auth can ignore this var and instead set
# CRAWLER_TOR_CONTROL_COOKIE_PATH on the backend — the TorController
# prefers cookie when both are present.
TOR_CONTROL_PASSWORD=change-me-to-a-strong-random-string
# ----- Frontend -----
# The frontend container runs SvelteKit's Node adapter on :3000 and
# proxies /api/* to BACKEND_URL via src/hooks.server.ts. In compose the
@@ -51,3 +131,8 @@ MAX_FILE_BYTES=20971520
# internal docker network. Override only if you're running the
# frontend container against a backend somewhere else.
BACKEND_URL=http://backend:8080
# Per-request wall-clock cap for the /api/* reverse proxy (milliseconds).
# Default 300000 (5 min) covers a typical 200 MiB chapter upload over
# 25 Mbps; raise for users on slower upstream links or lower if a
# tighter front proxy already bounds the request lifetime.
BACKEND_PROXY_TIMEOUT_MS=300000

71
.gitea/README.md Normal file
View File

@@ -0,0 +1,71 @@
# Gitea Actions
The [`deploy`](workflows/deploy.yml) workflow runs on every push to `main`
(and via manual `workflow_dispatch`). It tests, builds, pushes the images
to a private registry, and rolls the stack over by SSH on the target host.
## Required secrets
Set under *Repo Settings → Actions → Secrets*:
| Name | Example | Purpose |
| -------------------- | ------------------------ | ---------------------------------------------------------------- |
| `REGISTRY_URL` | `registry.example.com` | Registry host. No scheme, no trailing slash. |
| `REGISTRY_USERNAME` | `mangalord-ci` | `docker login` user. |
| `REGISTRY_PASSWORD` | `<token>` | `docker login` token/password. |
| `SSH_HOST` | `mangalord.example.com` | Deploy target hostname/IP. |
| `SSH_USER` | `deploy` | SSH user on the target (must be in the `docker` group). |
| `SSH_PRIVATE_KEY` | `-----BEGIN OPENSSH...` | Private key authorised in the target user's `authorized_keys`. |
| `SSH_PORT` | `22` | Optional. Defaults to `22` if unset. |
## Required variables
Set under *Repo Settings → Actions → Variables* (not secrets — they appear
in logs):
| Name | Example | Purpose |
| ------------- | ------------------------ | ---------------------------------------------------------------------- |
| `DEPLOY_PATH` | `/srv/mangalord` | Directory on target holding `docker-compose.yml`, `.env`, and the prod overlay. |
## One-time host setup
The workflow assumes the deploy target already has:
1. Docker + Docker Compose v2 installed and the `SSH_USER` in the `docker` group.
2. `$DEPLOY_PATH/docker-compose.yml` (copy of the repo's [docker-compose.yml](../docker-compose.yml)).
3. `$DEPLOY_PATH/docker-compose.prod.yml` (copy of the repo's [docker-compose.prod.yml](../docker-compose.prod.yml)).
4. `$DEPLOY_PATH/.env` populated from [.env.example](../.env.example) with production values (real `POSTGRES_PASSWORD`, `COOKIE_SECURE=true`, etc.).
Bootstrap once:
```bash
ssh deploy@mangalord.example.com
sudo mkdir -p /srv/mangalord && sudo chown deploy:deploy /srv/mangalord
cd /srv/mangalord
# place docker-compose.yml, docker-compose.prod.yml, and .env here
```
The first workflow run will pull the images, bring the stack up, and run
the embedded migrations on startup.
## Image tags
Every push produces three tags per image:
- `mangalord-{backend,frontend}:latest`
- `mangalord-{backend,frontend}:<git-sha>` — used by the deploy job; lets
you pin a deploy to a specific commit
- `mangalord-{backend,frontend}:<version>` — the version from
[backend/Cargo.toml](../backend/Cargo.toml) (verified in lockstep with
[frontend/package.json](../frontend/package.json))
## Rollback
SSH to the target, set `IMAGE_TAG` to a previous commit SHA, and re-up:
```bash
cd /srv/mangalord
export REGISTRY_URL=registry.example.com
export IMAGE_TAG=<previous-sha>
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d
```

153
.gitea/workflows/deploy.yml Normal file
View File

@@ -0,0 +1,153 @@
name: deploy
on:
push:
branches: [main]
pull_request:
branches: [main]
workflow_dispatch:
jobs:
test-backend:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:16-alpine
env:
POSTGRES_USER: mangalord
POSTGRES_PASSWORD: mangalord
POSTGRES_DB: mangalord
options: >-
--health-cmd "pg_isready -U mangalord"
--health-interval 5s
--health-timeout 5s
--health-retries 10
env:
DATABASE_URL: postgres://mangalord:mangalord@postgres:5432/mangalord
steps:
- uses: actions/checkout@v4
# ubuntu-latest has node (so JS actions like checkout/cache run) but no
# Rust. We intentionally avoid `container: rust:1-slim` because act_runner
# runs JS actions with node *inside* the job container, and the slim Rust
# image ships no node (checkout would fail with exit 127).
- name: Install Rust + build deps
run: |
set -eu
SUDO=""; [ "$(id -u)" = "0" ] || SUDO="sudo"
$SUDO apt-get update
$SUDO apt-get install -y --no-install-recommends pkg-config libssl-dev ca-certificates curl
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --profile minimal --default-toolchain stable
echo "$HOME/.cargo/bin" >> "$GITHUB_PATH"
- name: Cache cargo registry and target
uses: actions/cache@v4
with:
path: |
~/.cargo/registry
~/.cargo/git
backend/target
key: cargo-${{ runner.os }}-${{ hashFiles('backend/Cargo.lock') }}
restore-keys: |
cargo-${{ runner.os }}-
- name: cargo test
working-directory: backend
run: cargo test --locked
test-frontend:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '22'
cache: npm
cache-dependency-path: frontend/package-lock.json
- name: npm ci
working-directory: frontend
run: npm ci
- name: vitest
working-directory: frontend
run: npm test
build-and-push:
runs-on: ubuntu-latest
needs: [test-backend, test-frontend]
# PRs only run the test jobs; build + deploy are reserved for
# post-merge pushes to main.
if: github.event_name != 'pull_request'
# Build on the host docker daemon directly (docker-outside-of-docker):
# the runner shares the deploy host's daemon, so a plain `docker build`
# reuses the host's layer cache and avoids buildx's docker-container
# driver + the gha cache exporter — neither works against this single-host
# act_runner, and there is no in-job daemon socket unless we mount it.
container:
image: docker.gitea.com/runner-images:ubuntu-latest
volumes:
- /var/run/docker.sock:/var/run/docker.sock
outputs:
image_tag: ${{ steps.meta.outputs.image_tag }}
version: ${{ steps.meta.outputs.version }}
steps:
- uses: actions/checkout@v4
- name: Resolve image tags
id: meta
run: |
version="$(grep -m1 '^version' backend/Cargo.toml | cut -d'"' -f2)"
frontend_version="$(grep -m1 '"version"' frontend/package.json | cut -d'"' -f4)"
if [ "$version" != "$frontend_version" ]; then
echo "Version mismatch: backend=$version frontend=$frontend_version" >&2
exit 1
fi
echo "image_tag=${GITHUB_SHA}" >> "$GITHUB_OUTPUT"
echo "version=${version}" >> "$GITHUB_OUTPUT"
- name: Build & push backend + frontend
env:
REGISTRY_URL: ${{ secrets.REGISTRY_URL }}
REGISTRY_USERNAME: ${{ secrets.REGISTRY_USERNAME }}
REGISTRY_PASSWORD: ${{ secrets.REGISTRY_PASSWORD }}
IMAGE_TAG: ${{ steps.meta.outputs.image_tag }}
VERSION: ${{ steps.meta.outputs.version }}
run: |
set -eu
echo "$REGISTRY_PASSWORD" | docker login "$REGISTRY_URL" -u "$REGISTRY_USERNAME" --password-stdin
for svc in backend frontend; do
img="$REGISTRY_URL/mangalord-$svc"
docker build -t "$img:$IMAGE_TAG" -t "$img:latest" -t "$img:$VERSION" "./$svc"
for tag in "$IMAGE_TAG" latest "$VERSION"; do docker push "$img:$tag"; done
done
docker logout "$REGISTRY_URL"
deploy:
runs-on: ubuntu-latest
needs: build-and-push
if: github.event_name != 'pull_request'
# Single-host deploy: the runner lives on the same box as the stack, so we
# drive the host docker daemon directly (the job mounts the host docker
# socket) instead of SSHing out. The compose dir is bind-mounted at its
# REAL host path so compose's relative bind-mounts (./mangalord/...,
# ./Caddyfile) resolve; both paths must be in the runner's
# container.valid_volumes. The central compose references the images as
# registry.mc02.dev/mangalord-*:${MANGALORD_TAG:-latest}, so we only pull
# and recreate the two mangalord services at the freshly built SHA.
container:
image: docker:cli
volumes:
- /mnt/ssd/docker-data:/mnt/ssd/docker-data
- /var/run/docker.sock:/var/run/docker.sock
steps:
- name: Deploy to the local stack
working-directory: /mnt/ssd/docker-data
env:
REGISTRY_URL: ${{ secrets.REGISTRY_URL }}
REGISTRY_USERNAME: ${{ secrets.REGISTRY_USERNAME }}
REGISTRY_PASSWORD: ${{ secrets.REGISTRY_PASSWORD }}
IMAGE_TAG: ${{ needs.build-and-push.outputs.image_tag }}
run: |
set -eu
echo "$REGISTRY_PASSWORD" | docker login "$REGISTRY_URL" -u "$REGISTRY_USERNAME" --password-stdin
export MANGALORD_TAG="$IMAGE_TAG"
docker compose pull mangalord-backend mangalord-frontend
docker compose up -d mangalord-backend mangalord-frontend
docker image prune -f
docker logout "$REGISTRY_URL"

78
HANDOFF.md Normal file
View File

@@ -0,0 +1,78 @@
# Hand-off — 2026-06-05
Snapshot of repo state for whoever picks up next. Pair with [CLAUDE.md](CLAUDE.md) for the architecture, dev rules, and command crib.
## Where main is
`main` is at **0.52.0** (commit `679abae`, `feat(chapter): preserve source-site order in chapter list`).
Recently landed (in order):
- `0.52.0``feat(chapter)`: source-site order in chapter list; new `source_index` column + reverse-sort. Migration `0021_chapter_source_index.sql` runs on next backend start.
- `0.51.2``fix(reader)`: drop `Chapter N:` prefix from title display; `chapterLabel()` helper in `$lib/api/chapters`.
- `0.47.0``feat(crawler)`: honour `CRAWLER_LIMIT` in the in-process daemon (previously CLI-only).
All three were authored this session.
## In-flight branches (pushed, not yet on main)
The remote now has 18 local-only branches. The ones most relevant to current work:
| Branch | Tip ver | Notes |
| --- | --- | --- |
| `feat/crawler-observability-and-reliability` | 0.54.0 | Active dev branch — admin crawler dashboard, live status SSE, coordinated browser restart, dead-job requeue, runtime PHPSESSID refresh, reliability fixes. Now also carries the `restart_browser` clears-`session_expired` fix from this session (commit `fb4182f`, 0.53.1). |
| `feat/private-mode` | 0.48.0 | Site-wide auth gate via `PRIVATE_MODE`. Already merged into main during this session. Branch left around for reference. |
| `feat/cover-retry-and-force-resync` | 0.50.0 | Cover retry backfill + admin force-resync. Already merged into main. |
The older `bugfix/*`, `chore/*`, and other `feat/*` branches are pre-0.40 era WIP — they predate this session and may be stale; verify before reviving.
## Session-specific changes worth flagging
### 1. `CRAWLER_LIMIT` now caps the daemon
Before: `bin/crawler` honoured `CRAWLER_LIMIT`, but the in-process daemon called `pipeline::run_metadata_pass(..., 0, ...)` so it always swept the full catalog.
Fix: `CrawlerConfig.manga_limit` reads `CRAWLER_LIMIT` (default `0` = no cap), threaded through `RealMetadataPass::run`. Same env var as the CLI.
Tests: `backend/src/config.rs::tests::crawler_limit_env_populates_manga_limit` (and the unset default).
### 2. Browser restart now drops the sticky `session_expired` flag
On branch `feat/crawler-observability-and-reliability` only (not on main yet).
Bug: hitting the admin "Restart browser" button ran `coordinated_restart` (which re-runs `on_launch` → re-injects PHPSESSID → re-probes), but `session_expired` stayed `true` regardless. UI continued to report Session Expired and chapter workers kept idling.
Fix: on `Ok(())` from `coordinated_restart`, the handler now calls `c.session.clear_expired()`. Error path still leaves it set. See [backend/src/api/admin/crawler.rs:230-262](backend/src/api/admin/crawler.rs#L230-L262).
Test: `backend/src/crawler/session_control.rs::tests::clear_expired_flips_sticky_flag_without_touching_session`.
## Dev stack
Compose ships db + tor (frontend runs natively). See `docker-compose.dev.yml`.
```bash
docker compose -f docker-compose.dev.yml up -d
(cd frontend && npm run dev) # http://localhost:5175
```
Backend dev command template (fill in `<START_URL>`):
```bash
cd backend && \
CRAWLER_START_URL="<START_URL>" \
CRAWLER_LIMIT=96 \
CRAWLER_DAEMON=true \
CRAWLER_PROXY=socks5h://localhost:9050 \
CRAWLER_TOR_CONTROL_URL=tcp://localhost:9051 \
CRAWLER_TOR_CONTROL_PASSWORD=dev-tor-password \
CRAWLER_ALLOW_ANY_HOST=true \
ADMIN_USERNAME=admin ADMIN_PASSWORD=admin \
cargo run --release
```
`backend/.env` already pins `DATABASE_URL`, `BIND_ADDRESS=0.0.0.0:18080`, `STORAGE_DIR`, `COOKIE_SECURE=false`. It also extends `RUST_LOG` to silence the noisy `chromiumoxide::conn` / `chromiumoxide::handler` WS-deserialize lines.
## Open items / known nuance
- **Vite generates `vite.config.ts.timestamp-*.mjs`** as a transient artifact when the dev server is running. It's not in `.gitignore`; consider adding `frontend/vite.config.ts.timestamp-*.mjs` to root `.gitignore` to stop it showing up in `git status`. Deleting the file is safe — Vite re-creates as needed.
- **Pre-bump drift.** Manifest versions have been hand-bumped ahead of commits a couple of times during this session (e.g. `0.53.0` sitting uncommitted in the working tree). When you land work on the active branches, double-check the `backend/Cargo.toml`, `backend/Cargo.lock`, `frontend/package.json` triplet are in lockstep before committing.
- **`feat/crawler-observability-and-reliability` is multi-feature.** It carries observability, reliability, runtime session control, dashboard, and now the session-expired-clear fix — landing it as one squash on main would be a sizeable diff. Consider whether to split it (e.g. split out the dashboard + dead-job requeue into its own slice) before merging.

222
backend/Cargo.lock generated
View File

@@ -397,6 +397,28 @@ dependencies = [
"windows-link",
]
[[package]]
name = "chrono-tz"
version = "0.9.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "93698b29de5e97ad0ae26447b344c482a7284c737d9ddc5f9e52b74a336671bb"
dependencies = [
"chrono",
"chrono-tz-build",
"phf 0.11.3",
]
[[package]]
name = "chrono-tz-build"
version = "0.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0c088aee841df9c3041febbb73934cfc39708749bf96dc827e3359cd39ef11b1"
dependencies = [
"parse-zoneinfo",
"phf 0.11.3",
"phf_codegen 0.11.3",
]
[[package]]
name = "concurrent-queue"
version = "2.5.0"
@@ -423,6 +445,24 @@ dependencies = [
"version_check",
]
[[package]]
name = "cookie_store"
version = "0.22.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "15b2c103cf610ec6cae3da84a766285b42fd16aad564758459e6ecf128c75206"
dependencies = [
"cookie",
"document-features",
"idna",
"log",
"publicsuffix",
"serde",
"serde_derive",
"serde_json",
"time",
"url",
]
[[package]]
name = "core-foundation-sys"
version = "0.8.7"
@@ -601,6 +641,15 @@ dependencies = [
"syn",
]
[[package]]
name = "document-features"
version = "0.2.12"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d4b8a88685455ed29a21542a33abd9cb6510b6b129abadabdcef0f4c55bc8f61"
dependencies = [
"litrs",
]
[[package]]
name = "dotenvy"
version = "0.15.7"
@@ -1153,7 +1202,7 @@ dependencies = [
"js-sys",
"log",
"wasm-bindgen",
"windows-core",
"windows-core 0.62.2",
]
[[package]]
@@ -1386,6 +1435,12 @@ version = "0.8.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "92daf443525c4cce67b150400bc2316076100ce0b3686209eb8cf3c31612e6f0"
[[package]]
name = "litrs"
version = "1.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "11d3d7f243d5c5a8b9bb5d6dd2b1602c0cb0b9db1621bafc7ed66e35ff9fe092"
[[package]]
name = "lock_api"
version = "0.4.14"
@@ -1415,7 +1470,7 @@ checksum = "c41e0c4fef86961ac6d6f8a82609f55f31b05e4fce149ac5710e439df7619ba4"
[[package]]
name = "mangalord"
version = "0.23.0"
version = "0.52.0"
dependencies = [
"anyhow",
"argon2",
@@ -1426,12 +1481,14 @@ dependencies = [
"bytes",
"chromiumoxide",
"chrono",
"chrono-tz",
"dotenvy",
"futures-core",
"futures-util",
"http-body-util",
"infer",
"mime",
"nix 0.29.0",
"rand 0.8.6",
"reqwest",
"scraper",
@@ -1440,6 +1497,7 @@ dependencies = [
"sha2",
"sqlx",
"subtle",
"sysinfo",
"tempfile",
"thiserror 1.0.69",
"time",
@@ -1547,6 +1605,18 @@ version = "1.0.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "650eef8c711430f1a879fdd01d4745a7deea475becfb90269c06775983bbf086"
[[package]]
name = "nix"
version = "0.29.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "71e2746dc3a24dd78b3cfcb7be93368c6de9963d30f43a6a73998a9cf4b17b46"
dependencies = [
"bitflags",
"cfg-if",
"cfg_aliases",
"libc",
]
[[package]]
name = "nix"
version = "0.31.3"
@@ -1559,6 +1629,15 @@ dependencies = [
"libc",
]
[[package]]
name = "ntapi"
version = "0.4.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c3b335231dfd352ffb0f8017f3b6027a4917f7df785ea2143d8af2adc66980ae"
dependencies = [
"winapi",
]
[[package]]
name = "nu-ansi-term"
version = "0.50.3"
@@ -1799,7 +1878,7 @@ checksum = "9cf20a545b305cf1da722b236b5155c9bb35f1d5ceb28c048bd96ca842f41b5b"
dependencies = [
"android_system_properties",
"log",
"nix",
"nix 0.31.3",
"objc2",
"objc2-foundation",
"objc2-ui-kit",
@@ -1835,6 +1914,15 @@ dependencies = [
"windows-link",
]
[[package]]
name = "parse-zoneinfo"
version = "0.3.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1f2a05b18d44e2957b88f96ba460715e295bc1d7510468a2f3d3b44535d26c24"
dependencies = [
"regex",
]
[[package]]
name = "password-hash"
version = "0.5.0"
@@ -2039,6 +2127,22 @@ dependencies = [
"unicode-ident",
]
[[package]]
name = "psl-types"
version = "2.0.11"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "33cb294fe86a74cbcf50d4445b37da762029549ebeea341421c7c70370f86cac"
[[package]]
name = "publicsuffix"
version = "2.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6f42ea446cab60335f76979ec15e12619a2165b5ae2c12166bef27d283a9fadf"
dependencies = [
"idna",
"psl-types",
]
[[package]]
name = "quinn"
version = "0.11.9"
@@ -2240,7 +2344,10 @@ checksum = "eddd3ca559203180a307f12d114c268abf583f59b03cb906fd0b3ff8646c1147"
dependencies = [
"base64",
"bytes",
"cookie",
"cookie_store",
"futures-core",
"futures-util",
"http",
"http-body",
"http-body-util",
@@ -2260,12 +2367,14 @@ dependencies = [
"sync_wrapper",
"tokio",
"tokio-rustls",
"tokio-util",
"tower",
"tower-http",
"tower-service",
"url",
"wasm-bindgen",
"wasm-bindgen-futures",
"wasm-streams",
"web-sys",
"webpki-roots",
]
@@ -2899,6 +3008,19 @@ dependencies = [
"syn",
]
[[package]]
name = "sysinfo"
version = "0.32.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4c33cd241af0f2e9e3b5c32163b873b29956890b5342e6745b917ce9d490f4af"
dependencies = [
"core-foundation-sys",
"libc",
"memchr",
"ntapi",
"windows",
]
[[package]]
name = "tempfile"
version = "3.27.0"
@@ -3444,6 +3566,19 @@ dependencies = [
"wasmparser",
]
[[package]]
name = "wasm-streams"
version = "0.4.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "15053d8d85c7eccdbefef60f06769760a563c7f0a9d6902a13d35c7800b0ad65"
dependencies = [
"futures-util",
"js-sys",
"wasm-bindgen",
"wasm-bindgen-futures",
"web-sys",
]
[[package]]
name = "wasmparser"
version = "0.244.0"
@@ -3507,19 +3642,74 @@ dependencies = [
"wasite",
]
[[package]]
name = "winapi"
version = "0.3.9"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5c839a674fcd7a98952e593242ea400abe93992746761e38641405d28b00f419"
dependencies = [
"winapi-i686-pc-windows-gnu",
"winapi-x86_64-pc-windows-gnu",
]
[[package]]
name = "winapi-i686-pc-windows-gnu"
version = "0.4.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ac3b87c63620426dd9b991e5ce0329eff545bccbbb34f3be09ff6fb6ab51b7b6"
[[package]]
name = "winapi-x86_64-pc-windows-gnu"
version = "0.4.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "712e227841d057c1ee1cd2fb22fa7e5a5461ae8e48fa2ca79ec42cfc1931183f"
[[package]]
name = "windows"
version = "0.57.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "12342cb4d8e3b046f3d80effd474a7a02447231330ef77d71daa6fbc40681143"
dependencies = [
"windows-core 0.57.0",
"windows-targets 0.52.6",
]
[[package]]
name = "windows-core"
version = "0.57.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d2ed2439a290666cd67ecce2b0ffaad89c2a56b976b736e6ece670297897832d"
dependencies = [
"windows-implement 0.57.0",
"windows-interface 0.57.0",
"windows-result 0.1.2",
"windows-targets 0.52.6",
]
[[package]]
name = "windows-core"
version = "0.62.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b8e83a14d34d0623b51dce9581199302a221863196a1dde71a7663a4c2be9deb"
dependencies = [
"windows-implement",
"windows-interface",
"windows-implement 0.60.2",
"windows-interface 0.59.3",
"windows-link",
"windows-result",
"windows-result 0.4.1",
"windows-strings",
]
[[package]]
name = "windows-implement"
version = "0.57.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9107ddc059d5b6fbfbffdfa7a7fe3e22a226def0b2608f72e9d552763d3e1ad7"
dependencies = [
"proc-macro2",
"quote",
"syn",
]
[[package]]
name = "windows-implement"
version = "0.60.2"
@@ -3531,6 +3721,17 @@ dependencies = [
"syn",
]
[[package]]
name = "windows-interface"
version = "0.57.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "29bee4b38ea3cde66011baa44dba677c432a78593e202392d1e9070cf2a7fca7"
dependencies = [
"proc-macro2",
"quote",
"syn",
]
[[package]]
name = "windows-interface"
version = "0.59.3"
@@ -3548,6 +3749,15 @@ version = "0.2.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f0805222e57f7521d6a62e36fa9163bc891acd422f971defe97d64e70d0a4fe5"
[[package]]
name = "windows-result"
version = "0.1.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5e383302e8ec8515204254685643de10811af0ed97ea37210dc26fb0032647f8"
dependencies = [
"windows-targets 0.52.6",
]
[[package]]
name = "windows-result"
version = "0.4.1"

View File

@@ -1,6 +1,6 @@
[package]
name = "mangalord"
version = "0.23.0"
version = "0.52.0"
edition = "2021"
default-run = "mangalord"
@@ -23,6 +23,7 @@ serde = { version = "1", features = ["derive"] }
serde_json = "1"
uuid = { version = "1", features = ["v4", "serde"] }
chrono = { version = "0.4", features = ["serde"] }
chrono-tz = "0.9"
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter"] }
tower = { version = "0.5", features = ["util"] }
@@ -44,8 +45,10 @@ futures-core = "0.3"
futures-util = "0.3"
bytes = "1"
chromiumoxide = { version = "0.7", features = ["tokio-runtime", "_fetcher-rusttls-tokio"], default-features = false }
sysinfo = { version = "0.32", default-features = false, features = ["system"] }
nix = { version = "0.29", features = ["fs"] }
scraper = "0.20"
reqwest = { version = "0.12", default-features = false, features = ["rustls-tls", "socks"] }
reqwest = { version = "0.12", default-features = false, features = ["rustls-tls", "socks", "cookies", "stream"] }
[dev-dependencies]
tempfile = "3"
@@ -54,3 +57,13 @@ http-body-util = "0.1"
mime = "0.3"
futures-util = "0.3"
tokio = { version = "1", features = ["test-util"] }
# Trim debug builds: keep line numbers in panics / backtraces but drop the
# full DWARF info (variable-level inspection in gdb/lldb). With a sqlx +
# axum + tokio dep tree the default ("full") leaves backend/target on the
# order of tens of GiB; this typically cuts ~5070% off that.
[profile.dev]
debug = "line-tables-only"
[profile.test]
debug = "line-tables-only"

View File

@@ -10,7 +10,8 @@ RUN apt-get update \
# exact crate versions CI tested. Without Cargo.lock + the flag, cargo
# would silently resolve fresh on every image build.
COPY Cargo.toml Cargo.lock ./
RUN mkdir src && echo "fn main() {}" > src/main.rs && echo "" > src/lib.rs \
RUN mkdir -p src/bin && echo "fn main() {}" > src/main.rs && echo "" > src/lib.rs \
&& echo "fn main() {}" > src/bin/crawler.rs \
&& cargo build --locked --release \
&& rm -rf src
@@ -18,13 +19,68 @@ COPY src ./src
COPY migrations ./migrations
RUN touch src/main.rs src/lib.rs && cargo build --locked --release
FROM debian:bookworm-slim
FROM debian:trixie-slim
# Runtime base must match the builder's Debian release: `rust:1-slim` tracks
# trixie (glibc 2.41), so a bookworm runtime (glibc 2.36) can't run the
# binary ("GLIBC_2.39 not found"). Keep these two in lockstep on bumps.
# `curl` is for the container HEALTHCHECK; `ca-certificates` is for
# outbound HTTPS (crawler covers/pages).
#
# INSTALL_CHROMIUM is an opt-in for deployments that can't use the
# chromiumoxide fetcher path (notably Linux_arm64 / Raspberry Pi, where
# the upstream snapshot bucket has no usable build). When `true`, adds
# Debian's apt-packaged headless chromium plus a baseline font set —
# pair with `CRAWLER_CHROMIUM_BINARY=/usr/bin/chromium-headless-shell`
# at runtime so the launcher uses it. Default `false` keeps cloud/x86
# images slim.
#
# Build the Pi image with:
# docker compose build --build-arg INSTALL_CHROMIUM=true backend
ARG INSTALL_CHROMIUM=false
RUN apt-get update \
&& apt-get install -y --no-install-recommends ca-certificates \
&& apt-get install -y --no-install-recommends ca-certificates curl \
&& if [ "$INSTALL_CHROMIUM" = "true" ]; then \
apt-get install -y --no-install-recommends chromium-headless-shell fonts-liberation; \
fi \
&& rm -rf /var/lib/apt/lists/*
# Non-root runtime user. The API binary doesn't need any root
# privilege; the crawler daemon's Chromium launcher uses --no-sandbox
# precisely because user-namespace sandboxing is fragile, so dropping
# privileges costs nothing operationally and shrinks the blast radius
# of any RCE.
ARG APP_UID=10001
ARG APP_GID=10001
RUN groupadd --system --gid ${APP_GID} app \
&& useradd --system --uid ${APP_UID} --gid app --home-dir /home/app --create-home --shell /usr/sbin/nologin app
WORKDIR /app
COPY --from=builder /app/target/release/mangalord /usr/local/bin/mangalord
COPY --from=builder /app/migrations /app/migrations
ENV STORAGE_DIR=/var/lib/mangalord/storage
# Pre-create the storage dir so the entrypoint doesn't need to
# mkdir-as-root and so the named volume mount inherits the right
# ownership.
#
# UPGRADE NOTE for operators: if you're moving from an older image
# that ran as root, the existing `storage-data` volume has files owned
# by UID 0 and the new UID-10001 user can't write them. Run once
# before the upgrade:
# docker compose run --rm --user 0 backend \
# chown -R 10001:10001 /var/lib/mangalord/storage
# (Postgres is unaffected — that image's `postgres` user UID hasn't
# changed.)
RUN mkdir -p ${STORAGE_DIR} \
&& chown -R app:app ${STORAGE_DIR} /app /home/app
USER app
EXPOSE 8080
# `--start-period` is generous because first boot runs sqlx::migrate
# against postgres which can take a few seconds; subsequent restarts
# are sub-second.
HEALTHCHECK --interval=30s --timeout=5s --start-period=20s --retries=3 \
CMD curl -fsS http://localhost:8080/api/v1/health > /dev/null || exit 1
CMD ["mangalord"]

View File

@@ -0,0 +1,18 @@
-- Real-world sources publish multiple chapters at the same number:
-- different uploaders, translator notices/farewells, paid-vs-free
-- re-uploads, and our own users can legitimately have two versions of
-- "Ch.52" with different scanlations. The (manga_id, number) UNIQUE
-- from 0001_init silently collapses all of those into a single row via
-- ON CONFLICT, dropping data. Drop the constraint and lean on the
-- chapter id (UUID) as the only chapter identity going forward.
ALTER TABLE chapters DROP CONSTRAINT chapters_manga_id_number_key;
-- The UNIQUE was also our only index on (manga_id, number) since
-- 0007 dropped the redundant explicit one. Chapter list pages
-- ORDER BY number ASC and the manga page is a hot read path, so put
-- the index back without the uniqueness. Secondary sort by created_at
-- so duplicate-numbered chapters have a stable order in lists and
-- prev/next navigation.
CREATE INDEX chapters_manga_id_number_idx
ON chapters (manga_id, number, created_at);

View File

@@ -0,0 +1,15 @@
-- Dedup SyncChapterContent jobs in flight.
--
-- Without this, the daemon's bookmark/cron enqueue paths would have to do a
-- pre-check + insert race that's incorrect under concurrency. The partial
-- unique index lets both producers use plain `INSERT ... ON CONFLICT DO
-- NOTHING`: at most one (pending|running) job per chapter_id exists, and the
-- slot frees again as soon as the job transitions to done/failed/dead so a
-- re-enqueue is possible after the row is reaped or a force-refetch is wanted.
--
-- Scoped to sync_chapter_content payloads only so Discover / SyncManga /
-- SyncChapterList jobs (which don't carry a chapter_id) remain un-deduped.
CREATE UNIQUE INDEX crawler_jobs_chapter_content_dedup_idx
ON crawler_jobs ((payload->>'chapter_id'))
WHERE state IN ('pending', 'running')
AND payload->>'kind' = 'sync_chapter_content';

View File

@@ -0,0 +1,12 @@
-- Small key-value table for daemon state that needs to survive restarts.
--
-- Used so far only by the cron scheduler (`last_metadata_tick_at`) so it can
-- detect that the most recent slot was missed (e.g. the backend was down at
-- midnight) and fire immediately on startup before resuming the regular
-- schedule. JSONB on the value column lets future keys carry richer payloads
-- without another migration.
CREATE TABLE crawler_state (
key text PRIMARY KEY,
value jsonb NOT NULL,
updated_at timestamptz NOT NULL DEFAULT now()
);

View File

@@ -0,0 +1,15 @@
-- The original 0012 partial index covers `state IN ('pending','failed')`,
-- but `ack_failed` in src/crawler/jobs.rs only writes `dead` or
-- `pending` — `failed` is never set. The index branch on `failed`
-- never matches any row, so it's dead weight on every write.
--
-- Drop and recreate the index without the dead branch. The CHECK
-- constraint on `state` still allows `'failed'` so a future migration
-- can adopt that terminal-but-retryable state without a second
-- schema change.
DROP INDEX IF EXISTS crawler_jobs_ready_idx;
CREATE INDEX crawler_jobs_ready_idx
ON crawler_jobs (scheduled_at)
WHERE state = 'pending';

View File

@@ -0,0 +1,20 @@
-- chapter_sources: drop the global (source_id, source_chapter_key) PK
-- and rekey on (source_id, chapter_id).
--
-- The old PK assumed chapter slugs are unique per source. Sources whose
-- chapter naming is per-manga (chapter-1, chapter-2, ...) instead of per-
-- catalog (br_chapter-379272 with a global counter) would collide on the
-- second manga: the INSERT would conflict on (source_id, "chapter-1") and
-- the lookup would attribute the row to the first manga's chapter_id.
--
-- The new key is the natural identity of a source attachment: "this source
-- has this chapter". An (source_id, source_chapter_key) index preserves
-- the lookup path (find existing source row by source's identifier) but
-- no longer enforces uniqueness — the application combines it with the
-- chapters table's manga_id to scope the lookup per-manga.
ALTER TABLE chapter_sources DROP CONSTRAINT chapter_sources_pkey;
ALTER TABLE chapter_sources ADD PRIMARY KEY (source_id, chapter_id);
CREATE INDEX chapter_sources_source_key_idx
ON chapter_sources (source_id, source_chapter_key);

View File

@@ -0,0 +1,5 @@
-- Admin role flag on users. Booted from ADMIN_USERNAME / ADMIN_PASSWORD env at
-- startup (see app::build). Demotion is instant: the RequireAdmin extractor
-- re-reads the user row every request, so flipping this column takes effect on
-- the next call without a session purge.
ALTER TABLE users ADD COLUMN is_admin BOOLEAN NOT NULL DEFAULT false;

View File

@@ -0,0 +1,20 @@
-- Admin audit log. Written from inside the same transaction as the action
-- it records, so a failed COMMIT also rolls back the audit row — the log
-- never claims an action happened that didn't.
--
-- `actor_user_id` is ON DELETE SET NULL so audit rows outlive a deleted
-- admin (the answer to "who promoted Bob to admin?" survives even after
-- Alice's account is removed). `target_id` is intentionally not a FK
-- because future audit kinds may target non-user rows (manga, source,
-- etc.) and a single typed FK can't express that.
CREATE TABLE admin_audit (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
actor_user_id uuid REFERENCES users(id) ON DELETE SET NULL,
action text NOT NULL,
target_kind text NOT NULL,
target_id uuid,
payload jsonb NOT NULL DEFAULT '{}'::jsonb,
at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX admin_audit_at_idx ON admin_audit (at DESC);

View File

@@ -0,0 +1,14 @@
-- Per-manga sync-state derivation joins crawler_jobs to manga_sources via
-- (payload->>'source_id', payload->>'source_manga_key') for the
-- `sync_manga` job kind (whose payload doesn't carry a manga_id directly).
-- Without this index the join falls back to a seqscan of crawler_jobs on
-- every admin manga listing — a noticeable cost as the job table grows
-- with the daily metadata pass.
--
-- Partial on `state IN ('pending','running')` so it covers only in-flight
-- jobs (the bulk of the table is done/dead and irrelevant to "is this
-- manga being synced right now").
CREATE INDEX crawler_jobs_sync_manga_key_idx
ON crawler_jobs ((payload->>'source_manga_key'))
WHERE state IN ('pending', 'running')
AND payload->>'kind' = 'sync_manga';

View File

@@ -0,0 +1,18 @@
-- Capture each chapter's position in the source site's chapter list so
-- the user-facing list can preserve site order: variants of the same
-- chapter number (e.g. "Ch.14 : PH" next to "Ch.14 : Official") stay
-- adjacent, and non-numeric entries like "notice. : Officials" land
-- where the site placed them rather than clustering at the top under
-- number = 0.
--
-- Lower source_index = closer to the top of the source DOM = newer
-- chapter on this site (it renders newest-first). The list query
-- reverses this with ORDER BY source_index DESC so the oldest chapter
-- appears first in our UI.
--
-- NULL is the sentinel for user-uploaded chapters (no source row) and
-- for crawled rows that pre-date this migration. The list query keeps
-- the existing (number, created_at) tiebreak via NULLS LAST so those
-- fall through to the prior behaviour until the next crawler tick
-- populates the column.
ALTER TABLE chapters ADD COLUMN source_index INTEGER;

View File

@@ -0,0 +1,110 @@
//! Admin manga/chapter overview with derived sync state.
//!
//! Sync state comes from `repo::admin_view`, which joins the manga /
//! chapter tables with the crawler signals at query time — there is no
//! persisted sync_state column. See [`repo::admin_view`] for the
//! derivation priority order.
use axum::extract::{Path, Query, State};
use axum::routing::get;
use axum::{Json, Router};
use serde::Deserialize;
use uuid::Uuid;
use crate::api::pagination::PagedResponse;
use crate::app::AppState;
use crate::auth::extractor::RequireAdmin;
use crate::domain::MangaSyncState;
use crate::error::{AppError, AppResult};
use crate::repo;
use crate::repo::admin_view::{AdminChapterRow, AdminMangaRow};
pub fn routes() -> Router<AppState> {
Router::new()
.route("/admin/mangas", get(list_mangas))
.route("/admin/mangas/:id/chapters", get(list_chapters))
}
#[derive(Debug, Deserialize, Default)]
pub struct ListChaptersParams {
#[serde(default = "default_chapter_limit")]
pub limit: i64,
#[serde(default)]
pub offset: i64,
}
fn default_chapter_limit() -> i64 {
200
}
#[derive(Debug, Deserialize, Default)]
pub struct ListMangasParams {
#[serde(default)]
pub search: Option<String>,
/// `in_progress` | `dropped` | `synced`. Unrecognised values are a 400.
#[serde(default)]
pub sync_state: Option<String>,
#[serde(default = "default_limit")]
pub limit: i64,
#[serde(default)]
pub offset: i64,
}
fn default_limit() -> i64 {
50
}
async fn list_mangas(
State(state): State<AppState>,
_admin: RequireAdmin,
Query(params): Query<ListMangasParams>,
) -> AppResult<Json<PagedResponse<AdminMangaRow>>> {
let limit = params.limit.clamp(1, 200);
let offset = params.offset.max(0);
let sync_state = match params.sync_state.as_deref() {
None | Some("") => None,
Some("in_progress") => Some(MangaSyncState::InProgress),
Some("dropped") => Some(MangaSyncState::Dropped),
Some("synced") => Some(MangaSyncState::Synced),
Some(other) => {
return Err(AppError::InvalidInput(format!(
"sync_state must be one of in_progress|dropped|synced (got {other:?})"
)));
}
};
let q = repo::admin_view::ListAdminMangasQuery {
search: params.search.filter(|s| !s.trim().is_empty()),
sync_state,
limit,
offset,
};
let (items, total) = repo::admin_view::list_mangas_with_sync_state(&state.db, &q).await?;
Ok(Json(PagedResponse::with_total(items, limit, offset, total)))
}
async fn list_chapters(
State(state): State<AppState>,
_admin: RequireAdmin,
Path(manga_id): Path<Uuid>,
Query(params): Query<ListChaptersParams>,
) -> AppResult<Json<PagedResponse<AdminChapterRow>>> {
// Explicit existence check so a typo / deleted manga returns 404
// rather than a misleading "no chapters" 200.
if !repo::manga::exists(&state.db, manga_id).await? {
return Err(AppError::NotFound);
}
// Cap at 500 to bound the per-row scalar-subquery cost on
// long-runners with thousands of chapters; default 200 covers
// typical browsing without paging round-trips.
let limit = params.limit.clamp(1, 500);
let offset = params.offset.max(0);
let q = repo::admin_view::ListAdminChaptersQuery {
manga_id,
limit,
offset,
};
let (items, total) = repo::admin_view::list_chapters_with_sync_state(&state.db, &q).await?;
Ok(Json(PagedResponse::with_total(items, limit, offset, total)))
}

View File

@@ -0,0 +1,22 @@
//! Admin-only endpoints. Mounted under `/api/v1/admin/*` by
//! `crate::api::routes`. Every handler in this subtree is guarded by
//! `RequireAdmin`, which only accepts session-cookie authentication —
//! bot/API tokens cannot reach admin routes (see
//! `crate::auth::extractor::RequireAdmin`).
pub mod mangas;
pub mod resync;
pub mod system;
pub mod users;
use axum::Router;
use crate::app::AppState;
pub fn routes() -> Router<AppState> {
Router::new()
.merge(users::routes())
.merge(mangas::routes())
.merge(resync::routes())
.merge(system::routes())
}

View File

@@ -0,0 +1,176 @@
//! Admin-triggered force resync of a single manga's metadata + cover,
//! or a single chapter's content.
//!
//! Both endpoints are admin-only (`RequireAdmin`, cookie-only) and run
//! synchronously with the request — the response carries the refreshed
//! resource so the UI can swap it in without a follow-up GET. The work
//! itself is delegated to [`ResyncService`] (set on AppState by
//! `app::build` when the crawler daemon is enabled); when the daemon
//! is disabled, both handlers return 503.
use axum::extract::{Path, State};
use axum::routing::post;
use axum::{Json, Router};
use serde::Serialize;
use serde_json::json;
use uuid::Uuid;
use crate::app::AppState;
use crate::auth::extractor::RequireAdmin;
use crate::crawler::resync::{ChapterResyncOutcome, ResyncError};
use crate::domain::manga::MangaDetail;
use crate::domain::Chapter;
use crate::error::{AppError, AppResult};
use crate::repo;
use crate::repo::crawler::UpsertStatus;
pub fn routes() -> Router<AppState> {
Router::new()
.route("/admin/mangas/:id/resync", post(resync_manga))
.route("/admin/chapters/:id/resync", post(resync_chapter))
}
#[derive(Debug, Serialize)]
pub struct MangaResyncResponse {
pub manga: MangaDetail,
/// `"new" | "updated" | "unchanged"` — mirrors [`UpsertStatus`].
pub metadata_status: &'static str,
pub cover_fetched: bool,
}
#[derive(Debug, Serialize)]
pub struct ChapterResyncResponse {
pub chapter: Chapter,
/// `"fetched" | "skipped"` — whether new pages landed or the
/// service short-circuited (e.g. chapter already had pages and the
/// session was lost so force was downgraded).
pub outcome: &'static str,
/// Page count when `outcome == "fetched"`. `None` for `skipped`.
pub pages: Option<usize>,
}
async fn resync_manga(
State(state): State<AppState>,
admin: RequireAdmin,
Path(manga_id): Path<Uuid>,
) -> AppResult<Json<MangaResyncResponse>> {
if !repo::manga::exists(&state.db, manga_id).await? {
return Err(AppError::NotFound);
}
let resync = state
.resync
.as_ref()
.ok_or_else(|| AppError::ServiceUnavailable(
"crawler daemon is disabled; force resync unavailable".into(),
))?;
let outcome = resync.resync_manga(manga_id).await.map_err(map_resync_err)?;
// Audit the action with the actor + the resync outcome so an
// operator-of-operators can answer "who refetched this manga, and
// did the cover land?" from the log alone.
repo::admin_audit::insert(
&state.db,
admin.0.id,
"manga_resync",
"manga",
Some(manga_id),
json!({
"metadata_status": status_str(outcome.metadata_status),
"cover_fetched": outcome.cover_fetched,
}),
)
.await?;
let manga = repo::manga::get_detail(&state.db, manga_id).await?;
Ok(Json(MangaResyncResponse {
manga,
metadata_status: status_str(outcome.metadata_status),
cover_fetched: outcome.cover_fetched,
}))
}
async fn resync_chapter(
State(state): State<AppState>,
admin: RequireAdmin,
Path(chapter_id): Path<Uuid>,
) -> AppResult<Json<ChapterResyncResponse>> {
let resync = state
.resync
.as_ref()
.ok_or_else(|| AppError::ServiceUnavailable(
"crawler daemon is disabled; force resync unavailable".into(),
))?;
// Look up the manga the chapter belongs to so we can return the
// refreshed chapter row in the response and 404 for unknown ids.
let manga_id: Option<Uuid> =
sqlx::query_scalar("SELECT manga_id FROM chapters WHERE id = $1")
.bind(chapter_id)
.fetch_optional(&state.db)
.await?;
let Some(manga_id) = manga_id else {
return Err(AppError::NotFound);
};
let outcome = resync
.resync_chapter(chapter_id)
.await
.map_err(map_resync_err)?;
let (outcome_str, pages) = match &outcome {
ChapterResyncOutcome::Fetched { pages, .. } => ("fetched", Some(*pages)),
ChapterResyncOutcome::Skipped { .. } => ("skipped", None),
};
repo::admin_audit::insert(
&state.db,
admin.0.id,
"chapter_resync",
"chapter",
Some(chapter_id),
json!({
"outcome": outcome_str,
"pages": pages,
}),
)
.await?;
let chapter = repo::chapter::find_by_id_in_manga(&state.db, manga_id, chapter_id)
.await?
.ok_or(AppError::NotFound)?;
Ok(Json(ChapterResyncResponse {
chapter,
outcome: outcome_str,
pages,
}))
}
fn status_str(s: UpsertStatus) -> &'static str {
match s {
UpsertStatus::New => "new",
UpsertStatus::Updated => "updated",
UpsertStatus::Unchanged => "unchanged",
}
}
/// Map [`ResyncError`] (and the anyhow envelopes wrapping it) onto the
/// right [`AppError`]. Anything else surfaces as a generic 500 via the
/// `Other` arm — the operator sees the underlying anyhow chain in
/// server logs, the client sees a clean envelope.
fn map_resync_err(err: anyhow::Error) -> AppError {
if let Some(rerr) = err.downcast_ref::<ResyncError>() {
match rerr {
ResyncError::NoMangaSource => AppError::ValidationFailed {
message: "manga has no live crawler source — cannot resync".into(),
details: json!({ "manga": "no_source" }),
},
ResyncError::NoChapterSource => AppError::ValidationFailed {
message: "chapter has no live crawler source — cannot resync".into(),
details: json!({ "chapter": "no_source" }),
},
}
} else {
AppError::Other(err)
}
}

View File

@@ -0,0 +1,163 @@
//! System metrics for the admin dashboard.
//!
//! Disk is `statvfs(storage_dir)` so the number reflects the volume the
//! app actually writes to (not the root filesystem of the host). When the
//! storage backend doesn't expose a local path (e.g. a future S3 impl)
//! the disk fields are `null` rather than fabricated.
//!
//! Memory and CPU come from `sysinfo`. CPU requires two refreshes with
//! at least 200ms between them to compute a meaningful delta; the
//! handler eats the 250ms wall-clock cost on each request. Admin
//! traffic is low-volume so a background cache isn't worth the moving
//! parts yet — revisit if polling becomes frequent.
use std::path::Path;
use std::time::Duration;
use axum::extract::State;
use axum::routing::get;
use axum::{Json, Router};
use serde::Serialize;
use sysinfo::{CpuRefreshKind, MemoryRefreshKind, RefreshKind, System};
use crate::app::AppState;
use crate::auth::extractor::RequireAdmin;
use crate::error::AppResult;
const ALERT_THRESHOLD_PERCENT: f64 = 90.0;
pub fn routes() -> Router<AppState> {
Router::new().route("/admin/system", get(system))
}
#[derive(Debug, Serialize)]
pub struct SystemStats {
pub disk: Option<DiskStats>,
pub memory: MemoryStats,
pub cpu: CpuStats,
pub alerts: Vec<Alert>,
}
#[derive(Debug, Serialize)]
pub struct DiskStats {
pub total_bytes: u64,
pub used_bytes: u64,
pub free_bytes: u64,
pub percent_used: f64,
}
#[derive(Debug, Serialize)]
pub struct MemoryStats {
pub total_bytes: u64,
pub used_bytes: u64,
pub percent_used: f64,
}
#[derive(Debug, Serialize)]
pub struct CpuStats {
pub percent_used: f64,
}
#[derive(Debug, Serialize)]
pub struct Alert {
pub level: AlertLevel,
pub message: String,
}
#[derive(Debug, Serialize, Clone, Copy)]
#[serde(rename_all = "snake_case")]
pub enum AlertLevel {
Warning,
}
async fn system(
State(state): State<AppState>,
_admin: RequireAdmin,
) -> AppResult<Json<SystemStats>> {
let disk = state.storage.local_root().and_then(disk_stats_for);
let (memory, cpu) = memory_and_cpu().await;
let mut alerts = Vec::new();
if let Some(d) = &disk {
if d.percent_used >= ALERT_THRESHOLD_PERCENT {
alerts.push(Alert {
level: AlertLevel::Warning,
message: format!(
"disk near full ({:.0}% used)",
d.percent_used
),
});
}
}
if memory.percent_used >= ALERT_THRESHOLD_PERCENT {
alerts.push(Alert {
level: AlertLevel::Warning,
message: format!(
"memory near full ({:.0}% used)",
memory.percent_used
),
});
}
Ok(Json(SystemStats {
disk,
memory,
cpu,
alerts,
}))
}
fn disk_stats_for(root: &Path) -> Option<DiskStats> {
let s = nix::sys::statvfs::statvfs(root).ok()?;
// statvfs reports `f_frsize * f_blocks` for total bytes. `f_bavail`
// is "free to non-root callers" which is what an operator actually
// cares about — `f_bfree` includes blocks reserved for root.
let block = s.fragment_size();
let total = block * s.blocks();
let avail = block * s.blocks_available();
let used = total.saturating_sub(avail);
let percent_used = if total > 0 {
(used as f64) * 100.0 / (total as f64)
} else {
0.0
};
Some(DiskStats {
total_bytes: total,
used_bytes: used,
free_bytes: avail,
percent_used,
})
}
async fn memory_and_cpu() -> (MemoryStats, CpuStats) {
// sysinfo's CPU sampling needs two refreshes with a delay between
// them — the first seeds the delta counters, the second measures.
// We do this once per request; admin traffic is low enough that the
// 250ms cost is invisible.
let mut sys = System::new_with_specifics(
RefreshKind::new()
.with_cpu(CpuRefreshKind::everything())
.with_memory(MemoryRefreshKind::everything()),
);
sys.refresh_cpu_all();
// Yield the runtime instead of blocking it for the gap.
tokio::time::sleep(Duration::from_millis(250)).await;
sys.refresh_cpu_all();
sys.refresh_memory();
let total = sys.total_memory();
let used = sys.used_memory();
let mem_pct = if total > 0 {
(used as f64) * 100.0 / (total as f64)
} else {
0.0
};
let memory = MemoryStats {
total_bytes: total,
used_bytes: used,
percent_used: mem_pct,
};
let cpu = CpuStats {
percent_used: sys.global_cpu_usage() as f64,
};
(memory, cpu)
}

View File

@@ -0,0 +1,128 @@
//! Admin user management: list, delete, promote/demote.
//!
//! All handlers are gated by `RequireAdmin` and rely on
//! `repo::user::admin_safe_*` for self-protection and the last-admin
//! invariant. Audit rows are written inside the same DB transaction as
//! the action they record.
use axum::extract::{Path, Query, State};
use axum::http::StatusCode;
use axum::routing::{delete, get};
use axum::{Json, Router};
use serde::Deserialize;
use uuid::Uuid;
use crate::api::auth::{validate_password, validate_username};
use crate::api::pagination::PagedResponse;
use crate::app::AppState;
use crate::auth::extractor::RequireAdmin;
use crate::auth::password::hash_password;
use crate::domain::User;
use crate::error::{AppError, AppResult};
use crate::repo;
pub fn routes() -> Router<AppState> {
Router::new()
.route("/admin/users", get(list_users).post(create_user))
.route(
"/admin/users/:id",
delete(delete_user).patch(update_user),
)
}
#[derive(Debug, Deserialize, Default)]
pub struct ListUsersParams {
#[serde(default)]
pub search: Option<String>,
#[serde(default = "default_limit")]
pub limit: i64,
#[serde(default)]
pub offset: i64,
}
fn default_limit() -> i64 {
50
}
async fn list_users(
State(state): State<AppState>,
_admin: RequireAdmin,
Query(params): Query<ListUsersParams>,
) -> AppResult<Json<PagedResponse<User>>> {
let limit = params.limit.clamp(1, 200);
let offset = params.offset.max(0);
let (items, total) = repo::user::list_with_total(
&state.db,
&repo::user::ListUsersQuery {
search: params.search.filter(|s| !s.trim().is_empty()),
limit,
offset,
},
)
.await?;
Ok(Json(PagedResponse::with_total(items, limit, offset, total)))
}
#[derive(Debug, Deserialize)]
pub struct UpdateUserInput {
pub is_admin: Option<bool>,
}
async fn update_user(
State(state): State<AppState>,
RequireAdmin(actor): RequireAdmin,
Path(id): Path<Uuid>,
Json(input): Json<UpdateUserInput>,
) -> AppResult<Json<User>> {
let Some(is_admin) = input.is_admin else {
return Err(AppError::InvalidInput(
"no updatable fields supplied".into(),
));
};
let updated =
repo::user::admin_safe_set_is_admin(&state.db, actor.id, id, is_admin).await?;
Ok(Json(updated))
}
async fn delete_user(
State(state): State<AppState>,
RequireAdmin(actor): RequireAdmin,
Path(id): Path<Uuid>,
) -> AppResult<StatusCode> {
repo::user::admin_safe_delete(&state.db, actor.id, id).await?;
Ok(StatusCode::NO_CONTENT)
}
#[derive(Debug, Deserialize)]
pub struct CreateUserInput {
pub username: String,
pub password: String,
/// Defaults to false; admins may mint other admins in a single
/// call. Doing it as one POST avoids a second audit row for the
/// common "invite a co-admin" flow.
#[serde(default)]
pub is_admin: bool,
}
async fn create_user(
State(state): State<AppState>,
RequireAdmin(actor): RequireAdmin,
Json(input): Json<CreateUserInput>,
) -> AppResult<(StatusCode, Json<User>)> {
let username = input.username.trim();
// Reuse the canonical self-register validators so the admin-create
// path can never produce a username that self-register would
// reject (and vice versa).
validate_username(username)?;
validate_password(&input.password)?;
let pwhash = hash_password(&input.password)?;
let user = repo::user::admin_create_user(
&state.db,
actor.id,
username,
&pwhash,
input.is_admin,
)
.await?;
Ok((StatusCode::CREATED, Json(user)))
}

View File

@@ -4,6 +4,8 @@
//! expire naturally rather than being explicitly invalidated, so other
//! devices keep their existing logins).
use std::sync::OnceLock;
use axum::extract::{Path, State};
use axum::http::StatusCode;
use axum::response::IntoResponse;
@@ -26,6 +28,7 @@ use crate::repo;
pub fn routes() -> Router<AppState> {
Router::new()
.route("/auth/config", get(auth_config))
.route("/auth/register", post(register))
.route("/auth/login", post(login))
.route("/auth/logout", post(logout))
@@ -39,6 +42,25 @@ pub fn routes() -> Router<AppState> {
.route("/auth/tokens/:id", delete(delete_token))
}
/// Public, unauthenticated. Exposes anonymous-relevant auth policy so
/// the frontend can render its login / register affordances correctly
/// without a probe request that would conflate "disabled" with
/// "rate-limited". `self_register_enabled` is the *effective* value
/// (`allow_self_register && !private_mode`), so a private-mode
/// instance reports `false` even if the raw flag is on.
#[derive(Debug, Serialize)]
pub struct AuthConfigResponse {
pub self_register_enabled: bool,
pub private_mode: bool,
}
async fn auth_config(State(state): State<AppState>) -> Json<AuthConfigResponse> {
Json(AuthConfigResponse {
self_register_enabled: state.auth.allow_self_register && !state.auth.private_mode,
private_mode: state.auth.private_mode,
})
}
#[derive(Debug, Deserialize)]
pub struct Credentials {
pub username: String,
@@ -80,6 +102,17 @@ async fn register(
jar: CookieJar,
Json(input): Json<Credentials>,
) -> AppResult<impl IntoResponse> {
// Rate limit before the disabled check so an operator who flips
// the toggle can't be probed for the toggle state via timing —
// disabled and enabled paths both consume a token, and disabled
// returns 403 instead of running argon2.
check_auth_rate_limit(&state, "register")?;
// Private mode force-blocks self-registration regardless of
// ALLOW_SELF_REGISTER — operators of locked-down instances mint
// accounts via `POST /admin/users` instead.
if !state.auth.allow_self_register || state.auth.private_mode {
return Err(AppError::Forbidden);
}
let username = input.username.trim();
validate_username(username)?;
validate_password(&input.password)?;
@@ -95,6 +128,7 @@ async fn login(
jar: CookieJar,
Json(input): Json<Credentials>,
) -> AppResult<impl IntoResponse> {
check_auth_rate_limit(&state, "login")?;
let username = input.username.trim();
if username.is_empty() || input.password.is_empty() {
return Err(AppError::InvalidInput(
@@ -102,9 +136,15 @@ async fn login(
));
}
let user = repo::user::find_by_username(&state.db, username)
.await?
.ok_or(AppError::Unauthenticated)?;
let user = repo::user::find_by_username(&state.db, username).await?;
let Some(user) = user else {
// No such user. Run argon2 against a stable dummy hash so the
// response time matches the wrong-password branch — otherwise
// an attacker can enumerate usernames by timing the no-user
// 401 against the wrong-password 401.
let _ = verify_password(&input.password, dummy_password_hash());
return Err(AppError::Unauthenticated);
};
if !verify_password(&input.password, &user.password_hash) {
return Err(AppError::Unauthenticated);
}
@@ -113,6 +153,21 @@ async fn login(
Ok((StatusCode::OK, jar, Json(AuthResponse { user })))
}
/// Lazily-computed argon2 hash used to equalise login response time
/// across the "no such user" and "wrong password" branches. Computing
/// it once (on the first login of the process) is enough — the hash is
/// never compared against a real password, only used to force argon2
/// to do the same amount of work it would for a real verify.
fn dummy_password_hash() -> &'static str {
static DUMMY: OnceLock<String> = OnceLock::new();
DUMMY
.get_or_init(|| {
crate::auth::password::hash_password("login-timing-equaliser")
.expect("hash_password on a fixed input cannot fail")
})
.as_str()
}
async fn logout(
State(state): State<AppState>,
jar: CookieJar,
@@ -149,6 +204,7 @@ async fn change_password(
jar: CookieJar,
Json(input): Json<ChangePassword>,
) -> AppResult<impl IntoResponse> {
check_auth_rate_limit(&state, "change_password")?;
if !verify_password(&input.current_password, &user.password_hash) {
return Err(AppError::Unauthenticated);
}
@@ -230,8 +286,24 @@ async fn create_token(
Json(input): Json<CreateTokenInput>,
) -> AppResult<impl IntoResponse> {
let name = input.name.trim();
// Both arms use `ValidationFailed` (422 with field details) to
// match the structured-error shape `attach_tag` returns for the
// same kind of free-form-identifier validation. The other
// /auth/* handlers in this file use `InvalidInput` (400); the
// divergence is pre-existing and would warrant a project-wide
// pass to flip them all if the client side wants uniform per-
// field error rendering.
if name.is_empty() {
return Err(AppError::InvalidInput("token name is required".into()));
return Err(AppError::ValidationFailed {
message: "token name is required".into(),
details: serde_json::json!({ "name": "required" }),
});
}
if name.chars().count() > 64 {
return Err(AppError::ValidationFailed {
message: "token name too long".into(),
details: serde_json::json!({ "name": "max 64 characters" }),
});
}
let (raw, hash) = generate_token();
let token = repo::api_token::create(&state.db, user.id, name, &hash).await?;
@@ -267,6 +339,18 @@ async fn start_session(
Ok(jar.add(build_session_cookie(raw, &state.auth)))
}
// CSRF posture: `SameSite=Lax` is the project's primary CSRF defense.
// Browsers refuse to attach this cookie to cross-site POST / PATCH /
// DELETE requests, which covers every state-changing endpoint (auth
// mutations, uploads, bookmarks, collections, admin user management,
// etc. — all JSON over POST/PATCH/DELETE). Lax DOES still attach the
// cookie on top-level cross-site GETs, so this defense breaks the
// instant anyone adds a state-changing GET. If you reach for one,
// switch to `SameSite=Strict` here AND add an explicit CSRF-token
// check on the new endpoint. The Bearer-token branch in the
// extractor is unaffected (bots authenticate with the token header,
// not the cookie) and admin routes reject Bearer entirely — see
// `auth::extractor::RequireAdmin`.
fn build_session_cookie(raw: String, cfg: &AuthConfig) -> Cookie<'static> {
let mut builder = Cookie::build((SESSION_COOKIE_NAME, raw))
.http_only(true)
@@ -293,7 +377,38 @@ fn build_expired_cookie(cfg: &AuthConfig) -> Cookie<'static> {
builder.build()
}
fn validate_username(u: &str) -> AppResult<()> {
/// Consume one token from the shared auth rate limiter. Called at the
/// start of `register`, `login`, and `change_password` so credential
/// stuffing / spraying / username-probe loops are throttled by the
/// configured budget (default 5/sec with a 10-request burst).
///
/// All three endpoints share one bucket — they all expose the same
/// argon2-verify-or-create work and the same enumeration channels, so
/// any one of them in a tight loop should trip the limit. `endpoint`
/// is included in the rate-limit-hit log line so operators can tell
/// which endpoint is being probed.
fn check_auth_rate_limit(state: &AppState, endpoint: &'static str) -> AppResult<()> {
use crate::auth::rate_limit::AcquireResult;
match state.auth_limiter.try_acquire() {
AcquireResult::Allowed => Ok(()),
AcquireResult::Denied { retry_after_secs } => {
tracing::warn!(
endpoint,
retry_after_secs,
"auth rate limit hit; returning 429"
);
Err(AppError::TooManyRequests {
retry_after_secs: Some(retry_after_secs),
})
}
}
}
// Exposed pub(crate) so the admin user-create handler can apply the
// same rules as self-registration. Keeping the lone canonical
// implementation here avoids the two paths drifting on min length /
// allowed character set.
pub(crate) fn validate_username(u: &str) -> AppResult<()> {
if u.is_empty() {
return Err(AppError::InvalidInput("username is required".into()));
}
@@ -310,7 +425,7 @@ fn validate_username(u: &str) -> AppResult<()> {
Ok(())
}
fn validate_password(p: &str) -> AppResult<()> {
pub(crate) fn validate_password(p: &str) -> AppResult<()> {
if p.len() < 8 {
return Err(AppError::InvalidInput(
"password must be at least 8 characters".into(),

View File

@@ -13,6 +13,7 @@ use uuid::Uuid;
use crate::api::pagination::PagedResponse;
use crate::app::AppState;
use crate::auth::extractor::CurrentUser;
use crate::crawler::pipeline;
use crate::domain::{Bookmark, BookmarkSummary};
use crate::error::{AppError, AppResult};
use crate::repo;
@@ -66,14 +67,7 @@ async fn create(
// the foreign-key violation collapse into a generic 500.
repo::manga::get(&state.db, input.manga_id).await?;
if let Some(chapter_id) = input.chapter_id {
let exists: Option<(Uuid,)> = sqlx::query_as(
"SELECT id FROM chapters WHERE id = $1 AND manga_id = $2",
)
.bind(chapter_id)
.bind(input.manga_id)
.fetch_optional(&state.db)
.await?;
if exists.is_none() {
if !repo::chapter::belongs_to_manga(&state.db, chapter_id, input.manga_id).await? {
return Err(AppError::NotFound);
}
}
@@ -86,6 +80,29 @@ async fn create(
input.page,
)
.await?;
// Fire-and-forget: kick off content syncs for any pending chapters of
// the newly-bookmarked manga. The dedup index makes this idempotent
// across repeated bookmarks of the same manga; failure here must not
// surface to the user (the daily cron sweeps anything missed).
let pool = state.db.clone();
let manga_id = input.manga_id;
tokio::spawn(async move {
match pipeline::enqueue_pending_for_manga(&pool, manga_id).await {
Ok(summary) => tracing::info!(
%manga_id,
inserted = summary.inserted,
skipped = summary.skipped,
failed = summary.failed,
"bookmark hook: enqueued pending chapters"
),
Err(e) => tracing::warn!(
%manga_id, error = ?e,
"bookmark hook: enqueue_pending_for_manga failed"
),
}
});
Ok((StatusCode::CREATED, Json(bookmark)))
}

View File

@@ -26,9 +26,9 @@ use crate::upload::{parse_image, UploadedImage};
pub fn routes() -> Router<AppState> {
Router::new()
.route("/mangas/:manga_id/chapters", get(list).post(create))
.route("/mangas/:manga_id/chapters/:number", get(get_one))
.route("/mangas/:manga_id/chapters/:chapter_id", get(get_one))
.route(
"/mangas/:manga_id/chapters/:number/pages",
"/mangas/:manga_id/chapters/:chapter_id/pages",
get(list_pages),
)
}
@@ -60,10 +60,10 @@ async fn list(
async fn get_one(
State(state): State<AppState>,
Path((manga_id, number)): Path<(Uuid, i32)>,
Path((manga_id, chapter_id)): Path<(Uuid, Uuid)>,
) -> AppResult<Json<Chapter>> {
repo::manga::get(&state.db, manga_id).await?;
let chapter = repo::chapter::find_by_manga_and_number(&state.db, manga_id, number)
let chapter = repo::chapter::find_by_id_in_manga(&state.db, manga_id, chapter_id)
.await?
.ok_or(AppError::NotFound)?;
Ok(Json(chapter))
@@ -164,10 +164,10 @@ struct PagesResponse {
async fn list_pages(
State(state): State<AppState>,
Path((manga_id, number)): Path<(Uuid, i32)>,
Path((manga_id, chapter_id)): Path<(Uuid, Uuid)>,
) -> AppResult<Json<PagesResponse>> {
repo::manga::get(&state.db, manga_id).await?;
let chapter = repo::chapter::find_by_manga_and_number(&state.db, manga_id, number)
let chapter = repo::chapter::find_by_id_in_manga(&state.db, manga_id, chapter_id)
.await?
.ok_or(AppError::NotFound)?;
let pages = repo::page::list_for_chapter(&state.db, chapter.id).await?;

View File

@@ -1,6 +1,6 @@
use axum::extract::{Multipart, Path, Query, State};
use axum::http::StatusCode;
use axum::routing::{delete, get, post};
use axum::routing::{delete, get, post, put};
use axum::{Json, Router};
use serde::Deserialize;
use serde_json::json;
@@ -14,12 +14,14 @@ use crate::domain::patch::Patch;
use crate::domain::tag::TagRef;
use crate::error::{AppError, AppResult};
use crate::repo;
use crate::storage::StorageError;
use crate::upload::{parse_image, UploadedImage};
pub fn routes() -> Router<AppState> {
Router::new()
.route("/mangas", get(list).post(create))
.route("/mangas/:id", get(get_one).patch(update))
.route("/mangas/:id/cover", put(put_cover).delete(delete_cover))
.route("/mangas/:id/tags", post(attach_tag))
.route("/mangas/:id/tags/:tag_id", delete(detach_tag))
}
@@ -194,16 +196,14 @@ async fn create(
async fn update(
State(state): State<AppState>,
CurrentUser(_user): CurrentUser,
CurrentUser(user): CurrentUser,
Path(id): Path<Uuid>,
Json(patch): Json<MangaPatch>,
) -> AppResult<Json<MangaDetail>> {
// TODO(auth): until uploaders are tracked (Phase 5), any signed-in
// user can edit any manga. Restrict to uploader + admin once that
// column lands.
if !repo::manga::exists(&state.db, id).await? {
return Err(AppError::NotFound);
}
require_can_edit(&state, id, user.id).await?;
if let Some(ref status) = patch.status {
let trimmed = status.trim();
@@ -259,6 +259,80 @@ async fn update(
Ok(Json(repo::manga::get_detail(&state.db, id).await?))
}
/// `PUT /api/v1/mangas/:id/cover` is multipart/form-data with a single
/// required `cover` part containing image bytes. MIME is sniffed by
/// magic bytes (jpeg/png/webp/gif/avif); filename and Content-Type from
/// the client are ignored. Replaces any existing cover, deleting the
/// previous blob if its extension differs. Returns the refreshed
/// `MangaDetail`.
async fn put_cover(
State(state): State<AppState>,
CurrentUser(user): CurrentUser,
Path(id): Path<Uuid>,
mut multipart: Multipart,
) -> AppResult<Json<MangaDetail>> {
if !repo::manga::exists(&state.db, id).await? {
return Err(AppError::NotFound);
}
require_can_edit(&state, id, user.id).await?;
let mut cover: Option<UploadedImage> = None;
while let Some(field) = next_field(&mut multipart).await? {
if field.name() == Some("cover") {
let bytes = read_field_bytes(field).await?.to_vec();
cover = Some(parse_image(bytes, state.upload.max_file_bytes, "cover")?);
}
}
let img = cover.ok_or_else(|| AppError::ValidationFailed {
message: "cover part is required".into(),
details: json!({ "cover": "required" }),
})?;
// Read the old key BEFORE writing so we can clean up an orphan if
// the extension changed (e.g., .png → .jpg). Same-extension is a
// `put` overwrite — no delete needed.
let old_key = repo::manga::get(&state.db, id).await?.cover_image_path;
let new_key = format!("mangas/{}/cover.{}", id, img.ext);
state.storage.put(&new_key, &img.bytes).await?;
if let Some(prev) = old_key.as_deref() {
if prev != new_key {
// Swallow NotFound — AppError maps it to a client 404,
// which would be wrong here. The DB row can outlive a
// manually-deleted blob.
match state.storage.delete(prev).await {
Ok(()) | Err(StorageError::NotFound) => {}
Err(e) => return Err(e.into()),
}
}
}
repo::manga::set_cover_image_path(&state.db, id, &new_key).await?;
Ok(Json(repo::manga::get_detail(&state.db, id).await?))
}
/// `DELETE /api/v1/mangas/:id/cover` clears `cover_image_path` and
/// removes the blob. Idempotent: removing a non-existent cover succeeds
/// with the unchanged detail.
async fn delete_cover(
State(state): State<AppState>,
CurrentUser(user): CurrentUser,
Path(id): Path<Uuid>,
) -> AppResult<Json<MangaDetail>> {
if !repo::manga::exists(&state.db, id).await? {
return Err(AppError::NotFound);
}
require_can_edit(&state, id, user.id).await?;
if let Some(key) = repo::manga::get(&state.db, id).await?.cover_image_path {
match state.storage.delete(&key).await {
Ok(()) | Err(StorageError::NotFound) => {}
Err(e) => return Err(e.into()),
}
repo::manga::clear_cover_image_path(&state.db, id).await?;
}
Ok(Json(repo::manga::get_detail(&state.db, id).await?))
}
#[derive(Debug, Deserialize)]
pub struct AttachTagBody {
pub name: String,
@@ -270,6 +344,7 @@ async fn attach_tag(
Path(id): Path<Uuid>,
Json(body): Json<AttachTagBody>,
) -> AppResult<(StatusCode, Json<TagRef>)> {
validate_tag_name(&body.name)?;
if !repo::manga::exists(&state.db, id).await? {
return Err(AppError::NotFound);
}
@@ -316,6 +391,27 @@ async fn detach_tag(
}
}
/// Request-side validation for `POST /mangas/:id/tags` body. Mirrors
/// the repo-level cap in `repo::tag::upsert_by_name` (max 64 chars
/// after trim) but surfaces the failure at the handler boundary with
/// the same envelope shape other validations use.
fn validate_tag_name(name: &str) -> AppResult<()> {
let trimmed = name.trim();
if trimmed.is_empty() {
return Err(AppError::ValidationFailed {
message: "tag name cannot be empty".into(),
details: json!({ "name": "required" }),
});
}
if trimmed.chars().count() > 64 {
return Err(AppError::ValidationFailed {
message: "tag name too long".into(),
details: json!({ "name": "max 64 characters" }),
});
}
Ok(())
}
fn validate_new_manga(input: &NewManga) -> AppResult<()> {
if input.title.trim().is_empty() {
return Err(AppError::ValidationFailed {
@@ -335,6 +431,30 @@ fn validate_new_manga(input: &NewManga) -> AppResult<()> {
Ok(())
}
/// Authorisation gate for manga mutations. The manga is assumed to
/// exist (the caller runs [`repo::manga::exists`] first so a missing id
/// surfaces as `NotFound`, not `Forbidden`).
///
/// Rule: a non-NULL `uploaded_by` must match the current user. Legacy
/// rows with `uploaded_by IS NULL` (pre-migration-0011) are still
/// editable by any signed-in user — there's nobody to gate on yet, and
/// the historical-data note in 0011 acknowledges the gap. Once an
/// admin role lands the NULL case can flip to admin-only.
///
/// Returns `Forbidden` (not `NotFound`) on owner mismatch — mangas
/// are listable via `GET /mangas`, so existence isn't a secret and
/// the more accurate 403 is fine. This deliberately differs from
/// `repo::collection::require_owner`, which collapses both states to
/// `NotFound` because collections are private to a user and existence
/// itself is information worth hiding from non-owners.
async fn require_can_edit(state: &AppState, manga_id: Uuid, user_id: Uuid) -> AppResult<()> {
match repo::manga::uploaded_by(&state.db, manga_id).await? {
Some(owner) if owner != user_id => Err(AppError::Forbidden),
// Some(owner) == user_id (good) or None (legacy row, no owner).
_ => Ok(()),
}
}
async fn validate_genre_ids(state: &AppState, ids: &[Uuid]) -> AppResult<()> {
if ids.is_empty() {
return Ok(());

View File

@@ -1,3 +1,4 @@
pub mod admin;
pub mod auth;
pub mod authors;
pub mod bookmarks;
@@ -28,4 +29,5 @@ pub fn routes() -> Router<AppState> {
.merge(authors::routes())
.merge(collections::routes())
.merge(history::routes())
.merge(admin::routes())
}

View File

@@ -1,14 +1,33 @@
use std::sync::Arc;
use std::sync::atomic::AtomicBool;
use axum::extract::DefaultBodyLimit;
use anyhow::Context;
use async_trait::async_trait;
use axum::extract::{DefaultBodyLimit, FromRequestParts, Request, State};
use axum::http::{HeaderName, HeaderValue, Method};
use axum::middleware::{self, Next};
use axum::response::Response;
use axum::Router;
use sqlx::postgres::PgPoolOptions;
use sqlx::PgPool;
use tokio_util::sync::CancellationToken;
use tower_http::cors::{AllowOrigin, CorsLayer};
use tower_http::trace::TraceLayer;
use crate::config::{AuthConfig, Config, UploadConfig};
use crate::auth::extractor::CurrentUser;
use crate::auth::rate_limit::AuthRateLimiter;
use crate::error::AppError;
use crate::config::{AuthConfig, Config, CrawlerConfig, UploadConfig};
use crate::crawler::browser_manager::{self, BrowserManager};
use crate::crawler::content::{self, SyncOutcome};
use crate::crawler::daemon::{self, ChapterDispatcher, DaemonConfig, MetadataPass};
use crate::crawler::jobs::JobPayload;
use crate::crawler::pipeline::{self, MetadataStats};
use crate::crawler::rate_limit::HostRateLimiters;
use crate::crawler::resync::{RealResyncService, ResyncService};
use crate::crawler::safety::DownloadAllowlist;
use crate::crawler::session;
use crate::repo;
use crate::storage::{LocalStorage, Storage};
#[derive(Clone)]
@@ -17,24 +36,381 @@ pub struct AppState {
pub storage: Arc<dyn Storage>,
pub auth: AuthConfig,
pub upload: UploadConfig,
/// Shared rate limiter guarding the `/auth/*` mutation endpoints.
/// One instance per AppState so tests stay isolated across the
/// same process.
pub auth_limiter: Arc<AuthRateLimiter>,
/// Admin-triggered force resync. `None` when the crawler daemon
/// is disabled (`CRAWLER_DAEMON=false`); admin handlers gate on
/// `.is_some()` and return 503 otherwise. Set by [`build`] from the
/// same wiring that builds the daemon's chapter dispatcher, so a
/// force resync uses the daemon's BrowserManager + rate limiters.
pub resync: Option<Arc<dyn ResyncService>>,
}
pub async fn build(config: Config) -> anyhow::Result<Router> {
/// Bundle returned by [`build`]. The router is what `axum::serve` consumes;
/// the daemon (when enabled) outlives the HTTP server and is awaited via
/// [`AppHandle::shutdown`] after the listener has finished gracefully.
pub struct AppHandle {
pub router: Router,
pub daemon: Option<daemon::DaemonHandle>,
}
impl AppHandle {
pub async fn shutdown(self) {
if let Some(d) = self.daemon {
d.shutdown().await;
}
}
}
pub async fn build(config: Config) -> anyhow::Result<AppHandle> {
let db = PgPoolOptions::new()
.max_connections(10)
.connect(&config.database_url)
.await?;
sqlx::migrate!("./migrations").run(&db).await?;
if let Some((username, password)) = config.admin_bootstrap.as_ref() {
repo::user::bootstrap_admin(&db, username, password)
.await
.context("bootstrap_admin from ADMIN_USERNAME/ADMIN_PASSWORD env")?;
tracing::info!(admin_username = %username, "admin bootstrap ensured");
}
let storage: Arc<dyn Storage> = Arc::new(LocalStorage::new(config.storage_dir.clone()));
let (daemon, resync) = if config.crawler.daemon_enabled {
let spawned = spawn_crawler_daemon(db.clone(), Arc::clone(&storage), &config.crawler).await?;
(Some(spawned.handle), Some(spawned.resync))
} else {
tracing::info!("crawler daemon disabled (CRAWLER_DAEMON=false)");
(None, None)
};
let auth_limiter = Arc::new(AuthRateLimiter::new(config.auth.rate_limit));
let state = AppState {
db,
storage,
auth: config.auth.clone(),
upload: config.upload.clone(),
auth_limiter,
resync,
};
Ok(router(state).layer(cors_layer(&config.cors_allowed_origins)))
let router = router(state).layer(cors_layer(&config.cors_allowed_origins));
Ok(AppHandle { router, daemon })
}
/// Bundle returned by [`spawn_crawler_daemon`]. The handle owns the
/// daemon's tasks; `resync` is the operator-trigger service shared with
/// `AppState` so admin endpoints can call into the same browser /
/// rate-limit machinery.
struct SpawnedDaemon {
handle: daemon::DaemonHandle,
resync: Arc<dyn ResyncService>,
}
async fn spawn_crawler_daemon(
db: PgPool,
storage: Arc<dyn Storage>,
cfg: &CrawlerConfig,
) -> anyhow::Result<SpawnedDaemon> {
// Reqwest client with cookie jar pre-seeded so CDN image fetches
// include PHPSESSID. Same shape as bin/crawler.rs main().
let cookie_jar = Arc::new(reqwest::cookie::Jar::default());
if let (Some(sid), Some(domain), Some(start_url)) =
(&cfg.phpsessid, &cfg.cookie_domain, &cfg.start_url)
{
let cookie_str = format!("PHPSESSID={sid}; Domain={domain}; Path=/");
let seed_url = reqwest::Url::parse(start_url)
.context("parse CRAWLER_START_URL for cookie seed")?;
cookie_jar.add_cookie_str(&cookie_str, &seed_url);
}
let mut http_builder = reqwest::Client::builder()
.timeout(std::time::Duration::from_secs(30))
.no_proxy()
.cookie_provider(cookie_jar);
if let Some(ua) = &cfg.user_agent {
http_builder = http_builder.user_agent(ua);
}
if let Some(proxy) = &cfg.proxy {
http_builder = http_builder
.proxy(reqwest::Proxy::all(proxy).with_context(|| format!("parse proxy: {proxy}"))?);
}
let http = http_builder.build().context("build crawler reqwest")?;
let mut rate = HostRateLimiters::new(std::time::Duration::from_millis(cfg.rate_ms));
if let Some(host) = &cfg.cdn_host {
rate = rate.with_override(host, std::time::Duration::from_millis(cfg.cdn_rate_ms));
}
let rate = Arc::new(rate);
let tor = crate::crawler::tor::TorController::from_parts(
cfg.tor_control_url.as_deref(),
cfg.tor_control_password.as_deref(),
cfg.tor_control_cookie_path.as_deref(),
)
.context("build TorController from CRAWLER_TOR_CONTROL_* env")?
.map(Arc::new);
if let Some(t) = &tor {
tracing::info!(?t, "TOR control configured; transient pages will trigger NEWNYM");
}
let tor_recircuit_max = cfg.tor_recircuit_max_attempts;
// Browser manager. on_launch re-injects PHPSESSID on every fresh
// chromium spawn so an idle teardown followed by re-launch stays
// authenticated without operator action.
let mut launch_opts = cfg.browser.clone();
if let Some(proxy) = &cfg.proxy {
let chromium_proxy = crate::crawler::url_utils::chromium_proxy_arg(proxy);
launch_opts.extra_args.push(format!("--proxy-server={chromium_proxy}"));
}
let on_launch = match (&cfg.phpsessid, &cfg.cookie_domain, &cfg.start_url) {
(Some(sid), Some(domain), Some(start_url)) => {
let sid = sid.clone();
let domain = domain.clone();
let start_url = start_url.clone();
let tor_for_launch = tor.as_ref().map(Arc::clone);
let on_launch: browser_manager::OnLaunch = Arc::new(move |browser| {
let sid = sid.clone();
let domain = domain.clone();
let start_url = start_url.clone();
let tor_for_launch = tor_for_launch.as_ref().map(Arc::clone);
Box::pin(async move {
session::inject_phpsessid(&browser, &sid, &domain)
.await
.context("on_launch: inject_phpsessid")?;
session::verify_session_with_recircuit(
&browser,
&start_url,
tor_for_launch.as_deref(),
tor_recircuit_max,
)
.await
.context("on_launch: verify_session")?;
Ok(())
})
});
on_launch
}
_ => browser_manager::noop_on_launch(),
};
let browser_manager = BrowserManager::new(launch_opts, cfg.idle_timeout, on_launch);
let session_expired = Arc::new(AtomicBool::new(false));
let metadata_pass: Option<Arc<dyn MetadataPass>> = cfg.start_url.as_ref().map(|url| {
let m: Arc<dyn MetadataPass> = Arc::new(RealMetadataPass {
browser_manager: Arc::clone(&browser_manager),
db: db.clone(),
storage: Arc::clone(&storage),
http: http.clone(),
rate: Arc::clone(&rate),
start_url: url.clone(),
manga_limit: cfg.manga_limit,
download_allowlist: cfg.download_allowlist.clone(),
max_image_bytes: cfg.max_image_bytes,
tor: tor.as_ref().map(Arc::clone),
});
m
});
let dispatcher: Arc<dyn ChapterDispatcher> = Arc::new(RealChapterDispatcher {
browser_manager: Arc::clone(&browser_manager),
db: db.clone(),
storage: Arc::clone(&storage),
http: http.clone(),
rate: Arc::clone(&rate),
download_allowlist: cfg.download_allowlist.clone(),
max_image_bytes: cfg.max_image_bytes,
tor: tor.as_ref().map(Arc::clone),
});
let resync: Arc<dyn ResyncService> = Arc::new(RealResyncService {
browser_manager: Arc::clone(&browser_manager),
db: db.clone(),
storage: Arc::clone(&storage),
http,
rate: Arc::clone(&rate),
download_allowlist: cfg.download_allowlist.clone(),
max_image_bytes: cfg.max_image_bytes,
tor: tor.as_ref().map(Arc::clone),
});
// Shared cancellation: daemon shutdown cancels the BrowserManager's
// idle reaper too. Reaper itself is added to the daemon's extra_tasks
// so DaemonHandle::shutdown awaits its completion.
let cancel = CancellationToken::new();
let reaper_task = browser_manager::spawn_idle_reaper(
Arc::clone(&browser_manager),
cancel.clone(),
);
// Also close the browser explicitly on shutdown so we don't rely on
// kill-on-drop when other Arc<Browser> holders may still exist.
let shutdown_task = {
let cancel = cancel.clone();
let mgr = Arc::clone(&browser_manager);
tokio::spawn(async move {
cancel.cancelled().await;
mgr.shutdown().await;
})
};
let daemon_handle = daemon::spawn(
db,
cancel,
DaemonConfig {
metadata_pass,
dispatcher,
chapter_workers: cfg.chapter_workers,
daily_at: cfg.daily_at,
tz: cfg.tz,
retention_days: cfg.retention_days,
session_expired,
extra_tasks: vec![reaper_task, shutdown_task],
},
);
Ok(SpawnedDaemon {
handle: daemon_handle,
resync,
})
}
// Real impls of the daemon traits, owning the browser manager + I/O. Kept
// in app.rs because they need the same builder-side env wiring that
// AppState gets — the daemon module itself stays free of reqwest / storage
// details so its tests don't pull them in.
struct RealMetadataPass {
browser_manager: Arc<BrowserManager>,
db: PgPool,
storage: Arc<dyn Storage>,
http: reqwest::Client,
rate: Arc<HostRateLimiters>,
start_url: String,
manga_limit: usize,
download_allowlist: DownloadAllowlist,
max_image_bytes: usize,
tor: Option<Arc<crate::crawler::tor::TorController>>,
}
#[async_trait]
impl MetadataPass for RealMetadataPass {
async fn run(&self) -> anyhow::Result<MetadataStats> {
let result = pipeline::run_metadata_pass(
&self.browser_manager,
&self.db,
self.storage.as_ref(),
&self.http,
&self.rate,
&self.start_url,
self.manga_limit,
false,
&self.download_allowlist,
self.max_image_bytes,
self.tor.as_deref(),
)
.await;
if let Err(e) = &result {
if crate::crawler::nav::anyhow_looks_browser_dead(e) {
self.browser_manager.invalidate().await;
}
}
// Cover backfill follows the metadata pass even when the pass
// errored — the early-stop walk can complete its work and bail
// late, and a transient browser failure shouldn't cancel the
// residual cover backlog. The backfill has its own per-call cap
// so a runaway error stream can't monopolise the tick.
match pipeline::backfill_missing_covers(
&self.browser_manager,
&self.db,
self.storage.as_ref(),
&self.http,
&self.rate,
pipeline::COVER_BACKFILL_DEFAULT_MAX,
&self.download_allowlist,
self.max_image_bytes,
self.tor.as_deref(),
)
.await
{
Ok(stats) => {
if stats.considered > 0 {
tracing::info!(?stats, "cover backfill complete");
}
}
Err(e) => {
tracing::warn!(error = ?e, "cover backfill failed");
if crate::crawler::nav::anyhow_looks_browser_dead(&e) {
self.browser_manager.invalidate().await;
}
}
}
result
}
}
struct RealChapterDispatcher {
browser_manager: Arc<BrowserManager>,
db: PgPool,
storage: Arc<dyn Storage>,
http: reqwest::Client,
rate: Arc<HostRateLimiters>,
download_allowlist: DownloadAllowlist,
max_image_bytes: usize,
tor: Option<Arc<crate::crawler::tor::TorController>>,
}
#[async_trait]
impl ChapterDispatcher for RealChapterDispatcher {
async fn dispatch(&self, payload: JobPayload) -> anyhow::Result<SyncOutcome> {
match payload {
JobPayload::SyncChapterContent {
source_id: _,
chapter_id,
source_chapter_key: _,
} => {
let row = repo::chapter::dispatch_target(&self.db, chapter_id)
.await
.context("look up chapter for dispatch")?;
let Some((manga_id, source_url)) = row else {
// Chapter (or its source row) is gone — ack done.
return Ok(SyncOutcome::Skipped);
};
let lease = self.browser_manager.acquire().await?;
let result = content::sync_chapter_content(
&lease,
&self.db,
self.storage.as_ref(),
&self.http,
&self.rate,
chapter_id,
manga_id,
&source_url,
false,
&self.download_allowlist,
self.max_image_bytes,
self.tor.as_deref(),
)
.await;
drop(lease);
match result {
Ok(outcome) => Ok(outcome),
Err(e) => {
if crate::crawler::nav::anyhow_looks_browser_dead(&e) {
self.browser_manager.invalidate().await;
}
Err(e)
}
}
}
// Other payload kinds aren't dispatched by this daemon yet —
// SyncManga / SyncChapterList are handled inline by the cron's
// metadata pass.
_ => Ok(SyncOutcome::Skipped),
}
}
}
/// Build a router from a pre-assembled state. Used by integration tests
@@ -43,11 +419,62 @@ pub fn router(state: AppState) -> Router {
let max_request_bytes = state.upload.max_request_bytes;
Router::new()
.nest("/api/v1", crate::api::routes())
.layer(middleware::from_fn_with_state(
state.clone(),
private_mode_guard,
))
.layer(DefaultBodyLimit::max(max_request_bytes))
.with_state(state)
.layer(TraceLayer::new_for_http())
}
/// Paths reachable anonymously even when `PRIVATE_MODE=true`. Login and
/// logout are needed for the auth flow itself; `/health` is reserved
/// for load-balancer probes; `/auth/config` lets the frontend decide
/// whether to render the login form or its anonymous alternatives;
/// `/auth/register` is exempted from the gate so the handler can
/// return its informative `registration_disabled` 403 (the same code
/// public-mode deployments use when `ALLOW_SELF_REGISTER=false`) —
/// the handler itself force-blocks the request body in private mode,
/// so no account ever gets created here. Everything else demands a
/// valid session cookie or bearer token.
fn is_public_in_private_mode(path: &str) -> bool {
matches!(
path,
"/api/v1/health"
| "/api/v1/auth/config"
| "/api/v1/auth/login"
| "/api/v1/auth/logout"
| "/api/v1/auth/register"
)
}
/// Site-wide auth gate for `PRIVATE_MODE=true`. With the flag off this
/// is a no-op pass-through, so public deployments take no extra DB
/// hit. With it on, the guard reuses [`CurrentUser`] — the same
/// session-cookie-then-bearer-token logic the per-handler extractor
/// uses — so the two paths can never drift.
async fn private_mode_guard(
State(state): State<AppState>,
req: Request,
next: Next,
) -> Result<Response, AppError> {
if !state.auth.private_mode {
return Ok(next.run(req).await);
}
if is_public_in_private_mode(req.uri().path()) {
return Ok(next.run(req).await);
}
let (mut parts, body) = req.into_parts();
match CurrentUser::from_request_parts(&mut parts, &state).await {
Ok(_) => {
let req = Request::from_parts(parts, body);
Ok(next.run(req).await)
}
Err(_) => Err(AppError::Unauthenticated),
}
}
pub(crate) fn cors_layer(allowed_origins: &[String]) -> CorsLayer {
if allowed_origins.is_empty() {
// Same-origin only — no CORS headers emitted.

View File

@@ -1,11 +1,19 @@
//! `CurrentUser` axum extractor.
//! Auth extractors.
//!
//! Resolves a request to a logged-in user by trying, in order:
//! 1. a `mangalord_session` cookie (session lookup by `sha256(value)`);
//! 2. an `Authorization: Bearer <token>` header (api_token lookup).
//! Three extractors are available, in increasing strictness:
//!
//! Both paths look up by hash, never by raw value. Failure to resolve
//! either way returns 401 via `AppError::Unauthenticated`.
//! - [`CurrentUser`] — accepts either a session cookie or an
//! `Authorization: Bearer <token>` header. Used by ordinary
//! authenticated endpoints where bot tokens are first-class clients.
//! - [`CurrentSessionUser`] — accepts only the session cookie. Used as
//! the substrate for admin extraction so bot tokens cannot authenticate
//! as the admin (see [`RequireAdmin`]).
//! - [`RequireAdmin`] — composes over [`CurrentSessionUser`] and
//! additionally requires `user.is_admin`. Returns 403 for
//! authenticated-but-not-admin, 401 otherwise.
//!
//! All lookups go by `sha256(raw_token)` — the raw value is never stored
//! in the database.
use axum::async_trait;
use axum::extract::FromRequestParts;
@@ -61,3 +69,54 @@ impl FromRequestParts<AppState> for CurrentUser {
Err(AppError::Unauthenticated)
}
}
/// Cookie-only authentication. Bot/API tokens are explicitly NOT accepted
/// here — this is the substrate for [`RequireAdmin`] and exists precisely
/// to keep admin authority out of bearer-token reach.
pub struct CurrentSessionUser(pub User);
#[async_trait]
impl FromRequestParts<AppState> for CurrentSessionUser {
type Rejection = AppError;
async fn from_request_parts(
parts: &mut Parts,
state: &AppState,
) -> Result<Self, Self::Rejection> {
let jar = CookieJar::from_headers(&parts.headers);
let cookie = jar
.get(SESSION_COOKIE_NAME)
.ok_or(AppError::Unauthenticated)?;
let hash = hash_token(cookie.value());
let session = repo::session::find_active(&state.db, &hash)
.await?
.ok_or(AppError::Unauthenticated)?;
let user = repo::user::find_by_id(&state.db, session.user_id)
.await?
.ok_or(AppError::Unauthenticated)?;
Ok(CurrentSessionUser(user))
}
}
/// Admin-only. Composes over [`CurrentSessionUser`] so bot tokens are
/// rejected at the auth step (401) rather than the role step (403).
/// The user row is re-read every request, so demotion takes effect on
/// the very next call without needing to purge sessions.
pub struct RequireAdmin(pub User);
#[async_trait]
impl FromRequestParts<AppState> for RequireAdmin {
type Rejection = AppError;
async fn from_request_parts(
parts: &mut Parts,
state: &AppState,
) -> Result<Self, Self::Rejection> {
let CurrentSessionUser(user) =
CurrentSessionUser::from_request_parts(parts, state).await?;
if !user.is_admin {
return Err(AppError::Forbidden);
}
Ok(RequireAdmin(user))
}
}

View File

@@ -7,4 +7,5 @@
pub mod extractor;
pub mod password;
pub mod rate_limit;
pub mod token;

View File

@@ -0,0 +1,179 @@
//! Per-process token-bucket rate limiter for the auth endpoints.
//!
//! Protects `/auth/login`, `/auth/register`, and `/auth/me/password`
//! from credential stuffing / password spraying / username probing.
//!
//! The current deploy puts SvelteKit's hooks.server.ts proxy in front
//! of axum without forwarding the original client IP (no
//! `X-Forwarded-For`), so per-IP buckets would all collapse to the
//! proxy container's address. Until the proxy learns to forward the
//! peer address, a single global bucket gives equivalent protection
//! against mass-attack patterns and trades a small DoS surface
//! (legitimate users sharing the limit) for simplicity.
//!
//! Each `AppState` carries its own [`AuthRateLimiter`] instance, so
//! tests run in isolated buckets and won't bleed across `#[sqlx::test]`
//! cases that share a process.
use std::sync::Mutex;
use std::time::Instant;
/// Tunable limits. `per_sec == 0` disables the limiter — used by the
/// test harness and by anyone who wants to opt out via env config.
#[derive(Clone, Copy, Debug)]
pub struct RateLimitConfig {
pub per_sec: u32,
pub burst: u32,
}
impl Default for RateLimitConfig {
/// Disabled by default. The production `AuthConfig::from_env`
/// overrides to a real limit; the test harness keeps the default
/// so existing tests don't flake against shared buckets.
fn default() -> Self {
Self {
per_sec: 0,
burst: 0,
}
}
}
/// Production defaults: 5 requests/sec sustained, 10-request burst.
/// Tight enough to make brute force impractical, loose enough that a
/// real user mistyping their password three times in a row doesn't
/// hit it.
pub const PRODUCTION_PER_SEC: u32 = 5;
pub const PRODUCTION_BURST: u32 = 10;
struct Bucket {
tokens: f64,
last_refill: Instant,
}
/// Outcome of [`AuthRateLimiter::try_acquire`]. When `Denied`, the
/// caller can use `retry_after_secs` for a `Retry-After: N` header
/// (RFC 6585 §4) so well-behaved clients back off correctly rather
/// than retrying in a tight loop.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum AcquireResult {
Allowed,
Denied { retry_after_secs: u64 },
}
/// Single-bucket token-bucket limiter. `try_acquire` is cheap (one
/// mutex acquire, no allocations) so the auth path doesn't pay a real
/// cost for the check.
pub struct AuthRateLimiter {
cfg: RateLimitConfig,
bucket: Mutex<Bucket>,
}
impl AuthRateLimiter {
pub fn new(cfg: RateLimitConfig) -> Self {
Self {
cfg,
bucket: Mutex::new(Bucket {
tokens: cfg.burst as f64,
last_refill: Instant::now(),
}),
}
}
/// Consume one token if available. Returns `Denied` with a
/// rounded-up seconds-until-refill so the caller can emit a
/// `Retry-After` header.
pub fn try_acquire(&self) -> AcquireResult {
if self.cfg.per_sec == 0 {
return AcquireResult::Allowed;
}
let now = Instant::now();
let mut bucket = self.bucket.lock().expect("rate limiter mutex poisoned");
let elapsed = now.duration_since(bucket.last_refill).as_secs_f64();
bucket.tokens =
(bucket.tokens + elapsed * f64::from(self.cfg.per_sec)).min(f64::from(self.cfg.burst));
bucket.last_refill = now;
if bucket.tokens >= 1.0 {
bucket.tokens -= 1.0;
AcquireResult::Allowed
} else {
// ceil((1 - tokens) / per_sec), minimum 1 — a `Retry-After: 0`
// would tell clients to retry immediately, which is what we're
// actively trying to discourage.
let deficit = 1.0 - bucket.tokens;
let wait_secs = (deficit / f64::from(self.cfg.per_sec)).ceil() as u64;
AcquireResult::Denied {
retry_after_secs: wait_secs.max(1),
}
}
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn disabled_limiter_always_allows() {
let rl = AuthRateLimiter::new(RateLimitConfig {
per_sec: 0,
burst: 0,
});
for _ in 0..1000 {
assert_eq!(rl.try_acquire(), AcquireResult::Allowed);
}
}
#[test]
fn burst_lets_through_initial_window_then_blocks() {
// 0 refill, burst 3 → first three pass, fourth blocks.
let rl = AuthRateLimiter::new(RateLimitConfig {
per_sec: 1,
burst: 3,
});
assert_eq!(rl.try_acquire(), AcquireResult::Allowed);
assert_eq!(rl.try_acquire(), AcquireResult::Allowed);
assert_eq!(rl.try_acquire(), AcquireResult::Allowed);
match rl.try_acquire() {
AcquireResult::Denied { retry_after_secs } => {
// Bucket is at ~0 tokens, refill rate 1/sec → ~1s wait.
assert!(
retry_after_secs >= 1,
"retry_after must be at least 1s, got {retry_after_secs}"
);
}
AcquireResult::Allowed => panic!("fourth request must be denied"),
}
}
#[test]
fn tokens_refill_over_time() {
// 10/sec → after ~120ms we should have at least one token back.
let rl = AuthRateLimiter::new(RateLimitConfig {
per_sec: 10,
burst: 1,
});
assert_eq!(rl.try_acquire(), AcquireResult::Allowed);
assert!(matches!(rl.try_acquire(), AcquireResult::Denied { .. }));
std::thread::sleep(std::time::Duration::from_millis(150));
assert_eq!(
rl.try_acquire(),
AcquireResult::Allowed,
"token should have refilled"
);
}
#[test]
fn retry_after_scales_inversely_with_refill_rate() {
// 1/sec → wait ~1s after burst exhausted.
// 10/sec → wait <1s, but we clamp to a minimum of 1s.
let slow = AuthRateLimiter::new(RateLimitConfig {
per_sec: 1,
burst: 1,
});
slow.try_acquire();
match slow.try_acquire() {
AcquireResult::Denied { retry_after_secs } => assert_eq!(retry_after_secs, 1),
_ => panic!("expected Denied"),
}
}
}

View File

@@ -1,47 +1,39 @@
//! Crawler binary.
//!
//! Walks the source's manga listing (all pages), fetches each manga's
//! metadata + chapter list, downloads the cover into `Storage`, and
//! reconciles everything into the DB. Chapter *content* (page images)
//! is out of scope for now — only chapter rows + their source links
//! are written.
//! Now an ops escape hatch sitting alongside the in-process daemon: walks
//! the source's manga listing (all pages), fetches each manga's metadata +
//! chapter list, downloads covers, reconciles chapters — and then, for any
//! chapter belonging to a bookmarked manga whose `page_count` is still 0,
//! fetches the chapter pages inline. The daemon does the same work through
//! `crawler_jobs`; the CLI is kept around for force-refetches and manual
//! backfills.
//!
//! Configuration:
//! - **Start URL** (required): first CLI positional arg, else
//! `$CRAWLER_START_URL`. This is the manga *list* page (page 1).
//! - **Database** (required): `$DATABASE_URL`.
//! - **Storage dir**: `$STORAGE_DIR`, default `./data/storage` —
//! matches the API binary so both write to the same local tree.
//! - **Browser**: see `LaunchOptions::from_env` —
//! `CRAWLER_BROWSER_MODE` (`headed`|`headless`) and
//! `CRAWLER_BROWSER_ARGS`.
//! - **Rate limit**: `CRAWLER_RATE_MS` (ms between requests, default
//! `1000`).
//! - **Cap**: `CRAWLER_LIMIT` (max manga detail fetches per run,
//! default `0` = no cap).
//! - **Skip chapters**: `CRAWLER_SKIP_CHAPTERS=1` — turn off the
//! chapter selector in the parser AND skip the per-manga
//! `sync_manga_chapters` write. Use this for "metadata only" runs.
//! - **Proxy**: `$CRAWLER_PROXY` — single URL applied to both
//! Chromium (`--proxy-server`) and `reqwest::Proxy::all`. Supports
//! `http://`, `https://`, and `socks5://` (with optional user:pass).
//! Example: `socks5://user:pass@host:1080`. Unset → direct.
//! Configuration mirrors the daemon's `CRAWLER_*` env vars (see
//! `crate::config::CrawlerConfig`) plus the CLI-only:
//! - **Start URL**: first CLI positional arg, else `$CRAWLER_START_URL`.
//! - **Skip chapters / chapter content / force re-fetch / keep browser**:
//! `CRAWLER_SKIP_CHAPTERS`, `CRAWLER_SKIP_CHAPTER_CONTENT`,
//! `CRAWLER_FORCE_REFETCH_CHAPTERS`, `CRAWLER_KEEP_BROWSER_OPEN`.
//! - **Limit**: `CRAWLER_LIMIT` (max manga detail fetches per run).
//!
//! See `crawler::pipeline::run_metadata_pass` for the shared metadata
//! flow.
use std::path::PathBuf;
use std::sync::Arc;
use std::time::Duration;
use anyhow::{anyhow, Context};
use mangalord::crawler::{
browser::{self, LaunchOptions},
rate_limit::RateLimiter,
source::{target::TargetSource, DiscoverMode, FetchContext, Source},
};
use mangalord::repo;
use futures_util::stream::{self, StreamExt};
use mangalord::crawler::browser::{BrowserMode, LaunchOptions};
use mangalord::crawler::browser_manager::{self, BrowserManager};
use mangalord::crawler::content::{self, SyncOutcome};
use mangalord::crawler::pipeline;
use mangalord::crawler::rate_limit::HostRateLimiters;
use mangalord::crawler::session;
use mangalord::storage::{LocalStorage, Storage};
use sqlx::postgres::PgPoolOptions;
use sqlx::PgPool;
use tokio::sync::Mutex;
use tracing_subscriber::EnvFilter;
use uuid::Uuid;
@@ -64,11 +56,44 @@ async fn main() -> anyhow::Result<()> {
.unwrap_or_else(|_| "./data/storage".to_string())
.into();
let rate_ms = env_u64("CRAWLER_RATE_MS", 1000);
let cdn_host = std::env::var("CRAWLER_CDN_HOST")
.ok()
.filter(|s| !s.trim().is_empty());
let cdn_rate_ms = env_u64("CRAWLER_CDN_RATE_MS", rate_ms);
let limit = env_u64("CRAWLER_LIMIT", 0) as usize;
let skip_chapters = env_bool("CRAWLER_SKIP_CHAPTERS", false);
let skip_chapter_content = env_bool("CRAWLER_SKIP_CHAPTER_CONTENT", false);
let chapter_workers = env_u64("CRAWLER_CHAPTER_WORKERS", 1).max(1) as usize;
let force_refetch_chapters = env_bool("CRAWLER_FORCE_REFETCH_CHAPTERS", false);
let phpsessid = std::env::var("CRAWLER_PHPSESSID")
.ok()
.filter(|s| !s.trim().is_empty());
let cookie_domain = std::env::var("CRAWLER_COOKIE_DOMAIN")
.ok()
.filter(|s| !s.trim().is_empty())
.or_else(|| session::registrable_domain(&start_url));
let user_agent = std::env::var("CRAWLER_USER_AGENT")
.ok()
.filter(|s| !s.trim().is_empty());
let proxy_url = std::env::var("CRAWLER_PROXY")
.ok()
.filter(|s| !s.trim().is_empty());
let tor_control_url = std::env::var("CRAWLER_TOR_CONTROL_URL")
.ok()
.filter(|s| !s.trim().is_empty());
let tor_control_password = std::env::var("CRAWLER_TOR_CONTROL_PASSWORD")
.ok()
.filter(|s| !s.trim().is_empty());
let tor_control_cookie_path = std::env::var("CRAWLER_TOR_CONTROL_COOKIE_PATH")
.ok()
.filter(|s| !s.trim().is_empty())
.map(std::path::PathBuf::from);
let tor_recircuit_max_attempts: u32 = std::env::var("CRAWLER_TOR_RECIRCUIT_MAX_ATTEMPTS")
.ok()
.and_then(|s| s.parse().ok())
.unwrap_or(3)
.max(1);
let keep_browser_open = env_bool("CRAWLER_KEEP_BROWSER_OPEN", false);
let db = PgPoolOptions::new()
.max_connections(5)
@@ -79,13 +104,21 @@ async fn main() -> anyhow::Result<()> {
let storage: Arc<dyn Storage> = Arc::new(LocalStorage::new(&storage_dir));
// `no_proxy()` disables reqwest's own env-based detection so the
// single `CRAWLER_PROXY` knob is the only thing that influences
// routing. Otherwise an unrelated `HTTPS_PROXY` in the shell would
// silently route cover downloads while the browser stayed direct.
let cookie_jar = Arc::new(reqwest::cookie::Jar::default());
if let (Some(sid), Some(domain)) = (&phpsessid, &cookie_domain) {
let cookie_str = format!("PHPSESSID={sid}; Domain={domain}; Path=/");
let seed_url =
reqwest::Url::parse(&start_url).context("parse start URL for cookie seed")?;
cookie_jar.add_cookie_str(&cookie_str, &seed_url);
tracing::info!(domain, "seeded PHPSESSID into reqwest cookie jar");
}
let mut http_builder = reqwest::Client::builder()
.timeout(Duration::from_secs(30))
.no_proxy();
.no_proxy()
.cookie_provider(cookie_jar);
if let Some(ua) = &user_agent {
http_builder = http_builder.user_agent(ua);
}
if let Some(proxy) = &proxy_url {
http_builder = http_builder
.proxy(reqwest::Proxy::all(proxy).with_context(|| format!("parse proxy URL: {proxy}"))?);
@@ -94,208 +127,341 @@ async fn main() -> anyhow::Result<()> {
let mut options = LaunchOptions::from_env();
if let Some(proxy) = &proxy_url {
options.extra_args.push(format!("--proxy-server={proxy}"));
let chromium_proxy = mangalord::crawler::url_utils::chromium_proxy_arg(proxy);
options.extra_args.push(format!("--proxy-server={chromium_proxy}"));
}
let keep_open = match (keep_browser_open, options.mode) {
(true, BrowserMode::Headed) => true,
(true, BrowserMode::Headless) => {
tracing::warn!(
"CRAWLER_KEEP_BROWSER_OPEN ignored in headless mode (no window to inspect)"
);
false
}
_ => false,
};
tracing::info!(
?options,
%start_url,
rate_ms,
cdn_host = ?cdn_host,
cdn_rate_ms,
limit,
skip_chapters,
skip_chapter_content,
chapter_workers,
force_refetch_chapters,
phpsessid_set = phpsessid.is_some(),
cookie_domain = ?cookie_domain,
user_agent = ?user_agent,
proxy = ?proxy_url,
keep_open,
storage_dir = %storage_dir.display(),
"starting crawler"
);
let handle = browser::launch(options).await.context("launch browser")?;
let tor = mangalord::crawler::tor::TorController::from_parts(
tor_control_url.as_deref(),
tor_control_password.as_deref(),
tor_control_cookie_path.as_deref(),
)
.context("build TorController from CRAWLER_TOR_CONTROL_* env")?
.map(Arc::new);
if let Some(t) = &tor {
tracing::info!(?t, "TOR control configured");
}
// BrowserManager with idle_timeout = ZERO so the CLI keeps Chromium
// alive for the entire run — same lifecycle as the old direct
// `browser::launch()` flow. on_launch re-injects PHPSESSID + runs the
// session probe; bad cookies fail fast before any real work happens.
let on_launch: browser_manager::OnLaunch = match (&phpsessid, &cookie_domain) {
(Some(sid), Some(domain)) => {
let sid = sid.clone();
let domain = domain.clone();
let start_url_clone = start_url.clone();
let tor_for_launch = tor.as_ref().map(Arc::clone);
Arc::new(move |browser| {
let sid = sid.clone();
let domain = domain.clone();
let start_url = start_url_clone.clone();
let tor_for_launch = tor_for_launch.as_ref().map(Arc::clone);
Box::pin(async move {
session::inject_phpsessid(&browser, &sid, &domain)
.await
.context("inject_phpsessid")?;
session::verify_session_with_recircuit(
&browser,
&start_url,
tor_for_launch.as_deref(),
tor_recircuit_max_attempts,
)
.await
.context("verify_session")?;
Ok(())
})
})
}
_ => browser_manager::noop_on_launch(),
};
let session_ready = phpsessid.is_some() && cookie_domain.is_some();
let manager = BrowserManager::new(options, Duration::ZERO, on_launch);
let result = run(
handle.browser(),
Arc::clone(&manager),
&db,
storage.as_ref(),
Arc::clone(&storage),
&http,
&start_url,
rate_ms,
cdn_host.as_deref(),
cdn_rate_ms,
limit,
skip_chapters,
skip_chapter_content || !session_ready,
chapter_workers,
force_refetch_chapters,
tor.clone(),
)
.await;
handle.close().await.ok();
if keep_open {
tracing::info!(
"crawler finished; browser kept open. Press Ctrl+C to close and exit."
);
let _ = tokio::signal::ctrl_c().await;
tracing::info!("Ctrl+C received; closing browser");
}
manager.shutdown().await;
result
}
#[allow(clippy::too_many_arguments)]
async fn run(
browser: &chromiumoxide::Browser,
manager: Arc<BrowserManager>,
db: &PgPool,
storage: &dyn Storage,
storage: Arc<dyn Storage>,
http: &reqwest::Client,
start_url: &str,
rate_ms: u64,
cdn_host: Option<&str>,
cdn_rate_ms: u64,
limit: usize,
skip_chapters: bool,
skip_chapter_content: bool,
chapter_workers: usize,
force_refetch_chapters: bool,
tor: Option<Arc<mangalord::crawler::tor::TorController>>,
) -> anyhow::Result<()> {
let rate = Mutex::new(RateLimiter::new(Duration::from_millis(rate_ms)));
let source = {
let s = TargetSource::new(start_url.to_string());
if skip_chapters {
s.without_chapter_parsing()
} else {
s
let mut rate = HostRateLimiters::new(Duration::from_millis(rate_ms));
if let Some(host) = cdn_host {
rate = rate.with_override(host, Duration::from_millis(cdn_rate_ms));
}
let rate = Arc::new(rate);
// SSRF defence: only download from the catalog host + CDN host
// (plus optional CRAWLER_DOWNLOAD_ALLOWLIST extras), and cap
// single-image downloads at CRAWLER_MAX_IMAGE_BYTES bytes.
// CRAWLER_ALLOW_ANY_HOST=true short-circuits the host check for
// sharded-CDN sources; private-IP and scheme guards still apply.
let allowlist = if env_bool("CRAWLER_ALLOW_ANY_HOST", false) {
mangalord::crawler::safety::DownloadAllowlist::allow_any()
} else {
let mut allow = mangalord::crawler::safety::DownloadAllowlist::new();
if let Ok(parsed) = reqwest::Url::parse(start_url) {
if let Some(h) = parsed.host_str() {
allow = allow.allow(h);
}
}
};
let ctx = FetchContext {
browser,
rate: &rate,
};
let source_id = source.id();
repo::crawler::ensure_source(
db,
source_id,
"Target Site",
&origin_of(start_url).unwrap_or_else(|| start_url.to_string()),
)
.await
.context("ensure_source")?;
let run_started_at = chrono::Utc::now();
let max_refs = (limit > 0).then_some(limit);
tracing::info!(?max_refs, "discovering manga list");
let refs = source
.discover(&ctx, DiscoverMode::Backfill, max_refs)
.await
.context("discover failed")?;
tracing::info!(count = refs.len(), "discovered manga list");
let to_fetch = refs;
let total = to_fetch.len();
for (i, r) in to_fetch.iter().enumerate() {
tracing::info!(idx = i + 1, total, key = %r.source_manga_key, "fetching metadata");
let manga = match source.fetch_manga(&ctx, r).await {
Ok(m) => m,
Err(e) => {
tracing::warn!(key = %r.source_manga_key, url = %r.url, error = ?e, "fetch_manga failed");
continue;
}
};
let upsert = match repo::crawler::upsert_manga_from_source(db, source_id, &r.url, &manga)
.await
{
Ok(u) => u,
Err(e) => {
tracing::error!(key = %r.source_manga_key, error = ?e, "upsert_manga_from_source failed");
continue;
}
};
tracing::info!(
key = %manga.source_manga_key,
manga_id = %upsert.manga_id,
status = ?upsert.status,
title = %manga.title,
"manga upserted"
);
// Cover image: download when missing in storage (backfill for
// mangas synced before cover-download support, plus the New
// path) or when metadata changed (cover URL is part of
// metadata_hash, so an Updated status implies the URL may
// have moved). Failures are non-fatal.
let needs_cover = upsert.cover_image_path.is_none()
|| matches!(upsert.status, repo::crawler::UpsertStatus::Updated);
if needs_cover {
if let Some(cover_url) = manga.cover_url.as_deref() {
if let Err(e) = download_and_store_cover(
db,
storage,
http,
&rate,
&r.url,
upsert.manga_id,
cover_url,
)
.await
{
tracing::warn!(manga_id = %upsert.manga_id, error = ?e, "cover download failed");
if let Some(host) = cdn_host {
allow = allow.allow(host);
}
if let Ok(extras) = std::env::var("CRAWLER_DOWNLOAD_ALLOWLIST") {
for piece in extras.split(',') {
let trimmed = piece.trim();
if !trimmed.is_empty() {
allow = allow.allow(trimmed);
}
}
}
allow
};
let max_image_bytes: usize = std::env::var("CRAWLER_MAX_IMAGE_BYTES")
.ok()
.and_then(|s| s.parse().ok())
.unwrap_or(mangalord::crawler::safety::DEFAULT_MAX_IMAGE_BYTES);
let allowlist = Arc::new(allowlist);
if !skip_chapters {
match repo::crawler::sync_manga_chapters(
db,
source_id,
upsert.manga_id,
&manga.chapters,
)
.await
{
Ok(diff) => tracing::info!(
manga_id = %upsert.manga_id,
new = diff.new,
refreshed = diff.refreshed,
dropped = diff.dropped,
"chapters synced"
),
Err(e) => tracing::warn!(manga_id = %upsert.manga_id, error = ?e, "chapter sync failed"),
}
}
}
let stats = pipeline::run_metadata_pass(
manager.as_ref(),
db,
storage.as_ref(),
http,
rate.as_ref(),
start_url,
limit,
skip_chapters,
allowlist.as_ref(),
max_image_bytes,
tor.as_deref(),
)
.await?;
tracing::info!(?stats, "metadata pass complete");
if limit == 0 {
match repo::crawler::mark_dropped_mangas(db, source_id, run_started_at).await {
Ok(n) => tracing::info!(dropped = n, "marked unseen manga as dropped"),
Err(e) => tracing::warn!(error = ?e, "drop-pass failed"),
}
} else {
tracing::info!(limit, "partial sync — skipping drop pass");
if !skip_chapter_content {
sync_bookmarked_chapter_content(
Arc::clone(&manager),
db,
Arc::clone(&storage),
http,
Arc::clone(&rate),
"target",
chapter_workers,
force_refetch_chapters,
Arc::clone(&allowlist),
max_image_bytes,
tor.clone(),
)
.await?;
}
Ok(())
}
async fn download_and_store_cover(
/// Find every chapter whose manga is bookmarked by at least one user and
/// that hasn't been content-synced yet, then fan them out across `workers`
/// concurrent tasks. Same as before except the browser comes from a
/// BrowserManager lease so it interleaves cleanly with the metadata pass.
///
/// A `SessionExpired` result aborts the phase.
#[allow(clippy::too_many_arguments)]
async fn sync_bookmarked_chapter_content(
manager: Arc<BrowserManager>,
db: &PgPool,
storage: &dyn Storage,
storage: Arc<dyn Storage>,
http: &reqwest::Client,
rate: &Mutex<RateLimiter>,
manga_url: &str,
manga_id: Uuid,
cover_url: &str,
rate: Arc<HostRateLimiters>,
source_id: &str,
workers: usize,
force_refetch: bool,
allowlist: Arc<mangalord::crawler::safety::DownloadAllowlist>,
max_image_bytes: usize,
tor: Option<Arc<mangalord::crawler::tor::TorController>>,
) -> anyhow::Result<()> {
let absolute = reqwest::Url::parse(manga_url)
.context("parse manga URL")?
.join(cover_url)
.context("join cover URL onto manga URL")?;
let pending: Vec<(Uuid, Uuid, String)> = sqlx::query_as(
r#"
SELECT id, manga_id, source_url FROM (
SELECT DISTINCT c.id, c.manga_id, c.created_at, cs.source_url
FROM chapters c
JOIN bookmarks b ON b.manga_id = c.manga_id
JOIN chapter_sources cs ON cs.chapter_id = c.id
WHERE cs.source_id = $1
AND cs.dropped_at IS NULL
AND (c.page_count = 0 OR $2)
) sub
ORDER BY manga_id, created_at ASC
"#,
)
.bind(source_id)
.bind(force_refetch)
.fetch_all(db)
.await
.context("query pending chapter content")?;
rate.lock().await.wait().await;
let resp = http
.get(absolute.clone())
.send()
.await
.with_context(|| format!("GET {absolute}"))?
.error_for_status()
.with_context(|| format!("non-2xx for {absolute}"))?;
let bytes = resp.bytes().await.context("read cover body")?;
if pending.is_empty() {
tracing::info!("chapter content: nothing pending");
return Ok(());
}
tracing::info!(count = pending.len(), workers, "chapter content phase starting");
// `infer` sniffs the magic bytes — same crate the upload handler
// uses, so we don't trust the URL's extension.
let kind = infer::get(&bytes);
let ext = kind.map(|k| k.extension()).unwrap_or("bin");
let key = format!("mangas/{manga_id}/cover.{ext}");
let session_expired = Arc::new(std::sync::atomic::AtomicBool::new(false));
let stats = std::sync::Mutex::new(WorkerStats::default());
storage
.put(&key, &bytes)
.await
.with_context(|| format!("store cover at {key}"))?;
repo::manga::set_cover_image_path(db, manga_id, &key)
.await
.with_context(|| format!("update cover_image_path for {manga_id}"))?;
tracing::info!(manga_id = %manga_id, key = %key, bytes = bytes.len(), %absolute, "cover stored");
stream::iter(pending.into_iter())
.for_each_concurrent(workers.max(1), |(chapter_id, manga_id, source_url)| {
let session_expired = Arc::clone(&session_expired);
let storage = Arc::clone(&storage);
let rate = Arc::clone(&rate);
let manager = Arc::clone(&manager);
let allowlist = Arc::clone(&allowlist);
let tor = tor.clone();
let stats = &stats;
async move {
if session_expired.load(std::sync::atomic::Ordering::Relaxed) {
return;
}
let lease = match manager.acquire().await {
Ok(l) => l,
Err(e) => {
tracing::error!(%chapter_id, error = ?e, "browser acquire failed");
let mut s = stats.lock().unwrap();
s.failed += 1;
return;
}
};
let outcome = content::sync_chapter_content(
&lease,
db,
storage.as_ref(),
http,
rate.as_ref(),
chapter_id,
manga_id,
&source_url,
force_refetch,
allowlist.as_ref(),
max_image_bytes,
tor.as_deref(),
)
.await;
drop(lease);
let mut s = stats.lock().unwrap();
match outcome {
Ok(SyncOutcome::Fetched { pages }) => {
tracing::info!(%chapter_id, pages, "chapter content fetched");
s.fetched += 1;
}
Ok(SyncOutcome::Skipped) => s.skipped += 1,
Ok(SyncOutcome::SessionExpired) => {
tracing::error!(
%chapter_id,
"session expired mid-run — refresh CRAWLER_PHPSESSID and re-run"
);
session_expired
.store(true, std::sync::atomic::Ordering::Relaxed);
}
Err(e) => {
tracing::warn!(
%chapter_id, error = ?e, "chapter content sync failed"
);
s.failed += 1;
}
}
}
})
.await;
let total = stats.into_inner().unwrap();
tracing::info!(
fetched = total.fetched,
skipped = total.skipped,
failed = total.failed,
"chapter content phase done"
);
if session_expired.load(std::sync::atomic::Ordering::Relaxed) {
anyhow::bail!("session expired during chapter content phase");
}
Ok(())
}
#[derive(Default, Clone, Copy)]
struct WorkerStats {
fetched: usize,
skipped: usize,
failed: usize,
}
fn resolve_start_url() -> anyhow::Result<String> {
if let Some(arg) = std::env::args().nth(1) {
return Ok(arg);
@@ -307,12 +473,6 @@ fn resolve_start_url() -> anyhow::Result<String> {
})
}
fn origin_of(url: &str) -> Option<String> {
let (scheme, rest) = url.split_once("://")?;
let host = rest.split('/').next()?;
Some(format!("{scheme}://{host}"))
}
fn env_u64(name: &str, default: u64) -> u64 {
std::env::var(name)
.ok()
@@ -327,3 +487,4 @@ fn env_bool(name: &str, default: bool) -> bool {
_ => default,
}
}

View File

@@ -1,10 +1,32 @@
use std::path::PathBuf;
use std::time::Duration;
use chrono::NaiveTime;
use chrono_tz::Tz;
use crate::crawler::browser::LaunchOptions;
use crate::crawler::safety::{DownloadAllowlist, DEFAULT_MAX_IMAGE_BYTES};
#[derive(Clone, Debug)]
pub struct AuthConfig {
pub cookie_secure: bool,
pub cookie_domain: Option<String>,
pub session_ttl_days: i64,
pub rate_limit: crate::auth::rate_limit::RateLimitConfig,
/// When `false`, `POST /auth/register` returns 403
/// `registration_disabled` and the frontend hides its register
/// affordance. Admins can still mint accounts via
/// `POST /admin/users`. Defaults to `true` (open registration)
/// for backward compatibility.
pub allow_self_register: bool,
/// When `true`, every API path except a small allowlist
/// (`/health`, `/auth/config`, `/auth/login`, `/auth/logout`)
/// requires a valid session cookie or bearer token — anonymous
/// reads are rejected with 401. Self-registration is also
/// force-disabled regardless of [`Self::allow_self_register`]
/// so a private instance is locked down with a single switch.
/// Defaults to `false` (current public behaviour).
pub private_mode: bool,
}
impl Default for AuthConfig {
@@ -13,6 +35,13 @@ impl Default for AuthConfig {
cookie_secure: true,
cookie_domain: None,
session_ttl_days: 30,
// Disabled by default so the test harness inherits a
// non-throttling limiter. Production `from_env` overrides
// to the [`PRODUCTION_PER_SEC`]/[`PRODUCTION_BURST`]
// defaults.
rate_limit: crate::auth::rate_limit::RateLimitConfig::default(),
allow_self_register: true,
private_mode: false,
}
}
}
@@ -45,6 +74,93 @@ pub struct Config {
pub auth: AuthConfig,
pub upload: UploadConfig,
pub cors_allowed_origins: Vec<String>,
pub crawler: CrawlerConfig,
/// `(username, password)` for the admin user provisioned at startup
/// when both `ADMIN_USERNAME` and `ADMIN_PASSWORD` are set. `None`
/// skips the bootstrap entirely. See `repo::user::bootstrap_admin`
/// for the create-vs-promote semantics — notably the password here
/// is used only when creating a new row, never to overwrite an
/// existing one.
pub admin_bootstrap: Option<(String, String)>,
}
/// All crawler-daemon knobs read from env. Mirrors the env vars the
/// `bin/crawler` binary already reads, plus the new daemon-only knobs
/// (daily_at, tz, idle_timeout, retention_days, daemon_enabled).
///
/// `daemon_enabled = false` skips the daemon spawn entirely — used by
/// integration tests and dev runs that don't want background activity.
#[derive(Clone, Debug)]
pub struct CrawlerConfig {
pub daemon_enabled: bool,
pub daily_at: NaiveTime,
pub tz: Tz,
pub idle_timeout: Duration,
pub chapter_workers: usize,
pub retention_days: u32,
pub start_url: Option<String>,
pub rate_ms: u64,
pub cdn_host: Option<String>,
pub cdn_rate_ms: u64,
pub phpsessid: Option<String>,
pub cookie_domain: Option<String>,
pub user_agent: Option<String>,
pub proxy: Option<String>,
/// `tcp://host:port`, `host:port`, or bare `host` (default port
/// 9051). When `None`, TOR-recircuit-on-transient is disabled and
/// the crawler behaves identically to pre-TOR releases.
pub tor_control_url: Option<String>,
/// HashedControlPassword auth. Used only when
/// `tor_control_cookie_path` is `None`.
pub tor_control_password: Option<String>,
/// Cookie-file auth path (e.g.
/// `/var/lib/tor/control_auth_cookie`). Takes precedence over
/// password when both are set.
pub tor_control_cookie_path: Option<PathBuf>,
/// Maximum NEWNYM-and-retry cycles per recircuit-eligible failure.
/// Defaults to 3.
pub tor_recircuit_max_attempts: u32,
pub browser: LaunchOptions,
/// Hosts the crawler is allowed to download images / covers from.
/// Always seeded with the host of `start_url` and (when set) the
/// configured `cdn_host`. Additional hosts can be added via
/// `CRAWLER_DOWNLOAD_ALLOWLIST` (comma-separated).
pub download_allowlist: DownloadAllowlist,
/// Hard upper bound on a single image download. Defaults to 32 MiB.
pub max_image_bytes: usize,
/// Max manga detail fetches per metadata pass. `0` means no cap
/// (full sweep up to the source's own bound). Sourced from
/// `CRAWLER_LIMIT`, mirroring the CLI binary.
pub manga_limit: usize,
}
impl Default for CrawlerConfig {
fn default() -> Self {
Self {
daemon_enabled: false,
daily_at: NaiveTime::from_hms_opt(0, 0, 0).unwrap(),
tz: Tz::UTC,
idle_timeout: Duration::from_secs(600),
chapter_workers: 1,
retention_days: 7,
start_url: None,
rate_ms: 1000,
cdn_host: None,
cdn_rate_ms: 1000,
phpsessid: None,
cookie_domain: None,
user_agent: None,
proxy: None,
tor_control_url: None,
tor_control_password: None,
tor_control_cookie_path: None,
tor_recircuit_max_attempts: 3,
browser: LaunchOptions::headless(),
download_allowlist: DownloadAllowlist::new(),
max_image_bytes: DEFAULT_MAX_IMAGE_BYTES,
manga_limit: 0,
}
}
}
impl Config {
@@ -63,6 +179,18 @@ impl Config {
.ok()
.filter(|s| !s.is_empty()),
session_ttl_days: env_i64("SESSION_TTL_DAYS", 30),
rate_limit: crate::auth::rate_limit::RateLimitConfig {
per_sec: env_u64(
"AUTH_RATE_PER_SEC",
crate::auth::rate_limit::PRODUCTION_PER_SEC.into(),
) as u32,
burst: env_u64(
"AUTH_RATE_BURST",
crate::auth::rate_limit::PRODUCTION_BURST.into(),
) as u32,
},
allow_self_register: env_bool("ALLOW_SELF_REGISTER", true),
private_mode: env_bool("PRIVATE_MODE", false),
},
upload: UploadConfig {
max_request_bytes: env_usize("MAX_REQUEST_BYTES", 200 * 1024 * 1024),
@@ -77,10 +205,135 @@ impl Config {
.collect()
})
.unwrap_or_default(),
crawler: CrawlerConfig::from_env()?,
admin_bootstrap: admin_bootstrap_from_env(),
})
}
}
/// Returns `Some((username, password))` only when BOTH `ADMIN_USERNAME`
/// and `ADMIN_PASSWORD` are set and non-empty. Half-set configuration is
/// treated as "no bootstrap" rather than a hard error, so an operator
/// can comment out one env var without crashing the server.
fn admin_bootstrap_from_env() -> Option<(String, String)> {
let username = std::env::var("ADMIN_USERNAME").ok().filter(|s| !s.is_empty())?;
let password = std::env::var("ADMIN_PASSWORD").ok().filter(|s| !s.is_empty())?;
Some((username, password))
}
impl CrawlerConfig {
pub fn from_env() -> anyhow::Result<Self> {
// Parse CRAWLER_DAILY_AT (HH:MM, 24h). Invalid → fail fast.
let daily_at = match std::env::var("CRAWLER_DAILY_AT").ok().as_deref() {
None | Some("") => NaiveTime::from_hms_opt(0, 0, 0).unwrap(),
Some(raw) => NaiveTime::parse_from_str(raw, "%H:%M").map_err(|e| {
anyhow::anyhow!("CRAWLER_DAILY_AT must be HH:MM (got {raw:?}): {e}")
})?,
};
let tz: Tz = match std::env::var("CRAWLER_TZ").ok().as_deref() {
None | Some("") => Tz::UTC,
Some(raw) => raw
.parse()
.map_err(|e| anyhow::anyhow!("CRAWLER_TZ must be a valid IANA TZ (got {raw:?}): {e}"))?,
};
let start_url = std::env::var("CRAWLER_START_URL")
.ok()
.filter(|s| !s.trim().is_empty());
let cdn_host = std::env::var("CRAWLER_CDN_HOST")
.ok()
.filter(|s| !s.trim().is_empty());
let download_allowlist =
build_download_allowlist(start_url.as_deref(), cdn_host.as_deref());
Ok(Self {
daemon_enabled: env_bool("CRAWLER_DAEMON", true),
daily_at,
tz,
idle_timeout: Duration::from_secs(env_u64("CRAWLER_IDLE_TIMEOUT_S", 600)),
chapter_workers: env_u64("CRAWLER_CHAPTER_WORKERS", 1).max(1) as usize,
retention_days: env_u64("CRAWLER_JOB_RETENTION_DAYS", 7) as u32,
start_url,
rate_ms: env_u64("CRAWLER_RATE_MS", 1000),
cdn_host,
cdn_rate_ms: env_u64("CRAWLER_CDN_RATE_MS", env_u64("CRAWLER_RATE_MS", 1000)),
phpsessid: std::env::var("CRAWLER_PHPSESSID")
.ok()
.filter(|s| !s.trim().is_empty()),
cookie_domain: std::env::var("CRAWLER_COOKIE_DOMAIN")
.ok()
.filter(|s| !s.trim().is_empty()),
user_agent: std::env::var("CRAWLER_USER_AGENT")
.ok()
.filter(|s| !s.trim().is_empty()),
proxy: std::env::var("CRAWLER_PROXY")
.ok()
.filter(|s| !s.trim().is_empty()),
tor_control_url: std::env::var("CRAWLER_TOR_CONTROL_URL")
.ok()
.filter(|s| !s.trim().is_empty()),
tor_control_password: std::env::var("CRAWLER_TOR_CONTROL_PASSWORD")
.ok()
.filter(|s| !s.trim().is_empty()),
tor_control_cookie_path: std::env::var("CRAWLER_TOR_CONTROL_COOKIE_PATH")
.ok()
.filter(|s| !s.trim().is_empty())
.map(PathBuf::from),
tor_recircuit_max_attempts: env_u64("CRAWLER_TOR_RECIRCUIT_MAX_ATTEMPTS", 3)
.max(1) as u32,
browser: LaunchOptions::from_env(),
download_allowlist,
max_image_bytes: env_usize("CRAWLER_MAX_IMAGE_BYTES", DEFAULT_MAX_IMAGE_BYTES),
manga_limit: env_usize("CRAWLER_LIMIT", 0),
})
}
}
/// Build the download allowlist from env. Always includes
/// `CRAWLER_START_URL`'s host (so the crawler can fetch covers from
/// the catalog itself) and `CRAWLER_CDN_HOST` when set. Additional
/// hosts can be supplied via `CRAWLER_DOWNLOAD_ALLOWLIST` (comma-
/// separated). Empty by default — meaning the crawler refuses to
/// download anything when no source is configured, which is the safe
/// fail-closed posture.
///
/// `CRAWLER_ALLOW_ANY_HOST=true` short-circuits the host enumeration
/// for operators whose sources shard across numbered CDN subdomains.
/// Scheme + private-IP defenses still apply.
fn build_download_allowlist(
start_url: Option<&str>,
cdn_host: Option<&str>,
) -> DownloadAllowlist {
if env_bool("CRAWLER_ALLOW_ANY_HOST", false) {
return DownloadAllowlist::allow_any();
}
let mut allow = DownloadAllowlist::new();
if let Some(url) = start_url {
if let Ok(parsed) = reqwest::Url::parse(url) {
if let Some(h) = parsed.host_str() {
allow = allow.allow(h);
}
}
}
if let Some(host) = cdn_host {
allow = allow.allow(host);
}
if let Ok(extras) = std::env::var("CRAWLER_DOWNLOAD_ALLOWLIST") {
for piece in extras.split(',') {
let trimmed = piece.trim();
if !trimmed.is_empty() {
allow = allow.allow(trimmed);
}
}
}
allow
}
fn env_u64(name: &str, default: u64) -> u64 {
std::env::var(name)
.ok()
.and_then(|s| s.parse().ok())
.unwrap_or(default)
}
fn env_bool(name: &str, default: bool) -> bool {
match std::env::var(name).ok().as_deref() {
Some("1") | Some("true") | Some("TRUE") | Some("yes") => true,
@@ -102,3 +355,65 @@ fn env_usize(name: &str, default: usize) -> usize {
.and_then(|s| s.parse().ok())
.unwrap_or(default)
}
#[cfg(test)]
mod tests {
use super::*;
use std::sync::Mutex;
// Serialise env-touching tests so concurrent cargo-test threads don't
// race on the process-global env. Re-acquire on poison since a
// panicking test still leaves the env in a consistent state for us
// (we set/unset within each guard region).
static ENV_GUARD: Mutex<()> = Mutex::new(());
#[test]
fn crawler_limit_env_populates_manga_limit() {
let _g = ENV_GUARD.lock().unwrap_or_else(|p| p.into_inner());
std::env::set_var("CRAWLER_LIMIT", "96");
let cfg = CrawlerConfig::from_env().expect("from_env");
std::env::remove_var("CRAWLER_LIMIT");
assert_eq!(cfg.manga_limit, 96);
}
#[test]
fn crawler_limit_unset_defaults_to_zero() {
let _g = ENV_GUARD.lock().unwrap_or_else(|p| p.into_inner());
std::env::remove_var("CRAWLER_LIMIT");
let cfg = CrawlerConfig::from_env().expect("from_env");
assert_eq!(cfg.manga_limit, 0);
}
#[test]
fn private_mode_env_parses_true() {
let _g = ENV_GUARD.lock().unwrap_or_else(|p| p.into_inner());
std::env::set_var("PRIVATE_MODE", "true");
std::env::set_var("DATABASE_URL", "postgres://test");
let cfg = Config::from_env().expect("from_env");
std::env::remove_var("PRIVATE_MODE");
std::env::remove_var("DATABASE_URL");
assert!(cfg.auth.private_mode);
}
#[test]
fn private_mode_env_parses_false() {
let _g = ENV_GUARD.lock().unwrap_or_else(|p| p.into_inner());
std::env::set_var("PRIVATE_MODE", "false");
std::env::set_var("DATABASE_URL", "postgres://test");
let cfg = Config::from_env().expect("from_env");
std::env::remove_var("PRIVATE_MODE");
std::env::remove_var("DATABASE_URL");
assert!(!cfg.auth.private_mode);
}
#[test]
fn private_mode_defaults_to_false() {
let _g = ENV_GUARD.lock().unwrap_or_else(|p| p.into_inner());
std::env::remove_var("PRIVATE_MODE");
std::env::set_var("DATABASE_URL", "postgres://test");
let cfg = Config::from_env().expect("from_env");
std::env::remove_var("DATABASE_URL");
assert!(!cfg.auth.private_mode);
}
}

View File

@@ -1,10 +1,17 @@
//! Chromium launcher and lifecycle.
//!
//! Uses `chromiumoxide`'s `fetcher` feature so we don't depend on a
//! system Chrome install — first call downloads a known-good revision
//! into a cache dir and reuses it forever after. `BrowserMode` toggles
//! headed vs headless; the headed path needs a display (real `$DISPLAY`
//! or `xvfb-run`).
//! By default uses `chromiumoxide`'s `fetcher` feature — first call
//! downloads a known-good revision into a cache dir and reuses it
//! forever after. Set `CRAWLER_CHROMIUM_BINARY` to skip the fetcher
//! and use a system-installed Chromium instead; required on platforms
//! where the upstream snapshot bucket has no usable build (notably
//! `Linux_arm64` / Raspberry Pi). Debian's package is at
//! `/usr/bin/chromium` or `/usr/bin/chromium-headless-shell`; Ubuntu
//! ships it as `chromium-browser` at a different path — don't paste
//! the wrong one.
//!
//! `BrowserMode` toggles headed vs headless; the headed path needs a
//! display (real `$DISPLAY` or `xvfb-run`).
//!
//! Extra Chromium command-line flags can be supplied through
//! [`LaunchOptions::extra_args`] in code, or via the
@@ -15,6 +22,7 @@
//! caller-provided.
use std::path::PathBuf;
use std::sync::Arc;
use anyhow::Context;
use chromiumoxide::browser::{Browser, BrowserConfig};
@@ -26,12 +34,12 @@ use tokio::task::JoinHandle;
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub enum BrowserMode {
/// Real window. Needs `$DISPLAY` (or `xvfb-run` wrapping the
/// binary). This is the default the old Puppeteer crawler used and
/// the assumed mode for the target site until we prove headless
/// works against it.
/// binary). Opt-in via `CRAWLER_BROWSER_MODE=headed` — useful for
/// debugging a flow visually or for sites that fingerprint
/// headless Chrome. Not used in production.
Headed,
/// No window. Faster, lower resource use, but more likely to trip
/// fingerprinting on hostile sites.
/// No window. Faster, lower resource use, runs without a display.
/// This is the default for both `from_env()` and `Default`.
Headless,
}
@@ -64,13 +72,13 @@ impl LaunchOptions {
}
/// Reads `CRAWLER_BROWSER_MODE` (`headless`|`headed`, default
/// `headed`) and `CRAWLER_BROWSER_ARGS` (whitespace-separated
/// `headless`) and `CRAWLER_BROWSER_ARGS` (whitespace-separated
/// Chromium flags). Flags containing whitespace aren't supported
/// through the env var — use the programmatic API for those.
pub fn from_env() -> Self {
let mode = match std::env::var("CRAWLER_BROWSER_MODE").as_deref() {
Ok("headless") => BrowserMode::Headless,
_ => BrowserMode::Headed,
Ok("headed") => BrowserMode::Headed,
_ => BrowserMode::Headless,
};
let extra_args = std::env::var("CRAWLER_BROWSER_ARGS")
.map(|s| parse_args(&s))
@@ -81,7 +89,7 @@ impl LaunchOptions {
impl Default for LaunchOptions {
fn default() -> Self {
Self::headed()
Self::headless()
}
}
@@ -95,55 +103,110 @@ pub(crate) fn parse_args(s: &str) -> Vec<String> {
/// Owned browser plus the spawned task that drives its CDP event loop.
/// Dropping `Handle` without calling `close` leaks the Chromium process
/// — always call `close().await` in production paths.
///
/// The browser is stored behind an `Arc` so it can be shared across
/// worker tasks (via [`Handle::shared`]) without copying. `Browser::new_page`
/// only needs `&self`, so multiple workers can drive the same browser
/// concurrently as long as the manager keeps the `Arc` alive.
pub struct Handle {
browser: Browser,
browser: Arc<Browser>,
driver: JoinHandle<()>,
}
impl Handle {
/// Borrow the browser. Equivalent to `&*handle.shared()`.
pub fn browser(&self) -> &Browser {
&self.browser
}
pub fn browser_mut(&mut self) -> &mut Browser {
&mut self.browser
/// Clone the shared handle. Workers hold these to call `new_page`
/// concurrently. The browser only exits when the last `Arc<Browser>`
/// is dropped (kill-on-drop), or when `close()` is called on the
/// originating `Handle` while it is the sole holder.
pub fn shared(&self) -> Arc<Browser> {
Arc::clone(&self.browser)
}
/// Closes the browser and awaits the driver task. Safe to call
/// multiple times — subsequent calls are no-ops.
pub async fn close(mut self) -> anyhow::Result<()> {
let _ = self.browser.close().await;
let _ = self.browser.wait().await;
let _ = self.driver.await;
/// Closes the browser and awaits the driver task. If other Arcs to
/// the browser are still alive we can't issue a clean CDP `close`,
/// so we abort the driver task instead — otherwise `handler.next()`
/// keeps polling forever and `Handle::close` hangs (chromiumoxide's
/// handler stream doesn't end on its own when the underlying WS
/// dies). Chromium itself is reaped by kill-on-drop once the last
/// `Arc<Browser>` is dropped.
pub async fn close(self) -> anyhow::Result<()> {
close_or_abort(self.browser, self.driver, |mut owned| async move {
let _ = owned.close().await;
let _ = owned.wait().await;
})
.await;
Ok(())
}
}
/// Launches Chromium. Downloads it on first run via the `fetcher`
/// feature; subsequent runs hit the cache. The cache dir is
/// Shutdown core for [`Handle::close`], extracted so it can be unit-
/// tested without launching real Chromium. When `arc` is uniquely owned,
/// `on_owned` runs against the owned value and the driver is awaited
/// normally. When other Arc holders exist, the driver is aborted before
/// awaiting it so shutdown returns promptly.
async fn close_or_abort<T, F, Fut>(arc: Arc<T>, driver: JoinHandle<()>, on_owned: F)
where
T: Send + 'static,
F: FnOnce(T) -> Fut + Send,
Fut: std::future::Future<Output = ()> + Send,
{
match Arc::try_unwrap(arc) {
Ok(owned) => {
on_owned(owned).await;
let _ = driver.await;
}
Err(shared) => {
tracing::warn!(
strong_count = Arc::strong_count(&shared),
"Handle::close while Arc still shared — aborting driver, relying on kill-on-drop"
);
drop(shared);
driver.abort();
let _ = driver.await;
}
}
}
/// Launches Chromium. If `CRAWLER_CHROMIUM_BINARY` is set, uses that
/// path directly. Otherwise downloads via the `fetcher` feature on
/// first run and hits the cache after that. The fetcher cache dir is
/// `$CRAWLER_CHROMIUM_DIR` if set, else `$HOME/.cache/mangalord/chromium`,
/// else `./.chromium-cache` as a last-resort repo-local fallback.
pub async fn launch(options: LaunchOptions) -> anyhow::Result<Handle> {
let cache = cache_dir()?;
tokio::fs::create_dir_all(&cache)
.await
.with_context(|| format!("create cache dir {}", cache.display()))?;
let executable = match system_chromium_path_from_env() {
Some(path) => {
tracing::info!(path = %path.display(), "using system chromium (CRAWLER_CHROMIUM_BINARY)");
path
}
None => {
let cache = cache_dir()?;
tokio::fs::create_dir_all(&cache)
.await
.with_context(|| format!("create cache dir {}", cache.display()))?;
let fetcher = BrowserFetcher::new(
BrowserFetcherOptions::builder()
.with_path(&cache)
.build()
.map_err(|e| anyhow::anyhow!("fetcher options: {e}"))?,
);
tracing::info!(path = %cache.display(), "ensuring chromium revision is present");
let info = fetcher
.fetch()
.await
.context("download chromium via fetcher")?;
tracing::info!(executable = %info.executable_path.display(), "chromium ready");
let fetcher = BrowserFetcher::new(
BrowserFetcherOptions::builder()
.with_path(&cache)
.build()
.map_err(|e| anyhow::anyhow!("fetcher options: {e}"))?,
);
tracing::info!(path = %cache.display(), "ensuring chromium revision is present");
let info = fetcher
.fetch()
.await
.context("download chromium via fetcher")?;
tracing::info!(executable = %info.executable_path.display(), "chromium ready");
info.executable_path
}
};
let mut builder = BrowserConfig::builder()
.chrome_executable(info.executable_path)
.chrome_executable(executable)
// Linux containers / CI commonly lack the user namespaces
// Chromium's sandbox wants. Disable it; the crawler runs in its
// own container anyway.
@@ -184,7 +247,10 @@ pub async fn launch(options: LaunchOptions) -> anyhow::Result<Handle> {
}
});
Ok(Handle { browser, driver })
Ok(Handle {
browser: Arc::new(browser),
driver,
})
}
fn cache_dir() -> anyhow::Result<PathBuf> {
@@ -197,6 +263,24 @@ fn cache_dir() -> anyhow::Result<PathBuf> {
Ok(PathBuf::from("./.chromium-cache"))
}
/// Reads `CRAWLER_CHROMIUM_BINARY` and delegates to the pure helper.
/// Thin wrapper kept separate so the decision logic can be unit-tested
/// without mutating the process environment.
fn system_chromium_path_from_env() -> Option<PathBuf> {
system_chromium_path_from_value(std::env::var_os("CRAWLER_CHROMIUM_BINARY").as_deref())
}
/// Returns `Some(path)` only when the value is set and non-empty. An
/// exported-but-blank var (common in compose `${VAR:-}` patterns when
/// the operator didn't fill it in) must behave like "unset" — otherwise
/// we'd hand chromiumoxide an empty path and fail launch in a confusing
/// way.
pub(crate) fn system_chromium_path_from_value(
raw: Option<&std::ffi::OsStr>,
) -> Option<PathBuf> {
raw.filter(|v| !v.is_empty()).map(PathBuf::from)
}
#[cfg(test)]
mod tests {
use super::*;
@@ -223,4 +307,91 @@ mod tests {
assert!(parse_args("").is_empty());
assert!(parse_args(" \t\n").is_empty());
}
#[test]
fn system_chromium_path_returns_some_when_value_set() {
let raw = std::ffi::OsString::from("/usr/bin/chromium-headless-shell");
assert_eq!(
system_chromium_path_from_value(Some(raw.as_os_str())),
Some(PathBuf::from("/usr/bin/chromium-headless-shell"))
);
}
#[test]
fn system_chromium_path_returns_none_when_unset() {
assert_eq!(system_chromium_path_from_value(None), None);
}
#[test]
fn system_chromium_path_treats_empty_as_unset() {
// Compose's `${VAR:-}` substitution produces an exported-but-empty
// env var when the operator left it blank. Treat it as unset so
// the launcher falls back to the fetcher path instead of handing
// chromiumoxide an empty path.
let raw = std::ffi::OsString::from("");
assert_eq!(
system_chromium_path_from_value(Some(raw.as_os_str())),
None
);
}
#[test]
fn default_launch_options_are_headless() {
// Headless is the production-safe default — no display required,
// smaller resource footprint. `Headed` stays available as an
// opt-in for debugging via CRAWLER_BROWSER_MODE=headed.
assert_eq!(LaunchOptions::default().mode, BrowserMode::Headless);
assert_eq!(LaunchOptions::headless().mode, BrowserMode::Headless);
assert_eq!(LaunchOptions::headed().mode, BrowserMode::Headed);
}
// Regression: if another Arc<Browser> outlives `Handle::close`, the
// old code awaited the driver task forever because the chromiumoxide
// handler stream doesn't return None on its own. Aborting the driver
// unblocks shutdown even when kill-on-drop can't fire yet.
#[tokio::test]
async fn close_or_abort_returns_when_arc_is_shared() {
use std::sync::atomic::{AtomicBool, Ordering};
use std::time::Duration;
let arc = Arc::new(());
let _keepalive = Arc::clone(&arc); // forces try_unwrap to fail
let driver = tokio::spawn(std::future::pending::<()>());
let on_owned_ran = Arc::new(AtomicBool::new(false));
let flag = Arc::clone(&on_owned_ran);
let fut = close_or_abort(arc, driver, move |_| {
let flag = Arc::clone(&flag);
async move { flag.store(true, Ordering::Release) }
});
tokio::time::timeout(Duration::from_secs(2), fut)
.await
.expect("close_or_abort must not hang when driver is pending and Arc is shared");
assert!(
!on_owned_ran.load(Ordering::Acquire),
"on_owned must not run when the Arc is still shared"
);
}
#[tokio::test]
async fn close_or_abort_runs_on_owned_when_arc_is_unique() {
use std::sync::atomic::{AtomicBool, Ordering};
let arc = Arc::new(());
let driver = tokio::spawn(async {}); // completes immediately
let on_owned_ran = Arc::new(AtomicBool::new(false));
let flag = Arc::clone(&on_owned_ran);
close_or_abort(arc, driver, move |_| {
let flag = Arc::clone(&flag);
async move { flag.store(true, Ordering::Release) }
})
.await;
assert!(
on_owned_ran.load(Ordering::Acquire),
"on_owned must run when the Arc is unique"
);
}
}

View File

@@ -0,0 +1,301 @@
//! Lazy-launch / idle-teardown Chromium manager for the daemon.
//!
//! The first worker that calls [`BrowserManager::acquire`] triggers a real
//! Chromium launch (and the `on_launch` hook — used to re-inject the
//! PHPSESSID cookie on every fresh process). Each acquire bumps an active
//! counter; the returned [`BrowserLease`] decrements it on drop.
//!
//! When the active counter hits zero, a background reaper task waits
//! `idle_timeout`. If still zero on wake, it closes Chromium and clears the
//! cached handle. The next acquire re-launches.
//!
//! `idle_timeout = Duration::ZERO` disables the reaper — Chromium stays alive
//! until [`BrowserManager::shutdown`].
use std::ops::Deref;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
use std::time::Duration;
use anyhow::Context;
use chromiumoxide::browser::Browser;
use futures_util::future::BoxFuture;
use tokio::sync::{Mutex, Notify};
use tokio::task::JoinHandle;
use tokio_util::sync::CancellationToken;
use crate::crawler::browser::{self, LaunchOptions};
/// Hook invoked on every fresh launch with the new browser. Typically used
/// to re-inject PHPSESSID + run the session probe. Errors abort the
/// `acquire` that triggered the launch — the next acquire will re-launch.
pub type OnLaunch =
Arc<dyn Fn(Arc<Browser>) -> BoxFuture<'static, anyhow::Result<()>> + Send + Sync>;
/// Returns an `OnLaunch` that does nothing — useful when no session is
/// configured (e.g. CLI metadata-only runs).
pub fn noop_on_launch() -> OnLaunch {
Arc::new(|_| Box::pin(async { Ok(()) }))
}
/// Decoupled active-lease tracker. Owns the atomic counter and the idle
/// notifier so the wiring is unit-testable without standing up a real
/// `BrowserManager` (which would require launching Chromium).
#[derive(Default)]
pub(crate) struct ActiveTracker {
counter: AtomicUsize,
idle_signal: Notify,
}
impl ActiveTracker {
pub(crate) fn new() -> Arc<Self> {
Arc::new(Self::default())
}
pub(crate) fn acquire(self: &Arc<Self>) {
self.counter.fetch_add(1, Ordering::AcqRel);
}
pub(crate) fn release(self: &Arc<Self>) {
if self.counter.fetch_sub(1, Ordering::AcqRel) == 1 {
self.idle_signal.notify_one();
}
}
pub(crate) fn current(&self) -> usize {
self.counter.load(Ordering::Acquire)
}
pub(crate) fn idle_signal(&self) -> &Notify {
&self.idle_signal
}
}
pub struct BrowserManager {
inner: Mutex<Inner>,
active: Arc<ActiveTracker>,
launch_opts: LaunchOptions,
idle_timeout: Duration,
on_launch: OnLaunch,
}
struct Inner {
handle: Option<browser::Handle>,
shared: Option<Arc<Browser>>,
}
impl BrowserManager {
pub fn new(
launch_opts: LaunchOptions,
idle_timeout: Duration,
on_launch: OnLaunch,
) -> Arc<Self> {
Arc::new(Self {
inner: Mutex::new(Inner {
handle: None,
shared: None,
}),
active: ActiveTracker::new(),
launch_opts,
idle_timeout,
on_launch,
})
}
/// Acquire a shared browser lease. The first acquire after a teardown
/// launches a fresh Chromium (and runs `on_launch`); subsequent acquires
/// while a process is alive just bump the counter and clone the `Arc`.
pub async fn acquire(&self) -> anyhow::Result<BrowserLease> {
let mut guard = self.inner.lock().await;
if guard.handle.is_none() {
let handle = browser::launch(self.launch_opts.clone())
.await
.context("BrowserManager: launch chromium")?;
let shared = handle.shared();
// Run the on-launch hook before publishing the handle so a session
// probe failure doesn't leave a half-initialized browser behind.
if let Err(e) = (self.on_launch)(Arc::clone(&shared)).await {
// Close the just-launched browser since we won't be using it.
let _ = handle.close().await;
return Err(e.context("BrowserManager: on_launch hook failed"));
}
guard.handle = Some(handle);
guard.shared = Some(shared);
}
let browser = guard
.shared
.as_ref()
.expect("shared set above")
.clone();
self.active.acquire();
Ok(BrowserLease {
browser,
active: Arc::clone(&self.active),
})
}
/// Forcefully close the cached browser regardless of active count.
/// Used on daemon shutdown. After this returns the next acquire will
/// re-launch from scratch.
pub async fn shutdown(&self) {
let mut guard = self.inner.lock().await;
guard.shared = None;
if let Some(handle) = guard.handle.take() {
let _ = handle.close().await;
}
}
/// Mark the cached browser handle as unhealthy. The next `acquire`
/// will re-launch Chromium from scratch.
///
/// Same semantics as `shutdown` — the difference is intent:
/// `shutdown` runs once at daemon teardown, while `invalidate` is a
/// recovery hook callers fire after a CDP / connection / navigation
/// failure that suggests the underlying process has died. Calling
/// this while other workers still hold leases is safe — their
/// outstanding CDP operations will return channel-closed errors
/// and those workers will then re-acquire (re-launching Chromium).
///
/// Idempotent: calling on an already-invalidated manager is a
/// no-op.
pub async fn invalidate(&self) {
let mut guard = self.inner.lock().await;
guard.shared = None;
if let Some(handle) = guard.handle.take() {
let _ = handle.close().await;
tracing::warn!("BrowserManager: handle invalidated — next acquire will relaunch");
}
}
fn idle_timeout(&self) -> Duration {
self.idle_timeout
}
fn active(&self) -> Arc<ActiveTracker> {
Arc::clone(&self.active)
}
}
/// Background reaper. Returns immediately when `idle_timeout == 0`.
/// Otherwise spawns a task that:
/// 1. Waits on `idle_signal` (woken when active hits zero).
/// 2. Sleeps `idle_timeout`.
/// 3. Re-checks the counter under the mutex — if still zero, takes the
/// handle and closes it.
///
/// Repeats forever until `cancel` fires.
pub fn spawn_idle_reaper(mgr: Arc<BrowserManager>, cancel: CancellationToken) -> JoinHandle<()> {
tokio::spawn(async move {
if mgr.idle_timeout().is_zero() {
// Block until cancellation, then exit.
cancel.cancelled().await;
return;
}
let active = mgr.active();
loop {
tokio::select! {
_ = cancel.cancelled() => return,
_ = active.idle_signal().notified() => {}
}
if active.current() > 0 {
continue;
}
tokio::select! {
_ = cancel.cancelled() => return,
_ = tokio::time::sleep(mgr.idle_timeout()) => {}
}
let mut guard = mgr.inner.lock().await;
if active.current() > 0 {
// A worker grabbed a lease during the sleep — abort teardown.
continue;
}
let handle = guard.handle.take();
guard.shared = None;
drop(guard);
if let Some(h) = handle {
let _ = h.close().await;
tracing::info!("BrowserManager: idle teardown — Chromium closed");
}
}
})
}
/// A worker-side handle that keeps the browser alive while in scope.
/// `Deref<Target = Browser>` so callers can pass `&*lease` to APIs that
/// expect `&Browser`.
pub struct BrowserLease {
browser: Arc<Browser>,
active: Arc<ActiveTracker>,
}
impl Deref for BrowserLease {
type Target = Browser;
fn deref(&self) -> &Browser {
&self.browser
}
}
impl Drop for BrowserLease {
fn drop(&mut self) {
self.active.release();
}
}
#[cfg(test)]
mod tests {
use super::*;
use std::sync::atomic::AtomicBool;
#[test]
fn noop_on_launch_is_send_sync() {
fn assert_send_sync<T: Send + Sync>(_: &T) {}
let h = noop_on_launch();
assert_send_sync(&h);
}
/// Invalidate is the only `BrowserManager` method that's safe to
/// exercise in a unit test without launching Chromium — it's a
/// no-op when no handle has been cached, and that path is exactly
/// the one we want to verify is idempotent.
#[tokio::test]
async fn invalidate_is_a_noop_when_no_handle_cached() {
let mgr = BrowserManager::new(
crate::crawler::browser::LaunchOptions::default(),
Duration::ZERO,
noop_on_launch(),
);
// Two back-to-back invalidates must both complete; the second
// would hang or panic if the first had left torn state.
mgr.invalidate().await;
mgr.invalidate().await;
}
#[tokio::test]
async fn active_tracker_signals_idle_only_on_zero_transition() {
let tracker = ActiveTracker::new();
let signaled = Arc::new(AtomicBool::new(false));
{
let s = Arc::clone(&signaled);
let t = Arc::clone(&tracker);
tokio::spawn(async move {
t.idle_signal().notified().await;
s.store(true, Ordering::Release);
});
}
tracker.acquire();
tracker.acquire();
assert_eq!(tracker.current(), 2);
tracker.release();
assert_eq!(tracker.current(), 1);
tokio::time::sleep(Duration::from_millis(20)).await;
assert!(!signaled.load(Ordering::Acquire), "no idle signal at count 1");
tracker.release();
tokio::time::sleep(Duration::from_millis(20)).await;
assert_eq!(tracker.current(), 0);
assert!(
signaled.load(Ordering::Acquire),
"idle signal fires on 1 -> 0 transition"
);
}
}

View File

@@ -0,0 +1,621 @@
//! Chapter content sync — fetch a logged-in chapter page, extract its
//! image URLs in `pageN` order, download each to storage, and atomically
//! persist a `pages` row per image plus the chapter's `page_count`.
//!
//! Only chapters belonging to a manga someone has bookmarked are
//! candidates. The crawler scans bookmarks at the start of each run and
//! enqueues unfetched chapters; the API also enqueues at bookmark-time
//! so users get instant feedback. Both feed into the same queue and
//! dedup by chapter id.
// Implementation lands in the next commits in this branch. Module is
// declared so other crates can `use crawler::content` without breaking
// builds while iteration is in progress.
use anyhow::Context;
use sqlx::PgPool;
use uuid::Uuid;
use crate::crawler::detect::PageError;
use crate::crawler::rate_limit::HostRateLimiters;
use crate::crawler::safety::{fetch_bytes_capped, looks_like_image, DownloadAllowlist};
use crate::crawler::session::{self, ChapterProbe};
use crate::storage::Storage;
/// Parse the chapter page DOM and return the page images in `pageN`
/// order. Filters out the loader `<img class="loading">` and any
/// `<img>` without a numeric `id="pageN"`.
///
/// Reader pages don't render the site's `#logo` element, so the
/// universal logo-sentinel can't apply here — instead we assert
/// `a#pic_container` is present. Its absence means the response is the
/// transient broken-page response (or a redirect to some other layout)
/// and the caller should retry.
pub fn parse_chapter_pages(html: &str) -> Result<Vec<ChapterImage>, PageError> {
let doc = scraper::Html::parse_document(html);
let container_sel = scraper::Selector::parse("a#pic_container").unwrap();
if doc.select(&container_sel).next().is_none() {
return Err(PageError::transient("reader: a#pic_container missing"));
}
let sel = scraper::Selector::parse("a#pic_container img:not(.loading)").unwrap();
let mut pages: Vec<ChapterImage> = doc
.select(&sel)
.filter_map(|img| {
let id = img.value().id()?;
let n: i32 = id.strip_prefix("page")?.parse().ok()?;
let src = img.value().attr("src")?.trim().to_string();
if src.is_empty() {
return None;
}
Some(ChapterImage { page_number: n, url: src })
})
.collect();
pages.sort_by_key(|p| p.page_number);
Ok(pages)
}
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct ChapterImage {
pub page_number: i32,
pub url: String,
}
/// Outcome of a single chapter sync — surfaced to callers for logging
/// and exit-code decisions.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum SyncOutcome {
/// All images downloaded and stored, chapter row updated.
Fetched { pages: usize },
/// `page_count > 0` already — no-op unless force_refetch is set.
Skipped,
/// Session probe failed mid-sync (avatar selector missing on the
/// chapter page). Caller should abort the whole crawler run.
SessionExpired,
}
/// Per-chapter max fetch attempts when TOR is configured. `N = 3` means
/// up to 3 total page fetches with 2 NEWNYM signals between them. When
/// TOR is not configured the effective budget collapses to 1 (single
/// attempt, no retry, no recircuit — bit-for-bit pre-TOR behavior).
const CHAPTER_RECIRCUIT_MAX_ATTEMPTS: u32 = 3;
/// Outcome of [`fetch_chapter_html_with_recircuit`]. `Ok` carries the
/// final reader HTML; the other two map to `sync_chapter_content`'s
/// existing failure modes.
#[derive(Debug)]
enum ChapterFetchOutcome {
Ok(String),
/// `ChapterProbe::Unauthenticated` after exhausting recircuit
/// budget (or with budget=0). Caller returns
/// `SyncOutcome::SessionExpired`.
SessionExpired,
/// `ChapterProbe::Transient` after exhausting recircuit budget
/// (or with budget=0). Caller bails so the dispatcher does
/// exponential backoff.
PersistentTransient,
}
/// Single rate-limited Chromium navigation to the chapter URL,
/// returning the page HTML. Extracted from `sync_chapter_content` so
/// the recircuit loop can call it once per attempt.
async fn fetch_chapter_html_once(
browser: &chromiumoxide::Browser,
rate: &HostRateLimiters,
source_url: &str,
) -> anyhow::Result<String> {
rate.wait_for(source_url).await?;
let page = browser
.new_page(source_url)
.await
.with_context(|| format!("open chapter page {source_url}"))?;
crate::crawler::nav::wait_for_nav(&page)
.await
.context("wait for chapter nav")?;
// Best-effort wait for the reader marker — same partial-render
// race that bit the chapter-list parser can hit here. Timeout is
// not an error; the chapter probe + parser sentinels still catch
// real failures.
let _ = crate::crawler::nav::wait_for_selector(
&page,
"a#pic_container",
crate::crawler::nav::SELECTOR_TIMEOUT,
)
.await;
let html = page.content().await.context("read chapter html")?;
page.close().await.ok();
Ok(html)
}
/// Pure-over-IO loop: fetch + classify, up to `max_attempts` total
/// fetches. Between attempts, `recircuit` is invoked (a no-op when
/// TOR isn't configured). `max_attempts = 1` collapses to the
/// original single-shot behavior — `Unauthenticated` →
/// `SessionExpired`, `Transient` → `PersistentTransient` on the first
/// hit, no recircuit.
///
/// Semantics match [`crate::crawler::detect::retry_on_transient`] and
/// [`run_session_probe_loop`]: `N` is **total attempts including the
/// first**, so `N = 3` means 3 fetches and up to 2 NEWNYM calls.
/// `Unauthenticated` and `Transient` share the budget — the loop
/// doesn't distinguish, so a sequence like Transient → Unauth → Ok
/// counts as 3 attempts.
async fn fetch_chapter_html_with_recircuit<F, Fut, R, RFut>(
mut fetch: F,
mut recircuit: R,
max_attempts: u32,
source_url_for_msg: &str,
) -> anyhow::Result<ChapterFetchOutcome>
where
F: FnMut() -> Fut,
Fut: std::future::Future<Output = anyhow::Result<String>>,
R: FnMut() -> RFut,
RFut: std::future::Future<Output = ()>,
{
debug_assert!(max_attempts >= 1, "max_attempts must be at least 1");
let mut attempt = 0u32;
loop {
attempt += 1;
let html = fetch().await?;
match session::classify_chapter_probe(&html) {
ChapterProbe::Ok => return Ok(ChapterFetchOutcome::Ok(html)),
ChapterProbe::Unauthenticated => {
if attempt >= max_attempts {
return Ok(ChapterFetchOutcome::SessionExpired);
}
tracing::warn!(
attempt,
max = max_attempts,
url = source_url_for_msg,
"chapter probe Unauthenticated; signaling TOR NEWNYM and retrying"
);
recircuit().await;
}
ChapterProbe::Transient => {
if attempt >= max_attempts {
return Ok(ChapterFetchOutcome::PersistentTransient);
}
tracing::warn!(
attempt,
max = max_attempts,
url = source_url_for_msg,
"chapter probe Transient; signaling TOR NEWNYM and retrying"
);
recircuit().await;
}
}
}
}
/// Fetch all images for one chapter and persist them atomically. On
/// any error after the first storage put, the DB transaction rolls
/// back so the chapter stays at `page_count = 0` and is retried on the
/// next run. Bytes already written to storage become orphans; a future
/// reaper sweeps them.
#[allow(clippy::too_many_arguments)]
pub async fn sync_chapter_content(
browser: &chromiumoxide::Browser,
db: &PgPool,
storage: &dyn Storage,
http: &reqwest::Client,
rate: &HostRateLimiters,
chapter_id: Uuid,
manga_id: Uuid,
source_url: &str,
force_refetch: bool,
allowlist: &DownloadAllowlist,
max_image_bytes: usize,
tor: Option<&crate::crawler::tor::TorController>,
) -> anyhow::Result<SyncOutcome> {
// Skip if already fetched, unless caller explicitly forces.
if !force_refetch {
let (page_count,): (i32,) =
sqlx::query_as("SELECT page_count FROM chapters WHERE id = $1")
.bind(chapter_id)
.fetch_one(db)
.await
.context("read chapter page_count")?;
if page_count > 0 {
return Ok(SyncOutcome::Skipped);
}
}
// Fetch + classify. With TOR configured, allow up to
// CHAPTER_RECIRCUIT_MAX_ATTEMPTS total page fetches with NEWNYM
// between each. Without TOR, collapse to 1 attempt (no retry, no
// recircuit) — matches the pre-TOR single-shot behavior bit-for-bit.
let max_attempts = if tor.is_some() { CHAPTER_RECIRCUIT_MAX_ATTEMPTS } else { 1 };
let html = match fetch_chapter_html_with_recircuit(
|| fetch_chapter_html_once(browser, rate, source_url),
|| async {
if let Some(t) = tor {
if let Err(e) = t.new_identity().await {
tracing::warn!(error = %e, "TOR NEWNYM failed; continuing with same circuit");
}
}
},
max_attempts,
source_url,
)
.await?
{
ChapterFetchOutcome::Ok(html) => html,
ChapterFetchOutcome::SessionExpired => return Ok(SyncOutcome::SessionExpired),
ChapterFetchOutcome::PersistentTransient => {
// Surface as a typed Err so the dispatcher path runs
// ack_failed with exponential backoff (rather than the
// session-expired sticky flag).
anyhow::bail!(
"chapter page at {source_url} returned a transient response after \
{max_attempts} attempt(s); will retry"
);
}
};
let images = parse_chapter_pages(&html)
.with_context(|| format!("parse chapter pages at {source_url}"))?;
if images.is_empty() {
anyhow::bail!("no page images parsed from {source_url}");
}
// Resolve image URLs against the chapter URL (they may be relative).
let base = reqwest::Url::parse(source_url).context("parse chapter URL")?;
// Fetch every image bytes-first into memory before writing
// anything. Lets us bail the whole chapter cleanly if any image
// fails — DB stays at page_count=0, no partial rows persisted.
let mut fetched: Vec<(i32, Vec<u8>, &'static str)> = Vec::with_capacity(images.len());
for img in &images {
let url = base.join(&img.url).with_context(|| {
format!("join image URL {} onto {source_url}", img.url)
})?;
rate.wait_for(url.as_str()).await?;
let bytes = fetch_bytes_capped(
http,
url.as_str(),
Some(source_url),
allowlist,
max_image_bytes,
)
.await?
.to_vec();
// Reject any non-image response: the only valid output of an
// image URL is an image. `infer` returns None on truncated
// bytes too, which also wants to be a failure not a silent
// `.bin` extension.
if !looks_like_image(&bytes) {
anyhow::bail!(
"image URL {url} returned non-image bytes \
(first 16: {:?}); refusing to store as binary blob",
&bytes.get(..16.min(bytes.len()))
);
}
let ext = infer::get(&bytes)
.map(|k| k.extension())
.expect("looks_like_image asserted infer succeeded");
fetched.push((img.page_number, bytes, ext));
}
// Atomic write: storage puts + page row inserts + page_count
// update, all in one transaction. If anything fails, rollback +
// the chapter is retried next run. Storage orphans the bytes; a
// reaper sweeps them later.
let mut tx = db.begin().await.context("open chapter sync tx")?;
for (page_number, bytes, ext) in &fetched {
let key = format!(
"mangas/{manga_id}/chapters/{chapter_id}/pages/{:04}.{ext}",
page_number
);
storage
.put(&key, bytes)
.await
.with_context(|| format!("put {key}"))?;
// (chapter_id, page_number) is unique — re-runs idempotent.
sqlx::query(
"INSERT INTO pages (chapter_id, page_number, storage_key, content_type)
VALUES ($1, $2, $3, $4)
ON CONFLICT (chapter_id, page_number) DO UPDATE
SET storage_key = EXCLUDED.storage_key,
content_type = EXCLUDED.content_type",
)
.bind(chapter_id)
.bind(page_number)
.bind(&key)
.bind(format!("image/{ext}"))
.execute(&mut *tx)
.await
.with_context(|| format!("insert page row {page_number}"))?;
}
sqlx::query("UPDATE chapters SET page_count = $1 WHERE id = $2")
.bind(fetched.len() as i32)
.bind(chapter_id)
.execute(&mut *tx)
.await
.context("update page_count")?;
tx.commit().await.context("commit chapter sync")?;
Ok(SyncOutcome::Fetched { pages: fetched.len() })
}
// Suppress unused-import warning for `session::registrable_domain`
// until the bin/crawler wiring lands in this branch and uses it
// through this module.
#[allow(dead_code)]
fn _keep_session_in_scope() {
let _ = session::registrable_domain;
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn parse_chapter_pages_skips_loader_and_sorts_by_id() {
// Loader image, two real pages out of order, and one with no id.
let html = r#"
<html><body id="body"><a id="pic_container">
<img class="loading" src="/images/ajax-loader2.gif">
<img id="page2" class="page2" src="https://cdn/2.jpg">
<img id="page1" class="page1" src="https://cdn/1.jpg">
<img src="https://cdn/orphan.jpg">
<img id="not-a-page" src="https://cdn/not-a-page.jpg">
</a></body></html>
"#;
let pages = parse_chapter_pages(html).expect("parse");
assert_eq!(pages.len(), 2);
assert_eq!(pages[0].page_number, 1);
assert_eq!(pages[0].url, "https://cdn/1.jpg");
assert_eq!(pages[1].page_number, 2);
assert_eq!(pages[1].url, "https://cdn/2.jpg");
}
#[test]
fn parse_chapter_pages_drops_images_without_src() {
let html = r#"
<a id="pic_container">
<img id="page1" src="">
<img id="page2" src="https://cdn/2.jpg">
</a>
"#;
let pages = parse_chapter_pages(html).expect("parse");
assert_eq!(pages.len(), 1);
assert_eq!(pages[0].page_number, 2);
}
#[test]
fn parse_chapter_pages_handles_three_digit_page_ids() {
let html = r#"
<a id="pic_container">
<img id="page126" src="https://cdn/126.jpg">
<img id="page9" src="https://cdn/9.jpg">
<img id="page50" src="https://cdn/50.jpg">
</a>
"#;
let pages = parse_chapter_pages(html).expect("parse");
assert_eq!(
pages.iter().map(|p| p.page_number).collect::<Vec<_>>(),
vec![9, 50, 126]
);
}
#[test]
fn parse_chapter_pages_returns_transient_when_container_missing() {
// Reader doesn't render #logo, so the universal logo sentinel
// can't be used here — a#pic_container is the reader-specific
// marker. Broken-page response trips this.
let html = "<html><body>\
<p>we're sorry, the request file are not found.</p>\
</body></html>";
let err = parse_chapter_pages(html).expect_err("expected Transient");
assert!(err.is_transient(), "got non-transient: {err}");
}
// --- fetch_chapter_html_with_recircuit -------------------------------
const OK_HTML: &str = r#"<html><body><a id="pic_container"><img id="page1" src="x"/></a></body></html>"#;
const UNAUTH_HTML: &str = r#"<html><body><header><div id="logo">x</div></header><main>please log in</main></body></html>"#;
const TRANSIENT_HTML: &str = "<html><body><p>we're sorry, the request file are not found.</p></body></html>";
#[tokio::test]
async fn recircuit_loop_ok_first_attempt() {
let mut recircuits = 0u32;
let mut fetches = 0u32;
let outcome = fetch_chapter_html_with_recircuit(
|| {
fetches += 1;
async { Ok(OK_HTML.to_string()) }
},
|| {
recircuits += 1;
async {}
},
3,
"https://example/c",
)
.await
.expect("ok");
assert!(matches!(outcome, ChapterFetchOutcome::Ok(_)));
assert_eq!(fetches, 1);
assert_eq!(recircuits, 0);
}
#[tokio::test]
async fn recircuit_loop_unauth_with_single_attempt_returns_session_expired() {
// max_attempts=1 = TOR disabled, fail-fast on first Unauthenticated.
let mut recircuits = 0u32;
let mut fetches = 0u32;
let outcome = fetch_chapter_html_with_recircuit(
|| {
fetches += 1;
async { Ok(UNAUTH_HTML.to_string()) }
},
|| {
recircuits += 1;
async {}
},
1,
"https://example/c",
)
.await
.expect("ok-result");
assert!(matches!(outcome, ChapterFetchOutcome::SessionExpired));
assert_eq!(fetches, 1);
assert_eq!(recircuits, 0, "no recircuit when budget is 1 (TOR disabled)");
}
#[tokio::test]
async fn recircuit_loop_unauth_then_ok_within_budget() {
// max_attempts=3 = up to 3 fetches with 2 recircuits between.
let mut recircuits = 0u32;
let mut fetch_n = 0u32;
let outcome = fetch_chapter_html_with_recircuit(
|| {
fetch_n += 1;
let n = fetch_n;
async move {
if n == 1 {
Ok(UNAUTH_HTML.to_string())
} else {
Ok(OK_HTML.to_string())
}
}
},
|| {
recircuits += 1;
async {}
},
3,
"https://example/c",
)
.await
.expect("ok");
assert!(matches!(outcome, ChapterFetchOutcome::Ok(_)));
assert_eq!(fetch_n, 2);
assert_eq!(recircuits, 1);
}
#[tokio::test]
async fn recircuit_loop_unauth_exhausts_budget_returns_session_expired() {
let mut recircuits = 0u32;
let mut fetch_n = 0u32;
let outcome = fetch_chapter_html_with_recircuit(
|| {
fetch_n += 1;
async { Ok(UNAUTH_HTML.to_string()) }
},
|| {
recircuits += 1;
async {}
},
3,
"https://example/c",
)
.await
.expect("ok-result");
assert!(matches!(outcome, ChapterFetchOutcome::SessionExpired));
assert_eq!(fetch_n, 3, "max_attempts=3 → 3 fetches total");
assert_eq!(recircuits, 2, "2 recircuits between 3 fetches");
}
#[tokio::test]
async fn recircuit_loop_transient_then_ok_within_budget() {
let mut recircuits = 0u32;
let mut fetch_n = 0u32;
let outcome = fetch_chapter_html_with_recircuit(
|| {
fetch_n += 1;
let n = fetch_n;
async move {
if n < 3 {
Ok(TRANSIENT_HTML.to_string())
} else {
Ok(OK_HTML.to_string())
}
}
},
|| {
recircuits += 1;
async {}
},
3,
"https://example/c",
)
.await
.expect("ok");
assert!(matches!(outcome, ChapterFetchOutcome::Ok(_)));
assert_eq!(fetch_n, 3);
assert_eq!(recircuits, 2);
}
#[tokio::test]
async fn recircuit_loop_transient_exhausts_budget_returns_persistent() {
let mut recircuits = 0u32;
let mut fetch_n = 0u32;
let outcome = fetch_chapter_html_with_recircuit(
|| {
fetch_n += 1;
async { Ok(TRANSIENT_HTML.to_string()) }
},
|| {
recircuits += 1;
async {}
},
3,
"https://example/c",
)
.await
.expect("ok-result");
assert!(matches!(outcome, ChapterFetchOutcome::PersistentTransient));
assert_eq!(fetch_n, 3, "max_attempts=3 → 3 fetches total");
assert_eq!(recircuits, 2, "2 recircuits between 3 fetches");
}
#[tokio::test]
async fn recircuit_loop_mixed_transient_then_unauth_then_ok_shares_budget() {
// Audit-prompted regression: outcomes share the attempt counter.
// Sequence: Transient (attempt 1) → Unauth (attempt 2) → Ok (3).
let mut recircuits = 0u32;
let mut fetch_n = 0u32;
let outcome = fetch_chapter_html_with_recircuit(
|| {
fetch_n += 1;
let n = fetch_n;
async move {
match n {
1 => Ok(TRANSIENT_HTML.to_string()),
2 => Ok(UNAUTH_HTML.to_string()),
_ => Ok(OK_HTML.to_string()),
}
}
},
|| {
recircuits += 1;
async {}
},
3,
"https://example/c",
)
.await
.expect("ok");
assert!(matches!(outcome, ChapterFetchOutcome::Ok(_)));
assert_eq!(fetch_n, 3);
assert_eq!(recircuits, 2);
}
#[tokio::test]
async fn recircuit_loop_propagates_fetch_errors() {
let mut fetch_n = 0u32;
let err = fetch_chapter_html_with_recircuit(
|| {
fetch_n += 1;
async { Err(anyhow::anyhow!("nav timeout")) }
},
|| async {},
3,
"https://example/c",
)
.await
.expect_err("fetch error bubbles");
assert_eq!(fetch_n, 1);
assert!(format!("{err:#}").contains("nav timeout"));
}
}

View File

@@ -0,0 +1,658 @@
//! In-process crawler daemon.
//!
//! Owns a cron task that fires a daily metadata pass and N worker tasks
//! that drain `SyncChapterContent` jobs from `crawler_jobs`. The dispatch
//! seams ([`MetadataPass`], [`ChapterDispatcher`]) are traits so tests can
//! inject stubs without standing up a real Chromium / `Source` impl.
//!
//! ## Cron
//!
//! Each tick:
//! 1. Acquire a Postgres advisory lock on a dedicated pool connection
//! (multi-replica safety). Skip the tick on contention.
//! 2. Call [`MetadataPass::run`] (typically `pipeline::run_metadata_pass`).
//! 3. Enqueue `SyncChapterContent` jobs for any bookmarked manga whose
//! chapters still have `page_count = 0`.
//! 4. Reap `done` jobs older than `retention_days`.
//! 5. Persist `last_metadata_tick_at` and release the lock.
//!
//! If the last persisted tick is older than the most recent scheduled slot
//! (e.g. backend was down at midnight), the daemon fires immediately on
//! startup before resuming the regular schedule.
//!
//! ## Workers
//!
//! Each worker leases one chapter-content job at a time, dispatches via the
//! [`ChapterDispatcher`], and acks `done` / `failed` / re-`pending` based on
//! the outcome. A `SessionExpired` outcome flips the sticky
//! `session_expired` flag — all workers idle while it's set (until operator
//! restart with a refreshed PHPSESSID).
//!
//! Worker dispatch is wrapped in `catch_unwind` so a panicking handler
//! marks the job failed instead of taking down the worker task.
use std::panic::AssertUnwindSafe;
use std::sync::atomic::{AtomicBool, Ordering};
use std::sync::Arc;
use std::time::Duration;
use async_trait::async_trait;
use chrono::{DateTime, Datelike, NaiveTime, TimeZone, Timelike, Utc};
use chrono_tz::Tz;
use futures_util::FutureExt;
use serde_json::json;
use sqlx::PgPool;
use tokio::task::JoinSet;
use tokio_util::sync::CancellationToken;
use crate::crawler::content::SyncOutcome;
use crate::crawler::jobs::{self, JobPayload, Lease, KIND_SYNC_CHAPTER_CONTENT};
use crate::crawler::pipeline;
/// Fixed `pg_try_advisory_lock` key. ASCII "MANGALRD" interpreted as a
/// big-endian i64. Hardcoded so every replica agrees on the lock identity
/// without consulting config.
pub const CRON_LOCK_KEY: i64 = 0x4D414E47414C5244;
const STATE_KEY_LAST_TICK: &str = "last_metadata_tick_at";
#[async_trait]
pub trait MetadataPass: Send + Sync {
async fn run(&self) -> anyhow::Result<pipeline::MetadataStats>;
}
#[async_trait]
pub trait ChapterDispatcher: Send + Sync {
async fn dispatch(&self, payload: JobPayload) -> anyhow::Result<SyncOutcome>;
}
/// Configuration for [`spawn`]. Use `None` for `metadata_pass` to disable
/// the cron entirely (worker-pool-only mode — useful when only the
/// bookmark-triggered enqueue path is wanted).
pub struct DaemonConfig {
pub metadata_pass: Option<Arc<dyn MetadataPass>>,
pub dispatcher: Arc<dyn ChapterDispatcher>,
pub chapter_workers: usize,
pub daily_at: NaiveTime,
pub tz: Tz,
pub retention_days: u32,
pub session_expired: Arc<AtomicBool>,
/// Tasks that should run alongside the cron + workers and be cancelled
/// on shutdown. Used to hand the daemon ownership of the browser
/// manager's idle reaper.
pub extra_tasks: Vec<tokio::task::JoinHandle<()>>,
}
pub struct DaemonHandle {
cancel: CancellationToken,
join: JoinSet<()>,
extra: Vec<tokio::task::JoinHandle<()>>,
}
impl DaemonHandle {
/// Trigger shutdown and await all worker / cron / extra tasks.
pub async fn shutdown(mut self) {
self.cancel.cancel();
while self.join.join_next().await.is_some() {}
for task in self.extra.drain(..) {
let _ = task.await;
}
}
/// Cancellation token that drives shutdown — exposed so callers
/// (`app::spawn_crawler_daemon`) can hand the same token to auxiliary
/// tasks (e.g. the BrowserManager idle reaper) and have them stop on
/// the daemon's signal.
pub fn cancel_token(&self) -> CancellationToken {
self.cancel.clone()
}
}
/// Spawn the daemon. Returns immediately; tasks run in the background.
/// Pass an external [`CancellationToken`] so auxiliary tasks (e.g. a
/// BrowserManager idle reaper) can share the same shutdown signal —
/// typically created in the caller, cloned into both spawns.
pub fn spawn(pool: PgPool, cancel: CancellationToken, cfg: DaemonConfig) -> DaemonHandle {
let mut join = JoinSet::new();
let DaemonConfig {
metadata_pass,
dispatcher,
chapter_workers,
daily_at,
tz,
retention_days,
session_expired,
extra_tasks,
} = cfg;
if let Some(metadata) = metadata_pass {
let ctx = CronContext {
pool: pool.clone(),
cancel: cancel.clone(),
daily_at,
tz,
retention_days,
metadata,
};
join.spawn(async move { ctx.run().await });
} else {
tracing::info!("crawler daemon: no metadata_pass — cron disabled");
}
for worker_id in 0..chapter_workers.max(1) {
let ctx = WorkerContext {
pool: pool.clone(),
cancel: cancel.clone(),
dispatcher: Arc::clone(&dispatcher),
session_expired: Arc::clone(&session_expired),
id: worker_id,
};
join.spawn(async move { ctx.run().await });
}
DaemonHandle {
cancel,
join,
extra: extra_tasks,
}
}
// ---------------------------------------------------------------------------
// Cron
// ---------------------------------------------------------------------------
struct CronContext {
pool: PgPool,
cancel: CancellationToken,
daily_at: NaiveTime,
tz: Tz,
retention_days: u32,
metadata: Arc<dyn MetadataPass>,
}
impl CronContext {
async fn run(self) {
// On startup, fire immediately if the most recent slot has already
// passed and we never recorded a tick for it.
let now = Utc::now();
let mut catchup = match read_last_tick(&self.pool).await {
Ok(Some(last)) => previous_fire(now, self.daily_at, self.tz) > last,
Ok(None) => true,
Err(e) => {
tracing::warn!(?e, "cron: read_last_tick failed; assuming no catch-up");
false
}
};
loop {
if catchup {
tracing::info!("cron: catch-up tick (missed scheduled slot)");
self.run_tick().await;
catchup = false;
continue;
}
// Recompute next-fire from now() each iteration so clock jumps
// (NTP step, suspend/resume) don't strand us on a stale instant.
let next = next_fire(Utc::now(), self.daily_at, self.tz);
let wait = (next - Utc::now()).to_std().unwrap_or(Duration::ZERO);
tracing::info!(
next_fire_utc = %next.to_rfc3339(),
wait_seconds = wait.as_secs(),
"cron: sleeping until next slot"
);
tokio::select! {
_ = tokio::time::sleep(wait) => {}
_ = self.cancel.cancelled() => {
tracing::info!("cron: shutdown");
return;
}
}
self.run_tick().await;
}
}
async fn run_tick(&self) {
let mut conn = match self.pool.acquire().await {
Ok(c) => c,
Err(e) => {
tracing::error!(?e, "cron: acquire conn failed; skipping tick");
return;
}
};
// pg_try_advisory_lock is session-scoped — we must hold the same
// connection for the unlock or the call silently no-ops on a
// different connection from the pool.
let acquired: bool = sqlx::query_scalar("SELECT pg_try_advisory_lock($1)")
.bind(CRON_LOCK_KEY)
.fetch_one(&mut *conn)
.await
.unwrap_or(false);
if !acquired {
tracing::info!("cron: tick skipped — another replica holds the lock");
return;
}
// Panic-isolate the tick body the same way `process_lease` does
// for worker dispatch. Without this, a panic in metadata.run
// (or any of the follow-on steps) would kill the cron task and
// no future tick would ever run — workers would keep going but
// no new metadata work would be scheduled until daemon restart.
// The advisory unlock below runs unconditionally so a panicked
// tick doesn't leave the lock held for another replica.
let metadata = &self.metadata;
let pool = &self.pool;
let retention_days = self.retention_days;
let body = async move {
match metadata.run().await {
Ok(stats) => tracing::info!(?stats, "cron: metadata pass done"),
Err(e) => tracing::error!(?e, "cron: metadata pass failed"),
}
match pipeline::enqueue_bookmarked_pending(pool).await {
Ok(summary) => {
tracing::info!(?summary, "cron: enqueued bookmarked-pending");
}
Err(e) => tracing::error!(?e, "cron: enqueue_bookmarked_pending failed"),
}
match jobs::reap_done(pool, retention_days).await {
Ok(n) => tracing::info!(reaped = n, "cron: done-job reaper finished"),
Err(e) => tracing::error!(?e, "cron: done-job reaper failed"),
}
if let Err(e) = write_last_tick(pool, Utc::now()).await {
tracing::warn!(?e, "cron: persist last_metadata_tick_at failed");
}
};
if let Err(_panic) = AssertUnwindSafe(body).catch_unwind().await {
tracing::error!("cron: tick body panicked — continuing");
}
let _ = sqlx::query("SELECT pg_advisory_unlock($1)")
.bind(CRON_LOCK_KEY)
.execute(&mut *conn)
.await;
drop(conn);
}
}
// ---------------------------------------------------------------------------
// Workers
// ---------------------------------------------------------------------------
struct WorkerContext {
pool: PgPool,
cancel: CancellationToken,
dispatcher: Arc<dyn ChapterDispatcher>,
session_expired: Arc<AtomicBool>,
id: usize,
}
impl WorkerContext {
async fn run(self) {
loop {
if self.cancel.is_cancelled() {
tracing::info!(worker = self.id, "worker: shutdown");
return;
}
if self.session_expired.load(Ordering::Acquire) {
tokio::select! {
_ = tokio::time::sleep(Duration::from_secs(30)) => continue,
_ = self.cancel.cancelled() => return,
}
}
let leases = match jobs::lease(
&self.pool,
Some(KIND_SYNC_CHAPTER_CONTENT),
1,
Duration::from_secs(60),
)
.await
{
Ok(v) => v,
Err(e) => {
tracing::warn!(worker = self.id, ?e, "worker: lease failed");
tokio::select! {
_ = tokio::time::sleep(Duration::from_secs(5)) => continue,
_ = self.cancel.cancelled() => return,
}
}
};
let Some(lease) = leases.into_iter().next() else {
tokio::select! {
_ = tokio::time::sleep(Duration::from_secs(1)) => continue,
_ = self.cancel.cancelled() => return,
}
};
self.process_lease(lease).await;
}
}
async fn process_lease(&self, lease: Lease) {
// Consumer-side dedup safety net: if the chapter already has pages
// (because a force-refetch race or a job that was re-enqueued
// after a previous one finished), ack done without re-fetching.
if let JobPayload::SyncChapterContent { chapter_id, .. } = &lease.payload {
let page_count = crate::repo::chapter::page_count(&self.pool, *chapter_id)
.await
.ok()
.flatten();
if matches!(page_count, Some(n) if n > 0) {
let _ = jobs::ack_done(&self.pool, lease.id).await;
return;
}
}
let outcome = AssertUnwindSafe(self.dispatcher.dispatch(lease.payload.clone()))
.catch_unwind()
.await;
match outcome {
Ok(Ok(SyncOutcome::Fetched { .. } | SyncOutcome::Skipped)) => {
let _ = jobs::ack_done(&self.pool, lease.id).await;
}
Ok(Ok(SyncOutcome::SessionExpired)) => {
tracing::error!(
worker = self.id,
lease_id = %lease.id,
"session expired — workers will idle until restart"
);
self.session_expired.store(true, Ordering::Release);
let _ = jobs::release(&self.pool, lease.id).await;
}
Ok(Err(e)) => {
tracing::warn!(
worker = self.id,
lease_id = %lease.id,
error = ?e,
"worker: dispatch error — ack failed"
);
let _ = jobs::ack_failed(
&self.pool,
lease.id,
&format!("{e:#}"),
lease.attempts,
lease.max_attempts,
)
.await;
}
Err(_panic) => {
tracing::error!(
worker = self.id,
lease_id = %lease.id,
"worker: dispatcher panicked — ack failed"
);
let _ = jobs::ack_failed(
&self.pool,
lease.id,
"worker panicked",
lease.attempts,
lease.max_attempts,
)
.await;
}
}
}
}
// ---------------------------------------------------------------------------
// Cron timing primitives
// ---------------------------------------------------------------------------
/// Compute the next UTC instant when `daily_at` (interpreted in `tz`) will
/// fire, strictly after `now`. Handles DST gaps (spring-forward) by
/// advancing past the gap; on DST overlap (fall-back) picks the later
/// instant so the job runs once, not twice.
pub fn next_fire(now: DateTime<Utc>, daily_at: NaiveTime, tz: Tz) -> DateTime<Utc> {
let now_local = now.with_timezone(&tz);
// Start with today's slot in the local TZ.
let mut candidate = local_at(now_local.date_naive(), daily_at, tz);
// If today's slot is in the past (or now), roll forward day-by-day.
while candidate <= now {
let next_day = candidate
.with_timezone(&tz)
.date_naive()
.succ_opt()
.unwrap_or_else(|| {
// Defensive: succ_opt only fails at chrono's max date.
chrono::NaiveDate::from_ymd_opt(
candidate.year(),
candidate.month(),
candidate.day(),
)
.expect("valid date")
});
candidate = local_at(next_day, daily_at, tz);
}
candidate
}
/// The most recent fire instant at or before `now`. Used to detect missed
/// slots after a restart.
pub fn previous_fire(now: DateTime<Utc>, daily_at: NaiveTime, tz: Tz) -> DateTime<Utc> {
let now_local = now.with_timezone(&tz);
let today = local_at(now_local.date_naive(), daily_at, tz);
if today <= now {
return today;
}
let yesterday = now_local
.date_naive()
.pred_opt()
.expect("a day before now");
local_at(yesterday, daily_at, tz)
}
/// Resolve a local date+time to a UTC instant in `tz`, navigating DST
/// edges deterministically:
/// - `LocalResult::Single` → that instant.
/// - `LocalResult::Ambiguous(_, latest)` → the later instant (fall-back
/// hour). Picking latest means a daily job fires once across the
/// repeated hour, not twice.
/// - `LocalResult::None` → spring-forward gap. Advance the local time
/// by 1 minute and try again, repeating up to 120 times (so the worst
/// case is still well inside an hour-long gap).
fn local_at(date: chrono::NaiveDate, time: NaiveTime, tz: Tz) -> DateTime<Utc> {
use chrono::LocalResult;
for offset_minutes in 0..120 {
let mut t = time;
if offset_minutes > 0 {
let added = chrono::NaiveTime::from_num_seconds_from_midnight_opt(
((time.num_seconds_from_midnight() as i64 + offset_minutes * 60) % 86_400) as u32,
0,
)
.unwrap_or(time);
t = added;
}
let naive = date.and_time(t);
match tz.from_local_datetime(&naive) {
LocalResult::Single(dt) => return dt.with_timezone(&Utc),
LocalResult::Ambiguous(_, latest) => return latest.with_timezone(&Utc),
LocalResult::None => continue,
}
}
// Should be unreachable — DST gaps are always less than an hour.
Utc.from_utc_datetime(&date.and_time(time))
}
// ---------------------------------------------------------------------------
// crawler_state I/O
// ---------------------------------------------------------------------------
async fn read_last_tick(pool: &PgPool) -> sqlx::Result<Option<DateTime<Utc>>> {
let row: Option<serde_json::Value> = sqlx::query_scalar(
"SELECT value FROM crawler_state WHERE key = $1",
)
.bind(STATE_KEY_LAST_TICK)
.fetch_optional(pool)
.await?;
Ok(row.and_then(|v| {
v.get("at")
.and_then(|s| s.as_str())
.and_then(|s| DateTime::parse_from_rfc3339(s).ok())
.map(|dt| dt.with_timezone(&Utc))
}))
}
async fn write_last_tick(pool: &PgPool, at: DateTime<Utc>) -> sqlx::Result<()> {
sqlx::query(
"INSERT INTO crawler_state (key, value, updated_at) \
VALUES ($1, $2, now()) \
ON CONFLICT (key) DO UPDATE \
SET value = EXCLUDED.value, updated_at = now()",
)
.bind(STATE_KEY_LAST_TICK)
.bind(json!({ "at": at.to_rfc3339() }))
.execute(pool)
.await?;
Ok(())
}
// ---------------------------------------------------------------------------
// Test helpers (not gated on cfg(test) — integration tests in tests/ dir
// need them too).
// ---------------------------------------------------------------------------
pub mod test_support {
//! Lightweight stubs the daemon tests use. Public because integration
//! tests live outside this module.
use super::*;
use std::sync::atomic::AtomicUsize;
pub struct CountingMetadataPass {
pub count: AtomicUsize,
}
impl Default for CountingMetadataPass {
fn default() -> Self {
Self {
count: AtomicUsize::new(0),
}
}
}
#[async_trait]
impl MetadataPass for CountingMetadataPass {
async fn run(&self) -> anyhow::Result<pipeline::MetadataStats> {
self.count.fetch_add(1, Ordering::AcqRel);
Ok(pipeline::MetadataStats::default())
}
}
pub type DispatchFn = Arc<
dyn Fn(JobPayload) -> futures_util::future::BoxFuture<'static, anyhow::Result<SyncOutcome>>
+ Send
+ Sync,
>;
pub struct StubDispatcher {
pub handler: DispatchFn,
}
#[async_trait]
impl ChapterDispatcher for StubDispatcher {
async fn dispatch(&self, payload: JobPayload) -> anyhow::Result<SyncOutcome> {
(self.handler)(payload).await
}
}
pub fn always_done() -> Arc<StubDispatcher> {
Arc::new(StubDispatcher {
handler: Arc::new(|_| Box::pin(async { Ok(SyncOutcome::Fetched { pages: 1 }) })),
})
}
pub fn panicking_dispatcher() -> Arc<StubDispatcher> {
Arc::new(StubDispatcher {
handler: Arc::new(|_| Box::pin(async { panic!("intentional dispatcher panic") })),
})
}
}
#[cfg(test)]
mod tests {
use super::*;
use chrono::Duration as ChronoDuration;
fn dt_utc(y: i32, mo: u32, d: u32, h: u32, mi: u32) -> DateTime<Utc> {
Utc.with_ymd_and_hms(y, mo, d, h, mi, 0).unwrap()
}
#[test]
fn next_fire_in_utc_at_midnight_advances_one_day() {
let now = dt_utc(2026, 5, 25, 12, 0); // noon UTC
let at = NaiveTime::from_hms_opt(0, 0, 0).unwrap();
let next = next_fire(now, at, Tz::UTC);
// Next midnight is May 26 00:00 UTC.
assert_eq!(next, dt_utc(2026, 5, 26, 0, 0));
}
#[test]
fn next_fire_before_today_slot_returns_today() {
let now = dt_utc(2026, 5, 25, 23, 0); // 23:00 UTC
let at = NaiveTime::from_hms_opt(23, 30, 0).unwrap();
let next = next_fire(now, at, Tz::UTC);
assert_eq!(next, dt_utc(2026, 5, 25, 23, 30));
}
#[test]
fn next_fire_skips_spring_forward_gap_in_europe_berlin() {
// 2024-03-31: clocks jump 02:00 -> 03:00 in Berlin (CET -> CEST).
// Asking for daily_at = 02:30 on the morning of the jump should
// land on the *next valid* local instant past the gap. We test
// by computing `next_fire` at 2024-03-31 00:30 UTC (= 01:30 CET,
// i.e. just before the gap). The next 02:30 local does not exist,
// so the helper advances past it.
let now = dt_utc(2024, 3, 31, 0, 30); // 01:30 local Berlin (CET = UTC+1)
let at = NaiveTime::from_hms_opt(2, 30, 0).unwrap();
let next = next_fire(now, at, Tz::Europe__Berlin);
// Local Berlin time skips from 02:00 -> 03:00. After the +1 minute
// search, the first valid slot is 03:00 local on 2024-03-31, which
// is 01:00 UTC (CEST = UTC+2).
// We assert the result is strictly between (now) and 1h later
// and is in UTC — the exact minute depends on how many +1m steps
// were required.
assert!(next > now);
assert!(next < now + ChronoDuration::hours(2));
}
#[test]
fn next_fire_on_fall_back_picks_later_instant() {
// 2024-10-27: clocks jump 03:00 -> 02:00 (CEST -> CET) in Berlin.
// 02:30 happens twice on that day. We pick the later one.
let now = dt_utc(2024, 10, 26, 12, 0); // day before, noon UTC
let at = NaiveTime::from_hms_opt(2, 30, 0).unwrap();
let next = next_fire(now, at, Tz::Europe__Berlin);
// First 02:30 local is 00:30 UTC (CEST = UTC+2).
// Second 02:30 local is 01:30 UTC (CET = UTC+1).
// We expect the later instant: 01:30 UTC on 2024-10-27.
assert_eq!(next, dt_utc(2024, 10, 27, 1, 30));
}
#[test]
fn previous_fire_returns_today_when_now_is_after_slot() {
let now = dt_utc(2026, 5, 25, 12, 0); // noon UTC
let at = NaiveTime::from_hms_opt(0, 0, 0).unwrap();
let prev = previous_fire(now, at, Tz::UTC);
assert_eq!(prev, dt_utc(2026, 5, 25, 0, 0));
}
#[test]
fn previous_fire_returns_yesterday_when_now_is_before_today_slot() {
let now = dt_utc(2026, 5, 25, 8, 0); // 08:00 UTC
let at = NaiveTime::from_hms_opt(23, 30, 0).unwrap();
let prev = previous_fire(now, at, Tz::UTC);
assert_eq!(prev, dt_utc(2026, 5, 24, 23, 30));
}
/// Documents the panic-isolation pattern `run_tick` now relies on:
/// `AssertUnwindSafe(...).catch_unwind().await` must yield `Err(_)`
/// when the wrapped future panics, so the surrounding loop (or in
/// our case, the unconditional advisory-unlock that follows) keeps
/// running. The shape of this test mirrors the production callsite.
#[tokio::test]
async fn assert_unwind_safe_catches_a_panicking_future() {
let result = AssertUnwindSafe(async {
panic!("boom");
})
.catch_unwind()
.await;
assert!(result.is_err(), "panicking future must yield Err");
}
}

View File

@@ -0,0 +1,362 @@
//! Transient-page detection.
//!
//! The target site occasionally responds with a 403 + tiny "we're sorry,
//! the request file are not found" body on pages that actually exist.
//! Selectors on that body match nothing, which is indistinguishable from
//! a genuinely empty page unless we look for the broken-page markers
//! explicitly. The same shape covers full-site outages: 5xx pages,
//! Cloudflare interstitials, and "site is down" placeholders all share
//! the trait that the normal layout (`#logo` in the header) is absent.
//!
//! Helpers here are split into two signals so callers can compose them:
//! - [`is_broken_page_body`]: pattern-match on the known broken-page
//! string. Works for *any* page on the site, including the reader,
//! which doesn't render `#logo`.
//! - [`has_logo_sentinel`]: assert `#logo` is in the parsed DOM. Site-
//! structural marker — present on the manga list, manga detail,
//! chapter-list, and login probe pages. **Not** present on the reader,
//! so callers in the reader path must rely on the body signature only.
//!
//! [`PageError::Transient`] is the typed signal returned by parser and
//! navigate wrappers. Job handlers map it to "reschedule with backoff"
//! rather than the per-page silent skip the parsers used to do.
use std::future::Future;
use std::time::Duration;
use thiserror::Error;
/// Universal substring of the broken-page body. The site renders the
/// exact string verbatim in a single `<p>`, so a case-insensitive
/// substring match is enough — we deliberately do *not* anchor to the
/// kaomoji because that part is more likely to change than the prose.
const BROKEN_PAGE_MARKER: &str = "we're sorry, the request file are not found";
/// Outcome of a page fetch or parse when the caller wants to
/// distinguish "site/page is transiently broken — retry later" from
/// other errors. `Transient` is the only retry-friendly variant; every
/// other failure mode stays as `anyhow::Error` and is treated as today.
#[derive(Debug, Error)]
pub enum PageError {
/// Page came back but the site signaled trouble — broken-page body
/// signature, structural sentinel missing, etc. Caller should
/// reschedule this fetch rather than treat it as data.
#[error("transient page error: {reason}")]
Transient { reason: String },
#[error(transparent)]
Other(#[from] anyhow::Error),
}
impl PageError {
pub fn transient(reason: impl Into<String>) -> Self {
Self::Transient { reason: reason.into() }
}
pub fn is_transient(&self) -> bool {
matches!(self, Self::Transient { .. })
}
}
/// Returns true when the response body matches the known broken-page
/// template. Case-insensitive substring match — small bodies (~150B)
/// make the scan trivially fast, and the broken page is always tiny so
/// false positives on a real catalog page are not a concern.
pub fn is_broken_page_body(html: &str) -> bool {
html.to_ascii_lowercase().contains(BROKEN_PAGE_MARKER)
}
/// Returns true when the parsed document contains `#logo` — the site's
/// header logo element, present on every full-layout page and absent on
/// the broken-page response and on the reader.
pub fn has_logo_sentinel(doc: &scraper::Html) -> bool {
let sel = scraper::Selector::parse("#logo").expect("#logo is a valid selector");
doc.select(&sel).next().is_some()
}
/// Retry `op` up to `max_attempts` times whenever it returns
/// [`PageError::Transient`], sleeping `delay` between attempts.
/// Non-transient errors short-circuit immediately. Used by discover-loop
/// callers so a single broken page doesn't drop the whole walk — the
/// caller can fall back on the job system's retry/backoff once the
/// inline budget is exhausted.
pub async fn retry_on_transient<F, Fut, T>(
op: F,
max_attempts: u32,
delay: Duration,
) -> Result<T, PageError>
where
F: FnMut() -> Fut,
Fut: Future<Output = Result<T, PageError>>,
{
retry_on_transient_with_hook(op, max_attempts, delay, || async {}).await
}
/// Like [`retry_on_transient`] but invokes `on_retry` between a
/// transient failure and the subsequent sleep+retry. The hook does
/// **not** fire on the first attempt, after a non-transient error, or
/// after the final attempt (no retry follows). Hook failures are not
/// propagated — return `()` from the future and log inside if needed.
///
/// Wire the TOR controller's `new_identity` here to rotate circuits
/// between page-fetch retries; see [`crate::crawler::tor`].
pub async fn retry_on_transient_with_hook<F, Fut, T, H, HFut>(
mut op: F,
max_attempts: u32,
delay: Duration,
mut on_retry: H,
) -> Result<T, PageError>
where
F: FnMut() -> Fut,
Fut: Future<Output = Result<T, PageError>>,
H: FnMut() -> HFut,
HFut: Future<Output = ()>,
{
debug_assert!(max_attempts >= 1, "max_attempts must be at least 1");
let mut attempt = 0u32;
loop {
attempt += 1;
match op().await {
Ok(v) => return Ok(v),
Err(e) if !e.is_transient() => return Err(e),
Err(e) if attempt >= max_attempts => return Err(e),
Err(e) => {
tracing::warn!(
attempt,
max_attempts,
error = %e,
"transient error; running on-retry hook and sleeping before retry"
);
on_retry().await;
tokio::time::sleep(delay).await;
}
}
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn broken_page_body_matches_exact_template() {
let html = "<html><head></head><body>\
<p>we're sorry, the request file are not found. Σ(っ°Д °;)っ</p>\
</body></html>";
assert!(is_broken_page_body(html));
}
#[test]
fn broken_page_body_is_case_insensitive() {
let html = "<p>WE'RE SORRY, THE REQUEST FILE ARE NOT FOUND.</p>";
assert!(is_broken_page_body(html));
}
#[test]
fn broken_page_body_does_not_match_normal_listing() {
let html = "<html><body><div id='logo'></div>\
<ul><li>Manga A</li><li>Manga B</li></ul></body></html>";
assert!(!is_broken_page_body(html));
}
#[test]
fn broken_page_body_does_not_match_empty_string() {
assert!(!is_broken_page_body(""));
}
#[test]
fn logo_sentinel_present_on_normal_page() {
let doc = scraper::Html::parse_document(
"<html><body><div id='logo'>Site</div><main>...</main></body></html>",
);
assert!(has_logo_sentinel(&doc));
}
#[test]
fn logo_sentinel_absent_on_broken_page() {
let doc = scraper::Html::parse_document(
"<html><head></head><body>\
<p>we're sorry, the request file are not found.</p></body></html>",
);
assert!(!has_logo_sentinel(&doc));
}
#[test]
fn logo_sentinel_absent_on_empty_document() {
let doc = scraper::Html::parse_document("");
assert!(!has_logo_sentinel(&doc));
}
#[test]
fn page_error_transient_constructor_sets_reason() {
let e = PageError::transient("logo missing");
assert!(e.is_transient());
assert_eq!(e.to_string(), "transient page error: logo missing");
}
#[test]
fn page_error_other_is_not_transient() {
let e: PageError = anyhow::anyhow!("something else").into();
assert!(!e.is_transient());
}
#[tokio::test]
async fn retry_returns_ok_after_a_transient_streak() {
let mut attempt = 0u32;
let result: Result<i32, PageError> = retry_on_transient(
|| {
attempt += 1;
let n = attempt;
async move {
if n < 3 {
Err(PageError::transient("not yet"))
} else {
Ok(42)
}
}
},
5,
Duration::from_millis(0),
)
.await;
assert_eq!(result.unwrap(), 42);
assert_eq!(attempt, 3);
}
#[tokio::test]
async fn retry_gives_up_after_max_attempts_on_persistent_transient() {
let mut attempt = 0u32;
let result: Result<i32, PageError> = retry_on_transient(
|| {
attempt += 1;
async { Err(PageError::transient("always")) }
},
3,
Duration::from_millis(0),
)
.await;
let err = result.expect_err("expected Transient");
assert!(err.is_transient());
assert_eq!(attempt, 3, "retried max_attempts times, no more");
}
#[tokio::test]
async fn retry_does_not_retry_non_transient_errors() {
let mut attempt = 0u32;
let result: Result<i32, PageError> = retry_on_transient(
|| {
attempt += 1;
async { Err(PageError::Other(anyhow::anyhow!("permanent"))) }
},
5,
Duration::from_millis(0),
)
.await;
assert!(result.is_err());
assert!(!result.unwrap_err().is_transient());
assert_eq!(attempt, 1, "non-transient must fail immediately");
}
#[tokio::test]
async fn retry_returns_ok_on_first_attempt_without_sleeping() {
let mut attempt = 0u32;
let result: Result<i32, PageError> = retry_on_transient(
|| {
attempt += 1;
async { Ok(7) }
},
5,
Duration::from_secs(60),
)
.await;
assert_eq!(result.unwrap(), 7);
assert_eq!(attempt, 1);
}
#[tokio::test]
async fn hook_fires_once_between_transient_and_success() {
let mut attempt = 0u32;
let mut hook_calls = 0u32;
let result: Result<i32, PageError> = retry_on_transient_with_hook(
|| {
attempt += 1;
let n = attempt;
async move {
if n < 2 {
Err(PageError::transient("once"))
} else {
Ok(99)
}
}
},
5,
Duration::from_millis(0),
|| {
hook_calls += 1;
async {}
},
)
.await;
assert_eq!(result.unwrap(), 99);
assert_eq!(attempt, 2);
assert_eq!(hook_calls, 1, "hook fires exactly once between attempts");
}
#[tokio::test]
async fn hook_does_not_fire_when_first_attempt_succeeds() {
let mut hook_calls = 0u32;
let result: Result<i32, PageError> = retry_on_transient_with_hook(
|| async { Ok(1) },
5,
Duration::from_millis(0),
|| {
hook_calls += 1;
async {}
},
)
.await;
assert!(result.is_ok());
assert_eq!(hook_calls, 0);
}
#[tokio::test]
async fn hook_does_not_fire_after_non_transient_error() {
let mut hook_calls = 0u32;
let result: Result<i32, PageError> = retry_on_transient_with_hook(
|| async { Err(PageError::Other(anyhow::anyhow!("permanent"))) },
5,
Duration::from_millis(0),
|| {
hook_calls += 1;
async {}
},
)
.await;
assert!(result.is_err());
assert_eq!(hook_calls, 0, "non-transient must short-circuit before hook");
}
#[tokio::test]
async fn hook_does_not_fire_after_final_failed_attempt() {
// With max_attempts=3 and three persistent transients, the hook
// should run twice (between 1→2 and 2→3) — never a third time,
// because no retry follows attempt 3.
let mut attempt = 0u32;
let mut hook_calls = 0u32;
let result: Result<i32, PageError> = retry_on_transient_with_hook(
|| {
attempt += 1;
async { Err(PageError::transient("always")) }
},
3,
Duration::from_millis(0),
|| {
hook_calls += 1;
async {}
},
)
.await;
assert!(result.is_err());
assert_eq!(attempt, 3);
assert_eq!(hook_calls, 2, "hook fires N-1 times for N attempts that all fail transient");
}
}

View File

@@ -1,27 +1,20 @@
//! Persistent job queue and the four job kinds.
//! Persistent job queue and its job kinds.
//!
//! Backed by Postgres (the `crawler_jobs` table). Workers lease rows
//! with `SELECT ... FOR UPDATE SKIP LOCKED`, heartbeat via
//! `leased_until`, and ack by transitioning to `done` (or backoff /
//! `dead`). Handlers are idempotent so a crash mid-run is recoverable
//! by replay.
//!
//! Scaffold only — the actual queue wrapper and handler dispatch land
//! once we have the first `Source` impl exercising the pipeline.
use std::time::Duration;
use serde::{Deserialize, Serialize};
use sqlx::PgPool;
use uuid::Uuid;
use super::source::DiscoverMode;
#[derive(Clone, Debug, Serialize, Deserialize)]
#[serde(tag = "kind", rename_all = "snake_case")]
pub enum JobPayload {
/// Walk the source index and enqueue `SyncManga` jobs.
Discover {
source_id: String,
mode: DiscoverMode,
},
/// Fetch one manga's detail page, upsert metadata, enqueue
/// `SyncChapterList`.
SyncManga {
@@ -53,3 +46,251 @@ pub enum JobState {
Failed,
Dead,
}
/// Kind discriminator stored in `payload->>'kind'`. Public so callers
/// (daemon worker, bookmark hook) can filter `lease()` to a single kind
/// without re-spelling the literal.
pub const KIND_SYNC_CHAPTER_CONTENT: &str = "sync_chapter_content";
#[derive(Debug)]
pub enum EnqueueResult {
Inserted(Uuid),
Skipped,
}
#[derive(Debug, Clone)]
pub struct Lease {
pub id: Uuid,
pub payload: JobPayload,
pub attempts: i32,
pub max_attempts: i32,
}
/// Exponential backoff for `ack_failed` retries. `attempts` is the
/// post-increment value reported by `lease()` (so the first failure has
/// `attempts == 1` and waits 60s, the second 120s, etc.). Capped at 1h to
/// avoid runaway long sleeps that would outlive the daemon process.
fn backoff_for(attempts: i32) -> Duration {
let shift = attempts.saturating_sub(1).clamp(0, 20) as u32;
let secs = 60u64.saturating_mul(1u64 << shift);
Duration::from_secs(secs.min(3600))
}
/// Insert a new pending job. For `SyncChapterContent` payloads the
/// partial unique index `crawler_jobs_chapter_content_dedup_idx` blocks
/// a second `(pending|running)` insert per chapter_id, returning
/// `Skipped`. The slot frees again once the previous job leaves the
/// in-flight states (done/failed/dead), so a re-enqueue after a force
/// refetch succeeds.
pub async fn enqueue(pool: &PgPool, payload: &JobPayload) -> sqlx::Result<EnqueueResult> {
let json = serde_json::to_value(payload).expect("JobPayload is always serializable");
let id: Option<Uuid> = sqlx::query_scalar(
"INSERT INTO crawler_jobs (payload) VALUES ($1) \
ON CONFLICT DO NOTHING RETURNING id",
)
.bind(json)
.fetch_optional(pool)
.await?;
Ok(match id {
Some(id) => EnqueueResult::Inserted(id),
None => EnqueueResult::Skipped,
})
}
/// Lease up to `max` rows whose `state` is `pending`, or `running` with
/// an expired `leased_until` (the crashed-worker recovery path). The
/// inner CTE uses `FOR UPDATE SKIP LOCKED` so concurrent leasers don't
/// block each other and each row is handed to exactly one worker.
///
/// `kind_filter` matches against `payload->>'kind'`; `None` means
/// any kind.
///
/// Ties on `scheduled_at` (the common case: a cron batch enqueues
/// everything with the same default `now()`) break by `created_at`, so
/// jobs come off the queue in insertion order. The enqueue paths insert
/// chapter-content jobs in ascending `chapters.number` order, so this
/// tiebreaker is what propagates that intent through to dequeue.
pub async fn lease(
pool: &PgPool,
kind_filter: Option<&str>,
max: i64,
lease_duration: Duration,
) -> sqlx::Result<Vec<Lease>> {
let lease_ms: i64 = lease_duration.as_millis().min(i64::MAX as u128) as i64;
let rows: Vec<(Uuid, serde_json::Value, i32, i32)> = sqlx::query_as(
r#"
WITH leased AS (
SELECT id FROM crawler_jobs
WHERE (state = 'pending' OR (state = 'running' AND leased_until < now()))
AND scheduled_at <= now()
AND ($1::text IS NULL OR payload->>'kind' = $1)
ORDER BY scheduled_at, created_at
LIMIT $2
FOR UPDATE SKIP LOCKED
)
UPDATE crawler_jobs j
SET state = 'running',
attempts = j.attempts + 1,
leased_until = now() + ($3::bigint || ' milliseconds')::interval,
updated_at = now()
FROM leased l
WHERE j.id = l.id
RETURNING j.id, j.payload, j.attempts, j.max_attempts
"#,
)
.bind(kind_filter)
.bind(max)
.bind(lease_ms)
.fetch_all(pool)
.await?;
let mut leases = Vec::with_capacity(rows.len());
for (id, payload_json, attempts, max_attempts) in rows {
let payload: JobPayload = serde_json::from_value(payload_json).map_err(|e| {
sqlx::Error::Decode(format!("invalid JobPayload JSON for job {id}: {e}").into())
})?;
leases.push(Lease {
id,
payload,
attempts,
max_attempts,
});
}
Ok(leases)
}
/// Mark a leased job as successfully completed. The `state = 'running'`
/// predicate guards against a late ack from a worker whose lease expired
/// and was already re-leased by another worker: without it, the late ack
/// would clobber the new lease's `state` and `leased_until`. `rows_affected
/// == 0` means we lost the lease — surfaced as a warn rather than an
/// error because the new lease holder is doing real work; the late ack
/// just has to step aside.
pub async fn ack_done(pool: &PgPool, lease_id: Uuid) -> sqlx::Result<()> {
let res = sqlx::query(
"UPDATE crawler_jobs \
SET state = 'done', leased_until = NULL, updated_at = now() \
WHERE id = $1 AND state = 'running'",
)
.bind(lease_id)
.execute(pool)
.await?;
if res.rows_affected() == 0 {
tracing::warn!(
%lease_id,
"ack_done: lease no longer running — likely re-leased by another worker; skipping update"
);
}
Ok(())
}
/// Mark a leased job as failed. If the current attempt count has reached
/// `max_attempts` the job is terminally dead and stops retrying;
/// otherwise it goes back to `pending` with `scheduled_at` pushed into
/// the future by the exponential backoff. See [`ack_done`] for the
/// `state = 'running'` guard rationale.
pub async fn ack_failed(
pool: &PgPool,
lease_id: Uuid,
error: &str,
attempts: i32,
max_attempts: i32,
) -> sqlx::Result<()> {
let res = if attempts >= max_attempts {
sqlx::query(
"UPDATE crawler_jobs \
SET state = 'dead', last_error = $2, leased_until = NULL, updated_at = now() \
WHERE id = $1 AND state = 'running'",
)
.bind(lease_id)
.bind(error)
.execute(pool)
.await?
} else {
let backoff_ms: i64 = backoff_for(attempts).as_millis().min(i64::MAX as u128) as i64;
sqlx::query(
"UPDATE crawler_jobs \
SET state = 'pending', last_error = $2, leased_until = NULL, \
scheduled_at = now() + ($3::bigint || ' milliseconds')::interval, \
updated_at = now() \
WHERE id = $1 AND state = 'running'",
)
.bind(lease_id)
.bind(error)
.bind(backoff_ms)
.execute(pool)
.await?
};
if res.rows_affected() == 0 {
tracing::warn!(
%lease_id,
"ack_failed: lease no longer running — likely re-leased by another worker; skipping update"
);
}
Ok(())
}
/// Return a leased job to `pending` without burning a retry attempt.
/// Used on graceful shutdown and on session-expired aborts where the
/// failure isn't the job's fault. See [`ack_done`] for the
/// `state = 'running'` guard rationale — important here because
/// `attempts - 1` would otherwise spuriously decrement the new lease's
/// attempt count.
pub async fn release(pool: &PgPool, lease_id: Uuid) -> sqlx::Result<()> {
let res = sqlx::query(
"UPDATE crawler_jobs \
SET state = 'pending', leased_until = NULL, \
attempts = GREATEST(0, attempts - 1), updated_at = now() \
WHERE id = $1 AND state = 'running'",
)
.bind(lease_id)
.execute(pool)
.await?;
if res.rows_affected() == 0 {
tracing::warn!(
%lease_id,
"release: lease no longer running — likely re-leased by another worker; skipping update"
);
}
Ok(())
}
/// Delete `done` jobs whose `updated_at` is older than `retention_days`
/// days. `0` disables the reaper without touching the table. Returns the
/// number of rows removed.
pub async fn reap_done(pool: &PgPool, retention_days: u32) -> sqlx::Result<u64> {
if retention_days == 0 {
return Ok(0);
}
let result = sqlx::query(
"DELETE FROM crawler_jobs \
WHERE state = 'done' \
AND updated_at < now() - ($1::bigint || ' days')::interval",
)
.bind(retention_days as i64)
.execute(pool)
.await?;
Ok(result.rows_affected())
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn backoff_grows_exponentially_and_caps_at_one_hour() {
// attempts == 1 → 60s, doubling each step.
assert_eq!(backoff_for(1), Duration::from_secs(60));
assert_eq!(backoff_for(2), Duration::from_secs(120));
assert_eq!(backoff_for(3), Duration::from_secs(240));
assert_eq!(backoff_for(4), Duration::from_secs(480));
assert_eq!(backoff_for(5), Duration::from_secs(960));
assert_eq!(backoff_for(6), Duration::from_secs(1920));
// 7th: 60 * 64 = 3840 → capped to 3600.
assert_eq!(backoff_for(7), Duration::from_secs(3600));
assert_eq!(backoff_for(20), Duration::from_secs(3600));
// Garbage / zero / negatives stay sane.
assert_eq!(backoff_for(0), Duration::from_secs(60));
assert_eq!(backoff_for(-5), Duration::from_secs(60));
}
}

View File

@@ -14,7 +14,18 @@
//! - [`diff`]: change detection — new / updated / dropped semantics.
pub mod browser;
pub mod browser_manager;
pub mod content;
pub mod daemon;
pub mod detect;
pub mod diff;
pub mod jobs;
pub mod nav;
pub mod pipeline;
pub mod rate_limit;
pub mod resync;
pub mod safety;
pub mod session;
pub mod source;
pub mod tor;
pub mod url_utils;

241
backend/src/crawler/nav.rs Normal file
View File

@@ -0,0 +1,241 @@
//! Page navigation helpers — wrap `chromiumoxide` `wait_for_navigation`
//! with a timeout so a hung TLS handshake or a page that never fires
//! `load` cannot wedge a worker (or the cron metadata pass) forever.
//!
//! [`NAV_TIMEOUT`] is the global budget. Callers in the crawler use
//! [`wait_for_nav`] to get back a typed error so transient timeouts can
//! be reported separately from underlying CDP errors.
use std::time::Duration;
use chromiumoxide::error::CdpError;
use chromiumoxide::Page;
use thiserror::Error;
/// Maximum wall-clock time we'll wait for a single page navigation. A
/// healthy Chromium reaches `load` in well under a second on the target
/// site; a 30-second cap is generous enough for slow TLS handshakes on
/// the first request after a fresh process while still catching real
/// hangs before they wedge the daemon.
pub const NAV_TIMEOUT: Duration = Duration::from_secs(30);
/// Outcome of a timed-out navigation. `Timeout` is the transient signal
/// callers translate into a retry-friendly error
/// ([`crate::crawler::detect::PageError::Transient`] in the source path,
/// a context'd anyhow elsewhere). `Cdp` carries the underlying
/// chromiumoxide error unchanged.
#[derive(Debug, Error)]
pub enum NavError {
#[error("navigation timed out after {0:?}")]
Timeout(Duration),
#[error(transparent)]
Cdp(#[from] CdpError),
}
/// Wait for the page's next navigation to complete, capped at
/// [`NAV_TIMEOUT`]. Replaces bare `page.wait_for_navigation().await`
/// throughout the crawler.
pub async fn wait_for_nav(page: &Page) -> Result<(), NavError> {
match tokio::time::timeout(NAV_TIMEOUT, page.wait_for_navigation()).await {
Err(_elapsed) => Err(NavError::Timeout(NAV_TIMEOUT)),
Ok(Err(e)) => Err(NavError::Cdp(e)),
Ok(Ok(_)) => Ok(()),
}
}
/// Poll interval for [`wait_for_selector`]. 100ms is fast enough that a
/// page rendering in 200ms isn't held back noticeably, and slow enough
/// not to spam CDP with `find_element` calls on a page that's actually
/// taking its time.
const SELECTOR_POLL_INTERVAL: Duration = Duration::from_millis(100);
/// Wait until `selector` matches at least one element on `page`, or
/// `timeout` elapses. Used after a navigation to confirm a page-type-
/// specific marker is in the DOM before parsing — replaces the fixed
/// post-nav sleep that previously masked partial-render races.
///
/// chromiumoxide 0.7.0 has no built-in `wait_for_selector`, so we poll
/// `find_element` at [`SELECTOR_POLL_INTERVAL`] until success or budget
/// exhaustion. A failed `find_element` is *not* an error here — it just
/// means "not yet" — we only surface an error once the overall
/// `timeout` is up.
pub async fn wait_for_selector(
page: &Page,
selector: &str,
timeout: Duration,
) -> Result<(), NavError> {
let deadline = tokio::time::Instant::now() + timeout;
loop {
if page.find_element(selector).await.is_ok() {
return Ok(());
}
if tokio::time::Instant::now() >= deadline {
return Err(NavError::Timeout(timeout));
}
let remaining = deadline.saturating_duration_since(tokio::time::Instant::now());
let sleep_for = SELECTOR_POLL_INTERVAL.min(remaining);
tokio::time::sleep(sleep_for).await;
}
}
/// Per-page-type budget for [`wait_for_selector`]. Shorter than
/// [`NAV_TIMEOUT`] because by the time we're waiting on a selector, the
/// page has already responded — we're only absorbing post-load JS
/// finishing its row injection, which on a healthy site takes well
/// under a second.
pub const SELECTOR_TIMEOUT: Duration = Duration::from_secs(10);
impl NavError {
/// Does this navigation error indicate the underlying Chromium
/// process has died or its CDP connection has dropped? Used by the
/// dispatcher to decide whether to invalidate the
/// [`crate::crawler::browser_manager::BrowserManager`] handle so
/// the next acquire re-launches.
///
/// Both variants count: a `Timeout` past [`NAV_TIMEOUT`] is in
/// practice always either a hung CDP transport or a wedged page
/// the browser can't recover from on its own, and a `Cdp` error
/// surfacing at the navigation layer means the chromium-facing
/// channel is the failing layer.
pub fn is_likely_browser_dead(&self) -> bool {
match self {
Self::Timeout(_) => true,
Self::Cdp(_) => true,
}
}
}
/// Walk an `anyhow::Error` chain looking for typed evidence that the
/// chromium-facing layer is the failing one. Two markers count:
///
/// 1. A wrapped [`NavError`] flagged by [`NavError::is_likely_browser_dead`].
/// 2. A wrapped [`CdpError`] (via `anyhow::Error::from(CdpError)` at a
/// `Browser::new_page` call site, or any other direct CDP boundary).
///
/// Earlier versions also substring-matched the chain for "connection",
/// "closed", "channel", etc. as a fallback. That was too broad —
/// reqwest TCP-reset errors during CDN image downloads, sqlx
/// connection-pool errors, and similar non-browser failures contain
/// those words and triggered spurious chromium relaunches. The typed
/// downcasts cover every place we hand a chromium error to anyhow,
/// so the fallback is unnecessary.
pub fn anyhow_looks_browser_dead(err: &anyhow::Error) -> bool {
for cause in err.chain() {
if let Some(nav) = cause.downcast_ref::<NavError>() {
if nav.is_likely_browser_dead() {
return true;
}
}
if cause.downcast_ref::<CdpError>().is_some() {
return true;
}
}
false
}
#[cfg(test)]
mod tests {
use super::*;
use std::future::pending;
/// Sanity-check the timeout pattern used by [`wait_for_nav`]: a
/// future that never resolves must yield `Elapsed` within the
/// configured budget. We can't easily stand up a real `Page` in a
/// unit test, so we assert the underlying primitive behaves the way
/// the helper depends on.
#[tokio::test(flavor = "current_thread", start_paused = true)]
async fn timeout_elapses_on_a_future_that_never_resolves() {
let result =
tokio::time::timeout(Duration::from_millis(50), pending::<()>()).await;
assert!(result.is_err(), "expected Elapsed on a hung future");
}
#[test]
fn nav_error_timeout_message_includes_duration() {
let e = NavError::Timeout(Duration::from_secs(30));
assert_eq!(e.to_string(), "navigation timed out after 30s");
}
#[test]
fn timeout_is_treated_as_likely_browser_dead() {
let e = NavError::Timeout(NAV_TIMEOUT);
assert!(e.is_likely_browser_dead());
}
#[test]
fn anyhow_with_nav_timeout_in_chain_is_flagged() {
let inner: Result<(), NavError> = Err(NavError::Timeout(NAV_TIMEOUT));
let outer = inner.unwrap_err();
let wrapped: anyhow::Error =
anyhow::Error::new(outer).context("wait for chapter nav");
assert!(anyhow_looks_browser_dead(&wrapped));
}
#[test]
fn anyhow_with_cdp_error_in_chain_is_flagged() {
// `Browser::new_page` errors get wrapped via
// `anyhow::Error::from(CdpError)` at the navigate / dispatch
// call sites. Walking the chain and downcasting to CdpError is
// what catches that path. Any CdpError variant counts; the
// Serde variant is the easiest to construct in a unit test.
let serde_err: serde_json::Error =
serde_json::from_str::<i32>("not a number").unwrap_err();
let cdp = CdpError::Serde(serde_err);
let wrapped: anyhow::Error =
anyhow::Error::from(cdp).context("open chapter page");
assert!(anyhow_looks_browser_dead(&wrapped));
}
#[test]
fn anyhow_with_innocuous_parse_error_is_not_flagged() {
let e: anyhow::Error =
anyhow::anyhow!("parse manga detail: chapter row regex did not match");
assert!(!anyhow_looks_browser_dead(&e));
}
#[test]
fn anyhow_with_reqwest_style_connection_message_is_not_flagged() {
// Regression: the earlier substring fallback flagged any error
// whose message contained "connection" or "closed" as browser-
// dead. A TCP reset from a CDN during image download, or a
// sqlx pool-connection error, would burn a chromium relaunch
// even though the browser is fine. Typed downcasts only —
// these untyped strings must pass through.
for msg in [
"error sending request: connection reset by peer",
"PoolTimedOut: timed out waiting for a connection",
"request to https://cdn/x.jpg: connection closed before message completed",
"transport error during image fetch",
] {
let e: anyhow::Error = anyhow::anyhow!("{msg}");
assert!(
!anyhow_looks_browser_dead(&e),
"must not flag non-browser error: {msg}"
);
}
}
/// Same sanity check as [`timeout_elapses_on_a_future_that_never_resolves`],
/// but for the [`wait_for_selector`] polling pattern: the loop must
/// surrender on `Elapsed` rather than spinning past the deadline.
#[tokio::test(flavor = "current_thread", start_paused = true)]
async fn selector_polling_pattern_surrenders_at_deadline() {
let timeout = Duration::from_millis(300);
let start = tokio::time::Instant::now();
let deadline = start + timeout;
// Simulate find_element forever returning "not found".
let mut polls = 0u32;
let result: Result<(), NavError> = loop {
polls += 1;
if tokio::time::Instant::now() >= deadline {
break Err(NavError::Timeout(timeout));
}
tokio::time::sleep(SELECTOR_POLL_INTERVAL).await;
};
assert!(matches!(result, Err(NavError::Timeout(_))));
// 300ms / 100ms poll interval ≈ 3 iterations plus the final check
// that breaks out. Allow some slack since the first poll happens
// before any sleep.
assert!(polls >= 3, "expected at least 3 poll iterations, got {polls}");
}
}

View File

@@ -0,0 +1,813 @@
//! Crawler pipeline — the reusable metadata pass and the enqueue helpers
//! that fan out chapter-content work. Shared between the daemon (cron tick)
//! and the CLI (`bin/crawler.rs`) so behavior stays in lockstep.
use std::collections::HashSet;
use anyhow::Context;
use sqlx::PgPool;
use uuid::Uuid;
use crate::crawler::browser_manager::BrowserManager;
use crate::crawler::jobs::{self, EnqueueResult, JobPayload};
use crate::crawler::rate_limit::HostRateLimiters;
use crate::crawler::safety::{fetch_bytes_capped, looks_like_image, DownloadAllowlist};
use crate::crawler::source::target::TargetSource;
use crate::crawler::source::{FetchContext, Source, SourceMangaRef};
use crate::repo;
use crate::repo::crawler::UpsertStatus;
use crate::storage::Storage;
/// Coarse counters surfaced for logging at the end of a metadata pass.
#[derive(Debug, Default, Clone, Copy)]
pub struct MetadataStats {
pub discovered: usize,
pub upserted: usize,
pub covers_fetched: usize,
pub mangas_failed: usize,
}
/// Decide whether the per-ref loop should stop on the manga just
/// processed. The walk halts only when (a) the previous run exited
/// cleanly — so the index tail is known to be caught up and we're not
/// in a recovery sweep — AND (b) this manga's metadata hash matched
/// storage (`Unchanged`) AND (c) the chapter sync confirmed zero new
/// chapters. A `None` chapter count (skip_chapters, or a chapter-sync
/// error we logged-and-swallowed) refuses the stop because we can't
/// verify the tail is unchanged from a single piece of evidence.
///
/// Pure function so the rule is unit-testable without the walker, DB,
/// or browser.
pub(crate) fn should_stop(
was_clean: bool,
status: UpsertStatus,
chapters_new: Option<usize>,
) -> bool {
was_clean
&& matches!(status, UpsertStatus::Unchanged)
&& chapters_new == Some(0)
}
/// Whether the just-finished walk should be recorded as a clean exit.
/// `true` writes the recovery flag back to `completed: true`; `false`
/// leaves it `false` so the next tick treats this run as crashed and
/// does a recovery sweep.
///
/// `hit_limit` (the caller-imposed `CRAWLER_LIMIT` cap) is *not* an
/// argument: a limit cap by definition does not reach the catalog tail,
/// so it can never count as a clean exit. Encoding that in the type
/// (rather than as an `&& !hit_limit` clause inline) prevents a future
/// edit from accidentally adding it back to the truth table.
pub(crate) fn should_mark_clean_exit(
walked_to_completion: bool,
hit_stop_condition: bool,
) -> bool {
walked_to_completion || hit_stop_condition
}
/// Runs the discover → fetch → upsert → cover → chapter-list-diff pipeline
/// for the target source. Pure metadata; chapter content is enqueued as
/// separate `SyncChapterContent` jobs by the caller after this returns.
///
/// `limit == 0` means no cap (full sweep up to the source's own bound).
/// `skip_chapters == true` is the "metadata-only" mode (parser doesn't
/// extract chapters, and `sync_manga_chapters` is skipped — otherwise an
/// empty chapter list would soft-drop existing rows). In this mode the
/// stop condition never fires because chapter freshness can't be
/// confirmed, so the walk always runs to end-of-source.
///
/// The walk is always newest-first. Steady-state runs stop on the first
/// manga where metadata is `Unchanged` AND chapter sync reports zero
/// new chapters — the source orders by `update_date DESC`, so anything
/// with a fresh chapter or fresh metadata is bumped to the top and will
/// be processed before we hit a fully-caught-up manga.
///
/// A per-source recovery flag stored in `crawler_state`
/// (`last_run_completed:<source_id>`) gates the early stop: it's set to
/// `false` right after `ensure_source` and back to `true` only when the
/// run exits via end-of-walk OR the intentional stop. A crash, panic,
/// or SIGKILL leaves the flag at `false`, so the next tick reads it,
/// recognizes the previous run did not exit cleanly, and walks the
/// full catalog (ignoring the stop condition) to re-cover anything the
/// crashed run missed past its crash point. Once that recovery sweep
/// reaches end-of-walk, steady-state resumes.
#[allow(clippy::too_many_arguments)]
pub async fn run_metadata_pass(
browser_manager: &BrowserManager,
db: &PgPool,
storage: &dyn Storage,
http: &reqwest::Client,
rate: &HostRateLimiters,
start_url: &str,
limit: usize,
skip_chapters: bool,
allowlist: &DownloadAllowlist,
max_image_bytes: usize,
tor: Option<&crate::crawler::tor::TorController>,
) -> anyhow::Result<MetadataStats> {
let lease = browser_manager
.acquire()
.await
.context("acquire browser lease for metadata pass")?;
let browser_ref: &chromiumoxide::Browser = &lease;
let source = {
let s = TargetSource::new(start_url.to_string());
if skip_chapters {
s.without_chapter_parsing()
} else {
s
}
};
let ctx = FetchContext {
browser: browser_ref,
rate,
tor,
};
let source_id = source.id();
repo::crawler::ensure_source(
db,
source_id,
"Target Site",
&origin_of(start_url).unwrap_or_else(|| start_url.to_string()),
)
.await
.context("ensure_source")?;
// Read BEFORE flipping to "in-flight" — a `false` here means the
// previous run didn't reach a clean exit, and this run must walk
// the full catalog (recovery sweep) instead of bailing on the
// first caught-up manga.
let was_clean = repo::crawler::last_run_completed_cleanly(db, source_id)
.await
.context("read last_run_completed_cleanly")?;
repo::crawler::mark_run_started(db, source_id)
.await
.context("mark_run_started")?;
let max_refs = (limit > 0).then_some(limit);
tracing::info!(was_clean, ?max_refs, "starting metadata pass");
let mut walker = source
.discover(&ctx)
.await
.context("discover failed")?;
let mut stats = MetadataStats::default();
// Run-scoped dedup of `source_manga_key`s already processed this pass.
// A shift in the source index causes the slot-last item of the page
// we just read to reappear at slot 0 of the next page; skipping it
// here prevents redundant fetch_manga + upsert and avoids spuriously
// tripping the stop condition with a re-confirm of an entry we
// already counted.
let mut seen: HashSet<String> = HashSet::new();
let mut walked_to_completion = false;
let mut hit_limit = false;
let mut hit_stop_condition = false;
'outer: loop {
let batch = match walker.next_batch(&ctx).await? {
Some(b) => b,
None => {
walked_to_completion = true;
break;
}
};
for r in batch {
if max_refs.map(|m| stats.discovered >= m).unwrap_or(false) {
hit_limit = true;
tracing::info!(cap = ?max_refs, "max_results reached; halting walk");
break 'outer;
}
// Skip refs we've already *successfully* processed this pass.
// Checking `contains` here (rather than `insert`) keeps the key
// out of `seen` on failure paths below, so a transient fetch or
// upsert error gets a second chance if the ref reappears in
// another batch. Done *before* counting toward
// `stats.discovered` (the skipped ref did no work) and *before*
// touching the stop check (a `continue` here doesn't let a
// re-confirm trip the stop condition). The matching
// `seen.insert(...)` lives just after the successful upsert
// below.
if seen.contains(&r.source_manga_key) {
tracing::debug!(
key = %r.source_manga_key,
"skip already-seen key in this run"
);
continue;
}
stats.discovered += 1;
tracing::info!(
idx = stats.discovered,
key = %r.source_manga_key,
"fetching metadata"
);
let manga = match source.fetch_manga(&ctx, &r).await {
Ok(m) => m,
Err(e) => {
tracing::warn!(
key = %r.source_manga_key,
url = %r.url,
error = ?e,
"fetch_manga failed"
);
stats.mangas_failed += 1;
continue;
}
};
// Partial-render guard: an empty chapter list paired with a
// prior count > 0 is overwhelmingly a chromium snapshot
// taken between the #chapter_table wrapper render and its
// rows render. The wait_for_selector wait in `navigate`
// narrows this window but cannot close it for slow renders
// beyond the selector budget. Treat as a transient failure
// here — skip upsert, skip seen.insert — so the next batch
// (or the next tick) retries. Skipped in `skip_chapters`
// mode because the parser is configured to return an empty
// Vec by design there.
if !skip_chapters && manga.chapters.is_empty() {
match repo::crawler::live_chapter_count_for_source_manga(
db, source_id, &r.source_manga_key,
)
.await
{
Ok(prior) if prior > 0 => {
tracing::warn!(
key = %r.source_manga_key,
url = %r.url,
prior_chapter_count = prior,
"fetch_manga returned empty chapters but prior count > 0; treating as partial-render transient and skipping"
);
stats.mangas_failed += 1;
continue;
}
Ok(_) => {}
Err(e) => {
// DB lookup failed — fail safe: skip rather
// than risk a soft-drop on a manga whose prior
// count we couldn't confirm.
tracing::warn!(
key = %r.source_manga_key,
error = ?e,
"live_chapter_count_for_source_manga failed; skipping cautiously"
);
stats.mangas_failed += 1;
continue;
}
}
}
let upsert = match repo::crawler::upsert_manga_from_source(
db, source_id, &r.url, &manga,
)
.await
{
Ok(u) => u,
Err(e) => {
tracing::error!(
key = %r.source_manga_key,
error = ?e,
"upsert_manga_from_source failed"
);
stats.mangas_failed += 1;
continue;
}
};
stats.upserted += 1;
// Record success in the dedup set. Cover and chapter-sync
// failures below are non-fatal and don't roll this back —
// metadata is the durable source of truth for the dedup.
seen.insert(r.source_manga_key.clone());
tracing::info!(
key = %manga.source_manga_key,
manga_id = %upsert.manga_id,
status = ?upsert.status,
title = %manga.title,
"manga upserted"
);
// Cover image: download when missing in storage or when metadata
// signaled an update (cover URL is part of metadata_hash, so
// Updated implies the URL may have moved). Failures are non-fatal.
let needs_cover = upsert.cover_image_path.is_none()
|| matches!(upsert.status, repo::crawler::UpsertStatus::Updated);
if needs_cover {
if let Some(cover_url) = manga.cover_url.as_deref() {
match download_and_store_cover(
db,
storage,
http,
rate,
&r.url,
upsert.manga_id,
cover_url,
allowlist,
max_image_bytes,
)
.await
{
Ok(()) => stats.covers_fetched += 1,
Err(e) => tracing::warn!(
manga_id = %upsert.manga_id,
error = ?e,
"cover download failed"
),
}
}
}
// Chapter sync. `chapters_new` feeds the stop check below:
// `None` (skip_chapters mode, or a logged-and-swallowed sync
// error) refuses to stop on this manga because we can't
// confirm "no new chapters."
let chapters_new: Option<usize> = if skip_chapters {
None
} else {
match repo::crawler::sync_manga_chapters(
db,
source_id,
upsert.manga_id,
&manga.chapters,
)
.await
{
Ok(diff) => {
tracing::info!(
manga_id = %upsert.manga_id,
new = diff.new,
refreshed = diff.refreshed,
dropped = diff.dropped,
"chapters synced"
);
Some(diff.new)
}
Err(e) => {
tracing::warn!(
manga_id = %upsert.manga_id,
error = ?e,
"chapter sync failed"
);
None
}
}
};
if should_stop(was_clean, upsert.status, chapters_new) {
hit_stop_condition = true;
tracing::info!(
key = %manga.source_manga_key,
"stop condition met (Unchanged metadata + 0 new chapters); halting walk"
);
break 'outer;
}
}
}
// Recovery-flag write. Only on a clean exit (end-of-walk OR the
// intentional stop). `hit_limit` is a caller-imposed early break
// and does NOT count — the catalog tail wasn't reached, so a future
// tick still needs to walk past where we stopped. The truth table is
// pinned by `should_mark_clean_exit` so a future edit that adds
// `hit_limit` back into the disjunction trips its unit test. Flag-
// write errors are warned and swallowed: the run already did its
// work, and a stale `false` flag just buys a recovery sweep on the
// next tick.
let exited_cleanly = should_mark_clean_exit(walked_to_completion, hit_stop_condition);
if exited_cleanly {
if let Err(e) = repo::crawler::mark_run_completed(db, source_id).await {
tracing::warn!(error = ?e, "mark_run_completed failed");
}
}
tracing::info!(
was_clean,
discovered = stats.discovered,
upserted = stats.upserted,
covers_fetched = stats.covers_fetched,
mangas_failed = stats.mangas_failed,
walked_to_completion,
hit_limit,
hit_stop_condition,
exited_cleanly,
"metadata pass complete"
);
drop(lease);
Ok(stats)
}
/// Quarantine window for chapters whose latest `SyncChapterContent` job is
/// `dead`. The partial dedup index `crawler_jobs_chapter_content_dedup_idx`
/// only blocks `(pending|running)` duplicates, so without this gate a
/// permanently-failing chapter is re-enqueued every cron tick, burns
/// `max_attempts` retries, dies again, and spins forever. With the gate,
/// dead chapters get a week of silence before the next attempt — long
/// enough for a transient site issue to resolve, short enough that
/// permanent failures don't stay permanent if conditions change.
const CHAPTER_DEAD_QUARANTINE_DAYS: i64 = 7;
/// Enqueue a `SyncChapterContent` job for every chapter of *any* bookmarked
/// manga that still has `page_count = 0` and a non-dropped source row.
/// Chapters whose latest job is `dead` within `CHAPTER_DEAD_QUARANTINE_DAYS`
/// are excluded to break the dead-letter spin.
/// Returns `(inserted, skipped)` counts. Dedup index handles repeats.
pub async fn enqueue_bookmarked_pending(pool: &PgPool) -> anyhow::Result<EnqueueSummary> {
let rows: Vec<(String, Uuid, String)> = sqlx::query_as(
r#"
SELECT cs.source_id, c.id AS chapter_id, cs.source_chapter_key
FROM chapters c
JOIN bookmarks b ON b.manga_id = c.manga_id
JOIN chapter_sources cs ON cs.chapter_id = c.id
WHERE c.page_count = 0
AND cs.dropped_at IS NULL
AND NOT EXISTS (
SELECT 1 FROM crawler_jobs cj
WHERE cj.payload->>'kind' = 'sync_chapter_content'
AND cj.payload->>'chapter_id' = c.id::text
AND cj.state = 'dead'
AND cj.updated_at > now() - ($1::bigint || ' days')::interval
)
GROUP BY cs.source_id, c.id, cs.source_chapter_key, c.manga_id, c.number, c.created_at
ORDER BY c.manga_id, c.number ASC, c.created_at ASC
"#,
)
.bind(CHAPTER_DEAD_QUARANTINE_DAYS)
.fetch_all(pool)
.await
.context("query bookmarked-pending chapters")?;
let mut summary = EnqueueSummary::default();
for (source_id, chapter_id, source_chapter_key) in rows {
let payload = JobPayload::SyncChapterContent {
source_id,
chapter_id,
source_chapter_key,
};
match jobs::enqueue(pool, &payload).await {
Ok(EnqueueResult::Inserted(_)) => summary.inserted += 1,
Ok(EnqueueResult::Skipped) => summary.skipped += 1,
Err(e) => {
tracing::warn!(
%chapter_id,
error = ?e,
"enqueue chapter content failed"
);
summary.failed += 1;
}
}
}
Ok(summary)
}
/// Enqueue chapter-content jobs for a *single* manga (the bookmark-create
/// hook). Same dedup semantics as [`enqueue_bookmarked_pending`], including
/// the dead-letter quarantine — a freshly bookmarked manga should not
/// burn retries on chapters that just died on the cron tick.
pub async fn enqueue_pending_for_manga(
pool: &PgPool,
manga_id: Uuid,
) -> anyhow::Result<EnqueueSummary> {
let rows: Vec<(String, Uuid, String)> = sqlx::query_as(
r#"
SELECT cs.source_id, c.id AS chapter_id, cs.source_chapter_key
FROM chapters c
JOIN chapter_sources cs ON cs.chapter_id = c.id
WHERE c.manga_id = $1
AND c.page_count = 0
AND cs.dropped_at IS NULL
AND NOT EXISTS (
SELECT 1 FROM crawler_jobs cj
WHERE cj.payload->>'kind' = 'sync_chapter_content'
AND cj.payload->>'chapter_id' = c.id::text
AND cj.state = 'dead'
AND cj.updated_at > now() - ($2::bigint || ' days')::interval
)
GROUP BY cs.source_id, c.id, cs.source_chapter_key, c.number, c.created_at
ORDER BY c.number ASC, c.created_at ASC, cs.source_id
"#,
)
.bind(manga_id)
.bind(CHAPTER_DEAD_QUARANTINE_DAYS)
.fetch_all(pool)
.await
.context("query pending chapters for manga")?;
let mut summary = EnqueueSummary::default();
for (source_id, chapter_id, source_chapter_key) in rows {
let payload = JobPayload::SyncChapterContent {
source_id,
chapter_id,
source_chapter_key,
};
match jobs::enqueue(pool, &payload).await {
Ok(EnqueueResult::Inserted(_)) => summary.inserted += 1,
Ok(EnqueueResult::Skipped) => summary.skipped += 1,
Err(e) => {
tracing::warn!(
%chapter_id,
error = ?e,
"enqueue chapter content failed"
);
summary.failed += 1;
}
}
}
Ok(summary)
}
#[derive(Debug, Default, Clone, Copy)]
pub struct EnqueueSummary {
pub inserted: usize,
pub skipped: usize,
pub failed: usize,
}
#[derive(Debug, Default, Clone, Copy)]
pub struct CoverBackfillStats {
pub considered: usize,
pub fetched: usize,
pub failed: usize,
}
/// Default per-tick cap for [`backfill_missing_covers`]. The metadata pass
/// already retries covers when its walk reaches the affected manga; this
/// backfill exists to catch the residual case where the early-stop
/// optimisation prevents the walk from reaching mangas whose cover failed
/// on first attempt. A small cap is enough because the backlog only grows
/// from sporadic download failures, not from systematic misses.
pub const COVER_BACKFILL_DEFAULT_MAX: usize = 10;
/// Re-attempt cover downloads for mangas where `cover_image_path IS NULL`
/// but a live `manga_sources` row exists. Refetches the source detail
/// page (which is where the cover URL lives) and downloads the cover.
///
/// Bounded by `max_mangas` per call so a steady stream of failing covers
/// — e.g. a CDN host that's persistently 502 — can't monopolise a cron
/// tick. Orders by `manga_sources.last_seen_at DESC` so the freshest
/// missing-cover mangas are addressed first.
///
/// Failures are logged and counted, not raised: a single bad cover URL
/// must not stall every other backfill behind it.
#[allow(clippy::too_many_arguments)]
pub async fn backfill_missing_covers(
browser_manager: &BrowserManager,
db: &PgPool,
storage: &dyn Storage,
http: &reqwest::Client,
rate: &HostRateLimiters,
max_mangas: usize,
allowlist: &DownloadAllowlist,
max_image_bytes: usize,
tor: Option<&crate::crawler::tor::TorController>,
) -> anyhow::Result<CoverBackfillStats> {
let mut stats = CoverBackfillStats::default();
if max_mangas == 0 {
return Ok(stats);
}
let entries = repo::crawler::list_missing_covers(db, max_mangas as i64)
.await
.context("list_missing_covers")?;
if entries.is_empty() {
return Ok(stats);
}
let lease = browser_manager
.acquire()
.await
.context("acquire browser lease for cover backfill")?;
let browser_ref: &chromiumoxide::Browser = &lease;
let ctx = FetchContext { browser: browser_ref, rate, tor };
for entry in entries {
stats.considered += 1;
// Metadata-only TargetSource: skip chapter-list parsing so a
// missing-cover refetch doesn't soft-drop chapters on a partial
// render. Cover URL alone is what we need.
let source = TargetSource::new(entry.source_url.clone()).without_chapter_parsing();
let r = SourceMangaRef {
source_manga_key: entry.source_manga_key.clone(),
title: String::new(),
url: entry.source_url.clone(),
};
let cover_url = match source.fetch_manga(&ctx, &r).await {
Ok(manga) => manga.cover_url,
Err(e) => {
tracing::warn!(
manga_id = %entry.manga_id,
url = %entry.source_url,
error = ?e,
"cover backfill: fetch_manga failed"
);
stats.failed += 1;
continue;
}
};
let Some(cover_url) = cover_url else {
tracing::warn!(
manga_id = %entry.manga_id,
url = %entry.source_url,
"cover backfill: source returned no cover_url"
);
stats.failed += 1;
continue;
};
match download_and_store_cover(
db,
storage,
http,
rate,
&entry.source_url,
entry.manga_id,
&cover_url,
allowlist,
max_image_bytes,
)
.await
{
Ok(()) => stats.fetched += 1,
Err(e) => {
tracing::warn!(
manga_id = %entry.manga_id,
url = %entry.source_url,
error = ?e,
"cover backfill: download failed"
);
stats.failed += 1;
}
}
}
drop(lease);
Ok(stats)
}
/// Download a cover image and persist its storage path. Local to the
/// pipeline because the CLI still calls it from its inline chapter-content
/// loop; once the worker pool fully replaces that path we can fold this
/// into `pipeline` proper.
#[allow(clippy::too_many_arguments)]
pub(crate) async fn download_and_store_cover(
db: &PgPool,
storage: &dyn Storage,
http: &reqwest::Client,
rate: &HostRateLimiters,
manga_url: &str,
manga_id: Uuid,
cover_url: &str,
allowlist: &DownloadAllowlist,
max_image_bytes: usize,
) -> anyhow::Result<()> {
let absolute = reqwest::Url::parse(manga_url)
.context("parse manga URL")?
.join(cover_url)
.context("join cover URL onto manga URL")?;
rate.wait_for(absolute.as_str()).await?;
let bytes = fetch_bytes_capped(
http,
absolute.as_str(),
Some(manga_url),
allowlist,
max_image_bytes,
)
.await?;
if !looks_like_image(&bytes) {
anyhow::bail!(
"cover URL {absolute} returned non-image bytes; refusing to store as binary blob"
);
}
let ext = infer::get(&bytes)
.map(|k| k.extension())
.expect("looks_like_image asserted infer succeeded");
let key = format!("mangas/{manga_id}/cover.{ext}");
storage
.put(&key, &bytes)
.await
.with_context(|| format!("store cover at {key}"))?;
repo::manga::set_cover_image_path(db, manga_id, &key)
.await
.with_context(|| format!("update cover_image_path for {manga_id}"))?;
tracing::info!(
manga_id = %manga_id,
key = %key,
bytes = bytes.len(),
%absolute,
"cover stored"
);
Ok(())
}
use crate::crawler::url_utils::origin_of;
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn stop_condition_fires_on_unchanged_metadata_and_zero_new_chapters() {
// The whole point of the rule: in steady state, a manga whose
// metadata hash matches AND whose chapter list gained no new
// entries proves we've reached the caught-up tail of a
// newest-first index.
assert!(should_stop(true, UpsertStatus::Unchanged, Some(0)));
}
#[test]
fn stop_condition_refuses_when_chapters_added() {
// Unchanged metadata + N new chapters means the source bumped
// this manga because of the chapter add; the rest of the index
// is still ahead of us. Don't bail.
assert!(!should_stop(true, UpsertStatus::Unchanged, Some(1)));
assert!(!should_stop(true, UpsertStatus::Unchanged, Some(42)));
}
#[test]
fn stop_condition_refuses_when_metadata_changed() {
// Updated or New metadata always continues — even with zero new
// chapters — because the change-of-metadata bump itself is what
// the walk is following.
assert!(!should_stop(true, UpsertStatus::Updated, Some(0)));
assert!(!should_stop(true, UpsertStatus::New, Some(0)));
}
#[test]
fn stop_condition_refuses_when_chapter_count_unknown() {
// skip_chapters mode (CLI metadata-only sweep) or a
// logged-and-swallowed chapter sync error: we can't claim "no
// new chapters" from absence of evidence, so don't stop. The
// operator who runs metadata-only intentionally wants a full
// walk anyway.
assert!(!should_stop(true, UpsertStatus::Unchanged, None));
}
#[test]
fn stop_condition_disabled_in_recovery_mode() {
// was_clean = false means the previous run did not exit cleanly;
// the catalog past its crash point is potentially un-synced. Walk
// to end-of-source no matter what individual mangas report.
assert!(!should_stop(false, UpsertStatus::Unchanged, Some(0)));
assert!(!should_stop(false, UpsertStatus::Unchanged, Some(1)));
assert!(!should_stop(false, UpsertStatus::Updated, Some(0)));
assert!(!should_stop(false, UpsertStatus::New, None));
}
#[test]
fn clean_exit_when_walked_to_completion() {
// End-of-walk reached the catalog tail — the recovery flag may
// safely flip back to `true`.
assert!(should_mark_clean_exit(true, false));
}
#[test]
fn clean_exit_when_stop_condition_fired() {
// First Unchanged + 0-new-chapter manga is a complete steady-
// state exit: every manga newer than this point was synced, and
// by source-side `update_date DESC` ordering everything past
// this point is at least as caught-up.
assert!(should_mark_clean_exit(false, true));
}
#[test]
fn dirty_exit_when_neither_completion_nor_stop_fired() {
// The walk ended for some other reason — including the
// caller-imposed `hit_limit` cap, which is the regression case
// this test exists for. `should_mark_clean_exit` does not take
// `hit_limit` as a parameter, so a future edit that adds
// `|| hit_limit` to the inline expression in `run_metadata_pass`
// would need to also touch this helper, and would fail this
// assertion when it did.
assert!(!should_mark_clean_exit(false, false));
}
#[test]
fn run_scoped_seen_set_skips_duplicate_source_manga_keys() {
// Pins the per-ref loop contract: `contains` gates whether work
// runs, and `insert` only fires on the success path (after upsert).
// A failed ref that reappears later in the same pass must get a
// second chance — that's why the loop uses contains-then-insert
// instead of insert-and-skip-on-collision.
let mut seen: HashSet<String> = HashSet::new();
// First sighting of a key: not yet seen → loop proceeds.
assert!(!seen.contains("manga-a"), "first sighting is unseen");
// Simulate a failed fetch_manga: do NOT insert. Next sighting must
// still be considered unseen so the loop retries it.
assert!(!seen.contains("manga-a"), "failed key is still retryable");
// Now simulate a successful upsert — insert is called.
seen.insert("manga-a".to_string());
// Subsequent sightings of the same key are skipped.
assert!(seen.contains("manga-a"), "successful key is now seen");
// Distinct keys never collide.
assert!(!seen.contains("manga-b"), "different key independent");
seen.insert("manga-b".to_string());
assert!(seen.contains("manga-b"));
assert!(seen.contains("manga-a"), "first key still recorded");
}
}

View File

@@ -1,11 +1,22 @@
//! Per-host request pacing.
//!
//! Single-token bucket: each `wait().await` either returns immediately
//! (if at least `interval` has elapsed since the last call) or sleeps
//! just enough to satisfy it. Uses `tokio::time::Instant` so tests can
//! run under `start_paused` virtual time without sleeping for real.
//! `RateLimiter` is a single-token bucket: each `wait().await` returns
//! immediately when at least `interval` has elapsed since the last call,
//! otherwise sleeps just enough to satisfy it. Uses
//! `tokio::time::Instant` so tests can run under `start_paused` virtual
//! time without sleeping for real.
//!
//! `HostRateLimiters` is the multi-host wrapper actually used by the
//! crawler — concurrent workers issuing requests to different origins
//! (catalog vs. CDN) don't contend on a shared budget; each host gets
//! its own bucket. `wait_for(url)` extracts the host, lazily creates a
//! limiter for it, and serializes only against other callers hitting
//! the same host.
use std::collections::HashMap;
use std::sync::Arc;
use std::time::Duration;
use tokio::sync::Mutex;
use tokio::time::Instant;
#[derive(Debug)]
@@ -33,6 +44,64 @@ impl RateLimiter {
}
}
/// Per-host rate limiter map. The outer `Mutex<HashMap>` is held only
/// during the entry-or-insert + Arc clone; the per-host `Mutex<RateLimiter>`
/// is held during the actual `wait().await`. So N workers calling
/// `wait_for(url)` on N different hosts contend nowhere except the brief
/// HashMap lookup; workers hitting the same host serialize on that
/// host's bucket.
#[derive(Debug)]
pub struct HostRateLimiters {
default_interval: Duration,
overrides: HashMap<String, Duration>,
map: Mutex<HashMap<String, Arc<Mutex<RateLimiter>>>>,
}
impl HostRateLimiters {
pub fn new(default_interval: Duration) -> Self {
Self {
default_interval,
overrides: HashMap::new(),
map: Mutex::new(HashMap::new()),
}
}
/// Set a per-host interval that overrides `default_interval`. Calls
/// after a host's limiter has been instantiated do *not* re-create
/// it — set all overrides before the first `wait_for` to that host.
pub fn with_override(mut self, host: impl Into<String>, interval: Duration) -> Self {
self.overrides.insert(host.into(), interval);
self
}
/// Block until the per-host budget allows the next request to
/// `url`'s host. Returns an error only when the URL has no host
/// (malformed input).
pub async fn wait_for(&self, url: &str) -> anyhow::Result<()> {
let host = host_of(url)
.ok_or_else(|| anyhow::anyhow!("no host in url: {url}"))?;
let limiter = {
let mut map = self.map.lock().await;
map.entry(host.clone())
.or_insert_with(|| {
let interval = self
.overrides
.get(&host)
.copied()
.unwrap_or(self.default_interval);
Arc::new(Mutex::new(RateLimiter::new(interval)))
})
.clone()
};
limiter.lock().await.wait().await;
Ok(())
}
}
// `host_of` was duplicated across session/rate_limit/pipeline; the
// canonical version now lives in `crawler::url_utils`.
use crate::crawler::url_utils::host_of;
#[cfg(test)]
mod tests {
use super::*;
@@ -66,4 +135,44 @@ mod tests {
// Already 250ms past — no further wait needed.
assert_eq!(Instant::now() - t0, Duration::ZERO);
}
#[test]
fn host_of_parses_scheme_path_and_port() {
assert_eq!(host_of("https://Example.com/path").as_deref(), Some("example.com"));
assert_eq!(host_of("http://cdn.foo.bar/img.jpg").as_deref(), Some("cdn.foo.bar"));
assert_eq!(host_of("http://localhost:8080/x").as_deref(), Some("localhost"));
assert!(host_of("not a url").is_none());
}
#[tokio::test(start_paused = true)]
async fn host_rate_limiters_pace_per_host() {
// Two hosts at 100ms each. Two consecutive calls to the SAME
// host wait 100ms total. Two consecutive calls to DIFFERENT
// hosts both fire immediately.
let rl = HostRateLimiters::new(Duration::from_millis(100));
let t0 = Instant::now();
rl.wait_for("https://a.example/x").await.unwrap();
rl.wait_for("https://b.example/y").await.unwrap();
assert_eq!(Instant::now() - t0, Duration::ZERO, "different hosts don't contend");
let t1 = Instant::now();
rl.wait_for("https://a.example/x").await.unwrap();
assert_eq!(
Instant::now() - t1,
Duration::from_millis(100),
"second call to same host waits a full interval"
);
}
#[tokio::test(start_paused = true)]
async fn host_rate_limiters_honor_overrides() {
let rl = HostRateLimiters::new(Duration::from_millis(1000))
.with_override("fast.example", Duration::from_millis(100));
rl.wait_for("https://fast.example/a").await.unwrap();
let t0 = Instant::now();
rl.wait_for("https://fast.example/b").await.unwrap();
assert_eq!(Instant::now() - t0, Duration::from_millis(100));
}
}

View File

@@ -0,0 +1,277 @@
//! Admin-triggered resync of a single manga's metadata + cover, or a
//! single chapter's content.
//!
//! The cron tick already retries covers and chapter content on its own
//! schedule. This module exists for the operator-controlled path:
//! "this manga's metadata is stale / its cover never landed / this
//! chapter is broken — pull from source now, not at the next daily
//! tick." Wired into the admin API, never into the queue, so the work
//! happens synchronously with the HTTP request and the admin sees the
//! refreshed row in the response.
//!
//! Shares the daemon's [`BrowserManager`], rate limiter, HTTP client,
//! and TOR controller so a force resync respects the same per-host
//! pacing and recircuit budget the daily crawl uses — admin actions
//! must not let an operator accidentally hammer the source.
use std::sync::Arc;
use anyhow::Context;
use async_trait::async_trait;
use sqlx::PgPool;
use uuid::Uuid;
use crate::crawler::browser_manager::BrowserManager;
use crate::crawler::content::{self, SyncOutcome};
use crate::crawler::pipeline;
use crate::crawler::rate_limit::HostRateLimiters;
use crate::crawler::safety::DownloadAllowlist;
use crate::crawler::source::target::TargetSource;
use crate::crawler::source::{FetchContext, Source, SourceMangaRef};
use crate::crawler::tor::TorController;
use crate::repo;
use crate::repo::crawler::UpsertStatus;
use crate::storage::Storage;
/// Outcome of [`ResyncService::resync_manga`]. Mirrors the bits the
/// admin UI cares about — was the row actually re-upserted, did the
/// cover land — so the response can show "metadata refreshed, cover
/// re-downloaded" or "metadata unchanged" without a second round-trip.
#[derive(Debug, Clone, Copy)]
pub struct MangaResyncOutcome {
pub manga_id: Uuid,
pub metadata_status: UpsertStatus,
pub cover_fetched: bool,
}
/// Outcome of [`ResyncService::resync_chapter`]. `Fetched(pages)` is the
/// success case; `Skipped` means the source row was already gone or the
/// chapter had no live source.
#[derive(Debug, Clone)]
pub enum ChapterResyncOutcome {
Fetched { chapter_id: Uuid, pages: usize },
Skipped { chapter_id: Uuid, reason: String },
}
/// Service exposed by the daemon to the admin API. Optional on
/// [`AppState`] — `None` when the crawler daemon is disabled
/// (`CRAWLER_DAEMON=false`), in which case admin handlers return 503.
#[async_trait]
pub trait ResyncService: Send + Sync {
async fn resync_manga(&self, manga_id: Uuid) -> anyhow::Result<MangaResyncOutcome>;
async fn resync_chapter(&self, chapter_id: Uuid) -> anyhow::Result<ChapterResyncOutcome>;
}
/// Errors with a stable shape so the API layer can map them to the
/// right HTTP status (404 vs 422 vs 5xx). Anything else surfaces as a
/// generic 500.
#[derive(Debug, thiserror::Error)]
pub enum ResyncError {
#[error("manga has no source to resync from")]
NoMangaSource,
#[error("chapter has no source to resync from")]
NoChapterSource,
}
pub struct RealResyncService {
pub browser_manager: Arc<BrowserManager>,
pub db: PgPool,
pub storage: Arc<dyn Storage>,
pub http: reqwest::Client,
pub rate: Arc<HostRateLimiters>,
pub download_allowlist: DownloadAllowlist,
pub max_image_bytes: usize,
pub tor: Option<Arc<TorController>>,
}
#[async_trait]
impl ResyncService for RealResyncService {
async fn resync_manga(&self, manga_id: Uuid) -> anyhow::Result<MangaResyncOutcome> {
// Pick the freshest live source row. Multi-source mangas
// (theoretical — only one Source impl today) get the row whose
// `last_seen_at` is newest; soft-dropped rows are skipped.
let row: Option<(String, String, String)> = sqlx::query_as(
"SELECT source_id, source_manga_key, source_url \
FROM manga_sources \
WHERE manga_id = $1 AND dropped_at IS NULL \
ORDER BY last_seen_at DESC \
LIMIT 1",
)
.bind(manga_id)
.fetch_optional(&self.db)
.await
.context("look up manga_sources for resync")?;
let Some((_source_id, source_manga_key, source_url)) = row else {
return Err(ResyncError::NoMangaSource.into());
};
let lease = self
.browser_manager
.acquire()
.await
.context("acquire browser lease for manga resync")?;
let browser_ref: &chromiumoxide::Browser = &lease;
let ctx = FetchContext {
browser: browser_ref,
rate: &self.rate,
tor: self.tor.as_deref(),
};
// Parse chapters too — a force resync is "make this manga fully
// current," not just metadata. The full pipeline handles the
// partial-render guard for us; we replicate the same caution
// here by skipping the chapter sync when the parser returned
// empty but the manga previously had chapters.
let source = TargetSource::new(source_url.clone());
let r = SourceMangaRef {
source_manga_key: source_manga_key.clone(),
title: String::new(),
url: source_url.clone(),
};
let manga = source
.fetch_manga(&ctx, &r)
.await
.with_context(|| format!("fetch_manga during resync of {manga_id}"))?;
// Partial-render guard: same logic as run_metadata_pass.
let source_id = source.id();
if !manga.chapters.is_empty() || {
let prior = repo::crawler::live_chapter_count_for_source_manga(
&self.db,
source_id,
&source_manga_key,
)
.await
.unwrap_or(0);
prior == 0
} {
// Either the new fetch surfaced chapters, or there were
// none before either — chapter sync is safe to run.
} else {
tracing::warn!(
%manga_id,
source_url = %source_url,
"resync_manga: fetch returned empty chapters but prior count > 0; skipping chapter sync to avoid soft-drop"
);
}
let upsert = repo::crawler::upsert_manga_from_source(
&self.db,
source_id,
&source_url,
&manga,
)
.await
.with_context(|| format!("upsert_manga_from_source during resync of {manga_id}"))?;
// Cover refetch: force-download regardless of UpsertStatus.
// Admin clicked "resync" because they want the cover too.
let mut cover_fetched = false;
if let Some(cover_url) = manga.cover_url.as_deref() {
match pipeline::download_and_store_cover(
&self.db,
self.storage.as_ref(),
&self.http,
&self.rate,
&source_url,
upsert.manga_id,
cover_url,
&self.download_allowlist,
self.max_image_bytes,
)
.await
{
Ok(()) => cover_fetched = true,
Err(e) => tracing::warn!(
%manga_id,
error = ?e,
"resync_manga: cover download failed"
),
}
}
// Chapter sync — only when the partial-render guard above
// didn't bail.
let prior_chapter_count = repo::crawler::live_chapter_count_for_source_manga(
&self.db,
source_id,
&source_manga_key,
)
.await
.unwrap_or(0);
if !manga.chapters.is_empty() || prior_chapter_count == 0 {
match repo::crawler::sync_manga_chapters(
&self.db,
source_id,
upsert.manga_id,
&manga.chapters,
)
.await
{
Ok(diff) => tracing::info!(
%manga_id,
new = diff.new,
refreshed = diff.refreshed,
dropped = diff.dropped,
"resync_manga: chapters synced"
),
Err(e) => tracing::warn!(
%manga_id,
error = ?e,
"resync_manga: chapter sync failed"
),
}
}
drop(lease);
Ok(MangaResyncOutcome {
manga_id: upsert.manga_id,
metadata_status: upsert.status,
cover_fetched,
})
}
async fn resync_chapter(&self, chapter_id: Uuid) -> anyhow::Result<ChapterResyncOutcome> {
let row = repo::chapter::dispatch_target(&self.db, chapter_id)
.await
.context("look up chapter_sources for resync")?;
let Some((manga_id, source_url)) = row else {
return Err(ResyncError::NoChapterSource.into());
};
let lease = self
.browser_manager
.acquire()
.await
.context("acquire browser lease for chapter resync")?;
let result = content::sync_chapter_content(
&lease,
&self.db,
self.storage.as_ref(),
&self.http,
&self.rate,
chapter_id,
manga_id,
&source_url,
true,
&self.download_allowlist,
self.max_image_bytes,
self.tor.as_deref(),
)
.await;
drop(lease);
match result? {
SyncOutcome::Fetched { pages } => {
Ok(ChapterResyncOutcome::Fetched { chapter_id, pages })
}
SyncOutcome::Skipped => Ok(ChapterResyncOutcome::Skipped {
chapter_id,
reason: "chapter already had pages on disk".to_string(),
}),
SyncOutcome::SessionExpired => {
anyhow::bail!("source session expired — operator must refresh PHPSESSID")
}
}
}
}

View File

@@ -0,0 +1,558 @@
//! Defensive helpers for the image-download paths.
//!
//! Two threats this module addresses:
//!
//! - **SSRF**: a scraped chapter or manga page can embed an absolute
//! `<img src="http://10.0.0.1/...">`. The crawler runs inside the
//! backend container with intra-compose access to `postgres:5432`
//! and possibly other internal services; without a host check the
//! crawler would happily probe them. [`is_safe_url`] rejects
//! anything whose host isn't on the operator-configured allowlist,
//! plus any IP literal in RFC1918 / loopback / link-local / unique-
//! local space (including IPv4-mapped IPv6 like `::ffff:127.0.0.1`)
//! as a second defence for the case where an allowlisted hostname's
//! DNS happens to resolve to a literal private address.
//!
//! **DNS rebinding is not covered.** A hostname like `cdn.allowed.com`
//! that *resolves* to `127.0.0.1` via hostile DNS bypasses the IP
//! check entirely — `is_safe_url` only inspects URL strings, not
//! resolved IPs. Mitigating that requires a custom reqwest resolver
//! that filters IPs after DNS, which would mean rebuilding reqwest's
//! connector. The allowlist + good operator DNS hygiene is the
//! realistic mitigation today.
//!
//! - **Unbounded download**: `Response::bytes().await` reads the full
//! body before returning. A malicious source serving a 10 GiB image
//! would fill memory and then disk. [`accumulate_capped`] streams
//! the body chunk-by-chunk into a [`bytes::BytesMut`] and bails as
//! soon as the running total exceeds the cap.
//!
//! Both helpers are pure-data: the SSRF check is keyed off a parsed
//! URL string, and the byte accumulator is keyed off a generic stream.
//! Easy to unit-test without a live network or browser.
use std::net::IpAddr;
use anyhow::{bail, Context};
use bytes::BytesMut;
use futures_util::StreamExt;
use reqwest::Url;
/// Default per-image download cap. A page image is generally <2 MiB;
/// 32 MiB leaves headroom for high-resolution covers while still
/// stopping a misbehaving CDN dead. Override via `CRAWLER_MAX_IMAGE_BYTES`.
pub const DEFAULT_MAX_IMAGE_BYTES: usize = 32 * 1024 * 1024;
/// Hosts that are always allowed in addition to the operator's
/// configured allowlist. None by default — keeping the surface area
/// minimal so the only way a URL gets through is if it matches an
/// explicit catalog/CDN entry.
///
/// `allow_any` flips the host check off entirely (private-IP and
/// scheme checks still apply). It exists for operators whose sources
/// shard images across numbered CDN subdomains (`cdn1`, `cdn2`, …)
/// where enumerating each host upfront is impractical. Off by default.
#[derive(Clone, Debug, Default)]
pub struct DownloadAllowlist {
hosts: Vec<String>,
allow_any: bool,
}
impl DownloadAllowlist {
pub fn new() -> Self {
Self {
hosts: Vec::new(),
allow_any: false,
}
}
/// Bypass the host allowlist. Scheme, localhost, and private-IP
/// checks in [`is_safe_url`] continue to apply — this only opens
/// up public hosts that weren't pre-enumerated.
pub fn allow_any() -> Self {
Self {
hosts: Vec::new(),
allow_any: true,
}
}
/// Add a host (case-insensitive match). Sub-domains are *not*
/// implied: pass `cdn.example.com` and `example.com` separately
/// if both should be reachable.
pub fn allow(mut self, host: impl Into<String>) -> Self {
let h = host.into().to_ascii_lowercase();
if !h.is_empty() && !self.hosts.iter().any(|existing| existing == &h) {
self.hosts.push(h);
}
self
}
pub fn is_empty(&self) -> bool {
self.hosts.is_empty()
}
pub fn contains(&self, host: &str) -> bool {
if self.allow_any {
return true;
}
let lower = host.to_ascii_lowercase();
self.hosts.iter().any(|h| h == &lower)
}
}
/// Verify a URL is safe for the crawler to fetch.
///
/// Rejects:
/// - non-http(s) schemes (file://, gopher://, …),
/// - any IP literal in private / loopback / link-local / unique-local
/// space (defense in depth — a DNS allowlist alone wouldn't cover an
/// attacker that places an entry like `cdn.evil` pointing at
/// `192.168.1.1`),
/// - the literal hostname `localhost`,
/// - hosts that aren't on the supplied allowlist.
///
/// An empty allowlist rejects everything (the conservative default —
/// callers must explicitly allow the catalog and CDN hosts).
pub fn is_safe_url(raw_url: &str, allow: &DownloadAllowlist) -> Result<(), UrlSafetyError> {
let url = Url::parse(raw_url).map_err(|_| UrlSafetyError::Unparseable)?;
let scheme = url.scheme();
if scheme != "http" && scheme != "https" {
return Err(UrlSafetyError::BadScheme(scheme.to_string()));
}
let host = url.host_str().ok_or(UrlSafetyError::NoHost)?;
let lower_host = host.to_ascii_lowercase();
if lower_host == "localhost" {
return Err(UrlSafetyError::Loopback);
}
// Reject IP literals in private/loopback ranges regardless of the
// allowlist — if someone puts an IP literal on the allowlist they
// almost certainly didn't mean a private range.
// reqwest::Url normalises IPv6 literals as `[::1]` (brackets
// included) in `host_str()`. Strip the brackets before parsing.
let ip_candidate = lower_host
.strip_prefix('[')
.and_then(|s| s.strip_suffix(']'))
.unwrap_or(&lower_host);
if let Ok(ip) = ip_candidate.parse::<IpAddr>() {
if is_private_ip(&ip) {
return Err(UrlSafetyError::PrivateIp(ip));
}
}
if !allow.contains(&lower_host) {
return Err(UrlSafetyError::HostNotAllowed(lower_host));
}
Ok(())
}
fn is_private_ip(ip: &IpAddr) -> bool {
match ip {
IpAddr::V4(v4) => {
v4.is_loopback()
|| v4.is_private()
|| v4.is_link_local()
|| v4.is_unspecified()
|| v4.is_broadcast()
// CGNAT 100.64.0.0/10
|| (v4.octets()[0] == 100 && (v4.octets()[1] & 0xC0) == 64)
// 169.254/16 link-local already covered, but 0.0.0.0/8 is special-use
|| v4.octets()[0] == 0
}
IpAddr::V6(v6) => {
// IPv4-mapped IPv6 (::ffff:0:0/96): unwrap to the embedded
// IPv4 and recurse so `::ffff:127.0.0.1` is caught by the
// IPv4 loopback check rather than passing through.
// `Ipv6Addr::is_loopback()` only matches `::1` exactly.
if let Some(v4) = v6.to_ipv4_mapped() {
return is_private_ip(&IpAddr::V4(v4));
}
v6.is_loopback()
|| v6.is_unspecified()
// fc00::/7 unique-local
|| (v6.segments()[0] & 0xfe00) == 0xfc00
// fe80::/10 link-local
|| (v6.segments()[0] & 0xffc0) == 0xfe80
}
}
}
#[derive(Debug, thiserror::Error, PartialEq, Eq)]
pub enum UrlSafetyError {
#[error("URL is not parseable")]
Unparseable,
#[error("scheme {0:?} is not http or https")]
BadScheme(String),
#[error("URL is missing a host")]
NoHost,
#[error("host points at the loopback interface")]
Loopback,
#[error("host is a private/internal IP: {0}")]
PrivateIp(IpAddr),
#[error("host {0:?} is not on the crawler download allowlist")]
HostNotAllowed(String),
}
/// Drain a byte stream into a single buffer, bailing out as soon as
/// the running total exceeds `max_bytes`. Generic over the stream so
/// it's testable without a live HTTP response.
pub async fn accumulate_capped<S, E>(stream: S, max_bytes: usize) -> anyhow::Result<bytes::Bytes>
where
S: futures_core::Stream<Item = Result<bytes::Bytes, E>>,
E: std::error::Error + Send + Sync + 'static,
{
let mut buf = BytesMut::new();
let mut stream = std::pin::pin!(stream);
while let Some(chunk) = stream.next().await {
let chunk = chunk.map_err(|e| anyhow::anyhow!("stream chunk: {e}"))?;
if buf.len().saturating_add(chunk.len()) > max_bytes {
bail!(
"response exceeds {max_bytes}-byte cap (received >{}+{})",
buf.len(),
chunk.len()
);
}
buf.extend_from_slice(&chunk);
}
Ok(buf.freeze())
}
/// Send `req` and stream the response into a length-limited buffer.
/// Combines [`is_safe_url`] check + [`accumulate_capped`] so each
/// call-site is one line.
pub async fn fetch_bytes_capped(
http: &reqwest::Client,
url: &str,
referer: Option<&str>,
allow: &DownloadAllowlist,
max_bytes: usize,
) -> anyhow::Result<bytes::Bytes> {
is_safe_url(url, allow).with_context(|| format!("reject unsafe URL {url}"))?;
let mut req = http.get(url);
if let Some(r) = referer {
req = req.header(reqwest::header::REFERER, r);
}
let resp = req
.send()
.await
.with_context(|| format!("GET {url}"))?
.error_for_status()
.with_context(|| format!("non-2xx for {url}"))?;
accumulate_capped(resp.bytes_stream(), max_bytes)
.await
.with_context(|| format!("download body for {url}"))
}
/// True when `bytes` sniffs as one of the *renderable* image formats
/// the `/files/*key` endpoint can serve with a correct Content-Type:
/// JPEG, PNG, WebP, GIF, AVIF. Matches the upload pipeline's
/// whitelist in `upload::parse_image`.
///
/// `infer::MatcherType::Image` is intentionally NOT used — it also
/// matches BMP, TIFF, HEIF, ICO, PSD, and JP2. Those would sniff as
/// "image" here but [`api::files::content_type_for`] would fall back
/// to `application/octet-stream`, prompting browsers to download
/// instead of render. Keep the two layers aligned.
pub fn looks_like_image(bytes: &[u8]) -> bool {
matches!(
infer::get(bytes).map(|k| k.mime_type()),
Some("image/jpeg" | "image/png" | "image/webp" | "image/gif" | "image/avif")
)
}
#[cfg(test)]
mod tests {
use super::*;
use futures_util::stream;
fn allow_just(host: &str) -> DownloadAllowlist {
DownloadAllowlist::new().allow(host)
}
#[test]
fn allow_any_admits_arbitrary_public_host() {
// Operators who can't pre-enumerate a numbered-CDN fleet
// (cdn1, cdn2, …) opt into allow_any. Any public host passes.
let allow = DownloadAllowlist::allow_any();
assert!(is_safe_url("https://cdn7.random.tld/x.jpg", &allow).is_ok());
assert!(is_safe_url("https://anything-goes.example/", &allow).is_ok());
}
#[test]
fn allow_any_still_blocks_private_ips() {
// The point of the bypass is the host-allowlist check, not the
// SSRF defense. Private/loopback IPs stay refused.
let allow = DownloadAllowlist::allow_any();
for url in [
"http://10.0.0.1/",
"http://192.168.1.1/",
"http://169.254.169.254/",
"http://127.0.0.1/",
"http://[::1]/",
"http://[::ffff:127.0.0.1]/",
] {
assert!(
matches!(
is_safe_url(url, &allow).unwrap_err(),
UrlSafetyError::PrivateIp(_)
),
"allow_any must still reject {url}"
);
}
}
#[test]
fn allow_any_still_blocks_localhost() {
let allow = DownloadAllowlist::allow_any();
assert!(matches!(
is_safe_url("http://localhost:8080/", &allow).unwrap_err(),
UrlSafetyError::Loopback
));
}
#[test]
fn allow_any_still_blocks_non_http_schemes() {
let allow = DownloadAllowlist::allow_any();
assert!(matches!(
is_safe_url("file:///etc/passwd", &allow).unwrap_err(),
UrlSafetyError::BadScheme(_)
));
}
#[test]
fn safe_url_allows_listed_host() {
let allow = allow_just("cdn.example.com");
assert!(is_safe_url("https://cdn.example.com/img.jpg", &allow).is_ok());
}
#[test]
fn safe_url_blocks_unlisted_host() {
let allow = allow_just("cdn.example.com");
let err = is_safe_url("https://evil.example.org/img.jpg", &allow).unwrap_err();
assert!(matches!(err, UrlSafetyError::HostNotAllowed(h) if h == "evil.example.org"));
}
#[test]
fn safe_url_blocks_localhost_even_if_allowlisted() {
let allow = allow_just("localhost");
assert!(matches!(
is_safe_url("http://localhost:8080/", &allow).unwrap_err(),
UrlSafetyError::Loopback
));
}
#[test]
fn safe_url_blocks_loopback_ipv4() {
let allow = allow_just("127.0.0.1");
assert!(matches!(
is_safe_url("http://127.0.0.1/", &allow).unwrap_err(),
UrlSafetyError::PrivateIp(_)
));
}
#[test]
fn safe_url_blocks_rfc1918() {
let allow = allow_just("10.0.0.1");
for url in [
"http://10.0.0.1/",
"http://192.168.1.1/",
"http://172.16.0.5/",
"http://172.31.255.255/",
] {
assert!(
matches!(
is_safe_url(url, &allow).unwrap_err(),
UrlSafetyError::PrivateIp(_)
),
"should reject {url}"
);
}
}
#[test]
fn safe_url_blocks_link_local() {
let allow = allow_just("169.254.169.254");
// 169.254.169.254 is the AWS/GCP metadata service — the most
// dangerous SSRF target on a default cloud VM.
assert!(matches!(
is_safe_url("http://169.254.169.254/", &allow).unwrap_err(),
UrlSafetyError::PrivateIp(_)
));
}
#[test]
fn safe_url_blocks_ipv6_loopback_and_ula() {
// Debug what host_str returns first — reqwest::Url normalises
// IPv6 literals as `[::1]` with brackets, which doesn't parse
// as `IpAddr` directly. The implementation strips them.
let allow = allow_just("[::1]");
let err = is_safe_url("http://[::1]/", &allow).unwrap_err();
assert!(
matches!(err, UrlSafetyError::PrivateIp(_)),
"expected PrivateIp, got {err:?}"
);
let allow = allow_just("[fd00::1]");
let err = is_safe_url("http://[fd00::1]/", &allow).unwrap_err();
assert!(
matches!(err, UrlSafetyError::PrivateIp(_)),
"expected PrivateIp, got {err:?}"
);
}
#[test]
fn safe_url_blocks_ipv4_mapped_ipv6_loopback() {
// `Ipv6Addr::is_loopback()` only matches `::1` exactly, so
// `::ffff:127.0.0.1` would slip through without the
// to_ipv4_mapped() unwrap in is_private_ip.
let allow = allow_just("[::ffff:127.0.0.1]");
let err = is_safe_url("http://[::ffff:127.0.0.1]/", &allow).unwrap_err();
assert!(
matches!(err, UrlSafetyError::PrivateIp(_)),
"expected PrivateIp, got {err:?}"
);
}
#[test]
fn safe_url_blocks_ipv4_mapped_ipv6_rfc1918() {
let allow = allow_just("[::ffff:10.0.0.1]");
let err = is_safe_url("http://[::ffff:10.0.0.1]/", &allow).unwrap_err();
assert!(matches!(err, UrlSafetyError::PrivateIp(_)));
}
#[test]
fn safe_url_blocks_non_http_schemes() {
let allow = allow_just("anywhere");
assert!(matches!(
is_safe_url("file:///etc/passwd", &allow).unwrap_err(),
UrlSafetyError::BadScheme(_)
));
assert!(matches!(
is_safe_url("gopher://anywhere:70/", &allow).unwrap_err(),
UrlSafetyError::BadScheme(_)
));
}
#[test]
fn safe_url_rejects_unparseable() {
let allow = allow_just("anywhere");
assert!(matches!(
is_safe_url("not a url", &allow).unwrap_err(),
UrlSafetyError::Unparseable
));
}
#[test]
fn safe_url_empty_allowlist_rejects_everything() {
let allow = DownloadAllowlist::new();
let err = is_safe_url("https://cdn.example.com/img.jpg", &allow).unwrap_err();
assert!(matches!(err, UrlSafetyError::HostNotAllowed(_)));
}
#[test]
fn allowlist_matches_case_insensitively() {
let allow = DownloadAllowlist::new().allow("CDN.Example.COM");
assert!(is_safe_url("https://cdn.example.com/x.jpg", &allow).is_ok());
assert!(is_safe_url("https://CDN.EXAMPLE.com/x.jpg", &allow).is_ok());
}
#[tokio::test]
async fn accumulate_capped_returns_full_body_under_cap() {
let chunks: Vec<Result<bytes::Bytes, std::io::Error>> = vec![
Ok(bytes::Bytes::from_static(b"hello ")),
Ok(bytes::Bytes::from_static(b"world")),
];
let s = stream::iter(chunks);
let out = accumulate_capped(s, 100).await.unwrap();
assert_eq!(out.as_ref(), b"hello world");
}
#[tokio::test]
async fn accumulate_capped_bails_past_cap() {
let chunks: Vec<Result<bytes::Bytes, std::io::Error>> = vec![
Ok(bytes::Bytes::from(vec![0u8; 50])),
Ok(bytes::Bytes::from(vec![0u8; 60])),
];
let s = stream::iter(chunks);
let err = accumulate_capped(s, 100).await.unwrap_err();
assert!(err.to_string().contains("100-byte cap"));
}
#[tokio::test]
async fn accumulate_capped_surfaces_stream_errors() {
let chunks: Vec<Result<bytes::Bytes, std::io::Error>> = vec![
Ok(bytes::Bytes::from_static(b"ok")),
Err(std::io::Error::other("network blip")),
];
let s = stream::iter(chunks);
let err = accumulate_capped(s, 100).await.unwrap_err();
assert!(err.to_string().contains("network blip"));
}
#[test]
fn looks_like_image_accepts_jpeg() {
// JPEG SOI + APP0 segment.
let jpeg = [0xff, 0xd8, 0xff, 0xe0, 0, 0x10, b'J', b'F', b'I', b'F'];
assert!(looks_like_image(&jpeg));
}
#[test]
fn looks_like_image_accepts_png() {
let png = [0x89, 0x50, 0x4e, 0x47, 0x0d, 0x0a, 0x1a, 0x0a, 0, 0, 0, 0];
assert!(looks_like_image(&png));
}
#[test]
fn looks_like_image_rejects_html_disguised_as_image() {
let html = b"<html><body>not an image</body></html>";
assert!(!looks_like_image(html));
}
#[test]
fn looks_like_image_rejects_empty() {
assert!(!looks_like_image(&[]));
}
#[test]
fn looks_like_image_rejects_renderable_but_unsupported_formats() {
// BMP, TIFF, ICO, PSD are `infer::MatcherType::Image` but the
// /files/*key handler doesn't have Content-Type mappings for
// them, so they'd be served as application/octet-stream and
// download instead of render. Reject at the crawler so we
// never land them in storage.
// BMP magic: "BM" + 4-byte size.
let bmp = [b'B', b'M', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
assert!(!looks_like_image(&bmp), "BMP must be rejected (not renderable by /files)");
// TIFF little-endian magic: "II" + 42.
let tiff = [0x49, 0x49, 0x2a, 0x00, 0, 0, 0, 0];
assert!(!looks_like_image(&tiff), "TIFF must be rejected");
// ICO magic: 0x00,0x00,0x01,0x00.
let ico = [0x00, 0x00, 0x01, 0x00, 1, 0, 16, 16, 0, 0, 1, 0, 0x18, 0, 0x40, 0, 0, 0, 0x16, 0, 0, 0];
assert!(!looks_like_image(&ico), "ICO must be rejected");
}
#[test]
fn looks_like_image_accepts_webp_gif_avif() {
// Cover the three remaining whitelisted formats so a future
// tightening that drops one would fail noisily.
let webp = [
b'R', b'I', b'F', b'F',
0, 0, 0, 0,
b'W', b'E', b'B', b'P',
b'V', b'P', b'8', b' ',
];
assert!(looks_like_image(&webp));
let gif = [b'G', b'I', b'F', b'8', b'7', b'a', 0, 0, 0, 0];
assert!(looks_like_image(&gif));
let avif = [
0x00, 0x00, 0x00, 0x18,
b'f', b't', b'y', b'p',
b'a', b'v', b'i', b'f',
0x00, 0x00, 0x00, 0x00,
b'm', b'i', b'f', b'1',
b'a', b'v', b'i', b'f',
];
assert!(looks_like_image(&avif));
}
}

View File

@@ -0,0 +1,635 @@
//! PHPSESSID injection + login probe.
//!
//! The catalog site we crawl renders chapter pages as a single multi-
//! page list only for logged-in users. We don't try to bypass the
//! login (CAPTCHA wall) — instead the operator pastes their browser's
//! `PHPSESSID` cookie into `CRAWLER_PHPSESSID` and the crawler injects
//! it into Chromium *and* reqwest before the first navigation.
//!
//! Two things the cookie alone doesn't give us:
//! 1. The cookie value is only meaningful to the *server* — we have
//! no way to predict from the value alone whether it's still valid.
//! `verify_session` does a navigation and inspects the probe page
//! for three outcomes: broken-page response (transient — retry the
//! probe), `#logo` present but `#avatar_menu` absent (genuine logout
//! — bail loudly), or both present (authenticated). The earlier
//! avatar-only check conflated "site is hiccuping" with "session is
//! dead" and refused to start the crawler when the site had a brief
//! 503.
//! 2. The reqwest client (used for cover and chapter-image downloads)
//! has its own cookie store; we seed it for the catalog host only.
//! CDN hosts are deliberately *not* given the cookie — they serve
//! image bytes by signed URLs and don't need it.
use std::time::Duration;
use anyhow::{anyhow, Context};
use chromiumoxide::browser::Browser;
use chromiumoxide::cdp::browser_protocol::network::CookieParam;
use crate::crawler::detect::{has_logo_sentinel, is_broken_page_body};
/// Outcome of inspecting a probe-page response.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum SessionProbe {
/// `#logo` present and `#avatar_menu` present — session valid.
Ok,
/// `#logo` present but `#avatar_menu` absent — site rendered the
/// normal layout for an unauthenticated visitor; refresh PHPSESSID.
Unauthenticated,
/// Broken-page body signature or `#logo` missing — site is hiccuping.
/// Caller retries the probe rather than blaming the session.
Transient,
}
/// Re-export so existing callers keep working after the helper moved
/// to `crawler::url_utils`. The body lives there.
pub use crate::crawler::url_utils::registrable_domain;
/// Inject the PHPSESSID cookie into the browser's cookie store for the
/// catalog domain. Must be called before any navigation that depends on
/// authentication; subsequent navigations include the cookie
/// automatically.
pub async fn inject_phpsessid(
browser: &Browser,
sid: &str,
cookie_domain: &str,
) -> anyhow::Result<()> {
let cookie = CookieParam {
name: "PHPSESSID".to_string(),
value: sid.to_string(),
url: None,
domain: Some(cookie_domain.to_string()),
path: Some("/".to_string()),
secure: None,
http_only: Some(true),
same_site: None,
expires: None,
priority: None,
same_party: None,
source_scheme: None,
source_port: None,
partition_key: None,
};
browser
.set_cookies(vec![cookie])
.await
.context("set PHPSESSID in chromium cookie store")?;
tracing::info!(domain = cookie_domain, "injected PHPSESSID into browser");
Ok(())
}
/// Three-way classification of a probe-page response. Pure over HTML so
/// it's unit-testable without a real browser. Order matters: a body
/// matching the broken-page template is `Transient` even if the page
/// happens to contain `#avatar_menu` HTML somewhere — trust the universal
/// site signal over a stray selector match.
pub fn classify_probe(html: &str) -> SessionProbe {
if is_broken_page_body(html) {
return SessionProbe::Transient;
}
let doc = scraper::Html::parse_document(html);
if !has_logo_sentinel(&doc) {
return SessionProbe::Transient;
}
let avatar_sel = scraper::Selector::parse("#avatar_menu").unwrap();
if doc.select(&avatar_sel).next().is_some() {
SessionProbe::Ok
} else {
SessionProbe::Unauthenticated
}
}
/// Three-way classification of a chapter page response.
///
/// Reader pages don't render `#logo`, so [`classify_probe`] can't be
/// reused as-is. The chapter-specific marker is `a#pic_container`
/// (asserted by the reader-page parser at `parse_chapter_pages`).
///
/// Order matters: broken-page body wins over selector matches, so a
/// transient site-wide 5xx that happens to render the avatar widget
/// elsewhere doesn't falsely reach `Ok`.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum ChapterProbe {
/// `a#pic_container` present — reader rendered. Whether
/// `#avatar_menu` is also there is informational; if the reader
/// loaded the session is by definition still good.
Ok,
/// Site rendered a "logged out" or "please log in" page (no
/// reader, no broken-page body, and no avatar widget either).
/// Distinguishes the genuine expired-session case from a
/// transient site hiccup.
Unauthenticated,
/// Broken-page body, or reader didn't render but the user is
/// still logged in (avatar widget present). Caller should retry
/// rather than blame the session.
Transient,
}
pub fn classify_chapter_probe(html: &str) -> ChapterProbe {
if is_broken_page_body(html) {
return ChapterProbe::Transient;
}
let doc = scraper::Html::parse_document(html);
let container = scraper::Selector::parse("a#pic_container").unwrap();
if doc.select(&container).next().is_some() {
return ChapterProbe::Ok;
}
let avatar = scraper::Selector::parse("#avatar_menu").unwrap();
if doc.select(&avatar).next().is_some() {
// Logged-in user, but the reader didn't render — most likely
// the layout shifted or the site is serving an interstitial.
ChapterProbe::Transient
} else {
// No reader, no avatar, no broken-body marker — site rendered
// the "please log in" page, which is the genuine session-
// expired signal on this route.
ChapterProbe::Unauthenticated
}
}
/// In-startup retry budget for the session probe. Small but non-zero —
/// startup hitting a 5-second site hiccup shouldn't fail the operator
/// with "PHPSESSID expired" when the session is actually fine.
const PROBE_MAX_ATTEMPTS: u32 = 3;
const PROBE_RETRY_DELAY: Duration = Duration::from_secs(2);
/// Navigate to `probe_url` and classify the response. Retries the probe
/// on `Transient` outcomes (broken-page body, missing `#logo`); fails
/// fast on `Unauthenticated`; returns `Ok(())` on success.
///
/// This burns one navigation per attempt against the catalog's rate
/// limiter. The trade is worth it — failing here costs ~1s; failing 30
/// minutes into a backfill costs 30 minutes.
pub async fn verify_session(browser: &Browser, probe_url: &str) -> anyhow::Result<()> {
verify_session_with_recircuit(browser, probe_url, None, 0).await
}
/// Like [`verify_session`] but, when `tor` is `Some`, signals
/// `SIGNAL NEWNYM` between retries on transient pages AND treats
/// `Unauthenticated` as recoverable (up to `tor_max_attempts` total
/// probes, calling NEWNYM between each).
///
/// `verify_session` is `verify_session_with_recircuit(..., None, _)`,
/// which collapses the `Unauthenticated` budget to 1 attempt — i.e.
/// fail-fast, exactly the pre-TOR behavior.
pub async fn verify_session_with_recircuit(
browser: &Browser,
probe_url: &str,
tor: Option<&crate::crawler::tor::TorController>,
tor_max_attempts: u32,
) -> anyhow::Result<()> {
let unauth_max_attempts = if tor.is_some() { tor_max_attempts.max(1) } else { 1 };
run_session_probe_loop(
|| fetch_probe_html(browser, probe_url),
|| async {
if let Some(t) = tor {
if let Err(e) = t.new_identity().await {
tracing::warn!(error = %e, "TOR NEWNYM failed; continuing with same circuit");
}
}
},
PROBE_MAX_ATTEMPTS,
unauth_max_attempts,
PROBE_RETRY_DELAY,
probe_url,
)
.await
}
/// Pure-over-IO loop body for the session probe. Generic over the
/// fetch and recircuit closures so it can be unit-tested without a
/// real browser or TOR daemon.
///
/// Both budgets count **total attempts**, including the first — so
/// `transient_max_attempts = 3` allows 3 fetches and 2 recircuits
/// between them, and `unauth_max_attempts = 1` means "fail-fast, no
/// retry". This matches [`crate::crawler::detect::retry_on_transient`]
/// and the content-path recircuit loop.
///
/// Outcomes:
/// - `SessionProbe::Ok` → return `Ok(())`.
/// - `SessionProbe::Unauthenticated` → recircuit + retry while
/// under the unauth budget. After the cap, bail with the
/// "PHPSESSID expired" diagnostic, mentioning the attempt count so
/// a TOR-misconfig diagnosis is easier.
/// - `SessionProbe::Transient` → same shape against the transient
/// budget; bails with "site down or rate-limiting" after the cap.
async fn run_session_probe_loop<F, Fut, R, RFut>(
mut fetch_html: F,
mut recircuit: R,
transient_max_attempts: u32,
unauth_max_attempts: u32,
retry_delay: Duration,
probe_url_for_msg: &str,
) -> anyhow::Result<()>
where
F: FnMut() -> Fut,
Fut: std::future::Future<Output = anyhow::Result<String>>,
R: FnMut() -> RFut,
RFut: std::future::Future<Output = ()>,
{
debug_assert!(transient_max_attempts >= 1);
debug_assert!(unauth_max_attempts >= 1);
let mut transient_attempts = 0u32;
let mut unauth_attempts = 0u32;
loop {
let html = fetch_html().await?;
match classify_probe(&html) {
SessionProbe::Ok => {
tracing::info!(
transient_attempts,
unauth_attempts,
"session probe ok — #logo + #avatar_menu present"
);
return Ok(());
}
SessionProbe::Unauthenticated => {
unauth_attempts += 1;
if unauth_attempts >= unauth_max_attempts {
return Err(anyhow!(
"session probe failed — #avatar_menu not present at {probe_url_for_msg} \
after {unauth_attempts} attempt(s); PHPSESSID is missing, \
expired, or revoked. Refresh CRAWLER_PHPSESSID and re-run."
));
}
tracing::warn!(
attempt = unauth_attempts,
max_attempts = unauth_max_attempts,
"session probe Unauthenticated despite PHPSESSID; signaling TOR \
NEWNYM and retrying"
);
recircuit().await;
tokio::time::sleep(retry_delay).await;
}
SessionProbe::Transient => {
transient_attempts += 1;
if transient_attempts >= transient_max_attempts {
return Err(anyhow!(
"session probe failed — probe page at {probe_url_for_msg} returned \
a broken-page response after {transient_max_attempts} attempts. \
The site appears to be down or rate-limiting us; try again \
later before refreshing CRAWLER_PHPSESSID."
));
}
tracing::warn!(
attempt = transient_attempts,
max_attempts = transient_max_attempts,
"session probe got a transient page; recircuit + retry"
);
recircuit().await;
tokio::time::sleep(retry_delay).await;
}
}
}
}
async fn fetch_probe_html(browser: &Browser, probe_url: &str) -> anyhow::Result<String> {
let page = browser
.new_page(probe_url)
.await
.with_context(|| format!("open probe page {probe_url}"))?;
crate::crawler::nav::wait_for_nav(&page)
.await
.context("wait for nav on probe")?;
// Best-effort wait for the layout marker. Timeout is fine — the
// probe classifier handles a missing `#logo` as Transient anyway,
// and the verify loop retries on Transient.
let _ = crate::crawler::nav::wait_for_selector(
&page,
"#logo",
crate::crawler::nav::SELECTOR_TIMEOUT,
)
.await;
let html = page.content().await.context("read probe html")?;
page.close().await.ok();
Ok(html)
}
#[cfg(test)]
mod tests {
use super::*;
// registrable_domain tests live in crawler::url_utils now —
// it's the canonical home for that helper.
#[test]
fn classify_probe_ok_when_logo_and_avatar_present() {
let html = r#"<html><body>
<header><div id="logo">Target</div><div id="avatar_menu"></div></header>
</body></html>"#;
assert_eq!(classify_probe(html), SessionProbe::Ok);
}
#[test]
fn classify_probe_unauth_when_logo_present_but_avatar_absent() {
// Real "logged out" response: site layout renders fine, just no
// avatar widget. This is the only state that should blame the
// session cookie.
let html = r#"<html><body>
<header><div id="logo">Target</div></header>
<main>Please log in.</main>
</body></html>"#;
assert_eq!(classify_probe(html), SessionProbe::Unauthenticated);
}
#[test]
fn classify_probe_transient_on_broken_page_body() {
let html = "<html><body>\
<p>we're sorry, the request file are not found.</p>\
</body></html>";
assert_eq!(classify_probe(html), SessionProbe::Transient);
}
#[test]
fn classify_probe_transient_when_logo_missing() {
// No broken-body marker, but no site layout either — treat as
// transient (could be a Cloudflare interstitial, a 5xx page,
// etc.) rather than blaming the session.
let html = "<html><body><h1>Service Unavailable</h1></body></html>";
assert_eq!(classify_probe(html), SessionProbe::Transient);
}
#[test]
fn classify_probe_transient_on_empty_response() {
assert_eq!(classify_probe(""), SessionProbe::Transient);
}
#[test]
fn classify_chapter_probe_ok_when_reader_rendered() {
let html = r#"
<html><body>
<a id="pic_container">
<img id="page1" src="https://cdn/1.jpg">
</a>
</body></html>
"#;
assert_eq!(classify_chapter_probe(html), ChapterProbe::Ok);
}
#[test]
fn classify_chapter_probe_unauthenticated_when_no_reader_and_no_avatar() {
// What a logged-out hit on a chapter URL renders: a normal
// site layout (header etc.) with a "please log in" body, but
// no reader and no avatar widget.
let html = r#"
<html><body>
<header><div id="logo">Catalog</div></header>
<main>Please log in to read this chapter.</main>
</body></html>
"#;
assert_eq!(
classify_chapter_probe(html),
ChapterProbe::Unauthenticated
);
}
#[test]
fn classify_chapter_probe_transient_when_logged_in_but_reader_missing() {
// Avatar shows the session is still valid; reader didn't
// render — site is serving an interstitial or the layout
// momentarily shifted. Retry, don't blame the session.
let html = r#"
<html><body>
<header><div id="logo">Catalog</div><div id="avatar_menu"></div></header>
<main>Site maintenance — back in 5 minutes.</main>
</body></html>
"#;
assert_eq!(classify_chapter_probe(html), ChapterProbe::Transient);
}
#[test]
fn classify_chapter_probe_transient_on_broken_page_body() {
let html =
"<html><body><p>we're sorry, the request file are not found.</p></body></html>";
assert_eq!(classify_chapter_probe(html), ChapterProbe::Transient);
}
#[test]
fn classify_chapter_probe_does_not_misfire_on_avatar_alone_without_reader() {
// Regression for the original bug: the binary
// find_element("#avatar_menu") check treated "no avatar" as
// session-expired even when a transient hiccup was the real
// cause. classify_chapter_probe must NOT trip on that pattern
// when pic_container *is* present.
let html = r#"
<html><body>
<a id="pic_container">
<img id="page1" src="https://cdn/1.jpg">
</a>
</body></html>
"#;
assert_eq!(classify_chapter_probe(html), ChapterProbe::Ok);
}
// --- run_session_probe_loop -----------------------------------------
//
// These tests exercise the recircuit-aware loop without a real
// browser. The fetch and recircuit closures are mocked over Vecs of
// canned outcomes / counters.
const OK_HTML: &str = r#"<html><body><div id="logo"></div><div id="avatar_menu"></div></body></html>"#;
const UNAUTH_HTML: &str = r#"<html><body><div id="logo"></div></body></html>"#;
const TRANSIENT_HTML: &str = "<html><body><p>we're sorry, the request file are not found.</p></body></html>";
#[tokio::test]
async fn probe_loop_ok_on_first_attempt_does_not_recircuit() {
let mut recircuits = 0u32;
let mut fetched = 0u32;
run_session_probe_loop(
|| {
fetched += 1;
async { Ok(OK_HTML.to_string()) }
},
|| {
recircuits += 1;
async {}
},
3,
3,
Duration::from_millis(0),
"https://example/probe",
)
.await
.expect("ok on first attempt");
assert_eq!(fetched, 1);
assert_eq!(recircuits, 0);
}
#[tokio::test]
async fn probe_loop_unauth_then_ok_when_attempt_budget_available() {
// Budget = 3 total attempts. Unauth on call 1, ok on call 2.
let mut recircuits = 0u32;
let mut call = 0u32;
run_session_probe_loop(
|| {
call += 1;
let n = call;
async move {
if n == 1 {
Ok(UNAUTH_HTML.to_string())
} else {
Ok(OK_HTML.to_string())
}
}
},
|| {
recircuits += 1;
async {}
},
3,
3,
Duration::from_millis(0),
"https://example/probe",
)
.await
.expect("recovers after one recircuit");
assert_eq!(call, 2);
assert_eq!(recircuits, 1);
}
#[tokio::test]
async fn probe_loop_unauth_with_single_attempt_budget_fails_fast() {
// Budget = 1 total attempt = no retry (matches no-TOR behavior).
let mut recircuits = 0u32;
let mut call = 0u32;
let err = run_session_probe_loop(
|| {
call += 1;
async { Ok(UNAUTH_HTML.to_string()) }
},
|| {
recircuits += 1;
async {}
},
3,
1,
Duration::from_millis(0),
"https://example/probe",
)
.await
.expect_err("budget=1 → fail-fast");
assert_eq!(call, 1, "no retry when budget is 1");
assert_eq!(recircuits, 0);
let msg = format!("{err:#}");
assert!(msg.contains("Refresh CRAWLER_PHPSESSID"), "msg: {msg}");
assert!(msg.contains("after 1 attempt"), "expected attempt count in msg: {msg}");
}
#[tokio::test]
async fn probe_loop_unauth_after_exhausting_budget_emits_attempt_count() {
let mut recircuits = 0u32;
let mut call = 0u32;
let err = run_session_probe_loop(
|| {
call += 1;
async { Ok(UNAUTH_HTML.to_string()) }
},
|| {
recircuits += 1;
async {}
},
10, // transient budget irrelevant here
3, // 3 attempts total, 2 recircuits between
Duration::from_millis(0),
"https://example/probe",
)
.await
.expect_err("exhausts unauth budget");
assert_eq!(call, 3);
assert_eq!(recircuits, 2);
let msg = format!("{err:#}");
assert!(msg.contains("after 3 attempt"), "expected attempt count in error, got: {msg}");
}
#[tokio::test]
async fn probe_loop_transient_repeats_until_max_then_errors() {
let mut recircuits = 0u32;
let mut call = 0u32;
let err = run_session_probe_loop(
|| {
call += 1;
async { Ok(TRANSIENT_HTML.to_string()) }
},
|| {
recircuits += 1;
async {}
},
3,
1,
Duration::from_millis(0),
"https://example/probe",
)
.await
.expect_err("transient until max → fail");
assert_eq!(call, 3);
// Recircuit fires between attempts: 3 attempts → 2 recircuits.
assert_eq!(recircuits, 2);
let msg = format!("{err:#}");
assert!(msg.contains("broken-page response after 3 attempts"), "msg: {msg}");
}
#[tokio::test]
async fn probe_loop_transient_then_ok_returns_ok_after_one_recircuit() {
let mut recircuits = 0u32;
let mut call = 0u32;
run_session_probe_loop(
|| {
call += 1;
let n = call;
async move {
if n == 1 {
Ok(TRANSIENT_HTML.to_string())
} else {
Ok(OK_HTML.to_string())
}
}
},
|| {
recircuits += 1;
async {}
},
3,
1,
Duration::from_millis(0),
"https://example/probe",
)
.await
.expect("ok on second try");
assert_eq!(call, 2);
assert_eq!(recircuits, 1);
}
#[tokio::test]
async fn probe_loop_propagates_fetch_errors_immediately() {
let mut call = 0u32;
let err = run_session_probe_loop(
|| {
call += 1;
async { Err(anyhow!("nav timeout")) }
},
|| async {},
5,
5,
Duration::from_millis(0),
"https://example/probe",
)
.await
.expect_err("fetch error bubbles");
assert_eq!(call, 1);
assert!(format!("{err:#}").contains("nav timeout"));
}
#[test]
fn classify_probe_trusts_broken_body_over_stray_avatar_match() {
// Defensive: if a broken-page body somehow contains an
// #avatar_menu element (e.g. an unrelated debug page on the
// same template), the body signature still wins.
let html = r#"<html><body>
<p>we're sorry, the request file are not found.</p>
<div id="logo"></div>
<div id="avatar_menu"></div>
</body></html>"#;
assert_eq!(classify_probe(html), SessionProbe::Transient);
}
}

View File

@@ -8,19 +8,6 @@ pub mod target;
use async_trait::async_trait;
use chromiumoxide::browser::Browser;
use serde::{Deserialize, Serialize};
/// How a `discover` job should walk the source's index.
#[derive(Clone, Copy, Debug, Serialize, Deserialize)]
pub enum DiscoverMode {
/// Walk every index page from last back to first. Used for the
/// initial seed of a source.
Backfill,
/// Walk index pages from page 1 forward, stopping after
/// `stop_after_unchanged` consecutive mangas whose `metadata_hash`
/// matches storage. Used for the recurring cron tick.
Incremental { stop_after_unchanged: usize },
}
/// Pointer at a manga in the source's index, before we've fetched the
/// detail page. The `source_manga_key` is whatever stable id the source
@@ -74,12 +61,36 @@ pub struct SourceChapter {
}
/// Context passed to every `Source` call. Carries the browser handle
/// plus a shared rate limiter so impls that issue multiple requests in
/// one call (e.g. pagination walks) honor the same per-host budget as
/// the outer job loop.
/// plus the per-host rate-limiter map so impls that issue multiple
/// requests in one call (pagination walks, multi-page chapter image
/// fetches) honor the right budget for each origin.
pub struct FetchContext<'a> {
pub browser: &'a Browser,
pub rate: &'a tokio::sync::Mutex<crate::crawler::rate_limit::RateLimiter>,
pub rate: &'a crate::crawler::rate_limit::HostRateLimiters,
/// Optional TOR control-port client. When `Some`, retry helpers
/// signal `NEWNYM` between transient-page attempts so the next try
/// draws a fresh exit. `None` keeps pre-TOR behavior.
pub tor: Option<&'a crate::crawler::tor::TorController>,
}
/// Lazy iterator over discovered manga refs. The caller drives the
/// walk one batch at a time, so it can break out as soon as the
/// downstream stop condition is met (the first manga where metadata is
/// `Unchanged` and chapter sync reports zero new chapters) without
/// paying for pages it won't use.
///
/// Batches are typically one source-index page each. Within a batch
/// refs are in the source's natural newest-first ordering — the same
/// `update_date DESC` sort that makes the stop condition meaningful.
#[async_trait]
pub trait DiscoverWalk: Send {
/// Return the next batch of refs, or `Ok(None)` when the source has
/// no more pages. The walker is single-use; calling `next_batch`
/// after `None` is allowed and continues to return `None`.
async fn next_batch(
&mut self,
ctx: &FetchContext<'_>,
) -> anyhow::Result<Option<Vec<SourceMangaRef>>>;
}
#[async_trait]
@@ -87,16 +98,15 @@ pub trait Source: Send + Sync {
/// Stable identifier — also the row key in the `sources` table.
fn id(&self) -> &'static str;
/// Returns up to `max_results` manga refs in source order. Pass
/// `None` for an uncapped walk (full backfill / incremental sweep).
/// Implementations should stop paginating as soon as the cap is
/// reached so partial runs don't pay for pages they won't use.
/// Begin discovery. Returns a walker the caller drives page-by-page
/// via `next_batch`. The initial page-1 probe (used to determine
/// `last_page` and warm the cache for sites that can't be paged
/// without knowing the bound) happens inside this call, so a fresh
/// walker is ready to yield its first batch without further setup.
async fn discover(
&self,
ctx: &FetchContext<'_>,
mode: DiscoverMode,
max_results: Option<usize>,
) -> anyhow::Result<Vec<SourceMangaRef>>;
) -> anyhow::Result<Box<dyn DiscoverWalk + Send>>;
async fn fetch_manga(
&self,

View File

@@ -14,9 +14,26 @@ use async_trait::async_trait;
use sha2::{Digest, Sha256};
use super::{
DiscoverMode, FetchContext, Source, SourceChapter, SourceChapterRef, SourceManga,
DiscoverWalk, FetchContext, Source, SourceChapter, SourceChapterRef, SourceManga,
SourceMangaRef,
};
use crate::crawler::detect::{
has_logo_sentinel, is_broken_page_body, retry_on_transient_with_hook, PageError,
};
use crate::crawler::nav::{wait_for_nav, wait_for_selector, NavError, SELECTOR_TIMEOUT};
/// `sources.id` value for this Source impl. Exposed as a const so the
/// daemon can look up per-source state (e.g. the recovery flag) before
/// constructing the Source itself.
pub const SOURCE_ID: &str = "target";
/// In-loop retry budget for transient pages encountered during a single
/// `discover` walk. Bounded small because the next cron tick will pick up
/// where this run left off via the recovery flag — these inline retries
/// only need to absorb a brief site hiccup mid-walk, not a sustained
/// outage.
const PAGE_TRANSIENT_RETRY_ATTEMPTS: u32 = 3;
const PAGE_TRANSIENT_RETRY_DELAY: Duration = Duration::from_secs(2);
pub struct TargetSource {
base_url: String,
@@ -50,64 +67,33 @@ impl TargetSource {
#[async_trait]
impl Source for TargetSource {
fn id(&self) -> &'static str {
"target"
SOURCE_ID
}
async fn discover(
&self,
ctx: &FetchContext<'_>,
mode: DiscoverMode,
max_results: Option<usize>,
) -> anyhow::Result<Vec<SourceMangaRef>> {
// Always visit page 1 first because that's the only way to
// discover `last_page`. We cache the HTML so we don't have to
// re-navigate when the iteration reaches page 1 again.
let first_html = navigate(ctx, self.base_url.as_str()).await?;
let last_page = {
let doc = scraper::Html::parse_document(&first_html);
parse_last_page(&doc)
};
) -> anyhow::Result<Box<dyn DiscoverWalk + Send>> {
// Probe page 1 up front (with transient retry) for two reasons:
// a broken first page should abort cleanly rather than mid-walk,
// and the HTML is handed straight to the first `next_batch` call
// so the walker doesn't re-fetch it. Page count is discovered
// incrementally — see `TargetSourceWalker::next_batch`.
let first_html = retry_on_transient_with_hook(
|| async {
navigate(ctx, self.base_url.as_str(), LIST_PAGE_MARKER).await
},
PAGE_TRANSIENT_RETRY_ATTEMPTS,
PAGE_TRANSIENT_RETRY_DELAY,
|| async { recircuit_if_configured(ctx.tor).await },
)
.await?;
let backfill = matches!(mode, DiscoverMode::Backfill);
let order: Vec<i32> = match (last_page, backfill) {
(None, _) => vec![1],
// Backfill = oldest-first: walk pages last → 1, then
// reverse within each page (the listing is update_date
// DESC, so the bottom of the last page is the oldest
// entry the source still surfaces).
(Some(last), true) => (1..=last).rev().collect(),
(Some(last), false) => (1..=last).collect(),
};
tracing::info!(
?mode,
last_page = ?last_page,
page_count = order.len(),
"walking pagination"
);
let mut all = Vec::new();
for page_num in order {
let html = if page_num == 1 {
first_html.clone()
} else {
navigate(ctx, &page_url(&self.base_url, page_num)).await?
};
let mut page_refs = {
let doc = scraper::Html::parse_document(&html);
parse_manga_list_from(&doc)
};
if backfill {
page_refs.reverse();
}
tracing::info!(page_num, count = page_refs.len(), "page walked");
all.extend(page_refs);
if cap_reached(&all, max_results) {
tracing::info!(cap = ?max_results, "max_results reached; halting pagination");
break;
}
}
Ok(truncate_to_cap(all, max_results))
Ok(Box::new(TargetSourceWalker {
base_url: self.base_url.clone(),
next_page: 1,
first_page_html: Some(first_html),
}))
}
async fn fetch_manga(
@@ -115,9 +101,23 @@ impl Source for TargetSource {
ctx: &FetchContext<'_>,
r: &SourceMangaRef,
) -> anyhow::Result<SourceManga> {
let html = navigate(ctx, r.url.as_str()).await?;
parse_manga_detail(&html, &r.source_manga_key, self.parse_chapters)
.with_context(|| format!("parse manga detail at {}", r.url))
// When we'll parse the chapter table, wait for at least one
// chapter row to appear — that's the marker most sensitive to
// the post-load JS partial-render race. When we won't, fall
// back to the layout-level `#logo` so we still wait for the
// page to settle.
let marker = if self.parse_chapters {
DETAIL_PAGE_CHAPTERS_MARKER
} else {
DETAIL_PAGE_LAYOUT_MARKER
};
let html = navigate(ctx, r.url.as_str(), marker).await?;
// Convert PageError → anyhow::Error via `?`. PageError stays
// downcastable from the wrapped anyhow::Error so the pipeline
// can still recognize Transient via `error.downcast_ref::<PageError>()`.
let manga = parse_manga_detail(&html, &r.source_manga_key, self.parse_chapters)
.with_context(|| format!("parse manga detail at {}", r.url))?;
Ok(manga)
}
async fn fetch_chapter_list(
@@ -137,46 +137,160 @@ impl Source for TargetSource {
}
}
fn cap_reached<T>(buf: &[T], max: Option<usize>) -> bool {
matches!(max, Some(m) if buf.len() >= m)
/// Walker returned by [`TargetSource::discover`]. Walks pages `1..` in
/// order, terminating as soon as a page renders cleanly with zero entries
/// — that's the "we ran off the end of the index" signal. Page 1's HTML
/// is cached at construction time (discover already had to fetch it for
/// the transient probe) so the first batch doesn't re-fetch.
///
/// A genuinely empty `Ok(vec![])` from `parse_manga_list_from` is what
/// stops us: the parser's `#logo` sentinel converts unrendered pages
/// into transient errors before they reach this loop, so an empty
/// parse result reliably means "no more entries."
struct TargetSourceWalker {
base_url: String,
next_page: i32,
first_page_html: Option<String>,
}
fn truncate_to_cap<T>(mut buf: Vec<T>, max: Option<usize>) -> Vec<T> {
if let Some(m) = max {
buf.truncate(m);
#[async_trait]
impl DiscoverWalk for TargetSourceWalker {
async fn next_batch(
&mut self,
ctx: &FetchContext<'_>,
) -> anyhow::Result<Option<Vec<SourceMangaRef>>> {
let page_num = self.next_page;
let page_refs = if page_num == 1 {
// Reuse the cached page-1 HTML from the initial probe. Take
// it (rather than clone) so a future re-entry that somehow
// revisits page 1 still falls back to a real fetch.
match self.first_page_html.take() {
Some(html) => {
let doc = scraper::Html::parse_document(&html);
parse_manga_list_from(&doc)?
}
None => {
retry_on_transient_with_hook(
|| async {
let html = navigate(
ctx,
self.base_url.as_str(),
LIST_PAGE_MARKER,
)
.await?;
let doc = scraper::Html::parse_document(&html);
parse_manga_list_from(&doc)
},
PAGE_TRANSIENT_RETRY_ATTEMPTS,
PAGE_TRANSIENT_RETRY_DELAY,
|| async { recircuit_if_configured(ctx.tor).await },
)
.await?
}
}
} else {
retry_on_transient_with_hook(
|| async {
let url = page_url(&self.base_url, page_num);
let html = navigate(ctx, &url, LIST_PAGE_MARKER).await?;
let doc = scraper::Html::parse_document(&html);
parse_manga_list_from(&doc)
},
PAGE_TRANSIENT_RETRY_ATTEMPTS,
PAGE_TRANSIENT_RETRY_DELAY,
|| async { recircuit_if_configured(ctx.tor).await },
)
.await?
};
tracing::info!(page_num, count = page_refs.len(), "page walked");
if page_refs.is_empty() {
return Ok(None);
}
self.next_page += 1;
Ok(Some(page_refs))
}
buf
}
/// Per-page-type markers used by `navigate`'s post-navigation wait.
/// Each is the most specific element the parser will later look for —
/// waiting on it closes the partial-render race (e.g. `#chapter_table`
/// wrapper present but rows still being injected by post-load JS) that
/// the old fixed 1s sleep masked. See [`navigate`].
const LIST_PAGE_MARKER: &str = "#left_side .pic_list .updatesli";
const DETAIL_PAGE_CHAPTERS_MARKER: &str = "#chapter_table td h4 a.chico";
const DETAIL_PAGE_LAYOUT_MARKER: &str = "#logo";
/// Single point of rate-limited navigation. Every Source request goes
/// through here, so the limiter is the only knob that controls
/// per-host RPS.
async fn navigate(ctx: &FetchContext<'_>, url: &str) -> anyhow::Result<String> {
ctx.rate.lock().await.wait().await;
let page = ctx.browser.new_page(url).await?;
page.wait_for_navigation().await?;
// Stopgap until we wait on a specific selector per page type —
// gives any post-load JS a beat to finish injecting content.
tokio::time::sleep(Duration::from_secs(1)).await;
let html = page.content().await?;
page.close().await?;
/// through here, so the per-host limiter map is the only knob that
/// controls per-origin RPS. Also the choke point for transient-page
/// detection — every fetched body is screened by
/// [`classify_navigate_html`] before being handed to a selector.
///
/// `marker` is a CSS selector the caller expects to find on the loaded
/// page. The wait is best-effort: a timeout is **not** an error
/// (legitimately-empty pages may never render the marker), it just
/// caps how long we'll hold for post-load JS to finish injecting
/// content. The parser's own sentinels and the universal broken-page
/// body check still catch real failures.
async fn navigate(
ctx: &FetchContext<'_>,
url: &str,
marker: &str,
) -> Result<String, PageError> {
ctx.rate.wait_for(url).await?;
let page = ctx
.browser
.new_page(url)
.await
.map_err(|e| PageError::Other(anyhow::Error::from(e)))?;
match wait_for_nav(&page).await {
Ok(()) => {}
Err(NavError::Timeout(_)) => {
page.close().await.ok();
return Err(PageError::transient("nav timeout"));
}
Err(NavError::Cdp(e)) => {
page.close().await.ok();
return Err(PageError::Other(anyhow::Error::from(e)));
}
}
// Best-effort wait for the page-type marker. We deliberately
// discard a timeout here — see fn-level doc.
let _ = wait_for_selector(&page, marker, SELECTOR_TIMEOUT).await;
let html = page
.content()
.await
.map_err(|e| PageError::Other(anyhow::Error::from(e)))?;
page.close().await.ok();
classify_navigate_html(html)
}
/// Classify a fetched body. The broken-page template is universal across
/// the site — every page type (list, detail, chapter list, reader) gets
/// the same `we're sorry, the request file are not found` body when the
/// server is hiccuping. Catching it here means individual parsers
/// downstream don't have to repeat the check.
fn classify_navigate_html(html: String) -> Result<String, PageError> {
if is_broken_page_body(&html) {
return Err(PageError::transient("broken-page body signature"));
}
Ok(html)
}
fn parse_last_page(doc: &scraper::Html) -> Option<i32> {
// Pagination links carry their page number as text. Take the
// numeric maximum so we don't depend on a specific layout (Prev,
// Next, ellipses, etc. all get filtered out by .parse).
let sel = scraper::Selector::parse("#left_side .pagination a").unwrap();
doc.select(&sel)
.filter_map(|a| {
collapse_whitespace(&a.text().collect::<String>())
.parse::<i32>()
.ok()
})
.max()
/// Hook for [`retry_on_transient_with_hook`]: when TOR is configured,
/// signal `NEWNYM` so the next navigation draws a fresh exit. Errors
/// from the controller are logged and swallowed — failing to recircuit
/// shouldn't take down the crawl, the next attempt just runs on the
/// same circuit as before.
async fn recircuit_if_configured(tor: Option<&crate::crawler::tor::TorController>) {
if let Some(t) = tor {
if let Err(e) = t.new_identity().await {
tracing::warn!(error = %e, "TOR NEWNYM failed; retrying on same circuit");
}
}
}
/// Substitutes the first `/N/` path segment with the target page
/// number. Source impls that paginate via a different URL shape can
/// override this — for the modeled site the segment is always present.
@@ -204,14 +318,23 @@ fn page_url(template_url: &str, page: i32) -> String {
}
#[cfg(test)]
fn parse_manga_list(html: &str) -> Vec<SourceMangaRef> {
fn parse_manga_list(html: &str) -> Result<Vec<SourceMangaRef>, PageError> {
let doc = scraper::Html::parse_document(html);
parse_manga_list_from(&doc)
}
fn parse_manga_list_from(doc: &scraper::Html) -> Vec<SourceMangaRef> {
/// Parse a manga listing page. `#logo` is present on every well-formed
/// listing page on the source; its absence means the response is a
/// broken-page placeholder (transient) rather than a genuinely empty
/// listing. Empty listings (last-page tail, search with no hits) remain
/// `Ok(vec![])`.
fn parse_manga_list_from(doc: &scraper::Html) -> Result<Vec<SourceMangaRef>, PageError> {
if !has_logo_sentinel(doc) {
return Err(PageError::transient("manga list: #logo sentinel missing"));
}
let sel = scraper::Selector::parse("#left_side .pic_list .updatesli span a").unwrap();
doc.select(&sel)
Ok(doc
.select(&sel)
.filter_map(|a| {
let url = a.value().attr("href")?.trim().to_string();
if url.is_empty() {
@@ -227,16 +350,22 @@ fn parse_manga_list_from(doc: &scraper::Html) -> Vec<SourceMangaRef> {
url,
})
})
.collect()
.collect())
}
fn parse_manga_detail(
html: &str,
key: &str,
include_chapters: bool,
) -> anyhow::Result<SourceManga> {
) -> Result<SourceManga, PageError> {
let doc = scraper::Html::parse_document(html);
// Sentinel first: a broken-page response will trip this before any
// anyhow context is added for missing required fields.
if !has_logo_sentinel(&doc) {
return Err(PageError::transient("manga detail: #logo sentinel missing"));
}
let title = first_text(&doc, ".w-title h1").context("missing .w-title h1")?;
let summary = first_text(&doc, ".manga_summary");
let cover_url = first_attr(&doc, ".cover > img:nth-child(1)", "src");
@@ -265,7 +394,7 @@ fn parse_manga_detail(
.collect();
let chapters = if include_chapters {
parse_chapter_list(&doc)
parse_chapter_list(&doc)?
} else {
Vec::new()
};
@@ -323,9 +452,22 @@ fn strip_tag_count(s: &str) -> String {
trimmed.to_string()
}
fn parse_chapter_list(doc: &scraper::Html) -> Vec<SourceChapterRef> {
/// Parse the chapter table on a manga detail page. Returns `Transient` if
/// `#chapter_table` isn't in the DOM at all — the table is required even
/// for mangas with no published chapters yet (the source renders an empty
/// `<table>`), so an absent table signals a partial render (post-load JS
/// not done, layout drift) rather than a legitimately empty list. Without
/// this sentinel, an empty `Vec` reaches `sync_manga_chapters` and the
/// soft-drop branch flips every existing chapter to `dropped_at`.
fn parse_chapter_list(doc: &scraper::Html) -> Result<Vec<SourceChapterRef>, PageError> {
if !has_chapter_table_sentinel(doc) {
return Err(PageError::transient(
"manga detail: #chapter_table sentinel missing",
));
}
let sel = scraper::Selector::parse("#chapter_table td h4 a.chico").unwrap();
doc.select(&sel)
Ok(doc
.select(&sel)
.filter_map(|a| {
let url = a.value().attr("href")?.trim().to_string();
if url.is_empty() {
@@ -334,13 +476,22 @@ fn parse_chapter_list(doc: &scraper::Html) -> Vec<SourceChapterRef> {
let title_text = collapse_whitespace(&a.text().collect::<String>());
let number = parse_chapter_number(&title_text).unwrap_or(0);
Some(SourceChapterRef {
source_chapter_key: derive_key_from_url(&url),
source_chapter_key: derive_chapter_key_from_url(&url),
number,
title: (!title_text.is_empty()).then_some(title_text),
url,
})
})
.collect()
.collect())
}
/// Returns true when the chapter-table container is present in the DOM.
/// Source-specific: the target site uses `#chapter_table` as the wrapper
/// element. Distinguishes "table is present but empty" (legit edge case
/// for new mangas) from "table is missing entirely" (partial render).
fn has_chapter_table_sentinel(doc: &scraper::Html) -> bool {
let sel = scraper::Selector::parse("#chapter_table").expect("valid selector");
doc.select(&sel).next().is_some()
}
fn parse_chapter_number(text: &str) -> Option<i32> {
@@ -366,6 +517,29 @@ fn derive_key_from_url(url: &str) -> String {
.to_string()
}
/// Chapter URLs on this source point at the reader's page 1, e.g.
/// `.../uu/br_chapter-379272/pg-1/`. The chapter identity is the
/// `br_chapter-N` (or `to_chapter-N`) segment — the `pg-\d+` segment
/// identifies a page *within* a chapter, so naively taking the last
/// path component returns `"pg-1"` for every chapter and collapses
/// them all under one source_chapter_key downstream.
fn derive_chapter_key_from_url(url: &str) -> String {
let trimmed = url.split('?').next().unwrap_or(url).trim_end_matches('/');
let without_reader_page = match trimmed.rsplit_once('/') {
Some((prefix, last)) if is_reader_page_segment(last) => prefix,
_ => trimmed,
};
without_reader_page
.rsplit('/')
.find(|s| !s.is_empty())
.unwrap_or(url)
.to_string()
}
fn is_reader_page_segment(s: &str) -> bool {
s.len() > 3 && s.starts_with("pg-") && s[3..].bytes().all(|b| b.is_ascii_digit())
}
fn first_text(doc: &scraper::Html, sel: &str) -> Option<String> {
let s = scraper::Selector::parse(sel).ok()?;
let el = doc.select(&s).next()?;
@@ -471,6 +645,7 @@ mod tests {
const LISTING_HTML: &str = r#"
<html><body>
<header><div id="logo">Target</div></header>
<div id="left_side">
<div class="pic_list">
<div class="updatesli">
@@ -489,6 +664,7 @@ mod tests {
const DETAIL_HTML: &str = r#"
<html><body>
<header><div id="logo">Target</div></header>
<div class="w-title"><h1>Test Manga Title</h1></div>
<div class="cover"><img src="/cover.jpg"><img src="/extra-not-cover.jpg"></div>
<div class="manga_summary">A summary of the manga.</div>
@@ -514,7 +690,7 @@ mod tests {
#[test]
fn parse_manga_list_extracts_title_url_and_derives_key() {
let refs = parse_manga_list(LISTING_HTML);
let refs = parse_manga_list(LISTING_HTML).expect("parse");
assert_eq!(refs.len(), 2, "third entry has empty href and is skipped");
assert_eq!(refs[0].title, "Foo Manga");
assert_eq!(refs[0].url, "https://target.example/manga/foo");
@@ -523,6 +699,30 @@ mod tests {
assert_eq!(refs[1].source_manga_key, "bar-baz");
}
#[test]
fn parse_manga_list_returns_transient_when_logo_missing() {
// Broken-page response: no #logo, no listing. Empty Vec would
// hide this as "page has no mangas"; Transient is the signal
// upstream code retries on.
let html = r#"<html><body>\
<p>we're sorry, the request file are not found.</p>\
</body></html>"#;
let err = parse_manga_list(html).expect_err("expected Transient");
assert!(err.is_transient(), "got non-transient: {err}");
}
#[test]
fn parse_manga_list_ok_empty_when_logo_present_but_no_items() {
// Last page of pagination, "no results" search, etc. Legitimately
// empty must stay distinguishable from "page is broken".
let html = r#"<html><body>\
<header><div id="logo">Target</div></header>\
<div id="left_side"><div class="pic_list"></div></div>\
</body></html>"#;
let refs = parse_manga_list(html).expect("logo present == not transient");
assert!(refs.is_empty());
}
#[test]
fn parse_manga_detail_pulls_all_fields() {
let m = parse_manga_detail(DETAIL_HTML, "test-key", true).expect("parse");
@@ -577,6 +777,61 @@ mod tests {
assert_eq!(strip_tag_count("Tag (a) (12)"), "Tag (a)");
}
#[test]
fn parse_chapter_list_keeps_all_chapters_with_unique_keys() {
// Real listing fixture from the target site. 15 rows: chapters
// with various Ch.N markup, one hiatus row, three "notice." rows,
// and duplicates of Ch.1 and Ch.52 from different uploaders.
// Every row must survive parsing and every chapter must have a
// distinct source_chapter_key — chapter URLs all end in `/pg-1/`
// (the reader's page-1 entry point), and a naive
// last-segment-of-URL derivation returns "pg-1" for every row,
// collapsing the whole list into one downstream chapter row.
let html = include_str!(
"../../../tests/fixtures/target/chapter_list_uu.html"
);
let doc = scraper::Html::parse_document(html);
let chapters = parse_chapter_list(&doc).expect("fixture has the table");
assert_eq!(chapters.len(), 15, "every row kept (notices/hiatus included)");
let mut keys: Vec<&str> =
chapters.iter().map(|c| c.source_chapter_key.as_str()).collect();
keys.sort();
let dupe = keys.windows(2).find(|w| w[0] == w[1]).map(|w| w[0]);
assert!(dupe.is_none(), "duplicate chapter key: {dupe:?}");
for c in &chapters {
assert_ne!(
c.source_chapter_key, "pg-1",
"key must not be the reader-page segment: {:?}", c
);
}
// Latest chapter is first (source orders newest → oldest).
assert_eq!(chapters[0].number, 67);
assert_eq!(chapters[0].title.as_deref(), Some("Ch.67 : Official"));
assert_eq!(chapters[0].source_chapter_key, "br_chapter-379272");
// Duplicate-number chapters (different uploaders) survive as
// two rows. The (manga_id, number) UNIQUE collapse is a
// downstream schema concern handled separately.
assert_eq!(
chapters.iter().filter(|c| c.number == 52).count(),
2,
"two Ch.52 uploads must both survive parsing"
);
assert_eq!(
chapters.iter().filter(|c| c.number == 1).count(),
2,
"Ch.1 Official and Ch.1 Team Hazama are both kept"
);
// Notices / hiatus rows have no leading digit so they parse to
// number=0. They are not filtered out.
let zero = chapters.iter().filter(|c| c.number == 0).count();
assert!(zero >= 4, "hiatus + 3 notices kept; got {zero}");
}
#[test]
fn parse_chapter_number_grabs_first_integer_run() {
assert_eq!(parse_chapter_number("Ch.1"), Some(1));
@@ -587,29 +842,6 @@ mod tests {
assert_eq!(parse_chapter_number("Special"), None);
}
#[test]
fn parse_last_page_picks_highest_pagination_link() {
let html = r#"
<div id="left_side"><div class="pagination">
<a href="/list/1/">Prev</a>
<ol>
<li><a href="/list/1/">1</a></li>
<li><a href="/list/2/">2</a></li>
<li><a href="/list/47/">47</a></li>
<li><a href="/list/2/">Next</a></li>
</ol>
</div></div>
"#;
let doc = scraper::Html::parse_document(html);
assert_eq!(parse_last_page(&doc), Some(47));
}
#[test]
fn parse_last_page_none_when_no_pagination() {
let doc = scraper::Html::parse_document("<html></html>");
assert!(parse_last_page(&doc).is_none());
}
#[test]
fn page_url_substitutes_numeric_path_segment() {
assert_eq!(
@@ -630,6 +862,45 @@ mod tests {
assert_eq!(derive_key_from_url("/manga/bar"), "bar");
}
#[test]
fn derive_chapter_key_strips_trailing_reader_page_segment() {
// Listing links go to page 1 of the reader; strip /pg-\d+/.
assert_eq!(
derive_chapter_key_from_url(".../uu/br_chapter-379272/pg-1/"),
"br_chapter-379272"
);
assert_eq!(
derive_chapter_key_from_url(".../uu/to_chapter-13/pg-1/"),
"to_chapter-13"
);
// Defensive: deep-link to a non-first page should still resolve
// to the same chapter identity.
assert_eq!(
derive_chapter_key_from_url(".../uu/br_chapter-379272/pg-25/"),
"br_chapter-379272"
);
// No reader-page suffix → behaves like derive_key_from_url.
assert_eq!(
derive_chapter_key_from_url(".../uu/br_chapter-379272/"),
"br_chapter-379272"
);
// Query strings are stripped.
assert_eq!(
derive_chapter_key_from_url(".../uu/br_chapter-379272/pg-1/?ref=x"),
"br_chapter-379272"
);
// `pg-foo` is not a valid reader-page segment; treated as identity.
assert_eq!(
derive_chapter_key_from_url(".../uu/something/pg-foo/"),
"pg-foo"
);
// Bare `pg-` (no digits) likewise not stripped.
assert_eq!(
derive_chapter_key_from_url(".../uu/something/pg-/"),
"pg-"
);
}
#[test]
fn metadata_hash_is_stable_and_field_sensitive() {
let base = parse_manga_detail(DETAIL_HTML, "k", true).unwrap();
@@ -644,7 +915,17 @@ mod tests {
#[test]
fn missing_optional_fields_parse_to_none() {
let html = r#"<html><body><div class="w-title"><h1>Minimal</h1></div></body></html>"#;
// Minimal but well-formed detail page: title is required, every
// other field is optional, but the chapter table is structural —
// its absence is treated as Transient (a freshly added manga
// renders the table empty, not absent). See
// `parse_chapter_list_returns_transient_when_table_missing` for
// the negative case.
let html = r#"<html><body>\
<header><div id="logo">Target</div></header>\
<div class="w-title"><h1>Minimal</h1></div>\
<table id="chapter_table"></table>\
</body></html>"#;
let m = parse_manga_detail(html, "min", true).unwrap();
assert_eq!(m.title, "Minimal");
assert!(m.summary.is_none());
@@ -668,8 +949,104 @@ mod tests {
#[test]
fn parse_manga_detail_errors_on_missing_title() {
let html = "<html><body><p>nothing</p></body></html>";
// Logo present (page is alive) — failure here is a real parse
// miss (Other), not Transient.
let html = r#"<html><body>\
<header><div id="logo">Target</div></header>\
<p>nothing</p></body></html>"#;
let err = parse_manga_detail(html, "x", true).unwrap_err();
assert!(!err.is_transient(), "expected Other, got Transient: {err}");
assert!(err.to_string().contains("missing .w-title h1"));
}
#[test]
fn classify_navigate_html_passes_normal_body_through() {
let body = "<html><body><header><div id='logo'>Target</div></header>\
<p>content</p></body></html>"
.to_string();
let out = classify_navigate_html(body.clone()).expect("ok");
assert_eq!(out, body);
}
#[test]
fn classify_navigate_html_returns_transient_for_broken_template() {
let body = "<html><head></head><body>\
<p>we're sorry, the request file are not found.</p>\
</body></html>"
.to_string();
let err = classify_navigate_html(body).expect_err("expected Transient");
assert!(err.is_transient(), "got non-transient: {err}");
}
#[test]
fn parse_manga_detail_returns_transient_when_logo_missing() {
// Broken-page response on a detail URL — must be reported as
// Transient so the job is retried rather than logging "missing
// .w-title h1" against a permanently-skipped manga.
let html = "<html><body>\
<p>we're sorry, the request file are not found.</p>\
</body></html>";
let err = parse_manga_detail(html, "x", true).expect_err("expected Transient");
assert!(err.is_transient(), "got non-transient: {err}");
}
#[test]
fn parse_chapter_list_returns_transient_when_table_missing() {
// Partial render (post-load JS hadn't injected the table, layout
// drift, etc). Returning Vec::new() would silently soft-drop every
// existing chapter for the manga via sync_manga_chapters; Transient
// is the signal the job system retries on.
let html = r#"<html><body>
<header><div id="logo">Target</div></header>
<div class="w-title"><h1>Test</h1></div>
</body></html>"#;
let doc = scraper::Html::parse_document(html);
let err = parse_chapter_list(&doc).expect_err("expected Transient");
assert!(err.is_transient(), "got non-transient: {err}");
}
#[test]
fn parse_chapter_list_ok_empty_when_table_present_but_no_rows() {
// A freshly-added manga with no chapters yet — the source renders
// the `<table id="chapter_table">` wrapper but no `<tr>` rows
// inside. Must stay distinguishable from a missing-table render.
let html = r#"<html><body>
<header><div id="logo">Target</div></header>
<table id="chapter_table"></table>
</body></html>"#;
let doc = scraper::Html::parse_document(html);
let chapters = parse_chapter_list(&doc).expect("present table is not transient");
assert!(chapters.is_empty());
}
#[test]
fn parse_manga_detail_propagates_chapter_table_transient() {
// End-to-end: a detail page that survives the #logo sentinel but
// has the chapter table stripped must fail Transient at the parser
// boundary, not return a SourceManga with empty chapters.
let html = r#"<html><body>
<header><div id="logo">Target</div></header>
<div class="w-title"><h1>Test Title</h1></div>
<div class="cover"><img src="/cover.jpg"></div>
<!-- intentionally no #chapter_table -->
</body></html>"#;
let err = parse_manga_detail(html, "key", true).expect_err("expected Transient");
assert!(err.is_transient(), "got non-transient: {err}");
}
#[test]
fn parse_manga_detail_skips_chapter_sentinel_when_include_chapters_false() {
// Metadata-only mode (`skip_chapters` upstream) must not require
// the chapter table — pipeline.rs avoids calling sync_manga_chapters
// for these mangas, so the absent table is not a correctness issue
// and shouldn't surface as Transient.
let html = r#"<html><body>
<header><div id="logo">Target</div></header>
<div class="w-title"><h1>Test Title</h1></div>
<div class="cover"><img src="/cover.jpg"></div>
</body></html>"#;
let manga = parse_manga_detail(html, "key", false)
.expect("metadata-only parse must not require chapter table");
assert!(manga.chapters.is_empty());
}
}

446
backend/src/crawler/tor.rs Normal file
View File

@@ -0,0 +1,446 @@
//! TOR control-port client for `SIGNAL NEWNYM` ("recircuit").
//!
//! The crawler can be proxied through TOR (`CRAWLER_PROXY=socks5h://tor:9050`)
//! to randomize the exit IP seen by the target site. When the target
//! returns a "bad page" (its broken-template body, missing layout
//! sentinel, or unauthenticated probe despite a valid PHPSESSID), it
//! is often the current exit being rate-limited or fingerprinted rather
//! than a real failure. Asking the local TOR daemon for a new identity
//! over its control port (port 9051 by default) makes subsequent
//! connections draw a fresh circuit; combined with `IsolateDestAddr`
//! in torrc this is usually enough to clear the failure.
//!
//! Scope is deliberately tiny — `AUTHENTICATE` + `SIGNAL NEWNYM` over
//! a one-shot TCP connection. No `torut` dep, no hidden-service
//! plumbing, no event streaming.
//!
//! **Caveat for in-flight connections:** Chromium reuses sockets, so a
//! `NEWNYM` only affects *new* connections (in TOR terms, new circuits).
//! That's fine for our retry path — the next navigation opens a fresh
//! connection. We do not try to forcibly close existing streams.
use std::path::{Path, PathBuf};
use std::time::Duration;
use anyhow::{anyhow, bail, Context};
use tokio::io::{AsyncBufReadExt, AsyncWriteExt, BufReader};
use tokio::net::TcpStream;
use tokio::time::timeout;
/// Default control-port (`tor --defaults-torrc` ships 9051).
const DEFAULT_CONTROL_PORT: u16 = 9051;
/// Connect timeout — generous enough for a slow compose start, short
/// enough that a misconfigured controller doesn't stall a crawl.
const CONNECT_TIMEOUT: Duration = Duration::from_secs(5);
/// Per-command read timeout. `SIGNAL NEWNYM` returns instantly on the
/// happy path; bound it so a half-broken control port can't hang us.
const READ_TIMEOUT: Duration = Duration::from_secs(5);
/// How the controller authenticates to the control port.
///
/// `Cookie` is preferred for compose deploys where the auth cookie file
/// is shared between the `tor` and `backend` containers via a named
/// volume. `Password` is the fallback when the cookie file isn't
/// reachable (different gid, no shared volume, etc.). `None` matches a
/// torrc with no `CookieAuthentication 1` and no `HashedControlPassword`
/// — useful for local experimentation, not for production.
///
/// `Debug` is implemented manually to redact the password (and the
/// cookie path, which is non-sensitive but uninteresting in logs).
/// Don't add `#[derive(Debug)]` — the controller is `?`-logged at
/// startup and a derive would expand the password into the trace.
#[derive(Clone)]
pub enum TorAuth {
None,
Password(String),
Cookie(PathBuf),
}
impl std::fmt::Debug for TorAuth {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
TorAuth::None => f.write_str("None"),
TorAuth::Password(_) => f.write_str("Password(<redacted>)"),
TorAuth::Cookie(_) => f.write_str("Cookie(<path>)"),
}
}
}
#[derive(Debug, Clone)]
pub struct TorController {
/// `host:port` string. Kept as a string (not a `SocketAddr`) so
/// docker-compose hostnames like `tor:9051` resolve at connect time.
addr: String,
auth: TorAuth,
}
impl TorController {
pub fn new(addr: impl Into<String>, auth: TorAuth) -> Self {
Self { addr: addr.into(), auth }
}
/// Build a controller from the env-config shape:
/// `url` (e.g. `tcp://tor:9051`, `127.0.0.1:9051`, or `tor`),
/// optional password, optional cookie path. Returns `Ok(None)` when
/// `url` is absent — that's the "TOR feature disabled" signal.
/// Cookie wins over password when both are set (rotates with TOR;
/// no secret to manage).
pub fn from_parts(
url: Option<&str>,
password: Option<&str>,
cookie_path: Option<&Path>,
) -> anyhow::Result<Option<Self>> {
let Some(url) = url else { return Ok(None) };
let addr = parse_control_url(url)?;
let auth = match (cookie_path, password) {
(Some(p), _) => TorAuth::Cookie(p.to_path_buf()),
(None, Some(p)) => TorAuth::Password(p.to_string()),
(None, None) => TorAuth::None,
};
Ok(Some(Self { addr, auth }))
}
/// Open the control port, `AUTHENTICATE`, `SIGNAL NEWNYM`, `QUIT`.
/// Each invocation is a fresh connection; the controller is cheap
/// to clone and stateless across calls.
pub async fn new_identity(&self) -> anyhow::Result<()> {
let stream = timeout(CONNECT_TIMEOUT, TcpStream::connect(&self.addr))
.await
.with_context(|| {
format!("timed out connecting to TOR control port {}", self.addr)
})?
.with_context(|| format!("connect to TOR control port {}", self.addr))?;
let (read, mut write) = stream.into_split();
let mut read = BufReader::new(read);
let auth_line = self.build_auth_line().await?;
write_line(&mut write, &auth_line).await?;
timeout(READ_TIMEOUT, expect_250(&mut read))
.await
.map_err(|_| anyhow!("TOR control AUTHENTICATE timed out"))?
.context("AUTHENTICATE")?;
write_line(&mut write, "SIGNAL NEWNYM").await?;
timeout(READ_TIMEOUT, expect_250(&mut read))
.await
.map_err(|_| anyhow!("TOR control SIGNAL NEWNYM timed out"))?
.context("SIGNAL NEWNYM")?;
// QUIT is courtesy; ignore errors — the daemon may close the
// socket before our QUIT lands and that's perfectly fine.
let _ = write_line(&mut write, "QUIT").await;
// Debug-level: a busy crawl can rotate circuits many times per
// minute, INFO is too chatty. Failures still log at WARN.
tracing::debug!(addr = %self.addr, "TOR NEWNYM signaled");
Ok(())
}
async fn build_auth_line(&self) -> anyhow::Result<String> {
match &self.auth {
TorAuth::None => Ok("AUTHENTICATE".to_string()),
TorAuth::Password(p) => Ok(format!("AUTHENTICATE \"{}\"", escape_quoted(p))),
TorAuth::Cookie(path) => {
let bytes = tokio::fs::read(path)
.await
.with_context(|| format!("read TOR cookie file {}", path.display()))?;
Ok(format!("AUTHENTICATE {}", hex_encode(&bytes)))
}
}
}
}
/// Parse `tcp://host:port`, `host:port`, or bare `host` into a
/// connect-time string. Default port is [`DEFAULT_CONTROL_PORT`].
fn parse_control_url(url: &str) -> anyhow::Result<String> {
let stripped = url.strip_prefix("tcp://").unwrap_or(url);
if stripped.is_empty() {
bail!("TOR control url is empty");
}
if stripped.contains(':') {
Ok(stripped.to_string())
} else {
Ok(format!("{stripped}:{DEFAULT_CONTROL_PORT}"))
}
}
fn escape_quoted(s: &str) -> String {
s.replace('\\', r"\\").replace('"', r#"\""#)
}
fn hex_encode(bytes: &[u8]) -> String {
let mut s = String::with_capacity(bytes.len() * 2);
for b in bytes {
s.push_str(&format!("{b:02x}"));
}
s
}
async fn write_line<W: tokio::io::AsyncWrite + Unpin>(
w: &mut W,
line: &str,
) -> anyhow::Result<()> {
w.write_all(line.as_bytes()).await?;
w.write_all(b"\r\n").await?;
w.flush().await?;
Ok(())
}
/// Drain a TOR control reply, accepting only status `250`. Handles
/// the protocol's three line forms: `XYZ ...` (single/end), `XYZ-...`
/// (continuation), `XYZ+...` (data block ended by a lone `.`). Our
/// commands only ever produce single-line `250 OK`, but we honor the
/// continuation forms so a future torrc that adds events / banners
/// doesn't confuse the parser.
async fn expect_250<R: AsyncBufReadExt + Unpin>(r: &mut R) -> anyhow::Result<()> {
loop {
let mut line = String::new();
let n = r.read_line(&mut line).await?;
if n == 0 {
bail!("TOR control port closed connection mid-reply");
}
let trimmed = line.trim_end_matches(['\r', '\n']);
if trimmed.len() < 4 {
bail!("malformed TOR control reply: {trimmed:?}");
}
let (code, rest) = trimmed.split_at(3);
if code != "250" {
bail!("TOR control replied {trimmed:?}");
}
let sep = rest.as_bytes()[0];
match sep {
b' ' => return Ok(()),
b'-' => continue,
b'+' => {
// Data block — read until a line consisting of only ".".
loop {
let mut data = String::new();
let n = r.read_line(&mut data).await?;
if n == 0 {
bail!("TOR control port closed mid-data-block");
}
if data.trim_end_matches(['\r', '\n']) == "." {
break;
}
}
}
_ => bail!("malformed TOR control reply separator: {trimmed:?}"),
}
}
}
#[cfg(test)]
mod tests {
use super::*;
use std::sync::{Arc, Mutex};
use tokio::io::AsyncWriteExt;
use tokio::net::TcpListener;
/// Spawn a mock control port that responds to each \r\n-terminated
/// inbound line with the next entry from `replies`. Each reply has
/// its own `\r\n` appended. Records received lines into `recorder`.
/// After `replies.len()` exchanges the task drops the socket — this
/// matches the real TOR behavior for QUIT (close after acking).
async fn spawn_mock(
replies: Vec<&'static str>,
recorder: Arc<Mutex<Vec<String>>>,
) -> String {
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap().to_string();
tokio::spawn(async move {
let (sock, _) = listener.accept().await.unwrap();
let (r, mut w) = sock.into_split();
let mut r = BufReader::new(r);
for reply in replies {
let mut line = String::new();
let n = r.read_line(&mut line).await.unwrap_or(0);
if n == 0 {
return;
}
recorder
.lock()
.unwrap()
.push(line.trim_end_matches(['\r', '\n']).to_string());
w.write_all(reply.as_bytes()).await.unwrap();
w.write_all(b"\r\n").await.unwrap();
w.flush().await.unwrap();
}
});
addr
}
#[tokio::test]
async fn password_auth_then_newnym_writes_expected_sequence() {
let recorder = Arc::new(Mutex::new(Vec::new()));
// Two replies: AUTHENTICATE then SIGNAL NEWNYM. QUIT is
// fire-and-forget; the mock dropping the socket is the
// expected real-world behavior.
let addr =
spawn_mock(vec!["250 OK", "250 OK"], Arc::clone(&recorder)).await;
let controller = TorController::new(addr, TorAuth::Password("secret".into()));
controller.new_identity().await.expect("new_identity ok");
let recorded = recorder.lock().unwrap().clone();
assert_eq!(recorded.first().map(String::as_str), Some("AUTHENTICATE \"secret\""));
assert_eq!(recorded.get(1).map(String::as_str), Some("SIGNAL NEWNYM"));
}
#[tokio::test]
async fn cookie_auth_hex_encodes_file_bytes() {
let tmp = tempfile::NamedTempFile::new().unwrap();
let cookie: Vec<u8> = (0u8..32).collect();
std::fs::write(tmp.path(), &cookie).unwrap();
let recorder = Arc::new(Mutex::new(Vec::new()));
let addr =
spawn_mock(vec!["250 OK", "250 OK"], Arc::clone(&recorder)).await;
let controller =
TorController::new(addr, TorAuth::Cookie(tmp.path().to_path_buf()));
controller.new_identity().await.expect("new_identity ok");
let recorded = recorder.lock().unwrap().clone();
let expected_hex: String = cookie.iter().map(|b| format!("{b:02x}")).collect();
assert_eq!(
recorded.first().map(String::as_str),
Some(format!("AUTHENTICATE {expected_hex}").as_str())
);
}
#[tokio::test]
async fn no_auth_sends_bare_authenticate() {
let recorder = Arc::new(Mutex::new(Vec::new()));
let addr =
spawn_mock(vec!["250 OK", "250 OK"], Arc::clone(&recorder)).await;
let controller = TorController::new(addr, TorAuth::None);
controller.new_identity().await.expect("new_identity ok");
let recorded = recorder.lock().unwrap().clone();
assert_eq!(recorded.first().map(String::as_str), Some("AUTHENTICATE"));
}
#[tokio::test]
async fn non_250_reply_returns_err_with_reply_text() {
let recorder = Arc::new(Mutex::new(Vec::new()));
let addr = spawn_mock(
vec!["515 Bad authentication"],
Arc::clone(&recorder),
)
.await;
let controller =
TorController::new(addr, TorAuth::Password("wrong".into()));
let err = controller.new_identity().await.expect_err("should fail");
let msg = format!("{err:#}");
assert!(msg.contains("515"), "expected 515 in error, got: {msg}");
}
#[tokio::test]
async fn closed_connection_mid_reply_is_an_error() {
// Listener accepts the AUTH line then drops without replying —
// this exercises the EOF-mid-reply path in expect_250 (rather
// than tor's own error replies which are covered elsewhere).
let listener = TcpListener::bind("127.0.0.1:0").await.unwrap();
let addr = listener.local_addr().unwrap().to_string();
tokio::spawn(async move {
if let Ok((sock, _)) = listener.accept().await {
let (r, _w) = sock.into_split();
let mut r = BufReader::new(r);
let mut line = String::new();
let _ = r.read_line(&mut line).await; // read AUTH, ignore
// Drop _w (and the read half via scope exit) so the
// peer sees an immediate EOF on the next read.
}
});
let controller = TorController::new(addr, TorAuth::None);
let err = controller.new_identity().await.expect_err("should fail");
let msg = format!("{err:#}");
assert!(
msg.contains("closed connection"),
"expected EOF-mid-reply error, got: {msg}"
);
}
#[tokio::test]
async fn multi_line_250_continuation_is_accepted() {
let recorder = Arc::new(Mutex::new(Vec::new()));
// AUTHENTICATE reply uses the `250-...\r\n250 OK\r\n` form.
// Single reply string contains the whole multi-line response.
let addr = spawn_mock(
vec!["250-banner=foo\r\n250 OK", "250 OK"],
Arc::clone(&recorder),
)
.await;
let controller = TorController::new(addr, TorAuth::None);
controller.new_identity().await.expect("new_identity ok");
}
#[test]
fn from_parts_returns_none_when_url_unset() {
let c = TorController::from_parts(None, None, None).unwrap();
assert!(c.is_none());
}
#[test]
fn from_parts_prefers_cookie_over_password() {
let c = TorController::from_parts(
Some("tor:9051"),
Some("pw"),
Some(Path::new("/var/lib/tor/control_auth_cookie")),
)
.unwrap()
.expect("controller built");
assert!(matches!(c.auth, TorAuth::Cookie(_)));
}
#[test]
fn from_parts_falls_back_to_password_without_cookie() {
let c = TorController::from_parts(Some("tor:9051"), Some("pw"), None)
.unwrap()
.expect("controller built");
assert!(matches!(c.auth, TorAuth::Password(p) if p == "pw"));
}
#[test]
fn parse_control_url_accepts_tcp_scheme() {
assert_eq!(parse_control_url("tcp://127.0.0.1:9051").unwrap(), "127.0.0.1:9051");
}
#[test]
fn parse_control_url_defaults_port_when_omitted() {
assert_eq!(parse_control_url("tor").unwrap(), "tor:9051");
}
#[test]
fn parse_control_url_passes_through_host_port() {
assert_eq!(parse_control_url("tor:9999").unwrap(), "tor:9999");
}
#[test]
fn parse_control_url_rejects_empty() {
assert!(parse_control_url("").is_err());
assert!(parse_control_url("tcp://").is_err());
}
#[test]
fn escape_quoted_handles_quotes_and_backslashes() {
assert_eq!(escape_quoted(r#"a"b\c"#), r#"a\"b\\c"#);
}
#[test]
fn debug_format_redacts_password_and_cookie_path() {
// Regression: app.rs / bin/crawler.rs log the controller at
// startup via `tracing::info!(?t, ...)`. A derived Debug on
// TorAuth would expand TorAuth::Password(p) and leak the
// plaintext into logs.
let c = TorController::new("tor:9051", TorAuth::Password("super-secret".into()));
let dbg = format!("{c:?}");
assert!(!dbg.contains("super-secret"), "password leaked: {dbg}");
assert!(dbg.contains("<redacted>"), "expected <redacted>, got: {dbg}");
let c = TorController::new(
"tor:9051",
TorAuth::Cookie("/var/lib/tor/control_auth_cookie".into()),
);
let dbg = format!("{c:?}");
assert!(!dbg.contains("control_auth_cookie"), "cookie path leaked: {dbg}");
}
#[test]
fn hex_encode_zero_pads_low_bytes() {
assert_eq!(hex_encode(&[0x00, 0x0f, 0xff]), "000fff");
}
}

View File

@@ -0,0 +1,244 @@
//! Centralised URL helpers for the crawler subsystem.
//!
//! Three near-identical hand-rolled URL parsers used to live in
//! `crawler::session`, `crawler::rate_limit`, and `crawler::pipeline`
//! respectively, each with subtly different edge-case behaviour
//! around port handling and IPv6 literals. They're consolidated here
//! so the divergence can't drift again.
//!
//! The hand-rolled implementations are kept intentionally — they
//! preserve the exact semantics every existing test pins. A future
//! refactor can switch to `reqwest::Url` if it can be done without
//! changing those semantics.
/// Lowercased host (no port). Returns `None` for inputs without a
/// `scheme://host` shape — those would never have reached the network
/// layer anyway. Used by the per-host rate limiter as its bucket key.
///
/// IPv6 literals are kept in their `[::1]` bracketed form so the
/// `rsplit_once(':')` port-stripping logic doesn't split inside the
/// address (e.g. `https://[::1]/foo` used to return `"[:"` because
/// the rightmost `:` is inside the literal). Buckets keyed by
/// `[::1]` vs `::1` are still uniquely-per-host; the brackets are
/// cosmetic.
pub fn host_of(url: &str) -> Option<String> {
let after_scheme = url.split_once("://")?.1;
let host_with_port = after_scheme.split('/').next()?;
let host = if host_with_port.starts_with('[') {
// IPv6 literal: keep through the closing bracket. There may
// be a trailing `:port` after `]`; strip only that.
match host_with_port.rfind(']') {
Some(end) => &host_with_port[..=end],
None => host_with_port,
}
} else {
// Hostnames and IPv4 literals: trailing `:port` (if any) is
// after the last `:`.
host_with_port
.rsplit_once(':')
.map_or(host_with_port, |(h, _)| h)
};
(!host.is_empty()).then(|| host.to_ascii_lowercase())
}
/// `scheme://host` with no path or port stripping. Used by the metadata
/// pass to seed `sources.base_url` from `CRAWLER_START_URL`.
pub fn origin_of(url: &str) -> Option<String> {
let (scheme, rest) = url.split_once("://")?;
let host = rest.split('/').next()?;
Some(format!("{scheme}://{host}"))
}
/// Approximate registrable-domain calculation: take the last two
/// dot-labels of the host, prefix with `.`. Used to set a parent-
/// domain cookie so the catalog's `www.` / `m.` redirects don't drop
/// the cookie mid-crawl.
///
/// Caveat: wrong for multi-part TLDs (`.co.uk`, `.com.br`). The
/// operator can override via `CRAWLER_COOKIE_DOMAIN`; pulling in the
/// Public Suffix List for one knob isn't worth it yet.
///
/// Bare hostnames (e.g. `localhost`) return the host as-is, with no
/// leading dot — setting `.localhost` as a cookie domain is invalid.
/// IPv6 literals (e.g. `[::1]`) are returned bracketed and unchanged;
/// the browser will reject them as a cookie `Domain` anyway, but the
/// representation stays sensible. Same `starts_with('[')` branch as
/// [`host_of`] for consistent IPv6 handling across the module.
pub fn registrable_domain(url: &str) -> Option<String> {
let after_scheme = url.split_once("://")?.1;
let host_with_port = after_scheme.split('/').next()?;
let host_str = if host_with_port.starts_with('[') {
// IPv6 literal: keep through the closing bracket; an optional
// `:port` follows `]`.
match host_with_port.rfind(']') {
Some(end) => &host_with_port[..=end],
None => host_with_port,
}
} else {
host_with_port
.rsplit_once(':')
.map_or(host_with_port, |(h, _)| h)
};
let host = host_str.to_ascii_lowercase();
if host.is_empty() {
return None;
}
let labels: Vec<&str> = host.split('.').filter(|l| !l.is_empty()).collect();
if labels.len() < 2 {
return Some(host);
}
let registrable = &labels[labels.len() - 2..];
Some(format!(".{}", registrable.join(".")))
}
/// Normalise a SOCKS proxy URL for Chromium's `--proxy-server=` flag.
///
/// reqwest accepts both `socks5://` (resolve locally) and
/// `socks5h://` (resolve via the SOCKS server — important when the
/// proxy is TOR and we don't want the host's resolver to see the
/// target hostname). Chromium does **not** know the `socks5h` scheme
/// and refuses navigations with `ERR_NO_SUPPORTED_PROXIES`. It
/// already sends destination hostnames over SOCKS5 by default
/// regardless, so stripping the `h` is a pure scheme rename — the
/// remote-DNS behaviour is preserved.
///
/// Non-SOCKS schemes pass through unchanged.
pub fn chromium_proxy_arg(proxy: &str) -> String {
if let Some(rest) = proxy.strip_prefix("socks5h://") {
format!("socks5://{rest}")
} else {
proxy.to_string()
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn host_of_strips_port_and_lowercases() {
assert_eq!(
host_of("https://CDN.Example.com:443/x").as_deref(),
Some("cdn.example.com")
);
assert_eq!(host_of("http://localhost/").as_deref(), Some("localhost"));
assert_eq!(host_of("not a url"), None);
}
#[test]
fn host_of_keeps_bracketed_ipv6_literal_intact() {
// Regression: the old impl rsplit_once(':')'d the IPv6 address,
// returning "[:" instead of "[::1]". A real IPv6 source would
// silently get a wrong rate-limit bucket key.
assert_eq!(host_of("https://[::1]/").as_deref(), Some("[::1]"));
assert_eq!(host_of("https://[::1]:8080/").as_deref(), Some("[::1]"));
assert_eq!(
host_of("https://[2001:db8::1]/foo").as_deref(),
Some("[2001:db8::1]")
);
assert_eq!(
host_of("https://[2001:db8::1]:443/foo").as_deref(),
Some("[2001:db8::1]")
);
}
#[test]
fn origin_of_returns_scheme_and_host() {
assert_eq!(
origin_of("https://example.com/some/path?q=1").as_deref(),
Some("https://example.com")
);
assert_eq!(origin_of("garbage"), None);
}
#[test]
fn registrable_domain_strips_subdomain() {
assert_eq!(
registrable_domain("https://www.target-site.com/manga/foo/").as_deref(),
Some(".target-site.com")
);
assert_eq!(
registrable_domain("https://m.example.org").as_deref(),
Some(".example.org")
);
}
#[test]
fn registrable_domain_keeps_two_label_host() {
assert_eq!(
registrable_domain("https://example.com/").as_deref(),
Some(".example.com")
);
}
#[test]
fn registrable_domain_handles_port() {
assert_eq!(
registrable_domain("http://www.foo.bar:8080/x").as_deref(),
Some(".foo.bar")
);
}
#[test]
fn registrable_domain_bare_hostname_no_leading_dot() {
assert_eq!(
registrable_domain("http://localhost:5173").as_deref(),
Some("localhost")
);
}
#[test]
fn registrable_domain_returns_none_for_garbage() {
assert!(registrable_domain("not a url").is_none());
}
#[test]
fn registrable_domain_keeps_bracketed_ipv6_literal_intact() {
// Symmetric with host_of's IPv6 fix. The cookie-domain code
// won't accept an IP as a `Domain` value, but the function
// should at least return a sensible representation rather
// than the truncated `"[:"` the old port-stripper produced.
assert_eq!(
registrable_domain("https://[::1]/").as_deref(),
Some("[::1]")
);
assert_eq!(
registrable_domain("https://[::1]:8080/").as_deref(),
Some("[::1]")
);
assert_eq!(
registrable_domain("https://[2001:db8::1]/foo").as_deref(),
Some("[2001:db8::1]")
);
}
#[test]
fn chromium_proxy_arg_strips_socks5h_to_socks5() {
// Regression: passing socks5h:// to Chromium yields
// ERR_NO_SUPPORTED_PROXIES at navigation time.
assert_eq!(
chromium_proxy_arg("socks5h://127.0.0.1:9050"),
"socks5://127.0.0.1:9050"
);
assert_eq!(
chromium_proxy_arg("socks5h://tor:9050"),
"socks5://tor:9050"
);
}
#[test]
fn chromium_proxy_arg_passes_socks5_unchanged() {
assert_eq!(
chromium_proxy_arg("socks5://127.0.0.1:9050"),
"socks5://127.0.0.1:9050"
);
}
#[test]
fn chromium_proxy_arg_passes_non_socks_unchanged() {
assert_eq!(
chromium_proxy_arg("http://proxy.example:8080"),
"http://proxy.example:8080"
);
}
}

View File

@@ -0,0 +1,15 @@
use chrono::{DateTime, Utc};
use serde::Serialize;
use sqlx::FromRow;
use uuid::Uuid;
#[derive(Debug, Clone, Serialize, FromRow)]
pub struct AdminAuditEntry {
pub id: Uuid,
pub actor_user_id: Option<Uuid>,
pub action: String,
pub target_kind: String,
pub target_id: Option<Uuid>,
pub payload: serde_json::Value,
pub at: DateTime<Utc>,
}

View File

@@ -1,3 +1,4 @@
pub mod admin_audit;
pub mod api_token;
pub mod author;
pub mod bookmark;
@@ -9,11 +10,13 @@ pub mod page;
pub mod patch;
pub mod read_progress;
pub mod session;
pub mod sync_state;
pub mod tag;
pub mod upload_entry;
pub mod user;
pub mod user_preferences;
pub use admin_audit::AdminAuditEntry;
pub use api_token::ApiToken;
pub use author::{Author, AuthorRef, AuthorWithCount};
pub use bookmark::{Bookmark, BookmarkSummary};
@@ -25,6 +28,7 @@ pub use page::Page;
pub use patch::Patch;
pub use read_progress::{ReadProgress, ReadProgressForManga, ReadProgressSummary};
pub use session::Session;
pub use sync_state::{ChapterSyncState, MangaSyncState};
pub use tag::{Tag, TagRef};
pub use upload_entry::UploadEntry;
pub use user::User;

View File

@@ -0,0 +1,48 @@
//! Sync-state enums derived per-manga / per-chapter from `manga_sources`,
//! `chapter_sources`, and `crawler_jobs` at query time. No state column
//! is persisted on `mangas` / `chapters` — see `repo::admin_view` for the
//! derivation rules and priority order.
use serde::{Deserialize, Serialize};
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize, sqlx::Type)]
#[sqlx(type_name = "text", rename_all = "snake_case")]
#[serde(rename_all = "snake_case")]
pub enum MangaSyncState {
/// A `sync_manga` or `sync_chapter_list` job is currently
/// pending or running for this manga.
InProgress,
/// At least one `manga_sources` row exists for this manga and ALL of
/// them have `dropped_at IS NOT NULL` — every source we know about
/// has stopped surfacing it.
Dropped,
/// Default healthy state: at least one live source row OR the manga
/// was user-uploaded (no `manga_sources` rows at all).
Synced,
}
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize, sqlx::Type)]
#[sqlx(type_name = "text", rename_all = "snake_case")]
#[serde(rename_all = "snake_case")]
pub enum ChapterSyncState {
/// A `sync_chapter_content` job is currently pending or running for
/// this chapter (the 0014 dedup index guarantees at most one).
Downloading,
/// At least one `chapter_sources` row exists AND all of them are
/// `dropped_at IS NOT NULL`.
Dropped,
/// `page_count = 0` AND a `dead` `sync_chapter_content` job exists
/// for this chapter. Checked BEFORE `NotDownloaded` so the more
/// informative "we tried and it died" state wins over "we never
/// got around to it". Does NOT fire when `page_count > 0`, because
/// pages on disk mean the chapter IS synced regardless of historical
/// job failures — see the priority comment in `repo::admin_view`.
Failed,
/// `page_count = 0` and no in-flight or failed job — the chapter
/// row exists but content has never been downloaded.
NotDownloaded,
/// `page_count > 0` — content has been downloaded at some point.
/// Reaped `done` jobs in `crawler_jobs` mean we can't read this from
/// the job table, so `page_count` is the durable truth.
Synced,
}

View File

@@ -10,4 +10,5 @@ pub struct User {
#[serde(skip)]
pub password_hash: String,
pub created_at: DateTime<Utc>,
pub is_admin: bool,
}

View File

@@ -21,6 +21,16 @@ pub enum AppError {
PayloadTooLarge(String),
#[error("unsupported media type: {0}")]
UnsupportedMediaType(String),
/// 503 — a feature is currently unavailable, distinct from a 5xx
/// internal error. Used when admin actions require the crawler
/// daemon but it's been disabled (`CRAWLER_DAEMON=false`).
#[error("service unavailable: {0}")]
ServiceUnavailable(String),
/// 429 with an optional `Retry-After` header value (in seconds).
#[error("too many requests")]
TooManyRequests {
retry_after_secs: Option<u64>,
},
/// Semantic per-field validation failure. `details` is rendered into the
/// envelope so the client can highlight the bad field(s).
#[error("validation failed")]
@@ -51,6 +61,8 @@ impl AppError {
AppError::Conflict(_) => "conflict",
AppError::PayloadTooLarge(_) => "payload_too_large",
AppError::UnsupportedMediaType(_) => "unsupported_media_type",
AppError::ServiceUnavailable(_) => "service_unavailable",
AppError::TooManyRequests { .. } => "too_many_requests",
AppError::ValidationFailed { .. } => "validation_failed",
AppError::Database(sqlx::Error::RowNotFound) => "not_found",
AppError::Database(_) => "internal_error",
@@ -79,6 +91,34 @@ impl IntoResponse for AppError {
AppError::UnsupportedMediaType(msg) => {
(StatusCode::UNSUPPORTED_MEDIA_TYPE, msg.clone(), None)
}
AppError::ServiceUnavailable(msg) => {
(StatusCode::SERVICE_UNAVAILABLE, msg.clone(), None)
}
AppError::TooManyRequests { retry_after_secs } => {
// Emit `Retry-After: N` (RFC 6585 §4) so a well-behaved
// client can back off correctly. Done by building the
// response by hand below — the `(status, headers,
// body)` tuple shape doesn't fit the standard
// `(status, body)` IntoResponse path for the other
// variants.
let body = json!({
"error": {
"code": code,
"message": "too many requests; slow down",
}
});
let mut resp = (StatusCode::TOO_MANY_REQUESTS, Json(body)).into_response();
if let Some(secs) = retry_after_secs {
// `HeaderValue: From<u64>` skips both the
// intermediate `String` allocation and the
// fallible-by-shape `from_str` path.
resp.headers_mut().insert(
axum::http::header::RETRY_AFTER,
axum::http::HeaderValue::from(*secs),
);
}
return resp;
}
AppError::ValidationFailed { message, details } => (
StatusCode::UNPROCESSABLE_ENTITY,
message.clone(),

View File

@@ -1,21 +1,77 @@
use std::net::SocketAddr;
use std::time::Duration;
use tracing_subscriber::EnvFilter;
/// Upper bound on how long we're willing to wait for the crawler daemon
/// to drain before letting `main` return. Without it a wedged background
/// task (e.g. a chromiumoxide handler stuck on a dead WS) blocks the
/// process from exiting after Ctrl-C / SIGTERM.
const CRAWLER_SHUTDOWN_TIMEOUT: Duration = Duration::from_secs(5);
#[tokio::main]
async fn main() -> anyhow::Result<()> {
dotenvy::dotenv().ok();
tracing_subscriber::fmt()
.with_env_filter(
EnvFilter::try_from_default_env().unwrap_or_else(|_| "info,mangalord=debug".into()),
EnvFilter::try_from_default_env().unwrap_or_else(|_| {
"info,mangalord=debug,chromiumoxide::conn=off,chromiumoxide::handler=off".into()
}),
)
.init();
let config = mangalord::config::Config::from_env()?;
let addr: SocketAddr = config.bind_address.parse()?;
let app = mangalord::app::build(config).await?;
let mangalord::app::AppHandle { router, daemon } = mangalord::app::build(config).await?;
tracing::info!(%addr, "mangalord listening");
let listener = tokio::net::TcpListener::bind(addr).await?;
axum::serve(listener, app).await?;
axum::serve(listener, router)
.with_graceful_shutdown(shutdown_signal())
.await?;
// Drain background tasks (crawler daemon) before exiting so Chromium
// gets a clean shutdown rather than relying on kill-on-drop. Bounded
// by a timeout so a wedged shutdown path can't trap the process.
if let Some(d) = daemon {
if tokio::time::timeout(CRAWLER_SHUTDOWN_TIMEOUT, d.shutdown())
.await
.is_err()
{
tracing::warn!(
timeout_s = CRAWLER_SHUTDOWN_TIMEOUT.as_secs(),
"crawler daemon shutdown exceeded timeout; abandoning"
);
}
}
Ok(())
}
/// Wait for either Ctrl-C (interactive shell) or SIGTERM (Docker /
/// Kubernetes / Podman / systemd stop) and log which arrived. Without
/// the SIGTERM branch, `docker compose stop` runs out its grace period
/// and skips straight to SIGKILL — the daemon never gets the
/// `daemon.shutdown().await` path, leaking Chromium.
async fn shutdown_signal() {
use tokio::signal::unix::{signal, SignalKind};
let mut sigterm = match signal(SignalKind::terminate()) {
Ok(s) => s,
Err(e) => {
// SignalKind::terminate() is supported on every Unix the
// tokio runtime runs on; if registration fails we still
// honour Ctrl-C so the process is at least
// interactive-shutdownable.
tracing::warn!(error = %e, "could not install SIGTERM handler; falling back to ctrl_c only");
let _ = tokio::signal::ctrl_c().await;
tracing::info!("ctrl-c received; shutting down");
return;
}
};
tokio::select! {
_ = tokio::signal::ctrl_c() => {
tracing::info!("ctrl-c received; shutting down");
}
_ = sigterm.recv() => {
tracing::info!("SIGTERM received; shutting down");
}
}
}

View File

@@ -0,0 +1,32 @@
//! Admin-action audit log writes.
//!
//! Insert is always called from inside the same transaction as the
//! action it audits — the executor parameter is `PgExecutor` so the
//! caller passes `&mut *tx` directly.
use sqlx::PgExecutor;
use uuid::Uuid;
use crate::error::AppResult;
pub async fn insert<'e, E: PgExecutor<'e>>(
executor: E,
actor_user_id: Uuid,
action: &str,
target_kind: &str,
target_id: Option<Uuid>,
payload: serde_json::Value,
) -> AppResult<()> {
sqlx::query(
"INSERT INTO admin_audit (actor_user_id, action, target_kind, target_id, payload) \
VALUES ($1, $2, $3, $4, $5)",
)
.bind(actor_user_id)
.bind(action)
.bind(target_kind)
.bind(target_id)
.bind(payload)
.execute(executor)
.await?;
Ok(())
}

View File

@@ -0,0 +1,232 @@
//! Admin-facing read queries that join manga/chapter with the crawler
//! signals (`manga_sources`, `chapter_sources`, `crawler_jobs`) to
//! derive a sync state per row at query time.
//!
//! Priority order for `MangaSyncState`:
//! 1. `InProgress` — any pending/running `sync_manga` or
//! `sync_chapter_list` job matches this manga.
//! 2. `Dropped` — manga has source rows AND every one of them is
//! `dropped_at IS NOT NULL`.
//! 3. `Synced` — default (includes user-uploaded mangas with no
//! `manga_sources` rows at all).
//!
//! Priority order for `ChapterSyncState`:
//! 1. `Downloading` — pending/running `sync_chapter_content` for this id
//! 2. `Dropped` — chapter has source rows AND all are dropped
//! 3. `Failed` — `page_count = 0` AND a `dead` `sync_chapter_content`
//! row exists for this chapter. Constrained to `page_count = 0`
//! because once pages are on disk the chapter IS synced — a
//! historical dead job (likely from a re-download attempt that
//! crashed) is noise that gets reaped after retention. Surfacing
//! "Failed" when content is present would contradict
//! `ChapterSyncState::Synced`'s "downloaded at some point" contract.
//! 4. `NotDownloaded` — `page_count = 0`, no in-flight, no dead job
//! 5. `Synced` — `page_count > 0`
//!
//! Reminder: `done` jobs are reaped after `CRAWLER_JOB_RETENTION_DAYS`,
//! so `chapters.page_count > 0` is the durable "this is synced" signal,
//! not the job table.
use chrono::{DateTime, Utc};
use serde::Serialize;
use sqlx::{FromRow, PgPool};
use uuid::Uuid;
use crate::domain::{ChapterSyncState, MangaSyncState};
use crate::error::AppResult;
#[derive(Debug, Serialize, FromRow)]
pub struct AdminMangaRow {
pub id: Uuid,
pub title: String,
pub status: String,
pub cover_image_path: Option<String>,
pub created_at: DateTime<Utc>,
pub updated_at: DateTime<Utc>,
pub sync_state: MangaSyncState,
pub chapter_count: i64,
pub latest_seen_at: Option<DateTime<Utc>>,
}
#[derive(Debug, Default)]
pub struct ListAdminMangasQuery {
pub search: Option<String>,
pub sync_state: Option<MangaSyncState>,
pub limit: i64,
pub offset: i64,
}
const MANGA_SYNC_STATE_CASE: &str = r#"
CASE
WHEN EXISTS (
SELECT 1 FROM crawler_jobs cj
WHERE cj.state IN ('pending','running')
AND (
(cj.payload->>'kind' = 'sync_chapter_list'
AND (cj.payload->>'manga_id')::uuid = m.id)
OR (cj.payload->>'kind' = 'sync_manga'
AND EXISTS (
SELECT 1 FROM manga_sources ms
WHERE ms.manga_id = m.id
AND ms.source_id = cj.payload->>'source_id'
AND ms.source_manga_key = cj.payload->>'source_manga_key'
))
)
) THEN 'in_progress'
WHEN EXISTS (SELECT 1 FROM manga_sources ms WHERE ms.manga_id = m.id)
AND NOT EXISTS (
SELECT 1 FROM manga_sources ms
WHERE ms.manga_id = m.id AND ms.dropped_at IS NULL
)
THEN 'dropped'
ELSE 'synced'
END
"#;
/// Paginated admin manga list with derived sync state and total count.
/// Filters by `search` (substring on title, case-insensitive) and
/// `sync_state` (post-derivation). The CTE keeps the case expression
/// in one place — the same projection feeds both the page rows and the
/// totals count under the same filter.
pub async fn list_mangas_with_sync_state(
pool: &PgPool,
q: &ListAdminMangasQuery,
) -> AppResult<(Vec<AdminMangaRow>, i64)> {
let search_pat = q
.search
.as_ref()
.map(|s| format!("%{}%", s.trim()))
.filter(|p| p.len() > 2);
// sqlx::Type → text: bind the snake_case representation manually so
// the SQL can compare it as text without an explicit cast.
let sync_filter = q.sync_state.map(|s| match s {
MangaSyncState::InProgress => "in_progress",
MangaSyncState::Dropped => "dropped",
MangaSyncState::Synced => "synced",
});
let sql = format!(
r#"
WITH classified AS (
SELECT
m.id, m.title, m.status, m.cover_image_path,
m.created_at, m.updated_at,
{case} AS sync_state,
(SELECT COUNT(*) FROM chapters c WHERE c.manga_id = m.id) AS chapter_count,
(SELECT MAX(last_seen_at) FROM manga_sources ms
WHERE ms.manga_id = m.id AND ms.dropped_at IS NULL) AS latest_seen_at
FROM mangas m
WHERE ($1::text IS NULL OR m.title ILIKE $1)
)
SELECT * FROM classified
WHERE ($2::text IS NULL OR sync_state = $2)
ORDER BY updated_at DESC
LIMIT $3 OFFSET $4
"#,
case = MANGA_SYNC_STATE_CASE
);
let items: Vec<AdminMangaRow> = sqlx::query_as(&sql)
.bind(&search_pat)
.bind(sync_filter)
.bind(q.limit)
.bind(q.offset)
.fetch_all(pool)
.await?;
let total_sql = format!(
r#"
WITH classified AS (
SELECT {case} AS sync_state
FROM mangas m
WHERE ($1::text IS NULL OR m.title ILIKE $1)
)
SELECT COUNT(*) FROM classified
WHERE ($2::text IS NULL OR sync_state = $2)
"#,
case = MANGA_SYNC_STATE_CASE
);
let total: i64 = sqlx::query_scalar(&total_sql)
.bind(&search_pat)
.bind(sync_filter)
.fetch_one(pool)
.await?;
Ok((items, total))
}
#[derive(Debug, Serialize, FromRow)]
pub struct AdminChapterRow {
pub id: Uuid,
pub manga_id: Uuid,
pub number: i32,
pub title: Option<String>,
pub page_count: i32,
pub created_at: DateTime<Utc>,
pub sync_state: ChapterSyncState,
pub latest_seen_at: Option<DateTime<Utc>>,
}
#[derive(Debug, Default)]
pub struct ListAdminChaptersQuery {
pub manga_id: Uuid,
pub limit: i64,
pub offset: i64,
}
/// Paginated chapter list with derived sync state. Pagination is non-
/// optional — long-runners can have thousands of chapters and the
/// per-row scalar subqueries make the unbounded variant a real
/// stall risk even behind an admin guard. Returns the page slice plus
/// the unfiltered total so the UI can render "showing N of M".
pub async fn list_chapters_with_sync_state(
pool: &PgPool,
q: &ListAdminChaptersQuery,
) -> AppResult<(Vec<AdminChapterRow>, i64)> {
let items: Vec<AdminChapterRow> = sqlx::query_as(
r#"
SELECT
c.id, c.manga_id, c.number, c.title, c.page_count, c.created_at,
CASE
WHEN EXISTS (
SELECT 1 FROM crawler_jobs cj
WHERE cj.state IN ('pending','running')
AND cj.payload->>'kind' = 'sync_chapter_content'
AND (cj.payload->>'chapter_id')::uuid = c.id
) THEN 'downloading'
WHEN EXISTS (SELECT 1 FROM chapter_sources cs WHERE cs.chapter_id = c.id)
AND NOT EXISTS (
SELECT 1 FROM chapter_sources cs
WHERE cs.chapter_id = c.id AND cs.dropped_at IS NULL
)
THEN 'dropped'
WHEN c.page_count = 0
AND EXISTS (
SELECT 1 FROM crawler_jobs cj
WHERE cj.state = 'dead'
AND cj.payload->>'kind' = 'sync_chapter_content'
AND (cj.payload->>'chapter_id')::uuid = c.id
) THEN 'failed'
WHEN c.page_count = 0 THEN 'not_downloaded'
ELSE 'synced'
END AS sync_state,
(SELECT MAX(last_seen_at) FROM chapter_sources cs
WHERE cs.chapter_id = c.id AND cs.dropped_at IS NULL) AS latest_seen_at
FROM chapters c
WHERE c.manga_id = $1
ORDER BY c.number ASC
LIMIT $2 OFFSET $3
"#,
)
.bind(q.manga_id)
.bind(q.limit)
.bind(q.offset)
.fetch_all(pool)
.await?;
let total: i64 = sqlx::query_scalar("SELECT COUNT(*) FROM chapters WHERE manga_id = $1")
.bind(q.manga_id)
.fetch_one(pool)
.await?;
Ok((items, total))
}

View File

@@ -99,6 +99,11 @@ pub async fn list(
/// Atomically replace the set of authors on a manga. Caller passes a
/// `&mut PgConnection` (`&mut *tx` works) so the delete+upserts run in
/// one transaction with whatever called us.
///
/// Note: `crawler::repo::sync_authors` does a similar replace with the
/// same semantics on names. The duplication is intentional — handler
/// callers want the `Vec<AuthorRef>` for the API response; the
/// crawler doesn't need it and stays inside its own transaction.
pub async fn set_for_manga(
conn: &mut PgConnection,
manga_id: Uuid,

View File

@@ -29,9 +29,9 @@ pub async fn create(
match result {
Ok(b) => Ok(b),
Err(e) if is_unique_violation(&e) => Err(AppError::Conflict(
"bookmark already exists for this manga/chapter".into(),
)),
Err(sqlx::Error::Database(ref db_err)) if db_err.is_unique_violation() => Err(
AppError::Conflict("bookmark already exists for this manga/chapter".into()),
),
Err(e) => Err(AppError::Database(e)),
}
}
@@ -97,10 +97,3 @@ pub async fn delete(pool: &PgPool, id: Uuid) -> AppResult<()> {
Ok(())
}
fn is_unique_violation(err: &sqlx::Error) -> bool {
if let sqlx::Error::Database(db_err) = err {
db_err.code().as_deref() == Some("23505")
} else {
false
}
}

View File

@@ -4,7 +4,7 @@ use sqlx::{PgExecutor, PgPool};
use uuid::Uuid;
use crate::domain::Chapter;
use crate::error::{AppError, AppResult};
use crate::error::AppResult;
pub async fn list_for_manga(
pool: &PgPool,
@@ -12,12 +12,20 @@ pub async fn list_for_manga(
limit: i64,
offset: i64,
) -> AppResult<Vec<Chapter>> {
// Display order = source-site order reversed. The crawler stamps
// `source_index` = position in the source DOM (0 = first = newest
// on this site, see migration 0021), so DESC puts the oldest
// chapter first and keeps the site's variant grouping and the
// placement of non-numeric entries (e.g. "notice. : Officials")
// intact. NULLS LAST keeps user-uploaded chapters (no source row)
// and rows that pre-date the migration below crawled rows; the
// (number, created_at) tail then orders them deterministically.
let rows = sqlx::query_as::<_, Chapter>(
r#"
SELECT id, manga_id, number, title, page_count, created_at
FROM chapters
WHERE manga_id = $1
ORDER BY number ASC
ORDER BY source_index DESC NULLS LAST, number ASC, created_at ASC
LIMIT $2 OFFSET $3
"#,
)
@@ -29,33 +37,39 @@ pub async fn list_for_manga(
Ok(rows)
}
pub async fn find_by_manga_and_number(
/// Look up a chapter by its UUID, scoped to its manga so a UUID guessed
/// from a different manga's URL doesn't accidentally resolve.
pub async fn find_by_id_in_manga(
pool: &PgPool,
manga_id: Uuid,
number: i32,
chapter_id: Uuid,
) -> AppResult<Option<Chapter>> {
let row = sqlx::query_as::<_, Chapter>(
r#"
SELECT id, manga_id, number, title, page_count, created_at
FROM chapters
WHERE manga_id = $1 AND number = $2
WHERE manga_id = $1 AND id = $2
"#,
)
.bind(manga_id)
.bind(number)
.bind(chapter_id)
.fetch_optional(pool)
.await?;
Ok(row)
}
/// Accepts any `PgExecutor` so the upload handler can run this inside a
/// transaction with the per-page inserts. Returns `AppError::Conflict`
/// on the (manga_id, number) unique violation so handlers can surface a
/// clean 409.
/// transaction with the per-page inserts.
///
/// `uploaded_by` records who uploaded the chapter and feeds the
/// per-user upload history. `None` means "historical / API token with
/// no associated user" — kept nullable to support that case.
///
/// Chapter identity is the row UUID; the same (manga_id, number)
/// combination can repeat (multiple translations, re-uploads). The
/// 0013 migration dropped the (manga_id, number) UNIQUE, so duplicate
/// inserts succeed by design. If a future migration re-adds any
/// uniqueness, surface a 409 by adding a unique-violation arm here.
pub async fn create<'e, E: PgExecutor<'e>>(
executor: E,
manga_id: Uuid,
@@ -63,7 +77,7 @@ pub async fn create<'e, E: PgExecutor<'e>>(
title: Option<&str>,
uploaded_by: Option<Uuid>,
) -> AppResult<Chapter> {
let result = sqlx::query_as::<_, Chapter>(
let row = sqlx::query_as::<_, Chapter>(
r#"
INSERT INTO chapters (manga_id, number, title, uploaded_by)
VALUES ($1, $2, $3, $4)
@@ -75,15 +89,71 @@ pub async fn create<'e, E: PgExecutor<'e>>(
.bind(title)
.bind(uploaded_by)
.fetch_one(executor)
.await;
.await?;
Ok(row)
}
match result {
Ok(c) => Ok(c),
Err(e) if is_unique_violation(&e) => Err(AppError::Conflict(format!(
"chapter {number} already exists for this manga"
))),
Err(e) => Err(AppError::Database(e)),
}
/// Cross-link guard for `POST /bookmarks`: the bookmarks FK accepts
/// any valid chapter id, but a chapter must belong to the bookmark's
/// manga or the bookmark would dangle on a foreign manga. Handlers
/// call this before the insert and surface `NotFound` when it
/// returns `false`.
pub async fn belongs_to_manga(
pool: &PgPool,
chapter_id: Uuid,
manga_id: Uuid,
) -> AppResult<bool> {
let (exists,): (bool,) = sqlx::query_as(
"SELECT EXISTS(SELECT 1 FROM chapters WHERE id = $1 AND manga_id = $2)",
)
.bind(chapter_id)
.bind(manga_id)
.fetch_one(pool)
.await?;
Ok(exists)
}
/// Read just the page_count for a chapter. Used by the crawler
/// daemon's consumer-side dedup safety net so it can ack-done a job
/// whose chapter has already been fetched by a racing worker.
pub async fn page_count(pool: &PgPool, id: Uuid) -> sqlx::Result<Option<i32>> {
sqlx::query_scalar("SELECT page_count FROM chapters WHERE id = $1")
.bind(id)
.fetch_optional(pool)
.await
}
/// Look up the manga_id + most recent live source_url for a chapter.
/// Used by the daemon's chapter dispatcher to resolve the URL it needs
/// to hand to `content::sync_chapter_content`.
///
/// Skips soft-dropped sources (`cs.dropped_at IS NOT NULL`) and breaks
/// ties between multiple live sources by `last_seen_at DESC`, so the
/// freshest still-attached URL wins. Returns `None` when the chapter
/// is gone or all its source rows are dropped — callers in the
/// dispatcher treat `None` as "ack the job, skip the work."
///
/// The enqueue queries (`pipeline::enqueue_bookmarked_pending` and
/// `enqueue_pending_for_manga`) apply the same `dropped_at IS NULL`
/// filter — this resolver stays in lockstep so a chapter that was
/// dropped between enqueue and lease isn't dispatched against a stale
/// URL.
pub async fn dispatch_target(
pool: &PgPool,
chapter_id: Uuid,
) -> sqlx::Result<Option<(Uuid, String)>> {
sqlx::query_as(
"SELECT c.manga_id, cs.source_url \
FROM chapters c \
JOIN chapter_sources cs ON cs.chapter_id = c.id \
WHERE c.id = $1 \
AND cs.dropped_at IS NULL \
ORDER BY cs.last_seen_at DESC \
LIMIT 1",
)
.bind(chapter_id)
.fetch_optional(pool)
.await
}
pub async fn set_page_count<'e, E: PgExecutor<'e>>(
@@ -99,10 +169,3 @@ pub async fn set_page_count<'e, E: PgExecutor<'e>>(
Ok(())
}
fn is_unique_violation(err: &sqlx::Error) -> bool {
if let sqlx::Error::Database(db_err) = err {
db_err.code().as_deref() == Some("23505")
} else {
false
}
}

View File

@@ -8,14 +8,16 @@
//! updated (metadata_hash changed), or unchanged.
//! - [`sync_manga_chapters`]: per-manga chapter reconciliation. Adds
//! new ones, refreshes URLs on existing ones, soft-drops vanished.
//! - [`mark_dropped_mangas`]: end-of-run pass. Any manga from this
//! source whose `last_seen_at` is older than the run start is
//! soft-dropped.
//! - [`mark_run_started`] / [`mark_run_completed`] /
//! [`last_run_completed_cleanly`]: per-source recovery flag in
//! `crawler_state`. A `false` flag on tick start means the previous
//! run did not exit cleanly and the next walk should ignore the
//! early-stop condition.
//!
//! Each public function is a transaction boundary so a partial failure
//! mid-call leaves the DB in its pre-call state.
use chrono::{DateTime, Utc};
use chrono::Utc;
use sqlx::{PgPool, Postgres, Transaction};
use uuid::Uuid;
@@ -274,7 +276,20 @@ async fn sync_tags(
manga_id: Uuid,
tags: &[String],
) -> sqlx::Result<()> {
sqlx::query("DELETE FROM manga_tags WHERE manga_id = $1")
// Only clear crawler-owned attachments (added_by IS NULL). User-
// attached tags are owned by the attaching user and must survive
// the recurring metadata pass — see manga_tags.added_by in
// migration 0009.
//
// Note on orphans: `manga_tags.added_by` is `ON DELETE SET NULL`,
// so an attachment whose user was deleted becomes
// indistinguishable from a crawler-owned row and is cleaned up
// here. That mirrors how `api::mangas::detach_tag` already treats
// orphans ("nobody owns it, refuse to let anyone but admin clear
// them") — the crawler now becomes the eventual reaper. Tracked
// by `sync_tags_garbage_collects_orphan_user_attachments` in
// backend/tests/crawler_sync.rs.
sqlx::query("DELETE FROM manga_tags WHERE manga_id = $1 AND added_by IS NULL")
.bind(manga_id)
.execute(&mut **tx)
.await?;
@@ -315,38 +330,74 @@ pub async fn sync_manga_chapters(
chapters: &[SourceChapterRef],
) -> sqlx::Result<ChapterDiff> {
let mut tx = pool.begin().await?;
// Per-manga advisory lock. Two concurrent calls for the same manga
// would otherwise both read `seen_keys`, both run the drop UPDATE
// filtered on `NOT (key = ANY $3)`, and the later commit could soft-
// drop a chapter the earlier commit had just inserted (lost-update
// shape under MVCC). `pg_advisory_xact_lock` is scoped to this
// transaction: it auto-releases on COMMIT/ROLLBACK so a Rust-side
// panic mid-call doesn't strand the lock. The single-arg int8 form
// keyed by `hashtextextended(manga_id::text, 0)` shares Postgres'
// global advisory-lock namespace with `CRON_LOCK_KEY`, but collision
// is 2^-64 per pair (a UUID-derived hash hitting the fixed cron key
// is effectively impossible).
sqlx::query("SELECT pg_advisory_xact_lock(hashtextextended($1::text, 0))")
.bind(manga_id)
.execute(&mut *tx)
.await?;
let mut diff = ChapterDiff::default();
let seen_keys: Vec<String> = chapters
.iter()
.map(|c| c.source_chapter_key.clone())
.collect();
for c in chapters {
for (idx, c) in chapters.iter().enumerate() {
// `source_index` captures the chapter's position in the source
// DOM (0 = first = newest on this site) so the list query can
// reverse it for the user-facing list — see migration 0021.
// Every sync overwrites the value on both branches, so a new
// chapter inserted at the top of the source shifts every other
// row down by one on the next tick.
let source_index = idx as i32;
// Lookup is constrained by manga_id (via the chapters join) so a
// source whose chapter slugs collide across mangas (e.g.
// "chapter-1" appearing under two different mangas) attributes
// each row to the correct manga. Migration 0017 dropped the
// (source_id, source_chapter_key) PK in favour of
// (source_id, chapter_id) for exactly this reason.
let existing: Option<(Uuid,)> = sqlx::query_as(
"SELECT chapter_id FROM chapter_sources WHERE source_id = $1 AND source_chapter_key = $2",
"SELECT cs.chapter_id \
FROM chapter_sources cs \
JOIN chapters ch ON ch.id = cs.chapter_id \
WHERE cs.source_id = $1 \
AND cs.source_chapter_key = $2 \
AND ch.manga_id = $3",
)
.bind(source_id)
.bind(&c.source_chapter_key)
.bind(manga_id)
.fetch_optional(&mut *tx)
.await?;
match existing {
None => {
// New chapter row. The (manga_id, number) unique
// constraint protects against re-inserts if the same
// number arrives via a different source_chapter_key.
// New chapter row. As of 0013 there's no (manga_id,
// number) UNIQUE, so duplicate-numbered chapters from
// the source (different uploaders, notices, alt
// translations) each get their own row — chapter
// identity is the UUID, not the number.
let (chapter_id,): (Uuid,) = sqlx::query_as(
r#"
INSERT INTO chapters (manga_id, number, title, page_count)
VALUES ($1, $2, $3, 0)
ON CONFLICT (manga_id, number) DO UPDATE
SET title = EXCLUDED.title
INSERT INTO chapters (manga_id, number, title, page_count, source_index)
VALUES ($1, $2, $3, 0, $4)
RETURNING id
"#,
)
.bind(manga_id)
.bind(c.number)
.bind(c.title.as_deref())
.bind(source_index)
.fetch_one(&mut *tx)
.await?;
sqlx::query(
@@ -365,21 +416,27 @@ pub async fn sync_manga_chapters(
diff.new += 1;
}
Some((chapter_id,)) => {
sqlx::query("UPDATE chapters SET title = $1 WHERE id = $2")
sqlx::query(
"UPDATE chapters SET title = $1, source_index = $2 WHERE id = $3",
)
.bind(c.title.as_deref())
.bind(source_index)
.bind(chapter_id)
.execute(&mut *tx)
.await?;
// chapter_id is now the natural per-(source, chapter)
// identifier — use it directly instead of re-keying on
// (source_id, source_chapter_key) which may not be unique.
sqlx::query(
r#"
UPDATE chapter_sources
SET source_url = $1, last_seen_at = NOW(), dropped_at = NULL
WHERE source_id = $2 AND source_chapter_key = $3
WHERE source_id = $2 AND chapter_id = $3
"#,
)
.bind(&c.url)
.bind(source_id)
.bind(&c.source_chapter_key)
.bind(chapter_id)
.execute(&mut *tx)
.await?;
diff.refreshed += 1;
@@ -412,23 +469,152 @@ pub async fn sync_manga_chapters(
Ok(diff)
}
pub async fn mark_dropped_mangas(
/// Count the chapters that the source `(source_id, source_manga_key)`
/// is currently known to attach to — i.e. the number of `chapter_sources`
/// rows for the manga identified by the (source_id, source_manga_key)
/// pair, restricted to live (`dropped_at IS NULL`) rows.
///
/// Used by the metadata pass's partial-render guard: if `fetch_manga`
/// returns an empty `chapters` Vec but the source previously surfaced
/// chapters here, that's most likely a chromium snapshot taken between
/// the `#chapter_table` wrapper render and its rows render — the
/// safest move is to skip `sync_manga_chapters` so the soft-drop
/// branch doesn't flip every existing chapter to `dropped_at`.
///
/// Returns `Ok(0)` when the manga is brand-new (no `manga_sources`
/// row yet), which is the legitimate "this manga has no chapters yet"
/// case and must NOT be flagged.
pub async fn live_chapter_count_for_source_manga(
pool: &PgPool,
source_id: &str,
run_started_at: DateTime<Utc>,
) -> sqlx::Result<u64> {
let res = sqlx::query(
r#"
UPDATE manga_sources
SET dropped_at = NOW()
WHERE source_id = $1
AND last_seen_at < $2
AND dropped_at IS NULL
"#,
source_manga_key: &str,
) -> sqlx::Result<i64> {
let row: Option<(i64,)> = sqlx::query_as(
"SELECT COUNT(*) \
FROM chapter_sources cs \
JOIN chapters c ON c.id = cs.chapter_id \
JOIN manga_sources ms \
ON ms.manga_id = c.manga_id \
AND ms.source_id = cs.source_id \
WHERE ms.source_id = $1 \
AND ms.source_manga_key = $2 \
AND cs.dropped_at IS NULL",
)
.bind(source_id)
.bind(run_started_at)
.bind(source_manga_key)
.fetch_optional(pool)
.await?;
Ok(row.map(|(n,)| n).unwrap_or(0))
}
/// Mark a metadata pass as in-flight for `source_id`. Stamps
/// `last_run_completed:<source_id>` in `crawler_state` with
/// `{"completed": false, "at": now}`. A crash, panic, or SIGKILL after
/// this point leaves the flag at `false`, which the next tick reads as
/// "previous run did not exit cleanly — walk the full catalog this
/// time" (recovery sweep).
pub async fn mark_run_started(pool: &PgPool, source_id: &str) -> sqlx::Result<()> {
let key = format!("last_run_completed:{source_id}");
sqlx::query(
"INSERT INTO crawler_state (key, value, updated_at) \
VALUES ($1, $2, now()) \
ON CONFLICT (key) DO UPDATE \
SET value = EXCLUDED.value, updated_at = now()",
)
.bind(&key)
.bind(serde_json::json!({
"completed": false,
"at": Utc::now().to_rfc3339(),
}))
.execute(pool)
.await?;
Ok(res.rows_affected())
Ok(())
}
/// Mark a metadata pass as completed cleanly for `source_id`. Called
/// from the same place a run decides it reached end-of-walk or hit the
/// intentional stop. The next tick reads `true` and applies the normal
/// stop condition.
pub async fn mark_run_completed(pool: &PgPool, source_id: &str) -> sqlx::Result<()> {
let key = format!("last_run_completed:{source_id}");
sqlx::query(
"INSERT INTO crawler_state (key, value, updated_at) \
VALUES ($1, $2, now()) \
ON CONFLICT (key) DO UPDATE \
SET value = EXCLUDED.value, updated_at = now()",
)
.bind(&key)
.bind(serde_json::json!({
"completed": true,
"at": Utc::now().to_rfc3339(),
}))
.execute(pool)
.await?;
Ok(())
}
/// List mangas whose `cover_image_path IS NULL` but a live
/// `manga_sources` row still attaches them to a source. The bounded
/// result feeds the cover-backfill pass in [`crate::crawler::pipeline`]:
/// each entry is one (manga, freshest source row) pair where a cover
/// re-download is in order.
///
/// Per-manga deduplication uses `DISTINCT ON (m.id)` keyed on the row
/// with the newest `last_seen_at`, so a manga that's surfaced by
/// multiple sources only produces one row (the freshest). Sort is
/// stable for tests.
pub async fn list_missing_covers(
pool: &PgPool,
max: i64,
) -> sqlx::Result<Vec<MissingCoverEntry>> {
let rows: Vec<(Uuid, String, String)> = sqlx::query_as(
r#"
SELECT DISTINCT ON (m.id) m.id, ms.source_manga_key, ms.source_url
FROM mangas m
JOIN manga_sources ms ON ms.manga_id = m.id
WHERE m.cover_image_path IS NULL
AND ms.dropped_at IS NULL
ORDER BY m.id, ms.last_seen_at DESC
LIMIT $1
"#,
)
.bind(max)
.fetch_all(pool)
.await?;
Ok(rows
.into_iter()
.map(|(manga_id, source_manga_key, source_url)| MissingCoverEntry {
manga_id,
source_manga_key,
source_url,
})
.collect())
}
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct MissingCoverEntry {
pub manga_id: Uuid,
pub source_manga_key: String,
pub source_url: String,
}
/// Read the recovery flag for `source_id`. A missing row OR an
/// unparseable value reads as `true` ("clean") — the former covers the
/// first-ever run on a virgin DB (no recovery needed), the latter
/// covers forward-compat against future schema changes; both fail-safe
/// toward not making an operator pay for an unnecessary full sweep.
pub async fn last_run_completed_cleanly(
pool: &PgPool,
source_id: &str,
) -> sqlx::Result<bool> {
let key = format!("last_run_completed:{source_id}");
let row: Option<serde_json::Value> =
sqlx::query_scalar("SELECT value FROM crawler_state WHERE key = $1")
.bind(&key)
.fetch_optional(pool)
.await?;
Ok(row
.and_then(|v| v.get("completed").and_then(|b| b.as_bool()))
.unwrap_or(true))
}

View File

@@ -61,6 +61,11 @@ pub async fn load_for_mangas(
/// FK constraint would reject them, so we filter upstream rather than
/// surface a 500 here. (The API layer validates the set against
/// `list_all` first.)
///
/// Note: `crawler::repo::sync_genres` does a similar replace, but by
/// *name* and with auto-create of unseen genres — the crawler can't
/// validate against the curated vocabulary on its own. Both paths are
/// intentional; don't merge them without preserving that semantic.
pub async fn set_for_manga(
conn: &mut PgConnection,
manga_id: Uuid,

View File

@@ -262,6 +262,17 @@ pub async fn set_cover_image_path<'e, E: PgExecutor<'e>>(
Ok(())
}
pub async fn clear_cover_image_path<'e, E: PgExecutor<'e>>(
executor: E,
id: Uuid,
) -> AppResult<()> {
sqlx::query("UPDATE mangas SET cover_image_path = NULL, updated_at = now() WHERE id = $1")
.bind(id)
.execute(executor)
.await?;
Ok(())
}
pub async fn exists(pool: &PgPool, id: Uuid) -> AppResult<bool> {
let (exists,): (bool,) =
sqlx::query_as("SELECT EXISTS(SELECT 1 FROM mangas WHERE id = $1)")
@@ -270,3 +281,17 @@ pub async fn exists(pool: &PgPool, id: Uuid) -> AppResult<bool> {
.await?;
Ok(exists)
}
/// Returns the uploader's user id for a manga. `None` either when the
/// manga doesn't exist or when the row predates the `uploaded_by`
/// column (historical NULL — see migration 0011). Callers must
/// distinguish "manga missing" via [`exists`] before relying on this
/// to make an authz decision.
pub async fn uploaded_by(pool: &PgPool, id: Uuid) -> AppResult<Option<Uuid>> {
let row: Option<(Option<Uuid>,)> =
sqlx::query_as("SELECT uploaded_by FROM mangas WHERE id = $1")
.bind(id)
.fetch_optional(pool)
.await?;
Ok(row.and_then(|(u,)| u))
}

View File

@@ -1,3 +1,5 @@
pub mod admin_audit;
pub mod admin_view;
pub mod api_token;
pub mod author;
pub mod bookmark;

View File

@@ -11,7 +11,7 @@ pub async fn create(pool: &PgPool, username: &str, password_hash: &str) -> AppRe
r#"
INSERT INTO users (username, password_hash)
VALUES ($1, $2)
RETURNING id, username, password_hash, created_at
RETURNING id, username, password_hash, created_at, is_admin
"#,
)
.bind(username)
@@ -21,7 +21,7 @@ pub async fn create(pool: &PgPool, username: &str, password_hash: &str) -> AppRe
match result {
Ok(user) => Ok(user),
Err(e) if is_unique_violation(&e) => {
Err(sqlx::Error::Database(ref db_err)) if db_err.is_unique_violation() => {
Err(AppError::Conflict("username is already taken".into()))
}
Err(e) => Err(AppError::Database(e)),
@@ -35,7 +35,7 @@ pub async fn create(pool: &PgPool, username: &str, password_hash: &str) -> AppRe
pub async fn find_by_username(pool: &PgPool, username: &str) -> AppResult<Option<User>> {
let row = sqlx::query_as::<_, User>(
r#"
SELECT id, username, password_hash, created_at
SELECT id, username, password_hash, created_at, is_admin
FROM users
WHERE lower(username) = lower($1)
"#,
@@ -48,7 +48,7 @@ pub async fn find_by_username(pool: &PgPool, username: &str) -> AppResult<Option
pub async fn find_by_id(pool: &PgPool, id: Uuid) -> AppResult<Option<User>> {
let row = sqlx::query_as::<_, User>(
r#"SELECT id, username, password_hash, created_at FROM users WHERE id = $1"#,
r#"SELECT id, username, password_hash, created_at, is_admin FROM users WHERE id = $1"#,
)
.bind(id)
.fetch_optional(pool)
@@ -56,10 +56,317 @@ pub async fn find_by_id(pool: &PgPool, id: Uuid) -> AppResult<Option<User>> {
Ok(row)
}
fn is_unique_violation(err: &sqlx::Error) -> bool {
if let sqlx::Error::Database(db_err) = err {
db_err.code().as_deref() == Some("23505")
} else {
false
}
/// Postgres advisory-lock key guarding admin-count-changing operations
/// (demote, delete-admin). Without this lock two concurrent demotes of
/// different admins could each pass their "more than one admin remains"
/// check, then commit, leaving zero admins. The lock serialises any tx
/// that might change the admin count so the recount under the lock is
/// authoritative.
///
/// Value is the bytes of "admininv" interpreted as a big-endian i64.
/// Postgres' advisory-lock keyspace is global; collision risk with
/// `CRON_LOCK_KEY` and friends is ~2^-64.
pub const ADMIN_INVARIANT_LOCK_KEY: i64 = 0x61_64_6d_69_6e_69_6e_76;
#[derive(Debug, Default)]
pub struct ListUsersQuery {
pub search: Option<String>,
pub limit: i64,
pub offset: i64,
}
/// Paginated user list with total count. `search` is a case-insensitive
/// substring match on `username`. Order is alphabetical by username so
/// pagination is stable across concurrent writes (mangas changing
/// is_admin doesn't reshuffle the page).
pub async fn list_with_total(
pool: &PgPool,
q: &ListUsersQuery,
) -> AppResult<(Vec<User>, i64)> {
let pat = q
.search
.as_ref()
.map(|s| format!("%{}%", s.trim()))
.filter(|p| p.len() > 2);
let items = sqlx::query_as::<_, User>(
r#"
SELECT id, username, password_hash, created_at, is_admin
FROM users
WHERE ($1::text IS NULL OR username ILIKE $1)
ORDER BY username
LIMIT $2 OFFSET $3
"#,
)
.bind(&pat)
.bind(q.limit)
.bind(q.offset)
.fetch_all(pool)
.await?;
let total: i64 = sqlx::query_scalar(
"SELECT COUNT(*) FROM users WHERE ($1::text IS NULL OR username ILIKE $1)",
)
.bind(&pat)
.fetch_one(pool)
.await?;
Ok((items, total))
}
/// Raw `is_admin` update with no safety checks, no audit log, and no
/// advisory lock. Exists only as a test setup helper for the admin-
/// feature integration suite — production code MUST go through
/// [`admin_safe_set_is_admin`], which enforces self-protection, the
/// last-admin invariant, and the audit log atomically.
pub async fn set_is_admin_unchecked(pool: &PgPool, id: Uuid, value: bool) -> AppResult<()> {
sqlx::query("UPDATE users SET is_admin = $1 WHERE id = $2")
.bind(value)
.bind(id)
.execute(pool)
.await?;
Ok(())
}
/// Ensure the user `username` exists and is an admin. Called at startup
/// from `app::build` when `ADMIN_USERNAME` / `ADMIN_PASSWORD` are set.
///
/// Semantics — see cross-cutting decision #2 in the feature plan:
/// - If no row exists: create with the env-supplied password hashed via
/// argon2id and `is_admin = true`.
/// - If a row already exists: flip `is_admin` to true if needed; **never**
/// touch the existing `password_hash`. Lets the operator rotate the
/// admin password through the UI without env-var conflict.
/// Wrapped in a transaction so a concurrent `register` for the same
/// username can't slip an INSERT between the SELECT and UPDATE/INSERT.
/// Set `is_admin` on a user with full safety checks: rejects self-demote,
/// rejects demoting the only remaining admin (under `ADMIN_INVARIANT_LOCK_KEY`
/// to close the parallel-demote race), and writes an `admin_audit` row
/// in the same tx so the log mirrors what actually committed.
///
/// Returns the freshly-written user row (so the handler can return it
/// without a second SELECT).
pub async fn admin_safe_set_is_admin(
pool: &PgPool,
actor_id: Uuid,
target_id: Uuid,
value: bool,
) -> AppResult<User> {
// Cheap pre-check before opening a tx — also covers the "demote me"
// case which would otherwise pass the recount when other admins exist.
if actor_id == target_id && !value {
return Err(AppError::Conflict(
"cannot demote yourself; ask another admin".into(),
));
}
let mut tx = pool.begin().await?;
sqlx::query("SELECT pg_advisory_xact_lock($1)")
.bind(ADMIN_INVARIANT_LOCK_KEY)
.execute(&mut *tx)
.await?;
let target: Option<User> = sqlx::query_as(
"SELECT id, username, password_hash, created_at, is_admin \
FROM users WHERE id = $1 FOR UPDATE",
)
.bind(target_id)
.fetch_optional(&mut *tx)
.await?;
let Some(target) = target else {
return Err(AppError::NotFound);
};
// No-op: caller asked to set `is_admin` to its current value. Return
// the row as-is without writing an audit entry — otherwise repeated
// PATCH calls (browser retry, double-click) pile misleading
// "promote_user" rows in `admin_audit` for actions that changed
// nothing.
if target.is_admin == value {
tx.commit().await?;
return Ok(target);
}
// Recount inside the lock — this is the authoritative read.
if target.is_admin && !value {
let admin_count: i64 =
sqlx::query_scalar("SELECT COUNT(*) FROM users WHERE is_admin = true")
.fetch_one(&mut *tx)
.await?;
if admin_count <= 1 {
return Err(AppError::Conflict(
"cannot demote the last admin; promote another user first".into(),
));
}
}
let updated: User = sqlx::query_as(
"UPDATE users SET is_admin = $1 WHERE id = $2 \
RETURNING id, username, password_hash, created_at, is_admin",
)
.bind(value)
.bind(target_id)
.fetch_one(&mut *tx)
.await?;
let action = if value { "promote_user" } else { "demote_user" };
crate::repo::admin_audit::insert(
&mut *tx,
actor_id,
action,
"user",
Some(target_id),
serde_json::json!({ "username": target.username }),
)
.await?;
tx.commit().await?;
Ok(updated)
}
/// Delete a user with full safety checks: rejects self-delete, rejects
/// deleting the only remaining admin (under `ADMIN_INVARIANT_LOCK_KEY`),
/// and writes an `admin_audit` row in the same tx. Captures the deleted
/// username + admin status in the audit payload so the action is
/// readable after the user row itself is gone.
pub async fn admin_safe_delete(
pool: &PgPool,
actor_id: Uuid,
target_id: Uuid,
) -> AppResult<()> {
if actor_id == target_id {
return Err(AppError::Conflict(
"cannot delete yourself; ask another admin".into(),
));
}
let mut tx = pool.begin().await?;
sqlx::query("SELECT pg_advisory_xact_lock($1)")
.bind(ADMIN_INVARIANT_LOCK_KEY)
.execute(&mut *tx)
.await?;
let target: Option<User> = sqlx::query_as(
"SELECT id, username, password_hash, created_at, is_admin \
FROM users WHERE id = $1 FOR UPDATE",
)
.bind(target_id)
.fetch_optional(&mut *tx)
.await?;
let Some(target) = target else {
return Err(AppError::NotFound);
};
if target.is_admin {
let admin_count: i64 =
sqlx::query_scalar("SELECT COUNT(*) FROM users WHERE is_admin = true")
.fetch_one(&mut *tx)
.await?;
if admin_count <= 1 {
return Err(AppError::Conflict(
"cannot delete the last admin; promote another user first".into(),
));
}
}
sqlx::query("DELETE FROM users WHERE id = $1")
.bind(target_id)
.execute(&mut *tx)
.await?;
crate::repo::admin_audit::insert(
&mut *tx,
actor_id,
"delete_user",
"user",
Some(target_id),
serde_json::json!({
"username": target.username,
"was_admin": target.is_admin,
}),
)
.await?;
tx.commit().await?;
Ok(())
}
/// Admin-initiated user creation. Wraps the INSERT + audit row in a
/// single transaction so a rolled-back create never leaves an orphan
/// audit entry. Caller (HTTP handler) is responsible for validating
/// `username`/`password` and hashing — this fn assumes both are
/// already vetted by the same `validate_*` rules used by self-
/// registration.
pub async fn admin_create_user(
pool: &PgPool,
actor_id: Uuid,
username: &str,
password_hash: &str,
is_admin: bool,
) -> AppResult<User> {
let mut tx = pool.begin().await?;
let user: User = match sqlx::query_as::<_, User>(
"INSERT INTO users (username, password_hash, is_admin) VALUES ($1, $2, $3) \
RETURNING id, username, password_hash, created_at, is_admin",
)
.bind(username)
.bind(password_hash)
.bind(is_admin)
.fetch_one(&mut *tx)
.await
{
Ok(u) => u,
Err(sqlx::Error::Database(ref db_err)) if db_err.is_unique_violation() => {
return Err(AppError::Conflict("username is already taken".into()));
}
Err(e) => return Err(AppError::Database(e)),
};
crate::repo::admin_audit::insert(
&mut *tx,
actor_id,
"create_user",
"user",
Some(user.id),
serde_json::json!({
"username": user.username,
"is_admin": user.is_admin,
}),
)
.await?;
tx.commit().await?;
Ok(user)
}
pub async fn bootstrap_admin(
pool: &PgPool,
username: &str,
password: &str,
) -> AppResult<()> {
let mut tx = pool.begin().await?;
let existing: Option<(Uuid,)> = sqlx::query_as(
"SELECT id FROM users WHERE lower(username) = lower($1) FOR UPDATE",
)
.bind(username)
.fetch_optional(&mut *tx)
.await?;
match existing {
Some((id,)) => {
sqlx::query("UPDATE users SET is_admin = true WHERE id = $1 AND is_admin = false")
.bind(id)
.execute(&mut *tx)
.await?;
}
None => {
let hash = crate::auth::password::hash_password(password)?;
sqlx::query("INSERT INTO users (username, password_hash, is_admin) VALUES ($1, $2, true)")
.bind(username)
.bind(&hash)
.execute(&mut *tx)
.await?;
}
}
tx.commit().await?;
Ok(())
}

View File

@@ -16,6 +16,13 @@ impl LocalStorage {
}
fn resolve(&self, key: &str) -> Result<PathBuf, StorageError> {
// NUL bytes are rejected by the Linux syscall layer, but the
// error surfaces as an opaque IO failure rather than the
// explicit `BadKey` the rest of the contract uses. Catch it
// here so the error path is consistent.
if key.contains('\0') {
return Err(StorageError::BadKey);
}
let key = key.trim_start_matches('/');
if key.is_empty() {
return Err(StorageError::BadKey);
@@ -79,6 +86,10 @@ impl Storage for LocalStorage {
let path: &Path = &self.resolve(key)?;
Ok(fs::try_exists(path).await?)
}
fn local_root(&self) -> Option<&Path> {
Some(&self.root)
}
}
#[cfg(test)]
@@ -114,6 +125,9 @@ mod tests {
assert!(matches!(s.get(".").await, Err(StorageError::BadKey)));
// Empty segment via doubled slash.
assert!(matches!(s.get("a//b").await, Err(StorageError::BadKey)));
// NUL byte (rejected explicitly so callers see BadKey rather
// than an opaque IO error from the kernel).
assert!(matches!(s.put("a\0b", b"x").await, Err(StorageError::BadKey)));
}
#[tokio::test]

View File

@@ -9,6 +9,8 @@ mod local;
use std::io;
use std::pin::Pin;
use std::path::Path;
use async_trait::async_trait;
use bytes::Bytes;
use futures_core::Stream;
@@ -44,4 +46,13 @@ pub trait Storage: Send + Sync {
async fn get_stream(&self, key: &str) -> Result<StreamingFile, StorageError>;
async fn delete(&self, key: &str) -> Result<(), StorageError>;
async fn exists(&self, key: &str) -> Result<bool, StorageError>;
/// Filesystem path the backend is rooted at, when introspectable.
/// Returns `None` for backends that aren't a local filesystem (e.g.
/// a future `S3Storage`). The admin system endpoint uses this to
/// statvfs the data dir; backends that return `None` get a `disk:
/// null` payload instead of fabricated numbers.
fn local_root(&self) -> Option<&Path> {
None
}
}

View File

@@ -0,0 +1,548 @@
//! PR 3 (feat/admin-mangas-api) integration tests.
//!
//! Per-variant fixture tests for the derived sync-state SQL plus
//! happy-path E2E for the two admin endpoints. Auth-gate regression
//! (403/401) is covered by PR 1's `RequireAdmin` test matrix; the only
//! gate test here is one spot check per endpoint.
mod common;
use axum::http::StatusCode;
use axum::Router;
use serde_json::json;
use sqlx::PgPool;
use tower::ServiceExt;
use uuid::Uuid;
use mangalord::repo;
const SOURCE_ID: &str = "test-source";
async fn seed_admin(pool: &PgPool, app: &Router) -> (String, String) {
let (username, cookie) = common::register_user(app).await;
let u = repo::user::find_by_username(pool, &username)
.await
.unwrap()
.unwrap();
repo::user::set_is_admin_unchecked(pool, u.id, true).await.unwrap();
(username, cookie)
}
async fn seed_source(pool: &PgPool) {
repo::crawler::ensure_source(pool, SOURCE_ID, "Test", "https://example.test")
.await
.unwrap();
}
async fn insert_manga(pool: &PgPool, title: &str) -> Uuid {
let (id,): (Uuid,) = sqlx::query_as(
"INSERT INTO mangas (title, status, alt_titles) VALUES ($1, 'ongoing', ARRAY[]::text[]) RETURNING id",
)
.bind(title)
.fetch_one(pool)
.await
.unwrap();
id
}
async fn insert_manga_source(
pool: &PgPool,
manga_id: Uuid,
source_manga_key: &str,
dropped: bool,
) {
let dropped_at = if dropped { "now()" } else { "NULL" };
let sql = format!(
"INSERT INTO manga_sources (source_id, source_manga_key, manga_id, source_url, dropped_at) \
VALUES ($1, $2, $3, 'https://example.test/m', {dropped_at})"
);
sqlx::query(&sql)
.bind(SOURCE_ID)
.bind(source_manga_key)
.bind(manga_id)
.execute(pool)
.await
.unwrap();
}
async fn insert_chapter(pool: &PgPool, manga_id: Uuid, number: i32, page_count: i32) -> Uuid {
let (id,): (Uuid,) = sqlx::query_as(
"INSERT INTO chapters (manga_id, number, title, page_count) VALUES ($1, $2, NULL, $3) RETURNING id",
)
.bind(manga_id)
.bind(number)
.bind(page_count)
.fetch_one(pool)
.await
.unwrap();
id
}
async fn insert_chapter_source(
pool: &PgPool,
chapter_id: Uuid,
source_chapter_key: &str,
dropped: bool,
) {
let dropped_at = if dropped { "now()" } else { "NULL" };
let sql = format!(
"INSERT INTO chapter_sources (source_id, source_chapter_key, chapter_id, source_url, dropped_at) \
VALUES ($1, $2, $3, 'https://example.test/c', {dropped_at})"
);
sqlx::query(&sql)
.bind(SOURCE_ID)
.bind(source_chapter_key)
.bind(chapter_id)
.execute(pool)
.await
.unwrap();
}
async fn insert_job(pool: &PgPool, payload: serde_json::Value, state: &str) {
sqlx::query("INSERT INTO crawler_jobs (payload, state) VALUES ($1, $2)")
.bind(payload)
.bind(state)
.execute(pool)
.await
.unwrap();
}
/// Per-variant tests don't care about pagination — fetch the whole
/// chapter set (up to the hard cap) and discard the total.
async fn fetch_chapter_rows(
pool: &PgPool,
manga_id: Uuid,
) -> Vec<mangalord::repo::admin_view::AdminChapterRow> {
let (rows, _) = repo::admin_view::list_chapters_with_sync_state(
pool,
&repo::admin_view::ListAdminChaptersQuery {
manga_id,
limit: 500,
offset: 0,
},
)
.await
.unwrap();
rows
}
// ---- manga sync state ------------------------------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn manga_state_synced_for_fresh_source(pool: PgPool) {
seed_source(&pool).await;
let m = insert_manga(&pool, "Synced Manga").await;
insert_manga_source(&pool, m, "smk-1", false).await;
let (rows, total) = repo::admin_view::list_mangas_with_sync_state(
&pool,
&repo::admin_view::ListAdminMangasQuery {
limit: 50,
..Default::default()
},
)
.await
.unwrap();
assert_eq!(total, 1);
assert_eq!(rows[0].id, m);
assert_eq!(rows[0].sync_state, mangalord::domain::MangaSyncState::Synced);
}
#[sqlx::test(migrations = "./migrations")]
async fn manga_state_synced_for_user_upload_without_sources(pool: PgPool) {
let m = insert_manga(&pool, "User Upload").await;
let (rows, _) = repo::admin_view::list_mangas_with_sync_state(
&pool,
&repo::admin_view::ListAdminMangasQuery {
limit: 50,
..Default::default()
},
)
.await
.unwrap();
assert_eq!(rows[0].id, m);
assert_eq!(rows[0].sync_state, mangalord::domain::MangaSyncState::Synced);
}
#[sqlx::test(migrations = "./migrations")]
async fn manga_state_dropped_when_all_sources_dropped(pool: PgPool) {
seed_source(&pool).await;
let m = insert_manga(&pool, "Dropped Manga").await;
insert_manga_source(&pool, m, "smk-1", true).await;
let (rows, _) = repo::admin_view::list_mangas_with_sync_state(
&pool,
&repo::admin_view::ListAdminMangasQuery {
limit: 50,
..Default::default()
},
)
.await
.unwrap();
assert_eq!(rows[0].id, m);
assert_eq!(rows[0].sync_state, mangalord::domain::MangaSyncState::Dropped);
}
#[sqlx::test(migrations = "./migrations")]
async fn manga_state_in_progress_via_sync_chapter_list_job(pool: PgPool) {
seed_source(&pool).await;
let m = insert_manga(&pool, "Syncing Manga").await;
insert_manga_source(&pool, m, "smk-1", false).await;
// sync_chapter_list payload carries manga_id directly.
insert_job(
&pool,
json!({
"kind": "sync_chapter_list",
"source_id": SOURCE_ID,
"manga_id": m.to_string(),
"source_manga_key": "smk-1",
}),
"pending",
)
.await;
let (rows, _) = repo::admin_view::list_mangas_with_sync_state(
&pool,
&repo::admin_view::ListAdminMangasQuery {
limit: 50,
..Default::default()
},
)
.await
.unwrap();
assert_eq!(rows[0].sync_state, mangalord::domain::MangaSyncState::InProgress);
}
#[sqlx::test(migrations = "./migrations")]
async fn manga_state_in_progress_via_sync_manga_job(pool: PgPool) {
// The trickier branch: sync_manga payload is keyed by
// source_manga_key, NOT manga_id — must join through manga_sources.
seed_source(&pool).await;
let m = insert_manga(&pool, "Metadata-Refreshing Manga").await;
insert_manga_source(&pool, m, "smk-key-42", false).await;
insert_job(
&pool,
json!({
"kind": "sync_manga",
"source_id": SOURCE_ID,
"source_manga_key": "smk-key-42",
}),
"running",
)
.await;
let (rows, _) = repo::admin_view::list_mangas_with_sync_state(
&pool,
&repo::admin_view::ListAdminMangasQuery {
limit: 50,
..Default::default()
},
)
.await
.unwrap();
assert_eq!(rows[0].sync_state, mangalord::domain::MangaSyncState::InProgress);
}
#[sqlx::test(migrations = "./migrations")]
async fn manga_list_filters_by_sync_state(pool: PgPool) {
seed_source(&pool).await;
let m_synced = insert_manga(&pool, "AAA Synced").await;
insert_manga_source(&pool, m_synced, "smk-a", false).await;
let m_dropped = insert_manga(&pool, "BBB Dropped").await;
insert_manga_source(&pool, m_dropped, "smk-b", true).await;
let (rows, total) = repo::admin_view::list_mangas_with_sync_state(
&pool,
&repo::admin_view::ListAdminMangasQuery {
sync_state: Some(mangalord::domain::MangaSyncState::Dropped),
limit: 50,
..Default::default()
},
)
.await
.unwrap();
assert_eq!(total, 1);
assert_eq!(rows.len(), 1);
assert_eq!(rows[0].id, m_dropped);
}
// ---- chapter sync state ----------------------------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn chapter_state_synced_when_pages_present(pool: PgPool) {
seed_source(&pool).await;
let m = insert_manga(&pool, "M").await;
insert_manga_source(&pool, m, "smk", false).await;
let c = insert_chapter(&pool, m, 1, 12).await;
insert_chapter_source(&pool, c, "ckey-1", false).await;
let rows = fetch_chapter_rows(&pool, m).await;
assert_eq!(rows.len(), 1);
assert_eq!(rows[0].id, c);
assert_eq!(rows[0].sync_state, mangalord::domain::ChapterSyncState::Synced);
}
#[sqlx::test(migrations = "./migrations")]
async fn chapter_state_not_downloaded_when_page_count_zero(pool: PgPool) {
seed_source(&pool).await;
let m = insert_manga(&pool, "M").await;
let c = insert_chapter(&pool, m, 1, 0).await;
insert_chapter_source(&pool, c, "ckey-1", false).await;
let rows = fetch_chapter_rows(&pool, m).await;
assert_eq!(
rows[0].sync_state,
mangalord::domain::ChapterSyncState::NotDownloaded
);
}
#[sqlx::test(migrations = "./migrations")]
async fn chapter_state_downloading_when_job_in_flight(pool: PgPool) {
seed_source(&pool).await;
let m = insert_manga(&pool, "M").await;
let c = insert_chapter(&pool, m, 1, 0).await;
insert_chapter_source(&pool, c, "ckey-1", false).await;
insert_job(
&pool,
json!({
"kind": "sync_chapter_content",
"source_id": SOURCE_ID,
"chapter_id": c.to_string(),
"source_chapter_key": "ckey-1",
}),
"running",
)
.await;
let rows = fetch_chapter_rows(&pool, m).await;
assert_eq!(
rows[0].sync_state,
mangalord::domain::ChapterSyncState::Downloading
);
}
#[sqlx::test(migrations = "./migrations")]
async fn chapter_state_dropped_when_all_sources_dropped(pool: PgPool) {
seed_source(&pool).await;
let m = insert_manga(&pool, "M").await;
let c = insert_chapter(&pool, m, 1, 0).await;
insert_chapter_source(&pool, c, "ckey-1", true).await;
let rows = fetch_chapter_rows(&pool, m).await;
assert_eq!(
rows[0].sync_state,
mangalord::domain::ChapterSyncState::Dropped
);
}
#[sqlx::test(migrations = "./migrations")]
async fn chapter_state_failed_when_most_recent_job_dead(pool: PgPool) {
seed_source(&pool).await;
let m = insert_manga(&pool, "M").await;
let c = insert_chapter(&pool, m, 1, 0).await;
insert_chapter_source(&pool, c, "ckey-1", false).await;
insert_job(
&pool,
json!({
"kind": "sync_chapter_content",
"source_id": SOURCE_ID,
"chapter_id": c.to_string(),
"source_chapter_key": "ckey-1",
}),
"dead",
)
.await;
let rows = fetch_chapter_rows(&pool, m).await;
assert_eq!(
rows[0].sync_state,
mangalord::domain::ChapterSyncState::Failed
);
}
// ---- HTTP-level happy-path + gate ------------------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn http_list_mangas_returns_paged_with_state(pool: PgPool) {
let h = common::harness(pool.clone());
let (_admin, cookie) = seed_admin(&pool, &h.app).await;
seed_source(&pool).await;
let m = insert_manga(&pool, "Hello").await;
insert_manga_source(&pool, m, "smk", false).await;
let resp = h
.app
.oneshot(common::get_with_cookie(
"/api/v1/admin/mangas?limit=50",
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
let items = body["items"].as_array().unwrap();
assert_eq!(items.len(), 1);
assert_eq!(items[0]["id"], m.to_string());
assert_eq!(items[0]["sync_state"], "synced");
assert_eq!(items[0]["chapter_count"], 0);
assert_eq!(body["page"]["total"], 1);
}
#[sqlx::test(migrations = "./migrations")]
async fn http_list_mangas_rejects_unknown_sync_state(pool: PgPool) {
let h = common::harness(pool.clone());
let (_admin, cookie) = seed_admin(&pool, &h.app).await;
let resp = h
.app
.oneshot(common::get_with_cookie(
"/api/v1/admin/mangas?sync_state=bogus",
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::BAD_REQUEST);
}
#[sqlx::test(migrations = "./migrations")]
async fn http_list_chapters_returns_per_chapter_state(pool: PgPool) {
let h = common::harness(pool.clone());
let (_admin, cookie) = seed_admin(&pool, &h.app).await;
seed_source(&pool).await;
let m = insert_manga(&pool, "M").await;
let c1 = insert_chapter(&pool, m, 1, 12).await;
let c2 = insert_chapter(&pool, m, 2, 0).await;
insert_chapter_source(&pool, c1, "ck1", false).await;
insert_chapter_source(&pool, c2, "ck2", false).await;
let resp = h
.app
.oneshot(common::get_with_cookie(
&format!("/api/v1/admin/mangas/{m}/chapters"),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
let items = body["items"].as_array().unwrap();
assert_eq!(items.len(), 2);
assert_eq!(items[0]["id"], c1.to_string());
assert_eq!(items[0]["sync_state"], "synced");
assert_eq!(items[1]["id"], c2.to_string());
assert_eq!(items[1]["sync_state"], "not_downloaded");
assert_eq!(body["page"]["total"], 2);
}
#[sqlx::test(migrations = "./migrations")]
async fn http_list_chapters_caps_limit_at_500(pool: PgPool) {
// The handler clamps limit to [1, 500] so a long-runner with
// thousands of chapters can't be turned into a request-stall by an
// admin (or by a curious admin tab) just clicking expand.
let h = common::harness(pool.clone());
let (_admin, cookie) = seed_admin(&pool, &h.app).await;
seed_source(&pool).await;
let m = insert_manga(&pool, "M").await;
for n in 1..=3 {
let _c = insert_chapter(&pool, m, n, 0).await;
}
let resp = h
.app
.oneshot(common::get_with_cookie(
&format!("/api/v1/admin/mangas/{m}/chapters?limit=999"),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
assert_eq!(body["page"]["limit"], 500, "limit must clamp to 500");
assert_eq!(body["items"].as_array().unwrap().len(), 3);
}
#[sqlx::test(migrations = "./migrations")]
async fn http_list_chapters_paginates(pool: PgPool) {
let h = common::harness(pool.clone());
let (_admin, cookie) = seed_admin(&pool, &h.app).await;
seed_source(&pool).await;
let m = insert_manga(&pool, "M").await;
for n in 1..=5 {
let _c = insert_chapter(&pool, m, n, 0).await;
}
let resp = h
.app
.clone()
.oneshot(common::get_with_cookie(
&format!("/api/v1/admin/mangas/{m}/chapters?limit=2&offset=2"),
&cookie,
))
.await
.unwrap();
let body = common::body_json(resp).await;
let items = body["items"].as_array().unwrap();
assert_eq!(items.len(), 2);
// Ordered by chapter number ascending; offset=2 skips chapters 1 & 2.
assert_eq!(items[0]["number"], 3);
assert_eq!(items[1]["number"], 4);
assert_eq!(body["page"]["total"], 5);
}
#[sqlx::test(migrations = "./migrations")]
async fn http_list_chapters_returns_404_for_unknown_manga(pool: PgPool) {
// Regression: used to return 200 [] for a non-existent manga,
// which silently rendered "No chapters." for a typo'd / deleted id.
let h = common::harness(pool.clone());
let (_admin, cookie) = seed_admin(&pool, &h.app).await;
let resp = h
.app
.oneshot(common::get_with_cookie(
&format!("/api/v1/admin/mangas/{}/chapters", Uuid::new_v4()),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::NOT_FOUND);
}
#[sqlx::test(migrations = "./migrations")]
async fn chapter_state_synced_when_pages_present_even_with_dead_job(pool: PgPool) {
// Regression: the old CASE prioritised the dead-job branch above
// the page_count check, so a chapter with pages on disk AND a
// historical dead job (e.g. from a re-download attempt that
// crashed) flipped to Failed — contradicting Synced's "downloaded
// at some point" contract.
seed_source(&pool).await;
let m = insert_manga(&pool, "M").await;
let c = insert_chapter(&pool, m, 1, 12).await; // pages present
insert_chapter_source(&pool, c, "ckey-1", false).await;
insert_job(
&pool,
json!({
"kind": "sync_chapter_content",
"source_id": SOURCE_ID,
"chapter_id": c.to_string(),
"source_chapter_key": "ckey-1",
}),
"dead",
)
.await;
let rows = fetch_chapter_rows(&pool, m).await;
assert_eq!(
rows[0].sync_state,
mangalord::domain::ChapterSyncState::Synced,
"pages on disk override historical dead-job noise"
);
}
#[sqlx::test(migrations = "./migrations")]
async fn http_list_mangas_requires_admin(pool: PgPool) {
let h = common::harness(pool);
let (_u, cookie) = common::register_user(&h.app).await;
let resp = h
.app
.oneshot(common::get_with_cookie("/api/v1/admin/mangas", &cookie))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
}

View File

@@ -0,0 +1,350 @@
//! Integration tests for the admin force-resync endpoints.
//!
//! Real resync work requires Chromium, so these tests swap in a stub
//! [`ResyncService`] to assert the handler-level contract: routing,
//! admin gate, 503 when the daemon is disabled, 404 / 422 mapping for
//! missing-resource / no-source cases, and the audit-log side effect.
mod common;
use std::sync::Arc;
use std::sync::atomic::{AtomicUsize, Ordering};
use async_trait::async_trait;
use axum::http::StatusCode;
use serde_json::json;
use sqlx::PgPool;
use tower::ServiceExt;
use uuid::Uuid;
use mangalord::crawler::resync::{
ChapterResyncOutcome, MangaResyncOutcome, ResyncError, ResyncService,
};
use mangalord::repo;
use mangalord::repo::crawler::UpsertStatus;
/// Stub that records call counts and returns a canned outcome.
struct StubResync {
manga_calls: AtomicUsize,
chapter_calls: AtomicUsize,
/// When true, returns NoMangaSource / NoChapterSource.
no_source: bool,
}
impl StubResync {
fn new() -> Arc<Self> {
Arc::new(Self {
manga_calls: AtomicUsize::new(0),
chapter_calls: AtomicUsize::new(0),
no_source: false,
})
}
fn no_source() -> Arc<Self> {
Arc::new(Self {
manga_calls: AtomicUsize::new(0),
chapter_calls: AtomicUsize::new(0),
no_source: true,
})
}
}
#[async_trait]
impl ResyncService for StubResync {
async fn resync_manga(&self, manga_id: Uuid) -> anyhow::Result<MangaResyncOutcome> {
self.manga_calls.fetch_add(1, Ordering::SeqCst);
if self.no_source {
return Err(ResyncError::NoMangaSource.into());
}
Ok(MangaResyncOutcome {
manga_id,
metadata_status: UpsertStatus::Updated,
cover_fetched: true,
})
}
async fn resync_chapter(&self, chapter_id: Uuid) -> anyhow::Result<ChapterResyncOutcome> {
self.chapter_calls.fetch_add(1, Ordering::SeqCst);
if self.no_source {
return Err(ResyncError::NoChapterSource.into());
}
Ok(ChapterResyncOutcome::Fetched {
chapter_id,
pages: 7,
})
}
}
async fn promote_admin(pool: &PgPool, username: &str) {
let u = repo::user::find_by_username(pool, username)
.await
.unwrap()
.unwrap();
repo::user::set_is_admin_unchecked(pool, u.id, true)
.await
.unwrap();
}
async fn insert_manga(pool: &PgPool, title: &str) -> Uuid {
let (id,): (Uuid,) = sqlx::query_as(
"INSERT INTO mangas (title, status, alt_titles) VALUES ($1, 'ongoing', ARRAY[]::text[]) RETURNING id",
)
.bind(title)
.fetch_one(pool)
.await
.unwrap();
id
}
async fn insert_chapter(pool: &PgPool, manga_id: Uuid, number: i32, pages: i32) -> Uuid {
let (id,): (Uuid,) = sqlx::query_as(
"INSERT INTO chapters (manga_id, number, title, page_count) VALUES ($1, $2, NULL, $3) RETURNING id",
)
.bind(manga_id)
.bind(number)
.bind(pages)
.fetch_one(pool)
.await
.unwrap();
id
}
// ----- manga resync ---------------------------------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn manga_resync_calls_service_and_returns_refreshed_detail(pool: PgPool) {
let stub = StubResync::new();
let h = common::harness_with_resync(pool.clone(), stub.clone());
let (username, cookie) = common::register_user(&h.app).await;
promote_admin(&pool, &username).await;
let manga_id = insert_manga(&pool, "Hello").await;
let resp = h
.app
.oneshot(common::post_json_with_cookie(
&format!("/api/v1/admin/mangas/{manga_id}/resync"),
json!({}),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
// Stub returned Updated + cover_fetched=true.
assert_eq!(body["metadata_status"], "updated");
assert_eq!(body["cover_fetched"], true);
// Response includes the refreshed manga detail.
assert_eq!(body["manga"]["id"], manga_id.to_string());
assert_eq!(body["manga"]["title"], "Hello");
assert_eq!(stub.manga_calls.load(Ordering::SeqCst), 1);
// Audit row written.
let (audit_count,): (i64,) =
sqlx::query_as("SELECT count(*) FROM admin_audit WHERE action = 'manga_resync' AND target_id = $1")
.bind(manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(audit_count, 1);
}
#[sqlx::test(migrations = "./migrations")]
async fn manga_resync_returns_404_for_unknown_id(pool: PgPool) {
let stub = StubResync::new();
let h = common::harness_with_resync(pool.clone(), stub.clone());
let (username, cookie) = common::register_user(&h.app).await;
promote_admin(&pool, &username).await;
let resp = h
.app
.oneshot(common::post_json_with_cookie(
&format!("/api/v1/admin/mangas/{}/resync", Uuid::new_v4()),
json!({}),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::NOT_FOUND);
// Service must not have been called when the manga doesn't exist.
assert_eq!(stub.manga_calls.load(Ordering::SeqCst), 0);
}
#[sqlx::test(migrations = "./migrations")]
async fn manga_resync_maps_no_source_to_422(pool: PgPool) {
let stub = StubResync::no_source();
let h = common::harness_with_resync(pool.clone(), stub);
let (username, cookie) = common::register_user(&h.app).await;
promote_admin(&pool, &username).await;
let manga_id = insert_manga(&pool, "Manual upload, no crawler source").await;
let resp = h
.app
.oneshot(common::post_json_with_cookie(
&format!("/api/v1/admin/mangas/{manga_id}/resync"),
json!({}),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::UNPROCESSABLE_ENTITY);
let body = common::body_json(resp).await;
assert_eq!(body["error"]["details"]["manga"], "no_source");
}
#[sqlx::test(migrations = "./migrations")]
async fn manga_resync_returns_503_when_daemon_disabled(pool: PgPool) {
let h = common::harness(pool.clone());
let (username, cookie) = common::register_user(&h.app).await;
promote_admin(&pool, &username).await;
let manga_id = insert_manga(&pool, "Z").await;
let resp = h
.app
.oneshot(common::post_json_with_cookie(
&format!("/api/v1/admin/mangas/{manga_id}/resync"),
json!({}),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::SERVICE_UNAVAILABLE);
let body = common::body_json(resp).await;
assert_eq!(body["error"]["code"], "service_unavailable");
}
#[sqlx::test(migrations = "./migrations")]
async fn manga_resync_requires_admin(pool: PgPool) {
let stub = StubResync::new();
let h = common::harness_with_resync(pool.clone(), stub);
// Non-admin user.
let (_u, cookie) = common::register_user(&h.app).await;
let manga_id = insert_manga(&pool, "M").await;
let resp = h
.app
.oneshot(common::post_json_with_cookie(
&format!("/api/v1/admin/mangas/{manga_id}/resync"),
json!({}),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
}
// ----- chapter resync -------------------------------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn chapter_resync_calls_service_and_returns_refreshed_chapter(pool: PgPool) {
let stub = StubResync::new();
let h = common::harness_with_resync(pool.clone(), stub.clone());
let (username, cookie) = common::register_user(&h.app).await;
promote_admin(&pool, &username).await;
let manga_id = insert_manga(&pool, "M").await;
let chapter_id = insert_chapter(&pool, manga_id, 1, 0).await;
let resp = h
.app
.oneshot(common::post_json_with_cookie(
&format!("/api/v1/admin/chapters/{chapter_id}/resync"),
json!({}),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
assert_eq!(body["outcome"], "fetched");
assert_eq!(body["pages"], 7);
assert_eq!(body["chapter"]["id"], chapter_id.to_string());
assert_eq!(stub.chapter_calls.load(Ordering::SeqCst), 1);
let (audit_count,): (i64,) = sqlx::query_as(
"SELECT count(*) FROM admin_audit WHERE action = 'chapter_resync' AND target_id = $1",
)
.bind(chapter_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(audit_count, 1);
}
#[sqlx::test(migrations = "./migrations")]
async fn chapter_resync_returns_404_for_unknown_id(pool: PgPool) {
let stub = StubResync::new();
let h = common::harness_with_resync(pool.clone(), stub.clone());
let (username, cookie) = common::register_user(&h.app).await;
promote_admin(&pool, &username).await;
let resp = h
.app
.oneshot(common::post_json_with_cookie(
&format!("/api/v1/admin/chapters/{}/resync", Uuid::new_v4()),
json!({}),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::NOT_FOUND);
assert_eq!(stub.chapter_calls.load(Ordering::SeqCst), 0);
}
#[sqlx::test(migrations = "./migrations")]
async fn chapter_resync_maps_no_source_to_422(pool: PgPool) {
let stub = StubResync::no_source();
let h = common::harness_with_resync(pool.clone(), stub);
let (username, cookie) = common::register_user(&h.app).await;
promote_admin(&pool, &username).await;
let manga_id = insert_manga(&pool, "M").await;
let chapter_id = insert_chapter(&pool, manga_id, 1, 0).await;
let resp = h
.app
.oneshot(common::post_json_with_cookie(
&format!("/api/v1/admin/chapters/{chapter_id}/resync"),
json!({}),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::UNPROCESSABLE_ENTITY);
let body = common::body_json(resp).await;
assert_eq!(body["error"]["details"]["chapter"], "no_source");
}
#[sqlx::test(migrations = "./migrations")]
async fn chapter_resync_returns_503_when_daemon_disabled(pool: PgPool) {
let h = common::harness(pool.clone());
let (username, cookie) = common::register_user(&h.app).await;
promote_admin(&pool, &username).await;
let manga_id = insert_manga(&pool, "M").await;
let chapter_id = insert_chapter(&pool, manga_id, 1, 0).await;
let resp = h
.app
.oneshot(common::post_json_with_cookie(
&format!("/api/v1/admin/chapters/{chapter_id}/resync"),
json!({}),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::SERVICE_UNAVAILABLE);
}
#[sqlx::test(migrations = "./migrations")]
async fn chapter_resync_requires_admin(pool: PgPool) {
let stub = StubResync::new();
let h = common::harness_with_resync(pool.clone(), stub);
let (_u, cookie) = common::register_user(&h.app).await;
let manga_id = insert_manga(&pool, "M").await;
let chapter_id = insert_chapter(&pool, manga_id, 1, 0).await;
let resp = h
.app
.oneshot(common::post_json_with_cookie(
&format!("/api/v1/admin/chapters/{chapter_id}/resync"),
json!({}),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
}

View File

@@ -0,0 +1,258 @@
//! PR 1 (feat/admin-role) integration tests.
//!
//! Covers: `bootstrap_admin` semantics, `is_admin` exposed on /auth/me,
//! and the `RequireAdmin` extractor's 401/403/200 matrix — including the
//! load-bearing decision that Bearer-authed callers can NEVER reach an
//! admin-guarded route, even when the underlying user IS admin.
mod common;
use std::sync::Arc;
use axum::http::StatusCode;
use axum::routing::get;
use axum::{Json, Router};
use serde_json::json;
use sqlx::PgPool;
use tempfile::TempDir;
use tower::ServiceExt;
use mangalord::api;
use mangalord::app::AppState;
use mangalord::auth::extractor::RequireAdmin;
use mangalord::auth::rate_limit::AuthRateLimiter;
use mangalord::config::{AuthConfig, UploadConfig};
use mangalord::repo;
use mangalord::storage::{LocalStorage, Storage};
/// Test-only handler guarded by `RequireAdmin`. Lets the test suite assert
/// the extractor's behaviour end-to-end without depending on an admin
/// endpoint existing yet (those land in PR 2+).
async fn admin_only_handler(RequireAdmin(user): RequireAdmin) -> Json<serde_json::Value> {
Json(json!({ "username": user.username, "is_admin": user.is_admin }))
}
/// Build a router that exposes the production /api/v1/* AND a test-only
/// `/_test/admin_only` route guarded by `RequireAdmin`. Pool is consumed;
/// callers that want to inspect the DB after a request should clone it.
fn admin_test_router(pool: PgPool) -> (Router, TempDir) {
let storage_dir = tempfile::tempdir().expect("tempdir");
let storage: Arc<dyn Storage> = Arc::new(LocalStorage::new(storage_dir.path()));
let auth = AuthConfig {
cookie_secure: false,
..AuthConfig::default()
};
let auth_limiter = Arc::new(AuthRateLimiter::new(auth.rate_limit));
let state = AppState {
db: pool,
storage,
auth,
upload: UploadConfig::default(),
auth_limiter,
resync: None,
};
let app = Router::new()
.nest("/api/v1", api::routes())
.route("/_test/admin_only", get(admin_only_handler))
.with_state(state);
(app, storage_dir)
}
// ---- bootstrap_admin -------------------------------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn bootstrap_creates_admin_when_user_missing(pool: PgPool) {
repo::user::bootstrap_admin(&pool, "root", "hunter2hunter2")
.await
.expect("bootstrap on empty DB");
let user = repo::user::find_by_username(&pool, "root")
.await
.unwrap()
.expect("root user exists after bootstrap");
assert!(user.is_admin, "bootstrap must set is_admin = true on creation");
// Password hash must verify the env-supplied password (and not be empty).
assert!(
mangalord::auth::password::verify_password("hunter2hunter2", &user.password_hash),
"bootstrap-created user must accept the env-supplied password"
);
}
#[sqlx::test(migrations = "./migrations")]
async fn bootstrap_promotes_existing_user_without_touching_password(pool: PgPool) {
// Pre-existing user, not admin. Use the real register path so the
// hash format matches production exactly.
let (app, _td) = admin_test_router(pool.clone());
let resp = app
.oneshot(common::post_json(
"/api/v1/auth/register",
json!({ "username": "preexisting", "password": "originalpw1234" }),
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CREATED);
let before = repo::user::find_by_username(&pool, "preexisting")
.await
.unwrap()
.unwrap();
assert!(!before.is_admin);
let original_hash = before.password_hash.clone();
// Bootstrap with a DIFFERENT password — must not overwrite the hash.
repo::user::bootstrap_admin(&pool, "preexisting", "envpw_should_be_ignored")
.await
.expect("bootstrap on existing user");
let after = repo::user::find_by_username(&pool, "preexisting")
.await
.unwrap()
.unwrap();
assert!(after.is_admin, "bootstrap must promote existing user");
assert_eq!(
after.password_hash, original_hash,
"bootstrap must NOT overwrite the existing password hash"
);
assert!(
mangalord::auth::password::verify_password("originalpw1234", &after.password_hash),
"original password must still verify after bootstrap"
);
}
#[sqlx::test(migrations = "./migrations")]
async fn bootstrap_is_idempotent(pool: PgPool) {
repo::user::bootstrap_admin(&pool, "root", "hunter2hunter2")
.await
.expect("first bootstrap");
repo::user::bootstrap_admin(&pool, "root", "hunter2hunter2")
.await
.expect("second bootstrap is no-op");
// Exactly one row, still admin.
let (count,): (i64,) = sqlx::query_as("SELECT COUNT(*) FROM users WHERE username = $1")
.bind("root")
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(count, 1);
}
// ---- /api/v1/auth/me exposes is_admin --------------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn auth_me_response_includes_is_admin(pool: PgPool) {
let (app, _td) = admin_test_router(pool.clone());
let (_username, cookie) = common::register_user(&app).await;
let resp = app
.oneshot(common::get_with_cookie("/api/v1/auth/me", &cookie))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
assert_eq!(
body["user"]["is_admin"], false,
"freshly-registered users default to is_admin=false"
);
}
// ---- RequireAdmin: 401 / 403 / 200 matrix ----------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn require_admin_rejects_unauthenticated(pool: PgPool) {
let (app, _td) = admin_test_router(pool);
let resp = app
.oneshot(common::get("/_test/admin_only"))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::UNAUTHORIZED);
}
#[sqlx::test(migrations = "./migrations")]
async fn require_admin_rejects_non_admin_cookie(pool: PgPool) {
let (app, _td) = admin_test_router(pool);
let (_username, cookie) = common::register_user(&app).await;
let resp = app
.oneshot(common::get_with_cookie("/_test/admin_only", &cookie))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
let body = common::body_json(resp).await;
assert_eq!(body["error"]["code"], "forbidden");
}
#[sqlx::test(migrations = "./migrations")]
async fn require_admin_accepts_admin_cookie(pool: PgPool) {
let (app, _td) = admin_test_router(pool.clone());
let (username, cookie) = common::register_user(&app).await;
// Promote via the repo (the admin-users API doesn't exist yet).
let u = repo::user::find_by_username(&pool, &username)
.await
.unwrap()
.unwrap();
repo::user::set_is_admin_unchecked(&pool, u.id, true).await.unwrap();
let resp = app
.oneshot(common::get_with_cookie("/_test/admin_only", &cookie))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
assert_eq!(body["username"], username);
assert_eq!(body["is_admin"], true);
}
#[sqlx::test(migrations = "./migrations")]
async fn require_admin_rejects_bearer_token_even_for_admin_user(pool: PgPool) {
// Key privilege-escalation test: an API token belonging to an admin user
// must NOT grant admin authority. Bot tokens are excluded from admin
// routes by design (the RequireAdmin extractor only accepts session
// cookies). See cross-cutting decision #1 in the PR plan.
let (app, _td) = admin_test_router(pool.clone());
let (username, cookie) = common::register_user(&app).await;
// Promote to admin and mint an API token (the existing /auth/tokens
// endpoint authenticates via the same cookie).
let u = repo::user::find_by_username(&pool, &username)
.await
.unwrap()
.unwrap();
repo::user::set_is_admin_unchecked(&pool, u.id, true).await.unwrap();
let resp = app
.clone()
.oneshot(common::post_json_with_cookie(
"/api/v1/auth/tokens",
json!({ "name": "test-bot" }),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CREATED);
let body = common::body_json(resp).await;
let token = body["bearer"]
.as_str()
.expect("raw bearer token in response")
.to_string();
// Sanity: the bearer DOES work on a non-admin endpoint (proves the
// token is valid, isolating the failure below to the admin guard).
let resp = app
.clone()
.oneshot(common::get_with_bearer("/api/v1/auth/me", &token))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
// Same token, same admin user, but on the admin-guarded route → 401
// (no session cookie present at all from the extractor's POV).
let resp = app
.oneshot(common::get_with_bearer("/_test/admin_only", &token))
.await
.unwrap();
assert_eq!(
resp.status(),
StatusCode::UNAUTHORIZED,
"Bearer-authed admin must NOT pass the RequireAdmin guard"
);
}

View File

@@ -0,0 +1,96 @@
//! PR 4 (feat/admin-system-api) integration tests.
//!
//! Shape-only assertions — we don't mock the system, just call the
//! endpoint and check the response envelope. Threshold-triggering of
//! alerts would require faking statvfs / sysinfo, which is more
//! plumbing than the test gives back.
mod common;
use axum::http::StatusCode;
use axum::Router;
use sqlx::PgPool;
use tower::ServiceExt;
use mangalord::repo;
async fn seed_admin(pool: &PgPool, app: &Router) -> String {
let (username, cookie) = common::register_user(app).await;
let u = repo::user::find_by_username(pool, &username)
.await
.unwrap()
.unwrap();
repo::user::set_is_admin_unchecked(pool, u.id, true).await.unwrap();
cookie
}
#[sqlx::test(migrations = "./migrations")]
async fn requires_admin(pool: PgPool) {
let h = common::harness(pool);
let (_u, cookie) = common::register_user(&h.app).await;
let resp = h
.app
.oneshot(common::get_with_cookie("/api/v1/admin/system", &cookie))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
}
#[sqlx::test(migrations = "./migrations")]
async fn unauthenticated_request_is_rejected(pool: PgPool) {
let h = common::harness(pool);
let resp = h
.app
.oneshot(common::get("/api/v1/admin/system"))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::UNAUTHORIZED);
}
#[sqlx::test(migrations = "./migrations")]
async fn returns_disk_memory_cpu_alerts_shape(pool: PgPool) {
let h = common::harness(pool.clone());
let cookie = seed_admin(&pool, &h.app).await;
let resp = h
.app
.oneshot(common::get_with_cookie("/api/v1/admin/system", &cookie))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
// Disk: harness uses LocalStorage on a tempdir, so disk SHOULD be
// populated. Validate the field shape and percent range.
let disk = body
.get("disk")
.expect("disk key present")
.as_object()
.expect("disk is an object (LocalStorage exposes a path)");
assert!(disk["total_bytes"].as_u64().unwrap() > 0);
let pct = disk["percent_used"].as_f64().unwrap();
assert!(
(0.0..=100.0).contains(&pct),
"percent_used outside [0,100]: {pct}"
);
let mem = body.get("memory").expect("memory key").as_object().unwrap();
assert!(mem["total_bytes"].as_u64().unwrap() > 0);
let mpct = mem["percent_used"].as_f64().unwrap();
assert!((0.0..=100.0).contains(&mpct));
let cpu = body.get("cpu").expect("cpu key").as_object().unwrap();
let cpu_pct = cpu["percent_used"].as_f64().unwrap();
assert!(
(0.0..=100.0).contains(&cpu_pct),
"cpu out of range: {cpu_pct}"
);
let alerts = body.get("alerts").expect("alerts key").as_array().unwrap();
// Don't assert on length — the box may genuinely be >90% on memory
// when the test runs. Just confirm shape of any present entry.
for alert in alerts {
assert!(alert["level"].is_string());
assert!(alert["message"].is_string());
}
}

View File

@@ -0,0 +1,605 @@
//! PR 2 (feat/admin-users-api) integration tests.
//!
//! Exercises list / delete / promote-demote on /api/v1/admin/users:
//! pagination + search, the RequireAdmin gate, self-protection,
//! last-admin invariant (including the parallel-demote race that
//! `pg_advisory_xact_lock` + recount-inside-tx guards against), and
//! that audit rows land in `admin_audit` only on successful commit.
//!
//! Note on the last-admin invariant: the *serial* path via HTTP is
//! structurally unreachable — the only configuration that would hit the
//! "would orphan admins" branch requires the actor to be the lone admin
//! demoting themselves, which the self-guard fires on first. So the
//! last-admin checks below call the repo directly to exercise the
//! invariant; the HTTP race scenario is covered by
//! `parallel_demotes_cannot_orphan_admins`.
mod common;
use axum::http::StatusCode;
use axum::Router;
use serde_json::json;
use sqlx::PgPool;
use tower::ServiceExt;
use uuid::Uuid;
use mangalord::error::AppError;
use mangalord::repo;
/// Register a user via the public API and immediately promote them via
/// the repo. Returns (username, session cookie, user_id) — the common
/// "I need a logged-in admin" prelude.
async fn seed_admin(pool: &PgPool, app: &Router) -> (String, String, Uuid) {
let (username, cookie) = common::register_user(app).await;
let u = repo::user::find_by_username(pool, &username)
.await
.unwrap()
.unwrap();
repo::user::set_is_admin_unchecked(pool, u.id, true).await.unwrap();
(username, cookie, u.id)
}
// ---- RequireAdmin gate -----------------------------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn list_requires_admin(pool: PgPool) {
let h = common::harness(pool);
let (_username, cookie) = common::register_user(&h.app).await;
let resp = h
.app
.oneshot(common::get_with_cookie("/api/v1/admin/users", &cookie))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
}
#[sqlx::test(migrations = "./migrations")]
async fn delete_requires_admin(pool: PgPool) {
let h = common::harness(pool);
let (_username, cookie) = common::register_user(&h.app).await;
let resp = h
.app
.oneshot(common::delete_with_cookie(
&format!("/api/v1/admin/users/{}", Uuid::new_v4()),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
}
#[sqlx::test(migrations = "./migrations")]
async fn patch_requires_admin(pool: PgPool) {
let h = common::harness(pool);
let (_username, cookie) = common::register_user(&h.app).await;
let resp = h
.app
.oneshot(common::patch_json_with_cookie(
&format!("/api/v1/admin/users/{}", Uuid::new_v4()),
json!({ "is_admin": true }),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
}
// ---- list with search and pagination ---------------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn list_returns_paginated_users(pool: PgPool) {
let h = common::harness(pool.clone());
let (_admin_name, cookie, _) = seed_admin(&pool, &h.app).await;
let _u1 = common::register_user(&h.app).await;
let _u2 = common::register_user(&h.app).await;
let _u3 = common::register_user(&h.app).await;
let resp = h
.app
.oneshot(common::get_with_cookie(
"/api/v1/admin/users?limit=2&offset=0",
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
let items = body["items"].as_array().expect("items array");
assert_eq!(items.len(), 2, "limit=2 should cap the page");
assert_eq!(body["page"]["limit"], 2);
assert_eq!(body["page"]["offset"], 0);
assert_eq!(body["page"]["total"], 4);
assert!(items[0].get("is_admin").is_some());
assert!(
items[0].get("password_hash").is_none(),
"password_hash must never leak even to other admins"
);
}
#[sqlx::test(migrations = "./migrations")]
async fn list_filters_by_substring_search(pool: PgPool) {
let h = common::harness(pool.clone());
let (_admin_name, cookie, _) = seed_admin(&pool, &h.app).await;
let resp = h
.app
.clone()
.oneshot(common::post_json(
"/api/v1/auth/register",
json!({ "username": "zzzfindme01", "password": "hunter2hunter2" }),
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CREATED);
let resp = h
.app
.oneshot(common::get_with_cookie(
"/api/v1/admin/users?search=zzzfindme",
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
let items = body["items"].as_array().unwrap();
assert_eq!(items.len(), 1, "search must narrow to the one match");
assert_eq!(items[0]["username"], "zzzfindme01");
assert_eq!(body["page"]["total"], 1);
}
// ---- self-protection -------------------------------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn cannot_self_delete(pool: PgPool) {
let h = common::harness(pool.clone());
let (_username, cookie, actor_id) = seed_admin(&pool, &h.app).await;
// Second admin so the last-admin guard isn't what triggers the conflict.
let (_other, _, _) = seed_admin(&pool, &h.app).await;
let resp = h
.app
.oneshot(common::delete_with_cookie(
&format!("/api/v1/admin/users/{actor_id}"),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CONFLICT);
let body = common::body_json(resp).await;
assert_eq!(body["error"]["code"], "conflict");
assert!(
body["error"]["message"]
.as_str()
.unwrap()
.contains("yourself"),
"message must call out the self-action; got {:?}",
body["error"]["message"]
);
}
#[sqlx::test(migrations = "./migrations")]
async fn cannot_self_demote(pool: PgPool) {
let h = common::harness(pool.clone());
let (_username, cookie, actor_id) = seed_admin(&pool, &h.app).await;
let (_other, _, _) = seed_admin(&pool, &h.app).await;
let resp = h
.app
.oneshot(common::patch_json_with_cookie(
&format!("/api/v1/admin/users/{actor_id}"),
json!({ "is_admin": false }),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CONFLICT);
let body = common::body_json(resp).await;
assert!(body["error"]["message"]
.as_str()
.unwrap()
.contains("yourself"));
}
// ---- last-admin invariant (repo layer, see file header) --------------------
#[sqlx::test(migrations = "./migrations")]
async fn last_admin_demote_refused_at_repo(pool: PgPool) {
let h = common::harness(pool.clone());
let (_a, _, a_id) = seed_admin(&pool, &h.app).await;
let (_b, _, b_id) = seed_admin(&pool, &h.app).await;
// admins = {A, B}. Demote A via B (count 2 → 1) — allowed.
let r = repo::user::admin_safe_set_is_admin(&pool, b_id, a_id, false)
.await
.expect("first demote succeeds");
assert!(!r.is_admin);
// admins = {B}. Try to demote B via A (actor doesn't matter to the
// repo — that's the HTTP gate's job). Last-admin guard kicks in.
let err = repo::user::admin_safe_set_is_admin(&pool, a_id, b_id, false)
.await
.expect_err("second demote must be refused");
match err {
AppError::Conflict(m) => assert!(
m.contains("last admin"),
"expected last-admin conflict; got {m:?}"
),
other => panic!("expected Conflict, got {other:?}"),
}
}
#[sqlx::test(migrations = "./migrations")]
async fn last_admin_delete_refused_at_repo(pool: PgPool) {
let h = common::harness(pool.clone());
let (_a, _, a_id) = seed_admin(&pool, &h.app).await;
let (_b, _, b_id) = seed_admin(&pool, &h.app).await;
// admins = {A, B}. Delete A via B (count 2 → 1) — allowed.
repo::user::admin_safe_delete(&pool, b_id, a_id)
.await
.expect("first delete succeeds");
// admins = {B}. Try to delete B via a fresh non-admin actor. Last-
// admin guard kicks in.
let (_c, _, c_id) = {
let (cn, _ck) = common::register_user(&h.app).await;
let c = repo::user::find_by_username(&pool, &cn).await.unwrap().unwrap();
(cn, _ck, c.id)
};
let err = repo::user::admin_safe_delete(&pool, c_id, b_id)
.await
.expect_err("second delete must be refused");
match err {
AppError::Conflict(m) => assert!(
m.contains("last admin"),
"expected last-admin conflict; got {m:?}"
),
other => panic!("expected Conflict, got {other:?}"),
}
}
#[sqlx::test(migrations = "./migrations")]
async fn parallel_demotes_cannot_orphan_admins(pool: PgPool) {
// The race the advisory lock + recount exists to close: two parallel
// demotes of two DIFFERENT admins, each reading `count = 2` and
// committing, would land at zero admins. With the lock the second
// demote sees count = 1 inside the tx and refuses.
let h = common::harness(pool.clone());
let (_a, _, a_id) = seed_admin(&pool, &h.app).await;
let (_b, _, b_id) = seed_admin(&pool, &h.app).await;
let pool_x = pool.clone();
let pool_y = pool.clone();
let task_x = tokio::spawn(async move {
repo::user::admin_safe_set_is_admin(&pool_x, a_id, b_id, false).await
});
let task_y = tokio::spawn(async move {
repo::user::admin_safe_set_is_admin(&pool_y, b_id, a_id, false).await
});
let r_x = task_x.await.unwrap();
let r_y = task_y.await.unwrap();
let outcomes = (r_x.is_ok(), r_y.is_ok());
assert!(
outcomes == (true, false) || outcomes == (false, true),
"exactly one of the two parallel demotes must succeed; got {outcomes:?}"
);
let (count,): (i64,) =
sqlx::query_as("SELECT COUNT(*) FROM users WHERE is_admin = true")
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(count, 1, "at least one admin must remain");
}
// ---- audit log -------------------------------------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn promote_writes_audit_row(pool: PgPool) {
let h = common::harness(pool.clone());
let (_a_name, a_cookie, a_id) = seed_admin(&pool, &h.app).await;
let (b_name, _b_cookie) = common::register_user(&h.app).await;
let b = repo::user::find_by_username(&pool, &b_name)
.await
.unwrap()
.unwrap();
let resp = h
.app
.oneshot(common::patch_json_with_cookie(
&format!("/api/v1/admin/users/{}", b.id),
json!({ "is_admin": true }),
&a_cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let rows: Vec<(Option<Uuid>, String, String, Option<Uuid>)> = sqlx::query_as(
"SELECT actor_user_id, action, target_kind, target_id FROM admin_audit",
)
.fetch_all(&pool)
.await
.unwrap();
assert_eq!(rows.len(), 1);
let (actor, action, kind, target) = rows.into_iter().next().unwrap();
assert_eq!(actor, Some(a_id));
assert_eq!(action, "promote_user");
assert_eq!(kind, "user");
assert_eq!(target, Some(b.id));
}
#[sqlx::test(migrations = "./migrations")]
async fn redundant_promote_does_not_write_audit_row(pool: PgPool) {
// Regression: PATCH {is_admin: true} on someone already admin used
// to UPDATE (no-op) and still INSERT a misleading "promote_user"
// audit row. Should short-circuit without touching admin_audit.
let h = common::harness(pool.clone());
let (_a_name, a_cookie, _a_id) = seed_admin(&pool, &h.app).await;
let (b_name, _b_cookie, _b_id) = seed_admin(&pool, &h.app).await; // already admin
let b = repo::user::find_by_username(&pool, &b_name)
.await
.unwrap()
.unwrap();
let resp = h
.app
.oneshot(common::patch_json_with_cookie(
&format!("/api/v1/admin/users/{}", b.id),
json!({ "is_admin": true }),
&a_cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let (count,): (i64,) = sqlx::query_as("SELECT COUNT(*) FROM admin_audit")
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(count, 0, "no-op promote must not write audit row");
}
#[sqlx::test(migrations = "./migrations")]
async fn delete_writes_audit_row(pool: PgPool) {
let h = common::harness(pool.clone());
let (_a_name, a_cookie, a_id) = seed_admin(&pool, &h.app).await;
let (b_name, _b_cookie) = common::register_user(&h.app).await;
let b = repo::user::find_by_username(&pool, &b_name)
.await
.unwrap()
.unwrap();
let resp = h
.app
.oneshot(common::delete_with_cookie(
&format!("/api/v1/admin/users/{}", b.id),
&a_cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::NO_CONTENT);
let rows: Vec<(Option<Uuid>, String, String, Option<Uuid>, serde_json::Value)> =
sqlx::query_as(
"SELECT actor_user_id, action, target_kind, target_id, payload FROM admin_audit",
)
.fetch_all(&pool)
.await
.unwrap();
assert_eq!(rows.len(), 1);
let (actor, action, kind, target, payload) = rows.into_iter().next().unwrap();
assert_eq!(actor, Some(a_id));
assert_eq!(action, "delete_user");
assert_eq!(kind, "user");
assert_eq!(target, Some(b.id));
assert_eq!(payload["username"], b_name);
assert_eq!(payload["was_admin"], false);
}
// ---- POST /admin/users (admin-create) --------------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn create_user_requires_admin(pool: PgPool) {
let h = common::harness(pool);
let (_username, cookie) = common::register_user(&h.app).await;
let resp = h
.app
.oneshot(common::post_json_with_cookie(
"/api/v1/admin/users",
json!({ "username": "newbie", "password": "hunter2hunter2" }),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
}
#[sqlx::test(migrations = "./migrations")]
async fn create_user_unauthenticated_is_rejected(pool: PgPool) {
let h = common::harness(pool);
let resp = h
.app
.oneshot(common::post_json(
"/api/v1/admin/users",
json!({ "username": "newbie", "password": "hunter2hunter2" }),
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::UNAUTHORIZED);
}
#[sqlx::test(migrations = "./migrations")]
async fn create_user_happy_path_creates_user_and_audit(pool: PgPool) {
let h = common::harness(pool.clone());
let (_a_name, a_cookie, a_id) = seed_admin(&pool, &h.app).await;
let resp = h
.app
.oneshot(common::post_json_with_cookie(
"/api/v1/admin/users",
json!({ "username": "invited01", "password": "freshpass1234" }),
&a_cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CREATED);
let body = common::body_json(resp).await;
assert_eq!(body["username"], "invited01");
assert_eq!(body["is_admin"], false);
assert!(body["id"].as_str().is_some());
assert!(
body.get("password_hash").is_none(),
"password_hash must never appear in admin-create response"
);
let target_id =
Uuid::parse_str(body["id"].as_str().unwrap()).unwrap();
let (actor, action, kind, target, payload): (
Option<Uuid>,
String,
String,
Option<Uuid>,
serde_json::Value,
) = sqlx::query_as(
"SELECT actor_user_id, action, target_kind, target_id, payload \
FROM admin_audit ORDER BY at DESC LIMIT 1",
)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(actor, Some(a_id));
assert_eq!(action, "create_user");
assert_eq!(kind, "user");
assert_eq!(target, Some(target_id));
assert_eq!(payload["username"], "invited01");
assert_eq!(payload["is_admin"], false);
}
#[sqlx::test(migrations = "./migrations")]
async fn create_user_can_mint_an_admin_in_one_call(pool: PgPool) {
let h = common::harness(pool.clone());
let (_a_name, a_cookie, _) = seed_admin(&pool, &h.app).await;
let resp = h
.app
.oneshot(common::post_json_with_cookie(
"/api/v1/admin/users",
json!({
"username": "newadmin",
"password": "freshpass1234",
"is_admin": true
}),
&a_cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CREATED);
let body = common::body_json(resp).await;
assert_eq!(body["is_admin"], true);
}
#[sqlx::test(migrations = "./migrations")]
async fn create_user_returns_409_on_duplicate(pool: PgPool) {
let h = common::harness(pool.clone());
let (_a_name, a_cookie, _) = seed_admin(&pool, &h.app).await;
// Seed an existing user via the public register path.
let resp = h
.app
.clone()
.oneshot(common::post_json(
"/api/v1/auth/register",
json!({ "username": "taken", "password": "hunter2hunter2" }),
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CREATED);
let resp = h
.app
.oneshot(common::post_json_with_cookie(
"/api/v1/admin/users",
json!({ "username": "Taken", "password": "freshpass1234" }),
&a_cookie,
))
.await
.unwrap();
assert_eq!(
resp.status(),
StatusCode::CONFLICT,
"case-insensitive collision via the lower(username) index"
);
let body = common::body_json(resp).await;
assert_eq!(body["error"]["code"], "conflict");
}
#[sqlx::test(migrations = "./migrations")]
async fn create_user_rejects_weak_password(pool: PgPool) {
let h = common::harness(pool.clone());
let (_a_name, a_cookie, _) = seed_admin(&pool, &h.app).await;
let resp = h
.app
.oneshot(common::post_json_with_cookie(
"/api/v1/admin/users",
json!({ "username": "okayname", "password": "short" }),
&a_cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::BAD_REQUEST);
let body = common::body_json(resp).await;
assert_eq!(body["error"]["code"], "invalid_input");
}
#[sqlx::test(migrations = "./migrations")]
async fn create_user_rejects_invalid_username(pool: PgPool) {
let h = common::harness(pool.clone());
let (_a_name, a_cookie, _) = seed_admin(&pool, &h.app).await;
let resp = h
.app
.oneshot(common::post_json_with_cookie(
"/api/v1/admin/users",
json!({ "username": "bad name!", "password": "freshpass1234" }),
&a_cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::BAD_REQUEST);
}
#[sqlx::test(migrations = "./migrations")]
async fn create_user_works_even_when_self_register_disabled(pool: PgPool) {
// The admin-create path must NOT be gated by ALLOW_SELF_REGISTER —
// that's the entire point of having an admin-create endpoint.
let h = common::harness_with_self_register_disabled(pool.clone());
// Bootstrap an admin out-of-band since self-register would refuse.
repo::user::bootstrap_admin(&pool, "root", "hunter2hunter2")
.await
.unwrap();
let resp = h
.app
.clone()
.oneshot(common::post_json(
"/api/v1/auth/login",
json!({ "username": "root", "password": "hunter2hunter2" }),
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let cookie = common::extract_session_cookie(&resp).unwrap();
let resp = h
.app
.oneshot(common::post_json_with_cookie(
"/api/v1/admin/users",
json!({ "username": "invited01", "password": "freshpass1234" }),
&cookie,
))
.await
.unwrap();
assert_eq!(
resp.status(),
StatusCode::CREATED,
"admin must be able to mint users even with self-register off"
);
}

View File

@@ -567,6 +567,166 @@ async fn user_a_cannot_delete_user_b_token(pool: PgPool) {
assert_eq!(resp.status(), StatusCode::NO_CONTENT);
}
/// Username enumeration via login response time: an attacker probes
/// for valid usernames by measuring how long /auth/login takes. Before
/// the equalisation fix, the no-user branch returned 401 in <1 ms
/// while the wrong-password branch took ~50-100 ms (the argon2 verify
/// cost). This test asserts the no-user branch now spends at least
/// some meaningful fraction of the wrong-password branch's time.
///
/// Tolerance is intentionally loose so CI variance doesn't flap the
/// test. The unequalised gap is large enough (~50x) that even a noisy
/// CI run with a 5x slack still catches it.
#[sqlx::test(migrations = "./migrations")]
async fn login_no_user_branch_runs_argon2_for_timing_equalisation(pool: PgPool) {
use std::time::Instant;
let h = common::harness(pool);
// Register the victim user so the wrong-password branch has a real
// argon2 hash to verify against.
let _ = h
.app
.clone()
.oneshot(common::post_json(
"/api/v1/auth/register",
json!({ "username": "victim", "password": "hunter2hunter2" }),
))
.await
.unwrap();
// Warm-up: first login of the process initialises the dummy hash
// lazily. Skip that cost when measuring.
let _ = h
.app
.clone()
.oneshot(common::post_json(
"/api/v1/auth/login",
json!({ "username": "victim", "password": "wrong" }),
))
.await
.unwrap();
let _ = h
.app
.clone()
.oneshot(common::post_json(
"/api/v1/auth/login",
json!({ "username": "ghost", "password": "wrong" }),
))
.await
.unwrap();
// Median-of-N is more stable than a single sample.
async fn sample_min(
app: &axum::Router,
username: &str,
n: u32,
) -> std::time::Duration {
let mut samples = Vec::with_capacity(n as usize);
for _ in 0..n {
let req = common::post_json(
"/api/v1/auth/login",
json!({ "username": username, "password": "wrong-guess" }),
);
let t = Instant::now();
let resp = app.clone().oneshot(req).await.unwrap();
let d = t.elapsed();
assert_eq!(resp.status(), StatusCode::UNAUTHORIZED);
samples.push(d);
}
// Use the minimum: it's the floor that argon2 takes, robust
// against unrelated stalls (DB connection acquisition, etc.).
*samples.iter().min().unwrap()
}
let wrong_pwd = sample_min(&h.app, "victim", 3).await;
let no_user = sample_min(&h.app, "ghost", 3).await;
// 5x slack: argon2 dominates both branches, so they should be
// within an order of magnitude. Unequalised, no_user would be
// ~50-100x faster. Asserting "no_user >= wrong_pwd / 5" catches
// the bug without being flaky in CI.
assert!(
no_user * 5 >= wrong_pwd,
"login timing leaks user existence: no_user={no_user:?}, wrong_pwd={wrong_pwd:?}"
);
}
/// Brute-force / spray protection: at default production limits, a
/// tight loop of /auth/login attempts should burst through the bucket
/// and then 429 every subsequent request until the bucket refills.
#[sqlx::test(migrations = "./migrations")]
async fn login_rate_limited_under_burst_pressure(pool: PgPool) {
let h = common::harness_with_auth_rate_limit(pool, 1, 3);
// Register a victim so the wrong-password branch is real work.
let _ = h
.app
.clone()
.oneshot(common::post_json("/api/v1/auth/register", creds("victim")))
.await
.unwrap();
// Register consumed one token from the burst-3 bucket. Fire 30
// wrong-password logins back-to-back; with per_sec=1 the refill
// is too slow to keep up and at least one must come back 429.
let mut saw_429 = false;
for _ in 0..30 {
let resp = h
.app
.clone()
.oneshot(common::post_json(
"/api/v1/auth/login",
json!({ "username": "victim", "password": "wrong" }),
))
.await
.unwrap();
if resp.status() == StatusCode::TOO_MANY_REQUESTS {
// RFC 6585 §4: 429 SHOULD include a Retry-After header. The
// value is in seconds; with per_sec=1 the bucket needs ~1s
// to refill, so the header should be 1 or 2.
let retry_after = resp
.headers()
.get(axum::http::header::RETRY_AFTER)
.and_then(|v| v.to_str().ok())
.and_then(|s| s.parse::<u32>().ok())
.expect("Retry-After header present and numeric");
assert!(
retry_after >= 1,
"Retry-After must be at least 1s, got {retry_after}"
);
let body = common::body_json(resp).await;
assert_eq!(body["error"]["code"], "too_many_requests");
saw_429 = true;
break;
}
}
assert!(
saw_429,
"expected at least one 429 within 30 rapid login attempts"
);
}
/// Default (test-harness) limits are disabled, so existing tests that
/// fire multiple auth requests don't start failing.
#[sqlx::test(migrations = "./migrations")]
async fn default_test_harness_does_not_rate_limit(pool: PgPool) {
let h = common::harness(pool);
for i in 0..50 {
let resp = h
.app
.clone()
.oneshot(common::post_json(
"/api/v1/auth/login",
json!({ "username": format!("nobody-{i}"), "password": "x" }),
))
.await
.unwrap();
// None of these should be 429 — only 401.
assert_eq!(resp.status(), StatusCode::UNAUTHORIZED, "iter {i}");
}
}
#[sqlx::test(migrations = "./migrations")]
async fn delete_unknown_token_is_404(pool: PgPool) {
let h = common::harness(pool);
@@ -581,3 +741,68 @@ async fn delete_unknown_token_is_404(pool: PgPool) {
.unwrap();
assert_eq!(resp.status(), StatusCode::NOT_FOUND);
}
/// Bot token names are user-supplied free-form strings; a 10 MB name
/// was accepted before. Cap at 64 chars to match the other free-form
/// identifier caps (tags, collection names). The response uses
/// `ValidationFailed` (422 with per-field details) so clients can
/// render the same shape they already handle for `attach_tag`.
#[sqlx::test(migrations = "./migrations")]
async fn create_token_rejects_name_over_64_chars(pool: PgPool) {
let h = common::harness(pool);
let (_, cookie) = common::register_user(&h.app).await;
let resp = h
.app
.oneshot(common::post_json_with_cookie(
"/api/v1/auth/tokens",
json!({ "name": "x".repeat(65) }),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::UNPROCESSABLE_ENTITY);
let body = common::body_json(resp).await;
assert_eq!(body["error"]["code"], "validation_failed");
assert!(body["error"]["details"]["name"].is_string());
}
// ---- self-register toggle + /auth/config -----------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn auth_config_reports_self_register_enabled_by_default(pool: PgPool) {
let h = common::harness(pool);
let resp = h
.app
.oneshot(common::get("/api/v1/auth/config"))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
assert_eq!(body["self_register_enabled"], true);
}
#[sqlx::test(migrations = "./migrations")]
async fn auth_config_reflects_self_register_disabled(pool: PgPool) {
let h = common::harness_with_self_register_disabled(pool);
let resp = h
.app
.oneshot(common::get("/api/v1/auth/config"))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
assert_eq!(body["self_register_enabled"], false);
}
#[sqlx::test(migrations = "./migrations")]
async fn register_returns_403_when_self_register_disabled(pool: PgPool) {
let h = common::harness_with_self_register_disabled(pool);
let resp = h
.app
.oneshot(common::post_json("/api/v1/auth/register", creds("alice")))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
let body = common::body_json(resp).await;
assert_eq!(body["error"]["code"], "forbidden");
}

View File

@@ -438,3 +438,196 @@ async fn list_me_returns_paged_envelope(pool: PgPool) {
// without paging through.
assert_eq!(body["page"]["total"], 0);
}
// -------------------------------------------------------------------------
// Bookmark create -> SyncChapterContent job enqueue (background task)
// -------------------------------------------------------------------------
async fn seed_chapter_with_source(
pool: &PgPool,
manga_id: Uuid,
number: i32,
source_id: &str,
source_chapter_key: &str,
source_url: &str,
dropped: bool,
) -> Uuid {
let chapter_id: Uuid =
mangalord::repo::chapter::create(pool, manga_id, number, None, None)
.await
.unwrap()
.id;
sqlx::query("INSERT INTO sources (id, name, base_url) VALUES ($1, $2, $3) ON CONFLICT DO NOTHING")
.bind(source_id)
.bind(source_id)
.bind("https://example.com")
.execute(pool)
.await
.unwrap();
let dropped_at = if dropped { "now()" } else { "NULL" };
sqlx::query(&format!(
"INSERT INTO chapter_sources (source_id, source_chapter_key, chapter_id, source_url, dropped_at) \
VALUES ($1, $2, $3, $4, {dropped_at})"
))
.bind(source_id)
.bind(source_chapter_key)
.bind(chapter_id)
.bind(source_url)
.execute(pool)
.await
.unwrap();
chapter_id
}
/// Poll `crawler_jobs` for the expected pending count, up to ~1.5s, so the
/// detached `tokio::spawn` from the bookmark create handler has time to
/// land regardless of CI scheduling jitter.
async fn wait_for_pending_count(pool: &PgPool, expected: i64) -> i64 {
for _ in 0..30 {
let count: i64 = sqlx::query_scalar(
"SELECT COUNT(*) FROM crawler_jobs \
WHERE state = 'pending' \
AND payload->>'kind' = 'sync_chapter_content'",
)
.fetch_one(pool)
.await
.unwrap();
if count >= expected {
return count;
}
tokio::time::sleep(std::time::Duration::from_millis(50)).await;
}
sqlx::query_scalar::<_, i64>(
"SELECT COUNT(*) FROM crawler_jobs \
WHERE state = 'pending' \
AND payload->>'kind' = 'sync_chapter_content'",
)
.fetch_one(pool)
.await
.unwrap()
}
#[sqlx::test(migrations = "./migrations")]
async fn create_enqueues_sync_chapter_content_jobs_for_pending_chapters(pool: PgPool) {
let h = common::harness(pool.clone());
let (_, cookie) = common::register_user(&h.app).await;
let manga_id = common::seed_manga_via_api(&h.app, &cookie, "Berserk").await;
// Two zero-page chapters with non-dropped sources.
let c1 = seed_chapter_with_source(&pool, manga_id, 1, "target", "ch1", "https://example.com/c1", false).await;
let c2 = seed_chapter_with_source(&pool, manga_id, 2, "target", "ch2", "https://example.com/c2", false).await;
let resp = h
.app
.clone()
.oneshot(common::post_json_with_cookie(
"/api/v1/bookmarks",
json!({ "manga_id": manga_id.to_string() }),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CREATED);
let count = wait_for_pending_count(&pool, 2).await;
assert_eq!(count, 2, "both pending chapters should be enqueued");
let chapter_ids: Vec<String> = sqlx::query_scalar(
"SELECT payload->>'chapter_id' FROM crawler_jobs \
WHERE payload->>'kind' = 'sync_chapter_content' \
ORDER BY payload->>'chapter_id'",
)
.fetch_all(&pool)
.await
.unwrap();
let mut expected = vec![c1.to_string(), c2.to_string()];
expected.sort();
assert_eq!(chapter_ids, expected);
}
#[sqlx::test(migrations = "./migrations")]
async fn re_bookmark_after_delete_does_not_re_enqueue_pending_jobs(pool: PgPool) {
let h = common::harness(pool.clone());
let (_, cookie) = common::register_user(&h.app).await;
let manga_id = common::seed_manga_via_api(&h.app, &cookie, "Berserk").await;
let _ = seed_chapter_with_source(&pool, manga_id, 1, "target", "ch1", "https://example.com/c1", false).await;
// First bookmark — should enqueue 1.
let resp = h
.app
.clone()
.oneshot(common::post_json_with_cookie(
"/api/v1/bookmarks",
json!({ "manga_id": manga_id.to_string() }),
&cookie,
))
.await
.unwrap();
let bookmark_id = common::body_json(resp).await["id"].as_str().unwrap().to_string();
assert_eq!(wait_for_pending_count(&pool, 1).await, 1);
// Delete the bookmark, then re-bookmark — the existing pending job
// is still there so the dedup index suppresses the second enqueue.
let resp = h
.app
.clone()
.oneshot(common::delete_with_cookie(
&format!("/api/v1/bookmarks/{bookmark_id}"),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::NO_CONTENT);
let resp = h
.app
.clone()
.oneshot(common::post_json_with_cookie(
"/api/v1/bookmarks",
json!({ "manga_id": manga_id.to_string() }),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CREATED);
// Give the background task time to attempt re-enqueue (it should be a no-op).
tokio::time::sleep(std::time::Duration::from_millis(300)).await;
let final_count: i64 = sqlx::query_scalar(
"SELECT COUNT(*) FROM crawler_jobs \
WHERE state IN ('pending', 'running') \
AND payload->>'kind' = 'sync_chapter_content'",
)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(final_count, 1, "dedup index keeps the queue at a single in-flight row");
}
#[sqlx::test(migrations = "./migrations")]
async fn create_skips_chapters_with_dropped_sources(pool: PgPool) {
let h = common::harness(pool.clone());
let (_, cookie) = common::register_user(&h.app).await;
let manga_id = common::seed_manga_via_api(&h.app, &cookie, "Berserk").await;
let _alive = seed_chapter_with_source(&pool, manga_id, 1, "target", "ch1", "https://example.com/c1", false).await;
let _dropped = seed_chapter_with_source(&pool, manga_id, 2, "target", "ch2", "https://example.com/c2", true).await;
let resp = h
.app
.clone()
.oneshot(common::post_json_with_cookie(
"/api/v1/bookmarks",
json!({ "manga_id": manga_id.to_string() }),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CREATED);
assert_eq!(
wait_for_pending_count(&pool, 1).await,
1,
"only the chapter with a non-dropped source row gets enqueued"
);
}

View File

@@ -12,12 +12,18 @@ async fn seed_manga(h: &common::Harness, cookie: &str, title: &str) -> Uuid {
common::seed_manga_via_api(&h.app, cookie, title).await
}
async fn seed_chapter(pool: &PgPool, manga_id: Uuid, number: i32, title: Option<&str>) {
async fn seed_chapter(
pool: &PgPool,
manga_id: Uuid,
number: i32,
title: Option<&str>,
) -> Uuid {
// Historical seed — uploaded_by remains NULL, mirroring the
// pre-Phase-5 rows in the production DB.
mangalord::repo::chapter::create(pool, manga_id, number, title, None)
.await
.unwrap();
.unwrap()
.id
}
#[sqlx::test(migrations = "./migrations")]
@@ -81,16 +87,16 @@ async fn list_chapters_returns_404_for_unknown_manga(pool: PgPool) {
}
#[sqlx::test(migrations = "./migrations")]
async fn get_chapter_by_number(pool: PgPool) {
async fn get_chapter_by_id(pool: PgPool) {
let h = common::harness(pool.clone());
let (_, cookie) = common::register_user(&h.app).await;
let manga_id = seed_manga(&h, &cookie, "Berserk").await;
seed_chapter(&pool, manga_id, 1, Some("The Brand")).await;
let chapter_id = seed_chapter(&pool, manga_id, 1, Some("The Brand")).await;
let resp = h
.app
.oneshot(common::get(&format!(
"/api/v1/mangas/{manga_id}/chapters/1"
"/api/v1/mangas/{manga_id}/chapters/{chapter_id}"
)))
.await
.unwrap();
@@ -99,18 +105,20 @@ async fn get_chapter_by_number(pool: PgPool) {
assert_eq!(body["number"], 1);
assert_eq!(body["title"], "The Brand");
assert_eq!(body["page_count"], 0);
assert_eq!(body["id"], chapter_id.to_string());
}
#[sqlx::test(migrations = "./migrations")]
async fn get_chapter_unknown_number_is_404(pool: PgPool) {
async fn get_chapter_unknown_id_is_404(pool: PgPool) {
let h = common::harness(pool);
let (_, cookie) = common::register_user(&h.app).await;
let manga_id = seed_manga(&h, &cookie, "Berserk").await;
let unknown_chapter = Uuid::new_v4();
let resp = h
.app
.oneshot(common::get(&format!(
"/api/v1/mangas/{manga_id}/chapters/99"
"/api/v1/mangas/{manga_id}/chapters/{unknown_chapter}"
)))
.await
.unwrap();
@@ -122,10 +130,34 @@ async fn get_chapter_unknown_number_is_404(pool: PgPool) {
#[sqlx::test(migrations = "./migrations")]
async fn get_chapter_unknown_manga_is_404(pool: PgPool) {
let h = common::harness(pool);
let unknown = Uuid::nil();
let unknown_manga = Uuid::nil();
let unknown_chapter = Uuid::new_v4();
let resp = h
.app
.oneshot(common::get(&format!("/api/v1/mangas/{unknown}/chapters/1")))
.oneshot(common::get(&format!(
"/api/v1/mangas/{unknown_manga}/chapters/{unknown_chapter}"
)))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::NOT_FOUND);
}
/// Cross-manga isolation: a chapter id belonging to manga A must not
/// resolve when accessed via manga B's URL. The (manga_id, id) scoping
/// in `find_by_id_in_manga` enforces this.
#[sqlx::test(migrations = "./migrations")]
async fn get_chapter_from_wrong_manga_is_404(pool: PgPool) {
let h = common::harness(pool.clone());
let (_, cookie) = common::register_user(&h.app).await;
let manga_a = seed_manga(&h, &cookie, "Berserk").await;
let manga_b = seed_manga(&h, &cookie, "Vagabond").await;
let chapter_id = seed_chapter(&pool, manga_a, 1, Some("Episode 1")).await;
let resp = h
.app
.oneshot(common::get(&format!(
"/api/v1/mangas/{manga_b}/chapters/{chapter_id}"
)))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::NOT_FOUND);
@@ -136,12 +168,12 @@ async fn list_pages_empty_for_chapter_without_upload(pool: PgPool) {
let h = common::harness(pool.clone());
let (_, cookie) = common::register_user(&h.app).await;
let manga_id = seed_manga(&h, &cookie, "Berserk").await;
seed_chapter(&pool, manga_id, 1, None).await;
let chapter_id = seed_chapter(&pool, manga_id, 1, None).await;
let resp = h
.app
.oneshot(common::get(&format!(
"/api/v1/mangas/{manga_id}/chapters/1/pages"
"/api/v1/mangas/{manga_id}/chapters/{chapter_id}/pages"
)))
.await
.unwrap();
@@ -155,11 +187,12 @@ async fn list_pages_returns_404_for_unknown_chapter(pool: PgPool) {
let h = common::harness(pool);
let (_, cookie) = common::register_user(&h.app).await;
let manga_id = seed_manga(&h, &cookie, "Berserk").await;
let unknown_chapter = Uuid::new_v4();
let resp = h
.app
.oneshot(common::get(&format!(
"/api/v1/mangas/{manga_id}/chapters/99/pages"
"/api/v1/mangas/{manga_id}/chapters/{unknown_chapter}/pages"
)))
.await
.unwrap();

View File

@@ -0,0 +1,462 @@
mod common;
use axum::http::StatusCode;
use serde_json::{json, Value};
use sqlx::PgPool;
use tower::ServiceExt;
use uuid::Uuid;
use common::{
body_json, delete_with_cookie, fake_jpeg_bytes, fake_png_bytes, get, harness,
post_multipart_with_cookie, put_multipart, put_multipart_with_cookie, register_user,
MultipartBuilder,
};
async fn create_manga_with_cover(
app: &axum::Router,
cookie: &str,
title: &str,
cover: Option<(&str, &[u8])>,
) -> Value {
let mut form =
MultipartBuilder::new().add_json("metadata", json!({ "title": title }));
if let Some((ct, bytes)) = cover {
form = form.add_file("cover", "cover.bin", ct, bytes);
}
let resp = app
.clone()
.oneshot(post_multipart_with_cookie("/api/v1/mangas", form, cookie))
.await
.unwrap();
assert_eq!(
resp.status(),
StatusCode::CREATED,
"seed create_manga failed: {:?}",
resp.status()
);
body_json(resp).await
}
fn id_of(body: &Value) -> Uuid {
Uuid::parse_str(body["id"].as_str().unwrap()).unwrap()
}
fn cover_form(bytes: &[u8]) -> MultipartBuilder {
MultipartBuilder::new().add_file("cover", "cover.bin", "application/octet-stream", bytes)
}
#[sqlx::test(migrations = "./migrations")]
async fn put_cover_sets_path_when_none_existed(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let manga = create_manga_with_cover(&h.app, &cookie, "Cover Me", None).await;
let id = id_of(&manga);
assert!(manga["cover_image_path"].is_null());
let bytes = fake_png_bytes();
let resp = h
.app
.clone()
.oneshot(put_multipart_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
cover_form(&bytes),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = body_json(resp).await;
let expected_key = format!("mangas/{id}/cover.png");
assert_eq!(body["cover_image_path"], expected_key);
assert_eq!(body["title"], "Cover Me");
let file_resp = h
.app
.clone()
.oneshot(get(&format!("/api/v1/files/{expected_key}")))
.await
.unwrap();
assert_eq!(file_resp.status(), StatusCode::OK);
}
#[sqlx::test(migrations = "./migrations")]
async fn put_cover_replaces_existing_same_extension(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let original = fake_png_bytes();
let manga = create_manga_with_cover(
&h.app,
&cookie,
"Replace Me",
Some(("image/png", &original)),
)
.await;
let id = id_of(&manga);
let original_key = format!("mangas/{id}/cover.png");
assert_eq!(manga["cover_image_path"], original_key);
let mut replacement = fake_png_bytes();
replacement.extend_from_slice(b"-replacement-marker");
let resp = h
.app
.clone()
.oneshot(put_multipart_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
cover_form(&replacement),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = body_json(resp).await;
assert_eq!(body["cover_image_path"], original_key);
let file_resp = h
.app
.clone()
.oneshot(get(&format!("/api/v1/files/{original_key}")))
.await
.unwrap();
assert_eq!(file_resp.status(), StatusCode::OK);
let body_bytes = http_body_util::BodyExt::collect(file_resp.into_body())
.await
.unwrap()
.to_bytes();
assert_eq!(body_bytes.as_ref(), replacement.as_slice());
}
#[sqlx::test(migrations = "./migrations")]
async fn put_cover_replaces_existing_different_extension_and_deletes_old_blob(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let png = fake_png_bytes();
let manga = create_manga_with_cover(
&h.app,
&cookie,
"Switch Ext",
Some(("image/png", &png)),
)
.await;
let id = id_of(&manga);
let old_key = format!("mangas/{id}/cover.png");
assert_eq!(manga["cover_image_path"], old_key);
let jpeg = fake_jpeg_bytes();
let resp = h
.app
.clone()
.oneshot(put_multipart_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
cover_form(&jpeg),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = body_json(resp).await;
let new_key = format!("mangas/{id}/cover.jpg");
assert_eq!(body["cover_image_path"], new_key);
let new_file = h
.app
.clone()
.oneshot(get(&format!("/api/v1/files/{new_key}")))
.await
.unwrap();
assert_eq!(new_file.status(), StatusCode::OK);
let old_file = h
.app
.clone()
.oneshot(get(&format!("/api/v1/files/{old_key}")))
.await
.unwrap();
assert_eq!(old_file.status(), StatusCode::NOT_FOUND);
}
#[sqlx::test(migrations = "./migrations")]
async fn put_cover_rejects_unauthenticated(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let manga = create_manga_with_cover(&h.app, &cookie, "Public Read", None).await;
let id = id_of(&manga);
let resp = h
.app
.clone()
.oneshot(put_multipart(
&format!("/api/v1/mangas/{id}/cover"),
cover_form(&fake_png_bytes()),
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::UNAUTHORIZED);
}
#[sqlx::test(migrations = "./migrations")]
async fn put_cover_404_on_unknown_id(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let id = Uuid::new_v4();
let resp = h
.app
.clone()
.oneshot(put_multipart_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
cover_form(&fake_png_bytes()),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::NOT_FOUND);
}
#[sqlx::test(migrations = "./migrations")]
async fn put_cover_rejects_non_image_with_unsupported_media_type(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let manga = create_manga_with_cover(&h.app, &cookie, "Not Image", None).await;
let id = id_of(&manga);
let pdf = b"%PDF-1.4\n%\xc4\xe5".to_vec();
let resp = h
.app
.clone()
.oneshot(put_multipart_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
cover_form(&pdf),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::UNSUPPORTED_MEDIA_TYPE);
let body = body_json(resp).await;
assert_eq!(body["error"]["code"], "unsupported_media_type");
}
#[sqlx::test(migrations = "./migrations")]
async fn put_cover_rejects_oversized(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let manga = create_manga_with_cover(&h.app, &cookie, "Too Big", None).await;
let id = id_of(&manga);
// Harness max_file_bytes is 256 KiB; 300 KiB trips the cap.
let mut bytes = fake_png_bytes();
bytes.resize(300 * 1024, 0);
let resp = h
.app
.clone()
.oneshot(put_multipart_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
cover_form(&bytes),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::PAYLOAD_TOO_LARGE);
}
#[sqlx::test(migrations = "./migrations")]
async fn put_cover_rejects_missing_cover_part(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let manga = create_manga_with_cover(&h.app, &cookie, "Empty Form", None).await;
let id = id_of(&manga);
let resp = h
.app
.clone()
.oneshot(put_multipart_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
MultipartBuilder::new(),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::UNPROCESSABLE_ENTITY);
let body = body_json(resp).await;
assert_eq!(body["error"]["code"], "validation_failed");
}
#[sqlx::test(migrations = "./migrations")]
async fn put_cover_preserves_other_metadata(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let manga = create_manga_with_cover(
&h.app,
&cookie,
"Keep My Fields",
None,
)
.await;
let id = id_of(&manga);
let resp = h
.app
.clone()
.oneshot(put_multipart_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
cover_form(&fake_png_bytes()),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = body_json(resp).await;
assert_eq!(body["title"], "Keep My Fields");
assert_eq!(body["status"], "ongoing");
assert_eq!(body["authors"], json!([]));
assert_eq!(body["genres"], json!([]));
assert_eq!(body["tags"], json!([]));
}
#[sqlx::test(migrations = "./migrations")]
async fn delete_cover_clears_path_and_removes_blob(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let png = fake_png_bytes();
let manga = create_manga_with_cover(
&h.app,
&cookie,
"Bye Cover",
Some(("image/png", &png)),
)
.await;
let id = id_of(&manga);
let key = format!("mangas/{id}/cover.png");
let resp = h
.app
.clone()
.oneshot(delete_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = body_json(resp).await;
assert!(body["cover_image_path"].is_null());
assert_eq!(body["title"], "Bye Cover");
let file_resp = h
.app
.clone()
.oneshot(get(&format!("/api/v1/files/{key}")))
.await
.unwrap();
assert_eq!(file_resp.status(), StatusCode::NOT_FOUND);
}
#[sqlx::test(migrations = "./migrations")]
async fn delete_cover_is_idempotent_when_no_cover_present(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let manga = create_manga_with_cover(&h.app, &cookie, "Never Had One", None).await;
let id = id_of(&manga);
for _ in 0..2 {
let resp = h
.app
.clone()
.oneshot(delete_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = body_json(resp).await;
assert!(body["cover_image_path"].is_null());
}
}
#[sqlx::test(migrations = "./migrations")]
async fn delete_cover_rejects_unauthenticated(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let manga = create_manga_with_cover(&h.app, &cookie, "Locked", None).await;
let id = id_of(&manga);
let resp = h
.app
.clone()
.oneshot(
axum::http::Request::builder()
.method("DELETE")
.uri(format!("/api/v1/mangas/{id}/cover"))
.body(axum::body::Body::empty())
.unwrap(),
)
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::UNAUTHORIZED);
}
#[sqlx::test(migrations = "./migrations")]
async fn delete_cover_404_on_unknown_id(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let id = Uuid::new_v4();
let resp = h
.app
.clone()
.oneshot(delete_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::NOT_FOUND);
}
/// Authz: PUT /mangas/:id/cover must be uploader-only.
#[sqlx::test(migrations = "./migrations")]
async fn put_cover_forbidden_for_non_uploader(pool: PgPool) {
let h = harness(pool);
let (_, owner_cookie) = register_user(&h.app).await;
let (_, intruder_cookie) = register_user(&h.app).await;
let manga =
create_manga_with_cover(&h.app, &owner_cookie, "Mine", None).await;
let id = id_of(&manga);
let resp = h
.app
.oneshot(put_multipart_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
cover_form(&fake_png_bytes()),
&intruder_cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
}
/// Authz: DELETE /mangas/:id/cover must be uploader-only.
#[sqlx::test(migrations = "./migrations")]
async fn delete_cover_forbidden_for_non_uploader(pool: PgPool) {
let h = harness(pool);
let (_, owner_cookie) = register_user(&h.app).await;
let (_, intruder_cookie) = register_user(&h.app).await;
let manga = create_manga_with_cover(
&h.app,
&owner_cookie,
"Mine",
Some(("image/jpeg", &fake_jpeg_bytes())),
)
.await;
let id = id_of(&manga);
let resp = h
.app
.oneshot(delete_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
&intruder_cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
}

View File

@@ -566,3 +566,78 @@ async fn patch_requires_authentication(pool: PgPool) {
.unwrap();
assert_eq!(resp.status(), StatusCode::UNAUTHORIZED);
}
/// A signed-in user who didn't upload the manga must not be able to
/// PATCH it. Without the uploader-gate this returned 200 — see
/// REVIEW.md "manga PATCH / cover endpoints don't check ownership".
#[sqlx::test(migrations = "./migrations")]
async fn patch_forbidden_for_non_uploader(pool: PgPool) {
let h = common::harness(pool);
let (_, owner_cookie) = common::register_user(&h.app).await;
let (_, intruder_cookie) = common::register_user(&h.app).await;
let created = create_manga(&h.app, &owner_cookie, json!({ "title": "Mine" })).await;
let id = id_of(&created);
let resp = h
.app
.oneshot(common::patch_json_with_cookie(
&format!("/api/v1/mangas/{id}"),
json!({ "status": "completed" }),
&intruder_cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
}
/// Owner can still edit their own manga (regression guard for the
/// authz fix).
#[sqlx::test(migrations = "./migrations")]
async fn patch_allowed_for_uploader(pool: PgPool) {
let h = common::harness(pool);
let (_, cookie) = common::register_user(&h.app).await;
let created = create_manga(&h.app, &cookie, json!({ "title": "Owned" })).await;
let id = id_of(&created);
let resp = h
.app
.oneshot(common::patch_json_with_cookie(
&format!("/api/v1/mangas/{id}"),
json!({ "status": "completed" }),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
}
/// Legacy rows with `uploaded_by IS NULL` (created before migration
/// 0011) remain editable by any signed-in user. Without this carve-out
/// the historical-data note in 0011 would be broken.
#[sqlx::test(migrations = "./migrations")]
async fn patch_allowed_on_legacy_null_uploader(pool: PgPool) {
let h = common::harness(pool.clone());
let (_, cookie) = common::register_user(&h.app).await;
let created = create_manga(&h.app, &cookie, json!({ "title": "Legacy" })).await;
let id = id_of(&created);
// Simulate a row uploaded before the column existed: clear
// uploaded_by directly via SQL.
sqlx::query("UPDATE mangas SET uploaded_by = NULL WHERE id = $1")
.bind(id)
.execute(&pool)
.await
.unwrap();
let (_, other_cookie) = common::register_user(&h.app).await;
let resp = h
.app
.oneshot(common::patch_json_with_cookie(
&format!("/api/v1/mangas/{id}"),
json!({ "status": "completed" }),
&other_cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
}

View File

@@ -0,0 +1,189 @@
//! Site-wide auth gate (`PRIVATE_MODE=true`).
//!
//! With private mode on, every API path except a small allowlist
//! (`/health`, `/auth/config`, `/auth/login`, `/auth/logout`) requires
//! a valid session cookie or bearer token, and `/auth/register` is
//! force-blocked regardless of `ALLOW_SELF_REGISTER`. With private mode
//! off (the default), nothing changes — the `public_mode_*` test
//! pins that regression guard.
mod common;
use serde_json::json;
use sqlx::PgPool;
use tower::ServiceExt;
use axum::http::StatusCode;
#[sqlx::test(migrations = "./migrations")]
async fn private_mode_blocks_anonymous_manga_list(pool: PgPool) {
let h = common::harness_with_private_mode(pool);
let resp = h.app.oneshot(common::get("/api/v1/mangas")).await.unwrap();
assert_eq!(resp.status(), StatusCode::UNAUTHORIZED);
}
#[sqlx::test(migrations = "./migrations")]
async fn private_mode_blocks_anonymous_files(pool: PgPool) {
let h = common::harness_with_private_mode(pool);
// The path doesn't have to exist — the guard runs before routing,
// so the response is 401 (not 404). That's the property the test
// is pinning: nothing leaks via crafted URLs.
let resp = h
.app
.oneshot(common::get("/api/v1/files/anything.png"))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::UNAUTHORIZED);
}
#[sqlx::test(migrations = "./migrations")]
async fn private_mode_allows_session_cookie_read(pool: PgPool) {
// Register through a non-private harness sharing the same DB pool
// so the session row exists. Then exercise the gate using a fresh
// private-mode harness against the same DB.
let public = common::harness(pool.clone());
let (_, cookie) = common::register_user(&public.app).await;
let private = common::harness_with_private_mode(pool);
let resp = private
.app
.oneshot(common::get_with_cookie("/api/v1/mangas", &cookie))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
}
#[sqlx::test(migrations = "./migrations")]
async fn private_mode_allows_bearer_token_read(pool: PgPool) {
let public = common::harness(pool.clone());
let (_, cookie) = common::register_user(&public.app).await;
let resp = public
.app
.clone()
.oneshot(common::post_json_with_cookie(
"/api/v1/auth/tokens",
json!({ "name": "private-mode-bot" }),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CREATED);
let body = common::body_json(resp).await;
let bearer = body["bearer"].as_str().unwrap().to_string();
let private = common::harness_with_private_mode(pool);
let resp = private
.app
.oneshot(common::get_with_bearer("/api/v1/mangas", &bearer))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
}
#[sqlx::test(migrations = "./migrations")]
async fn private_mode_allows_login_endpoint_anonymous(pool: PgPool) {
// Seed a user via the public harness so login has credentials to
// verify against.
let public = common::harness(pool.clone());
let _ = public
.app
.clone()
.oneshot(common::post_json(
"/api/v1/auth/register",
json!({ "username": "alice", "password": "hunter2hunter2" }),
))
.await
.unwrap();
let private = common::harness_with_private_mode(pool);
let resp = private
.app
.oneshot(common::post_json(
"/api/v1/auth/login",
json!({ "username": "alice", "password": "hunter2hunter2" }),
))
.await
.unwrap();
// Reaches the login handler and succeeds — *not* 401 from the
// gate. That's the property we're pinning.
assert_eq!(resp.status(), StatusCode::OK);
}
#[sqlx::test(migrations = "./migrations")]
async fn private_mode_allows_health_and_config_anonymous(pool: PgPool) {
let h = common::harness_with_private_mode(pool);
let r = h
.app
.clone()
.oneshot(common::get("/api/v1/health"))
.await
.unwrap();
assert_eq!(r.status(), StatusCode::OK);
let r = h
.app
.oneshot(common::get("/api/v1/auth/config"))
.await
.unwrap();
assert_eq!(r.status(), StatusCode::OK);
}
#[sqlx::test(migrations = "./migrations")]
async fn private_mode_blocks_register_even_when_self_register_enabled(pool: PgPool) {
// harness_with_private_mode keeps `allow_self_register=true` (the
// default) — private mode is supposed to force-block register
// regardless. That's what this test pins.
let h = common::harness_with_private_mode(pool);
let resp = h
.app
.oneshot(common::post_json(
"/api/v1/auth/register",
json!({ "username": "alice", "password": "hunter2hunter2" }),
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
let body = common::body_json(resp).await;
assert_eq!(body["error"]["code"], "forbidden");
}
#[sqlx::test(migrations = "./migrations")]
async fn auth_config_reports_private_mode_and_effective_self_register(pool: PgPool) {
let h = common::harness_with_private_mode(pool);
let resp = h
.app
.oneshot(common::get("/api/v1/auth/config"))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
assert_eq!(body["private_mode"], true);
// Effective value: `allow_self_register && !private_mode` is false
// here even though the raw `allow_self_register` is true.
assert_eq!(body["self_register_enabled"], false);
}
#[sqlx::test(migrations = "./migrations")]
async fn public_mode_does_not_gate_anonymous_reads(pool: PgPool) {
// Regression guard: with private_mode off (the default), the gate
// must be a no-op so existing public deployments stay public.
let h = common::harness(pool);
let resp = h.app.oneshot(common::get("/api/v1/mangas")).await.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
}
#[sqlx::test(migrations = "./migrations")]
async fn public_mode_reports_private_mode_false(pool: PgPool) {
let h = common::harness(pool);
let resp = h
.app
.oneshot(common::get("/api/v1/auth/config"))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
assert_eq!(body["private_mode"], false);
assert_eq!(body["self_register_enabled"], true);
}

View File

@@ -59,6 +59,31 @@ async fn reattach_same_tag_is_idempotent_and_returns_200(pool: PgPool) {
assert_eq!(second.status(), StatusCode::OK);
}
/// Tag names over 64 chars are rejected at the handler boundary. The
/// repo enforces the same cap, but doing it at the handler keeps the
/// envelope consistent with the other validation paths
/// (username, collection name, etc.).
#[sqlx::test(migrations = "./migrations")]
async fn attach_rejects_tag_name_over_64_chars(pool: PgPool) {
let h = common::harness(pool);
let (_, cookie) = common::register_user(&h.app).await;
let manga_id = common::seed_manga_via_api(&h.app, &cookie, "Berserk").await;
let long_name: String = "x".repeat(65);
let resp = h
.app
.oneshot(common::post_json_with_cookie(
&format!("/api/v1/mangas/{manga_id}/tags"),
json!({ "name": long_name }),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::UNPROCESSABLE_ENTITY);
let body = common::body_json(resp).await;
assert_eq!(body["error"]["code"], "validation_failed");
}
#[sqlx::test(migrations = "./migrations")]
async fn tag_names_dedup_case_insensitively(pool: PgPool) {
let h = common::harness(pool);

View File

@@ -139,13 +139,17 @@ async fn files_endpoint_streams_in_multiple_frames(pool: PgPool) {
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CREATED);
let chapter_id = common::body_json(resp).await["id"]
.as_str()
.unwrap()
.to_string();
// Fetch the page back via the streaming files endpoint.
let pages = h
.app
.clone()
.oneshot(common::get(&format!(
"/api/v1/mangas/{manga_id}/chapters/1/pages"
"/api/v1/mangas/{manga_id}/chapters/{chapter_id}/pages"
)))
.await
.unwrap();
@@ -317,8 +321,12 @@ async fn create_chapter_rejects_renamed_non_image_page(pool: PgPool) {
assert_eq!(body["error"]["code"], "unsupported_media_type");
}
/// Multiple chapters can share the same number — different
/// scanlations, re-uploads, translator notes. As of migration 0013,
/// (manga_id, number) is not unique and each upload gets its own
/// chapter id.
#[sqlx::test(migrations = "./migrations")]
async fn create_chapter_returns_409_on_duplicate_number(pool: PgPool) {
async fn create_chapter_allows_duplicate_numbers_as_separate_chapters(pool: PgPool) {
let h = common::harness(pool);
let (_, cookie) = common::register_user(&h.app).await;
let manga_id = common::seed_manga_via_api(&h.app, &cookie, "Berserk").await;
@@ -334,10 +342,27 @@ async fn create_chapter_returns_409_on_duplicate_number(pool: PgPool) {
};
let first = h.app.clone().oneshot(make()).await.unwrap();
assert_eq!(first.status(), StatusCode::CREATED);
let second = h.app.oneshot(make()).await.unwrap();
assert_eq!(second.status(), StatusCode::CONFLICT);
let body = common::body_json(second).await;
assert_eq!(body["error"]["code"], "conflict");
let first_id = common::body_json(first).await["id"].as_str().unwrap().to_string();
let second = h.app.clone().oneshot(make()).await.unwrap();
assert_eq!(second.status(), StatusCode::CREATED);
let second_id = common::body_json(second).await["id"].as_str().unwrap().to_string();
assert_ne!(first_id, second_id, "each upload gets a distinct chapter id");
// List endpoint surfaces both rows.
let resp = h
.app
.oneshot(common::get(&format!("/api/v1/mangas/{manga_id}/chapters")))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
let items = body["items"].as_array().unwrap();
assert_eq!(items.len(), 2, "both Ch.1 uploads listed separately");
for item in items {
assert_eq!(item["number"], 1);
}
}
#[sqlx::test(migrations = "./migrations")]

View File

@@ -15,6 +15,7 @@ use tempfile::TempDir;
use tower::ServiceExt;
use mangalord::app::{router, AppState};
use mangalord::auth::rate_limit::AuthRateLimiter;
use mangalord::config::{AuthConfig, UploadConfig};
use mangalord::storage::{LocalStorage, Storage, StorageError, StreamingFile};
@@ -49,20 +50,115 @@ fn harness_inner(
storage: Arc<dyn Storage>,
storage_dir: TempDir,
) -> Harness {
harness_with_auth_config(pool, storage, storage_dir, AuthConfig {
cookie_secure: false,
..AuthConfig::default()
})
}
fn harness_with_auth_config(
pool: PgPool,
storage: Arc<dyn Storage>,
storage_dir: TempDir,
auth: AuthConfig,
) -> Harness {
let auth_limiter = Arc::new(AuthRateLimiter::new(auth.rate_limit));
let state = AppState {
db: pool,
storage,
auth: AuthConfig { cookie_secure: false, ..AuthConfig::default() },
auth,
upload: UploadConfig {
// Keep file caps small in tests so the size-cap path is cheap to
// exercise without producing tens of MBs of bytes.
max_request_bytes: 4 * 1024 * 1024,
max_file_bytes: 256 * 1024,
},
auth_limiter,
// Default harness has no crawler daemon wired up; admin resync
// handlers return 503 in this config. Tests that need a stub
// resync service swap it in via `harness_with_resync`.
resync: None,
};
Harness { app: router(state), _storage_dir: storage_dir }
}
/// Like [`harness`] but flips `ALLOW_SELF_REGISTER` off so the
/// register-disabled test exercises the 403 branch in
/// `api::auth::register`.
pub fn harness_with_self_register_disabled(pool: PgPool) -> Harness {
let storage_dir = tempfile::tempdir().expect("tempdir");
let storage = Arc::new(LocalStorage::new(storage_dir.path()));
let auth = AuthConfig {
cookie_secure: false,
allow_self_register: false,
..AuthConfig::default()
};
harness_with_auth_config(pool, storage, storage_dir, auth)
}
/// Like [`harness`] but flips `PRIVATE_MODE` on so the site-wide auth
/// gate is exercised. `allow_self_register` stays at its default `true`
/// to verify that private mode force-disables self-registration on top
/// of whatever `ALLOW_SELF_REGISTER` says.
pub fn harness_with_private_mode(pool: PgPool) -> Harness {
let storage_dir = tempfile::tempdir().expect("tempdir");
let storage = Arc::new(LocalStorage::new(storage_dir.path()));
let auth = AuthConfig {
cookie_secure: false,
private_mode: true,
..AuthConfig::default()
};
harness_with_auth_config(pool, storage, storage_dir, auth)
}
/// Like [`harness`] but configures a tight auth rate limit. Used by
/// the brute-force-rate-limiting test.
pub fn harness_with_auth_rate_limit(
pool: PgPool,
per_sec: u32,
burst: u32,
) -> Harness {
let storage_dir = tempfile::tempdir().expect("tempdir");
let storage = Arc::new(LocalStorage::new(storage_dir.path()));
let auth = AuthConfig {
cookie_secure: false,
rate_limit: mangalord::auth::rate_limit::RateLimitConfig { per_sec, burst },
..AuthConfig::default()
};
harness_with_auth_config(pool, storage, storage_dir, auth)
}
/// Like [`harness`] but slots a caller-supplied [`ResyncService`] stub
/// into `AppState.resync`. Used by the admin resync tests so the
/// endpoint path is exercised without standing up a real Chromium.
pub fn harness_with_resync(
pool: PgPool,
resync: Arc<dyn mangalord::crawler::resync::ResyncService>,
) -> Harness {
let storage_dir = tempfile::tempdir().expect("tempdir");
let storage = Arc::new(LocalStorage::new(storage_dir.path()));
let auth = AuthConfig {
cookie_secure: false,
..AuthConfig::default()
};
let auth_limiter = Arc::new(AuthRateLimiter::new(auth.rate_limit));
let state = AppState {
db: pool,
storage,
auth,
upload: UploadConfig {
max_request_bytes: 4 * 1024 * 1024,
max_file_bytes: 256 * 1024,
},
auth_limiter,
resync: Some(resync),
};
Harness {
app: router(state),
_storage_dir: storage_dir,
}
}
/// Wraps a real `Storage` and fails on the N-th `put` call so tests can
/// assert that handlers roll their DB writes back when storage errors
/// mid-upload. Reads and other operations delegate to `inner`.
@@ -336,6 +432,37 @@ pub fn post_multipart_with_cookie(
.unwrap()
}
pub fn put_multipart_with_cookie(
uri: &str,
builder: MultipartBuilder,
cookie: &str,
) -> Request<Body> {
let (boundary, body) = builder.finalize();
Request::builder()
.method("PUT")
.uri(uri)
.header(
header::CONTENT_TYPE,
format!("multipart/form-data; boundary={boundary}"),
)
.header(header::COOKIE, cookie)
.body(Body::from(body))
.unwrap()
}
pub fn put_multipart(uri: &str, builder: MultipartBuilder) -> Request<Body> {
let (boundary, body) = builder.finalize();
Request::builder()
.method("PUT")
.uri(uri)
.header(
header::CONTENT_TYPE,
format!("multipart/form-data; boundary={boundary}"),
)
.body(Body::from(body))
.unwrap()
}
/// Realistic PNG file header bytes — enough for `infer` to identify.
pub fn fake_png_bytes() -> Vec<u8> {
vec![0x89, 0x50, 0x4e, 0x47, 0x0d, 0x0a, 0x1a, 0x0a, 0, 0, 0, 0]

View File

@@ -10,6 +10,11 @@
//!
//! Override the cache location with `CRAWLER_CHROMIUM_DIR=/some/path` if
//! `$HOME/.cache/mangalord/chromium` isn't writable.
//!
//! Set `CRAWLER_CHROMIUM_BINARY=/usr/bin/chromium-headless-shell` (or
//! another system chromium path) to exercise the system-chromium
//! launch path instead of the fetcher download — this is the path the
//! Raspberry Pi deployment takes.
use mangalord::crawler::browser::{self, LaunchOptions};

View File

@@ -0,0 +1,648 @@
//! Integration tests for the crawler daemon's cron + worker pool. The
//! daemon's full real path requires Chromium and a live source; here we
//! test the seam (MetadataPass / ChapterDispatcher traits) and the
//! cron/worker control-flow.
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
use std::time::Duration;
use chrono::NaiveTime;
use chrono_tz::Tz;
use mangalord::crawler::content::SyncOutcome;
use mangalord::crawler::daemon::{
self, test_support::CountingMetadataPass, ChapterDispatcher, DaemonConfig, MetadataPass,
CRON_LOCK_KEY,
};
use mangalord::crawler::jobs::{self, JobPayload};
use mangalord::crawler::pipeline;
use serde_json::json;
use sqlx::PgPool;
use tokio_util::sync::CancellationToken;
use uuid::Uuid;
fn far_future_daily_at() -> NaiveTime {
// Some time hours from "now" so the scheduler sleeps for the whole test.
NaiveTime::from_hms_opt(23, 59, 0).unwrap()
}
fn make_cfg(
metadata_pass: Option<Arc<dyn MetadataPass>>,
dispatcher: Arc<dyn ChapterDispatcher>,
session_expired: Arc<std::sync::atomic::AtomicBool>,
workers: usize,
) -> DaemonConfig {
DaemonConfig {
metadata_pass,
dispatcher,
chapter_workers: workers,
daily_at: far_future_daily_at(),
tz: Tz::UTC,
retention_days: 7,
session_expired,
extra_tasks: Vec::new(),
}
}
async fn enqueue_chapter_job(pool: &PgPool) -> Uuid {
let chapter_id = Uuid::new_v4();
let payload = JobPayload::SyncChapterContent {
source_id: "target".into(),
chapter_id,
source_chapter_key: format!("ch-{chapter_id}"),
};
let res = jobs::enqueue(pool, &payload).await.unwrap();
match res {
jobs::EnqueueResult::Inserted(_) => chapter_id,
jobs::EnqueueResult::Skipped => unreachable!("fresh chapter_id"),
}
}
async fn count_state(pool: &PgPool, state: &str) -> i64 {
sqlx::query_scalar::<_, i64>("SELECT COUNT(*) FROM crawler_jobs WHERE state = $1")
.bind(state)
.fetch_one(pool)
.await
.unwrap()
}
struct AlwaysDoneDispatcher {
seen: AtomicUsize,
}
#[async_trait::async_trait]
impl ChapterDispatcher for AlwaysDoneDispatcher {
async fn dispatch(&self, _payload: JobPayload) -> anyhow::Result<SyncOutcome> {
self.seen.fetch_add(1, Ordering::AcqRel);
Ok(SyncOutcome::Fetched { pages: 1 })
}
}
struct PanickingDispatcher {
seen: AtomicUsize,
}
#[async_trait::async_trait]
impl ChapterDispatcher for PanickingDispatcher {
async fn dispatch(&self, _payload: JobPayload) -> anyhow::Result<SyncOutcome> {
self.seen.fetch_add(1, Ordering::AcqRel);
panic!("intentional dispatcher panic");
}
}
#[sqlx::test(migrations = "./migrations")]
async fn workers_drain_jobs_through_dispatcher(pool: PgPool) {
enqueue_chapter_job(&pool).await;
enqueue_chapter_job(&pool).await;
enqueue_chapter_job(&pool).await;
let dispatcher = Arc::new(AlwaysDoneDispatcher {
seen: AtomicUsize::new(0),
});
let session_expired = Arc::new(std::sync::atomic::AtomicBool::new(false));
let cancel = CancellationToken::new();
let handle = daemon::spawn(
pool.clone(),
cancel.clone(),
make_cfg(None, dispatcher.clone(), session_expired, 2),
);
// Wait for the workers to drain all three jobs.
let dispatcher_seen = || dispatcher.seen.load(Ordering::Acquire);
for _ in 0..40 {
if dispatcher_seen() >= 3 {
break;
}
tokio::time::sleep(Duration::from_millis(50)).await;
}
assert!(
dispatcher_seen() >= 3,
"expected at least 3 dispatches, got {}",
dispatcher_seen()
);
handle.shutdown().await;
assert_eq!(count_state(&pool, "done").await, 3);
}
#[sqlx::test(migrations = "./migrations")]
async fn workers_idle_while_session_expired(pool: PgPool) {
let id = enqueue_chapter_job(&pool).await;
let dispatcher = Arc::new(AlwaysDoneDispatcher {
seen: AtomicUsize::new(0),
});
let session_expired = Arc::new(std::sync::atomic::AtomicBool::new(true));
let cancel = CancellationToken::new();
let handle = daemon::spawn(
pool.clone(),
cancel.clone(),
make_cfg(None, dispatcher.clone(), Arc::clone(&session_expired), 1),
);
// Wait long enough that a non-idled worker would have leased and ack'd.
tokio::time::sleep(Duration::from_millis(800)).await;
assert_eq!(
dispatcher.seen.load(Ordering::Acquire),
0,
"dispatcher must not be invoked while session_expired flag is set"
);
assert_eq!(count_state(&pool, "pending").await, 1);
let _ = id;
handle.shutdown().await;
}
#[sqlx::test(migrations = "./migrations")]
async fn dispatcher_panic_is_contained_and_job_is_acked_failed(pool: PgPool) {
enqueue_chapter_job(&pool).await;
enqueue_chapter_job(&pool).await;
let dispatcher = Arc::new(PanickingDispatcher {
seen: AtomicUsize::new(0),
});
let session_expired = Arc::new(std::sync::atomic::AtomicBool::new(false));
let cancel = CancellationToken::new();
let handle = daemon::spawn(
pool.clone(),
cancel.clone(),
make_cfg(None, dispatcher.clone(), session_expired, 1),
);
// Wait for the worker to handle both panicking jobs.
for _ in 0..40 {
if dispatcher.seen.load(Ordering::Acquire) >= 2 {
break;
}
tokio::time::sleep(Duration::from_millis(50)).await;
}
assert!(
dispatcher.seen.load(Ordering::Acquire) >= 2,
"worker must keep going after a panic — handled at least 2 jobs"
);
handle.shutdown().await;
// attempts=1 below max=5, so the panicking jobs go back to pending with
// backoff and `last_error = "worker panicked"`.
let last_errors: Vec<String> = sqlx::query_scalar(
"SELECT last_error FROM crawler_jobs WHERE last_error IS NOT NULL",
)
.fetch_all(&pool)
.await
.unwrap();
assert_eq!(last_errors.len(), 2);
assert!(last_errors.iter().all(|e| e == "worker panicked"));
}
#[sqlx::test(migrations = "./migrations")]
async fn cron_skips_tick_when_advisory_lock_held(pool: PgPool) {
// With no last_metadata_tick_at row, the daemon does a catch-up tick
// immediately on spawn. We hold the advisory lock on a separate
// connection beforehand so the catch-up's pg_try_advisory_lock returns
// false and the tick must skip without invoking the metadata pass.
let mut lock_conn = pool.acquire().await.unwrap();
sqlx::query("SELECT pg_advisory_lock($1)")
.bind(CRON_LOCK_KEY)
.execute(&mut *lock_conn)
.await
.unwrap();
let counter = Arc::new(CountingMetadataPass::default());
let dispatcher = Arc::new(AlwaysDoneDispatcher {
seen: AtomicUsize::new(0),
});
let session_expired = Arc::new(std::sync::atomic::AtomicBool::new(false));
let cancel = CancellationToken::new();
// daily_at far in the future so after the (skipped) catch-up the
// cron sleeps for the rest of the test rather than racing for the lock.
let cfg = make_cfg(
Some(counter.clone() as Arc<dyn MetadataPass>),
dispatcher,
session_expired,
1,
);
let handle = daemon::spawn(pool.clone(), cancel.clone(), cfg);
tokio::time::sleep(Duration::from_millis(800)).await;
assert_eq!(
counter.count.load(Ordering::Acquire),
0,
"cron must skip the catch-up tick while the advisory lock is held"
);
sqlx::query("SELECT pg_advisory_unlock($1)")
.bind(CRON_LOCK_KEY)
.execute(&mut *lock_conn)
.await
.unwrap();
drop(lock_conn);
handle.shutdown().await;
}
#[sqlx::test(migrations = "./migrations")]
async fn cron_catches_up_when_last_tick_is_stale(pool: PgPool) {
// Pre-seed last_metadata_tick_at well in the past so previous_fire(now)
// > last_tick is trivially true and the daemon catches up immediately.
sqlx::query(
"INSERT INTO crawler_state (key, value) VALUES ($1, $2)
ON CONFLICT (key) DO UPDATE SET value = EXCLUDED.value",
)
.bind("last_metadata_tick_at")
.bind(json!({"at": "2020-01-01T00:00:00Z"}))
.execute(&pool)
.await
.unwrap();
let counter = Arc::new(CountingMetadataPass::default());
let dispatcher = Arc::new(AlwaysDoneDispatcher {
seen: AtomicUsize::new(0),
});
let session_expired = Arc::new(std::sync::atomic::AtomicBool::new(false));
let cancel = CancellationToken::new();
let handle = daemon::spawn(
pool.clone(),
cancel.clone(),
make_cfg(
Some(counter.clone() as Arc<dyn MetadataPass>),
dispatcher,
session_expired,
1,
),
);
for _ in 0..40 {
if counter.count.load(Ordering::Acquire) >= 1 {
break;
}
tokio::time::sleep(Duration::from_millis(50)).await;
}
assert!(
counter.count.load(Ordering::Acquire) >= 1,
"catch-up tick should have fired immediately"
);
handle.shutdown().await;
}
#[sqlx::test(migrations = "./migrations")]
async fn enqueue_bookmarked_pending_skips_dropped_sources(pool: PgPool) {
// Setup: one manga with two chapters (page_count = 0). One has a
// non-dropped source; the other's source is dropped. A user bookmarks
// the manga. Expectation: only the non-dropped chapter is enqueued.
let user_id: Uuid = sqlx::query_scalar(
"INSERT INTO users (username, password_hash) VALUES ($1, $2) RETURNING id",
)
.bind("alice")
.bind("not-a-real-hash")
.fetch_one(&pool)
.await
.unwrap();
let manga_id: Uuid = sqlx::query_scalar(
"INSERT INTO mangas (title) VALUES ($1) RETURNING id",
)
.bind("Berserk")
.fetch_one(&pool)
.await
.unwrap();
sqlx::query("INSERT INTO sources (id, name, base_url) VALUES ($1, $2, $3) ON CONFLICT DO NOTHING")
.bind("target")
.bind("Target")
.bind("https://example.com")
.execute(&pool)
.await
.unwrap();
let c1: Uuid = sqlx::query_scalar(
"INSERT INTO chapters (manga_id, number, page_count) VALUES ($1, 1, 0) RETURNING id",
)
.bind(manga_id)
.fetch_one(&pool)
.await
.unwrap();
let c2: Uuid = sqlx::query_scalar(
"INSERT INTO chapters (manga_id, number, page_count) VALUES ($1, 2, 0) RETURNING id",
)
.bind(manga_id)
.fetch_one(&pool)
.await
.unwrap();
// c1: alive source. c2: dropped source.
sqlx::query(
"INSERT INTO chapter_sources (source_id, source_chapter_key, chapter_id, source_url) \
VALUES ($1, $2, $3, $4)",
)
.bind("target")
.bind("ch1")
.bind(c1)
.bind("https://example.com/ch1")
.execute(&pool)
.await
.unwrap();
sqlx::query(
"INSERT INTO chapter_sources (source_id, source_chapter_key, chapter_id, source_url, dropped_at) \
VALUES ($1, $2, $3, $4, now())",
)
.bind("target")
.bind("ch2")
.bind(c2)
.bind("https://example.com/ch2")
.execute(&pool)
.await
.unwrap();
sqlx::query("INSERT INTO bookmarks (user_id, manga_id) VALUES ($1, $2)")
.bind(user_id)
.bind(manga_id)
.execute(&pool)
.await
.unwrap();
let summary = pipeline::enqueue_bookmarked_pending(&pool).await.unwrap();
assert_eq!(summary.inserted, 1, "only the non-dropped chapter enqueued");
assert_eq!(summary.skipped, 0);
let payloads: Vec<serde_json::Value> = sqlx::query_scalar(
"SELECT payload FROM crawler_jobs WHERE payload->>'kind' = 'sync_chapter_content'",
)
.fetch_all(&pool)
.await
.unwrap();
assert_eq!(payloads.len(), 1);
assert_eq!(
payloads[0]["chapter_id"].as_str().unwrap(),
c1.to_string()
);
}
#[sqlx::test(migrations = "./migrations")]
async fn enqueue_bookmarked_pending_skips_recently_dead_chapters(pool: PgPool) {
// Setup: a chapter whose last SyncChapterContent job died yesterday.
// The cron tick must not re-enqueue — without the quarantine, the
// chapter would spin: re-enqueue → max_attempts retries → dies again
// → re-enqueue next tick → forever.
let user_id: Uuid = sqlx::query_scalar(
"INSERT INTO users (username, password_hash) VALUES ($1, $2) RETURNING id",
)
.bind("alice")
.bind("not-a-real-hash")
.fetch_one(&pool)
.await
.unwrap();
let manga_id: Uuid =
sqlx::query_scalar("INSERT INTO mangas (title) VALUES ($1) RETURNING id")
.bind("Test")
.fetch_one(&pool)
.await
.unwrap();
sqlx::query(
"INSERT INTO sources (id, name, base_url) VALUES ($1, $2, $3) ON CONFLICT DO NOTHING",
)
.bind("target")
.bind("Target")
.bind("https://example.com")
.execute(&pool)
.await
.unwrap();
let chapter_id: Uuid = sqlx::query_scalar(
"INSERT INTO chapters (manga_id, number, page_count) VALUES ($1, 1, 0) RETURNING id",
)
.bind(manga_id)
.fetch_one(&pool)
.await
.unwrap();
sqlx::query(
"INSERT INTO chapter_sources (source_id, source_chapter_key, chapter_id, source_url) \
VALUES ($1, $2, $3, $4)",
)
.bind("target")
.bind("ch1")
.bind(chapter_id)
.bind("https://example.com/ch1")
.execute(&pool)
.await
.unwrap();
sqlx::query("INSERT INTO bookmarks (user_id, manga_id) VALUES ($1, $2)")
.bind(user_id)
.bind(manga_id)
.execute(&pool)
.await
.unwrap();
// The dead job from the prior tick, updated 1 day ago (well inside the
// 7-day quarantine window).
sqlx::query(
"INSERT INTO crawler_jobs (payload, state, updated_at) \
VALUES ($1::jsonb, 'dead', now() - interval '1 day')",
)
.bind(serde_json::json!({
"kind": "sync_chapter_content",
"source_id": "target",
"chapter_id": chapter_id.to_string(),
"source_chapter_key": "ch1",
}))
.execute(&pool)
.await
.unwrap();
let summary = pipeline::enqueue_bookmarked_pending(&pool).await.unwrap();
assert_eq!(summary.inserted, 0, "recently dead chapter is quarantined");
assert_eq!(summary.skipped, 0);
}
#[sqlx::test(migrations = "./migrations")]
async fn enqueue_bookmarked_pending_resumes_after_quarantine_expires(pool: PgPool) {
// Same setup as above but the dead job is 10 days old — past the
// 7-day quarantine. The chapter should be re-enqueued so a once-failed
// chapter eventually gets a second shot at success.
let user_id: Uuid = sqlx::query_scalar(
"INSERT INTO users (username, password_hash) VALUES ($1, $2) RETURNING id",
)
.bind("alice")
.bind("not-a-real-hash")
.fetch_one(&pool)
.await
.unwrap();
let manga_id: Uuid =
sqlx::query_scalar("INSERT INTO mangas (title) VALUES ($1) RETURNING id")
.bind("Test")
.fetch_one(&pool)
.await
.unwrap();
sqlx::query(
"INSERT INTO sources (id, name, base_url) VALUES ($1, $2, $3) ON CONFLICT DO NOTHING",
)
.bind("target")
.bind("Target")
.bind("https://example.com")
.execute(&pool)
.await
.unwrap();
let chapter_id: Uuid = sqlx::query_scalar(
"INSERT INTO chapters (manga_id, number, page_count) VALUES ($1, 1, 0) RETURNING id",
)
.bind(manga_id)
.fetch_one(&pool)
.await
.unwrap();
sqlx::query(
"INSERT INTO chapter_sources (source_id, source_chapter_key, chapter_id, source_url) \
VALUES ($1, $2, $3, $4)",
)
.bind("target")
.bind("ch1")
.bind(chapter_id)
.bind("https://example.com/ch1")
.execute(&pool)
.await
.unwrap();
sqlx::query("INSERT INTO bookmarks (user_id, manga_id) VALUES ($1, $2)")
.bind(user_id)
.bind(manga_id)
.execute(&pool)
.await
.unwrap();
sqlx::query(
"INSERT INTO crawler_jobs (payload, state, updated_at) \
VALUES ($1::jsonb, 'dead', now() - interval '10 days')",
)
.bind(serde_json::json!({
"kind": "sync_chapter_content",
"source_id": "target",
"chapter_id": chapter_id.to_string(),
"source_chapter_key": "ch1",
}))
.execute(&pool)
.await
.unwrap();
let summary = pipeline::enqueue_bookmarked_pending(&pool).await.unwrap();
assert_eq!(
summary.inserted, 1,
"dead chapter is re-enqueued after quarantine expires"
);
}
/// Helper: insert a chapter with the given `number` and a non-dropped
/// source row, returning the chapter id. Used by the ordering tests so
/// the setup boilerplate doesn't drown the assertion.
async fn insert_pending_chapter(
pool: &PgPool,
manga_id: Uuid,
number: i32,
source_chapter_key: &str,
) -> Uuid {
let chapter_id: Uuid = sqlx::query_scalar(
"INSERT INTO chapters (manga_id, number, page_count) VALUES ($1, $2, 0) RETURNING id",
)
.bind(manga_id)
.bind(number)
.fetch_one(pool)
.await
.unwrap();
sqlx::query(
"INSERT INTO chapter_sources (source_id, source_chapter_key, chapter_id, source_url) \
VALUES ($1, $2, $3, $4)",
)
.bind("target")
.bind(source_chapter_key)
.bind(chapter_id)
.bind(format!("https://example.com/{source_chapter_key}"))
.execute(pool)
.await
.unwrap();
chapter_id
}
#[sqlx::test(migrations = "./migrations")]
async fn enqueue_bookmarked_pending_queues_chapters_in_ascending_number_order(pool: PgPool) {
// Insert chapters with `number` values 3, 1, 2 in that insertion
// order — so `created_at` order (the previous tiebreaker) does NOT
// match number order. After enqueue + lease, the worker should see
// chapters 1, 2, 3 in that sequence.
let user_id: Uuid = sqlx::query_scalar(
"INSERT INTO users (username, password_hash) VALUES ($1, $2) RETURNING id",
)
.bind("alice")
.bind("not-a-real-hash")
.fetch_one(&pool)
.await
.unwrap();
let manga_id: Uuid = sqlx::query_scalar("INSERT INTO mangas (title) VALUES ($1) RETURNING id")
.bind("Test")
.fetch_one(&pool)
.await
.unwrap();
sqlx::query(
"INSERT INTO sources (id, name, base_url) VALUES ($1, $2, $3) ON CONFLICT DO NOTHING",
)
.bind("target")
.bind("Target")
.bind("https://example.com")
.execute(&pool)
.await
.unwrap();
let c3 = insert_pending_chapter(&pool, manga_id, 3, "ch3").await;
let c1 = insert_pending_chapter(&pool, manga_id, 1, "ch1").await;
let c2 = insert_pending_chapter(&pool, manga_id, 2, "ch2").await;
sqlx::query("INSERT INTO bookmarks (user_id, manga_id) VALUES ($1, $2)")
.bind(user_id)
.bind(manga_id)
.execute(&pool)
.await
.unwrap();
let summary = pipeline::enqueue_bookmarked_pending(&pool).await.unwrap();
assert_eq!(summary.inserted, 3);
let leases = jobs::lease(&pool, None, 10, std::time::Duration::from_secs(60))
.await
.unwrap();
let leased_chapter_ids: Vec<Uuid> = leases
.iter()
.map(|l| match &l.payload {
JobPayload::SyncChapterContent { chapter_id, .. } => *chapter_id,
other => panic!("unexpected payload kind: {other:?}"),
})
.collect();
assert_eq!(
leased_chapter_ids,
vec![c1, c2, c3],
"chapters must be leased in ascending chapter-number order, not insertion order"
);
}
#[sqlx::test(migrations = "./migrations")]
async fn enqueue_pending_for_manga_queues_chapters_in_ascending_number_order(pool: PgPool) {
// Same scenario as above but exercising the bookmark-create hook path
// (`enqueue_pending_for_manga`) which has its own ORDER BY.
let manga_id: Uuid = sqlx::query_scalar("INSERT INTO mangas (title) VALUES ($1) RETURNING id")
.bind("Test")
.fetch_one(&pool)
.await
.unwrap();
sqlx::query(
"INSERT INTO sources (id, name, base_url) VALUES ($1, $2, $3) ON CONFLICT DO NOTHING",
)
.bind("target")
.bind("Target")
.bind("https://example.com")
.execute(&pool)
.await
.unwrap();
let c3 = insert_pending_chapter(&pool, manga_id, 3, "ch3").await;
let c1 = insert_pending_chapter(&pool, manga_id, 1, "ch1").await;
let c2 = insert_pending_chapter(&pool, manga_id, 2, "ch2").await;
let summary = pipeline::enqueue_pending_for_manga(&pool, manga_id)
.await
.unwrap();
assert_eq!(summary.inserted, 3);
let leases = jobs::lease(&pool, None, 10, std::time::Duration::from_secs(60))
.await
.unwrap();
let leased_chapter_ids: Vec<Uuid> = leases
.iter()
.map(|l| match &l.payload {
JobPayload::SyncChapterContent { chapter_id, .. } => *chapter_id,
other => panic!("unexpected payload kind: {other:?}"),
})
.collect();
assert_eq!(leased_chapter_ids, vec![c1, c2, c3]);
}

View File

@@ -0,0 +1,635 @@
//! Integration tests for `crawler::jobs` queue operations.
//!
//! Uses `#[sqlx::test(migrations = "./migrations")]` which provisions a fresh
//! migrated DB per test. No browser, no axum router — these exercise the SQL
//! shape and dedup-index semantics directly against Postgres.
use std::time::Duration;
use mangalord::crawler::jobs::{
self, EnqueueResult, JobPayload, KIND_SYNC_CHAPTER_CONTENT,
};
use sqlx::PgPool;
use uuid::Uuid;
fn chapter_content_payload(chapter_id: Uuid) -> JobPayload {
JobPayload::SyncChapterContent {
source_id: "target".into(),
chapter_id,
source_chapter_key: format!("ch-{chapter_id}"),
}
}
/// A non-`SyncChapterContent` payload, used to assert that only the
/// chapter-content kind is deduplicated by the partial index and that
/// `lease`'s kind filter correctly excludes other kinds.
fn sync_manga_payload(key: &str) -> JobPayload {
JobPayload::SyncManga {
source_id: "target".into(),
source_manga_key: key.into(),
}
}
async fn job_state(pool: &PgPool, id: Uuid) -> String {
sqlx::query_scalar::<_, String>("SELECT state FROM crawler_jobs WHERE id = $1")
.bind(id)
.fetch_one(pool)
.await
.unwrap()
}
async fn job_attempts(pool: &PgPool, id: Uuid) -> i32 {
sqlx::query_scalar::<_, i32>("SELECT attempts FROM crawler_jobs WHERE id = $1")
.bind(id)
.fetch_one(pool)
.await
.unwrap()
}
async fn job_count(pool: &PgPool) -> i64 {
sqlx::query_scalar::<_, i64>("SELECT COUNT(*) FROM crawler_jobs")
.fetch_one(pool)
.await
.unwrap()
}
#[sqlx::test(migrations = "./migrations")]
async fn enqueue_inserts_pending_row_with_round_trip_payload(pool: PgPool) {
let chapter_id = Uuid::new_v4();
let payload = chapter_content_payload(chapter_id);
let result = jobs::enqueue(&pool, &payload).await.unwrap();
let id = match result {
EnqueueResult::Inserted(id) => id,
EnqueueResult::Skipped => panic!("expected Inserted on first enqueue"),
};
assert_eq!(job_state(&pool, id).await, "pending");
assert_eq!(job_attempts(&pool, id).await, 0);
let raw_payload: serde_json::Value =
sqlx::query_scalar("SELECT payload FROM crawler_jobs WHERE id = $1")
.bind(id)
.fetch_one(&pool)
.await
.unwrap();
let decoded: JobPayload = serde_json::from_value(raw_payload).unwrap();
match decoded {
JobPayload::SyncChapterContent {
source_id,
chapter_id: c,
source_chapter_key,
} => {
assert_eq!(source_id, "target");
assert_eq!(c, chapter_id);
assert_eq!(source_chapter_key, format!("ch-{chapter_id}"));
}
_ => panic!("payload variant mismatch"),
}
}
#[sqlx::test(migrations = "./migrations")]
async fn duplicate_chapter_content_while_pending_is_skipped(pool: PgPool) {
let chapter_id = Uuid::new_v4();
let p = chapter_content_payload(chapter_id);
let first = jobs::enqueue(&pool, &p).await.unwrap();
assert!(matches!(first, EnqueueResult::Inserted(_)));
let second = jobs::enqueue(&pool, &p).await.unwrap();
assert!(matches!(second, EnqueueResult::Skipped));
assert_eq!(job_count(&pool).await, 1);
}
#[sqlx::test(migrations = "./migrations")]
async fn duplicate_after_done_releases_dedup_slot(pool: PgPool) {
let chapter_id = Uuid::new_v4();
let p = chapter_content_payload(chapter_id);
let first_id = match jobs::enqueue(&pool, &p).await.unwrap() {
EnqueueResult::Inserted(id) => id,
EnqueueResult::Skipped => panic!("first enqueue should insert"),
};
// Move the first job out of (pending|running) so the partial index drops it.
sqlx::query("UPDATE crawler_jobs SET state = 'done' WHERE id = $1")
.bind(first_id)
.execute(&pool)
.await
.unwrap();
let second = jobs::enqueue(&pool, &p).await.unwrap();
assert!(
matches!(second, EnqueueResult::Inserted(_)),
"after done the chapter_id slot is free again"
);
assert_eq!(job_count(&pool).await, 2);
}
#[sqlx::test(migrations = "./migrations")]
async fn different_chapter_ids_can_coexist(pool: PgPool) {
let p1 = chapter_content_payload(Uuid::new_v4());
let p2 = chapter_content_payload(Uuid::new_v4());
assert!(matches!(
jobs::enqueue(&pool, &p1).await.unwrap(),
EnqueueResult::Inserted(_)
));
assert!(matches!(
jobs::enqueue(&pool, &p2).await.unwrap(),
EnqueueResult::Inserted(_)
));
assert_eq!(job_count(&pool).await, 2);
}
#[sqlx::test(migrations = "./migrations")]
async fn non_chapter_content_payloads_are_never_deduped(pool: PgPool) {
let p = sync_manga_payload("foo");
assert!(matches!(
jobs::enqueue(&pool, &p).await.unwrap(),
EnqueueResult::Inserted(_)
));
assert!(matches!(
jobs::enqueue(&pool, &p).await.unwrap(),
EnqueueResult::Inserted(_)
));
assert_eq!(job_count(&pool).await, 2);
}
#[sqlx::test(migrations = "./migrations")]
async fn lease_marks_running_and_bumps_attempts_and_sets_leased_until(pool: PgPool) {
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
EnqueueResult::Skipped => unreachable!(),
};
let leases = jobs::lease(&pool, None, 10, Duration::from_secs(60))
.await
.unwrap();
assert_eq!(leases.len(), 1);
let lease = &leases[0];
assert_eq!(lease.id, id);
assert_eq!(lease.attempts, 1);
assert_eq!(job_state(&pool, id).await, "running");
let leased_until: Option<chrono::DateTime<chrono::Utc>> =
sqlx::query_scalar("SELECT leased_until FROM crawler_jobs WHERE id = $1")
.bind(id)
.fetch_one(&pool)
.await
.unwrap();
let leased_until = leased_until.expect("leased_until set");
assert!(leased_until > chrono::Utc::now());
}
#[sqlx::test(migrations = "./migrations")]
async fn lease_with_kind_filter_only_matches_that_kind(pool: PgPool) {
let manga_id = match jobs::enqueue(&pool, &sync_manga_payload("foo"))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let chapter_id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let leases = jobs::lease(
&pool,
Some(KIND_SYNC_CHAPTER_CONTENT),
10,
Duration::from_secs(60),
)
.await
.unwrap();
assert_eq!(leases.len(), 1, "only chapter content payload leases");
assert_eq!(leases[0].id, chapter_id);
// sync_manga is still pending
assert_eq!(job_state(&pool, manga_id).await, "pending");
}
#[sqlx::test(migrations = "./migrations")]
async fn concurrent_leases_under_skip_locked_return_disjoint_ids(pool: PgPool) {
// 4 pending jobs, two concurrent calls each asking for up to 2.
let mut ids = Vec::new();
for _ in 0..4 {
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
ids.push(id);
}
let (a, b) = tokio::join!(
jobs::lease(&pool, None, 2, Duration::from_secs(60)),
jobs::lease(&pool, None, 2, Duration::from_secs(60)),
);
let a = a.unwrap();
let b = b.unwrap();
let mut seen: Vec<Uuid> = a.iter().chain(b.iter()).map(|l| l.id).collect();
seen.sort();
seen.dedup();
let count = a.len() + b.len();
assert_eq!(
seen.len(),
count,
"no id appears in both lease results (SKIP LOCKED)"
);
assert!(count >= 2, "at least one lease saw work");
assert!(count <= 4);
}
#[sqlx::test(migrations = "./migrations")]
async fn stale_running_lease_can_be_reclaimed(pool: PgPool) {
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let first = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
assert_eq!(first.len(), 1);
// Pretend the worker crashed: rewind leased_until into the past.
sqlx::query("UPDATE crawler_jobs SET leased_until = now() - interval '1 minute' WHERE id = $1")
.bind(id)
.execute(&pool)
.await
.unwrap();
let second = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
assert_eq!(second.len(), 1, "stale running row was re-leased");
assert_eq!(second[0].id, id);
assert_eq!(second[0].attempts, 2, "attempts bumped again");
}
#[sqlx::test(migrations = "./migrations")]
async fn ack_done_transitions_state_and_clears_lease(pool: PgPool) {
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let leases = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
jobs::ack_done(&pool, leases[0].id).await.unwrap();
assert_eq!(job_state(&pool, id).await, "done");
let leased_until: Option<chrono::DateTime<chrono::Utc>> =
sqlx::query_scalar("SELECT leased_until FROM crawler_jobs WHERE id = $1")
.bind(id)
.fetch_one(&pool)
.await
.unwrap();
assert!(leased_until.is_none());
}
#[sqlx::test(migrations = "./migrations")]
async fn ack_failed_under_max_returns_to_pending_with_future_schedule(pool: PgPool) {
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let leases = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
let lease = &leases[0];
jobs::ack_failed(&pool, lease.id, "boom", lease.attempts, lease.max_attempts)
.await
.unwrap();
assert_eq!(job_state(&pool, id).await, "pending");
let (scheduled_at, last_error): (chrono::DateTime<chrono::Utc>, Option<String>) =
sqlx::query_as("SELECT scheduled_at, last_error FROM crawler_jobs WHERE id = $1")
.bind(id)
.fetch_one(&pool)
.await
.unwrap();
assert!(scheduled_at > chrono::Utc::now());
assert_eq!(last_error.as_deref(), Some("boom"));
}
#[sqlx::test(migrations = "./migrations")]
async fn ack_failed_at_max_marks_dead(pool: PgPool) {
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
// Force a single lease then mark "this was attempt N where N == max_attempts".
let leases = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
let lease = &leases[0];
jobs::ack_failed(&pool, lease.id, "final boom", lease.max_attempts, lease.max_attempts)
.await
.unwrap();
assert_eq!(job_state(&pool, id).await, "dead");
let last_error: Option<String> =
sqlx::query_scalar("SELECT last_error FROM crawler_jobs WHERE id = $1")
.bind(id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(last_error.as_deref(), Some("final boom"));
}
#[sqlx::test(migrations = "./migrations")]
async fn ack_done_no_ops_when_lease_was_stolen(pool: PgPool) {
// Worker A's lease expires, worker B re-leases the job (state stays
// 'running' but attempts++ and leased_until refreshed). A late
// ack_done from worker A must not clobber B's progress.
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
// Worker A grabs the lease, but its lease expires immediately.
let _a_leases = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
sqlx::query("UPDATE crawler_jobs SET leased_until = now() - interval '1 minute' WHERE id = $1")
.bind(id)
.execute(&pool)
.await
.unwrap();
// Worker B re-leases the expired-but-still-running job.
let b_leases = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
assert_eq!(b_leases.len(), 1);
assert_eq!(b_leases[0].attempts, 2, "re-lease bumps attempts");
// Worker A's late ack_done — guarded by `state = 'running'` + lease_id
// but in the simplest implementation the guard is state-only. Either
// way, the job stays 'running' with worker B's progress intact.
jobs::ack_done(&pool, id).await.unwrap();
// Worker B is still working; until B acks, the job remains 'running'
// with its leased_until in the future and attempts == 2.
// (We can't make ack_done's lease_id distinguish A from B today —
// both share the same `id` — so the strongest current guarantee is
// that a late ack_done doesn't fire when state is already 'done',
// exercised below.)
// Finalize: worker B acks done.
jobs::ack_done(&pool, b_leases[0].id).await.unwrap();
assert_eq!(job_state(&pool, id).await, "done");
assert_eq!(job_attempts(&pool, id).await, 2);
}
#[sqlx::test(migrations = "./migrations")]
async fn ack_failed_no_ops_when_state_is_not_running(pool: PgPool) {
// After a job transitions to 'done', a stale ack_failed (e.g. a
// worker that finished work and queued its ack but then handed off
// before the SQL ran) must not flip the state back to 'pending' or
// 'dead'. The `state = 'running'` predicate enforces this.
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let leases = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
jobs::ack_done(&pool, leases[0].id).await.unwrap();
assert_eq!(job_state(&pool, id).await, "done");
// Late ack_failed arrives. Must be a no-op.
jobs::ack_failed(&pool, leases[0].id, "late", 1, 5)
.await
.unwrap();
assert_eq!(
job_state(&pool, id).await,
"done",
"late ack_failed must not resurrect a done job"
);
}
#[sqlx::test(migrations = "./migrations")]
async fn release_no_ops_when_state_is_not_running(pool: PgPool) {
// Mirror of ack_failed_no_ops_when_state_is_not_running. release also
// decrements `attempts`, which would corrupt a re-leased job's
// attempt count if the guard were missing.
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let leases = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
jobs::ack_done(&pool, leases[0].id).await.unwrap();
let attempts_before = job_attempts(&pool, id).await;
// Late release arrives.
jobs::release(&pool, leases[0].id).await.unwrap();
assert_eq!(
job_state(&pool, id).await,
"done",
"late release must not flip a done job back to pending"
);
assert_eq!(
job_attempts(&pool, id).await,
attempts_before,
"late release must not decrement attempts of a non-running job"
);
}
#[sqlx::test(migrations = "./migrations")]
async fn release_returns_to_pending_and_undoes_attempt_increment(pool: PgPool) {
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let leases = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
assert_eq!(leases[0].attempts, 1);
jobs::release(&pool, leases[0].id).await.unwrap();
assert_eq!(job_state(&pool, id).await, "pending");
assert_eq!(job_attempts(&pool, id).await, 0);
let leased_until: Option<chrono::DateTime<chrono::Utc>> =
sqlx::query_scalar("SELECT leased_until FROM crawler_jobs WHERE id = $1")
.bind(id)
.fetch_one(&pool)
.await
.unwrap();
assert!(leased_until.is_none());
}
#[sqlx::test(migrations = "./migrations")]
async fn reap_done_deletes_old_rows_keeps_fresh(pool: PgPool) {
// Two done rows: one old (updated_at 10 days ago), one fresh.
let old_id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let fresh_id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
sqlx::query("UPDATE crawler_jobs SET state='done', updated_at = now() - interval '10 days' WHERE id = $1")
.bind(old_id)
.execute(&pool)
.await
.unwrap();
sqlx::query("UPDATE crawler_jobs SET state='done' WHERE id = $1")
.bind(fresh_id)
.execute(&pool)
.await
.unwrap();
let deleted = jobs::reap_done(&pool, 7).await.unwrap();
assert_eq!(deleted, 1);
let remaining: Vec<Uuid> = sqlx::query_scalar("SELECT id FROM crawler_jobs ORDER BY id")
.fetch_all(&pool)
.await
.unwrap();
assert_eq!(remaining, vec![fresh_id], "only fresh row remains");
}
#[sqlx::test(migrations = "./migrations")]
async fn lease_ties_on_scheduled_at_break_by_created_at(pool: PgPool) {
// Locks in the tiebreaker that lets enqueue order survive the lease
// step: when many jobs share `scheduled_at` (the common cron-batch
// case), the worker must pick the earliest-inserted row, not whatever
// Postgres returns in heap order. The enqueue path inserts chapters
// in chapter-number order, so this tiebreaker is what makes "queue
// in rising order" observable at the dequeue side too.
let a = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let b = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let c = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
// Pin `scheduled_at` to a single literal instant (shared across all
// three rows — `now()` would yield a different microsecond per UPDATE
// and make scheduled_at the actual sort key). Reverse `created_at`
// against insertion order so heap order would give the wrong answer.
let shared_scheduled = chrono::Utc::now() - chrono::Duration::hours(1);
sqlx::query(
"UPDATE crawler_jobs \
SET scheduled_at = $2, \
created_at = $3 \
WHERE id = $1",
)
.bind(a)
.bind(shared_scheduled)
.bind(chrono::Utc::now() - chrono::Duration::seconds(10))
.execute(&pool)
.await
.unwrap();
sqlx::query(
"UPDATE crawler_jobs \
SET scheduled_at = $2, \
created_at = $3 \
WHERE id = $1",
)
.bind(b)
.bind(shared_scheduled)
.bind(chrono::Utc::now() - chrono::Duration::seconds(20))
.execute(&pool)
.await
.unwrap();
sqlx::query(
"UPDATE crawler_jobs \
SET scheduled_at = $2, \
created_at = $3 \
WHERE id = $1",
)
.bind(c)
.bind(shared_scheduled)
.bind(chrono::Utc::now() - chrono::Duration::seconds(30))
.execute(&pool)
.await
.unwrap();
let leases = jobs::lease(&pool, None, 10, Duration::from_secs(60))
.await
.unwrap();
let order: Vec<Uuid> = leases.iter().map(|l| l.id).collect();
assert_eq!(
order,
vec![c, b, a],
"lease must return jobs in created_at order when scheduled_at ties"
);
}
#[sqlx::test(migrations = "./migrations")]
async fn reap_done_zero_is_a_no_op(pool: PgPool) {
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
sqlx::query("UPDATE crawler_jobs SET state='done', updated_at = now() - interval '999 days' WHERE id = $1")
.bind(id)
.execute(&pool)
.await
.unwrap();
let deleted = jobs::reap_done(&pool, 0).await.unwrap();
assert_eq!(deleted, 0);
assert_eq!(job_count(&pool).await, 1);
}

View File

@@ -0,0 +1,82 @@
//! Integration tests for the per-source recovery flag:
//! `mark_run_started` / `mark_run_completed` / `last_run_completed_cleanly`
//! round-trip via the `crawler_state` table.
//!
//! End-to-end pipeline behavior (a crashed run forcing a recovery sweep
//! on the next tick) requires a real `chromiumoxide::Browser` to drive
//! the walker, so that path is covered by `crawler_browser_smoke.rs`.
//! The pure stop-condition logic itself is unit-tested in
//! `crawler::pipeline::tests`.
use mangalord::repo::crawler;
use sqlx::PgPool;
#[sqlx::test(migrations = "./migrations")]
async fn defaults_to_clean_when_no_marker(pool: PgPool) {
// First-ever run semantics: absence of the key must NOT trigger a
// recovery walk on a virgin DB. Treat missing as "previous run
// completed cleanly" so the first tick can take the early-stop path.
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let clean = crawler::last_run_completed_cleanly(&pool, "target")
.await
.unwrap();
assert!(clean, "absent marker must read as clean");
}
#[sqlx::test(migrations = "./migrations")]
async fn mark_run_started_flips_to_false(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
crawler::mark_run_started(&pool, "target").await.unwrap();
let clean = crawler::last_run_completed_cleanly(&pool, "target")
.await
.unwrap();
assert!(!clean, "after mark_run_started, flag must read false");
}
#[sqlx::test(migrations = "./migrations")]
async fn started_then_completed_round_trips_to_clean(pool: PgPool) {
// Steady-state: a run starts (flag → false) and exits cleanly
// (flag → true). The next tick should see "clean" and apply the
// normal stop condition.
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
crawler::mark_run_started(&pool, "target").await.unwrap();
crawler::mark_run_completed(&pool, "target").await.unwrap();
let clean = crawler::last_run_completed_cleanly(&pool, "target")
.await
.unwrap();
assert!(
clean,
"after start → complete the flag must round-trip to clean"
);
}
#[sqlx::test(migrations = "./migrations")]
async fn flag_is_per_source(pool: PgPool) {
// Two sources, only one is mid-run. The other must still report
// clean — the crawler_state key is namespaced by source_id.
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
crawler::ensure_source(&pool, "other", "O", "https://y.example")
.await
.unwrap();
crawler::mark_run_started(&pool, "target").await.unwrap();
assert!(
!crawler::last_run_completed_cleanly(&pool, "target")
.await
.unwrap(),
"target is mid-run"
);
assert!(
crawler::last_run_completed_cleanly(&pool, "other")
.await
.unwrap(),
"other source is untouched and reads clean"
);
}

View File

@@ -6,6 +6,7 @@
use mangalord::crawler::source::{SourceChapterRef, SourceManga};
use mangalord::repo::crawler::{self, ChapterDiff, UpsertStatus};
use mangalord::repo::chapter as chapter_repo;
use sqlx::PgPool;
use uuid::Uuid;
@@ -233,59 +234,360 @@ async fn sync_chapters_adds_new_refreshes_existing_and_drops_vanished(pool: PgPo
}
#[sqlx::test(migrations = "./migrations")]
async fn mark_dropped_mangas_only_drops_unseen(pool: PgPool) {
async fn live_chapter_count_returns_zero_for_unknown_source_key(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
// Seed two mangas before "now" so a later run_started_at sees them as stale.
let _ = crawler::upsert_manga_from_source(
&pool,
"target",
"https://x.example/foo",
&sample_manga("foo", "Foo", "hf"),
)
.await
.unwrap();
let _ = crawler::upsert_manga_from_source(
&pool,
"target",
"https://x.example/bar",
&sample_manga("bar", "Bar", "hb"),
)
.await
.unwrap();
// Now mark a new "run" beginning. Re-upsert only `foo` — `bar`
// should be the one flagged dropped.
let run_started = chrono::Utc::now();
// Sleep briefly so the second upsert's NOW() > run_started_at.
tokio::time::sleep(std::time::Duration::from_millis(20)).await;
let _ = crawler::upsert_manga_from_source(
&pool,
"target",
"https://x.example/foo",
&sample_manga("foo", "Foo", "hf"),
)
.await
.unwrap();
let n = crawler::mark_dropped_mangas(&pool, "target", run_started)
// No manga_sources row yet → unknown key path. Must not error and
// must report zero so the partial-render guard accepts the
// "brand-new manga with no chapters" case as legitimate.
let n = crawler::live_chapter_count_for_source_manga(&pool, "target", "nobody")
.await
.unwrap();
assert_eq!(n, 1, "only bar should have been dropped");
assert_eq!(n, 0);
}
let foo_dropped: (Option<chrono::DateTime<chrono::Utc>>,) =
sqlx::query_as("SELECT dropped_at FROM manga_sources WHERE source_manga_key = 'foo'")
#[sqlx::test(migrations = "./migrations")]
async fn live_chapter_count_only_counts_live_sources(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo Manga", "hash-1");
let up = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
let chapters = vec![
SourceChapterRef {
source_chapter_key: "1".into(),
number: 1,
title: Some("Ch.1".into()),
url: "https://x.example/foo/1".into(),
},
SourceChapterRef {
source_chapter_key: "2".into(),
number: 2,
title: Some("Ch.2".into()),
url: "https://x.example/foo/2".into(),
},
];
crawler::sync_manga_chapters(&pool, "target", up.manga_id, &chapters)
.await
.unwrap();
assert_eq!(
crawler::live_chapter_count_for_source_manga(&pool, "target", "foo")
.await
.unwrap(),
2
);
// Soft-drop one source row — count drops by one, the row stays.
sqlx::query(
"UPDATE chapter_sources SET dropped_at = NOW() WHERE source_chapter_key = '2'",
)
.execute(&pool)
.await
.unwrap();
assert_eq!(
crawler::live_chapter_count_for_source_manga(&pool, "target", "foo")
.await
.unwrap(),
1
);
}
/// Real-world sources publish multiple chapters at the same number
/// (different uploaders, translator notes, re-releases). After the
/// (manga_id, number) UNIQUE drop in 0013, each `SourceChapterRef`
/// becomes its own `chapters` row even when the parsed number matches
/// — chapter identity is now the chapter id, not the number.
#[sqlx::test(migrations = "./migrations")]
async fn sync_chapters_keeps_duplicate_numbered_chapters_as_separate_rows(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo Manga", "hash-1");
let up = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
// Two distinct uploads of Ch.52 (different uploaders → different
// URLs/keys, same parsed number) plus a notice/hiatus row that
// parses to number=0 alongside a real chapter at number 1.
let chapters = vec![
SourceChapterRef {
source_chapter_key: "br_chapter-A".into(),
number: 52,
title: Some("Ch.52 : Official".into()),
url: "https://x.example/foo/A/pg-1/".into(),
},
SourceChapterRef {
source_chapter_key: "br_chapter-B".into(),
number: 52,
title: Some("Ch.52 : Official (alt)".into()),
url: "https://x.example/foo/B/pg-1/".into(),
},
SourceChapterRef {
source_chapter_key: "br_chapter-NOTICE".into(),
number: 0,
title: Some("hitaus.".into()),
url: "https://x.example/foo/notice/pg-1/".into(),
},
SourceChapterRef {
source_chapter_key: "br_chapter-1".into(),
number: 1,
title: Some("Ch.1 : Official".into()),
url: "https://x.example/foo/1/pg-1/".into(),
},
];
let diff = crawler::sync_manga_chapters(&pool, "target", up.manga_id, &chapters)
.await
.unwrap();
assert_eq!(
diff,
ChapterDiff {
new: 4,
refreshed: 0,
dropped: 0
},
"every source ref yields a new chapter row"
);
let rows: (i64,) =
sqlx::query_as("SELECT COUNT(*) FROM chapters WHERE manga_id = $1")
.bind(up.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert!(foo_dropped.0.is_none(), "foo seen this run, must not be dropped");
let bar_dropped: (Option<chrono::DateTime<chrono::Utc>>,) =
sqlx::query_as("SELECT dropped_at FROM manga_sources WHERE source_manga_key = 'bar'")
.fetch_one(&pool)
.await
.unwrap();
assert!(bar_dropped.0.is_some());
assert_eq!(rows.0, 4, "4 distinct chapter rows even with duplicate numbers");
let ch52_count: (i64,) = sqlx::query_as(
"SELECT COUNT(*) FROM chapters WHERE manga_id = $1 AND number = 52",
)
.bind(up.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(ch52_count.0, 2, "both Ch.52 uploads survive as separate rows");
}
#[sqlx::test(migrations = "./migrations")]
async fn sync_chapters_isolates_colliding_keys_across_mangas(pool: PgPool) {
// Two mangas, both with a chapter whose source_chapter_key is
// "chapter-1". Pre-migration-0017 the PK enforced (source_id,
// source_chapter_key) globally and the lookup didn't filter by
// manga_id, so the second manga's sync would adopt the first manga's
// chapter_id (silent attribution corruption). After 0017 each manga
// owns its own row.
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m1 = sample_manga("foo", "Manga Foo", "hash-foo");
let m2 = sample_manga("bar", "Manga Bar", "hash-bar");
let up1 = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m1)
.await
.unwrap();
let up2 = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/bar", &m2)
.await
.unwrap();
assert_ne!(up1.manga_id, up2.manga_id);
let shared = vec![SourceChapterRef {
source_chapter_key: "chapter-1".into(),
number: 1,
title: Some("Ch.1".into()),
url: "https://x.example/foo/chapter-1/".into(),
}];
let diff1 = crawler::sync_manga_chapters(&pool, "target", up1.manga_id, &shared)
.await
.unwrap();
assert_eq!(diff1.new, 1, "manga foo: chapter inserted fresh");
// Manga bar now syncs *the same key*. Under the old schema this would
// either fail on PK conflict or attribute the chapter to foo. Under
// the new schema bar gets its own chapter row.
let bar_chapters = vec![SourceChapterRef {
source_chapter_key: "chapter-1".into(),
number: 1,
title: Some("Ch.1 (bar)".into()),
url: "https://x.example/bar/chapter-1/".into(),
}];
let diff2 = crawler::sync_manga_chapters(&pool, "target", up2.manga_id, &bar_chapters)
.await
.unwrap();
assert_eq!(
diff2.new, 1,
"manga bar: same key resolved per-manga to a fresh row"
);
let foo_count: (i64,) = sqlx::query_as(
"SELECT COUNT(*) FROM chapters WHERE manga_id = $1",
)
.bind(up1.manga_id)
.fetch_one(&pool)
.await
.unwrap();
let bar_count: (i64,) = sqlx::query_as(
"SELECT COUNT(*) FROM chapters WHERE manga_id = $1",
)
.bind(up2.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(foo_count.0, 1);
assert_eq!(bar_count.0, 1);
let bar_title: (Option<String>,) = sqlx::query_as(
"SELECT title FROM chapters WHERE manga_id = $1 AND number = 1",
)
.bind(up2.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(
bar_title.0.as_deref(),
Some("Ch.1 (bar)"),
"bar's chapter has bar's title, not foo's"
);
// A subsequent re-sync of foo with the same key correctly refreshes
// foo's row, not bar's.
let foo_resync = vec![SourceChapterRef {
source_chapter_key: "chapter-1".into(),
number: 1,
title: Some("Ch.1 (foo updated)".into()),
url: "https://x.example/foo/chapter-1/".into(),
}];
let diff_refresh = crawler::sync_manga_chapters(&pool, "target", up1.manga_id, &foo_resync)
.await
.unwrap();
assert_eq!(diff_refresh.refreshed, 1);
assert_eq!(diff_refresh.new, 0);
let foo_title: (Option<String>,) = sqlx::query_as(
"SELECT title FROM chapters WHERE manga_id = $1 AND number = 1",
)
.bind(up1.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(foo_title.0.as_deref(), Some("Ch.1 (foo updated)"));
let bar_title_after: (Option<String>,) = sqlx::query_as(
"SELECT title FROM chapters WHERE manga_id = $1 AND number = 1",
)
.bind(up2.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(
bar_title_after.0.as_deref(),
Some("Ch.1 (bar)"),
"bar's row is untouched by foo's refresh"
);
}
#[sqlx::test(migrations = "./migrations")]
async fn sync_chapters_serializes_concurrent_calls_for_same_manga(pool: PgPool) {
// Without the per-manga advisory lock, two concurrent calls would
// both read `seen_keys`, both run the drop UPDATE filtered on `NOT
// (key = ANY $3)`, and the later commit could soft-drop a chapter
// the earlier had just inserted. The lock makes the calls strictly
// sequential per-manga: whichever runs second sees the first one's
// committed chapters and treats their absence as a "dropped" signal
// only if the second list legitimately omits them.
//
// Concretely: pre-state [A]. Call X syncs [A, B]; call Y syncs
// [A, B, C]. Whatever the schedule, the final state must include
// *all three* chapters because neither call legitimately omits the
// other's contribution — both lists are supersets of each other's
// pre-existing rows.
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo Manga", "hash-1");
let up = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
let manga_id = up.manga_id;
// Pre-state: [A].
let pre = vec![SourceChapterRef {
source_chapter_key: "A".into(),
number: 1,
title: Some("Ch.A".into()),
url: "https://x.example/foo/A".into(),
}];
crawler::sync_manga_chapters(&pool, "target", manga_id, &pre)
.await
.unwrap();
// Two concurrent calls. Call X adds B; call Y adds B + C. Both keep
// A. Their drop branches would otherwise race against each other.
let list_x = vec![
SourceChapterRef {
source_chapter_key: "A".into(),
number: 1,
title: Some("Ch.A".into()),
url: "https://x.example/foo/A".into(),
},
SourceChapterRef {
source_chapter_key: "B".into(),
number: 2,
title: Some("Ch.B".into()),
url: "https://x.example/foo/B".into(),
},
];
let list_y = vec![
SourceChapterRef {
source_chapter_key: "A".into(),
number: 1,
title: Some("Ch.A".into()),
url: "https://x.example/foo/A".into(),
},
SourceChapterRef {
source_chapter_key: "B".into(),
number: 2,
title: Some("Ch.B".into()),
url: "https://x.example/foo/B".into(),
},
SourceChapterRef {
source_chapter_key: "C".into(),
number: 3,
title: Some("Ch.C".into()),
url: "https://x.example/foo/C".into(),
},
];
let pool_x = pool.clone();
let pool_y = pool.clone();
let (rx, ry) = tokio::join!(
tokio::spawn(async move {
crawler::sync_manga_chapters(&pool_x, "target", manga_id, &list_x).await
}),
tokio::spawn(async move {
crawler::sync_manga_chapters(&pool_y, "target", manga_id, &list_y).await
}),
);
rx.unwrap().expect("call X");
ry.unwrap().expect("call Y");
// All three keys must survive with dropped_at NULL — the lock
// ensures the later call sees the earlier one's INSERTs and the
// drop UPDATE finds nothing to drop.
let alive: Vec<String> = sqlx::query_scalar(
"SELECT cs.source_chapter_key \
FROM chapter_sources cs \
JOIN chapters ch ON ch.id = cs.chapter_id \
WHERE ch.manga_id = $1 AND cs.dropped_at IS NULL \
ORDER BY cs.source_chapter_key",
)
.bind(manga_id)
.fetch_all(&pool)
.await
.unwrap();
assert_eq!(
alive,
vec!["A".to_string(), "B".to_string(), "C".to_string()],
"all chapters survive concurrent syncs that both contain them"
);
}
#[sqlx::test(migrations = "./migrations")]
@@ -364,6 +666,271 @@ async fn arbitrary_genres_from_source_get_inserted(pool: PgPool) {
assert_eq!(webtoons_count.0, 1, "case-insensitive lookup reuses the existing row");
}
/// User-attached tags (rows with non-NULL `added_by` in `manga_tags`)
/// must survive a crawler upsert. The crawler owns source-attached tags
/// (added_by IS NULL); user attachments are owned by the user who made
/// them and the recurring metadata pass must not delete them.
#[sqlx::test(migrations = "./migrations")]
async fn sync_tags_preserves_user_attached_tags(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo Manga", "hash-1");
let up = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
// A real user attaches a personal tag.
let user = mangalord::repo::user::create(&pool, "alice", "phc-stub")
.await
.unwrap();
let outcome = mangalord::repo::tag::attach_to_manga(&pool, up.manga_id, "personal", user.id)
.await
.unwrap();
assert!(outcome.created_attachment);
// Second crawler pass. Use a different metadata_hash so the upsert
// takes the Updated branch, but the bug also fires on Unchanged
// ticks since sync_tags runs unconditionally.
let mut m2 = m.clone();
m2.metadata_hash = "hash-2".into();
m2.tags = vec!["popular".into(), "weekly".into()];
let _ = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m2)
.await
.unwrap();
// The user tag must still be attached.
let user_tag_rows: (i64,) = sqlx::query_as(
"SELECT COUNT(*) FROM manga_tags mt \
JOIN tags t ON t.id = mt.tag_id \
WHERE mt.manga_id = $1 AND lower(t.name) = 'personal' \
AND mt.added_by = $2",
)
.bind(up.manga_id)
.bind(user.id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(
user_tag_rows.0, 1,
"user-attached tag must survive a crawler upsert"
);
// The source's tags should still attach as well, as crawler-owned.
let source_tag_rows: (i64,) = sqlx::query_as(
"SELECT COUNT(*) FROM manga_tags mt \
JOIN tags t ON t.id = mt.tag_id \
WHERE mt.manga_id = $1 \
AND mt.added_by IS NULL \
AND lower(t.name) IN ('popular', 'weekly')",
)
.bind(up.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(source_tag_rows.0, 2, "source tags re-attach on each pass");
// A subsequent pass where the source drops a previously-seen tag
// must clear that crawler-owned attachment (otherwise crawler-tags
// would only ever accumulate).
let mut m3 = m2.clone();
m3.metadata_hash = "hash-3".into();
m3.tags = vec!["popular".into()];
let _ = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m3)
.await
.unwrap();
let weekly_rows: (i64,) = sqlx::query_as(
"SELECT COUNT(*) FROM manga_tags mt \
JOIN tags t ON t.id = mt.tag_id \
WHERE mt.manga_id = $1 AND lower(t.name) = 'weekly'",
)
.bind(up.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(weekly_rows.0, 0, "source-owned tag dropped by source goes away");
// And the user tag still survives that third pass.
let user_tag_rows: (i64,) = sqlx::query_as(
"SELECT COUNT(*) FROM manga_tags mt \
JOIN tags t ON t.id = mt.tag_id \
WHERE mt.manga_id = $1 AND lower(t.name) = 'personal' \
AND mt.added_by = $2",
)
.bind(up.manga_id)
.bind(user.id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(user_tag_rows.0, 1);
}
/// `manga_tags.added_by` is `ON DELETE SET NULL` on the user FK. When
/// the attaching user is deleted, their attachments become orphans
/// indistinguishable from crawler-owned rows — and the crawler should
/// reap them on the next pass. Pins the semantic so a future change
/// can't quietly leave orphan rows lying around.
#[sqlx::test(migrations = "./migrations")]
async fn sync_tags_garbage_collects_orphan_user_attachments(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo", "hash-1");
let up = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
// A user attaches "personal", then the user gets deleted. The
// attachment row stays (manga_tags.manga_id FK is CASCADE on
// mangas only; we never CASCADE-delete user attachments). The FK
// on added_by is `ON DELETE SET NULL`, so the row's owner column
// goes NULL — same shape as a crawler-owned row.
let user = mangalord::repo::user::create(&pool, "bob", "phc-stub")
.await
.unwrap();
let _ = mangalord::repo::tag::attach_to_manga(&pool, up.manga_id, "personal", user.id)
.await
.unwrap();
sqlx::query("DELETE FROM users WHERE id = $1")
.bind(user.id)
.execute(&pool)
.await
.unwrap();
// Sanity: the orphan still exists post-user-delete with added_by NULL.
let (orphan_rows,): (i64,) = sqlx::query_as(
"SELECT COUNT(*) FROM manga_tags mt \
JOIN tags t ON t.id = mt.tag_id \
WHERE mt.manga_id = $1 AND lower(t.name) = 'personal' \
AND mt.added_by IS NULL",
)
.bind(up.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(orphan_rows, 1);
// Next crawler pass — orphan should be reaped along with any
// other source-owned rows that aren't in the new tag list.
let mut m2 = m.clone();
m2.metadata_hash = "hash-2".into();
m2.tags = vec!["popular".into()];
let _ = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m2)
.await
.unwrap();
let (orphan_rows,): (i64,) = sqlx::query_as(
"SELECT COUNT(*) FROM manga_tags mt \
JOIN tags t ON t.id = mt.tag_id \
WHERE mt.manga_id = $1 AND lower(t.name) = 'personal'",
)
.bind(up.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(orphan_rows, 0, "orphan user-attached tag should be reaped");
}
// ---- list_missing_covers ---------------------------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn list_missing_covers_only_returns_rows_without_cover(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let with_cover = sample_manga("with", "With Cover", "h1");
let without_cover = sample_manga("without", "No Cover", "h2");
let _w = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/with", &with_cover)
.await
.unwrap();
let nc = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/without", &without_cover)
.await
.unwrap();
// Manually set a cover for `with` only.
sqlx::query("UPDATE mangas SET cover_image_path = 'mangas/x/cover.jpg' WHERE id = $1")
.bind(_w.manga_id)
.execute(&pool)
.await
.unwrap();
let entries = crawler::list_missing_covers(&pool, 50).await.unwrap();
assert_eq!(entries.len(), 1, "exactly the manga without a cover");
assert_eq!(entries[0].manga_id, nc.manga_id);
assert_eq!(entries[0].source_manga_key, "without");
assert_eq!(entries[0].source_url, "https://x.example/without");
}
#[sqlx::test(migrations = "./migrations")]
async fn list_missing_covers_skips_dropped_source_rows(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo", "h1");
let up = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
sqlx::query("UPDATE manga_sources SET dropped_at = NOW() WHERE manga_id = $1")
.bind(up.manga_id)
.execute(&pool)
.await
.unwrap();
let entries = crawler::list_missing_covers(&pool, 50).await.unwrap();
assert!(
entries.is_empty(),
"dropped-source mangas must not be backfilled — no live source to fetch from"
);
}
#[sqlx::test(migrations = "./migrations")]
async fn list_missing_covers_respects_limit(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
for i in 0..5 {
let key = format!("m{i}");
let url = format!("https://x.example/{key}");
let m = sample_manga(&key, &format!("M{i}"), &format!("h{i}"));
let _ = crawler::upsert_manga_from_source(&pool, "target", &url, &m)
.await
.unwrap();
}
let entries = crawler::list_missing_covers(&pool, 3).await.unwrap();
assert_eq!(entries.len(), 3, "limit caps the result set");
}
#[sqlx::test(migrations = "./migrations")]
async fn list_missing_covers_deduplicates_per_manga(pool: PgPool) {
// A manga surfaced by two sources should produce ONE backfill
// entry, not two — otherwise the per-tick cap could be eaten by
// duplicates and starve other mangas.
crawler::ensure_source(&pool, "src-a", "A", "https://a.example")
.await
.unwrap();
crawler::ensure_source(&pool, "src-b", "B", "https://b.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo", "h1");
let up = crawler::upsert_manga_from_source(&pool, "src-a", "https://a.example/foo", &m)
.await
.unwrap();
// Second source attaches to the SAME manga row.
sqlx::query(
"INSERT INTO manga_sources (source_id, source_manga_key, manga_id, source_url) \
VALUES ($1, $2, $3, $4)",
)
.bind("src-b")
.bind("foo-on-b")
.bind(up.manga_id)
.bind("https://b.example/foo")
.execute(&pool)
.await
.unwrap();
let entries = crawler::list_missing_covers(&pool, 50).await.unwrap();
assert_eq!(entries.len(), 1, "DISTINCT ON (m.id) collapses duplicate source rows");
}
#[sqlx::test(migrations = "./migrations")]
async fn re_appearing_manga_clears_dropped_at(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
@@ -395,3 +962,261 @@ async fn re_appearing_manga_clears_dropped_at(pool: PgPool) {
assert!(dropped.0.is_none());
assert_eq!(dropped.1, up.manga_id);
}
// ---- source_index: site-order preservation ----
//
// The user-facing chapter list reverses the source-site order so that
// the oldest chapter appears first. The crawler records each row's DOM
// position in `chapters.source_index` (0 = first in source DOM = newest
// on this site) on every sync; the list query orders by source_index
// DESC NULLS LAST, falling through to number/created_at for rows with
// no source row (e.g. user uploads).
#[sqlx::test(migrations = "./migrations")]
async fn source_index_set_on_insert_matches_dom_order(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo Manga", "hash-1");
let up = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
let chapters = vec![
SourceChapterRef {
source_chapter_key: "a".into(),
number: 30,
title: Some("Ch.30".into()),
url: "https://x.example/foo/a".into(),
},
SourceChapterRef {
source_chapter_key: "b".into(),
number: 29,
title: Some("Ch.29".into()),
url: "https://x.example/foo/b".into(),
},
SourceChapterRef {
source_chapter_key: "c".into(),
number: 28,
title: Some("Ch.28".into()),
url: "https://x.example/foo/c".into(),
},
];
crawler::sync_manga_chapters(&pool, "target", up.manga_id, &chapters)
.await
.unwrap();
let rows: Vec<(String, Option<i32>)> = sqlx::query_as(
"SELECT cs.source_chapter_key, c.source_index \
FROM chapters c \
JOIN chapter_sources cs ON cs.chapter_id = c.id \
WHERE c.manga_id = $1 \
ORDER BY cs.source_chapter_key",
)
.bind(up.manga_id)
.fetch_all(&pool)
.await
.unwrap();
assert_eq!(
rows,
vec![
("a".to_string(), Some(0)),
("b".to_string(), Some(1)),
("c".to_string(), Some(2)),
],
"source_index reflects enumerate() position in the input slice",
);
}
#[sqlx::test(migrations = "./migrations")]
async fn source_index_rewritten_on_resync_when_new_chapter_prepended(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo Manga", "hash-1");
let up = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
let first = vec![
SourceChapterRef {
source_chapter_key: "a".into(),
number: 1,
title: Some("Ch.1".into()),
url: "https://x.example/foo/a".into(),
},
SourceChapterRef {
source_chapter_key: "b".into(),
number: 2,
title: Some("Ch.2".into()),
url: "https://x.example/foo/b".into(),
},
];
crawler::sync_manga_chapters(&pool, "target", up.manga_id, &first)
.await
.unwrap();
// Second sync: a brand-new chapter appears at the top of the source
// (newest first on the site). All existing rows must shift their
// source_index down by one so the display order stays correct.
let second = vec![
SourceChapterRef {
source_chapter_key: "new".into(),
number: 3,
title: Some("Ch.3".into()),
url: "https://x.example/foo/new".into(),
},
SourceChapterRef {
source_chapter_key: "a".into(),
number: 1,
title: Some("Ch.1".into()),
url: "https://x.example/foo/a".into(),
},
SourceChapterRef {
source_chapter_key: "b".into(),
number: 2,
title: Some("Ch.2".into()),
url: "https://x.example/foo/b".into(),
},
];
crawler::sync_manga_chapters(&pool, "target", up.manga_id, &second)
.await
.unwrap();
let rows: Vec<(String, Option<i32>)> = sqlx::query_as(
"SELECT cs.source_chapter_key, c.source_index \
FROM chapters c \
JOIN chapter_sources cs ON cs.chapter_id = c.id \
WHERE c.manga_id = $1 \
ORDER BY cs.source_chapter_key",
)
.bind(up.manga_id)
.fetch_all(&pool)
.await
.unwrap();
assert_eq!(
rows,
vec![
("a".to_string(), Some(1)),
("b".to_string(), Some(2)),
("new".to_string(), Some(0)),
],
"new chapter takes index 0, existing rows shift down on UPDATE",
);
}
#[sqlx::test(migrations = "./migrations")]
async fn list_for_manga_returns_source_order_reversed(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo Manga", "hash-1");
let up = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
// Site DOM order (top-down = newest-first):
// ch11 (number = 11)
// notice (number = 0, non-numeric label on the site)
// ch10 (number = 10)
// Numbers deliberately disagree with DOM order: a number-based sort
// would put notice first, but the site places it between ch10 and
// ch11. Reversed-DOM display should yield [ch10, notice, ch11].
let chapters = vec![
SourceChapterRef {
source_chapter_key: "ch11".into(),
number: 11,
title: Some("Ch.11 : Official".into()),
url: "https://x.example/foo/11".into(),
},
SourceChapterRef {
source_chapter_key: "notice".into(),
number: 0,
title: Some("notice. : Officials".into()),
url: "https://x.example/foo/notice".into(),
},
SourceChapterRef {
source_chapter_key: "ch10".into(),
number: 10,
title: Some("Ch.10 : Official".into()),
url: "https://x.example/foo/10".into(),
},
];
crawler::sync_manga_chapters(&pool, "target", up.manga_id, &chapters)
.await
.unwrap();
let listed = chapter_repo::list_for_manga(&pool, up.manga_id, 50, 0)
.await
.unwrap();
let keys: Vec<String> = listed
.iter()
.map(|c| c.title.clone().unwrap_or_default())
.collect();
assert_eq!(
keys,
vec![
"Ch.10 : Official".to_string(),
"notice. : Officials".to_string(),
"Ch.11 : Official".to_string(),
],
"list returns chapters in reversed source-DOM order, so the \
oldest appears first and non-numeric entries land where the \
site placed them",
);
}
#[sqlx::test(migrations = "./migrations")]
async fn list_for_manga_places_null_source_index_last(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo Manga", "hash-1");
let up = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
// Crawled chapters get source_index 0 and 1; the upload path leaves
// it NULL. NULLS LAST plus the (number, created_at) tail means the
// upload sits after both crawled rows even though its number is in
// the middle.
let crawled = vec![
SourceChapterRef {
source_chapter_key: "a".into(),
number: 1,
title: Some("Ch.1".into()),
url: "https://x.example/foo/a".into(),
},
SourceChapterRef {
source_chapter_key: "b".into(),
number: 3,
title: Some("Ch.3".into()),
url: "https://x.example/foo/b".into(),
},
];
crawler::sync_manga_chapters(&pool, "target", up.manga_id, &crawled)
.await
.unwrap();
chapter_repo::create(&pool, up.manga_id, 2, Some("User upload Ch.2"), None)
.await
.unwrap();
let listed = chapter_repo::list_for_manga(&pool, up.manga_id, 50, 0)
.await
.unwrap();
let titles: Vec<String> = listed
.iter()
.map(|c| c.title.clone().unwrap_or_default())
.collect();
assert_eq!(
titles,
vec![
"Ch.3".to_string(),
"Ch.1".to_string(),
"User upload Ch.2".to_string(),
],
"crawled rows ordered by reversed source_index; user upload \
(NULL source_index) falls through to the end",
);
}

View File

@@ -0,0 +1,194 @@
<table class="listing" id="chapter_table">
<tbody>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-379272/pg-1/"><b>Ch.67</b>
: Official </a>
<b style="color:#FEFD7F;width;30px;display:inline-block;margin-left:5px">new</b>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">May 20, 2026</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-328248/pg-1/"><b>hitaus.</b>
</a>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">Jan 15, 2026</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-326351/pg-1/"><b>Ch.66</b>
: Official </a>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">Jan 10, 2026</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-295078/pg-1/"><b>Ch.52</b>
: Official </a>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">Aug 28, 2025</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-294815/pg-1/"><b>Ch.52</b>
: Official </a>
</h4>
</td>
<td class="no">
<a href=".../4300634/upload/">mina</a>
</td>
<td class="no">Aug 27, 2025</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-249964/pg-1/"><b>Ch.10</b>
: Official </a>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">Jan 5, 2025</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/to_chapter-13/pg-1/"><b>Ch.13</b>
: Thank you, we'll see you in the next one! </a>
</h4>
</td>
<td class="no"></td>
<td class="no">Dec 30, 2024</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-249095/pg-1/"><b>Ch.9</b>
: Official </a>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">Dec 28, 2024</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-248930/pg-1/"><b>Ch.1</b>
: Official </a>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">Dec 26, 2024</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/to_chapter-12/pg-1/"><b>Ch.12</b>
</a>
</h4>
</td>
<td class="no"></td>
<td class="no">Dec 1, 2024</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-244844/pg-1/"><b>notice.</b>
: Officials </a>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">Nov 26, 2024</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/to_chapter-11/pg-1/"><b>Ch.11</b>
</a>
</h4>
</td>
<td class="no"></td>
<td class="no">Nov 18, 2024</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-221180/pg-1/"><b>notice.</b>
</a>
</h4>
</td>
<td class="no">
<a href=".../3781074/upload/">Izanami</a>
</td>
<td class="no">Jun 21, 2024</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-234803/pg-1/"><b>notice.</b>
</a>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">Sep 13, 2024</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-220299/pg-1/"><b>Ch.1</b>
: Team Hazama </a>
</h4>
</td>
<td class="no">
<a href=".../1457681/upload/">purplepandabear</a>
</td>
<td class="no">Jun 16, 2024</td>
</tr>
</tbody>
</table>

View File

@@ -0,0 +1,162 @@
//! Integration tests for `repo::chapter` — focused on
//! `dispatch_target`, the resolver the daemon's chapter dispatcher
//! uses to look up the URL it needs to hand to
//! `content::sync_chapter_content`.
//!
//! The query must:
//! 1. Skip `chapter_sources` rows where `dropped_at IS NOT NULL` —
//! otherwise a soft-dropped source URL is dispatched as if live and
//! burns the chapter's retry budget against guaranteed transients.
//! 2. Order the remaining rows by `last_seen_at DESC` so the freshest
//! surviving source is the one we'll fetch from.
//!
//! The fix lives in `backend/src/repo/chapter.rs:dispatch_target`. The
//! enqueue queries at `pipeline.rs:381` and `:435` already filter on
//! `cs.dropped_at IS NULL`; this brings the resolver into line.
use mangalord::crawler::source::{SourceChapterRef, SourceManga};
use mangalord::repo::{
chapter::dispatch_target,
crawler::{ensure_source, sync_manga_chapters, upsert_manga_from_source},
};
use sqlx::PgPool;
use uuid::Uuid;
fn sample_manga(key: &str, title: &str, hash: &str) -> SourceManga {
SourceManga {
source_manga_key: key.to_string(),
title: title.to_string(),
alternative_titles: vec![],
authors: vec![],
genres: vec![],
tags: vec![],
status: None,
summary: None,
cover_url: None,
chapters: vec![],
metadata_hash: hash.to_string(),
}
}
/// Seed a manga with one chapter, plus a second `chapter_sources` row
/// pointing at the same chapter with a *newer* `last_seen_at` so the
/// `ORDER BY cs.last_seen_at DESC` branch of the fixed query can
/// distinguish "freshest live source" from "any live source."
async fn seed_chapter_with_two_live_sources(pool: &PgPool) -> (Uuid, String, String) {
// Two distinct sources both pointing at the same chapter is the
// realistic shape of the multi-source state — each source row is
// keyed (source_id, chapter_id) after migration 0017.
ensure_source(pool, "target", "T", "https://x.example")
.await
.unwrap();
ensure_source(pool, "mirror", "Mirror", "https://m.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo Manga", "hash-1");
let up = upsert_manga_from_source(pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
let initial = vec![SourceChapterRef {
source_chapter_key: "1".into(),
number: 1,
title: Some("Ch.1".into()),
url: "https://x.example/foo/1/old".into(),
}];
sync_manga_chapters(pool, "target", up.manga_id, &initial)
.await
.unwrap();
let (chapter_id,): (Uuid,) = sqlx::query_as(
"SELECT c.id FROM chapters c \
JOIN chapter_sources cs ON cs.chapter_id = c.id \
WHERE cs.source_chapter_key = '1' AND cs.source_id = 'target'",
)
.fetch_one(pool)
.await
.unwrap();
let old_url = "https://x.example/foo/1/old".to_string();
let new_url = "https://m.example/foo/1/mirror".to_string();
// Backdate the existing (old/target) source row and add a fresher
// row from the mirror source. The fix uses `last_seen_at DESC` to
// break the tie deterministically.
sqlx::query(
"UPDATE chapter_sources \
SET last_seen_at = NOW() - INTERVAL '2 days' \
WHERE chapter_id = $1 AND source_id = 'target'",
)
.bind(chapter_id)
.execute(pool)
.await
.unwrap();
sqlx::query(
"INSERT INTO chapter_sources \
(source_id, chapter_id, source_chapter_key, source_url, last_seen_at) \
VALUES ('mirror', $1, '1', $2, NOW())",
)
.bind(chapter_id)
.bind(&new_url)
.execute(pool)
.await
.unwrap();
(chapter_id, old_url, new_url)
}
#[sqlx::test(migrations = "./migrations")]
async fn dispatch_target_prefers_most_recent_live_source(pool: PgPool) {
let (chapter_id, _old_url, new_url) =
seed_chapter_with_two_live_sources(&pool).await;
let row = dispatch_target(&pool, chapter_id).await.unwrap();
let (_manga_id, source_url) =
row.expect("two live sources should yield a dispatch target");
assert_eq!(
source_url, new_url,
"ORDER BY last_seen_at DESC LIMIT 1 must return the freshest source"
);
}
#[sqlx::test(migrations = "./migrations")]
async fn dispatch_target_skips_dropped_sources(pool: PgPool) {
let (chapter_id, _old_url, new_url) =
seed_chapter_with_two_live_sources(&pool).await;
// Soft-drop the fresher row. The dispatcher must now return the
// *older* still-live row instead of the dropped one.
sqlx::query(
"UPDATE chapter_sources SET dropped_at = NOW() WHERE source_url = $1",
)
.bind(&new_url)
.execute(&pool)
.await
.unwrap();
let row = dispatch_target(&pool, chapter_id).await.unwrap();
let (_manga_id, source_url) =
row.expect("a single live source should still yield a dispatch target");
assert!(
source_url != new_url,
"dispatch_target must not return a dropped source"
);
}
#[sqlx::test(migrations = "./migrations")]
async fn dispatch_target_returns_none_when_only_dropped_sources_remain(
pool: PgPool,
) {
let (chapter_id, _old_url, _new_url) =
seed_chapter_with_two_live_sources(&pool).await;
sqlx::query("UPDATE chapter_sources SET dropped_at = NOW() WHERE chapter_id = $1")
.bind(chapter_id)
.execute(&pool)
.await
.unwrap();
let row = dispatch_target(&pool, chapter_id).await.unwrap();
assert!(
row.is_none(),
"every source is dropped — dispatch_target must return None"
);
}

View File

@@ -17,5 +17,28 @@ services:
timeout: 5s
retries: 10
# Optional: TOR daemon for crawler dev. Ports bind to 127.0.0.1 only
# — never the LAN — so a native `cargo run` on the host can reach
# 127.0.0.1:9050 / 9051. Mirrors the prod tor service (see
# docker-compose.yml), just with host-loopback ports and a default
# password baked in for friction-free dev.
tor:
image: dockurr/tor:latest
entrypoint: ["/bin/sh", "/usr/local/bin/mangalord-entrypoint.sh"]
environment:
PASSWORD: ${TOR_CONTROL_PASSWORD:-dev-tor-password}
volumes:
- ./tor/torrc:/etc/tor/torrc:ro
- ./tor/entrypoint.sh:/usr/local/bin/mangalord-entrypoint.sh:ro
ports:
- "127.0.0.1:9050:9050"
- "127.0.0.1:9051:9051"
healthcheck:
test: ["CMD-SHELL", "nc -z 127.0.0.1 9050 && nc -z 127.0.0.1 9051"]
interval: 5s
timeout: 5s
retries: 20
start_period: 30s
volumes:
mangalord-postgres-dev:

22
docker-compose.prod.yml Normal file
View File

@@ -0,0 +1,22 @@
# Production overlay: layer on top of docker-compose.yml on the deploy
# host so the backend and frontend run from pre-built registry images
# instead of building locally.
#
# docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d
#
# REGISTRY_URL and IMAGE_TAG are injected by .gitea/workflows/deploy.yml
# at deploy time. IMAGE_TAG defaults to `latest` so a manual
# `docker compose ... up -d` on the host still works.
services:
backend:
build: !reset null
image: ${REGISTRY_URL}/mangalord-backend:${IMAGE_TAG:-latest}
pull_policy: always
restart: unless-stopped
frontend:
build: !reset null
image: ${REGISTRY_URL}/mangalord-frontend:${IMAGE_TAG:-latest}
pull_policy: always
restart: unless-stopped

View File

@@ -1,9 +1,15 @@
# Production-like compose. Requires a populated `.env` next to this
# file: at minimum POSTGRES_PASSWORD must be set to a non-default
# value (the `?required` form below fails fast otherwise). The
# frontend container expects HTTPS in front (Caddy/Traefik/nginx)
# because COOKIE_SECURE=true browsers will refuse to send the session
# cookie over plain HTTP.
services:
postgres:
image: postgres:16-alpine
environment:
POSTGRES_USER: ${POSTGRES_USER:-mangalord}
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-mangalord}
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:?POSTGRES_PASSWORD must be set in .env}
POSTGRES_DB: ${POSTGRES_DB:-mangalord}
volumes:
- postgres-data:/var/lib/postgresql/data
@@ -13,13 +19,50 @@ services:
timeout: 5s
retries: 10
tor:
# SOCKS5 proxy for the crawler, plus a control port so the backend
# can signal NEWNYM on bad pages. See tor/torrc for the daemon
# config; both ports are only `expose`d (compose-internal), never
# bound on the host.
#
# We bypass dockurr/tor's stock entrypoint because it binds the
# control port to localhost (unreachable from the backend
# container) and skips its own HashedControlPassword injection
# when the user's torrc declares a ControlPort. Our wrapper
# (tor/entrypoint.sh) generates the hash from $PASSWORD and execs
# tor with our torrc. Backend authenticates with the same plain
# string via CRAWLER_TOR_CONTROL_PASSWORD.
image: dockurr/tor:latest
entrypoint: ["/bin/sh", "/usr/local/bin/mangalord-entrypoint.sh"]
environment:
PASSWORD: ${TOR_CONTROL_PASSWORD:?TOR_CONTROL_PASSWORD must be set in .env}
volumes:
- ./tor/torrc:/etc/tor/torrc:ro
- ./tor/entrypoint.sh:/usr/local/bin/mangalord-entrypoint.sh:ro
expose:
- "9050"
- "9051"
# Wait for both control + SOCKS ports to listen before downstream
# services start. dockurr/tor's main process spawns before tor
# itself is bound, so `service_started` alone races the first
# NEWNYM call.
healthcheck:
test: ["CMD-SHELL", "nc -z 127.0.0.1 9050 && nc -z 127.0.0.1 9051"]
interval: 5s
timeout: 5s
retries: 20
start_period: 30s
restart: unless-stopped
backend:
build: ./backend
depends_on:
postgres:
condition: service_healthy
tor:
condition: service_healthy
environment:
DATABASE_URL: postgres://${POSTGRES_USER:-mangalord}:${POSTGRES_PASSWORD:-mangalord}@postgres:5432/${POSTGRES_DB:-mangalord}
DATABASE_URL: postgres://${POSTGRES_USER:-mangalord}:${POSTGRES_PASSWORD:?POSTGRES_PASSWORD must be set in .env}@postgres:5432/${POSTGRES_DB:-mangalord}
BIND_ADDRESS: 0.0.0.0:8080
STORAGE_DIR: /var/lib/mangalord/storage
RUST_LOG: ${RUST_LOG:-info,mangalord=debug}
@@ -33,6 +76,21 @@ services:
# Upload limits.
MAX_REQUEST_BYTES: ${MAX_REQUEST_BYTES:-209715200}
MAX_FILE_BYTES: ${MAX_FILE_BYTES:-20971520}
# System-chromium override for the crawler. Leave blank to use the
# bundled fetcher; set to e.g. /usr/bin/chromium-headless-shell on
# arm64 deployments. Pair with `--build-arg INSTALL_CHROMIUM=true`
# so the image actually contains the binary.
CRAWLER_CHROMIUM_BINARY: ${CRAWLER_CHROMIUM_BINARY:-}
# TOR proxy + NEWNYM recircuit (see .env.example for details).
# Defaults assume the bundled `tor` service above; override
# CRAWLER_PROXY= and CRAWLER_TOR_CONTROL_URL= (both empty) in
# .env to disable. CRAWLER_TOR_CONTROL_PASSWORD MUST match the
# tor service's PASSWORD (both wired to the same TOR_CONTROL_PASSWORD
# .env var below).
CRAWLER_PROXY: ${CRAWLER_PROXY-socks5h://tor:9050}
CRAWLER_TOR_CONTROL_URL: ${CRAWLER_TOR_CONTROL_URL-tcp://tor:9051}
CRAWLER_TOR_CONTROL_PASSWORD: ${TOR_CONTROL_PASSWORD:?TOR_CONTROL_PASSWORD must be set in .env}
CRAWLER_TOR_RECIRCUIT_MAX_ATTEMPTS: ${CRAWLER_TOR_RECIRCUIT_MAX_ATTEMPTS:-3}
volumes:
- storage-data:/var/lib/mangalord/storage
# No host port mapping in the default setup — the frontend proxies

View File

@@ -1,7 +1,11 @@
FROM node:22-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json* ./
RUN npm install
# `npm ci` installs the locked versions exactly; `npm install` would
# silently rewrite package-lock.json mid-build. CI (.gitea/workflows)
# also uses `npm ci`, so this keeps the image build deterministic and
# matches what the test job validated.
RUN npm ci
COPY . .
RUN npm run build
@@ -10,8 +14,22 @@ WORKDIR /app
ENV NODE_ENV=production
ENV HOST=0.0.0.0
ENV PORT=3000
COPY --from=builder /app/build ./build
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./
# node:22-alpine ships a `node` user (UID 1000); use it instead of
# running the SvelteKit server as root.
COPY --from=builder --chown=node:node /app/build ./build
COPY --from=builder --chown=node:node /app/node_modules ./node_modules
COPY --from=builder --chown=node:node /app/package.json ./
USER node
EXPOSE 3000
# Alpine's busybox `wget` is the canonical lightweight HTTP probe. Probe
# 127.0.0.1, not `localhost`: musl resolves `localhost` to IPv6 ::1 first,
# but the Node server binds IPv4 0.0.0.0 only, so a localhost probe gets
# "connection refused" and the container is wrongly marked unhealthy. Use a
# GET (`-O /dev/null`) since `node build` serves 200 on `/`.
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD wget -q -O /dev/null http://127.0.0.1:3000/ || exit 1
CMD ["node", "build"]

View File

@@ -0,0 +1,147 @@
import { test, expect, type Page } from '@playwright/test';
const userFixture = {
id: 'u1',
username: 'alice',
created_at: '2026-01-01T00:00:00Z'
};
const baseManga = {
id: 'm1',
title: 'Berserk',
status: 'ongoing',
alt_titles: ['Old Alt'],
description: 'Original description',
cover_image_path: null,
created_at: '2026-01-01T00:00:00Z',
updated_at: '2026-01-01T00:00:00Z',
authors: [{ id: 'a1', name: 'Kentaro Miura' }],
genres: [],
tags: []
};
async function stubAuthenticatedAndGenres(page: Page) {
await page.route('**/api/v1/auth/me', (route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({ user: userFixture })
})
);
await page.route('**/api/v1/genres', (route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify([
{ id: 'g-action', name: 'Action' },
{ id: 'g-fantasy', name: 'Fantasy' }
])
})
);
}
test('anonymous user sees sign-in prompt on /manga/[id]/edit', async ({ page }) => {
await page.route('**/api/v1/auth/me', (route) =>
route.fulfill({
status: 401,
contentType: 'application/json',
body: JSON.stringify({
error: { code: 'unauthenticated', message: 'unauthenticated' }
})
})
);
await page.route('**/api/v1/genres', (route) =>
route.fulfill({ status: 200, contentType: 'application/json', body: '[]' })
);
await page.route('**/api/v1/mangas/m1', (route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify(baseManga)
})
);
await page.goto('/manga/m1/edit');
await expect(page.getByTestId('edit-signin')).toBeVisible();
});
test('/manga/[id]/edit PATCHes the changed metadata and lands on the manga page', async ({
page
}) => {
await stubAuthenticatedAndGenres(page);
let patchBody: Record<string, unknown> | null = null;
let mangaAfter = { ...baseManga };
await page.route('**/api/v1/mangas/m1', async (route) => {
const method = route.request().method();
if (method === 'GET') {
await route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify(mangaAfter)
});
} else if (method === 'PATCH') {
patchBody = JSON.parse(route.request().postData() ?? '{}');
mangaAfter = {
...mangaAfter,
title: (patchBody.title as string) ?? mangaAfter.title,
description:
'description' in (patchBody as Record<string, unknown>)
? ((patchBody.description as string | null) ?? null)
: mangaAfter.description
};
await route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify(mangaAfter)
});
} else {
await route.fallback();
}
});
await page.route('**/api/v1/mangas/m1/chapters*', (route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({
items: [],
page: { limit: 50, offset: 0, total: 0 }
})
})
);
await page.route('**/api/v1/me/bookmarks*', (route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({
items: [],
page: { limit: 50, offset: 0, total: 0 }
})
})
);
await page.route('**/api/v1/me/read-progress/m1', (route) =>
route.fulfill({
status: 404,
contentType: 'application/json',
body: JSON.stringify({
error: { code: 'not_found', message: 'no progress' }
})
})
);
await page.goto('/manga/m1');
// Edit link is gated on session.user — it should be visible to the
// stubbed authenticated user.
await page.getByTestId('edit-manga-link').click();
await expect(page).toHaveURL(/\/manga\/m1\/edit$/);
const titleInput = page.getByTestId('manga-title');
await expect(titleInput).toHaveValue('Berserk');
await titleInput.fill('Berserk (Deluxe)');
await page.getByTestId('manga-edit-submit').click();
await expect(page).toHaveURL(/\/manga\/m1$/);
await expect(page.getByTestId('manga-title')).toHaveText('Berserk (Deluxe)');
expect(patchBody).not.toBeNull();
expect((patchBody as Record<string, unknown>).title).toBe('Berserk (Deluxe)');
});

View File

@@ -10,6 +10,15 @@ import { test, expect, type Page } from '@playwright/test';
const emptyPage = { items: [], page: { limit: 50, offset: 0, total: null } };
async function mockAnonymous(page: Page) {
// Force public mode so the root +layout.ts doesn't bounce us to /login
// (a dev backend with PRIVATE_MODE=true must not leak into E2E runs).
await page.route('**/api/v1/auth/config', async (route) => {
await route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({ self_register_enabled: true, private_mode: false })
});
});
await page.route('**/api/v1/auth/me', async (route) => {
await route.fulfill({
status: 401,
@@ -69,3 +78,53 @@ test('search updates the manga list', async ({ page }) => {
await expect(page.getByTestId('manga-list')).toContainText('Berserk');
expect(lastSearch).toBe('berserk');
});
test('clicking Next paginates to page 2 and updates the URL', async ({ page }) => {
await mockAnonymous(page);
// Fake a catalogue of 75 mangas; page 1 is ids 1..50, page 2 is ids 51..75.
const TOTAL = 75;
function mangaAt(i: number) {
return {
id: `m${i}`,
title: `Manga ${i}`,
author: 'Test',
description: null,
cover_image_path: null,
created_at: '2026-01-01T00:00:00Z',
updated_at: '2026-01-01T00:00:00Z',
authors: [],
genres: []
};
}
await page.route('**/api/v1/mangas*', async (route) => {
const url = new URL(route.request().url());
const limit = Number(url.searchParams.get('limit') ?? '50');
const offset = Number(url.searchParams.get('offset') ?? '0');
const items: ReturnType<typeof mangaAt>[] = [];
for (let i = offset + 1; i <= Math.min(offset + limit, TOTAL); i++) {
items.push(mangaAt(i));
}
await route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({
items,
page: { limit, offset, total: TOTAL }
})
});
});
await page.goto('/');
await expect(page.getByTestId('manga-total')).toContainText('Showing 150 of 75');
await expect(page.getByTestId('manga-list')).toContainText('Manga 1');
await expect(page.getByTestId('manga-list')).not.toContainText('Manga 75');
await page.getByTestId('manga-pager').getByRole('button', { name: /next/i }).click();
await expect(page).toHaveURL(/[?&]page=2(&|$)/);
await expect(page.getByTestId('manga-total')).toContainText('Showing 5175 of 75');
await expect(page.getByTestId('manga-list')).toContainText('Manga 75');
await expect(page.getByTestId('manga-list')).not.toContainText('Manga 1');
});

View File

@@ -0,0 +1,67 @@
import { test, expect, type Page } from '@playwright/test';
// Guards the title-on-nav behavior: without this, a stale title from
// the last manga / author page lingers when the user navigates to a
// generic page like /upload.
async function mockAnonymous(page: Page) {
await page.route('**/api/v1/auth/config', async (route) => {
await route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({ self_register_enabled: true, private_mode: false })
});
});
await page.route('**/api/v1/auth/me', async (route) => {
await route.fulfill({
status: 401,
contentType: 'application/json',
body: JSON.stringify({ error: { code: 'unauthenticated', message: 'unauthenticated' } })
});
});
await page.route('**/api/v1/mangas*', async (route) => {
await route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({ items: [], page: { limit: 50, offset: 0, total: 0 } })
});
});
}
test('static route titles use the brand-first layout map', async ({ page }) => {
await mockAnonymous(page);
await page.goto('/');
await expect(page).toHaveTitle('Mangalord');
await page.goto('/upload');
await expect(page).toHaveTitle('Mangalord | Upload');
await page.goto('/login');
await expect(page).toHaveTitle('Mangalord | Login');
await page.goto('/bookmarks');
await expect(page).toHaveTitle('Mangalord | Bookmarks');
await page.goto('/collections');
await expect(page).toHaveTitle('Mangalord | Collections');
});
test('title updates when navigating away from a content page', async ({ page }) => {
await mockAnonymous(page);
// Pretend we just left a manga detail page — the document title
// would have been overridden to "Mangalord | Berserk". Use evaluate
// to set it synthetically so we can assert the regression cleanly
// even though the dynamic page itself isn't mocked here.
await page.goto('/');
await page.evaluate(() => {
document.title = 'Mangalord | Berserk';
});
expect(await page.title()).toBe('Mangalord | Berserk');
// Client-side nav to /upload — the root layout must reassert its
// mapped title or the stale "Berserk" lingers.
await page.goto('/upload');
await expect(page).toHaveTitle('Mangalord | Upload');
});

View File

@@ -0,0 +1,101 @@
import { test, expect, type Page } from '@playwright/test';
// Network-level mocks for the private-mode UX. The backend integration
// tests (api_private_mode.rs) cover the actual gate; here we only
// verify that the SvelteKit universal load redirects anonymous
// visitors to /login and then back to where they were going.
const userFixture = {
id: 'user-1',
username: 'alice',
created_at: '2026-01-01T00:00:00Z',
is_admin: false
};
const emptyPage = { items: [], page: { limit: 50, offset: 0, total: null } };
async function stubPrivateInstance(page: Page) {
let loggedIn = false;
// The flag that flips the gate on. Frontend reads it in
// `+layout.ts` to decide whether to redirect.
await page.route('**/api/v1/auth/config', async (route) => {
await route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({
self_register_enabled: false,
private_mode: true
})
});
});
await page.route('**/api/v1/auth/me', async (route) => {
if (loggedIn) {
await route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({ user: userFixture })
});
} else {
await route.fulfill({
status: 401,
contentType: 'application/json',
body: JSON.stringify({
error: { code: 'unauthenticated', message: 'unauthenticated' }
})
});
}
});
await page.route('**/api/v1/auth/login', async (route) => {
loggedIn = true;
await route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({ user: userFixture })
});
});
// The real backend would 401 these too in private mode; we stub
// success so the post-login navigation can render the home page
// without an additional redirect cycle.
await page.route('**/api/v1/mangas*', async (route) => {
await route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify(emptyPage)
});
});
}
test('private mode: anonymous visit to / redirects to /login?next=%2F', async ({ page }) => {
await stubPrivateInstance(page);
await page.goto('/');
await expect(page).toHaveURL(/\/login\?next=%2F$/);
await expect(page.getByTestId('login-username')).toBeVisible();
});
test('private mode: register link is hidden', async ({ page }) => {
await stubPrivateInstance(page);
await page.goto('/login');
await expect(page.getByTestId('nav-login')).toBeVisible();
// self_register_enabled is the effective value (false in private
// mode regardless of ALLOW_SELF_REGISTER), so the navbar must
// never render the register affordance here.
await expect(page.getByTestId('nav-register')).toHaveCount(0);
});
test('private mode: after login the user lands back on the requested page', async ({ page }) => {
await stubPrivateInstance(page);
// Visit a deep link → bounced to /login with next= preserving it.
await page.goto('/');
await expect(page).toHaveURL(/\/login\?next=%2F$/);
await page.getByTestId('login-username').fill('alice');
await page.getByTestId('login-password').fill('hunter2hunter2');
await page.getByTestId('login-submit').click();
// Authenticated → can now reach the home page without bouncing.
await expect(page.getByTestId('session-user')).toContainText('alice');
});

View File

@@ -0,0 +1,167 @@
import { test, expect, type Page } from '@playwright/test';
const mangaId = '33333333-3333-3333-3333-333333333333';
const chapter1Id = 'c1111111-3333-3333-3333-333333333333';
const chapter2Id = 'c2222222-3333-3333-3333-333333333333';
const chapter3Id = 'c3333333-3333-3333-3333-333333333333';
const mangaFixture = {
id: mangaId,
title: 'Vinland Saga',
author: 'Makoto Yukimura',
description: null,
cover_image_path: null,
created_at: '2026-01-01T00:00:00Z',
updated_at: '2026-01-01T00:00:00Z'
};
const chaptersFixture = [
{
id: chapter1Id,
manga_id: mangaId,
number: 1,
title: 'Somewhere, Not Here',
page_count: 1,
created_at: '2026-01-01T00:00:00Z'
},
{
id: chapter2Id,
manga_id: mangaId,
number: 2,
title: null,
page_count: 1,
created_at: '2026-01-02T00:00:00Z'
},
{
id: chapter3Id,
manga_id: mangaId,
number: 3,
title: 'Sword Dance',
page_count: 1,
created_at: '2026-01-03T00:00:00Z'
}
];
function pageFixture(chapterId: string) {
return [
{
id: `p1111111-${chapterId.slice(1, 8)}-3333-3333-333333333333`,
chapter_id: chapterId,
page_number: 1,
storage_key: `mangas/${mangaId}/chapters/${chapterId}/pages/0001.png`,
content_type: 'image/png'
}
];
}
async function mockReaderApis(page: Page) {
// Force public mode so the layout doesn't bounce anonymous visitors
// to /login (the dev backend on this machine runs with
// PRIVATE_MODE=true, which the layout's universal load respects).
await page.route('**/api/v1/auth/config', (route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({ self_register_enabled: true, private_mode: false })
})
);
await page.route('**/api/v1/auth/me', (route) =>
route.fulfill({
status: 401,
contentType: 'application/json',
body: JSON.stringify({ error: { code: 'unauthenticated', message: '' } })
})
);
await page.route('**/api/v1/auth/me/preferences', (route) =>
route.fulfill({
status: 401,
contentType: 'application/json',
body: JSON.stringify({ error: { code: 'unauthenticated', message: '' } })
})
);
await page.route('**/api/v1/me/bookmarks*', (route) =>
route.fulfill({
status: 401,
contentType: 'application/json',
body: JSON.stringify({ error: { code: 'unauthenticated', message: '' } })
})
);
await page.route(`**/api/v1/mangas/${mangaId}`, (route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify(mangaFixture)
})
);
await page.route(new RegExp(`/api/v1/mangas/${mangaId}/chapters(\\?.*)?$`), (route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({
items: chaptersFixture,
page: { limit: 200, offset: 0, total: chaptersFixture.length }
})
})
);
for (const c of chaptersFixture) {
await page.route(`**/api/v1/mangas/${mangaId}/chapters/${c.id}`, (route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify(c)
})
);
await page.route(
`**/api/v1/mangas/${mangaId}/chapters/${c.id}/pages`,
(route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({ pages: pageFixture(c.id) })
})
);
}
const png = Buffer.from(
'89504e470d0a1a0a0000000d49484452000000010000000108060000001f15c4890000000d49444154789c63000100000005000158a3b62a0000000049454e44ae426082',
'hex'
);
await page.route('**/api/v1/files/**', (route) =>
route.fulfill({ status: 200, contentType: 'image/png', body: png })
);
}
test('reader chapter select lists every chapter with the manga-detail-style label', async ({
page
}) => {
await mockReaderApis(page);
await page.goto(`/manga/${mangaId}/chapter/${chapter2Id}`);
const select = page.getByTestId('reader-chapter-select');
await expect(select).toBeVisible();
// The current chapter is preselected.
await expect(select).toHaveValue(chapter2Id);
// Each chapter rendered as "Ch. N — Title" (or "Ch. N" when title is null),
// in ascending number order — matching the prev/next sort.
const labels = await select.locator('option').allTextContents();
expect(labels.map((l) => l.trim())).toEqual([
'Ch. 1 — Somewhere, Not Here',
'Ch. 2',
'Ch. 3 — Sword Dance'
]);
});
test('choosing a chapter from the select navigates to that chapter', async ({ page }) => {
await mockReaderApis(page);
await page.goto(`/manga/${mangaId}/chapter/${chapter1Id}`);
await expect(page.getByTestId('reader-chapter-select')).toHaveValue(chapter1Id);
await page.getByTestId('reader-chapter-select').selectOption(chapter3Id);
await expect(page).toHaveURL(
new RegExp(`/manga/${mangaId}/chapter/${chapter3Id}$`)
);
await expect(page.getByTestId('reader-chapter-select')).toHaveValue(chapter3Id);
});

View File

@@ -1,6 +1,7 @@
import { test, expect, type Page } from '@playwright/test';
const mangaId = '22222222-2222-2222-2222-222222222222';
const chapterId = 'c2222222-2222-2222-2222-222222222222';
const mangaFixture = {
id: mangaId,
title: 'Vagabond',
@@ -11,7 +12,7 @@ const mangaFixture = {
updated_at: '2026-01-01T00:00:00Z'
};
const chapterFixture = {
id: 'c1',
id: chapterId,
manga_id: mangaId,
number: 1,
title: null,
@@ -20,24 +21,24 @@ const chapterFixture = {
};
const pagesFixture = [
{
id: 'p1',
chapter_id: 'c1',
id: 'p1111111-2222-2222-2222-222222222222',
chapter_id: chapterId,
page_number: 1,
storage_key: 'mangas/m2/chapters/c1/pages/0001.png',
storage_key: `mangas/${mangaId}/chapters/${chapterId}/pages/0001.png`,
content_type: 'image/png'
},
{
id: 'p2',
chapter_id: 'c1',
id: 'p2222222-2222-2222-2222-222222222222',
chapter_id: chapterId,
page_number: 2,
storage_key: 'mangas/m2/chapters/c1/pages/0002.png',
storage_key: `mangas/${mangaId}/chapters/${chapterId}/pages/0002.png`,
content_type: 'image/png'
},
{
id: 'p3',
chapter_id: 'c1',
id: 'p3333333-2222-2222-2222-222222222222',
chapter_id: chapterId,
page_number: 3,
storage_key: 'mangas/m2/chapters/c1/pages/0003.png',
storage_key: `mangas/${mangaId}/chapters/${chapterId}/pages/0003.png`,
content_type: 'image/png'
}
];
@@ -92,19 +93,21 @@ async function mockReaderApis(page: Page) {
})
})
);
await page.route(`**/api/v1/mangas/${mangaId}/chapters/1`, (route) =>
await page.route(`**/api/v1/mangas/${mangaId}/chapters/${chapterId}`, (route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify(chapterFixture)
})
);
await page.route(`**/api/v1/mangas/${mangaId}/chapters/1/pages`, (route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({ pages: pagesFixture })
})
await page.route(
`**/api/v1/mangas/${mangaId}/chapters/${chapterId}/pages`,
(route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({ pages: pagesFixture })
})
);
const png = Buffer.from(
'89504e470d0a1a0a0000000d49484452000000010000000108060000001f15c4890000000d49444154789c63000100000005000158a3b62a0000000049454e44ae426082',
@@ -131,7 +134,7 @@ test.beforeEach(async ({ context }) => {
test('switching to continuous mode stacks all pages and hides chevrons', async ({ page }) => {
await mockReaderApis(page);
await page.goto(`/manga/${mangaId}/chapter/1`);
await page.goto(`/manga/${mangaId}/chapter/${chapterId}`);
// Default single-page mode is active.
await expect(page.getByTestId('reader-page')).toBeVisible();
@@ -149,7 +152,7 @@ test('switching to continuous mode stacks all pages and hides chevrons', async (
test('arrow keys do not paginate while in continuous mode', async ({ page }) => {
await mockReaderApis(page);
await page.goto(`/manga/${mangaId}/chapter/1`);
await page.goto(`/manga/${mangaId}/chapter/${chapterId}`);
await page.getByTestId('reader-mode-continuous').click();
await expect(page.getByTestId('reader-continuous')).toBeVisible();
@@ -164,7 +167,7 @@ test('arrow keys do not paginate while in continuous mode', async ({ page }) =>
test('gap select updates the inline gap on the continuous container', async ({ page }) => {
await mockReaderApis(page);
await page.goto(`/manga/${mangaId}/chapter/1`);
await page.goto(`/manga/${mangaId}/chapter/${chapterId}`);
await page.getByTestId('reader-mode-continuous').click();
const container = page.getByTestId('reader-continuous');
@@ -192,7 +195,7 @@ test('reader-mode preference set on one page is honored when the reader opens',
});
await mockReaderApis(page);
await page.goto(`/manga/${mangaId}/chapter/1`);
await page.goto(`/manga/${mangaId}/chapter/${chapterId}`);
await expect(page.getByTestId('reader-continuous')).toBeVisible();
await expect(page.getByTestId('page-indicator')).toHaveText('3 pages');
await expect(page.getByTestId('reader-continuous')).toHaveAttribute(

View File

@@ -1,6 +1,7 @@
import { test, expect, type Page } from '@playwright/test';
const mangaId = '11111111-1111-1111-1111-111111111111';
const chapterId = 'c1111111-1111-1111-1111-111111111111';
const mangaFixture = {
id: mangaId,
title: 'Berserk',
@@ -12,7 +13,7 @@ const mangaFixture = {
};
const chaptersFixture = [
{
id: 'c1',
id: chapterId,
manga_id: mangaId,
number: 1,
title: 'The Brand',
@@ -22,24 +23,24 @@ const chaptersFixture = [
];
const pagesFixture = [
{
id: 'p1',
chapter_id: 'c1',
id: 'p1111111-1111-1111-1111-111111111111',
chapter_id: chapterId,
page_number: 1,
storage_key: 'mangas/m1/chapters/c1/pages/0001.png',
storage_key: `mangas/${mangaId}/chapters/${chapterId}/pages/0001.png`,
content_type: 'image/png'
},
{
id: 'p2',
chapter_id: 'c1',
id: 'p2222222-1111-1111-1111-111111111111',
chapter_id: chapterId,
page_number: 2,
storage_key: 'mangas/m1/chapters/c1/pages/0002.png',
storage_key: `mangas/${mangaId}/chapters/${chapterId}/pages/0002.png`,
content_type: 'image/png'
},
{
id: 'p3',
chapter_id: 'c1',
id: 'p3333333-1111-1111-1111-111111111111',
chapter_id: chapterId,
page_number: 3,
storage_key: 'mangas/m1/chapters/c1/pages/0003.png',
storage_key: `mangas/${mangaId}/chapters/${chapterId}/pages/0003.png`,
content_type: 'image/png'
}
];
@@ -86,19 +87,21 @@ async function mockReaderApis(page: Page) {
})
})
);
await page.route(`**/api/v1/mangas/${mangaId}/chapters/1`, (route) =>
await page.route(`**/api/v1/mangas/${mangaId}/chapters/${chapterId}`, (route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify(chaptersFixture[0])
})
);
await page.route(`**/api/v1/mangas/${mangaId}/chapters/1/pages`, (route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({ pages: pagesFixture })
})
await page.route(
`**/api/v1/mangas/${mangaId}/chapters/${chapterId}/pages`,
(route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({ pages: pagesFixture })
})
);
// Stub image bytes so the <img> doesn't 404 (1x1 transparent PNG).
const png = Buffer.from(
@@ -117,13 +120,13 @@ test('manga overview shows title, cover, and a chapter list', async ({ page }) =
await expect(page.getByTestId('manga-title')).toHaveText('Berserk');
await expect(page.getByTestId('manga-author')).toContainText('Kentaro Miura');
await expect(page.getByTestId('manga-cover')).toBeVisible();
await expect(page.getByTestId('chapter-list')).toContainText('Chapter 1');
await expect(page.getByTestId('chapter-list')).toContainText('The Brand');
await expect(page.getByTestId('bookmark-signin')).toBeVisible();
});
test('reader paginates with arrow keys and j/k, and preloads the next page', async ({ page }) => {
await mockReaderApis(page);
await page.goto(`/manga/${mangaId}/chapter/1`);
await page.goto(`/manga/${mangaId}/chapter/${chapterId}`);
// Page 1 shown, preload for page 2 in the DOM.
await expect(page.getByTestId('page-indicator')).toHaveText('Page 1 / 3');

View File

@@ -1,6 +1,6 @@
{
"name": "mangalord-frontend",
"version": "0.23.0",
"version": "0.52.0",
"private": true,
"type": "module",
"scripts": {

Some files were not shown because too many files have changed in this diff Show More