Compare commits

..

53 Commits

Author SHA1 Message Date
074ab25f8c ci(test-backend): run on ubuntu-latest + rustup instead of rust:1-slim
All checks were successful
deploy / test-backend (pull_request) Successful in 18m36s
deploy / test-frontend (pull_request) Successful in 9m42s
deploy / build-and-push (pull_request) Has been skipped
deploy / deploy (pull_request) Has been skipped
act_runner runs JS actions (checkout/cache) with node inside the job
container; rust:1-slim ships no node, so every JS action failed with
exit 127 ("node: not found"). Drop the container, run on the
node-equipped ubuntu-latest image, install Rust via rustup. The postgres
service is still reachable by name (act_runner containerises the job).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 18:18:19 +02:00
6b49a47d0a feat(crawler): system Chromium via CRAWLER_CHROMIUM_BINARY (0.45.0) (#2)
Some checks failed
deploy / test-backend (push) Failing after 7s
deploy / test-frontend (push) Failing after 33s
deploy / build-and-push (push) Has been skipped
deploy / deploy (push) Has been skipped
2026-05-31 15:47:47 +00:00
e851355f28 Merge pull request 'ci: no-SSH local deploy + Dockerfile build fixes' (#1) from fix/ci-deploy-pipeline into main
Some checks failed
deploy / test-backend (push) Failing after 7s
deploy / test-frontend (push) Failing after 30s
deploy / build-and-push (push) Has been skipped
deploy / deploy (push) Has been skipped
2026-05-31 15:43:54 +00:00
2a0cc24c07 ci: deploy to the local stack over the runner socket, not SSH
Some checks failed
deploy / test-backend (pull_request) Failing after 1m6s
deploy / test-frontend (pull_request) Failing after 1m18s
deploy / build-and-push (pull_request) Has been skipped
deploy / deploy (pull_request) Has been skipped
The runner lives on the deploy host and shares its docker daemon, so the
deploy job runs `docker compose pull && up -d` against the central compose
via a bind-mounted compose dir (docker:cli + docker_host: "-") instead of
appleboy/ssh-action. Drops the SSH_* secrets and recreates only the two
mangalord services at the freshly built SHA. Requires /mnt/ssd/docker-data
in the runner's container.valid_volumes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 17:26:58 +02:00
a615b0aee7 fix(docker): unblock image builds on this host
- backend dep-cache stage stubs only main.rs/lib.rs, but Cargo.toml
  declares a second [[bin]] crawler at src/bin/crawler.rs, so
  `cargo build --locked` aborts ("can't find bin crawler"). Stub it too.
- runtime was debian:bookworm-slim (glibc 2.36) while rust:1-slim now
  tracks trixie (glibc 2.41) -> "GLIBC_2.39 not found" at boot. Pin the
  runtime to debian:trixie-slim so it matches the builder's glibc.
- frontend healthcheck probed localhost (-> musl picks IPv6 ::1) but the
  Node server binds IPv4 0.0.0.0 only -> false "unhealthy". Probe 127.0.0.1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 17:26:58 +02:00
MechaCat02
a2826d6467 feat(crawler): CRAWLER_ALLOW_ANY_HOST bypasses the host allowlist (0.44.0)
Some checks failed
deploy / test-backend (push) Failing after 11s
deploy / test-frontend (push) Failing after 36s
deploy / build-and-push (push) Has been skipped
deploy / deploy (push) Has been skipped
Operators whose sources shard images across numbered CDN subdomains
can't pre-enumerate every host in CRAWLER_DOWNLOAD_ALLOWLIST. The new
flag short-circuits the host check in DownloadAllowlist::contains
while leaving scheme, localhost, and private-IP defenses in
is_safe_url untouched — scraped URLs pointing at 10.x /
169.254.169.254 / file:// stay refused. Default is false; fail-closed
posture is preserved unless the operator opts in. Wired into both the
server (config::build_download_allowlist) and the bin/crawler.rs
one-shot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 14:52:49 +02:00
MechaCat02
1eebb90e25 fix(crawler): unhang shutdown on lingering Arc<Browser>, silence WS noise (0.43.1)
Some checks failed
deploy / test-backend (push) Failing after 6s
deploy / test-frontend (push) Failing after 40s
deploy / build-and-push (push) Has been skipped
deploy / deploy (push) Has been skipped
- Handle::close aborts its chromiumoxide driver task when another
  Arc<Browser> outlives the call, so shutdown returns instead of
  hanging on a stream that never terminates. Generic close_or_abort
  helper with regression tests covering both Arc paths.
- daemon.shutdown() is wrapped in a 5s timeout in main as defense
  in depth.
- Default RUST_LOG silences chromiumoxide::conn / chromiumoxide::handler
  WS-deserialize ERROR spam.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 14:47:36 +02:00
MechaCat02
030b27754b feat(api): admin-initiated user creation via POST /admin/users (0.43.0)
Some checks failed
deploy / test-backend (push) Failing after 8s
deploy / test-frontend (push) Failing after 38s
deploy / build-and-push (push) Has been skipped
deploy / deploy (push) Has been skipped
Pairs with the ALLOW_SELF_REGISTER toggle from 0.42.0: admins can mint
accounts regardless of the toggle state, so a closed-membership
deployment still has a working enrollment path. The endpoint accepts
{ username, password, is_admin? } so admins can mint co-admins in one
call (avoiding a separate promote + extra audit row for the common
"invite a co-admin" flow).

Implementation:
- POST /api/v1/admin/users guarded by RequireAdmin
- Reuses validate_username / validate_password from api::auth (made
  pub(crate)) so the admin path can never produce an account self-
  register would reject and vice versa
- repo::user::admin_create_user wraps INSERT + admin_audit insert in
  a single tx — same "audit reflects what committed" semantics as the
  existing admin_safe_* fns
- Audit row: action="create_user", payload={username, is_admin}

Frontend:
- createAdminUser() in lib/api/admin.ts
- /admin/users grows a collapsible "Create user" form above the table
  (username, password, "Make admin" checkbox). Errors surface inline;
  the list reloads on success.

Backend tests: 7 new, including the headline
`create_user_works_even_when_self_register_disabled` that pins the
admin-create path is NOT gated by the public toggle.
2026-05-31 14:00:31 +02:00
MechaCat02
2f47faa11c feat(auth): ALLOW_SELF_REGISTER toggle + public /auth/config endpoint (0.42.0)
Lets operators run a closed-membership deployment by setting
ALLOW_SELF_REGISTER=false (default true, so existing deploys are
unaffected). When off, POST /auth/register returns 403 forbidden. The
rate-limit token is consumed BEFORE the disabled check so the timing
doesn't distinguish enabled-but-rejected from disabled — closes the
toggle-state probe channel.

New public GET /auth/config returns { self_register_enabled: bool }
so the frontend can render its register affordances correctly
without conflating "disabled" with "rate-limited" (which a probe
attempt would).

Frontend: a lightweight reactive `authConfig` store loads the flag
once on root-layout mount (and again on /register direct navigation,
which bypasses the layout's onMount). Header hides the Register link
when the toggle is off; /register renders a "self-registration is
disabled — ask an administrator" notice instead of the form.

Admin-create endpoint that pairs with this toggle is intentionally
not in this PR — it lands as the next branch (feat/admin-user-create).
The toggle alone is independently useful for deployments that want
to lock down enrollment without yet wiring an admin UI.
2026-05-31 13:56:18 +02:00
MechaCat02
6dd21451a8 chore: sync Cargo.lock for 0.41.2 2026-05-30 22:26:24 +02:00
MechaCat02
f6728dc71a fix(admin): security-audit findings — paginate chapters, lock down unchecked helper (0.41.2)
Addresses the security-audit findings on top of the admin feature stack:

M1: /admin/mangas/:id/chapters now paginates (default limit 200, max 500).
A long-runner with thousands of chapters would otherwise produce a multi-MB
response with that many scalar subqueries per row — admin-only but a real
stall risk on one expand-click. Adds explicit pagination tests for the cap
and offset; frontend renders a "Showing first N of M" hint when the cap
clips the result.

L1: repo::user::set_is_admin renamed to set_is_admin_unchecked with a
doc-comment pointing at admin_safe_set_is_admin for production use. The
short name was a footgun — a future contributor reaching for it would
silently bypass self-protection, the last-admin invariant, and the audit
log. Used only by integration-test setup; production code goes through
the admin_safe_* paths.

CSRF posture: build_session_cookie carries a comment that the
SameSite=Lax default is the project's CSRF defense for state-changing
mutations and breaks the instant anyone adds a side-effecting GET under
/admin/*. Spells out what to do then (Strict + explicit token check).

Test counts: 43 backend admin tests + 12 vitest admin tests all green;
svelte-check 0/0 across 446 files.
2026-05-30 22:23:55 +02:00
MechaCat02
aa2159ca06 fix(admin): three review findings — audit no-op, 404, chapter priority (0.41.1)
- admin_safe_set_is_admin: short-circuit when target.is_admin == value,
  before writing audit. PATCH {is_admin: true} on someone already admin
  previously wrote a misleading "promote_user" row even though the UPDATE
  was a no-op.

- list_chapters (/admin/mangas/:id/chapters): explicit exists() check on
  manga_id, returns 404 instead of 200 [] for a typo'd / deleted manga.

- ChapterSyncState priority: the Failed branch now requires page_count = 0,
  so a chapter with pages on disk AND a historical dead job (from a
  re-download attempt that crashed) stays Synced. The old order
  contradicted Synced's documented "downloaded at some point" contract.
  Doc comments updated alongside the SQL.

Three new regression tests pin the behaviour.
2026-05-30 21:58:15 +02:00
MechaCat02
b434c9b68d feat(frontend): /admin dashboard with users/mangas/system views (0.41.0)
Adds the SvelteKit /admin route tree backed by the admin endpoints
landed in PR 1-4. Pages: Overview (alerts + summary cards), Users
(list / promote-demote / delete), Mangas (list with sync state +
expandable per-chapter state), System (live disk/mem/cpu bars,
refreshing every 5s).

Security model: the backend's RequireAdmin extractor is the actual
boundary. /admin/+layout.ts calls getSystemStats() at load and
translates the response — 401 → redirect to /login, 403 → throw
SvelteKit error(403) which renders the framework error page. The
header's "Admin" link is hidden unless `session.user?.is_admin`,
but that's UX only.

Carries `is_admin: boolean` through to the frontend User TS type so
the header check works and so admin tables can show role per row.

Vitest covers lib/api/admin.ts (10 tests: list/delete/PATCH for
users, sync-state filter for mangas, nested chapter route, system
disk-nullable case). Playwright is intentionally deferred until the
routes stabilise — admin UI is operator-only and changes shape often
in v0.
2026-05-30 21:49:39 +02:00
MechaCat02
cc4ec76d17 feat(api): admin system metrics endpoint with disk/mem/cpu alerts (0.40.0)
Adds GET /api/v1/admin/system returning disk (scoped to storage_dir
via statvfs), memory, CPU, and a server-side alerts array that fires
at >90% disk or memory.

Disk uses nix::sys::statvfs directly rather than sysinfo's Disks API
to avoid mountpoint-matching gymnastics for the storage_dir. A new
`Storage::local_root() -> Option<&Path>` trait method exposes the
root; the default returns None so a future S3Storage gets `disk:
null` in the response instead of fabricated numbers.

CPU is sampled inline (refresh → 250ms sleep → refresh → read) so the
endpoint adds 250ms of latency per call. No background-cache yet —
admin traffic is low-volume and the moving parts aren't worth it
until polling shows up.

Alerts are evaluated server-side so the frontend can render them
without re-implementing the thresholds.
2026-05-30 21:45:06 +02:00
MechaCat02
bf7c9b5c2a feat(api): admin manga/chapter overview with derived sync state (0.39.0)
Adds GET /api/v1/admin/mangas and /admin/mangas/:id/chapters guarded by
RequireAdmin. Sync state is computed at query time from the existing
crawler signals (manga_sources / chapter_sources / crawler_jobs) — no
new state column is persisted, so the crawler stays the single writer
of these signals.

Per-manga priority: InProgress (in-flight sync_manga or
sync_chapter_list job) > Dropped (all source rows soft-dropped) >
Synced (default; covers user-uploaded mangas with zero source rows).

Per-chapter priority: Downloading (in-flight sync_chapter_content) >
Dropped (all source rows soft-dropped) > Failed (most-recent terminal
job is dead) > NotDownloaded (page_count = 0) > Synced. The Failed
check sits ABOVE NotDownloaded so the more informative "we tried and
it died" state wins over "we never got around to it" — see the
priority comment in repo/admin_view.rs.

Migration 0020 adds a partial index on
crawler_jobs((payload->>'source_manga_key')) for the one job kind
(sync_manga) whose payload doesn't carry manga_id directly — without it
the in-flight detection for a manga falls back to a seqscan over the
job table.
2026-05-30 21:41:09 +02:00
MechaCat02
0b2018ceca feat(api): admin user management endpoints with audit log (0.38.0)
Adds /api/v1/admin/users list / DELETE / PATCH guarded by RequireAdmin,
plus the audit-log substrate every future destructive admin endpoint
will reuse.

Safety properties:
- Cannot self-delete or self-demote (409 conflict, message calls out
  "yourself" so the UI can render an explanation).
- Cannot remove the last admin via either DELETE or demote. The check
  takes pg_advisory_xact_lock(ADMIN_INVARIANT_LOCK_KEY) and re-counts
  admins inside the same tx, closing the parallel-demote race that a
  bare "if count > 1" check would let through. The HTTP-serial path to
  this guard is structurally unreachable (the actor would have to be
  the lone admin demoting themselves, which the self-guard fires on
  first); the parallel race test exercises it via repo calls.

Audit log (admin_audit table) records the action inside the same tx
as the action itself, so a rolled-back action never leaves an orphan
audit row. actor_user_id is ON DELETE SET NULL so the log outlives a
later-deleted admin. target_id is not a FK because future audit kinds
will target non-user rows.
2026-05-30 21:35:35 +02:00
MechaCat02
ab8b7acc34 feat(auth): admin role with cookie-only RequireAdmin extractor (0.37.0)
Adds an `is_admin` flag on users plus the substrate every later PR in the
admin feature builds on:

- migration 0018 adds the column with default false
- `repo::user::bootstrap_admin` creates or promotes the user named by
  `ADMIN_USERNAME` at startup, hashing `ADMIN_PASSWORD` only when the row
  is new — never overwriting an existing hash, so an operator can rotate
  the admin password via the UI without env-var conflict
- `CurrentSessionUser` extractor accepts only the session cookie;
  `RequireAdmin` composes over it and additionally requires
  `user.is_admin`. Bearer tokens are intentionally excluded so an
  admin's bot token never inherits admin authority (privilege-escalation
  surface that bites every "API keys reuse user perms" auth design)
- demotion is instant: `RequireAdmin` re-reads the user row each request

`/api/v1/auth/me` now exposes `is_admin`; no other response embeds
`User`, so no privacy fanout to audit.
2026-05-30 21:26:26 +02:00
MechaCat02
9925f54695 fix(crawler): narrow browser-dead heuristic to typed downcasts (0.36.7)
anyhow_looks_browser_dead substring-matched any chain message
containing channel / connection / websocket / transport / closed /
nav timeout. Real chromium failures hit those words, but so do
reqwest TCP-reset errors during CDN image downloads, sqlx pool-
timeout errors, and any number of non-browser failures — each of
which triggered a wasted chromium relaunch + session-probe re-run
against the catalog's rate-limit budget.

Drop the substring pass. Walk the chain looking only for typed
NavError (flagged via is_likely_browser_dead) or CdpError. Every
place we feed a chromium error into anyhow goes through one of
those types, so the typed downcasts cover the real cases without
the false-positive surface.

NavError::is_likely_browser_dead also drops its own substring
check on Cdp(e); any CdpError surfacing at the navigation layer
means the chromium-facing channel is the failing layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-30 20:41:59 +02:00
MechaCat02
eaa5afda50 fix(crawler): skip sync when empty chapters + prior > 0 (0.36.6)
The wait_for_selector wait in 0.36.2 narrows the partial-render race
window but doesn't close it: a render that takes longer than
SELECTOR_TIMEOUT (10s) still hands an empty Vec to sync_manga_chapters,
and the soft-drop branch flips every existing chapter to dropped_at.
The next tick recovers but a manga's reader briefly stops working in
between.

Close it at the pipeline level. Between fetch_manga and the upsert/
sync, if the parsed chapter list is empty and the prior live count
for (source_id, source_manga_key) is > 0, treat the fetch as a
transient failure: log, bump mangas_failed, skip upsert + sync + the
seen.insert so a later batch / tick retries. Brand-new mangas with
genuinely zero chapters (prior == 0) pass through unchanged.

New repo helper repo::crawler::live_chapter_count_for_source_manga
joins chapters → chapter_sources → manga_sources with dropped_at IS
NULL — same lockstep as dispatch_target and the enqueue queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-30 20:17:42 +02:00
MechaCat02
5c04b0532b fix(crawler): panic-isolate the cron tick body (0.36.5)
Worker dispatch was already wrapped in AssertUnwindSafe(...)
.catch_unwind() — a panicking handler ack's the job failed and the
worker keeps going. The cron tick had no such guard: a panic in
metadata.run, enqueue_bookmarked_pending, reap_done, or
write_last_tick would kill the cron task. The JoinSet would drop it,
workers would keep running, and no future metadata pass would ever
fire until daemon restart.

Wrap the tick body (between advisory-lock acquire and unlock) in the
same AssertUnwindSafe(...).catch_unwind() pattern. The unlock and
connection drop run unconditionally so a panicked tick doesn't leave
the lock held for another replica.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-30 20:08:11 +02:00
MechaCat02
655ea42731 fix(crawler): scope dispatch_target to live sources, newest first (0.36.4)
The chapter dispatcher's URL resolver had no dropped_at filter and no
ORDER BY — a chapter whose only chapter_sources row had been soft-
dropped was still dispatched against the stale URL, eating retry
budget on guaranteed transients. With multiple live sources the LIMIT
1 winner was nondeterministic.

Add `AND cs.dropped_at IS NULL` and `ORDER BY cs.last_seen_at DESC`
to dispatch_target, bringing it in lockstep with the enqueue queries
in pipeline.rs that already filter on dropped_at. Returns None when
all sources are dropped — callers in daemon.rs already treat None
as "ack the job, skip the work."

Tests in tests/repo_chapter.rs cover the three branches (freshest
live wins, dropped sources skipped, all-dropped returns None).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-30 20:03:45 +02:00
MechaCat02
70e8a7895c fix(crawler): relaunch chromium on CDP / nav-timeout errors (0.36.3)
BrowserManager only re-launched chromium when the cached handle was
None. A crash mid-pass left the handle Some pointing at a dead
process — every subsequent acquire returned the zombie Browser, and
every nav cascaded CDP errors until the idle reaper fired.

Add BrowserManager::invalidate(): take the inner mutex, drop the
handle (closing it if present), and signal the next acquire to
relaunch. Idempotent — invalidating an empty handle is a no-op.

Wire detection via NavError::is_likely_browser_dead and a
chain-walking anyhow_looks_browser_dead helper: substring-match
common channel/connection/transport/WebSocket markers and surface
NavError::Timeout as "presumed dead." Apply at both error
boundaries — RealChapterDispatcher::dispatch and
RealMetadataPass::run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-30 18:39:19 +02:00
MechaCat02
8e0b638e3f fix(crawler): wait for page marker instead of fixed 1s sleep (0.36.2)
A chromium snapshot taken between the wrapper-render and row-render
phases let parse_chapter_list return Ok(vec![]) for a manga that
actually has chapters — the soft-drop branch in sync_manga_chapters
then flipped every existing chapter to dropped_at.

Add wait_for_selector to crawler::nav. navigate() now takes a CSS
marker matching the most-specific element the downstream parser will
look for (one of LIST_PAGE_MARKER / DETAIL_PAGE_CHAPTERS_MARKER /
DETAIL_PAGE_LAYOUT_MARKER). The wait is best-effort and capped by
SELECTOR_TIMEOUT (10s); a legitimately empty page can still pass
through because the parser's #chapter_table sentinel and the
universal broken-page body check stay in force.

Same pattern wired at the reader nav (a#pic_container) and probe
nav (#logo), replacing the implicit assumption that the post-load
JS had finished within 1 second.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-30 18:29:38 +02:00
MechaCat02
e2bd1462ba fix(crawler): wrap wait_for_navigation in 30s timeout (0.36.1)
A hung TLS handshake or a page that never fires load could wedge a
worker (or the cron metadata pass) indefinitely — chromiumoxide
imposes no navigation timeout of its own.

New crawler::nav::wait_for_nav caps each navigation at NAV_TIMEOUT
(30s) and returns a typed NavError so timeouts surface as transient
(retryable) errors. Wired at the three navigation sites:
- source::target::navigate (catalog/detail/pagination)
- content::sync_chapter_content (chapter reader)
- session::fetch_probe_html (session probe)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-30 18:10:51 +02:00
MechaCat02
9f56f283d4 feat(crawler): single-mode walker gated by recovery flag (0.36.0)
Collapses the crawler to a single newest-first walker and replaces the
N-consecutive-unchanged streak with a per-manga rule: stop on the first
manga where metadata is Unchanged AND chapter sync reports zero new
chapters. The early stop is gated by a per-source recovery flag stored
in `crawler_state` — set to `false` when a run starts, back to `true`
only on a clean exit (end-of-walk or intentional stop). A crashed run
leaves the flag `false` automatically (no shutdown code runs), so the
next tick walks the full catalog instead of bailing at the first
caught-up manga.

This means a crashed mid-walk run self-heals on the next tick: the
flag stays `false`, the next walk visits every page (recovering
anything the crash missed past its crash point), and steady state
resumes once the recovery sweep reaches end-of-walk.

Removed:
- DiscoverMode enum, Backfill mode, the boundary re-check +
  displaced-refs machinery in TargetSourceWalker.
- Drop-pass (mark_dropped_mangas) and seed-completion plumbing
  (mark_seed_completed / seed_completed_at). The recovery flag
  subsumes the seed-completion signal; drop detection was explicitly
  opted out.
- JobPayload::Discover (no production callers).
- CRAWLER_MODE / CRAWLER_INCREMENTAL_STOP_AFTER env vars and the
  CrawlerModePref config type.

`should_mark_clean_exit(walked_to_completion, hit_stop_condition)`
encodes the clean-exit truth table in its signature — `hit_limit` is
deliberately absent so a future edit cannot accidentally count a
caller-imposed cap as a clean exit.

Net -501 lines, 261 backend tests passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 23:49:28 +02:00
MechaCat02
33f7e19077 fix(crawler): serialize sync_manga_chapters per-manga (0.35.6)
Two concurrent calls of sync_manga_chapters for the same manga both
read seen_keys, both run the drop UPDATE filtered on `NOT (key = ANY
$3)`, and the later commit can soft-drop a chapter the earlier had
just inserted (lost-update under MVCC). Today the cron tick is the
only caller and the daemon-level advisory lock keeps it single-flight,
but that lock is held on one pool connection and doesn't actually
serialize the *function*: any future caller (bookmark hook,
admin-triggered re-sync, parallel worker) would race against the cron.

Add `pg_advisory_xact_lock(hashtextextended(manga_id::text, 0))` at
the start of the transaction. Auto-releases on commit/rollback so a
panic mid-call can't strand the lock. Lock keyed per-manga so calls
for different mangas still parallelize.

Test sync_chapters_serializes_concurrent_calls_for_same_manga spawns
two tokio tasks calling the function concurrently with overlapping
chapter lists and asserts every chapter survives.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 20:45:01 +02:00
MechaCat02
c6bb9160e3 fix(crawler): scope chapter_sources lookup per-manga (0.35.5)
chapter_sources's PRIMARY KEY was (source_id, source_chapter_key) and
the lookup in sync_manga_chapters didn't constrain by manga_id, so a
source whose chapter slugs aren't globally unique (e.g. "chapter-1"
appearing under multiple mangas) silently attributed every collision
to the first manga that synced it. The INSERT path would have
conflicted on the second manga's sync.

Migration 0017 drops the old PK and rekeys on (source_id, chapter_id)
— the natural identity of a per-source chapter attachment — and adds
an index on (source_id, source_chapter_key) for the lookup path. The
repo lookup now joins chapters and filters by manga_id; the UPDATE
path keys on chapter_id directly (the row's natural identifier
post-migration).

Test sync_chapters_isolates_colliding_keys_across_mangas pins the
contract end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 20:43:08 +02:00
MechaCat02
50763addcf fix(crawler): quarantine recently-dead chapters from re-enqueue (0.35.4)
The partial dedup index only blocks (pending|running) duplicates, so
once a SyncChapterContent job transitions to 'dead' (max_attempts
exhausted) the slot frees. Every subsequent cron tick re-enqueued the
chapter — page_count = 0 and dropped_at IS NULL stay true — burned
another max_attempts retries, and died again. Permanent-failure
chapters spun forever.

enqueue_bookmarked_pending and enqueue_pending_for_manga now skip
chapters whose latest sync_chapter_content job is dead within
CHAPTER_DEAD_QUARANTINE_DAYS (7). A failed chapter goes silent for a
week, then gets one more shot — long enough for a transient site
issue to resolve, short enough that permanent failures don't stay
permanent if conditions change.

Two integration tests pin both halves of the contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 20:42:41 +02:00
MechaCat02
766c6eebac fix(crawler): guard ack_done/ack_failed/release on state='running' (0.35.3)
The three lease-ack functions matched their UPDATE on the job id
alone. If a lease expired and another worker re-leased the row, a
late ack from the original worker would clobber the new lease's
state, leased_until, and (for release) decrement its attempts.

Add `AND state = 'running'` to each UPDATE and log a warn when
rows_affected is zero, so a stolen lease shows up in telemetry without
blocking the new lease holder's progress.

Three new integration tests pin the contract:
- ack_done_no_ops_when_lease_was_stolen
- ack_failed_no_ops_when_state_is_not_running
- release_no_ops_when_state_is_not_running

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 20:42:18 +02:00
MechaCat02
c686d6eb51 fix(crawler): sentinel-gate parse_chapter_list to stop false drops (0.35.2)
parse_chapter_list previously returned Vec::new() on any selector
miss. The empty list flowed into sync_manga_chapters, whose soft-drop
branch then flipped every existing chapter's dropped_at to NOW().
Bookmarks subsequently pointed at dropped sources, and
enqueue_bookmarked_pending (filters on cs.dropped_at IS NULL) silently
stopped re-fetching pages.

Same shape as the walker race fixed in 0.35.1: a transient parse miss
masquerading as "source removed everything" → false soft-drop.

Fix: require #chapter_table in the DOM. Present-but-empty is preserved
as Ok(vec![]) so a freshly added manga with no published chapters
still parses cleanly. Absent table is now Transient — the job system
reschedules with backoff instead of treating the partial render as
data.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 20:41:47 +02:00
MechaCat02
dea9b1aaa8 fix(crawler): close walker race against site reordering (0.35.1)
The target site orders by update_date DESC, and any new or updated
manga pushes everyone down by one slot. The paginated walker was
blind to this drift:

* Backfill (page last -> 1): shifts push items into pages already
  finished. The displaced manga was silently missed; with
  mark_dropped_mangas running on a fully-completed walk, items even
  got false-dropped because last_seen_at was stale.
* Incremental (page 1 -> last): a shift causes the slot-last item
  of an already-read page to reappear on the next page, leading to
  a redundant fetch_manga and an inflated consecutive_unchanged
  streak.

Fix is two-pronged:

1. Backfill boundary re-check. After fetching each page P, re-fetch
   the previously-walked page P+1 and check where its old slot-0
   key now sits. If it slid to slot K, the first K entries are
   items that used to live on P and slid past us; they get appended
   to the batch. If the anchor is gone entirely (multi-page shift
   or it was bumped to page 1), the whole re-fetched page is
   processed conservatively and the pipeline dedup absorbs the
   noise. The re-check must be the *last* navigation of the
   iteration to close the within-iteration race.

2. Run-scoped dedup in run_metadata_pass. A HashSet<String> of
   source_manga_keys avoids double-processing. The set uses a
   contains-then-insert pattern with insert firing *after* a
   successful upsert, so a transient fetch/upsert failure leaves
   the key retryable if it reappears later in the same pass (via
   the boundary re-check or another batch).

Incremental mode does not run the re-check (shifts move in the
same direction as the walk); only the dedup helps it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 20:14:01 +02:00
MechaCat02
f57ca8e45c feat: harden auth, shutdown, and session bundle (0.35.0)
Some checks failed
deploy / test-backend (push) Failing after 1m37s
deploy / test-frontend (push) Failing after 16m31s
deploy / build-and-push (push) Has been skipped
deploy / deploy (push) Has been skipped
Three features bundled into one release:

- rate-limit /auth/login, /register, /me/password (token bucket,
  5 req/sec sustained with 10-request burst by default; 429 +
  Retry-After header on hit; tracing::warn! per hit so operators
  see attack patterns; AUTH_RATE_PER_SEC / AUTH_RATE_BURST env knobs)
- handle SIGTERM for graceful container stops (replaces bare
  ctrl_c() with a select over ctrl_c + SignalKind::terminate() so
  docker compose stop runs the daemon shutdown path instead of
  letting Chromium leak past SIGKILL)
- clear session.user on 401 from any API call (setOn401Hook in
  api/client.ts, registered from session.svelte.ts gated on
  $app/environment::browser so the SSR bundle never installs it;
  fixes "logged in but no bookmarks/collections" mid-session
  expiry state)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 20:27:21 +02:00
MechaCat02
8d34132883 bugfix: security & correctness bundle (0.34.1)
Five fixes bundled into one release:

- preserve user-attached tags across crawler upserts
  (repo::crawler::sync_tags now scopes to added_by IS NULL; orphaned
  attachments from deleted users are reaped as crawler-owned)
- gate manga PATCH and cover endpoints on uploaded_by (require_can_edit
  in api::mangas; non-NULL uploaded_by must match the caller)
- equalise login response time across user-existence branches
  (run argon2 against a OnceLock-cached dummy hash on the no-user
  branch so timing doesn't leak username existence)
- crawler download defences (SSRF allowlist of host literals
  including IPv4-mapped IPv6 ranges, 32 MiB streamed size cap,
  reject non-whitelisted image types, three-way chapter-probe
  classifier replaces the binary #avatar_menu check)
- tighten validation and clean up dead unload path
  (attach_tag + create_token enforce 64-char caps; LocalStorage
  rejects NUL bytes explicitly; reader flushFinalProgress drops
  the always-405 sendBeacon path)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 20:24:51 +02:00
MechaCat02
c5c1179e9d chore: full hop-by-hop header strip and 60s timeout on /api/* proxy
The SvelteKit proxy was only stripping host + content-length; the rest
of RFC 7230 §6.1 (connection, keep-alive, proxy-authenticate,
proxy-authorization, te, trailer, transfer-encoding, upgrade) leaked
through to axum. Axum doesn't emit them so the impact is theoretical,
but the proxy should be RFC-conformant. Also adds an AbortController
with a configurable 60s timeout (BACKEND_PROXY_TIMEOUT_MS) so a
wedged backend can't hang the browser request indefinitely — failures
surface as the standard 502 upstream_unavailable envelope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 20:24:05 +02:00
MechaCat02
c320eda7cd chore: dedupe is_unique_violation, lift SQL into repo, centralise URL parsing
Three layering cleanups from REVIEW.md §5 / §3:

- Drop the three private `is_unique_violation` helpers in
  repo::{user,chapter,bookmark} in favour of sqlx 0.8's
  `DatabaseError::is_unique_violation()` method (already used by
  repo::collection).
- Remove the unreachable 23505 branch in repo::chapter::create — the
  (manga_id, number) UNIQUE was dropped in 0013, so the defensive arm
  could no longer fire. A doc note records what to do if uniqueness
  is re-added.
- Move three inline SQL queries out of handlers/daemon into repo
  functions: bookmarks' chapter-belongs-to-manga guard
  (`repo::chapter::belongs_to_manga`), the daemon's dispatch lookup
  (`repo::chapter::dispatch_target`), and the daemon's page_count
  safety net (`repo::chapter::page_count`). Restores the
  handlers→repo layering invariant in CLAUDE.md.
- New `crawler::url_utils` module consolidates host_of / origin_of /
  registrable_domain — they used to live in three crawler submodules
  with diverging edge-case behaviour. Tests moved with them.
- Doc cross-references on repo::author::set_for_manga and
  repo::genre::set_for_manga pointing to the crawler's name-keyed
  variants, so the intentional duplication is discoverable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 20:24:05 +02:00
MechaCat02
bd9a6bd257 chore: drop dead 'failed' branch from crawler_jobs partial index
0012_crawler.sql's partial index on `state IN ('pending','failed')`
indexes a state that no code path ever writes — ack_failed in
crawler/jobs.rs only ever moves jobs to 'dead' or 'pending'. The
'failed' branch costs a write on every state change without ever
matching a query. Drop it; the CHECK still allows 'failed' so a
future migration can re-introduce it cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 20:24:05 +02:00
MechaCat02
ebc1966103 chore: run containers as non-root, add HEALTHCHECK, npm ci
Backend: new `app` user (UID 10001), STORAGE_DIR pre-chowned so the
named volume inherits ownership, curl installed for the HEALTHCHECK
that pings /api/v1/health. The crawler's Chromium uses --no-sandbox
already so dropping privileges costs nothing operationally.

Frontend: switch `npm install` to `npm ci` (matches CI; deterministic
versions; refuses to silently rewrite package-lock.json mid-build).
Run as the built-in `node` user via --chown=node:node, add a busybox
wget HEALTHCHECK on port 3000.

Both images now expose container-level health so orchestrators can
take a wedged container out of rotation instead of letting it keep
serving timeouts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 20:24:05 +02:00
MechaCat02
e4333631e1 chore: run CI on PRs, require POSTGRES_PASSWORD, document HTTPS need
- .gitea/workflows/deploy.yml: trigger on pull_request to main so PRs
  get test feedback; gate build-and-push + deploy on push events so
  PRs only run the test jobs (no registry push, no SSH deploy).
- docker-compose.yml: change `${POSTGRES_PASSWORD:-mangalord}` to
  `${POSTGRES_PASSWORD:?...}` so a deploy without an .env fails fast
  instead of booting Postgres with a known-default credential.
- .env.example: change the example value to a "change-me" sentinel,
  add a banner explaining that production needs HTTPS in front of
  the frontend container because COOKIE_SECURE=true makes browsers
  refuse cookies over plain HTTP.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 20:24:05 +02:00
MechaCat02
e7662d18d6 feat: gitea actions for build, push, and ssh deploy (0.34.0)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 06:56:13 +02:00
MechaCat02
45ce0d8f12 feat: incremental crawl mode with seed-completion gate (0.33.0)
Daemon now auto-detects mode per source: Backfill until the first
full walk records `seed_completed:<source>` in `crawler_state`, then
Incremental (newest-first, stops after N consecutive Unchanged
upserts). `CRAWLER_MODE` overrides to a fixed mode; CLI rejects
`auto` since it has no pre-run DB state.

`Source::discover` returns a lazy `DiscoverWalk` so Incremental can
break out mid-walk without prefetching pages. The drop pass and seed
marker are now gated on a true full walk — fixes a latent soft-drop
of the index tail under partial sweeps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 06:41:26 +02:00
MechaCat02
51f42b03e9 feat: default crawler browser to headless (0.32.0)
LaunchOptions::from_env() and LaunchOptions::default() now return
BrowserMode::Headless. The in-process daemon (via CrawlerConfig::from_env)
and the standalone crawler binary both pick this up — no display
required for production runs, smaller resource footprint.

`Headed` stays as an explicit opt-in via CRAWLER_BROWSER_MODE=headed
for debugging or sites that fingerprint headless Chrome. New unit test
locks the default in place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 20:27:05 +02:00
MechaCat02
fa0a7da311 feat: edit existing manga metadata (0.31.0)
Adds PUT /mangas/:id/cover (multipart) and DELETE /mangas/:id/cover so
covers can be replaced or cleared after creation, and wires a dedicated
/manga/[id]/edit SvelteKit route that combines the existing PATCH with
the new cover endpoints. Cover PUT cleans up the old blob when the
extension changes, swallowing StorageError::NotFound so a manually-gone
file doesn't surface as a 404 to the client. Edit link on the manga
detail page is gated on session.user, matching the auth posture of the
underlying handlers.

Also pins the local-dev port story via loadEnv() in vite.config.ts so
VITE_PORT / BACKEND_URL from a (gitignored) .env keep the dev URL
stable across runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 20:26:23 +02:00
MechaCat02
9ff49166a5 feat: transient-page detection across the crawler (0.30.0)
Until now, when the target site returned its 403 "we're sorry, the
request file are not found" response on a page that actually exists,
selectors matched nothing and the crawler treated the page as
"legitimately empty". Pagination walks silently dropped whole pages
worth of mangas, fetch_manga skipped individual entries, and the
startup session probe blamed PHPSESSID for what was a site hiccup.

This branch adds a single detection layer that the whole pipeline
routes through:

- `crawler::detect`: PageError::Transient typed signal, plus two
  primitives (`is_broken_page_body` matches the universal 403 body;
  `has_logo_sentinel` asserts #logo, the site-wide header element)
  and a `retry_on_transient` helper that retries a closure on
  Transient with a small attempt budget.
- `navigate()` screens every fetched body for the broken-page
  signature before handing it to a selector.
- Parsers (`parse_manga_list_from`, `parse_manga_detail`,
  `parse_chapter_pages`) check their structural sentinels (#logo for
  full-layout pages; a#pic_container for the reader, which doesn't
  render #logo) and return Result<_, PageError>. Empty Vec is now
  reserved for genuinely empty pages.
- `discover()` retries each pagination page up to 3× (2s apart) before
  failing the whole Discover job — at which point the existing job
  system's retry/backoff takes over for longer outages.
- `verify_session` is three-state: broken-page → retry probe;
  #logo present but #avatar_menu absent → genuine logout (the only
  state that should blame PHPSESSID); both present → ok.

Test coverage added at the helper level: 13 unit tests for the
detection module (body signature, logo sentinel, PageError, retry
helper), parser-level tests for both transient and legitimately-empty
inputs, and 6 unit tests for the session probe classifier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 22:47:21 +02:00
MechaCat02
b845d88766 feat: bookmark create enqueues SyncChapterContent jobs (0.29.0)
After a successful bookmark insert, the create handler spawns a
detached tokio task that calls pipeline::enqueue_pending_for_manga
for every chapter of the manga where page_count = 0 and the source
row is not dropped. Bookmark create returns 201 immediately; enqueue
work happens in the background and its failure is logged without
surfacing to the user (the daily cron sweeps anything missed).

The Phase A dedup index handles re-bookmarks idempotently — deleting
and recreating a bookmark does not duplicate in-flight jobs — and the
Phase B worker pool drains them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 20:59:14 +02:00
MechaCat02
9fe0f26d75 feat: in-process crawler daemon with cron and worker pool (0.28.0)
The backend now boots an internal crawler daemon that runs a daily
metadata pass (CRAWLER_DAILY_AT in CRAWLER_TZ, advisory-lock guarded
for multi-replica safety) and drains SyncChapterContent jobs from
crawler_jobs through a worker pool. Chromium launches lazily on first
job and is torn down after CRAWLER_IDLE_TIMEOUT_S seconds of inactivity.

Modules:
- crawler::browser_manager — lazy-launch / idle-teardown wrapper
  around browser::Handle, with an on_launch hook that re-injects
  PHPSESSID on every fresh Chromium spawn.
- crawler::pipeline — run_metadata_pass (the shared discover/upsert
  /cover/sync-chapters loop) and the enqueue_bookmarked_pending helper
  used by the cron tick.
- crawler::daemon — cron task + worker pool, behind two trait seams
  (MetadataPass, ChapterDispatcher) so tests can inject stubs without
  standing up Chromium or a live source.

Behavior:
- CRAWLER_DAEMON=false skips daemon spawn entirely (default for tests).
- Catch-up tick fires on startup if the last persisted slot was missed.
- A SyncOutcome::SessionExpired sets a sticky AtomicBool; workers
  idle until operator restart with a refreshed PHPSESSID.
- Worker dispatch wrapped in catch_unwind so a panicking handler
  marks the job failed instead of taking down the worker.
- Migration 0015 adds a small crawler_state k-v table for the
  last_metadata_tick_at watermark.

Dep additions: chrono-tz (IANA TZ parsing).

CLI (bin/crawler) reuses pipeline::run_metadata_pass and now holds
the browser via BrowserManager so the on_launch session injection
flow stays in one place. Inline chapter-content sync semantics are
unchanged — the queue is for the daemon, force-refetches and manual
backfills still bypass it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 20:32:02 +02:00
MechaCat02
93c7fd63fc feat: crawler job queue ops and dedup index (0.27.0)
Adds enqueue / lease / ack_done / ack_failed / release / reap_done on
crawler::jobs, backed by the existing crawler_jobs table. lease() uses
a single FOR UPDATE SKIP LOCKED CTE that also re-claims stale running
rows (crashed-worker recovery), and ack_failed applies an exponential
backoff capped at 1h before retrying.

Migration 0014 adds a partial unique index on
(payload->>'chapter_id') restricted to (pending|running)
sync_chapter_content jobs, so producers can just
INSERT ... ON CONFLICT DO NOTHING without racing each other. The slot
frees again the moment the job leaves the in-flight states, so a
future force-refetch can re-enqueue.

Library-only — no daemon, no API hook. Those land in the next two
phases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 19:59:09 +02:00
MechaCat02
89b84252a5 bugfix: subquery-wrap pending chapters query so DISTINCT + ORDER BY agree (0.26.1)
PG rejects `SELECT DISTINCT c.id, c.manga_id, cs.source_url ... ORDER BY
c.manga_id, c.created_at` because the ORDER BY references a column not in
the DISTINCT projection. Wrap the DISTINCT in a subquery (which includes
created_at) and apply the ORDER BY in the outer SELECT.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 22:20:15 +02:00
MechaCat02
728d704a66 feat: CRAWLER_KEEP_BROWSER_OPEN waits for Ctrl+C in headed mode (0.26.0)
Debug aid: when set in headed mode, the crawler blocks on Ctrl+C at
every shutdown point (early auth bails + normal completion) instead
of closing the browser immediately. Operator can inspect DOM, cookies,
and network state in the visible Chromium window before exit.

Ignored in headless (no window to inspect) — logged as a warning if
set under headless so the operator doesn't sit waiting.

chromiumoxide's `Browser` is `kill_on_drop`, so the close-or-wait
helper must await Ctrl+C *before* the Handle is dropped — otherwise
the Chromium child gets killed out from under the operator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 21:33:18 +02:00
MechaCat02
d24e68c78d feat: chapter content sync via PHPSESSID + per-host pacing (0.25.0)
After the metadata pass, the crawler now fetches per-chapter image
content for chapters belonging to bookmarked mangas. Logged-in chapter
pages render every page image at once (no per-page navigation), so the
crawler reuses the operator's browser session via a pasted PHPSESSID
cookie. Each chapter sync is a single transaction: storage puts + page
row inserts + page_count update commit together, or roll back together
on any image error so the chapter stays at page_count=0 and is retried
next run.

New crawler modules:

- `rate_limit::HostRateLimiters`: per-host buckets keyed by URL host,
  with optional per-host overrides. Replaces the single shared
  `Mutex<RateLimiter>`. Catalog and CDN no longer share a budget;
  default 1 req/s per host.
- `session`: derives `.<registrable>.<tld>` from the start URL
  (override via `CRAWLER_COOKIE_DOMAIN` for multi-part TLDs), injects
  PHPSESSID into the Chromium cookie store, probes `#avatar_menu` at
  startup to fail fast on a bad/expired cookie.
- `content`: parses `a#pic_container img:not(.loading)` with `pageN`
  id-based sorting (DOM order isn't trusted), then performs the
  atomic chapter sync.

bin/crawler additions:

- Concurrent chapter content phase via `futures_util::for_each_concurrent`
  (`CRAWLER_CHAPTER_WORKERS`, default 1). Browser is borrowed across
  workers — chromiumoxide allows concurrent `new_page` on `&self` —
  and per-host rate limit gates total RPS regardless of worker count.
- reqwest gets the `cookies` feature, a `Jar` seeded with PHPSESSID
  for the catalog domain only (CDN intentionally not given the
  cookie), and `Referer` is set on cover + chapter image fetches.
- New env knobs: `CRAWLER_PHPSESSID`, `CRAWLER_COOKIE_DOMAIN`,
  `CRAWLER_USER_AGENT`, `CRAWLER_CHAPTER_WORKERS`,
  `CRAWLER_SKIP_CHAPTER_CONTENT`, `CRAWLER_FORCE_REFETCH_CHAPTERS`,
  `CRAWLER_CDN_HOST` + `CRAWLER_CDN_RATE_MS`.
- Mid-run session-expired detection: `#avatar_menu` is re-checked on
  every chapter page nav; first failure aborts the phase with a
  cookie-refresh message.

Bookmark-driven enqueueing is sync-on-crawl-tick only: the bookmarked
chapters with `page_count = 0` are queried at the start of the
chapter-content phase. Sync-on-bookmark via an API hook is deferred
to a follow-up branch — that needs a daemon consumer of crawler_jobs,
which doesn't exist yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 00:28:36 +02:00
MechaCat02
51346227dd feat: route reader by chapter id, allow duplicate-numbered chapters (0.24.0)
Real-world sources publish multiple chapters at the same number:
different scanlators ("Ch.52 from bloomingdale" + "Ch.52 from mina"),
translator notices and farewells, alt-translations. The (manga_id,
number) UNIQUE constraint from 0001 silently collapsed all of those
into a single row via the upsert path in repo::crawler. Migration 0013
drops the constraint; sync_manga_chapters now plain-INSERTs each
SourceChapterRef so every parsed chapter survives as its own row.

Identity moves from the (manga_id, number) tuple to the chapter UUID:

- `GET /api/v1/mangas/:manga_id/chapters/:chapter_id` (replaces :number)
- `GET /api/v1/mangas/:manga_id/chapters/:chapter_id/pages`
- `repo::chapter::find_by_id_in_manga` (replaces find_by_manga_and_number)
- Frontend reader route renamed to `/manga/[id]/chapter/[chapter_id]`
- Chapter links throughout (manga page list, continue-reading CTA,
  reader prev/next, history rows, bookmark cards) use chapter.id
- API clients getChapter/getChapterPages take a chapter id string

read_progress + bookmarks already FK chapter_id; they only enrich with
chapter_number for display, which is preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 23:37:07 +02:00
MechaCat02
c51353ead3 bugfix: chapter source key uses chapter id, not /pg-1/ (0.23.1)
Listing links point at the reader's page 1
(`.../uu/br_chapter-N/pg-1/`). The generic `derive_key_from_url` took
the last URL segment and returned `"pg-1"` for every chapter, so all
parsed chapters collapsed onto a single `chapter_sources` row downstream
and the first-manga chapter was the only row that survived. New
`derive_chapter_key_from_url` strips a trailing `/pg-\d+/` before
picking the chapter-identifying segment (`br_chapter-N` / `to_chapter-N`).

Notices, hiatus rows, and duplicate-numbered chapters are preserved as
distinct parser entries. The (manga_id, number) UNIQUE collapse in the
chapters table is a separate, follow-up concern handled in
feat/chapter-id-routing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 23:15:36 +02:00
MechaCat02
b1a3a4e9d3 feat: crawler manga-list & metadata sync with cover download (0.23.0)
- TargetSource: first concrete impl of the Source trait, modeled on
  the old Puppeteer crawler's selectors (+ status normalization,
  tag-count stripping, chapter list)
- DiscoverMode::Backfill walks pagination last->1, reverse within each
  page (oldest-first); Incremental walks forward
- RateLimiter (tokio-time aware) plumbed through FetchContext so the
  pagination walk honors the same per-host budget as the outer loop
- repo::crawler: ensure_source, upsert_manga_from_source (returns
  New/Updated/Unchanged + current cover_image_path for backfill
  decisions), sync_manga_chapters, mark_dropped_mangas — all
  transactional, with case-insensitive lookups and source-insertable
  genres
- Cover image download via reqwest+infer; stored under
  mangas/{id}/cover.{ext} via the Storage trait
- Single CRAWLER_PROXY env wires both Chromium (--proxy-server) and
  reqwest::Proxy::all (HTTP/HTTPS/SOCKS5)
- Crawler binary: positional start URL or $CRAWLER_START_URL,
  $CRAWLER_LIMIT (cap fetches + skip drop pass on partial runs),
  $CRAWLER_SKIP_CHAPTERS (disable selector AND sync), $CRAWLER_RATE_MS
- Silences chromiumoxide 0.7's known CDP deserialize log spam via
  default tracing filter + CdpError::Serde downgrade
- 9 sqlx integration tests + 11 selector/rate-limit unit tests
2026-05-21 22:04:23 +02:00
MechaCat02
26eccd0abe feat: crawler scaffold with chromium launcher (0.22.0)
- crawler module (browser, source trait, jobs, diff) + binary
- chromiumoxide launcher with fetcher feature (auto-downloads
  Chromium on first run, caches under ~/.cache/mangalord/chromium)
- LaunchOptions struct with extra_args, parseable from
  CRAWLER_BROWSER_MODE and CRAWLER_BROWSER_ARGS
- migration 0012 introduces sources, manga_sources,
  chapter_sources, crawler_jobs
- integration tests for headed + headless launch, ipify load+parse,
  and extra-args propagation (all #[ignore], opt-in)
2026-05-20 22:07:56 +02:00
122 changed files with 18942 additions and 277 deletions

View File

@@ -1,20 +1,30 @@
# Copy to .env for `docker compose up --build`. Local-dev runs (cargo run
# / npm run dev) read backend/.env if present, or pick up the variables
# from your shell.
#
# Production note: COOKIE_SECURE=true (the default below) makes browsers
# refuse to send the session cookie over plain HTTP. Run with a TLS-
# terminating reverse proxy (Caddy, Traefik, nginx) in front — the
# compose file here doesn't ship one. Local/dev runs without HTTPS
# should set COOKIE_SECURE=false.
# ----- Postgres -----
# These are read by the Postgres container *and* by DATABASE_URL below;
# changing them after the first boot won't migrate existing data, so set
# them up front for any new deployment.
#
# POSTGRES_PASSWORD is REQUIRED — docker-compose.yml fails fast if it
# isn't set in this file, to prevent a deploy without an .env booting
# Postgres with a publicly-known credential.
POSTGRES_USER=mangalord
POSTGRES_PASSWORD=mangalord
POSTGRES_PASSWORD=change-me-to-a-strong-random-string
POSTGRES_DB=mangalord
# ----- Backend -----
DATABASE_URL=postgres://mangalord:mangalord@postgres:5432/mangalord
BIND_ADDRESS=0.0.0.0:8080
STORAGE_DIR=/var/lib/mangalord/storage
RUST_LOG=info,mangalord=debug
RUST_LOG=info,mangalord=debug,chromiumoxide::conn=off,chromiumoxide::handler=off
# ----- Auth / cookies -----
# COOKIE_SECURE controls whether the `Secure` flag is set on the session
@@ -29,6 +39,13 @@ COOKIE_DOMAIN=
# get reaped lazily.
SESSION_TTL_DAYS=30
# ----- Auth brute-force rate limits -----
# Token-bucket budget shared across /auth/login, /auth/register, and
# /auth/me/password. Set per_sec=0 to disable (e.g. behind a
# rate-limiting reverse proxy that already enforces a budget).
AUTH_RATE_PER_SEC=5
AUTH_RATE_BURST=10
# ----- CORS -----
# Comma-separated origins allowed to call the API with credentials.
# Default is empty: same-origin only. Set when frontend and backend live
@@ -44,6 +61,28 @@ MAX_REQUEST_BYTES=209715200
# Default 20 MiB.
MAX_FILE_BYTES=20971520
# ----- Crawler download safety -----
# Hosts the crawler is allowed to fetch images/covers from, in addition
# to CRAWLER_START_URL's host and CRAWLER_CDN_HOST. Comma-separated.
# Defends against SSRF via scraped <img src="http://10.0.0.1/...">.
CRAWLER_DOWNLOAD_ALLOWLIST=
# Bypass the host allowlist entirely. Intended for sources that shard
# images across numbered CDN subdomains (cdn1/cdn2/…) where enumerating
# every host upfront is impractical. The private-IP / localhost / non-
# http(s) scheme defenses STAY ON — a scraped <img src="http://10.0.0.1/">
# is still refused with this flag set.
CRAWLER_ALLOW_ANY_HOST=false
# Hard cap on a single image body. Default 32 MiB.
CRAWLER_MAX_IMAGE_BYTES=33554432
# Path to a system Chromium binary. When set, the crawler skips the
# bundled-fetcher download. Required on platforms without a usable
# upstream Chromium build (notably Linux_arm64 / Raspberry Pi). On
# Debian: /usr/bin/chromium-headless-shell or /usr/bin/chromium. On
# Ubuntu the package is chromium-browser (different path). Pair with
# `docker compose build --build-arg INSTALL_CHROMIUM=true backend` so
# the image actually contains the binary.
CRAWLER_CHROMIUM_BINARY=
# ----- Frontend -----
# The frontend container runs SvelteKit's Node adapter on :3000 and
# proxies /api/* to BACKEND_URL via src/hooks.server.ts. In compose the
@@ -51,3 +90,8 @@ MAX_FILE_BYTES=20971520
# internal docker network. Override only if you're running the
# frontend container against a backend somewhere else.
BACKEND_URL=http://backend:8080
# Per-request wall-clock cap for the /api/* reverse proxy (milliseconds).
# Default 300000 (5 min) covers a typical 200 MiB chapter upload over
# 25 Mbps; raise for users on slower upstream links or lower if a
# tighter front proxy already bounds the request lifetime.
BACKEND_PROXY_TIMEOUT_MS=300000

71
.gitea/README.md Normal file
View File

@@ -0,0 +1,71 @@
# Gitea Actions
The [`deploy`](workflows/deploy.yml) workflow runs on every push to `main`
(and via manual `workflow_dispatch`). It tests, builds, pushes the images
to a private registry, and rolls the stack over by SSH on the target host.
## Required secrets
Set under *Repo Settings → Actions → Secrets*:
| Name | Example | Purpose |
| -------------------- | ------------------------ | ---------------------------------------------------------------- |
| `REGISTRY_URL` | `registry.example.com` | Registry host. No scheme, no trailing slash. |
| `REGISTRY_USERNAME` | `mangalord-ci` | `docker login` user. |
| `REGISTRY_PASSWORD` | `<token>` | `docker login` token/password. |
| `SSH_HOST` | `mangalord.example.com` | Deploy target hostname/IP. |
| `SSH_USER` | `deploy` | SSH user on the target (must be in the `docker` group). |
| `SSH_PRIVATE_KEY` | `-----BEGIN OPENSSH...` | Private key authorised in the target user's `authorized_keys`. |
| `SSH_PORT` | `22` | Optional. Defaults to `22` if unset. |
## Required variables
Set under *Repo Settings → Actions → Variables* (not secrets — they appear
in logs):
| Name | Example | Purpose |
| ------------- | ------------------------ | ---------------------------------------------------------------------- |
| `DEPLOY_PATH` | `/srv/mangalord` | Directory on target holding `docker-compose.yml`, `.env`, and the prod overlay. |
## One-time host setup
The workflow assumes the deploy target already has:
1. Docker + Docker Compose v2 installed and the `SSH_USER` in the `docker` group.
2. `$DEPLOY_PATH/docker-compose.yml` (copy of the repo's [docker-compose.yml](../docker-compose.yml)).
3. `$DEPLOY_PATH/docker-compose.prod.yml` (copy of the repo's [docker-compose.prod.yml](../docker-compose.prod.yml)).
4. `$DEPLOY_PATH/.env` populated from [.env.example](../.env.example) with production values (real `POSTGRES_PASSWORD`, `COOKIE_SECURE=true`, etc.).
Bootstrap once:
```bash
ssh deploy@mangalord.example.com
sudo mkdir -p /srv/mangalord && sudo chown deploy:deploy /srv/mangalord
cd /srv/mangalord
# place docker-compose.yml, docker-compose.prod.yml, and .env here
```
The first workflow run will pull the images, bring the stack up, and run
the embedded migrations on startup.
## Image tags
Every push produces three tags per image:
- `mangalord-{backend,frontend}:latest`
- `mangalord-{backend,frontend}:<git-sha>` — used by the deploy job; lets
you pin a deploy to a specific commit
- `mangalord-{backend,frontend}:<version>` — the version from
[backend/Cargo.toml](../backend/Cargo.toml) (verified in lockstep with
[frontend/package.json](../frontend/package.json))
## Rollback
SSH to the target, set `IMAGE_TAG` to a previous commit SHA, and re-up:
```bash
cd /srv/mangalord
export REGISTRY_URL=registry.example.com
export IMAGE_TAG=<previous-sha>
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d
```

160
.gitea/workflows/deploy.yml Normal file
View File

@@ -0,0 +1,160 @@
name: deploy
on:
push:
branches: [main]
pull_request:
branches: [main]
workflow_dispatch:
jobs:
test-backend:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:16-alpine
env:
POSTGRES_USER: mangalord
POSTGRES_PASSWORD: mangalord
POSTGRES_DB: mangalord
options: >-
--health-cmd "pg_isready -U mangalord"
--health-interval 5s
--health-timeout 5s
--health-retries 10
env:
DATABASE_URL: postgres://mangalord:mangalord@postgres:5432/mangalord
steps:
- uses: actions/checkout@v4
# ubuntu-latest has node (so JS actions like checkout/cache run) but no
# Rust. We intentionally avoid `container: rust:1-slim` because act_runner
# runs JS actions with node *inside* the job container, and the slim Rust
# image ships no node (checkout would fail with exit 127).
- name: Install Rust + build deps
run: |
set -eu
SUDO=""; [ "$(id -u)" = "0" ] || SUDO="sudo"
$SUDO apt-get update
$SUDO apt-get install -y --no-install-recommends pkg-config libssl-dev ca-certificates curl
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --profile minimal --default-toolchain stable
echo "$HOME/.cargo/bin" >> "$GITHUB_PATH"
- name: Cache cargo registry and target
uses: actions/cache@v4
with:
path: |
~/.cargo/registry
~/.cargo/git
backend/target
key: cargo-${{ runner.os }}-${{ hashFiles('backend/Cargo.lock') }}
restore-keys: |
cargo-${{ runner.os }}-
- name: cargo test
working-directory: backend
run: cargo test --locked
test-frontend:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '22'
cache: npm
cache-dependency-path: frontend/package-lock.json
- name: npm ci
working-directory: frontend
run: npm ci
- name: vitest
working-directory: frontend
run: npm test
build-and-push:
runs-on: ubuntu-latest
needs: [test-backend, test-frontend]
# PRs only run the test jobs; build + deploy are reserved for
# post-merge pushes to main. Without this gate every PR would push
# a tagged image to the registry and SSH-deploy to prod.
if: github.event_name != 'pull_request'
outputs:
image_tag: ${{ steps.meta.outputs.image_tag }}
version: ${{ steps.meta.outputs.version }}
steps:
- uses: actions/checkout@v4
- name: Resolve image tags
id: meta
run: |
version="$(grep -m1 '^version' backend/Cargo.toml | cut -d'"' -f2)"
frontend_version="$(grep -m1 '"version"' frontend/package.json | cut -d'"' -f4)"
if [ "$version" != "$frontend_version" ]; then
echo "Version mismatch: backend=$version frontend=$frontend_version" >&2
exit 1
fi
echo "image_tag=${GITHUB_SHA}" >> "$GITHUB_OUTPUT"
echo "version=${version}" >> "$GITHUB_OUTPUT"
- uses: docker/setup-buildx-action@v3
- name: docker login
uses: docker/login-action@v3
with:
registry: ${{ secrets.REGISTRY_URL }}
username: ${{ secrets.REGISTRY_USERNAME }}
password: ${{ secrets.REGISTRY_PASSWORD }}
- name: Build & push backend
uses: docker/build-push-action@v5
with:
context: ./backend
push: true
tags: |
${{ secrets.REGISTRY_URL }}/mangalord-backend:latest
${{ secrets.REGISTRY_URL }}/mangalord-backend:${{ steps.meta.outputs.image_tag }}
${{ secrets.REGISTRY_URL }}/mangalord-backend:${{ steps.meta.outputs.version }}
cache-from: type=gha,scope=backend
cache-to: type=gha,mode=max,scope=backend
- name: Build & push frontend
uses: docker/build-push-action@v5
with:
context: ./frontend
push: true
tags: |
${{ secrets.REGISTRY_URL }}/mangalord-frontend:latest
${{ secrets.REGISTRY_URL }}/mangalord-frontend:${{ steps.meta.outputs.image_tag }}
${{ secrets.REGISTRY_URL }}/mangalord-frontend:${{ steps.meta.outputs.version }}
cache-from: type=gha,scope=frontend
cache-to: type=gha,mode=max,scope=frontend
deploy:
runs-on: ubuntu-latest
needs: build-and-push
if: github.event_name != 'pull_request'
# Single-host deploy: the runner lives on the same box as the stack, so we
# drive the host docker daemon directly (act_runner shares its socket via
# `docker_host: "-"`) instead of SSHing out. The compose dir is bind-mounted
# at its REAL host path so compose's relative bind-mounts (./mangalord/...,
# ./Caddyfile) resolve; this requires `/mnt/ssd/docker-data` in the runner's
# container.valid_volumes. The central compose references the images as
# registry.mc02.dev/mangalord-*:${MANGALORD_TAG:-latest}, so we only pull
# and recreate the two mangalord services at the freshly built SHA.
container:
image: docker:cli
volumes:
- /mnt/ssd/docker-data:/mnt/ssd/docker-data
steps:
- name: Deploy to the local stack
working-directory: /mnt/ssd/docker-data
env:
REGISTRY_URL: ${{ secrets.REGISTRY_URL }}
REGISTRY_USERNAME: ${{ secrets.REGISTRY_USERNAME }}
REGISTRY_PASSWORD: ${{ secrets.REGISTRY_PASSWORD }}
IMAGE_TAG: ${{ needs.build-and-push.outputs.image_tag }}
run: |
set -eu
echo "$REGISTRY_PASSWORD" | docker login "$REGISTRY_URL" -u "$REGISTRY_USERNAME" --password-stdin
export MANGALORD_TAG="$IMAGE_TAG"
docker compose pull mangalord-backend mangalord-frontend
docker compose up -d mangalord-backend mangalord-frontend
docker image prune -f
docker logout "$REGISTRY_URL"

1516
backend/Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -1,7 +1,8 @@
[package]
name = "mangalord"
version = "0.21.3"
version = "0.45.0"
edition = "2021"
default-run = "mangalord"
[lib]
path = "src/lib.rs"
@@ -10,6 +11,10 @@ path = "src/lib.rs"
name = "mangalord"
path = "src/main.rs"
[[bin]]
name = "crawler"
path = "src/bin/crawler.rs"
[dependencies]
axum = { version = "0.7", features = ["macros", "multipart"] }
tokio = { version = "1", features = ["full"] }
@@ -18,6 +23,7 @@ serde = { version = "1", features = ["derive"] }
serde_json = "1"
uuid = { version = "1", features = ["v4", "serde"] }
chrono = { version = "0.4", features = ["serde"] }
chrono-tz = "0.9"
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter"] }
tower = { version = "0.5", features = ["util"] }
@@ -36,7 +42,13 @@ time = "0.3"
infer = "0.16"
tokio-util = { version = "0.7", features = ["io"] }
futures-core = "0.3"
futures-util = "0.3"
bytes = "1"
chromiumoxide = { version = "0.7", features = ["tokio-runtime", "_fetcher-rusttls-tokio"], default-features = false }
sysinfo = { version = "0.32", default-features = false, features = ["system"] }
nix = { version = "0.29", features = ["fs"] }
scraper = "0.20"
reqwest = { version = "0.12", default-features = false, features = ["rustls-tls", "socks", "cookies", "stream"] }
[dev-dependencies]
tempfile = "3"
@@ -44,3 +56,4 @@ tower = { version = "0.5", features = ["util"] }
http-body-util = "0.1"
mime = "0.3"
futures-util = "0.3"
tokio = { version = "1", features = ["test-util"] }

View File

@@ -10,7 +10,8 @@ RUN apt-get update \
# exact crate versions CI tested. Without Cargo.lock + the flag, cargo
# would silently resolve fresh on every image build.
COPY Cargo.toml Cargo.lock ./
RUN mkdir src && echo "fn main() {}" > src/main.rs && echo "" > src/lib.rs \
RUN mkdir -p src/bin && echo "fn main() {}" > src/main.rs && echo "" > src/lib.rs \
&& echo "fn main() {}" > src/bin/crawler.rs \
&& cargo build --locked --release \
&& rm -rf src
@@ -18,13 +19,68 @@ COPY src ./src
COPY migrations ./migrations
RUN touch src/main.rs src/lib.rs && cargo build --locked --release
FROM debian:bookworm-slim
FROM debian:trixie-slim
# Runtime base must match the builder's Debian release: `rust:1-slim` tracks
# trixie (glibc 2.41), so a bookworm runtime (glibc 2.36) can't run the
# binary ("GLIBC_2.39 not found"). Keep these two in lockstep on bumps.
# `curl` is for the container HEALTHCHECK; `ca-certificates` is for
# outbound HTTPS (crawler covers/pages).
#
# INSTALL_CHROMIUM is an opt-in for deployments that can't use the
# chromiumoxide fetcher path (notably Linux_arm64 / Raspberry Pi, where
# the upstream snapshot bucket has no usable build). When `true`, adds
# Debian's apt-packaged headless chromium plus a baseline font set —
# pair with `CRAWLER_CHROMIUM_BINARY=/usr/bin/chromium-headless-shell`
# at runtime so the launcher uses it. Default `false` keeps cloud/x86
# images slim.
#
# Build the Pi image with:
# docker compose build --build-arg INSTALL_CHROMIUM=true backend
ARG INSTALL_CHROMIUM=false
RUN apt-get update \
&& apt-get install -y --no-install-recommends ca-certificates \
&& apt-get install -y --no-install-recommends ca-certificates curl \
&& if [ "$INSTALL_CHROMIUM" = "true" ]; then \
apt-get install -y --no-install-recommends chromium-headless-shell fonts-liberation; \
fi \
&& rm -rf /var/lib/apt/lists/*
# Non-root runtime user. The API binary doesn't need any root
# privilege; the crawler daemon's Chromium launcher uses --no-sandbox
# precisely because user-namespace sandboxing is fragile, so dropping
# privileges costs nothing operationally and shrinks the blast radius
# of any RCE.
ARG APP_UID=10001
ARG APP_GID=10001
RUN groupadd --system --gid ${APP_GID} app \
&& useradd --system --uid ${APP_UID} --gid app --home-dir /home/app --create-home --shell /usr/sbin/nologin app
WORKDIR /app
COPY --from=builder /app/target/release/mangalord /usr/local/bin/mangalord
COPY --from=builder /app/migrations /app/migrations
ENV STORAGE_DIR=/var/lib/mangalord/storage
# Pre-create the storage dir so the entrypoint doesn't need to
# mkdir-as-root and so the named volume mount inherits the right
# ownership.
#
# UPGRADE NOTE for operators: if you're moving from an older image
# that ran as root, the existing `storage-data` volume has files owned
# by UID 0 and the new UID-10001 user can't write them. Run once
# before the upgrade:
# docker compose run --rm --user 0 backend \
# chown -R 10001:10001 /var/lib/mangalord/storage
# (Postgres is unaffected — that image's `postgres` user UID hasn't
# changed.)
RUN mkdir -p ${STORAGE_DIR} \
&& chown -R app:app ${STORAGE_DIR} /app /home/app
USER app
EXPOSE 8080
# `--start-period` is generous because first boot runs sqlx::migrate
# against postgres which can take a few seconds; subsequent restarts
# are sub-second.
HEALTHCHECK --interval=30s --timeout=5s --start-period=20s --retries=3 \
CMD curl -fsS http://localhost:8080/api/v1/health > /dev/null || exit 1
CMD ["mangalord"]

View File

@@ -0,0 +1,72 @@
-- Crawler tables.
--
-- Same philosophy as 0001_init.sql: new concepts go in new tables
-- joined to existing ones, not jammed onto `mangas`/`chapters`. A
-- crawled manga IS a manga; the only thing the source-link tables
-- carry is "where did this come from and when did we last see it".
-- That keeps the API and frontend source-agnostic.
-- 1. Source registry. One row per site the crawler knows about.
-- `config` carries per-site knobs (base URL, rate limits, custom
-- selectors) so adding a source is a row insert plus a `Source`
-- trait impl — no schema change.
CREATE TABLE sources (
id text PRIMARY KEY,
name text NOT NULL,
base_url text NOT NULL,
enabled boolean NOT NULL DEFAULT true,
config jsonb NOT NULL DEFAULT '{}'::jsonb,
created_at timestamptz NOT NULL DEFAULT now()
);
-- 2. Link tables. `(source_id, source_*_key)` is the natural key the
-- source itself exposes; the FK to `mangas`/`chapters` is what
-- threads it back into our domain. `metadata_hash` is the signal
-- used by `crawler::diff` to detect updates without re-comparing
-- every field. `last_seen_at` + `dropped_at` is the soft-drop pair.
CREATE TABLE manga_sources (
source_id text NOT NULL REFERENCES sources(id) ON DELETE CASCADE,
source_manga_key text NOT NULL,
manga_id uuid NOT NULL REFERENCES mangas(id) ON DELETE CASCADE,
source_url text NOT NULL,
metadata_hash text,
first_seen_at timestamptz NOT NULL DEFAULT now(),
last_seen_at timestamptz NOT NULL DEFAULT now(),
dropped_at timestamptz,
PRIMARY KEY (source_id, source_manga_key)
);
CREATE INDEX manga_sources_manga_idx ON manga_sources (manga_id);
CREATE INDEX manga_sources_last_seen_idx ON manga_sources (source_id, last_seen_at);
CREATE TABLE chapter_sources (
source_id text NOT NULL REFERENCES sources(id) ON DELETE CASCADE,
source_chapter_key text NOT NULL,
chapter_id uuid NOT NULL REFERENCES chapters(id) ON DELETE CASCADE,
source_url text NOT NULL,
first_seen_at timestamptz NOT NULL DEFAULT now(),
last_seen_at timestamptz NOT NULL DEFAULT now(),
dropped_at timestamptz,
PRIMARY KEY (source_id, source_chapter_key)
);
CREATE INDEX chapter_sources_chapter_idx ON chapter_sources (chapter_id);
-- 3. Persistent job queue. Workers lease with
-- `FOR UPDATE SKIP LOCKED`, heartbeat via `leased_until`, and ack
-- by transitioning state. The partial index keeps the hot path
-- (pick the next ready job) off the bulk of done/dead rows.
CREATE TABLE crawler_jobs (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
payload jsonb NOT NULL,
state text NOT NULL DEFAULT 'pending'
CHECK (state IN ('pending','running','done','failed','dead')),
attempts integer NOT NULL DEFAULT 0,
max_attempts integer NOT NULL DEFAULT 5,
scheduled_at timestamptz NOT NULL DEFAULT now(),
leased_until timestamptz,
last_error text,
created_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX crawler_jobs_ready_idx
ON crawler_jobs (scheduled_at)
WHERE state IN ('pending', 'failed');

View File

@@ -0,0 +1,18 @@
-- Real-world sources publish multiple chapters at the same number:
-- different uploaders, translator notices/farewells, paid-vs-free
-- re-uploads, and our own users can legitimately have two versions of
-- "Ch.52" with different scanlations. The (manga_id, number) UNIQUE
-- from 0001_init silently collapses all of those into a single row via
-- ON CONFLICT, dropping data. Drop the constraint and lean on the
-- chapter id (UUID) as the only chapter identity going forward.
ALTER TABLE chapters DROP CONSTRAINT chapters_manga_id_number_key;
-- The UNIQUE was also our only index on (manga_id, number) since
-- 0007 dropped the redundant explicit one. Chapter list pages
-- ORDER BY number ASC and the manga page is a hot read path, so put
-- the index back without the uniqueness. Secondary sort by created_at
-- so duplicate-numbered chapters have a stable order in lists and
-- prev/next navigation.
CREATE INDEX chapters_manga_id_number_idx
ON chapters (manga_id, number, created_at);

View File

@@ -0,0 +1,15 @@
-- Dedup SyncChapterContent jobs in flight.
--
-- Without this, the daemon's bookmark/cron enqueue paths would have to do a
-- pre-check + insert race that's incorrect under concurrency. The partial
-- unique index lets both producers use plain `INSERT ... ON CONFLICT DO
-- NOTHING`: at most one (pending|running) job per chapter_id exists, and the
-- slot frees again as soon as the job transitions to done/failed/dead so a
-- re-enqueue is possible after the row is reaped or a force-refetch is wanted.
--
-- Scoped to sync_chapter_content payloads only so Discover / SyncManga /
-- SyncChapterList jobs (which don't carry a chapter_id) remain un-deduped.
CREATE UNIQUE INDEX crawler_jobs_chapter_content_dedup_idx
ON crawler_jobs ((payload->>'chapter_id'))
WHERE state IN ('pending', 'running')
AND payload->>'kind' = 'sync_chapter_content';

View File

@@ -0,0 +1,12 @@
-- Small key-value table for daemon state that needs to survive restarts.
--
-- Used so far only by the cron scheduler (`last_metadata_tick_at`) so it can
-- detect that the most recent slot was missed (e.g. the backend was down at
-- midnight) and fire immediately on startup before resuming the regular
-- schedule. JSONB on the value column lets future keys carry richer payloads
-- without another migration.
CREATE TABLE crawler_state (
key text PRIMARY KEY,
value jsonb NOT NULL,
updated_at timestamptz NOT NULL DEFAULT now()
);

View File

@@ -0,0 +1,15 @@
-- The original 0012 partial index covers `state IN ('pending','failed')`,
-- but `ack_failed` in src/crawler/jobs.rs only writes `dead` or
-- `pending` — `failed` is never set. The index branch on `failed`
-- never matches any row, so it's dead weight on every write.
--
-- Drop and recreate the index without the dead branch. The CHECK
-- constraint on `state` still allows `'failed'` so a future migration
-- can adopt that terminal-but-retryable state without a second
-- schema change.
DROP INDEX IF EXISTS crawler_jobs_ready_idx;
CREATE INDEX crawler_jobs_ready_idx
ON crawler_jobs (scheduled_at)
WHERE state = 'pending';

View File

@@ -0,0 +1,20 @@
-- chapter_sources: drop the global (source_id, source_chapter_key) PK
-- and rekey on (source_id, chapter_id).
--
-- The old PK assumed chapter slugs are unique per source. Sources whose
-- chapter naming is per-manga (chapter-1, chapter-2, ...) instead of per-
-- catalog (br_chapter-379272 with a global counter) would collide on the
-- second manga: the INSERT would conflict on (source_id, "chapter-1") and
-- the lookup would attribute the row to the first manga's chapter_id.
--
-- The new key is the natural identity of a source attachment: "this source
-- has this chapter". An (source_id, source_chapter_key) index preserves
-- the lookup path (find existing source row by source's identifier) but
-- no longer enforces uniqueness — the application combines it with the
-- chapters table's manga_id to scope the lookup per-manga.
ALTER TABLE chapter_sources DROP CONSTRAINT chapter_sources_pkey;
ALTER TABLE chapter_sources ADD PRIMARY KEY (source_id, chapter_id);
CREATE INDEX chapter_sources_source_key_idx
ON chapter_sources (source_id, source_chapter_key);

View File

@@ -0,0 +1,5 @@
-- Admin role flag on users. Booted from ADMIN_USERNAME / ADMIN_PASSWORD env at
-- startup (see app::build). Demotion is instant: the RequireAdmin extractor
-- re-reads the user row every request, so flipping this column takes effect on
-- the next call without a session purge.
ALTER TABLE users ADD COLUMN is_admin BOOLEAN NOT NULL DEFAULT false;

View File

@@ -0,0 +1,20 @@
-- Admin audit log. Written from inside the same transaction as the action
-- it records, so a failed COMMIT also rolls back the audit row — the log
-- never claims an action happened that didn't.
--
-- `actor_user_id` is ON DELETE SET NULL so audit rows outlive a deleted
-- admin (the answer to "who promoted Bob to admin?" survives even after
-- Alice's account is removed). `target_id` is intentionally not a FK
-- because future audit kinds may target non-user rows (manga, source,
-- etc.) and a single typed FK can't express that.
CREATE TABLE admin_audit (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
actor_user_id uuid REFERENCES users(id) ON DELETE SET NULL,
action text NOT NULL,
target_kind text NOT NULL,
target_id uuid,
payload jsonb NOT NULL DEFAULT '{}'::jsonb,
at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX admin_audit_at_idx ON admin_audit (at DESC);

View File

@@ -0,0 +1,14 @@
-- Per-manga sync-state derivation joins crawler_jobs to manga_sources via
-- (payload->>'source_id', payload->>'source_manga_key') for the
-- `sync_manga` job kind (whose payload doesn't carry a manga_id directly).
-- Without this index the join falls back to a seqscan of crawler_jobs on
-- every admin manga listing — a noticeable cost as the job table grows
-- with the daily metadata pass.
--
-- Partial on `state IN ('pending','running')` so it covers only in-flight
-- jobs (the bulk of the table is done/dead and irrelevant to "is this
-- manga being synced right now").
CREATE INDEX crawler_jobs_sync_manga_key_idx
ON crawler_jobs ((payload->>'source_manga_key'))
WHERE state IN ('pending', 'running')
AND payload->>'kind' = 'sync_manga';

View File

@@ -0,0 +1,110 @@
//! Admin manga/chapter overview with derived sync state.
//!
//! Sync state comes from `repo::admin_view`, which joins the manga /
//! chapter tables with the crawler signals at query time — there is no
//! persisted sync_state column. See [`repo::admin_view`] for the
//! derivation priority order.
use axum::extract::{Path, Query, State};
use axum::routing::get;
use axum::{Json, Router};
use serde::Deserialize;
use uuid::Uuid;
use crate::api::pagination::PagedResponse;
use crate::app::AppState;
use crate::auth::extractor::RequireAdmin;
use crate::domain::MangaSyncState;
use crate::error::{AppError, AppResult};
use crate::repo;
use crate::repo::admin_view::{AdminChapterRow, AdminMangaRow};
pub fn routes() -> Router<AppState> {
Router::new()
.route("/admin/mangas", get(list_mangas))
.route("/admin/mangas/:id/chapters", get(list_chapters))
}
#[derive(Debug, Deserialize, Default)]
pub struct ListChaptersParams {
#[serde(default = "default_chapter_limit")]
pub limit: i64,
#[serde(default)]
pub offset: i64,
}
fn default_chapter_limit() -> i64 {
200
}
#[derive(Debug, Deserialize, Default)]
pub struct ListMangasParams {
#[serde(default)]
pub search: Option<String>,
/// `in_progress` | `dropped` | `synced`. Unrecognised values are a 400.
#[serde(default)]
pub sync_state: Option<String>,
#[serde(default = "default_limit")]
pub limit: i64,
#[serde(default)]
pub offset: i64,
}
fn default_limit() -> i64 {
50
}
async fn list_mangas(
State(state): State<AppState>,
_admin: RequireAdmin,
Query(params): Query<ListMangasParams>,
) -> AppResult<Json<PagedResponse<AdminMangaRow>>> {
let limit = params.limit.clamp(1, 200);
let offset = params.offset.max(0);
let sync_state = match params.sync_state.as_deref() {
None | Some("") => None,
Some("in_progress") => Some(MangaSyncState::InProgress),
Some("dropped") => Some(MangaSyncState::Dropped),
Some("synced") => Some(MangaSyncState::Synced),
Some(other) => {
return Err(AppError::InvalidInput(format!(
"sync_state must be one of in_progress|dropped|synced (got {other:?})"
)));
}
};
let q = repo::admin_view::ListAdminMangasQuery {
search: params.search.filter(|s| !s.trim().is_empty()),
sync_state,
limit,
offset,
};
let (items, total) = repo::admin_view::list_mangas_with_sync_state(&state.db, &q).await?;
Ok(Json(PagedResponse::with_total(items, limit, offset, total)))
}
async fn list_chapters(
State(state): State<AppState>,
_admin: RequireAdmin,
Path(manga_id): Path<Uuid>,
Query(params): Query<ListChaptersParams>,
) -> AppResult<Json<PagedResponse<AdminChapterRow>>> {
// Explicit existence check so a typo / deleted manga returns 404
// rather than a misleading "no chapters" 200.
if !repo::manga::exists(&state.db, manga_id).await? {
return Err(AppError::NotFound);
}
// Cap at 500 to bound the per-row scalar-subquery cost on
// long-runners with thousands of chapters; default 200 covers
// typical browsing without paging round-trips.
let limit = params.limit.clamp(1, 500);
let offset = params.offset.max(0);
let q = repo::admin_view::ListAdminChaptersQuery {
manga_id,
limit,
offset,
};
let (items, total) = repo::admin_view::list_chapters_with_sync_state(&state.db, &q).await?;
Ok(Json(PagedResponse::with_total(items, limit, offset, total)))
}

View File

@@ -0,0 +1,20 @@
//! Admin-only endpoints. Mounted under `/api/v1/admin/*` by
//! `crate::api::routes`. Every handler in this subtree is guarded by
//! `RequireAdmin`, which only accepts session-cookie authentication —
//! bot/API tokens cannot reach admin routes (see
//! `crate::auth::extractor::RequireAdmin`).
pub mod mangas;
pub mod system;
pub mod users;
use axum::Router;
use crate::app::AppState;
pub fn routes() -> Router<AppState> {
Router::new()
.merge(users::routes())
.merge(mangas::routes())
.merge(system::routes())
}

View File

@@ -0,0 +1,163 @@
//! System metrics for the admin dashboard.
//!
//! Disk is `statvfs(storage_dir)` so the number reflects the volume the
//! app actually writes to (not the root filesystem of the host). When the
//! storage backend doesn't expose a local path (e.g. a future S3 impl)
//! the disk fields are `null` rather than fabricated.
//!
//! Memory and CPU come from `sysinfo`. CPU requires two refreshes with
//! at least 200ms between them to compute a meaningful delta; the
//! handler eats the 250ms wall-clock cost on each request. Admin
//! traffic is low-volume so a background cache isn't worth the moving
//! parts yet — revisit if polling becomes frequent.
use std::path::Path;
use std::time::Duration;
use axum::extract::State;
use axum::routing::get;
use axum::{Json, Router};
use serde::Serialize;
use sysinfo::{CpuRefreshKind, MemoryRefreshKind, RefreshKind, System};
use crate::app::AppState;
use crate::auth::extractor::RequireAdmin;
use crate::error::AppResult;
const ALERT_THRESHOLD_PERCENT: f64 = 90.0;
pub fn routes() -> Router<AppState> {
Router::new().route("/admin/system", get(system))
}
#[derive(Debug, Serialize)]
pub struct SystemStats {
pub disk: Option<DiskStats>,
pub memory: MemoryStats,
pub cpu: CpuStats,
pub alerts: Vec<Alert>,
}
#[derive(Debug, Serialize)]
pub struct DiskStats {
pub total_bytes: u64,
pub used_bytes: u64,
pub free_bytes: u64,
pub percent_used: f64,
}
#[derive(Debug, Serialize)]
pub struct MemoryStats {
pub total_bytes: u64,
pub used_bytes: u64,
pub percent_used: f64,
}
#[derive(Debug, Serialize)]
pub struct CpuStats {
pub percent_used: f64,
}
#[derive(Debug, Serialize)]
pub struct Alert {
pub level: AlertLevel,
pub message: String,
}
#[derive(Debug, Serialize, Clone, Copy)]
#[serde(rename_all = "snake_case")]
pub enum AlertLevel {
Warning,
}
async fn system(
State(state): State<AppState>,
_admin: RequireAdmin,
) -> AppResult<Json<SystemStats>> {
let disk = state.storage.local_root().and_then(disk_stats_for);
let (memory, cpu) = memory_and_cpu().await;
let mut alerts = Vec::new();
if let Some(d) = &disk {
if d.percent_used >= ALERT_THRESHOLD_PERCENT {
alerts.push(Alert {
level: AlertLevel::Warning,
message: format!(
"disk near full ({:.0}% used)",
d.percent_used
),
});
}
}
if memory.percent_used >= ALERT_THRESHOLD_PERCENT {
alerts.push(Alert {
level: AlertLevel::Warning,
message: format!(
"memory near full ({:.0}% used)",
memory.percent_used
),
});
}
Ok(Json(SystemStats {
disk,
memory,
cpu,
alerts,
}))
}
fn disk_stats_for(root: &Path) -> Option<DiskStats> {
let s = nix::sys::statvfs::statvfs(root).ok()?;
// statvfs reports `f_frsize * f_blocks` for total bytes. `f_bavail`
// is "free to non-root callers" which is what an operator actually
// cares about — `f_bfree` includes blocks reserved for root.
let block = s.fragment_size();
let total = block * s.blocks();
let avail = block * s.blocks_available();
let used = total.saturating_sub(avail);
let percent_used = if total > 0 {
(used as f64) * 100.0 / (total as f64)
} else {
0.0
};
Some(DiskStats {
total_bytes: total,
used_bytes: used,
free_bytes: avail,
percent_used,
})
}
async fn memory_and_cpu() -> (MemoryStats, CpuStats) {
// sysinfo's CPU sampling needs two refreshes with a delay between
// them — the first seeds the delta counters, the second measures.
// We do this once per request; admin traffic is low enough that the
// 250ms cost is invisible.
let mut sys = System::new_with_specifics(
RefreshKind::new()
.with_cpu(CpuRefreshKind::everything())
.with_memory(MemoryRefreshKind::everything()),
);
sys.refresh_cpu_all();
// Yield the runtime instead of blocking it for the gap.
tokio::time::sleep(Duration::from_millis(250)).await;
sys.refresh_cpu_all();
sys.refresh_memory();
let total = sys.total_memory();
let used = sys.used_memory();
let mem_pct = if total > 0 {
(used as f64) * 100.0 / (total as f64)
} else {
0.0
};
let memory = MemoryStats {
total_bytes: total,
used_bytes: used,
percent_used: mem_pct,
};
let cpu = CpuStats {
percent_used: sys.global_cpu_usage() as f64,
};
(memory, cpu)
}

View File

@@ -0,0 +1,128 @@
//! Admin user management: list, delete, promote/demote.
//!
//! All handlers are gated by `RequireAdmin` and rely on
//! `repo::user::admin_safe_*` for self-protection and the last-admin
//! invariant. Audit rows are written inside the same DB transaction as
//! the action they record.
use axum::extract::{Path, Query, State};
use axum::http::StatusCode;
use axum::routing::{delete, get};
use axum::{Json, Router};
use serde::Deserialize;
use uuid::Uuid;
use crate::api::auth::{validate_password, validate_username};
use crate::api::pagination::PagedResponse;
use crate::app::AppState;
use crate::auth::extractor::RequireAdmin;
use crate::auth::password::hash_password;
use crate::domain::User;
use crate::error::{AppError, AppResult};
use crate::repo;
pub fn routes() -> Router<AppState> {
Router::new()
.route("/admin/users", get(list_users).post(create_user))
.route(
"/admin/users/:id",
delete(delete_user).patch(update_user),
)
}
#[derive(Debug, Deserialize, Default)]
pub struct ListUsersParams {
#[serde(default)]
pub search: Option<String>,
#[serde(default = "default_limit")]
pub limit: i64,
#[serde(default)]
pub offset: i64,
}
fn default_limit() -> i64 {
50
}
async fn list_users(
State(state): State<AppState>,
_admin: RequireAdmin,
Query(params): Query<ListUsersParams>,
) -> AppResult<Json<PagedResponse<User>>> {
let limit = params.limit.clamp(1, 200);
let offset = params.offset.max(0);
let (items, total) = repo::user::list_with_total(
&state.db,
&repo::user::ListUsersQuery {
search: params.search.filter(|s| !s.trim().is_empty()),
limit,
offset,
},
)
.await?;
Ok(Json(PagedResponse::with_total(items, limit, offset, total)))
}
#[derive(Debug, Deserialize)]
pub struct UpdateUserInput {
pub is_admin: Option<bool>,
}
async fn update_user(
State(state): State<AppState>,
RequireAdmin(actor): RequireAdmin,
Path(id): Path<Uuid>,
Json(input): Json<UpdateUserInput>,
) -> AppResult<Json<User>> {
let Some(is_admin) = input.is_admin else {
return Err(AppError::InvalidInput(
"no updatable fields supplied".into(),
));
};
let updated =
repo::user::admin_safe_set_is_admin(&state.db, actor.id, id, is_admin).await?;
Ok(Json(updated))
}
async fn delete_user(
State(state): State<AppState>,
RequireAdmin(actor): RequireAdmin,
Path(id): Path<Uuid>,
) -> AppResult<StatusCode> {
repo::user::admin_safe_delete(&state.db, actor.id, id).await?;
Ok(StatusCode::NO_CONTENT)
}
#[derive(Debug, Deserialize)]
pub struct CreateUserInput {
pub username: String,
pub password: String,
/// Defaults to false; admins may mint other admins in a single
/// call. Doing it as one POST avoids a second audit row for the
/// common "invite a co-admin" flow.
#[serde(default)]
pub is_admin: bool,
}
async fn create_user(
State(state): State<AppState>,
RequireAdmin(actor): RequireAdmin,
Json(input): Json<CreateUserInput>,
) -> AppResult<(StatusCode, Json<User>)> {
let username = input.username.trim();
// Reuse the canonical self-register validators so the admin-create
// path can never produce a username that self-register would
// reject (and vice versa).
validate_username(username)?;
validate_password(&input.password)?;
let pwhash = hash_password(&input.password)?;
let user = repo::user::admin_create_user(
&state.db,
actor.id,
username,
&pwhash,
input.is_admin,
)
.await?;
Ok((StatusCode::CREATED, Json(user)))
}

View File

@@ -4,6 +4,8 @@
//! expire naturally rather than being explicitly invalidated, so other
//! devices keep their existing logins).
use std::sync::OnceLock;
use axum::extract::{Path, State};
use axum::http::StatusCode;
use axum::response::IntoResponse;
@@ -26,6 +28,7 @@ use crate::repo;
pub fn routes() -> Router<AppState> {
Router::new()
.route("/auth/config", get(auth_config))
.route("/auth/register", post(register))
.route("/auth/login", post(login))
.route("/auth/logout", post(logout))
@@ -39,6 +42,21 @@ pub fn routes() -> Router<AppState> {
.route("/auth/tokens/:id", delete(delete_token))
}
/// Public, unauthenticated. Exposes anonymous-relevant auth policy
/// (currently just whether self-registration is open) so the frontend
/// can render its login / register affordances correctly without a
/// probe request that would conflate "disabled" with "rate-limited".
#[derive(Debug, Serialize)]
pub struct AuthConfigResponse {
pub self_register_enabled: bool,
}
async fn auth_config(State(state): State<AppState>) -> Json<AuthConfigResponse> {
Json(AuthConfigResponse {
self_register_enabled: state.auth.allow_self_register,
})
}
#[derive(Debug, Deserialize)]
pub struct Credentials {
pub username: String,
@@ -80,6 +98,14 @@ async fn register(
jar: CookieJar,
Json(input): Json<Credentials>,
) -> AppResult<impl IntoResponse> {
// Rate limit before the disabled check so an operator who flips
// the toggle can't be probed for the toggle state via timing —
// disabled and enabled paths both consume a token, and disabled
// returns 403 instead of running argon2.
check_auth_rate_limit(&state, "register")?;
if !state.auth.allow_self_register {
return Err(AppError::Forbidden);
}
let username = input.username.trim();
validate_username(username)?;
validate_password(&input.password)?;
@@ -95,6 +121,7 @@ async fn login(
jar: CookieJar,
Json(input): Json<Credentials>,
) -> AppResult<impl IntoResponse> {
check_auth_rate_limit(&state, "login")?;
let username = input.username.trim();
if username.is_empty() || input.password.is_empty() {
return Err(AppError::InvalidInput(
@@ -102,9 +129,15 @@ async fn login(
));
}
let user = repo::user::find_by_username(&state.db, username)
.await?
.ok_or(AppError::Unauthenticated)?;
let user = repo::user::find_by_username(&state.db, username).await?;
let Some(user) = user else {
// No such user. Run argon2 against a stable dummy hash so the
// response time matches the wrong-password branch — otherwise
// an attacker can enumerate usernames by timing the no-user
// 401 against the wrong-password 401.
let _ = verify_password(&input.password, dummy_password_hash());
return Err(AppError::Unauthenticated);
};
if !verify_password(&input.password, &user.password_hash) {
return Err(AppError::Unauthenticated);
}
@@ -113,6 +146,21 @@ async fn login(
Ok((StatusCode::OK, jar, Json(AuthResponse { user })))
}
/// Lazily-computed argon2 hash used to equalise login response time
/// across the "no such user" and "wrong password" branches. Computing
/// it once (on the first login of the process) is enough — the hash is
/// never compared against a real password, only used to force argon2
/// to do the same amount of work it would for a real verify.
fn dummy_password_hash() -> &'static str {
static DUMMY: OnceLock<String> = OnceLock::new();
DUMMY
.get_or_init(|| {
crate::auth::password::hash_password("login-timing-equaliser")
.expect("hash_password on a fixed input cannot fail")
})
.as_str()
}
async fn logout(
State(state): State<AppState>,
jar: CookieJar,
@@ -149,6 +197,7 @@ async fn change_password(
jar: CookieJar,
Json(input): Json<ChangePassword>,
) -> AppResult<impl IntoResponse> {
check_auth_rate_limit(&state, "change_password")?;
if !verify_password(&input.current_password, &user.password_hash) {
return Err(AppError::Unauthenticated);
}
@@ -230,8 +279,24 @@ async fn create_token(
Json(input): Json<CreateTokenInput>,
) -> AppResult<impl IntoResponse> {
let name = input.name.trim();
// Both arms use `ValidationFailed` (422 with field details) to
// match the structured-error shape `attach_tag` returns for the
// same kind of free-form-identifier validation. The other
// /auth/* handlers in this file use `InvalidInput` (400); the
// divergence is pre-existing and would warrant a project-wide
// pass to flip them all if the client side wants uniform per-
// field error rendering.
if name.is_empty() {
return Err(AppError::InvalidInput("token name is required".into()));
return Err(AppError::ValidationFailed {
message: "token name is required".into(),
details: serde_json::json!({ "name": "required" }),
});
}
if name.chars().count() > 64 {
return Err(AppError::ValidationFailed {
message: "token name too long".into(),
details: serde_json::json!({ "name": "max 64 characters" }),
});
}
let (raw, hash) = generate_token();
let token = repo::api_token::create(&state.db, user.id, name, &hash).await?;
@@ -267,6 +332,18 @@ async fn start_session(
Ok(jar.add(build_session_cookie(raw, &state.auth)))
}
// CSRF posture: `SameSite=Lax` is the project's primary CSRF defense.
// Browsers refuse to attach this cookie to cross-site POST / PATCH /
// DELETE requests, which covers every state-changing endpoint (auth
// mutations, uploads, bookmarks, collections, admin user management,
// etc. — all JSON over POST/PATCH/DELETE). Lax DOES still attach the
// cookie on top-level cross-site GETs, so this defense breaks the
// instant anyone adds a state-changing GET. If you reach for one,
// switch to `SameSite=Strict` here AND add an explicit CSRF-token
// check on the new endpoint. The Bearer-token branch in the
// extractor is unaffected (bots authenticate with the token header,
// not the cookie) and admin routes reject Bearer entirely — see
// `auth::extractor::RequireAdmin`.
fn build_session_cookie(raw: String, cfg: &AuthConfig) -> Cookie<'static> {
let mut builder = Cookie::build((SESSION_COOKIE_NAME, raw))
.http_only(true)
@@ -293,7 +370,38 @@ fn build_expired_cookie(cfg: &AuthConfig) -> Cookie<'static> {
builder.build()
}
fn validate_username(u: &str) -> AppResult<()> {
/// Consume one token from the shared auth rate limiter. Called at the
/// start of `register`, `login`, and `change_password` so credential
/// stuffing / spraying / username-probe loops are throttled by the
/// configured budget (default 5/sec with a 10-request burst).
///
/// All three endpoints share one bucket — they all expose the same
/// argon2-verify-or-create work and the same enumeration channels, so
/// any one of them in a tight loop should trip the limit. `endpoint`
/// is included in the rate-limit-hit log line so operators can tell
/// which endpoint is being probed.
fn check_auth_rate_limit(state: &AppState, endpoint: &'static str) -> AppResult<()> {
use crate::auth::rate_limit::AcquireResult;
match state.auth_limiter.try_acquire() {
AcquireResult::Allowed => Ok(()),
AcquireResult::Denied { retry_after_secs } => {
tracing::warn!(
endpoint,
retry_after_secs,
"auth rate limit hit; returning 429"
);
Err(AppError::TooManyRequests {
retry_after_secs: Some(retry_after_secs),
})
}
}
}
// Exposed pub(crate) so the admin user-create handler can apply the
// same rules as self-registration. Keeping the lone canonical
// implementation here avoids the two paths drifting on min length /
// allowed character set.
pub(crate) fn validate_username(u: &str) -> AppResult<()> {
if u.is_empty() {
return Err(AppError::InvalidInput("username is required".into()));
}
@@ -310,7 +418,7 @@ fn validate_username(u: &str) -> AppResult<()> {
Ok(())
}
fn validate_password(p: &str) -> AppResult<()> {
pub(crate) fn validate_password(p: &str) -> AppResult<()> {
if p.len() < 8 {
return Err(AppError::InvalidInput(
"password must be at least 8 characters".into(),

View File

@@ -13,6 +13,7 @@ use uuid::Uuid;
use crate::api::pagination::PagedResponse;
use crate::app::AppState;
use crate::auth::extractor::CurrentUser;
use crate::crawler::pipeline;
use crate::domain::{Bookmark, BookmarkSummary};
use crate::error::{AppError, AppResult};
use crate::repo;
@@ -66,14 +67,7 @@ async fn create(
// the foreign-key violation collapse into a generic 500.
repo::manga::get(&state.db, input.manga_id).await?;
if let Some(chapter_id) = input.chapter_id {
let exists: Option<(Uuid,)> = sqlx::query_as(
"SELECT id FROM chapters WHERE id = $1 AND manga_id = $2",
)
.bind(chapter_id)
.bind(input.manga_id)
.fetch_optional(&state.db)
.await?;
if exists.is_none() {
if !repo::chapter::belongs_to_manga(&state.db, chapter_id, input.manga_id).await? {
return Err(AppError::NotFound);
}
}
@@ -86,6 +80,29 @@ async fn create(
input.page,
)
.await?;
// Fire-and-forget: kick off content syncs for any pending chapters of
// the newly-bookmarked manga. The dedup index makes this idempotent
// across repeated bookmarks of the same manga; failure here must not
// surface to the user (the daily cron sweeps anything missed).
let pool = state.db.clone();
let manga_id = input.manga_id;
tokio::spawn(async move {
match pipeline::enqueue_pending_for_manga(&pool, manga_id).await {
Ok(summary) => tracing::info!(
%manga_id,
inserted = summary.inserted,
skipped = summary.skipped,
failed = summary.failed,
"bookmark hook: enqueued pending chapters"
),
Err(e) => tracing::warn!(
%manga_id, error = ?e,
"bookmark hook: enqueue_pending_for_manga failed"
),
}
});
Ok((StatusCode::CREATED, Json(bookmark)))
}

View File

@@ -26,9 +26,9 @@ use crate::upload::{parse_image, UploadedImage};
pub fn routes() -> Router<AppState> {
Router::new()
.route("/mangas/:manga_id/chapters", get(list).post(create))
.route("/mangas/:manga_id/chapters/:number", get(get_one))
.route("/mangas/:manga_id/chapters/:chapter_id", get(get_one))
.route(
"/mangas/:manga_id/chapters/:number/pages",
"/mangas/:manga_id/chapters/:chapter_id/pages",
get(list_pages),
)
}
@@ -60,10 +60,10 @@ async fn list(
async fn get_one(
State(state): State<AppState>,
Path((manga_id, number)): Path<(Uuid, i32)>,
Path((manga_id, chapter_id)): Path<(Uuid, Uuid)>,
) -> AppResult<Json<Chapter>> {
repo::manga::get(&state.db, manga_id).await?;
let chapter = repo::chapter::find_by_manga_and_number(&state.db, manga_id, number)
let chapter = repo::chapter::find_by_id_in_manga(&state.db, manga_id, chapter_id)
.await?
.ok_or(AppError::NotFound)?;
Ok(Json(chapter))
@@ -164,10 +164,10 @@ struct PagesResponse {
async fn list_pages(
State(state): State<AppState>,
Path((manga_id, number)): Path<(Uuid, i32)>,
Path((manga_id, chapter_id)): Path<(Uuid, Uuid)>,
) -> AppResult<Json<PagesResponse>> {
repo::manga::get(&state.db, manga_id).await?;
let chapter = repo::chapter::find_by_manga_and_number(&state.db, manga_id, number)
let chapter = repo::chapter::find_by_id_in_manga(&state.db, manga_id, chapter_id)
.await?
.ok_or(AppError::NotFound)?;
let pages = repo::page::list_for_chapter(&state.db, chapter.id).await?;

View File

@@ -1,6 +1,6 @@
use axum::extract::{Multipart, Path, Query, State};
use axum::http::StatusCode;
use axum::routing::{delete, get, post};
use axum::routing::{delete, get, post, put};
use axum::{Json, Router};
use serde::Deserialize;
use serde_json::json;
@@ -14,12 +14,14 @@ use crate::domain::patch::Patch;
use crate::domain::tag::TagRef;
use crate::error::{AppError, AppResult};
use crate::repo;
use crate::storage::StorageError;
use crate::upload::{parse_image, UploadedImage};
pub fn routes() -> Router<AppState> {
Router::new()
.route("/mangas", get(list).post(create))
.route("/mangas/:id", get(get_one).patch(update))
.route("/mangas/:id/cover", put(put_cover).delete(delete_cover))
.route("/mangas/:id/tags", post(attach_tag))
.route("/mangas/:id/tags/:tag_id", delete(detach_tag))
}
@@ -194,16 +196,14 @@ async fn create(
async fn update(
State(state): State<AppState>,
CurrentUser(_user): CurrentUser,
CurrentUser(user): CurrentUser,
Path(id): Path<Uuid>,
Json(patch): Json<MangaPatch>,
) -> AppResult<Json<MangaDetail>> {
// TODO(auth): until uploaders are tracked (Phase 5), any signed-in
// user can edit any manga. Restrict to uploader + admin once that
// column lands.
if !repo::manga::exists(&state.db, id).await? {
return Err(AppError::NotFound);
}
require_can_edit(&state, id, user.id).await?;
if let Some(ref status) = patch.status {
let trimmed = status.trim();
@@ -259,6 +259,80 @@ async fn update(
Ok(Json(repo::manga::get_detail(&state.db, id).await?))
}
/// `PUT /api/v1/mangas/:id/cover` is multipart/form-data with a single
/// required `cover` part containing image bytes. MIME is sniffed by
/// magic bytes (jpeg/png/webp/gif/avif); filename and Content-Type from
/// the client are ignored. Replaces any existing cover, deleting the
/// previous blob if its extension differs. Returns the refreshed
/// `MangaDetail`.
async fn put_cover(
State(state): State<AppState>,
CurrentUser(user): CurrentUser,
Path(id): Path<Uuid>,
mut multipart: Multipart,
) -> AppResult<Json<MangaDetail>> {
if !repo::manga::exists(&state.db, id).await? {
return Err(AppError::NotFound);
}
require_can_edit(&state, id, user.id).await?;
let mut cover: Option<UploadedImage> = None;
while let Some(field) = next_field(&mut multipart).await? {
if field.name() == Some("cover") {
let bytes = read_field_bytes(field).await?.to_vec();
cover = Some(parse_image(bytes, state.upload.max_file_bytes, "cover")?);
}
}
let img = cover.ok_or_else(|| AppError::ValidationFailed {
message: "cover part is required".into(),
details: json!({ "cover": "required" }),
})?;
// Read the old key BEFORE writing so we can clean up an orphan if
// the extension changed (e.g., .png → .jpg). Same-extension is a
// `put` overwrite — no delete needed.
let old_key = repo::manga::get(&state.db, id).await?.cover_image_path;
let new_key = format!("mangas/{}/cover.{}", id, img.ext);
state.storage.put(&new_key, &img.bytes).await?;
if let Some(prev) = old_key.as_deref() {
if prev != new_key {
// Swallow NotFound — AppError maps it to a client 404,
// which would be wrong here. The DB row can outlive a
// manually-deleted blob.
match state.storage.delete(prev).await {
Ok(()) | Err(StorageError::NotFound) => {}
Err(e) => return Err(e.into()),
}
}
}
repo::manga::set_cover_image_path(&state.db, id, &new_key).await?;
Ok(Json(repo::manga::get_detail(&state.db, id).await?))
}
/// `DELETE /api/v1/mangas/:id/cover` clears `cover_image_path` and
/// removes the blob. Idempotent: removing a non-existent cover succeeds
/// with the unchanged detail.
async fn delete_cover(
State(state): State<AppState>,
CurrentUser(user): CurrentUser,
Path(id): Path<Uuid>,
) -> AppResult<Json<MangaDetail>> {
if !repo::manga::exists(&state.db, id).await? {
return Err(AppError::NotFound);
}
require_can_edit(&state, id, user.id).await?;
if let Some(key) = repo::manga::get(&state.db, id).await?.cover_image_path {
match state.storage.delete(&key).await {
Ok(()) | Err(StorageError::NotFound) => {}
Err(e) => return Err(e.into()),
}
repo::manga::clear_cover_image_path(&state.db, id).await?;
}
Ok(Json(repo::manga::get_detail(&state.db, id).await?))
}
#[derive(Debug, Deserialize)]
pub struct AttachTagBody {
pub name: String,
@@ -270,6 +344,7 @@ async fn attach_tag(
Path(id): Path<Uuid>,
Json(body): Json<AttachTagBody>,
) -> AppResult<(StatusCode, Json<TagRef>)> {
validate_tag_name(&body.name)?;
if !repo::manga::exists(&state.db, id).await? {
return Err(AppError::NotFound);
}
@@ -316,6 +391,27 @@ async fn detach_tag(
}
}
/// Request-side validation for `POST /mangas/:id/tags` body. Mirrors
/// the repo-level cap in `repo::tag::upsert_by_name` (max 64 chars
/// after trim) but surfaces the failure at the handler boundary with
/// the same envelope shape other validations use.
fn validate_tag_name(name: &str) -> AppResult<()> {
let trimmed = name.trim();
if trimmed.is_empty() {
return Err(AppError::ValidationFailed {
message: "tag name cannot be empty".into(),
details: json!({ "name": "required" }),
});
}
if trimmed.chars().count() > 64 {
return Err(AppError::ValidationFailed {
message: "tag name too long".into(),
details: json!({ "name": "max 64 characters" }),
});
}
Ok(())
}
fn validate_new_manga(input: &NewManga) -> AppResult<()> {
if input.title.trim().is_empty() {
return Err(AppError::ValidationFailed {
@@ -335,6 +431,30 @@ fn validate_new_manga(input: &NewManga) -> AppResult<()> {
Ok(())
}
/// Authorisation gate for manga mutations. The manga is assumed to
/// exist (the caller runs [`repo::manga::exists`] first so a missing id
/// surfaces as `NotFound`, not `Forbidden`).
///
/// Rule: a non-NULL `uploaded_by` must match the current user. Legacy
/// rows with `uploaded_by IS NULL` (pre-migration-0011) are still
/// editable by any signed-in user — there's nobody to gate on yet, and
/// the historical-data note in 0011 acknowledges the gap. Once an
/// admin role lands the NULL case can flip to admin-only.
///
/// Returns `Forbidden` (not `NotFound`) on owner mismatch — mangas
/// are listable via `GET /mangas`, so existence isn't a secret and
/// the more accurate 403 is fine. This deliberately differs from
/// `repo::collection::require_owner`, which collapses both states to
/// `NotFound` because collections are private to a user and existence
/// itself is information worth hiding from non-owners.
async fn require_can_edit(state: &AppState, manga_id: Uuid, user_id: Uuid) -> AppResult<()> {
match repo::manga::uploaded_by(&state.db, manga_id).await? {
Some(owner) if owner != user_id => Err(AppError::Forbidden),
// Some(owner) == user_id (good) or None (legacy row, no owner).
_ => Ok(()),
}
}
async fn validate_genre_ids(state: &AppState, ids: &[Uuid]) -> AppResult<()> {
if ids.is_empty() {
return Ok(());

View File

@@ -1,3 +1,4 @@
pub mod admin;
pub mod auth;
pub mod authors;
pub mod bookmarks;
@@ -28,4 +29,5 @@ pub fn routes() -> Router<AppState> {
.merge(authors::routes())
.merge(collections::routes())
.merge(history::routes())
.merge(admin::routes())
}

View File

@@ -1,14 +1,28 @@
use std::sync::Arc;
use std::sync::atomic::AtomicBool;
use anyhow::Context;
use async_trait::async_trait;
use axum::extract::DefaultBodyLimit;
use axum::http::{HeaderName, HeaderValue, Method};
use axum::Router;
use sqlx::postgres::PgPoolOptions;
use sqlx::PgPool;
use tokio_util::sync::CancellationToken;
use tower_http::cors::{AllowOrigin, CorsLayer};
use tower_http::trace::TraceLayer;
use crate::config::{AuthConfig, Config, UploadConfig};
use crate::auth::rate_limit::AuthRateLimiter;
use crate::config::{AuthConfig, Config, CrawlerConfig, UploadConfig};
use crate::crawler::browser_manager::{self, BrowserManager};
use crate::crawler::content::{self, SyncOutcome};
use crate::crawler::daemon::{self, ChapterDispatcher, DaemonConfig, MetadataPass};
use crate::crawler::jobs::JobPayload;
use crate::crawler::pipeline::{self, MetadataStats};
use crate::crawler::rate_limit::HostRateLimiters;
use crate::crawler::safety::DownloadAllowlist;
use crate::crawler::session;
use crate::repo;
use crate::storage::{LocalStorage, Storage};
#[derive(Clone)]
@@ -17,24 +31,292 @@ pub struct AppState {
pub storage: Arc<dyn Storage>,
pub auth: AuthConfig,
pub upload: UploadConfig,
/// Shared rate limiter guarding the `/auth/*` mutation endpoints.
/// One instance per AppState so tests stay isolated across the
/// same process.
pub auth_limiter: Arc<AuthRateLimiter>,
}
pub async fn build(config: Config) -> anyhow::Result<Router> {
/// Bundle returned by [`build`]. The router is what `axum::serve` consumes;
/// the daemon (when enabled) outlives the HTTP server and is awaited via
/// [`AppHandle::shutdown`] after the listener has finished gracefully.
pub struct AppHandle {
pub router: Router,
pub daemon: Option<daemon::DaemonHandle>,
}
impl AppHandle {
pub async fn shutdown(self) {
if let Some(d) = self.daemon {
d.shutdown().await;
}
}
}
pub async fn build(config: Config) -> anyhow::Result<AppHandle> {
let db = PgPoolOptions::new()
.max_connections(10)
.connect(&config.database_url)
.await?;
sqlx::migrate!("./migrations").run(&db).await?;
if let Some((username, password)) = config.admin_bootstrap.as_ref() {
repo::user::bootstrap_admin(&db, username, password)
.await
.context("bootstrap_admin from ADMIN_USERNAME/ADMIN_PASSWORD env")?;
tracing::info!(admin_username = %username, "admin bootstrap ensured");
}
let storage: Arc<dyn Storage> = Arc::new(LocalStorage::new(config.storage_dir.clone()));
let daemon = if config.crawler.daemon_enabled {
Some(spawn_crawler_daemon(db.clone(), Arc::clone(&storage), &config.crawler).await?)
} else {
tracing::info!("crawler daemon disabled (CRAWLER_DAEMON=false)");
None
};
let auth_limiter = Arc::new(AuthRateLimiter::new(config.auth.rate_limit));
let state = AppState {
db,
storage,
auth: config.auth.clone(),
upload: config.upload.clone(),
auth_limiter,
};
Ok(router(state).layer(cors_layer(&config.cors_allowed_origins)))
let router = router(state).layer(cors_layer(&config.cors_allowed_origins));
Ok(AppHandle { router, daemon })
}
async fn spawn_crawler_daemon(
db: PgPool,
storage: Arc<dyn Storage>,
cfg: &CrawlerConfig,
) -> anyhow::Result<daemon::DaemonHandle> {
// Reqwest client with cookie jar pre-seeded so CDN image fetches
// include PHPSESSID. Same shape as bin/crawler.rs main().
let cookie_jar = Arc::new(reqwest::cookie::Jar::default());
if let (Some(sid), Some(domain), Some(start_url)) =
(&cfg.phpsessid, &cfg.cookie_domain, &cfg.start_url)
{
let cookie_str = format!("PHPSESSID={sid}; Domain={domain}; Path=/");
let seed_url = reqwest::Url::parse(start_url)
.context("parse CRAWLER_START_URL for cookie seed")?;
cookie_jar.add_cookie_str(&cookie_str, &seed_url);
}
let mut http_builder = reqwest::Client::builder()
.timeout(std::time::Duration::from_secs(30))
.no_proxy()
.cookie_provider(cookie_jar);
if let Some(ua) = &cfg.user_agent {
http_builder = http_builder.user_agent(ua);
}
if let Some(proxy) = &cfg.proxy {
http_builder = http_builder
.proxy(reqwest::Proxy::all(proxy).with_context(|| format!("parse proxy: {proxy}"))?);
}
let http = http_builder.build().context("build crawler reqwest")?;
let mut rate = HostRateLimiters::new(std::time::Duration::from_millis(cfg.rate_ms));
if let Some(host) = &cfg.cdn_host {
rate = rate.with_override(host, std::time::Duration::from_millis(cfg.cdn_rate_ms));
}
let rate = Arc::new(rate);
// Browser manager. on_launch re-injects PHPSESSID on every fresh
// chromium spawn so an idle teardown followed by re-launch stays
// authenticated without operator action.
let mut launch_opts = cfg.browser.clone();
if let Some(proxy) = &cfg.proxy {
launch_opts.extra_args.push(format!("--proxy-server={proxy}"));
}
let on_launch = match (&cfg.phpsessid, &cfg.cookie_domain, &cfg.start_url) {
(Some(sid), Some(domain), Some(start_url)) => {
let sid = sid.clone();
let domain = domain.clone();
let start_url = start_url.clone();
let on_launch: browser_manager::OnLaunch = Arc::new(move |browser| {
let sid = sid.clone();
let domain = domain.clone();
let start_url = start_url.clone();
Box::pin(async move {
session::inject_phpsessid(&browser, &sid, &domain)
.await
.context("on_launch: inject_phpsessid")?;
session::verify_session(&browser, &start_url)
.await
.context("on_launch: verify_session")?;
Ok(())
})
});
on_launch
}
_ => browser_manager::noop_on_launch(),
};
let browser_manager = BrowserManager::new(launch_opts, cfg.idle_timeout, on_launch);
let session_expired = Arc::new(AtomicBool::new(false));
let metadata_pass: Option<Arc<dyn MetadataPass>> = cfg.start_url.as_ref().map(|url| {
let m: Arc<dyn MetadataPass> = Arc::new(RealMetadataPass {
browser_manager: Arc::clone(&browser_manager),
db: db.clone(),
storage: Arc::clone(&storage),
http: http.clone(),
rate: Arc::clone(&rate),
start_url: url.clone(),
download_allowlist: cfg.download_allowlist.clone(),
max_image_bytes: cfg.max_image_bytes,
});
m
});
let dispatcher: Arc<dyn ChapterDispatcher> = Arc::new(RealChapterDispatcher {
browser_manager: Arc::clone(&browser_manager),
db: db.clone(),
storage: Arc::clone(&storage),
http,
rate: Arc::clone(&rate),
download_allowlist: cfg.download_allowlist.clone(),
max_image_bytes: cfg.max_image_bytes,
});
// Shared cancellation: daemon shutdown cancels the BrowserManager's
// idle reaper too. Reaper itself is added to the daemon's extra_tasks
// so DaemonHandle::shutdown awaits its completion.
let cancel = CancellationToken::new();
let reaper_task = browser_manager::spawn_idle_reaper(
Arc::clone(&browser_manager),
cancel.clone(),
);
// Also close the browser explicitly on shutdown so we don't rely on
// kill-on-drop when other Arc<Browser> holders may still exist.
let shutdown_task = {
let cancel = cancel.clone();
let mgr = Arc::clone(&browser_manager);
tokio::spawn(async move {
cancel.cancelled().await;
mgr.shutdown().await;
})
};
let daemon_handle = daemon::spawn(
db,
cancel,
DaemonConfig {
metadata_pass,
dispatcher,
chapter_workers: cfg.chapter_workers,
daily_at: cfg.daily_at,
tz: cfg.tz,
retention_days: cfg.retention_days,
session_expired,
extra_tasks: vec![reaper_task, shutdown_task],
},
);
Ok(daemon_handle)
}
// Real impls of the daemon traits, owning the browser manager + I/O. Kept
// in app.rs because they need the same builder-side env wiring that
// AppState gets — the daemon module itself stays free of reqwest / storage
// details so its tests don't pull them in.
struct RealMetadataPass {
browser_manager: Arc<BrowserManager>,
db: PgPool,
storage: Arc<dyn Storage>,
http: reqwest::Client,
rate: Arc<HostRateLimiters>,
start_url: String,
download_allowlist: DownloadAllowlist,
max_image_bytes: usize,
}
#[async_trait]
impl MetadataPass for RealMetadataPass {
async fn run(&self) -> anyhow::Result<MetadataStats> {
let result = pipeline::run_metadata_pass(
&self.browser_manager,
&self.db,
self.storage.as_ref(),
&self.http,
&self.rate,
&self.start_url,
0,
false,
&self.download_allowlist,
self.max_image_bytes,
)
.await;
if let Err(e) = &result {
if crate::crawler::nav::anyhow_looks_browser_dead(e) {
self.browser_manager.invalidate().await;
}
}
result
}
}
struct RealChapterDispatcher {
browser_manager: Arc<BrowserManager>,
db: PgPool,
storage: Arc<dyn Storage>,
http: reqwest::Client,
rate: Arc<HostRateLimiters>,
download_allowlist: DownloadAllowlist,
max_image_bytes: usize,
}
#[async_trait]
impl ChapterDispatcher for RealChapterDispatcher {
async fn dispatch(&self, payload: JobPayload) -> anyhow::Result<SyncOutcome> {
match payload {
JobPayload::SyncChapterContent {
source_id: _,
chapter_id,
source_chapter_key: _,
} => {
let row = repo::chapter::dispatch_target(&self.db, chapter_id)
.await
.context("look up chapter for dispatch")?;
let Some((manga_id, source_url)) = row else {
// Chapter (or its source row) is gone — ack done.
return Ok(SyncOutcome::Skipped);
};
let lease = self.browser_manager.acquire().await?;
let result = content::sync_chapter_content(
&lease,
&self.db,
self.storage.as_ref(),
&self.http,
&self.rate,
chapter_id,
manga_id,
&source_url,
false,
&self.download_allowlist,
self.max_image_bytes,
)
.await;
drop(lease);
match result {
Ok(outcome) => Ok(outcome),
Err(e) => {
if crate::crawler::nav::anyhow_looks_browser_dead(&e) {
self.browser_manager.invalidate().await;
}
Err(e)
}
}
}
// Other payload kinds aren't dispatched by this daemon yet —
// SyncManga / SyncChapterList are handled inline by the cron's
// metadata pass.
_ => Ok(SyncOutcome::Skipped),
}
}
}
/// Build a router from a pre-assembled state. Used by integration tests

View File

@@ -1,11 +1,19 @@
//! `CurrentUser` axum extractor.
//! Auth extractors.
//!
//! Resolves a request to a logged-in user by trying, in order:
//! 1. a `mangalord_session` cookie (session lookup by `sha256(value)`);
//! 2. an `Authorization: Bearer <token>` header (api_token lookup).
//! Three extractors are available, in increasing strictness:
//!
//! Both paths look up by hash, never by raw value. Failure to resolve
//! either way returns 401 via `AppError::Unauthenticated`.
//! - [`CurrentUser`] — accepts either a session cookie or an
//! `Authorization: Bearer <token>` header. Used by ordinary
//! authenticated endpoints where bot tokens are first-class clients.
//! - [`CurrentSessionUser`] — accepts only the session cookie. Used as
//! the substrate for admin extraction so bot tokens cannot authenticate
//! as the admin (see [`RequireAdmin`]).
//! - [`RequireAdmin`] — composes over [`CurrentSessionUser`] and
//! additionally requires `user.is_admin`. Returns 403 for
//! authenticated-but-not-admin, 401 otherwise.
//!
//! All lookups go by `sha256(raw_token)` — the raw value is never stored
//! in the database.
use axum::async_trait;
use axum::extract::FromRequestParts;
@@ -61,3 +69,54 @@ impl FromRequestParts<AppState> for CurrentUser {
Err(AppError::Unauthenticated)
}
}
/// Cookie-only authentication. Bot/API tokens are explicitly NOT accepted
/// here — this is the substrate for [`RequireAdmin`] and exists precisely
/// to keep admin authority out of bearer-token reach.
pub struct CurrentSessionUser(pub User);
#[async_trait]
impl FromRequestParts<AppState> for CurrentSessionUser {
type Rejection = AppError;
async fn from_request_parts(
parts: &mut Parts,
state: &AppState,
) -> Result<Self, Self::Rejection> {
let jar = CookieJar::from_headers(&parts.headers);
let cookie = jar
.get(SESSION_COOKIE_NAME)
.ok_or(AppError::Unauthenticated)?;
let hash = hash_token(cookie.value());
let session = repo::session::find_active(&state.db, &hash)
.await?
.ok_or(AppError::Unauthenticated)?;
let user = repo::user::find_by_id(&state.db, session.user_id)
.await?
.ok_or(AppError::Unauthenticated)?;
Ok(CurrentSessionUser(user))
}
}
/// Admin-only. Composes over [`CurrentSessionUser`] so bot tokens are
/// rejected at the auth step (401) rather than the role step (403).
/// The user row is re-read every request, so demotion takes effect on
/// the very next call without needing to purge sessions.
pub struct RequireAdmin(pub User);
#[async_trait]
impl FromRequestParts<AppState> for RequireAdmin {
type Rejection = AppError;
async fn from_request_parts(
parts: &mut Parts,
state: &AppState,
) -> Result<Self, Self::Rejection> {
let CurrentSessionUser(user) =
CurrentSessionUser::from_request_parts(parts, state).await?;
if !user.is_admin {
return Err(AppError::Forbidden);
}
Ok(RequireAdmin(user))
}
}

View File

@@ -7,4 +7,5 @@
pub mod extractor;
pub mod password;
pub mod rate_limit;
pub mod token;

View File

@@ -0,0 +1,179 @@
//! Per-process token-bucket rate limiter for the auth endpoints.
//!
//! Protects `/auth/login`, `/auth/register`, and `/auth/me/password`
//! from credential stuffing / password spraying / username probing.
//!
//! The current deploy puts SvelteKit's hooks.server.ts proxy in front
//! of axum without forwarding the original client IP (no
//! `X-Forwarded-For`), so per-IP buckets would all collapse to the
//! proxy container's address. Until the proxy learns to forward the
//! peer address, a single global bucket gives equivalent protection
//! against mass-attack patterns and trades a small DoS surface
//! (legitimate users sharing the limit) for simplicity.
//!
//! Each `AppState` carries its own [`AuthRateLimiter`] instance, so
//! tests run in isolated buckets and won't bleed across `#[sqlx::test]`
//! cases that share a process.
use std::sync::Mutex;
use std::time::Instant;
/// Tunable limits. `per_sec == 0` disables the limiter — used by the
/// test harness and by anyone who wants to opt out via env config.
#[derive(Clone, Copy, Debug)]
pub struct RateLimitConfig {
pub per_sec: u32,
pub burst: u32,
}
impl Default for RateLimitConfig {
/// Disabled by default. The production `AuthConfig::from_env`
/// overrides to a real limit; the test harness keeps the default
/// so existing tests don't flake against shared buckets.
fn default() -> Self {
Self {
per_sec: 0,
burst: 0,
}
}
}
/// Production defaults: 5 requests/sec sustained, 10-request burst.
/// Tight enough to make brute force impractical, loose enough that a
/// real user mistyping their password three times in a row doesn't
/// hit it.
pub const PRODUCTION_PER_SEC: u32 = 5;
pub const PRODUCTION_BURST: u32 = 10;
struct Bucket {
tokens: f64,
last_refill: Instant,
}
/// Outcome of [`AuthRateLimiter::try_acquire`]. When `Denied`, the
/// caller can use `retry_after_secs` for a `Retry-After: N` header
/// (RFC 6585 §4) so well-behaved clients back off correctly rather
/// than retrying in a tight loop.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum AcquireResult {
Allowed,
Denied { retry_after_secs: u64 },
}
/// Single-bucket token-bucket limiter. `try_acquire` is cheap (one
/// mutex acquire, no allocations) so the auth path doesn't pay a real
/// cost for the check.
pub struct AuthRateLimiter {
cfg: RateLimitConfig,
bucket: Mutex<Bucket>,
}
impl AuthRateLimiter {
pub fn new(cfg: RateLimitConfig) -> Self {
Self {
cfg,
bucket: Mutex::new(Bucket {
tokens: cfg.burst as f64,
last_refill: Instant::now(),
}),
}
}
/// Consume one token if available. Returns `Denied` with a
/// rounded-up seconds-until-refill so the caller can emit a
/// `Retry-After` header.
pub fn try_acquire(&self) -> AcquireResult {
if self.cfg.per_sec == 0 {
return AcquireResult::Allowed;
}
let now = Instant::now();
let mut bucket = self.bucket.lock().expect("rate limiter mutex poisoned");
let elapsed = now.duration_since(bucket.last_refill).as_secs_f64();
bucket.tokens =
(bucket.tokens + elapsed * f64::from(self.cfg.per_sec)).min(f64::from(self.cfg.burst));
bucket.last_refill = now;
if bucket.tokens >= 1.0 {
bucket.tokens -= 1.0;
AcquireResult::Allowed
} else {
// ceil((1 - tokens) / per_sec), minimum 1 — a `Retry-After: 0`
// would tell clients to retry immediately, which is what we're
// actively trying to discourage.
let deficit = 1.0 - bucket.tokens;
let wait_secs = (deficit / f64::from(self.cfg.per_sec)).ceil() as u64;
AcquireResult::Denied {
retry_after_secs: wait_secs.max(1),
}
}
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn disabled_limiter_always_allows() {
let rl = AuthRateLimiter::new(RateLimitConfig {
per_sec: 0,
burst: 0,
});
for _ in 0..1000 {
assert_eq!(rl.try_acquire(), AcquireResult::Allowed);
}
}
#[test]
fn burst_lets_through_initial_window_then_blocks() {
// 0 refill, burst 3 → first three pass, fourth blocks.
let rl = AuthRateLimiter::new(RateLimitConfig {
per_sec: 1,
burst: 3,
});
assert_eq!(rl.try_acquire(), AcquireResult::Allowed);
assert_eq!(rl.try_acquire(), AcquireResult::Allowed);
assert_eq!(rl.try_acquire(), AcquireResult::Allowed);
match rl.try_acquire() {
AcquireResult::Denied { retry_after_secs } => {
// Bucket is at ~0 tokens, refill rate 1/sec → ~1s wait.
assert!(
retry_after_secs >= 1,
"retry_after must be at least 1s, got {retry_after_secs}"
);
}
AcquireResult::Allowed => panic!("fourth request must be denied"),
}
}
#[test]
fn tokens_refill_over_time() {
// 10/sec → after ~120ms we should have at least one token back.
let rl = AuthRateLimiter::new(RateLimitConfig {
per_sec: 10,
burst: 1,
});
assert_eq!(rl.try_acquire(), AcquireResult::Allowed);
assert!(matches!(rl.try_acquire(), AcquireResult::Denied { .. }));
std::thread::sleep(std::time::Duration::from_millis(150));
assert_eq!(
rl.try_acquire(),
AcquireResult::Allowed,
"token should have refilled"
);
}
#[test]
fn retry_after_scales_inversely_with_refill_rate() {
// 1/sec → wait ~1s after burst exhausted.
// 10/sec → wait <1s, but we clamp to a minimum of 1s.
let slow = AuthRateLimiter::new(RateLimitConfig {
per_sec: 1,
burst: 1,
});
slow.try_acquire();
match slow.try_acquire() {
AcquireResult::Denied { retry_after_secs } => assert_eq!(retry_after_secs, 1),
_ => panic!("expected Denied"),
}
}
}

449
backend/src/bin/crawler.rs Normal file
View File

@@ -0,0 +1,449 @@
//! Crawler binary.
//!
//! Now an ops escape hatch sitting alongside the in-process daemon: walks
//! the source's manga listing (all pages), fetches each manga's metadata +
//! chapter list, downloads covers, reconciles chapters — and then, for any
//! chapter belonging to a bookmarked manga whose `page_count` is still 0,
//! fetches the chapter pages inline. The daemon does the same work through
//! `crawler_jobs`; the CLI is kept around for force-refetches and manual
//! backfills.
//!
//! Configuration mirrors the daemon's `CRAWLER_*` env vars (see
//! `crate::config::CrawlerConfig`) plus the CLI-only:
//! - **Start URL**: first CLI positional arg, else `$CRAWLER_START_URL`.
//! - **Skip chapters / chapter content / force re-fetch / keep browser**:
//! `CRAWLER_SKIP_CHAPTERS`, `CRAWLER_SKIP_CHAPTER_CONTENT`,
//! `CRAWLER_FORCE_REFETCH_CHAPTERS`, `CRAWLER_KEEP_BROWSER_OPEN`.
//! - **Limit**: `CRAWLER_LIMIT` (max manga detail fetches per run).
//!
//! See `crawler::pipeline::run_metadata_pass` for the shared metadata
//! flow.
use std::path::PathBuf;
use std::sync::Arc;
use std::time::Duration;
use anyhow::{anyhow, Context};
use futures_util::stream::{self, StreamExt};
use mangalord::crawler::browser::{BrowserMode, LaunchOptions};
use mangalord::crawler::browser_manager::{self, BrowserManager};
use mangalord::crawler::content::{self, SyncOutcome};
use mangalord::crawler::pipeline;
use mangalord::crawler::rate_limit::HostRateLimiters;
use mangalord::crawler::session;
use mangalord::storage::{LocalStorage, Storage};
use sqlx::postgres::PgPoolOptions;
use sqlx::PgPool;
use tracing_subscriber::EnvFilter;
use uuid::Uuid;
#[tokio::main]
async fn main() -> anyhow::Result<()> {
dotenvy::dotenv().ok();
tracing_subscriber::fmt()
.with_env_filter(
EnvFilter::try_from_default_env().unwrap_or_else(|_| {
"info,mangalord=debug,chromiumoxide::conn=off,chromiumoxide::handler=off"
.into()
}),
)
.init();
let start_url = resolve_start_url()?;
let database_url = std::env::var("DATABASE_URL")
.map_err(|_| anyhow!("DATABASE_URL must be set"))?;
let storage_dir: PathBuf = std::env::var("STORAGE_DIR")
.unwrap_or_else(|_| "./data/storage".to_string())
.into();
let rate_ms = env_u64("CRAWLER_RATE_MS", 1000);
let cdn_host = std::env::var("CRAWLER_CDN_HOST")
.ok()
.filter(|s| !s.trim().is_empty());
let cdn_rate_ms = env_u64("CRAWLER_CDN_RATE_MS", rate_ms);
let limit = env_u64("CRAWLER_LIMIT", 0) as usize;
let skip_chapters = env_bool("CRAWLER_SKIP_CHAPTERS", false);
let skip_chapter_content = env_bool("CRAWLER_SKIP_CHAPTER_CONTENT", false);
let chapter_workers = env_u64("CRAWLER_CHAPTER_WORKERS", 1).max(1) as usize;
let force_refetch_chapters = env_bool("CRAWLER_FORCE_REFETCH_CHAPTERS", false);
let phpsessid = std::env::var("CRAWLER_PHPSESSID")
.ok()
.filter(|s| !s.trim().is_empty());
let cookie_domain = std::env::var("CRAWLER_COOKIE_DOMAIN")
.ok()
.filter(|s| !s.trim().is_empty())
.or_else(|| session::registrable_domain(&start_url));
let user_agent = std::env::var("CRAWLER_USER_AGENT")
.ok()
.filter(|s| !s.trim().is_empty());
let proxy_url = std::env::var("CRAWLER_PROXY")
.ok()
.filter(|s| !s.trim().is_empty());
let keep_browser_open = env_bool("CRAWLER_KEEP_BROWSER_OPEN", false);
let db = PgPoolOptions::new()
.max_connections(5)
.connect(&database_url)
.await
.context("connect to database")?;
sqlx::migrate!("./migrations").run(&db).await?;
let storage: Arc<dyn Storage> = Arc::new(LocalStorage::new(&storage_dir));
let cookie_jar = Arc::new(reqwest::cookie::Jar::default());
if let (Some(sid), Some(domain)) = (&phpsessid, &cookie_domain) {
let cookie_str = format!("PHPSESSID={sid}; Domain={domain}; Path=/");
let seed_url =
reqwest::Url::parse(&start_url).context("parse start URL for cookie seed")?;
cookie_jar.add_cookie_str(&cookie_str, &seed_url);
tracing::info!(domain, "seeded PHPSESSID into reqwest cookie jar");
}
let mut http_builder = reqwest::Client::builder()
.timeout(Duration::from_secs(30))
.no_proxy()
.cookie_provider(cookie_jar);
if let Some(ua) = &user_agent {
http_builder = http_builder.user_agent(ua);
}
if let Some(proxy) = &proxy_url {
http_builder = http_builder
.proxy(reqwest::Proxy::all(proxy).with_context(|| format!("parse proxy URL: {proxy}"))?);
}
let http = http_builder.build().context("build http client")?;
let mut options = LaunchOptions::from_env();
if let Some(proxy) = &proxy_url {
options.extra_args.push(format!("--proxy-server={proxy}"));
}
let keep_open = match (keep_browser_open, options.mode) {
(true, BrowserMode::Headed) => true,
(true, BrowserMode::Headless) => {
tracing::warn!(
"CRAWLER_KEEP_BROWSER_OPEN ignored in headless mode (no window to inspect)"
);
false
}
_ => false,
};
tracing::info!(
?options,
%start_url,
rate_ms,
cdn_host = ?cdn_host,
cdn_rate_ms,
limit,
skip_chapters,
skip_chapter_content,
chapter_workers,
force_refetch_chapters,
phpsessid_set = phpsessid.is_some(),
cookie_domain = ?cookie_domain,
user_agent = ?user_agent,
proxy = ?proxy_url,
keep_open,
storage_dir = %storage_dir.display(),
"starting crawler"
);
// BrowserManager with idle_timeout = ZERO so the CLI keeps Chromium
// alive for the entire run — same lifecycle as the old direct
// `browser::launch()` flow. on_launch re-injects PHPSESSID + runs the
// session probe; bad cookies fail fast before any real work happens.
let on_launch: browser_manager::OnLaunch = match (&phpsessid, &cookie_domain) {
(Some(sid), Some(domain)) => {
let sid = sid.clone();
let domain = domain.clone();
let start_url_clone = start_url.clone();
Arc::new(move |browser| {
let sid = sid.clone();
let domain = domain.clone();
let start_url = start_url_clone.clone();
Box::pin(async move {
session::inject_phpsessid(&browser, &sid, &domain)
.await
.context("inject_phpsessid")?;
session::verify_session(&browser, &start_url)
.await
.context("verify_session")?;
Ok(())
})
})
}
_ => browser_manager::noop_on_launch(),
};
let session_ready = phpsessid.is_some() && cookie_domain.is_some();
let manager = BrowserManager::new(options, Duration::ZERO, on_launch);
let result = run(
Arc::clone(&manager),
&db,
Arc::clone(&storage),
&http,
&start_url,
rate_ms,
cdn_host.as_deref(),
cdn_rate_ms,
limit,
skip_chapters,
skip_chapter_content || !session_ready,
chapter_workers,
force_refetch_chapters,
)
.await;
if keep_open {
tracing::info!(
"crawler finished; browser kept open. Press Ctrl+C to close and exit."
);
let _ = tokio::signal::ctrl_c().await;
tracing::info!("Ctrl+C received; closing browser");
}
manager.shutdown().await;
result
}
#[allow(clippy::too_many_arguments)]
async fn run(
manager: Arc<BrowserManager>,
db: &PgPool,
storage: Arc<dyn Storage>,
http: &reqwest::Client,
start_url: &str,
rate_ms: u64,
cdn_host: Option<&str>,
cdn_rate_ms: u64,
limit: usize,
skip_chapters: bool,
skip_chapter_content: bool,
chapter_workers: usize,
force_refetch_chapters: bool,
) -> anyhow::Result<()> {
let mut rate = HostRateLimiters::new(Duration::from_millis(rate_ms));
if let Some(host) = cdn_host {
rate = rate.with_override(host, Duration::from_millis(cdn_rate_ms));
}
let rate = Arc::new(rate);
// SSRF defence: only download from the catalog host + CDN host
// (plus optional CRAWLER_DOWNLOAD_ALLOWLIST extras), and cap
// single-image downloads at CRAWLER_MAX_IMAGE_BYTES bytes.
// CRAWLER_ALLOW_ANY_HOST=true short-circuits the host check for
// sharded-CDN sources; private-IP and scheme guards still apply.
let allowlist = if env_bool("CRAWLER_ALLOW_ANY_HOST", false) {
mangalord::crawler::safety::DownloadAllowlist::allow_any()
} else {
let mut allow = mangalord::crawler::safety::DownloadAllowlist::new();
if let Ok(parsed) = reqwest::Url::parse(start_url) {
if let Some(h) = parsed.host_str() {
allow = allow.allow(h);
}
}
if let Some(host) = cdn_host {
allow = allow.allow(host);
}
if let Ok(extras) = std::env::var("CRAWLER_DOWNLOAD_ALLOWLIST") {
for piece in extras.split(',') {
let trimmed = piece.trim();
if !trimmed.is_empty() {
allow = allow.allow(trimmed);
}
}
}
allow
};
let max_image_bytes: usize = std::env::var("CRAWLER_MAX_IMAGE_BYTES")
.ok()
.and_then(|s| s.parse().ok())
.unwrap_or(mangalord::crawler::safety::DEFAULT_MAX_IMAGE_BYTES);
let allowlist = Arc::new(allowlist);
let stats = pipeline::run_metadata_pass(
manager.as_ref(),
db,
storage.as_ref(),
http,
rate.as_ref(),
start_url,
limit,
skip_chapters,
allowlist.as_ref(),
max_image_bytes,
)
.await?;
tracing::info!(?stats, "metadata pass complete");
if !skip_chapter_content {
sync_bookmarked_chapter_content(
Arc::clone(&manager),
db,
Arc::clone(&storage),
http,
Arc::clone(&rate),
"target",
chapter_workers,
force_refetch_chapters,
Arc::clone(&allowlist),
max_image_bytes,
)
.await?;
}
Ok(())
}
/// Find every chapter whose manga is bookmarked by at least one user and
/// that hasn't been content-synced yet, then fan them out across `workers`
/// concurrent tasks. Same as before except the browser comes from a
/// BrowserManager lease so it interleaves cleanly with the metadata pass.
///
/// A `SessionExpired` result aborts the phase.
#[allow(clippy::too_many_arguments)]
async fn sync_bookmarked_chapter_content(
manager: Arc<BrowserManager>,
db: &PgPool,
storage: Arc<dyn Storage>,
http: &reqwest::Client,
rate: Arc<HostRateLimiters>,
source_id: &str,
workers: usize,
force_refetch: bool,
allowlist: Arc<mangalord::crawler::safety::DownloadAllowlist>,
max_image_bytes: usize,
) -> anyhow::Result<()> {
let pending: Vec<(Uuid, Uuid, String)> = sqlx::query_as(
r#"
SELECT id, manga_id, source_url FROM (
SELECT DISTINCT c.id, c.manga_id, c.created_at, cs.source_url
FROM chapters c
JOIN bookmarks b ON b.manga_id = c.manga_id
JOIN chapter_sources cs ON cs.chapter_id = c.id
WHERE cs.source_id = $1
AND cs.dropped_at IS NULL
AND (c.page_count = 0 OR $2)
) sub
ORDER BY manga_id, created_at ASC
"#,
)
.bind(source_id)
.bind(force_refetch)
.fetch_all(db)
.await
.context("query pending chapter content")?;
if pending.is_empty() {
tracing::info!("chapter content: nothing pending");
return Ok(());
}
tracing::info!(count = pending.len(), workers, "chapter content phase starting");
let session_expired = Arc::new(std::sync::atomic::AtomicBool::new(false));
let stats = std::sync::Mutex::new(WorkerStats::default());
stream::iter(pending.into_iter())
.for_each_concurrent(workers.max(1), |(chapter_id, manga_id, source_url)| {
let session_expired = Arc::clone(&session_expired);
let storage = Arc::clone(&storage);
let rate = Arc::clone(&rate);
let manager = Arc::clone(&manager);
let allowlist = Arc::clone(&allowlist);
let stats = &stats;
async move {
if session_expired.load(std::sync::atomic::Ordering::Relaxed) {
return;
}
let lease = match manager.acquire().await {
Ok(l) => l,
Err(e) => {
tracing::error!(%chapter_id, error = ?e, "browser acquire failed");
let mut s = stats.lock().unwrap();
s.failed += 1;
return;
}
};
let outcome = content::sync_chapter_content(
&lease,
db,
storage.as_ref(),
http,
rate.as_ref(),
chapter_id,
manga_id,
&source_url,
force_refetch,
allowlist.as_ref(),
max_image_bytes,
)
.await;
drop(lease);
let mut s = stats.lock().unwrap();
match outcome {
Ok(SyncOutcome::Fetched { pages }) => {
tracing::info!(%chapter_id, pages, "chapter content fetched");
s.fetched += 1;
}
Ok(SyncOutcome::Skipped) => s.skipped += 1,
Ok(SyncOutcome::SessionExpired) => {
tracing::error!(
%chapter_id,
"session expired mid-run — refresh CRAWLER_PHPSESSID and re-run"
);
session_expired
.store(true, std::sync::atomic::Ordering::Relaxed);
}
Err(e) => {
tracing::warn!(
%chapter_id, error = ?e, "chapter content sync failed"
);
s.failed += 1;
}
}
}
})
.await;
let total = stats.into_inner().unwrap();
tracing::info!(
fetched = total.fetched,
skipped = total.skipped,
failed = total.failed,
"chapter content phase done"
);
if session_expired.load(std::sync::atomic::Ordering::Relaxed) {
anyhow::bail!("session expired during chapter content phase");
}
Ok(())
}
#[derive(Default, Clone, Copy)]
struct WorkerStats {
fetched: usize,
skipped: usize,
failed: usize,
}
fn resolve_start_url() -> anyhow::Result<String> {
if let Some(arg) = std::env::args().nth(1) {
return Ok(arg);
}
std::env::var("CRAWLER_START_URL").map_err(|_| {
anyhow!(
"start URL is required — pass as first CLI arg or set $CRAWLER_START_URL"
)
})
}
fn env_u64(name: &str, default: u64) -> u64 {
std::env::var(name)
.ok()
.and_then(|s| s.parse().ok())
.unwrap_or(default)
}
fn env_bool(name: &str, default: bool) -> bool {
match std::env::var(name).ok().as_deref() {
Some("1") | Some("true") | Some("TRUE") | Some("yes") => true,
Some("0") | Some("false") | Some("FALSE") | Some("no") => false,
_ => default,
}
}

View File

@@ -1,10 +1,24 @@
use std::path::PathBuf;
use std::time::Duration;
use chrono::NaiveTime;
use chrono_tz::Tz;
use crate::crawler::browser::LaunchOptions;
use crate::crawler::safety::{DownloadAllowlist, DEFAULT_MAX_IMAGE_BYTES};
#[derive(Clone, Debug)]
pub struct AuthConfig {
pub cookie_secure: bool,
pub cookie_domain: Option<String>,
pub session_ttl_days: i64,
pub rate_limit: crate::auth::rate_limit::RateLimitConfig,
/// When `false`, `POST /auth/register` returns 403
/// `registration_disabled` and the frontend hides its register
/// affordance. Admins can still mint accounts via
/// `POST /admin/users`. Defaults to `true` (open registration)
/// for backward compatibility.
pub allow_self_register: bool,
}
impl Default for AuthConfig {
@@ -13,6 +27,12 @@ impl Default for AuthConfig {
cookie_secure: true,
cookie_domain: None,
session_ttl_days: 30,
// Disabled by default so the test harness inherits a
// non-throttling limiter. Production `from_env` overrides
// to the [`PRODUCTION_PER_SEC`]/[`PRODUCTION_BURST`]
// defaults.
rate_limit: crate::auth::rate_limit::RateLimitConfig::default(),
allow_self_register: true,
}
}
}
@@ -45,6 +65,70 @@ pub struct Config {
pub auth: AuthConfig,
pub upload: UploadConfig,
pub cors_allowed_origins: Vec<String>,
pub crawler: CrawlerConfig,
/// `(username, password)` for the admin user provisioned at startup
/// when both `ADMIN_USERNAME` and `ADMIN_PASSWORD` are set. `None`
/// skips the bootstrap entirely. See `repo::user::bootstrap_admin`
/// for the create-vs-promote semantics — notably the password here
/// is used only when creating a new row, never to overwrite an
/// existing one.
pub admin_bootstrap: Option<(String, String)>,
}
/// All crawler-daemon knobs read from env. Mirrors the env vars the
/// `bin/crawler` binary already reads, plus the new daemon-only knobs
/// (daily_at, tz, idle_timeout, retention_days, daemon_enabled).
///
/// `daemon_enabled = false` skips the daemon spawn entirely — used by
/// integration tests and dev runs that don't want background activity.
#[derive(Clone, Debug)]
pub struct CrawlerConfig {
pub daemon_enabled: bool,
pub daily_at: NaiveTime,
pub tz: Tz,
pub idle_timeout: Duration,
pub chapter_workers: usize,
pub retention_days: u32,
pub start_url: Option<String>,
pub rate_ms: u64,
pub cdn_host: Option<String>,
pub cdn_rate_ms: u64,
pub phpsessid: Option<String>,
pub cookie_domain: Option<String>,
pub user_agent: Option<String>,
pub proxy: Option<String>,
pub browser: LaunchOptions,
/// Hosts the crawler is allowed to download images / covers from.
/// Always seeded with the host of `start_url` and (when set) the
/// configured `cdn_host`. Additional hosts can be added via
/// `CRAWLER_DOWNLOAD_ALLOWLIST` (comma-separated).
pub download_allowlist: DownloadAllowlist,
/// Hard upper bound on a single image download. Defaults to 32 MiB.
pub max_image_bytes: usize,
}
impl Default for CrawlerConfig {
fn default() -> Self {
Self {
daemon_enabled: false,
daily_at: NaiveTime::from_hms_opt(0, 0, 0).unwrap(),
tz: Tz::UTC,
idle_timeout: Duration::from_secs(600),
chapter_workers: 1,
retention_days: 7,
start_url: None,
rate_ms: 1000,
cdn_host: None,
cdn_rate_ms: 1000,
phpsessid: None,
cookie_domain: None,
user_agent: None,
proxy: None,
browser: LaunchOptions::headless(),
download_allowlist: DownloadAllowlist::new(),
max_image_bytes: DEFAULT_MAX_IMAGE_BYTES,
}
}
}
impl Config {
@@ -63,6 +147,17 @@ impl Config {
.ok()
.filter(|s| !s.is_empty()),
session_ttl_days: env_i64("SESSION_TTL_DAYS", 30),
rate_limit: crate::auth::rate_limit::RateLimitConfig {
per_sec: env_u64(
"AUTH_RATE_PER_SEC",
crate::auth::rate_limit::PRODUCTION_PER_SEC.into(),
) as u32,
burst: env_u64(
"AUTH_RATE_BURST",
crate::auth::rate_limit::PRODUCTION_BURST.into(),
) as u32,
},
allow_self_register: env_bool("ALLOW_SELF_REGISTER", true),
},
upload: UploadConfig {
max_request_bytes: env_usize("MAX_REQUEST_BYTES", 200 * 1024 * 1024),
@@ -77,10 +172,122 @@ impl Config {
.collect()
})
.unwrap_or_default(),
crawler: CrawlerConfig::from_env()?,
admin_bootstrap: admin_bootstrap_from_env(),
})
}
}
/// Returns `Some((username, password))` only when BOTH `ADMIN_USERNAME`
/// and `ADMIN_PASSWORD` are set and non-empty. Half-set configuration is
/// treated as "no bootstrap" rather than a hard error, so an operator
/// can comment out one env var without crashing the server.
fn admin_bootstrap_from_env() -> Option<(String, String)> {
let username = std::env::var("ADMIN_USERNAME").ok().filter(|s| !s.is_empty())?;
let password = std::env::var("ADMIN_PASSWORD").ok().filter(|s| !s.is_empty())?;
Some((username, password))
}
impl CrawlerConfig {
pub fn from_env() -> anyhow::Result<Self> {
// Parse CRAWLER_DAILY_AT (HH:MM, 24h). Invalid → fail fast.
let daily_at = match std::env::var("CRAWLER_DAILY_AT").ok().as_deref() {
None | Some("") => NaiveTime::from_hms_opt(0, 0, 0).unwrap(),
Some(raw) => NaiveTime::parse_from_str(raw, "%H:%M").map_err(|e| {
anyhow::anyhow!("CRAWLER_DAILY_AT must be HH:MM (got {raw:?}): {e}")
})?,
};
let tz: Tz = match std::env::var("CRAWLER_TZ").ok().as_deref() {
None | Some("") => Tz::UTC,
Some(raw) => raw
.parse()
.map_err(|e| anyhow::anyhow!("CRAWLER_TZ must be a valid IANA TZ (got {raw:?}): {e}"))?,
};
let start_url = std::env::var("CRAWLER_START_URL")
.ok()
.filter(|s| !s.trim().is_empty());
let cdn_host = std::env::var("CRAWLER_CDN_HOST")
.ok()
.filter(|s| !s.trim().is_empty());
let download_allowlist =
build_download_allowlist(start_url.as_deref(), cdn_host.as_deref());
Ok(Self {
daemon_enabled: env_bool("CRAWLER_DAEMON", true),
daily_at,
tz,
idle_timeout: Duration::from_secs(env_u64("CRAWLER_IDLE_TIMEOUT_S", 600)),
chapter_workers: env_u64("CRAWLER_CHAPTER_WORKERS", 1).max(1) as usize,
retention_days: env_u64("CRAWLER_JOB_RETENTION_DAYS", 7) as u32,
start_url,
rate_ms: env_u64("CRAWLER_RATE_MS", 1000),
cdn_host,
cdn_rate_ms: env_u64("CRAWLER_CDN_RATE_MS", env_u64("CRAWLER_RATE_MS", 1000)),
phpsessid: std::env::var("CRAWLER_PHPSESSID")
.ok()
.filter(|s| !s.trim().is_empty()),
cookie_domain: std::env::var("CRAWLER_COOKIE_DOMAIN")
.ok()
.filter(|s| !s.trim().is_empty()),
user_agent: std::env::var("CRAWLER_USER_AGENT")
.ok()
.filter(|s| !s.trim().is_empty()),
proxy: std::env::var("CRAWLER_PROXY")
.ok()
.filter(|s| !s.trim().is_empty()),
browser: LaunchOptions::from_env(),
download_allowlist,
max_image_bytes: env_usize("CRAWLER_MAX_IMAGE_BYTES", DEFAULT_MAX_IMAGE_BYTES),
})
}
}
/// Build the download allowlist from env. Always includes
/// `CRAWLER_START_URL`'s host (so the crawler can fetch covers from
/// the catalog itself) and `CRAWLER_CDN_HOST` when set. Additional
/// hosts can be supplied via `CRAWLER_DOWNLOAD_ALLOWLIST` (comma-
/// separated). Empty by default — meaning the crawler refuses to
/// download anything when no source is configured, which is the safe
/// fail-closed posture.
///
/// `CRAWLER_ALLOW_ANY_HOST=true` short-circuits the host enumeration
/// for operators whose sources shard across numbered CDN subdomains.
/// Scheme + private-IP defenses still apply.
fn build_download_allowlist(
start_url: Option<&str>,
cdn_host: Option<&str>,
) -> DownloadAllowlist {
if env_bool("CRAWLER_ALLOW_ANY_HOST", false) {
return DownloadAllowlist::allow_any();
}
let mut allow = DownloadAllowlist::new();
if let Some(url) = start_url {
if let Ok(parsed) = reqwest::Url::parse(url) {
if let Some(h) = parsed.host_str() {
allow = allow.allow(h);
}
}
}
if let Some(host) = cdn_host {
allow = allow.allow(host);
}
if let Ok(extras) = std::env::var("CRAWLER_DOWNLOAD_ALLOWLIST") {
for piece in extras.split(',') {
let trimmed = piece.trim();
if !trimmed.is_empty() {
allow = allow.allow(trimmed);
}
}
}
allow
}
fn env_u64(name: &str, default: u64) -> u64 {
std::env::var(name)
.ok()
.and_then(|s| s.parse().ok())
.unwrap_or(default)
}
fn env_bool(name: &str, default: bool) -> bool {
match std::env::var(name).ok().as_deref() {
Some("1") | Some("true") | Some("TRUE") | Some("yes") => true,
@@ -102,3 +309,4 @@ fn env_usize(name: &str, default: usize) -> usize {
.and_then(|s| s.parse().ok())
.unwrap_or(default)
}

View File

@@ -0,0 +1,397 @@
//! Chromium launcher and lifecycle.
//!
//! By default uses `chromiumoxide`'s `fetcher` feature — first call
//! downloads a known-good revision into a cache dir and reuses it
//! forever after. Set `CRAWLER_CHROMIUM_BINARY` to skip the fetcher
//! and use a system-installed Chromium instead; required on platforms
//! where the upstream snapshot bucket has no usable build (notably
//! `Linux_arm64` / Raspberry Pi). Debian's package is at
//! `/usr/bin/chromium` or `/usr/bin/chromium-headless-shell`; Ubuntu
//! ships it as `chromium-browser` at a different path — don't paste
//! the wrong one.
//!
//! `BrowserMode` toggles headed vs headless; the headed path needs a
//! display (real `$DISPLAY` or `xvfb-run`).
//!
//! Extra Chromium command-line flags can be supplied through
//! [`LaunchOptions::extra_args`] in code, or via the
//! `CRAWLER_BROWSER_ARGS` env var (whitespace-separated) when going
//! through [`LaunchOptions::from_env`]. The launcher always also
//! injects `--no-sandbox` and `--disable-dev-shm-usage` because they're
//! near-mandatory for containerized Chromium; everything else is
//! caller-provided.
use std::path::PathBuf;
use std::sync::Arc;
use anyhow::Context;
use chromiumoxide::browser::{Browser, BrowserConfig};
use chromiumoxide::error::CdpError;
use chromiumoxide::fetcher::{BrowserFetcher, BrowserFetcherOptions};
use futures_util::StreamExt;
use tokio::task::JoinHandle;
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub enum BrowserMode {
/// Real window. Needs `$DISPLAY` (or `xvfb-run` wrapping the
/// binary). Opt-in via `CRAWLER_BROWSER_MODE=headed` — useful for
/// debugging a flow visually or for sites that fingerprint
/// headless Chrome. Not used in production.
Headed,
/// No window. Faster, lower resource use, runs without a display.
/// This is the default for both `from_env()` and `Default`.
Headless,
}
/// Configuration for a single browser launch.
///
/// Public fields rather than a builder — there are only two of them
/// and callers benefit from struct literal syntax for clarity.
#[derive(Clone, Debug)]
pub struct LaunchOptions {
pub mode: BrowserMode,
/// Extra Chromium flags, appended after the launcher's own
/// defaults. Example: `vec!["--lang=de-DE".into(),
/// "--window-size=1280,800".into()]`.
pub extra_args: Vec<String>,
}
impl LaunchOptions {
pub fn headed() -> Self {
Self {
mode: BrowserMode::Headed,
extra_args: Vec::new(),
}
}
pub fn headless() -> Self {
Self {
mode: BrowserMode::Headless,
extra_args: Vec::new(),
}
}
/// Reads `CRAWLER_BROWSER_MODE` (`headless`|`headed`, default
/// `headless`) and `CRAWLER_BROWSER_ARGS` (whitespace-separated
/// Chromium flags). Flags containing whitespace aren't supported
/// through the env var — use the programmatic API for those.
pub fn from_env() -> Self {
let mode = match std::env::var("CRAWLER_BROWSER_MODE").as_deref() {
Ok("headed") => BrowserMode::Headed,
_ => BrowserMode::Headless,
};
let extra_args = std::env::var("CRAWLER_BROWSER_ARGS")
.map(|s| parse_args(&s))
.unwrap_or_default();
Self { mode, extra_args }
}
}
impl Default for LaunchOptions {
fn default() -> Self {
Self::headless()
}
}
/// Whitespace-split a CRAWLER_BROWSER_ARGS-style string. Exposed
/// separately from `from_env` so it can be unit-tested without
/// touching process environment.
pub(crate) fn parse_args(s: &str) -> Vec<String> {
s.split_whitespace().map(str::to_string).collect()
}
/// Owned browser plus the spawned task that drives its CDP event loop.
/// Dropping `Handle` without calling `close` leaks the Chromium process
/// — always call `close().await` in production paths.
///
/// The browser is stored behind an `Arc` so it can be shared across
/// worker tasks (via [`Handle::shared`]) without copying. `Browser::new_page`
/// only needs `&self`, so multiple workers can drive the same browser
/// concurrently as long as the manager keeps the `Arc` alive.
pub struct Handle {
browser: Arc<Browser>,
driver: JoinHandle<()>,
}
impl Handle {
/// Borrow the browser. Equivalent to `&*handle.shared()`.
pub fn browser(&self) -> &Browser {
&self.browser
}
/// Clone the shared handle. Workers hold these to call `new_page`
/// concurrently. The browser only exits when the last `Arc<Browser>`
/// is dropped (kill-on-drop), or when `close()` is called on the
/// originating `Handle` while it is the sole holder.
pub fn shared(&self) -> Arc<Browser> {
Arc::clone(&self.browser)
}
/// Closes the browser and awaits the driver task. If other Arcs to
/// the browser are still alive we can't issue a clean CDP `close`,
/// so we abort the driver task instead — otherwise `handler.next()`
/// keeps polling forever and `Handle::close` hangs (chromiumoxide's
/// handler stream doesn't end on its own when the underlying WS
/// dies). Chromium itself is reaped by kill-on-drop once the last
/// `Arc<Browser>` is dropped.
pub async fn close(self) -> anyhow::Result<()> {
close_or_abort(self.browser, self.driver, |mut owned| async move {
let _ = owned.close().await;
let _ = owned.wait().await;
})
.await;
Ok(())
}
}
/// Shutdown core for [`Handle::close`], extracted so it can be unit-
/// tested without launching real Chromium. When `arc` is uniquely owned,
/// `on_owned` runs against the owned value and the driver is awaited
/// normally. When other Arc holders exist, the driver is aborted before
/// awaiting it so shutdown returns promptly.
async fn close_or_abort<T, F, Fut>(arc: Arc<T>, driver: JoinHandle<()>, on_owned: F)
where
T: Send + 'static,
F: FnOnce(T) -> Fut + Send,
Fut: std::future::Future<Output = ()> + Send,
{
match Arc::try_unwrap(arc) {
Ok(owned) => {
on_owned(owned).await;
let _ = driver.await;
}
Err(shared) => {
tracing::warn!(
strong_count = Arc::strong_count(&shared),
"Handle::close while Arc still shared — aborting driver, relying on kill-on-drop"
);
drop(shared);
driver.abort();
let _ = driver.await;
}
}
}
/// Launches Chromium. If `CRAWLER_CHROMIUM_BINARY` is set, uses that
/// path directly. Otherwise downloads via the `fetcher` feature on
/// first run and hits the cache after that. The fetcher cache dir is
/// `$CRAWLER_CHROMIUM_DIR` if set, else `$HOME/.cache/mangalord/chromium`,
/// else `./.chromium-cache` as a last-resort repo-local fallback.
pub async fn launch(options: LaunchOptions) -> anyhow::Result<Handle> {
let executable = match system_chromium_path_from_env() {
Some(path) => {
tracing::info!(path = %path.display(), "using system chromium (CRAWLER_CHROMIUM_BINARY)");
path
}
None => {
let cache = cache_dir()?;
tokio::fs::create_dir_all(&cache)
.await
.with_context(|| format!("create cache dir {}", cache.display()))?;
let fetcher = BrowserFetcher::new(
BrowserFetcherOptions::builder()
.with_path(&cache)
.build()
.map_err(|e| anyhow::anyhow!("fetcher options: {e}"))?,
);
tracing::info!(path = %cache.display(), "ensuring chromium revision is present");
let info = fetcher
.fetch()
.await
.context("download chromium via fetcher")?;
tracing::info!(executable = %info.executable_path.display(), "chromium ready");
info.executable_path
}
};
let mut builder = BrowserConfig::builder()
.chrome_executable(executable)
// Linux containers / CI commonly lack the user namespaces
// Chromium's sandbox wants. Disable it; the crawler runs in its
// own container anyway.
.arg("--no-sandbox")
.arg("--disable-dev-shm-usage");
for arg in &options.extra_args {
builder = builder.arg(arg);
}
if matches!(options.mode, BrowserMode::Headed) {
builder = builder.with_head();
}
tracing::info!(
mode = ?options.mode,
extra_args = ?options.extra_args,
"building browser config"
);
let config = builder
.build()
.map_err(|e| anyhow::anyhow!("browser config: {e}"))?;
let (browser, mut handler) = Browser::launch(config)
.await
.context("launch chromium")?;
let driver = tokio::spawn(async move {
while let Some(event) = handler.next().await {
match event {
Ok(_) => {}
// chromiumoxide 0.7 ships fixed CDP type bindings, so any
// CDP event Chrome added later fails to deserialize. The
// connection is unaffected — these are noise. Suppress
// them so real failures stay visible.
Err(CdpError::Serde(_)) => {
tracing::trace!("chromium emitted an unrecognized CDP event");
}
Err(err) => tracing::warn!(?err, "chromium handler event error"),
}
}
});
Ok(Handle {
browser: Arc::new(browser),
driver,
})
}
fn cache_dir() -> anyhow::Result<PathBuf> {
if let Ok(dir) = std::env::var("CRAWLER_CHROMIUM_DIR") {
return Ok(PathBuf::from(dir));
}
if let Ok(home) = std::env::var("HOME") {
return Ok(PathBuf::from(home).join(".cache/mangalord/chromium"));
}
Ok(PathBuf::from("./.chromium-cache"))
}
/// Reads `CRAWLER_CHROMIUM_BINARY` and delegates to the pure helper.
/// Thin wrapper kept separate so the decision logic can be unit-tested
/// without mutating the process environment.
fn system_chromium_path_from_env() -> Option<PathBuf> {
system_chromium_path_from_value(std::env::var_os("CRAWLER_CHROMIUM_BINARY").as_deref())
}
/// Returns `Some(path)` only when the value is set and non-empty. An
/// exported-but-blank var (common in compose `${VAR:-}` patterns when
/// the operator didn't fill it in) must behave like "unset" — otherwise
/// we'd hand chromiumoxide an empty path and fail launch in a confusing
/// way.
pub(crate) fn system_chromium_path_from_value(
raw: Option<&std::ffi::OsStr>,
) -> Option<PathBuf> {
raw.filter(|v| !v.is_empty()).map(PathBuf::from)
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn parse_args_splits_on_whitespace() {
assert_eq!(
parse_args("--lang=de-DE --window-size=1280,800"),
vec!["--lang=de-DE", "--window-size=1280,800"]
);
}
#[test]
fn parse_args_tolerates_irregular_whitespace() {
// tabs, multiple spaces, leading/trailing — all collapsed.
assert_eq!(
parse_args(" --a\t--b --c=1\n"),
vec!["--a", "--b", "--c=1"]
);
}
#[test]
fn parse_args_empty_string_yields_empty_vec() {
assert!(parse_args("").is_empty());
assert!(parse_args(" \t\n").is_empty());
}
#[test]
fn system_chromium_path_returns_some_when_value_set() {
let raw = std::ffi::OsString::from("/usr/bin/chromium-headless-shell");
assert_eq!(
system_chromium_path_from_value(Some(raw.as_os_str())),
Some(PathBuf::from("/usr/bin/chromium-headless-shell"))
);
}
#[test]
fn system_chromium_path_returns_none_when_unset() {
assert_eq!(system_chromium_path_from_value(None), None);
}
#[test]
fn system_chromium_path_treats_empty_as_unset() {
// Compose's `${VAR:-}` substitution produces an exported-but-empty
// env var when the operator left it blank. Treat it as unset so
// the launcher falls back to the fetcher path instead of handing
// chromiumoxide an empty path.
let raw = std::ffi::OsString::from("");
assert_eq!(
system_chromium_path_from_value(Some(raw.as_os_str())),
None
);
}
#[test]
fn default_launch_options_are_headless() {
// Headless is the production-safe default — no display required,
// smaller resource footprint. `Headed` stays available as an
// opt-in for debugging via CRAWLER_BROWSER_MODE=headed.
assert_eq!(LaunchOptions::default().mode, BrowserMode::Headless);
assert_eq!(LaunchOptions::headless().mode, BrowserMode::Headless);
assert_eq!(LaunchOptions::headed().mode, BrowserMode::Headed);
}
// Regression: if another Arc<Browser> outlives `Handle::close`, the
// old code awaited the driver task forever because the chromiumoxide
// handler stream doesn't return None on its own. Aborting the driver
// unblocks shutdown even when kill-on-drop can't fire yet.
#[tokio::test]
async fn close_or_abort_returns_when_arc_is_shared() {
use std::sync::atomic::{AtomicBool, Ordering};
use std::time::Duration;
let arc = Arc::new(());
let _keepalive = Arc::clone(&arc); // forces try_unwrap to fail
let driver = tokio::spawn(std::future::pending::<()>());
let on_owned_ran = Arc::new(AtomicBool::new(false));
let flag = Arc::clone(&on_owned_ran);
let fut = close_or_abort(arc, driver, move |_| {
let flag = Arc::clone(&flag);
async move { flag.store(true, Ordering::Release) }
});
tokio::time::timeout(Duration::from_secs(2), fut)
.await
.expect("close_or_abort must not hang when driver is pending and Arc is shared");
assert!(
!on_owned_ran.load(Ordering::Acquire),
"on_owned must not run when the Arc is still shared"
);
}
#[tokio::test]
async fn close_or_abort_runs_on_owned_when_arc_is_unique() {
use std::sync::atomic::{AtomicBool, Ordering};
let arc = Arc::new(());
let driver = tokio::spawn(async {}); // completes immediately
let on_owned_ran = Arc::new(AtomicBool::new(false));
let flag = Arc::clone(&on_owned_ran);
close_or_abort(arc, driver, move |_| {
let flag = Arc::clone(&flag);
async move { flag.store(true, Ordering::Release) }
})
.await;
assert!(
on_owned_ran.load(Ordering::Acquire),
"on_owned must run when the Arc is unique"
);
}
}

View File

@@ -0,0 +1,301 @@
//! Lazy-launch / idle-teardown Chromium manager for the daemon.
//!
//! The first worker that calls [`BrowserManager::acquire`] triggers a real
//! Chromium launch (and the `on_launch` hook — used to re-inject the
//! PHPSESSID cookie on every fresh process). Each acquire bumps an active
//! counter; the returned [`BrowserLease`] decrements it on drop.
//!
//! When the active counter hits zero, a background reaper task waits
//! `idle_timeout`. If still zero on wake, it closes Chromium and clears the
//! cached handle. The next acquire re-launches.
//!
//! `idle_timeout = Duration::ZERO` disables the reaper — Chromium stays alive
//! until [`BrowserManager::shutdown`].
use std::ops::Deref;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
use std::time::Duration;
use anyhow::Context;
use chromiumoxide::browser::Browser;
use futures_util::future::BoxFuture;
use tokio::sync::{Mutex, Notify};
use tokio::task::JoinHandle;
use tokio_util::sync::CancellationToken;
use crate::crawler::browser::{self, LaunchOptions};
/// Hook invoked on every fresh launch with the new browser. Typically used
/// to re-inject PHPSESSID + run the session probe. Errors abort the
/// `acquire` that triggered the launch — the next acquire will re-launch.
pub type OnLaunch =
Arc<dyn Fn(Arc<Browser>) -> BoxFuture<'static, anyhow::Result<()>> + Send + Sync>;
/// Returns an `OnLaunch` that does nothing — useful when no session is
/// configured (e.g. CLI metadata-only runs).
pub fn noop_on_launch() -> OnLaunch {
Arc::new(|_| Box::pin(async { Ok(()) }))
}
/// Decoupled active-lease tracker. Owns the atomic counter and the idle
/// notifier so the wiring is unit-testable without standing up a real
/// `BrowserManager` (which would require launching Chromium).
#[derive(Default)]
pub(crate) struct ActiveTracker {
counter: AtomicUsize,
idle_signal: Notify,
}
impl ActiveTracker {
pub(crate) fn new() -> Arc<Self> {
Arc::new(Self::default())
}
pub(crate) fn acquire(self: &Arc<Self>) {
self.counter.fetch_add(1, Ordering::AcqRel);
}
pub(crate) fn release(self: &Arc<Self>) {
if self.counter.fetch_sub(1, Ordering::AcqRel) == 1 {
self.idle_signal.notify_one();
}
}
pub(crate) fn current(&self) -> usize {
self.counter.load(Ordering::Acquire)
}
pub(crate) fn idle_signal(&self) -> &Notify {
&self.idle_signal
}
}
pub struct BrowserManager {
inner: Mutex<Inner>,
active: Arc<ActiveTracker>,
launch_opts: LaunchOptions,
idle_timeout: Duration,
on_launch: OnLaunch,
}
struct Inner {
handle: Option<browser::Handle>,
shared: Option<Arc<Browser>>,
}
impl BrowserManager {
pub fn new(
launch_opts: LaunchOptions,
idle_timeout: Duration,
on_launch: OnLaunch,
) -> Arc<Self> {
Arc::new(Self {
inner: Mutex::new(Inner {
handle: None,
shared: None,
}),
active: ActiveTracker::new(),
launch_opts,
idle_timeout,
on_launch,
})
}
/// Acquire a shared browser lease. The first acquire after a teardown
/// launches a fresh Chromium (and runs `on_launch`); subsequent acquires
/// while a process is alive just bump the counter and clone the `Arc`.
pub async fn acquire(&self) -> anyhow::Result<BrowserLease> {
let mut guard = self.inner.lock().await;
if guard.handle.is_none() {
let handle = browser::launch(self.launch_opts.clone())
.await
.context("BrowserManager: launch chromium")?;
let shared = handle.shared();
// Run the on-launch hook before publishing the handle so a session
// probe failure doesn't leave a half-initialized browser behind.
if let Err(e) = (self.on_launch)(Arc::clone(&shared)).await {
// Close the just-launched browser since we won't be using it.
let _ = handle.close().await;
return Err(e.context("BrowserManager: on_launch hook failed"));
}
guard.handle = Some(handle);
guard.shared = Some(shared);
}
let browser = guard
.shared
.as_ref()
.expect("shared set above")
.clone();
self.active.acquire();
Ok(BrowserLease {
browser,
active: Arc::clone(&self.active),
})
}
/// Forcefully close the cached browser regardless of active count.
/// Used on daemon shutdown. After this returns the next acquire will
/// re-launch from scratch.
pub async fn shutdown(&self) {
let mut guard = self.inner.lock().await;
guard.shared = None;
if let Some(handle) = guard.handle.take() {
let _ = handle.close().await;
}
}
/// Mark the cached browser handle as unhealthy. The next `acquire`
/// will re-launch Chromium from scratch.
///
/// Same semantics as `shutdown` — the difference is intent:
/// `shutdown` runs once at daemon teardown, while `invalidate` is a
/// recovery hook callers fire after a CDP / connection / navigation
/// failure that suggests the underlying process has died. Calling
/// this while other workers still hold leases is safe — their
/// outstanding CDP operations will return channel-closed errors
/// and those workers will then re-acquire (re-launching Chromium).
///
/// Idempotent: calling on an already-invalidated manager is a
/// no-op.
pub async fn invalidate(&self) {
let mut guard = self.inner.lock().await;
guard.shared = None;
if let Some(handle) = guard.handle.take() {
let _ = handle.close().await;
tracing::warn!("BrowserManager: handle invalidated — next acquire will relaunch");
}
}
fn idle_timeout(&self) -> Duration {
self.idle_timeout
}
fn active(&self) -> Arc<ActiveTracker> {
Arc::clone(&self.active)
}
}
/// Background reaper. Returns immediately when `idle_timeout == 0`.
/// Otherwise spawns a task that:
/// 1. Waits on `idle_signal` (woken when active hits zero).
/// 2. Sleeps `idle_timeout`.
/// 3. Re-checks the counter under the mutex — if still zero, takes the
/// handle and closes it.
///
/// Repeats forever until `cancel` fires.
pub fn spawn_idle_reaper(mgr: Arc<BrowserManager>, cancel: CancellationToken) -> JoinHandle<()> {
tokio::spawn(async move {
if mgr.idle_timeout().is_zero() {
// Block until cancellation, then exit.
cancel.cancelled().await;
return;
}
let active = mgr.active();
loop {
tokio::select! {
_ = cancel.cancelled() => return,
_ = active.idle_signal().notified() => {}
}
if active.current() > 0 {
continue;
}
tokio::select! {
_ = cancel.cancelled() => return,
_ = tokio::time::sleep(mgr.idle_timeout()) => {}
}
let mut guard = mgr.inner.lock().await;
if active.current() > 0 {
// A worker grabbed a lease during the sleep — abort teardown.
continue;
}
let handle = guard.handle.take();
guard.shared = None;
drop(guard);
if let Some(h) = handle {
let _ = h.close().await;
tracing::info!("BrowserManager: idle teardown — Chromium closed");
}
}
})
}
/// A worker-side handle that keeps the browser alive while in scope.
/// `Deref<Target = Browser>` so callers can pass `&*lease` to APIs that
/// expect `&Browser`.
pub struct BrowserLease {
browser: Arc<Browser>,
active: Arc<ActiveTracker>,
}
impl Deref for BrowserLease {
type Target = Browser;
fn deref(&self) -> &Browser {
&self.browser
}
}
impl Drop for BrowserLease {
fn drop(&mut self) {
self.active.release();
}
}
#[cfg(test)]
mod tests {
use super::*;
use std::sync::atomic::AtomicBool;
#[test]
fn noop_on_launch_is_send_sync() {
fn assert_send_sync<T: Send + Sync>(_: &T) {}
let h = noop_on_launch();
assert_send_sync(&h);
}
/// Invalidate is the only `BrowserManager` method that's safe to
/// exercise in a unit test without launching Chromium — it's a
/// no-op when no handle has been cached, and that path is exactly
/// the one we want to verify is idempotent.
#[tokio::test]
async fn invalidate_is_a_noop_when_no_handle_cached() {
let mgr = BrowserManager::new(
crate::crawler::browser::LaunchOptions::default(),
Duration::ZERO,
noop_on_launch(),
);
// Two back-to-back invalidates must both complete; the second
// would hang or panic if the first had left torn state.
mgr.invalidate().await;
mgr.invalidate().await;
}
#[tokio::test]
async fn active_tracker_signals_idle_only_on_zero_transition() {
let tracker = ActiveTracker::new();
let signaled = Arc::new(AtomicBool::new(false));
{
let s = Arc::clone(&signaled);
let t = Arc::clone(&tracker);
tokio::spawn(async move {
t.idle_signal().notified().await;
s.store(true, Ordering::Release);
});
}
tracker.acquire();
tracker.acquire();
assert_eq!(tracker.current(), 2);
tracker.release();
assert_eq!(tracker.current(), 1);
tokio::time::sleep(Duration::from_millis(20)).await;
assert!(!signaled.load(Ordering::Acquire), "no idle signal at count 1");
tracker.release();
tokio::time::sleep(Duration::from_millis(20)).await;
assert_eq!(tracker.current(), 0);
assert!(
signaled.load(Ordering::Acquire),
"idle signal fires on 1 -> 0 transition"
);
}
}

View File

@@ -0,0 +1,307 @@
//! Chapter content sync — fetch a logged-in chapter page, extract its
//! image URLs in `pageN` order, download each to storage, and atomically
//! persist a `pages` row per image plus the chapter's `page_count`.
//!
//! Only chapters belonging to a manga someone has bookmarked are
//! candidates. The crawler scans bookmarks at the start of each run and
//! enqueues unfetched chapters; the API also enqueues at bookmark-time
//! so users get instant feedback. Both feed into the same queue and
//! dedup by chapter id.
// Implementation lands in the next commits in this branch. Module is
// declared so other crates can `use crawler::content` without breaking
// builds while iteration is in progress.
use anyhow::Context;
use sqlx::PgPool;
use uuid::Uuid;
use crate::crawler::detect::PageError;
use crate::crawler::rate_limit::HostRateLimiters;
use crate::crawler::safety::{fetch_bytes_capped, looks_like_image, DownloadAllowlist};
use crate::crawler::session::{self, ChapterProbe};
use crate::storage::Storage;
/// Parse the chapter page DOM and return the page images in `pageN`
/// order. Filters out the loader `<img class="loading">` and any
/// `<img>` without a numeric `id="pageN"`.
///
/// Reader pages don't render the site's `#logo` element, so the
/// universal logo-sentinel can't apply here — instead we assert
/// `a#pic_container` is present. Its absence means the response is the
/// transient broken-page response (or a redirect to some other layout)
/// and the caller should retry.
pub fn parse_chapter_pages(html: &str) -> Result<Vec<ChapterImage>, PageError> {
let doc = scraper::Html::parse_document(html);
let container_sel = scraper::Selector::parse("a#pic_container").unwrap();
if doc.select(&container_sel).next().is_none() {
return Err(PageError::transient("reader: a#pic_container missing"));
}
let sel = scraper::Selector::parse("a#pic_container img:not(.loading)").unwrap();
let mut pages: Vec<ChapterImage> = doc
.select(&sel)
.filter_map(|img| {
let id = img.value().id()?;
let n: i32 = id.strip_prefix("page")?.parse().ok()?;
let src = img.value().attr("src")?.trim().to_string();
if src.is_empty() {
return None;
}
Some(ChapterImage { page_number: n, url: src })
})
.collect();
pages.sort_by_key(|p| p.page_number);
Ok(pages)
}
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct ChapterImage {
pub page_number: i32,
pub url: String,
}
/// Outcome of a single chapter sync — surfaced to callers for logging
/// and exit-code decisions.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum SyncOutcome {
/// All images downloaded and stored, chapter row updated.
Fetched { pages: usize },
/// `page_count > 0` already — no-op unless force_refetch is set.
Skipped,
/// Session probe failed mid-sync (avatar selector missing on the
/// chapter page). Caller should abort the whole crawler run.
SessionExpired,
}
/// Fetch all images for one chapter and persist them atomically. On
/// any error after the first storage put, the DB transaction rolls
/// back so the chapter stays at `page_count = 0` and is retried on the
/// next run. Bytes already written to storage become orphans; a future
/// reaper sweeps them.
#[allow(clippy::too_many_arguments)]
pub async fn sync_chapter_content(
browser: &chromiumoxide::Browser,
db: &PgPool,
storage: &dyn Storage,
http: &reqwest::Client,
rate: &HostRateLimiters,
chapter_id: Uuid,
manga_id: Uuid,
source_url: &str,
force_refetch: bool,
allowlist: &DownloadAllowlist,
max_image_bytes: usize,
) -> anyhow::Result<SyncOutcome> {
// Skip if already fetched, unless caller explicitly forces.
if !force_refetch {
let (page_count,): (i32,) =
sqlx::query_as("SELECT page_count FROM chapters WHERE id = $1")
.bind(chapter_id)
.fetch_one(db)
.await
.context("read chapter page_count")?;
if page_count > 0 {
return Ok(SyncOutcome::Skipped);
}
}
// Nav to chapter page (rate-limited per host).
rate.wait_for(source_url).await?;
let page = browser
.new_page(source_url)
.await
.with_context(|| format!("open chapter page {source_url}"))?;
crate::crawler::nav::wait_for_nav(&page)
.await
.context("wait for chapter nav")?;
// Best-effort wait for the reader marker — same partial-render
// race that bit the chapter-list parser can hit here. Timeout is
// not an error; the chapter probe + parser sentinels still catch
// real failures.
let _ = crate::crawler::nav::wait_for_selector(
&page,
"a#pic_container",
crate::crawler::nav::SELECTOR_TIMEOUT,
)
.await;
let html = page.content().await.context("read chapter html")?;
page.close().await.ok();
// Three-way session classification: distinguishes a transient
// hiccup (broken-page body or logged-in-but-no-reader) from a
// genuine PHPSESSID expiry (no reader and no avatar widget). The
// earlier binary `#avatar_menu` check conflated both and froze
// every worker on a layout shift.
match session::classify_chapter_probe(&html) {
ChapterProbe::Unauthenticated => return Ok(SyncOutcome::SessionExpired),
ChapterProbe::Transient => {
// Surface as a typed Err so the dispatcher path runs
// ack_failed with exponential backoff (rather than the
// session-expired sticky flag).
anyhow::bail!(
"chapter page at {source_url} returned a transient response \
(broken-page body or reader didn't render); will retry"
);
}
ChapterProbe::Ok => {}
}
let images = parse_chapter_pages(&html)
.with_context(|| format!("parse chapter pages at {source_url}"))?;
if images.is_empty() {
anyhow::bail!("no page images parsed from {source_url}");
}
// Resolve image URLs against the chapter URL (they may be relative).
let base = reqwest::Url::parse(source_url).context("parse chapter URL")?;
// Fetch every image bytes-first into memory before writing
// anything. Lets us bail the whole chapter cleanly if any image
// fails — DB stays at page_count=0, no partial rows persisted.
let mut fetched: Vec<(i32, Vec<u8>, &'static str)> = Vec::with_capacity(images.len());
for img in &images {
let url = base.join(&img.url).with_context(|| {
format!("join image URL {} onto {source_url}", img.url)
})?;
rate.wait_for(url.as_str()).await?;
let bytes = fetch_bytes_capped(
http,
url.as_str(),
Some(source_url),
allowlist,
max_image_bytes,
)
.await?
.to_vec();
// Reject any non-image response: the only valid output of an
// image URL is an image. `infer` returns None on truncated
// bytes too, which also wants to be a failure not a silent
// `.bin` extension.
if !looks_like_image(&bytes) {
anyhow::bail!(
"image URL {url} returned non-image bytes \
(first 16: {:?}); refusing to store as binary blob",
&bytes.get(..16.min(bytes.len()))
);
}
let ext = infer::get(&bytes)
.map(|k| k.extension())
.expect("looks_like_image asserted infer succeeded");
fetched.push((img.page_number, bytes, ext));
}
// Atomic write: storage puts + page row inserts + page_count
// update, all in one transaction. If anything fails, rollback +
// the chapter is retried next run. Storage orphans the bytes; a
// reaper sweeps them later.
let mut tx = db.begin().await.context("open chapter sync tx")?;
for (page_number, bytes, ext) in &fetched {
let key = format!(
"mangas/{manga_id}/chapters/{chapter_id}/pages/{:04}.{ext}",
page_number
);
storage
.put(&key, bytes)
.await
.with_context(|| format!("put {key}"))?;
// (chapter_id, page_number) is unique — re-runs idempotent.
sqlx::query(
"INSERT INTO pages (chapter_id, page_number, storage_key, content_type)
VALUES ($1, $2, $3, $4)
ON CONFLICT (chapter_id, page_number) DO UPDATE
SET storage_key = EXCLUDED.storage_key,
content_type = EXCLUDED.content_type",
)
.bind(chapter_id)
.bind(page_number)
.bind(&key)
.bind(format!("image/{ext}"))
.execute(&mut *tx)
.await
.with_context(|| format!("insert page row {page_number}"))?;
}
sqlx::query("UPDATE chapters SET page_count = $1 WHERE id = $2")
.bind(fetched.len() as i32)
.bind(chapter_id)
.execute(&mut *tx)
.await
.context("update page_count")?;
tx.commit().await.context("commit chapter sync")?;
Ok(SyncOutcome::Fetched { pages: fetched.len() })
}
// Suppress unused-import warning for `session::registrable_domain`
// until the bin/crawler wiring lands in this branch and uses it
// through this module.
#[allow(dead_code)]
fn _keep_session_in_scope() {
let _ = session::registrable_domain;
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn parse_chapter_pages_skips_loader_and_sorts_by_id() {
// Loader image, two real pages out of order, and one with no id.
let html = r#"
<html><body id="body"><a id="pic_container">
<img class="loading" src="/images/ajax-loader2.gif">
<img id="page2" class="page2" src="https://cdn/2.jpg">
<img id="page1" class="page1" src="https://cdn/1.jpg">
<img src="https://cdn/orphan.jpg">
<img id="not-a-page" src="https://cdn/not-a-page.jpg">
</a></body></html>
"#;
let pages = parse_chapter_pages(html).expect("parse");
assert_eq!(pages.len(), 2);
assert_eq!(pages[0].page_number, 1);
assert_eq!(pages[0].url, "https://cdn/1.jpg");
assert_eq!(pages[1].page_number, 2);
assert_eq!(pages[1].url, "https://cdn/2.jpg");
}
#[test]
fn parse_chapter_pages_drops_images_without_src() {
let html = r#"
<a id="pic_container">
<img id="page1" src="">
<img id="page2" src="https://cdn/2.jpg">
</a>
"#;
let pages = parse_chapter_pages(html).expect("parse");
assert_eq!(pages.len(), 1);
assert_eq!(pages[0].page_number, 2);
}
#[test]
fn parse_chapter_pages_handles_three_digit_page_ids() {
let html = r#"
<a id="pic_container">
<img id="page126" src="https://cdn/126.jpg">
<img id="page9" src="https://cdn/9.jpg">
<img id="page50" src="https://cdn/50.jpg">
</a>
"#;
let pages = parse_chapter_pages(html).expect("parse");
assert_eq!(
pages.iter().map(|p| p.page_number).collect::<Vec<_>>(),
vec![9, 50, 126]
);
}
#[test]
fn parse_chapter_pages_returns_transient_when_container_missing() {
// Reader doesn't render #logo, so the universal logo sentinel
// can't be used here — a#pic_container is the reader-specific
// marker. Broken-page response trips this.
let html = "<html><body>\
<p>we're sorry, the request file are not found.</p>\
</body></html>";
let err = parse_chapter_pages(html).expect_err("expected Transient");
assert!(err.is_transient(), "got non-transient: {err}");
}
}

View File

@@ -0,0 +1,658 @@
//! In-process crawler daemon.
//!
//! Owns a cron task that fires a daily metadata pass and N worker tasks
//! that drain `SyncChapterContent` jobs from `crawler_jobs`. The dispatch
//! seams ([`MetadataPass`], [`ChapterDispatcher`]) are traits so tests can
//! inject stubs without standing up a real Chromium / `Source` impl.
//!
//! ## Cron
//!
//! Each tick:
//! 1. Acquire a Postgres advisory lock on a dedicated pool connection
//! (multi-replica safety). Skip the tick on contention.
//! 2. Call [`MetadataPass::run`] (typically `pipeline::run_metadata_pass`).
//! 3. Enqueue `SyncChapterContent` jobs for any bookmarked manga whose
//! chapters still have `page_count = 0`.
//! 4. Reap `done` jobs older than `retention_days`.
//! 5. Persist `last_metadata_tick_at` and release the lock.
//!
//! If the last persisted tick is older than the most recent scheduled slot
//! (e.g. backend was down at midnight), the daemon fires immediately on
//! startup before resuming the regular schedule.
//!
//! ## Workers
//!
//! Each worker leases one chapter-content job at a time, dispatches via the
//! [`ChapterDispatcher`], and acks `done` / `failed` / re-`pending` based on
//! the outcome. A `SessionExpired` outcome flips the sticky
//! `session_expired` flag — all workers idle while it's set (until operator
//! restart with a refreshed PHPSESSID).
//!
//! Worker dispatch is wrapped in `catch_unwind` so a panicking handler
//! marks the job failed instead of taking down the worker task.
use std::panic::AssertUnwindSafe;
use std::sync::atomic::{AtomicBool, Ordering};
use std::sync::Arc;
use std::time::Duration;
use async_trait::async_trait;
use chrono::{DateTime, Datelike, NaiveTime, TimeZone, Timelike, Utc};
use chrono_tz::Tz;
use futures_util::FutureExt;
use serde_json::json;
use sqlx::PgPool;
use tokio::task::JoinSet;
use tokio_util::sync::CancellationToken;
use crate::crawler::content::SyncOutcome;
use crate::crawler::jobs::{self, JobPayload, Lease, KIND_SYNC_CHAPTER_CONTENT};
use crate::crawler::pipeline;
/// Fixed `pg_try_advisory_lock` key. ASCII "MANGALRD" interpreted as a
/// big-endian i64. Hardcoded so every replica agrees on the lock identity
/// without consulting config.
pub const CRON_LOCK_KEY: i64 = 0x4D414E47414C5244;
const STATE_KEY_LAST_TICK: &str = "last_metadata_tick_at";
#[async_trait]
pub trait MetadataPass: Send + Sync {
async fn run(&self) -> anyhow::Result<pipeline::MetadataStats>;
}
#[async_trait]
pub trait ChapterDispatcher: Send + Sync {
async fn dispatch(&self, payload: JobPayload) -> anyhow::Result<SyncOutcome>;
}
/// Configuration for [`spawn`]. Use `None` for `metadata_pass` to disable
/// the cron entirely (worker-pool-only mode — useful when only the
/// bookmark-triggered enqueue path is wanted).
pub struct DaemonConfig {
pub metadata_pass: Option<Arc<dyn MetadataPass>>,
pub dispatcher: Arc<dyn ChapterDispatcher>,
pub chapter_workers: usize,
pub daily_at: NaiveTime,
pub tz: Tz,
pub retention_days: u32,
pub session_expired: Arc<AtomicBool>,
/// Tasks that should run alongside the cron + workers and be cancelled
/// on shutdown. Used to hand the daemon ownership of the browser
/// manager's idle reaper.
pub extra_tasks: Vec<tokio::task::JoinHandle<()>>,
}
pub struct DaemonHandle {
cancel: CancellationToken,
join: JoinSet<()>,
extra: Vec<tokio::task::JoinHandle<()>>,
}
impl DaemonHandle {
/// Trigger shutdown and await all worker / cron / extra tasks.
pub async fn shutdown(mut self) {
self.cancel.cancel();
while self.join.join_next().await.is_some() {}
for task in self.extra.drain(..) {
let _ = task.await;
}
}
/// Cancellation token that drives shutdown — exposed so callers
/// (`app::spawn_crawler_daemon`) can hand the same token to auxiliary
/// tasks (e.g. the BrowserManager idle reaper) and have them stop on
/// the daemon's signal.
pub fn cancel_token(&self) -> CancellationToken {
self.cancel.clone()
}
}
/// Spawn the daemon. Returns immediately; tasks run in the background.
/// Pass an external [`CancellationToken`] so auxiliary tasks (e.g. a
/// BrowserManager idle reaper) can share the same shutdown signal —
/// typically created in the caller, cloned into both spawns.
pub fn spawn(pool: PgPool, cancel: CancellationToken, cfg: DaemonConfig) -> DaemonHandle {
let mut join = JoinSet::new();
let DaemonConfig {
metadata_pass,
dispatcher,
chapter_workers,
daily_at,
tz,
retention_days,
session_expired,
extra_tasks,
} = cfg;
if let Some(metadata) = metadata_pass {
let ctx = CronContext {
pool: pool.clone(),
cancel: cancel.clone(),
daily_at,
tz,
retention_days,
metadata,
};
join.spawn(async move { ctx.run().await });
} else {
tracing::info!("crawler daemon: no metadata_pass — cron disabled");
}
for worker_id in 0..chapter_workers.max(1) {
let ctx = WorkerContext {
pool: pool.clone(),
cancel: cancel.clone(),
dispatcher: Arc::clone(&dispatcher),
session_expired: Arc::clone(&session_expired),
id: worker_id,
};
join.spawn(async move { ctx.run().await });
}
DaemonHandle {
cancel,
join,
extra: extra_tasks,
}
}
// ---------------------------------------------------------------------------
// Cron
// ---------------------------------------------------------------------------
struct CronContext {
pool: PgPool,
cancel: CancellationToken,
daily_at: NaiveTime,
tz: Tz,
retention_days: u32,
metadata: Arc<dyn MetadataPass>,
}
impl CronContext {
async fn run(self) {
// On startup, fire immediately if the most recent slot has already
// passed and we never recorded a tick for it.
let now = Utc::now();
let mut catchup = match read_last_tick(&self.pool).await {
Ok(Some(last)) => previous_fire(now, self.daily_at, self.tz) > last,
Ok(None) => true,
Err(e) => {
tracing::warn!(?e, "cron: read_last_tick failed; assuming no catch-up");
false
}
};
loop {
if catchup {
tracing::info!("cron: catch-up tick (missed scheduled slot)");
self.run_tick().await;
catchup = false;
continue;
}
// Recompute next-fire from now() each iteration so clock jumps
// (NTP step, suspend/resume) don't strand us on a stale instant.
let next = next_fire(Utc::now(), self.daily_at, self.tz);
let wait = (next - Utc::now()).to_std().unwrap_or(Duration::ZERO);
tracing::info!(
next_fire_utc = %next.to_rfc3339(),
wait_seconds = wait.as_secs(),
"cron: sleeping until next slot"
);
tokio::select! {
_ = tokio::time::sleep(wait) => {}
_ = self.cancel.cancelled() => {
tracing::info!("cron: shutdown");
return;
}
}
self.run_tick().await;
}
}
async fn run_tick(&self) {
let mut conn = match self.pool.acquire().await {
Ok(c) => c,
Err(e) => {
tracing::error!(?e, "cron: acquire conn failed; skipping tick");
return;
}
};
// pg_try_advisory_lock is session-scoped — we must hold the same
// connection for the unlock or the call silently no-ops on a
// different connection from the pool.
let acquired: bool = sqlx::query_scalar("SELECT pg_try_advisory_lock($1)")
.bind(CRON_LOCK_KEY)
.fetch_one(&mut *conn)
.await
.unwrap_or(false);
if !acquired {
tracing::info!("cron: tick skipped — another replica holds the lock");
return;
}
// Panic-isolate the tick body the same way `process_lease` does
// for worker dispatch. Without this, a panic in metadata.run
// (or any of the follow-on steps) would kill the cron task and
// no future tick would ever run — workers would keep going but
// no new metadata work would be scheduled until daemon restart.
// The advisory unlock below runs unconditionally so a panicked
// tick doesn't leave the lock held for another replica.
let metadata = &self.metadata;
let pool = &self.pool;
let retention_days = self.retention_days;
let body = async move {
match metadata.run().await {
Ok(stats) => tracing::info!(?stats, "cron: metadata pass done"),
Err(e) => tracing::error!(?e, "cron: metadata pass failed"),
}
match pipeline::enqueue_bookmarked_pending(pool).await {
Ok(summary) => {
tracing::info!(?summary, "cron: enqueued bookmarked-pending");
}
Err(e) => tracing::error!(?e, "cron: enqueue_bookmarked_pending failed"),
}
match jobs::reap_done(pool, retention_days).await {
Ok(n) => tracing::info!(reaped = n, "cron: done-job reaper finished"),
Err(e) => tracing::error!(?e, "cron: done-job reaper failed"),
}
if let Err(e) = write_last_tick(pool, Utc::now()).await {
tracing::warn!(?e, "cron: persist last_metadata_tick_at failed");
}
};
if let Err(_panic) = AssertUnwindSafe(body).catch_unwind().await {
tracing::error!("cron: tick body panicked — continuing");
}
let _ = sqlx::query("SELECT pg_advisory_unlock($1)")
.bind(CRON_LOCK_KEY)
.execute(&mut *conn)
.await;
drop(conn);
}
}
// ---------------------------------------------------------------------------
// Workers
// ---------------------------------------------------------------------------
struct WorkerContext {
pool: PgPool,
cancel: CancellationToken,
dispatcher: Arc<dyn ChapterDispatcher>,
session_expired: Arc<AtomicBool>,
id: usize,
}
impl WorkerContext {
async fn run(self) {
loop {
if self.cancel.is_cancelled() {
tracing::info!(worker = self.id, "worker: shutdown");
return;
}
if self.session_expired.load(Ordering::Acquire) {
tokio::select! {
_ = tokio::time::sleep(Duration::from_secs(30)) => continue,
_ = self.cancel.cancelled() => return,
}
}
let leases = match jobs::lease(
&self.pool,
Some(KIND_SYNC_CHAPTER_CONTENT),
1,
Duration::from_secs(60),
)
.await
{
Ok(v) => v,
Err(e) => {
tracing::warn!(worker = self.id, ?e, "worker: lease failed");
tokio::select! {
_ = tokio::time::sleep(Duration::from_secs(5)) => continue,
_ = self.cancel.cancelled() => return,
}
}
};
let Some(lease) = leases.into_iter().next() else {
tokio::select! {
_ = tokio::time::sleep(Duration::from_secs(1)) => continue,
_ = self.cancel.cancelled() => return,
}
};
self.process_lease(lease).await;
}
}
async fn process_lease(&self, lease: Lease) {
// Consumer-side dedup safety net: if the chapter already has pages
// (because a force-refetch race or a job that was re-enqueued
// after a previous one finished), ack done without re-fetching.
if let JobPayload::SyncChapterContent { chapter_id, .. } = &lease.payload {
let page_count = crate::repo::chapter::page_count(&self.pool, *chapter_id)
.await
.ok()
.flatten();
if matches!(page_count, Some(n) if n > 0) {
let _ = jobs::ack_done(&self.pool, lease.id).await;
return;
}
}
let outcome = AssertUnwindSafe(self.dispatcher.dispatch(lease.payload.clone()))
.catch_unwind()
.await;
match outcome {
Ok(Ok(SyncOutcome::Fetched { .. } | SyncOutcome::Skipped)) => {
let _ = jobs::ack_done(&self.pool, lease.id).await;
}
Ok(Ok(SyncOutcome::SessionExpired)) => {
tracing::error!(
worker = self.id,
lease_id = %lease.id,
"session expired — workers will idle until restart"
);
self.session_expired.store(true, Ordering::Release);
let _ = jobs::release(&self.pool, lease.id).await;
}
Ok(Err(e)) => {
tracing::warn!(
worker = self.id,
lease_id = %lease.id,
error = ?e,
"worker: dispatch error — ack failed"
);
let _ = jobs::ack_failed(
&self.pool,
lease.id,
&format!("{e:#}"),
lease.attempts,
lease.max_attempts,
)
.await;
}
Err(_panic) => {
tracing::error!(
worker = self.id,
lease_id = %lease.id,
"worker: dispatcher panicked — ack failed"
);
let _ = jobs::ack_failed(
&self.pool,
lease.id,
"worker panicked",
lease.attempts,
lease.max_attempts,
)
.await;
}
}
}
}
// ---------------------------------------------------------------------------
// Cron timing primitives
// ---------------------------------------------------------------------------
/// Compute the next UTC instant when `daily_at` (interpreted in `tz`) will
/// fire, strictly after `now`. Handles DST gaps (spring-forward) by
/// advancing past the gap; on DST overlap (fall-back) picks the later
/// instant so the job runs once, not twice.
pub fn next_fire(now: DateTime<Utc>, daily_at: NaiveTime, tz: Tz) -> DateTime<Utc> {
let now_local = now.with_timezone(&tz);
// Start with today's slot in the local TZ.
let mut candidate = local_at(now_local.date_naive(), daily_at, tz);
// If today's slot is in the past (or now), roll forward day-by-day.
while candidate <= now {
let next_day = candidate
.with_timezone(&tz)
.date_naive()
.succ_opt()
.unwrap_or_else(|| {
// Defensive: succ_opt only fails at chrono's max date.
chrono::NaiveDate::from_ymd_opt(
candidate.year(),
candidate.month(),
candidate.day(),
)
.expect("valid date")
});
candidate = local_at(next_day, daily_at, tz);
}
candidate
}
/// The most recent fire instant at or before `now`. Used to detect missed
/// slots after a restart.
pub fn previous_fire(now: DateTime<Utc>, daily_at: NaiveTime, tz: Tz) -> DateTime<Utc> {
let now_local = now.with_timezone(&tz);
let today = local_at(now_local.date_naive(), daily_at, tz);
if today <= now {
return today;
}
let yesterday = now_local
.date_naive()
.pred_opt()
.expect("a day before now");
local_at(yesterday, daily_at, tz)
}
/// Resolve a local date+time to a UTC instant in `tz`, navigating DST
/// edges deterministically:
/// - `LocalResult::Single` → that instant.
/// - `LocalResult::Ambiguous(_, latest)` → the later instant (fall-back
/// hour). Picking latest means a daily job fires once across the
/// repeated hour, not twice.
/// - `LocalResult::None` → spring-forward gap. Advance the local time
/// by 1 minute and try again, repeating up to 120 times (so the worst
/// case is still well inside an hour-long gap).
fn local_at(date: chrono::NaiveDate, time: NaiveTime, tz: Tz) -> DateTime<Utc> {
use chrono::LocalResult;
for offset_minutes in 0..120 {
let mut t = time;
if offset_minutes > 0 {
let added = chrono::NaiveTime::from_num_seconds_from_midnight_opt(
((time.num_seconds_from_midnight() as i64 + offset_minutes * 60) % 86_400) as u32,
0,
)
.unwrap_or(time);
t = added;
}
let naive = date.and_time(t);
match tz.from_local_datetime(&naive) {
LocalResult::Single(dt) => return dt.with_timezone(&Utc),
LocalResult::Ambiguous(_, latest) => return latest.with_timezone(&Utc),
LocalResult::None => continue,
}
}
// Should be unreachable — DST gaps are always less than an hour.
Utc.from_utc_datetime(&date.and_time(time))
}
// ---------------------------------------------------------------------------
// crawler_state I/O
// ---------------------------------------------------------------------------
async fn read_last_tick(pool: &PgPool) -> sqlx::Result<Option<DateTime<Utc>>> {
let row: Option<serde_json::Value> = sqlx::query_scalar(
"SELECT value FROM crawler_state WHERE key = $1",
)
.bind(STATE_KEY_LAST_TICK)
.fetch_optional(pool)
.await?;
Ok(row.and_then(|v| {
v.get("at")
.and_then(|s| s.as_str())
.and_then(|s| DateTime::parse_from_rfc3339(s).ok())
.map(|dt| dt.with_timezone(&Utc))
}))
}
async fn write_last_tick(pool: &PgPool, at: DateTime<Utc>) -> sqlx::Result<()> {
sqlx::query(
"INSERT INTO crawler_state (key, value, updated_at) \
VALUES ($1, $2, now()) \
ON CONFLICT (key) DO UPDATE \
SET value = EXCLUDED.value, updated_at = now()",
)
.bind(STATE_KEY_LAST_TICK)
.bind(json!({ "at": at.to_rfc3339() }))
.execute(pool)
.await?;
Ok(())
}
// ---------------------------------------------------------------------------
// Test helpers (not gated on cfg(test) — integration tests in tests/ dir
// need them too).
// ---------------------------------------------------------------------------
pub mod test_support {
//! Lightweight stubs the daemon tests use. Public because integration
//! tests live outside this module.
use super::*;
use std::sync::atomic::AtomicUsize;
pub struct CountingMetadataPass {
pub count: AtomicUsize,
}
impl Default for CountingMetadataPass {
fn default() -> Self {
Self {
count: AtomicUsize::new(0),
}
}
}
#[async_trait]
impl MetadataPass for CountingMetadataPass {
async fn run(&self) -> anyhow::Result<pipeline::MetadataStats> {
self.count.fetch_add(1, Ordering::AcqRel);
Ok(pipeline::MetadataStats::default())
}
}
pub type DispatchFn = Arc<
dyn Fn(JobPayload) -> futures_util::future::BoxFuture<'static, anyhow::Result<SyncOutcome>>
+ Send
+ Sync,
>;
pub struct StubDispatcher {
pub handler: DispatchFn,
}
#[async_trait]
impl ChapterDispatcher for StubDispatcher {
async fn dispatch(&self, payload: JobPayload) -> anyhow::Result<SyncOutcome> {
(self.handler)(payload).await
}
}
pub fn always_done() -> Arc<StubDispatcher> {
Arc::new(StubDispatcher {
handler: Arc::new(|_| Box::pin(async { Ok(SyncOutcome::Fetched { pages: 1 }) })),
})
}
pub fn panicking_dispatcher() -> Arc<StubDispatcher> {
Arc::new(StubDispatcher {
handler: Arc::new(|_| Box::pin(async { panic!("intentional dispatcher panic") })),
})
}
}
#[cfg(test)]
mod tests {
use super::*;
use chrono::Duration as ChronoDuration;
fn dt_utc(y: i32, mo: u32, d: u32, h: u32, mi: u32) -> DateTime<Utc> {
Utc.with_ymd_and_hms(y, mo, d, h, mi, 0).unwrap()
}
#[test]
fn next_fire_in_utc_at_midnight_advances_one_day() {
let now = dt_utc(2026, 5, 25, 12, 0); // noon UTC
let at = NaiveTime::from_hms_opt(0, 0, 0).unwrap();
let next = next_fire(now, at, Tz::UTC);
// Next midnight is May 26 00:00 UTC.
assert_eq!(next, dt_utc(2026, 5, 26, 0, 0));
}
#[test]
fn next_fire_before_today_slot_returns_today() {
let now = dt_utc(2026, 5, 25, 23, 0); // 23:00 UTC
let at = NaiveTime::from_hms_opt(23, 30, 0).unwrap();
let next = next_fire(now, at, Tz::UTC);
assert_eq!(next, dt_utc(2026, 5, 25, 23, 30));
}
#[test]
fn next_fire_skips_spring_forward_gap_in_europe_berlin() {
// 2024-03-31: clocks jump 02:00 -> 03:00 in Berlin (CET -> CEST).
// Asking for daily_at = 02:30 on the morning of the jump should
// land on the *next valid* local instant past the gap. We test
// by computing `next_fire` at 2024-03-31 00:30 UTC (= 01:30 CET,
// i.e. just before the gap). The next 02:30 local does not exist,
// so the helper advances past it.
let now = dt_utc(2024, 3, 31, 0, 30); // 01:30 local Berlin (CET = UTC+1)
let at = NaiveTime::from_hms_opt(2, 30, 0).unwrap();
let next = next_fire(now, at, Tz::Europe__Berlin);
// Local Berlin time skips from 02:00 -> 03:00. After the +1 minute
// search, the first valid slot is 03:00 local on 2024-03-31, which
// is 01:00 UTC (CEST = UTC+2).
// We assert the result is strictly between (now) and 1h later
// and is in UTC — the exact minute depends on how many +1m steps
// were required.
assert!(next > now);
assert!(next < now + ChronoDuration::hours(2));
}
#[test]
fn next_fire_on_fall_back_picks_later_instant() {
// 2024-10-27: clocks jump 03:00 -> 02:00 (CEST -> CET) in Berlin.
// 02:30 happens twice on that day. We pick the later one.
let now = dt_utc(2024, 10, 26, 12, 0); // day before, noon UTC
let at = NaiveTime::from_hms_opt(2, 30, 0).unwrap();
let next = next_fire(now, at, Tz::Europe__Berlin);
// First 02:30 local is 00:30 UTC (CEST = UTC+2).
// Second 02:30 local is 01:30 UTC (CET = UTC+1).
// We expect the later instant: 01:30 UTC on 2024-10-27.
assert_eq!(next, dt_utc(2024, 10, 27, 1, 30));
}
#[test]
fn previous_fire_returns_today_when_now_is_after_slot() {
let now = dt_utc(2026, 5, 25, 12, 0); // noon UTC
let at = NaiveTime::from_hms_opt(0, 0, 0).unwrap();
let prev = previous_fire(now, at, Tz::UTC);
assert_eq!(prev, dt_utc(2026, 5, 25, 0, 0));
}
#[test]
fn previous_fire_returns_yesterday_when_now_is_before_today_slot() {
let now = dt_utc(2026, 5, 25, 8, 0); // 08:00 UTC
let at = NaiveTime::from_hms_opt(23, 30, 0).unwrap();
let prev = previous_fire(now, at, Tz::UTC);
assert_eq!(prev, dt_utc(2026, 5, 24, 23, 30));
}
/// Documents the panic-isolation pattern `run_tick` now relies on:
/// `AssertUnwindSafe(...).catch_unwind().await` must yield `Err(_)`
/// when the wrapped future panics, so the surrounding loop (or in
/// our case, the unconditional advisory-unlock that follows) keeps
/// running. The shape of this test mirrors the production callsite.
#[tokio::test]
async fn assert_unwind_safe_catches_a_panicking_future() {
let result = AssertUnwindSafe(async {
panic!("boom");
})
.catch_unwind()
.await;
assert!(result.is_err(), "panicking future must yield Err");
}
}

View File

@@ -0,0 +1,250 @@
//! Transient-page detection.
//!
//! The target site occasionally responds with a 403 + tiny "we're sorry,
//! the request file are not found" body on pages that actually exist.
//! Selectors on that body match nothing, which is indistinguishable from
//! a genuinely empty page unless we look for the broken-page markers
//! explicitly. The same shape covers full-site outages: 5xx pages,
//! Cloudflare interstitials, and "site is down" placeholders all share
//! the trait that the normal layout (`#logo` in the header) is absent.
//!
//! Helpers here are split into two signals so callers can compose them:
//! - [`is_broken_page_body`]: pattern-match on the known broken-page
//! string. Works for *any* page on the site, including the reader,
//! which doesn't render `#logo`.
//! - [`has_logo_sentinel`]: assert `#logo` is in the parsed DOM. Site-
//! structural marker — present on the manga list, manga detail,
//! chapter-list, and login probe pages. **Not** present on the reader,
//! so callers in the reader path must rely on the body signature only.
//!
//! [`PageError::Transient`] is the typed signal returned by parser and
//! navigate wrappers. Job handlers map it to "reschedule with backoff"
//! rather than the per-page silent skip the parsers used to do.
use std::future::Future;
use std::time::Duration;
use thiserror::Error;
/// Universal substring of the broken-page body. The site renders the
/// exact string verbatim in a single `<p>`, so a case-insensitive
/// substring match is enough — we deliberately do *not* anchor to the
/// kaomoji because that part is more likely to change than the prose.
const BROKEN_PAGE_MARKER: &str = "we're sorry, the request file are not found";
/// Outcome of a page fetch or parse when the caller wants to
/// distinguish "site/page is transiently broken — retry later" from
/// other errors. `Transient` is the only retry-friendly variant; every
/// other failure mode stays as `anyhow::Error` and is treated as today.
#[derive(Debug, Error)]
pub enum PageError {
/// Page came back but the site signaled trouble — broken-page body
/// signature, structural sentinel missing, etc. Caller should
/// reschedule this fetch rather than treat it as data.
#[error("transient page error: {reason}")]
Transient { reason: String },
#[error(transparent)]
Other(#[from] anyhow::Error),
}
impl PageError {
pub fn transient(reason: impl Into<String>) -> Self {
Self::Transient { reason: reason.into() }
}
pub fn is_transient(&self) -> bool {
matches!(self, Self::Transient { .. })
}
}
/// Returns true when the response body matches the known broken-page
/// template. Case-insensitive substring match — small bodies (~150B)
/// make the scan trivially fast, and the broken page is always tiny so
/// false positives on a real catalog page are not a concern.
pub fn is_broken_page_body(html: &str) -> bool {
html.to_ascii_lowercase().contains(BROKEN_PAGE_MARKER)
}
/// Returns true when the parsed document contains `#logo` — the site's
/// header logo element, present on every full-layout page and absent on
/// the broken-page response and on the reader.
pub fn has_logo_sentinel(doc: &scraper::Html) -> bool {
let sel = scraper::Selector::parse("#logo").expect("#logo is a valid selector");
doc.select(&sel).next().is_some()
}
/// Retry `op` up to `max_attempts` times whenever it returns
/// [`PageError::Transient`], sleeping `delay` between attempts.
/// Non-transient errors short-circuit immediately. Used by discover-loop
/// callers so a single broken page doesn't drop the whole walk — the
/// caller can fall back on the job system's retry/backoff once the
/// inline budget is exhausted.
pub async fn retry_on_transient<F, Fut, T>(
mut op: F,
max_attempts: u32,
delay: Duration,
) -> Result<T, PageError>
where
F: FnMut() -> Fut,
Fut: Future<Output = Result<T, PageError>>,
{
debug_assert!(max_attempts >= 1, "max_attempts must be at least 1");
let mut attempt = 0u32;
loop {
attempt += 1;
match op().await {
Ok(v) => return Ok(v),
Err(e) if !e.is_transient() => return Err(e),
Err(e) if attempt >= max_attempts => return Err(e),
Err(e) => {
tracing::warn!(
attempt,
max_attempts,
error = %e,
"transient error; sleeping before retry"
);
tokio::time::sleep(delay).await;
}
}
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn broken_page_body_matches_exact_template() {
let html = "<html><head></head><body>\
<p>we're sorry, the request file are not found. Σ(っ°Д °;)っ</p>\
</body></html>";
assert!(is_broken_page_body(html));
}
#[test]
fn broken_page_body_is_case_insensitive() {
let html = "<p>WE'RE SORRY, THE REQUEST FILE ARE NOT FOUND.</p>";
assert!(is_broken_page_body(html));
}
#[test]
fn broken_page_body_does_not_match_normal_listing() {
let html = "<html><body><div id='logo'></div>\
<ul><li>Manga A</li><li>Manga B</li></ul></body></html>";
assert!(!is_broken_page_body(html));
}
#[test]
fn broken_page_body_does_not_match_empty_string() {
assert!(!is_broken_page_body(""));
}
#[test]
fn logo_sentinel_present_on_normal_page() {
let doc = scraper::Html::parse_document(
"<html><body><div id='logo'>Site</div><main>...</main></body></html>",
);
assert!(has_logo_sentinel(&doc));
}
#[test]
fn logo_sentinel_absent_on_broken_page() {
let doc = scraper::Html::parse_document(
"<html><head></head><body>\
<p>we're sorry, the request file are not found.</p></body></html>",
);
assert!(!has_logo_sentinel(&doc));
}
#[test]
fn logo_sentinel_absent_on_empty_document() {
let doc = scraper::Html::parse_document("");
assert!(!has_logo_sentinel(&doc));
}
#[test]
fn page_error_transient_constructor_sets_reason() {
let e = PageError::transient("logo missing");
assert!(e.is_transient());
assert_eq!(e.to_string(), "transient page error: logo missing");
}
#[test]
fn page_error_other_is_not_transient() {
let e: PageError = anyhow::anyhow!("something else").into();
assert!(!e.is_transient());
}
#[tokio::test]
async fn retry_returns_ok_after_a_transient_streak() {
let mut attempt = 0u32;
let result: Result<i32, PageError> = retry_on_transient(
|| {
attempt += 1;
let n = attempt;
async move {
if n < 3 {
Err(PageError::transient("not yet"))
} else {
Ok(42)
}
}
},
5,
Duration::from_millis(0),
)
.await;
assert_eq!(result.unwrap(), 42);
assert_eq!(attempt, 3);
}
#[tokio::test]
async fn retry_gives_up_after_max_attempts_on_persistent_transient() {
let mut attempt = 0u32;
let result: Result<i32, PageError> = retry_on_transient(
|| {
attempt += 1;
async { Err(PageError::transient("always")) }
},
3,
Duration::from_millis(0),
)
.await;
let err = result.expect_err("expected Transient");
assert!(err.is_transient());
assert_eq!(attempt, 3, "retried max_attempts times, no more");
}
#[tokio::test]
async fn retry_does_not_retry_non_transient_errors() {
let mut attempt = 0u32;
let result: Result<i32, PageError> = retry_on_transient(
|| {
attempt += 1;
async { Err(PageError::Other(anyhow::anyhow!("permanent"))) }
},
5,
Duration::from_millis(0),
)
.await;
assert!(result.is_err());
assert!(!result.unwrap_err().is_transient());
assert_eq!(attempt, 1, "non-transient must fail immediately");
}
#[tokio::test]
async fn retry_returns_ok_on_first_attempt_without_sleeping() {
let mut attempt = 0u32;
let result: Result<i32, PageError> = retry_on_transient(
|| {
attempt += 1;
async { Ok(7) }
},
5,
Duration::from_secs(60),
)
.await;
assert_eq!(result.unwrap(), 7);
assert_eq!(attempt, 1);
}
}

View File

@@ -0,0 +1,15 @@
//! Change-detection rules between the source and our DB.
//!
//! | Event | Signal |
//! |--------------------|----------------------------------------------------------------------------------------|
//! | New manga | `(source_id, source_manga_key)` not in `manga_sources` |
//! | Updated metadata | freshly computed `metadata_hash` differs from the stored one |
//! | Dropped manga | `last_seen_at < discover_run_started_at` for N consecutive successful discover runs |
//! | New chapter | `(source_id, source_chapter_key)` not in `chapter_sources` |
//! | Dropped chapter | present in DB but absent from the latest `fetch_chapter_list` for the same manga |
//!
//! Dropped is always a soft flag (`dropped_at`), never a row delete —
//! restoring is a matter of clearing the flag if the source brings the
//! item back.
//!
//! Scaffold only — implementations land once `repo::crawler` exists.

290
backend/src/crawler/jobs.rs Normal file
View File

@@ -0,0 +1,290 @@
//! Persistent job queue and its job kinds.
//!
//! Backed by Postgres (the `crawler_jobs` table). Workers lease rows
//! with `SELECT ... FOR UPDATE SKIP LOCKED`, heartbeat via
//! `leased_until`, and ack by transitioning to `done` (or backoff /
//! `dead`). Handlers are idempotent so a crash mid-run is recoverable
//! by replay.
use std::time::Duration;
use serde::{Deserialize, Serialize};
use sqlx::PgPool;
use uuid::Uuid;
#[derive(Clone, Debug, Serialize, Deserialize)]
#[serde(tag = "kind", rename_all = "snake_case")]
pub enum JobPayload {
/// Fetch one manga's detail page, upsert metadata, enqueue
/// `SyncChapterList`.
SyncManga {
source_id: String,
source_manga_key: String,
},
/// Diff the chapter list, enqueue `SyncChapterContent` for new
/// chapters, soft-drop vanished ones.
SyncChapterList {
source_id: String,
manga_id: Uuid,
source_manga_key: String,
},
/// Download a single chapter's page images into storage.
SyncChapterContent {
source_id: String,
chapter_id: Uuid,
source_chapter_key: String,
},
}
#[derive(Clone, Copy, Debug, sqlx::Type, Serialize, Deserialize)]
#[sqlx(type_name = "text", rename_all = "snake_case")]
#[serde(rename_all = "snake_case")]
pub enum JobState {
Pending,
Running,
Done,
Failed,
Dead,
}
/// Kind discriminator stored in `payload->>'kind'`. Public so callers
/// (daemon worker, bookmark hook) can filter `lease()` to a single kind
/// without re-spelling the literal.
pub const KIND_SYNC_CHAPTER_CONTENT: &str = "sync_chapter_content";
#[derive(Debug)]
pub enum EnqueueResult {
Inserted(Uuid),
Skipped,
}
#[derive(Debug, Clone)]
pub struct Lease {
pub id: Uuid,
pub payload: JobPayload,
pub attempts: i32,
pub max_attempts: i32,
}
/// Exponential backoff for `ack_failed` retries. `attempts` is the
/// post-increment value reported by `lease()` (so the first failure has
/// `attempts == 1` and waits 60s, the second 120s, etc.). Capped at 1h to
/// avoid runaway long sleeps that would outlive the daemon process.
fn backoff_for(attempts: i32) -> Duration {
let shift = attempts.saturating_sub(1).clamp(0, 20) as u32;
let secs = 60u64.saturating_mul(1u64 << shift);
Duration::from_secs(secs.min(3600))
}
/// Insert a new pending job. For `SyncChapterContent` payloads the
/// partial unique index `crawler_jobs_chapter_content_dedup_idx` blocks
/// a second `(pending|running)` insert per chapter_id, returning
/// `Skipped`. The slot frees again once the previous job leaves the
/// in-flight states (done/failed/dead), so a re-enqueue after a force
/// refetch succeeds.
pub async fn enqueue(pool: &PgPool, payload: &JobPayload) -> sqlx::Result<EnqueueResult> {
let json = serde_json::to_value(payload).expect("JobPayload is always serializable");
let id: Option<Uuid> = sqlx::query_scalar(
"INSERT INTO crawler_jobs (payload) VALUES ($1) \
ON CONFLICT DO NOTHING RETURNING id",
)
.bind(json)
.fetch_optional(pool)
.await?;
Ok(match id {
Some(id) => EnqueueResult::Inserted(id),
None => EnqueueResult::Skipped,
})
}
/// Lease up to `max` rows whose `state` is `pending`, or `running` with
/// an expired `leased_until` (the crashed-worker recovery path). The
/// inner CTE uses `FOR UPDATE SKIP LOCKED` so concurrent leasers don't
/// block each other and each row is handed to exactly one worker.
///
/// `kind_filter` matches against `payload->>'kind'`; `None` means
/// any kind.
pub async fn lease(
pool: &PgPool,
kind_filter: Option<&str>,
max: i64,
lease_duration: Duration,
) -> sqlx::Result<Vec<Lease>> {
let lease_ms: i64 = lease_duration.as_millis().min(i64::MAX as u128) as i64;
let rows: Vec<(Uuid, serde_json::Value, i32, i32)> = sqlx::query_as(
r#"
WITH leased AS (
SELECT id FROM crawler_jobs
WHERE (state = 'pending' OR (state = 'running' AND leased_until < now()))
AND scheduled_at <= now()
AND ($1::text IS NULL OR payload->>'kind' = $1)
ORDER BY scheduled_at
LIMIT $2
FOR UPDATE SKIP LOCKED
)
UPDATE crawler_jobs j
SET state = 'running',
attempts = j.attempts + 1,
leased_until = now() + ($3::bigint || ' milliseconds')::interval,
updated_at = now()
FROM leased l
WHERE j.id = l.id
RETURNING j.id, j.payload, j.attempts, j.max_attempts
"#,
)
.bind(kind_filter)
.bind(max)
.bind(lease_ms)
.fetch_all(pool)
.await?;
let mut leases = Vec::with_capacity(rows.len());
for (id, payload_json, attempts, max_attempts) in rows {
let payload: JobPayload = serde_json::from_value(payload_json).map_err(|e| {
sqlx::Error::Decode(format!("invalid JobPayload JSON for job {id}: {e}").into())
})?;
leases.push(Lease {
id,
payload,
attempts,
max_attempts,
});
}
Ok(leases)
}
/// Mark a leased job as successfully completed. The `state = 'running'`
/// predicate guards against a late ack from a worker whose lease expired
/// and was already re-leased by another worker: without it, the late ack
/// would clobber the new lease's `state` and `leased_until`. `rows_affected
/// == 0` means we lost the lease — surfaced as a warn rather than an
/// error because the new lease holder is doing real work; the late ack
/// just has to step aside.
pub async fn ack_done(pool: &PgPool, lease_id: Uuid) -> sqlx::Result<()> {
let res = sqlx::query(
"UPDATE crawler_jobs \
SET state = 'done', leased_until = NULL, updated_at = now() \
WHERE id = $1 AND state = 'running'",
)
.bind(lease_id)
.execute(pool)
.await?;
if res.rows_affected() == 0 {
tracing::warn!(
%lease_id,
"ack_done: lease no longer running — likely re-leased by another worker; skipping update"
);
}
Ok(())
}
/// Mark a leased job as failed. If the current attempt count has reached
/// `max_attempts` the job is terminally dead and stops retrying;
/// otherwise it goes back to `pending` with `scheduled_at` pushed into
/// the future by the exponential backoff. See [`ack_done`] for the
/// `state = 'running'` guard rationale.
pub async fn ack_failed(
pool: &PgPool,
lease_id: Uuid,
error: &str,
attempts: i32,
max_attempts: i32,
) -> sqlx::Result<()> {
let res = if attempts >= max_attempts {
sqlx::query(
"UPDATE crawler_jobs \
SET state = 'dead', last_error = $2, leased_until = NULL, updated_at = now() \
WHERE id = $1 AND state = 'running'",
)
.bind(lease_id)
.bind(error)
.execute(pool)
.await?
} else {
let backoff_ms: i64 = backoff_for(attempts).as_millis().min(i64::MAX as u128) as i64;
sqlx::query(
"UPDATE crawler_jobs \
SET state = 'pending', last_error = $2, leased_until = NULL, \
scheduled_at = now() + ($3::bigint || ' milliseconds')::interval, \
updated_at = now() \
WHERE id = $1 AND state = 'running'",
)
.bind(lease_id)
.bind(error)
.bind(backoff_ms)
.execute(pool)
.await?
};
if res.rows_affected() == 0 {
tracing::warn!(
%lease_id,
"ack_failed: lease no longer running — likely re-leased by another worker; skipping update"
);
}
Ok(())
}
/// Return a leased job to `pending` without burning a retry attempt.
/// Used on graceful shutdown and on session-expired aborts where the
/// failure isn't the job's fault. See [`ack_done`] for the
/// `state = 'running'` guard rationale — important here because
/// `attempts - 1` would otherwise spuriously decrement the new lease's
/// attempt count.
pub async fn release(pool: &PgPool, lease_id: Uuid) -> sqlx::Result<()> {
let res = sqlx::query(
"UPDATE crawler_jobs \
SET state = 'pending', leased_until = NULL, \
attempts = GREATEST(0, attempts - 1), updated_at = now() \
WHERE id = $1 AND state = 'running'",
)
.bind(lease_id)
.execute(pool)
.await?;
if res.rows_affected() == 0 {
tracing::warn!(
%lease_id,
"release: lease no longer running — likely re-leased by another worker; skipping update"
);
}
Ok(())
}
/// Delete `done` jobs whose `updated_at` is older than `retention_days`
/// days. `0` disables the reaper without touching the table. Returns the
/// number of rows removed.
pub async fn reap_done(pool: &PgPool, retention_days: u32) -> sqlx::Result<u64> {
if retention_days == 0 {
return Ok(0);
}
let result = sqlx::query(
"DELETE FROM crawler_jobs \
WHERE state = 'done' \
AND updated_at < now() - ($1::bigint || ' days')::interval",
)
.bind(retention_days as i64)
.execute(pool)
.await?;
Ok(result.rows_affected())
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn backoff_grows_exponentially_and_caps_at_one_hour() {
// attempts == 1 → 60s, doubling each step.
assert_eq!(backoff_for(1), Duration::from_secs(60));
assert_eq!(backoff_for(2), Duration::from_secs(120));
assert_eq!(backoff_for(3), Duration::from_secs(240));
assert_eq!(backoff_for(4), Duration::from_secs(480));
assert_eq!(backoff_for(5), Duration::from_secs(960));
assert_eq!(backoff_for(6), Duration::from_secs(1920));
// 7th: 60 * 64 = 3840 → capped to 3600.
assert_eq!(backoff_for(7), Duration::from_secs(3600));
assert_eq!(backoff_for(20), Duration::from_secs(3600));
// Garbage / zero / negatives stay sane.
assert_eq!(backoff_for(0), Duration::from_secs(60));
assert_eq!(backoff_for(-5), Duration::from_secs(60));
}
}

View File

@@ -0,0 +1,29 @@
//! Crawler subsystem.
//!
//! Runs as its own binary (`src/bin/crawler.rs`) and shares `domain`,
//! `repo`, and `storage` with the API binary. Layering mirrors the
//! `Storage` trait pattern: callers depend on the `source::Source`
//! trait, not on a concrete site; new sites plug in as additional
//! impls without touching the job runner.
//!
//! Submodules:
//! - [`browser`]: launches and pools Chromium via `chromiumoxide`.
//! First run downloads a known-good build via the `fetcher` feature.
//! - [`source`]: the `Source` trait. Per-site impls live alongside it.
//! - [`jobs`]: job kinds, queue wrapper, handler dispatch.
//! - [`diff`]: change detection — new / updated / dropped semantics.
pub mod browser;
pub mod browser_manager;
pub mod content;
pub mod daemon;
pub mod detect;
pub mod diff;
pub mod jobs;
pub mod nav;
pub mod pipeline;
pub mod rate_limit;
pub mod safety;
pub mod session;
pub mod source;
pub mod url_utils;

241
backend/src/crawler/nav.rs Normal file
View File

@@ -0,0 +1,241 @@
//! Page navigation helpers — wrap `chromiumoxide` `wait_for_navigation`
//! with a timeout so a hung TLS handshake or a page that never fires
//! `load` cannot wedge a worker (or the cron metadata pass) forever.
//!
//! [`NAV_TIMEOUT`] is the global budget. Callers in the crawler use
//! [`wait_for_nav`] to get back a typed error so transient timeouts can
//! be reported separately from underlying CDP errors.
use std::time::Duration;
use chromiumoxide::error::CdpError;
use chromiumoxide::Page;
use thiserror::Error;
/// Maximum wall-clock time we'll wait for a single page navigation. A
/// healthy Chromium reaches `load` in well under a second on the target
/// site; a 30-second cap is generous enough for slow TLS handshakes on
/// the first request after a fresh process while still catching real
/// hangs before they wedge the daemon.
pub const NAV_TIMEOUT: Duration = Duration::from_secs(30);
/// Outcome of a timed-out navigation. `Timeout` is the transient signal
/// callers translate into a retry-friendly error
/// ([`crate::crawler::detect::PageError::Transient`] in the source path,
/// a context'd anyhow elsewhere). `Cdp` carries the underlying
/// chromiumoxide error unchanged.
#[derive(Debug, Error)]
pub enum NavError {
#[error("navigation timed out after {0:?}")]
Timeout(Duration),
#[error(transparent)]
Cdp(#[from] CdpError),
}
/// Wait for the page's next navigation to complete, capped at
/// [`NAV_TIMEOUT`]. Replaces bare `page.wait_for_navigation().await`
/// throughout the crawler.
pub async fn wait_for_nav(page: &Page) -> Result<(), NavError> {
match tokio::time::timeout(NAV_TIMEOUT, page.wait_for_navigation()).await {
Err(_elapsed) => Err(NavError::Timeout(NAV_TIMEOUT)),
Ok(Err(e)) => Err(NavError::Cdp(e)),
Ok(Ok(_)) => Ok(()),
}
}
/// Poll interval for [`wait_for_selector`]. 100ms is fast enough that a
/// page rendering in 200ms isn't held back noticeably, and slow enough
/// not to spam CDP with `find_element` calls on a page that's actually
/// taking its time.
const SELECTOR_POLL_INTERVAL: Duration = Duration::from_millis(100);
/// Wait until `selector` matches at least one element on `page`, or
/// `timeout` elapses. Used after a navigation to confirm a page-type-
/// specific marker is in the DOM before parsing — replaces the fixed
/// post-nav sleep that previously masked partial-render races.
///
/// chromiumoxide 0.7.0 has no built-in `wait_for_selector`, so we poll
/// `find_element` at [`SELECTOR_POLL_INTERVAL`] until success or budget
/// exhaustion. A failed `find_element` is *not* an error here — it just
/// means "not yet" — we only surface an error once the overall
/// `timeout` is up.
pub async fn wait_for_selector(
page: &Page,
selector: &str,
timeout: Duration,
) -> Result<(), NavError> {
let deadline = tokio::time::Instant::now() + timeout;
loop {
if page.find_element(selector).await.is_ok() {
return Ok(());
}
if tokio::time::Instant::now() >= deadline {
return Err(NavError::Timeout(timeout));
}
let remaining = deadline.saturating_duration_since(tokio::time::Instant::now());
let sleep_for = SELECTOR_POLL_INTERVAL.min(remaining);
tokio::time::sleep(sleep_for).await;
}
}
/// Per-page-type budget for [`wait_for_selector`]. Shorter than
/// [`NAV_TIMEOUT`] because by the time we're waiting on a selector, the
/// page has already responded — we're only absorbing post-load JS
/// finishing its row injection, which on a healthy site takes well
/// under a second.
pub const SELECTOR_TIMEOUT: Duration = Duration::from_secs(10);
impl NavError {
/// Does this navigation error indicate the underlying Chromium
/// process has died or its CDP connection has dropped? Used by the
/// dispatcher to decide whether to invalidate the
/// [`crate::crawler::browser_manager::BrowserManager`] handle so
/// the next acquire re-launches.
///
/// Both variants count: a `Timeout` past [`NAV_TIMEOUT`] is in
/// practice always either a hung CDP transport or a wedged page
/// the browser can't recover from on its own, and a `Cdp` error
/// surfacing at the navigation layer means the chromium-facing
/// channel is the failing layer.
pub fn is_likely_browser_dead(&self) -> bool {
match self {
Self::Timeout(_) => true,
Self::Cdp(_) => true,
}
}
}
/// Walk an `anyhow::Error` chain looking for typed evidence that the
/// chromium-facing layer is the failing one. Two markers count:
///
/// 1. A wrapped [`NavError`] flagged by [`NavError::is_likely_browser_dead`].
/// 2. A wrapped [`CdpError`] (via `anyhow::Error::from(CdpError)` at a
/// `Browser::new_page` call site, or any other direct CDP boundary).
///
/// Earlier versions also substring-matched the chain for "connection",
/// "closed", "channel", etc. as a fallback. That was too broad —
/// reqwest TCP-reset errors during CDN image downloads, sqlx
/// connection-pool errors, and similar non-browser failures contain
/// those words and triggered spurious chromium relaunches. The typed
/// downcasts cover every place we hand a chromium error to anyhow,
/// so the fallback is unnecessary.
pub fn anyhow_looks_browser_dead(err: &anyhow::Error) -> bool {
for cause in err.chain() {
if let Some(nav) = cause.downcast_ref::<NavError>() {
if nav.is_likely_browser_dead() {
return true;
}
}
if cause.downcast_ref::<CdpError>().is_some() {
return true;
}
}
false
}
#[cfg(test)]
mod tests {
use super::*;
use std::future::pending;
/// Sanity-check the timeout pattern used by [`wait_for_nav`]: a
/// future that never resolves must yield `Elapsed` within the
/// configured budget. We can't easily stand up a real `Page` in a
/// unit test, so we assert the underlying primitive behaves the way
/// the helper depends on.
#[tokio::test(flavor = "current_thread", start_paused = true)]
async fn timeout_elapses_on_a_future_that_never_resolves() {
let result =
tokio::time::timeout(Duration::from_millis(50), pending::<()>()).await;
assert!(result.is_err(), "expected Elapsed on a hung future");
}
#[test]
fn nav_error_timeout_message_includes_duration() {
let e = NavError::Timeout(Duration::from_secs(30));
assert_eq!(e.to_string(), "navigation timed out after 30s");
}
#[test]
fn timeout_is_treated_as_likely_browser_dead() {
let e = NavError::Timeout(NAV_TIMEOUT);
assert!(e.is_likely_browser_dead());
}
#[test]
fn anyhow_with_nav_timeout_in_chain_is_flagged() {
let inner: Result<(), NavError> = Err(NavError::Timeout(NAV_TIMEOUT));
let outer = inner.unwrap_err();
let wrapped: anyhow::Error =
anyhow::Error::new(outer).context("wait for chapter nav");
assert!(anyhow_looks_browser_dead(&wrapped));
}
#[test]
fn anyhow_with_cdp_error_in_chain_is_flagged() {
// `Browser::new_page` errors get wrapped via
// `anyhow::Error::from(CdpError)` at the navigate / dispatch
// call sites. Walking the chain and downcasting to CdpError is
// what catches that path. Any CdpError variant counts; the
// Serde variant is the easiest to construct in a unit test.
let serde_err: serde_json::Error =
serde_json::from_str::<i32>("not a number").unwrap_err();
let cdp = CdpError::Serde(serde_err);
let wrapped: anyhow::Error =
anyhow::Error::from(cdp).context("open chapter page");
assert!(anyhow_looks_browser_dead(&wrapped));
}
#[test]
fn anyhow_with_innocuous_parse_error_is_not_flagged() {
let e: anyhow::Error =
anyhow::anyhow!("parse manga detail: chapter row regex did not match");
assert!(!anyhow_looks_browser_dead(&e));
}
#[test]
fn anyhow_with_reqwest_style_connection_message_is_not_flagged() {
// Regression: the earlier substring fallback flagged any error
// whose message contained "connection" or "closed" as browser-
// dead. A TCP reset from a CDN during image download, or a
// sqlx pool-connection error, would burn a chromium relaunch
// even though the browser is fine. Typed downcasts only —
// these untyped strings must pass through.
for msg in [
"error sending request: connection reset by peer",
"PoolTimedOut: timed out waiting for a connection",
"request to https://cdn/x.jpg: connection closed before message completed",
"transport error during image fetch",
] {
let e: anyhow::Error = anyhow::anyhow!("{msg}");
assert!(
!anyhow_looks_browser_dead(&e),
"must not flag non-browser error: {msg}"
);
}
}
/// Same sanity check as [`timeout_elapses_on_a_future_that_never_resolves`],
/// but for the [`wait_for_selector`] polling pattern: the loop must
/// surrender on `Elapsed` rather than spinning past the deadline.
#[tokio::test(flavor = "current_thread", start_paused = true)]
async fn selector_polling_pattern_surrenders_at_deadline() {
let timeout = Duration::from_millis(300);
let start = tokio::time::Instant::now();
let deadline = start + timeout;
// Simulate find_element forever returning "not found".
let mut polls = 0u32;
let result: Result<(), NavError> = loop {
polls += 1;
if tokio::time::Instant::now() >= deadline {
break Err(NavError::Timeout(timeout));
}
tokio::time::sleep(SELECTOR_POLL_INTERVAL).await;
};
assert!(matches!(result, Err(NavError::Timeout(_))));
// 300ms / 100ms poll interval ≈ 3 iterations plus the final check
// that breaks out. Allow some slack since the first poll happens
// before any sleep.
assert!(polls >= 3, "expected at least 3 poll iterations, got {polls}");
}
}

View File

@@ -0,0 +1,689 @@
//! Crawler pipeline — the reusable metadata pass and the enqueue helpers
//! that fan out chapter-content work. Shared between the daemon (cron tick)
//! and the CLI (`bin/crawler.rs`) so behavior stays in lockstep.
use std::collections::HashSet;
use anyhow::Context;
use sqlx::PgPool;
use uuid::Uuid;
use crate::crawler::browser_manager::BrowserManager;
use crate::crawler::jobs::{self, EnqueueResult, JobPayload};
use crate::crawler::rate_limit::HostRateLimiters;
use crate::crawler::safety::{fetch_bytes_capped, looks_like_image, DownloadAllowlist};
use crate::crawler::source::target::TargetSource;
use crate::crawler::source::{FetchContext, Source};
use crate::repo;
use crate::repo::crawler::UpsertStatus;
use crate::storage::Storage;
/// Coarse counters surfaced for logging at the end of a metadata pass.
#[derive(Debug, Default, Clone, Copy)]
pub struct MetadataStats {
pub discovered: usize,
pub upserted: usize,
pub covers_fetched: usize,
pub mangas_failed: usize,
}
/// Decide whether the per-ref loop should stop on the manga just
/// processed. The walk halts only when (a) the previous run exited
/// cleanly — so the index tail is known to be caught up and we're not
/// in a recovery sweep — AND (b) this manga's metadata hash matched
/// storage (`Unchanged`) AND (c) the chapter sync confirmed zero new
/// chapters. A `None` chapter count (skip_chapters, or a chapter-sync
/// error we logged-and-swallowed) refuses the stop because we can't
/// verify the tail is unchanged from a single piece of evidence.
///
/// Pure function so the rule is unit-testable without the walker, DB,
/// or browser.
pub(crate) fn should_stop(
was_clean: bool,
status: UpsertStatus,
chapters_new: Option<usize>,
) -> bool {
was_clean
&& matches!(status, UpsertStatus::Unchanged)
&& chapters_new == Some(0)
}
/// Whether the just-finished walk should be recorded as a clean exit.
/// `true` writes the recovery flag back to `completed: true`; `false`
/// leaves it `false` so the next tick treats this run as crashed and
/// does a recovery sweep.
///
/// `hit_limit` (the caller-imposed `CRAWLER_LIMIT` cap) is *not* an
/// argument: a limit cap by definition does not reach the catalog tail,
/// so it can never count as a clean exit. Encoding that in the type
/// (rather than as an `&& !hit_limit` clause inline) prevents a future
/// edit from accidentally adding it back to the truth table.
pub(crate) fn should_mark_clean_exit(
walked_to_completion: bool,
hit_stop_condition: bool,
) -> bool {
walked_to_completion || hit_stop_condition
}
/// Runs the discover → fetch → upsert → cover → chapter-list-diff pipeline
/// for the target source. Pure metadata; chapter content is enqueued as
/// separate `SyncChapterContent` jobs by the caller after this returns.
///
/// `limit == 0` means no cap (full sweep up to the source's own bound).
/// `skip_chapters == true` is the "metadata-only" mode (parser doesn't
/// extract chapters, and `sync_manga_chapters` is skipped — otherwise an
/// empty chapter list would soft-drop existing rows). In this mode the
/// stop condition never fires because chapter freshness can't be
/// confirmed, so the walk always runs to end-of-source.
///
/// The walk is always newest-first. Steady-state runs stop on the first
/// manga where metadata is `Unchanged` AND chapter sync reports zero
/// new chapters — the source orders by `update_date DESC`, so anything
/// with a fresh chapter or fresh metadata is bumped to the top and will
/// be processed before we hit a fully-caught-up manga.
///
/// A per-source recovery flag stored in `crawler_state`
/// (`last_run_completed:<source_id>`) gates the early stop: it's set to
/// `false` right after `ensure_source` and back to `true` only when the
/// run exits via end-of-walk OR the intentional stop. A crash, panic,
/// or SIGKILL leaves the flag at `false`, so the next tick reads it,
/// recognizes the previous run did not exit cleanly, and walks the
/// full catalog (ignoring the stop condition) to re-cover anything the
/// crashed run missed past its crash point. Once that recovery sweep
/// reaches end-of-walk, steady-state resumes.
#[allow(clippy::too_many_arguments)]
pub async fn run_metadata_pass(
browser_manager: &BrowserManager,
db: &PgPool,
storage: &dyn Storage,
http: &reqwest::Client,
rate: &HostRateLimiters,
start_url: &str,
limit: usize,
skip_chapters: bool,
allowlist: &DownloadAllowlist,
max_image_bytes: usize,
) -> anyhow::Result<MetadataStats> {
let lease = browser_manager
.acquire()
.await
.context("acquire browser lease for metadata pass")?;
let browser_ref: &chromiumoxide::Browser = &lease;
let source = {
let s = TargetSource::new(start_url.to_string());
if skip_chapters {
s.without_chapter_parsing()
} else {
s
}
};
let ctx = FetchContext {
browser: browser_ref,
rate,
};
let source_id = source.id();
repo::crawler::ensure_source(
db,
source_id,
"Target Site",
&origin_of(start_url).unwrap_or_else(|| start_url.to_string()),
)
.await
.context("ensure_source")?;
// Read BEFORE flipping to "in-flight" — a `false` here means the
// previous run didn't reach a clean exit, and this run must walk
// the full catalog (recovery sweep) instead of bailing on the
// first caught-up manga.
let was_clean = repo::crawler::last_run_completed_cleanly(db, source_id)
.await
.context("read last_run_completed_cleanly")?;
repo::crawler::mark_run_started(db, source_id)
.await
.context("mark_run_started")?;
let max_refs = (limit > 0).then_some(limit);
tracing::info!(was_clean, ?max_refs, "starting metadata pass");
let mut walker = source
.discover(&ctx)
.await
.context("discover failed")?;
let mut stats = MetadataStats::default();
// Run-scoped dedup of `source_manga_key`s already processed this pass.
// A shift in the source index causes the slot-last item of the page
// we just read to reappear at slot 0 of the next page; skipping it
// here prevents redundant fetch_manga + upsert and avoids spuriously
// tripping the stop condition with a re-confirm of an entry we
// already counted.
let mut seen: HashSet<String> = HashSet::new();
let mut walked_to_completion = false;
let mut hit_limit = false;
let mut hit_stop_condition = false;
'outer: loop {
let batch = match walker.next_batch(&ctx).await? {
Some(b) => b,
None => {
walked_to_completion = true;
break;
}
};
for r in batch {
if max_refs.map(|m| stats.discovered >= m).unwrap_or(false) {
hit_limit = true;
tracing::info!(cap = ?max_refs, "max_results reached; halting walk");
break 'outer;
}
// Skip refs we've already *successfully* processed this pass.
// Checking `contains` here (rather than `insert`) keeps the key
// out of `seen` on failure paths below, so a transient fetch or
// upsert error gets a second chance if the ref reappears in
// another batch. Done *before* counting toward
// `stats.discovered` (the skipped ref did no work) and *before*
// touching the stop check (a `continue` here doesn't let a
// re-confirm trip the stop condition). The matching
// `seen.insert(...)` lives just after the successful upsert
// below.
if seen.contains(&r.source_manga_key) {
tracing::debug!(
key = %r.source_manga_key,
"skip already-seen key in this run"
);
continue;
}
stats.discovered += 1;
tracing::info!(
idx = stats.discovered,
key = %r.source_manga_key,
"fetching metadata"
);
let manga = match source.fetch_manga(&ctx, &r).await {
Ok(m) => m,
Err(e) => {
tracing::warn!(
key = %r.source_manga_key,
url = %r.url,
error = ?e,
"fetch_manga failed"
);
stats.mangas_failed += 1;
continue;
}
};
// Partial-render guard: an empty chapter list paired with a
// prior count > 0 is overwhelmingly a chromium snapshot
// taken between the #chapter_table wrapper render and its
// rows render. The wait_for_selector wait in `navigate`
// narrows this window but cannot close it for slow renders
// beyond the selector budget. Treat as a transient failure
// here — skip upsert, skip seen.insert — so the next batch
// (or the next tick) retries. Skipped in `skip_chapters`
// mode because the parser is configured to return an empty
// Vec by design there.
if !skip_chapters && manga.chapters.is_empty() {
match repo::crawler::live_chapter_count_for_source_manga(
db, source_id, &r.source_manga_key,
)
.await
{
Ok(prior) if prior > 0 => {
tracing::warn!(
key = %r.source_manga_key,
url = %r.url,
prior_chapter_count = prior,
"fetch_manga returned empty chapters but prior count > 0; treating as partial-render transient and skipping"
);
stats.mangas_failed += 1;
continue;
}
Ok(_) => {}
Err(e) => {
// DB lookup failed — fail safe: skip rather
// than risk a soft-drop on a manga whose prior
// count we couldn't confirm.
tracing::warn!(
key = %r.source_manga_key,
error = ?e,
"live_chapter_count_for_source_manga failed; skipping cautiously"
);
stats.mangas_failed += 1;
continue;
}
}
}
let upsert = match repo::crawler::upsert_manga_from_source(
db, source_id, &r.url, &manga,
)
.await
{
Ok(u) => u,
Err(e) => {
tracing::error!(
key = %r.source_manga_key,
error = ?e,
"upsert_manga_from_source failed"
);
stats.mangas_failed += 1;
continue;
}
};
stats.upserted += 1;
// Record success in the dedup set. Cover and chapter-sync
// failures below are non-fatal and don't roll this back —
// metadata is the durable source of truth for the dedup.
seen.insert(r.source_manga_key.clone());
tracing::info!(
key = %manga.source_manga_key,
manga_id = %upsert.manga_id,
status = ?upsert.status,
title = %manga.title,
"manga upserted"
);
// Cover image: download when missing in storage or when metadata
// signaled an update (cover URL is part of metadata_hash, so
// Updated implies the URL may have moved). Failures are non-fatal.
let needs_cover = upsert.cover_image_path.is_none()
|| matches!(upsert.status, repo::crawler::UpsertStatus::Updated);
if needs_cover {
if let Some(cover_url) = manga.cover_url.as_deref() {
match download_and_store_cover(
db,
storage,
http,
rate,
&r.url,
upsert.manga_id,
cover_url,
allowlist,
max_image_bytes,
)
.await
{
Ok(()) => stats.covers_fetched += 1,
Err(e) => tracing::warn!(
manga_id = %upsert.manga_id,
error = ?e,
"cover download failed"
),
}
}
}
// Chapter sync. `chapters_new` feeds the stop check below:
// `None` (skip_chapters mode, or a logged-and-swallowed sync
// error) refuses to stop on this manga because we can't
// confirm "no new chapters."
let chapters_new: Option<usize> = if skip_chapters {
None
} else {
match repo::crawler::sync_manga_chapters(
db,
source_id,
upsert.manga_id,
&manga.chapters,
)
.await
{
Ok(diff) => {
tracing::info!(
manga_id = %upsert.manga_id,
new = diff.new,
refreshed = diff.refreshed,
dropped = diff.dropped,
"chapters synced"
);
Some(diff.new)
}
Err(e) => {
tracing::warn!(
manga_id = %upsert.manga_id,
error = ?e,
"chapter sync failed"
);
None
}
}
};
if should_stop(was_clean, upsert.status, chapters_new) {
hit_stop_condition = true;
tracing::info!(
key = %manga.source_manga_key,
"stop condition met (Unchanged metadata + 0 new chapters); halting walk"
);
break 'outer;
}
}
}
// Recovery-flag write. Only on a clean exit (end-of-walk OR the
// intentional stop). `hit_limit` is a caller-imposed early break
// and does NOT count — the catalog tail wasn't reached, so a future
// tick still needs to walk past where we stopped. The truth table is
// pinned by `should_mark_clean_exit` so a future edit that adds
// `hit_limit` back into the disjunction trips its unit test. Flag-
// write errors are warned and swallowed: the run already did its
// work, and a stale `false` flag just buys a recovery sweep on the
// next tick.
let exited_cleanly = should_mark_clean_exit(walked_to_completion, hit_stop_condition);
if exited_cleanly {
if let Err(e) = repo::crawler::mark_run_completed(db, source_id).await {
tracing::warn!(error = ?e, "mark_run_completed failed");
}
}
tracing::info!(
was_clean,
discovered = stats.discovered,
upserted = stats.upserted,
covers_fetched = stats.covers_fetched,
mangas_failed = stats.mangas_failed,
walked_to_completion,
hit_limit,
hit_stop_condition,
exited_cleanly,
"metadata pass complete"
);
drop(lease);
Ok(stats)
}
/// Quarantine window for chapters whose latest `SyncChapterContent` job is
/// `dead`. The partial dedup index `crawler_jobs_chapter_content_dedup_idx`
/// only blocks `(pending|running)` duplicates, so without this gate a
/// permanently-failing chapter is re-enqueued every cron tick, burns
/// `max_attempts` retries, dies again, and spins forever. With the gate,
/// dead chapters get a week of silence before the next attempt — long
/// enough for a transient site issue to resolve, short enough that
/// permanent failures don't stay permanent if conditions change.
const CHAPTER_DEAD_QUARANTINE_DAYS: i64 = 7;
/// Enqueue a `SyncChapterContent` job for every chapter of *any* bookmarked
/// manga that still has `page_count = 0` and a non-dropped source row.
/// Chapters whose latest job is `dead` within `CHAPTER_DEAD_QUARANTINE_DAYS`
/// are excluded to break the dead-letter spin.
/// Returns `(inserted, skipped)` counts. Dedup index handles repeats.
pub async fn enqueue_bookmarked_pending(pool: &PgPool) -> anyhow::Result<EnqueueSummary> {
let rows: Vec<(String, Uuid, String)> = sqlx::query_as(
r#"
SELECT cs.source_id, c.id AS chapter_id, cs.source_chapter_key
FROM chapters c
JOIN bookmarks b ON b.manga_id = c.manga_id
JOIN chapter_sources cs ON cs.chapter_id = c.id
WHERE c.page_count = 0
AND cs.dropped_at IS NULL
AND NOT EXISTS (
SELECT 1 FROM crawler_jobs cj
WHERE cj.payload->>'kind' = 'sync_chapter_content'
AND cj.payload->>'chapter_id' = c.id::text
AND cj.state = 'dead'
AND cj.updated_at > now() - ($1::bigint || ' days')::interval
)
GROUP BY cs.source_id, c.id, cs.source_chapter_key, c.manga_id, c.created_at
ORDER BY c.manga_id, c.created_at ASC
"#,
)
.bind(CHAPTER_DEAD_QUARANTINE_DAYS)
.fetch_all(pool)
.await
.context("query bookmarked-pending chapters")?;
let mut summary = EnqueueSummary::default();
for (source_id, chapter_id, source_chapter_key) in rows {
let payload = JobPayload::SyncChapterContent {
source_id,
chapter_id,
source_chapter_key,
};
match jobs::enqueue(pool, &payload).await {
Ok(EnqueueResult::Inserted(_)) => summary.inserted += 1,
Ok(EnqueueResult::Skipped) => summary.skipped += 1,
Err(e) => {
tracing::warn!(
%chapter_id,
error = ?e,
"enqueue chapter content failed"
);
summary.failed += 1;
}
}
}
Ok(summary)
}
/// Enqueue chapter-content jobs for a *single* manga (the bookmark-create
/// hook). Same dedup semantics as [`enqueue_bookmarked_pending`], including
/// the dead-letter quarantine — a freshly bookmarked manga should not
/// burn retries on chapters that just died on the cron tick.
pub async fn enqueue_pending_for_manga(
pool: &PgPool,
manga_id: Uuid,
) -> anyhow::Result<EnqueueSummary> {
let rows: Vec<(String, Uuid, String)> = sqlx::query_as(
r#"
SELECT DISTINCT cs.source_id, c.id AS chapter_id, cs.source_chapter_key
FROM chapters c
JOIN chapter_sources cs ON cs.chapter_id = c.id
WHERE c.manga_id = $1
AND c.page_count = 0
AND cs.dropped_at IS NULL
AND NOT EXISTS (
SELECT 1 FROM crawler_jobs cj
WHERE cj.payload->>'kind' = 'sync_chapter_content'
AND cj.payload->>'chapter_id' = c.id::text
AND cj.state = 'dead'
AND cj.updated_at > now() - ($2::bigint || ' days')::interval
)
ORDER BY cs.source_id, c.id
"#,
)
.bind(manga_id)
.bind(CHAPTER_DEAD_QUARANTINE_DAYS)
.fetch_all(pool)
.await
.context("query pending chapters for manga")?;
let mut summary = EnqueueSummary::default();
for (source_id, chapter_id, source_chapter_key) in rows {
let payload = JobPayload::SyncChapterContent {
source_id,
chapter_id,
source_chapter_key,
};
match jobs::enqueue(pool, &payload).await {
Ok(EnqueueResult::Inserted(_)) => summary.inserted += 1,
Ok(EnqueueResult::Skipped) => summary.skipped += 1,
Err(e) => {
tracing::warn!(
%chapter_id,
error = ?e,
"enqueue chapter content failed"
);
summary.failed += 1;
}
}
}
Ok(summary)
}
#[derive(Debug, Default, Clone, Copy)]
pub struct EnqueueSummary {
pub inserted: usize,
pub skipped: usize,
pub failed: usize,
}
/// Download a cover image and persist its storage path. Local to the
/// pipeline because the CLI still calls it from its inline chapter-content
/// loop; once the worker pool fully replaces that path we can fold this
/// into `pipeline` proper.
#[allow(clippy::too_many_arguments)]
async fn download_and_store_cover(
db: &PgPool,
storage: &dyn Storage,
http: &reqwest::Client,
rate: &HostRateLimiters,
manga_url: &str,
manga_id: Uuid,
cover_url: &str,
allowlist: &DownloadAllowlist,
max_image_bytes: usize,
) -> anyhow::Result<()> {
let absolute = reqwest::Url::parse(manga_url)
.context("parse manga URL")?
.join(cover_url)
.context("join cover URL onto manga URL")?;
rate.wait_for(absolute.as_str()).await?;
let bytes = fetch_bytes_capped(
http,
absolute.as_str(),
Some(manga_url),
allowlist,
max_image_bytes,
)
.await?;
if !looks_like_image(&bytes) {
anyhow::bail!(
"cover URL {absolute} returned non-image bytes; refusing to store as binary blob"
);
}
let ext = infer::get(&bytes)
.map(|k| k.extension())
.expect("looks_like_image asserted infer succeeded");
let key = format!("mangas/{manga_id}/cover.{ext}");
storage
.put(&key, &bytes)
.await
.with_context(|| format!("store cover at {key}"))?;
repo::manga::set_cover_image_path(db, manga_id, &key)
.await
.with_context(|| format!("update cover_image_path for {manga_id}"))?;
tracing::info!(
manga_id = %manga_id,
key = %key,
bytes = bytes.len(),
%absolute,
"cover stored"
);
Ok(())
}
use crate::crawler::url_utils::origin_of;
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn stop_condition_fires_on_unchanged_metadata_and_zero_new_chapters() {
// The whole point of the rule: in steady state, a manga whose
// metadata hash matches AND whose chapter list gained no new
// entries proves we've reached the caught-up tail of a
// newest-first index.
assert!(should_stop(true, UpsertStatus::Unchanged, Some(0)));
}
#[test]
fn stop_condition_refuses_when_chapters_added() {
// Unchanged metadata + N new chapters means the source bumped
// this manga because of the chapter add; the rest of the index
// is still ahead of us. Don't bail.
assert!(!should_stop(true, UpsertStatus::Unchanged, Some(1)));
assert!(!should_stop(true, UpsertStatus::Unchanged, Some(42)));
}
#[test]
fn stop_condition_refuses_when_metadata_changed() {
// Updated or New metadata always continues — even with zero new
// chapters — because the change-of-metadata bump itself is what
// the walk is following.
assert!(!should_stop(true, UpsertStatus::Updated, Some(0)));
assert!(!should_stop(true, UpsertStatus::New, Some(0)));
}
#[test]
fn stop_condition_refuses_when_chapter_count_unknown() {
// skip_chapters mode (CLI metadata-only sweep) or a
// logged-and-swallowed chapter sync error: we can't claim "no
// new chapters" from absence of evidence, so don't stop. The
// operator who runs metadata-only intentionally wants a full
// walk anyway.
assert!(!should_stop(true, UpsertStatus::Unchanged, None));
}
#[test]
fn stop_condition_disabled_in_recovery_mode() {
// was_clean = false means the previous run did not exit cleanly;
// the catalog past its crash point is potentially un-synced. Walk
// to end-of-source no matter what individual mangas report.
assert!(!should_stop(false, UpsertStatus::Unchanged, Some(0)));
assert!(!should_stop(false, UpsertStatus::Unchanged, Some(1)));
assert!(!should_stop(false, UpsertStatus::Updated, Some(0)));
assert!(!should_stop(false, UpsertStatus::New, None));
}
#[test]
fn clean_exit_when_walked_to_completion() {
// End-of-walk reached the catalog tail — the recovery flag may
// safely flip back to `true`.
assert!(should_mark_clean_exit(true, false));
}
#[test]
fn clean_exit_when_stop_condition_fired() {
// First Unchanged + 0-new-chapter manga is a complete steady-
// state exit: every manga newer than this point was synced, and
// by source-side `update_date DESC` ordering everything past
// this point is at least as caught-up.
assert!(should_mark_clean_exit(false, true));
}
#[test]
fn dirty_exit_when_neither_completion_nor_stop_fired() {
// The walk ended for some other reason — including the
// caller-imposed `hit_limit` cap, which is the regression case
// this test exists for. `should_mark_clean_exit` does not take
// `hit_limit` as a parameter, so a future edit that adds
// `|| hit_limit` to the inline expression in `run_metadata_pass`
// would need to also touch this helper, and would fail this
// assertion when it did.
assert!(!should_mark_clean_exit(false, false));
}
#[test]
fn run_scoped_seen_set_skips_duplicate_source_manga_keys() {
// Pins the per-ref loop contract: `contains` gates whether work
// runs, and `insert` only fires on the success path (after upsert).
// A failed ref that reappears later in the same pass must get a
// second chance — that's why the loop uses contains-then-insert
// instead of insert-and-skip-on-collision.
let mut seen: HashSet<String> = HashSet::new();
// First sighting of a key: not yet seen → loop proceeds.
assert!(!seen.contains("manga-a"), "first sighting is unseen");
// Simulate a failed fetch_manga: do NOT insert. Next sighting must
// still be considered unseen so the loop retries it.
assert!(!seen.contains("manga-a"), "failed key is still retryable");
// Now simulate a successful upsert — insert is called.
seen.insert("manga-a".to_string());
// Subsequent sightings of the same key are skipped.
assert!(seen.contains("manga-a"), "successful key is now seen");
// Distinct keys never collide.
assert!(!seen.contains("manga-b"), "different key independent");
seen.insert("manga-b".to_string());
assert!(seen.contains("manga-b"));
assert!(seen.contains("manga-a"), "first key still recorded");
}
}

View File

@@ -0,0 +1,178 @@
//! Per-host request pacing.
//!
//! `RateLimiter` is a single-token bucket: each `wait().await` returns
//! immediately when at least `interval` has elapsed since the last call,
//! otherwise sleeps just enough to satisfy it. Uses
//! `tokio::time::Instant` so tests can run under `start_paused` virtual
//! time without sleeping for real.
//!
//! `HostRateLimiters` is the multi-host wrapper actually used by the
//! crawler — concurrent workers issuing requests to different origins
//! (catalog vs. CDN) don't contend on a shared budget; each host gets
//! its own bucket. `wait_for(url)` extracts the host, lazily creates a
//! limiter for it, and serializes only against other callers hitting
//! the same host.
use std::collections::HashMap;
use std::sync::Arc;
use std::time::Duration;
use tokio::sync::Mutex;
use tokio::time::Instant;
#[derive(Debug)]
pub struct RateLimiter {
interval: Duration,
last: Option<Instant>,
}
impl RateLimiter {
pub fn new(interval: Duration) -> Self {
Self {
interval,
last: None,
}
}
pub async fn wait(&mut self) {
if let Some(last) = self.last {
let elapsed = last.elapsed();
if elapsed < self.interval {
tokio::time::sleep(self.interval - elapsed).await;
}
}
self.last = Some(Instant::now());
}
}
/// Per-host rate limiter map. The outer `Mutex<HashMap>` is held only
/// during the entry-or-insert + Arc clone; the per-host `Mutex<RateLimiter>`
/// is held during the actual `wait().await`. So N workers calling
/// `wait_for(url)` on N different hosts contend nowhere except the brief
/// HashMap lookup; workers hitting the same host serialize on that
/// host's bucket.
#[derive(Debug)]
pub struct HostRateLimiters {
default_interval: Duration,
overrides: HashMap<String, Duration>,
map: Mutex<HashMap<String, Arc<Mutex<RateLimiter>>>>,
}
impl HostRateLimiters {
pub fn new(default_interval: Duration) -> Self {
Self {
default_interval,
overrides: HashMap::new(),
map: Mutex::new(HashMap::new()),
}
}
/// Set a per-host interval that overrides `default_interval`. Calls
/// after a host's limiter has been instantiated do *not* re-create
/// it — set all overrides before the first `wait_for` to that host.
pub fn with_override(mut self, host: impl Into<String>, interval: Duration) -> Self {
self.overrides.insert(host.into(), interval);
self
}
/// Block until the per-host budget allows the next request to
/// `url`'s host. Returns an error only when the URL has no host
/// (malformed input).
pub async fn wait_for(&self, url: &str) -> anyhow::Result<()> {
let host = host_of(url)
.ok_or_else(|| anyhow::anyhow!("no host in url: {url}"))?;
let limiter = {
let mut map = self.map.lock().await;
map.entry(host.clone())
.or_insert_with(|| {
let interval = self
.overrides
.get(&host)
.copied()
.unwrap_or(self.default_interval);
Arc::new(Mutex::new(RateLimiter::new(interval)))
})
.clone()
};
limiter.lock().await.wait().await;
Ok(())
}
}
// `host_of` was duplicated across session/rate_limit/pipeline; the
// canonical version now lives in `crawler::url_utils`.
use crate::crawler::url_utils::host_of;
#[cfg(test)]
mod tests {
use super::*;
#[tokio::test(start_paused = true)]
async fn first_call_does_not_sleep() {
let mut rl = RateLimiter::new(Duration::from_millis(100));
let t0 = Instant::now();
rl.wait().await;
assert_eq!(Instant::now() - t0, Duration::ZERO);
}
#[tokio::test(start_paused = true)]
async fn second_call_sleeps_to_fill_interval() {
let mut rl = RateLimiter::new(Duration::from_millis(100));
let t0 = Instant::now();
rl.wait().await;
rl.wait().await;
// Second call had to wait the full 100ms after the (instant)
// first call.
assert_eq!(Instant::now() - t0, Duration::from_millis(100));
}
#[tokio::test(start_paused = true)]
async fn no_sleep_if_interval_already_elapsed() {
let mut rl = RateLimiter::new(Duration::from_millis(100));
rl.wait().await;
tokio::time::sleep(Duration::from_millis(250)).await;
let t0 = Instant::now();
rl.wait().await;
// Already 250ms past — no further wait needed.
assert_eq!(Instant::now() - t0, Duration::ZERO);
}
#[test]
fn host_of_parses_scheme_path_and_port() {
assert_eq!(host_of("https://Example.com/path").as_deref(), Some("example.com"));
assert_eq!(host_of("http://cdn.foo.bar/img.jpg").as_deref(), Some("cdn.foo.bar"));
assert_eq!(host_of("http://localhost:8080/x").as_deref(), Some("localhost"));
assert!(host_of("not a url").is_none());
}
#[tokio::test(start_paused = true)]
async fn host_rate_limiters_pace_per_host() {
// Two hosts at 100ms each. Two consecutive calls to the SAME
// host wait 100ms total. Two consecutive calls to DIFFERENT
// hosts both fire immediately.
let rl = HostRateLimiters::new(Duration::from_millis(100));
let t0 = Instant::now();
rl.wait_for("https://a.example/x").await.unwrap();
rl.wait_for("https://b.example/y").await.unwrap();
assert_eq!(Instant::now() - t0, Duration::ZERO, "different hosts don't contend");
let t1 = Instant::now();
rl.wait_for("https://a.example/x").await.unwrap();
assert_eq!(
Instant::now() - t1,
Duration::from_millis(100),
"second call to same host waits a full interval"
);
}
#[tokio::test(start_paused = true)]
async fn host_rate_limiters_honor_overrides() {
let rl = HostRateLimiters::new(Duration::from_millis(1000))
.with_override("fast.example", Duration::from_millis(100));
rl.wait_for("https://fast.example/a").await.unwrap();
let t0 = Instant::now();
rl.wait_for("https://fast.example/b").await.unwrap();
assert_eq!(Instant::now() - t0, Duration::from_millis(100));
}
}

View File

@@ -0,0 +1,558 @@
//! Defensive helpers for the image-download paths.
//!
//! Two threats this module addresses:
//!
//! - **SSRF**: a scraped chapter or manga page can embed an absolute
//! `<img src="http://10.0.0.1/...">`. The crawler runs inside the
//! backend container with intra-compose access to `postgres:5432`
//! and possibly other internal services; without a host check the
//! crawler would happily probe them. [`is_safe_url`] rejects
//! anything whose host isn't on the operator-configured allowlist,
//! plus any IP literal in RFC1918 / loopback / link-local / unique-
//! local space (including IPv4-mapped IPv6 like `::ffff:127.0.0.1`)
//! as a second defence for the case where an allowlisted hostname's
//! DNS happens to resolve to a literal private address.
//!
//! **DNS rebinding is not covered.** A hostname like `cdn.allowed.com`
//! that *resolves* to `127.0.0.1` via hostile DNS bypasses the IP
//! check entirely — `is_safe_url` only inspects URL strings, not
//! resolved IPs. Mitigating that requires a custom reqwest resolver
//! that filters IPs after DNS, which would mean rebuilding reqwest's
//! connector. The allowlist + good operator DNS hygiene is the
//! realistic mitigation today.
//!
//! - **Unbounded download**: `Response::bytes().await` reads the full
//! body before returning. A malicious source serving a 10 GiB image
//! would fill memory and then disk. [`accumulate_capped`] streams
//! the body chunk-by-chunk into a [`bytes::BytesMut`] and bails as
//! soon as the running total exceeds the cap.
//!
//! Both helpers are pure-data: the SSRF check is keyed off a parsed
//! URL string, and the byte accumulator is keyed off a generic stream.
//! Easy to unit-test without a live network or browser.
use std::net::IpAddr;
use anyhow::{bail, Context};
use bytes::BytesMut;
use futures_util::StreamExt;
use reqwest::Url;
/// Default per-image download cap. A page image is generally <2 MiB;
/// 32 MiB leaves headroom for high-resolution covers while still
/// stopping a misbehaving CDN dead. Override via `CRAWLER_MAX_IMAGE_BYTES`.
pub const DEFAULT_MAX_IMAGE_BYTES: usize = 32 * 1024 * 1024;
/// Hosts that are always allowed in addition to the operator's
/// configured allowlist. None by default — keeping the surface area
/// minimal so the only way a URL gets through is if it matches an
/// explicit catalog/CDN entry.
///
/// `allow_any` flips the host check off entirely (private-IP and
/// scheme checks still apply). It exists for operators whose sources
/// shard images across numbered CDN subdomains (`cdn1`, `cdn2`, …)
/// where enumerating each host upfront is impractical. Off by default.
#[derive(Clone, Debug, Default)]
pub struct DownloadAllowlist {
hosts: Vec<String>,
allow_any: bool,
}
impl DownloadAllowlist {
pub fn new() -> Self {
Self {
hosts: Vec::new(),
allow_any: false,
}
}
/// Bypass the host allowlist. Scheme, localhost, and private-IP
/// checks in [`is_safe_url`] continue to apply — this only opens
/// up public hosts that weren't pre-enumerated.
pub fn allow_any() -> Self {
Self {
hosts: Vec::new(),
allow_any: true,
}
}
/// Add a host (case-insensitive match). Sub-domains are *not*
/// implied: pass `cdn.example.com` and `example.com` separately
/// if both should be reachable.
pub fn allow(mut self, host: impl Into<String>) -> Self {
let h = host.into().to_ascii_lowercase();
if !h.is_empty() && !self.hosts.iter().any(|existing| existing == &h) {
self.hosts.push(h);
}
self
}
pub fn is_empty(&self) -> bool {
self.hosts.is_empty()
}
pub fn contains(&self, host: &str) -> bool {
if self.allow_any {
return true;
}
let lower = host.to_ascii_lowercase();
self.hosts.iter().any(|h| h == &lower)
}
}
/// Verify a URL is safe for the crawler to fetch.
///
/// Rejects:
/// - non-http(s) schemes (file://, gopher://, …),
/// - any IP literal in private / loopback / link-local / unique-local
/// space (defense in depth — a DNS allowlist alone wouldn't cover an
/// attacker that places an entry like `cdn.evil` pointing at
/// `192.168.1.1`),
/// - the literal hostname `localhost`,
/// - hosts that aren't on the supplied allowlist.
///
/// An empty allowlist rejects everything (the conservative default —
/// callers must explicitly allow the catalog and CDN hosts).
pub fn is_safe_url(raw_url: &str, allow: &DownloadAllowlist) -> Result<(), UrlSafetyError> {
let url = Url::parse(raw_url).map_err(|_| UrlSafetyError::Unparseable)?;
let scheme = url.scheme();
if scheme != "http" && scheme != "https" {
return Err(UrlSafetyError::BadScheme(scheme.to_string()));
}
let host = url.host_str().ok_or(UrlSafetyError::NoHost)?;
let lower_host = host.to_ascii_lowercase();
if lower_host == "localhost" {
return Err(UrlSafetyError::Loopback);
}
// Reject IP literals in private/loopback ranges regardless of the
// allowlist — if someone puts an IP literal on the allowlist they
// almost certainly didn't mean a private range.
// reqwest::Url normalises IPv6 literals as `[::1]` (brackets
// included) in `host_str()`. Strip the brackets before parsing.
let ip_candidate = lower_host
.strip_prefix('[')
.and_then(|s| s.strip_suffix(']'))
.unwrap_or(&lower_host);
if let Ok(ip) = ip_candidate.parse::<IpAddr>() {
if is_private_ip(&ip) {
return Err(UrlSafetyError::PrivateIp(ip));
}
}
if !allow.contains(&lower_host) {
return Err(UrlSafetyError::HostNotAllowed(lower_host));
}
Ok(())
}
fn is_private_ip(ip: &IpAddr) -> bool {
match ip {
IpAddr::V4(v4) => {
v4.is_loopback()
|| v4.is_private()
|| v4.is_link_local()
|| v4.is_unspecified()
|| v4.is_broadcast()
// CGNAT 100.64.0.0/10
|| (v4.octets()[0] == 100 && (v4.octets()[1] & 0xC0) == 64)
// 169.254/16 link-local already covered, but 0.0.0.0/8 is special-use
|| v4.octets()[0] == 0
}
IpAddr::V6(v6) => {
// IPv4-mapped IPv6 (::ffff:0:0/96): unwrap to the embedded
// IPv4 and recurse so `::ffff:127.0.0.1` is caught by the
// IPv4 loopback check rather than passing through.
// `Ipv6Addr::is_loopback()` only matches `::1` exactly.
if let Some(v4) = v6.to_ipv4_mapped() {
return is_private_ip(&IpAddr::V4(v4));
}
v6.is_loopback()
|| v6.is_unspecified()
// fc00::/7 unique-local
|| (v6.segments()[0] & 0xfe00) == 0xfc00
// fe80::/10 link-local
|| (v6.segments()[0] & 0xffc0) == 0xfe80
}
}
}
#[derive(Debug, thiserror::Error, PartialEq, Eq)]
pub enum UrlSafetyError {
#[error("URL is not parseable")]
Unparseable,
#[error("scheme {0:?} is not http or https")]
BadScheme(String),
#[error("URL is missing a host")]
NoHost,
#[error("host points at the loopback interface")]
Loopback,
#[error("host is a private/internal IP: {0}")]
PrivateIp(IpAddr),
#[error("host {0:?} is not on the crawler download allowlist")]
HostNotAllowed(String),
}
/// Drain a byte stream into a single buffer, bailing out as soon as
/// the running total exceeds `max_bytes`. Generic over the stream so
/// it's testable without a live HTTP response.
pub async fn accumulate_capped<S, E>(stream: S, max_bytes: usize) -> anyhow::Result<bytes::Bytes>
where
S: futures_core::Stream<Item = Result<bytes::Bytes, E>>,
E: std::error::Error + Send + Sync + 'static,
{
let mut buf = BytesMut::new();
let mut stream = std::pin::pin!(stream);
while let Some(chunk) = stream.next().await {
let chunk = chunk.map_err(|e| anyhow::anyhow!("stream chunk: {e}"))?;
if buf.len().saturating_add(chunk.len()) > max_bytes {
bail!(
"response exceeds {max_bytes}-byte cap (received >{}+{})",
buf.len(),
chunk.len()
);
}
buf.extend_from_slice(&chunk);
}
Ok(buf.freeze())
}
/// Send `req` and stream the response into a length-limited buffer.
/// Combines [`is_safe_url`] check + [`accumulate_capped`] so each
/// call-site is one line.
pub async fn fetch_bytes_capped(
http: &reqwest::Client,
url: &str,
referer: Option<&str>,
allow: &DownloadAllowlist,
max_bytes: usize,
) -> anyhow::Result<bytes::Bytes> {
is_safe_url(url, allow).with_context(|| format!("reject unsafe URL {url}"))?;
let mut req = http.get(url);
if let Some(r) = referer {
req = req.header(reqwest::header::REFERER, r);
}
let resp = req
.send()
.await
.with_context(|| format!("GET {url}"))?
.error_for_status()
.with_context(|| format!("non-2xx for {url}"))?;
accumulate_capped(resp.bytes_stream(), max_bytes)
.await
.with_context(|| format!("download body for {url}"))
}
/// True when `bytes` sniffs as one of the *renderable* image formats
/// the `/files/*key` endpoint can serve with a correct Content-Type:
/// JPEG, PNG, WebP, GIF, AVIF. Matches the upload pipeline's
/// whitelist in `upload::parse_image`.
///
/// `infer::MatcherType::Image` is intentionally NOT used — it also
/// matches BMP, TIFF, HEIF, ICO, PSD, and JP2. Those would sniff as
/// "image" here but [`api::files::content_type_for`] would fall back
/// to `application/octet-stream`, prompting browsers to download
/// instead of render. Keep the two layers aligned.
pub fn looks_like_image(bytes: &[u8]) -> bool {
matches!(
infer::get(bytes).map(|k| k.mime_type()),
Some("image/jpeg" | "image/png" | "image/webp" | "image/gif" | "image/avif")
)
}
#[cfg(test)]
mod tests {
use super::*;
use futures_util::stream;
fn allow_just(host: &str) -> DownloadAllowlist {
DownloadAllowlist::new().allow(host)
}
#[test]
fn allow_any_admits_arbitrary_public_host() {
// Operators who can't pre-enumerate a numbered-CDN fleet
// (cdn1, cdn2, …) opt into allow_any. Any public host passes.
let allow = DownloadAllowlist::allow_any();
assert!(is_safe_url("https://cdn7.random.tld/x.jpg", &allow).is_ok());
assert!(is_safe_url("https://anything-goes.example/", &allow).is_ok());
}
#[test]
fn allow_any_still_blocks_private_ips() {
// The point of the bypass is the host-allowlist check, not the
// SSRF defense. Private/loopback IPs stay refused.
let allow = DownloadAllowlist::allow_any();
for url in [
"http://10.0.0.1/",
"http://192.168.1.1/",
"http://169.254.169.254/",
"http://127.0.0.1/",
"http://[::1]/",
"http://[::ffff:127.0.0.1]/",
] {
assert!(
matches!(
is_safe_url(url, &allow).unwrap_err(),
UrlSafetyError::PrivateIp(_)
),
"allow_any must still reject {url}"
);
}
}
#[test]
fn allow_any_still_blocks_localhost() {
let allow = DownloadAllowlist::allow_any();
assert!(matches!(
is_safe_url("http://localhost:8080/", &allow).unwrap_err(),
UrlSafetyError::Loopback
));
}
#[test]
fn allow_any_still_blocks_non_http_schemes() {
let allow = DownloadAllowlist::allow_any();
assert!(matches!(
is_safe_url("file:///etc/passwd", &allow).unwrap_err(),
UrlSafetyError::BadScheme(_)
));
}
#[test]
fn safe_url_allows_listed_host() {
let allow = allow_just("cdn.example.com");
assert!(is_safe_url("https://cdn.example.com/img.jpg", &allow).is_ok());
}
#[test]
fn safe_url_blocks_unlisted_host() {
let allow = allow_just("cdn.example.com");
let err = is_safe_url("https://evil.example.org/img.jpg", &allow).unwrap_err();
assert!(matches!(err, UrlSafetyError::HostNotAllowed(h) if h == "evil.example.org"));
}
#[test]
fn safe_url_blocks_localhost_even_if_allowlisted() {
let allow = allow_just("localhost");
assert!(matches!(
is_safe_url("http://localhost:8080/", &allow).unwrap_err(),
UrlSafetyError::Loopback
));
}
#[test]
fn safe_url_blocks_loopback_ipv4() {
let allow = allow_just("127.0.0.1");
assert!(matches!(
is_safe_url("http://127.0.0.1/", &allow).unwrap_err(),
UrlSafetyError::PrivateIp(_)
));
}
#[test]
fn safe_url_blocks_rfc1918() {
let allow = allow_just("10.0.0.1");
for url in [
"http://10.0.0.1/",
"http://192.168.1.1/",
"http://172.16.0.5/",
"http://172.31.255.255/",
] {
assert!(
matches!(
is_safe_url(url, &allow).unwrap_err(),
UrlSafetyError::PrivateIp(_)
),
"should reject {url}"
);
}
}
#[test]
fn safe_url_blocks_link_local() {
let allow = allow_just("169.254.169.254");
// 169.254.169.254 is the AWS/GCP metadata service — the most
// dangerous SSRF target on a default cloud VM.
assert!(matches!(
is_safe_url("http://169.254.169.254/", &allow).unwrap_err(),
UrlSafetyError::PrivateIp(_)
));
}
#[test]
fn safe_url_blocks_ipv6_loopback_and_ula() {
// Debug what host_str returns first — reqwest::Url normalises
// IPv6 literals as `[::1]` with brackets, which doesn't parse
// as `IpAddr` directly. The implementation strips them.
let allow = allow_just("[::1]");
let err = is_safe_url("http://[::1]/", &allow).unwrap_err();
assert!(
matches!(err, UrlSafetyError::PrivateIp(_)),
"expected PrivateIp, got {err:?}"
);
let allow = allow_just("[fd00::1]");
let err = is_safe_url("http://[fd00::1]/", &allow).unwrap_err();
assert!(
matches!(err, UrlSafetyError::PrivateIp(_)),
"expected PrivateIp, got {err:?}"
);
}
#[test]
fn safe_url_blocks_ipv4_mapped_ipv6_loopback() {
// `Ipv6Addr::is_loopback()` only matches `::1` exactly, so
// `::ffff:127.0.0.1` would slip through without the
// to_ipv4_mapped() unwrap in is_private_ip.
let allow = allow_just("[::ffff:127.0.0.1]");
let err = is_safe_url("http://[::ffff:127.0.0.1]/", &allow).unwrap_err();
assert!(
matches!(err, UrlSafetyError::PrivateIp(_)),
"expected PrivateIp, got {err:?}"
);
}
#[test]
fn safe_url_blocks_ipv4_mapped_ipv6_rfc1918() {
let allow = allow_just("[::ffff:10.0.0.1]");
let err = is_safe_url("http://[::ffff:10.0.0.1]/", &allow).unwrap_err();
assert!(matches!(err, UrlSafetyError::PrivateIp(_)));
}
#[test]
fn safe_url_blocks_non_http_schemes() {
let allow = allow_just("anywhere");
assert!(matches!(
is_safe_url("file:///etc/passwd", &allow).unwrap_err(),
UrlSafetyError::BadScheme(_)
));
assert!(matches!(
is_safe_url("gopher://anywhere:70/", &allow).unwrap_err(),
UrlSafetyError::BadScheme(_)
));
}
#[test]
fn safe_url_rejects_unparseable() {
let allow = allow_just("anywhere");
assert!(matches!(
is_safe_url("not a url", &allow).unwrap_err(),
UrlSafetyError::Unparseable
));
}
#[test]
fn safe_url_empty_allowlist_rejects_everything() {
let allow = DownloadAllowlist::new();
let err = is_safe_url("https://cdn.example.com/img.jpg", &allow).unwrap_err();
assert!(matches!(err, UrlSafetyError::HostNotAllowed(_)));
}
#[test]
fn allowlist_matches_case_insensitively() {
let allow = DownloadAllowlist::new().allow("CDN.Example.COM");
assert!(is_safe_url("https://cdn.example.com/x.jpg", &allow).is_ok());
assert!(is_safe_url("https://CDN.EXAMPLE.com/x.jpg", &allow).is_ok());
}
#[tokio::test]
async fn accumulate_capped_returns_full_body_under_cap() {
let chunks: Vec<Result<bytes::Bytes, std::io::Error>> = vec![
Ok(bytes::Bytes::from_static(b"hello ")),
Ok(bytes::Bytes::from_static(b"world")),
];
let s = stream::iter(chunks);
let out = accumulate_capped(s, 100).await.unwrap();
assert_eq!(out.as_ref(), b"hello world");
}
#[tokio::test]
async fn accumulate_capped_bails_past_cap() {
let chunks: Vec<Result<bytes::Bytes, std::io::Error>> = vec![
Ok(bytes::Bytes::from(vec![0u8; 50])),
Ok(bytes::Bytes::from(vec![0u8; 60])),
];
let s = stream::iter(chunks);
let err = accumulate_capped(s, 100).await.unwrap_err();
assert!(err.to_string().contains("100-byte cap"));
}
#[tokio::test]
async fn accumulate_capped_surfaces_stream_errors() {
let chunks: Vec<Result<bytes::Bytes, std::io::Error>> = vec![
Ok(bytes::Bytes::from_static(b"ok")),
Err(std::io::Error::other("network blip")),
];
let s = stream::iter(chunks);
let err = accumulate_capped(s, 100).await.unwrap_err();
assert!(err.to_string().contains("network blip"));
}
#[test]
fn looks_like_image_accepts_jpeg() {
// JPEG SOI + APP0 segment.
let jpeg = [0xff, 0xd8, 0xff, 0xe0, 0, 0x10, b'J', b'F', b'I', b'F'];
assert!(looks_like_image(&jpeg));
}
#[test]
fn looks_like_image_accepts_png() {
let png = [0x89, 0x50, 0x4e, 0x47, 0x0d, 0x0a, 0x1a, 0x0a, 0, 0, 0, 0];
assert!(looks_like_image(&png));
}
#[test]
fn looks_like_image_rejects_html_disguised_as_image() {
let html = b"<html><body>not an image</body></html>";
assert!(!looks_like_image(html));
}
#[test]
fn looks_like_image_rejects_empty() {
assert!(!looks_like_image(&[]));
}
#[test]
fn looks_like_image_rejects_renderable_but_unsupported_formats() {
// BMP, TIFF, ICO, PSD are `infer::MatcherType::Image` but the
// /files/*key handler doesn't have Content-Type mappings for
// them, so they'd be served as application/octet-stream and
// download instead of render. Reject at the crawler so we
// never land them in storage.
// BMP magic: "BM" + 4-byte size.
let bmp = [b'B', b'M', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
assert!(!looks_like_image(&bmp), "BMP must be rejected (not renderable by /files)");
// TIFF little-endian magic: "II" + 42.
let tiff = [0x49, 0x49, 0x2a, 0x00, 0, 0, 0, 0];
assert!(!looks_like_image(&tiff), "TIFF must be rejected");
// ICO magic: 0x00,0x00,0x01,0x00.
let ico = [0x00, 0x00, 0x01, 0x00, 1, 0, 16, 16, 0, 0, 1, 0, 0x18, 0, 0x40, 0, 0, 0, 0x16, 0, 0, 0];
assert!(!looks_like_image(&ico), "ICO must be rejected");
}
#[test]
fn looks_like_image_accepts_webp_gif_avif() {
// Cover the three remaining whitelisted formats so a future
// tightening that drops one would fail noisily.
let webp = [
b'R', b'I', b'F', b'F',
0, 0, 0, 0,
b'W', b'E', b'B', b'P',
b'V', b'P', b'8', b' ',
];
assert!(looks_like_image(&webp));
let gif = [b'G', b'I', b'F', b'8', b'7', b'a', 0, 0, 0, 0];
assert!(looks_like_image(&gif));
let avif = [
0x00, 0x00, 0x00, 0x18,
b'f', b't', b'y', b'p',
b'a', b'v', b'i', b'f',
0x00, 0x00, 0x00, 0x00,
b'm', b'i', b'f', b'1',
b'a', b'v', b'i', b'f',
];
assert!(looks_like_image(&avif));
}
}

View File

@@ -0,0 +1,351 @@
//! PHPSESSID injection + login probe.
//!
//! The catalog site we crawl renders chapter pages as a single multi-
//! page list only for logged-in users. We don't try to bypass the
//! login (CAPTCHA wall) — instead the operator pastes their browser's
//! `PHPSESSID` cookie into `CRAWLER_PHPSESSID` and the crawler injects
//! it into Chromium *and* reqwest before the first navigation.
//!
//! Two things the cookie alone doesn't give us:
//! 1. The cookie value is only meaningful to the *server* — we have
//! no way to predict from the value alone whether it's still valid.
//! `verify_session` does a navigation and inspects the probe page
//! for three outcomes: broken-page response (transient — retry the
//! probe), `#logo` present but `#avatar_menu` absent (genuine logout
//! — bail loudly), or both present (authenticated). The earlier
//! avatar-only check conflated "site is hiccuping" with "session is
//! dead" and refused to start the crawler when the site had a brief
//! 503.
//! 2. The reqwest client (used for cover and chapter-image downloads)
//! has its own cookie store; we seed it for the catalog host only.
//! CDN hosts are deliberately *not* given the cookie — they serve
//! image bytes by signed URLs and don't need it.
use std::time::Duration;
use anyhow::{anyhow, Context};
use chromiumoxide::browser::Browser;
use chromiumoxide::cdp::browser_protocol::network::CookieParam;
use crate::crawler::detect::{has_logo_sentinel, is_broken_page_body};
/// Outcome of inspecting a probe-page response.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum SessionProbe {
/// `#logo` present and `#avatar_menu` present — session valid.
Ok,
/// `#logo` present but `#avatar_menu` absent — site rendered the
/// normal layout for an unauthenticated visitor; refresh PHPSESSID.
Unauthenticated,
/// Broken-page body signature or `#logo` missing — site is hiccuping.
/// Caller retries the probe rather than blaming the session.
Transient,
}
/// Re-export so existing callers keep working after the helper moved
/// to `crawler::url_utils`. The body lives there.
pub use crate::crawler::url_utils::registrable_domain;
/// Inject the PHPSESSID cookie into the browser's cookie store for the
/// catalog domain. Must be called before any navigation that depends on
/// authentication; subsequent navigations include the cookie
/// automatically.
pub async fn inject_phpsessid(
browser: &Browser,
sid: &str,
cookie_domain: &str,
) -> anyhow::Result<()> {
let cookie = CookieParam {
name: "PHPSESSID".to_string(),
value: sid.to_string(),
url: None,
domain: Some(cookie_domain.to_string()),
path: Some("/".to_string()),
secure: None,
http_only: Some(true),
same_site: None,
expires: None,
priority: None,
same_party: None,
source_scheme: None,
source_port: None,
partition_key: None,
};
browser
.set_cookies(vec![cookie])
.await
.context("set PHPSESSID in chromium cookie store")?;
tracing::info!(domain = cookie_domain, "injected PHPSESSID into browser");
Ok(())
}
/// Three-way classification of a probe-page response. Pure over HTML so
/// it's unit-testable without a real browser. Order matters: a body
/// matching the broken-page template is `Transient` even if the page
/// happens to contain `#avatar_menu` HTML somewhere — trust the universal
/// site signal over a stray selector match.
pub fn classify_probe(html: &str) -> SessionProbe {
if is_broken_page_body(html) {
return SessionProbe::Transient;
}
let doc = scraper::Html::parse_document(html);
if !has_logo_sentinel(&doc) {
return SessionProbe::Transient;
}
let avatar_sel = scraper::Selector::parse("#avatar_menu").unwrap();
if doc.select(&avatar_sel).next().is_some() {
SessionProbe::Ok
} else {
SessionProbe::Unauthenticated
}
}
/// Three-way classification of a chapter page response.
///
/// Reader pages don't render `#logo`, so [`classify_probe`] can't be
/// reused as-is. The chapter-specific marker is `a#pic_container`
/// (asserted by the reader-page parser at `parse_chapter_pages`).
///
/// Order matters: broken-page body wins over selector matches, so a
/// transient site-wide 5xx that happens to render the avatar widget
/// elsewhere doesn't falsely reach `Ok`.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum ChapterProbe {
/// `a#pic_container` present — reader rendered. Whether
/// `#avatar_menu` is also there is informational; if the reader
/// loaded the session is by definition still good.
Ok,
/// Site rendered a "logged out" or "please log in" page (no
/// reader, no broken-page body, and no avatar widget either).
/// Distinguishes the genuine expired-session case from a
/// transient site hiccup.
Unauthenticated,
/// Broken-page body, or reader didn't render but the user is
/// still logged in (avatar widget present). Caller should retry
/// rather than blame the session.
Transient,
}
pub fn classify_chapter_probe(html: &str) -> ChapterProbe {
if is_broken_page_body(html) {
return ChapterProbe::Transient;
}
let doc = scraper::Html::parse_document(html);
let container = scraper::Selector::parse("a#pic_container").unwrap();
if doc.select(&container).next().is_some() {
return ChapterProbe::Ok;
}
let avatar = scraper::Selector::parse("#avatar_menu").unwrap();
if doc.select(&avatar).next().is_some() {
// Logged-in user, but the reader didn't render — most likely
// the layout shifted or the site is serving an interstitial.
ChapterProbe::Transient
} else {
// No reader, no avatar, no broken-body marker — site rendered
// the "please log in" page, which is the genuine session-
// expired signal on this route.
ChapterProbe::Unauthenticated
}
}
/// In-startup retry budget for the session probe. Small but non-zero —
/// startup hitting a 5-second site hiccup shouldn't fail the operator
/// with "PHPSESSID expired" when the session is actually fine.
const PROBE_MAX_ATTEMPTS: u32 = 3;
const PROBE_RETRY_DELAY: Duration = Duration::from_secs(2);
/// Navigate to `probe_url` and classify the response. Retries the probe
/// on `Transient` outcomes (broken-page body, missing `#logo`); fails
/// fast on `Unauthenticated`; returns `Ok(())` on success.
///
/// This burns one navigation per attempt against the catalog's rate
/// limiter. The trade is worth it — failing here costs ~1s; failing 30
/// minutes into a backfill costs 30 minutes.
pub async fn verify_session(browser: &Browser, probe_url: &str) -> anyhow::Result<()> {
let mut attempt = 0u32;
loop {
attempt += 1;
let html = fetch_probe_html(browser, probe_url).await?;
match classify_probe(&html) {
SessionProbe::Ok => {
tracing::info!(attempt, "session probe ok — #logo + #avatar_menu present");
return Ok(());
}
SessionProbe::Unauthenticated => {
return Err(anyhow!(
"session probe failed — #avatar_menu not present at {probe_url} \
(page rendered the normal layout); PHPSESSID is missing, expired, \
or revoked. Refresh CRAWLER_PHPSESSID and re-run."
));
}
SessionProbe::Transient if attempt < PROBE_MAX_ATTEMPTS => {
tracing::warn!(
attempt,
max_attempts = PROBE_MAX_ATTEMPTS,
"session probe got a transient page; retrying"
);
tokio::time::sleep(PROBE_RETRY_DELAY).await;
}
SessionProbe::Transient => {
return Err(anyhow!(
"session probe failed — probe page at {probe_url} returned a \
broken-page response after {PROBE_MAX_ATTEMPTS} attempts. \
The site appears to be down or rate-limiting us; try again \
later before refreshing CRAWLER_PHPSESSID."
));
}
}
}
}
async fn fetch_probe_html(browser: &Browser, probe_url: &str) -> anyhow::Result<String> {
let page = browser
.new_page(probe_url)
.await
.with_context(|| format!("open probe page {probe_url}"))?;
crate::crawler::nav::wait_for_nav(&page)
.await
.context("wait for nav on probe")?;
// Best-effort wait for the layout marker. Timeout is fine — the
// probe classifier handles a missing `#logo` as Transient anyway,
// and the verify loop retries on Transient.
let _ = crate::crawler::nav::wait_for_selector(
&page,
"#logo",
crate::crawler::nav::SELECTOR_TIMEOUT,
)
.await;
let html = page.content().await.context("read probe html")?;
page.close().await.ok();
Ok(html)
}
#[cfg(test)]
mod tests {
use super::*;
// registrable_domain tests live in crawler::url_utils now —
// it's the canonical home for that helper.
#[test]
fn classify_probe_ok_when_logo_and_avatar_present() {
let html = r#"<html><body>
<header><div id="logo">Target</div><div id="avatar_menu"></div></header>
</body></html>"#;
assert_eq!(classify_probe(html), SessionProbe::Ok);
}
#[test]
fn classify_probe_unauth_when_logo_present_but_avatar_absent() {
// Real "logged out" response: site layout renders fine, just no
// avatar widget. This is the only state that should blame the
// session cookie.
let html = r#"<html><body>
<header><div id="logo">Target</div></header>
<main>Please log in.</main>
</body></html>"#;
assert_eq!(classify_probe(html), SessionProbe::Unauthenticated);
}
#[test]
fn classify_probe_transient_on_broken_page_body() {
let html = "<html><body>\
<p>we're sorry, the request file are not found.</p>\
</body></html>";
assert_eq!(classify_probe(html), SessionProbe::Transient);
}
#[test]
fn classify_probe_transient_when_logo_missing() {
// No broken-body marker, but no site layout either — treat as
// transient (could be a Cloudflare interstitial, a 5xx page,
// etc.) rather than blaming the session.
let html = "<html><body><h1>Service Unavailable</h1></body></html>";
assert_eq!(classify_probe(html), SessionProbe::Transient);
}
#[test]
fn classify_probe_transient_on_empty_response() {
assert_eq!(classify_probe(""), SessionProbe::Transient);
}
#[test]
fn classify_chapter_probe_ok_when_reader_rendered() {
let html = r#"
<html><body>
<a id="pic_container">
<img id="page1" src="https://cdn/1.jpg">
</a>
</body></html>
"#;
assert_eq!(classify_chapter_probe(html), ChapterProbe::Ok);
}
#[test]
fn classify_chapter_probe_unauthenticated_when_no_reader_and_no_avatar() {
// What a logged-out hit on a chapter URL renders: a normal
// site layout (header etc.) with a "please log in" body, but
// no reader and no avatar widget.
let html = r#"
<html><body>
<header><div id="logo">Catalog</div></header>
<main>Please log in to read this chapter.</main>
</body></html>
"#;
assert_eq!(
classify_chapter_probe(html),
ChapterProbe::Unauthenticated
);
}
#[test]
fn classify_chapter_probe_transient_when_logged_in_but_reader_missing() {
// Avatar shows the session is still valid; reader didn't
// render — site is serving an interstitial or the layout
// momentarily shifted. Retry, don't blame the session.
let html = r#"
<html><body>
<header><div id="logo">Catalog</div><div id="avatar_menu"></div></header>
<main>Site maintenance — back in 5 minutes.</main>
</body></html>
"#;
assert_eq!(classify_chapter_probe(html), ChapterProbe::Transient);
}
#[test]
fn classify_chapter_probe_transient_on_broken_page_body() {
let html =
"<html><body><p>we're sorry, the request file are not found.</p></body></html>";
assert_eq!(classify_chapter_probe(html), ChapterProbe::Transient);
}
#[test]
fn classify_chapter_probe_does_not_misfire_on_avatar_alone_without_reader() {
// Regression for the original bug: the binary
// find_element("#avatar_menu") check treated "no avatar" as
// session-expired even when a transient hiccup was the real
// cause. classify_chapter_probe must NOT trip on that pattern
// when pic_container *is* present.
let html = r#"
<html><body>
<a id="pic_container">
<img id="page1" src="https://cdn/1.jpg">
</a>
</body></html>
"#;
assert_eq!(classify_chapter_probe(html), ChapterProbe::Ok);
}
#[test]
fn classify_probe_trusts_broken_body_over_stray_avatar_match() {
// Defensive: if a broken-page body somehow contains an
// #avatar_menu element (e.g. an unrelated debug page on the
// same template), the body signature still wins.
let html = r#"<html><body>
<p>we're sorry, the request file are not found.</p>
<div id="logo"></div>
<div id="avatar_menu"></div>
</body></html>"#;
assert_eq!(classify_probe(html), SessionProbe::Transient);
}
}

View File

@@ -0,0 +1,124 @@
//! `Source` trait — the per-site abstraction.
//!
//! Job handlers depend on this trait, not on a concrete site. Adding a
//! new site is: implement `Source`, register it in a `sources` table
//! row, and the existing job pipeline picks it up unchanged.
pub mod target;
use async_trait::async_trait;
use chromiumoxide::browser::Browser;
/// Pointer at a manga in the source's index, before we've fetched the
/// detail page. The `source_manga_key` is whatever stable id the source
/// uses (slug, numeric id, etc).
#[derive(Clone, Debug)]
pub struct SourceMangaRef {
pub source_manga_key: String,
pub title: String,
pub url: String,
}
/// Full metadata returned by `fetch_manga`. The hash is computed by the
/// source impl over the metadata-only field set (title through
/// cover_url) — chapter changes are tracked separately via
/// `chapter_sources`, so they intentionally do not affect
/// `metadata_hash`.
#[derive(Clone, Debug)]
pub struct SourceManga {
pub source_manga_key: String,
pub title: String,
pub alternative_titles: Vec<String>,
pub authors: Vec<String>,
pub genres: Vec<String>,
pub tags: Vec<String>,
pub status: Option<String>,
pub summary: Option<String>,
pub cover_url: Option<String>,
/// Chapters surfaced on the same page as the metadata. Sources
/// where the chapter list lives elsewhere can leave this empty
/// and supply it via `fetch_chapter_list` instead.
pub chapters: Vec<SourceChapterRef>,
pub metadata_hash: String,
}
#[derive(Clone, Debug)]
pub struct SourceChapterRef {
pub source_chapter_key: String,
pub number: i32,
pub title: Option<String>,
pub url: String,
}
#[derive(Clone, Debug)]
pub struct SourceChapter {
pub source_chapter_key: String,
pub number: i32,
pub title: Option<String>,
/// Ordered list of page image URLs, ready to be fetched and put
/// into `Storage`.
pub page_urls: Vec<String>,
}
/// Context passed to every `Source` call. Carries the browser handle
/// plus the per-host rate-limiter map so impls that issue multiple
/// requests in one call (pagination walks, multi-page chapter image
/// fetches) honor the right budget for each origin.
pub struct FetchContext<'a> {
pub browser: &'a Browser,
pub rate: &'a crate::crawler::rate_limit::HostRateLimiters,
}
/// Lazy iterator over discovered manga refs. The caller drives the
/// walk one batch at a time, so it can break out as soon as the
/// downstream stop condition is met (the first manga where metadata is
/// `Unchanged` and chapter sync reports zero new chapters) without
/// paying for pages it won't use.
///
/// Batches are typically one source-index page each. Within a batch
/// refs are in the source's natural newest-first ordering — the same
/// `update_date DESC` sort that makes the stop condition meaningful.
#[async_trait]
pub trait DiscoverWalk: Send {
/// Return the next batch of refs, or `Ok(None)` when the source has
/// no more pages. The walker is single-use; calling `next_batch`
/// after `None` is allowed and continues to return `None`.
async fn next_batch(
&mut self,
ctx: &FetchContext<'_>,
) -> anyhow::Result<Option<Vec<SourceMangaRef>>>;
}
#[async_trait]
pub trait Source: Send + Sync {
/// Stable identifier — also the row key in the `sources` table.
fn id(&self) -> &'static str;
/// Begin discovery. Returns a walker the caller drives page-by-page
/// via `next_batch`. The initial page-1 probe (used to determine
/// `last_page` and warm the cache for sites that can't be paged
/// without knowing the bound) happens inside this call, so a fresh
/// walker is ready to yield its first batch without further setup.
async fn discover(
&self,
ctx: &FetchContext<'_>,
) -> anyhow::Result<Box<dyn DiscoverWalk + Send>>;
async fn fetch_manga(
&self,
ctx: &FetchContext<'_>,
r: &SourceMangaRef,
) -> anyhow::Result<SourceManga>;
async fn fetch_chapter_list(
&self,
ctx: &FetchContext<'_>,
manga: &SourceManga,
) -> anyhow::Result<Vec<SourceChapterRef>>;
async fn fetch_chapter(
&self,
ctx: &FetchContext<'_>,
r: &SourceChapterRef,
) -> anyhow::Result<SourceChapter>;
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,194 @@
//! Centralised URL helpers for the crawler subsystem.
//!
//! Three near-identical hand-rolled URL parsers used to live in
//! `crawler::session`, `crawler::rate_limit`, and `crawler::pipeline`
//! respectively, each with subtly different edge-case behaviour
//! around port handling and IPv6 literals. They're consolidated here
//! so the divergence can't drift again.
//!
//! The hand-rolled implementations are kept intentionally — they
//! preserve the exact semantics every existing test pins. A future
//! refactor can switch to `reqwest::Url` if it can be done without
//! changing those semantics.
/// Lowercased host (no port). Returns `None` for inputs without a
/// `scheme://host` shape — those would never have reached the network
/// layer anyway. Used by the per-host rate limiter as its bucket key.
///
/// IPv6 literals are kept in their `[::1]` bracketed form so the
/// `rsplit_once(':')` port-stripping logic doesn't split inside the
/// address (e.g. `https://[::1]/foo` used to return `"[:"` because
/// the rightmost `:` is inside the literal). Buckets keyed by
/// `[::1]` vs `::1` are still uniquely-per-host; the brackets are
/// cosmetic.
pub fn host_of(url: &str) -> Option<String> {
let after_scheme = url.split_once("://")?.1;
let host_with_port = after_scheme.split('/').next()?;
let host = if host_with_port.starts_with('[') {
// IPv6 literal: keep through the closing bracket. There may
// be a trailing `:port` after `]`; strip only that.
match host_with_port.rfind(']') {
Some(end) => &host_with_port[..=end],
None => host_with_port,
}
} else {
// Hostnames and IPv4 literals: trailing `:port` (if any) is
// after the last `:`.
host_with_port
.rsplit_once(':')
.map_or(host_with_port, |(h, _)| h)
};
(!host.is_empty()).then(|| host.to_ascii_lowercase())
}
/// `scheme://host` with no path or port stripping. Used by the metadata
/// pass to seed `sources.base_url` from `CRAWLER_START_URL`.
pub fn origin_of(url: &str) -> Option<String> {
let (scheme, rest) = url.split_once("://")?;
let host = rest.split('/').next()?;
Some(format!("{scheme}://{host}"))
}
/// Approximate registrable-domain calculation: take the last two
/// dot-labels of the host, prefix with `.`. Used to set a parent-
/// domain cookie so the catalog's `www.` / `m.` redirects don't drop
/// the cookie mid-crawl.
///
/// Caveat: wrong for multi-part TLDs (`.co.uk`, `.com.br`). The
/// operator can override via `CRAWLER_COOKIE_DOMAIN`; pulling in the
/// Public Suffix List for one knob isn't worth it yet.
///
/// Bare hostnames (e.g. `localhost`) return the host as-is, with no
/// leading dot — setting `.localhost` as a cookie domain is invalid.
/// IPv6 literals (e.g. `[::1]`) are returned bracketed and unchanged;
/// the browser will reject them as a cookie `Domain` anyway, but the
/// representation stays sensible. Same `starts_with('[')` branch as
/// [`host_of`] for consistent IPv6 handling across the module.
pub fn registrable_domain(url: &str) -> Option<String> {
let after_scheme = url.split_once("://")?.1;
let host_with_port = after_scheme.split('/').next()?;
let host_str = if host_with_port.starts_with('[') {
// IPv6 literal: keep through the closing bracket; an optional
// `:port` follows `]`.
match host_with_port.rfind(']') {
Some(end) => &host_with_port[..=end],
None => host_with_port,
}
} else {
host_with_port
.rsplit_once(':')
.map_or(host_with_port, |(h, _)| h)
};
let host = host_str.to_ascii_lowercase();
if host.is_empty() {
return None;
}
let labels: Vec<&str> = host.split('.').filter(|l| !l.is_empty()).collect();
if labels.len() < 2 {
return Some(host);
}
let registrable = &labels[labels.len() - 2..];
Some(format!(".{}", registrable.join(".")))
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn host_of_strips_port_and_lowercases() {
assert_eq!(
host_of("https://CDN.Example.com:443/x").as_deref(),
Some("cdn.example.com")
);
assert_eq!(host_of("http://localhost/").as_deref(), Some("localhost"));
assert_eq!(host_of("not a url"), None);
}
#[test]
fn host_of_keeps_bracketed_ipv6_literal_intact() {
// Regression: the old impl rsplit_once(':')'d the IPv6 address,
// returning "[:" instead of "[::1]". A real IPv6 source would
// silently get a wrong rate-limit bucket key.
assert_eq!(host_of("https://[::1]/").as_deref(), Some("[::1]"));
assert_eq!(host_of("https://[::1]:8080/").as_deref(), Some("[::1]"));
assert_eq!(
host_of("https://[2001:db8::1]/foo").as_deref(),
Some("[2001:db8::1]")
);
assert_eq!(
host_of("https://[2001:db8::1]:443/foo").as_deref(),
Some("[2001:db8::1]")
);
}
#[test]
fn origin_of_returns_scheme_and_host() {
assert_eq!(
origin_of("https://example.com/some/path?q=1").as_deref(),
Some("https://example.com")
);
assert_eq!(origin_of("garbage"), None);
}
#[test]
fn registrable_domain_strips_subdomain() {
assert_eq!(
registrable_domain("https://www.target-site.com/manga/foo/").as_deref(),
Some(".target-site.com")
);
assert_eq!(
registrable_domain("https://m.example.org").as_deref(),
Some(".example.org")
);
}
#[test]
fn registrable_domain_keeps_two_label_host() {
assert_eq!(
registrable_domain("https://example.com/").as_deref(),
Some(".example.com")
);
}
#[test]
fn registrable_domain_handles_port() {
assert_eq!(
registrable_domain("http://www.foo.bar:8080/x").as_deref(),
Some(".foo.bar")
);
}
#[test]
fn registrable_domain_bare_hostname_no_leading_dot() {
assert_eq!(
registrable_domain("http://localhost:5173").as_deref(),
Some("localhost")
);
}
#[test]
fn registrable_domain_returns_none_for_garbage() {
assert!(registrable_domain("not a url").is_none());
}
#[test]
fn registrable_domain_keeps_bracketed_ipv6_literal_intact() {
// Symmetric with host_of's IPv6 fix. The cookie-domain code
// won't accept an IP as a `Domain` value, but the function
// should at least return a sensible representation rather
// than the truncated `"[:"` the old port-stripper produced.
assert_eq!(
registrable_domain("https://[::1]/").as_deref(),
Some("[::1]")
);
assert_eq!(
registrable_domain("https://[::1]:8080/").as_deref(),
Some("[::1]")
);
assert_eq!(
registrable_domain("https://[2001:db8::1]/foo").as_deref(),
Some("[2001:db8::1]")
);
}
}

View File

@@ -0,0 +1,15 @@
use chrono::{DateTime, Utc};
use serde::Serialize;
use sqlx::FromRow;
use uuid::Uuid;
#[derive(Debug, Clone, Serialize, FromRow)]
pub struct AdminAuditEntry {
pub id: Uuid,
pub actor_user_id: Option<Uuid>,
pub action: String,
pub target_kind: String,
pub target_id: Option<Uuid>,
pub payload: serde_json::Value,
pub at: DateTime<Utc>,
}

View File

@@ -1,3 +1,4 @@
pub mod admin_audit;
pub mod api_token;
pub mod author;
pub mod bookmark;
@@ -9,11 +10,13 @@ pub mod page;
pub mod patch;
pub mod read_progress;
pub mod session;
pub mod sync_state;
pub mod tag;
pub mod upload_entry;
pub mod user;
pub mod user_preferences;
pub use admin_audit::AdminAuditEntry;
pub use api_token::ApiToken;
pub use author::{Author, AuthorRef, AuthorWithCount};
pub use bookmark::{Bookmark, BookmarkSummary};
@@ -25,6 +28,7 @@ pub use page::Page;
pub use patch::Patch;
pub use read_progress::{ReadProgress, ReadProgressForManga, ReadProgressSummary};
pub use session::Session;
pub use sync_state::{ChapterSyncState, MangaSyncState};
pub use tag::{Tag, TagRef};
pub use upload_entry::UploadEntry;
pub use user::User;

View File

@@ -0,0 +1,48 @@
//! Sync-state enums derived per-manga / per-chapter from `manga_sources`,
//! `chapter_sources`, and `crawler_jobs` at query time. No state column
//! is persisted on `mangas` / `chapters` — see `repo::admin_view` for the
//! derivation rules and priority order.
use serde::{Deserialize, Serialize};
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize, sqlx::Type)]
#[sqlx(type_name = "text", rename_all = "snake_case")]
#[serde(rename_all = "snake_case")]
pub enum MangaSyncState {
/// A `sync_manga` or `sync_chapter_list` job is currently
/// pending or running for this manga.
InProgress,
/// At least one `manga_sources` row exists for this manga and ALL of
/// them have `dropped_at IS NOT NULL` — every source we know about
/// has stopped surfacing it.
Dropped,
/// Default healthy state: at least one live source row OR the manga
/// was user-uploaded (no `manga_sources` rows at all).
Synced,
}
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize, sqlx::Type)]
#[sqlx(type_name = "text", rename_all = "snake_case")]
#[serde(rename_all = "snake_case")]
pub enum ChapterSyncState {
/// A `sync_chapter_content` job is currently pending or running for
/// this chapter (the 0014 dedup index guarantees at most one).
Downloading,
/// At least one `chapter_sources` row exists AND all of them are
/// `dropped_at IS NOT NULL`.
Dropped,
/// `page_count = 0` AND a `dead` `sync_chapter_content` job exists
/// for this chapter. Checked BEFORE `NotDownloaded` so the more
/// informative "we tried and it died" state wins over "we never
/// got around to it". Does NOT fire when `page_count > 0`, because
/// pages on disk mean the chapter IS synced regardless of historical
/// job failures — see the priority comment in `repo::admin_view`.
Failed,
/// `page_count = 0` and no in-flight or failed job — the chapter
/// row exists but content has never been downloaded.
NotDownloaded,
/// `page_count > 0` — content has been downloaded at some point.
/// Reaped `done` jobs in `crawler_jobs` mean we can't read this from
/// the job table, so `page_count` is the durable truth.
Synced,
}

View File

@@ -10,4 +10,5 @@ pub struct User {
#[serde(skip)]
pub password_hash: String,
pub created_at: DateTime<Utc>,
pub is_admin: bool,
}

View File

@@ -21,6 +21,11 @@ pub enum AppError {
PayloadTooLarge(String),
#[error("unsupported media type: {0}")]
UnsupportedMediaType(String),
/// 429 with an optional `Retry-After` header value (in seconds).
#[error("too many requests")]
TooManyRequests {
retry_after_secs: Option<u64>,
},
/// Semantic per-field validation failure. `details` is rendered into the
/// envelope so the client can highlight the bad field(s).
#[error("validation failed")]
@@ -51,6 +56,7 @@ impl AppError {
AppError::Conflict(_) => "conflict",
AppError::PayloadTooLarge(_) => "payload_too_large",
AppError::UnsupportedMediaType(_) => "unsupported_media_type",
AppError::TooManyRequests { .. } => "too_many_requests",
AppError::ValidationFailed { .. } => "validation_failed",
AppError::Database(sqlx::Error::RowNotFound) => "not_found",
AppError::Database(_) => "internal_error",
@@ -79,6 +85,31 @@ impl IntoResponse for AppError {
AppError::UnsupportedMediaType(msg) => {
(StatusCode::UNSUPPORTED_MEDIA_TYPE, msg.clone(), None)
}
AppError::TooManyRequests { retry_after_secs } => {
// Emit `Retry-After: N` (RFC 6585 §4) so a well-behaved
// client can back off correctly. Done by building the
// response by hand below — the `(status, headers,
// body)` tuple shape doesn't fit the standard
// `(status, body)` IntoResponse path for the other
// variants.
let body = json!({
"error": {
"code": code,
"message": "too many requests; slow down",
}
});
let mut resp = (StatusCode::TOO_MANY_REQUESTS, Json(body)).into_response();
if let Some(secs) = retry_after_secs {
// `HeaderValue: From<u64>` skips both the
// intermediate `String` allocation and the
// fallible-by-shape `from_str` path.
resp.headers_mut().insert(
axum::http::header::RETRY_AFTER,
axum::http::HeaderValue::from(*secs),
);
}
return resp;
}
AppError::ValidationFailed { message, details } => (
StatusCode::UNPROCESSABLE_ENTITY,
message.clone(),

View File

@@ -2,6 +2,7 @@ pub mod api;
pub mod app;
pub mod auth;
pub mod config;
pub mod crawler;
pub mod domain;
pub mod error;
pub mod repo;

View File

@@ -1,21 +1,77 @@
use std::net::SocketAddr;
use std::time::Duration;
use tracing_subscriber::EnvFilter;
/// Upper bound on how long we're willing to wait for the crawler daemon
/// to drain before letting `main` return. Without it a wedged background
/// task (e.g. a chromiumoxide handler stuck on a dead WS) blocks the
/// process from exiting after Ctrl-C / SIGTERM.
const CRAWLER_SHUTDOWN_TIMEOUT: Duration = Duration::from_secs(5);
#[tokio::main]
async fn main() -> anyhow::Result<()> {
dotenvy::dotenv().ok();
tracing_subscriber::fmt()
.with_env_filter(
EnvFilter::try_from_default_env().unwrap_or_else(|_| "info,mangalord=debug".into()),
EnvFilter::try_from_default_env().unwrap_or_else(|_| {
"info,mangalord=debug,chromiumoxide::conn=off,chromiumoxide::handler=off".into()
}),
)
.init();
let config = mangalord::config::Config::from_env()?;
let addr: SocketAddr = config.bind_address.parse()?;
let app = mangalord::app::build(config).await?;
let mangalord::app::AppHandle { router, daemon } = mangalord::app::build(config).await?;
tracing::info!(%addr, "mangalord listening");
let listener = tokio::net::TcpListener::bind(addr).await?;
axum::serve(listener, app).await?;
axum::serve(listener, router)
.with_graceful_shutdown(shutdown_signal())
.await?;
// Drain background tasks (crawler daemon) before exiting so Chromium
// gets a clean shutdown rather than relying on kill-on-drop. Bounded
// by a timeout so a wedged shutdown path can't trap the process.
if let Some(d) = daemon {
if tokio::time::timeout(CRAWLER_SHUTDOWN_TIMEOUT, d.shutdown())
.await
.is_err()
{
tracing::warn!(
timeout_s = CRAWLER_SHUTDOWN_TIMEOUT.as_secs(),
"crawler daemon shutdown exceeded timeout; abandoning"
);
}
}
Ok(())
}
/// Wait for either Ctrl-C (interactive shell) or SIGTERM (Docker /
/// Kubernetes / Podman / systemd stop) and log which arrived. Without
/// the SIGTERM branch, `docker compose stop` runs out its grace period
/// and skips straight to SIGKILL — the daemon never gets the
/// `daemon.shutdown().await` path, leaking Chromium.
async fn shutdown_signal() {
use tokio::signal::unix::{signal, SignalKind};
let mut sigterm = match signal(SignalKind::terminate()) {
Ok(s) => s,
Err(e) => {
// SignalKind::terminate() is supported on every Unix the
// tokio runtime runs on; if registration fails we still
// honour Ctrl-C so the process is at least
// interactive-shutdownable.
tracing::warn!(error = %e, "could not install SIGTERM handler; falling back to ctrl_c only");
let _ = tokio::signal::ctrl_c().await;
tracing::info!("ctrl-c received; shutting down");
return;
}
};
tokio::select! {
_ = tokio::signal::ctrl_c() => {
tracing::info!("ctrl-c received; shutting down");
}
_ = sigterm.recv() => {
tracing::info!("SIGTERM received; shutting down");
}
}
}

View File

@@ -0,0 +1,32 @@
//! Admin-action audit log writes.
//!
//! Insert is always called from inside the same transaction as the
//! action it audits — the executor parameter is `PgExecutor` so the
//! caller passes `&mut *tx` directly.
use sqlx::PgExecutor;
use uuid::Uuid;
use crate::error::AppResult;
pub async fn insert<'e, E: PgExecutor<'e>>(
executor: E,
actor_user_id: Uuid,
action: &str,
target_kind: &str,
target_id: Option<Uuid>,
payload: serde_json::Value,
) -> AppResult<()> {
sqlx::query(
"INSERT INTO admin_audit (actor_user_id, action, target_kind, target_id, payload) \
VALUES ($1, $2, $3, $4, $5)",
)
.bind(actor_user_id)
.bind(action)
.bind(target_kind)
.bind(target_id)
.bind(payload)
.execute(executor)
.await?;
Ok(())
}

View File

@@ -0,0 +1,232 @@
//! Admin-facing read queries that join manga/chapter with the crawler
//! signals (`manga_sources`, `chapter_sources`, `crawler_jobs`) to
//! derive a sync state per row at query time.
//!
//! Priority order for `MangaSyncState`:
//! 1. `InProgress` — any pending/running `sync_manga` or
//! `sync_chapter_list` job matches this manga.
//! 2. `Dropped` — manga has source rows AND every one of them is
//! `dropped_at IS NOT NULL`.
//! 3. `Synced` — default (includes user-uploaded mangas with no
//! `manga_sources` rows at all).
//!
//! Priority order for `ChapterSyncState`:
//! 1. `Downloading` — pending/running `sync_chapter_content` for this id
//! 2. `Dropped` — chapter has source rows AND all are dropped
//! 3. `Failed` — `page_count = 0` AND a `dead` `sync_chapter_content`
//! row exists for this chapter. Constrained to `page_count = 0`
//! because once pages are on disk the chapter IS synced — a
//! historical dead job (likely from a re-download attempt that
//! crashed) is noise that gets reaped after retention. Surfacing
//! "Failed" when content is present would contradict
//! `ChapterSyncState::Synced`'s "downloaded at some point" contract.
//! 4. `NotDownloaded` — `page_count = 0`, no in-flight, no dead job
//! 5. `Synced` — `page_count > 0`
//!
//! Reminder: `done` jobs are reaped after `CRAWLER_JOB_RETENTION_DAYS`,
//! so `chapters.page_count > 0` is the durable "this is synced" signal,
//! not the job table.
use chrono::{DateTime, Utc};
use serde::Serialize;
use sqlx::{FromRow, PgPool};
use uuid::Uuid;
use crate::domain::{ChapterSyncState, MangaSyncState};
use crate::error::AppResult;
#[derive(Debug, Serialize, FromRow)]
pub struct AdminMangaRow {
pub id: Uuid,
pub title: String,
pub status: String,
pub cover_image_path: Option<String>,
pub created_at: DateTime<Utc>,
pub updated_at: DateTime<Utc>,
pub sync_state: MangaSyncState,
pub chapter_count: i64,
pub latest_seen_at: Option<DateTime<Utc>>,
}
#[derive(Debug, Default)]
pub struct ListAdminMangasQuery {
pub search: Option<String>,
pub sync_state: Option<MangaSyncState>,
pub limit: i64,
pub offset: i64,
}
const MANGA_SYNC_STATE_CASE: &str = r#"
CASE
WHEN EXISTS (
SELECT 1 FROM crawler_jobs cj
WHERE cj.state IN ('pending','running')
AND (
(cj.payload->>'kind' = 'sync_chapter_list'
AND (cj.payload->>'manga_id')::uuid = m.id)
OR (cj.payload->>'kind' = 'sync_manga'
AND EXISTS (
SELECT 1 FROM manga_sources ms
WHERE ms.manga_id = m.id
AND ms.source_id = cj.payload->>'source_id'
AND ms.source_manga_key = cj.payload->>'source_manga_key'
))
)
) THEN 'in_progress'
WHEN EXISTS (SELECT 1 FROM manga_sources ms WHERE ms.manga_id = m.id)
AND NOT EXISTS (
SELECT 1 FROM manga_sources ms
WHERE ms.manga_id = m.id AND ms.dropped_at IS NULL
)
THEN 'dropped'
ELSE 'synced'
END
"#;
/// Paginated admin manga list with derived sync state and total count.
/// Filters by `search` (substring on title, case-insensitive) and
/// `sync_state` (post-derivation). The CTE keeps the case expression
/// in one place — the same projection feeds both the page rows and the
/// totals count under the same filter.
pub async fn list_mangas_with_sync_state(
pool: &PgPool,
q: &ListAdminMangasQuery,
) -> AppResult<(Vec<AdminMangaRow>, i64)> {
let search_pat = q
.search
.as_ref()
.map(|s| format!("%{}%", s.trim()))
.filter(|p| p.len() > 2);
// sqlx::Type → text: bind the snake_case representation manually so
// the SQL can compare it as text without an explicit cast.
let sync_filter = q.sync_state.map(|s| match s {
MangaSyncState::InProgress => "in_progress",
MangaSyncState::Dropped => "dropped",
MangaSyncState::Synced => "synced",
});
let sql = format!(
r#"
WITH classified AS (
SELECT
m.id, m.title, m.status, m.cover_image_path,
m.created_at, m.updated_at,
{case} AS sync_state,
(SELECT COUNT(*) FROM chapters c WHERE c.manga_id = m.id) AS chapter_count,
(SELECT MAX(last_seen_at) FROM manga_sources ms
WHERE ms.manga_id = m.id AND ms.dropped_at IS NULL) AS latest_seen_at
FROM mangas m
WHERE ($1::text IS NULL OR m.title ILIKE $1)
)
SELECT * FROM classified
WHERE ($2::text IS NULL OR sync_state = $2)
ORDER BY updated_at DESC
LIMIT $3 OFFSET $4
"#,
case = MANGA_SYNC_STATE_CASE
);
let items: Vec<AdminMangaRow> = sqlx::query_as(&sql)
.bind(&search_pat)
.bind(sync_filter)
.bind(q.limit)
.bind(q.offset)
.fetch_all(pool)
.await?;
let total_sql = format!(
r#"
WITH classified AS (
SELECT {case} AS sync_state
FROM mangas m
WHERE ($1::text IS NULL OR m.title ILIKE $1)
)
SELECT COUNT(*) FROM classified
WHERE ($2::text IS NULL OR sync_state = $2)
"#,
case = MANGA_SYNC_STATE_CASE
);
let total: i64 = sqlx::query_scalar(&total_sql)
.bind(&search_pat)
.bind(sync_filter)
.fetch_one(pool)
.await?;
Ok((items, total))
}
#[derive(Debug, Serialize, FromRow)]
pub struct AdminChapterRow {
pub id: Uuid,
pub manga_id: Uuid,
pub number: i32,
pub title: Option<String>,
pub page_count: i32,
pub created_at: DateTime<Utc>,
pub sync_state: ChapterSyncState,
pub latest_seen_at: Option<DateTime<Utc>>,
}
#[derive(Debug, Default)]
pub struct ListAdminChaptersQuery {
pub manga_id: Uuid,
pub limit: i64,
pub offset: i64,
}
/// Paginated chapter list with derived sync state. Pagination is non-
/// optional — long-runners can have thousands of chapters and the
/// per-row scalar subqueries make the unbounded variant a real
/// stall risk even behind an admin guard. Returns the page slice plus
/// the unfiltered total so the UI can render "showing N of M".
pub async fn list_chapters_with_sync_state(
pool: &PgPool,
q: &ListAdminChaptersQuery,
) -> AppResult<(Vec<AdminChapterRow>, i64)> {
let items: Vec<AdminChapterRow> = sqlx::query_as(
r#"
SELECT
c.id, c.manga_id, c.number, c.title, c.page_count, c.created_at,
CASE
WHEN EXISTS (
SELECT 1 FROM crawler_jobs cj
WHERE cj.state IN ('pending','running')
AND cj.payload->>'kind' = 'sync_chapter_content'
AND (cj.payload->>'chapter_id')::uuid = c.id
) THEN 'downloading'
WHEN EXISTS (SELECT 1 FROM chapter_sources cs WHERE cs.chapter_id = c.id)
AND NOT EXISTS (
SELECT 1 FROM chapter_sources cs
WHERE cs.chapter_id = c.id AND cs.dropped_at IS NULL
)
THEN 'dropped'
WHEN c.page_count = 0
AND EXISTS (
SELECT 1 FROM crawler_jobs cj
WHERE cj.state = 'dead'
AND cj.payload->>'kind' = 'sync_chapter_content'
AND (cj.payload->>'chapter_id')::uuid = c.id
) THEN 'failed'
WHEN c.page_count = 0 THEN 'not_downloaded'
ELSE 'synced'
END AS sync_state,
(SELECT MAX(last_seen_at) FROM chapter_sources cs
WHERE cs.chapter_id = c.id AND cs.dropped_at IS NULL) AS latest_seen_at
FROM chapters c
WHERE c.manga_id = $1
ORDER BY c.number ASC
LIMIT $2 OFFSET $3
"#,
)
.bind(q.manga_id)
.bind(q.limit)
.bind(q.offset)
.fetch_all(pool)
.await?;
let total: i64 = sqlx::query_scalar("SELECT COUNT(*) FROM chapters WHERE manga_id = $1")
.bind(q.manga_id)
.fetch_one(pool)
.await?;
Ok((items, total))
}

View File

@@ -99,6 +99,11 @@ pub async fn list(
/// Atomically replace the set of authors on a manga. Caller passes a
/// `&mut PgConnection` (`&mut *tx` works) so the delete+upserts run in
/// one transaction with whatever called us.
///
/// Note: `crawler::repo::sync_authors` does a similar replace with the
/// same semantics on names. The duplication is intentional — handler
/// callers want the `Vec<AuthorRef>` for the API response; the
/// crawler doesn't need it and stays inside its own transaction.
pub async fn set_for_manga(
conn: &mut PgConnection,
manga_id: Uuid,

View File

@@ -29,9 +29,9 @@ pub async fn create(
match result {
Ok(b) => Ok(b),
Err(e) if is_unique_violation(&e) => Err(AppError::Conflict(
"bookmark already exists for this manga/chapter".into(),
)),
Err(sqlx::Error::Database(ref db_err)) if db_err.is_unique_violation() => Err(
AppError::Conflict("bookmark already exists for this manga/chapter".into()),
),
Err(e) => Err(AppError::Database(e)),
}
}
@@ -97,10 +97,3 @@ pub async fn delete(pool: &PgPool, id: Uuid) -> AppResult<()> {
Ok(())
}
fn is_unique_violation(err: &sqlx::Error) -> bool {
if let sqlx::Error::Database(db_err) = err {
db_err.code().as_deref() == Some("23505")
} else {
false
}
}

View File

@@ -4,7 +4,7 @@ use sqlx::{PgExecutor, PgPool};
use uuid::Uuid;
use crate::domain::Chapter;
use crate::error::{AppError, AppResult};
use crate::error::AppResult;
pub async fn list_for_manga(
pool: &PgPool,
@@ -12,12 +12,15 @@ pub async fn list_for_manga(
limit: i64,
offset: i64,
) -> AppResult<Vec<Chapter>> {
// Secondary sort by created_at gives duplicate-numbered chapters
// (multiple uploaders/translations of the same number) a stable
// order in lists and prev/next reader navigation.
let rows = sqlx::query_as::<_, Chapter>(
r#"
SELECT id, manga_id, number, title, page_count, created_at
FROM chapters
WHERE manga_id = $1
ORDER BY number ASC
ORDER BY number ASC, created_at ASC
LIMIT $2 OFFSET $3
"#,
)
@@ -29,33 +32,39 @@ pub async fn list_for_manga(
Ok(rows)
}
pub async fn find_by_manga_and_number(
/// Look up a chapter by its UUID, scoped to its manga so a UUID guessed
/// from a different manga's URL doesn't accidentally resolve.
pub async fn find_by_id_in_manga(
pool: &PgPool,
manga_id: Uuid,
number: i32,
chapter_id: Uuid,
) -> AppResult<Option<Chapter>> {
let row = sqlx::query_as::<_, Chapter>(
r#"
SELECT id, manga_id, number, title, page_count, created_at
FROM chapters
WHERE manga_id = $1 AND number = $2
WHERE manga_id = $1 AND id = $2
"#,
)
.bind(manga_id)
.bind(number)
.bind(chapter_id)
.fetch_optional(pool)
.await?;
Ok(row)
}
/// Accepts any `PgExecutor` so the upload handler can run this inside a
/// transaction with the per-page inserts. Returns `AppError::Conflict`
/// on the (manga_id, number) unique violation so handlers can surface a
/// clean 409.
/// transaction with the per-page inserts.
///
/// `uploaded_by` records who uploaded the chapter and feeds the
/// per-user upload history. `None` means "historical / API token with
/// no associated user" — kept nullable to support that case.
///
/// Chapter identity is the row UUID; the same (manga_id, number)
/// combination can repeat (multiple translations, re-uploads). The
/// 0013 migration dropped the (manga_id, number) UNIQUE, so duplicate
/// inserts succeed by design. If a future migration re-adds any
/// uniqueness, surface a 409 by adding a unique-violation arm here.
pub async fn create<'e, E: PgExecutor<'e>>(
executor: E,
manga_id: Uuid,
@@ -63,7 +72,7 @@ pub async fn create<'e, E: PgExecutor<'e>>(
title: Option<&str>,
uploaded_by: Option<Uuid>,
) -> AppResult<Chapter> {
let result = sqlx::query_as::<_, Chapter>(
let row = sqlx::query_as::<_, Chapter>(
r#"
INSERT INTO chapters (manga_id, number, title, uploaded_by)
VALUES ($1, $2, $3, $4)
@@ -75,15 +84,71 @@ pub async fn create<'e, E: PgExecutor<'e>>(
.bind(title)
.bind(uploaded_by)
.fetch_one(executor)
.await;
.await?;
Ok(row)
}
match result {
Ok(c) => Ok(c),
Err(e) if is_unique_violation(&e) => Err(AppError::Conflict(format!(
"chapter {number} already exists for this manga"
))),
Err(e) => Err(AppError::Database(e)),
}
/// Cross-link guard for `POST /bookmarks`: the bookmarks FK accepts
/// any valid chapter id, but a chapter must belong to the bookmark's
/// manga or the bookmark would dangle on a foreign manga. Handlers
/// call this before the insert and surface `NotFound` when it
/// returns `false`.
pub async fn belongs_to_manga(
pool: &PgPool,
chapter_id: Uuid,
manga_id: Uuid,
) -> AppResult<bool> {
let (exists,): (bool,) = sqlx::query_as(
"SELECT EXISTS(SELECT 1 FROM chapters WHERE id = $1 AND manga_id = $2)",
)
.bind(chapter_id)
.bind(manga_id)
.fetch_one(pool)
.await?;
Ok(exists)
}
/// Read just the page_count for a chapter. Used by the crawler
/// daemon's consumer-side dedup safety net so it can ack-done a job
/// whose chapter has already been fetched by a racing worker.
pub async fn page_count(pool: &PgPool, id: Uuid) -> sqlx::Result<Option<i32>> {
sqlx::query_scalar("SELECT page_count FROM chapters WHERE id = $1")
.bind(id)
.fetch_optional(pool)
.await
}
/// Look up the manga_id + most recent live source_url for a chapter.
/// Used by the daemon's chapter dispatcher to resolve the URL it needs
/// to hand to `content::sync_chapter_content`.
///
/// Skips soft-dropped sources (`cs.dropped_at IS NOT NULL`) and breaks
/// ties between multiple live sources by `last_seen_at DESC`, so the
/// freshest still-attached URL wins. Returns `None` when the chapter
/// is gone or all its source rows are dropped — callers in the
/// dispatcher treat `None` as "ack the job, skip the work."
///
/// The enqueue queries (`pipeline::enqueue_bookmarked_pending` and
/// `enqueue_pending_for_manga`) apply the same `dropped_at IS NULL`
/// filter — this resolver stays in lockstep so a chapter that was
/// dropped between enqueue and lease isn't dispatched against a stale
/// URL.
pub async fn dispatch_target(
pool: &PgPool,
chapter_id: Uuid,
) -> sqlx::Result<Option<(Uuid, String)>> {
sqlx::query_as(
"SELECT c.manga_id, cs.source_url \
FROM chapters c \
JOIN chapter_sources cs ON cs.chapter_id = c.id \
WHERE c.id = $1 \
AND cs.dropped_at IS NULL \
ORDER BY cs.last_seen_at DESC \
LIMIT 1",
)
.bind(chapter_id)
.fetch_optional(pool)
.await
}
pub async fn set_page_count<'e, E: PgExecutor<'e>>(
@@ -99,10 +164,3 @@ pub async fn set_page_count<'e, E: PgExecutor<'e>>(
Ok(())
}
fn is_unique_violation(err: &sqlx::Error) -> bool {
if let sqlx::Error::Database(db_err) = err {
db_err.code().as_deref() == Some("23505")
} else {
false
}
}

564
backend/src/repo/crawler.rs Normal file
View File

@@ -0,0 +1,564 @@
//! Persistence for crawled mangas.
//!
//! High-level operations:
//! - [`ensure_source`]: idempotent registration of a source row.
//! - [`upsert_manga_from_source`]: end-to-end "I saw this manga" —
//! creates or updates the `mangas` row, threads `manga_sources`, and
//! refreshes authors/genres/tags. Returns whether the manga is new,
//! updated (metadata_hash changed), or unchanged.
//! - [`sync_manga_chapters`]: per-manga chapter reconciliation. Adds
//! new ones, refreshes URLs on existing ones, soft-drops vanished.
//! - [`mark_run_started`] / [`mark_run_completed`] /
//! [`last_run_completed_cleanly`]: per-source recovery flag in
//! `crawler_state`. A `false` flag on tick start means the previous
//! run did not exit cleanly and the next walk should ignore the
//! early-stop condition.
//!
//! Each public function is a transaction boundary so a partial failure
//! mid-call leaves the DB in its pre-call state.
use chrono::Utc;
use sqlx::{PgPool, Postgres, Transaction};
use uuid::Uuid;
use crate::crawler::source::{SourceChapterRef, SourceManga};
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum UpsertStatus {
New,
Updated,
Unchanged,
}
#[derive(Debug, Clone)]
pub struct UpsertedManga {
pub manga_id: Uuid,
pub status: UpsertStatus,
/// Current value of `mangas.cover_image_path` after the upsert.
/// `None` means the cover hasn't been downloaded yet — the caller
/// uses this to backfill covers for mangas that were synced before
/// cover-download support existed.
pub cover_image_path: Option<String>,
}
#[derive(Debug, Default, Clone, Copy, PartialEq, Eq)]
pub struct ChapterDiff {
pub new: usize,
pub refreshed: usize,
pub dropped: usize,
}
pub async fn ensure_source(
pool: &PgPool,
id: &str,
name: &str,
base_url: &str,
) -> sqlx::Result<()> {
sqlx::query(
r#"
INSERT INTO sources (id, name, base_url, enabled)
VALUES ($1, $2, $3, true)
ON CONFLICT (id) DO UPDATE
SET name = EXCLUDED.name,
base_url = EXCLUDED.base_url
"#,
)
.bind(id)
.bind(name)
.bind(base_url)
.execute(pool)
.await?;
Ok(())
}
pub async fn upsert_manga_from_source(
pool: &PgPool,
source_id: &str,
source_url: &str,
sm: &SourceManga,
) -> sqlx::Result<UpsertedManga> {
let mut tx = pool.begin().await?;
let existing: Option<(Uuid, Option<String>)> = sqlx::query_as(
r#"
SELECT manga_id, metadata_hash
FROM manga_sources
WHERE source_id = $1 AND source_manga_key = $2
"#,
)
.bind(source_id)
.bind(&sm.source_manga_key)
.fetch_optional(&mut *tx)
.await?;
let status_db = sm.status.as_deref().unwrap_or("ongoing");
// Note: `cover_image_path` is intentionally not written here.
// The repo layer doesn't know about the storage backend, so the
// caller (crawler binary) downloads the cover via the `Storage`
// trait and sets the path with `repo::manga::set_cover_image_path`
// once the bytes have landed.
let (manga_id, status) = match existing {
None => {
let (id,): (Uuid,) = sqlx::query_as(
r#"
INSERT INTO mangas (title, description, status, alt_titles)
VALUES ($1, $2, $3, $4)
RETURNING id
"#,
)
.bind(&sm.title)
.bind(sm.summary.as_deref())
.bind(status_db)
.bind(&sm.alternative_titles)
.fetch_one(&mut *tx)
.await?;
(id, UpsertStatus::New)
}
Some((id, prev_hash)) if prev_hash.as_deref() == Some(&sm.metadata_hash) => {
(id, UpsertStatus::Unchanged)
}
Some((id, _)) => {
sqlx::query(
r#"
UPDATE mangas
SET title = $1,
description = $2,
status = $3,
alt_titles = $4,
updated_at = NOW()
WHERE id = $5
"#,
)
.bind(&sm.title)
.bind(sm.summary.as_deref())
.bind(status_db)
.bind(&sm.alternative_titles)
.bind(id)
.execute(&mut *tx)
.await?;
(id, UpsertStatus::Updated)
}
};
sqlx::query(
r#"
INSERT INTO manga_sources
(source_id, source_manga_key, manga_id, source_url, metadata_hash, last_seen_at, dropped_at)
VALUES ($1, $2, $3, $4, $5, NOW(), NULL)
ON CONFLICT (source_id, source_manga_key) DO UPDATE
SET source_url = EXCLUDED.source_url,
metadata_hash = EXCLUDED.metadata_hash,
last_seen_at = NOW(),
dropped_at = NULL
"#,
)
.bind(source_id)
.bind(&sm.source_manga_key)
.bind(manga_id)
.bind(source_url)
.bind(&sm.metadata_hash)
.execute(&mut *tx)
.await?;
sync_authors(&mut tx, manga_id, &sm.authors).await?;
sync_genres(&mut tx, manga_id, &sm.genres).await?;
sync_tags(&mut tx, manga_id, &sm.tags).await?;
let cover_image_path: Option<String> =
sqlx::query_scalar("SELECT cover_image_path FROM mangas WHERE id = $1")
.bind(manga_id)
.fetch_one(&mut *tx)
.await?;
tx.commit().await?;
Ok(UpsertedManga {
manga_id,
status,
cover_image_path,
})
}
async fn sync_authors(
tx: &mut Transaction<'_, Postgres>,
manga_id: Uuid,
authors: &[String],
) -> sqlx::Result<()> {
sqlx::query("DELETE FROM manga_authors WHERE manga_id = $1")
.bind(manga_id)
.execute(&mut **tx)
.await?;
for (i, name) in authors.iter().enumerate() {
let trimmed = name.trim();
if trimmed.is_empty() {
continue;
}
// Self-update on conflict so the row id is always returned —
// can't use DO NOTHING because that suppresses RETURNING.
let (author_id,): (Uuid,) = sqlx::query_as(
r#"
INSERT INTO authors (name) VALUES ($1)
ON CONFLICT (lower(name)) DO UPDATE SET name = authors.name
RETURNING id
"#,
)
.bind(trimmed)
.fetch_one(&mut **tx)
.await?;
sqlx::query(
r#"
INSERT INTO manga_authors (manga_id, author_id, position)
VALUES ($1, $2, $3)
ON CONFLICT DO NOTHING
"#,
)
.bind(manga_id)
.bind(author_id)
.bind(i as i32)
.execute(&mut **tx)
.await?;
}
Ok(())
}
async fn sync_genres(
tx: &mut Transaction<'_, Postgres>,
manga_id: Uuid,
genres: &[String],
) -> sqlx::Result<()> {
sqlx::query("DELETE FROM manga_genres WHERE manga_id = $1")
.bind(manga_id)
.execute(&mut **tx)
.await?;
for name in genres {
let trimmed = name.trim();
if trimmed.is_empty() {
continue;
}
// Case-insensitive lookup so a source-supplied "action"
// attaches to the seeded "Action" rather than creating a
// second row.
let existing: Option<(Uuid,)> =
sqlx::query_as("SELECT id FROM genres WHERE lower(name) = lower($1)")
.bind(trimmed)
.fetch_optional(&mut **tx)
.await?;
let genre_id = match existing {
Some((id,)) => id,
None => {
let (id,): (Uuid,) = sqlx::query_as(
r#"
INSERT INTO genres (name) VALUES ($1)
ON CONFLICT (name) DO UPDATE SET name = genres.name
RETURNING id
"#,
)
.bind(trimmed)
.fetch_one(&mut **tx)
.await?;
tracing::info!(genre = trimmed, "added new genre from source");
id
}
};
sqlx::query(
"INSERT INTO manga_genres (manga_id, genre_id) VALUES ($1, $2) ON CONFLICT DO NOTHING",
)
.bind(manga_id)
.bind(genre_id)
.execute(&mut **tx)
.await?;
}
Ok(())
}
async fn sync_tags(
tx: &mut Transaction<'_, Postgres>,
manga_id: Uuid,
tags: &[String],
) -> sqlx::Result<()> {
// Only clear crawler-owned attachments (added_by IS NULL). User-
// attached tags are owned by the attaching user and must survive
// the recurring metadata pass — see manga_tags.added_by in
// migration 0009.
//
// Note on orphans: `manga_tags.added_by` is `ON DELETE SET NULL`,
// so an attachment whose user was deleted becomes
// indistinguishable from a crawler-owned row and is cleaned up
// here. That mirrors how `api::mangas::detach_tag` already treats
// orphans ("nobody owns it, refuse to let anyone but admin clear
// them") — the crawler now becomes the eventual reaper. Tracked
// by `sync_tags_garbage_collects_orphan_user_attachments` in
// backend/tests/crawler_sync.rs.
sqlx::query("DELETE FROM manga_tags WHERE manga_id = $1 AND added_by IS NULL")
.bind(manga_id)
.execute(&mut **tx)
.await?;
for name in tags {
let trimmed = name.trim();
if trimmed.is_empty() {
continue;
}
let (tag_id,): (Uuid,) = sqlx::query_as(
r#"
INSERT INTO tags (name) VALUES ($1)
ON CONFLICT (lower(name)) DO UPDATE SET name = tags.name
RETURNING id
"#,
)
.bind(trimmed)
.fetch_one(&mut **tx)
.await?;
sqlx::query(
r#"
INSERT INTO manga_tags (manga_id, tag_id, added_by)
VALUES ($1, $2, NULL)
ON CONFLICT DO NOTHING
"#,
)
.bind(manga_id)
.bind(tag_id)
.execute(&mut **tx)
.await?;
}
Ok(())
}
pub async fn sync_manga_chapters(
pool: &PgPool,
source_id: &str,
manga_id: Uuid,
chapters: &[SourceChapterRef],
) -> sqlx::Result<ChapterDiff> {
let mut tx = pool.begin().await?;
// Per-manga advisory lock. Two concurrent calls for the same manga
// would otherwise both read `seen_keys`, both run the drop UPDATE
// filtered on `NOT (key = ANY $3)`, and the later commit could soft-
// drop a chapter the earlier commit had just inserted (lost-update
// shape under MVCC). `pg_advisory_xact_lock` is scoped to this
// transaction: it auto-releases on COMMIT/ROLLBACK so a Rust-side
// panic mid-call doesn't strand the lock. The single-arg int8 form
// keyed by `hashtextextended(manga_id::text, 0)` shares Postgres'
// global advisory-lock namespace with `CRON_LOCK_KEY`, but collision
// is 2^-64 per pair (a UUID-derived hash hitting the fixed cron key
// is effectively impossible).
sqlx::query("SELECT pg_advisory_xact_lock(hashtextextended($1::text, 0))")
.bind(manga_id)
.execute(&mut *tx)
.await?;
let mut diff = ChapterDiff::default();
let seen_keys: Vec<String> = chapters
.iter()
.map(|c| c.source_chapter_key.clone())
.collect();
for c in chapters {
// Lookup is constrained by manga_id (via the chapters join) so a
// source whose chapter slugs collide across mangas (e.g.
// "chapter-1" appearing under two different mangas) attributes
// each row to the correct manga. Migration 0017 dropped the
// (source_id, source_chapter_key) PK in favour of
// (source_id, chapter_id) for exactly this reason.
let existing: Option<(Uuid,)> = sqlx::query_as(
"SELECT cs.chapter_id \
FROM chapter_sources cs \
JOIN chapters ch ON ch.id = cs.chapter_id \
WHERE cs.source_id = $1 \
AND cs.source_chapter_key = $2 \
AND ch.manga_id = $3",
)
.bind(source_id)
.bind(&c.source_chapter_key)
.bind(manga_id)
.fetch_optional(&mut *tx)
.await?;
match existing {
None => {
// New chapter row. As of 0013 there's no (manga_id,
// number) UNIQUE, so duplicate-numbered chapters from
// the source (different uploaders, notices, alt
// translations) each get their own row — chapter
// identity is the UUID, not the number.
let (chapter_id,): (Uuid,) = sqlx::query_as(
r#"
INSERT INTO chapters (manga_id, number, title, page_count)
VALUES ($1, $2, $3, 0)
RETURNING id
"#,
)
.bind(manga_id)
.bind(c.number)
.bind(c.title.as_deref())
.fetch_one(&mut *tx)
.await?;
sqlx::query(
r#"
INSERT INTO chapter_sources
(source_id, source_chapter_key, chapter_id, source_url, last_seen_at, dropped_at)
VALUES ($1, $2, $3, $4, NOW(), NULL)
"#,
)
.bind(source_id)
.bind(&c.source_chapter_key)
.bind(chapter_id)
.bind(&c.url)
.execute(&mut *tx)
.await?;
diff.new += 1;
}
Some((chapter_id,)) => {
sqlx::query("UPDATE chapters SET title = $1 WHERE id = $2")
.bind(c.title.as_deref())
.bind(chapter_id)
.execute(&mut *tx)
.await?;
// chapter_id is now the natural per-(source, chapter)
// identifier — use it directly instead of re-keying on
// (source_id, source_chapter_key) which may not be unique.
sqlx::query(
r#"
UPDATE chapter_sources
SET source_url = $1, last_seen_at = NOW(), dropped_at = NULL
WHERE source_id = $2 AND chapter_id = $3
"#,
)
.bind(&c.url)
.bind(source_id)
.bind(chapter_id)
.execute(&mut *tx)
.await?;
diff.refreshed += 1;
}
}
}
// Soft-drop any chapter previously seen from this source for this
// manga that's not in the current list.
let result = sqlx::query(
r#"
UPDATE chapter_sources cs
SET dropped_at = NOW()
FROM chapters ch
WHERE cs.chapter_id = ch.id
AND ch.manga_id = $1
AND cs.source_id = $2
AND cs.dropped_at IS NULL
AND NOT (cs.source_chapter_key = ANY($3))
"#,
)
.bind(manga_id)
.bind(source_id)
.bind(&seen_keys)
.execute(&mut *tx)
.await?;
diff.dropped = result.rows_affected() as usize;
tx.commit().await?;
Ok(diff)
}
/// Count the chapters that the source `(source_id, source_manga_key)`
/// is currently known to attach to — i.e. the number of `chapter_sources`
/// rows for the manga identified by the (source_id, source_manga_key)
/// pair, restricted to live (`dropped_at IS NULL`) rows.
///
/// Used by the metadata pass's partial-render guard: if `fetch_manga`
/// returns an empty `chapters` Vec but the source previously surfaced
/// chapters here, that's most likely a chromium snapshot taken between
/// the `#chapter_table` wrapper render and its rows render — the
/// safest move is to skip `sync_manga_chapters` so the soft-drop
/// branch doesn't flip every existing chapter to `dropped_at`.
///
/// Returns `Ok(0)` when the manga is brand-new (no `manga_sources`
/// row yet), which is the legitimate "this manga has no chapters yet"
/// case and must NOT be flagged.
pub async fn live_chapter_count_for_source_manga(
pool: &PgPool,
source_id: &str,
source_manga_key: &str,
) -> sqlx::Result<i64> {
let row: Option<(i64,)> = sqlx::query_as(
"SELECT COUNT(*) \
FROM chapter_sources cs \
JOIN chapters c ON c.id = cs.chapter_id \
JOIN manga_sources ms \
ON ms.manga_id = c.manga_id \
AND ms.source_id = cs.source_id \
WHERE ms.source_id = $1 \
AND ms.source_manga_key = $2 \
AND cs.dropped_at IS NULL",
)
.bind(source_id)
.bind(source_manga_key)
.fetch_optional(pool)
.await?;
Ok(row.map(|(n,)| n).unwrap_or(0))
}
/// Mark a metadata pass as in-flight for `source_id`. Stamps
/// `last_run_completed:<source_id>` in `crawler_state` with
/// `{"completed": false, "at": now}`. A crash, panic, or SIGKILL after
/// this point leaves the flag at `false`, which the next tick reads as
/// "previous run did not exit cleanly — walk the full catalog this
/// time" (recovery sweep).
pub async fn mark_run_started(pool: &PgPool, source_id: &str) -> sqlx::Result<()> {
let key = format!("last_run_completed:{source_id}");
sqlx::query(
"INSERT INTO crawler_state (key, value, updated_at) \
VALUES ($1, $2, now()) \
ON CONFLICT (key) DO UPDATE \
SET value = EXCLUDED.value, updated_at = now()",
)
.bind(&key)
.bind(serde_json::json!({
"completed": false,
"at": Utc::now().to_rfc3339(),
}))
.execute(pool)
.await?;
Ok(())
}
/// Mark a metadata pass as completed cleanly for `source_id`. Called
/// from the same place a run decides it reached end-of-walk or hit the
/// intentional stop. The next tick reads `true` and applies the normal
/// stop condition.
pub async fn mark_run_completed(pool: &PgPool, source_id: &str) -> sqlx::Result<()> {
let key = format!("last_run_completed:{source_id}");
sqlx::query(
"INSERT INTO crawler_state (key, value, updated_at) \
VALUES ($1, $2, now()) \
ON CONFLICT (key) DO UPDATE \
SET value = EXCLUDED.value, updated_at = now()",
)
.bind(&key)
.bind(serde_json::json!({
"completed": true,
"at": Utc::now().to_rfc3339(),
}))
.execute(pool)
.await?;
Ok(())
}
/// Read the recovery flag for `source_id`. A missing row OR an
/// unparseable value reads as `true` ("clean") — the former covers the
/// first-ever run on a virgin DB (no recovery needed), the latter
/// covers forward-compat against future schema changes; both fail-safe
/// toward not making an operator pay for an unnecessary full sweep.
pub async fn last_run_completed_cleanly(
pool: &PgPool,
source_id: &str,
) -> sqlx::Result<bool> {
let key = format!("last_run_completed:{source_id}");
let row: Option<serde_json::Value> =
sqlx::query_scalar("SELECT value FROM crawler_state WHERE key = $1")
.bind(&key)
.fetch_optional(pool)
.await?;
Ok(row
.and_then(|v| v.get("completed").and_then(|b| b.as_bool()))
.unwrap_or(true))
}

View File

@@ -61,6 +61,11 @@ pub async fn load_for_mangas(
/// FK constraint would reject them, so we filter upstream rather than
/// surface a 500 here. (The API layer validates the set against
/// `list_all` first.)
///
/// Note: `crawler::repo::sync_genres` does a similar replace, but by
/// *name* and with auto-create of unseen genres — the crawler can't
/// validate against the curated vocabulary on its own. Both paths are
/// intentional; don't merge them without preserving that semantic.
pub async fn set_for_manga(
conn: &mut PgConnection,
manga_id: Uuid,

View File

@@ -262,6 +262,17 @@ pub async fn set_cover_image_path<'e, E: PgExecutor<'e>>(
Ok(())
}
pub async fn clear_cover_image_path<'e, E: PgExecutor<'e>>(
executor: E,
id: Uuid,
) -> AppResult<()> {
sqlx::query("UPDATE mangas SET cover_image_path = NULL, updated_at = now() WHERE id = $1")
.bind(id)
.execute(executor)
.await?;
Ok(())
}
pub async fn exists(pool: &PgPool, id: Uuid) -> AppResult<bool> {
let (exists,): (bool,) =
sqlx::query_as("SELECT EXISTS(SELECT 1 FROM mangas WHERE id = $1)")
@@ -270,3 +281,17 @@ pub async fn exists(pool: &PgPool, id: Uuid) -> AppResult<bool> {
.await?;
Ok(exists)
}
/// Returns the uploader's user id for a manga. `None` either when the
/// manga doesn't exist or when the row predates the `uploaded_by`
/// column (historical NULL — see migration 0011). Callers must
/// distinguish "manga missing" via [`exists`] before relying on this
/// to make an authz decision.
pub async fn uploaded_by(pool: &PgPool, id: Uuid) -> AppResult<Option<Uuid>> {
let row: Option<(Option<Uuid>,)> =
sqlx::query_as("SELECT uploaded_by FROM mangas WHERE id = $1")
.bind(id)
.fetch_optional(pool)
.await?;
Ok(row.and_then(|(u,)| u))
}

View File

@@ -1,8 +1,11 @@
pub mod admin_audit;
pub mod admin_view;
pub mod api_token;
pub mod author;
pub mod bookmark;
pub mod chapter;
pub mod collection;
pub mod crawler;
pub mod genre;
pub mod manga;
pub mod page;

View File

@@ -11,7 +11,7 @@ pub async fn create(pool: &PgPool, username: &str, password_hash: &str) -> AppRe
r#"
INSERT INTO users (username, password_hash)
VALUES ($1, $2)
RETURNING id, username, password_hash, created_at
RETURNING id, username, password_hash, created_at, is_admin
"#,
)
.bind(username)
@@ -21,7 +21,7 @@ pub async fn create(pool: &PgPool, username: &str, password_hash: &str) -> AppRe
match result {
Ok(user) => Ok(user),
Err(e) if is_unique_violation(&e) => {
Err(sqlx::Error::Database(ref db_err)) if db_err.is_unique_violation() => {
Err(AppError::Conflict("username is already taken".into()))
}
Err(e) => Err(AppError::Database(e)),
@@ -35,7 +35,7 @@ pub async fn create(pool: &PgPool, username: &str, password_hash: &str) -> AppRe
pub async fn find_by_username(pool: &PgPool, username: &str) -> AppResult<Option<User>> {
let row = sqlx::query_as::<_, User>(
r#"
SELECT id, username, password_hash, created_at
SELECT id, username, password_hash, created_at, is_admin
FROM users
WHERE lower(username) = lower($1)
"#,
@@ -48,7 +48,7 @@ pub async fn find_by_username(pool: &PgPool, username: &str) -> AppResult<Option
pub async fn find_by_id(pool: &PgPool, id: Uuid) -> AppResult<Option<User>> {
let row = sqlx::query_as::<_, User>(
r#"SELECT id, username, password_hash, created_at FROM users WHERE id = $1"#,
r#"SELECT id, username, password_hash, created_at, is_admin FROM users WHERE id = $1"#,
)
.bind(id)
.fetch_optional(pool)
@@ -56,10 +56,317 @@ pub async fn find_by_id(pool: &PgPool, id: Uuid) -> AppResult<Option<User>> {
Ok(row)
}
fn is_unique_violation(err: &sqlx::Error) -> bool {
if let sqlx::Error::Database(db_err) = err {
db_err.code().as_deref() == Some("23505")
} else {
false
}
/// Postgres advisory-lock key guarding admin-count-changing operations
/// (demote, delete-admin). Without this lock two concurrent demotes of
/// different admins could each pass their "more than one admin remains"
/// check, then commit, leaving zero admins. The lock serialises any tx
/// that might change the admin count so the recount under the lock is
/// authoritative.
///
/// Value is the bytes of "admininv" interpreted as a big-endian i64.
/// Postgres' advisory-lock keyspace is global; collision risk with
/// `CRON_LOCK_KEY` and friends is ~2^-64.
pub const ADMIN_INVARIANT_LOCK_KEY: i64 = 0x61_64_6d_69_6e_69_6e_76;
#[derive(Debug, Default)]
pub struct ListUsersQuery {
pub search: Option<String>,
pub limit: i64,
pub offset: i64,
}
/// Paginated user list with total count. `search` is a case-insensitive
/// substring match on `username`. Order is alphabetical by username so
/// pagination is stable across concurrent writes (mangas changing
/// is_admin doesn't reshuffle the page).
pub async fn list_with_total(
pool: &PgPool,
q: &ListUsersQuery,
) -> AppResult<(Vec<User>, i64)> {
let pat = q
.search
.as_ref()
.map(|s| format!("%{}%", s.trim()))
.filter(|p| p.len() > 2);
let items = sqlx::query_as::<_, User>(
r#"
SELECT id, username, password_hash, created_at, is_admin
FROM users
WHERE ($1::text IS NULL OR username ILIKE $1)
ORDER BY username
LIMIT $2 OFFSET $3
"#,
)
.bind(&pat)
.bind(q.limit)
.bind(q.offset)
.fetch_all(pool)
.await?;
let total: i64 = sqlx::query_scalar(
"SELECT COUNT(*) FROM users WHERE ($1::text IS NULL OR username ILIKE $1)",
)
.bind(&pat)
.fetch_one(pool)
.await?;
Ok((items, total))
}
/// Raw `is_admin` update with no safety checks, no audit log, and no
/// advisory lock. Exists only as a test setup helper for the admin-
/// feature integration suite — production code MUST go through
/// [`admin_safe_set_is_admin`], which enforces self-protection, the
/// last-admin invariant, and the audit log atomically.
pub async fn set_is_admin_unchecked(pool: &PgPool, id: Uuid, value: bool) -> AppResult<()> {
sqlx::query("UPDATE users SET is_admin = $1 WHERE id = $2")
.bind(value)
.bind(id)
.execute(pool)
.await?;
Ok(())
}
/// Ensure the user `username` exists and is an admin. Called at startup
/// from `app::build` when `ADMIN_USERNAME` / `ADMIN_PASSWORD` are set.
///
/// Semantics — see cross-cutting decision #2 in the feature plan:
/// - If no row exists: create with the env-supplied password hashed via
/// argon2id and `is_admin = true`.
/// - If a row already exists: flip `is_admin` to true if needed; **never**
/// touch the existing `password_hash`. Lets the operator rotate the
/// admin password through the UI without env-var conflict.
/// Wrapped in a transaction so a concurrent `register` for the same
/// username can't slip an INSERT between the SELECT and UPDATE/INSERT.
/// Set `is_admin` on a user with full safety checks: rejects self-demote,
/// rejects demoting the only remaining admin (under `ADMIN_INVARIANT_LOCK_KEY`
/// to close the parallel-demote race), and writes an `admin_audit` row
/// in the same tx so the log mirrors what actually committed.
///
/// Returns the freshly-written user row (so the handler can return it
/// without a second SELECT).
pub async fn admin_safe_set_is_admin(
pool: &PgPool,
actor_id: Uuid,
target_id: Uuid,
value: bool,
) -> AppResult<User> {
// Cheap pre-check before opening a tx — also covers the "demote me"
// case which would otherwise pass the recount when other admins exist.
if actor_id == target_id && !value {
return Err(AppError::Conflict(
"cannot demote yourself; ask another admin".into(),
));
}
let mut tx = pool.begin().await?;
sqlx::query("SELECT pg_advisory_xact_lock($1)")
.bind(ADMIN_INVARIANT_LOCK_KEY)
.execute(&mut *tx)
.await?;
let target: Option<User> = sqlx::query_as(
"SELECT id, username, password_hash, created_at, is_admin \
FROM users WHERE id = $1 FOR UPDATE",
)
.bind(target_id)
.fetch_optional(&mut *tx)
.await?;
let Some(target) = target else {
return Err(AppError::NotFound);
};
// No-op: caller asked to set `is_admin` to its current value. Return
// the row as-is without writing an audit entry — otherwise repeated
// PATCH calls (browser retry, double-click) pile misleading
// "promote_user" rows in `admin_audit` for actions that changed
// nothing.
if target.is_admin == value {
tx.commit().await?;
return Ok(target);
}
// Recount inside the lock — this is the authoritative read.
if target.is_admin && !value {
let admin_count: i64 =
sqlx::query_scalar("SELECT COUNT(*) FROM users WHERE is_admin = true")
.fetch_one(&mut *tx)
.await?;
if admin_count <= 1 {
return Err(AppError::Conflict(
"cannot demote the last admin; promote another user first".into(),
));
}
}
let updated: User = sqlx::query_as(
"UPDATE users SET is_admin = $1 WHERE id = $2 \
RETURNING id, username, password_hash, created_at, is_admin",
)
.bind(value)
.bind(target_id)
.fetch_one(&mut *tx)
.await?;
let action = if value { "promote_user" } else { "demote_user" };
crate::repo::admin_audit::insert(
&mut *tx,
actor_id,
action,
"user",
Some(target_id),
serde_json::json!({ "username": target.username }),
)
.await?;
tx.commit().await?;
Ok(updated)
}
/// Delete a user with full safety checks: rejects self-delete, rejects
/// deleting the only remaining admin (under `ADMIN_INVARIANT_LOCK_KEY`),
/// and writes an `admin_audit` row in the same tx. Captures the deleted
/// username + admin status in the audit payload so the action is
/// readable after the user row itself is gone.
pub async fn admin_safe_delete(
pool: &PgPool,
actor_id: Uuid,
target_id: Uuid,
) -> AppResult<()> {
if actor_id == target_id {
return Err(AppError::Conflict(
"cannot delete yourself; ask another admin".into(),
));
}
let mut tx = pool.begin().await?;
sqlx::query("SELECT pg_advisory_xact_lock($1)")
.bind(ADMIN_INVARIANT_LOCK_KEY)
.execute(&mut *tx)
.await?;
let target: Option<User> = sqlx::query_as(
"SELECT id, username, password_hash, created_at, is_admin \
FROM users WHERE id = $1 FOR UPDATE",
)
.bind(target_id)
.fetch_optional(&mut *tx)
.await?;
let Some(target) = target else {
return Err(AppError::NotFound);
};
if target.is_admin {
let admin_count: i64 =
sqlx::query_scalar("SELECT COUNT(*) FROM users WHERE is_admin = true")
.fetch_one(&mut *tx)
.await?;
if admin_count <= 1 {
return Err(AppError::Conflict(
"cannot delete the last admin; promote another user first".into(),
));
}
}
sqlx::query("DELETE FROM users WHERE id = $1")
.bind(target_id)
.execute(&mut *tx)
.await?;
crate::repo::admin_audit::insert(
&mut *tx,
actor_id,
"delete_user",
"user",
Some(target_id),
serde_json::json!({
"username": target.username,
"was_admin": target.is_admin,
}),
)
.await?;
tx.commit().await?;
Ok(())
}
/// Admin-initiated user creation. Wraps the INSERT + audit row in a
/// single transaction so a rolled-back create never leaves an orphan
/// audit entry. Caller (HTTP handler) is responsible for validating
/// `username`/`password` and hashing — this fn assumes both are
/// already vetted by the same `validate_*` rules used by self-
/// registration.
pub async fn admin_create_user(
pool: &PgPool,
actor_id: Uuid,
username: &str,
password_hash: &str,
is_admin: bool,
) -> AppResult<User> {
let mut tx = pool.begin().await?;
let user: User = match sqlx::query_as::<_, User>(
"INSERT INTO users (username, password_hash, is_admin) VALUES ($1, $2, $3) \
RETURNING id, username, password_hash, created_at, is_admin",
)
.bind(username)
.bind(password_hash)
.bind(is_admin)
.fetch_one(&mut *tx)
.await
{
Ok(u) => u,
Err(sqlx::Error::Database(ref db_err)) if db_err.is_unique_violation() => {
return Err(AppError::Conflict("username is already taken".into()));
}
Err(e) => return Err(AppError::Database(e)),
};
crate::repo::admin_audit::insert(
&mut *tx,
actor_id,
"create_user",
"user",
Some(user.id),
serde_json::json!({
"username": user.username,
"is_admin": user.is_admin,
}),
)
.await?;
tx.commit().await?;
Ok(user)
}
pub async fn bootstrap_admin(
pool: &PgPool,
username: &str,
password: &str,
) -> AppResult<()> {
let mut tx = pool.begin().await?;
let existing: Option<(Uuid,)> = sqlx::query_as(
"SELECT id FROM users WHERE lower(username) = lower($1) FOR UPDATE",
)
.bind(username)
.fetch_optional(&mut *tx)
.await?;
match existing {
Some((id,)) => {
sqlx::query("UPDATE users SET is_admin = true WHERE id = $1 AND is_admin = false")
.bind(id)
.execute(&mut *tx)
.await?;
}
None => {
let hash = crate::auth::password::hash_password(password)?;
sqlx::query("INSERT INTO users (username, password_hash, is_admin) VALUES ($1, $2, true)")
.bind(username)
.bind(&hash)
.execute(&mut *tx)
.await?;
}
}
tx.commit().await?;
Ok(())
}

View File

@@ -16,6 +16,13 @@ impl LocalStorage {
}
fn resolve(&self, key: &str) -> Result<PathBuf, StorageError> {
// NUL bytes are rejected by the Linux syscall layer, but the
// error surfaces as an opaque IO failure rather than the
// explicit `BadKey` the rest of the contract uses. Catch it
// here so the error path is consistent.
if key.contains('\0') {
return Err(StorageError::BadKey);
}
let key = key.trim_start_matches('/');
if key.is_empty() {
return Err(StorageError::BadKey);
@@ -79,6 +86,10 @@ impl Storage for LocalStorage {
let path: &Path = &self.resolve(key)?;
Ok(fs::try_exists(path).await?)
}
fn local_root(&self) -> Option<&Path> {
Some(&self.root)
}
}
#[cfg(test)]
@@ -114,6 +125,9 @@ mod tests {
assert!(matches!(s.get(".").await, Err(StorageError::BadKey)));
// Empty segment via doubled slash.
assert!(matches!(s.get("a//b").await, Err(StorageError::BadKey)));
// NUL byte (rejected explicitly so callers see BadKey rather
// than an opaque IO error from the kernel).
assert!(matches!(s.put("a\0b", b"x").await, Err(StorageError::BadKey)));
}
#[tokio::test]

View File

@@ -9,6 +9,8 @@ mod local;
use std::io;
use std::pin::Pin;
use std::path::Path;
use async_trait::async_trait;
use bytes::Bytes;
use futures_core::Stream;
@@ -44,4 +46,13 @@ pub trait Storage: Send + Sync {
async fn get_stream(&self, key: &str) -> Result<StreamingFile, StorageError>;
async fn delete(&self, key: &str) -> Result<(), StorageError>;
async fn exists(&self, key: &str) -> Result<bool, StorageError>;
/// Filesystem path the backend is rooted at, when introspectable.
/// Returns `None` for backends that aren't a local filesystem (e.g.
/// a future `S3Storage`). The admin system endpoint uses this to
/// statvfs the data dir; backends that return `None` get a `disk:
/// null` payload instead of fabricated numbers.
fn local_root(&self) -> Option<&Path> {
None
}
}

View File

@@ -0,0 +1,548 @@
//! PR 3 (feat/admin-mangas-api) integration tests.
//!
//! Per-variant fixture tests for the derived sync-state SQL plus
//! happy-path E2E for the two admin endpoints. Auth-gate regression
//! (403/401) is covered by PR 1's `RequireAdmin` test matrix; the only
//! gate test here is one spot check per endpoint.
mod common;
use axum::http::StatusCode;
use axum::Router;
use serde_json::json;
use sqlx::PgPool;
use tower::ServiceExt;
use uuid::Uuid;
use mangalord::repo;
const SOURCE_ID: &str = "test-source";
async fn seed_admin(pool: &PgPool, app: &Router) -> (String, String) {
let (username, cookie) = common::register_user(app).await;
let u = repo::user::find_by_username(pool, &username)
.await
.unwrap()
.unwrap();
repo::user::set_is_admin_unchecked(pool, u.id, true).await.unwrap();
(username, cookie)
}
async fn seed_source(pool: &PgPool) {
repo::crawler::ensure_source(pool, SOURCE_ID, "Test", "https://example.test")
.await
.unwrap();
}
async fn insert_manga(pool: &PgPool, title: &str) -> Uuid {
let (id,): (Uuid,) = sqlx::query_as(
"INSERT INTO mangas (title, status, alt_titles) VALUES ($1, 'ongoing', ARRAY[]::text[]) RETURNING id",
)
.bind(title)
.fetch_one(pool)
.await
.unwrap();
id
}
async fn insert_manga_source(
pool: &PgPool,
manga_id: Uuid,
source_manga_key: &str,
dropped: bool,
) {
let dropped_at = if dropped { "now()" } else { "NULL" };
let sql = format!(
"INSERT INTO manga_sources (source_id, source_manga_key, manga_id, source_url, dropped_at) \
VALUES ($1, $2, $3, 'https://example.test/m', {dropped_at})"
);
sqlx::query(&sql)
.bind(SOURCE_ID)
.bind(source_manga_key)
.bind(manga_id)
.execute(pool)
.await
.unwrap();
}
async fn insert_chapter(pool: &PgPool, manga_id: Uuid, number: i32, page_count: i32) -> Uuid {
let (id,): (Uuid,) = sqlx::query_as(
"INSERT INTO chapters (manga_id, number, title, page_count) VALUES ($1, $2, NULL, $3) RETURNING id",
)
.bind(manga_id)
.bind(number)
.bind(page_count)
.fetch_one(pool)
.await
.unwrap();
id
}
async fn insert_chapter_source(
pool: &PgPool,
chapter_id: Uuid,
source_chapter_key: &str,
dropped: bool,
) {
let dropped_at = if dropped { "now()" } else { "NULL" };
let sql = format!(
"INSERT INTO chapter_sources (source_id, source_chapter_key, chapter_id, source_url, dropped_at) \
VALUES ($1, $2, $3, 'https://example.test/c', {dropped_at})"
);
sqlx::query(&sql)
.bind(SOURCE_ID)
.bind(source_chapter_key)
.bind(chapter_id)
.execute(pool)
.await
.unwrap();
}
async fn insert_job(pool: &PgPool, payload: serde_json::Value, state: &str) {
sqlx::query("INSERT INTO crawler_jobs (payload, state) VALUES ($1, $2)")
.bind(payload)
.bind(state)
.execute(pool)
.await
.unwrap();
}
/// Per-variant tests don't care about pagination — fetch the whole
/// chapter set (up to the hard cap) and discard the total.
async fn fetch_chapter_rows(
pool: &PgPool,
manga_id: Uuid,
) -> Vec<mangalord::repo::admin_view::AdminChapterRow> {
let (rows, _) = repo::admin_view::list_chapters_with_sync_state(
pool,
&repo::admin_view::ListAdminChaptersQuery {
manga_id,
limit: 500,
offset: 0,
},
)
.await
.unwrap();
rows
}
// ---- manga sync state ------------------------------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn manga_state_synced_for_fresh_source(pool: PgPool) {
seed_source(&pool).await;
let m = insert_manga(&pool, "Synced Manga").await;
insert_manga_source(&pool, m, "smk-1", false).await;
let (rows, total) = repo::admin_view::list_mangas_with_sync_state(
&pool,
&repo::admin_view::ListAdminMangasQuery {
limit: 50,
..Default::default()
},
)
.await
.unwrap();
assert_eq!(total, 1);
assert_eq!(rows[0].id, m);
assert_eq!(rows[0].sync_state, mangalord::domain::MangaSyncState::Synced);
}
#[sqlx::test(migrations = "./migrations")]
async fn manga_state_synced_for_user_upload_without_sources(pool: PgPool) {
let m = insert_manga(&pool, "User Upload").await;
let (rows, _) = repo::admin_view::list_mangas_with_sync_state(
&pool,
&repo::admin_view::ListAdminMangasQuery {
limit: 50,
..Default::default()
},
)
.await
.unwrap();
assert_eq!(rows[0].id, m);
assert_eq!(rows[0].sync_state, mangalord::domain::MangaSyncState::Synced);
}
#[sqlx::test(migrations = "./migrations")]
async fn manga_state_dropped_when_all_sources_dropped(pool: PgPool) {
seed_source(&pool).await;
let m = insert_manga(&pool, "Dropped Manga").await;
insert_manga_source(&pool, m, "smk-1", true).await;
let (rows, _) = repo::admin_view::list_mangas_with_sync_state(
&pool,
&repo::admin_view::ListAdminMangasQuery {
limit: 50,
..Default::default()
},
)
.await
.unwrap();
assert_eq!(rows[0].id, m);
assert_eq!(rows[0].sync_state, mangalord::domain::MangaSyncState::Dropped);
}
#[sqlx::test(migrations = "./migrations")]
async fn manga_state_in_progress_via_sync_chapter_list_job(pool: PgPool) {
seed_source(&pool).await;
let m = insert_manga(&pool, "Syncing Manga").await;
insert_manga_source(&pool, m, "smk-1", false).await;
// sync_chapter_list payload carries manga_id directly.
insert_job(
&pool,
json!({
"kind": "sync_chapter_list",
"source_id": SOURCE_ID,
"manga_id": m.to_string(),
"source_manga_key": "smk-1",
}),
"pending",
)
.await;
let (rows, _) = repo::admin_view::list_mangas_with_sync_state(
&pool,
&repo::admin_view::ListAdminMangasQuery {
limit: 50,
..Default::default()
},
)
.await
.unwrap();
assert_eq!(rows[0].sync_state, mangalord::domain::MangaSyncState::InProgress);
}
#[sqlx::test(migrations = "./migrations")]
async fn manga_state_in_progress_via_sync_manga_job(pool: PgPool) {
// The trickier branch: sync_manga payload is keyed by
// source_manga_key, NOT manga_id — must join through manga_sources.
seed_source(&pool).await;
let m = insert_manga(&pool, "Metadata-Refreshing Manga").await;
insert_manga_source(&pool, m, "smk-key-42", false).await;
insert_job(
&pool,
json!({
"kind": "sync_manga",
"source_id": SOURCE_ID,
"source_manga_key": "smk-key-42",
}),
"running",
)
.await;
let (rows, _) = repo::admin_view::list_mangas_with_sync_state(
&pool,
&repo::admin_view::ListAdminMangasQuery {
limit: 50,
..Default::default()
},
)
.await
.unwrap();
assert_eq!(rows[0].sync_state, mangalord::domain::MangaSyncState::InProgress);
}
#[sqlx::test(migrations = "./migrations")]
async fn manga_list_filters_by_sync_state(pool: PgPool) {
seed_source(&pool).await;
let m_synced = insert_manga(&pool, "AAA Synced").await;
insert_manga_source(&pool, m_synced, "smk-a", false).await;
let m_dropped = insert_manga(&pool, "BBB Dropped").await;
insert_manga_source(&pool, m_dropped, "smk-b", true).await;
let (rows, total) = repo::admin_view::list_mangas_with_sync_state(
&pool,
&repo::admin_view::ListAdminMangasQuery {
sync_state: Some(mangalord::domain::MangaSyncState::Dropped),
limit: 50,
..Default::default()
},
)
.await
.unwrap();
assert_eq!(total, 1);
assert_eq!(rows.len(), 1);
assert_eq!(rows[0].id, m_dropped);
}
// ---- chapter sync state ----------------------------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn chapter_state_synced_when_pages_present(pool: PgPool) {
seed_source(&pool).await;
let m = insert_manga(&pool, "M").await;
insert_manga_source(&pool, m, "smk", false).await;
let c = insert_chapter(&pool, m, 1, 12).await;
insert_chapter_source(&pool, c, "ckey-1", false).await;
let rows = fetch_chapter_rows(&pool, m).await;
assert_eq!(rows.len(), 1);
assert_eq!(rows[0].id, c);
assert_eq!(rows[0].sync_state, mangalord::domain::ChapterSyncState::Synced);
}
#[sqlx::test(migrations = "./migrations")]
async fn chapter_state_not_downloaded_when_page_count_zero(pool: PgPool) {
seed_source(&pool).await;
let m = insert_manga(&pool, "M").await;
let c = insert_chapter(&pool, m, 1, 0).await;
insert_chapter_source(&pool, c, "ckey-1", false).await;
let rows = fetch_chapter_rows(&pool, m).await;
assert_eq!(
rows[0].sync_state,
mangalord::domain::ChapterSyncState::NotDownloaded
);
}
#[sqlx::test(migrations = "./migrations")]
async fn chapter_state_downloading_when_job_in_flight(pool: PgPool) {
seed_source(&pool).await;
let m = insert_manga(&pool, "M").await;
let c = insert_chapter(&pool, m, 1, 0).await;
insert_chapter_source(&pool, c, "ckey-1", false).await;
insert_job(
&pool,
json!({
"kind": "sync_chapter_content",
"source_id": SOURCE_ID,
"chapter_id": c.to_string(),
"source_chapter_key": "ckey-1",
}),
"running",
)
.await;
let rows = fetch_chapter_rows(&pool, m).await;
assert_eq!(
rows[0].sync_state,
mangalord::domain::ChapterSyncState::Downloading
);
}
#[sqlx::test(migrations = "./migrations")]
async fn chapter_state_dropped_when_all_sources_dropped(pool: PgPool) {
seed_source(&pool).await;
let m = insert_manga(&pool, "M").await;
let c = insert_chapter(&pool, m, 1, 0).await;
insert_chapter_source(&pool, c, "ckey-1", true).await;
let rows = fetch_chapter_rows(&pool, m).await;
assert_eq!(
rows[0].sync_state,
mangalord::domain::ChapterSyncState::Dropped
);
}
#[sqlx::test(migrations = "./migrations")]
async fn chapter_state_failed_when_most_recent_job_dead(pool: PgPool) {
seed_source(&pool).await;
let m = insert_manga(&pool, "M").await;
let c = insert_chapter(&pool, m, 1, 0).await;
insert_chapter_source(&pool, c, "ckey-1", false).await;
insert_job(
&pool,
json!({
"kind": "sync_chapter_content",
"source_id": SOURCE_ID,
"chapter_id": c.to_string(),
"source_chapter_key": "ckey-1",
}),
"dead",
)
.await;
let rows = fetch_chapter_rows(&pool, m).await;
assert_eq!(
rows[0].sync_state,
mangalord::domain::ChapterSyncState::Failed
);
}
// ---- HTTP-level happy-path + gate ------------------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn http_list_mangas_returns_paged_with_state(pool: PgPool) {
let h = common::harness(pool.clone());
let (_admin, cookie) = seed_admin(&pool, &h.app).await;
seed_source(&pool).await;
let m = insert_manga(&pool, "Hello").await;
insert_manga_source(&pool, m, "smk", false).await;
let resp = h
.app
.oneshot(common::get_with_cookie(
"/api/v1/admin/mangas?limit=50",
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
let items = body["items"].as_array().unwrap();
assert_eq!(items.len(), 1);
assert_eq!(items[0]["id"], m.to_string());
assert_eq!(items[0]["sync_state"], "synced");
assert_eq!(items[0]["chapter_count"], 0);
assert_eq!(body["page"]["total"], 1);
}
#[sqlx::test(migrations = "./migrations")]
async fn http_list_mangas_rejects_unknown_sync_state(pool: PgPool) {
let h = common::harness(pool.clone());
let (_admin, cookie) = seed_admin(&pool, &h.app).await;
let resp = h
.app
.oneshot(common::get_with_cookie(
"/api/v1/admin/mangas?sync_state=bogus",
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::BAD_REQUEST);
}
#[sqlx::test(migrations = "./migrations")]
async fn http_list_chapters_returns_per_chapter_state(pool: PgPool) {
let h = common::harness(pool.clone());
let (_admin, cookie) = seed_admin(&pool, &h.app).await;
seed_source(&pool).await;
let m = insert_manga(&pool, "M").await;
let c1 = insert_chapter(&pool, m, 1, 12).await;
let c2 = insert_chapter(&pool, m, 2, 0).await;
insert_chapter_source(&pool, c1, "ck1", false).await;
insert_chapter_source(&pool, c2, "ck2", false).await;
let resp = h
.app
.oneshot(common::get_with_cookie(
&format!("/api/v1/admin/mangas/{m}/chapters"),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
let items = body["items"].as_array().unwrap();
assert_eq!(items.len(), 2);
assert_eq!(items[0]["id"], c1.to_string());
assert_eq!(items[0]["sync_state"], "synced");
assert_eq!(items[1]["id"], c2.to_string());
assert_eq!(items[1]["sync_state"], "not_downloaded");
assert_eq!(body["page"]["total"], 2);
}
#[sqlx::test(migrations = "./migrations")]
async fn http_list_chapters_caps_limit_at_500(pool: PgPool) {
// The handler clamps limit to [1, 500] so a long-runner with
// thousands of chapters can't be turned into a request-stall by an
// admin (or by a curious admin tab) just clicking expand.
let h = common::harness(pool.clone());
let (_admin, cookie) = seed_admin(&pool, &h.app).await;
seed_source(&pool).await;
let m = insert_manga(&pool, "M").await;
for n in 1..=3 {
let _c = insert_chapter(&pool, m, n, 0).await;
}
let resp = h
.app
.oneshot(common::get_with_cookie(
&format!("/api/v1/admin/mangas/{m}/chapters?limit=999"),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
assert_eq!(body["page"]["limit"], 500, "limit must clamp to 500");
assert_eq!(body["items"].as_array().unwrap().len(), 3);
}
#[sqlx::test(migrations = "./migrations")]
async fn http_list_chapters_paginates(pool: PgPool) {
let h = common::harness(pool.clone());
let (_admin, cookie) = seed_admin(&pool, &h.app).await;
seed_source(&pool).await;
let m = insert_manga(&pool, "M").await;
for n in 1..=5 {
let _c = insert_chapter(&pool, m, n, 0).await;
}
let resp = h
.app
.clone()
.oneshot(common::get_with_cookie(
&format!("/api/v1/admin/mangas/{m}/chapters?limit=2&offset=2"),
&cookie,
))
.await
.unwrap();
let body = common::body_json(resp).await;
let items = body["items"].as_array().unwrap();
assert_eq!(items.len(), 2);
// Ordered by chapter number ascending; offset=2 skips chapters 1 & 2.
assert_eq!(items[0]["number"], 3);
assert_eq!(items[1]["number"], 4);
assert_eq!(body["page"]["total"], 5);
}
#[sqlx::test(migrations = "./migrations")]
async fn http_list_chapters_returns_404_for_unknown_manga(pool: PgPool) {
// Regression: used to return 200 [] for a non-existent manga,
// which silently rendered "No chapters." for a typo'd / deleted id.
let h = common::harness(pool.clone());
let (_admin, cookie) = seed_admin(&pool, &h.app).await;
let resp = h
.app
.oneshot(common::get_with_cookie(
&format!("/api/v1/admin/mangas/{}/chapters", Uuid::new_v4()),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::NOT_FOUND);
}
#[sqlx::test(migrations = "./migrations")]
async fn chapter_state_synced_when_pages_present_even_with_dead_job(pool: PgPool) {
// Regression: the old CASE prioritised the dead-job branch above
// the page_count check, so a chapter with pages on disk AND a
// historical dead job (e.g. from a re-download attempt that
// crashed) flipped to Failed — contradicting Synced's "downloaded
// at some point" contract.
seed_source(&pool).await;
let m = insert_manga(&pool, "M").await;
let c = insert_chapter(&pool, m, 1, 12).await; // pages present
insert_chapter_source(&pool, c, "ckey-1", false).await;
insert_job(
&pool,
json!({
"kind": "sync_chapter_content",
"source_id": SOURCE_ID,
"chapter_id": c.to_string(),
"source_chapter_key": "ckey-1",
}),
"dead",
)
.await;
let rows = fetch_chapter_rows(&pool, m).await;
assert_eq!(
rows[0].sync_state,
mangalord::domain::ChapterSyncState::Synced,
"pages on disk override historical dead-job noise"
);
}
#[sqlx::test(migrations = "./migrations")]
async fn http_list_mangas_requires_admin(pool: PgPool) {
let h = common::harness(pool);
let (_u, cookie) = common::register_user(&h.app).await;
let resp = h
.app
.oneshot(common::get_with_cookie("/api/v1/admin/mangas", &cookie))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
}

View File

@@ -0,0 +1,257 @@
//! PR 1 (feat/admin-role) integration tests.
//!
//! Covers: `bootstrap_admin` semantics, `is_admin` exposed on /auth/me,
//! and the `RequireAdmin` extractor's 401/403/200 matrix — including the
//! load-bearing decision that Bearer-authed callers can NEVER reach an
//! admin-guarded route, even when the underlying user IS admin.
mod common;
use std::sync::Arc;
use axum::http::StatusCode;
use axum::routing::get;
use axum::{Json, Router};
use serde_json::json;
use sqlx::PgPool;
use tempfile::TempDir;
use tower::ServiceExt;
use mangalord::api;
use mangalord::app::AppState;
use mangalord::auth::extractor::RequireAdmin;
use mangalord::auth::rate_limit::AuthRateLimiter;
use mangalord::config::{AuthConfig, UploadConfig};
use mangalord::repo;
use mangalord::storage::{LocalStorage, Storage};
/// Test-only handler guarded by `RequireAdmin`. Lets the test suite assert
/// the extractor's behaviour end-to-end without depending on an admin
/// endpoint existing yet (those land in PR 2+).
async fn admin_only_handler(RequireAdmin(user): RequireAdmin) -> Json<serde_json::Value> {
Json(json!({ "username": user.username, "is_admin": user.is_admin }))
}
/// Build a router that exposes the production /api/v1/* AND a test-only
/// `/_test/admin_only` route guarded by `RequireAdmin`. Pool is consumed;
/// callers that want to inspect the DB after a request should clone it.
fn admin_test_router(pool: PgPool) -> (Router, TempDir) {
let storage_dir = tempfile::tempdir().expect("tempdir");
let storage: Arc<dyn Storage> = Arc::new(LocalStorage::new(storage_dir.path()));
let auth = AuthConfig {
cookie_secure: false,
..AuthConfig::default()
};
let auth_limiter = Arc::new(AuthRateLimiter::new(auth.rate_limit));
let state = AppState {
db: pool,
storage,
auth,
upload: UploadConfig::default(),
auth_limiter,
};
let app = Router::new()
.nest("/api/v1", api::routes())
.route("/_test/admin_only", get(admin_only_handler))
.with_state(state);
(app, storage_dir)
}
// ---- bootstrap_admin -------------------------------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn bootstrap_creates_admin_when_user_missing(pool: PgPool) {
repo::user::bootstrap_admin(&pool, "root", "hunter2hunter2")
.await
.expect("bootstrap on empty DB");
let user = repo::user::find_by_username(&pool, "root")
.await
.unwrap()
.expect("root user exists after bootstrap");
assert!(user.is_admin, "bootstrap must set is_admin = true on creation");
// Password hash must verify the env-supplied password (and not be empty).
assert!(
mangalord::auth::password::verify_password("hunter2hunter2", &user.password_hash),
"bootstrap-created user must accept the env-supplied password"
);
}
#[sqlx::test(migrations = "./migrations")]
async fn bootstrap_promotes_existing_user_without_touching_password(pool: PgPool) {
// Pre-existing user, not admin. Use the real register path so the
// hash format matches production exactly.
let (app, _td) = admin_test_router(pool.clone());
let resp = app
.oneshot(common::post_json(
"/api/v1/auth/register",
json!({ "username": "preexisting", "password": "originalpw1234" }),
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CREATED);
let before = repo::user::find_by_username(&pool, "preexisting")
.await
.unwrap()
.unwrap();
assert!(!before.is_admin);
let original_hash = before.password_hash.clone();
// Bootstrap with a DIFFERENT password — must not overwrite the hash.
repo::user::bootstrap_admin(&pool, "preexisting", "envpw_should_be_ignored")
.await
.expect("bootstrap on existing user");
let after = repo::user::find_by_username(&pool, "preexisting")
.await
.unwrap()
.unwrap();
assert!(after.is_admin, "bootstrap must promote existing user");
assert_eq!(
after.password_hash, original_hash,
"bootstrap must NOT overwrite the existing password hash"
);
assert!(
mangalord::auth::password::verify_password("originalpw1234", &after.password_hash),
"original password must still verify after bootstrap"
);
}
#[sqlx::test(migrations = "./migrations")]
async fn bootstrap_is_idempotent(pool: PgPool) {
repo::user::bootstrap_admin(&pool, "root", "hunter2hunter2")
.await
.expect("first bootstrap");
repo::user::bootstrap_admin(&pool, "root", "hunter2hunter2")
.await
.expect("second bootstrap is no-op");
// Exactly one row, still admin.
let (count,): (i64,) = sqlx::query_as("SELECT COUNT(*) FROM users WHERE username = $1")
.bind("root")
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(count, 1);
}
// ---- /api/v1/auth/me exposes is_admin --------------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn auth_me_response_includes_is_admin(pool: PgPool) {
let (app, _td) = admin_test_router(pool.clone());
let (_username, cookie) = common::register_user(&app).await;
let resp = app
.oneshot(common::get_with_cookie("/api/v1/auth/me", &cookie))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
assert_eq!(
body["user"]["is_admin"], false,
"freshly-registered users default to is_admin=false"
);
}
// ---- RequireAdmin: 401 / 403 / 200 matrix ----------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn require_admin_rejects_unauthenticated(pool: PgPool) {
let (app, _td) = admin_test_router(pool);
let resp = app
.oneshot(common::get("/_test/admin_only"))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::UNAUTHORIZED);
}
#[sqlx::test(migrations = "./migrations")]
async fn require_admin_rejects_non_admin_cookie(pool: PgPool) {
let (app, _td) = admin_test_router(pool);
let (_username, cookie) = common::register_user(&app).await;
let resp = app
.oneshot(common::get_with_cookie("/_test/admin_only", &cookie))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
let body = common::body_json(resp).await;
assert_eq!(body["error"]["code"], "forbidden");
}
#[sqlx::test(migrations = "./migrations")]
async fn require_admin_accepts_admin_cookie(pool: PgPool) {
let (app, _td) = admin_test_router(pool.clone());
let (username, cookie) = common::register_user(&app).await;
// Promote via the repo (the admin-users API doesn't exist yet).
let u = repo::user::find_by_username(&pool, &username)
.await
.unwrap()
.unwrap();
repo::user::set_is_admin_unchecked(&pool, u.id, true).await.unwrap();
let resp = app
.oneshot(common::get_with_cookie("/_test/admin_only", &cookie))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
assert_eq!(body["username"], username);
assert_eq!(body["is_admin"], true);
}
#[sqlx::test(migrations = "./migrations")]
async fn require_admin_rejects_bearer_token_even_for_admin_user(pool: PgPool) {
// Key privilege-escalation test: an API token belonging to an admin user
// must NOT grant admin authority. Bot tokens are excluded from admin
// routes by design (the RequireAdmin extractor only accepts session
// cookies). See cross-cutting decision #1 in the PR plan.
let (app, _td) = admin_test_router(pool.clone());
let (username, cookie) = common::register_user(&app).await;
// Promote to admin and mint an API token (the existing /auth/tokens
// endpoint authenticates via the same cookie).
let u = repo::user::find_by_username(&pool, &username)
.await
.unwrap()
.unwrap();
repo::user::set_is_admin_unchecked(&pool, u.id, true).await.unwrap();
let resp = app
.clone()
.oneshot(common::post_json_with_cookie(
"/api/v1/auth/tokens",
json!({ "name": "test-bot" }),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CREATED);
let body = common::body_json(resp).await;
let token = body["bearer"]
.as_str()
.expect("raw bearer token in response")
.to_string();
// Sanity: the bearer DOES work on a non-admin endpoint (proves the
// token is valid, isolating the failure below to the admin guard).
let resp = app
.clone()
.oneshot(common::get_with_bearer("/api/v1/auth/me", &token))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
// Same token, same admin user, but on the admin-guarded route → 401
// (no session cookie present at all from the extractor's POV).
let resp = app
.oneshot(common::get_with_bearer("/_test/admin_only", &token))
.await
.unwrap();
assert_eq!(
resp.status(),
StatusCode::UNAUTHORIZED,
"Bearer-authed admin must NOT pass the RequireAdmin guard"
);
}

View File

@@ -0,0 +1,96 @@
//! PR 4 (feat/admin-system-api) integration tests.
//!
//! Shape-only assertions — we don't mock the system, just call the
//! endpoint and check the response envelope. Threshold-triggering of
//! alerts would require faking statvfs / sysinfo, which is more
//! plumbing than the test gives back.
mod common;
use axum::http::StatusCode;
use axum::Router;
use sqlx::PgPool;
use tower::ServiceExt;
use mangalord::repo;
async fn seed_admin(pool: &PgPool, app: &Router) -> String {
let (username, cookie) = common::register_user(app).await;
let u = repo::user::find_by_username(pool, &username)
.await
.unwrap()
.unwrap();
repo::user::set_is_admin_unchecked(pool, u.id, true).await.unwrap();
cookie
}
#[sqlx::test(migrations = "./migrations")]
async fn requires_admin(pool: PgPool) {
let h = common::harness(pool);
let (_u, cookie) = common::register_user(&h.app).await;
let resp = h
.app
.oneshot(common::get_with_cookie("/api/v1/admin/system", &cookie))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
}
#[sqlx::test(migrations = "./migrations")]
async fn unauthenticated_request_is_rejected(pool: PgPool) {
let h = common::harness(pool);
let resp = h
.app
.oneshot(common::get("/api/v1/admin/system"))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::UNAUTHORIZED);
}
#[sqlx::test(migrations = "./migrations")]
async fn returns_disk_memory_cpu_alerts_shape(pool: PgPool) {
let h = common::harness(pool.clone());
let cookie = seed_admin(&pool, &h.app).await;
let resp = h
.app
.oneshot(common::get_with_cookie("/api/v1/admin/system", &cookie))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
// Disk: harness uses LocalStorage on a tempdir, so disk SHOULD be
// populated. Validate the field shape and percent range.
let disk = body
.get("disk")
.expect("disk key present")
.as_object()
.expect("disk is an object (LocalStorage exposes a path)");
assert!(disk["total_bytes"].as_u64().unwrap() > 0);
let pct = disk["percent_used"].as_f64().unwrap();
assert!(
(0.0..=100.0).contains(&pct),
"percent_used outside [0,100]: {pct}"
);
let mem = body.get("memory").expect("memory key").as_object().unwrap();
assert!(mem["total_bytes"].as_u64().unwrap() > 0);
let mpct = mem["percent_used"].as_f64().unwrap();
assert!((0.0..=100.0).contains(&mpct));
let cpu = body.get("cpu").expect("cpu key").as_object().unwrap();
let cpu_pct = cpu["percent_used"].as_f64().unwrap();
assert!(
(0.0..=100.0).contains(&cpu_pct),
"cpu out of range: {cpu_pct}"
);
let alerts = body.get("alerts").expect("alerts key").as_array().unwrap();
// Don't assert on length — the box may genuinely be >90% on memory
// when the test runs. Just confirm shape of any present entry.
for alert in alerts {
assert!(alert["level"].is_string());
assert!(alert["message"].is_string());
}
}

View File

@@ -0,0 +1,605 @@
//! PR 2 (feat/admin-users-api) integration tests.
//!
//! Exercises list / delete / promote-demote on /api/v1/admin/users:
//! pagination + search, the RequireAdmin gate, self-protection,
//! last-admin invariant (including the parallel-demote race that
//! `pg_advisory_xact_lock` + recount-inside-tx guards against), and
//! that audit rows land in `admin_audit` only on successful commit.
//!
//! Note on the last-admin invariant: the *serial* path via HTTP is
//! structurally unreachable — the only configuration that would hit the
//! "would orphan admins" branch requires the actor to be the lone admin
//! demoting themselves, which the self-guard fires on first. So the
//! last-admin checks below call the repo directly to exercise the
//! invariant; the HTTP race scenario is covered by
//! `parallel_demotes_cannot_orphan_admins`.
mod common;
use axum::http::StatusCode;
use axum::Router;
use serde_json::json;
use sqlx::PgPool;
use tower::ServiceExt;
use uuid::Uuid;
use mangalord::error::AppError;
use mangalord::repo;
/// Register a user via the public API and immediately promote them via
/// the repo. Returns (username, session cookie, user_id) — the common
/// "I need a logged-in admin" prelude.
async fn seed_admin(pool: &PgPool, app: &Router) -> (String, String, Uuid) {
let (username, cookie) = common::register_user(app).await;
let u = repo::user::find_by_username(pool, &username)
.await
.unwrap()
.unwrap();
repo::user::set_is_admin_unchecked(pool, u.id, true).await.unwrap();
(username, cookie, u.id)
}
// ---- RequireAdmin gate -----------------------------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn list_requires_admin(pool: PgPool) {
let h = common::harness(pool);
let (_username, cookie) = common::register_user(&h.app).await;
let resp = h
.app
.oneshot(common::get_with_cookie("/api/v1/admin/users", &cookie))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
}
#[sqlx::test(migrations = "./migrations")]
async fn delete_requires_admin(pool: PgPool) {
let h = common::harness(pool);
let (_username, cookie) = common::register_user(&h.app).await;
let resp = h
.app
.oneshot(common::delete_with_cookie(
&format!("/api/v1/admin/users/{}", Uuid::new_v4()),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
}
#[sqlx::test(migrations = "./migrations")]
async fn patch_requires_admin(pool: PgPool) {
let h = common::harness(pool);
let (_username, cookie) = common::register_user(&h.app).await;
let resp = h
.app
.oneshot(common::patch_json_with_cookie(
&format!("/api/v1/admin/users/{}", Uuid::new_v4()),
json!({ "is_admin": true }),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
}
// ---- list with search and pagination ---------------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn list_returns_paginated_users(pool: PgPool) {
let h = common::harness(pool.clone());
let (_admin_name, cookie, _) = seed_admin(&pool, &h.app).await;
let _u1 = common::register_user(&h.app).await;
let _u2 = common::register_user(&h.app).await;
let _u3 = common::register_user(&h.app).await;
let resp = h
.app
.oneshot(common::get_with_cookie(
"/api/v1/admin/users?limit=2&offset=0",
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
let items = body["items"].as_array().expect("items array");
assert_eq!(items.len(), 2, "limit=2 should cap the page");
assert_eq!(body["page"]["limit"], 2);
assert_eq!(body["page"]["offset"], 0);
assert_eq!(body["page"]["total"], 4);
assert!(items[0].get("is_admin").is_some());
assert!(
items[0].get("password_hash").is_none(),
"password_hash must never leak even to other admins"
);
}
#[sqlx::test(migrations = "./migrations")]
async fn list_filters_by_substring_search(pool: PgPool) {
let h = common::harness(pool.clone());
let (_admin_name, cookie, _) = seed_admin(&pool, &h.app).await;
let resp = h
.app
.clone()
.oneshot(common::post_json(
"/api/v1/auth/register",
json!({ "username": "zzzfindme01", "password": "hunter2hunter2" }),
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CREATED);
let resp = h
.app
.oneshot(common::get_with_cookie(
"/api/v1/admin/users?search=zzzfindme",
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
let items = body["items"].as_array().unwrap();
assert_eq!(items.len(), 1, "search must narrow to the one match");
assert_eq!(items[0]["username"], "zzzfindme01");
assert_eq!(body["page"]["total"], 1);
}
// ---- self-protection -------------------------------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn cannot_self_delete(pool: PgPool) {
let h = common::harness(pool.clone());
let (_username, cookie, actor_id) = seed_admin(&pool, &h.app).await;
// Second admin so the last-admin guard isn't what triggers the conflict.
let (_other, _, _) = seed_admin(&pool, &h.app).await;
let resp = h
.app
.oneshot(common::delete_with_cookie(
&format!("/api/v1/admin/users/{actor_id}"),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CONFLICT);
let body = common::body_json(resp).await;
assert_eq!(body["error"]["code"], "conflict");
assert!(
body["error"]["message"]
.as_str()
.unwrap()
.contains("yourself"),
"message must call out the self-action; got {:?}",
body["error"]["message"]
);
}
#[sqlx::test(migrations = "./migrations")]
async fn cannot_self_demote(pool: PgPool) {
let h = common::harness(pool.clone());
let (_username, cookie, actor_id) = seed_admin(&pool, &h.app).await;
let (_other, _, _) = seed_admin(&pool, &h.app).await;
let resp = h
.app
.oneshot(common::patch_json_with_cookie(
&format!("/api/v1/admin/users/{actor_id}"),
json!({ "is_admin": false }),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CONFLICT);
let body = common::body_json(resp).await;
assert!(body["error"]["message"]
.as_str()
.unwrap()
.contains("yourself"));
}
// ---- last-admin invariant (repo layer, see file header) --------------------
#[sqlx::test(migrations = "./migrations")]
async fn last_admin_demote_refused_at_repo(pool: PgPool) {
let h = common::harness(pool.clone());
let (_a, _, a_id) = seed_admin(&pool, &h.app).await;
let (_b, _, b_id) = seed_admin(&pool, &h.app).await;
// admins = {A, B}. Demote A via B (count 2 → 1) — allowed.
let r = repo::user::admin_safe_set_is_admin(&pool, b_id, a_id, false)
.await
.expect("first demote succeeds");
assert!(!r.is_admin);
// admins = {B}. Try to demote B via A (actor doesn't matter to the
// repo — that's the HTTP gate's job). Last-admin guard kicks in.
let err = repo::user::admin_safe_set_is_admin(&pool, a_id, b_id, false)
.await
.expect_err("second demote must be refused");
match err {
AppError::Conflict(m) => assert!(
m.contains("last admin"),
"expected last-admin conflict; got {m:?}"
),
other => panic!("expected Conflict, got {other:?}"),
}
}
#[sqlx::test(migrations = "./migrations")]
async fn last_admin_delete_refused_at_repo(pool: PgPool) {
let h = common::harness(pool.clone());
let (_a, _, a_id) = seed_admin(&pool, &h.app).await;
let (_b, _, b_id) = seed_admin(&pool, &h.app).await;
// admins = {A, B}. Delete A via B (count 2 → 1) — allowed.
repo::user::admin_safe_delete(&pool, b_id, a_id)
.await
.expect("first delete succeeds");
// admins = {B}. Try to delete B via a fresh non-admin actor. Last-
// admin guard kicks in.
let (_c, _, c_id) = {
let (cn, _ck) = common::register_user(&h.app).await;
let c = repo::user::find_by_username(&pool, &cn).await.unwrap().unwrap();
(cn, _ck, c.id)
};
let err = repo::user::admin_safe_delete(&pool, c_id, b_id)
.await
.expect_err("second delete must be refused");
match err {
AppError::Conflict(m) => assert!(
m.contains("last admin"),
"expected last-admin conflict; got {m:?}"
),
other => panic!("expected Conflict, got {other:?}"),
}
}
#[sqlx::test(migrations = "./migrations")]
async fn parallel_demotes_cannot_orphan_admins(pool: PgPool) {
// The race the advisory lock + recount exists to close: two parallel
// demotes of two DIFFERENT admins, each reading `count = 2` and
// committing, would land at zero admins. With the lock the second
// demote sees count = 1 inside the tx and refuses.
let h = common::harness(pool.clone());
let (_a, _, a_id) = seed_admin(&pool, &h.app).await;
let (_b, _, b_id) = seed_admin(&pool, &h.app).await;
let pool_x = pool.clone();
let pool_y = pool.clone();
let task_x = tokio::spawn(async move {
repo::user::admin_safe_set_is_admin(&pool_x, a_id, b_id, false).await
});
let task_y = tokio::spawn(async move {
repo::user::admin_safe_set_is_admin(&pool_y, b_id, a_id, false).await
});
let r_x = task_x.await.unwrap();
let r_y = task_y.await.unwrap();
let outcomes = (r_x.is_ok(), r_y.is_ok());
assert!(
outcomes == (true, false) || outcomes == (false, true),
"exactly one of the two parallel demotes must succeed; got {outcomes:?}"
);
let (count,): (i64,) =
sqlx::query_as("SELECT COUNT(*) FROM users WHERE is_admin = true")
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(count, 1, "at least one admin must remain");
}
// ---- audit log -------------------------------------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn promote_writes_audit_row(pool: PgPool) {
let h = common::harness(pool.clone());
let (_a_name, a_cookie, a_id) = seed_admin(&pool, &h.app).await;
let (b_name, _b_cookie) = common::register_user(&h.app).await;
let b = repo::user::find_by_username(&pool, &b_name)
.await
.unwrap()
.unwrap();
let resp = h
.app
.oneshot(common::patch_json_with_cookie(
&format!("/api/v1/admin/users/{}", b.id),
json!({ "is_admin": true }),
&a_cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let rows: Vec<(Option<Uuid>, String, String, Option<Uuid>)> = sqlx::query_as(
"SELECT actor_user_id, action, target_kind, target_id FROM admin_audit",
)
.fetch_all(&pool)
.await
.unwrap();
assert_eq!(rows.len(), 1);
let (actor, action, kind, target) = rows.into_iter().next().unwrap();
assert_eq!(actor, Some(a_id));
assert_eq!(action, "promote_user");
assert_eq!(kind, "user");
assert_eq!(target, Some(b.id));
}
#[sqlx::test(migrations = "./migrations")]
async fn redundant_promote_does_not_write_audit_row(pool: PgPool) {
// Regression: PATCH {is_admin: true} on someone already admin used
// to UPDATE (no-op) and still INSERT a misleading "promote_user"
// audit row. Should short-circuit without touching admin_audit.
let h = common::harness(pool.clone());
let (_a_name, a_cookie, _a_id) = seed_admin(&pool, &h.app).await;
let (b_name, _b_cookie, _b_id) = seed_admin(&pool, &h.app).await; // already admin
let b = repo::user::find_by_username(&pool, &b_name)
.await
.unwrap()
.unwrap();
let resp = h
.app
.oneshot(common::patch_json_with_cookie(
&format!("/api/v1/admin/users/{}", b.id),
json!({ "is_admin": true }),
&a_cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let (count,): (i64,) = sqlx::query_as("SELECT COUNT(*) FROM admin_audit")
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(count, 0, "no-op promote must not write audit row");
}
#[sqlx::test(migrations = "./migrations")]
async fn delete_writes_audit_row(pool: PgPool) {
let h = common::harness(pool.clone());
let (_a_name, a_cookie, a_id) = seed_admin(&pool, &h.app).await;
let (b_name, _b_cookie) = common::register_user(&h.app).await;
let b = repo::user::find_by_username(&pool, &b_name)
.await
.unwrap()
.unwrap();
let resp = h
.app
.oneshot(common::delete_with_cookie(
&format!("/api/v1/admin/users/{}", b.id),
&a_cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::NO_CONTENT);
let rows: Vec<(Option<Uuid>, String, String, Option<Uuid>, serde_json::Value)> =
sqlx::query_as(
"SELECT actor_user_id, action, target_kind, target_id, payload FROM admin_audit",
)
.fetch_all(&pool)
.await
.unwrap();
assert_eq!(rows.len(), 1);
let (actor, action, kind, target, payload) = rows.into_iter().next().unwrap();
assert_eq!(actor, Some(a_id));
assert_eq!(action, "delete_user");
assert_eq!(kind, "user");
assert_eq!(target, Some(b.id));
assert_eq!(payload["username"], b_name);
assert_eq!(payload["was_admin"], false);
}
// ---- POST /admin/users (admin-create) --------------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn create_user_requires_admin(pool: PgPool) {
let h = common::harness(pool);
let (_username, cookie) = common::register_user(&h.app).await;
let resp = h
.app
.oneshot(common::post_json_with_cookie(
"/api/v1/admin/users",
json!({ "username": "newbie", "password": "hunter2hunter2" }),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
}
#[sqlx::test(migrations = "./migrations")]
async fn create_user_unauthenticated_is_rejected(pool: PgPool) {
let h = common::harness(pool);
let resp = h
.app
.oneshot(common::post_json(
"/api/v1/admin/users",
json!({ "username": "newbie", "password": "hunter2hunter2" }),
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::UNAUTHORIZED);
}
#[sqlx::test(migrations = "./migrations")]
async fn create_user_happy_path_creates_user_and_audit(pool: PgPool) {
let h = common::harness(pool.clone());
let (_a_name, a_cookie, a_id) = seed_admin(&pool, &h.app).await;
let resp = h
.app
.oneshot(common::post_json_with_cookie(
"/api/v1/admin/users",
json!({ "username": "invited01", "password": "freshpass1234" }),
&a_cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CREATED);
let body = common::body_json(resp).await;
assert_eq!(body["username"], "invited01");
assert_eq!(body["is_admin"], false);
assert!(body["id"].as_str().is_some());
assert!(
body.get("password_hash").is_none(),
"password_hash must never appear in admin-create response"
);
let target_id =
Uuid::parse_str(body["id"].as_str().unwrap()).unwrap();
let (actor, action, kind, target, payload): (
Option<Uuid>,
String,
String,
Option<Uuid>,
serde_json::Value,
) = sqlx::query_as(
"SELECT actor_user_id, action, target_kind, target_id, payload \
FROM admin_audit ORDER BY at DESC LIMIT 1",
)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(actor, Some(a_id));
assert_eq!(action, "create_user");
assert_eq!(kind, "user");
assert_eq!(target, Some(target_id));
assert_eq!(payload["username"], "invited01");
assert_eq!(payload["is_admin"], false);
}
#[sqlx::test(migrations = "./migrations")]
async fn create_user_can_mint_an_admin_in_one_call(pool: PgPool) {
let h = common::harness(pool.clone());
let (_a_name, a_cookie, _) = seed_admin(&pool, &h.app).await;
let resp = h
.app
.oneshot(common::post_json_with_cookie(
"/api/v1/admin/users",
json!({
"username": "newadmin",
"password": "freshpass1234",
"is_admin": true
}),
&a_cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CREATED);
let body = common::body_json(resp).await;
assert_eq!(body["is_admin"], true);
}
#[sqlx::test(migrations = "./migrations")]
async fn create_user_returns_409_on_duplicate(pool: PgPool) {
let h = common::harness(pool.clone());
let (_a_name, a_cookie, _) = seed_admin(&pool, &h.app).await;
// Seed an existing user via the public register path.
let resp = h
.app
.clone()
.oneshot(common::post_json(
"/api/v1/auth/register",
json!({ "username": "taken", "password": "hunter2hunter2" }),
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CREATED);
let resp = h
.app
.oneshot(common::post_json_with_cookie(
"/api/v1/admin/users",
json!({ "username": "Taken", "password": "freshpass1234" }),
&a_cookie,
))
.await
.unwrap();
assert_eq!(
resp.status(),
StatusCode::CONFLICT,
"case-insensitive collision via the lower(username) index"
);
let body = common::body_json(resp).await;
assert_eq!(body["error"]["code"], "conflict");
}
#[sqlx::test(migrations = "./migrations")]
async fn create_user_rejects_weak_password(pool: PgPool) {
let h = common::harness(pool.clone());
let (_a_name, a_cookie, _) = seed_admin(&pool, &h.app).await;
let resp = h
.app
.oneshot(common::post_json_with_cookie(
"/api/v1/admin/users",
json!({ "username": "okayname", "password": "short" }),
&a_cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::BAD_REQUEST);
let body = common::body_json(resp).await;
assert_eq!(body["error"]["code"], "invalid_input");
}
#[sqlx::test(migrations = "./migrations")]
async fn create_user_rejects_invalid_username(pool: PgPool) {
let h = common::harness(pool.clone());
let (_a_name, a_cookie, _) = seed_admin(&pool, &h.app).await;
let resp = h
.app
.oneshot(common::post_json_with_cookie(
"/api/v1/admin/users",
json!({ "username": "bad name!", "password": "freshpass1234" }),
&a_cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::BAD_REQUEST);
}
#[sqlx::test(migrations = "./migrations")]
async fn create_user_works_even_when_self_register_disabled(pool: PgPool) {
// The admin-create path must NOT be gated by ALLOW_SELF_REGISTER —
// that's the entire point of having an admin-create endpoint.
let h = common::harness_with_self_register_disabled(pool.clone());
// Bootstrap an admin out-of-band since self-register would refuse.
repo::user::bootstrap_admin(&pool, "root", "hunter2hunter2")
.await
.unwrap();
let resp = h
.app
.clone()
.oneshot(common::post_json(
"/api/v1/auth/login",
json!({ "username": "root", "password": "hunter2hunter2" }),
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let cookie = common::extract_session_cookie(&resp).unwrap();
let resp = h
.app
.oneshot(common::post_json_with_cookie(
"/api/v1/admin/users",
json!({ "username": "invited01", "password": "freshpass1234" }),
&cookie,
))
.await
.unwrap();
assert_eq!(
resp.status(),
StatusCode::CREATED,
"admin must be able to mint users even with self-register off"
);
}

View File

@@ -567,6 +567,166 @@ async fn user_a_cannot_delete_user_b_token(pool: PgPool) {
assert_eq!(resp.status(), StatusCode::NO_CONTENT);
}
/// Username enumeration via login response time: an attacker probes
/// for valid usernames by measuring how long /auth/login takes. Before
/// the equalisation fix, the no-user branch returned 401 in <1 ms
/// while the wrong-password branch took ~50-100 ms (the argon2 verify
/// cost). This test asserts the no-user branch now spends at least
/// some meaningful fraction of the wrong-password branch's time.
///
/// Tolerance is intentionally loose so CI variance doesn't flap the
/// test. The unequalised gap is large enough (~50x) that even a noisy
/// CI run with a 5x slack still catches it.
#[sqlx::test(migrations = "./migrations")]
async fn login_no_user_branch_runs_argon2_for_timing_equalisation(pool: PgPool) {
use std::time::Instant;
let h = common::harness(pool);
// Register the victim user so the wrong-password branch has a real
// argon2 hash to verify against.
let _ = h
.app
.clone()
.oneshot(common::post_json(
"/api/v1/auth/register",
json!({ "username": "victim", "password": "hunter2hunter2" }),
))
.await
.unwrap();
// Warm-up: first login of the process initialises the dummy hash
// lazily. Skip that cost when measuring.
let _ = h
.app
.clone()
.oneshot(common::post_json(
"/api/v1/auth/login",
json!({ "username": "victim", "password": "wrong" }),
))
.await
.unwrap();
let _ = h
.app
.clone()
.oneshot(common::post_json(
"/api/v1/auth/login",
json!({ "username": "ghost", "password": "wrong" }),
))
.await
.unwrap();
// Median-of-N is more stable than a single sample.
async fn sample_min(
app: &axum::Router,
username: &str,
n: u32,
) -> std::time::Duration {
let mut samples = Vec::with_capacity(n as usize);
for _ in 0..n {
let req = common::post_json(
"/api/v1/auth/login",
json!({ "username": username, "password": "wrong-guess" }),
);
let t = Instant::now();
let resp = app.clone().oneshot(req).await.unwrap();
let d = t.elapsed();
assert_eq!(resp.status(), StatusCode::UNAUTHORIZED);
samples.push(d);
}
// Use the minimum: it's the floor that argon2 takes, robust
// against unrelated stalls (DB connection acquisition, etc.).
*samples.iter().min().unwrap()
}
let wrong_pwd = sample_min(&h.app, "victim", 3).await;
let no_user = sample_min(&h.app, "ghost", 3).await;
// 5x slack: argon2 dominates both branches, so they should be
// within an order of magnitude. Unequalised, no_user would be
// ~50-100x faster. Asserting "no_user >= wrong_pwd / 5" catches
// the bug without being flaky in CI.
assert!(
no_user * 5 >= wrong_pwd,
"login timing leaks user existence: no_user={no_user:?}, wrong_pwd={wrong_pwd:?}"
);
}
/// Brute-force / spray protection: at default production limits, a
/// tight loop of /auth/login attempts should burst through the bucket
/// and then 429 every subsequent request until the bucket refills.
#[sqlx::test(migrations = "./migrations")]
async fn login_rate_limited_under_burst_pressure(pool: PgPool) {
let h = common::harness_with_auth_rate_limit(pool, 1, 3);
// Register a victim so the wrong-password branch is real work.
let _ = h
.app
.clone()
.oneshot(common::post_json("/api/v1/auth/register", creds("victim")))
.await
.unwrap();
// Register consumed one token from the burst-3 bucket. Fire 30
// wrong-password logins back-to-back; with per_sec=1 the refill
// is too slow to keep up and at least one must come back 429.
let mut saw_429 = false;
for _ in 0..30 {
let resp = h
.app
.clone()
.oneshot(common::post_json(
"/api/v1/auth/login",
json!({ "username": "victim", "password": "wrong" }),
))
.await
.unwrap();
if resp.status() == StatusCode::TOO_MANY_REQUESTS {
// RFC 6585 §4: 429 SHOULD include a Retry-After header. The
// value is in seconds; with per_sec=1 the bucket needs ~1s
// to refill, so the header should be 1 or 2.
let retry_after = resp
.headers()
.get(axum::http::header::RETRY_AFTER)
.and_then(|v| v.to_str().ok())
.and_then(|s| s.parse::<u32>().ok())
.expect("Retry-After header present and numeric");
assert!(
retry_after >= 1,
"Retry-After must be at least 1s, got {retry_after}"
);
let body = common::body_json(resp).await;
assert_eq!(body["error"]["code"], "too_many_requests");
saw_429 = true;
break;
}
}
assert!(
saw_429,
"expected at least one 429 within 30 rapid login attempts"
);
}
/// Default (test-harness) limits are disabled, so existing tests that
/// fire multiple auth requests don't start failing.
#[sqlx::test(migrations = "./migrations")]
async fn default_test_harness_does_not_rate_limit(pool: PgPool) {
let h = common::harness(pool);
for i in 0..50 {
let resp = h
.app
.clone()
.oneshot(common::post_json(
"/api/v1/auth/login",
json!({ "username": format!("nobody-{i}"), "password": "x" }),
))
.await
.unwrap();
// None of these should be 429 — only 401.
assert_eq!(resp.status(), StatusCode::UNAUTHORIZED, "iter {i}");
}
}
#[sqlx::test(migrations = "./migrations")]
async fn delete_unknown_token_is_404(pool: PgPool) {
let h = common::harness(pool);
@@ -581,3 +741,68 @@ async fn delete_unknown_token_is_404(pool: PgPool) {
.unwrap();
assert_eq!(resp.status(), StatusCode::NOT_FOUND);
}
/// Bot token names are user-supplied free-form strings; a 10 MB name
/// was accepted before. Cap at 64 chars to match the other free-form
/// identifier caps (tags, collection names). The response uses
/// `ValidationFailed` (422 with per-field details) so clients can
/// render the same shape they already handle for `attach_tag`.
#[sqlx::test(migrations = "./migrations")]
async fn create_token_rejects_name_over_64_chars(pool: PgPool) {
let h = common::harness(pool);
let (_, cookie) = common::register_user(&h.app).await;
let resp = h
.app
.oneshot(common::post_json_with_cookie(
"/api/v1/auth/tokens",
json!({ "name": "x".repeat(65) }),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::UNPROCESSABLE_ENTITY);
let body = common::body_json(resp).await;
assert_eq!(body["error"]["code"], "validation_failed");
assert!(body["error"]["details"]["name"].is_string());
}
// ---- self-register toggle + /auth/config -----------------------------------
#[sqlx::test(migrations = "./migrations")]
async fn auth_config_reports_self_register_enabled_by_default(pool: PgPool) {
let h = common::harness(pool);
let resp = h
.app
.oneshot(common::get("/api/v1/auth/config"))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
assert_eq!(body["self_register_enabled"], true);
}
#[sqlx::test(migrations = "./migrations")]
async fn auth_config_reflects_self_register_disabled(pool: PgPool) {
let h = common::harness_with_self_register_disabled(pool);
let resp = h
.app
.oneshot(common::get("/api/v1/auth/config"))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
assert_eq!(body["self_register_enabled"], false);
}
#[sqlx::test(migrations = "./migrations")]
async fn register_returns_403_when_self_register_disabled(pool: PgPool) {
let h = common::harness_with_self_register_disabled(pool);
let resp = h
.app
.oneshot(common::post_json("/api/v1/auth/register", creds("alice")))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
let body = common::body_json(resp).await;
assert_eq!(body["error"]["code"], "forbidden");
}

View File

@@ -438,3 +438,196 @@ async fn list_me_returns_paged_envelope(pool: PgPool) {
// without paging through.
assert_eq!(body["page"]["total"], 0);
}
// -------------------------------------------------------------------------
// Bookmark create -> SyncChapterContent job enqueue (background task)
// -------------------------------------------------------------------------
async fn seed_chapter_with_source(
pool: &PgPool,
manga_id: Uuid,
number: i32,
source_id: &str,
source_chapter_key: &str,
source_url: &str,
dropped: bool,
) -> Uuid {
let chapter_id: Uuid =
mangalord::repo::chapter::create(pool, manga_id, number, None, None)
.await
.unwrap()
.id;
sqlx::query("INSERT INTO sources (id, name, base_url) VALUES ($1, $2, $3) ON CONFLICT DO NOTHING")
.bind(source_id)
.bind(source_id)
.bind("https://example.com")
.execute(pool)
.await
.unwrap();
let dropped_at = if dropped { "now()" } else { "NULL" };
sqlx::query(&format!(
"INSERT INTO chapter_sources (source_id, source_chapter_key, chapter_id, source_url, dropped_at) \
VALUES ($1, $2, $3, $4, {dropped_at})"
))
.bind(source_id)
.bind(source_chapter_key)
.bind(chapter_id)
.bind(source_url)
.execute(pool)
.await
.unwrap();
chapter_id
}
/// Poll `crawler_jobs` for the expected pending count, up to ~1.5s, so the
/// detached `tokio::spawn` from the bookmark create handler has time to
/// land regardless of CI scheduling jitter.
async fn wait_for_pending_count(pool: &PgPool, expected: i64) -> i64 {
for _ in 0..30 {
let count: i64 = sqlx::query_scalar(
"SELECT COUNT(*) FROM crawler_jobs \
WHERE state = 'pending' \
AND payload->>'kind' = 'sync_chapter_content'",
)
.fetch_one(pool)
.await
.unwrap();
if count >= expected {
return count;
}
tokio::time::sleep(std::time::Duration::from_millis(50)).await;
}
sqlx::query_scalar::<_, i64>(
"SELECT COUNT(*) FROM crawler_jobs \
WHERE state = 'pending' \
AND payload->>'kind' = 'sync_chapter_content'",
)
.fetch_one(pool)
.await
.unwrap()
}
#[sqlx::test(migrations = "./migrations")]
async fn create_enqueues_sync_chapter_content_jobs_for_pending_chapters(pool: PgPool) {
let h = common::harness(pool.clone());
let (_, cookie) = common::register_user(&h.app).await;
let manga_id = common::seed_manga_via_api(&h.app, &cookie, "Berserk").await;
// Two zero-page chapters with non-dropped sources.
let c1 = seed_chapter_with_source(&pool, manga_id, 1, "target", "ch1", "https://example.com/c1", false).await;
let c2 = seed_chapter_with_source(&pool, manga_id, 2, "target", "ch2", "https://example.com/c2", false).await;
let resp = h
.app
.clone()
.oneshot(common::post_json_with_cookie(
"/api/v1/bookmarks",
json!({ "manga_id": manga_id.to_string() }),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CREATED);
let count = wait_for_pending_count(&pool, 2).await;
assert_eq!(count, 2, "both pending chapters should be enqueued");
let chapter_ids: Vec<String> = sqlx::query_scalar(
"SELECT payload->>'chapter_id' FROM crawler_jobs \
WHERE payload->>'kind' = 'sync_chapter_content' \
ORDER BY payload->>'chapter_id'",
)
.fetch_all(&pool)
.await
.unwrap();
let mut expected = vec![c1.to_string(), c2.to_string()];
expected.sort();
assert_eq!(chapter_ids, expected);
}
#[sqlx::test(migrations = "./migrations")]
async fn re_bookmark_after_delete_does_not_re_enqueue_pending_jobs(pool: PgPool) {
let h = common::harness(pool.clone());
let (_, cookie) = common::register_user(&h.app).await;
let manga_id = common::seed_manga_via_api(&h.app, &cookie, "Berserk").await;
let _ = seed_chapter_with_source(&pool, manga_id, 1, "target", "ch1", "https://example.com/c1", false).await;
// First bookmark — should enqueue 1.
let resp = h
.app
.clone()
.oneshot(common::post_json_with_cookie(
"/api/v1/bookmarks",
json!({ "manga_id": manga_id.to_string() }),
&cookie,
))
.await
.unwrap();
let bookmark_id = common::body_json(resp).await["id"].as_str().unwrap().to_string();
assert_eq!(wait_for_pending_count(&pool, 1).await, 1);
// Delete the bookmark, then re-bookmark — the existing pending job
// is still there so the dedup index suppresses the second enqueue.
let resp = h
.app
.clone()
.oneshot(common::delete_with_cookie(
&format!("/api/v1/bookmarks/{bookmark_id}"),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::NO_CONTENT);
let resp = h
.app
.clone()
.oneshot(common::post_json_with_cookie(
"/api/v1/bookmarks",
json!({ "manga_id": manga_id.to_string() }),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CREATED);
// Give the background task time to attempt re-enqueue (it should be a no-op).
tokio::time::sleep(std::time::Duration::from_millis(300)).await;
let final_count: i64 = sqlx::query_scalar(
"SELECT COUNT(*) FROM crawler_jobs \
WHERE state IN ('pending', 'running') \
AND payload->>'kind' = 'sync_chapter_content'",
)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(final_count, 1, "dedup index keeps the queue at a single in-flight row");
}
#[sqlx::test(migrations = "./migrations")]
async fn create_skips_chapters_with_dropped_sources(pool: PgPool) {
let h = common::harness(pool.clone());
let (_, cookie) = common::register_user(&h.app).await;
let manga_id = common::seed_manga_via_api(&h.app, &cookie, "Berserk").await;
let _alive = seed_chapter_with_source(&pool, manga_id, 1, "target", "ch1", "https://example.com/c1", false).await;
let _dropped = seed_chapter_with_source(&pool, manga_id, 2, "target", "ch2", "https://example.com/c2", true).await;
let resp = h
.app
.clone()
.oneshot(common::post_json_with_cookie(
"/api/v1/bookmarks",
json!({ "manga_id": manga_id.to_string() }),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CREATED);
assert_eq!(
wait_for_pending_count(&pool, 1).await,
1,
"only the chapter with a non-dropped source row gets enqueued"
);
}

View File

@@ -12,12 +12,18 @@ async fn seed_manga(h: &common::Harness, cookie: &str, title: &str) -> Uuid {
common::seed_manga_via_api(&h.app, cookie, title).await
}
async fn seed_chapter(pool: &PgPool, manga_id: Uuid, number: i32, title: Option<&str>) {
async fn seed_chapter(
pool: &PgPool,
manga_id: Uuid,
number: i32,
title: Option<&str>,
) -> Uuid {
// Historical seed — uploaded_by remains NULL, mirroring the
// pre-Phase-5 rows in the production DB.
mangalord::repo::chapter::create(pool, manga_id, number, title, None)
.await
.unwrap();
.unwrap()
.id
}
#[sqlx::test(migrations = "./migrations")]
@@ -81,16 +87,16 @@ async fn list_chapters_returns_404_for_unknown_manga(pool: PgPool) {
}
#[sqlx::test(migrations = "./migrations")]
async fn get_chapter_by_number(pool: PgPool) {
async fn get_chapter_by_id(pool: PgPool) {
let h = common::harness(pool.clone());
let (_, cookie) = common::register_user(&h.app).await;
let manga_id = seed_manga(&h, &cookie, "Berserk").await;
seed_chapter(&pool, manga_id, 1, Some("The Brand")).await;
let chapter_id = seed_chapter(&pool, manga_id, 1, Some("The Brand")).await;
let resp = h
.app
.oneshot(common::get(&format!(
"/api/v1/mangas/{manga_id}/chapters/1"
"/api/v1/mangas/{manga_id}/chapters/{chapter_id}"
)))
.await
.unwrap();
@@ -99,18 +105,20 @@ async fn get_chapter_by_number(pool: PgPool) {
assert_eq!(body["number"], 1);
assert_eq!(body["title"], "The Brand");
assert_eq!(body["page_count"], 0);
assert_eq!(body["id"], chapter_id.to_string());
}
#[sqlx::test(migrations = "./migrations")]
async fn get_chapter_unknown_number_is_404(pool: PgPool) {
async fn get_chapter_unknown_id_is_404(pool: PgPool) {
let h = common::harness(pool);
let (_, cookie) = common::register_user(&h.app).await;
let manga_id = seed_manga(&h, &cookie, "Berserk").await;
let unknown_chapter = Uuid::new_v4();
let resp = h
.app
.oneshot(common::get(&format!(
"/api/v1/mangas/{manga_id}/chapters/99"
"/api/v1/mangas/{manga_id}/chapters/{unknown_chapter}"
)))
.await
.unwrap();
@@ -122,10 +130,34 @@ async fn get_chapter_unknown_number_is_404(pool: PgPool) {
#[sqlx::test(migrations = "./migrations")]
async fn get_chapter_unknown_manga_is_404(pool: PgPool) {
let h = common::harness(pool);
let unknown = Uuid::nil();
let unknown_manga = Uuid::nil();
let unknown_chapter = Uuid::new_v4();
let resp = h
.app
.oneshot(common::get(&format!("/api/v1/mangas/{unknown}/chapters/1")))
.oneshot(common::get(&format!(
"/api/v1/mangas/{unknown_manga}/chapters/{unknown_chapter}"
)))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::NOT_FOUND);
}
/// Cross-manga isolation: a chapter id belonging to manga A must not
/// resolve when accessed via manga B's URL. The (manga_id, id) scoping
/// in `find_by_id_in_manga` enforces this.
#[sqlx::test(migrations = "./migrations")]
async fn get_chapter_from_wrong_manga_is_404(pool: PgPool) {
let h = common::harness(pool.clone());
let (_, cookie) = common::register_user(&h.app).await;
let manga_a = seed_manga(&h, &cookie, "Berserk").await;
let manga_b = seed_manga(&h, &cookie, "Vagabond").await;
let chapter_id = seed_chapter(&pool, manga_a, 1, Some("Episode 1")).await;
let resp = h
.app
.oneshot(common::get(&format!(
"/api/v1/mangas/{manga_b}/chapters/{chapter_id}"
)))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::NOT_FOUND);
@@ -136,12 +168,12 @@ async fn list_pages_empty_for_chapter_without_upload(pool: PgPool) {
let h = common::harness(pool.clone());
let (_, cookie) = common::register_user(&h.app).await;
let manga_id = seed_manga(&h, &cookie, "Berserk").await;
seed_chapter(&pool, manga_id, 1, None).await;
let chapter_id = seed_chapter(&pool, manga_id, 1, None).await;
let resp = h
.app
.oneshot(common::get(&format!(
"/api/v1/mangas/{manga_id}/chapters/1/pages"
"/api/v1/mangas/{manga_id}/chapters/{chapter_id}/pages"
)))
.await
.unwrap();
@@ -155,11 +187,12 @@ async fn list_pages_returns_404_for_unknown_chapter(pool: PgPool) {
let h = common::harness(pool);
let (_, cookie) = common::register_user(&h.app).await;
let manga_id = seed_manga(&h, &cookie, "Berserk").await;
let unknown_chapter = Uuid::new_v4();
let resp = h
.app
.oneshot(common::get(&format!(
"/api/v1/mangas/{manga_id}/chapters/99/pages"
"/api/v1/mangas/{manga_id}/chapters/{unknown_chapter}/pages"
)))
.await
.unwrap();

View File

@@ -0,0 +1,462 @@
mod common;
use axum::http::StatusCode;
use serde_json::{json, Value};
use sqlx::PgPool;
use tower::ServiceExt;
use uuid::Uuid;
use common::{
body_json, delete_with_cookie, fake_jpeg_bytes, fake_png_bytes, get, harness,
post_multipart_with_cookie, put_multipart, put_multipart_with_cookie, register_user,
MultipartBuilder,
};
async fn create_manga_with_cover(
app: &axum::Router,
cookie: &str,
title: &str,
cover: Option<(&str, &[u8])>,
) -> Value {
let mut form =
MultipartBuilder::new().add_json("metadata", json!({ "title": title }));
if let Some((ct, bytes)) = cover {
form = form.add_file("cover", "cover.bin", ct, bytes);
}
let resp = app
.clone()
.oneshot(post_multipart_with_cookie("/api/v1/mangas", form, cookie))
.await
.unwrap();
assert_eq!(
resp.status(),
StatusCode::CREATED,
"seed create_manga failed: {:?}",
resp.status()
);
body_json(resp).await
}
fn id_of(body: &Value) -> Uuid {
Uuid::parse_str(body["id"].as_str().unwrap()).unwrap()
}
fn cover_form(bytes: &[u8]) -> MultipartBuilder {
MultipartBuilder::new().add_file("cover", "cover.bin", "application/octet-stream", bytes)
}
#[sqlx::test(migrations = "./migrations")]
async fn put_cover_sets_path_when_none_existed(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let manga = create_manga_with_cover(&h.app, &cookie, "Cover Me", None).await;
let id = id_of(&manga);
assert!(manga["cover_image_path"].is_null());
let bytes = fake_png_bytes();
let resp = h
.app
.clone()
.oneshot(put_multipart_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
cover_form(&bytes),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = body_json(resp).await;
let expected_key = format!("mangas/{id}/cover.png");
assert_eq!(body["cover_image_path"], expected_key);
assert_eq!(body["title"], "Cover Me");
let file_resp = h
.app
.clone()
.oneshot(get(&format!("/api/v1/files/{expected_key}")))
.await
.unwrap();
assert_eq!(file_resp.status(), StatusCode::OK);
}
#[sqlx::test(migrations = "./migrations")]
async fn put_cover_replaces_existing_same_extension(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let original = fake_png_bytes();
let manga = create_manga_with_cover(
&h.app,
&cookie,
"Replace Me",
Some(("image/png", &original)),
)
.await;
let id = id_of(&manga);
let original_key = format!("mangas/{id}/cover.png");
assert_eq!(manga["cover_image_path"], original_key);
let mut replacement = fake_png_bytes();
replacement.extend_from_slice(b"-replacement-marker");
let resp = h
.app
.clone()
.oneshot(put_multipart_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
cover_form(&replacement),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = body_json(resp).await;
assert_eq!(body["cover_image_path"], original_key);
let file_resp = h
.app
.clone()
.oneshot(get(&format!("/api/v1/files/{original_key}")))
.await
.unwrap();
assert_eq!(file_resp.status(), StatusCode::OK);
let body_bytes = http_body_util::BodyExt::collect(file_resp.into_body())
.await
.unwrap()
.to_bytes();
assert_eq!(body_bytes.as_ref(), replacement.as_slice());
}
#[sqlx::test(migrations = "./migrations")]
async fn put_cover_replaces_existing_different_extension_and_deletes_old_blob(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let png = fake_png_bytes();
let manga = create_manga_with_cover(
&h.app,
&cookie,
"Switch Ext",
Some(("image/png", &png)),
)
.await;
let id = id_of(&manga);
let old_key = format!("mangas/{id}/cover.png");
assert_eq!(manga["cover_image_path"], old_key);
let jpeg = fake_jpeg_bytes();
let resp = h
.app
.clone()
.oneshot(put_multipart_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
cover_form(&jpeg),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = body_json(resp).await;
let new_key = format!("mangas/{id}/cover.jpg");
assert_eq!(body["cover_image_path"], new_key);
let new_file = h
.app
.clone()
.oneshot(get(&format!("/api/v1/files/{new_key}")))
.await
.unwrap();
assert_eq!(new_file.status(), StatusCode::OK);
let old_file = h
.app
.clone()
.oneshot(get(&format!("/api/v1/files/{old_key}")))
.await
.unwrap();
assert_eq!(old_file.status(), StatusCode::NOT_FOUND);
}
#[sqlx::test(migrations = "./migrations")]
async fn put_cover_rejects_unauthenticated(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let manga = create_manga_with_cover(&h.app, &cookie, "Public Read", None).await;
let id = id_of(&manga);
let resp = h
.app
.clone()
.oneshot(put_multipart(
&format!("/api/v1/mangas/{id}/cover"),
cover_form(&fake_png_bytes()),
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::UNAUTHORIZED);
}
#[sqlx::test(migrations = "./migrations")]
async fn put_cover_404_on_unknown_id(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let id = Uuid::new_v4();
let resp = h
.app
.clone()
.oneshot(put_multipart_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
cover_form(&fake_png_bytes()),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::NOT_FOUND);
}
#[sqlx::test(migrations = "./migrations")]
async fn put_cover_rejects_non_image_with_unsupported_media_type(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let manga = create_manga_with_cover(&h.app, &cookie, "Not Image", None).await;
let id = id_of(&manga);
let pdf = b"%PDF-1.4\n%\xc4\xe5".to_vec();
let resp = h
.app
.clone()
.oneshot(put_multipart_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
cover_form(&pdf),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::UNSUPPORTED_MEDIA_TYPE);
let body = body_json(resp).await;
assert_eq!(body["error"]["code"], "unsupported_media_type");
}
#[sqlx::test(migrations = "./migrations")]
async fn put_cover_rejects_oversized(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let manga = create_manga_with_cover(&h.app, &cookie, "Too Big", None).await;
let id = id_of(&manga);
// Harness max_file_bytes is 256 KiB; 300 KiB trips the cap.
let mut bytes = fake_png_bytes();
bytes.resize(300 * 1024, 0);
let resp = h
.app
.clone()
.oneshot(put_multipart_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
cover_form(&bytes),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::PAYLOAD_TOO_LARGE);
}
#[sqlx::test(migrations = "./migrations")]
async fn put_cover_rejects_missing_cover_part(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let manga = create_manga_with_cover(&h.app, &cookie, "Empty Form", None).await;
let id = id_of(&manga);
let resp = h
.app
.clone()
.oneshot(put_multipart_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
MultipartBuilder::new(),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::UNPROCESSABLE_ENTITY);
let body = body_json(resp).await;
assert_eq!(body["error"]["code"], "validation_failed");
}
#[sqlx::test(migrations = "./migrations")]
async fn put_cover_preserves_other_metadata(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let manga = create_manga_with_cover(
&h.app,
&cookie,
"Keep My Fields",
None,
)
.await;
let id = id_of(&manga);
let resp = h
.app
.clone()
.oneshot(put_multipart_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
cover_form(&fake_png_bytes()),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = body_json(resp).await;
assert_eq!(body["title"], "Keep My Fields");
assert_eq!(body["status"], "ongoing");
assert_eq!(body["authors"], json!([]));
assert_eq!(body["genres"], json!([]));
assert_eq!(body["tags"], json!([]));
}
#[sqlx::test(migrations = "./migrations")]
async fn delete_cover_clears_path_and_removes_blob(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let png = fake_png_bytes();
let manga = create_manga_with_cover(
&h.app,
&cookie,
"Bye Cover",
Some(("image/png", &png)),
)
.await;
let id = id_of(&manga);
let key = format!("mangas/{id}/cover.png");
let resp = h
.app
.clone()
.oneshot(delete_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = body_json(resp).await;
assert!(body["cover_image_path"].is_null());
assert_eq!(body["title"], "Bye Cover");
let file_resp = h
.app
.clone()
.oneshot(get(&format!("/api/v1/files/{key}")))
.await
.unwrap();
assert_eq!(file_resp.status(), StatusCode::NOT_FOUND);
}
#[sqlx::test(migrations = "./migrations")]
async fn delete_cover_is_idempotent_when_no_cover_present(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let manga = create_manga_with_cover(&h.app, &cookie, "Never Had One", None).await;
let id = id_of(&manga);
for _ in 0..2 {
let resp = h
.app
.clone()
.oneshot(delete_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = body_json(resp).await;
assert!(body["cover_image_path"].is_null());
}
}
#[sqlx::test(migrations = "./migrations")]
async fn delete_cover_rejects_unauthenticated(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let manga = create_manga_with_cover(&h.app, &cookie, "Locked", None).await;
let id = id_of(&manga);
let resp = h
.app
.clone()
.oneshot(
axum::http::Request::builder()
.method("DELETE")
.uri(format!("/api/v1/mangas/{id}/cover"))
.body(axum::body::Body::empty())
.unwrap(),
)
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::UNAUTHORIZED);
}
#[sqlx::test(migrations = "./migrations")]
async fn delete_cover_404_on_unknown_id(pool: PgPool) {
let h = harness(pool);
let (_, cookie) = register_user(&h.app).await;
let id = Uuid::new_v4();
let resp = h
.app
.clone()
.oneshot(delete_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::NOT_FOUND);
}
/// Authz: PUT /mangas/:id/cover must be uploader-only.
#[sqlx::test(migrations = "./migrations")]
async fn put_cover_forbidden_for_non_uploader(pool: PgPool) {
let h = harness(pool);
let (_, owner_cookie) = register_user(&h.app).await;
let (_, intruder_cookie) = register_user(&h.app).await;
let manga =
create_manga_with_cover(&h.app, &owner_cookie, "Mine", None).await;
let id = id_of(&manga);
let resp = h
.app
.oneshot(put_multipart_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
cover_form(&fake_png_bytes()),
&intruder_cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
}
/// Authz: DELETE /mangas/:id/cover must be uploader-only.
#[sqlx::test(migrations = "./migrations")]
async fn delete_cover_forbidden_for_non_uploader(pool: PgPool) {
let h = harness(pool);
let (_, owner_cookie) = register_user(&h.app).await;
let (_, intruder_cookie) = register_user(&h.app).await;
let manga = create_manga_with_cover(
&h.app,
&owner_cookie,
"Mine",
Some(("image/jpeg", &fake_jpeg_bytes())),
)
.await;
let id = id_of(&manga);
let resp = h
.app
.oneshot(delete_with_cookie(
&format!("/api/v1/mangas/{id}/cover"),
&intruder_cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
}

View File

@@ -566,3 +566,78 @@ async fn patch_requires_authentication(pool: PgPool) {
.unwrap();
assert_eq!(resp.status(), StatusCode::UNAUTHORIZED);
}
/// A signed-in user who didn't upload the manga must not be able to
/// PATCH it. Without the uploader-gate this returned 200 — see
/// REVIEW.md "manga PATCH / cover endpoints don't check ownership".
#[sqlx::test(migrations = "./migrations")]
async fn patch_forbidden_for_non_uploader(pool: PgPool) {
let h = common::harness(pool);
let (_, owner_cookie) = common::register_user(&h.app).await;
let (_, intruder_cookie) = common::register_user(&h.app).await;
let created = create_manga(&h.app, &owner_cookie, json!({ "title": "Mine" })).await;
let id = id_of(&created);
let resp = h
.app
.oneshot(common::patch_json_with_cookie(
&format!("/api/v1/mangas/{id}"),
json!({ "status": "completed" }),
&intruder_cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
}
/// Owner can still edit their own manga (regression guard for the
/// authz fix).
#[sqlx::test(migrations = "./migrations")]
async fn patch_allowed_for_uploader(pool: PgPool) {
let h = common::harness(pool);
let (_, cookie) = common::register_user(&h.app).await;
let created = create_manga(&h.app, &cookie, json!({ "title": "Owned" })).await;
let id = id_of(&created);
let resp = h
.app
.oneshot(common::patch_json_with_cookie(
&format!("/api/v1/mangas/{id}"),
json!({ "status": "completed" }),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
}
/// Legacy rows with `uploaded_by IS NULL` (created before migration
/// 0011) remain editable by any signed-in user. Without this carve-out
/// the historical-data note in 0011 would be broken.
#[sqlx::test(migrations = "./migrations")]
async fn patch_allowed_on_legacy_null_uploader(pool: PgPool) {
let h = common::harness(pool.clone());
let (_, cookie) = common::register_user(&h.app).await;
let created = create_manga(&h.app, &cookie, json!({ "title": "Legacy" })).await;
let id = id_of(&created);
// Simulate a row uploaded before the column existed: clear
// uploaded_by directly via SQL.
sqlx::query("UPDATE mangas SET uploaded_by = NULL WHERE id = $1")
.bind(id)
.execute(&pool)
.await
.unwrap();
let (_, other_cookie) = common::register_user(&h.app).await;
let resp = h
.app
.oneshot(common::patch_json_with_cookie(
&format!("/api/v1/mangas/{id}"),
json!({ "status": "completed" }),
&other_cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
}

View File

@@ -59,6 +59,31 @@ async fn reattach_same_tag_is_idempotent_and_returns_200(pool: PgPool) {
assert_eq!(second.status(), StatusCode::OK);
}
/// Tag names over 64 chars are rejected at the handler boundary. The
/// repo enforces the same cap, but doing it at the handler keeps the
/// envelope consistent with the other validation paths
/// (username, collection name, etc.).
#[sqlx::test(migrations = "./migrations")]
async fn attach_rejects_tag_name_over_64_chars(pool: PgPool) {
let h = common::harness(pool);
let (_, cookie) = common::register_user(&h.app).await;
let manga_id = common::seed_manga_via_api(&h.app, &cookie, "Berserk").await;
let long_name: String = "x".repeat(65);
let resp = h
.app
.oneshot(common::post_json_with_cookie(
&format!("/api/v1/mangas/{manga_id}/tags"),
json!({ "name": long_name }),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::UNPROCESSABLE_ENTITY);
let body = common::body_json(resp).await;
assert_eq!(body["error"]["code"], "validation_failed");
}
#[sqlx::test(migrations = "./migrations")]
async fn tag_names_dedup_case_insensitively(pool: PgPool) {
let h = common::harness(pool);

View File

@@ -139,13 +139,17 @@ async fn files_endpoint_streams_in_multiple_frames(pool: PgPool) {
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CREATED);
let chapter_id = common::body_json(resp).await["id"]
.as_str()
.unwrap()
.to_string();
// Fetch the page back via the streaming files endpoint.
let pages = h
.app
.clone()
.oneshot(common::get(&format!(
"/api/v1/mangas/{manga_id}/chapters/1/pages"
"/api/v1/mangas/{manga_id}/chapters/{chapter_id}/pages"
)))
.await
.unwrap();
@@ -317,8 +321,12 @@ async fn create_chapter_rejects_renamed_non_image_page(pool: PgPool) {
assert_eq!(body["error"]["code"], "unsupported_media_type");
}
/// Multiple chapters can share the same number — different
/// scanlations, re-uploads, translator notes. As of migration 0013,
/// (manga_id, number) is not unique and each upload gets its own
/// chapter id.
#[sqlx::test(migrations = "./migrations")]
async fn create_chapter_returns_409_on_duplicate_number(pool: PgPool) {
async fn create_chapter_allows_duplicate_numbers_as_separate_chapters(pool: PgPool) {
let h = common::harness(pool);
let (_, cookie) = common::register_user(&h.app).await;
let manga_id = common::seed_manga_via_api(&h.app, &cookie, "Berserk").await;
@@ -334,10 +342,27 @@ async fn create_chapter_returns_409_on_duplicate_number(pool: PgPool) {
};
let first = h.app.clone().oneshot(make()).await.unwrap();
assert_eq!(first.status(), StatusCode::CREATED);
let second = h.app.oneshot(make()).await.unwrap();
assert_eq!(second.status(), StatusCode::CONFLICT);
let body = common::body_json(second).await;
assert_eq!(body["error"]["code"], "conflict");
let first_id = common::body_json(first).await["id"].as_str().unwrap().to_string();
let second = h.app.clone().oneshot(make()).await.unwrap();
assert_eq!(second.status(), StatusCode::CREATED);
let second_id = common::body_json(second).await["id"].as_str().unwrap().to_string();
assert_ne!(first_id, second_id, "each upload gets a distinct chapter id");
// List endpoint surfaces both rows.
let resp = h
.app
.oneshot(common::get(&format!("/api/v1/mangas/{manga_id}/chapters")))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
let items = body["items"].as_array().unwrap();
assert_eq!(items.len(), 2, "both Ch.1 uploads listed separately");
for item in items {
assert_eq!(item["number"], 1);
}
}
#[sqlx::test(migrations = "./migrations")]

View File

@@ -15,6 +15,7 @@ use tempfile::TempDir;
use tower::ServiceExt;
use mangalord::app::{router, AppState};
use mangalord::auth::rate_limit::AuthRateLimiter;
use mangalord::config::{AuthConfig, UploadConfig};
use mangalord::storage::{LocalStorage, Storage, StorageError, StreamingFile};
@@ -49,20 +50,65 @@ fn harness_inner(
storage: Arc<dyn Storage>,
storage_dir: TempDir,
) -> Harness {
harness_with_auth_config(pool, storage, storage_dir, AuthConfig {
cookie_secure: false,
..AuthConfig::default()
})
}
fn harness_with_auth_config(
pool: PgPool,
storage: Arc<dyn Storage>,
storage_dir: TempDir,
auth: AuthConfig,
) -> Harness {
let auth_limiter = Arc::new(AuthRateLimiter::new(auth.rate_limit));
let state = AppState {
db: pool,
storage,
auth: AuthConfig { cookie_secure: false, ..AuthConfig::default() },
auth,
upload: UploadConfig {
// Keep file caps small in tests so the size-cap path is cheap to
// exercise without producing tens of MBs of bytes.
max_request_bytes: 4 * 1024 * 1024,
max_file_bytes: 256 * 1024,
},
auth_limiter,
};
Harness { app: router(state), _storage_dir: storage_dir }
}
/// Like [`harness`] but flips `ALLOW_SELF_REGISTER` off so the
/// register-disabled test exercises the 403 branch in
/// `api::auth::register`.
pub fn harness_with_self_register_disabled(pool: PgPool) -> Harness {
let storage_dir = tempfile::tempdir().expect("tempdir");
let storage = Arc::new(LocalStorage::new(storage_dir.path()));
let auth = AuthConfig {
cookie_secure: false,
allow_self_register: false,
..AuthConfig::default()
};
harness_with_auth_config(pool, storage, storage_dir, auth)
}
/// Like [`harness`] but configures a tight auth rate limit. Used by
/// the brute-force-rate-limiting test.
pub fn harness_with_auth_rate_limit(
pool: PgPool,
per_sec: u32,
burst: u32,
) -> Harness {
let storage_dir = tempfile::tempdir().expect("tempdir");
let storage = Arc::new(LocalStorage::new(storage_dir.path()));
let auth = AuthConfig {
cookie_secure: false,
rate_limit: mangalord::auth::rate_limit::RateLimitConfig { per_sec, burst },
..AuthConfig::default()
};
harness_with_auth_config(pool, storage, storage_dir, auth)
}
/// Wraps a real `Storage` and fails on the N-th `put` call so tests can
/// assert that handlers roll their DB writes back when storage errors
/// mid-upload. Reads and other operations delegate to `inner`.
@@ -336,6 +382,37 @@ pub fn post_multipart_with_cookie(
.unwrap()
}
pub fn put_multipart_with_cookie(
uri: &str,
builder: MultipartBuilder,
cookie: &str,
) -> Request<Body> {
let (boundary, body) = builder.finalize();
Request::builder()
.method("PUT")
.uri(uri)
.header(
header::CONTENT_TYPE,
format!("multipart/form-data; boundary={boundary}"),
)
.header(header::COOKIE, cookie)
.body(Body::from(body))
.unwrap()
}
pub fn put_multipart(uri: &str, builder: MultipartBuilder) -> Request<Body> {
let (boundary, body) = builder.finalize();
Request::builder()
.method("PUT")
.uri(uri)
.header(
header::CONTENT_TYPE,
format!("multipart/form-data; boundary={boundary}"),
)
.body(Body::from(body))
.unwrap()
}
/// Realistic PNG file header bytes — enough for `infer` to identify.
pub fn fake_png_bytes() -> Vec<u8> {
vec![0x89, 0x50, 0x4e, 0x47, 0x0d, 0x0a, 0x1a, 0x0a, 0, 0, 0, 0]

View File

@@ -0,0 +1,162 @@
//! Smoke test for the Chromium launcher.
//!
//! Marked `#[ignore]` because it (a) downloads ~150 MB of Chromium on
//! first run via the `fetcher` feature and (b) requires a real `$DISPLAY`
//! for the headed path. Run it explicitly:
//!
//! ```sh
//! cargo test --test crawler_browser_smoke -- --ignored --nocapture
//! ```
//!
//! Override the cache location with `CRAWLER_CHROMIUM_DIR=/some/path` if
//! `$HOME/.cache/mangalord/chromium` isn't writable.
//!
//! Set `CRAWLER_CHROMIUM_BINARY=/usr/bin/chromium-headless-shell` (or
//! another system chromium path) to exercise the system-chromium
//! launch path instead of the fetcher download — this is the path the
//! Raspberry Pi deployment takes.
use mangalord::crawler::browser::{self, LaunchOptions};
#[tokio::test]
#[ignore = "downloads Chromium and needs a display; run with --ignored"]
async fn headed_browser_can_navigate_and_read_title() {
// A data URL avoids any network dependency — we're testing the
// browser launcher, not connectivity.
const PAGE: &str = "data:text/html,<html><head><title>Mangalord%20Smoke</title></head><body>OK</body></html>";
let handle = browser::launch(LaunchOptions::headed())
.await
.expect("launch headed chromium");
let page = handle
.browser()
.new_page(PAGE)
.await
.expect("open new page");
page.wait_for_navigation()
.await
.expect("wait for navigation");
let title = page.get_title().await.expect("get title");
assert_eq!(title.as_deref(), Some("Mangalord Smoke"));
handle.close().await.expect("close cleanly");
}
#[tokio::test]
#[ignore = "downloads Chromium; run with --ignored"]
async fn headless_browser_can_navigate_and_read_title() {
const PAGE: &str = "data:text/html,<html><head><title>Headless%20OK</title></head><body></body></html>";
let handle = browser::launch(LaunchOptions::headless())
.await
.expect("launch headless chromium");
let page = handle.browser().new_page(PAGE).await.expect("open new page");
page.wait_for_navigation().await.expect("wait for navigation");
let title = page.get_title().await.expect("get title");
assert_eq!(title.as_deref(), Some("Headless OK"));
handle.close().await.expect("close cleanly");
}
/// Live end-to-end: navigate to a real page, get the rendered HTML, and
/// parse it with `scraper`. ipify.org renders the visitor's public IP
/// into the page DOM, so a successful run proves browser → render →
/// `Html::parse_document` → selector → text extraction all work
/// against a real site. This is the same path each future `Source`
/// impl will take.
#[tokio::test]
#[ignore = "needs network; run with --ignored"]
async fn fetches_public_ip_from_ipify() {
use std::time::Duration;
let handle = browser::launch(LaunchOptions::headless())
.await
.expect("launch headless chromium");
let page = handle
.browser()
.new_page("https://www.ipify.org")
.await
.expect("open ipify");
page.wait_for_navigation().await.expect("wait for navigation");
// ipify injects the IP via JS after load, so the navigation event
// alone isn't enough — give the script a beat to run.
tokio::time::sleep(Duration::from_secs(2)).await;
let html = page.content().await.expect("get rendered html");
let doc = scraper::Html::parse_document(&html);
let body_sel = scraper::Selector::parse("body").unwrap();
let body_text: String = doc
.select(&body_sel)
.next()
.map(|n| n.text().collect::<Vec<_>>().join(" "))
.unwrap_or_default();
let ip = extract_ipv4(&body_text)
.unwrap_or_else(|| panic!("no IPv4 found in ipify body: {body_text}"));
eprintln!("ipify says our public IP is: {ip}");
handle.close().await.expect("close cleanly");
}
/// Proves that `LaunchOptions::extra_args` actually reach Chromium and
/// influence its runtime. `--user-agent=...` overrides `navigator.userAgent`,
/// observable from JS — read it back via `page.evaluate`.
#[tokio::test]
#[ignore = "downloads Chromium; run with --ignored"]
async fn extra_args_reach_chromium() {
const UA: &str = "MangalordCrawlerTest/1.0";
let options = LaunchOptions {
mode: browser::BrowserMode::Headless,
extra_args: vec![format!("--user-agent={UA}")],
};
let handle = browser::launch(options).await.expect("launch with extra args");
let page = handle
.browser()
.new_page("about:blank")
.await
.expect("open page");
page.wait_for_navigation().await.expect("wait");
let ua: String = page
.evaluate("navigator.userAgent")
.await
.expect("evaluate navigator.userAgent")
.into_value()
.expect("string value");
assert_eq!(
ua, UA,
"extra --user-agent flag should override navigator.userAgent"
);
handle.close().await.expect("close cleanly");
}
/// Tiny dotted-quad finder — avoids pulling `regex` in just for one
/// test. Scans the first valid IPv4 substring (four 0..=255 octets
/// separated by dots).
fn extract_ipv4(s: &str) -> Option<String> {
let bytes = s.as_bytes();
let mut i = 0;
while i < bytes.len() {
if !bytes[i].is_ascii_digit() {
i += 1;
continue;
}
let start = i;
while i < bytes.len() && (bytes[i].is_ascii_digit() || bytes[i] == b'.') {
i += 1;
}
let candidate = &s[start..i];
let parts: Vec<&str> = candidate.split('.').collect();
if parts.len() == 4 && parts.iter().all(|p| p.parse::<u8>().is_ok()) {
return Some(candidate.to_string());
}
}
None
}

View File

@@ -0,0 +1,519 @@
//! Integration tests for the crawler daemon's cron + worker pool. The
//! daemon's full real path requires Chromium and a live source; here we
//! test the seam (MetadataPass / ChapterDispatcher traits) and the
//! cron/worker control-flow.
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
use std::time::Duration;
use chrono::NaiveTime;
use chrono_tz::Tz;
use mangalord::crawler::content::SyncOutcome;
use mangalord::crawler::daemon::{
self, test_support::CountingMetadataPass, ChapterDispatcher, DaemonConfig, MetadataPass,
CRON_LOCK_KEY,
};
use mangalord::crawler::jobs::{self, JobPayload};
use mangalord::crawler::pipeline;
use serde_json::json;
use sqlx::PgPool;
use tokio_util::sync::CancellationToken;
use uuid::Uuid;
fn far_future_daily_at() -> NaiveTime {
// Some time hours from "now" so the scheduler sleeps for the whole test.
NaiveTime::from_hms_opt(23, 59, 0).unwrap()
}
fn make_cfg(
metadata_pass: Option<Arc<dyn MetadataPass>>,
dispatcher: Arc<dyn ChapterDispatcher>,
session_expired: Arc<std::sync::atomic::AtomicBool>,
workers: usize,
) -> DaemonConfig {
DaemonConfig {
metadata_pass,
dispatcher,
chapter_workers: workers,
daily_at: far_future_daily_at(),
tz: Tz::UTC,
retention_days: 7,
session_expired,
extra_tasks: Vec::new(),
}
}
async fn enqueue_chapter_job(pool: &PgPool) -> Uuid {
let chapter_id = Uuid::new_v4();
let payload = JobPayload::SyncChapterContent {
source_id: "target".into(),
chapter_id,
source_chapter_key: format!("ch-{chapter_id}"),
};
let res = jobs::enqueue(pool, &payload).await.unwrap();
match res {
jobs::EnqueueResult::Inserted(_) => chapter_id,
jobs::EnqueueResult::Skipped => unreachable!("fresh chapter_id"),
}
}
async fn count_state(pool: &PgPool, state: &str) -> i64 {
sqlx::query_scalar::<_, i64>("SELECT COUNT(*) FROM crawler_jobs WHERE state = $1")
.bind(state)
.fetch_one(pool)
.await
.unwrap()
}
struct AlwaysDoneDispatcher {
seen: AtomicUsize,
}
#[async_trait::async_trait]
impl ChapterDispatcher for AlwaysDoneDispatcher {
async fn dispatch(&self, _payload: JobPayload) -> anyhow::Result<SyncOutcome> {
self.seen.fetch_add(1, Ordering::AcqRel);
Ok(SyncOutcome::Fetched { pages: 1 })
}
}
struct PanickingDispatcher {
seen: AtomicUsize,
}
#[async_trait::async_trait]
impl ChapterDispatcher for PanickingDispatcher {
async fn dispatch(&self, _payload: JobPayload) -> anyhow::Result<SyncOutcome> {
self.seen.fetch_add(1, Ordering::AcqRel);
panic!("intentional dispatcher panic");
}
}
#[sqlx::test(migrations = "./migrations")]
async fn workers_drain_jobs_through_dispatcher(pool: PgPool) {
enqueue_chapter_job(&pool).await;
enqueue_chapter_job(&pool).await;
enqueue_chapter_job(&pool).await;
let dispatcher = Arc::new(AlwaysDoneDispatcher {
seen: AtomicUsize::new(0),
});
let session_expired = Arc::new(std::sync::atomic::AtomicBool::new(false));
let cancel = CancellationToken::new();
let handle = daemon::spawn(
pool.clone(),
cancel.clone(),
make_cfg(None, dispatcher.clone(), session_expired, 2),
);
// Wait for the workers to drain all three jobs.
let dispatcher_seen = || dispatcher.seen.load(Ordering::Acquire);
for _ in 0..40 {
if dispatcher_seen() >= 3 {
break;
}
tokio::time::sleep(Duration::from_millis(50)).await;
}
assert!(
dispatcher_seen() >= 3,
"expected at least 3 dispatches, got {}",
dispatcher_seen()
);
handle.shutdown().await;
assert_eq!(count_state(&pool, "done").await, 3);
}
#[sqlx::test(migrations = "./migrations")]
async fn workers_idle_while_session_expired(pool: PgPool) {
let id = enqueue_chapter_job(&pool).await;
let dispatcher = Arc::new(AlwaysDoneDispatcher {
seen: AtomicUsize::new(0),
});
let session_expired = Arc::new(std::sync::atomic::AtomicBool::new(true));
let cancel = CancellationToken::new();
let handle = daemon::spawn(
pool.clone(),
cancel.clone(),
make_cfg(None, dispatcher.clone(), Arc::clone(&session_expired), 1),
);
// Wait long enough that a non-idled worker would have leased and ack'd.
tokio::time::sleep(Duration::from_millis(800)).await;
assert_eq!(
dispatcher.seen.load(Ordering::Acquire),
0,
"dispatcher must not be invoked while session_expired flag is set"
);
assert_eq!(count_state(&pool, "pending").await, 1);
let _ = id;
handle.shutdown().await;
}
#[sqlx::test(migrations = "./migrations")]
async fn dispatcher_panic_is_contained_and_job_is_acked_failed(pool: PgPool) {
enqueue_chapter_job(&pool).await;
enqueue_chapter_job(&pool).await;
let dispatcher = Arc::new(PanickingDispatcher {
seen: AtomicUsize::new(0),
});
let session_expired = Arc::new(std::sync::atomic::AtomicBool::new(false));
let cancel = CancellationToken::new();
let handle = daemon::spawn(
pool.clone(),
cancel.clone(),
make_cfg(None, dispatcher.clone(), session_expired, 1),
);
// Wait for the worker to handle both panicking jobs.
for _ in 0..40 {
if dispatcher.seen.load(Ordering::Acquire) >= 2 {
break;
}
tokio::time::sleep(Duration::from_millis(50)).await;
}
assert!(
dispatcher.seen.load(Ordering::Acquire) >= 2,
"worker must keep going after a panic — handled at least 2 jobs"
);
handle.shutdown().await;
// attempts=1 below max=5, so the panicking jobs go back to pending with
// backoff and `last_error = "worker panicked"`.
let last_errors: Vec<String> = sqlx::query_scalar(
"SELECT last_error FROM crawler_jobs WHERE last_error IS NOT NULL",
)
.fetch_all(&pool)
.await
.unwrap();
assert_eq!(last_errors.len(), 2);
assert!(last_errors.iter().all(|e| e == "worker panicked"));
}
#[sqlx::test(migrations = "./migrations")]
async fn cron_skips_tick_when_advisory_lock_held(pool: PgPool) {
// With no last_metadata_tick_at row, the daemon does a catch-up tick
// immediately on spawn. We hold the advisory lock on a separate
// connection beforehand so the catch-up's pg_try_advisory_lock returns
// false and the tick must skip without invoking the metadata pass.
let mut lock_conn = pool.acquire().await.unwrap();
sqlx::query("SELECT pg_advisory_lock($1)")
.bind(CRON_LOCK_KEY)
.execute(&mut *lock_conn)
.await
.unwrap();
let counter = Arc::new(CountingMetadataPass::default());
let dispatcher = Arc::new(AlwaysDoneDispatcher {
seen: AtomicUsize::new(0),
});
let session_expired = Arc::new(std::sync::atomic::AtomicBool::new(false));
let cancel = CancellationToken::new();
// daily_at far in the future so after the (skipped) catch-up the
// cron sleeps for the rest of the test rather than racing for the lock.
let cfg = make_cfg(
Some(counter.clone() as Arc<dyn MetadataPass>),
dispatcher,
session_expired,
1,
);
let handle = daemon::spawn(pool.clone(), cancel.clone(), cfg);
tokio::time::sleep(Duration::from_millis(800)).await;
assert_eq!(
counter.count.load(Ordering::Acquire),
0,
"cron must skip the catch-up tick while the advisory lock is held"
);
sqlx::query("SELECT pg_advisory_unlock($1)")
.bind(CRON_LOCK_KEY)
.execute(&mut *lock_conn)
.await
.unwrap();
drop(lock_conn);
handle.shutdown().await;
}
#[sqlx::test(migrations = "./migrations")]
async fn cron_catches_up_when_last_tick_is_stale(pool: PgPool) {
// Pre-seed last_metadata_tick_at well in the past so previous_fire(now)
// > last_tick is trivially true and the daemon catches up immediately.
sqlx::query(
"INSERT INTO crawler_state (key, value) VALUES ($1, $2)
ON CONFLICT (key) DO UPDATE SET value = EXCLUDED.value",
)
.bind("last_metadata_tick_at")
.bind(json!({"at": "2020-01-01T00:00:00Z"}))
.execute(&pool)
.await
.unwrap();
let counter = Arc::new(CountingMetadataPass::default());
let dispatcher = Arc::new(AlwaysDoneDispatcher {
seen: AtomicUsize::new(0),
});
let session_expired = Arc::new(std::sync::atomic::AtomicBool::new(false));
let cancel = CancellationToken::new();
let handle = daemon::spawn(
pool.clone(),
cancel.clone(),
make_cfg(
Some(counter.clone() as Arc<dyn MetadataPass>),
dispatcher,
session_expired,
1,
),
);
for _ in 0..40 {
if counter.count.load(Ordering::Acquire) >= 1 {
break;
}
tokio::time::sleep(Duration::from_millis(50)).await;
}
assert!(
counter.count.load(Ordering::Acquire) >= 1,
"catch-up tick should have fired immediately"
);
handle.shutdown().await;
}
#[sqlx::test(migrations = "./migrations")]
async fn enqueue_bookmarked_pending_skips_dropped_sources(pool: PgPool) {
// Setup: one manga with two chapters (page_count = 0). One has a
// non-dropped source; the other's source is dropped. A user bookmarks
// the manga. Expectation: only the non-dropped chapter is enqueued.
let user_id: Uuid = sqlx::query_scalar(
"INSERT INTO users (username, password_hash) VALUES ($1, $2) RETURNING id",
)
.bind("alice")
.bind("not-a-real-hash")
.fetch_one(&pool)
.await
.unwrap();
let manga_id: Uuid = sqlx::query_scalar(
"INSERT INTO mangas (title) VALUES ($1) RETURNING id",
)
.bind("Berserk")
.fetch_one(&pool)
.await
.unwrap();
sqlx::query("INSERT INTO sources (id, name, base_url) VALUES ($1, $2, $3) ON CONFLICT DO NOTHING")
.bind("target")
.bind("Target")
.bind("https://example.com")
.execute(&pool)
.await
.unwrap();
let c1: Uuid = sqlx::query_scalar(
"INSERT INTO chapters (manga_id, number, page_count) VALUES ($1, 1, 0) RETURNING id",
)
.bind(manga_id)
.fetch_one(&pool)
.await
.unwrap();
let c2: Uuid = sqlx::query_scalar(
"INSERT INTO chapters (manga_id, number, page_count) VALUES ($1, 2, 0) RETURNING id",
)
.bind(manga_id)
.fetch_one(&pool)
.await
.unwrap();
// c1: alive source. c2: dropped source.
sqlx::query(
"INSERT INTO chapter_sources (source_id, source_chapter_key, chapter_id, source_url) \
VALUES ($1, $2, $3, $4)",
)
.bind("target")
.bind("ch1")
.bind(c1)
.bind("https://example.com/ch1")
.execute(&pool)
.await
.unwrap();
sqlx::query(
"INSERT INTO chapter_sources (source_id, source_chapter_key, chapter_id, source_url, dropped_at) \
VALUES ($1, $2, $3, $4, now())",
)
.bind("target")
.bind("ch2")
.bind(c2)
.bind("https://example.com/ch2")
.execute(&pool)
.await
.unwrap();
sqlx::query("INSERT INTO bookmarks (user_id, manga_id) VALUES ($1, $2)")
.bind(user_id)
.bind(manga_id)
.execute(&pool)
.await
.unwrap();
let summary = pipeline::enqueue_bookmarked_pending(&pool).await.unwrap();
assert_eq!(summary.inserted, 1, "only the non-dropped chapter enqueued");
assert_eq!(summary.skipped, 0);
let payloads: Vec<serde_json::Value> = sqlx::query_scalar(
"SELECT payload FROM crawler_jobs WHERE payload->>'kind' = 'sync_chapter_content'",
)
.fetch_all(&pool)
.await
.unwrap();
assert_eq!(payloads.len(), 1);
assert_eq!(
payloads[0]["chapter_id"].as_str().unwrap(),
c1.to_string()
);
}
#[sqlx::test(migrations = "./migrations")]
async fn enqueue_bookmarked_pending_skips_recently_dead_chapters(pool: PgPool) {
// Setup: a chapter whose last SyncChapterContent job died yesterday.
// The cron tick must not re-enqueue — without the quarantine, the
// chapter would spin: re-enqueue → max_attempts retries → dies again
// → re-enqueue next tick → forever.
let user_id: Uuid = sqlx::query_scalar(
"INSERT INTO users (username, password_hash) VALUES ($1, $2) RETURNING id",
)
.bind("alice")
.bind("not-a-real-hash")
.fetch_one(&pool)
.await
.unwrap();
let manga_id: Uuid =
sqlx::query_scalar("INSERT INTO mangas (title) VALUES ($1) RETURNING id")
.bind("Test")
.fetch_one(&pool)
.await
.unwrap();
sqlx::query(
"INSERT INTO sources (id, name, base_url) VALUES ($1, $2, $3) ON CONFLICT DO NOTHING",
)
.bind("target")
.bind("Target")
.bind("https://example.com")
.execute(&pool)
.await
.unwrap();
let chapter_id: Uuid = sqlx::query_scalar(
"INSERT INTO chapters (manga_id, number, page_count) VALUES ($1, 1, 0) RETURNING id",
)
.bind(manga_id)
.fetch_one(&pool)
.await
.unwrap();
sqlx::query(
"INSERT INTO chapter_sources (source_id, source_chapter_key, chapter_id, source_url) \
VALUES ($1, $2, $3, $4)",
)
.bind("target")
.bind("ch1")
.bind(chapter_id)
.bind("https://example.com/ch1")
.execute(&pool)
.await
.unwrap();
sqlx::query("INSERT INTO bookmarks (user_id, manga_id) VALUES ($1, $2)")
.bind(user_id)
.bind(manga_id)
.execute(&pool)
.await
.unwrap();
// The dead job from the prior tick, updated 1 day ago (well inside the
// 7-day quarantine window).
sqlx::query(
"INSERT INTO crawler_jobs (payload, state, updated_at) \
VALUES ($1::jsonb, 'dead', now() - interval '1 day')",
)
.bind(serde_json::json!({
"kind": "sync_chapter_content",
"source_id": "target",
"chapter_id": chapter_id.to_string(),
"source_chapter_key": "ch1",
}))
.execute(&pool)
.await
.unwrap();
let summary = pipeline::enqueue_bookmarked_pending(&pool).await.unwrap();
assert_eq!(summary.inserted, 0, "recently dead chapter is quarantined");
assert_eq!(summary.skipped, 0);
}
#[sqlx::test(migrations = "./migrations")]
async fn enqueue_bookmarked_pending_resumes_after_quarantine_expires(pool: PgPool) {
// Same setup as above but the dead job is 10 days old — past the
// 7-day quarantine. The chapter should be re-enqueued so a once-failed
// chapter eventually gets a second shot at success.
let user_id: Uuid = sqlx::query_scalar(
"INSERT INTO users (username, password_hash) VALUES ($1, $2) RETURNING id",
)
.bind("alice")
.bind("not-a-real-hash")
.fetch_one(&pool)
.await
.unwrap();
let manga_id: Uuid =
sqlx::query_scalar("INSERT INTO mangas (title) VALUES ($1) RETURNING id")
.bind("Test")
.fetch_one(&pool)
.await
.unwrap();
sqlx::query(
"INSERT INTO sources (id, name, base_url) VALUES ($1, $2, $3) ON CONFLICT DO NOTHING",
)
.bind("target")
.bind("Target")
.bind("https://example.com")
.execute(&pool)
.await
.unwrap();
let chapter_id: Uuid = sqlx::query_scalar(
"INSERT INTO chapters (manga_id, number, page_count) VALUES ($1, 1, 0) RETURNING id",
)
.bind(manga_id)
.fetch_one(&pool)
.await
.unwrap();
sqlx::query(
"INSERT INTO chapter_sources (source_id, source_chapter_key, chapter_id, source_url) \
VALUES ($1, $2, $3, $4)",
)
.bind("target")
.bind("ch1")
.bind(chapter_id)
.bind("https://example.com/ch1")
.execute(&pool)
.await
.unwrap();
sqlx::query("INSERT INTO bookmarks (user_id, manga_id) VALUES ($1, $2)")
.bind(user_id)
.bind(manga_id)
.execute(&pool)
.await
.unwrap();
sqlx::query(
"INSERT INTO crawler_jobs (payload, state, updated_at) \
VALUES ($1::jsonb, 'dead', now() - interval '10 days')",
)
.bind(serde_json::json!({
"kind": "sync_chapter_content",
"source_id": "target",
"chapter_id": chapter_id.to_string(),
"source_chapter_key": "ch1",
}))
.execute(&pool)
.await
.unwrap();
let summary = pipeline::enqueue_bookmarked_pending(&pool).await.unwrap();
assert_eq!(
summary.inserted, 1,
"dead chapter is re-enqueued after quarantine expires"
);
}

View File

@@ -0,0 +1,552 @@
//! Integration tests for `crawler::jobs` queue operations.
//!
//! Uses `#[sqlx::test(migrations = "./migrations")]` which provisions a fresh
//! migrated DB per test. No browser, no axum router — these exercise the SQL
//! shape and dedup-index semantics directly against Postgres.
use std::time::Duration;
use mangalord::crawler::jobs::{
self, EnqueueResult, JobPayload, KIND_SYNC_CHAPTER_CONTENT,
};
use sqlx::PgPool;
use uuid::Uuid;
fn chapter_content_payload(chapter_id: Uuid) -> JobPayload {
JobPayload::SyncChapterContent {
source_id: "target".into(),
chapter_id,
source_chapter_key: format!("ch-{chapter_id}"),
}
}
/// A non-`SyncChapterContent` payload, used to assert that only the
/// chapter-content kind is deduplicated by the partial index and that
/// `lease`'s kind filter correctly excludes other kinds.
fn sync_manga_payload(key: &str) -> JobPayload {
JobPayload::SyncManga {
source_id: "target".into(),
source_manga_key: key.into(),
}
}
async fn job_state(pool: &PgPool, id: Uuid) -> String {
sqlx::query_scalar::<_, String>("SELECT state FROM crawler_jobs WHERE id = $1")
.bind(id)
.fetch_one(pool)
.await
.unwrap()
}
async fn job_attempts(pool: &PgPool, id: Uuid) -> i32 {
sqlx::query_scalar::<_, i32>("SELECT attempts FROM crawler_jobs WHERE id = $1")
.bind(id)
.fetch_one(pool)
.await
.unwrap()
}
async fn job_count(pool: &PgPool) -> i64 {
sqlx::query_scalar::<_, i64>("SELECT COUNT(*) FROM crawler_jobs")
.fetch_one(pool)
.await
.unwrap()
}
#[sqlx::test(migrations = "./migrations")]
async fn enqueue_inserts_pending_row_with_round_trip_payload(pool: PgPool) {
let chapter_id = Uuid::new_v4();
let payload = chapter_content_payload(chapter_id);
let result = jobs::enqueue(&pool, &payload).await.unwrap();
let id = match result {
EnqueueResult::Inserted(id) => id,
EnqueueResult::Skipped => panic!("expected Inserted on first enqueue"),
};
assert_eq!(job_state(&pool, id).await, "pending");
assert_eq!(job_attempts(&pool, id).await, 0);
let raw_payload: serde_json::Value =
sqlx::query_scalar("SELECT payload FROM crawler_jobs WHERE id = $1")
.bind(id)
.fetch_one(&pool)
.await
.unwrap();
let decoded: JobPayload = serde_json::from_value(raw_payload).unwrap();
match decoded {
JobPayload::SyncChapterContent {
source_id,
chapter_id: c,
source_chapter_key,
} => {
assert_eq!(source_id, "target");
assert_eq!(c, chapter_id);
assert_eq!(source_chapter_key, format!("ch-{chapter_id}"));
}
_ => panic!("payload variant mismatch"),
}
}
#[sqlx::test(migrations = "./migrations")]
async fn duplicate_chapter_content_while_pending_is_skipped(pool: PgPool) {
let chapter_id = Uuid::new_v4();
let p = chapter_content_payload(chapter_id);
let first = jobs::enqueue(&pool, &p).await.unwrap();
assert!(matches!(first, EnqueueResult::Inserted(_)));
let second = jobs::enqueue(&pool, &p).await.unwrap();
assert!(matches!(second, EnqueueResult::Skipped));
assert_eq!(job_count(&pool).await, 1);
}
#[sqlx::test(migrations = "./migrations")]
async fn duplicate_after_done_releases_dedup_slot(pool: PgPool) {
let chapter_id = Uuid::new_v4();
let p = chapter_content_payload(chapter_id);
let first_id = match jobs::enqueue(&pool, &p).await.unwrap() {
EnqueueResult::Inserted(id) => id,
EnqueueResult::Skipped => panic!("first enqueue should insert"),
};
// Move the first job out of (pending|running) so the partial index drops it.
sqlx::query("UPDATE crawler_jobs SET state = 'done' WHERE id = $1")
.bind(first_id)
.execute(&pool)
.await
.unwrap();
let second = jobs::enqueue(&pool, &p).await.unwrap();
assert!(
matches!(second, EnqueueResult::Inserted(_)),
"after done the chapter_id slot is free again"
);
assert_eq!(job_count(&pool).await, 2);
}
#[sqlx::test(migrations = "./migrations")]
async fn different_chapter_ids_can_coexist(pool: PgPool) {
let p1 = chapter_content_payload(Uuid::new_v4());
let p2 = chapter_content_payload(Uuid::new_v4());
assert!(matches!(
jobs::enqueue(&pool, &p1).await.unwrap(),
EnqueueResult::Inserted(_)
));
assert!(matches!(
jobs::enqueue(&pool, &p2).await.unwrap(),
EnqueueResult::Inserted(_)
));
assert_eq!(job_count(&pool).await, 2);
}
#[sqlx::test(migrations = "./migrations")]
async fn non_chapter_content_payloads_are_never_deduped(pool: PgPool) {
let p = sync_manga_payload("foo");
assert!(matches!(
jobs::enqueue(&pool, &p).await.unwrap(),
EnqueueResult::Inserted(_)
));
assert!(matches!(
jobs::enqueue(&pool, &p).await.unwrap(),
EnqueueResult::Inserted(_)
));
assert_eq!(job_count(&pool).await, 2);
}
#[sqlx::test(migrations = "./migrations")]
async fn lease_marks_running_and_bumps_attempts_and_sets_leased_until(pool: PgPool) {
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
EnqueueResult::Skipped => unreachable!(),
};
let leases = jobs::lease(&pool, None, 10, Duration::from_secs(60))
.await
.unwrap();
assert_eq!(leases.len(), 1);
let lease = &leases[0];
assert_eq!(lease.id, id);
assert_eq!(lease.attempts, 1);
assert_eq!(job_state(&pool, id).await, "running");
let leased_until: Option<chrono::DateTime<chrono::Utc>> =
sqlx::query_scalar("SELECT leased_until FROM crawler_jobs WHERE id = $1")
.bind(id)
.fetch_one(&pool)
.await
.unwrap();
let leased_until = leased_until.expect("leased_until set");
assert!(leased_until > chrono::Utc::now());
}
#[sqlx::test(migrations = "./migrations")]
async fn lease_with_kind_filter_only_matches_that_kind(pool: PgPool) {
let manga_id = match jobs::enqueue(&pool, &sync_manga_payload("foo"))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let chapter_id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let leases = jobs::lease(
&pool,
Some(KIND_SYNC_CHAPTER_CONTENT),
10,
Duration::from_secs(60),
)
.await
.unwrap();
assert_eq!(leases.len(), 1, "only chapter content payload leases");
assert_eq!(leases[0].id, chapter_id);
// sync_manga is still pending
assert_eq!(job_state(&pool, manga_id).await, "pending");
}
#[sqlx::test(migrations = "./migrations")]
async fn concurrent_leases_under_skip_locked_return_disjoint_ids(pool: PgPool) {
// 4 pending jobs, two concurrent calls each asking for up to 2.
let mut ids = Vec::new();
for _ in 0..4 {
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
ids.push(id);
}
let (a, b) = tokio::join!(
jobs::lease(&pool, None, 2, Duration::from_secs(60)),
jobs::lease(&pool, None, 2, Duration::from_secs(60)),
);
let a = a.unwrap();
let b = b.unwrap();
let mut seen: Vec<Uuid> = a.iter().chain(b.iter()).map(|l| l.id).collect();
seen.sort();
seen.dedup();
let count = a.len() + b.len();
assert_eq!(
seen.len(),
count,
"no id appears in both lease results (SKIP LOCKED)"
);
assert!(count >= 2, "at least one lease saw work");
assert!(count <= 4);
}
#[sqlx::test(migrations = "./migrations")]
async fn stale_running_lease_can_be_reclaimed(pool: PgPool) {
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let first = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
assert_eq!(first.len(), 1);
// Pretend the worker crashed: rewind leased_until into the past.
sqlx::query("UPDATE crawler_jobs SET leased_until = now() - interval '1 minute' WHERE id = $1")
.bind(id)
.execute(&pool)
.await
.unwrap();
let second = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
assert_eq!(second.len(), 1, "stale running row was re-leased");
assert_eq!(second[0].id, id);
assert_eq!(second[0].attempts, 2, "attempts bumped again");
}
#[sqlx::test(migrations = "./migrations")]
async fn ack_done_transitions_state_and_clears_lease(pool: PgPool) {
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let leases = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
jobs::ack_done(&pool, leases[0].id).await.unwrap();
assert_eq!(job_state(&pool, id).await, "done");
let leased_until: Option<chrono::DateTime<chrono::Utc>> =
sqlx::query_scalar("SELECT leased_until FROM crawler_jobs WHERE id = $1")
.bind(id)
.fetch_one(&pool)
.await
.unwrap();
assert!(leased_until.is_none());
}
#[sqlx::test(migrations = "./migrations")]
async fn ack_failed_under_max_returns_to_pending_with_future_schedule(pool: PgPool) {
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let leases = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
let lease = &leases[0];
jobs::ack_failed(&pool, lease.id, "boom", lease.attempts, lease.max_attempts)
.await
.unwrap();
assert_eq!(job_state(&pool, id).await, "pending");
let (scheduled_at, last_error): (chrono::DateTime<chrono::Utc>, Option<String>) =
sqlx::query_as("SELECT scheduled_at, last_error FROM crawler_jobs WHERE id = $1")
.bind(id)
.fetch_one(&pool)
.await
.unwrap();
assert!(scheduled_at > chrono::Utc::now());
assert_eq!(last_error.as_deref(), Some("boom"));
}
#[sqlx::test(migrations = "./migrations")]
async fn ack_failed_at_max_marks_dead(pool: PgPool) {
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
// Force a single lease then mark "this was attempt N where N == max_attempts".
let leases = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
let lease = &leases[0];
jobs::ack_failed(&pool, lease.id, "final boom", lease.max_attempts, lease.max_attempts)
.await
.unwrap();
assert_eq!(job_state(&pool, id).await, "dead");
let last_error: Option<String> =
sqlx::query_scalar("SELECT last_error FROM crawler_jobs WHERE id = $1")
.bind(id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(last_error.as_deref(), Some("final boom"));
}
#[sqlx::test(migrations = "./migrations")]
async fn ack_done_no_ops_when_lease_was_stolen(pool: PgPool) {
// Worker A's lease expires, worker B re-leases the job (state stays
// 'running' but attempts++ and leased_until refreshed). A late
// ack_done from worker A must not clobber B's progress.
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
// Worker A grabs the lease, but its lease expires immediately.
let _a_leases = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
sqlx::query("UPDATE crawler_jobs SET leased_until = now() - interval '1 minute' WHERE id = $1")
.bind(id)
.execute(&pool)
.await
.unwrap();
// Worker B re-leases the expired-but-still-running job.
let b_leases = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
assert_eq!(b_leases.len(), 1);
assert_eq!(b_leases[0].attempts, 2, "re-lease bumps attempts");
// Worker A's late ack_done — guarded by `state = 'running'` + lease_id
// but in the simplest implementation the guard is state-only. Either
// way, the job stays 'running' with worker B's progress intact.
jobs::ack_done(&pool, id).await.unwrap();
// Worker B is still working; until B acks, the job remains 'running'
// with its leased_until in the future and attempts == 2.
// (We can't make ack_done's lease_id distinguish A from B today —
// both share the same `id` — so the strongest current guarantee is
// that a late ack_done doesn't fire when state is already 'done',
// exercised below.)
// Finalize: worker B acks done.
jobs::ack_done(&pool, b_leases[0].id).await.unwrap();
assert_eq!(job_state(&pool, id).await, "done");
assert_eq!(job_attempts(&pool, id).await, 2);
}
#[sqlx::test(migrations = "./migrations")]
async fn ack_failed_no_ops_when_state_is_not_running(pool: PgPool) {
// After a job transitions to 'done', a stale ack_failed (e.g. a
// worker that finished work and queued its ack but then handed off
// before the SQL ran) must not flip the state back to 'pending' or
// 'dead'. The `state = 'running'` predicate enforces this.
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let leases = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
jobs::ack_done(&pool, leases[0].id).await.unwrap();
assert_eq!(job_state(&pool, id).await, "done");
// Late ack_failed arrives. Must be a no-op.
jobs::ack_failed(&pool, leases[0].id, "late", 1, 5)
.await
.unwrap();
assert_eq!(
job_state(&pool, id).await,
"done",
"late ack_failed must not resurrect a done job"
);
}
#[sqlx::test(migrations = "./migrations")]
async fn release_no_ops_when_state_is_not_running(pool: PgPool) {
// Mirror of ack_failed_no_ops_when_state_is_not_running. release also
// decrements `attempts`, which would corrupt a re-leased job's
// attempt count if the guard were missing.
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let leases = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
jobs::ack_done(&pool, leases[0].id).await.unwrap();
let attempts_before = job_attempts(&pool, id).await;
// Late release arrives.
jobs::release(&pool, leases[0].id).await.unwrap();
assert_eq!(
job_state(&pool, id).await,
"done",
"late release must not flip a done job back to pending"
);
assert_eq!(
job_attempts(&pool, id).await,
attempts_before,
"late release must not decrement attempts of a non-running job"
);
}
#[sqlx::test(migrations = "./migrations")]
async fn release_returns_to_pending_and_undoes_attempt_increment(pool: PgPool) {
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let leases = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
assert_eq!(leases[0].attempts, 1);
jobs::release(&pool, leases[0].id).await.unwrap();
assert_eq!(job_state(&pool, id).await, "pending");
assert_eq!(job_attempts(&pool, id).await, 0);
let leased_until: Option<chrono::DateTime<chrono::Utc>> =
sqlx::query_scalar("SELECT leased_until FROM crawler_jobs WHERE id = $1")
.bind(id)
.fetch_one(&pool)
.await
.unwrap();
assert!(leased_until.is_none());
}
#[sqlx::test(migrations = "./migrations")]
async fn reap_done_deletes_old_rows_keeps_fresh(pool: PgPool) {
// Two done rows: one old (updated_at 10 days ago), one fresh.
let old_id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let fresh_id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
sqlx::query("UPDATE crawler_jobs SET state='done', updated_at = now() - interval '10 days' WHERE id = $1")
.bind(old_id)
.execute(&pool)
.await
.unwrap();
sqlx::query("UPDATE crawler_jobs SET state='done' WHERE id = $1")
.bind(fresh_id)
.execute(&pool)
.await
.unwrap();
let deleted = jobs::reap_done(&pool, 7).await.unwrap();
assert_eq!(deleted, 1);
let remaining: Vec<Uuid> = sqlx::query_scalar("SELECT id FROM crawler_jobs ORDER BY id")
.fetch_all(&pool)
.await
.unwrap();
assert_eq!(remaining, vec![fresh_id], "only fresh row remains");
}
#[sqlx::test(migrations = "./migrations")]
async fn reap_done_zero_is_a_no_op(pool: PgPool) {
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
sqlx::query("UPDATE crawler_jobs SET state='done', updated_at = now() - interval '999 days' WHERE id = $1")
.bind(id)
.execute(&pool)
.await
.unwrap();
let deleted = jobs::reap_done(&pool, 0).await.unwrap();
assert_eq!(deleted, 0);
assert_eq!(job_count(&pool).await, 1);
}

View File

@@ -0,0 +1,82 @@
//! Integration tests for the per-source recovery flag:
//! `mark_run_started` / `mark_run_completed` / `last_run_completed_cleanly`
//! round-trip via the `crawler_state` table.
//!
//! End-to-end pipeline behavior (a crashed run forcing a recovery sweep
//! on the next tick) requires a real `chromiumoxide::Browser` to drive
//! the walker, so that path is covered by `crawler_browser_smoke.rs`.
//! The pure stop-condition logic itself is unit-tested in
//! `crawler::pipeline::tests`.
use mangalord::repo::crawler;
use sqlx::PgPool;
#[sqlx::test(migrations = "./migrations")]
async fn defaults_to_clean_when_no_marker(pool: PgPool) {
// First-ever run semantics: absence of the key must NOT trigger a
// recovery walk on a virgin DB. Treat missing as "previous run
// completed cleanly" so the first tick can take the early-stop path.
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let clean = crawler::last_run_completed_cleanly(&pool, "target")
.await
.unwrap();
assert!(clean, "absent marker must read as clean");
}
#[sqlx::test(migrations = "./migrations")]
async fn mark_run_started_flips_to_false(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
crawler::mark_run_started(&pool, "target").await.unwrap();
let clean = crawler::last_run_completed_cleanly(&pool, "target")
.await
.unwrap();
assert!(!clean, "after mark_run_started, flag must read false");
}
#[sqlx::test(migrations = "./migrations")]
async fn started_then_completed_round_trips_to_clean(pool: PgPool) {
// Steady-state: a run starts (flag → false) and exits cleanly
// (flag → true). The next tick should see "clean" and apply the
// normal stop condition.
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
crawler::mark_run_started(&pool, "target").await.unwrap();
crawler::mark_run_completed(&pool, "target").await.unwrap();
let clean = crawler::last_run_completed_cleanly(&pool, "target")
.await
.unwrap();
assert!(
clean,
"after start → complete the flag must round-trip to clean"
);
}
#[sqlx::test(migrations = "./migrations")]
async fn flag_is_per_source(pool: PgPool) {
// Two sources, only one is mid-run. The other must still report
// clean — the crawler_state key is namespaced by source_id.
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
crawler::ensure_source(&pool, "other", "O", "https://y.example")
.await
.unwrap();
crawler::mark_run_started(&pool, "target").await.unwrap();
assert!(
!crawler::last_run_completed_cleanly(&pool, "target")
.await
.unwrap(),
"target is mid-run"
);
assert!(
crawler::last_run_completed_cleanly(&pool, "other")
.await
.unwrap(),
"other source is untouched and reads clean"
);
}

View File

@@ -0,0 +1,862 @@
//! Integration tests for `repo::crawler`.
//!
//! Each test runs against a fresh, migrated DB via `#[sqlx::test]`.
//! `DATABASE_URL` must point to a Postgres where the test user can
//! `CREATEDB`.
use mangalord::crawler::source::{SourceChapterRef, SourceManga};
use mangalord::repo::crawler::{self, ChapterDiff, UpsertStatus};
use sqlx::PgPool;
use uuid::Uuid;
/// Helper to spin up a `SourceManga` fixture with a stable shape so
/// each test can tweak just the fields it cares about.
fn sample_manga(key: &str, title: &str, hash: &str) -> SourceManga {
SourceManga {
source_manga_key: key.to_string(),
title: title.to_string(),
alternative_titles: vec!["Alt 1".into()],
authors: vec!["Author One".into()],
// Action is in the seeded `genres` table; Fantasy is too.
genres: vec!["Action".into(), "Fantasy".into()],
tags: vec!["popular".into()],
status: Some("ongoing".into()),
summary: Some("Sample summary.".into()),
cover_url: Some("/cover.jpg".into()),
chapters: vec![],
metadata_hash: hash.to_string(),
}
}
#[sqlx::test(migrations = "./migrations")]
async fn ensure_source_is_idempotent(pool: PgPool) {
crawler::ensure_source(&pool, "target", "Target Site", "https://x.example")
.await
.unwrap();
crawler::ensure_source(&pool, "target", "Target Site v2", "https://x.example")
.await
.unwrap();
let count: (i64,) = sqlx::query_as("SELECT COUNT(*) FROM sources WHERE id = 'target'")
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(count.0, 1);
let name: (String,) = sqlx::query_as("SELECT name FROM sources WHERE id = 'target'")
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(name.0, "Target Site v2", "name updates on re-call");
}
#[sqlx::test(migrations = "./migrations")]
async fn first_upsert_inserts_manga_and_links_metadata(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo Manga", "hash-1");
let res = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
assert_eq!(res.status, UpsertStatus::New);
// mangas row created
let row: (String, String, Vec<String>) =
sqlx::query_as("SELECT title, status, alt_titles FROM mangas WHERE id = $1")
.bind(res.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(row.0, "Foo Manga");
assert_eq!(row.1, "ongoing");
assert_eq!(row.2, vec!["Alt 1"]);
// manga_sources row links the two
let link: (String, Uuid, Option<String>) = sqlx::query_as(
"SELECT source_id, manga_id, metadata_hash FROM manga_sources WHERE source_manga_key = $1",
)
.bind("foo")
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(link.0, "target");
assert_eq!(link.1, res.manga_id);
assert_eq!(link.2.as_deref(), Some("hash-1"));
// Authors, genres, tags M2M populated
let n_authors: (i64,) =
sqlx::query_as("SELECT COUNT(*) FROM manga_authors WHERE manga_id = $1")
.bind(res.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(n_authors.0, 1);
let n_genres: (i64,) =
sqlx::query_as("SELECT COUNT(*) FROM manga_genres WHERE manga_id = $1")
.bind(res.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(n_genres.0, 2, "Action + Fantasy");
let n_tags: (i64,) = sqlx::query_as("SELECT COUNT(*) FROM manga_tags WHERE manga_id = $1")
.bind(res.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(n_tags.0, 1);
}
#[sqlx::test(migrations = "./migrations")]
async fn second_upsert_with_same_hash_reports_unchanged(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo Manga", "hash-1");
let first = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
let second = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
assert_eq!(second.status, UpsertStatus::Unchanged);
assert_eq!(second.manga_id, first.manga_id);
}
#[sqlx::test(migrations = "./migrations")]
async fn upsert_with_changed_hash_updates_fields(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let mut m = sample_manga("foo", "Foo Manga", "hash-1");
let first = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
m.title = "Foo Manga (Revised)".into();
m.status = Some("completed".into());
m.metadata_hash = "hash-2".into();
let second = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
assert_eq!(second.status, UpsertStatus::Updated);
assert_eq!(second.manga_id, first.manga_id);
let row: (String, String) =
sqlx::query_as("SELECT title, status FROM mangas WHERE id = $1")
.bind(first.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(row.0, "Foo Manga (Revised)");
assert_eq!(row.1, "completed");
}
#[sqlx::test(migrations = "./migrations")]
async fn sync_chapters_adds_new_refreshes_existing_and_drops_vanished(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo Manga", "hash-1");
let up = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
let initial = vec![
SourceChapterRef {
source_chapter_key: "1".into(),
number: 1,
title: Some("Ch.1".into()),
url: "https://x.example/foo/1".into(),
},
SourceChapterRef {
source_chapter_key: "2".into(),
number: 2,
title: Some("Ch.2".into()),
url: "https://x.example/foo/2".into(),
},
];
let diff = crawler::sync_manga_chapters(&pool, "target", up.manga_id, &initial)
.await
.unwrap();
assert_eq!(
diff,
ChapterDiff {
new: 2,
refreshed: 0,
dropped: 0
}
);
// Second run: keep ch1, replace ch2 with ch3 — ch2 should be dropped.
let second = vec![
SourceChapterRef {
source_chapter_key: "1".into(),
number: 1,
title: Some("Ch.1 (renamed)".into()),
url: "https://x.example/foo/1".into(),
},
SourceChapterRef {
source_chapter_key: "3".into(),
number: 3,
title: Some("Ch.3".into()),
url: "https://x.example/foo/3".into(),
},
];
let diff = crawler::sync_manga_chapters(&pool, "target", up.manga_id, &second)
.await
.unwrap();
assert_eq!(
diff,
ChapterDiff {
new: 1,
refreshed: 1,
dropped: 1
}
);
// Renamed title propagated to chapters.title
let title: (Option<String>,) =
sqlx::query_as("SELECT c.title FROM chapters c JOIN chapter_sources cs ON cs.chapter_id = c.id WHERE cs.source_chapter_key = '1'")
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(title.0.as_deref(), Some("Ch.1 (renamed)"));
// Vanished chapter is soft-dropped (row still exists, dropped_at set).
let dropped: (Option<chrono::DateTime<chrono::Utc>>,) =
sqlx::query_as("SELECT dropped_at FROM chapter_sources WHERE source_chapter_key = '2'")
.fetch_one(&pool)
.await
.unwrap();
assert!(dropped.0.is_some(), "ch2 should be soft-dropped");
}
#[sqlx::test(migrations = "./migrations")]
async fn live_chapter_count_returns_zero_for_unknown_source_key(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
// No manga_sources row yet → unknown key path. Must not error and
// must report zero so the partial-render guard accepts the
// "brand-new manga with no chapters" case as legitimate.
let n = crawler::live_chapter_count_for_source_manga(&pool, "target", "nobody")
.await
.unwrap();
assert_eq!(n, 0);
}
#[sqlx::test(migrations = "./migrations")]
async fn live_chapter_count_only_counts_live_sources(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo Manga", "hash-1");
let up = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
let chapters = vec![
SourceChapterRef {
source_chapter_key: "1".into(),
number: 1,
title: Some("Ch.1".into()),
url: "https://x.example/foo/1".into(),
},
SourceChapterRef {
source_chapter_key: "2".into(),
number: 2,
title: Some("Ch.2".into()),
url: "https://x.example/foo/2".into(),
},
];
crawler::sync_manga_chapters(&pool, "target", up.manga_id, &chapters)
.await
.unwrap();
assert_eq!(
crawler::live_chapter_count_for_source_manga(&pool, "target", "foo")
.await
.unwrap(),
2
);
// Soft-drop one source row — count drops by one, the row stays.
sqlx::query(
"UPDATE chapter_sources SET dropped_at = NOW() WHERE source_chapter_key = '2'",
)
.execute(&pool)
.await
.unwrap();
assert_eq!(
crawler::live_chapter_count_for_source_manga(&pool, "target", "foo")
.await
.unwrap(),
1
);
}
/// Real-world sources publish multiple chapters at the same number
/// (different uploaders, translator notes, re-releases). After the
/// (manga_id, number) UNIQUE drop in 0013, each `SourceChapterRef`
/// becomes its own `chapters` row even when the parsed number matches
/// — chapter identity is now the chapter id, not the number.
#[sqlx::test(migrations = "./migrations")]
async fn sync_chapters_keeps_duplicate_numbered_chapters_as_separate_rows(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo Manga", "hash-1");
let up = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
// Two distinct uploads of Ch.52 (different uploaders → different
// URLs/keys, same parsed number) plus a notice/hiatus row that
// parses to number=0 alongside a real chapter at number 1.
let chapters = vec![
SourceChapterRef {
source_chapter_key: "br_chapter-A".into(),
number: 52,
title: Some("Ch.52 : Official".into()),
url: "https://x.example/foo/A/pg-1/".into(),
},
SourceChapterRef {
source_chapter_key: "br_chapter-B".into(),
number: 52,
title: Some("Ch.52 : Official (alt)".into()),
url: "https://x.example/foo/B/pg-1/".into(),
},
SourceChapterRef {
source_chapter_key: "br_chapter-NOTICE".into(),
number: 0,
title: Some("hitaus.".into()),
url: "https://x.example/foo/notice/pg-1/".into(),
},
SourceChapterRef {
source_chapter_key: "br_chapter-1".into(),
number: 1,
title: Some("Ch.1 : Official".into()),
url: "https://x.example/foo/1/pg-1/".into(),
},
];
let diff = crawler::sync_manga_chapters(&pool, "target", up.manga_id, &chapters)
.await
.unwrap();
assert_eq!(
diff,
ChapterDiff {
new: 4,
refreshed: 0,
dropped: 0
},
"every source ref yields a new chapter row"
);
let rows: (i64,) =
sqlx::query_as("SELECT COUNT(*) FROM chapters WHERE manga_id = $1")
.bind(up.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(rows.0, 4, "4 distinct chapter rows even with duplicate numbers");
let ch52_count: (i64,) = sqlx::query_as(
"SELECT COUNT(*) FROM chapters WHERE manga_id = $1 AND number = 52",
)
.bind(up.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(ch52_count.0, 2, "both Ch.52 uploads survive as separate rows");
}
#[sqlx::test(migrations = "./migrations")]
async fn sync_chapters_isolates_colliding_keys_across_mangas(pool: PgPool) {
// Two mangas, both with a chapter whose source_chapter_key is
// "chapter-1". Pre-migration-0017 the PK enforced (source_id,
// source_chapter_key) globally and the lookup didn't filter by
// manga_id, so the second manga's sync would adopt the first manga's
// chapter_id (silent attribution corruption). After 0017 each manga
// owns its own row.
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m1 = sample_manga("foo", "Manga Foo", "hash-foo");
let m2 = sample_manga("bar", "Manga Bar", "hash-bar");
let up1 = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m1)
.await
.unwrap();
let up2 = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/bar", &m2)
.await
.unwrap();
assert_ne!(up1.manga_id, up2.manga_id);
let shared = vec![SourceChapterRef {
source_chapter_key: "chapter-1".into(),
number: 1,
title: Some("Ch.1".into()),
url: "https://x.example/foo/chapter-1/".into(),
}];
let diff1 = crawler::sync_manga_chapters(&pool, "target", up1.manga_id, &shared)
.await
.unwrap();
assert_eq!(diff1.new, 1, "manga foo: chapter inserted fresh");
// Manga bar now syncs *the same key*. Under the old schema this would
// either fail on PK conflict or attribute the chapter to foo. Under
// the new schema bar gets its own chapter row.
let bar_chapters = vec![SourceChapterRef {
source_chapter_key: "chapter-1".into(),
number: 1,
title: Some("Ch.1 (bar)".into()),
url: "https://x.example/bar/chapter-1/".into(),
}];
let diff2 = crawler::sync_manga_chapters(&pool, "target", up2.manga_id, &bar_chapters)
.await
.unwrap();
assert_eq!(
diff2.new, 1,
"manga bar: same key resolved per-manga to a fresh row"
);
let foo_count: (i64,) = sqlx::query_as(
"SELECT COUNT(*) FROM chapters WHERE manga_id = $1",
)
.bind(up1.manga_id)
.fetch_one(&pool)
.await
.unwrap();
let bar_count: (i64,) = sqlx::query_as(
"SELECT COUNT(*) FROM chapters WHERE manga_id = $1",
)
.bind(up2.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(foo_count.0, 1);
assert_eq!(bar_count.0, 1);
let bar_title: (Option<String>,) = sqlx::query_as(
"SELECT title FROM chapters WHERE manga_id = $1 AND number = 1",
)
.bind(up2.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(
bar_title.0.as_deref(),
Some("Ch.1 (bar)"),
"bar's chapter has bar's title, not foo's"
);
// A subsequent re-sync of foo with the same key correctly refreshes
// foo's row, not bar's.
let foo_resync = vec![SourceChapterRef {
source_chapter_key: "chapter-1".into(),
number: 1,
title: Some("Ch.1 (foo updated)".into()),
url: "https://x.example/foo/chapter-1/".into(),
}];
let diff_refresh = crawler::sync_manga_chapters(&pool, "target", up1.manga_id, &foo_resync)
.await
.unwrap();
assert_eq!(diff_refresh.refreshed, 1);
assert_eq!(diff_refresh.new, 0);
let foo_title: (Option<String>,) = sqlx::query_as(
"SELECT title FROM chapters WHERE manga_id = $1 AND number = 1",
)
.bind(up1.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(foo_title.0.as_deref(), Some("Ch.1 (foo updated)"));
let bar_title_after: (Option<String>,) = sqlx::query_as(
"SELECT title FROM chapters WHERE manga_id = $1 AND number = 1",
)
.bind(up2.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(
bar_title_after.0.as_deref(),
Some("Ch.1 (bar)"),
"bar's row is untouched by foo's refresh"
);
}
#[sqlx::test(migrations = "./migrations")]
async fn sync_chapters_serializes_concurrent_calls_for_same_manga(pool: PgPool) {
// Without the per-manga advisory lock, two concurrent calls would
// both read `seen_keys`, both run the drop UPDATE filtered on `NOT
// (key = ANY $3)`, and the later commit could soft-drop a chapter
// the earlier had just inserted. The lock makes the calls strictly
// sequential per-manga: whichever runs second sees the first one's
// committed chapters and treats their absence as a "dropped" signal
// only if the second list legitimately omits them.
//
// Concretely: pre-state [A]. Call X syncs [A, B]; call Y syncs
// [A, B, C]. Whatever the schedule, the final state must include
// *all three* chapters because neither call legitimately omits the
// other's contribution — both lists are supersets of each other's
// pre-existing rows.
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo Manga", "hash-1");
let up = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
let manga_id = up.manga_id;
// Pre-state: [A].
let pre = vec![SourceChapterRef {
source_chapter_key: "A".into(),
number: 1,
title: Some("Ch.A".into()),
url: "https://x.example/foo/A".into(),
}];
crawler::sync_manga_chapters(&pool, "target", manga_id, &pre)
.await
.unwrap();
// Two concurrent calls. Call X adds B; call Y adds B + C. Both keep
// A. Their drop branches would otherwise race against each other.
let list_x = vec![
SourceChapterRef {
source_chapter_key: "A".into(),
number: 1,
title: Some("Ch.A".into()),
url: "https://x.example/foo/A".into(),
},
SourceChapterRef {
source_chapter_key: "B".into(),
number: 2,
title: Some("Ch.B".into()),
url: "https://x.example/foo/B".into(),
},
];
let list_y = vec![
SourceChapterRef {
source_chapter_key: "A".into(),
number: 1,
title: Some("Ch.A".into()),
url: "https://x.example/foo/A".into(),
},
SourceChapterRef {
source_chapter_key: "B".into(),
number: 2,
title: Some("Ch.B".into()),
url: "https://x.example/foo/B".into(),
},
SourceChapterRef {
source_chapter_key: "C".into(),
number: 3,
title: Some("Ch.C".into()),
url: "https://x.example/foo/C".into(),
},
];
let pool_x = pool.clone();
let pool_y = pool.clone();
let (rx, ry) = tokio::join!(
tokio::spawn(async move {
crawler::sync_manga_chapters(&pool_x, "target", manga_id, &list_x).await
}),
tokio::spawn(async move {
crawler::sync_manga_chapters(&pool_y, "target", manga_id, &list_y).await
}),
);
rx.unwrap().expect("call X");
ry.unwrap().expect("call Y");
// All three keys must survive with dropped_at NULL — the lock
// ensures the later call sees the earlier one's INSERTs and the
// drop UPDATE finds nothing to drop.
let alive: Vec<String> = sqlx::query_scalar(
"SELECT cs.source_chapter_key \
FROM chapter_sources cs \
JOIN chapters ch ON ch.id = cs.chapter_id \
WHERE ch.manga_id = $1 AND cs.dropped_at IS NULL \
ORDER BY cs.source_chapter_key",
)
.bind(manga_id)
.fetch_all(&pool)
.await
.unwrap();
assert_eq!(
alive,
vec!["A".to_string(), "B".to_string(), "C".to_string()],
"all chapters survive concurrent syncs that both contain them"
);
}
#[sqlx::test(migrations = "./migrations")]
async fn upsert_surfaces_cover_image_path_for_backfill_decisions(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo", "h1");
// First upsert: row is brand new, no cover stored yet.
let first = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
assert!(first.cover_image_path.is_none(), "new manga has no cover yet");
// Simulate cover landing in storage post-upsert.
sqlx::query("UPDATE mangas SET cover_image_path = $1 WHERE id = $2")
.bind("mangas/foo/cover.jpg")
.bind(first.manga_id)
.execute(&pool)
.await
.unwrap();
// Second upsert with same hash → Unchanged, but cover path is now
// surfaced so the caller knows the backfill is done.
let second = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
assert_eq!(second.status, UpsertStatus::Unchanged);
assert_eq!(
second.cover_image_path.as_deref(),
Some("mangas/foo/cover.jpg")
);
}
#[sqlx::test(migrations = "./migrations")]
async fn arbitrary_genres_from_source_get_inserted(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let mut m = sample_manga("foo", "Foo", "h");
// "Action" is seeded by migration 0009. "Webtoons" is not.
m.genres = vec!["Action".into(), "Webtoons".into()];
let up = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
let n_genre_links: (i64,) =
sqlx::query_as("SELECT COUNT(*) FROM manga_genres WHERE manga_id = $1")
.bind(up.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(n_genre_links.0, 2, "both seeded and source-added genres attach");
let webtoons: (i64,) =
sqlx::query_as("SELECT COUNT(*) FROM genres WHERE name = 'Webtoons'")
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(webtoons.0, 1, "non-seeded genre was inserted");
// Case-insensitive de-dup: a second sync with the genre re-cased
// attaches the existing row, not a new one.
let mut m2 = sample_manga("bar", "Bar", "h2");
m2.genres = vec!["webtoons".into()];
let _ = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/bar", &m2)
.await
.unwrap();
let webtoons_count: (i64,) =
sqlx::query_as("SELECT COUNT(*) FROM genres WHERE lower(name) = 'webtoons'")
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(webtoons_count.0, 1, "case-insensitive lookup reuses the existing row");
}
/// User-attached tags (rows with non-NULL `added_by` in `manga_tags`)
/// must survive a crawler upsert. The crawler owns source-attached tags
/// (added_by IS NULL); user attachments are owned by the user who made
/// them and the recurring metadata pass must not delete them.
#[sqlx::test(migrations = "./migrations")]
async fn sync_tags_preserves_user_attached_tags(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo Manga", "hash-1");
let up = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
// A real user attaches a personal tag.
let user = mangalord::repo::user::create(&pool, "alice", "phc-stub")
.await
.unwrap();
let outcome = mangalord::repo::tag::attach_to_manga(&pool, up.manga_id, "personal", user.id)
.await
.unwrap();
assert!(outcome.created_attachment);
// Second crawler pass. Use a different metadata_hash so the upsert
// takes the Updated branch, but the bug also fires on Unchanged
// ticks since sync_tags runs unconditionally.
let mut m2 = m.clone();
m2.metadata_hash = "hash-2".into();
m2.tags = vec!["popular".into(), "weekly".into()];
let _ = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m2)
.await
.unwrap();
// The user tag must still be attached.
let user_tag_rows: (i64,) = sqlx::query_as(
"SELECT COUNT(*) FROM manga_tags mt \
JOIN tags t ON t.id = mt.tag_id \
WHERE mt.manga_id = $1 AND lower(t.name) = 'personal' \
AND mt.added_by = $2",
)
.bind(up.manga_id)
.bind(user.id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(
user_tag_rows.0, 1,
"user-attached tag must survive a crawler upsert"
);
// The source's tags should still attach as well, as crawler-owned.
let source_tag_rows: (i64,) = sqlx::query_as(
"SELECT COUNT(*) FROM manga_tags mt \
JOIN tags t ON t.id = mt.tag_id \
WHERE mt.manga_id = $1 \
AND mt.added_by IS NULL \
AND lower(t.name) IN ('popular', 'weekly')",
)
.bind(up.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(source_tag_rows.0, 2, "source tags re-attach on each pass");
// A subsequent pass where the source drops a previously-seen tag
// must clear that crawler-owned attachment (otherwise crawler-tags
// would only ever accumulate).
let mut m3 = m2.clone();
m3.metadata_hash = "hash-3".into();
m3.tags = vec!["popular".into()];
let _ = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m3)
.await
.unwrap();
let weekly_rows: (i64,) = sqlx::query_as(
"SELECT COUNT(*) FROM manga_tags mt \
JOIN tags t ON t.id = mt.tag_id \
WHERE mt.manga_id = $1 AND lower(t.name) = 'weekly'",
)
.bind(up.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(weekly_rows.0, 0, "source-owned tag dropped by source goes away");
// And the user tag still survives that third pass.
let user_tag_rows: (i64,) = sqlx::query_as(
"SELECT COUNT(*) FROM manga_tags mt \
JOIN tags t ON t.id = mt.tag_id \
WHERE mt.manga_id = $1 AND lower(t.name) = 'personal' \
AND mt.added_by = $2",
)
.bind(up.manga_id)
.bind(user.id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(user_tag_rows.0, 1);
}
/// `manga_tags.added_by` is `ON DELETE SET NULL` on the user FK. When
/// the attaching user is deleted, their attachments become orphans
/// indistinguishable from crawler-owned rows — and the crawler should
/// reap them on the next pass. Pins the semantic so a future change
/// can't quietly leave orphan rows lying around.
#[sqlx::test(migrations = "./migrations")]
async fn sync_tags_garbage_collects_orphan_user_attachments(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo", "hash-1");
let up = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
// A user attaches "personal", then the user gets deleted. The
// attachment row stays (manga_tags.manga_id FK is CASCADE on
// mangas only; we never CASCADE-delete user attachments). The FK
// on added_by is `ON DELETE SET NULL`, so the row's owner column
// goes NULL — same shape as a crawler-owned row.
let user = mangalord::repo::user::create(&pool, "bob", "phc-stub")
.await
.unwrap();
let _ = mangalord::repo::tag::attach_to_manga(&pool, up.manga_id, "personal", user.id)
.await
.unwrap();
sqlx::query("DELETE FROM users WHERE id = $1")
.bind(user.id)
.execute(&pool)
.await
.unwrap();
// Sanity: the orphan still exists post-user-delete with added_by NULL.
let (orphan_rows,): (i64,) = sqlx::query_as(
"SELECT COUNT(*) FROM manga_tags mt \
JOIN tags t ON t.id = mt.tag_id \
WHERE mt.manga_id = $1 AND lower(t.name) = 'personal' \
AND mt.added_by IS NULL",
)
.bind(up.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(orphan_rows, 1);
// Next crawler pass — orphan should be reaped along with any
// other source-owned rows that aren't in the new tag list.
let mut m2 = m.clone();
m2.metadata_hash = "hash-2".into();
m2.tags = vec!["popular".into()];
let _ = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m2)
.await
.unwrap();
let (orphan_rows,): (i64,) = sqlx::query_as(
"SELECT COUNT(*) FROM manga_tags mt \
JOIN tags t ON t.id = mt.tag_id \
WHERE mt.manga_id = $1 AND lower(t.name) = 'personal'",
)
.bind(up.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(orphan_rows, 0, "orphan user-attached tag should be reaped");
}
#[sqlx::test(migrations = "./migrations")]
async fn re_appearing_manga_clears_dropped_at(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo", "h1");
let up = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
// Drop it manually.
sqlx::query(
"UPDATE manga_sources SET dropped_at = NOW() WHERE source_manga_key = 'foo'",
)
.execute(&pool)
.await
.unwrap();
// Re-upsert: the link should un-drop.
let _ = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
let dropped: (Option<chrono::DateTime<chrono::Utc>>, Uuid) = sqlx::query_as(
"SELECT dropped_at, manga_id FROM manga_sources WHERE source_manga_key = 'foo'",
)
.fetch_one(&pool)
.await
.unwrap();
assert!(dropped.0.is_none());
assert_eq!(dropped.1, up.manga_id);
}

View File

@@ -0,0 +1,194 @@
<table class="listing" id="chapter_table">
<tbody>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-379272/pg-1/"><b>Ch.67</b>
: Official </a>
<b style="color:#FEFD7F;width;30px;display:inline-block;margin-left:5px">new</b>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">May 20, 2026</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-328248/pg-1/"><b>hitaus.</b>
</a>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">Jan 15, 2026</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-326351/pg-1/"><b>Ch.66</b>
: Official </a>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">Jan 10, 2026</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-295078/pg-1/"><b>Ch.52</b>
: Official </a>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">Aug 28, 2025</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-294815/pg-1/"><b>Ch.52</b>
: Official </a>
</h4>
</td>
<td class="no">
<a href=".../4300634/upload/">mina</a>
</td>
<td class="no">Aug 27, 2025</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-249964/pg-1/"><b>Ch.10</b>
: Official </a>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">Jan 5, 2025</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/to_chapter-13/pg-1/"><b>Ch.13</b>
: Thank you, we'll see you in the next one! </a>
</h4>
</td>
<td class="no"></td>
<td class="no">Dec 30, 2024</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-249095/pg-1/"><b>Ch.9</b>
: Official </a>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">Dec 28, 2024</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-248930/pg-1/"><b>Ch.1</b>
: Official </a>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">Dec 26, 2024</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/to_chapter-12/pg-1/"><b>Ch.12</b>
</a>
</h4>
</td>
<td class="no"></td>
<td class="no">Dec 1, 2024</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-244844/pg-1/"><b>notice.</b>
: Officials </a>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">Nov 26, 2024</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/to_chapter-11/pg-1/"><b>Ch.11</b>
</a>
</h4>
</td>
<td class="no"></td>
<td class="no">Nov 18, 2024</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-221180/pg-1/"><b>notice.</b>
</a>
</h4>
</td>
<td class="no">
<a href=".../3781074/upload/">Izanami</a>
</td>
<td class="no">Jun 21, 2024</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-234803/pg-1/"><b>notice.</b>
</a>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">Sep 13, 2024</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-220299/pg-1/"><b>Ch.1</b>
: Team Hazama </a>
</h4>
</td>
<td class="no">
<a href=".../1457681/upload/">purplepandabear</a>
</td>
<td class="no">Jun 16, 2024</td>
</tr>
</tbody>
</table>

View File

@@ -0,0 +1,162 @@
//! Integration tests for `repo::chapter` — focused on
//! `dispatch_target`, the resolver the daemon's chapter dispatcher
//! uses to look up the URL it needs to hand to
//! `content::sync_chapter_content`.
//!
//! The query must:
//! 1. Skip `chapter_sources` rows where `dropped_at IS NOT NULL` —
//! otherwise a soft-dropped source URL is dispatched as if live and
//! burns the chapter's retry budget against guaranteed transients.
//! 2. Order the remaining rows by `last_seen_at DESC` so the freshest
//! surviving source is the one we'll fetch from.
//!
//! The fix lives in `backend/src/repo/chapter.rs:dispatch_target`. The
//! enqueue queries at `pipeline.rs:381` and `:435` already filter on
//! `cs.dropped_at IS NULL`; this brings the resolver into line.
use mangalord::crawler::source::{SourceChapterRef, SourceManga};
use mangalord::repo::{
chapter::dispatch_target,
crawler::{ensure_source, sync_manga_chapters, upsert_manga_from_source},
};
use sqlx::PgPool;
use uuid::Uuid;
fn sample_manga(key: &str, title: &str, hash: &str) -> SourceManga {
SourceManga {
source_manga_key: key.to_string(),
title: title.to_string(),
alternative_titles: vec![],
authors: vec![],
genres: vec![],
tags: vec![],
status: None,
summary: None,
cover_url: None,
chapters: vec![],
metadata_hash: hash.to_string(),
}
}
/// Seed a manga with one chapter, plus a second `chapter_sources` row
/// pointing at the same chapter with a *newer* `last_seen_at` so the
/// `ORDER BY cs.last_seen_at DESC` branch of the fixed query can
/// distinguish "freshest live source" from "any live source."
async fn seed_chapter_with_two_live_sources(pool: &PgPool) -> (Uuid, String, String) {
// Two distinct sources both pointing at the same chapter is the
// realistic shape of the multi-source state — each source row is
// keyed (source_id, chapter_id) after migration 0017.
ensure_source(pool, "target", "T", "https://x.example")
.await
.unwrap();
ensure_source(pool, "mirror", "Mirror", "https://m.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo Manga", "hash-1");
let up = upsert_manga_from_source(pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
let initial = vec![SourceChapterRef {
source_chapter_key: "1".into(),
number: 1,
title: Some("Ch.1".into()),
url: "https://x.example/foo/1/old".into(),
}];
sync_manga_chapters(pool, "target", up.manga_id, &initial)
.await
.unwrap();
let (chapter_id,): (Uuid,) = sqlx::query_as(
"SELECT c.id FROM chapters c \
JOIN chapter_sources cs ON cs.chapter_id = c.id \
WHERE cs.source_chapter_key = '1' AND cs.source_id = 'target'",
)
.fetch_one(pool)
.await
.unwrap();
let old_url = "https://x.example/foo/1/old".to_string();
let new_url = "https://m.example/foo/1/mirror".to_string();
// Backdate the existing (old/target) source row and add a fresher
// row from the mirror source. The fix uses `last_seen_at DESC` to
// break the tie deterministically.
sqlx::query(
"UPDATE chapter_sources \
SET last_seen_at = NOW() - INTERVAL '2 days' \
WHERE chapter_id = $1 AND source_id = 'target'",
)
.bind(chapter_id)
.execute(pool)
.await
.unwrap();
sqlx::query(
"INSERT INTO chapter_sources \
(source_id, chapter_id, source_chapter_key, source_url, last_seen_at) \
VALUES ('mirror', $1, '1', $2, NOW())",
)
.bind(chapter_id)
.bind(&new_url)
.execute(pool)
.await
.unwrap();
(chapter_id, old_url, new_url)
}
#[sqlx::test(migrations = "./migrations")]
async fn dispatch_target_prefers_most_recent_live_source(pool: PgPool) {
let (chapter_id, _old_url, new_url) =
seed_chapter_with_two_live_sources(&pool).await;
let row = dispatch_target(&pool, chapter_id).await.unwrap();
let (_manga_id, source_url) =
row.expect("two live sources should yield a dispatch target");
assert_eq!(
source_url, new_url,
"ORDER BY last_seen_at DESC LIMIT 1 must return the freshest source"
);
}
#[sqlx::test(migrations = "./migrations")]
async fn dispatch_target_skips_dropped_sources(pool: PgPool) {
let (chapter_id, _old_url, new_url) =
seed_chapter_with_two_live_sources(&pool).await;
// Soft-drop the fresher row. The dispatcher must now return the
// *older* still-live row instead of the dropped one.
sqlx::query(
"UPDATE chapter_sources SET dropped_at = NOW() WHERE source_url = $1",
)
.bind(&new_url)
.execute(&pool)
.await
.unwrap();
let row = dispatch_target(&pool, chapter_id).await.unwrap();
let (_manga_id, source_url) =
row.expect("a single live source should still yield a dispatch target");
assert!(
source_url != new_url,
"dispatch_target must not return a dropped source"
);
}
#[sqlx::test(migrations = "./migrations")]
async fn dispatch_target_returns_none_when_only_dropped_sources_remain(
pool: PgPool,
) {
let (chapter_id, _old_url, _new_url) =
seed_chapter_with_two_live_sources(&pool).await;
sqlx::query("UPDATE chapter_sources SET dropped_at = NOW() WHERE chapter_id = $1")
.bind(chapter_id)
.execute(&pool)
.await
.unwrap();
let row = dispatch_target(&pool, chapter_id).await.unwrap();
assert!(
row.is_none(),
"every source is dropped — dispatch_target must return None"
);
}

22
docker-compose.prod.yml Normal file
View File

@@ -0,0 +1,22 @@
# Production overlay: layer on top of docker-compose.yml on the deploy
# host so the backend and frontend run from pre-built registry images
# instead of building locally.
#
# docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d
#
# REGISTRY_URL and IMAGE_TAG are injected by .gitea/workflows/deploy.yml
# at deploy time. IMAGE_TAG defaults to `latest` so a manual
# `docker compose ... up -d` on the host still works.
services:
backend:
build: !reset null
image: ${REGISTRY_URL}/mangalord-backend:${IMAGE_TAG:-latest}
pull_policy: always
restart: unless-stopped
frontend:
build: !reset null
image: ${REGISTRY_URL}/mangalord-frontend:${IMAGE_TAG:-latest}
pull_policy: always
restart: unless-stopped

View File

@@ -1,9 +1,15 @@
# Production-like compose. Requires a populated `.env` next to this
# file: at minimum POSTGRES_PASSWORD must be set to a non-default
# value (the `?required` form below fails fast otherwise). The
# frontend container expects HTTPS in front (Caddy/Traefik/nginx)
# because COOKIE_SECURE=true browsers will refuse to send the session
# cookie over plain HTTP.
services:
postgres:
image: postgres:16-alpine
environment:
POSTGRES_USER: ${POSTGRES_USER:-mangalord}
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-mangalord}
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:?POSTGRES_PASSWORD must be set in .env}
POSTGRES_DB: ${POSTGRES_DB:-mangalord}
volumes:
- postgres-data:/var/lib/postgresql/data
@@ -19,7 +25,7 @@ services:
postgres:
condition: service_healthy
environment:
DATABASE_URL: postgres://${POSTGRES_USER:-mangalord}:${POSTGRES_PASSWORD:-mangalord}@postgres:5432/${POSTGRES_DB:-mangalord}
DATABASE_URL: postgres://${POSTGRES_USER:-mangalord}:${POSTGRES_PASSWORD:?POSTGRES_PASSWORD must be set in .env}@postgres:5432/${POSTGRES_DB:-mangalord}
BIND_ADDRESS: 0.0.0.0:8080
STORAGE_DIR: /var/lib/mangalord/storage
RUST_LOG: ${RUST_LOG:-info,mangalord=debug}
@@ -33,6 +39,11 @@ services:
# Upload limits.
MAX_REQUEST_BYTES: ${MAX_REQUEST_BYTES:-209715200}
MAX_FILE_BYTES: ${MAX_FILE_BYTES:-20971520}
# System-chromium override for the crawler. Leave blank to use the
# bundled fetcher; set to e.g. /usr/bin/chromium-headless-shell on
# arm64 deployments. Pair with `--build-arg INSTALL_CHROMIUM=true`
# so the image actually contains the binary.
CRAWLER_CHROMIUM_BINARY: ${CRAWLER_CHROMIUM_BINARY:-}
volumes:
- storage-data:/var/lib/mangalord/storage
# No host port mapping in the default setup — the frontend proxies

View File

@@ -1,7 +1,11 @@
FROM node:22-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json* ./
RUN npm install
# `npm ci` installs the locked versions exactly; `npm install` would
# silently rewrite package-lock.json mid-build. CI (.gitea/workflows)
# also uses `npm ci`, so this keeps the image build deterministic and
# matches what the test job validated.
RUN npm ci
COPY . .
RUN npm run build
@@ -10,8 +14,22 @@ WORKDIR /app
ENV NODE_ENV=production
ENV HOST=0.0.0.0
ENV PORT=3000
COPY --from=builder /app/build ./build
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./
# node:22-alpine ships a `node` user (UID 1000); use it instead of
# running the SvelteKit server as root.
COPY --from=builder --chown=node:node /app/build ./build
COPY --from=builder --chown=node:node /app/node_modules ./node_modules
COPY --from=builder --chown=node:node /app/package.json ./
USER node
EXPOSE 3000
# Alpine's busybox `wget` is the canonical lightweight HTTP probe. Probe
# 127.0.0.1, not `localhost`: musl resolves `localhost` to IPv6 ::1 first,
# but the Node server binds IPv4 0.0.0.0 only, so a localhost probe gets
# "connection refused" and the container is wrongly marked unhealthy. Use a
# GET (`-O /dev/null`) since `node build` serves 200 on `/`.
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD wget -q -O /dev/null http://127.0.0.1:3000/ || exit 1
CMD ["node", "build"]

View File

@@ -0,0 +1,147 @@
import { test, expect, type Page } from '@playwright/test';
const userFixture = {
id: 'u1',
username: 'alice',
created_at: '2026-01-01T00:00:00Z'
};
const baseManga = {
id: 'm1',
title: 'Berserk',
status: 'ongoing',
alt_titles: ['Old Alt'],
description: 'Original description',
cover_image_path: null,
created_at: '2026-01-01T00:00:00Z',
updated_at: '2026-01-01T00:00:00Z',
authors: [{ id: 'a1', name: 'Kentaro Miura' }],
genres: [],
tags: []
};
async function stubAuthenticatedAndGenres(page: Page) {
await page.route('**/api/v1/auth/me', (route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({ user: userFixture })
})
);
await page.route('**/api/v1/genres', (route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify([
{ id: 'g-action', name: 'Action' },
{ id: 'g-fantasy', name: 'Fantasy' }
])
})
);
}
test('anonymous user sees sign-in prompt on /manga/[id]/edit', async ({ page }) => {
await page.route('**/api/v1/auth/me', (route) =>
route.fulfill({
status: 401,
contentType: 'application/json',
body: JSON.stringify({
error: { code: 'unauthenticated', message: 'unauthenticated' }
})
})
);
await page.route('**/api/v1/genres', (route) =>
route.fulfill({ status: 200, contentType: 'application/json', body: '[]' })
);
await page.route('**/api/v1/mangas/m1', (route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify(baseManga)
})
);
await page.goto('/manga/m1/edit');
await expect(page.getByTestId('edit-signin')).toBeVisible();
});
test('/manga/[id]/edit PATCHes the changed metadata and lands on the manga page', async ({
page
}) => {
await stubAuthenticatedAndGenres(page);
let patchBody: Record<string, unknown> | null = null;
let mangaAfter = { ...baseManga };
await page.route('**/api/v1/mangas/m1', async (route) => {
const method = route.request().method();
if (method === 'GET') {
await route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify(mangaAfter)
});
} else if (method === 'PATCH') {
patchBody = JSON.parse(route.request().postData() ?? '{}');
mangaAfter = {
...mangaAfter,
title: (patchBody.title as string) ?? mangaAfter.title,
description:
'description' in (patchBody as Record<string, unknown>)
? ((patchBody.description as string | null) ?? null)
: mangaAfter.description
};
await route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify(mangaAfter)
});
} else {
await route.fallback();
}
});
await page.route('**/api/v1/mangas/m1/chapters*', (route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({
items: [],
page: { limit: 50, offset: 0, total: 0 }
})
})
);
await page.route('**/api/v1/me/bookmarks*', (route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({
items: [],
page: { limit: 50, offset: 0, total: 0 }
})
})
);
await page.route('**/api/v1/me/read-progress/m1', (route) =>
route.fulfill({
status: 404,
contentType: 'application/json',
body: JSON.stringify({
error: { code: 'not_found', message: 'no progress' }
})
})
);
await page.goto('/manga/m1');
// Edit link is gated on session.user — it should be visible to the
// stubbed authenticated user.
await page.getByTestId('edit-manga-link').click();
await expect(page).toHaveURL(/\/manga\/m1\/edit$/);
const titleInput = page.getByTestId('manga-title');
await expect(titleInput).toHaveValue('Berserk');
await titleInput.fill('Berserk (Deluxe)');
await page.getByTestId('manga-edit-submit').click();
await expect(page).toHaveURL(/\/manga\/m1$/);
await expect(page.getByTestId('manga-title')).toHaveText('Berserk (Deluxe)');
expect(patchBody).not.toBeNull();
expect((patchBody as Record<string, unknown>).title).toBe('Berserk (Deluxe)');
});

View File

@@ -1,6 +1,7 @@
import { test, expect, type Page } from '@playwright/test';
const mangaId = '22222222-2222-2222-2222-222222222222';
const chapterId = 'c2222222-2222-2222-2222-222222222222';
const mangaFixture = {
id: mangaId,
title: 'Vagabond',
@@ -11,7 +12,7 @@ const mangaFixture = {
updated_at: '2026-01-01T00:00:00Z'
};
const chapterFixture = {
id: 'c1',
id: chapterId,
manga_id: mangaId,
number: 1,
title: null,
@@ -20,24 +21,24 @@ const chapterFixture = {
};
const pagesFixture = [
{
id: 'p1',
chapter_id: 'c1',
id: 'p1111111-2222-2222-2222-222222222222',
chapter_id: chapterId,
page_number: 1,
storage_key: 'mangas/m2/chapters/c1/pages/0001.png',
storage_key: `mangas/${mangaId}/chapters/${chapterId}/pages/0001.png`,
content_type: 'image/png'
},
{
id: 'p2',
chapter_id: 'c1',
id: 'p2222222-2222-2222-2222-222222222222',
chapter_id: chapterId,
page_number: 2,
storage_key: 'mangas/m2/chapters/c1/pages/0002.png',
storage_key: `mangas/${mangaId}/chapters/${chapterId}/pages/0002.png`,
content_type: 'image/png'
},
{
id: 'p3',
chapter_id: 'c1',
id: 'p3333333-2222-2222-2222-222222222222',
chapter_id: chapterId,
page_number: 3,
storage_key: 'mangas/m2/chapters/c1/pages/0003.png',
storage_key: `mangas/${mangaId}/chapters/${chapterId}/pages/0003.png`,
content_type: 'image/png'
}
];
@@ -92,19 +93,21 @@ async function mockReaderApis(page: Page) {
})
})
);
await page.route(`**/api/v1/mangas/${mangaId}/chapters/1`, (route) =>
await page.route(`**/api/v1/mangas/${mangaId}/chapters/${chapterId}`, (route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify(chapterFixture)
})
);
await page.route(`**/api/v1/mangas/${mangaId}/chapters/1/pages`, (route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({ pages: pagesFixture })
})
await page.route(
`**/api/v1/mangas/${mangaId}/chapters/${chapterId}/pages`,
(route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({ pages: pagesFixture })
})
);
const png = Buffer.from(
'89504e470d0a1a0a0000000d49484452000000010000000108060000001f15c4890000000d49444154789c63000100000005000158a3b62a0000000049454e44ae426082',
@@ -131,7 +134,7 @@ test.beforeEach(async ({ context }) => {
test('switching to continuous mode stacks all pages and hides chevrons', async ({ page }) => {
await mockReaderApis(page);
await page.goto(`/manga/${mangaId}/chapter/1`);
await page.goto(`/manga/${mangaId}/chapter/${chapterId}`);
// Default single-page mode is active.
await expect(page.getByTestId('reader-page')).toBeVisible();
@@ -149,7 +152,7 @@ test('switching to continuous mode stacks all pages and hides chevrons', async (
test('arrow keys do not paginate while in continuous mode', async ({ page }) => {
await mockReaderApis(page);
await page.goto(`/manga/${mangaId}/chapter/1`);
await page.goto(`/manga/${mangaId}/chapter/${chapterId}`);
await page.getByTestId('reader-mode-continuous').click();
await expect(page.getByTestId('reader-continuous')).toBeVisible();
@@ -164,7 +167,7 @@ test('arrow keys do not paginate while in continuous mode', async ({ page }) =>
test('gap select updates the inline gap on the continuous container', async ({ page }) => {
await mockReaderApis(page);
await page.goto(`/manga/${mangaId}/chapter/1`);
await page.goto(`/manga/${mangaId}/chapter/${chapterId}`);
await page.getByTestId('reader-mode-continuous').click();
const container = page.getByTestId('reader-continuous');
@@ -192,7 +195,7 @@ test('reader-mode preference set on one page is honored when the reader opens',
});
await mockReaderApis(page);
await page.goto(`/manga/${mangaId}/chapter/1`);
await page.goto(`/manga/${mangaId}/chapter/${chapterId}`);
await expect(page.getByTestId('reader-continuous')).toBeVisible();
await expect(page.getByTestId('page-indicator')).toHaveText('3 pages');
await expect(page.getByTestId('reader-continuous')).toHaveAttribute(

View File

@@ -1,6 +1,7 @@
import { test, expect, type Page } from '@playwright/test';
const mangaId = '11111111-1111-1111-1111-111111111111';
const chapterId = 'c1111111-1111-1111-1111-111111111111';
const mangaFixture = {
id: mangaId,
title: 'Berserk',
@@ -12,7 +13,7 @@ const mangaFixture = {
};
const chaptersFixture = [
{
id: 'c1',
id: chapterId,
manga_id: mangaId,
number: 1,
title: 'The Brand',
@@ -22,24 +23,24 @@ const chaptersFixture = [
];
const pagesFixture = [
{
id: 'p1',
chapter_id: 'c1',
id: 'p1111111-1111-1111-1111-111111111111',
chapter_id: chapterId,
page_number: 1,
storage_key: 'mangas/m1/chapters/c1/pages/0001.png',
storage_key: `mangas/${mangaId}/chapters/${chapterId}/pages/0001.png`,
content_type: 'image/png'
},
{
id: 'p2',
chapter_id: 'c1',
id: 'p2222222-1111-1111-1111-111111111111',
chapter_id: chapterId,
page_number: 2,
storage_key: 'mangas/m1/chapters/c1/pages/0002.png',
storage_key: `mangas/${mangaId}/chapters/${chapterId}/pages/0002.png`,
content_type: 'image/png'
},
{
id: 'p3',
chapter_id: 'c1',
id: 'p3333333-1111-1111-1111-111111111111',
chapter_id: chapterId,
page_number: 3,
storage_key: 'mangas/m1/chapters/c1/pages/0003.png',
storage_key: `mangas/${mangaId}/chapters/${chapterId}/pages/0003.png`,
content_type: 'image/png'
}
];
@@ -86,19 +87,21 @@ async function mockReaderApis(page: Page) {
})
})
);
await page.route(`**/api/v1/mangas/${mangaId}/chapters/1`, (route) =>
await page.route(`**/api/v1/mangas/${mangaId}/chapters/${chapterId}`, (route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify(chaptersFixture[0])
})
);
await page.route(`**/api/v1/mangas/${mangaId}/chapters/1/pages`, (route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({ pages: pagesFixture })
})
await page.route(
`**/api/v1/mangas/${mangaId}/chapters/${chapterId}/pages`,
(route) =>
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({ pages: pagesFixture })
})
);
// Stub image bytes so the <img> doesn't 404 (1x1 transparent PNG).
const png = Buffer.from(
@@ -123,7 +126,7 @@ test('manga overview shows title, cover, and a chapter list', async ({ page }) =
test('reader paginates with arrow keys and j/k, and preloads the next page', async ({ page }) => {
await mockReaderApis(page);
await page.goto(`/manga/${mangaId}/chapter/1`);
await page.goto(`/manga/${mangaId}/chapter/${chapterId}`);
// Page 1 shown, preload for page 2 in the DOM.
await expect(page.getByTestId('page-indicator')).toHaveText('Page 1 / 3');

View File

@@ -1,12 +1,12 @@
{
"name": "mangalord-frontend",
"version": "0.12.0",
"version": "0.23.0",
"lockfileVersion": 3,
"requires": true,
"packages": {
"": {
"name": "mangalord-frontend",
"version": "0.12.0",
"version": "0.23.0",
"devDependencies": {
"@lucide/svelte": "^1.16.0",
"@playwright/test": "^1.48.0",
@@ -169,7 +169,6 @@
}
],
"license": "MIT",
"peer": true,
"engines": {
"node": ">=18"
},
@@ -193,7 +192,6 @@
}
],
"license": "MIT",
"peer": true,
"engines": {
"node": ">=18"
}
@@ -1157,7 +1155,6 @@
"integrity": "sha512-mQjlkNo+rJvpln7V2IGY2j99BqhcFbS4UN0AQNKNYfhBAFZTuCDAdW3a1sgf330mvtNvsBXn3HpAhcmvdJTcIQ==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"@standard-schema/spec": "^1.0.0",
"@sveltejs/acorn-typescript": "^1.0.5",
@@ -1200,7 +1197,6 @@
"integrity": "sha512-0ba1RQ/PHen5FGpdSrW7Y3fAMQjrXantECALeOiOdBdzR5+5vPP6HVZRLmZaQL+W8m++o+haIAKq5qT+MiZ7VA==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"@sveltejs/vite-plugin-svelte-inspector": "^3.0.0-next.0||^3.0.0",
"debug": "^4.3.7",
@@ -1359,7 +1355,6 @@
"integrity": "sha512-dyh/xO2Fh5bYrfWaaqGrRQQGkNdmYw6AmaAUvYeUMNTWQtvb796ikLdmTchRmOlOiIJ1TDXfWgVx1QkUlQ6Hew==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"undici-types": "~6.21.0"
}
@@ -1507,7 +1502,6 @@
"integrity": "sha512-UVJyE9MttOsBQIDKw1skb9nAwQuR5wuGD3+82K6JgJlm/Y+KI92oNsMNGZCYdDsVtRHSak0pcV5Dno5+4jh9sw==",
"dev": true,
"license": "MIT",
"peer": true,
"bin": {
"acorn": "bin/acorn"
},
@@ -2249,7 +2243,6 @@
"integrity": "sha512-8i7LzZj7BF8uplX+ZyOlIz86V6TAsSs+np6m1kpW9u0JWi4z/1t+FzcK1aek+ybTnAC4KhBL4uXCNT0wcUIeCw==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"cssstyle": "^4.1.0",
"data-urls": "^5.0.0",
@@ -2638,7 +2631,6 @@
"integrity": "sha512-WHeFSbZYsPu3+bLoNRUuAO+wavNlocOPf3wSHTP7hcFKVnJeWsYlCDbr3mTS14FCizf9ccIxXA8sGL8zKeQN3g==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"@types/estree": "1.0.8"
},
@@ -2810,7 +2802,6 @@
"integrity": "sha512-ymI5ykLPwIHW839E053FQbI1G+jnRFJEw3Kv5Y4njixVWywQBx+NUFpkkKyk5LIb36Fg9DVXSYpqiGekLD0hyw==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"@jridgewell/remapping": "^2.3.4",
"@jridgewell/sourcemap-codec": "^1.5.0",
@@ -2997,7 +2988,6 @@
"integrity": "sha512-jl1vZzPDinLr9eUt3J/t7V6FgNEw9QjvBPdysz9KfQDD41fQrC2Y4vKQdiaUpFT4bXlb1RHhLpp8wtm6M5TgSw==",
"dev": true,
"license": "Apache-2.0",
"peer": true,
"bin": {
"tsc": "bin/tsc",
"tsserver": "bin/tsserver"
@@ -3019,7 +3009,6 @@
"integrity": "sha512-o5a9xKjbtuhY6Bi5S3+HvbRERmouabWbyUcpXXUA1u+GNUKoROi9byOJ8M0nHbHYHkYICiMlqxkg1KkYmm25Sw==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"esbuild": "^0.21.3",
"postcss": "^8.4.43",
@@ -3138,7 +3127,6 @@
"integrity": "sha512-MSmPM9REYqDGBI8439mA4mWhV5sKmDlBKWIYbA3lRb2PTHACE0mgKwA8yQ2xq9vxDTuk4iPrECBAEW2aoFXY0Q==",
"dev": true,
"license": "MIT",
"peer": true,
"dependencies": {
"@vitest/expect": "2.1.9",
"@vitest/mocker": "2.1.9",

View File

@@ -1,6 +1,6 @@
{
"name": "mangalord-frontend",
"version": "0.21.3",
"version": "0.45.0",
"private": true,
"type": "module",
"scripts": {

View File

@@ -118,4 +118,77 @@ describe('hooks.server proxy', () => {
expect(body.error.code).toBe('upstream_unavailable');
expect(errSpy).toHaveBeenCalled();
});
it('strips every hop-by-hop header listed in RFC 7230 §6.1', async () => {
// Defence in depth: axum doesn't emit these, but a future
// middleware that did would otherwise leak per-connection
// state across the proxy boundary.
fetchSpy.mockResolvedValueOnce(new Response('[]', { status: 200 }));
const resolve = vi.fn();
await handle({
event: makeEvent('/api/v1/health', {
headers: {
host: 'app.example.com',
'content-length': '0',
connection: 'keep-alive',
'keep-alive': 'timeout=5',
'proxy-authenticate': 'Basic realm=x',
'proxy-authorization': 'Basic xyz',
te: 'trailers',
trailer: 'Expires',
'transfer-encoding': 'chunked',
upgrade: 'websocket',
// A non-hop-by-hop header to ensure non-targets
// aren't accidentally stripped.
'x-custom': 'pass-through'
}
}),
resolve
});
const init = fetchSpy.mock.calls[0][1] as RequestInit;
const headers = init.headers as Headers;
for (const h of [
'host',
'content-length',
'connection',
'keep-alive',
'proxy-authenticate',
'proxy-authorization',
'te',
'trailer',
'transfer-encoding',
'upgrade'
]) {
expect(headers.get(h), `${h} should be stripped`).toBeNull();
}
expect(headers.get('x-custom')).toBe('pass-through');
});
it('aborts and returns 502 when the upstream stalls past the timeout', async () => {
const errSpy = vi.spyOn(console, 'error').mockImplementation(() => {});
// Simulate an aborted fetch (AbortController.abort() raises a
// DOMException with name 'AbortError' on Node's fetch). The
// handler should treat it as the same upstream_unavailable
// 502 it uses for any other network failure.
const abortErr = new DOMException('aborted', 'AbortError');
fetchSpy.mockRejectedValueOnce(abortErr);
const resolve = vi.fn();
const resp = await handle({ event: makeEvent('/api/v1/slow'), resolve });
expect(resp.status).toBe(502);
const body = await resp.json();
expect(body.error.code).toBe('upstream_unavailable');
expect(errSpy).toHaveBeenCalled();
});
it('attaches an AbortSignal to the upstream fetch so it can time out', async () => {
fetchSpy.mockResolvedValueOnce(new Response('[]', { status: 200 }));
const resolve = vi.fn();
await handle({ event: makeEvent('/api/v1/health'), resolve });
const init = fetchSpy.mock.calls[0][1] as RequestInit;
expect(init.signal).toBeInstanceOf(AbortSignal);
// The signal hasn't fired (handler returned in time), but its
// presence is the contract this test is pinning.
expect(init.signal?.aborted).toBe(false);
});
});

View File

@@ -12,20 +12,66 @@ import type { Handle } from '@sveltejs/kit';
const BACKEND_URL = process.env.BACKEND_URL ?? 'http://localhost:8080';
/**
* Hop-by-hop headers per RFC 7230 §6.1. These are scoped to a single
* transport-level connection and must not be forwarded by a proxy.
* Plus `host` and `content-length`: `host` would mislead the backend
* about its origin, and `content-length` is recomputed by the upstream
* fetch from the body stream.
*/
const HOP_BY_HOP_HEADERS = [
'host',
'content-length',
'connection',
'keep-alive',
'proxy-authenticate',
'proxy-authorization',
'te',
'trailer',
'transfer-encoding',
'upgrade'
];
/**
* Cap each proxied request at 5 minutes. The bound exists to surface
* a wedged backend (stuck on a slow DB query, deadlocked, etc.) as a
* 502 rather than letting the browser request hang indefinitely.
*
* The default leans toward the slow-upload end of the spectrum: at a
* 1 Mbps upstream, a 200 MiB chapter upload (the default
* `MAX_REQUEST_BYTES` cap) needs ~27 minutes; 300 s covers the more
* realistic 25 Mbps urban-broadband case (~64 s for the same upload)
* with comfortable headroom. Operators serving very slow clients
* should raise `BACKEND_PROXY_TIMEOUT_MS`; operators behind a
* tighter upstream proxy may want to lower it. A future improvement
* is an idle-based timeout (reset per chunk) instead of this
* wall-clock budget — that's a fair bit more code, deferred.
*/
const PROXY_TIMEOUT_MS = (() => {
const raw = process.env.BACKEND_PROXY_TIMEOUT_MS;
const n = raw ? Number(raw) : 300_000;
return Number.isFinite(n) && n > 0 ? n : 300_000;
})();
export const handle: Handle = async ({ event, resolve }) => {
if (event.url.pathname.startsWith('/api/')) {
const target = `${BACKEND_URL}${event.url.pathname}${event.url.search}`;
// Strip hop-by-hop headers — `host` would mislead the backend
// about the origin, and `content-length` will be recomputed.
const headers = new Headers(event.request.headers);
headers.delete('host');
headers.delete('content-length');
for (const h of HOP_BY_HOP_HEADERS) headers.delete(h);
// AbortController times the upstream fetch out so a backend
// wedged on a slow DB query doesn't keep the browser request
// hanging forever. The `signal` is also wired into the
// RequestInit so the body stream is cancelled cleanly.
const ctrl = new AbortController();
const timeoutHandle = setTimeout(() => ctrl.abort(), PROXY_TIMEOUT_MS);
const init: RequestInit & { duplex?: 'half' } = {
method: event.request.method,
headers,
redirect: 'manual'
redirect: 'manual',
signal: ctrl.signal
};
if (event.request.method !== 'GET' && event.request.method !== 'HEAD') {
init.body = event.request.body;
@@ -39,11 +85,13 @@ export const handle: Handle = async ({ event, resolve }) => {
upstream = await fetch(target, init);
} catch (e) {
// Network-layer failure (DNS / connection refused / TLS
// handshake) — most commonly "backend container restarting".
// SvelteKit's default 500 would be an HTML page that
// client.ts can't .json(), which masks the real cause. Emit
// the standard envelope with a dedicated code instead.
// handshake / abort by timeout) — most commonly "backend
// container restarting". SvelteKit's default 500 would be
// an HTML page that client.ts can't .json(), which masks
// the real cause. Emit the standard envelope with a
// dedicated code instead.
console.error('Proxy to backend failed:', e);
clearTimeout(timeoutHandle);
return new Response(
JSON.stringify({
error: {
@@ -58,6 +106,7 @@ export const handle: Handle = async ({ event, resolve }) => {
);
}
clearTimeout(timeoutHandle);
return new Response(upstream.body, {
status: upstream.status,
statusText: upstream.statusText,

View File

@@ -0,0 +1,245 @@
import {
describe,
it,
expect,
vi,
beforeEach,
afterEach,
type MockInstance
} from 'vitest';
import {
listAdminUsers,
deleteAdminUser,
setUserAdmin,
createAdminUser,
listAdminMangas,
listAdminChapters,
getSystemStats
} from './admin';
function ok(body: unknown, status = 200): Response {
return new Response(JSON.stringify(body), {
status,
headers: { 'content-type': 'application/json' }
});
}
function noContent(): Response {
return new Response(null, { status: 204 });
}
function envelope(status: number, code: string, message: string): Response {
return new Response(JSON.stringify({ error: { code, message } }), {
status,
headers: { 'content-type': 'application/json' }
});
}
const userFixture = {
id: 'u-1',
username: 'alice',
created_at: '2026-01-01T00:00:00Z',
is_admin: false
};
const mangaFixture = {
id: 'm-1',
title: 'Test',
status: 'ongoing',
cover_image_path: null,
created_at: '2026-01-01T00:00:00Z',
updated_at: '2026-01-01T00:00:00Z',
sync_state: 'synced' as const,
chapter_count: 3,
latest_seen_at: '2026-01-02T00:00:00Z'
};
const systemFixture = {
disk: {
total_bytes: 1_000_000,
used_bytes: 500_000,
free_bytes: 500_000,
percent_used: 50.0
},
memory: { total_bytes: 8_000_000, used_bytes: 4_000_000, percent_used: 50.0 },
cpu: { percent_used: 12.3 },
alerts: []
};
describe('admin api client', () => {
let fetchSpy: MockInstance<typeof globalThis.fetch>;
beforeEach(() => {
fetchSpy = vi.spyOn(globalThis, 'fetch');
});
afterEach(() => {
vi.restoreAllMocks();
});
// ---- users ----
it('listAdminUsers GETs /v1/admin/users and parses the paged envelope', async () => {
fetchSpy.mockResolvedValueOnce(
ok({ items: [userFixture], page: { limit: 50, offset: 0, total: 1 } })
);
const page = await listAdminUsers({ limit: 50 });
expect(page.items).toHaveLength(1);
expect(page.items[0]).toEqual(userFixture);
expect(page.page.total).toBe(1);
const url = fetchSpy.mock.calls[0][0] as string;
expect(url).toMatch(/\/v1\/admin\/users\?limit=50$/);
});
it('listAdminUsers forwards search + offset query params', async () => {
fetchSpy.mockResolvedValueOnce(
ok({ items: [], page: { limit: 50, offset: 10, total: 0 } })
);
await listAdminUsers({ search: 'al', offset: 10 });
const url = fetchSpy.mock.calls[0][0] as string;
expect(url).toContain('search=al');
expect(url).toContain('offset=10');
});
it('listAdminUsers surfaces 403 forbidden via ApiError.code', async () => {
fetchSpy.mockResolvedValueOnce(envelope(403, 'forbidden', 'forbidden'));
await expect(listAdminUsers()).rejects.toMatchObject({
status: 403,
code: 'forbidden'
});
});
it('deleteAdminUser DELETEs to /v1/admin/users/{id} and handles 204', async () => {
fetchSpy.mockResolvedValueOnce(noContent());
await expect(deleteAdminUser('u-1')).resolves.toBeUndefined();
const url = fetchSpy.mock.calls[0][0] as string;
expect(url).toMatch(/\/v1\/admin\/users\/u-1$/);
const init = fetchSpy.mock.calls[0][1] as RequestInit;
expect(init.method).toBe('DELETE');
});
it('deleteAdminUser surfaces 409 conflict (self-delete / last-admin)', async () => {
fetchSpy.mockResolvedValueOnce(
envelope(409, 'conflict', 'cannot delete yourself; ask another admin')
);
await expect(deleteAdminUser('u-1')).rejects.toMatchObject({
status: 409,
code: 'conflict'
});
});
it('createAdminUser POSTs to /v1/admin/users with body and returns the created user', async () => {
const created = { ...userFixture, username: 'invited01' };
fetchSpy.mockResolvedValueOnce(ok(created, 201));
const got = await createAdminUser({
username: 'invited01',
password: 'freshpass1234'
});
expect(got).toEqual(created);
const url = fetchSpy.mock.calls[0][0] as string;
expect(url).toMatch(/\/v1\/admin\/users$/);
const init = fetchSpy.mock.calls[0][1] as RequestInit;
expect(init.method).toBe('POST');
expect(JSON.parse(init.body as string)).toEqual({
username: 'invited01',
password: 'freshpass1234'
});
});
it('createAdminUser forwards is_admin when provided', async () => {
const created = { ...userFixture, username: 'coadmin', is_admin: true };
fetchSpy.mockResolvedValueOnce(ok(created, 201));
await createAdminUser({
username: 'coadmin',
password: 'freshpass1234',
is_admin: true
});
const init = fetchSpy.mock.calls[0][1] as RequestInit;
expect(JSON.parse(init.body as string)).toEqual({
username: 'coadmin',
password: 'freshpass1234',
is_admin: true
});
});
it('createAdminUser surfaces 409 conflict on duplicate username', async () => {
fetchSpy.mockResolvedValueOnce(
envelope(409, 'conflict', 'username is already taken')
);
await expect(
createAdminUser({ username: 'taken', password: 'freshpass1234' })
).rejects.toMatchObject({ status: 409, code: 'conflict' });
});
it('setUserAdmin PATCHes is_admin and returns the updated user', async () => {
const updated = { ...userFixture, is_admin: true };
fetchSpy.mockResolvedValueOnce(ok(updated));
const got = await setUserAdmin('u-1', true);
expect(got).toEqual(updated);
const init = fetchSpy.mock.calls[0][1] as RequestInit;
expect(init.method).toBe('PATCH');
expect(JSON.parse(init.body as string)).toEqual({ is_admin: true });
});
// ---- mangas + chapters ----
it('listAdminMangas GETs /v1/admin/mangas and forwards sync_state filter', async () => {
fetchSpy.mockResolvedValueOnce(
ok({ items: [mangaFixture], page: { limit: 100, offset: 0, total: 1 } })
);
const page = await listAdminMangas({ syncState: 'in_progress', limit: 100 });
expect(page.items[0].sync_state).toBe('synced');
const url = fetchSpy.mock.calls[0][0] as string;
expect(url).toContain('sync_state=in_progress');
expect(url).toContain('limit=100');
});
it('listAdminChapters GETs the nested chapter route and parses the paged envelope', async () => {
const chapter = {
id: 'c-1',
manga_id: 'm-1',
number: 1,
title: null,
page_count: 12,
created_at: '2026-01-01T00:00:00Z',
sync_state: 'synced' as const,
latest_seen_at: null
};
fetchSpy.mockResolvedValueOnce(
ok({ items: [chapter], page: { limit: 200, offset: 0, total: 1 } })
);
const resp = await listAdminChapters('m-1');
expect(resp.items).toEqual([chapter]);
expect(resp.page.total).toBe(1);
const url = fetchSpy.mock.calls[0][0] as string;
expect(url).toMatch(/\/v1\/admin\/mangas\/m-1\/chapters$/);
});
it('listAdminChapters forwards limit + offset query params', async () => {
fetchSpy.mockResolvedValueOnce(
ok({ items: [], page: { limit: 50, offset: 100, total: 0 } })
);
await listAdminChapters('m-1', { limit: 50, offset: 100 });
const url = fetchSpy.mock.calls[0][0] as string;
expect(url).toContain('limit=50');
expect(url).toContain('offset=100');
});
// ---- system ----
it('getSystemStats GETs /v1/admin/system and parses the four-key envelope', async () => {
fetchSpy.mockResolvedValueOnce(ok(systemFixture));
const s = await getSystemStats();
expect(s.disk?.percent_used).toBe(50);
expect(s.memory.percent_used).toBe(50);
expect(s.cpu.percent_used).toBe(12.3);
expect(s.alerts).toEqual([]);
const url = fetchSpy.mock.calls[0][0] as string;
expect(url).toMatch(/\/v1\/admin\/system$/);
});
it('getSystemStats keeps disk null when backend reports a non-local store', async () => {
fetchSpy.mockResolvedValueOnce(ok({ ...systemFixture, disk: null }));
const s = await getSystemStats();
expect(s.disk).toBeNull();
});
});

View File

@@ -0,0 +1,178 @@
// Admin-only API client. Every endpoint here is guarded by
// RequireAdmin on the backend (session cookie only — bearer tokens
// won't reach these routes). 403s thrown here propagate up to the
// /admin layout, which renders the framework error page.
import { request, type Page } from './client';
import type { User } from './auth';
// ---- users -----------------------------------------------------------------
export type AdminUsersPage = {
items: User[];
page: Page;
};
export type ListAdminUsersOptions = {
search?: string;
limit?: number;
offset?: number;
};
export async function listAdminUsers(
opts: ListAdminUsersOptions = {}
): Promise<AdminUsersPage> {
const params = new URLSearchParams();
if (opts.search) params.set('search', opts.search);
if (opts.limit != null) params.set('limit', String(opts.limit));
if (opts.offset != null) params.set('offset', String(opts.offset));
const qs = params.toString();
return request<AdminUsersPage>(`/v1/admin/users${qs ? `?${qs}` : ''}`);
}
export async function deleteAdminUser(id: string): Promise<void> {
await request<void>(`/v1/admin/users/${encodeURIComponent(id)}`, {
method: 'DELETE'
});
}
export async function setUserAdmin(id: string, isAdmin: boolean): Promise<User> {
return request<User>(`/v1/admin/users/${encodeURIComponent(id)}`, {
method: 'PATCH',
headers: { 'content-type': 'application/json' },
body: JSON.stringify({ is_admin: isAdmin })
});
}
export type CreateAdminUserInput = {
username: string;
password: string;
is_admin?: boolean;
};
/** POST /v1/admin/users — admin-initiated account creation. Works
* regardless of the ALLOW_SELF_REGISTER toggle, since the entire
* point is for an admin to enroll someone when self-register is off. */
export async function createAdminUser(input: CreateAdminUserInput): Promise<User> {
return request<User>('/v1/admin/users', {
method: 'POST',
headers: { 'content-type': 'application/json' },
body: JSON.stringify(input)
});
}
// ---- mangas / chapters with sync state -------------------------------------
export type MangaSyncState = 'in_progress' | 'dropped' | 'synced';
export type AdminMangaRow = {
id: string;
title: string;
status: string;
cover_image_path: string | null;
created_at: string;
updated_at: string;
sync_state: MangaSyncState;
chapter_count: number;
latest_seen_at: string | null;
};
export type AdminMangasPage = {
items: AdminMangaRow[];
page: Page;
};
export type ListAdminMangasOptions = {
search?: string;
syncState?: MangaSyncState;
limit?: number;
offset?: number;
};
export async function listAdminMangas(
opts: ListAdminMangasOptions = {}
): Promise<AdminMangasPage> {
const params = new URLSearchParams();
if (opts.search) params.set('search', opts.search);
if (opts.syncState) params.set('sync_state', opts.syncState);
if (opts.limit != null) params.set('limit', String(opts.limit));
if (opts.offset != null) params.set('offset', String(opts.offset));
const qs = params.toString();
return request<AdminMangasPage>(`/v1/admin/mangas${qs ? `?${qs}` : ''}`);
}
export type ChapterSyncState =
| 'downloading'
| 'dropped'
| 'failed'
| 'not_downloaded'
| 'synced';
export type AdminChapterRow = {
id: string;
manga_id: string;
number: number;
title: string | null;
page_count: number;
created_at: string;
sync_state: ChapterSyncState;
latest_seen_at: string | null;
};
export type AdminChaptersPage = {
items: AdminChapterRow[];
page: Page;
};
export type ListAdminChaptersOptions = {
limit?: number;
offset?: number;
};
export async function listAdminChapters(
mangaId: string,
opts: ListAdminChaptersOptions = {}
): Promise<AdminChaptersPage> {
const params = new URLSearchParams();
if (opts.limit != null) params.set('limit', String(opts.limit));
if (opts.offset != null) params.set('offset', String(opts.offset));
const qs = params.toString();
return request<AdminChaptersPage>(
`/v1/admin/mangas/${encodeURIComponent(mangaId)}/chapters${qs ? `?${qs}` : ''}`
);
}
// ---- system ----------------------------------------------------------------
export type DiskStats = {
total_bytes: number;
used_bytes: number;
free_bytes: number;
percent_used: number;
};
export type MemoryStats = {
total_bytes: number;
used_bytes: number;
percent_used: number;
};
export type CpuStats = {
percent_used: number;
};
export type Alert = {
level: 'warning';
message: string;
};
export type SystemStats = {
disk: DiskStats | null;
memory: MemoryStats;
cpu: CpuStats;
alerts: Alert[];
};
export async function getSystemStats(): Promise<SystemStats> {
return request<SystemStats>('/v1/admin/system');
}

View File

@@ -14,7 +14,8 @@ import {
me,
changePassword,
createToken,
deleteToken
deleteToken,
getAuthConfig
} from './auth';
function ok(body: unknown, status = 200): Response {
@@ -94,6 +95,11 @@ describe('auth api client', () => {
expect(url).toMatch(/\/v1\/auth\/logout$/);
const init = fetchSpy.mock.calls[0][1] as RequestInit;
expect(init.method).toBe('POST');
// Consistent content-type for all mutation requests, matching
// the rest of the module — axum doesn't require it but the
// header keeps the request style uniform.
const headers = new Headers(init.headers);
expect(headers.get('content-type')).toBe('application/json');
});
it('me returns the user on 200', async () => {
@@ -164,6 +170,17 @@ describe('auth api client', () => {
expect(url).toMatch(/\/v1\/auth\/tokens$/);
});
it('getAuthConfig GETs /v1/auth/config and parses the flag', async () => {
fetchSpy.mockResolvedValueOnce(ok({ self_register_enabled: false }));
const cfg = await getAuthConfig();
expect(cfg.self_register_enabled).toBe(false);
const url = fetchSpy.mock.calls[0][0] as string;
expect(url).toMatch(/\/v1\/auth\/config$/);
const init = fetchSpy.mock.calls[0][1] as RequestInit;
// Public endpoint; no method override means default GET.
expect(init?.method ?? 'GET').toBe('GET');
});
it('deleteToken DELETEs to /v1/auth/tokens/{id} and handles 204', async () => {
fetchSpy.mockResolvedValueOnce(noContent());
await expect(deleteToken('t1')).resolves.toBeUndefined();

View File

@@ -4,6 +4,7 @@ export type User = {
id: string;
username: string;
created_at: string;
is_admin: boolean;
};
export type Credentials = {
@@ -32,7 +33,14 @@ export async function login(creds: Credentials): Promise<User> {
}
export async function logout(): Promise<void> {
await request<void>('/v1/auth/logout', { method: 'POST' });
await request<void>('/v1/auth/logout', {
method: 'POST',
// Consistent with the other POST/PATCH helpers in this module.
// axum doesn't require it (no body), but keeping the header
// on every mutation request avoids the false-flag in logs and
// matches the project's style.
headers: { 'content-type': 'application/json' }
});
}
export type ChangePassword = {
@@ -92,3 +100,15 @@ export async function createToken(name: string): Promise<CreatedToken> {
export async function deleteToken(id: string): Promise<void> {
await request<void>(`/v1/auth/tokens/${encodeURIComponent(id)}`, { method: 'DELETE' });
}
export type AuthConfig = {
/** When false, /v1/auth/register returns 403 and the UI should
* hide its register affordance. Admins can still mint accounts
* via POST /v1/admin/users. */
self_register_enabled: boolean;
};
/** Public — no auth, no cookie required. */
export async function getAuthConfig(): Promise<AuthConfig> {
return request<AuthConfig>('/v1/auth/config');
}

View File

@@ -76,17 +76,17 @@ describe('chapters api client', () => {
expect(result.page.total).toBeNull();
});
it('getChapter hits /v1/mangas/{id}/chapters/{n}', async () => {
it('getChapter hits /v1/mangas/{id}/chapters/{chapter_id}', async () => {
fetchSpy.mockResolvedValueOnce(ok(chapterFixture));
const c = await getChapter('m1', 1);
const c = await getChapter('m1', 'ch-uuid-1');
expect(c).toEqual(chapterFixture);
const url = fetchSpy.mock.calls[0][0] as string;
expect(url).toMatch(/\/v1\/mangas\/m1\/chapters\/1$/);
expect(url).toMatch(/\/v1\/mangas\/m1\/chapters\/ch-uuid-1$/);
});
it('getChapter surfaces 404 via ApiError.code', async () => {
fetchSpy.mockResolvedValueOnce(envelope(404, 'not_found', 'not found'));
await expect(getChapter('m1', 99)).rejects.toMatchObject({
await expect(getChapter('m1', 'unknown-uuid')).rejects.toMatchObject({
status: 404,
code: 'not_found'
});
@@ -143,10 +143,10 @@ describe('chapters api client', () => {
]
})
);
const pages = await getChapterPages('m1', 1);
const pages = await getChapterPages('m1', 'ch-uuid-1');
expect(pages).toHaveLength(1);
expect(pages[0].storage_key).toContain('0001.png');
const url = fetchSpy.mock.calls[0][0] as string;
expect(url).toMatch(/\/v1\/mangas\/m1\/chapters\/1\/pages$/);
expect(url).toMatch(/\/v1\/mangas\/m1\/chapters\/ch-uuid-1\/pages$/);
});
});

View File

@@ -32,9 +32,9 @@ export async function listChapters(
);
}
export async function getChapter(mangaId: string, number: number): Promise<Chapter> {
export async function getChapter(mangaId: string, chapterId: string): Promise<Chapter> {
return request<Chapter>(
`/v1/mangas/${encodeURIComponent(mangaId)}/chapters/${number}`
`/v1/mangas/${encodeURIComponent(mangaId)}/chapters/${encodeURIComponent(chapterId)}`
);
}
@@ -48,10 +48,10 @@ export type ChapterPage = {
export async function getChapterPages(
mangaId: string,
number: number
chapterId: string
): Promise<ChapterPage[]> {
const r = await request<{ pages: ChapterPage[] }>(
`/v1/mangas/${encodeURIComponent(mangaId)}/chapters/${number}/pages`
`/v1/mangas/${encodeURIComponent(mangaId)}/chapters/${encodeURIComponent(chapterId)}/pages`
);
return r.pages;
}

Some files were not shown because too many files have changed in this diff Show More