Compare commits

...

10 Commits

Author SHA1 Message Date
MechaCat02
b845d88766 feat: bookmark create enqueues SyncChapterContent jobs (0.29.0)
After a successful bookmark insert, the create handler spawns a
detached tokio task that calls pipeline::enqueue_pending_for_manga
for every chapter of the manga where page_count = 0 and the source
row is not dropped. Bookmark create returns 201 immediately; enqueue
work happens in the background and its failure is logged without
surfacing to the user (the daily cron sweeps anything missed).

The Phase A dedup index handles re-bookmarks idempotently — deleting
and recreating a bookmark does not duplicate in-flight jobs — and the
Phase B worker pool drains them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 20:59:14 +02:00
MechaCat02
9fe0f26d75 feat: in-process crawler daemon with cron and worker pool (0.28.0)
The backend now boots an internal crawler daemon that runs a daily
metadata pass (CRAWLER_DAILY_AT in CRAWLER_TZ, advisory-lock guarded
for multi-replica safety) and drains SyncChapterContent jobs from
crawler_jobs through a worker pool. Chromium launches lazily on first
job and is torn down after CRAWLER_IDLE_TIMEOUT_S seconds of inactivity.

Modules:
- crawler::browser_manager — lazy-launch / idle-teardown wrapper
  around browser::Handle, with an on_launch hook that re-injects
  PHPSESSID on every fresh Chromium spawn.
- crawler::pipeline — run_metadata_pass (the shared discover/upsert
  /cover/sync-chapters loop) and the enqueue_bookmarked_pending helper
  used by the cron tick.
- crawler::daemon — cron task + worker pool, behind two trait seams
  (MetadataPass, ChapterDispatcher) so tests can inject stubs without
  standing up Chromium or a live source.

Behavior:
- CRAWLER_DAEMON=false skips daemon spawn entirely (default for tests).
- Catch-up tick fires on startup if the last persisted slot was missed.
- A SyncOutcome::SessionExpired sets a sticky AtomicBool; workers
  idle until operator restart with a refreshed PHPSESSID.
- Worker dispatch wrapped in catch_unwind so a panicking handler
  marks the job failed instead of taking down the worker.
- Migration 0015 adds a small crawler_state k-v table for the
  last_metadata_tick_at watermark.

Dep additions: chrono-tz (IANA TZ parsing).

CLI (bin/crawler) reuses pipeline::run_metadata_pass and now holds
the browser via BrowserManager so the on_launch session injection
flow stays in one place. Inline chapter-content sync semantics are
unchanged — the queue is for the daemon, force-refetches and manual
backfills still bypass it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 20:32:02 +02:00
MechaCat02
93c7fd63fc feat: crawler job queue ops and dedup index (0.27.0)
Adds enqueue / lease / ack_done / ack_failed / release / reap_done on
crawler::jobs, backed by the existing crawler_jobs table. lease() uses
a single FOR UPDATE SKIP LOCKED CTE that also re-claims stale running
rows (crashed-worker recovery), and ack_failed applies an exponential
backoff capped at 1h before retrying.

Migration 0014 adds a partial unique index on
(payload->>'chapter_id') restricted to (pending|running)
sync_chapter_content jobs, so producers can just
INSERT ... ON CONFLICT DO NOTHING without racing each other. The slot
frees again the moment the job leaves the in-flight states, so a
future force-refetch can re-enqueue.

Library-only — no daemon, no API hook. Those land in the next two
phases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 19:59:09 +02:00
MechaCat02
89b84252a5 bugfix: subquery-wrap pending chapters query so DISTINCT + ORDER BY agree (0.26.1)
PG rejects `SELECT DISTINCT c.id, c.manga_id, cs.source_url ... ORDER BY
c.manga_id, c.created_at` because the ORDER BY references a column not in
the DISTINCT projection. Wrap the DISTINCT in a subquery (which includes
created_at) and apply the ORDER BY in the outer SELECT.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 22:20:15 +02:00
MechaCat02
728d704a66 feat: CRAWLER_KEEP_BROWSER_OPEN waits for Ctrl+C in headed mode (0.26.0)
Debug aid: when set in headed mode, the crawler blocks on Ctrl+C at
every shutdown point (early auth bails + normal completion) instead
of closing the browser immediately. Operator can inspect DOM, cookies,
and network state in the visible Chromium window before exit.

Ignored in headless (no window to inspect) — logged as a warning if
set under headless so the operator doesn't sit waiting.

chromiumoxide's `Browser` is `kill_on_drop`, so the close-or-wait
helper must await Ctrl+C *before* the Handle is dropped — otherwise
the Chromium child gets killed out from under the operator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 21:33:18 +02:00
MechaCat02
d24e68c78d feat: chapter content sync via PHPSESSID + per-host pacing (0.25.0)
After the metadata pass, the crawler now fetches per-chapter image
content for chapters belonging to bookmarked mangas. Logged-in chapter
pages render every page image at once (no per-page navigation), so the
crawler reuses the operator's browser session via a pasted PHPSESSID
cookie. Each chapter sync is a single transaction: storage puts + page
row inserts + page_count update commit together, or roll back together
on any image error so the chapter stays at page_count=0 and is retried
next run.

New crawler modules:

- `rate_limit::HostRateLimiters`: per-host buckets keyed by URL host,
  with optional per-host overrides. Replaces the single shared
  `Mutex<RateLimiter>`. Catalog and CDN no longer share a budget;
  default 1 req/s per host.
- `session`: derives `.<registrable>.<tld>` from the start URL
  (override via `CRAWLER_COOKIE_DOMAIN` for multi-part TLDs), injects
  PHPSESSID into the Chromium cookie store, probes `#avatar_menu` at
  startup to fail fast on a bad/expired cookie.
- `content`: parses `a#pic_container img:not(.loading)` with `pageN`
  id-based sorting (DOM order isn't trusted), then performs the
  atomic chapter sync.

bin/crawler additions:

- Concurrent chapter content phase via `futures_util::for_each_concurrent`
  (`CRAWLER_CHAPTER_WORKERS`, default 1). Browser is borrowed across
  workers — chromiumoxide allows concurrent `new_page` on `&self` —
  and per-host rate limit gates total RPS regardless of worker count.
- reqwest gets the `cookies` feature, a `Jar` seeded with PHPSESSID
  for the catalog domain only (CDN intentionally not given the
  cookie), and `Referer` is set on cover + chapter image fetches.
- New env knobs: `CRAWLER_PHPSESSID`, `CRAWLER_COOKIE_DOMAIN`,
  `CRAWLER_USER_AGENT`, `CRAWLER_CHAPTER_WORKERS`,
  `CRAWLER_SKIP_CHAPTER_CONTENT`, `CRAWLER_FORCE_REFETCH_CHAPTERS`,
  `CRAWLER_CDN_HOST` + `CRAWLER_CDN_RATE_MS`.
- Mid-run session-expired detection: `#avatar_menu` is re-checked on
  every chapter page nav; first failure aborts the phase with a
  cookie-refresh message.

Bookmark-driven enqueueing is sync-on-crawl-tick only: the bookmarked
chapters with `page_count = 0` are queried at the start of the
chapter-content phase. Sync-on-bookmark via an API hook is deferred
to a follow-up branch — that needs a daemon consumer of crawler_jobs,
which doesn't exist yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 00:28:36 +02:00
MechaCat02
51346227dd feat: route reader by chapter id, allow duplicate-numbered chapters (0.24.0)
Real-world sources publish multiple chapters at the same number:
different scanlators ("Ch.52 from bloomingdale" + "Ch.52 from mina"),
translator notices and farewells, alt-translations. The (manga_id,
number) UNIQUE constraint from 0001 silently collapsed all of those
into a single row via the upsert path in repo::crawler. Migration 0013
drops the constraint; sync_manga_chapters now plain-INSERTs each
SourceChapterRef so every parsed chapter survives as its own row.

Identity moves from the (manga_id, number) tuple to the chapter UUID:

- `GET /api/v1/mangas/:manga_id/chapters/:chapter_id` (replaces :number)
- `GET /api/v1/mangas/:manga_id/chapters/:chapter_id/pages`
- `repo::chapter::find_by_id_in_manga` (replaces find_by_manga_and_number)
- Frontend reader route renamed to `/manga/[id]/chapter/[chapter_id]`
- Chapter links throughout (manga page list, continue-reading CTA,
  reader prev/next, history rows, bookmark cards) use chapter.id
- API clients getChapter/getChapterPages take a chapter id string

read_progress + bookmarks already FK chapter_id; they only enrich with
chapter_number for display, which is preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 23:37:07 +02:00
MechaCat02
c51353ead3 bugfix: chapter source key uses chapter id, not /pg-1/ (0.23.1)
Listing links point at the reader's page 1
(`.../uu/br_chapter-N/pg-1/`). The generic `derive_key_from_url` took
the last URL segment and returned `"pg-1"` for every chapter, so all
parsed chapters collapsed onto a single `chapter_sources` row downstream
and the first-manga chapter was the only row that survived. New
`derive_chapter_key_from_url` strips a trailing `/pg-\d+/` before
picking the chapter-identifying segment (`br_chapter-N` / `to_chapter-N`).

Notices, hiatus rows, and duplicate-numbered chapters are preserved as
distinct parser entries. The (manga_id, number) UNIQUE collapse in the
chapters table is a separate, follow-up concern handled in
feat/chapter-id-routing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 23:15:36 +02:00
MechaCat02
b1a3a4e9d3 feat: crawler manga-list & metadata sync with cover download (0.23.0)
- TargetSource: first concrete impl of the Source trait, modeled on
  the old Puppeteer crawler's selectors (+ status normalization,
  tag-count stripping, chapter list)
- DiscoverMode::Backfill walks pagination last->1, reverse within each
  page (oldest-first); Incremental walks forward
- RateLimiter (tokio-time aware) plumbed through FetchContext so the
  pagination walk honors the same per-host budget as the outer loop
- repo::crawler: ensure_source, upsert_manga_from_source (returns
  New/Updated/Unchanged + current cover_image_path for backfill
  decisions), sync_manga_chapters, mark_dropped_mangas — all
  transactional, with case-insensitive lookups and source-insertable
  genres
- Cover image download via reqwest+infer; stored under
  mangas/{id}/cover.{ext} via the Storage trait
- Single CRAWLER_PROXY env wires both Chromium (--proxy-server) and
  reqwest::Proxy::all (HTTP/HTTPS/SOCKS5)
- Crawler binary: positional start URL or $CRAWLER_START_URL,
  $CRAWLER_LIMIT (cap fetches + skip drop pass on partial runs),
  $CRAWLER_SKIP_CHAPTERS (disable selector AND sync), $CRAWLER_RATE_MS
- Silences chromiumoxide 0.7's known CDP deserialize log spam via
  default tracing filter + CdpError::Serde downgrade
- 9 sqlx integration tests + 11 selector/rate-limit unit tests
2026-05-21 22:04:23 +02:00
MechaCat02
26eccd0abe feat: crawler scaffold with chromium launcher (0.22.0)
- crawler module (browser, source trait, jobs, diff) + binary
- chromiumoxide launcher with fetcher feature (auto-downloads
  Chromium on first run, caches under ~/.cache/mangalord/chromium)
- LaunchOptions struct with extra_args, parseable from
  CRAWLER_BROWSER_MODE and CRAWLER_BROWSER_ARGS
- migration 0012 introduces sources, manga_sources,
  chapter_sources, crawler_jobs
- integration tests for headed + headless launch, ipify load+parse,
  and extra-args propagation (all #[ignore], opt-in)
2026-05-20 22:07:56 +02:00
47 changed files with 8035 additions and 142 deletions

1381
backend/Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -1,7 +1,8 @@
[package] [package]
name = "mangalord" name = "mangalord"
version = "0.21.3" version = "0.29.0"
edition = "2021" edition = "2021"
default-run = "mangalord"
[lib] [lib]
path = "src/lib.rs" path = "src/lib.rs"
@@ -10,6 +11,10 @@ path = "src/lib.rs"
name = "mangalord" name = "mangalord"
path = "src/main.rs" path = "src/main.rs"
[[bin]]
name = "crawler"
path = "src/bin/crawler.rs"
[dependencies] [dependencies]
axum = { version = "0.7", features = ["macros", "multipart"] } axum = { version = "0.7", features = ["macros", "multipart"] }
tokio = { version = "1", features = ["full"] } tokio = { version = "1", features = ["full"] }
@@ -18,6 +23,7 @@ serde = { version = "1", features = ["derive"] }
serde_json = "1" serde_json = "1"
uuid = { version = "1", features = ["v4", "serde"] } uuid = { version = "1", features = ["v4", "serde"] }
chrono = { version = "0.4", features = ["serde"] } chrono = { version = "0.4", features = ["serde"] }
chrono-tz = "0.9"
tracing = "0.1" tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter"] } tracing-subscriber = { version = "0.3", features = ["env-filter"] }
tower = { version = "0.5", features = ["util"] } tower = { version = "0.5", features = ["util"] }
@@ -36,7 +42,11 @@ time = "0.3"
infer = "0.16" infer = "0.16"
tokio-util = { version = "0.7", features = ["io"] } tokio-util = { version = "0.7", features = ["io"] }
futures-core = "0.3" futures-core = "0.3"
futures-util = "0.3"
bytes = "1" bytes = "1"
chromiumoxide = { version = "0.7", features = ["tokio-runtime", "_fetcher-rusttls-tokio"], default-features = false }
scraper = "0.20"
reqwest = { version = "0.12", default-features = false, features = ["rustls-tls", "socks", "cookies"] }
[dev-dependencies] [dev-dependencies]
tempfile = "3" tempfile = "3"
@@ -44,3 +54,4 @@ tower = { version = "0.5", features = ["util"] }
http-body-util = "0.1" http-body-util = "0.1"
mime = "0.3" mime = "0.3"
futures-util = "0.3" futures-util = "0.3"
tokio = { version = "1", features = ["test-util"] }

View File

@@ -0,0 +1,72 @@
-- Crawler tables.
--
-- Same philosophy as 0001_init.sql: new concepts go in new tables
-- joined to existing ones, not jammed onto `mangas`/`chapters`. A
-- crawled manga IS a manga; the only thing the source-link tables
-- carry is "where did this come from and when did we last see it".
-- That keeps the API and frontend source-agnostic.
-- 1. Source registry. One row per site the crawler knows about.
-- `config` carries per-site knobs (base URL, rate limits, custom
-- selectors) so adding a source is a row insert plus a `Source`
-- trait impl — no schema change.
CREATE TABLE sources (
id text PRIMARY KEY,
name text NOT NULL,
base_url text NOT NULL,
enabled boolean NOT NULL DEFAULT true,
config jsonb NOT NULL DEFAULT '{}'::jsonb,
created_at timestamptz NOT NULL DEFAULT now()
);
-- 2. Link tables. `(source_id, source_*_key)` is the natural key the
-- source itself exposes; the FK to `mangas`/`chapters` is what
-- threads it back into our domain. `metadata_hash` is the signal
-- used by `crawler::diff` to detect updates without re-comparing
-- every field. `last_seen_at` + `dropped_at` is the soft-drop pair.
CREATE TABLE manga_sources (
source_id text NOT NULL REFERENCES sources(id) ON DELETE CASCADE,
source_manga_key text NOT NULL,
manga_id uuid NOT NULL REFERENCES mangas(id) ON DELETE CASCADE,
source_url text NOT NULL,
metadata_hash text,
first_seen_at timestamptz NOT NULL DEFAULT now(),
last_seen_at timestamptz NOT NULL DEFAULT now(),
dropped_at timestamptz,
PRIMARY KEY (source_id, source_manga_key)
);
CREATE INDEX manga_sources_manga_idx ON manga_sources (manga_id);
CREATE INDEX manga_sources_last_seen_idx ON manga_sources (source_id, last_seen_at);
CREATE TABLE chapter_sources (
source_id text NOT NULL REFERENCES sources(id) ON DELETE CASCADE,
source_chapter_key text NOT NULL,
chapter_id uuid NOT NULL REFERENCES chapters(id) ON DELETE CASCADE,
source_url text NOT NULL,
first_seen_at timestamptz NOT NULL DEFAULT now(),
last_seen_at timestamptz NOT NULL DEFAULT now(),
dropped_at timestamptz,
PRIMARY KEY (source_id, source_chapter_key)
);
CREATE INDEX chapter_sources_chapter_idx ON chapter_sources (chapter_id);
-- 3. Persistent job queue. Workers lease with
-- `FOR UPDATE SKIP LOCKED`, heartbeat via `leased_until`, and ack
-- by transitioning state. The partial index keeps the hot path
-- (pick the next ready job) off the bulk of done/dead rows.
CREATE TABLE crawler_jobs (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
payload jsonb NOT NULL,
state text NOT NULL DEFAULT 'pending'
CHECK (state IN ('pending','running','done','failed','dead')),
attempts integer NOT NULL DEFAULT 0,
max_attempts integer NOT NULL DEFAULT 5,
scheduled_at timestamptz NOT NULL DEFAULT now(),
leased_until timestamptz,
last_error text,
created_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX crawler_jobs_ready_idx
ON crawler_jobs (scheduled_at)
WHERE state IN ('pending', 'failed');

View File

@@ -0,0 +1,18 @@
-- Real-world sources publish multiple chapters at the same number:
-- different uploaders, translator notices/farewells, paid-vs-free
-- re-uploads, and our own users can legitimately have two versions of
-- "Ch.52" with different scanlations. The (manga_id, number) UNIQUE
-- from 0001_init silently collapses all of those into a single row via
-- ON CONFLICT, dropping data. Drop the constraint and lean on the
-- chapter id (UUID) as the only chapter identity going forward.
ALTER TABLE chapters DROP CONSTRAINT chapters_manga_id_number_key;
-- The UNIQUE was also our only index on (manga_id, number) since
-- 0007 dropped the redundant explicit one. Chapter list pages
-- ORDER BY number ASC and the manga page is a hot read path, so put
-- the index back without the uniqueness. Secondary sort by created_at
-- so duplicate-numbered chapters have a stable order in lists and
-- prev/next navigation.
CREATE INDEX chapters_manga_id_number_idx
ON chapters (manga_id, number, created_at);

View File

@@ -0,0 +1,15 @@
-- Dedup SyncChapterContent jobs in flight.
--
-- Without this, the daemon's bookmark/cron enqueue paths would have to do a
-- pre-check + insert race that's incorrect under concurrency. The partial
-- unique index lets both producers use plain `INSERT ... ON CONFLICT DO
-- NOTHING`: at most one (pending|running) job per chapter_id exists, and the
-- slot frees again as soon as the job transitions to done/failed/dead so a
-- re-enqueue is possible after the row is reaped or a force-refetch is wanted.
--
-- Scoped to sync_chapter_content payloads only so Discover / SyncManga /
-- SyncChapterList jobs (which don't carry a chapter_id) remain un-deduped.
CREATE UNIQUE INDEX crawler_jobs_chapter_content_dedup_idx
ON crawler_jobs ((payload->>'chapter_id'))
WHERE state IN ('pending', 'running')
AND payload->>'kind' = 'sync_chapter_content';

View File

@@ -0,0 +1,12 @@
-- Small key-value table for daemon state that needs to survive restarts.
--
-- Used so far only by the cron scheduler (`last_metadata_tick_at`) so it can
-- detect that the most recent slot was missed (e.g. the backend was down at
-- midnight) and fire immediately on startup before resuming the regular
-- schedule. JSONB on the value column lets future keys carry richer payloads
-- without another migration.
CREATE TABLE crawler_state (
key text PRIMARY KEY,
value jsonb NOT NULL,
updated_at timestamptz NOT NULL DEFAULT now()
);

View File

@@ -13,6 +13,7 @@ use uuid::Uuid;
use crate::api::pagination::PagedResponse; use crate::api::pagination::PagedResponse;
use crate::app::AppState; use crate::app::AppState;
use crate::auth::extractor::CurrentUser; use crate::auth::extractor::CurrentUser;
use crate::crawler::pipeline;
use crate::domain::{Bookmark, BookmarkSummary}; use crate::domain::{Bookmark, BookmarkSummary};
use crate::error::{AppError, AppResult}; use crate::error::{AppError, AppResult};
use crate::repo; use crate::repo;
@@ -86,6 +87,29 @@ async fn create(
input.page, input.page,
) )
.await?; .await?;
// Fire-and-forget: kick off content syncs for any pending chapters of
// the newly-bookmarked manga. The dedup index makes this idempotent
// across repeated bookmarks of the same manga; failure here must not
// surface to the user (the daily cron sweeps anything missed).
let pool = state.db.clone();
let manga_id = input.manga_id;
tokio::spawn(async move {
match pipeline::enqueue_pending_for_manga(&pool, manga_id).await {
Ok(summary) => tracing::info!(
%manga_id,
inserted = summary.inserted,
skipped = summary.skipped,
failed = summary.failed,
"bookmark hook: enqueued pending chapters"
),
Err(e) => tracing::warn!(
%manga_id, error = ?e,
"bookmark hook: enqueue_pending_for_manga failed"
),
}
});
Ok((StatusCode::CREATED, Json(bookmark))) Ok((StatusCode::CREATED, Json(bookmark)))
} }

View File

@@ -26,9 +26,9 @@ use crate::upload::{parse_image, UploadedImage};
pub fn routes() -> Router<AppState> { pub fn routes() -> Router<AppState> {
Router::new() Router::new()
.route("/mangas/:manga_id/chapters", get(list).post(create)) .route("/mangas/:manga_id/chapters", get(list).post(create))
.route("/mangas/:manga_id/chapters/:number", get(get_one)) .route("/mangas/:manga_id/chapters/:chapter_id", get(get_one))
.route( .route(
"/mangas/:manga_id/chapters/:number/pages", "/mangas/:manga_id/chapters/:chapter_id/pages",
get(list_pages), get(list_pages),
) )
} }
@@ -60,10 +60,10 @@ async fn list(
async fn get_one( async fn get_one(
State(state): State<AppState>, State(state): State<AppState>,
Path((manga_id, number)): Path<(Uuid, i32)>, Path((manga_id, chapter_id)): Path<(Uuid, Uuid)>,
) -> AppResult<Json<Chapter>> { ) -> AppResult<Json<Chapter>> {
repo::manga::get(&state.db, manga_id).await?; repo::manga::get(&state.db, manga_id).await?;
let chapter = repo::chapter::find_by_manga_and_number(&state.db, manga_id, number) let chapter = repo::chapter::find_by_id_in_manga(&state.db, manga_id, chapter_id)
.await? .await?
.ok_or(AppError::NotFound)?; .ok_or(AppError::NotFound)?;
Ok(Json(chapter)) Ok(Json(chapter))
@@ -164,10 +164,10 @@ struct PagesResponse {
async fn list_pages( async fn list_pages(
State(state): State<AppState>, State(state): State<AppState>,
Path((manga_id, number)): Path<(Uuid, i32)>, Path((manga_id, chapter_id)): Path<(Uuid, Uuid)>,
) -> AppResult<Json<PagesResponse>> { ) -> AppResult<Json<PagesResponse>> {
repo::manga::get(&state.db, manga_id).await?; repo::manga::get(&state.db, manga_id).await?;
let chapter = repo::chapter::find_by_manga_and_number(&state.db, manga_id, number) let chapter = repo::chapter::find_by_id_in_manga(&state.db, manga_id, chapter_id)
.await? .await?
.ok_or(AppError::NotFound)?; .ok_or(AppError::NotFound)?;
let pages = repo::page::list_for_chapter(&state.db, chapter.id).await?; let pages = repo::page::list_for_chapter(&state.db, chapter.id).await?;

View File

@@ -1,14 +1,25 @@
use std::sync::Arc; use std::sync::Arc;
use std::sync::atomic::AtomicBool;
use anyhow::Context;
use async_trait::async_trait;
use axum::extract::DefaultBodyLimit; use axum::extract::DefaultBodyLimit;
use axum::http::{HeaderName, HeaderValue, Method}; use axum::http::{HeaderName, HeaderValue, Method};
use axum::Router; use axum::Router;
use sqlx::postgres::PgPoolOptions; use sqlx::postgres::PgPoolOptions;
use sqlx::PgPool; use sqlx::PgPool;
use tokio_util::sync::CancellationToken;
use tower_http::cors::{AllowOrigin, CorsLayer}; use tower_http::cors::{AllowOrigin, CorsLayer};
use tower_http::trace::TraceLayer; use tower_http::trace::TraceLayer;
use crate::config::{AuthConfig, Config, UploadConfig}; use crate::config::{AuthConfig, Config, CrawlerConfig, UploadConfig};
use crate::crawler::browser_manager::{self, BrowserManager};
use crate::crawler::content::{self, SyncOutcome};
use crate::crawler::daemon::{self, ChapterDispatcher, DaemonConfig, MetadataPass};
use crate::crawler::jobs::JobPayload;
use crate::crawler::pipeline::{self, MetadataStats};
use crate::crawler::rate_limit::HostRateLimiters;
use crate::crawler::session;
use crate::storage::{LocalStorage, Storage}; use crate::storage::{LocalStorage, Storage};
#[derive(Clone)] #[derive(Clone)]
@@ -19,7 +30,23 @@ pub struct AppState {
pub upload: UploadConfig, pub upload: UploadConfig,
} }
pub async fn build(config: Config) -> anyhow::Result<Router> { /// Bundle returned by [`build`]. The router is what `axum::serve` consumes;
/// the daemon (when enabled) outlives the HTTP server and is awaited via
/// [`AppHandle::shutdown`] after the listener has finished gracefully.
pub struct AppHandle {
pub router: Router,
pub daemon: Option<daemon::DaemonHandle>,
}
impl AppHandle {
pub async fn shutdown(self) {
if let Some(d) = self.daemon {
d.shutdown().await;
}
}
}
pub async fn build(config: Config) -> anyhow::Result<AppHandle> {
let db = PgPoolOptions::new() let db = PgPoolOptions::new()
.max_connections(10) .max_connections(10)
.connect(&config.database_url) .connect(&config.database_url)
@@ -28,13 +55,235 @@ pub async fn build(config: Config) -> anyhow::Result<Router> {
let storage: Arc<dyn Storage> = Arc::new(LocalStorage::new(config.storage_dir.clone())); let storage: Arc<dyn Storage> = Arc::new(LocalStorage::new(config.storage_dir.clone()));
let daemon = if config.crawler.daemon_enabled {
Some(spawn_crawler_daemon(db.clone(), Arc::clone(&storage), &config.crawler).await?)
} else {
tracing::info!("crawler daemon disabled (CRAWLER_DAEMON=false)");
None
};
let state = AppState { let state = AppState {
db, db,
storage, storage,
auth: config.auth.clone(), auth: config.auth.clone(),
upload: config.upload.clone(), upload: config.upload.clone(),
}; };
Ok(router(state).layer(cors_layer(&config.cors_allowed_origins))) let router = router(state).layer(cors_layer(&config.cors_allowed_origins));
Ok(AppHandle { router, daemon })
}
async fn spawn_crawler_daemon(
db: PgPool,
storage: Arc<dyn Storage>,
cfg: &CrawlerConfig,
) -> anyhow::Result<daemon::DaemonHandle> {
// Reqwest client with cookie jar pre-seeded so CDN image fetches
// include PHPSESSID. Same shape as bin/crawler.rs main().
let cookie_jar = Arc::new(reqwest::cookie::Jar::default());
if let (Some(sid), Some(domain), Some(start_url)) =
(&cfg.phpsessid, &cfg.cookie_domain, &cfg.start_url)
{
let cookie_str = format!("PHPSESSID={sid}; Domain={domain}; Path=/");
let seed_url = reqwest::Url::parse(start_url)
.context("parse CRAWLER_START_URL for cookie seed")?;
cookie_jar.add_cookie_str(&cookie_str, &seed_url);
}
let mut http_builder = reqwest::Client::builder()
.timeout(std::time::Duration::from_secs(30))
.no_proxy()
.cookie_provider(cookie_jar);
if let Some(ua) = &cfg.user_agent {
http_builder = http_builder.user_agent(ua);
}
if let Some(proxy) = &cfg.proxy {
http_builder = http_builder
.proxy(reqwest::Proxy::all(proxy).with_context(|| format!("parse proxy: {proxy}"))?);
}
let http = http_builder.build().context("build crawler reqwest")?;
let mut rate = HostRateLimiters::new(std::time::Duration::from_millis(cfg.rate_ms));
if let Some(host) = &cfg.cdn_host {
rate = rate.with_override(host, std::time::Duration::from_millis(cfg.cdn_rate_ms));
}
let rate = Arc::new(rate);
// Browser manager. on_launch re-injects PHPSESSID on every fresh
// chromium spawn so an idle teardown followed by re-launch stays
// authenticated without operator action.
let mut launch_opts = cfg.browser.clone();
if let Some(proxy) = &cfg.proxy {
launch_opts.extra_args.push(format!("--proxy-server={proxy}"));
}
let on_launch = match (&cfg.phpsessid, &cfg.cookie_domain, &cfg.start_url) {
(Some(sid), Some(domain), Some(start_url)) => {
let sid = sid.clone();
let domain = domain.clone();
let start_url = start_url.clone();
let on_launch: browser_manager::OnLaunch = Arc::new(move |browser| {
let sid = sid.clone();
let domain = domain.clone();
let start_url = start_url.clone();
Box::pin(async move {
session::inject_phpsessid(&browser, &sid, &domain)
.await
.context("on_launch: inject_phpsessid")?;
session::verify_session(&browser, &start_url)
.await
.context("on_launch: verify_session")?;
Ok(())
})
});
on_launch
}
_ => browser_manager::noop_on_launch(),
};
let browser_manager = BrowserManager::new(launch_opts, cfg.idle_timeout, on_launch);
let session_expired = Arc::new(AtomicBool::new(false));
let metadata_pass: Option<Arc<dyn MetadataPass>> = cfg.start_url.as_ref().map(|url| {
let m: Arc<dyn MetadataPass> = Arc::new(RealMetadataPass {
browser_manager: Arc::clone(&browser_manager),
db: db.clone(),
storage: Arc::clone(&storage),
http: http.clone(),
rate: Arc::clone(&rate),
start_url: url.clone(),
});
m
});
let dispatcher: Arc<dyn ChapterDispatcher> = Arc::new(RealChapterDispatcher {
browser_manager: Arc::clone(&browser_manager),
db: db.clone(),
storage: Arc::clone(&storage),
http,
rate: Arc::clone(&rate),
});
// Shared cancellation: daemon shutdown cancels the BrowserManager's
// idle reaper too. Reaper itself is added to the daemon's extra_tasks
// so DaemonHandle::shutdown awaits its completion.
let cancel = CancellationToken::new();
let reaper_task = browser_manager::spawn_idle_reaper(
Arc::clone(&browser_manager),
cancel.clone(),
);
// Also close the browser explicitly on shutdown so we don't rely on
// kill-on-drop when other Arc<Browser> holders may still exist.
let shutdown_task = {
let cancel = cancel.clone();
let mgr = Arc::clone(&browser_manager);
tokio::spawn(async move {
cancel.cancelled().await;
mgr.shutdown().await;
})
};
let daemon_handle = daemon::spawn(
db,
cancel,
DaemonConfig {
metadata_pass,
dispatcher,
chapter_workers: cfg.chapter_workers,
daily_at: cfg.daily_at,
tz: cfg.tz,
retention_days: cfg.retention_days,
session_expired,
extra_tasks: vec![reaper_task, shutdown_task],
},
);
Ok(daemon_handle)
}
// Real impls of the daemon traits, owning the browser manager + I/O. Kept
// in app.rs because they need the same builder-side env wiring that
// AppState gets — the daemon module itself stays free of reqwest / storage
// details so its tests don't pull them in.
struct RealMetadataPass {
browser_manager: Arc<BrowserManager>,
db: PgPool,
storage: Arc<dyn Storage>,
http: reqwest::Client,
rate: Arc<HostRateLimiters>,
start_url: String,
}
#[async_trait]
impl MetadataPass for RealMetadataPass {
async fn run(&self) -> anyhow::Result<MetadataStats> {
pipeline::run_metadata_pass(
&self.browser_manager,
&self.db,
self.storage.as_ref(),
&self.http,
&self.rate,
&self.start_url,
0,
false,
)
.await
}
}
struct RealChapterDispatcher {
browser_manager: Arc<BrowserManager>,
db: PgPool,
storage: Arc<dyn Storage>,
http: reqwest::Client,
rate: Arc<HostRateLimiters>,
}
#[async_trait]
impl ChapterDispatcher for RealChapterDispatcher {
async fn dispatch(&self, payload: JobPayload) -> anyhow::Result<SyncOutcome> {
match payload {
JobPayload::SyncChapterContent {
source_id: _,
chapter_id,
source_chapter_key: _,
} => {
// Look up manga_id + source_url for this chapter.
let row: Option<(uuid::Uuid, String)> = sqlx::query_as(
"SELECT c.manga_id, cs.source_url \
FROM chapters c \
JOIN chapter_sources cs ON cs.chapter_id = c.id \
WHERE c.id = $1 \
LIMIT 1",
)
.bind(chapter_id)
.fetch_optional(&self.db)
.await
.context("look up chapter for dispatch")?;
let Some((manga_id, source_url)) = row else {
// Chapter (or its source row) is gone — ack done.
return Ok(SyncOutcome::Skipped);
};
let lease = self.browser_manager.acquire().await?;
let outcome = content::sync_chapter_content(
&lease,
&self.db,
self.storage.as_ref(),
&self.http,
&self.rate,
chapter_id,
manga_id,
&source_url,
false,
)
.await?;
drop(lease);
Ok(outcome)
}
// Other payload kinds aren't dispatched by this daemon yet —
// metadata-driven jobs (Discover/SyncManga/SyncChapterList)
// are handled inline by the cron's metadata pass.
_ => Ok(SyncOutcome::Skipped),
}
}
} }
/// Build a router from a pre-assembled state. Used by integration tests /// Build a router from a pre-assembled state. Used by integration tests

407
backend/src/bin/crawler.rs Normal file
View File

@@ -0,0 +1,407 @@
//! Crawler binary.
//!
//! Now an ops escape hatch sitting alongside the in-process daemon: walks
//! the source's manga listing (all pages), fetches each manga's metadata +
//! chapter list, downloads covers, reconciles chapters — and then, for any
//! chapter belonging to a bookmarked manga whose `page_count` is still 0,
//! fetches the chapter pages inline. The daemon does the same work through
//! `crawler_jobs`; the CLI is kept around for force-refetches and manual
//! backfills.
//!
//! Configuration mirrors the daemon's `CRAWLER_*` env vars (see
//! `crate::config::CrawlerConfig`) plus the CLI-only:
//! - **Start URL**: first CLI positional arg, else `$CRAWLER_START_URL`.
//! - **Skip chapters / chapter content / force re-fetch / keep browser**:
//! `CRAWLER_SKIP_CHAPTERS`, `CRAWLER_SKIP_CHAPTER_CONTENT`,
//! `CRAWLER_FORCE_REFETCH_CHAPTERS`, `CRAWLER_KEEP_BROWSER_OPEN`.
//! - **Limit**: `CRAWLER_LIMIT` (max manga detail fetches per run).
//!
//! See `crawler::pipeline::run_metadata_pass` for the shared metadata
//! flow.
use std::path::PathBuf;
use std::sync::Arc;
use std::time::Duration;
use anyhow::{anyhow, Context};
use futures_util::stream::{self, StreamExt};
use mangalord::crawler::browser::{BrowserMode, LaunchOptions};
use mangalord::crawler::browser_manager::{self, BrowserManager};
use mangalord::crawler::content::{self, SyncOutcome};
use mangalord::crawler::pipeline;
use mangalord::crawler::rate_limit::HostRateLimiters;
use mangalord::crawler::session;
use mangalord::storage::{LocalStorage, Storage};
use sqlx::postgres::PgPoolOptions;
use sqlx::PgPool;
use tracing_subscriber::EnvFilter;
use uuid::Uuid;
#[tokio::main]
async fn main() -> anyhow::Result<()> {
dotenvy::dotenv().ok();
tracing_subscriber::fmt()
.with_env_filter(
EnvFilter::try_from_default_env().unwrap_or_else(|_| {
"info,mangalord=debug,chromiumoxide::conn=off,chromiumoxide::handler=off"
.into()
}),
)
.init();
let start_url = resolve_start_url()?;
let database_url = std::env::var("DATABASE_URL")
.map_err(|_| anyhow!("DATABASE_URL must be set"))?;
let storage_dir: PathBuf = std::env::var("STORAGE_DIR")
.unwrap_or_else(|_| "./data/storage".to_string())
.into();
let rate_ms = env_u64("CRAWLER_RATE_MS", 1000);
let cdn_host = std::env::var("CRAWLER_CDN_HOST")
.ok()
.filter(|s| !s.trim().is_empty());
let cdn_rate_ms = env_u64("CRAWLER_CDN_RATE_MS", rate_ms);
let limit = env_u64("CRAWLER_LIMIT", 0) as usize;
let skip_chapters = env_bool("CRAWLER_SKIP_CHAPTERS", false);
let skip_chapter_content = env_bool("CRAWLER_SKIP_CHAPTER_CONTENT", false);
let chapter_workers = env_u64("CRAWLER_CHAPTER_WORKERS", 1).max(1) as usize;
let force_refetch_chapters = env_bool("CRAWLER_FORCE_REFETCH_CHAPTERS", false);
let phpsessid = std::env::var("CRAWLER_PHPSESSID")
.ok()
.filter(|s| !s.trim().is_empty());
let cookie_domain = std::env::var("CRAWLER_COOKIE_DOMAIN")
.ok()
.filter(|s| !s.trim().is_empty())
.or_else(|| session::registrable_domain(&start_url));
let user_agent = std::env::var("CRAWLER_USER_AGENT")
.ok()
.filter(|s| !s.trim().is_empty());
let proxy_url = std::env::var("CRAWLER_PROXY")
.ok()
.filter(|s| !s.trim().is_empty());
let keep_browser_open = env_bool("CRAWLER_KEEP_BROWSER_OPEN", false);
let db = PgPoolOptions::new()
.max_connections(5)
.connect(&database_url)
.await
.context("connect to database")?;
sqlx::migrate!("./migrations").run(&db).await?;
let storage: Arc<dyn Storage> = Arc::new(LocalStorage::new(&storage_dir));
let cookie_jar = Arc::new(reqwest::cookie::Jar::default());
if let (Some(sid), Some(domain)) = (&phpsessid, &cookie_domain) {
let cookie_str = format!("PHPSESSID={sid}; Domain={domain}; Path=/");
let seed_url =
reqwest::Url::parse(&start_url).context("parse start URL for cookie seed")?;
cookie_jar.add_cookie_str(&cookie_str, &seed_url);
tracing::info!(domain, "seeded PHPSESSID into reqwest cookie jar");
}
let mut http_builder = reqwest::Client::builder()
.timeout(Duration::from_secs(30))
.no_proxy()
.cookie_provider(cookie_jar);
if let Some(ua) = &user_agent {
http_builder = http_builder.user_agent(ua);
}
if let Some(proxy) = &proxy_url {
http_builder = http_builder
.proxy(reqwest::Proxy::all(proxy).with_context(|| format!("parse proxy URL: {proxy}"))?);
}
let http = http_builder.build().context("build http client")?;
let mut options = LaunchOptions::from_env();
if let Some(proxy) = &proxy_url {
options.extra_args.push(format!("--proxy-server={proxy}"));
}
let keep_open = match (keep_browser_open, options.mode) {
(true, BrowserMode::Headed) => true,
(true, BrowserMode::Headless) => {
tracing::warn!(
"CRAWLER_KEEP_BROWSER_OPEN ignored in headless mode (no window to inspect)"
);
false
}
_ => false,
};
tracing::info!(
?options,
%start_url,
rate_ms,
cdn_host = ?cdn_host,
cdn_rate_ms,
limit,
skip_chapters,
skip_chapter_content,
chapter_workers,
force_refetch_chapters,
phpsessid_set = phpsessid.is_some(),
cookie_domain = ?cookie_domain,
user_agent = ?user_agent,
proxy = ?proxy_url,
keep_open,
storage_dir = %storage_dir.display(),
"starting crawler"
);
// BrowserManager with idle_timeout = ZERO so the CLI keeps Chromium
// alive for the entire run — same lifecycle as the old direct
// `browser::launch()` flow. on_launch re-injects PHPSESSID + runs the
// session probe; bad cookies fail fast before any real work happens.
let on_launch: browser_manager::OnLaunch = match (&phpsessid, &cookie_domain) {
(Some(sid), Some(domain)) => {
let sid = sid.clone();
let domain = domain.clone();
let start_url_clone = start_url.clone();
Arc::new(move |browser| {
let sid = sid.clone();
let domain = domain.clone();
let start_url = start_url_clone.clone();
Box::pin(async move {
session::inject_phpsessid(&browser, &sid, &domain)
.await
.context("inject_phpsessid")?;
session::verify_session(&browser, &start_url)
.await
.context("verify_session")?;
Ok(())
})
})
}
_ => browser_manager::noop_on_launch(),
};
let session_ready = phpsessid.is_some() && cookie_domain.is_some();
let manager = BrowserManager::new(options, Duration::ZERO, on_launch);
let result = run(
Arc::clone(&manager),
&db,
Arc::clone(&storage),
&http,
&start_url,
rate_ms,
cdn_host.as_deref(),
cdn_rate_ms,
limit,
skip_chapters,
skip_chapter_content || !session_ready,
chapter_workers,
force_refetch_chapters,
)
.await;
if keep_open {
tracing::info!(
"crawler finished; browser kept open. Press Ctrl+C to close and exit."
);
let _ = tokio::signal::ctrl_c().await;
tracing::info!("Ctrl+C received; closing browser");
}
manager.shutdown().await;
result
}
#[allow(clippy::too_many_arguments)]
async fn run(
manager: Arc<BrowserManager>,
db: &PgPool,
storage: Arc<dyn Storage>,
http: &reqwest::Client,
start_url: &str,
rate_ms: u64,
cdn_host: Option<&str>,
cdn_rate_ms: u64,
limit: usize,
skip_chapters: bool,
skip_chapter_content: bool,
chapter_workers: usize,
force_refetch_chapters: bool,
) -> anyhow::Result<()> {
let mut rate = HostRateLimiters::new(Duration::from_millis(rate_ms));
if let Some(host) = cdn_host {
rate = rate.with_override(host, Duration::from_millis(cdn_rate_ms));
}
let rate = Arc::new(rate);
let stats = pipeline::run_metadata_pass(
manager.as_ref(),
db,
storage.as_ref(),
http,
rate.as_ref(),
start_url,
limit,
skip_chapters,
)
.await?;
tracing::info!(?stats, "metadata pass complete");
if !skip_chapter_content {
sync_bookmarked_chapter_content(
Arc::clone(&manager),
db,
Arc::clone(&storage),
http,
Arc::clone(&rate),
"target",
chapter_workers,
force_refetch_chapters,
)
.await?;
}
Ok(())
}
/// Find every chapter whose manga is bookmarked by at least one user and
/// that hasn't been content-synced yet, then fan them out across `workers`
/// concurrent tasks. Same as before except the browser comes from a
/// BrowserManager lease so it interleaves cleanly with the metadata pass.
///
/// A `SessionExpired` result aborts the phase.
#[allow(clippy::too_many_arguments)]
async fn sync_bookmarked_chapter_content(
manager: Arc<BrowserManager>,
db: &PgPool,
storage: Arc<dyn Storage>,
http: &reqwest::Client,
rate: Arc<HostRateLimiters>,
source_id: &str,
workers: usize,
force_refetch: bool,
) -> anyhow::Result<()> {
let pending: Vec<(Uuid, Uuid, String)> = sqlx::query_as(
r#"
SELECT id, manga_id, source_url FROM (
SELECT DISTINCT c.id, c.manga_id, c.created_at, cs.source_url
FROM chapters c
JOIN bookmarks b ON b.manga_id = c.manga_id
JOIN chapter_sources cs ON cs.chapter_id = c.id
WHERE cs.source_id = $1
AND cs.dropped_at IS NULL
AND (c.page_count = 0 OR $2)
) sub
ORDER BY manga_id, created_at ASC
"#,
)
.bind(source_id)
.bind(force_refetch)
.fetch_all(db)
.await
.context("query pending chapter content")?;
if pending.is_empty() {
tracing::info!("chapter content: nothing pending");
return Ok(());
}
tracing::info!(count = pending.len(), workers, "chapter content phase starting");
let session_expired = Arc::new(std::sync::atomic::AtomicBool::new(false));
let stats = std::sync::Mutex::new(WorkerStats::default());
stream::iter(pending.into_iter())
.for_each_concurrent(workers.max(1), |(chapter_id, manga_id, source_url)| {
let session_expired = Arc::clone(&session_expired);
let storage = Arc::clone(&storage);
let rate = Arc::clone(&rate);
let manager = Arc::clone(&manager);
let stats = &stats;
async move {
if session_expired.load(std::sync::atomic::Ordering::Relaxed) {
return;
}
let lease = match manager.acquire().await {
Ok(l) => l,
Err(e) => {
tracing::error!(%chapter_id, error = ?e, "browser acquire failed");
let mut s = stats.lock().unwrap();
s.failed += 1;
return;
}
};
let outcome = content::sync_chapter_content(
&lease,
db,
storage.as_ref(),
http,
rate.as_ref(),
chapter_id,
manga_id,
&source_url,
force_refetch,
)
.await;
drop(lease);
let mut s = stats.lock().unwrap();
match outcome {
Ok(SyncOutcome::Fetched { pages }) => {
tracing::info!(%chapter_id, pages, "chapter content fetched");
s.fetched += 1;
}
Ok(SyncOutcome::Skipped) => s.skipped += 1,
Ok(SyncOutcome::SessionExpired) => {
tracing::error!(
%chapter_id,
"session expired mid-run — refresh CRAWLER_PHPSESSID and re-run"
);
session_expired
.store(true, std::sync::atomic::Ordering::Relaxed);
}
Err(e) => {
tracing::warn!(
%chapter_id, error = ?e, "chapter content sync failed"
);
s.failed += 1;
}
}
}
})
.await;
let total = stats.into_inner().unwrap();
tracing::info!(
fetched = total.fetched,
skipped = total.skipped,
failed = total.failed,
"chapter content phase done"
);
if session_expired.load(std::sync::atomic::Ordering::Relaxed) {
anyhow::bail!("session expired during chapter content phase");
}
Ok(())
}
#[derive(Default, Clone, Copy)]
struct WorkerStats {
fetched: usize,
skipped: usize,
failed: usize,
}
fn resolve_start_url() -> anyhow::Result<String> {
if let Some(arg) = std::env::args().nth(1) {
return Ok(arg);
}
std::env::var("CRAWLER_START_URL").map_err(|_| {
anyhow!(
"start URL is required — pass as first CLI arg or set $CRAWLER_START_URL"
)
})
}
fn env_u64(name: &str, default: u64) -> u64 {
std::env::var(name)
.ok()
.and_then(|s| s.parse().ok())
.unwrap_or(default)
}
fn env_bool(name: &str, default: bool) -> bool {
match std::env::var(name).ok().as_deref() {
Some("1") | Some("true") | Some("TRUE") | Some("yes") => true,
Some("0") | Some("false") | Some("FALSE") | Some("no") => false,
_ => default,
}
}

View File

@@ -1,4 +1,10 @@
use std::path::PathBuf; use std::path::PathBuf;
use std::time::Duration;
use chrono::NaiveTime;
use chrono_tz::Tz;
use crate::crawler::browser::LaunchOptions;
#[derive(Clone, Debug)] #[derive(Clone, Debug)]
pub struct AuthConfig { pub struct AuthConfig {
@@ -45,6 +51,54 @@ pub struct Config {
pub auth: AuthConfig, pub auth: AuthConfig,
pub upload: UploadConfig, pub upload: UploadConfig,
pub cors_allowed_origins: Vec<String>, pub cors_allowed_origins: Vec<String>,
pub crawler: CrawlerConfig,
}
/// All crawler-daemon knobs read from env. Mirrors the env vars the
/// `bin/crawler` binary already reads, plus the new daemon-only knobs
/// (daily_at, tz, idle_timeout, retention_days, daemon_enabled).
///
/// `daemon_enabled = false` skips the daemon spawn entirely — used by
/// integration tests and dev runs that don't want background activity.
#[derive(Clone, Debug)]
pub struct CrawlerConfig {
pub daemon_enabled: bool,
pub daily_at: NaiveTime,
pub tz: Tz,
pub idle_timeout: Duration,
pub chapter_workers: usize,
pub retention_days: u32,
pub start_url: Option<String>,
pub rate_ms: u64,
pub cdn_host: Option<String>,
pub cdn_rate_ms: u64,
pub phpsessid: Option<String>,
pub cookie_domain: Option<String>,
pub user_agent: Option<String>,
pub proxy: Option<String>,
pub browser: LaunchOptions,
}
impl Default for CrawlerConfig {
fn default() -> Self {
Self {
daemon_enabled: false,
daily_at: NaiveTime::from_hms_opt(0, 0, 0).unwrap(),
tz: Tz::UTC,
idle_timeout: Duration::from_secs(600),
chapter_workers: 1,
retention_days: 7,
start_url: None,
rate_ms: 1000,
cdn_host: None,
cdn_rate_ms: 1000,
phpsessid: None,
cookie_domain: None,
user_agent: None,
proxy: None,
browser: LaunchOptions::headless(),
}
}
} }
impl Config { impl Config {
@@ -77,10 +131,65 @@ impl Config {
.collect() .collect()
}) })
.unwrap_or_default(), .unwrap_or_default(),
crawler: CrawlerConfig::from_env()?,
}) })
} }
} }
impl CrawlerConfig {
pub fn from_env() -> anyhow::Result<Self> {
// Parse CRAWLER_DAILY_AT (HH:MM, 24h). Invalid → fail fast.
let daily_at = match std::env::var("CRAWLER_DAILY_AT").ok().as_deref() {
None | Some("") => NaiveTime::from_hms_opt(0, 0, 0).unwrap(),
Some(raw) => NaiveTime::parse_from_str(raw, "%H:%M").map_err(|e| {
anyhow::anyhow!("CRAWLER_DAILY_AT must be HH:MM (got {raw:?}): {e}")
})?,
};
let tz: Tz = match std::env::var("CRAWLER_TZ").ok().as_deref() {
None | Some("") => Tz::UTC,
Some(raw) => raw
.parse()
.map_err(|e| anyhow::anyhow!("CRAWLER_TZ must be a valid IANA TZ (got {raw:?}): {e}"))?,
};
Ok(Self {
daemon_enabled: env_bool("CRAWLER_DAEMON", true),
daily_at,
tz,
idle_timeout: Duration::from_secs(env_u64("CRAWLER_IDLE_TIMEOUT_S", 600)),
chapter_workers: env_u64("CRAWLER_CHAPTER_WORKERS", 1).max(1) as usize,
retention_days: env_u64("CRAWLER_JOB_RETENTION_DAYS", 7) as u32,
start_url: std::env::var("CRAWLER_START_URL")
.ok()
.filter(|s| !s.trim().is_empty()),
rate_ms: env_u64("CRAWLER_RATE_MS", 1000),
cdn_host: std::env::var("CRAWLER_CDN_HOST")
.ok()
.filter(|s| !s.trim().is_empty()),
cdn_rate_ms: env_u64("CRAWLER_CDN_RATE_MS", env_u64("CRAWLER_RATE_MS", 1000)),
phpsessid: std::env::var("CRAWLER_PHPSESSID")
.ok()
.filter(|s| !s.trim().is_empty()),
cookie_domain: std::env::var("CRAWLER_COOKIE_DOMAIN")
.ok()
.filter(|s| !s.trim().is_empty()),
user_agent: std::env::var("CRAWLER_USER_AGENT")
.ok()
.filter(|s| !s.trim().is_empty()),
proxy: std::env::var("CRAWLER_PROXY")
.ok()
.filter(|s| !s.trim().is_empty()),
browser: LaunchOptions::from_env(),
})
}
}
fn env_u64(name: &str, default: u64) -> u64 {
std::env::var(name)
.ok()
.and_then(|s| s.parse().ok())
.unwrap_or(default)
}
fn env_bool(name: &str, default: bool) -> bool { fn env_bool(name: &str, default: bool) -> bool {
match std::env::var(name).ok().as_deref() { match std::env::var(name).ok().as_deref() {
Some("1") | Some("true") | Some("TRUE") | Some("yes") => true, Some("1") | Some("true") | Some("TRUE") | Some("yes") => true,

View File

@@ -0,0 +1,254 @@
//! Chromium launcher and lifecycle.
//!
//! Uses `chromiumoxide`'s `fetcher` feature so we don't depend on a
//! system Chrome install — first call downloads a known-good revision
//! into a cache dir and reuses it forever after. `BrowserMode` toggles
//! headed vs headless; the headed path needs a display (real `$DISPLAY`
//! or `xvfb-run`).
//!
//! Extra Chromium command-line flags can be supplied through
//! [`LaunchOptions::extra_args`] in code, or via the
//! `CRAWLER_BROWSER_ARGS` env var (whitespace-separated) when going
//! through [`LaunchOptions::from_env`]. The launcher always also
//! injects `--no-sandbox` and `--disable-dev-shm-usage` because they're
//! near-mandatory for containerized Chromium; everything else is
//! caller-provided.
use std::path::PathBuf;
use std::sync::Arc;
use anyhow::Context;
use chromiumoxide::browser::{Browser, BrowserConfig};
use chromiumoxide::error::CdpError;
use chromiumoxide::fetcher::{BrowserFetcher, BrowserFetcherOptions};
use futures_util::StreamExt;
use tokio::task::JoinHandle;
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub enum BrowserMode {
/// Real window. Needs `$DISPLAY` (or `xvfb-run` wrapping the
/// binary). This is the default the old Puppeteer crawler used and
/// the assumed mode for the target site until we prove headless
/// works against it.
Headed,
/// No window. Faster, lower resource use, but more likely to trip
/// fingerprinting on hostile sites.
Headless,
}
/// Configuration for a single browser launch.
///
/// Public fields rather than a builder — there are only two of them
/// and callers benefit from struct literal syntax for clarity.
#[derive(Clone, Debug)]
pub struct LaunchOptions {
pub mode: BrowserMode,
/// Extra Chromium flags, appended after the launcher's own
/// defaults. Example: `vec!["--lang=de-DE".into(),
/// "--window-size=1280,800".into()]`.
pub extra_args: Vec<String>,
}
impl LaunchOptions {
pub fn headed() -> Self {
Self {
mode: BrowserMode::Headed,
extra_args: Vec::new(),
}
}
pub fn headless() -> Self {
Self {
mode: BrowserMode::Headless,
extra_args: Vec::new(),
}
}
/// Reads `CRAWLER_BROWSER_MODE` (`headless`|`headed`, default
/// `headed`) and `CRAWLER_BROWSER_ARGS` (whitespace-separated
/// Chromium flags). Flags containing whitespace aren't supported
/// through the env var — use the programmatic API for those.
pub fn from_env() -> Self {
let mode = match std::env::var("CRAWLER_BROWSER_MODE").as_deref() {
Ok("headless") => BrowserMode::Headless,
_ => BrowserMode::Headed,
};
let extra_args = std::env::var("CRAWLER_BROWSER_ARGS")
.map(|s| parse_args(&s))
.unwrap_or_default();
Self { mode, extra_args }
}
}
impl Default for LaunchOptions {
fn default() -> Self {
Self::headed()
}
}
/// Whitespace-split a CRAWLER_BROWSER_ARGS-style string. Exposed
/// separately from `from_env` so it can be unit-tested without
/// touching process environment.
pub(crate) fn parse_args(s: &str) -> Vec<String> {
s.split_whitespace().map(str::to_string).collect()
}
/// Owned browser plus the spawned task that drives its CDP event loop.
/// Dropping `Handle` without calling `close` leaks the Chromium process
/// — always call `close().await` in production paths.
///
/// The browser is stored behind an `Arc` so it can be shared across
/// worker tasks (via [`Handle::shared`]) without copying. `Browser::new_page`
/// only needs `&self`, so multiple workers can drive the same browser
/// concurrently as long as the manager keeps the `Arc` alive.
pub struct Handle {
browser: Arc<Browser>,
driver: JoinHandle<()>,
}
impl Handle {
/// Borrow the browser. Equivalent to `&*handle.shared()`.
pub fn browser(&self) -> &Browser {
&self.browser
}
/// Clone the shared handle. Workers hold these to call `new_page`
/// concurrently. The browser only exits when the last `Arc<Browser>`
/// is dropped (kill-on-drop), or when `close()` is called on the
/// originating `Handle` while it is the sole holder.
pub fn shared(&self) -> Arc<Browser> {
Arc::clone(&self.browser)
}
/// Closes the browser and awaits the driver task. If other Arcs to
/// the browser are still alive we fall back to drop-kills-Chromium
/// semantics and just join the driver — this is the rare case where
/// shutdown raced an outstanding worker; the OS-level kill is the
/// safety net.
pub async fn close(self) -> anyhow::Result<()> {
match Arc::try_unwrap(self.browser) {
Ok(mut owned) => {
let _ = owned.close().await;
let _ = owned.wait().await;
}
Err(shared) => {
tracing::warn!(
strong_count = Arc::strong_count(&shared),
"Handle::close while Arc<Browser> still shared — relying on kill-on-drop"
);
drop(shared);
}
}
let _ = self.driver.await;
Ok(())
}
}
/// Launches Chromium. Downloads it on first run via the `fetcher`
/// feature; subsequent runs hit the cache. The cache dir is
/// `$CRAWLER_CHROMIUM_DIR` if set, else `$HOME/.cache/mangalord/chromium`,
/// else `./.chromium-cache` as a last-resort repo-local fallback.
pub async fn launch(options: LaunchOptions) -> anyhow::Result<Handle> {
let cache = cache_dir()?;
tokio::fs::create_dir_all(&cache)
.await
.with_context(|| format!("create cache dir {}", cache.display()))?;
let fetcher = BrowserFetcher::new(
BrowserFetcherOptions::builder()
.with_path(&cache)
.build()
.map_err(|e| anyhow::anyhow!("fetcher options: {e}"))?,
);
tracing::info!(path = %cache.display(), "ensuring chromium revision is present");
let info = fetcher
.fetch()
.await
.context("download chromium via fetcher")?;
tracing::info!(executable = %info.executable_path.display(), "chromium ready");
let mut builder = BrowserConfig::builder()
.chrome_executable(info.executable_path)
// Linux containers / CI commonly lack the user namespaces
// Chromium's sandbox wants. Disable it; the crawler runs in its
// own container anyway.
.arg("--no-sandbox")
.arg("--disable-dev-shm-usage");
for arg in &options.extra_args {
builder = builder.arg(arg);
}
if matches!(options.mode, BrowserMode::Headed) {
builder = builder.with_head();
}
tracing::info!(
mode = ?options.mode,
extra_args = ?options.extra_args,
"building browser config"
);
let config = builder
.build()
.map_err(|e| anyhow::anyhow!("browser config: {e}"))?;
let (browser, mut handler) = Browser::launch(config)
.await
.context("launch chromium")?;
let driver = tokio::spawn(async move {
while let Some(event) = handler.next().await {
match event {
Ok(_) => {}
// chromiumoxide 0.7 ships fixed CDP type bindings, so any
// CDP event Chrome added later fails to deserialize. The
// connection is unaffected — these are noise. Suppress
// them so real failures stay visible.
Err(CdpError::Serde(_)) => {
tracing::trace!("chromium emitted an unrecognized CDP event");
}
Err(err) => tracing::warn!(?err, "chromium handler event error"),
}
}
});
Ok(Handle {
browser: Arc::new(browser),
driver,
})
}
fn cache_dir() -> anyhow::Result<PathBuf> {
if let Ok(dir) = std::env::var("CRAWLER_CHROMIUM_DIR") {
return Ok(PathBuf::from(dir));
}
if let Ok(home) = std::env::var("HOME") {
return Ok(PathBuf::from(home).join(".cache/mangalord/chromium"));
}
Ok(PathBuf::from("./.chromium-cache"))
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn parse_args_splits_on_whitespace() {
assert_eq!(
parse_args("--lang=de-DE --window-size=1280,800"),
vec!["--lang=de-DE", "--window-size=1280,800"]
);
}
#[test]
fn parse_args_tolerates_irregular_whitespace() {
// tabs, multiple spaces, leading/trailing — all collapsed.
assert_eq!(
parse_args(" --a\t--b --c=1\n"),
vec!["--a", "--b", "--c=1"]
);
}
#[test]
fn parse_args_empty_string_yields_empty_vec() {
assert!(parse_args("").is_empty());
assert!(parse_args(" \t\n").is_empty());
}
}

View File

@@ -0,0 +1,262 @@
//! Lazy-launch / idle-teardown Chromium manager for the daemon.
//!
//! The first worker that calls [`BrowserManager::acquire`] triggers a real
//! Chromium launch (and the `on_launch` hook — used to re-inject the
//! PHPSESSID cookie on every fresh process). Each acquire bumps an active
//! counter; the returned [`BrowserLease`] decrements it on drop.
//!
//! When the active counter hits zero, a background reaper task waits
//! `idle_timeout`. If still zero on wake, it closes Chromium and clears the
//! cached handle. The next acquire re-launches.
//!
//! `idle_timeout = Duration::ZERO` disables the reaper — Chromium stays alive
//! until [`BrowserManager::shutdown`].
use std::ops::Deref;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
use std::time::Duration;
use anyhow::Context;
use chromiumoxide::browser::Browser;
use futures_util::future::BoxFuture;
use tokio::sync::{Mutex, Notify};
use tokio::task::JoinHandle;
use tokio_util::sync::CancellationToken;
use crate::crawler::browser::{self, LaunchOptions};
/// Hook invoked on every fresh launch with the new browser. Typically used
/// to re-inject PHPSESSID + run the session probe. Errors abort the
/// `acquire` that triggered the launch — the next acquire will re-launch.
pub type OnLaunch =
Arc<dyn Fn(Arc<Browser>) -> BoxFuture<'static, anyhow::Result<()>> + Send + Sync>;
/// Returns an `OnLaunch` that does nothing — useful when no session is
/// configured (e.g. CLI metadata-only runs).
pub fn noop_on_launch() -> OnLaunch {
Arc::new(|_| Box::pin(async { Ok(()) }))
}
/// Decoupled active-lease tracker. Owns the atomic counter and the idle
/// notifier so the wiring is unit-testable without standing up a real
/// `BrowserManager` (which would require launching Chromium).
#[derive(Default)]
pub(crate) struct ActiveTracker {
counter: AtomicUsize,
idle_signal: Notify,
}
impl ActiveTracker {
pub(crate) fn new() -> Arc<Self> {
Arc::new(Self::default())
}
pub(crate) fn acquire(self: &Arc<Self>) {
self.counter.fetch_add(1, Ordering::AcqRel);
}
pub(crate) fn release(self: &Arc<Self>) {
if self.counter.fetch_sub(1, Ordering::AcqRel) == 1 {
self.idle_signal.notify_one();
}
}
pub(crate) fn current(&self) -> usize {
self.counter.load(Ordering::Acquire)
}
pub(crate) fn idle_signal(&self) -> &Notify {
&self.idle_signal
}
}
pub struct BrowserManager {
inner: Mutex<Inner>,
active: Arc<ActiveTracker>,
launch_opts: LaunchOptions,
idle_timeout: Duration,
on_launch: OnLaunch,
}
struct Inner {
handle: Option<browser::Handle>,
shared: Option<Arc<Browser>>,
}
impl BrowserManager {
pub fn new(
launch_opts: LaunchOptions,
idle_timeout: Duration,
on_launch: OnLaunch,
) -> Arc<Self> {
Arc::new(Self {
inner: Mutex::new(Inner {
handle: None,
shared: None,
}),
active: ActiveTracker::new(),
launch_opts,
idle_timeout,
on_launch,
})
}
/// Acquire a shared browser lease. The first acquire after a teardown
/// launches a fresh Chromium (and runs `on_launch`); subsequent acquires
/// while a process is alive just bump the counter and clone the `Arc`.
pub async fn acquire(&self) -> anyhow::Result<BrowserLease> {
let mut guard = self.inner.lock().await;
if guard.handle.is_none() {
let handle = browser::launch(self.launch_opts.clone())
.await
.context("BrowserManager: launch chromium")?;
let shared = handle.shared();
// Run the on-launch hook before publishing the handle so a session
// probe failure doesn't leave a half-initialized browser behind.
if let Err(e) = (self.on_launch)(Arc::clone(&shared)).await {
// Close the just-launched browser since we won't be using it.
let _ = handle.close().await;
return Err(e.context("BrowserManager: on_launch hook failed"));
}
guard.handle = Some(handle);
guard.shared = Some(shared);
}
let browser = guard
.shared
.as_ref()
.expect("shared set above")
.clone();
self.active.acquire();
Ok(BrowserLease {
browser,
active: Arc::clone(&self.active),
})
}
/// Forcefully close the cached browser regardless of active count.
/// Used on daemon shutdown. After this returns the next acquire will
/// re-launch from scratch.
pub async fn shutdown(&self) {
let mut guard = self.inner.lock().await;
guard.shared = None;
if let Some(handle) = guard.handle.take() {
let _ = handle.close().await;
}
}
fn idle_timeout(&self) -> Duration {
self.idle_timeout
}
fn active(&self) -> Arc<ActiveTracker> {
Arc::clone(&self.active)
}
}
/// Background reaper. Returns immediately when `idle_timeout == 0`.
/// Otherwise spawns a task that:
/// 1. Waits on `idle_signal` (woken when active hits zero).
/// 2. Sleeps `idle_timeout`.
/// 3. Re-checks the counter under the mutex — if still zero, takes the
/// handle and closes it.
///
/// Repeats forever until `cancel` fires.
pub fn spawn_idle_reaper(mgr: Arc<BrowserManager>, cancel: CancellationToken) -> JoinHandle<()> {
tokio::spawn(async move {
if mgr.idle_timeout().is_zero() {
// Block until cancellation, then exit.
cancel.cancelled().await;
return;
}
let active = mgr.active();
loop {
tokio::select! {
_ = cancel.cancelled() => return,
_ = active.idle_signal().notified() => {}
}
if active.current() > 0 {
continue;
}
tokio::select! {
_ = cancel.cancelled() => return,
_ = tokio::time::sleep(mgr.idle_timeout()) => {}
}
let mut guard = mgr.inner.lock().await;
if active.current() > 0 {
// A worker grabbed a lease during the sleep — abort teardown.
continue;
}
let handle = guard.handle.take();
guard.shared = None;
drop(guard);
if let Some(h) = handle {
let _ = h.close().await;
tracing::info!("BrowserManager: idle teardown — Chromium closed");
}
}
})
}
/// A worker-side handle that keeps the browser alive while in scope.
/// `Deref<Target = Browser>` so callers can pass `&*lease` to APIs that
/// expect `&Browser`.
pub struct BrowserLease {
browser: Arc<Browser>,
active: Arc<ActiveTracker>,
}
impl Deref for BrowserLease {
type Target = Browser;
fn deref(&self) -> &Browser {
&self.browser
}
}
impl Drop for BrowserLease {
fn drop(&mut self) {
self.active.release();
}
}
#[cfg(test)]
mod tests {
use super::*;
use std::sync::atomic::AtomicBool;
#[test]
fn noop_on_launch_is_send_sync() {
fn assert_send_sync<T: Send + Sync>(_: &T) {}
let h = noop_on_launch();
assert_send_sync(&h);
}
#[tokio::test]
async fn active_tracker_signals_idle_only_on_zero_transition() {
let tracker = ActiveTracker::new();
let signaled = Arc::new(AtomicBool::new(false));
{
let s = Arc::clone(&signaled);
let t = Arc::clone(&tracker);
tokio::spawn(async move {
t.idle_signal().notified().await;
s.store(true, Ordering::Release);
});
}
tracker.acquire();
tracker.acquire();
assert_eq!(tracker.current(), 2);
tracker.release();
assert_eq!(tracker.current(), 1);
tokio::time::sleep(Duration::from_millis(20)).await;
assert!(!signaled.load(Ordering::Acquire), "no idle signal at count 1");
tracker.release();
tokio::time::sleep(Duration::from_millis(20)).await;
assert_eq!(tracker.current(), 0);
assert!(
signaled.load(Ordering::Acquire),
"idle signal fires on 1 -> 0 transition"
);
}
}

View File

@@ -0,0 +1,244 @@
//! Chapter content sync — fetch a logged-in chapter page, extract its
//! image URLs in `pageN` order, download each to storage, and atomically
//! persist a `pages` row per image plus the chapter's `page_count`.
//!
//! Only chapters belonging to a manga someone has bookmarked are
//! candidates. The crawler scans bookmarks at the start of each run and
//! enqueues unfetched chapters; the API also enqueues at bookmark-time
//! so users get instant feedback. Both feed into the same queue and
//! dedup by chapter id.
// Implementation lands in the next commits in this branch. Module is
// declared so other crates can `use crawler::content` without breaking
// builds while iteration is in progress.
use anyhow::Context;
use sqlx::PgPool;
use uuid::Uuid;
use crate::crawler::rate_limit::HostRateLimiters;
use crate::crawler::session;
use crate::storage::Storage;
/// Parse the chapter page DOM and return the page images in `pageN`
/// order. Filters out the loader `<img class="loading">` and any
/// `<img>` without a numeric `id="pageN"`.
pub fn parse_chapter_pages(html: &str) -> Vec<ChapterImage> {
let doc = scraper::Html::parse_document(html);
let sel = scraper::Selector::parse("a#pic_container img:not(.loading)").unwrap();
let mut pages: Vec<ChapterImage> = doc
.select(&sel)
.filter_map(|img| {
let id = img.value().id()?;
let n: i32 = id.strip_prefix("page")?.parse().ok()?;
let src = img.value().attr("src")?.trim().to_string();
if src.is_empty() {
return None;
}
Some(ChapterImage { page_number: n, url: src })
})
.collect();
pages.sort_by_key(|p| p.page_number);
pages
}
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct ChapterImage {
pub page_number: i32,
pub url: String,
}
/// Outcome of a single chapter sync — surfaced to callers for logging
/// and exit-code decisions.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum SyncOutcome {
/// All images downloaded and stored, chapter row updated.
Fetched { pages: usize },
/// `page_count > 0` already — no-op unless force_refetch is set.
Skipped,
/// Session probe failed mid-sync (avatar selector missing on the
/// chapter page). Caller should abort the whole crawler run.
SessionExpired,
}
/// Fetch all images for one chapter and persist them atomically. On
/// any error after the first storage put, the DB transaction rolls
/// back so the chapter stays at `page_count = 0` and is retried on the
/// next run. Bytes already written to storage become orphans; a future
/// reaper sweeps them.
#[allow(clippy::too_many_arguments)]
pub async fn sync_chapter_content(
browser: &chromiumoxide::Browser,
db: &PgPool,
storage: &dyn Storage,
http: &reqwest::Client,
rate: &HostRateLimiters,
chapter_id: Uuid,
manga_id: Uuid,
source_url: &str,
force_refetch: bool,
) -> anyhow::Result<SyncOutcome> {
// Skip if already fetched, unless caller explicitly forces.
if !force_refetch {
let (page_count,): (i32,) =
sqlx::query_as("SELECT page_count FROM chapters WHERE id = $1")
.bind(chapter_id)
.fetch_one(db)
.await
.context("read chapter page_count")?;
if page_count > 0 {
return Ok(SyncOutcome::Skipped);
}
}
// Nav to chapter page (rate-limited per host).
rate.wait_for(source_url).await?;
let page = browser
.new_page(source_url)
.await
.with_context(|| format!("open chapter page {source_url}"))?;
page.wait_for_navigation().await.context("wait for chapter nav")?;
// Session probe: avatar present == still logged in. Missing means
// PHPSESSID expired; bail the entire crawler run.
if page.find_element("#avatar_menu").await.is_err() {
page.close().await.ok();
return Ok(SyncOutcome::SessionExpired);
}
let html = page.content().await.context("read chapter html")?;
page.close().await.ok();
let images = parse_chapter_pages(&html);
if images.is_empty() {
anyhow::bail!("no page images parsed from {source_url}");
}
// Resolve image URLs against the chapter URL (they may be relative).
let base = reqwest::Url::parse(source_url).context("parse chapter URL")?;
// Fetch every image bytes-first into memory before writing
// anything. Lets us bail the whole chapter cleanly if any image
// fails — DB stays at page_count=0, no partial rows persisted.
let mut fetched: Vec<(i32, Vec<u8>, &'static str)> = Vec::with_capacity(images.len());
for img in &images {
let url = base.join(&img.url).with_context(|| {
format!("join image URL {} onto {source_url}", img.url)
})?;
rate.wait_for(url.as_str()).await?;
let resp = http
.get(url.clone())
// Source CDNs commonly check Referer. Set it to the
// chapter page — matches what the browser would send.
.header(reqwest::header::REFERER, source_url)
.send()
.await
.with_context(|| format!("GET {url}"))?
.error_for_status()
.with_context(|| format!("non-2xx for {url}"))?;
let bytes = resp.bytes().await.context("read image body")?.to_vec();
let ext = infer::get(&bytes).map(|k| k.extension()).unwrap_or("bin");
fetched.push((img.page_number, bytes, ext));
}
// Atomic write: storage puts + page row inserts + page_count
// update, all in one transaction. If anything fails, rollback +
// the chapter is retried next run. Storage orphans the bytes; a
// reaper sweeps them later.
let mut tx = db.begin().await.context("open chapter sync tx")?;
for (page_number, bytes, ext) in &fetched {
let key = format!(
"mangas/{manga_id}/chapters/{chapter_id}/pages/{:04}.{ext}",
page_number
);
storage
.put(&key, bytes)
.await
.with_context(|| format!("put {key}"))?;
// (chapter_id, page_number) is unique — re-runs idempotent.
sqlx::query(
"INSERT INTO pages (chapter_id, page_number, storage_key, content_type)
VALUES ($1, $2, $3, $4)
ON CONFLICT (chapter_id, page_number) DO UPDATE
SET storage_key = EXCLUDED.storage_key,
content_type = EXCLUDED.content_type",
)
.bind(chapter_id)
.bind(page_number)
.bind(&key)
.bind(format!("image/{ext}"))
.execute(&mut *tx)
.await
.with_context(|| format!("insert page row {page_number}"))?;
}
sqlx::query("UPDATE chapters SET page_count = $1 WHERE id = $2")
.bind(fetched.len() as i32)
.bind(chapter_id)
.execute(&mut *tx)
.await
.context("update page_count")?;
tx.commit().await.context("commit chapter sync")?;
Ok(SyncOutcome::Fetched { pages: fetched.len() })
}
// Suppress unused-import warning for `session` until the bin/crawler
// wiring lands in this branch and uses it through this module.
#[allow(dead_code)]
fn _keep_session_in_scope() {
let _ = session::registrable_domain;
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn parse_chapter_pages_skips_loader_and_sorts_by_id() {
// Loader image, two real pages out of order, and one with no id.
let html = r#"
<html><body id="body"><a id="pic_container">
<img class="loading" src="/images/ajax-loader2.gif">
<img id="page2" class="page2" src="https://cdn/2.jpg">
<img id="page1" class="page1" src="https://cdn/1.jpg">
<img src="https://cdn/orphan.jpg">
<img id="not-a-page" src="https://cdn/not-a-page.jpg">
</a></body></html>
"#;
let pages = parse_chapter_pages(html);
assert_eq!(pages.len(), 2);
assert_eq!(pages[0].page_number, 1);
assert_eq!(pages[0].url, "https://cdn/1.jpg");
assert_eq!(pages[1].page_number, 2);
assert_eq!(pages[1].url, "https://cdn/2.jpg");
}
#[test]
fn parse_chapter_pages_drops_images_without_src() {
let html = r#"
<a id="pic_container">
<img id="page1" src="">
<img id="page2" src="https://cdn/2.jpg">
</a>
"#;
let pages = parse_chapter_pages(html);
assert_eq!(pages.len(), 1);
assert_eq!(pages[0].page_number, 2);
}
#[test]
fn parse_chapter_pages_handles_three_digit_page_ids() {
let html = r#"
<a id="pic_container">
<img id="page126" src="https://cdn/126.jpg">
<img id="page9" src="https://cdn/9.jpg">
<img id="page50" src="https://cdn/50.jpg">
</a>
"#;
let pages = parse_chapter_pages(html);
assert_eq!(
pages.iter().map(|p| p.page_number).collect::<Vec<_>>(),
vec![9, 50, 126]
);
}
}

View File

@@ -0,0 +1,633 @@
//! In-process crawler daemon.
//!
//! Owns a cron task that fires a daily metadata pass and N worker tasks
//! that drain `SyncChapterContent` jobs from `crawler_jobs`. The dispatch
//! seams ([`MetadataPass`], [`ChapterDispatcher`]) are traits so tests can
//! inject stubs without standing up a real Chromium / `Source` impl.
//!
//! ## Cron
//!
//! Each tick:
//! 1. Acquire a Postgres advisory lock on a dedicated pool connection
//! (multi-replica safety). Skip the tick on contention.
//! 2. Call [`MetadataPass::run`] (typically `pipeline::run_metadata_pass`).
//! 3. Enqueue `SyncChapterContent` jobs for any bookmarked manga whose
//! chapters still have `page_count = 0`.
//! 4. Reap `done` jobs older than `retention_days`.
//! 5. Persist `last_metadata_tick_at` and release the lock.
//!
//! If the last persisted tick is older than the most recent scheduled slot
//! (e.g. backend was down at midnight), the daemon fires immediately on
//! startup before resuming the regular schedule.
//!
//! ## Workers
//!
//! Each worker leases one chapter-content job at a time, dispatches via the
//! [`ChapterDispatcher`], and acks `done` / `failed` / re-`pending` based on
//! the outcome. A `SessionExpired` outcome flips the sticky
//! `session_expired` flag — all workers idle while it's set (until operator
//! restart with a refreshed PHPSESSID).
//!
//! Worker dispatch is wrapped in `catch_unwind` so a panicking handler
//! marks the job failed instead of taking down the worker task.
use std::panic::AssertUnwindSafe;
use std::sync::atomic::{AtomicBool, Ordering};
use std::sync::Arc;
use std::time::Duration;
use async_trait::async_trait;
use chrono::{DateTime, Datelike, NaiveTime, TimeZone, Timelike, Utc};
use chrono_tz::Tz;
use futures_util::FutureExt;
use serde_json::json;
use sqlx::PgPool;
use tokio::task::JoinSet;
use tokio_util::sync::CancellationToken;
use crate::crawler::content::SyncOutcome;
use crate::crawler::jobs::{self, JobPayload, Lease, KIND_SYNC_CHAPTER_CONTENT};
use crate::crawler::pipeline;
/// Fixed `pg_try_advisory_lock` key. ASCII "MANGALRD" interpreted as a
/// big-endian i64. Hardcoded so every replica agrees on the lock identity
/// without consulting config.
pub const CRON_LOCK_KEY: i64 = 0x4D414E47414C5244;
const STATE_KEY_LAST_TICK: &str = "last_metadata_tick_at";
#[async_trait]
pub trait MetadataPass: Send + Sync {
async fn run(&self) -> anyhow::Result<pipeline::MetadataStats>;
}
#[async_trait]
pub trait ChapterDispatcher: Send + Sync {
async fn dispatch(&self, payload: JobPayload) -> anyhow::Result<SyncOutcome>;
}
/// Configuration for [`spawn`]. Use `None` for `metadata_pass` to disable
/// the cron entirely (worker-pool-only mode — useful when only the
/// bookmark-triggered enqueue path is wanted).
pub struct DaemonConfig {
pub metadata_pass: Option<Arc<dyn MetadataPass>>,
pub dispatcher: Arc<dyn ChapterDispatcher>,
pub chapter_workers: usize,
pub daily_at: NaiveTime,
pub tz: Tz,
pub retention_days: u32,
pub session_expired: Arc<AtomicBool>,
/// Tasks that should run alongside the cron + workers and be cancelled
/// on shutdown. Used to hand the daemon ownership of the browser
/// manager's idle reaper.
pub extra_tasks: Vec<tokio::task::JoinHandle<()>>,
}
pub struct DaemonHandle {
cancel: CancellationToken,
join: JoinSet<()>,
extra: Vec<tokio::task::JoinHandle<()>>,
}
impl DaemonHandle {
/// Trigger shutdown and await all worker / cron / extra tasks.
pub async fn shutdown(mut self) {
self.cancel.cancel();
while self.join.join_next().await.is_some() {}
for task in self.extra.drain(..) {
let _ = task.await;
}
}
/// Cancellation token that drives shutdown — exposed so callers
/// (`app::spawn_crawler_daemon`) can hand the same token to auxiliary
/// tasks (e.g. the BrowserManager idle reaper) and have them stop on
/// the daemon's signal.
pub fn cancel_token(&self) -> CancellationToken {
self.cancel.clone()
}
}
/// Spawn the daemon. Returns immediately; tasks run in the background.
/// Pass an external [`CancellationToken`] so auxiliary tasks (e.g. a
/// BrowserManager idle reaper) can share the same shutdown signal —
/// typically created in the caller, cloned into both spawns.
pub fn spawn(pool: PgPool, cancel: CancellationToken, cfg: DaemonConfig) -> DaemonHandle {
let mut join = JoinSet::new();
let DaemonConfig {
metadata_pass,
dispatcher,
chapter_workers,
daily_at,
tz,
retention_days,
session_expired,
extra_tasks,
} = cfg;
if let Some(metadata) = metadata_pass {
let ctx = CronContext {
pool: pool.clone(),
cancel: cancel.clone(),
daily_at,
tz,
retention_days,
metadata,
};
join.spawn(async move { ctx.run().await });
} else {
tracing::info!("crawler daemon: no metadata_pass — cron disabled");
}
for worker_id in 0..chapter_workers.max(1) {
let ctx = WorkerContext {
pool: pool.clone(),
cancel: cancel.clone(),
dispatcher: Arc::clone(&dispatcher),
session_expired: Arc::clone(&session_expired),
id: worker_id,
};
join.spawn(async move { ctx.run().await });
}
DaemonHandle {
cancel,
join,
extra: extra_tasks,
}
}
// ---------------------------------------------------------------------------
// Cron
// ---------------------------------------------------------------------------
struct CronContext {
pool: PgPool,
cancel: CancellationToken,
daily_at: NaiveTime,
tz: Tz,
retention_days: u32,
metadata: Arc<dyn MetadataPass>,
}
impl CronContext {
async fn run(self) {
// On startup, fire immediately if the most recent slot has already
// passed and we never recorded a tick for it.
let now = Utc::now();
let mut catchup = match read_last_tick(&self.pool).await {
Ok(Some(last)) => previous_fire(now, self.daily_at, self.tz) > last,
Ok(None) => true,
Err(e) => {
tracing::warn!(?e, "cron: read_last_tick failed; assuming no catch-up");
false
}
};
loop {
if catchup {
tracing::info!("cron: catch-up tick (missed scheduled slot)");
self.run_tick().await;
catchup = false;
continue;
}
// Recompute next-fire from now() each iteration so clock jumps
// (NTP step, suspend/resume) don't strand us on a stale instant.
let next = next_fire(Utc::now(), self.daily_at, self.tz);
let wait = (next - Utc::now()).to_std().unwrap_or(Duration::ZERO);
tracing::info!(
next_fire_utc = %next.to_rfc3339(),
wait_seconds = wait.as_secs(),
"cron: sleeping until next slot"
);
tokio::select! {
_ = tokio::time::sleep(wait) => {}
_ = self.cancel.cancelled() => {
tracing::info!("cron: shutdown");
return;
}
}
self.run_tick().await;
}
}
async fn run_tick(&self) {
let mut conn = match self.pool.acquire().await {
Ok(c) => c,
Err(e) => {
tracing::error!(?e, "cron: acquire conn failed; skipping tick");
return;
}
};
// pg_try_advisory_lock is session-scoped — we must hold the same
// connection for the unlock or the call silently no-ops on a
// different connection from the pool.
let acquired: bool = sqlx::query_scalar("SELECT pg_try_advisory_lock($1)")
.bind(CRON_LOCK_KEY)
.fetch_one(&mut *conn)
.await
.unwrap_or(false);
if !acquired {
tracing::info!("cron: tick skipped — another replica holds the lock");
return;
}
match self.metadata.run().await {
Ok(stats) => tracing::info!(?stats, "cron: metadata pass done"),
Err(e) => tracing::error!(?e, "cron: metadata pass failed"),
}
match pipeline::enqueue_bookmarked_pending(&self.pool).await {
Ok(summary) => tracing::info!(?summary, "cron: enqueued bookmarked-pending"),
Err(e) => tracing::error!(?e, "cron: enqueue_bookmarked_pending failed"),
}
match jobs::reap_done(&self.pool, self.retention_days).await {
Ok(n) => tracing::info!(reaped = n, "cron: done-job reaper finished"),
Err(e) => tracing::error!(?e, "cron: done-job reaper failed"),
}
if let Err(e) = write_last_tick(&self.pool, Utc::now()).await {
tracing::warn!(?e, "cron: persist last_metadata_tick_at failed");
}
let _ = sqlx::query("SELECT pg_advisory_unlock($1)")
.bind(CRON_LOCK_KEY)
.execute(&mut *conn)
.await;
drop(conn);
}
}
// ---------------------------------------------------------------------------
// Workers
// ---------------------------------------------------------------------------
struct WorkerContext {
pool: PgPool,
cancel: CancellationToken,
dispatcher: Arc<dyn ChapterDispatcher>,
session_expired: Arc<AtomicBool>,
id: usize,
}
impl WorkerContext {
async fn run(self) {
loop {
if self.cancel.is_cancelled() {
tracing::info!(worker = self.id, "worker: shutdown");
return;
}
if self.session_expired.load(Ordering::Acquire) {
tokio::select! {
_ = tokio::time::sleep(Duration::from_secs(30)) => continue,
_ = self.cancel.cancelled() => return,
}
}
let leases = match jobs::lease(
&self.pool,
Some(KIND_SYNC_CHAPTER_CONTENT),
1,
Duration::from_secs(60),
)
.await
{
Ok(v) => v,
Err(e) => {
tracing::warn!(worker = self.id, ?e, "worker: lease failed");
tokio::select! {
_ = tokio::time::sleep(Duration::from_secs(5)) => continue,
_ = self.cancel.cancelled() => return,
}
}
};
let Some(lease) = leases.into_iter().next() else {
tokio::select! {
_ = tokio::time::sleep(Duration::from_secs(1)) => continue,
_ = self.cancel.cancelled() => return,
}
};
self.process_lease(lease).await;
}
}
async fn process_lease(&self, lease: Lease) {
// Consumer-side dedup safety net: if the chapter already has pages
// (because a force-refetch race or a job that was re-enqueued
// after a previous one finished), ack done without re-fetching.
if let JobPayload::SyncChapterContent { chapter_id, .. } = &lease.payload {
let page_count: Option<i32> = sqlx::query_scalar(
"SELECT page_count FROM chapters WHERE id = $1",
)
.bind(chapter_id)
.fetch_optional(&self.pool)
.await
.ok()
.flatten();
if matches!(page_count, Some(n) if n > 0) {
let _ = jobs::ack_done(&self.pool, lease.id).await;
return;
}
}
let outcome = AssertUnwindSafe(self.dispatcher.dispatch(lease.payload.clone()))
.catch_unwind()
.await;
match outcome {
Ok(Ok(SyncOutcome::Fetched { .. } | SyncOutcome::Skipped)) => {
let _ = jobs::ack_done(&self.pool, lease.id).await;
}
Ok(Ok(SyncOutcome::SessionExpired)) => {
tracing::error!(
worker = self.id,
lease_id = %lease.id,
"session expired — workers will idle until restart"
);
self.session_expired.store(true, Ordering::Release);
let _ = jobs::release(&self.pool, lease.id).await;
}
Ok(Err(e)) => {
tracing::warn!(
worker = self.id,
lease_id = %lease.id,
error = ?e,
"worker: dispatch error — ack failed"
);
let _ = jobs::ack_failed(
&self.pool,
lease.id,
&format!("{e:#}"),
lease.attempts,
lease.max_attempts,
)
.await;
}
Err(_panic) => {
tracing::error!(
worker = self.id,
lease_id = %lease.id,
"worker: dispatcher panicked — ack failed"
);
let _ = jobs::ack_failed(
&self.pool,
lease.id,
"worker panicked",
lease.attempts,
lease.max_attempts,
)
.await;
}
}
}
}
// ---------------------------------------------------------------------------
// Cron timing primitives
// ---------------------------------------------------------------------------
/// Compute the next UTC instant when `daily_at` (interpreted in `tz`) will
/// fire, strictly after `now`. Handles DST gaps (spring-forward) by
/// advancing past the gap; on DST overlap (fall-back) picks the later
/// instant so the job runs once, not twice.
pub fn next_fire(now: DateTime<Utc>, daily_at: NaiveTime, tz: Tz) -> DateTime<Utc> {
let now_local = now.with_timezone(&tz);
// Start with today's slot in the local TZ.
let mut candidate = local_at(now_local.date_naive(), daily_at, tz);
// If today's slot is in the past (or now), roll forward day-by-day.
while candidate <= now {
let next_day = candidate
.with_timezone(&tz)
.date_naive()
.succ_opt()
.unwrap_or_else(|| {
// Defensive: succ_opt only fails at chrono's max date.
chrono::NaiveDate::from_ymd_opt(
candidate.year(),
candidate.month(),
candidate.day(),
)
.expect("valid date")
});
candidate = local_at(next_day, daily_at, tz);
}
candidate
}
/// The most recent fire instant at or before `now`. Used to detect missed
/// slots after a restart.
pub fn previous_fire(now: DateTime<Utc>, daily_at: NaiveTime, tz: Tz) -> DateTime<Utc> {
let now_local = now.with_timezone(&tz);
let today = local_at(now_local.date_naive(), daily_at, tz);
if today <= now {
return today;
}
let yesterday = now_local
.date_naive()
.pred_opt()
.expect("a day before now");
local_at(yesterday, daily_at, tz)
}
/// Resolve a local date+time to a UTC instant in `tz`, navigating DST
/// edges deterministically:
/// - `LocalResult::Single` → that instant.
/// - `LocalResult::Ambiguous(_, latest)` → the later instant (fall-back
/// hour). Picking latest means a daily job fires once across the
/// repeated hour, not twice.
/// - `LocalResult::None` → spring-forward gap. Advance the local time
/// by 1 minute and try again, repeating up to 120 times (so the worst
/// case is still well inside an hour-long gap).
fn local_at(date: chrono::NaiveDate, time: NaiveTime, tz: Tz) -> DateTime<Utc> {
use chrono::LocalResult;
for offset_minutes in 0..120 {
let mut t = time;
if offset_minutes > 0 {
let added = chrono::NaiveTime::from_num_seconds_from_midnight_opt(
((time.num_seconds_from_midnight() as i64 + offset_minutes * 60) % 86_400) as u32,
0,
)
.unwrap_or(time);
t = added;
}
let naive = date.and_time(t);
match tz.from_local_datetime(&naive) {
LocalResult::Single(dt) => return dt.with_timezone(&Utc),
LocalResult::Ambiguous(_, latest) => return latest.with_timezone(&Utc),
LocalResult::None => continue,
}
}
// Should be unreachable — DST gaps are always less than an hour.
Utc.from_utc_datetime(&date.and_time(time))
}
// ---------------------------------------------------------------------------
// crawler_state I/O
// ---------------------------------------------------------------------------
async fn read_last_tick(pool: &PgPool) -> sqlx::Result<Option<DateTime<Utc>>> {
let row: Option<serde_json::Value> = sqlx::query_scalar(
"SELECT value FROM crawler_state WHERE key = $1",
)
.bind(STATE_KEY_LAST_TICK)
.fetch_optional(pool)
.await?;
Ok(row.and_then(|v| {
v.get("at")
.and_then(|s| s.as_str())
.and_then(|s| DateTime::parse_from_rfc3339(s).ok())
.map(|dt| dt.with_timezone(&Utc))
}))
}
async fn write_last_tick(pool: &PgPool, at: DateTime<Utc>) -> sqlx::Result<()> {
sqlx::query(
"INSERT INTO crawler_state (key, value, updated_at) \
VALUES ($1, $2, now()) \
ON CONFLICT (key) DO UPDATE \
SET value = EXCLUDED.value, updated_at = now()",
)
.bind(STATE_KEY_LAST_TICK)
.bind(json!({ "at": at.to_rfc3339() }))
.execute(pool)
.await?;
Ok(())
}
// ---------------------------------------------------------------------------
// Test helpers (not gated on cfg(test) — integration tests in tests/ dir
// need them too).
// ---------------------------------------------------------------------------
pub mod test_support {
//! Lightweight stubs the daemon tests use. Public because integration
//! tests live outside this module.
use super::*;
use std::sync::atomic::AtomicUsize;
pub struct CountingMetadataPass {
pub count: AtomicUsize,
}
impl Default for CountingMetadataPass {
fn default() -> Self {
Self {
count: AtomicUsize::new(0),
}
}
}
#[async_trait]
impl MetadataPass for CountingMetadataPass {
async fn run(&self) -> anyhow::Result<pipeline::MetadataStats> {
self.count.fetch_add(1, Ordering::AcqRel);
Ok(pipeline::MetadataStats::default())
}
}
pub type DispatchFn = Arc<
dyn Fn(JobPayload) -> futures_util::future::BoxFuture<'static, anyhow::Result<SyncOutcome>>
+ Send
+ Sync,
>;
pub struct StubDispatcher {
pub handler: DispatchFn,
}
#[async_trait]
impl ChapterDispatcher for StubDispatcher {
async fn dispatch(&self, payload: JobPayload) -> anyhow::Result<SyncOutcome> {
(self.handler)(payload).await
}
}
pub fn always_done() -> Arc<StubDispatcher> {
Arc::new(StubDispatcher {
handler: Arc::new(|_| Box::pin(async { Ok(SyncOutcome::Fetched { pages: 1 }) })),
})
}
pub fn panicking_dispatcher() -> Arc<StubDispatcher> {
Arc::new(StubDispatcher {
handler: Arc::new(|_| Box::pin(async { panic!("intentional dispatcher panic") })),
})
}
}
#[cfg(test)]
mod tests {
use super::*;
use chrono::Duration as ChronoDuration;
fn dt_utc(y: i32, mo: u32, d: u32, h: u32, mi: u32) -> DateTime<Utc> {
Utc.with_ymd_and_hms(y, mo, d, h, mi, 0).unwrap()
}
#[test]
fn next_fire_in_utc_at_midnight_advances_one_day() {
let now = dt_utc(2026, 5, 25, 12, 0); // noon UTC
let at = NaiveTime::from_hms_opt(0, 0, 0).unwrap();
let next = next_fire(now, at, Tz::UTC);
// Next midnight is May 26 00:00 UTC.
assert_eq!(next, dt_utc(2026, 5, 26, 0, 0));
}
#[test]
fn next_fire_before_today_slot_returns_today() {
let now = dt_utc(2026, 5, 25, 23, 0); // 23:00 UTC
let at = NaiveTime::from_hms_opt(23, 30, 0).unwrap();
let next = next_fire(now, at, Tz::UTC);
assert_eq!(next, dt_utc(2026, 5, 25, 23, 30));
}
#[test]
fn next_fire_skips_spring_forward_gap_in_europe_berlin() {
// 2024-03-31: clocks jump 02:00 -> 03:00 in Berlin (CET -> CEST).
// Asking for daily_at = 02:30 on the morning of the jump should
// land on the *next valid* local instant past the gap. We test
// by computing `next_fire` at 2024-03-31 00:30 UTC (= 01:30 CET,
// i.e. just before the gap). The next 02:30 local does not exist,
// so the helper advances past it.
let now = dt_utc(2024, 3, 31, 0, 30); // 01:30 local Berlin (CET = UTC+1)
let at = NaiveTime::from_hms_opt(2, 30, 0).unwrap();
let next = next_fire(now, at, Tz::Europe__Berlin);
// Local Berlin time skips from 02:00 -> 03:00. After the +1 minute
// search, the first valid slot is 03:00 local on 2024-03-31, which
// is 01:00 UTC (CEST = UTC+2).
// We assert the result is strictly between (now) and 1h later
// and is in UTC — the exact minute depends on how many +1m steps
// were required.
assert!(next > now);
assert!(next < now + ChronoDuration::hours(2));
}
#[test]
fn next_fire_on_fall_back_picks_later_instant() {
// 2024-10-27: clocks jump 03:00 -> 02:00 (CEST -> CET) in Berlin.
// 02:30 happens twice on that day. We pick the later one.
let now = dt_utc(2024, 10, 26, 12, 0); // day before, noon UTC
let at = NaiveTime::from_hms_opt(2, 30, 0).unwrap();
let next = next_fire(now, at, Tz::Europe__Berlin);
// First 02:30 local is 00:30 UTC (CEST = UTC+2).
// Second 02:30 local is 01:30 UTC (CET = UTC+1).
// We expect the later instant: 01:30 UTC on 2024-10-27.
assert_eq!(next, dt_utc(2024, 10, 27, 1, 30));
}
#[test]
fn previous_fire_returns_today_when_now_is_after_slot() {
let now = dt_utc(2026, 5, 25, 12, 0); // noon UTC
let at = NaiveTime::from_hms_opt(0, 0, 0).unwrap();
let prev = previous_fire(now, at, Tz::UTC);
assert_eq!(prev, dt_utc(2026, 5, 25, 0, 0));
}
#[test]
fn previous_fire_returns_yesterday_when_now_is_before_today_slot() {
let now = dt_utc(2026, 5, 25, 8, 0); // 08:00 UTC
let at = NaiveTime::from_hms_opt(23, 30, 0).unwrap();
let prev = previous_fire(now, at, Tz::UTC);
assert_eq!(prev, dt_utc(2026, 5, 24, 23, 30));
}
}

View File

@@ -0,0 +1,15 @@
//! Change-detection rules between the source and our DB.
//!
//! | Event | Signal |
//! |--------------------|----------------------------------------------------------------------------------------|
//! | New manga | `(source_id, source_manga_key)` not in `manga_sources` |
//! | Updated metadata | freshly computed `metadata_hash` differs from the stored one |
//! | Dropped manga | `last_seen_at < discover_run_started_at` for N consecutive successful discover runs |
//! | New chapter | `(source_id, source_chapter_key)` not in `chapter_sources` |
//! | Dropped chapter | present in DB but absent from the latest `fetch_chapter_list` for the same manga |
//!
//! Dropped is always a soft flag (`dropped_at`), never a row delete —
//! restoring is a matter of clearing the flag if the source brings the
//! item back.
//!
//! Scaffold only — implementations land once `repo::crawler` exists.

269
backend/src/crawler/jobs.rs Normal file
View File

@@ -0,0 +1,269 @@
//! Persistent job queue and the four job kinds.
//!
//! Backed by Postgres (the `crawler_jobs` table). Workers lease rows
//! with `SELECT ... FOR UPDATE SKIP LOCKED`, heartbeat via
//! `leased_until`, and ack by transitioning to `done` (or backoff /
//! `dead`). Handlers are idempotent so a crash mid-run is recoverable
//! by replay.
use std::time::Duration;
use serde::{Deserialize, Serialize};
use sqlx::PgPool;
use uuid::Uuid;
use super::source::DiscoverMode;
#[derive(Clone, Debug, Serialize, Deserialize)]
#[serde(tag = "kind", rename_all = "snake_case")]
pub enum JobPayload {
/// Walk the source index and enqueue `SyncManga` jobs.
Discover {
source_id: String,
mode: DiscoverMode,
},
/// Fetch one manga's detail page, upsert metadata, enqueue
/// `SyncChapterList`.
SyncManga {
source_id: String,
source_manga_key: String,
},
/// Diff the chapter list, enqueue `SyncChapterContent` for new
/// chapters, soft-drop vanished ones.
SyncChapterList {
source_id: String,
manga_id: Uuid,
source_manga_key: String,
},
/// Download a single chapter's page images into storage.
SyncChapterContent {
source_id: String,
chapter_id: Uuid,
source_chapter_key: String,
},
}
#[derive(Clone, Copy, Debug, sqlx::Type, Serialize, Deserialize)]
#[sqlx(type_name = "text", rename_all = "snake_case")]
#[serde(rename_all = "snake_case")]
pub enum JobState {
Pending,
Running,
Done,
Failed,
Dead,
}
/// Kind discriminator stored in `payload->>'kind'`. Public so callers
/// (daemon worker, bookmark hook) can filter `lease()` to a single kind
/// without re-spelling the literal.
pub const KIND_SYNC_CHAPTER_CONTENT: &str = "sync_chapter_content";
#[derive(Debug)]
pub enum EnqueueResult {
Inserted(Uuid),
Skipped,
}
#[derive(Debug, Clone)]
pub struct Lease {
pub id: Uuid,
pub payload: JobPayload,
pub attempts: i32,
pub max_attempts: i32,
}
/// Exponential backoff for `ack_failed` retries. `attempts` is the
/// post-increment value reported by `lease()` (so the first failure has
/// `attempts == 1` and waits 60s, the second 120s, etc.). Capped at 1h to
/// avoid runaway long sleeps that would outlive the daemon process.
fn backoff_for(attempts: i32) -> Duration {
let shift = attempts.saturating_sub(1).clamp(0, 20) as u32;
let secs = 60u64.saturating_mul(1u64 << shift);
Duration::from_secs(secs.min(3600))
}
/// Insert a new pending job. For `SyncChapterContent` payloads the
/// partial unique index `crawler_jobs_chapter_content_dedup_idx` blocks
/// a second `(pending|running)` insert per chapter_id, returning
/// `Skipped`. The slot frees again once the previous job leaves the
/// in-flight states (done/failed/dead), so a re-enqueue after a force
/// refetch succeeds.
pub async fn enqueue(pool: &PgPool, payload: &JobPayload) -> sqlx::Result<EnqueueResult> {
let json = serde_json::to_value(payload).expect("JobPayload is always serializable");
let id: Option<Uuid> = sqlx::query_scalar(
"INSERT INTO crawler_jobs (payload) VALUES ($1) \
ON CONFLICT DO NOTHING RETURNING id",
)
.bind(json)
.fetch_optional(pool)
.await?;
Ok(match id {
Some(id) => EnqueueResult::Inserted(id),
None => EnqueueResult::Skipped,
})
}
/// Lease up to `max` rows whose `state` is `pending`, or `running` with
/// an expired `leased_until` (the crashed-worker recovery path). The
/// inner CTE uses `FOR UPDATE SKIP LOCKED` so concurrent leasers don't
/// block each other and each row is handed to exactly one worker.
///
/// `kind_filter` matches against `payload->>'kind'`; `None` means
/// any kind.
pub async fn lease(
pool: &PgPool,
kind_filter: Option<&str>,
max: i64,
lease_duration: Duration,
) -> sqlx::Result<Vec<Lease>> {
let lease_ms: i64 = lease_duration.as_millis().min(i64::MAX as u128) as i64;
let rows: Vec<(Uuid, serde_json::Value, i32, i32)> = sqlx::query_as(
r#"
WITH leased AS (
SELECT id FROM crawler_jobs
WHERE (state = 'pending' OR (state = 'running' AND leased_until < now()))
AND scheduled_at <= now()
AND ($1::text IS NULL OR payload->>'kind' = $1)
ORDER BY scheduled_at
LIMIT $2
FOR UPDATE SKIP LOCKED
)
UPDATE crawler_jobs j
SET state = 'running',
attempts = j.attempts + 1,
leased_until = now() + ($3::bigint || ' milliseconds')::interval,
updated_at = now()
FROM leased l
WHERE j.id = l.id
RETURNING j.id, j.payload, j.attempts, j.max_attempts
"#,
)
.bind(kind_filter)
.bind(max)
.bind(lease_ms)
.fetch_all(pool)
.await?;
let mut leases = Vec::with_capacity(rows.len());
for (id, payload_json, attempts, max_attempts) in rows {
let payload: JobPayload = serde_json::from_value(payload_json).map_err(|e| {
sqlx::Error::Decode(format!("invalid JobPayload JSON for job {id}: {e}").into())
})?;
leases.push(Lease {
id,
payload,
attempts,
max_attempts,
});
}
Ok(leases)
}
/// Mark a leased job as successfully completed.
pub async fn ack_done(pool: &PgPool, lease_id: Uuid) -> sqlx::Result<()> {
sqlx::query(
"UPDATE crawler_jobs \
SET state = 'done', leased_until = NULL, updated_at = now() \
WHERE id = $1",
)
.bind(lease_id)
.execute(pool)
.await?;
Ok(())
}
/// Mark a leased job as failed. If the current attempt count has reached
/// `max_attempts` the job is terminally dead and stops retrying;
/// otherwise it goes back to `pending` with `scheduled_at` pushed into
/// the future by the exponential backoff.
pub async fn ack_failed(
pool: &PgPool,
lease_id: Uuid,
error: &str,
attempts: i32,
max_attempts: i32,
) -> sqlx::Result<()> {
if attempts >= max_attempts {
sqlx::query(
"UPDATE crawler_jobs \
SET state = 'dead', last_error = $2, leased_until = NULL, updated_at = now() \
WHERE id = $1",
)
.bind(lease_id)
.bind(error)
.execute(pool)
.await?;
} else {
let backoff_ms: i64 = backoff_for(attempts).as_millis().min(i64::MAX as u128) as i64;
sqlx::query(
"UPDATE crawler_jobs \
SET state = 'pending', last_error = $2, leased_until = NULL, \
scheduled_at = now() + ($3::bigint || ' milliseconds')::interval, \
updated_at = now() \
WHERE id = $1",
)
.bind(lease_id)
.bind(error)
.bind(backoff_ms)
.execute(pool)
.await?;
}
Ok(())
}
/// Return a leased job to `pending` without burning a retry attempt.
/// Used on graceful shutdown and on session-expired aborts where the
/// failure isn't the job's fault.
pub async fn release(pool: &PgPool, lease_id: Uuid) -> sqlx::Result<()> {
sqlx::query(
"UPDATE crawler_jobs \
SET state = 'pending', leased_until = NULL, \
attempts = GREATEST(0, attempts - 1), updated_at = now() \
WHERE id = $1",
)
.bind(lease_id)
.execute(pool)
.await?;
Ok(())
}
/// Delete `done` jobs whose `updated_at` is older than `retention_days`
/// days. `0` disables the reaper without touching the table. Returns the
/// number of rows removed.
pub async fn reap_done(pool: &PgPool, retention_days: u32) -> sqlx::Result<u64> {
if retention_days == 0 {
return Ok(0);
}
let result = sqlx::query(
"DELETE FROM crawler_jobs \
WHERE state = 'done' \
AND updated_at < now() - ($1::bigint || ' days')::interval",
)
.bind(retention_days as i64)
.execute(pool)
.await?;
Ok(result.rows_affected())
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn backoff_grows_exponentially_and_caps_at_one_hour() {
// attempts == 1 → 60s, doubling each step.
assert_eq!(backoff_for(1), Duration::from_secs(60));
assert_eq!(backoff_for(2), Duration::from_secs(120));
assert_eq!(backoff_for(3), Duration::from_secs(240));
assert_eq!(backoff_for(4), Duration::from_secs(480));
assert_eq!(backoff_for(5), Duration::from_secs(960));
assert_eq!(backoff_for(6), Duration::from_secs(1920));
// 7th: 60 * 64 = 3840 → capped to 3600.
assert_eq!(backoff_for(7), Duration::from_secs(3600));
assert_eq!(backoff_for(20), Duration::from_secs(3600));
// Garbage / zero / negatives stay sane.
assert_eq!(backoff_for(0), Duration::from_secs(60));
assert_eq!(backoff_for(-5), Duration::from_secs(60));
}
}

View File

@@ -0,0 +1,25 @@
//! Crawler subsystem.
//!
//! Runs as its own binary (`src/bin/crawler.rs`) and shares `domain`,
//! `repo`, and `storage` with the API binary. Layering mirrors the
//! `Storage` trait pattern: callers depend on the `source::Source`
//! trait, not on a concrete site; new sites plug in as additional
//! impls without touching the job runner.
//!
//! Submodules:
//! - [`browser`]: launches and pools Chromium via `chromiumoxide`.
//! First run downloads a known-good build via the `fetcher` feature.
//! - [`source`]: the `Source` trait. Per-site impls live alongside it.
//! - [`jobs`]: job kinds, queue wrapper, handler dispatch.
//! - [`diff`]: change detection — new / updated / dropped semantics.
pub mod browser;
pub mod browser_manager;
pub mod content;
pub mod daemon;
pub mod diff;
pub mod jobs;
pub mod pipeline;
pub mod rate_limit;
pub mod session;
pub mod source;

View File

@@ -0,0 +1,347 @@
//! Crawler pipeline — the reusable metadata pass and the enqueue helpers
//! that fan out chapter-content work. Shared between the daemon (cron tick)
//! and the CLI (`bin/crawler.rs`) so behavior stays in lockstep.
use anyhow::Context;
use sqlx::PgPool;
use uuid::Uuid;
use crate::crawler::browser_manager::BrowserManager;
use crate::crawler::jobs::{self, EnqueueResult, JobPayload};
use crate::crawler::rate_limit::HostRateLimiters;
use crate::crawler::source::target::TargetSource;
use crate::crawler::source::{DiscoverMode, FetchContext, Source};
use crate::repo;
use crate::storage::Storage;
/// Coarse counters surfaced for logging at the end of a metadata pass.
#[derive(Debug, Default, Clone, Copy)]
pub struct MetadataStats {
pub discovered: usize,
pub upserted: usize,
pub covers_fetched: usize,
pub mangas_failed: usize,
}
/// Runs the discover → fetch → upsert → cover → chapter-list-diff pipeline
/// for the target source. Pure metadata; chapter content is enqueued as
/// separate `SyncChapterContent` jobs by the caller after this returns.
///
/// `limit == 0` means no cap (full backfill). `skip_chapters == true` is
/// the "metadata-only" mode (parser doesn't extract chapters, and
/// `sync_manga_chapters` is skipped — otherwise an empty chapter list
/// would soft-drop existing rows).
#[allow(clippy::too_many_arguments)]
pub async fn run_metadata_pass(
browser_manager: &BrowserManager,
db: &PgPool,
storage: &dyn Storage,
http: &reqwest::Client,
rate: &HostRateLimiters,
start_url: &str,
limit: usize,
skip_chapters: bool,
) -> anyhow::Result<MetadataStats> {
let lease = browser_manager
.acquire()
.await
.context("acquire browser lease for metadata pass")?;
let browser_ref: &chromiumoxide::Browser = &lease;
let source = {
let s = TargetSource::new(start_url.to_string());
if skip_chapters {
s.without_chapter_parsing()
} else {
s
}
};
let ctx = FetchContext {
browser: browser_ref,
rate,
};
let source_id = source.id();
repo::crawler::ensure_source(
db,
source_id,
"Target Site",
&origin_of(start_url).unwrap_or_else(|| start_url.to_string()),
)
.await
.context("ensure_source")?;
let run_started_at = chrono::Utc::now();
let max_refs = (limit > 0).then_some(limit);
tracing::info!(?max_refs, "discovering manga list");
let refs = source
.discover(&ctx, DiscoverMode::Backfill, max_refs)
.await
.context("discover failed")?;
tracing::info!(count = refs.len(), "discovered manga list");
let mut stats = MetadataStats {
discovered: refs.len(),
..MetadataStats::default()
};
for (i, r) in refs.iter().enumerate() {
tracing::info!(
idx = i + 1,
total = stats.discovered,
key = %r.source_manga_key,
"fetching metadata"
);
let manga = match source.fetch_manga(&ctx, r).await {
Ok(m) => m,
Err(e) => {
tracing::warn!(
key = %r.source_manga_key,
url = %r.url,
error = ?e,
"fetch_manga failed"
);
stats.mangas_failed += 1;
continue;
}
};
let upsert = match repo::crawler::upsert_manga_from_source(db, source_id, &r.url, &manga)
.await
{
Ok(u) => u,
Err(e) => {
tracing::error!(
key = %r.source_manga_key,
error = ?e,
"upsert_manga_from_source failed"
);
stats.mangas_failed += 1;
continue;
}
};
stats.upserted += 1;
tracing::info!(
key = %manga.source_manga_key,
manga_id = %upsert.manga_id,
status = ?upsert.status,
title = %manga.title,
"manga upserted"
);
// Cover image: download when missing in storage or when metadata
// signaled an update (cover URL is part of metadata_hash, so
// Updated implies the URL may have moved). Failures are non-fatal.
let needs_cover = upsert.cover_image_path.is_none()
|| matches!(upsert.status, repo::crawler::UpsertStatus::Updated);
if needs_cover {
if let Some(cover_url) = manga.cover_url.as_deref() {
match download_and_store_cover(
db,
storage,
http,
rate,
&r.url,
upsert.manga_id,
cover_url,
)
.await
{
Ok(()) => stats.covers_fetched += 1,
Err(e) => tracing::warn!(
manga_id = %upsert.manga_id,
error = ?e,
"cover download failed"
),
}
}
}
if !skip_chapters {
match repo::crawler::sync_manga_chapters(
db,
source_id,
upsert.manga_id,
&manga.chapters,
)
.await
{
Ok(diff) => tracing::info!(
manga_id = %upsert.manga_id,
new = diff.new,
refreshed = diff.refreshed,
dropped = diff.dropped,
"chapters synced"
),
Err(e) => tracing::warn!(
manga_id = %upsert.manga_id,
error = ?e,
"chapter sync failed"
),
}
}
}
if limit == 0 {
match repo::crawler::mark_dropped_mangas(db, source_id, run_started_at).await {
Ok(n) => tracing::info!(dropped = n, "marked unseen manga as dropped"),
Err(e) => tracing::warn!(error = ?e, "drop-pass failed"),
}
} else {
tracing::info!(limit, "partial sync — skipping drop pass");
}
drop(lease);
Ok(stats)
}
/// Enqueue a `SyncChapterContent` job for every chapter of *any* bookmarked
/// manga that still has `page_count = 0` and a non-dropped source row.
/// Returns `(inserted, skipped)` counts. Dedup index handles repeats.
pub async fn enqueue_bookmarked_pending(pool: &PgPool) -> anyhow::Result<EnqueueSummary> {
let rows: Vec<(String, Uuid, String)> = sqlx::query_as(
r#"
SELECT cs.source_id, c.id AS chapter_id, cs.source_chapter_key
FROM chapters c
JOIN bookmarks b ON b.manga_id = c.manga_id
JOIN chapter_sources cs ON cs.chapter_id = c.id
WHERE c.page_count = 0
AND cs.dropped_at IS NULL
GROUP BY cs.source_id, c.id, cs.source_chapter_key, c.manga_id, c.created_at
ORDER BY c.manga_id, c.created_at ASC
"#,
)
.fetch_all(pool)
.await
.context("query bookmarked-pending chapters")?;
let mut summary = EnqueueSummary::default();
for (source_id, chapter_id, source_chapter_key) in rows {
let payload = JobPayload::SyncChapterContent {
source_id,
chapter_id,
source_chapter_key,
};
match jobs::enqueue(pool, &payload).await {
Ok(EnqueueResult::Inserted(_)) => summary.inserted += 1,
Ok(EnqueueResult::Skipped) => summary.skipped += 1,
Err(e) => {
tracing::warn!(
%chapter_id,
error = ?e,
"enqueue chapter content failed"
);
summary.failed += 1;
}
}
}
Ok(summary)
}
/// Enqueue chapter-content jobs for a *single* manga (the bookmark-create
/// hook). Same dedup semantics as [`enqueue_bookmarked_pending`].
pub async fn enqueue_pending_for_manga(
pool: &PgPool,
manga_id: Uuid,
) -> anyhow::Result<EnqueueSummary> {
let rows: Vec<(String, Uuid, String)> = sqlx::query_as(
r#"
SELECT DISTINCT cs.source_id, c.id AS chapter_id, cs.source_chapter_key
FROM chapters c
JOIN chapter_sources cs ON cs.chapter_id = c.id
WHERE c.manga_id = $1
AND c.page_count = 0
AND cs.dropped_at IS NULL
ORDER BY cs.source_id, c.id
"#,
)
.bind(manga_id)
.fetch_all(pool)
.await
.context("query pending chapters for manga")?;
let mut summary = EnqueueSummary::default();
for (source_id, chapter_id, source_chapter_key) in rows {
let payload = JobPayload::SyncChapterContent {
source_id,
chapter_id,
source_chapter_key,
};
match jobs::enqueue(pool, &payload).await {
Ok(EnqueueResult::Inserted(_)) => summary.inserted += 1,
Ok(EnqueueResult::Skipped) => summary.skipped += 1,
Err(e) => {
tracing::warn!(
%chapter_id,
error = ?e,
"enqueue chapter content failed"
);
summary.failed += 1;
}
}
}
Ok(summary)
}
#[derive(Debug, Default, Clone, Copy)]
pub struct EnqueueSummary {
pub inserted: usize,
pub skipped: usize,
pub failed: usize,
}
/// Download a cover image and persist its storage path. Local to the
/// pipeline because the CLI still calls it from its inline chapter-content
/// loop; once the worker pool fully replaces that path we can fold this
/// into `pipeline` proper.
async fn download_and_store_cover(
db: &PgPool,
storage: &dyn Storage,
http: &reqwest::Client,
rate: &HostRateLimiters,
manga_url: &str,
manga_id: Uuid,
cover_url: &str,
) -> anyhow::Result<()> {
let absolute = reqwest::Url::parse(manga_url)
.context("parse manga URL")?
.join(cover_url)
.context("join cover URL onto manga URL")?;
rate.wait_for(absolute.as_str()).await?;
let resp = http
.get(absolute.clone())
.header(reqwest::header::REFERER, manga_url)
.send()
.await
.with_context(|| format!("GET {absolute}"))?
.error_for_status()
.with_context(|| format!("non-2xx for {absolute}"))?;
let bytes = resp.bytes().await.context("read cover body")?;
let kind = infer::get(&bytes);
let ext = kind.map(|k| k.extension()).unwrap_or("bin");
let key = format!("mangas/{manga_id}/cover.{ext}");
storage
.put(&key, &bytes)
.await
.with_context(|| format!("store cover at {key}"))?;
repo::manga::set_cover_image_path(db, manga_id, &key)
.await
.with_context(|| format!("update cover_image_path for {manga_id}"))?;
tracing::info!(
manga_id = %manga_id,
key = %key,
bytes = bytes.len(),
%absolute,
"cover stored"
);
Ok(())
}
fn origin_of(url: &str) -> Option<String> {
let (scheme, rest) = url.split_once("://")?;
let host = rest.split('/').next()?;
Some(format!("{scheme}://{host}"))
}

View File

@@ -0,0 +1,184 @@
//! Per-host request pacing.
//!
//! `RateLimiter` is a single-token bucket: each `wait().await` returns
//! immediately when at least `interval` has elapsed since the last call,
//! otherwise sleeps just enough to satisfy it. Uses
//! `tokio::time::Instant` so tests can run under `start_paused` virtual
//! time without sleeping for real.
//!
//! `HostRateLimiters` is the multi-host wrapper actually used by the
//! crawler — concurrent workers issuing requests to different origins
//! (catalog vs. CDN) don't contend on a shared budget; each host gets
//! its own bucket. `wait_for(url)` extracts the host, lazily creates a
//! limiter for it, and serializes only against other callers hitting
//! the same host.
use std::collections::HashMap;
use std::sync::Arc;
use std::time::Duration;
use tokio::sync::Mutex;
use tokio::time::Instant;
#[derive(Debug)]
pub struct RateLimiter {
interval: Duration,
last: Option<Instant>,
}
impl RateLimiter {
pub fn new(interval: Duration) -> Self {
Self {
interval,
last: None,
}
}
pub async fn wait(&mut self) {
if let Some(last) = self.last {
let elapsed = last.elapsed();
if elapsed < self.interval {
tokio::time::sleep(self.interval - elapsed).await;
}
}
self.last = Some(Instant::now());
}
}
/// Per-host rate limiter map. The outer `Mutex<HashMap>` is held only
/// during the entry-or-insert + Arc clone; the per-host `Mutex<RateLimiter>`
/// is held during the actual `wait().await`. So N workers calling
/// `wait_for(url)` on N different hosts contend nowhere except the brief
/// HashMap lookup; workers hitting the same host serialize on that
/// host's bucket.
#[derive(Debug)]
pub struct HostRateLimiters {
default_interval: Duration,
overrides: HashMap<String, Duration>,
map: Mutex<HashMap<String, Arc<Mutex<RateLimiter>>>>,
}
impl HostRateLimiters {
pub fn new(default_interval: Duration) -> Self {
Self {
default_interval,
overrides: HashMap::new(),
map: Mutex::new(HashMap::new()),
}
}
/// Set a per-host interval that overrides `default_interval`. Calls
/// after a host's limiter has been instantiated do *not* re-create
/// it — set all overrides before the first `wait_for` to that host.
pub fn with_override(mut self, host: impl Into<String>, interval: Duration) -> Self {
self.overrides.insert(host.into(), interval);
self
}
/// Block until the per-host budget allows the next request to
/// `url`'s host. Returns an error only when the URL has no host
/// (malformed input).
pub async fn wait_for(&self, url: &str) -> anyhow::Result<()> {
let host = host_of(url)
.ok_or_else(|| anyhow::anyhow!("no host in url: {url}"))?;
let limiter = {
let mut map = self.map.lock().await;
map.entry(host.clone())
.or_insert_with(|| {
let interval = self
.overrides
.get(&host)
.copied()
.unwrap_or(self.default_interval);
Arc::new(Mutex::new(RateLimiter::new(interval)))
})
.clone()
};
limiter.lock().await.wait().await;
Ok(())
}
}
/// Extract the host (no port) from a URL string. Returns `None` for
/// inputs without a `scheme://host` shape — those would never have
/// reached the network layer anyway.
fn host_of(url: &str) -> Option<String> {
let after_scheme = url.split_once("://")?.1;
let host_with_port = after_scheme.split('/').next()?;
let host = host_with_port.rsplit_once(':').map_or(host_with_port, |(h, _)| h);
(!host.is_empty()).then(|| host.to_ascii_lowercase())
}
#[cfg(test)]
mod tests {
use super::*;
#[tokio::test(start_paused = true)]
async fn first_call_does_not_sleep() {
let mut rl = RateLimiter::new(Duration::from_millis(100));
let t0 = Instant::now();
rl.wait().await;
assert_eq!(Instant::now() - t0, Duration::ZERO);
}
#[tokio::test(start_paused = true)]
async fn second_call_sleeps_to_fill_interval() {
let mut rl = RateLimiter::new(Duration::from_millis(100));
let t0 = Instant::now();
rl.wait().await;
rl.wait().await;
// Second call had to wait the full 100ms after the (instant)
// first call.
assert_eq!(Instant::now() - t0, Duration::from_millis(100));
}
#[tokio::test(start_paused = true)]
async fn no_sleep_if_interval_already_elapsed() {
let mut rl = RateLimiter::new(Duration::from_millis(100));
rl.wait().await;
tokio::time::sleep(Duration::from_millis(250)).await;
let t0 = Instant::now();
rl.wait().await;
// Already 250ms past — no further wait needed.
assert_eq!(Instant::now() - t0, Duration::ZERO);
}
#[test]
fn host_of_parses_scheme_path_and_port() {
assert_eq!(host_of("https://Example.com/path").as_deref(), Some("example.com"));
assert_eq!(host_of("http://cdn.foo.bar/img.jpg").as_deref(), Some("cdn.foo.bar"));
assert_eq!(host_of("http://localhost:8080/x").as_deref(), Some("localhost"));
assert!(host_of("not a url").is_none());
}
#[tokio::test(start_paused = true)]
async fn host_rate_limiters_pace_per_host() {
// Two hosts at 100ms each. Two consecutive calls to the SAME
// host wait 100ms total. Two consecutive calls to DIFFERENT
// hosts both fire immediately.
let rl = HostRateLimiters::new(Duration::from_millis(100));
let t0 = Instant::now();
rl.wait_for("https://a.example/x").await.unwrap();
rl.wait_for("https://b.example/y").await.unwrap();
assert_eq!(Instant::now() - t0, Duration::ZERO, "different hosts don't contend");
let t1 = Instant::now();
rl.wait_for("https://a.example/x").await.unwrap();
assert_eq!(
Instant::now() - t1,
Duration::from_millis(100),
"second call to same host waits a full interval"
);
}
#[tokio::test(start_paused = true)]
async fn host_rate_limiters_honor_overrides() {
let rl = HostRateLimiters::new(Duration::from_millis(1000))
.with_override("fast.example", Duration::from_millis(100));
rl.wait_for("https://fast.example/a").await.unwrap();
let t0 = Instant::now();
rl.wait_for("https://fast.example/b").await.unwrap();
assert_eq!(Instant::now() - t0, Duration::from_millis(100));
}
}

View File

@@ -0,0 +1,161 @@
//! PHPSESSID injection + login probe.
//!
//! The catalog site we crawl renders chapter pages as a single multi-
//! page list only for logged-in users. We don't try to bypass the
//! login (CAPTCHA wall) — instead the operator pastes their browser's
//! `PHPSESSID` cookie into `CRAWLER_PHPSESSID` and the crawler injects
//! it into Chromium *and* reqwest before the first navigation.
//!
//! Two things the cookie alone doesn't give us:
//! 1. The cookie value is only meaningful to the *server* — we have
//! no way to predict from the value alone whether it's still valid.
//! `verify_session` does a navigation and checks for `#avatar_menu`,
//! which only renders for authenticated visitors. Bail clean at
//! startup if it's missing rather than discovering it 30 minutes
//! into a backfill.
//! 2. The reqwest client (used for cover and chapter-image downloads)
//! has its own cookie store; we seed it for the catalog host only.
//! CDN hosts are deliberately *not* given the cookie — they serve
//! image bytes by signed URLs and don't need it.
use anyhow::{anyhow, Context};
use chromiumoxide::browser::Browser;
use chromiumoxide::cdp::browser_protocol::network::CookieParam;
/// Compute the cookie domain (e.g. `.example.com`) from a start URL.
/// The leading dot makes the cookie cover every subdomain — the source
/// often redirects between `www.` and other prefixes mid-crawl, and a
/// host-only cookie would silently drop on the cross-subdomain hop.
///
/// Caveat: this takes the last two dot-labels, which is wrong for
/// multi-part TLDs (`.co.uk`, `.com.br` would resolve to `.co.uk` and
/// attach to every site on `.co.uk`). For those, the operator should
/// override via `CRAWLER_COOKIE_DOMAIN` rather than relying on this
/// function — pulling in the Public Suffix List for one knob isn't
/// worth it yet.
pub fn registrable_domain(url: &str) -> Option<String> {
let after_scheme = url.split_once("://")?.1;
let host_with_port = after_scheme.split('/').next()?;
let host = host_with_port
.rsplit_once(':')
.map_or(host_with_port, |(h, _)| h)
.to_ascii_lowercase();
if host.is_empty() {
return None;
}
let labels: Vec<&str> = host.split('.').filter(|l| !l.is_empty()).collect();
if labels.len() < 2 {
// Bare hostname (e.g. `localhost`) — return as-is, no leading
// dot. Setting `.localhost` as cookie domain is invalid.
return Some(host);
}
let registrable = &labels[labels.len() - 2..];
Some(format!(".{}", registrable.join(".")))
}
/// Inject the PHPSESSID cookie into the browser's cookie store for the
/// catalog domain. Must be called before any navigation that depends on
/// authentication; subsequent navigations include the cookie
/// automatically.
pub async fn inject_phpsessid(
browser: &Browser,
sid: &str,
cookie_domain: &str,
) -> anyhow::Result<()> {
let cookie = CookieParam {
name: "PHPSESSID".to_string(),
value: sid.to_string(),
url: None,
domain: Some(cookie_domain.to_string()),
path: Some("/".to_string()),
secure: None,
http_only: Some(true),
same_site: None,
expires: None,
priority: None,
same_party: None,
source_scheme: None,
source_port: None,
partition_key: None,
};
browser
.set_cookies(vec![cookie])
.await
.context("set PHPSESSID in chromium cookie store")?;
tracing::info!(domain = cookie_domain, "injected PHPSESSID into browser");
Ok(())
}
/// Navigate to `probe_url` and confirm the logged-in `#avatar_menu`
/// element is present. The selector only renders for authenticated
/// visitors, so its absence is the unambiguous signal that PHPSESSID
/// is missing, expired, or revoked.
///
/// This burns one navigation against the catalog's rate limiter. The
/// trade is worth it — failing here costs ~1s; failing 30 minutes into
/// a backfill costs 30 minutes.
pub async fn verify_session(browser: &Browser, probe_url: &str) -> anyhow::Result<()> {
let page = browser
.new_page(probe_url)
.await
.with_context(|| format!("open probe page {probe_url}"))?;
page.wait_for_navigation().await.context("wait for nav on probe")?;
// The avatar menu is rendered server-side as part of the header
// when a valid session cookie is present; absent JS is fine.
let found = page.find_element("#avatar_menu").await.is_ok();
page.close().await.ok();
if found {
tracing::info!("session probe ok — #avatar_menu present");
Ok(())
} else {
Err(anyhow!(
"session probe failed — #avatar_menu not present at {probe_url}; \
PHPSESSID is missing, expired, or revoked. Refresh CRAWLER_PHPSESSID \
and re-run."
))
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn registrable_domain_strips_subdomain() {
assert_eq!(
registrable_domain("https://www.target-site.com/manga/foo/").as_deref(),
Some(".target-site.com")
);
assert_eq!(
registrable_domain("https://m.example.org").as_deref(),
Some(".example.org")
);
}
#[test]
fn registrable_domain_keeps_two_label_host() {
assert_eq!(
registrable_domain("https://example.com/").as_deref(),
Some(".example.com")
);
}
#[test]
fn registrable_domain_handles_port() {
assert_eq!(
registrable_domain("http://www.foo.bar:8080/x").as_deref(),
Some(".foo.bar")
);
}
#[test]
fn registrable_domain_bare_hostname_no_leading_dot() {
// .localhost would be invalid as a cookie Domain.
assert_eq!(registrable_domain("http://localhost:5173").as_deref(), Some("localhost"));
}
#[test]
fn registrable_domain_returns_none_for_garbage() {
assert!(registrable_domain("not a url").is_none());
}
}

View File

@@ -0,0 +1,118 @@
//! `Source` trait — the per-site abstraction.
//!
//! Job handlers depend on this trait, not on a concrete site. Adding a
//! new site is: implement `Source`, register it in a `sources` table
//! row, and the existing job pipeline picks it up unchanged.
pub mod target;
use async_trait::async_trait;
use chromiumoxide::browser::Browser;
use serde::{Deserialize, Serialize};
/// How a `discover` job should walk the source's index.
#[derive(Clone, Copy, Debug, Serialize, Deserialize)]
pub enum DiscoverMode {
/// Walk every index page from last back to first. Used for the
/// initial seed of a source.
Backfill,
/// Walk index pages from page 1 forward, stopping after
/// `stop_after_unchanged` consecutive mangas whose `metadata_hash`
/// matches storage. Used for the recurring cron tick.
Incremental { stop_after_unchanged: usize },
}
/// Pointer at a manga in the source's index, before we've fetched the
/// detail page. The `source_manga_key` is whatever stable id the source
/// uses (slug, numeric id, etc).
#[derive(Clone, Debug)]
pub struct SourceMangaRef {
pub source_manga_key: String,
pub title: String,
pub url: String,
}
/// Full metadata returned by `fetch_manga`. The hash is computed by the
/// source impl over the metadata-only field set (title through
/// cover_url) — chapter changes are tracked separately via
/// `chapter_sources`, so they intentionally do not affect
/// `metadata_hash`.
#[derive(Clone, Debug)]
pub struct SourceManga {
pub source_manga_key: String,
pub title: String,
pub alternative_titles: Vec<String>,
pub authors: Vec<String>,
pub genres: Vec<String>,
pub tags: Vec<String>,
pub status: Option<String>,
pub summary: Option<String>,
pub cover_url: Option<String>,
/// Chapters surfaced on the same page as the metadata. Sources
/// where the chapter list lives elsewhere can leave this empty
/// and supply it via `fetch_chapter_list` instead.
pub chapters: Vec<SourceChapterRef>,
pub metadata_hash: String,
}
#[derive(Clone, Debug)]
pub struct SourceChapterRef {
pub source_chapter_key: String,
pub number: i32,
pub title: Option<String>,
pub url: String,
}
#[derive(Clone, Debug)]
pub struct SourceChapter {
pub source_chapter_key: String,
pub number: i32,
pub title: Option<String>,
/// Ordered list of page image URLs, ready to be fetched and put
/// into `Storage`.
pub page_urls: Vec<String>,
}
/// Context passed to every `Source` call. Carries the browser handle
/// plus the per-host rate-limiter map so impls that issue multiple
/// requests in one call (pagination walks, multi-page chapter image
/// fetches) honor the right budget for each origin.
pub struct FetchContext<'a> {
pub browser: &'a Browser,
pub rate: &'a crate::crawler::rate_limit::HostRateLimiters,
}
#[async_trait]
pub trait Source: Send + Sync {
/// Stable identifier — also the row key in the `sources` table.
fn id(&self) -> &'static str;
/// Returns up to `max_results` manga refs in source order. Pass
/// `None` for an uncapped walk (full backfill / incremental sweep).
/// Implementations should stop paginating as soon as the cap is
/// reached so partial runs don't pay for pages they won't use.
async fn discover(
&self,
ctx: &FetchContext<'_>,
mode: DiscoverMode,
max_results: Option<usize>,
) -> anyhow::Result<Vec<SourceMangaRef>>;
async fn fetch_manga(
&self,
ctx: &FetchContext<'_>,
r: &SourceMangaRef,
) -> anyhow::Result<SourceManga>;
async fn fetch_chapter_list(
&self,
ctx: &FetchContext<'_>,
manga: &SourceManga,
) -> anyhow::Result<Vec<SourceChapterRef>>;
async fn fetch_chapter(
&self,
ctx: &FetchContext<'_>,
r: &SourceChapterRef,
) -> anyhow::Result<SourceChapter>;
}

View File

@@ -0,0 +1,792 @@
//! First concrete [`Source`] impl, modeled on the selectors of the
//! old Puppeteer crawler. The name "target" is a placeholder — rename
//! once the site is officially identified.
//!
//! `scraper`'s selector parser does not support `:has()` or
//! `:contains()`, so the labelled-`td` lookups from the old script
//! (`td:has(label:contains("Author:"))`) are implemented by walking
//! the parsed tree.
use std::time::Duration;
use anyhow::Context;
use async_trait::async_trait;
use sha2::{Digest, Sha256};
use super::{
DiscoverMode, FetchContext, Source, SourceChapter, SourceChapterRef, SourceManga,
SourceMangaRef,
};
pub struct TargetSource {
base_url: String,
parse_chapters: bool,
}
impl TargetSource {
pub fn new(base_url: impl Into<String>) -> Self {
Self {
base_url: base_url.into(),
parse_chapters: true,
}
}
pub fn base_url(&self) -> &str {
&self.base_url
}
/// Skip the chapter-list selector when parsing detail pages.
/// The returned `SourceManga.chapters` will be empty even when the
/// page has a chapter table. Caller must also avoid calling
/// `repo::crawler::sync_manga_chapters` for these mangas — an
/// empty list would otherwise soft-drop the manga's existing
/// chapter rows.
pub fn without_chapter_parsing(mut self) -> Self {
self.parse_chapters = false;
self
}
}
#[async_trait]
impl Source for TargetSource {
fn id(&self) -> &'static str {
"target"
}
async fn discover(
&self,
ctx: &FetchContext<'_>,
mode: DiscoverMode,
max_results: Option<usize>,
) -> anyhow::Result<Vec<SourceMangaRef>> {
// Always visit page 1 first because that's the only way to
// discover `last_page`. We cache the HTML so we don't have to
// re-navigate when the iteration reaches page 1 again.
let first_html = navigate(ctx, self.base_url.as_str()).await?;
let last_page = {
let doc = scraper::Html::parse_document(&first_html);
parse_last_page(&doc)
};
let backfill = matches!(mode, DiscoverMode::Backfill);
let order: Vec<i32> = match (last_page, backfill) {
(None, _) => vec![1],
// Backfill = oldest-first: walk pages last → 1, then
// reverse within each page (the listing is update_date
// DESC, so the bottom of the last page is the oldest
// entry the source still surfaces).
(Some(last), true) => (1..=last).rev().collect(),
(Some(last), false) => (1..=last).collect(),
};
tracing::info!(
?mode,
last_page = ?last_page,
page_count = order.len(),
"walking pagination"
);
let mut all = Vec::new();
for page_num in order {
let html = if page_num == 1 {
first_html.clone()
} else {
navigate(ctx, &page_url(&self.base_url, page_num)).await?
};
let mut page_refs = {
let doc = scraper::Html::parse_document(&html);
parse_manga_list_from(&doc)
};
if backfill {
page_refs.reverse();
}
tracing::info!(page_num, count = page_refs.len(), "page walked");
all.extend(page_refs);
if cap_reached(&all, max_results) {
tracing::info!(cap = ?max_results, "max_results reached; halting pagination");
break;
}
}
Ok(truncate_to_cap(all, max_results))
}
async fn fetch_manga(
&self,
ctx: &FetchContext<'_>,
r: &SourceMangaRef,
) -> anyhow::Result<SourceManga> {
let html = navigate(ctx, r.url.as_str()).await?;
parse_manga_detail(&html, &r.source_manga_key, self.parse_chapters)
.with_context(|| format!("parse manga detail at {}", r.url))
}
async fn fetch_chapter_list(
&self,
_ctx: &FetchContext<'_>,
_manga: &SourceManga,
) -> anyhow::Result<Vec<SourceChapterRef>> {
anyhow::bail!("fetch_chapter_list not implemented yet")
}
async fn fetch_chapter(
&self,
_ctx: &FetchContext<'_>,
_r: &SourceChapterRef,
) -> anyhow::Result<SourceChapter> {
anyhow::bail!("fetch_chapter not implemented yet")
}
}
fn cap_reached<T>(buf: &[T], max: Option<usize>) -> bool {
matches!(max, Some(m) if buf.len() >= m)
}
fn truncate_to_cap<T>(mut buf: Vec<T>, max: Option<usize>) -> Vec<T> {
if let Some(m) = max {
buf.truncate(m);
}
buf
}
/// Single point of rate-limited navigation. Every Source request goes
/// through here, so the per-host limiter map is the only knob that
/// controls per-origin RPS.
async fn navigate(ctx: &FetchContext<'_>, url: &str) -> anyhow::Result<String> {
ctx.rate.wait_for(url).await?;
let page = ctx.browser.new_page(url).await?;
page.wait_for_navigation().await?;
// Stopgap until we wait on a specific selector per page type —
// gives any post-load JS a beat to finish injecting content.
tokio::time::sleep(Duration::from_secs(1)).await;
let html = page.content().await?;
page.close().await?;
Ok(html)
}
fn parse_last_page(doc: &scraper::Html) -> Option<i32> {
// Pagination links carry their page number as text. Take the
// numeric maximum so we don't depend on a specific layout (Prev,
// Next, ellipses, etc. all get filtered out by .parse).
let sel = scraper::Selector::parse("#left_side .pagination a").unwrap();
doc.select(&sel)
.filter_map(|a| {
collapse_whitespace(&a.text().collect::<String>())
.parse::<i32>()
.ok()
})
.max()
}
/// Substitutes the first `/N/` path segment with the target page
/// number. Source impls that paginate via a different URL shape can
/// override this — for the modeled site the segment is always present.
fn page_url(template_url: &str, page: i32) -> String {
let bytes = template_url.as_bytes();
let mut i = 0;
while i + 1 < bytes.len() {
if bytes[i] == b'/' && bytes[i + 1].is_ascii_digit() {
let start = i;
let mut j = i + 1;
while j < bytes.len() && bytes[j].is_ascii_digit() {
j += 1;
}
if j < bytes.len() && bytes[j] == b'/' {
let mut out = String::with_capacity(template_url.len() + 4);
out.push_str(&template_url[..start]);
out.push_str(&format!("/{page}/"));
out.push_str(&template_url[j + 1..]);
return out;
}
}
i += 1;
}
template_url.to_string()
}
#[cfg(test)]
fn parse_manga_list(html: &str) -> Vec<SourceMangaRef> {
let doc = scraper::Html::parse_document(html);
parse_manga_list_from(&doc)
}
fn parse_manga_list_from(doc: &scraper::Html) -> Vec<SourceMangaRef> {
let sel = scraper::Selector::parse("#left_side .pic_list .updatesli span a").unwrap();
doc.select(&sel)
.filter_map(|a| {
let url = a.value().attr("href")?.trim().to_string();
if url.is_empty() {
return None;
}
let title = collapse_whitespace(&a.text().collect::<String>());
if title.is_empty() {
return None;
}
Some(SourceMangaRef {
source_manga_key: derive_key_from_url(&url),
title,
url,
})
})
.collect()
}
fn parse_manga_detail(
html: &str,
key: &str,
include_chapters: bool,
) -> anyhow::Result<SourceManga> {
let doc = scraper::Html::parse_document(html);
let title = first_text(&doc, ".w-title h1").context("missing .w-title h1")?;
let summary = first_text(&doc, ".manga_summary");
let cover_url = first_attr(&doc, ".cover > img:nth-child(1)", "src");
let authors = links_in_labelled_td(&doc, "Author");
let genres = links_in_labelled_td(&doc, "Genre");
let raw_status = labelled_td_child_text(&doc, "Status", "span");
let status = normalize_status(raw_status.as_deref(), key);
let alternative_titles = labelled_td_value_after_label(&doc, "Alternative")
.map(|s| {
s.split([';', ',', '|'])
.map(str::trim)
.filter(|p| !p.is_empty())
.map(String::from)
.collect()
})
.unwrap_or_default();
let tag_sel = scraper::Selector::parse(".aside-body a.tag").unwrap();
let tags: Vec<String> = doc
.select(&tag_sel)
.map(|a| collapse_whitespace(&a.text().collect::<String>()))
.map(|s| strip_tag_count(&s))
.filter(|s| !s.is_empty())
.collect();
let chapters = if include_chapters {
parse_chapter_list(&doc)
} else {
Vec::new()
};
let mut manga = SourceManga {
source_manga_key: key.to_string(),
title,
alternative_titles,
authors,
genres,
tags,
status,
summary,
cover_url,
chapters,
metadata_hash: String::new(),
};
manga.metadata_hash = compute_metadata_hash(&manga);
Ok(manga)
}
/// Source advertises status as "Ongoing" or "Completed"; we normalize
/// to the lowercase form the `mangas.status` CHECK constraint accepts.
/// Anything else is a parse miss (selector drift, new value, etc.) and
/// returns `None` after logging — the manga sync continues regardless.
fn normalize_status(raw: Option<&str>, key: &str) -> Option<String> {
let trimmed = raw.map(str::trim).filter(|s| !s.is_empty())?;
if trimmed.eq_ignore_ascii_case("ongoing") {
Some("ongoing".to_string())
} else if trimmed.eq_ignore_ascii_case("completed") {
Some("completed".to_string())
} else {
tracing::error!(
key,
raw_status = trimmed,
"unknown manga status (expected 'Ongoing' or 'Completed'); continuing with status=None"
);
None
}
}
/// Strips a trailing digit-only `(NN)` suffix from a tag name, the form
/// the source uses to display tag counts. Non-numeric parentheses are
/// preserved.
fn strip_tag_count(s: &str) -> String {
let trimmed = s.trim();
if trimmed.ends_with(')') {
if let Some(open) = trimmed.rfind('(') {
let inside = &trimmed[open + 1..trimmed.len() - 1];
if !inside.is_empty() && inside.chars().all(|c| c.is_ascii_digit()) {
return trimmed[..open].trim().to_string();
}
}
}
trimmed.to_string()
}
fn parse_chapter_list(doc: &scraper::Html) -> Vec<SourceChapterRef> {
let sel = scraper::Selector::parse("#chapter_table td h4 a.chico").unwrap();
doc.select(&sel)
.filter_map(|a| {
let url = a.value().attr("href")?.trim().to_string();
if url.is_empty() {
return None;
}
let title_text = collapse_whitespace(&a.text().collect::<String>());
let number = parse_chapter_number(&title_text).unwrap_or(0);
Some(SourceChapterRef {
source_chapter_key: derive_chapter_key_from_url(&url),
number,
title: (!title_text.is_empty()).then_some(title_text),
url,
})
})
.collect()
}
fn parse_chapter_number(text: &str) -> Option<i32> {
let mut buf = String::new();
for c in text.chars() {
if c.is_ascii_digit() {
buf.push(c);
} else if !buf.is_empty() {
break;
}
}
buf.parse().ok()
}
fn derive_key_from_url(url: &str) -> String {
url.split('?')
.next()
.unwrap_or(url)
.trim_end_matches('/')
.rsplit('/')
.find(|s| !s.is_empty())
.unwrap_or(url)
.to_string()
}
/// Chapter URLs on this source point at the reader's page 1, e.g.
/// `.../uu/br_chapter-379272/pg-1/`. The chapter identity is the
/// `br_chapter-N` (or `to_chapter-N`) segment — the `pg-\d+` segment
/// identifies a page *within* a chapter, so naively taking the last
/// path component returns `"pg-1"` for every chapter and collapses
/// them all under one source_chapter_key downstream.
fn derive_chapter_key_from_url(url: &str) -> String {
let trimmed = url.split('?').next().unwrap_or(url).trim_end_matches('/');
let without_reader_page = match trimmed.rsplit_once('/') {
Some((prefix, last)) if is_reader_page_segment(last) => prefix,
_ => trimmed,
};
without_reader_page
.rsplit('/')
.find(|s| !s.is_empty())
.unwrap_or(url)
.to_string()
}
fn is_reader_page_segment(s: &str) -> bool {
s.len() > 3 && s.starts_with("pg-") && s[3..].bytes().all(|b| b.is_ascii_digit())
}
fn first_text(doc: &scraper::Html, sel: &str) -> Option<String> {
let s = scraper::Selector::parse(sel).ok()?;
let el = doc.select(&s).next()?;
let text = collapse_whitespace(&el.text().collect::<String>());
(!text.is_empty()).then_some(text)
}
fn first_attr(doc: &scraper::Html, sel: &str, attr: &str) -> Option<String> {
let s = scraper::Selector::parse(sel).ok()?;
let el = doc.select(&s).next()?;
el.value().attr(attr).map(str::to_string)
}
/// `td` whose contained `label` text begins with `label_prefix` — the
/// `scraper`-friendly equivalent of `td:has(label:contains("Foo"))`.
fn td_with_label<'a>(
doc: &'a scraper::Html,
label_prefix: &str,
) -> Option<scraper::ElementRef<'a>> {
let td_sel = scraper::Selector::parse("td").unwrap();
let label_sel = scraper::Selector::parse("label").unwrap();
for td in doc.select(&td_sel) {
for label in td.select(&label_sel) {
let text: String = label.text().collect();
if text.trim().starts_with(label_prefix) {
return Some(td);
}
}
}
None
}
fn links_in_labelled_td(doc: &scraper::Html, label_prefix: &str) -> Vec<String> {
let Some(td) = td_with_label(doc, label_prefix) else {
return Vec::new();
};
let a_sel = scraper::Selector::parse("a").unwrap();
td.select(&a_sel)
.map(|a| collapse_whitespace(&a.text().collect::<String>()))
.filter(|s| !s.is_empty())
.collect()
}
fn labelled_td_child_text(
doc: &scraper::Html,
label_prefix: &str,
child_sel: &str,
) -> Option<String> {
let td = td_with_label(doc, label_prefix)?;
let child = scraper::Selector::parse(child_sel).ok()?;
let el = td.select(&child).next()?;
let text = collapse_whitespace(&el.text().collect::<String>());
(!text.is_empty()).then_some(text)
}
/// Returns the text content of the labelled `td` with the leading
/// "Label:" portion stripped — used for "Alternative:" which puts the
/// value directly in the cell rather than in a child element.
fn labelled_td_value_after_label(
doc: &scraper::Html,
label_prefix: &str,
) -> Option<String> {
let td = td_with_label(doc, label_prefix)?;
let full: String = td.text().collect();
let after = full.split_once(':').map(|(_, r)| r).unwrap_or(&full);
let trimmed = collapse_whitespace(after);
(!trimmed.is_empty()).then_some(trimmed)
}
fn collapse_whitespace(s: &str) -> String {
s.split_whitespace().collect::<Vec<_>>().join(" ")
}
fn compute_metadata_hash(m: &SourceManga) -> String {
// Field separators are ASCII unit/record separators so a field
// containing a delimiter character can't be mistaken for two
// smaller fields.
let mut h = Sha256::new();
fn feed(h: &mut Sha256, s: &str) {
h.update(s.as_bytes());
h.update(b"\x1F");
}
fn feed_list(h: &mut Sha256, xs: &[String]) {
for s in xs {
feed(h, s);
}
h.update(b"\x1E");
}
feed(&mut h, &m.title);
feed_list(&mut h, &m.alternative_titles);
feed_list(&mut h, &m.authors);
feed_list(&mut h, &m.genres);
feed_list(&mut h, &m.tags);
feed(&mut h, m.status.as_deref().unwrap_or(""));
feed(&mut h, m.summary.as_deref().unwrap_or(""));
feed(&mut h, m.cover_url.as_deref().unwrap_or(""));
format!("{:x}", h.finalize())
}
#[cfg(test)]
mod tests {
use super::*;
const LISTING_HTML: &str = r#"
<html><body>
<div id="left_side">
<div class="pic_list">
<div class="updatesli">
<span><a href="https://target.example/manga/foo">Foo Manga</a></span>
</div>
<div class="updatesli">
<span><a href="https://target.example/manga/bar-baz"> Bar Baz </a></span>
</div>
<div class="updatesli">
<span><a href="">Empty href ignored</a></span>
</div>
</div>
</div>
</body></html>
"#;
const DETAIL_HTML: &str = r#"
<html><body>
<div class="w-title"><h1>Test Manga Title</h1></div>
<div class="cover"><img src="/cover.jpg"><img src="/extra-not-cover.jpg"></div>
<div class="manga_summary">A summary of the manga.</div>
<table>
<tr><td><label>Author:</label><a href="/a/1">Author One</a><a href="/a/2">Author Two</a></td></tr>
<tr><td><label>Genre(s):</label><a href="/g/1">Action</a><a href="/g/2">Drama</a></td></tr>
<tr><td><label>Status:</label><span>Ongoing</span></td></tr>
<tr><td><label>Alternative:</label> Alt Title 1; Alt Title 2 </td></tr>
</table>
<aside><div class="aside-body">
<a class="tag">Fantasy (21)</a>
<a class="tag">Romance</a>
<a class="tag"> Action (5)</a>
<a class="not-a-tag">should-be-ignored</a>
</div></aside>
<table id="chapter_table">
<tr><td><h4><a class="chico" href="/manga/foo/chapter/1">Ch.1</a></h4></td></tr>
<tr><td><h4><a class="chico" href="/manga/foo/chapter/2">Ch.2 - The Beginning</a></h4></td></tr>
<tr><td><h4><a class="chico" href="/manga/foo/chapter/3">Chapter 3: Onward</a></h4></td></tr>
</table>
</body></html>
"#;
#[test]
fn parse_manga_list_extracts_title_url_and_derives_key() {
let refs = parse_manga_list(LISTING_HTML);
assert_eq!(refs.len(), 2, "third entry has empty href and is skipped");
assert_eq!(refs[0].title, "Foo Manga");
assert_eq!(refs[0].url, "https://target.example/manga/foo");
assert_eq!(refs[0].source_manga_key, "foo");
assert_eq!(refs[1].title, "Bar Baz");
assert_eq!(refs[1].source_manga_key, "bar-baz");
}
#[test]
fn parse_manga_detail_pulls_all_fields() {
let m = parse_manga_detail(DETAIL_HTML, "test-key", true).expect("parse");
assert_eq!(m.source_manga_key, "test-key");
assert_eq!(m.title, "Test Manga Title");
assert_eq!(m.summary.as_deref(), Some("A summary of the manga."));
assert_eq!(m.authors, vec!["Author One", "Author Two"]);
assert_eq!(m.genres, vec!["Action", "Drama"]);
assert_eq!(m.status.as_deref(), Some("ongoing"));
assert_eq!(m.alternative_titles, vec!["Alt Title 1", "Alt Title 2"]);
// Counts in parentheses are stripped — "Fantasy (21)" → "Fantasy".
assert_eq!(m.tags, vec!["Fantasy", "Romance", "Action"]);
assert_eq!(m.cover_url.as_deref(), Some("/cover.jpg"));
assert!(!m.metadata_hash.is_empty());
assert_eq!(m.chapters.len(), 3);
assert_eq!(m.chapters[0].number, 1);
assert_eq!(m.chapters[0].title.as_deref(), Some("Ch.1"));
assert_eq!(m.chapters[0].url, "/manga/foo/chapter/1");
assert_eq!(m.chapters[0].source_chapter_key, "1");
assert_eq!(m.chapters[1].number, 2);
assert_eq!(m.chapters[1].title.as_deref(), Some("Ch.2 - The Beginning"));
assert_eq!(m.chapters[2].number, 3);
assert_eq!(m.chapters[2].title.as_deref(), Some("Chapter 3: Onward"));
}
#[test]
fn status_normalized_case_insensitively() {
assert_eq!(normalize_status(Some("Ongoing"), "k").as_deref(), Some("ongoing"));
assert_eq!(normalize_status(Some("ONGOING"), "k").as_deref(), Some("ongoing"));
assert_eq!(normalize_status(Some(" completed "), "k").as_deref(), Some("completed"));
}
#[test]
fn unknown_status_logs_and_returns_none() {
// Logging is observable in test output via tracing-test, but
// here we just assert the contract: unknown becomes None
// (and the manga is therefore still synced by the caller).
assert!(normalize_status(Some("Hiatus"), "k").is_none());
assert!(normalize_status(Some(""), "k").is_none());
assert!(normalize_status(None, "k").is_none());
}
#[test]
fn strip_tag_count_drops_trailing_digit_parens_only() {
assert_eq!(strip_tag_count("Fantasy (21)"), "Fantasy");
assert_eq!(strip_tag_count(" Action (5) "), "Action");
assert_eq!(strip_tag_count("Romance"), "Romance");
// Non-numeric parens stay put.
assert_eq!(strip_tag_count("Slice of Life (sub)"), "Slice of Life (sub)");
// Only the trailing paren is considered.
assert_eq!(strip_tag_count("Tag (a) (12)"), "Tag (a)");
}
#[test]
fn parse_chapter_list_keeps_all_chapters_with_unique_keys() {
// Real listing fixture from the target site. 15 rows: chapters
// with various Ch.N markup, one hiatus row, three "notice." rows,
// and duplicates of Ch.1 and Ch.52 from different uploaders.
// Every row must survive parsing and every chapter must have a
// distinct source_chapter_key — chapter URLs all end in `/pg-1/`
// (the reader's page-1 entry point), and a naive
// last-segment-of-URL derivation returns "pg-1" for every row,
// collapsing the whole list into one downstream chapter row.
let html = include_str!(
"../../../tests/fixtures/target/chapter_list_uu.html"
);
let doc = scraper::Html::parse_document(html);
let chapters = parse_chapter_list(&doc);
assert_eq!(chapters.len(), 15, "every row kept (notices/hiatus included)");
let mut keys: Vec<&str> =
chapters.iter().map(|c| c.source_chapter_key.as_str()).collect();
keys.sort();
let dupe = keys.windows(2).find(|w| w[0] == w[1]).map(|w| w[0]);
assert!(dupe.is_none(), "duplicate chapter key: {dupe:?}");
for c in &chapters {
assert_ne!(
c.source_chapter_key, "pg-1",
"key must not be the reader-page segment: {:?}", c
);
}
// Latest chapter is first (source orders newest → oldest).
assert_eq!(chapters[0].number, 67);
assert_eq!(chapters[0].title.as_deref(), Some("Ch.67 : Official"));
assert_eq!(chapters[0].source_chapter_key, "br_chapter-379272");
// Duplicate-number chapters (different uploaders) survive as
// two rows. The (manga_id, number) UNIQUE collapse is a
// downstream schema concern handled separately.
assert_eq!(
chapters.iter().filter(|c| c.number == 52).count(),
2,
"two Ch.52 uploads must both survive parsing"
);
assert_eq!(
chapters.iter().filter(|c| c.number == 1).count(),
2,
"Ch.1 Official and Ch.1 Team Hazama are both kept"
);
// Notices / hiatus rows have no leading digit so they parse to
// number=0. They are not filtered out.
let zero = chapters.iter().filter(|c| c.number == 0).count();
assert!(zero >= 4, "hiatus + 3 notices kept; got {zero}");
}
#[test]
fn parse_chapter_number_grabs_first_integer_run() {
assert_eq!(parse_chapter_number("Ch.1"), Some(1));
assert_eq!(parse_chapter_number("Chapter 12"), Some(12));
assert_eq!(parse_chapter_number("Ch.2 - The Beginning"), Some(2));
// Decimal chapters keep the integer part (i32 storage).
assert_eq!(parse_chapter_number("Ch.12.5"), Some(12));
assert_eq!(parse_chapter_number("Special"), None);
}
#[test]
fn parse_last_page_picks_highest_pagination_link() {
let html = r#"
<div id="left_side"><div class="pagination">
<a href="/list/1/">Prev</a>
<ol>
<li><a href="/list/1/">1</a></li>
<li><a href="/list/2/">2</a></li>
<li><a href="/list/47/">47</a></li>
<li><a href="/list/2/">Next</a></li>
</ol>
</div></div>
"#;
let doc = scraper::Html::parse_document(html);
assert_eq!(parse_last_page(&doc), Some(47));
}
#[test]
fn parse_last_page_none_when_no_pagination() {
let doc = scraper::Html::parse_document("<html></html>");
assert!(parse_last_page(&doc).is_none());
}
#[test]
fn page_url_substitutes_numeric_path_segment() {
assert_eq!(
page_url("https://site.example/list/1/?f=1&o=1&sortby=update_date&e=", 5),
"https://site.example/list/5/?f=1&o=1&sortby=update_date&e="
);
// No numeric segment → URL returned unchanged.
assert_eq!(
page_url("https://site.example/list/?f=1", 5),
"https://site.example/list/?f=1"
);
}
#[test]
fn derive_key_strips_trailing_slash_and_query() {
assert_eq!(derive_key_from_url("https://x.example/manga/foo/"), "foo");
assert_eq!(derive_key_from_url("https://x.example/manga/foo?p=1"), "foo");
assert_eq!(derive_key_from_url("/manga/bar"), "bar");
}
#[test]
fn derive_chapter_key_strips_trailing_reader_page_segment() {
// Listing links go to page 1 of the reader; strip /pg-\d+/.
assert_eq!(
derive_chapter_key_from_url(".../uu/br_chapter-379272/pg-1/"),
"br_chapter-379272"
);
assert_eq!(
derive_chapter_key_from_url(".../uu/to_chapter-13/pg-1/"),
"to_chapter-13"
);
// Defensive: deep-link to a non-first page should still resolve
// to the same chapter identity.
assert_eq!(
derive_chapter_key_from_url(".../uu/br_chapter-379272/pg-25/"),
"br_chapter-379272"
);
// No reader-page suffix → behaves like derive_key_from_url.
assert_eq!(
derive_chapter_key_from_url(".../uu/br_chapter-379272/"),
"br_chapter-379272"
);
// Query strings are stripped.
assert_eq!(
derive_chapter_key_from_url(".../uu/br_chapter-379272/pg-1/?ref=x"),
"br_chapter-379272"
);
// `pg-foo` is not a valid reader-page segment; treated as identity.
assert_eq!(
derive_chapter_key_from_url(".../uu/something/pg-foo/"),
"pg-foo"
);
// Bare `pg-` (no digits) likewise not stripped.
assert_eq!(
derive_chapter_key_from_url(".../uu/something/pg-/"),
"pg-"
);
}
#[test]
fn metadata_hash_is_stable_and_field_sensitive() {
let base = parse_manga_detail(DETAIL_HTML, "k", true).unwrap();
let again = parse_manga_detail(DETAIL_HTML, "k", true).unwrap();
assert_eq!(base.metadata_hash, again.metadata_hash);
// Same fields except status flipped — hash must change.
let altered_html = DETAIL_HTML.replace("Ongoing", "Completed");
let altered = parse_manga_detail(&altered_html, "k", true).unwrap();
assert_ne!(base.metadata_hash, altered.metadata_hash);
}
#[test]
fn missing_optional_fields_parse_to_none() {
let html = r#"<html><body><div class="w-title"><h1>Minimal</h1></div></body></html>"#;
let m = parse_manga_detail(html, "min", true).unwrap();
assert_eq!(m.title, "Minimal");
assert!(m.summary.is_none());
assert!(m.status.is_none());
assert!(m.authors.is_empty());
assert!(m.genres.is_empty());
assert!(m.tags.is_empty());
assert!(m.alternative_titles.is_empty());
assert!(m.chapters.is_empty());
}
#[test]
fn parse_manga_detail_skips_chapters_when_disabled() {
// Same fixture that yields 3 chapters above; with include_chapters=false
// the chapter table is ignored and the rest of the metadata still parses.
let m = parse_manga_detail(DETAIL_HTML, "k", false).unwrap();
assert!(m.chapters.is_empty(), "chapters should be empty when disabled");
assert_eq!(m.title, "Test Manga Title", "other fields still parse");
assert_eq!(m.authors, vec!["Author One", "Author Two"]);
}
#[test]
fn parse_manga_detail_errors_on_missing_title() {
let html = "<html><body><p>nothing</p></body></html>";
let err = parse_manga_detail(html, "x", true).unwrap_err();
assert!(err.to_string().contains("missing .w-title h1"));
}
}

View File

@@ -2,6 +2,7 @@ pub mod api;
pub mod app; pub mod app;
pub mod auth; pub mod auth;
pub mod config; pub mod config;
pub mod crawler;
pub mod domain; pub mod domain;
pub mod error; pub mod error;
pub mod repo; pub mod repo;

View File

@@ -12,10 +12,21 @@ async fn main() -> anyhow::Result<()> {
let config = mangalord::config::Config::from_env()?; let config = mangalord::config::Config::from_env()?;
let addr: SocketAddr = config.bind_address.parse()?; let addr: SocketAddr = config.bind_address.parse()?;
let app = mangalord::app::build(config).await?; let mangalord::app::AppHandle { router, daemon } = mangalord::app::build(config).await?;
tracing::info!(%addr, "mangalord listening"); tracing::info!(%addr, "mangalord listening");
let listener = tokio::net::TcpListener::bind(addr).await?; let listener = tokio::net::TcpListener::bind(addr).await?;
axum::serve(listener, app).await?; axum::serve(listener, router)
.with_graceful_shutdown(async {
let _ = tokio::signal::ctrl_c().await;
tracing::info!("ctrl-c received; shutting down");
})
.await?;
// Drain background tasks (crawler daemon) before exiting so Chromium
// gets a clean shutdown rather than relying on kill-on-drop.
if let Some(d) = daemon {
d.shutdown().await;
}
Ok(()) Ok(())
} }

View File

@@ -12,12 +12,15 @@ pub async fn list_for_manga(
limit: i64, limit: i64,
offset: i64, offset: i64,
) -> AppResult<Vec<Chapter>> { ) -> AppResult<Vec<Chapter>> {
// Secondary sort by created_at gives duplicate-numbered chapters
// (multiple uploaders/translations of the same number) a stable
// order in lists and prev/next reader navigation.
let rows = sqlx::query_as::<_, Chapter>( let rows = sqlx::query_as::<_, Chapter>(
r#" r#"
SELECT id, manga_id, number, title, page_count, created_at SELECT id, manga_id, number, title, page_count, created_at
FROM chapters FROM chapters
WHERE manga_id = $1 WHERE manga_id = $1
ORDER BY number ASC ORDER BY number ASC, created_at ASC
LIMIT $2 OFFSET $3 LIMIT $2 OFFSET $3
"#, "#,
) )
@@ -29,33 +32,40 @@ pub async fn list_for_manga(
Ok(rows) Ok(rows)
} }
pub async fn find_by_manga_and_number( /// Look up a chapter by its UUID, scoped to its manga so a UUID guessed
/// from a different manga's URL doesn't accidentally resolve.
pub async fn find_by_id_in_manga(
pool: &PgPool, pool: &PgPool,
manga_id: Uuid, manga_id: Uuid,
number: i32, chapter_id: Uuid,
) -> AppResult<Option<Chapter>> { ) -> AppResult<Option<Chapter>> {
let row = sqlx::query_as::<_, Chapter>( let row = sqlx::query_as::<_, Chapter>(
r#" r#"
SELECT id, manga_id, number, title, page_count, created_at SELECT id, manga_id, number, title, page_count, created_at
FROM chapters FROM chapters
WHERE manga_id = $1 AND number = $2 WHERE manga_id = $1 AND id = $2
"#, "#,
) )
.bind(manga_id) .bind(manga_id)
.bind(number) .bind(chapter_id)
.fetch_optional(pool) .fetch_optional(pool)
.await?; .await?;
Ok(row) Ok(row)
} }
/// Accepts any `PgExecutor` so the upload handler can run this inside a /// Accepts any `PgExecutor` so the upload handler can run this inside a
/// transaction with the per-page inserts. Returns `AppError::Conflict` /// transaction with the per-page inserts.
/// on the (manga_id, number) unique violation so handlers can surface a
/// clean 409.
/// ///
/// `uploaded_by` records who uploaded the chapter and feeds the /// `uploaded_by` records who uploaded the chapter and feeds the
/// per-user upload history. `None` means "historical / API token with /// per-user upload history. `None` means "historical / API token with
/// no associated user" — kept nullable to support that case. /// no associated user" — kept nullable to support that case.
///
/// Chapter identity is the row UUID; the same (manga_id, number)
/// combination can repeat (multiple translations, re-uploads). The
/// `is_unique_violation` branch below is a defensive holdover from
/// 0001's (manga_id, number) UNIQUE — it can no longer fire under
/// normal operation, but we surface a clean 409 if a future migration
/// re-adds any chapter uniqueness.
pub async fn create<'e, E: PgExecutor<'e>>( pub async fn create<'e, E: PgExecutor<'e>>(
executor: E, executor: E,
manga_id: Uuid, manga_id: Uuid,
@@ -80,7 +90,7 @@ pub async fn create<'e, E: PgExecutor<'e>>(
match result { match result {
Ok(c) => Ok(c), Ok(c) => Ok(c),
Err(e) if is_unique_violation(&e) => Err(AppError::Conflict(format!( Err(e) if is_unique_violation(&e) => Err(AppError::Conflict(format!(
"chapter {number} already exists for this manga" "chapter {number} conflicts with an existing chapter for this manga"
))), ))),
Err(e) => Err(AppError::Database(e)), Err(e) => Err(AppError::Database(e)),
} }

434
backend/src/repo/crawler.rs Normal file
View File

@@ -0,0 +1,434 @@
//! Persistence for crawled mangas.
//!
//! High-level operations:
//! - [`ensure_source`]: idempotent registration of a source row.
//! - [`upsert_manga_from_source`]: end-to-end "I saw this manga" —
//! creates or updates the `mangas` row, threads `manga_sources`, and
//! refreshes authors/genres/tags. Returns whether the manga is new,
//! updated (metadata_hash changed), or unchanged.
//! - [`sync_manga_chapters`]: per-manga chapter reconciliation. Adds
//! new ones, refreshes URLs on existing ones, soft-drops vanished.
//! - [`mark_dropped_mangas`]: end-of-run pass. Any manga from this
//! source whose `last_seen_at` is older than the run start is
//! soft-dropped.
//!
//! Each public function is a transaction boundary so a partial failure
//! mid-call leaves the DB in its pre-call state.
use chrono::{DateTime, Utc};
use sqlx::{PgPool, Postgres, Transaction};
use uuid::Uuid;
use crate::crawler::source::{SourceChapterRef, SourceManga};
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum UpsertStatus {
New,
Updated,
Unchanged,
}
#[derive(Debug, Clone)]
pub struct UpsertedManga {
pub manga_id: Uuid,
pub status: UpsertStatus,
/// Current value of `mangas.cover_image_path` after the upsert.
/// `None` means the cover hasn't been downloaded yet — the caller
/// uses this to backfill covers for mangas that were synced before
/// cover-download support existed.
pub cover_image_path: Option<String>,
}
#[derive(Debug, Default, Clone, Copy, PartialEq, Eq)]
pub struct ChapterDiff {
pub new: usize,
pub refreshed: usize,
pub dropped: usize,
}
pub async fn ensure_source(
pool: &PgPool,
id: &str,
name: &str,
base_url: &str,
) -> sqlx::Result<()> {
sqlx::query(
r#"
INSERT INTO sources (id, name, base_url, enabled)
VALUES ($1, $2, $3, true)
ON CONFLICT (id) DO UPDATE
SET name = EXCLUDED.name,
base_url = EXCLUDED.base_url
"#,
)
.bind(id)
.bind(name)
.bind(base_url)
.execute(pool)
.await?;
Ok(())
}
pub async fn upsert_manga_from_source(
pool: &PgPool,
source_id: &str,
source_url: &str,
sm: &SourceManga,
) -> sqlx::Result<UpsertedManga> {
let mut tx = pool.begin().await?;
let existing: Option<(Uuid, Option<String>)> = sqlx::query_as(
r#"
SELECT manga_id, metadata_hash
FROM manga_sources
WHERE source_id = $1 AND source_manga_key = $2
"#,
)
.bind(source_id)
.bind(&sm.source_manga_key)
.fetch_optional(&mut *tx)
.await?;
let status_db = sm.status.as_deref().unwrap_or("ongoing");
// Note: `cover_image_path` is intentionally not written here.
// The repo layer doesn't know about the storage backend, so the
// caller (crawler binary) downloads the cover via the `Storage`
// trait and sets the path with `repo::manga::set_cover_image_path`
// once the bytes have landed.
let (manga_id, status) = match existing {
None => {
let (id,): (Uuid,) = sqlx::query_as(
r#"
INSERT INTO mangas (title, description, status, alt_titles)
VALUES ($1, $2, $3, $4)
RETURNING id
"#,
)
.bind(&sm.title)
.bind(sm.summary.as_deref())
.bind(status_db)
.bind(&sm.alternative_titles)
.fetch_one(&mut *tx)
.await?;
(id, UpsertStatus::New)
}
Some((id, prev_hash)) if prev_hash.as_deref() == Some(&sm.metadata_hash) => {
(id, UpsertStatus::Unchanged)
}
Some((id, _)) => {
sqlx::query(
r#"
UPDATE mangas
SET title = $1,
description = $2,
status = $3,
alt_titles = $4,
updated_at = NOW()
WHERE id = $5
"#,
)
.bind(&sm.title)
.bind(sm.summary.as_deref())
.bind(status_db)
.bind(&sm.alternative_titles)
.bind(id)
.execute(&mut *tx)
.await?;
(id, UpsertStatus::Updated)
}
};
sqlx::query(
r#"
INSERT INTO manga_sources
(source_id, source_manga_key, manga_id, source_url, metadata_hash, last_seen_at, dropped_at)
VALUES ($1, $2, $3, $4, $5, NOW(), NULL)
ON CONFLICT (source_id, source_manga_key) DO UPDATE
SET source_url = EXCLUDED.source_url,
metadata_hash = EXCLUDED.metadata_hash,
last_seen_at = NOW(),
dropped_at = NULL
"#,
)
.bind(source_id)
.bind(&sm.source_manga_key)
.bind(manga_id)
.bind(source_url)
.bind(&sm.metadata_hash)
.execute(&mut *tx)
.await?;
sync_authors(&mut tx, manga_id, &sm.authors).await?;
sync_genres(&mut tx, manga_id, &sm.genres).await?;
sync_tags(&mut tx, manga_id, &sm.tags).await?;
let cover_image_path: Option<String> =
sqlx::query_scalar("SELECT cover_image_path FROM mangas WHERE id = $1")
.bind(manga_id)
.fetch_one(&mut *tx)
.await?;
tx.commit().await?;
Ok(UpsertedManga {
manga_id,
status,
cover_image_path,
})
}
async fn sync_authors(
tx: &mut Transaction<'_, Postgres>,
manga_id: Uuid,
authors: &[String],
) -> sqlx::Result<()> {
sqlx::query("DELETE FROM manga_authors WHERE manga_id = $1")
.bind(manga_id)
.execute(&mut **tx)
.await?;
for (i, name) in authors.iter().enumerate() {
let trimmed = name.trim();
if trimmed.is_empty() {
continue;
}
// Self-update on conflict so the row id is always returned —
// can't use DO NOTHING because that suppresses RETURNING.
let (author_id,): (Uuid,) = sqlx::query_as(
r#"
INSERT INTO authors (name) VALUES ($1)
ON CONFLICT (lower(name)) DO UPDATE SET name = authors.name
RETURNING id
"#,
)
.bind(trimmed)
.fetch_one(&mut **tx)
.await?;
sqlx::query(
r#"
INSERT INTO manga_authors (manga_id, author_id, position)
VALUES ($1, $2, $3)
ON CONFLICT DO NOTHING
"#,
)
.bind(manga_id)
.bind(author_id)
.bind(i as i32)
.execute(&mut **tx)
.await?;
}
Ok(())
}
async fn sync_genres(
tx: &mut Transaction<'_, Postgres>,
manga_id: Uuid,
genres: &[String],
) -> sqlx::Result<()> {
sqlx::query("DELETE FROM manga_genres WHERE manga_id = $1")
.bind(manga_id)
.execute(&mut **tx)
.await?;
for name in genres {
let trimmed = name.trim();
if trimmed.is_empty() {
continue;
}
// Case-insensitive lookup so a source-supplied "action"
// attaches to the seeded "Action" rather than creating a
// second row.
let existing: Option<(Uuid,)> =
sqlx::query_as("SELECT id FROM genres WHERE lower(name) = lower($1)")
.bind(trimmed)
.fetch_optional(&mut **tx)
.await?;
let genre_id = match existing {
Some((id,)) => id,
None => {
let (id,): (Uuid,) = sqlx::query_as(
r#"
INSERT INTO genres (name) VALUES ($1)
ON CONFLICT (name) DO UPDATE SET name = genres.name
RETURNING id
"#,
)
.bind(trimmed)
.fetch_one(&mut **tx)
.await?;
tracing::info!(genre = trimmed, "added new genre from source");
id
}
};
sqlx::query(
"INSERT INTO manga_genres (manga_id, genre_id) VALUES ($1, $2) ON CONFLICT DO NOTHING",
)
.bind(manga_id)
.bind(genre_id)
.execute(&mut **tx)
.await?;
}
Ok(())
}
async fn sync_tags(
tx: &mut Transaction<'_, Postgres>,
manga_id: Uuid,
tags: &[String],
) -> sqlx::Result<()> {
sqlx::query("DELETE FROM manga_tags WHERE manga_id = $1")
.bind(manga_id)
.execute(&mut **tx)
.await?;
for name in tags {
let trimmed = name.trim();
if trimmed.is_empty() {
continue;
}
let (tag_id,): (Uuid,) = sqlx::query_as(
r#"
INSERT INTO tags (name) VALUES ($1)
ON CONFLICT (lower(name)) DO UPDATE SET name = tags.name
RETURNING id
"#,
)
.bind(trimmed)
.fetch_one(&mut **tx)
.await?;
sqlx::query(
r#"
INSERT INTO manga_tags (manga_id, tag_id, added_by)
VALUES ($1, $2, NULL)
ON CONFLICT DO NOTHING
"#,
)
.bind(manga_id)
.bind(tag_id)
.execute(&mut **tx)
.await?;
}
Ok(())
}
pub async fn sync_manga_chapters(
pool: &PgPool,
source_id: &str,
manga_id: Uuid,
chapters: &[SourceChapterRef],
) -> sqlx::Result<ChapterDiff> {
let mut tx = pool.begin().await?;
let mut diff = ChapterDiff::default();
let seen_keys: Vec<String> = chapters
.iter()
.map(|c| c.source_chapter_key.clone())
.collect();
for c in chapters {
let existing: Option<(Uuid,)> = sqlx::query_as(
"SELECT chapter_id FROM chapter_sources WHERE source_id = $1 AND source_chapter_key = $2",
)
.bind(source_id)
.bind(&c.source_chapter_key)
.fetch_optional(&mut *tx)
.await?;
match existing {
None => {
// New chapter row. As of 0013 there's no (manga_id,
// number) UNIQUE, so duplicate-numbered chapters from
// the source (different uploaders, notices, alt
// translations) each get their own row — chapter
// identity is the UUID, not the number.
let (chapter_id,): (Uuid,) = sqlx::query_as(
r#"
INSERT INTO chapters (manga_id, number, title, page_count)
VALUES ($1, $2, $3, 0)
RETURNING id
"#,
)
.bind(manga_id)
.bind(c.number)
.bind(c.title.as_deref())
.fetch_one(&mut *tx)
.await?;
sqlx::query(
r#"
INSERT INTO chapter_sources
(source_id, source_chapter_key, chapter_id, source_url, last_seen_at, dropped_at)
VALUES ($1, $2, $3, $4, NOW(), NULL)
"#,
)
.bind(source_id)
.bind(&c.source_chapter_key)
.bind(chapter_id)
.bind(&c.url)
.execute(&mut *tx)
.await?;
diff.new += 1;
}
Some((chapter_id,)) => {
sqlx::query("UPDATE chapters SET title = $1 WHERE id = $2")
.bind(c.title.as_deref())
.bind(chapter_id)
.execute(&mut *tx)
.await?;
sqlx::query(
r#"
UPDATE chapter_sources
SET source_url = $1, last_seen_at = NOW(), dropped_at = NULL
WHERE source_id = $2 AND source_chapter_key = $3
"#,
)
.bind(&c.url)
.bind(source_id)
.bind(&c.source_chapter_key)
.execute(&mut *tx)
.await?;
diff.refreshed += 1;
}
}
}
// Soft-drop any chapter previously seen from this source for this
// manga that's not in the current list.
let result = sqlx::query(
r#"
UPDATE chapter_sources cs
SET dropped_at = NOW()
FROM chapters ch
WHERE cs.chapter_id = ch.id
AND ch.manga_id = $1
AND cs.source_id = $2
AND cs.dropped_at IS NULL
AND NOT (cs.source_chapter_key = ANY($3))
"#,
)
.bind(manga_id)
.bind(source_id)
.bind(&seen_keys)
.execute(&mut *tx)
.await?;
diff.dropped = result.rows_affected() as usize;
tx.commit().await?;
Ok(diff)
}
pub async fn mark_dropped_mangas(
pool: &PgPool,
source_id: &str,
run_started_at: DateTime<Utc>,
) -> sqlx::Result<u64> {
let res = sqlx::query(
r#"
UPDATE manga_sources
SET dropped_at = NOW()
WHERE source_id = $1
AND last_seen_at < $2
AND dropped_at IS NULL
"#,
)
.bind(source_id)
.bind(run_started_at)
.execute(pool)
.await?;
Ok(res.rows_affected())
}

View File

@@ -3,6 +3,7 @@ pub mod author;
pub mod bookmark; pub mod bookmark;
pub mod chapter; pub mod chapter;
pub mod collection; pub mod collection;
pub mod crawler;
pub mod genre; pub mod genre;
pub mod manga; pub mod manga;
pub mod page; pub mod page;

View File

@@ -438,3 +438,196 @@ async fn list_me_returns_paged_envelope(pool: PgPool) {
// without paging through. // without paging through.
assert_eq!(body["page"]["total"], 0); assert_eq!(body["page"]["total"], 0);
} }
// -------------------------------------------------------------------------
// Bookmark create -> SyncChapterContent job enqueue (background task)
// -------------------------------------------------------------------------
async fn seed_chapter_with_source(
pool: &PgPool,
manga_id: Uuid,
number: i32,
source_id: &str,
source_chapter_key: &str,
source_url: &str,
dropped: bool,
) -> Uuid {
let chapter_id: Uuid =
mangalord::repo::chapter::create(pool, manga_id, number, None, None)
.await
.unwrap()
.id;
sqlx::query("INSERT INTO sources (id, name, base_url) VALUES ($1, $2, $3) ON CONFLICT DO NOTHING")
.bind(source_id)
.bind(source_id)
.bind("https://example.com")
.execute(pool)
.await
.unwrap();
let dropped_at = if dropped { "now()" } else { "NULL" };
sqlx::query(&format!(
"INSERT INTO chapter_sources (source_id, source_chapter_key, chapter_id, source_url, dropped_at) \
VALUES ($1, $2, $3, $4, {dropped_at})"
))
.bind(source_id)
.bind(source_chapter_key)
.bind(chapter_id)
.bind(source_url)
.execute(pool)
.await
.unwrap();
chapter_id
}
/// Poll `crawler_jobs` for the expected pending count, up to ~1.5s, so the
/// detached `tokio::spawn` from the bookmark create handler has time to
/// land regardless of CI scheduling jitter.
async fn wait_for_pending_count(pool: &PgPool, expected: i64) -> i64 {
for _ in 0..30 {
let count: i64 = sqlx::query_scalar(
"SELECT COUNT(*) FROM crawler_jobs \
WHERE state = 'pending' \
AND payload->>'kind' = 'sync_chapter_content'",
)
.fetch_one(pool)
.await
.unwrap();
if count >= expected {
return count;
}
tokio::time::sleep(std::time::Duration::from_millis(50)).await;
}
sqlx::query_scalar::<_, i64>(
"SELECT COUNT(*) FROM crawler_jobs \
WHERE state = 'pending' \
AND payload->>'kind' = 'sync_chapter_content'",
)
.fetch_one(pool)
.await
.unwrap()
}
#[sqlx::test(migrations = "./migrations")]
async fn create_enqueues_sync_chapter_content_jobs_for_pending_chapters(pool: PgPool) {
let h = common::harness(pool.clone());
let (_, cookie) = common::register_user(&h.app).await;
let manga_id = common::seed_manga_via_api(&h.app, &cookie, "Berserk").await;
// Two zero-page chapters with non-dropped sources.
let c1 = seed_chapter_with_source(&pool, manga_id, 1, "target", "ch1", "https://example.com/c1", false).await;
let c2 = seed_chapter_with_source(&pool, manga_id, 2, "target", "ch2", "https://example.com/c2", false).await;
let resp = h
.app
.clone()
.oneshot(common::post_json_with_cookie(
"/api/v1/bookmarks",
json!({ "manga_id": manga_id.to_string() }),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CREATED);
let count = wait_for_pending_count(&pool, 2).await;
assert_eq!(count, 2, "both pending chapters should be enqueued");
let chapter_ids: Vec<String> = sqlx::query_scalar(
"SELECT payload->>'chapter_id' FROM crawler_jobs \
WHERE payload->>'kind' = 'sync_chapter_content' \
ORDER BY payload->>'chapter_id'",
)
.fetch_all(&pool)
.await
.unwrap();
let mut expected = vec![c1.to_string(), c2.to_string()];
expected.sort();
assert_eq!(chapter_ids, expected);
}
#[sqlx::test(migrations = "./migrations")]
async fn re_bookmark_after_delete_does_not_re_enqueue_pending_jobs(pool: PgPool) {
let h = common::harness(pool.clone());
let (_, cookie) = common::register_user(&h.app).await;
let manga_id = common::seed_manga_via_api(&h.app, &cookie, "Berserk").await;
let _ = seed_chapter_with_source(&pool, manga_id, 1, "target", "ch1", "https://example.com/c1", false).await;
// First bookmark — should enqueue 1.
let resp = h
.app
.clone()
.oneshot(common::post_json_with_cookie(
"/api/v1/bookmarks",
json!({ "manga_id": manga_id.to_string() }),
&cookie,
))
.await
.unwrap();
let bookmark_id = common::body_json(resp).await["id"].as_str().unwrap().to_string();
assert_eq!(wait_for_pending_count(&pool, 1).await, 1);
// Delete the bookmark, then re-bookmark — the existing pending job
// is still there so the dedup index suppresses the second enqueue.
let resp = h
.app
.clone()
.oneshot(common::delete_with_cookie(
&format!("/api/v1/bookmarks/{bookmark_id}"),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::NO_CONTENT);
let resp = h
.app
.clone()
.oneshot(common::post_json_with_cookie(
"/api/v1/bookmarks",
json!({ "manga_id": manga_id.to_string() }),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CREATED);
// Give the background task time to attempt re-enqueue (it should be a no-op).
tokio::time::sleep(std::time::Duration::from_millis(300)).await;
let final_count: i64 = sqlx::query_scalar(
"SELECT COUNT(*) FROM crawler_jobs \
WHERE state IN ('pending', 'running') \
AND payload->>'kind' = 'sync_chapter_content'",
)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(final_count, 1, "dedup index keeps the queue at a single in-flight row");
}
#[sqlx::test(migrations = "./migrations")]
async fn create_skips_chapters_with_dropped_sources(pool: PgPool) {
let h = common::harness(pool.clone());
let (_, cookie) = common::register_user(&h.app).await;
let manga_id = common::seed_manga_via_api(&h.app, &cookie, "Berserk").await;
let _alive = seed_chapter_with_source(&pool, manga_id, 1, "target", "ch1", "https://example.com/c1", false).await;
let _dropped = seed_chapter_with_source(&pool, manga_id, 2, "target", "ch2", "https://example.com/c2", true).await;
let resp = h
.app
.clone()
.oneshot(common::post_json_with_cookie(
"/api/v1/bookmarks",
json!({ "manga_id": manga_id.to_string() }),
&cookie,
))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::CREATED);
assert_eq!(
wait_for_pending_count(&pool, 1).await,
1,
"only the chapter with a non-dropped source row gets enqueued"
);
}

View File

@@ -12,12 +12,18 @@ async fn seed_manga(h: &common::Harness, cookie: &str, title: &str) -> Uuid {
common::seed_manga_via_api(&h.app, cookie, title).await common::seed_manga_via_api(&h.app, cookie, title).await
} }
async fn seed_chapter(pool: &PgPool, manga_id: Uuid, number: i32, title: Option<&str>) { async fn seed_chapter(
pool: &PgPool,
manga_id: Uuid,
number: i32,
title: Option<&str>,
) -> Uuid {
// Historical seed — uploaded_by remains NULL, mirroring the // Historical seed — uploaded_by remains NULL, mirroring the
// pre-Phase-5 rows in the production DB. // pre-Phase-5 rows in the production DB.
mangalord::repo::chapter::create(pool, manga_id, number, title, None) mangalord::repo::chapter::create(pool, manga_id, number, title, None)
.await .await
.unwrap(); .unwrap()
.id
} }
#[sqlx::test(migrations = "./migrations")] #[sqlx::test(migrations = "./migrations")]
@@ -81,16 +87,16 @@ async fn list_chapters_returns_404_for_unknown_manga(pool: PgPool) {
} }
#[sqlx::test(migrations = "./migrations")] #[sqlx::test(migrations = "./migrations")]
async fn get_chapter_by_number(pool: PgPool) { async fn get_chapter_by_id(pool: PgPool) {
let h = common::harness(pool.clone()); let h = common::harness(pool.clone());
let (_, cookie) = common::register_user(&h.app).await; let (_, cookie) = common::register_user(&h.app).await;
let manga_id = seed_manga(&h, &cookie, "Berserk").await; let manga_id = seed_manga(&h, &cookie, "Berserk").await;
seed_chapter(&pool, manga_id, 1, Some("The Brand")).await; let chapter_id = seed_chapter(&pool, manga_id, 1, Some("The Brand")).await;
let resp = h let resp = h
.app .app
.oneshot(common::get(&format!( .oneshot(common::get(&format!(
"/api/v1/mangas/{manga_id}/chapters/1" "/api/v1/mangas/{manga_id}/chapters/{chapter_id}"
))) )))
.await .await
.unwrap(); .unwrap();
@@ -99,18 +105,20 @@ async fn get_chapter_by_number(pool: PgPool) {
assert_eq!(body["number"], 1); assert_eq!(body["number"], 1);
assert_eq!(body["title"], "The Brand"); assert_eq!(body["title"], "The Brand");
assert_eq!(body["page_count"], 0); assert_eq!(body["page_count"], 0);
assert_eq!(body["id"], chapter_id.to_string());
} }
#[sqlx::test(migrations = "./migrations")] #[sqlx::test(migrations = "./migrations")]
async fn get_chapter_unknown_number_is_404(pool: PgPool) { async fn get_chapter_unknown_id_is_404(pool: PgPool) {
let h = common::harness(pool); let h = common::harness(pool);
let (_, cookie) = common::register_user(&h.app).await; let (_, cookie) = common::register_user(&h.app).await;
let manga_id = seed_manga(&h, &cookie, "Berserk").await; let manga_id = seed_manga(&h, &cookie, "Berserk").await;
let unknown_chapter = Uuid::new_v4();
let resp = h let resp = h
.app .app
.oneshot(common::get(&format!( .oneshot(common::get(&format!(
"/api/v1/mangas/{manga_id}/chapters/99" "/api/v1/mangas/{manga_id}/chapters/{unknown_chapter}"
))) )))
.await .await
.unwrap(); .unwrap();
@@ -122,10 +130,34 @@ async fn get_chapter_unknown_number_is_404(pool: PgPool) {
#[sqlx::test(migrations = "./migrations")] #[sqlx::test(migrations = "./migrations")]
async fn get_chapter_unknown_manga_is_404(pool: PgPool) { async fn get_chapter_unknown_manga_is_404(pool: PgPool) {
let h = common::harness(pool); let h = common::harness(pool);
let unknown = Uuid::nil(); let unknown_manga = Uuid::nil();
let unknown_chapter = Uuid::new_v4();
let resp = h let resp = h
.app .app
.oneshot(common::get(&format!("/api/v1/mangas/{unknown}/chapters/1"))) .oneshot(common::get(&format!(
"/api/v1/mangas/{unknown_manga}/chapters/{unknown_chapter}"
)))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::NOT_FOUND);
}
/// Cross-manga isolation: a chapter id belonging to manga A must not
/// resolve when accessed via manga B's URL. The (manga_id, id) scoping
/// in `find_by_id_in_manga` enforces this.
#[sqlx::test(migrations = "./migrations")]
async fn get_chapter_from_wrong_manga_is_404(pool: PgPool) {
let h = common::harness(pool.clone());
let (_, cookie) = common::register_user(&h.app).await;
let manga_a = seed_manga(&h, &cookie, "Berserk").await;
let manga_b = seed_manga(&h, &cookie, "Vagabond").await;
let chapter_id = seed_chapter(&pool, manga_a, 1, Some("Episode 1")).await;
let resp = h
.app
.oneshot(common::get(&format!(
"/api/v1/mangas/{manga_b}/chapters/{chapter_id}"
)))
.await .await
.unwrap(); .unwrap();
assert_eq!(resp.status(), StatusCode::NOT_FOUND); assert_eq!(resp.status(), StatusCode::NOT_FOUND);
@@ -136,12 +168,12 @@ async fn list_pages_empty_for_chapter_without_upload(pool: PgPool) {
let h = common::harness(pool.clone()); let h = common::harness(pool.clone());
let (_, cookie) = common::register_user(&h.app).await; let (_, cookie) = common::register_user(&h.app).await;
let manga_id = seed_manga(&h, &cookie, "Berserk").await; let manga_id = seed_manga(&h, &cookie, "Berserk").await;
seed_chapter(&pool, manga_id, 1, None).await; let chapter_id = seed_chapter(&pool, manga_id, 1, None).await;
let resp = h let resp = h
.app .app
.oneshot(common::get(&format!( .oneshot(common::get(&format!(
"/api/v1/mangas/{manga_id}/chapters/1/pages" "/api/v1/mangas/{manga_id}/chapters/{chapter_id}/pages"
))) )))
.await .await
.unwrap(); .unwrap();
@@ -155,11 +187,12 @@ async fn list_pages_returns_404_for_unknown_chapter(pool: PgPool) {
let h = common::harness(pool); let h = common::harness(pool);
let (_, cookie) = common::register_user(&h.app).await; let (_, cookie) = common::register_user(&h.app).await;
let manga_id = seed_manga(&h, &cookie, "Berserk").await; let manga_id = seed_manga(&h, &cookie, "Berserk").await;
let unknown_chapter = Uuid::new_v4();
let resp = h let resp = h
.app .app
.oneshot(common::get(&format!( .oneshot(common::get(&format!(
"/api/v1/mangas/{manga_id}/chapters/99/pages" "/api/v1/mangas/{manga_id}/chapters/{unknown_chapter}/pages"
))) )))
.await .await
.unwrap(); .unwrap();

View File

@@ -139,13 +139,17 @@ async fn files_endpoint_streams_in_multiple_frames(pool: PgPool) {
.await .await
.unwrap(); .unwrap();
assert_eq!(resp.status(), StatusCode::CREATED); assert_eq!(resp.status(), StatusCode::CREATED);
let chapter_id = common::body_json(resp).await["id"]
.as_str()
.unwrap()
.to_string();
// Fetch the page back via the streaming files endpoint. // Fetch the page back via the streaming files endpoint.
let pages = h let pages = h
.app .app
.clone() .clone()
.oneshot(common::get(&format!( .oneshot(common::get(&format!(
"/api/v1/mangas/{manga_id}/chapters/1/pages" "/api/v1/mangas/{manga_id}/chapters/{chapter_id}/pages"
))) )))
.await .await
.unwrap(); .unwrap();
@@ -317,8 +321,12 @@ async fn create_chapter_rejects_renamed_non_image_page(pool: PgPool) {
assert_eq!(body["error"]["code"], "unsupported_media_type"); assert_eq!(body["error"]["code"], "unsupported_media_type");
} }
/// Multiple chapters can share the same number — different
/// scanlations, re-uploads, translator notes. As of migration 0013,
/// (manga_id, number) is not unique and each upload gets its own
/// chapter id.
#[sqlx::test(migrations = "./migrations")] #[sqlx::test(migrations = "./migrations")]
async fn create_chapter_returns_409_on_duplicate_number(pool: PgPool) { async fn create_chapter_allows_duplicate_numbers_as_separate_chapters(pool: PgPool) {
let h = common::harness(pool); let h = common::harness(pool);
let (_, cookie) = common::register_user(&h.app).await; let (_, cookie) = common::register_user(&h.app).await;
let manga_id = common::seed_manga_via_api(&h.app, &cookie, "Berserk").await; let manga_id = common::seed_manga_via_api(&h.app, &cookie, "Berserk").await;
@@ -334,10 +342,27 @@ async fn create_chapter_returns_409_on_duplicate_number(pool: PgPool) {
}; };
let first = h.app.clone().oneshot(make()).await.unwrap(); let first = h.app.clone().oneshot(make()).await.unwrap();
assert_eq!(first.status(), StatusCode::CREATED); assert_eq!(first.status(), StatusCode::CREATED);
let second = h.app.oneshot(make()).await.unwrap(); let first_id = common::body_json(first).await["id"].as_str().unwrap().to_string();
assert_eq!(second.status(), StatusCode::CONFLICT);
let body = common::body_json(second).await; let second = h.app.clone().oneshot(make()).await.unwrap();
assert_eq!(body["error"]["code"], "conflict"); assert_eq!(second.status(), StatusCode::CREATED);
let second_id = common::body_json(second).await["id"].as_str().unwrap().to_string();
assert_ne!(first_id, second_id, "each upload gets a distinct chapter id");
// List endpoint surfaces both rows.
let resp = h
.app
.oneshot(common::get(&format!("/api/v1/mangas/{manga_id}/chapters")))
.await
.unwrap();
assert_eq!(resp.status(), StatusCode::OK);
let body = common::body_json(resp).await;
let items = body["items"].as_array().unwrap();
assert_eq!(items.len(), 2, "both Ch.1 uploads listed separately");
for item in items {
assert_eq!(item["number"], 1);
}
} }
#[sqlx::test(migrations = "./migrations")] #[sqlx::test(migrations = "./migrations")]

View File

@@ -0,0 +1,157 @@
//! Smoke test for the Chromium launcher.
//!
//! Marked `#[ignore]` because it (a) downloads ~150 MB of Chromium on
//! first run via the `fetcher` feature and (b) requires a real `$DISPLAY`
//! for the headed path. Run it explicitly:
//!
//! ```sh
//! cargo test --test crawler_browser_smoke -- --ignored --nocapture
//! ```
//!
//! Override the cache location with `CRAWLER_CHROMIUM_DIR=/some/path` if
//! `$HOME/.cache/mangalord/chromium` isn't writable.
use mangalord::crawler::browser::{self, LaunchOptions};
#[tokio::test]
#[ignore = "downloads Chromium and needs a display; run with --ignored"]
async fn headed_browser_can_navigate_and_read_title() {
// A data URL avoids any network dependency — we're testing the
// browser launcher, not connectivity.
const PAGE: &str = "data:text/html,<html><head><title>Mangalord%20Smoke</title></head><body>OK</body></html>";
let handle = browser::launch(LaunchOptions::headed())
.await
.expect("launch headed chromium");
let page = handle
.browser()
.new_page(PAGE)
.await
.expect("open new page");
page.wait_for_navigation()
.await
.expect("wait for navigation");
let title = page.get_title().await.expect("get title");
assert_eq!(title.as_deref(), Some("Mangalord Smoke"));
handle.close().await.expect("close cleanly");
}
#[tokio::test]
#[ignore = "downloads Chromium; run with --ignored"]
async fn headless_browser_can_navigate_and_read_title() {
const PAGE: &str = "data:text/html,<html><head><title>Headless%20OK</title></head><body></body></html>";
let handle = browser::launch(LaunchOptions::headless())
.await
.expect("launch headless chromium");
let page = handle.browser().new_page(PAGE).await.expect("open new page");
page.wait_for_navigation().await.expect("wait for navigation");
let title = page.get_title().await.expect("get title");
assert_eq!(title.as_deref(), Some("Headless OK"));
handle.close().await.expect("close cleanly");
}
/// Live end-to-end: navigate to a real page, get the rendered HTML, and
/// parse it with `scraper`. ipify.org renders the visitor's public IP
/// into the page DOM, so a successful run proves browser → render →
/// `Html::parse_document` → selector → text extraction all work
/// against a real site. This is the same path each future `Source`
/// impl will take.
#[tokio::test]
#[ignore = "needs network; run with --ignored"]
async fn fetches_public_ip_from_ipify() {
use std::time::Duration;
let handle = browser::launch(LaunchOptions::headless())
.await
.expect("launch headless chromium");
let page = handle
.browser()
.new_page("https://www.ipify.org")
.await
.expect("open ipify");
page.wait_for_navigation().await.expect("wait for navigation");
// ipify injects the IP via JS after load, so the navigation event
// alone isn't enough — give the script a beat to run.
tokio::time::sleep(Duration::from_secs(2)).await;
let html = page.content().await.expect("get rendered html");
let doc = scraper::Html::parse_document(&html);
let body_sel = scraper::Selector::parse("body").unwrap();
let body_text: String = doc
.select(&body_sel)
.next()
.map(|n| n.text().collect::<Vec<_>>().join(" "))
.unwrap_or_default();
let ip = extract_ipv4(&body_text)
.unwrap_or_else(|| panic!("no IPv4 found in ipify body: {body_text}"));
eprintln!("ipify says our public IP is: {ip}");
handle.close().await.expect("close cleanly");
}
/// Proves that `LaunchOptions::extra_args` actually reach Chromium and
/// influence its runtime. `--user-agent=...` overrides `navigator.userAgent`,
/// observable from JS — read it back via `page.evaluate`.
#[tokio::test]
#[ignore = "downloads Chromium; run with --ignored"]
async fn extra_args_reach_chromium() {
const UA: &str = "MangalordCrawlerTest/1.0";
let options = LaunchOptions {
mode: browser::BrowserMode::Headless,
extra_args: vec![format!("--user-agent={UA}")],
};
let handle = browser::launch(options).await.expect("launch with extra args");
let page = handle
.browser()
.new_page("about:blank")
.await
.expect("open page");
page.wait_for_navigation().await.expect("wait");
let ua: String = page
.evaluate("navigator.userAgent")
.await
.expect("evaluate navigator.userAgent")
.into_value()
.expect("string value");
assert_eq!(
ua, UA,
"extra --user-agent flag should override navigator.userAgent"
);
handle.close().await.expect("close cleanly");
}
/// Tiny dotted-quad finder — avoids pulling `regex` in just for one
/// test. Scans the first valid IPv4 substring (four 0..=255 octets
/// separated by dots).
fn extract_ipv4(s: &str) -> Option<String> {
let bytes = s.as_bytes();
let mut i = 0;
while i < bytes.len() {
if !bytes[i].is_ascii_digit() {
i += 1;
continue;
}
let start = i;
while i < bytes.len() && (bytes[i].is_ascii_digit() || bytes[i] == b'.') {
i += 1;
}
let candidate = &s[start..i];
let parts: Vec<&str> = candidate.split('.').collect();
if parts.len() == 4 && parts.iter().all(|p| p.parse::<u8>().is_ok()) {
return Some(candidate.to_string());
}
}
None
}

View File

@@ -0,0 +1,372 @@
//! Integration tests for the crawler daemon's cron + worker pool. The
//! daemon's full real path requires Chromium and a live source; here we
//! test the seam (MetadataPass / ChapterDispatcher traits) and the
//! cron/worker control-flow.
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
use std::time::Duration;
use chrono::NaiveTime;
use chrono_tz::Tz;
use mangalord::crawler::content::SyncOutcome;
use mangalord::crawler::daemon::{
self, test_support::CountingMetadataPass, ChapterDispatcher, DaemonConfig, MetadataPass,
CRON_LOCK_KEY,
};
use mangalord::crawler::jobs::{self, JobPayload};
use mangalord::crawler::pipeline;
use serde_json::json;
use sqlx::PgPool;
use tokio_util::sync::CancellationToken;
use uuid::Uuid;
fn far_future_daily_at() -> NaiveTime {
// Some time hours from "now" so the scheduler sleeps for the whole test.
NaiveTime::from_hms_opt(23, 59, 0).unwrap()
}
fn make_cfg(
metadata_pass: Option<Arc<dyn MetadataPass>>,
dispatcher: Arc<dyn ChapterDispatcher>,
session_expired: Arc<std::sync::atomic::AtomicBool>,
workers: usize,
) -> DaemonConfig {
DaemonConfig {
metadata_pass,
dispatcher,
chapter_workers: workers,
daily_at: far_future_daily_at(),
tz: Tz::UTC,
retention_days: 7,
session_expired,
extra_tasks: Vec::new(),
}
}
async fn enqueue_chapter_job(pool: &PgPool) -> Uuid {
let chapter_id = Uuid::new_v4();
let payload = JobPayload::SyncChapterContent {
source_id: "target".into(),
chapter_id,
source_chapter_key: format!("ch-{chapter_id}"),
};
let res = jobs::enqueue(pool, &payload).await.unwrap();
match res {
jobs::EnqueueResult::Inserted(_) => chapter_id,
jobs::EnqueueResult::Skipped => unreachable!("fresh chapter_id"),
}
}
async fn count_state(pool: &PgPool, state: &str) -> i64 {
sqlx::query_scalar::<_, i64>("SELECT COUNT(*) FROM crawler_jobs WHERE state = $1")
.bind(state)
.fetch_one(pool)
.await
.unwrap()
}
struct AlwaysDoneDispatcher {
seen: AtomicUsize,
}
#[async_trait::async_trait]
impl ChapterDispatcher for AlwaysDoneDispatcher {
async fn dispatch(&self, _payload: JobPayload) -> anyhow::Result<SyncOutcome> {
self.seen.fetch_add(1, Ordering::AcqRel);
Ok(SyncOutcome::Fetched { pages: 1 })
}
}
struct PanickingDispatcher {
seen: AtomicUsize,
}
#[async_trait::async_trait]
impl ChapterDispatcher for PanickingDispatcher {
async fn dispatch(&self, _payload: JobPayload) -> anyhow::Result<SyncOutcome> {
self.seen.fetch_add(1, Ordering::AcqRel);
panic!("intentional dispatcher panic");
}
}
#[sqlx::test(migrations = "./migrations")]
async fn workers_drain_jobs_through_dispatcher(pool: PgPool) {
enqueue_chapter_job(&pool).await;
enqueue_chapter_job(&pool).await;
enqueue_chapter_job(&pool).await;
let dispatcher = Arc::new(AlwaysDoneDispatcher {
seen: AtomicUsize::new(0),
});
let session_expired = Arc::new(std::sync::atomic::AtomicBool::new(false));
let cancel = CancellationToken::new();
let handle = daemon::spawn(
pool.clone(),
cancel.clone(),
make_cfg(None, dispatcher.clone(), session_expired, 2),
);
// Wait for the workers to drain all three jobs.
let dispatcher_seen = || dispatcher.seen.load(Ordering::Acquire);
for _ in 0..40 {
if dispatcher_seen() >= 3 {
break;
}
tokio::time::sleep(Duration::from_millis(50)).await;
}
assert!(
dispatcher_seen() >= 3,
"expected at least 3 dispatches, got {}",
dispatcher_seen()
);
handle.shutdown().await;
assert_eq!(count_state(&pool, "done").await, 3);
}
#[sqlx::test(migrations = "./migrations")]
async fn workers_idle_while_session_expired(pool: PgPool) {
let id = enqueue_chapter_job(&pool).await;
let dispatcher = Arc::new(AlwaysDoneDispatcher {
seen: AtomicUsize::new(0),
});
let session_expired = Arc::new(std::sync::atomic::AtomicBool::new(true));
let cancel = CancellationToken::new();
let handle = daemon::spawn(
pool.clone(),
cancel.clone(),
make_cfg(None, dispatcher.clone(), Arc::clone(&session_expired), 1),
);
// Wait long enough that a non-idled worker would have leased and ack'd.
tokio::time::sleep(Duration::from_millis(800)).await;
assert_eq!(
dispatcher.seen.load(Ordering::Acquire),
0,
"dispatcher must not be invoked while session_expired flag is set"
);
assert_eq!(count_state(&pool, "pending").await, 1);
let _ = id;
handle.shutdown().await;
}
#[sqlx::test(migrations = "./migrations")]
async fn dispatcher_panic_is_contained_and_job_is_acked_failed(pool: PgPool) {
enqueue_chapter_job(&pool).await;
enqueue_chapter_job(&pool).await;
let dispatcher = Arc::new(PanickingDispatcher {
seen: AtomicUsize::new(0),
});
let session_expired = Arc::new(std::sync::atomic::AtomicBool::new(false));
let cancel = CancellationToken::new();
let handle = daemon::spawn(
pool.clone(),
cancel.clone(),
make_cfg(None, dispatcher.clone(), session_expired, 1),
);
// Wait for the worker to handle both panicking jobs.
for _ in 0..40 {
if dispatcher.seen.load(Ordering::Acquire) >= 2 {
break;
}
tokio::time::sleep(Duration::from_millis(50)).await;
}
assert!(
dispatcher.seen.load(Ordering::Acquire) >= 2,
"worker must keep going after a panic — handled at least 2 jobs"
);
handle.shutdown().await;
// attempts=1 below max=5, so the panicking jobs go back to pending with
// backoff and `last_error = "worker panicked"`.
let last_errors: Vec<String> = sqlx::query_scalar(
"SELECT last_error FROM crawler_jobs WHERE last_error IS NOT NULL",
)
.fetch_all(&pool)
.await
.unwrap();
assert_eq!(last_errors.len(), 2);
assert!(last_errors.iter().all(|e| e == "worker panicked"));
}
#[sqlx::test(migrations = "./migrations")]
async fn cron_skips_tick_when_advisory_lock_held(pool: PgPool) {
// With no last_metadata_tick_at row, the daemon does a catch-up tick
// immediately on spawn. We hold the advisory lock on a separate
// connection beforehand so the catch-up's pg_try_advisory_lock returns
// false and the tick must skip without invoking the metadata pass.
let mut lock_conn = pool.acquire().await.unwrap();
sqlx::query("SELECT pg_advisory_lock($1)")
.bind(CRON_LOCK_KEY)
.execute(&mut *lock_conn)
.await
.unwrap();
let counter = Arc::new(CountingMetadataPass::default());
let dispatcher = Arc::new(AlwaysDoneDispatcher {
seen: AtomicUsize::new(0),
});
let session_expired = Arc::new(std::sync::atomic::AtomicBool::new(false));
let cancel = CancellationToken::new();
// daily_at far in the future so after the (skipped) catch-up the
// cron sleeps for the rest of the test rather than racing for the lock.
let cfg = make_cfg(
Some(counter.clone() as Arc<dyn MetadataPass>),
dispatcher,
session_expired,
1,
);
let handle = daemon::spawn(pool.clone(), cancel.clone(), cfg);
tokio::time::sleep(Duration::from_millis(800)).await;
assert_eq!(
counter.count.load(Ordering::Acquire),
0,
"cron must skip the catch-up tick while the advisory lock is held"
);
sqlx::query("SELECT pg_advisory_unlock($1)")
.bind(CRON_LOCK_KEY)
.execute(&mut *lock_conn)
.await
.unwrap();
drop(lock_conn);
handle.shutdown().await;
}
#[sqlx::test(migrations = "./migrations")]
async fn cron_catches_up_when_last_tick_is_stale(pool: PgPool) {
// Pre-seed last_metadata_tick_at well in the past so previous_fire(now)
// > last_tick is trivially true and the daemon catches up immediately.
sqlx::query(
"INSERT INTO crawler_state (key, value) VALUES ($1, $2)
ON CONFLICT (key) DO UPDATE SET value = EXCLUDED.value",
)
.bind("last_metadata_tick_at")
.bind(json!({"at": "2020-01-01T00:00:00Z"}))
.execute(&pool)
.await
.unwrap();
let counter = Arc::new(CountingMetadataPass::default());
let dispatcher = Arc::new(AlwaysDoneDispatcher {
seen: AtomicUsize::new(0),
});
let session_expired = Arc::new(std::sync::atomic::AtomicBool::new(false));
let cancel = CancellationToken::new();
let handle = daemon::spawn(
pool.clone(),
cancel.clone(),
make_cfg(
Some(counter.clone() as Arc<dyn MetadataPass>),
dispatcher,
session_expired,
1,
),
);
for _ in 0..40 {
if counter.count.load(Ordering::Acquire) >= 1 {
break;
}
tokio::time::sleep(Duration::from_millis(50)).await;
}
assert!(
counter.count.load(Ordering::Acquire) >= 1,
"catch-up tick should have fired immediately"
);
handle.shutdown().await;
}
#[sqlx::test(migrations = "./migrations")]
async fn enqueue_bookmarked_pending_skips_dropped_sources(pool: PgPool) {
// Setup: one manga with two chapters (page_count = 0). One has a
// non-dropped source; the other's source is dropped. A user bookmarks
// the manga. Expectation: only the non-dropped chapter is enqueued.
let user_id: Uuid = sqlx::query_scalar(
"INSERT INTO users (username, password_hash) VALUES ($1, $2) RETURNING id",
)
.bind("alice")
.bind("not-a-real-hash")
.fetch_one(&pool)
.await
.unwrap();
let manga_id: Uuid = sqlx::query_scalar(
"INSERT INTO mangas (title) VALUES ($1) RETURNING id",
)
.bind("Berserk")
.fetch_one(&pool)
.await
.unwrap();
sqlx::query("INSERT INTO sources (id, name, base_url) VALUES ($1, $2, $3) ON CONFLICT DO NOTHING")
.bind("target")
.bind("Target")
.bind("https://example.com")
.execute(&pool)
.await
.unwrap();
let c1: Uuid = sqlx::query_scalar(
"INSERT INTO chapters (manga_id, number, page_count) VALUES ($1, 1, 0) RETURNING id",
)
.bind(manga_id)
.fetch_one(&pool)
.await
.unwrap();
let c2: Uuid = sqlx::query_scalar(
"INSERT INTO chapters (manga_id, number, page_count) VALUES ($1, 2, 0) RETURNING id",
)
.bind(manga_id)
.fetch_one(&pool)
.await
.unwrap();
// c1: alive source. c2: dropped source.
sqlx::query(
"INSERT INTO chapter_sources (source_id, source_chapter_key, chapter_id, source_url) \
VALUES ($1, $2, $3, $4)",
)
.bind("target")
.bind("ch1")
.bind(c1)
.bind("https://example.com/ch1")
.execute(&pool)
.await
.unwrap();
sqlx::query(
"INSERT INTO chapter_sources (source_id, source_chapter_key, chapter_id, source_url, dropped_at) \
VALUES ($1, $2, $3, $4, now())",
)
.bind("target")
.bind("ch2")
.bind(c2)
.bind("https://example.com/ch2")
.execute(&pool)
.await
.unwrap();
sqlx::query("INSERT INTO bookmarks (user_id, manga_id) VALUES ($1, $2)")
.bind(user_id)
.bind(manga_id)
.execute(&pool)
.await
.unwrap();
let summary = pipeline::enqueue_bookmarked_pending(&pool).await.unwrap();
assert_eq!(summary.inserted, 1, "only the non-dropped chapter enqueued");
assert_eq!(summary.skipped, 0);
let payloads: Vec<serde_json::Value> = sqlx::query_scalar(
"SELECT payload FROM crawler_jobs WHERE payload->>'kind' = 'sync_chapter_content'",
)
.fetch_all(&pool)
.await
.unwrap();
assert_eq!(payloads.len(), 1);
assert_eq!(
payloads[0]["chapter_id"].as_str().unwrap(),
c1.to_string()
);
}

View File

@@ -0,0 +1,441 @@
//! Integration tests for `crawler::jobs` queue operations.
//!
//! Uses `#[sqlx::test(migrations = "./migrations")]` which provisions a fresh
//! migrated DB per test. No browser, no axum router — these exercise the SQL
//! shape and dedup-index semantics directly against Postgres.
use std::time::Duration;
use mangalord::crawler::jobs::{
self, EnqueueResult, JobPayload, KIND_SYNC_CHAPTER_CONTENT,
};
use mangalord::crawler::source::DiscoverMode;
use sqlx::PgPool;
use uuid::Uuid;
fn chapter_content_payload(chapter_id: Uuid) -> JobPayload {
JobPayload::SyncChapterContent {
source_id: "target".into(),
chapter_id,
source_chapter_key: format!("ch-{chapter_id}"),
}
}
fn discover_payload() -> JobPayload {
JobPayload::Discover {
source_id: "target".into(),
mode: DiscoverMode::Backfill,
}
}
async fn job_state(pool: &PgPool, id: Uuid) -> String {
sqlx::query_scalar::<_, String>("SELECT state FROM crawler_jobs WHERE id = $1")
.bind(id)
.fetch_one(pool)
.await
.unwrap()
}
async fn job_attempts(pool: &PgPool, id: Uuid) -> i32 {
sqlx::query_scalar::<_, i32>("SELECT attempts FROM crawler_jobs WHERE id = $1")
.bind(id)
.fetch_one(pool)
.await
.unwrap()
}
async fn job_count(pool: &PgPool) -> i64 {
sqlx::query_scalar::<_, i64>("SELECT COUNT(*) FROM crawler_jobs")
.fetch_one(pool)
.await
.unwrap()
}
#[sqlx::test(migrations = "./migrations")]
async fn enqueue_inserts_pending_row_with_round_trip_payload(pool: PgPool) {
let chapter_id = Uuid::new_v4();
let payload = chapter_content_payload(chapter_id);
let result = jobs::enqueue(&pool, &payload).await.unwrap();
let id = match result {
EnqueueResult::Inserted(id) => id,
EnqueueResult::Skipped => panic!("expected Inserted on first enqueue"),
};
assert_eq!(job_state(&pool, id).await, "pending");
assert_eq!(job_attempts(&pool, id).await, 0);
let raw_payload: serde_json::Value =
sqlx::query_scalar("SELECT payload FROM crawler_jobs WHERE id = $1")
.bind(id)
.fetch_one(&pool)
.await
.unwrap();
let decoded: JobPayload = serde_json::from_value(raw_payload).unwrap();
match decoded {
JobPayload::SyncChapterContent {
source_id,
chapter_id: c,
source_chapter_key,
} => {
assert_eq!(source_id, "target");
assert_eq!(c, chapter_id);
assert_eq!(source_chapter_key, format!("ch-{chapter_id}"));
}
_ => panic!("payload variant mismatch"),
}
}
#[sqlx::test(migrations = "./migrations")]
async fn duplicate_chapter_content_while_pending_is_skipped(pool: PgPool) {
let chapter_id = Uuid::new_v4();
let p = chapter_content_payload(chapter_id);
let first = jobs::enqueue(&pool, &p).await.unwrap();
assert!(matches!(first, EnqueueResult::Inserted(_)));
let second = jobs::enqueue(&pool, &p).await.unwrap();
assert!(matches!(second, EnqueueResult::Skipped));
assert_eq!(job_count(&pool).await, 1);
}
#[sqlx::test(migrations = "./migrations")]
async fn duplicate_after_done_releases_dedup_slot(pool: PgPool) {
let chapter_id = Uuid::new_v4();
let p = chapter_content_payload(chapter_id);
let first_id = match jobs::enqueue(&pool, &p).await.unwrap() {
EnqueueResult::Inserted(id) => id,
EnqueueResult::Skipped => panic!("first enqueue should insert"),
};
// Move the first job out of (pending|running) so the partial index drops it.
sqlx::query("UPDATE crawler_jobs SET state = 'done' WHERE id = $1")
.bind(first_id)
.execute(&pool)
.await
.unwrap();
let second = jobs::enqueue(&pool, &p).await.unwrap();
assert!(
matches!(second, EnqueueResult::Inserted(_)),
"after done the chapter_id slot is free again"
);
assert_eq!(job_count(&pool).await, 2);
}
#[sqlx::test(migrations = "./migrations")]
async fn different_chapter_ids_can_coexist(pool: PgPool) {
let p1 = chapter_content_payload(Uuid::new_v4());
let p2 = chapter_content_payload(Uuid::new_v4());
assert!(matches!(
jobs::enqueue(&pool, &p1).await.unwrap(),
EnqueueResult::Inserted(_)
));
assert!(matches!(
jobs::enqueue(&pool, &p2).await.unwrap(),
EnqueueResult::Inserted(_)
));
assert_eq!(job_count(&pool).await, 2);
}
#[sqlx::test(migrations = "./migrations")]
async fn non_chapter_content_payloads_are_never_deduped(pool: PgPool) {
let p = discover_payload();
assert!(matches!(
jobs::enqueue(&pool, &p).await.unwrap(),
EnqueueResult::Inserted(_)
));
assert!(matches!(
jobs::enqueue(&pool, &p).await.unwrap(),
EnqueueResult::Inserted(_)
));
assert_eq!(job_count(&pool).await, 2);
}
#[sqlx::test(migrations = "./migrations")]
async fn lease_marks_running_and_bumps_attempts_and_sets_leased_until(pool: PgPool) {
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
EnqueueResult::Skipped => unreachable!(),
};
let leases = jobs::lease(&pool, None, 10, Duration::from_secs(60))
.await
.unwrap();
assert_eq!(leases.len(), 1);
let lease = &leases[0];
assert_eq!(lease.id, id);
assert_eq!(lease.attempts, 1);
assert_eq!(job_state(&pool, id).await, "running");
let leased_until: Option<chrono::DateTime<chrono::Utc>> =
sqlx::query_scalar("SELECT leased_until FROM crawler_jobs WHERE id = $1")
.bind(id)
.fetch_one(&pool)
.await
.unwrap();
let leased_until = leased_until.expect("leased_until set");
assert!(leased_until > chrono::Utc::now());
}
#[sqlx::test(migrations = "./migrations")]
async fn lease_with_kind_filter_only_matches_that_kind(pool: PgPool) {
let discover_id = match jobs::enqueue(&pool, &discover_payload()).await.unwrap() {
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let chapter_id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let leases = jobs::lease(
&pool,
Some(KIND_SYNC_CHAPTER_CONTENT),
10,
Duration::from_secs(60),
)
.await
.unwrap();
assert_eq!(leases.len(), 1, "only chapter content payload leases");
assert_eq!(leases[0].id, chapter_id);
// discover is still pending
assert_eq!(job_state(&pool, discover_id).await, "pending");
}
#[sqlx::test(migrations = "./migrations")]
async fn concurrent_leases_under_skip_locked_return_disjoint_ids(pool: PgPool) {
// 4 pending jobs, two concurrent calls each asking for up to 2.
let mut ids = Vec::new();
for _ in 0..4 {
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
ids.push(id);
}
let (a, b) = tokio::join!(
jobs::lease(&pool, None, 2, Duration::from_secs(60)),
jobs::lease(&pool, None, 2, Duration::from_secs(60)),
);
let a = a.unwrap();
let b = b.unwrap();
let mut seen: Vec<Uuid> = a.iter().chain(b.iter()).map(|l| l.id).collect();
seen.sort();
seen.dedup();
let count = a.len() + b.len();
assert_eq!(
seen.len(),
count,
"no id appears in both lease results (SKIP LOCKED)"
);
assert!(count >= 2, "at least one lease saw work");
assert!(count <= 4);
}
#[sqlx::test(migrations = "./migrations")]
async fn stale_running_lease_can_be_reclaimed(pool: PgPool) {
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let first = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
assert_eq!(first.len(), 1);
// Pretend the worker crashed: rewind leased_until into the past.
sqlx::query("UPDATE crawler_jobs SET leased_until = now() - interval '1 minute' WHERE id = $1")
.bind(id)
.execute(&pool)
.await
.unwrap();
let second = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
assert_eq!(second.len(), 1, "stale running row was re-leased");
assert_eq!(second[0].id, id);
assert_eq!(second[0].attempts, 2, "attempts bumped again");
}
#[sqlx::test(migrations = "./migrations")]
async fn ack_done_transitions_state_and_clears_lease(pool: PgPool) {
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let leases = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
jobs::ack_done(&pool, leases[0].id).await.unwrap();
assert_eq!(job_state(&pool, id).await, "done");
let leased_until: Option<chrono::DateTime<chrono::Utc>> =
sqlx::query_scalar("SELECT leased_until FROM crawler_jobs WHERE id = $1")
.bind(id)
.fetch_one(&pool)
.await
.unwrap();
assert!(leased_until.is_none());
}
#[sqlx::test(migrations = "./migrations")]
async fn ack_failed_under_max_returns_to_pending_with_future_schedule(pool: PgPool) {
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let leases = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
let lease = &leases[0];
jobs::ack_failed(&pool, lease.id, "boom", lease.attempts, lease.max_attempts)
.await
.unwrap();
assert_eq!(job_state(&pool, id).await, "pending");
let (scheduled_at, last_error): (chrono::DateTime<chrono::Utc>, Option<String>) =
sqlx::query_as("SELECT scheduled_at, last_error FROM crawler_jobs WHERE id = $1")
.bind(id)
.fetch_one(&pool)
.await
.unwrap();
assert!(scheduled_at > chrono::Utc::now());
assert_eq!(last_error.as_deref(), Some("boom"));
}
#[sqlx::test(migrations = "./migrations")]
async fn ack_failed_at_max_marks_dead(pool: PgPool) {
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
// Force a single lease then mark "this was attempt N where N == max_attempts".
let leases = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
let lease = &leases[0];
jobs::ack_failed(&pool, lease.id, "final boom", lease.max_attempts, lease.max_attempts)
.await
.unwrap();
assert_eq!(job_state(&pool, id).await, "dead");
let last_error: Option<String> =
sqlx::query_scalar("SELECT last_error FROM crawler_jobs WHERE id = $1")
.bind(id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(last_error.as_deref(), Some("final boom"));
}
#[sqlx::test(migrations = "./migrations")]
async fn release_returns_to_pending_and_undoes_attempt_increment(pool: PgPool) {
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let leases = jobs::lease(&pool, None, 1, Duration::from_secs(60))
.await
.unwrap();
assert_eq!(leases[0].attempts, 1);
jobs::release(&pool, leases[0].id).await.unwrap();
assert_eq!(job_state(&pool, id).await, "pending");
assert_eq!(job_attempts(&pool, id).await, 0);
let leased_until: Option<chrono::DateTime<chrono::Utc>> =
sqlx::query_scalar("SELECT leased_until FROM crawler_jobs WHERE id = $1")
.bind(id)
.fetch_one(&pool)
.await
.unwrap();
assert!(leased_until.is_none());
}
#[sqlx::test(migrations = "./migrations")]
async fn reap_done_deletes_old_rows_keeps_fresh(pool: PgPool) {
// Two done rows: one old (updated_at 10 days ago), one fresh.
let old_id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
let fresh_id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
sqlx::query("UPDATE crawler_jobs SET state='done', updated_at = now() - interval '10 days' WHERE id = $1")
.bind(old_id)
.execute(&pool)
.await
.unwrap();
sqlx::query("UPDATE crawler_jobs SET state='done' WHERE id = $1")
.bind(fresh_id)
.execute(&pool)
.await
.unwrap();
let deleted = jobs::reap_done(&pool, 7).await.unwrap();
assert_eq!(deleted, 1);
let remaining: Vec<Uuid> = sqlx::query_scalar("SELECT id FROM crawler_jobs ORDER BY id")
.fetch_all(&pool)
.await
.unwrap();
assert_eq!(remaining, vec![fresh_id], "only fresh row remains");
}
#[sqlx::test(migrations = "./migrations")]
async fn reap_done_zero_is_a_no_op(pool: PgPool) {
let id = match jobs::enqueue(&pool, &chapter_content_payload(Uuid::new_v4()))
.await
.unwrap()
{
EnqueueResult::Inserted(id) => id,
_ => unreachable!(),
};
sqlx::query("UPDATE crawler_jobs SET state='done', updated_at = now() - interval '999 days' WHERE id = $1")
.bind(id)
.execute(&pool)
.await
.unwrap();
let deleted = jobs::reap_done(&pool, 0).await.unwrap();
assert_eq!(deleted, 0);
assert_eq!(job_count(&pool).await, 1);
}

View File

@@ -0,0 +1,473 @@
//! Integration tests for `repo::crawler`.
//!
//! Each test runs against a fresh, migrated DB via `#[sqlx::test]`.
//! `DATABASE_URL` must point to a Postgres where the test user can
//! `CREATEDB`.
use mangalord::crawler::source::{SourceChapterRef, SourceManga};
use mangalord::repo::crawler::{self, ChapterDiff, UpsertStatus};
use sqlx::PgPool;
use uuid::Uuid;
/// Helper to spin up a `SourceManga` fixture with a stable shape so
/// each test can tweak just the fields it cares about.
fn sample_manga(key: &str, title: &str, hash: &str) -> SourceManga {
SourceManga {
source_manga_key: key.to_string(),
title: title.to_string(),
alternative_titles: vec!["Alt 1".into()],
authors: vec!["Author One".into()],
// Action is in the seeded `genres` table; Fantasy is too.
genres: vec!["Action".into(), "Fantasy".into()],
tags: vec!["popular".into()],
status: Some("ongoing".into()),
summary: Some("Sample summary.".into()),
cover_url: Some("/cover.jpg".into()),
chapters: vec![],
metadata_hash: hash.to_string(),
}
}
#[sqlx::test(migrations = "./migrations")]
async fn ensure_source_is_idempotent(pool: PgPool) {
crawler::ensure_source(&pool, "target", "Target Site", "https://x.example")
.await
.unwrap();
crawler::ensure_source(&pool, "target", "Target Site v2", "https://x.example")
.await
.unwrap();
let count: (i64,) = sqlx::query_as("SELECT COUNT(*) FROM sources WHERE id = 'target'")
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(count.0, 1);
let name: (String,) = sqlx::query_as("SELECT name FROM sources WHERE id = 'target'")
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(name.0, "Target Site v2", "name updates on re-call");
}
#[sqlx::test(migrations = "./migrations")]
async fn first_upsert_inserts_manga_and_links_metadata(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo Manga", "hash-1");
let res = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
assert_eq!(res.status, UpsertStatus::New);
// mangas row created
let row: (String, String, Vec<String>) =
sqlx::query_as("SELECT title, status, alt_titles FROM mangas WHERE id = $1")
.bind(res.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(row.0, "Foo Manga");
assert_eq!(row.1, "ongoing");
assert_eq!(row.2, vec!["Alt 1"]);
// manga_sources row links the two
let link: (String, Uuid, Option<String>) = sqlx::query_as(
"SELECT source_id, manga_id, metadata_hash FROM manga_sources WHERE source_manga_key = $1",
)
.bind("foo")
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(link.0, "target");
assert_eq!(link.1, res.manga_id);
assert_eq!(link.2.as_deref(), Some("hash-1"));
// Authors, genres, tags M2M populated
let n_authors: (i64,) =
sqlx::query_as("SELECT COUNT(*) FROM manga_authors WHERE manga_id = $1")
.bind(res.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(n_authors.0, 1);
let n_genres: (i64,) =
sqlx::query_as("SELECT COUNT(*) FROM manga_genres WHERE manga_id = $1")
.bind(res.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(n_genres.0, 2, "Action + Fantasy");
let n_tags: (i64,) = sqlx::query_as("SELECT COUNT(*) FROM manga_tags WHERE manga_id = $1")
.bind(res.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(n_tags.0, 1);
}
#[sqlx::test(migrations = "./migrations")]
async fn second_upsert_with_same_hash_reports_unchanged(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo Manga", "hash-1");
let first = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
let second = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
assert_eq!(second.status, UpsertStatus::Unchanged);
assert_eq!(second.manga_id, first.manga_id);
}
#[sqlx::test(migrations = "./migrations")]
async fn upsert_with_changed_hash_updates_fields(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let mut m = sample_manga("foo", "Foo Manga", "hash-1");
let first = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
m.title = "Foo Manga (Revised)".into();
m.status = Some("completed".into());
m.metadata_hash = "hash-2".into();
let second = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
assert_eq!(second.status, UpsertStatus::Updated);
assert_eq!(second.manga_id, first.manga_id);
let row: (String, String) =
sqlx::query_as("SELECT title, status FROM mangas WHERE id = $1")
.bind(first.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(row.0, "Foo Manga (Revised)");
assert_eq!(row.1, "completed");
}
#[sqlx::test(migrations = "./migrations")]
async fn sync_chapters_adds_new_refreshes_existing_and_drops_vanished(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo Manga", "hash-1");
let up = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
let initial = vec![
SourceChapterRef {
source_chapter_key: "1".into(),
number: 1,
title: Some("Ch.1".into()),
url: "https://x.example/foo/1".into(),
},
SourceChapterRef {
source_chapter_key: "2".into(),
number: 2,
title: Some("Ch.2".into()),
url: "https://x.example/foo/2".into(),
},
];
let diff = crawler::sync_manga_chapters(&pool, "target", up.manga_id, &initial)
.await
.unwrap();
assert_eq!(
diff,
ChapterDiff {
new: 2,
refreshed: 0,
dropped: 0
}
);
// Second run: keep ch1, replace ch2 with ch3 — ch2 should be dropped.
let second = vec![
SourceChapterRef {
source_chapter_key: "1".into(),
number: 1,
title: Some("Ch.1 (renamed)".into()),
url: "https://x.example/foo/1".into(),
},
SourceChapterRef {
source_chapter_key: "3".into(),
number: 3,
title: Some("Ch.3".into()),
url: "https://x.example/foo/3".into(),
},
];
let diff = crawler::sync_manga_chapters(&pool, "target", up.manga_id, &second)
.await
.unwrap();
assert_eq!(
diff,
ChapterDiff {
new: 1,
refreshed: 1,
dropped: 1
}
);
// Renamed title propagated to chapters.title
let title: (Option<String>,) =
sqlx::query_as("SELECT c.title FROM chapters c JOIN chapter_sources cs ON cs.chapter_id = c.id WHERE cs.source_chapter_key = '1'")
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(title.0.as_deref(), Some("Ch.1 (renamed)"));
// Vanished chapter is soft-dropped (row still exists, dropped_at set).
let dropped: (Option<chrono::DateTime<chrono::Utc>>,) =
sqlx::query_as("SELECT dropped_at FROM chapter_sources WHERE source_chapter_key = '2'")
.fetch_one(&pool)
.await
.unwrap();
assert!(dropped.0.is_some(), "ch2 should be soft-dropped");
}
/// Real-world sources publish multiple chapters at the same number
/// (different uploaders, translator notes, re-releases). After the
/// (manga_id, number) UNIQUE drop in 0013, each `SourceChapterRef`
/// becomes its own `chapters` row even when the parsed number matches
/// — chapter identity is now the chapter id, not the number.
#[sqlx::test(migrations = "./migrations")]
async fn sync_chapters_keeps_duplicate_numbered_chapters_as_separate_rows(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo Manga", "hash-1");
let up = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
// Two distinct uploads of Ch.52 (different uploaders → different
// URLs/keys, same parsed number) plus a notice/hiatus row that
// parses to number=0 alongside a real chapter at number 1.
let chapters = vec![
SourceChapterRef {
source_chapter_key: "br_chapter-A".into(),
number: 52,
title: Some("Ch.52 : Official".into()),
url: "https://x.example/foo/A/pg-1/".into(),
},
SourceChapterRef {
source_chapter_key: "br_chapter-B".into(),
number: 52,
title: Some("Ch.52 : Official (alt)".into()),
url: "https://x.example/foo/B/pg-1/".into(),
},
SourceChapterRef {
source_chapter_key: "br_chapter-NOTICE".into(),
number: 0,
title: Some("hitaus.".into()),
url: "https://x.example/foo/notice/pg-1/".into(),
},
SourceChapterRef {
source_chapter_key: "br_chapter-1".into(),
number: 1,
title: Some("Ch.1 : Official".into()),
url: "https://x.example/foo/1/pg-1/".into(),
},
];
let diff = crawler::sync_manga_chapters(&pool, "target", up.manga_id, &chapters)
.await
.unwrap();
assert_eq!(
diff,
ChapterDiff {
new: 4,
refreshed: 0,
dropped: 0
},
"every source ref yields a new chapter row"
);
let rows: (i64,) =
sqlx::query_as("SELECT COUNT(*) FROM chapters WHERE manga_id = $1")
.bind(up.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(rows.0, 4, "4 distinct chapter rows even with duplicate numbers");
let ch52_count: (i64,) = sqlx::query_as(
"SELECT COUNT(*) FROM chapters WHERE manga_id = $1 AND number = 52",
)
.bind(up.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(ch52_count.0, 2, "both Ch.52 uploads survive as separate rows");
}
#[sqlx::test(migrations = "./migrations")]
async fn mark_dropped_mangas_only_drops_unseen(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
// Seed two mangas before "now" so a later run_started_at sees them as stale.
let _ = crawler::upsert_manga_from_source(
&pool,
"target",
"https://x.example/foo",
&sample_manga("foo", "Foo", "hf"),
)
.await
.unwrap();
let _ = crawler::upsert_manga_from_source(
&pool,
"target",
"https://x.example/bar",
&sample_manga("bar", "Bar", "hb"),
)
.await
.unwrap();
// Now mark a new "run" beginning. Re-upsert only `foo` — `bar`
// should be the one flagged dropped.
let run_started = chrono::Utc::now();
// Sleep briefly so the second upsert's NOW() > run_started_at.
tokio::time::sleep(std::time::Duration::from_millis(20)).await;
let _ = crawler::upsert_manga_from_source(
&pool,
"target",
"https://x.example/foo",
&sample_manga("foo", "Foo", "hf"),
)
.await
.unwrap();
let n = crawler::mark_dropped_mangas(&pool, "target", run_started)
.await
.unwrap();
assert_eq!(n, 1, "only bar should have been dropped");
let foo_dropped: (Option<chrono::DateTime<chrono::Utc>>,) =
sqlx::query_as("SELECT dropped_at FROM manga_sources WHERE source_manga_key = 'foo'")
.fetch_one(&pool)
.await
.unwrap();
assert!(foo_dropped.0.is_none(), "foo seen this run, must not be dropped");
let bar_dropped: (Option<chrono::DateTime<chrono::Utc>>,) =
sqlx::query_as("SELECT dropped_at FROM manga_sources WHERE source_manga_key = 'bar'")
.fetch_one(&pool)
.await
.unwrap();
assert!(bar_dropped.0.is_some());
}
#[sqlx::test(migrations = "./migrations")]
async fn upsert_surfaces_cover_image_path_for_backfill_decisions(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo", "h1");
// First upsert: row is brand new, no cover stored yet.
let first = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
assert!(first.cover_image_path.is_none(), "new manga has no cover yet");
// Simulate cover landing in storage post-upsert.
sqlx::query("UPDATE mangas SET cover_image_path = $1 WHERE id = $2")
.bind("mangas/foo/cover.jpg")
.bind(first.manga_id)
.execute(&pool)
.await
.unwrap();
// Second upsert with same hash → Unchanged, but cover path is now
// surfaced so the caller knows the backfill is done.
let second = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
assert_eq!(second.status, UpsertStatus::Unchanged);
assert_eq!(
second.cover_image_path.as_deref(),
Some("mangas/foo/cover.jpg")
);
}
#[sqlx::test(migrations = "./migrations")]
async fn arbitrary_genres_from_source_get_inserted(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let mut m = sample_manga("foo", "Foo", "h");
// "Action" is seeded by migration 0009. "Webtoons" is not.
m.genres = vec!["Action".into(), "Webtoons".into()];
let up = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
let n_genre_links: (i64,) =
sqlx::query_as("SELECT COUNT(*) FROM manga_genres WHERE manga_id = $1")
.bind(up.manga_id)
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(n_genre_links.0, 2, "both seeded and source-added genres attach");
let webtoons: (i64,) =
sqlx::query_as("SELECT COUNT(*) FROM genres WHERE name = 'Webtoons'")
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(webtoons.0, 1, "non-seeded genre was inserted");
// Case-insensitive de-dup: a second sync with the genre re-cased
// attaches the existing row, not a new one.
let mut m2 = sample_manga("bar", "Bar", "h2");
m2.genres = vec!["webtoons".into()];
let _ = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/bar", &m2)
.await
.unwrap();
let webtoons_count: (i64,) =
sqlx::query_as("SELECT COUNT(*) FROM genres WHERE lower(name) = 'webtoons'")
.fetch_one(&pool)
.await
.unwrap();
assert_eq!(webtoons_count.0, 1, "case-insensitive lookup reuses the existing row");
}
#[sqlx::test(migrations = "./migrations")]
async fn re_appearing_manga_clears_dropped_at(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let m = sample_manga("foo", "Foo", "h1");
let up = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
// Drop it manually.
sqlx::query(
"UPDATE manga_sources SET dropped_at = NOW() WHERE source_manga_key = 'foo'",
)
.execute(&pool)
.await
.unwrap();
// Re-upsert: the link should un-drop.
let _ = crawler::upsert_manga_from_source(&pool, "target", "https://x.example/foo", &m)
.await
.unwrap();
let dropped: (Option<chrono::DateTime<chrono::Utc>>, Uuid) = sqlx::query_as(
"SELECT dropped_at, manga_id FROM manga_sources WHERE source_manga_key = 'foo'",
)
.fetch_one(&pool)
.await
.unwrap();
assert!(dropped.0.is_none());
assert_eq!(dropped.1, up.manga_id);
}

View File

@@ -0,0 +1,194 @@
<table class="listing" id="chapter_table">
<tbody>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-379272/pg-1/"><b>Ch.67</b>
: Official </a>
<b style="color:#FEFD7F;width;30px;display:inline-block;margin-left:5px">new</b>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">May 20, 2026</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-328248/pg-1/"><b>hitaus.</b>
</a>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">Jan 15, 2026</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-326351/pg-1/"><b>Ch.66</b>
: Official </a>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">Jan 10, 2026</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-295078/pg-1/"><b>Ch.52</b>
: Official </a>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">Aug 28, 2025</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-294815/pg-1/"><b>Ch.52</b>
: Official </a>
</h4>
</td>
<td class="no">
<a href=".../4300634/upload/">mina</a>
</td>
<td class="no">Aug 27, 2025</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-249964/pg-1/"><b>Ch.10</b>
: Official </a>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">Jan 5, 2025</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/to_chapter-13/pg-1/"><b>Ch.13</b>
: Thank you, we'll see you in the next one! </a>
</h4>
</td>
<td class="no"></td>
<td class="no">Dec 30, 2024</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-249095/pg-1/"><b>Ch.9</b>
: Official </a>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">Dec 28, 2024</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-248930/pg-1/"><b>Ch.1</b>
: Official </a>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">Dec 26, 2024</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/to_chapter-12/pg-1/"><b>Ch.12</b>
</a>
</h4>
</td>
<td class="no"></td>
<td class="no">Dec 1, 2024</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-244844/pg-1/"><b>notice.</b>
: Officials </a>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">Nov 26, 2024</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/to_chapter-11/pg-1/"><b>Ch.11</b>
</a>
</h4>
</td>
<td class="no"></td>
<td class="no">Nov 18, 2024</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-221180/pg-1/"><b>notice.</b>
</a>
</h4>
</td>
<td class="no">
<a href=".../3781074/upload/">Izanami</a>
</td>
<td class="no">Jun 21, 2024</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-234803/pg-1/"><b>notice.</b>
</a>
</h4>
</td>
<td class="no">
<a href=".../2843005/upload/">bloomingdale</a>
</td>
<td class="no">Sep 13, 2024</td>
</tr>
<tr>
<td>
<h4>
<a class="chico"
href=".../uu/br_chapter-220299/pg-1/"><b>Ch.1</b>
: Team Hazama </a>
</h4>
</td>
<td class="no">
<a href=".../1457681/upload/">purplepandabear</a>
</td>
<td class="no">Jun 16, 2024</td>
</tr>
</tbody>
</table>

View File

@@ -1,6 +1,7 @@
import { test, expect, type Page } from '@playwright/test'; import { test, expect, type Page } from '@playwright/test';
const mangaId = '22222222-2222-2222-2222-222222222222'; const mangaId = '22222222-2222-2222-2222-222222222222';
const chapterId = 'c2222222-2222-2222-2222-222222222222';
const mangaFixture = { const mangaFixture = {
id: mangaId, id: mangaId,
title: 'Vagabond', title: 'Vagabond',
@@ -11,7 +12,7 @@ const mangaFixture = {
updated_at: '2026-01-01T00:00:00Z' updated_at: '2026-01-01T00:00:00Z'
}; };
const chapterFixture = { const chapterFixture = {
id: 'c1', id: chapterId,
manga_id: mangaId, manga_id: mangaId,
number: 1, number: 1,
title: null, title: null,
@@ -20,24 +21,24 @@ const chapterFixture = {
}; };
const pagesFixture = [ const pagesFixture = [
{ {
id: 'p1', id: 'p1111111-2222-2222-2222-222222222222',
chapter_id: 'c1', chapter_id: chapterId,
page_number: 1, page_number: 1,
storage_key: 'mangas/m2/chapters/c1/pages/0001.png', storage_key: `mangas/${mangaId}/chapters/${chapterId}/pages/0001.png`,
content_type: 'image/png' content_type: 'image/png'
}, },
{ {
id: 'p2', id: 'p2222222-2222-2222-2222-222222222222',
chapter_id: 'c1', chapter_id: chapterId,
page_number: 2, page_number: 2,
storage_key: 'mangas/m2/chapters/c1/pages/0002.png', storage_key: `mangas/${mangaId}/chapters/${chapterId}/pages/0002.png`,
content_type: 'image/png' content_type: 'image/png'
}, },
{ {
id: 'p3', id: 'p3333333-2222-2222-2222-222222222222',
chapter_id: 'c1', chapter_id: chapterId,
page_number: 3, page_number: 3,
storage_key: 'mangas/m2/chapters/c1/pages/0003.png', storage_key: `mangas/${mangaId}/chapters/${chapterId}/pages/0003.png`,
content_type: 'image/png' content_type: 'image/png'
} }
]; ];
@@ -92,14 +93,16 @@ async function mockReaderApis(page: Page) {
}) })
}) })
); );
await page.route(`**/api/v1/mangas/${mangaId}/chapters/1`, (route) => await page.route(`**/api/v1/mangas/${mangaId}/chapters/${chapterId}`, (route) =>
route.fulfill({ route.fulfill({
status: 200, status: 200,
contentType: 'application/json', contentType: 'application/json',
body: JSON.stringify(chapterFixture) body: JSON.stringify(chapterFixture)
}) })
); );
await page.route(`**/api/v1/mangas/${mangaId}/chapters/1/pages`, (route) => await page.route(
`**/api/v1/mangas/${mangaId}/chapters/${chapterId}/pages`,
(route) =>
route.fulfill({ route.fulfill({
status: 200, status: 200,
contentType: 'application/json', contentType: 'application/json',
@@ -131,7 +134,7 @@ test.beforeEach(async ({ context }) => {
test('switching to continuous mode stacks all pages and hides chevrons', async ({ page }) => { test('switching to continuous mode stacks all pages and hides chevrons', async ({ page }) => {
await mockReaderApis(page); await mockReaderApis(page);
await page.goto(`/manga/${mangaId}/chapter/1`); await page.goto(`/manga/${mangaId}/chapter/${chapterId}`);
// Default single-page mode is active. // Default single-page mode is active.
await expect(page.getByTestId('reader-page')).toBeVisible(); await expect(page.getByTestId('reader-page')).toBeVisible();
@@ -149,7 +152,7 @@ test('switching to continuous mode stacks all pages and hides chevrons', async (
test('arrow keys do not paginate while in continuous mode', async ({ page }) => { test('arrow keys do not paginate while in continuous mode', async ({ page }) => {
await mockReaderApis(page); await mockReaderApis(page);
await page.goto(`/manga/${mangaId}/chapter/1`); await page.goto(`/manga/${mangaId}/chapter/${chapterId}`);
await page.getByTestId('reader-mode-continuous').click(); await page.getByTestId('reader-mode-continuous').click();
await expect(page.getByTestId('reader-continuous')).toBeVisible(); await expect(page.getByTestId('reader-continuous')).toBeVisible();
@@ -164,7 +167,7 @@ test('arrow keys do not paginate while in continuous mode', async ({ page }) =>
test('gap select updates the inline gap on the continuous container', async ({ page }) => { test('gap select updates the inline gap on the continuous container', async ({ page }) => {
await mockReaderApis(page); await mockReaderApis(page);
await page.goto(`/manga/${mangaId}/chapter/1`); await page.goto(`/manga/${mangaId}/chapter/${chapterId}`);
await page.getByTestId('reader-mode-continuous').click(); await page.getByTestId('reader-mode-continuous').click();
const container = page.getByTestId('reader-continuous'); const container = page.getByTestId('reader-continuous');
@@ -192,7 +195,7 @@ test('reader-mode preference set on one page is honored when the reader opens',
}); });
await mockReaderApis(page); await mockReaderApis(page);
await page.goto(`/manga/${mangaId}/chapter/1`); await page.goto(`/manga/${mangaId}/chapter/${chapterId}`);
await expect(page.getByTestId('reader-continuous')).toBeVisible(); await expect(page.getByTestId('reader-continuous')).toBeVisible();
await expect(page.getByTestId('page-indicator')).toHaveText('3 pages'); await expect(page.getByTestId('page-indicator')).toHaveText('3 pages');
await expect(page.getByTestId('reader-continuous')).toHaveAttribute( await expect(page.getByTestId('reader-continuous')).toHaveAttribute(

View File

@@ -1,6 +1,7 @@
import { test, expect, type Page } from '@playwright/test'; import { test, expect, type Page } from '@playwright/test';
const mangaId = '11111111-1111-1111-1111-111111111111'; const mangaId = '11111111-1111-1111-1111-111111111111';
const chapterId = 'c1111111-1111-1111-1111-111111111111';
const mangaFixture = { const mangaFixture = {
id: mangaId, id: mangaId,
title: 'Berserk', title: 'Berserk',
@@ -12,7 +13,7 @@ const mangaFixture = {
}; };
const chaptersFixture = [ const chaptersFixture = [
{ {
id: 'c1', id: chapterId,
manga_id: mangaId, manga_id: mangaId,
number: 1, number: 1,
title: 'The Brand', title: 'The Brand',
@@ -22,24 +23,24 @@ const chaptersFixture = [
]; ];
const pagesFixture = [ const pagesFixture = [
{ {
id: 'p1', id: 'p1111111-1111-1111-1111-111111111111',
chapter_id: 'c1', chapter_id: chapterId,
page_number: 1, page_number: 1,
storage_key: 'mangas/m1/chapters/c1/pages/0001.png', storage_key: `mangas/${mangaId}/chapters/${chapterId}/pages/0001.png`,
content_type: 'image/png' content_type: 'image/png'
}, },
{ {
id: 'p2', id: 'p2222222-1111-1111-1111-111111111111',
chapter_id: 'c1', chapter_id: chapterId,
page_number: 2, page_number: 2,
storage_key: 'mangas/m1/chapters/c1/pages/0002.png', storage_key: `mangas/${mangaId}/chapters/${chapterId}/pages/0002.png`,
content_type: 'image/png' content_type: 'image/png'
}, },
{ {
id: 'p3', id: 'p3333333-1111-1111-1111-111111111111',
chapter_id: 'c1', chapter_id: chapterId,
page_number: 3, page_number: 3,
storage_key: 'mangas/m1/chapters/c1/pages/0003.png', storage_key: `mangas/${mangaId}/chapters/${chapterId}/pages/0003.png`,
content_type: 'image/png' content_type: 'image/png'
} }
]; ];
@@ -86,14 +87,16 @@ async function mockReaderApis(page: Page) {
}) })
}) })
); );
await page.route(`**/api/v1/mangas/${mangaId}/chapters/1`, (route) => await page.route(`**/api/v1/mangas/${mangaId}/chapters/${chapterId}`, (route) =>
route.fulfill({ route.fulfill({
status: 200, status: 200,
contentType: 'application/json', contentType: 'application/json',
body: JSON.stringify(chaptersFixture[0]) body: JSON.stringify(chaptersFixture[0])
}) })
); );
await page.route(`**/api/v1/mangas/${mangaId}/chapters/1/pages`, (route) => await page.route(
`**/api/v1/mangas/${mangaId}/chapters/${chapterId}/pages`,
(route) =>
route.fulfill({ route.fulfill({
status: 200, status: 200,
contentType: 'application/json', contentType: 'application/json',
@@ -123,7 +126,7 @@ test('manga overview shows title, cover, and a chapter list', async ({ page }) =
test('reader paginates with arrow keys and j/k, and preloads the next page', async ({ page }) => { test('reader paginates with arrow keys and j/k, and preloads the next page', async ({ page }) => {
await mockReaderApis(page); await mockReaderApis(page);
await page.goto(`/manga/${mangaId}/chapter/1`); await page.goto(`/manga/${mangaId}/chapter/${chapterId}`);
// Page 1 shown, preload for page 2 in the DOM. // Page 1 shown, preload for page 2 in the DOM.
await expect(page.getByTestId('page-indicator')).toHaveText('Page 1 / 3'); await expect(page.getByTestId('page-indicator')).toHaveText('Page 1 / 3');

View File

@@ -1,12 +1,12 @@
{ {
"name": "mangalord-frontend", "name": "mangalord-frontend",
"version": "0.12.0", "version": "0.23.0",
"lockfileVersion": 3, "lockfileVersion": 3,
"requires": true, "requires": true,
"packages": { "packages": {
"": { "": {
"name": "mangalord-frontend", "name": "mangalord-frontend",
"version": "0.12.0", "version": "0.23.0",
"devDependencies": { "devDependencies": {
"@lucide/svelte": "^1.16.0", "@lucide/svelte": "^1.16.0",
"@playwright/test": "^1.48.0", "@playwright/test": "^1.48.0",
@@ -169,7 +169,6 @@
} }
], ],
"license": "MIT", "license": "MIT",
"peer": true,
"engines": { "engines": {
"node": ">=18" "node": ">=18"
}, },
@@ -193,7 +192,6 @@
} }
], ],
"license": "MIT", "license": "MIT",
"peer": true,
"engines": { "engines": {
"node": ">=18" "node": ">=18"
} }
@@ -1157,7 +1155,6 @@
"integrity": "sha512-mQjlkNo+rJvpln7V2IGY2j99BqhcFbS4UN0AQNKNYfhBAFZTuCDAdW3a1sgf330mvtNvsBXn3HpAhcmvdJTcIQ==", "integrity": "sha512-mQjlkNo+rJvpln7V2IGY2j99BqhcFbS4UN0AQNKNYfhBAFZTuCDAdW3a1sgf330mvtNvsBXn3HpAhcmvdJTcIQ==",
"dev": true, "dev": true,
"license": "MIT", "license": "MIT",
"peer": true,
"dependencies": { "dependencies": {
"@standard-schema/spec": "^1.0.0", "@standard-schema/spec": "^1.0.0",
"@sveltejs/acorn-typescript": "^1.0.5", "@sveltejs/acorn-typescript": "^1.0.5",
@@ -1200,7 +1197,6 @@
"integrity": "sha512-0ba1RQ/PHen5FGpdSrW7Y3fAMQjrXantECALeOiOdBdzR5+5vPP6HVZRLmZaQL+W8m++o+haIAKq5qT+MiZ7VA==", "integrity": "sha512-0ba1RQ/PHen5FGpdSrW7Y3fAMQjrXantECALeOiOdBdzR5+5vPP6HVZRLmZaQL+W8m++o+haIAKq5qT+MiZ7VA==",
"dev": true, "dev": true,
"license": "MIT", "license": "MIT",
"peer": true,
"dependencies": { "dependencies": {
"@sveltejs/vite-plugin-svelte-inspector": "^3.0.0-next.0||^3.0.0", "@sveltejs/vite-plugin-svelte-inspector": "^3.0.0-next.0||^3.0.0",
"debug": "^4.3.7", "debug": "^4.3.7",
@@ -1359,7 +1355,6 @@
"integrity": "sha512-dyh/xO2Fh5bYrfWaaqGrRQQGkNdmYw6AmaAUvYeUMNTWQtvb796ikLdmTchRmOlOiIJ1TDXfWgVx1QkUlQ6Hew==", "integrity": "sha512-dyh/xO2Fh5bYrfWaaqGrRQQGkNdmYw6AmaAUvYeUMNTWQtvb796ikLdmTchRmOlOiIJ1TDXfWgVx1QkUlQ6Hew==",
"dev": true, "dev": true,
"license": "MIT", "license": "MIT",
"peer": true,
"dependencies": { "dependencies": {
"undici-types": "~6.21.0" "undici-types": "~6.21.0"
} }
@@ -1507,7 +1502,6 @@
"integrity": "sha512-UVJyE9MttOsBQIDKw1skb9nAwQuR5wuGD3+82K6JgJlm/Y+KI92oNsMNGZCYdDsVtRHSak0pcV5Dno5+4jh9sw==", "integrity": "sha512-UVJyE9MttOsBQIDKw1skb9nAwQuR5wuGD3+82K6JgJlm/Y+KI92oNsMNGZCYdDsVtRHSak0pcV5Dno5+4jh9sw==",
"dev": true, "dev": true,
"license": "MIT", "license": "MIT",
"peer": true,
"bin": { "bin": {
"acorn": "bin/acorn" "acorn": "bin/acorn"
}, },
@@ -2249,7 +2243,6 @@
"integrity": "sha512-8i7LzZj7BF8uplX+ZyOlIz86V6TAsSs+np6m1kpW9u0JWi4z/1t+FzcK1aek+ybTnAC4KhBL4uXCNT0wcUIeCw==", "integrity": "sha512-8i7LzZj7BF8uplX+ZyOlIz86V6TAsSs+np6m1kpW9u0JWi4z/1t+FzcK1aek+ybTnAC4KhBL4uXCNT0wcUIeCw==",
"dev": true, "dev": true,
"license": "MIT", "license": "MIT",
"peer": true,
"dependencies": { "dependencies": {
"cssstyle": "^4.1.0", "cssstyle": "^4.1.0",
"data-urls": "^5.0.0", "data-urls": "^5.0.0",
@@ -2638,7 +2631,6 @@
"integrity": "sha512-WHeFSbZYsPu3+bLoNRUuAO+wavNlocOPf3wSHTP7hcFKVnJeWsYlCDbr3mTS14FCizf9ccIxXA8sGL8zKeQN3g==", "integrity": "sha512-WHeFSbZYsPu3+bLoNRUuAO+wavNlocOPf3wSHTP7hcFKVnJeWsYlCDbr3mTS14FCizf9ccIxXA8sGL8zKeQN3g==",
"dev": true, "dev": true,
"license": "MIT", "license": "MIT",
"peer": true,
"dependencies": { "dependencies": {
"@types/estree": "1.0.8" "@types/estree": "1.0.8"
}, },
@@ -2810,7 +2802,6 @@
"integrity": "sha512-ymI5ykLPwIHW839E053FQbI1G+jnRFJEw3Kv5Y4njixVWywQBx+NUFpkkKyk5LIb36Fg9DVXSYpqiGekLD0hyw==", "integrity": "sha512-ymI5ykLPwIHW839E053FQbI1G+jnRFJEw3Kv5Y4njixVWywQBx+NUFpkkKyk5LIb36Fg9DVXSYpqiGekLD0hyw==",
"dev": true, "dev": true,
"license": "MIT", "license": "MIT",
"peer": true,
"dependencies": { "dependencies": {
"@jridgewell/remapping": "^2.3.4", "@jridgewell/remapping": "^2.3.4",
"@jridgewell/sourcemap-codec": "^1.5.0", "@jridgewell/sourcemap-codec": "^1.5.0",
@@ -2997,7 +2988,6 @@
"integrity": "sha512-jl1vZzPDinLr9eUt3J/t7V6FgNEw9QjvBPdysz9KfQDD41fQrC2Y4vKQdiaUpFT4bXlb1RHhLpp8wtm6M5TgSw==", "integrity": "sha512-jl1vZzPDinLr9eUt3J/t7V6FgNEw9QjvBPdysz9KfQDD41fQrC2Y4vKQdiaUpFT4bXlb1RHhLpp8wtm6M5TgSw==",
"dev": true, "dev": true,
"license": "Apache-2.0", "license": "Apache-2.0",
"peer": true,
"bin": { "bin": {
"tsc": "bin/tsc", "tsc": "bin/tsc",
"tsserver": "bin/tsserver" "tsserver": "bin/tsserver"
@@ -3019,7 +3009,6 @@
"integrity": "sha512-o5a9xKjbtuhY6Bi5S3+HvbRERmouabWbyUcpXXUA1u+GNUKoROi9byOJ8M0nHbHYHkYICiMlqxkg1KkYmm25Sw==", "integrity": "sha512-o5a9xKjbtuhY6Bi5S3+HvbRERmouabWbyUcpXXUA1u+GNUKoROi9byOJ8M0nHbHYHkYICiMlqxkg1KkYmm25Sw==",
"dev": true, "dev": true,
"license": "MIT", "license": "MIT",
"peer": true,
"dependencies": { "dependencies": {
"esbuild": "^0.21.3", "esbuild": "^0.21.3",
"postcss": "^8.4.43", "postcss": "^8.4.43",
@@ -3138,7 +3127,6 @@
"integrity": "sha512-MSmPM9REYqDGBI8439mA4mWhV5sKmDlBKWIYbA3lRb2PTHACE0mgKwA8yQ2xq9vxDTuk4iPrECBAEW2aoFXY0Q==", "integrity": "sha512-MSmPM9REYqDGBI8439mA4mWhV5sKmDlBKWIYbA3lRb2PTHACE0mgKwA8yQ2xq9vxDTuk4iPrECBAEW2aoFXY0Q==",
"dev": true, "dev": true,
"license": "MIT", "license": "MIT",
"peer": true,
"dependencies": { "dependencies": {
"@vitest/expect": "2.1.9", "@vitest/expect": "2.1.9",
"@vitest/mocker": "2.1.9", "@vitest/mocker": "2.1.9",

View File

@@ -1,6 +1,6 @@
{ {
"name": "mangalord-frontend", "name": "mangalord-frontend",
"version": "0.21.3", "version": "0.29.0",
"private": true, "private": true,
"type": "module", "type": "module",
"scripts": { "scripts": {

View File

@@ -76,17 +76,17 @@ describe('chapters api client', () => {
expect(result.page.total).toBeNull(); expect(result.page.total).toBeNull();
}); });
it('getChapter hits /v1/mangas/{id}/chapters/{n}', async () => { it('getChapter hits /v1/mangas/{id}/chapters/{chapter_id}', async () => {
fetchSpy.mockResolvedValueOnce(ok(chapterFixture)); fetchSpy.mockResolvedValueOnce(ok(chapterFixture));
const c = await getChapter('m1', 1); const c = await getChapter('m1', 'ch-uuid-1');
expect(c).toEqual(chapterFixture); expect(c).toEqual(chapterFixture);
const url = fetchSpy.mock.calls[0][0] as string; const url = fetchSpy.mock.calls[0][0] as string;
expect(url).toMatch(/\/v1\/mangas\/m1\/chapters\/1$/); expect(url).toMatch(/\/v1\/mangas\/m1\/chapters\/ch-uuid-1$/);
}); });
it('getChapter surfaces 404 via ApiError.code', async () => { it('getChapter surfaces 404 via ApiError.code', async () => {
fetchSpy.mockResolvedValueOnce(envelope(404, 'not_found', 'not found')); fetchSpy.mockResolvedValueOnce(envelope(404, 'not_found', 'not found'));
await expect(getChapter('m1', 99)).rejects.toMatchObject({ await expect(getChapter('m1', 'unknown-uuid')).rejects.toMatchObject({
status: 404, status: 404,
code: 'not_found' code: 'not_found'
}); });
@@ -143,10 +143,10 @@ describe('chapters api client', () => {
] ]
}) })
); );
const pages = await getChapterPages('m1', 1); const pages = await getChapterPages('m1', 'ch-uuid-1');
expect(pages).toHaveLength(1); expect(pages).toHaveLength(1);
expect(pages[0].storage_key).toContain('0001.png'); expect(pages[0].storage_key).toContain('0001.png');
const url = fetchSpy.mock.calls[0][0] as string; const url = fetchSpy.mock.calls[0][0] as string;
expect(url).toMatch(/\/v1\/mangas\/m1\/chapters\/1\/pages$/); expect(url).toMatch(/\/v1\/mangas\/m1\/chapters\/ch-uuid-1\/pages$/);
}); });
}); });

View File

@@ -32,9 +32,9 @@ export async function listChapters(
); );
} }
export async function getChapter(mangaId: string, number: number): Promise<Chapter> { export async function getChapter(mangaId: string, chapterId: string): Promise<Chapter> {
return request<Chapter>( return request<Chapter>(
`/v1/mangas/${encodeURIComponent(mangaId)}/chapters/${number}` `/v1/mangas/${encodeURIComponent(mangaId)}/chapters/${encodeURIComponent(chapterId)}`
); );
} }
@@ -48,10 +48,10 @@ export type ChapterPage = {
export async function getChapterPages( export async function getChapterPages(
mangaId: string, mangaId: string,
number: number chapterId: string
): Promise<ChapterPage[]> { ): Promise<ChapterPage[]> {
const r = await request<{ pages: ChapterPage[] }>( const r = await request<{ pages: ChapterPage[] }>(
`/v1/mangas/${encodeURIComponent(mangaId)}/chapters/${number}/pages` `/v1/mangas/${encodeURIComponent(mangaId)}/chapters/${encodeURIComponent(chapterId)}/pages`
); );
return r.pages; return r.pages;
} }

View File

@@ -39,7 +39,7 @@
</a> </a>
{#if b.chapter_id && b.chapter_number != null} {#if b.chapter_id && b.chapter_number != null}
<a <a
href="/manga/{b.manga_id}/chapter/{b.chapter_number}" href="/manga/{b.manga_id}/chapter/{b.chapter_id}"
class="target" class="target"
> >
Chapter {b.chapter_number}{#if b.page != null && b.page > 0} — page {b.page}{/if} Chapter {b.chapter_number}{#if b.page != null && b.page > 0} — page {b.page}{/if}

View File

@@ -29,6 +29,9 @@
? chapters.find((c) => c.id === readProgress.chapter_id) ?? null ? chapters.find((c) => c.id === readProgress.chapter_id) ?? null
: null : null
); );
/** Reader link target — always the chapter id when we have one,
* even for chapters past the loaded `chapters` list page. */
const continueChapterId = $derived(readProgress?.chapter_id ?? null);
const continueChapterNumber = $derived( const continueChapterNumber = $derived(
continueChapter?.number ?? readProgress?.chapter_number ?? null continueChapter?.number ?? readProgress?.chapter_number ?? null
); );
@@ -351,10 +354,10 @@
<section aria-label="chapters"> <section aria-label="chapters">
<h2>Chapters</h2> <h2>Chapters</h2>
{#if continueChapterNumber != null} {#if continueChapterId != null && continueChapterNumber != null}
<a <a
class="continue" class="continue"
href="/manga/{manga.id}/chapter/{continueChapterNumber}" href="/manga/{manga.id}/chapter/{continueChapterId}"
data-testid="continue-reading" data-testid="continue-reading"
> >
<span class="continue-label">Continue reading</span> <span class="continue-label">Continue reading</span>
@@ -372,7 +375,7 @@
<ol class="chapter-list" data-testid="chapter-list"> <ol class="chapter-list" data-testid="chapter-list">
{#each chapters as c (c.id)} {#each chapters as c (c.id)}
<li> <li>
<a href="/manga/{manga.id}/chapter/{c.number}"> <a href="/manga/{manga.id}/chapter/{c.id}">
Chapter {c.number}{#if c.title}: {c.title}{/if} Chapter {c.number}{#if c.title}: {c.title}{/if}
</a> </a>
<span class="pages">({c.page_count} pages)</span> <span class="pages">({c.page_count} pages)</span>

View File

@@ -135,11 +135,11 @@
// navigation feels continuous in single mode. Harmless in // navigation feels continuous in single mode. Harmless in
// continuous mode (the reader just shows everything). // continuous mode (the reader just shows everything).
const target = mode === 'single' ? `?page=last` : ''; const target = mode === 'single' ? `?page=last` : '';
void goto(`/manga/${manga.id}/chapter/${prevChapter.number}${target}`); void goto(`/manga/${manga.id}/chapter/${prevChapter.id}${target}`);
} }
function jumpToNextChapter() { function jumpToNextChapter() {
if (!nextChapter) return; if (!nextChapter) return;
void goto(`/manga/${manga.id}/chapter/${nextChapter.number}`); void goto(`/manga/${manga.id}/chapter/${nextChapter.id}`);
} }
function next() { function next() {

View File

@@ -6,11 +6,10 @@ import type { PageLoad } from './$types';
export const ssr = false; export const ssr = false;
export const load: PageLoad = async ({ params, url }) => { export const load: PageLoad = async ({ params, url }) => {
const number = Number(params.n);
const [manga, chapter, pages, readProgress, chapterList] = await Promise.all([ const [manga, chapter, pages, readProgress, chapterList] = await Promise.all([
getManga(params.id), getManga(params.id),
getChapter(params.id, number), getChapter(params.id, params.chapter_id),
getChapterPages(params.id, number), getChapterPages(params.id, params.chapter_id),
// `null` for guests or first-time openers — the reader uses // `null` for guests or first-time openers — the reader uses
// this to seed its session-local high-water mark. // this to seed its session-local high-water mark.
getMyReadProgressForManga(params.id), getMyReadProgressForManga(params.id),

View File

@@ -60,8 +60,8 @@
{#each progress as p (p.manga_id)} {#each progress as p (p.manga_id)}
<li class="entry"> <li class="entry">
<a <a
href={p.chapter_number != null href={p.chapter_id != null
? `/manga/${p.manga_id}/chapter/${p.chapter_number}` ? `/manga/${p.manga_id}/chapter/${p.chapter_id}`
: `/manga/${p.manga_id}`} : `/manga/${p.manga_id}`}
class="cover-link" class="cover-link"
tabindex="-1" tabindex="-1"
@@ -89,9 +89,9 @@
{p.manga_title} {p.manga_title}
</a> </a>
<span class="target"> <span class="target">
{#if p.chapter_number != null} {#if p.chapter_id != null && p.chapter_number != null}
<a <a
href="/manga/{p.manga_id}/chapter/{p.chapter_number}" href="/manga/{p.manga_id}/chapter/{p.chapter_id}"
> >
Continue Ch. {p.chapter_number}{#if p.page > 1} — page {p.page}{/if} Continue Ch. {p.chapter_number}{#if p.page > 1} — page {p.page}{/if}
</a> </a>
@@ -185,7 +185,7 @@
<div class="meta"> <div class="meta">
<a href="/manga/{u.manga_id}" class="title">{u.manga_title}</a> <a href="/manga/{u.manga_id}" class="title">{u.manga_title}</a>
<span class="target"> <span class="target">
<a href="/manga/{u.manga_id}/chapter/{u.chapter.number}"> <a href="/manga/{u.manga_id}/chapter/{u.chapter.id}">
Chapter {u.chapter.number}{#if u.chapter.title}: {u.chapter.title}{/if} Chapter {u.chapter.number}{#if u.chapter.title}: {u.chapter.title}{/if}
</a> </a>
<span class="muted">({u.chapter.page_count} pages)</span> <span class="muted">({u.chapter.page_count} pages)</span>