Compare commits

...

3 Commits

Author SHA1 Message Date
MechaCat02
699c1d0d69 feat: rate-limit /auth/login, /register, /me/password (0.35.0)
A hand-rolled token-bucket limiter (5 req/sec, 10-request burst by
default; AUTH_RATE_PER_SEC/AUTH_RATE_BURST env knobs) gates the three
auth-mutation endpoints. One bucket per AppState so tests stay
isolated. Tower-governor wasn't wired in because the reverse proxy
doesn't yet forward client IPs — a global bucket gives equivalent
brute-force protection until that lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 20:17:57 +02:00
MechaCat02
e7662d18d6 feat: gitea actions for build, push, and ssh deploy (0.34.0)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 06:56:13 +02:00
MechaCat02
45ce0d8f12 feat: incremental crawl mode with seed-completion gate (0.33.0)
Daemon now auto-detects mode per source: Backfill until the first
full walk records `seed_completed:<source>` in `crawler_state`, then
Incremental (newest-first, stops after N consecutive Unchanged
upserts). `CRAWLER_MODE` overrides to a fixed mode; CLI rejects
`auto` since it has no pre-run DB state.

`Source::discover` returns a lazy `DiscoverWalk` so Incremental can
break out mid-walk without prefetching pages. The drop pass and seed
marker are now gated on a true full walk — fixes a latent soft-drop
of the index tail under partial sweeps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 06:41:26 +02:00
21 changed files with 1377 additions and 163 deletions

View File

@@ -29,6 +29,13 @@ COOKIE_DOMAIN=
# get reaped lazily. # get reaped lazily.
SESSION_TTL_DAYS=30 SESSION_TTL_DAYS=30
# ----- Auth brute-force rate limits -----
# Token-bucket budget shared across /auth/login, /auth/register, and
# /auth/me/password. Set per_sec=0 to disable (e.g. behind a
# rate-limiting reverse proxy that already enforces a budget).
AUTH_RATE_PER_SEC=5
AUTH_RATE_BURST=10
# ----- CORS ----- # ----- CORS -----
# Comma-separated origins allowed to call the API with credentials. # Comma-separated origins allowed to call the API with credentials.
# Default is empty: same-origin only. Set when frontend and backend live # Default is empty: same-origin only. Set when frontend and backend live

71
.gitea/README.md Normal file
View File

@@ -0,0 +1,71 @@
# Gitea Actions
The [`deploy`](workflows/deploy.yml) workflow runs on every push to `main`
(and via manual `workflow_dispatch`). It tests, builds, pushes the images
to a private registry, and rolls the stack over by SSH on the target host.
## Required secrets
Set under *Repo Settings → Actions → Secrets*:
| Name | Example | Purpose |
| -------------------- | ------------------------ | ---------------------------------------------------------------- |
| `REGISTRY_URL` | `registry.example.com` | Registry host. No scheme, no trailing slash. |
| `REGISTRY_USERNAME` | `mangalord-ci` | `docker login` user. |
| `REGISTRY_PASSWORD` | `<token>` | `docker login` token/password. |
| `SSH_HOST` | `mangalord.example.com` | Deploy target hostname/IP. |
| `SSH_USER` | `deploy` | SSH user on the target (must be in the `docker` group). |
| `SSH_PRIVATE_KEY` | `-----BEGIN OPENSSH...` | Private key authorised in the target user's `authorized_keys`. |
| `SSH_PORT` | `22` | Optional. Defaults to `22` if unset. |
## Required variables
Set under *Repo Settings → Actions → Variables* (not secrets — they appear
in logs):
| Name | Example | Purpose |
| ------------- | ------------------------ | ---------------------------------------------------------------------- |
| `DEPLOY_PATH` | `/srv/mangalord` | Directory on target holding `docker-compose.yml`, `.env`, and the prod overlay. |
## One-time host setup
The workflow assumes the deploy target already has:
1. Docker + Docker Compose v2 installed and the `SSH_USER` in the `docker` group.
2. `$DEPLOY_PATH/docker-compose.yml` (copy of the repo's [docker-compose.yml](../docker-compose.yml)).
3. `$DEPLOY_PATH/docker-compose.prod.yml` (copy of the repo's [docker-compose.prod.yml](../docker-compose.prod.yml)).
4. `$DEPLOY_PATH/.env` populated from [.env.example](../.env.example) with production values (real `POSTGRES_PASSWORD`, `COOKIE_SECURE=true`, etc.).
Bootstrap once:
```bash
ssh deploy@mangalord.example.com
sudo mkdir -p /srv/mangalord && sudo chown deploy:deploy /srv/mangalord
cd /srv/mangalord
# place docker-compose.yml, docker-compose.prod.yml, and .env here
```
The first workflow run will pull the images, bring the stack up, and run
the embedded migrations on startup.
## Image tags
Every push produces three tags per image:
- `mangalord-{backend,frontend}:latest`
- `mangalord-{backend,frontend}:<git-sha>` — used by the deploy job; lets
you pin a deploy to a specific commit
- `mangalord-{backend,frontend}:<version>` — the version from
[backend/Cargo.toml](../backend/Cargo.toml) (verified in lockstep with
[frontend/package.json](../frontend/package.json))
## Rollback
SSH to the target, set `IMAGE_TAG` to a previous commit SHA, and re-up:
```bash
cd /srv/mangalord
export REGISTRY_URL=registry.example.com
export IMAGE_TAG=<previous-sha>
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d
```

144
.gitea/workflows/deploy.yml Normal file
View File

@@ -0,0 +1,144 @@
name: deploy
on:
push:
branches: [main]
workflow_dispatch:
jobs:
test-backend:
runs-on: ubuntu-latest
container:
image: rust:1-slim
services:
postgres:
image: postgres:16-alpine
env:
POSTGRES_USER: mangalord
POSTGRES_PASSWORD: mangalord
POSTGRES_DB: mangalord
options: >-
--health-cmd "pg_isready -U mangalord"
--health-interval 5s
--health-timeout 5s
--health-retries 10
env:
DATABASE_URL: postgres://mangalord:mangalord@postgres:5432/mangalord
steps:
- uses: actions/checkout@v4
- name: Install build deps
run: |
apt-get update
apt-get install -y --no-install-recommends pkg-config libssl-dev ca-certificates
- name: Cache cargo registry and target
uses: actions/cache@v4
with:
path: |
~/.cargo/registry
~/.cargo/git
backend/target
key: cargo-${{ runner.os }}-${{ hashFiles('backend/Cargo.lock') }}
restore-keys: |
cargo-${{ runner.os }}-
- name: cargo test
working-directory: backend
run: cargo test --locked
test-frontend:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '22'
cache: npm
cache-dependency-path: frontend/package-lock.json
- name: npm ci
working-directory: frontend
run: npm ci
- name: vitest
working-directory: frontend
run: npm test
build-and-push:
runs-on: ubuntu-latest
needs: [test-backend, test-frontend]
outputs:
image_tag: ${{ steps.meta.outputs.image_tag }}
version: ${{ steps.meta.outputs.version }}
steps:
- uses: actions/checkout@v4
- name: Resolve image tags
id: meta
run: |
version="$(grep -m1 '^version' backend/Cargo.toml | cut -d'"' -f2)"
frontend_version="$(grep -m1 '"version"' frontend/package.json | cut -d'"' -f4)"
if [ "$version" != "$frontend_version" ]; then
echo "Version mismatch: backend=$version frontend=$frontend_version" >&2
exit 1
fi
echo "image_tag=${GITHUB_SHA}" >> "$GITHUB_OUTPUT"
echo "version=${version}" >> "$GITHUB_OUTPUT"
- uses: docker/setup-buildx-action@v3
- name: docker login
uses: docker/login-action@v3
with:
registry: ${{ secrets.REGISTRY_URL }}
username: ${{ secrets.REGISTRY_USERNAME }}
password: ${{ secrets.REGISTRY_PASSWORD }}
- name: Build & push backend
uses: docker/build-push-action@v5
with:
context: ./backend
push: true
tags: |
${{ secrets.REGISTRY_URL }}/mangalord-backend:latest
${{ secrets.REGISTRY_URL }}/mangalord-backend:${{ steps.meta.outputs.image_tag }}
${{ secrets.REGISTRY_URL }}/mangalord-backend:${{ steps.meta.outputs.version }}
cache-from: type=gha,scope=backend
cache-to: type=gha,mode=max,scope=backend
- name: Build & push frontend
uses: docker/build-push-action@v5
with:
context: ./frontend
push: true
tags: |
${{ secrets.REGISTRY_URL }}/mangalord-frontend:latest
${{ secrets.REGISTRY_URL }}/mangalord-frontend:${{ steps.meta.outputs.image_tag }}
${{ secrets.REGISTRY_URL }}/mangalord-frontend:${{ steps.meta.outputs.version }}
cache-from: type=gha,scope=frontend
cache-to: type=gha,mode=max,scope=frontend
deploy:
runs-on: ubuntu-latest
needs: build-and-push
steps:
- name: SSH deploy
uses: appleboy/ssh-action@v1.0.3
with:
host: ${{ secrets.SSH_HOST }}
username: ${{ secrets.SSH_USER }}
key: ${{ secrets.SSH_PRIVATE_KEY }}
port: ${{ secrets.SSH_PORT || 22 }}
envs: REGISTRY_URL,REGISTRY_USERNAME,REGISTRY_PASSWORD,IMAGE_TAG,DEPLOY_PATH
script_stop: true
script: |
set -euo pipefail
cd "$DEPLOY_PATH"
echo "$REGISTRY_PASSWORD" | docker login "$REGISTRY_URL" -u "$REGISTRY_USERNAME" --password-stdin
export REGISTRY_URL IMAGE_TAG
docker compose -f docker-compose.yml -f docker-compose.prod.yml pull
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d
docker image prune -f
docker logout "$REGISTRY_URL"
env:
REGISTRY_URL: ${{ secrets.REGISTRY_URL }}
REGISTRY_USERNAME: ${{ secrets.REGISTRY_USERNAME }}
REGISTRY_PASSWORD: ${{ secrets.REGISTRY_PASSWORD }}
IMAGE_TAG: ${{ needs.build-and-push.outputs.image_tag }}
DEPLOY_PATH: ${{ vars.DEPLOY_PATH }}

2
backend/Cargo.lock generated
View File

@@ -1470,7 +1470,7 @@ checksum = "c41e0c4fef86961ac6d6f8a82609f55f31b05e4fce149ac5710e439df7619ba4"
[[package]] [[package]]
name = "mangalord" name = "mangalord"
version = "0.32.0" version = "0.35.0"
dependencies = [ dependencies = [
"anyhow", "anyhow",
"argon2", "argon2",

View File

@@ -1,6 +1,6 @@
[package] [package]
name = "mangalord" name = "mangalord"
version = "0.32.0" version = "0.35.0"
edition = "2021" edition = "2021"
default-run = "mangalord" default-run = "mangalord"

View File

@@ -80,6 +80,7 @@ async fn register(
jar: CookieJar, jar: CookieJar,
Json(input): Json<Credentials>, Json(input): Json<Credentials>,
) -> AppResult<impl IntoResponse> { ) -> AppResult<impl IntoResponse> {
check_auth_rate_limit(&state, "register")?;
let username = input.username.trim(); let username = input.username.trim();
validate_username(username)?; validate_username(username)?;
validate_password(&input.password)?; validate_password(&input.password)?;
@@ -95,6 +96,7 @@ async fn login(
jar: CookieJar, jar: CookieJar,
Json(input): Json<Credentials>, Json(input): Json<Credentials>,
) -> AppResult<impl IntoResponse> { ) -> AppResult<impl IntoResponse> {
check_auth_rate_limit(&state, "login")?;
let username = input.username.trim(); let username = input.username.trim();
if username.is_empty() || input.password.is_empty() { if username.is_empty() || input.password.is_empty() {
return Err(AppError::InvalidInput( return Err(AppError::InvalidInput(
@@ -149,6 +151,7 @@ async fn change_password(
jar: CookieJar, jar: CookieJar,
Json(input): Json<ChangePassword>, Json(input): Json<ChangePassword>,
) -> AppResult<impl IntoResponse> { ) -> AppResult<impl IntoResponse> {
check_auth_rate_limit(&state, "change_password")?;
if !verify_password(&input.current_password, &user.password_hash) { if !verify_password(&input.current_password, &user.password_hash) {
return Err(AppError::Unauthenticated); return Err(AppError::Unauthenticated);
} }
@@ -293,6 +296,33 @@ fn build_expired_cookie(cfg: &AuthConfig) -> Cookie<'static> {
builder.build() builder.build()
} }
/// Consume one token from the shared auth rate limiter. Called at the
/// start of `register`, `login`, and `change_password` so credential
/// stuffing / spraying / username-probe loops are throttled by the
/// configured budget (default 5/sec with a 10-request burst).
///
/// All three endpoints share one bucket — they all expose the same
/// argon2-verify-or-create work and the same enumeration channels, so
/// any one of them in a tight loop should trip the limit. `endpoint`
/// is included in the rate-limit-hit log line so operators can tell
/// which endpoint is being probed.
fn check_auth_rate_limit(state: &AppState, endpoint: &'static str) -> AppResult<()> {
use crate::auth::rate_limit::AcquireResult;
match state.auth_limiter.try_acquire() {
AcquireResult::Allowed => Ok(()),
AcquireResult::Denied { retry_after_secs } => {
tracing::warn!(
endpoint,
retry_after_secs,
"auth rate limit hit; returning 429"
);
Err(AppError::TooManyRequests {
retry_after_secs: Some(retry_after_secs),
})
}
}
}
fn validate_username(u: &str) -> AppResult<()> { fn validate_username(u: &str) -> AppResult<()> {
if u.is_empty() { if u.is_empty() {
return Err(AppError::InvalidInput("username is required".into())); return Err(AppError::InvalidInput("username is required".into()));

View File

@@ -12,7 +12,8 @@ use tokio_util::sync::CancellationToken;
use tower_http::cors::{AllowOrigin, CorsLayer}; use tower_http::cors::{AllowOrigin, CorsLayer};
use tower_http::trace::TraceLayer; use tower_http::trace::TraceLayer;
use crate::config::{AuthConfig, Config, CrawlerConfig, UploadConfig}; use crate::auth::rate_limit::AuthRateLimiter;
use crate::config::{AuthConfig, Config, CrawlerConfig, CrawlerModePref, UploadConfig};
use crate::crawler::browser_manager::{self, BrowserManager}; use crate::crawler::browser_manager::{self, BrowserManager};
use crate::crawler::content::{self, SyncOutcome}; use crate::crawler::content::{self, SyncOutcome};
use crate::crawler::daemon::{self, ChapterDispatcher, DaemonConfig, MetadataPass}; use crate::crawler::daemon::{self, ChapterDispatcher, DaemonConfig, MetadataPass};
@@ -20,6 +21,8 @@ use crate::crawler::jobs::JobPayload;
use crate::crawler::pipeline::{self, MetadataStats}; use crate::crawler::pipeline::{self, MetadataStats};
use crate::crawler::rate_limit::HostRateLimiters; use crate::crawler::rate_limit::HostRateLimiters;
use crate::crawler::session; use crate::crawler::session;
use crate::crawler::source::{target as target_source, DiscoverMode};
use crate::repo;
use crate::storage::{LocalStorage, Storage}; use crate::storage::{LocalStorage, Storage};
#[derive(Clone)] #[derive(Clone)]
@@ -28,6 +31,10 @@ pub struct AppState {
pub storage: Arc<dyn Storage>, pub storage: Arc<dyn Storage>,
pub auth: AuthConfig, pub auth: AuthConfig,
pub upload: UploadConfig, pub upload: UploadConfig,
/// Shared rate limiter guarding the `/auth/*` mutation endpoints.
/// One instance per AppState so tests stay isolated across the
/// same process.
pub auth_limiter: Arc<AuthRateLimiter>,
} }
/// Bundle returned by [`build`]. The router is what `axum::serve` consumes; /// Bundle returned by [`build`]. The router is what `axum::serve` consumes;
@@ -62,11 +69,13 @@ pub async fn build(config: Config) -> anyhow::Result<AppHandle> {
None None
}; };
let auth_limiter = Arc::new(AuthRateLimiter::new(config.auth.rate_limit));
let state = AppState { let state = AppState {
db, db,
storage, storage,
auth: config.auth.clone(), auth: config.auth.clone(),
upload: config.upload.clone(), upload: config.upload.clone(),
auth_limiter,
}; };
let router = router(state).layer(cors_layer(&config.cors_allowed_origins)); let router = router(state).layer(cors_layer(&config.cors_allowed_origins));
Ok(AppHandle { router, daemon }) Ok(AppHandle { router, daemon })
@@ -149,6 +158,8 @@ async fn spawn_crawler_daemon(
http: http.clone(), http: http.clone(),
rate: Arc::clone(&rate), rate: Arc::clone(&rate),
start_url: url.clone(), start_url: url.clone(),
mode_pref: cfg.mode,
incremental_stop_after: cfg.incremental_stop_after,
}); });
m m
}); });
@@ -210,11 +221,20 @@ struct RealMetadataPass {
http: reqwest::Client, http: reqwest::Client,
rate: Arc<HostRateLimiters>, rate: Arc<HostRateLimiters>,
start_url: String, start_url: String,
mode_pref: CrawlerModePref,
incremental_stop_after: usize,
} }
#[async_trait] #[async_trait]
impl MetadataPass for RealMetadataPass { impl MetadataPass for RealMetadataPass {
async fn run(&self) -> anyhow::Result<MetadataStats> { async fn run(&self) -> anyhow::Result<MetadataStats> {
let mode = resolve_mode(
&self.db,
target_source::SOURCE_ID,
self.mode_pref,
self.incremental_stop_after,
)
.await?;
pipeline::run_metadata_pass( pipeline::run_metadata_pass(
&self.browser_manager, &self.browser_manager,
&self.db, &self.db,
@@ -224,11 +244,56 @@ impl MetadataPass for RealMetadataPass {
&self.start_url, &self.start_url,
0, 0,
false, false,
mode,
) )
.await .await
} }
} }
/// Pick the active mode for this tick. `Explicit` short-circuits the
/// DB lookup. `Auto` reads `seed_completed_at`: missing → Backfill
/// (initial seed for this source), present → Incremental with the
/// configured threshold.
///
/// A DB error during the Auto lookup propagates as `Err` rather than
/// silently degrading to Backfill — the daemon's `run_tick` catches
/// the error, logs, and skips the tick. That's safer than running a
/// full re-backfill (including a drop pass against stale-looking rows)
/// when the DB is flaky.
async fn resolve_mode(
db: &PgPool,
source_id: &str,
pref: CrawlerModePref,
incremental_stop_after: usize,
) -> anyhow::Result<DiscoverMode> {
match pref {
CrawlerModePref::Explicit(m) => {
tracing::info!(?m, "crawler mode: explicit (CRAWLER_MODE override)");
Ok(m)
}
CrawlerModePref::Auto => {
let seeded = repo::crawler::seed_completed_at(db, source_id)
.await
.context("seed_completed_at lookup for mode auto-detection")?;
match seeded {
Some(at) => {
tracing::info!(
seed_completed_at = %at.to_rfc3339(),
"crawler mode: auto → incremental (seed previously completed)"
);
Ok(DiscoverMode::Incremental {
stop_after_unchanged: incremental_stop_after,
})
}
None => {
tracing::info!("crawler mode: auto → backfill (no seed marker for source)");
Ok(DiscoverMode::Backfill)
}
}
}
}
}
struct RealChapterDispatcher { struct RealChapterDispatcher {
browser_manager: Arc<BrowserManager>, browser_manager: Arc<BrowserManager>,
db: PgPool, db: PgPool,

View File

@@ -7,4 +7,5 @@
pub mod extractor; pub mod extractor;
pub mod password; pub mod password;
pub mod rate_limit;
pub mod token; pub mod token;

View File

@@ -0,0 +1,179 @@
//! Per-process token-bucket rate limiter for the auth endpoints.
//!
//! Protects `/auth/login`, `/auth/register`, and `/auth/me/password`
//! from credential stuffing / password spraying / username probing.
//!
//! The current deploy puts SvelteKit's hooks.server.ts proxy in front
//! of axum without forwarding the original client IP (no
//! `X-Forwarded-For`), so per-IP buckets would all collapse to the
//! proxy container's address. Until the proxy learns to forward the
//! peer address, a single global bucket gives equivalent protection
//! against mass-attack patterns and trades a small DoS surface
//! (legitimate users sharing the limit) for simplicity.
//!
//! Each `AppState` carries its own [`AuthRateLimiter`] instance, so
//! tests run in isolated buckets and won't bleed across `#[sqlx::test]`
//! cases that share a process.
use std::sync::Mutex;
use std::time::Instant;
/// Tunable limits. `per_sec == 0` disables the limiter — used by the
/// test harness and by anyone who wants to opt out via env config.
#[derive(Clone, Copy, Debug)]
pub struct RateLimitConfig {
pub per_sec: u32,
pub burst: u32,
}
impl Default for RateLimitConfig {
/// Disabled by default. The production `AuthConfig::from_env`
/// overrides to a real limit; the test harness keeps the default
/// so existing tests don't flake against shared buckets.
fn default() -> Self {
Self {
per_sec: 0,
burst: 0,
}
}
}
/// Production defaults: 5 requests/sec sustained, 10-request burst.
/// Tight enough to make brute force impractical, loose enough that a
/// real user mistyping their password three times in a row doesn't
/// hit it.
pub const PRODUCTION_PER_SEC: u32 = 5;
pub const PRODUCTION_BURST: u32 = 10;
struct Bucket {
tokens: f64,
last_refill: Instant,
}
/// Outcome of [`AuthRateLimiter::try_acquire`]. When `Denied`, the
/// caller can use `retry_after_secs` for a `Retry-After: N` header
/// (RFC 6585 §4) so well-behaved clients back off correctly rather
/// than retrying in a tight loop.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum AcquireResult {
Allowed,
Denied { retry_after_secs: u64 },
}
/// Single-bucket token-bucket limiter. `try_acquire` is cheap (one
/// mutex acquire, no allocations) so the auth path doesn't pay a real
/// cost for the check.
pub struct AuthRateLimiter {
cfg: RateLimitConfig,
bucket: Mutex<Bucket>,
}
impl AuthRateLimiter {
pub fn new(cfg: RateLimitConfig) -> Self {
Self {
cfg,
bucket: Mutex::new(Bucket {
tokens: cfg.burst as f64,
last_refill: Instant::now(),
}),
}
}
/// Consume one token if available. Returns `Denied` with a
/// rounded-up seconds-until-refill so the caller can emit a
/// `Retry-After` header.
pub fn try_acquire(&self) -> AcquireResult {
if self.cfg.per_sec == 0 {
return AcquireResult::Allowed;
}
let now = Instant::now();
let mut bucket = self.bucket.lock().expect("rate limiter mutex poisoned");
let elapsed = now.duration_since(bucket.last_refill).as_secs_f64();
bucket.tokens =
(bucket.tokens + elapsed * f64::from(self.cfg.per_sec)).min(f64::from(self.cfg.burst));
bucket.last_refill = now;
if bucket.tokens >= 1.0 {
bucket.tokens -= 1.0;
AcquireResult::Allowed
} else {
// ceil((1 - tokens) / per_sec), minimum 1 — a `Retry-After: 0`
// would tell clients to retry immediately, which is what we're
// actively trying to discourage.
let deficit = 1.0 - bucket.tokens;
let wait_secs = (deficit / f64::from(self.cfg.per_sec)).ceil() as u64;
AcquireResult::Denied {
retry_after_secs: wait_secs.max(1),
}
}
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn disabled_limiter_always_allows() {
let rl = AuthRateLimiter::new(RateLimitConfig {
per_sec: 0,
burst: 0,
});
for _ in 0..1000 {
assert_eq!(rl.try_acquire(), AcquireResult::Allowed);
}
}
#[test]
fn burst_lets_through_initial_window_then_blocks() {
// 0 refill, burst 3 → first three pass, fourth blocks.
let rl = AuthRateLimiter::new(RateLimitConfig {
per_sec: 1,
burst: 3,
});
assert_eq!(rl.try_acquire(), AcquireResult::Allowed);
assert_eq!(rl.try_acquire(), AcquireResult::Allowed);
assert_eq!(rl.try_acquire(), AcquireResult::Allowed);
match rl.try_acquire() {
AcquireResult::Denied { retry_after_secs } => {
// Bucket is at ~0 tokens, refill rate 1/sec → ~1s wait.
assert!(
retry_after_secs >= 1,
"retry_after must be at least 1s, got {retry_after_secs}"
);
}
AcquireResult::Allowed => panic!("fourth request must be denied"),
}
}
#[test]
fn tokens_refill_over_time() {
// 10/sec → after ~120ms we should have at least one token back.
let rl = AuthRateLimiter::new(RateLimitConfig {
per_sec: 10,
burst: 1,
});
assert_eq!(rl.try_acquire(), AcquireResult::Allowed);
assert!(matches!(rl.try_acquire(), AcquireResult::Denied { .. }));
std::thread::sleep(std::time::Duration::from_millis(150));
assert_eq!(
rl.try_acquire(),
AcquireResult::Allowed,
"token should have refilled"
);
}
#[test]
fn retry_after_scales_inversely_with_refill_rate() {
// 1/sec → wait ~1s after burst exhausted.
// 10/sec → wait <1s, but we clamp to a minimum of 1s.
let slow = AuthRateLimiter::new(RateLimitConfig {
per_sec: 1,
burst: 1,
});
slow.try_acquire();
match slow.try_acquire() {
AcquireResult::Denied { retry_after_secs } => assert_eq!(retry_after_secs, 1),
_ => panic!("expected Denied"),
}
}
}

View File

@@ -31,6 +31,7 @@ use mangalord::crawler::content::{self, SyncOutcome};
use mangalord::crawler::pipeline; use mangalord::crawler::pipeline;
use mangalord::crawler::rate_limit::HostRateLimiters; use mangalord::crawler::rate_limit::HostRateLimiters;
use mangalord::crawler::session; use mangalord::crawler::session;
use mangalord::crawler::source::DiscoverMode;
use mangalord::storage::{LocalStorage, Storage}; use mangalord::storage::{LocalStorage, Storage};
use sqlx::postgres::PgPoolOptions; use sqlx::postgres::PgPoolOptions;
use sqlx::PgPool; use sqlx::PgPool;
@@ -62,6 +63,8 @@ async fn main() -> anyhow::Result<()> {
let cdn_rate_ms = env_u64("CRAWLER_CDN_RATE_MS", rate_ms); let cdn_rate_ms = env_u64("CRAWLER_CDN_RATE_MS", rate_ms);
let limit = env_u64("CRAWLER_LIMIT", 0) as usize; let limit = env_u64("CRAWLER_LIMIT", 0) as usize;
let skip_chapters = env_bool("CRAWLER_SKIP_CHAPTERS", false); let skip_chapters = env_bool("CRAWLER_SKIP_CHAPTERS", false);
let incremental_stop_after = env_u64("CRAWLER_INCREMENTAL_STOP_AFTER", 20).max(1) as usize;
let mode = parse_crawler_mode(incremental_stop_after)?;
let skip_chapter_content = env_bool("CRAWLER_SKIP_CHAPTER_CONTENT", false); let skip_chapter_content = env_bool("CRAWLER_SKIP_CHAPTER_CONTENT", false);
let chapter_workers = env_u64("CRAWLER_CHAPTER_WORKERS", 1).max(1) as usize; let chapter_workers = env_u64("CRAWLER_CHAPTER_WORKERS", 1).max(1) as usize;
let force_refetch_chapters = env_bool("CRAWLER_FORCE_REFETCH_CHAPTERS", false); let force_refetch_chapters = env_bool("CRAWLER_FORCE_REFETCH_CHAPTERS", false);
@@ -140,6 +143,7 @@ async fn main() -> anyhow::Result<()> {
user_agent = ?user_agent, user_agent = ?user_agent,
proxy = ?proxy_url, proxy = ?proxy_url,
keep_open, keep_open,
?mode,
storage_dir = %storage_dir.display(), storage_dir = %storage_dir.display(),
"starting crawler" "starting crawler"
); );
@@ -187,6 +191,7 @@ async fn main() -> anyhow::Result<()> {
skip_chapter_content || !session_ready, skip_chapter_content || !session_ready,
chapter_workers, chapter_workers,
force_refetch_chapters, force_refetch_chapters,
mode,
) )
.await; .await;
@@ -216,6 +221,7 @@ async fn run(
skip_chapter_content: bool, skip_chapter_content: bool,
chapter_workers: usize, chapter_workers: usize,
force_refetch_chapters: bool, force_refetch_chapters: bool,
mode: DiscoverMode,
) -> anyhow::Result<()> { ) -> anyhow::Result<()> {
let mut rate = HostRateLimiters::new(Duration::from_millis(rate_ms)); let mut rate = HostRateLimiters::new(Duration::from_millis(rate_ms));
if let Some(host) = cdn_host { if let Some(host) = cdn_host {
@@ -232,6 +238,7 @@ async fn run(
start_url, start_url,
limit, limit,
skip_chapters, skip_chapters,
mode,
) )
.await?; .await?;
tracing::info!(?stats, "metadata pass complete"); tracing::info!(?stats, "metadata pass complete");
@@ -390,6 +397,38 @@ fn resolve_start_url() -> anyhow::Result<String> {
}) })
} }
/// Parse the CLI's `CRAWLER_MODE`. Defaults to `backfill` because the
/// binary is operator-driven (manual reseeds, force-refetches) — the
/// auto-detect logic lives in the daemon. `auto` is rejected because
/// the CLI has no DB state to consult before the run.
fn parse_crawler_mode(incremental_stop_after: usize) -> anyhow::Result<DiscoverMode> {
parse_crawler_mode_str(
std::env::var("CRAWLER_MODE").ok().as_deref(),
incremental_stop_after,
)
}
/// Pure variant of [`parse_crawler_mode`] — testable without env-var
/// mutation.
fn parse_crawler_mode_str(
raw: Option<&str>,
incremental_stop_after: usize,
) -> anyhow::Result<DiscoverMode> {
match raw.map(|s| s.trim().to_ascii_lowercase()).as_deref() {
None | Some("") | Some("backfill") => Ok(DiscoverMode::Backfill),
Some("incremental") => Ok(DiscoverMode::Incremental {
stop_after_unchanged: incremental_stop_after,
}),
Some("auto") => Err(anyhow!(
"CRAWLER_MODE=auto isn't supported by the CLI (use backfill or incremental); \
the daemon does auto-detection"
)),
Some(other) => Err(anyhow!(
"CRAWLER_MODE must be one of: backfill, incremental (got {other:?})"
)),
}
}
fn env_u64(name: &str, default: u64) -> u64 { fn env_u64(name: &str, default: u64) -> u64 {
std::env::var(name) std::env::var(name)
.ok() .ok()
@@ -405,3 +444,55 @@ fn env_bool(name: &str, default: bool) -> bool {
} }
} }
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn cli_mode_defaults_to_backfill_when_unset_or_blank() {
let none = parse_crawler_mode_str(None, 20).unwrap();
assert!(matches!(none, DiscoverMode::Backfill));
let blank = parse_crawler_mode_str(Some(""), 20).unwrap();
assert!(matches!(blank, DiscoverMode::Backfill));
}
#[test]
fn cli_mode_recognizes_backfill_and_incremental() {
let backfill = parse_crawler_mode_str(Some("backfill"), 20).unwrap();
assert!(matches!(backfill, DiscoverMode::Backfill));
let incremental = parse_crawler_mode_str(Some("incremental"), 9).unwrap();
assert!(matches!(
incremental,
DiscoverMode::Incremental { stop_after_unchanged: 9 }
));
}
#[test]
fn cli_mode_rejects_auto_explicitly() {
let err = parse_crawler_mode_str(Some("auto"), 20).unwrap_err();
let msg = format!("{err}");
assert!(
msg.contains("daemon"),
"rejection should point operator at the daemon: {msg}"
);
}
#[test]
fn cli_mode_rejects_unknown_value() {
let err = parse_crawler_mode_str(Some("garbage"), 20).unwrap_err();
let msg = format!("{err}");
assert!(msg.contains("backfill"));
assert!(msg.contains("incremental"));
}
#[test]
fn cli_mode_is_case_insensitive_and_trims() {
let mixed = parse_crawler_mode_str(Some(" Incremental "), 4).unwrap();
assert!(matches!(
mixed,
DiscoverMode::Incremental { stop_after_unchanged: 4 }
));
}
}

View File

@@ -5,12 +5,23 @@ use chrono::NaiveTime;
use chrono_tz::Tz; use chrono_tz::Tz;
use crate::crawler::browser::LaunchOptions; use crate::crawler::browser::LaunchOptions;
use crate::crawler::source::DiscoverMode;
/// What `CRAWLER_MODE` was set to. `Auto` is the daemon's default —
/// pick Backfill until `seed_completed_at` is written, then flip to
/// Incremental. `Explicit` forces a single mode regardless.
#[derive(Clone, Copy, Debug)]
pub enum CrawlerModePref {
Auto,
Explicit(DiscoverMode),
}
#[derive(Clone, Debug)] #[derive(Clone, Debug)]
pub struct AuthConfig { pub struct AuthConfig {
pub cookie_secure: bool, pub cookie_secure: bool,
pub cookie_domain: Option<String>, pub cookie_domain: Option<String>,
pub session_ttl_days: i64, pub session_ttl_days: i64,
pub rate_limit: crate::auth::rate_limit::RateLimitConfig,
} }
impl Default for AuthConfig { impl Default for AuthConfig {
@@ -19,6 +30,11 @@ impl Default for AuthConfig {
cookie_secure: true, cookie_secure: true,
cookie_domain: None, cookie_domain: None,
session_ttl_days: 30, session_ttl_days: 30,
// Disabled by default so the test harness inherits a
// non-throttling limiter. Production `from_env` overrides
// to the [`PRODUCTION_PER_SEC`]/[`PRODUCTION_BURST`]
// defaults.
rate_limit: crate::auth::rate_limit::RateLimitConfig::default(),
} }
} }
} }
@@ -77,6 +93,12 @@ pub struct CrawlerConfig {
pub user_agent: Option<String>, pub user_agent: Option<String>,
pub proxy: Option<String>, pub proxy: Option<String>,
pub browser: LaunchOptions, pub browser: LaunchOptions,
/// Mode preference for the metadata pass. Daemon default is `Auto`
/// (Backfill until `seed_completed_at` is written, then Incremental).
pub mode: CrawlerModePref,
/// `stop_after_unchanged` threshold supplied to Incremental in both
/// `Auto` (post-seed) and `Explicit(Incremental)` modes.
pub incremental_stop_after: usize,
} }
impl Default for CrawlerConfig { impl Default for CrawlerConfig {
@@ -97,6 +119,8 @@ impl Default for CrawlerConfig {
user_agent: None, user_agent: None,
proxy: None, proxy: None,
browser: LaunchOptions::headless(), browser: LaunchOptions::headless(),
mode: CrawlerModePref::Auto,
incremental_stop_after: 20,
} }
} }
} }
@@ -117,6 +141,16 @@ impl Config {
.ok() .ok()
.filter(|s| !s.is_empty()), .filter(|s| !s.is_empty()),
session_ttl_days: env_i64("SESSION_TTL_DAYS", 30), session_ttl_days: env_i64("SESSION_TTL_DAYS", 30),
rate_limit: crate::auth::rate_limit::RateLimitConfig {
per_sec: env_u64(
"AUTH_RATE_PER_SEC",
crate::auth::rate_limit::PRODUCTION_PER_SEC.into(),
) as u32,
burst: env_u64(
"AUTH_RATE_BURST",
crate::auth::rate_limit::PRODUCTION_BURST.into(),
) as u32,
},
}, },
upload: UploadConfig { upload: UploadConfig {
max_request_bytes: env_usize("MAX_REQUEST_BYTES", 200 * 1024 * 1024), max_request_bytes: env_usize("MAX_REQUEST_BYTES", 200 * 1024 * 1024),
@@ -151,6 +185,9 @@ impl CrawlerConfig {
.parse() .parse()
.map_err(|e| anyhow::anyhow!("CRAWLER_TZ must be a valid IANA TZ (got {raw:?}): {e}"))?, .map_err(|e| anyhow::anyhow!("CRAWLER_TZ must be a valid IANA TZ (got {raw:?}): {e}"))?,
}; };
let incremental_stop_after =
env_u64("CRAWLER_INCREMENTAL_STOP_AFTER", 20).max(1) as usize;
let mode = parse_mode_env(incremental_stop_after)?;
Ok(Self { Ok(Self {
daemon_enabled: env_bool("CRAWLER_DAEMON", true), daemon_enabled: env_bool("CRAWLER_DAEMON", true),
daily_at, daily_at,
@@ -179,10 +216,38 @@ impl CrawlerConfig {
.ok() .ok()
.filter(|s| !s.trim().is_empty()), .filter(|s| !s.trim().is_empty()),
browser: LaunchOptions::from_env(), browser: LaunchOptions::from_env(),
mode,
incremental_stop_after,
}) })
} }
} }
/// Parse `CRAWLER_MODE`. Empty/unset → `Auto`. Recognized values are
/// `auto`, `backfill`, and `incremental` (case-insensitive). Anything
/// else is a hard error so a typo can't silently fall through to the
/// default and mask itself.
fn parse_mode_env(incremental_stop_after: usize) -> anyhow::Result<CrawlerModePref> {
parse_mode_str(std::env::var("CRAWLER_MODE").ok().as_deref(), incremental_stop_after)
}
/// Pure variant of [`parse_mode_env`] — testable without env-var
/// mutation. Takes the raw value (or `None` if unset).
pub(crate) fn parse_mode_str(
raw: Option<&str>,
incremental_stop_after: usize,
) -> anyhow::Result<CrawlerModePref> {
match raw.map(|s| s.trim().to_ascii_lowercase()).as_deref() {
None | Some("") | Some("auto") => Ok(CrawlerModePref::Auto),
Some("backfill") => Ok(CrawlerModePref::Explicit(DiscoverMode::Backfill)),
Some("incremental") => Ok(CrawlerModePref::Explicit(DiscoverMode::Incremental {
stop_after_unchanged: incremental_stop_after,
})),
Some(other) => Err(anyhow::anyhow!(
"CRAWLER_MODE must be one of: auto, backfill, incremental (got {other:?})"
)),
}
}
fn env_u64(name: &str, default: u64) -> u64 { fn env_u64(name: &str, default: u64) -> u64 {
std::env::var(name) std::env::var(name)
.ok() .ok()
@@ -211,3 +276,63 @@ fn env_usize(name: &str, default: usize) -> usize {
.and_then(|s| s.parse().ok()) .and_then(|s| s.parse().ok())
.unwrap_or(default) .unwrap_or(default)
} }
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn parse_mode_str_defaults_to_auto_when_unset_or_blank() {
let none = parse_mode_str(None, 20).unwrap();
assert!(matches!(none, CrawlerModePref::Auto));
let blank = parse_mode_str(Some(""), 20).unwrap();
assert!(matches!(blank, CrawlerModePref::Auto));
let whitespace = parse_mode_str(Some(" "), 20).unwrap();
assert!(matches!(whitespace, CrawlerModePref::Auto));
}
#[test]
fn parse_mode_str_recognizes_each_keyword() {
let auto = parse_mode_str(Some("auto"), 20).unwrap();
assert!(matches!(auto, CrawlerModePref::Auto));
let backfill = parse_mode_str(Some("backfill"), 20).unwrap();
assert!(matches!(
backfill,
CrawlerModePref::Explicit(DiscoverMode::Backfill)
));
let incremental = parse_mode_str(Some("incremental"), 7).unwrap();
assert!(matches!(
incremental,
CrawlerModePref::Explicit(DiscoverMode::Incremental {
stop_after_unchanged: 7
})
));
}
#[test]
fn parse_mode_str_is_case_insensitive_and_trims_whitespace() {
let mixed = parse_mode_str(Some(" Incremental "), 5).unwrap();
assert!(matches!(
mixed,
CrawlerModePref::Explicit(DiscoverMode::Incremental {
stop_after_unchanged: 5
})
));
let upper = parse_mode_str(Some("BACKFILL"), 5).unwrap();
assert!(matches!(
upper,
CrawlerModePref::Explicit(DiscoverMode::Backfill)
));
}
#[test]
fn parse_mode_str_hard_errors_on_unknown_value() {
let err = parse_mode_str(Some("backfil"), 20).unwrap_err();
let msg = format!("{err}");
assert!(msg.contains("backfill"), "error should list valid values: {msg}");
assert!(msg.contains("auto"));
assert!(msg.contains("incremental"));
}
}

View File

@@ -23,14 +23,34 @@ pub struct MetadataStats {
pub mangas_failed: usize, pub mangas_failed: usize,
} }
/// Decide whether the per-ref loop should stop based on the Incremental
/// streak counter. Pulled out as a pure function so the rule is unit-
/// testable without standing up the walker or DB.
pub(crate) fn should_stop(mode: DiscoverMode, consecutive_unchanged: usize) -> bool {
match mode {
DiscoverMode::Backfill => false,
DiscoverMode::Incremental { stop_after_unchanged } => {
consecutive_unchanged >= stop_after_unchanged
}
}
}
/// Runs the discover → fetch → upsert → cover → chapter-list-diff pipeline /// Runs the discover → fetch → upsert → cover → chapter-list-diff pipeline
/// for the target source. Pure metadata; chapter content is enqueued as /// for the target source. Pure metadata; chapter content is enqueued as
/// separate `SyncChapterContent` jobs by the caller after this returns. /// separate `SyncChapterContent` jobs by the caller after this returns.
/// ///
/// `limit == 0` means no cap (full backfill). `skip_chapters == true` is /// `limit == 0` means no cap (full sweep up to the source's own bound).
/// the "metadata-only" mode (parser doesn't extract chapters, and /// `skip_chapters == true` is the "metadata-only" mode (parser doesn't
/// `sync_manga_chapters` is skipped — otherwise an empty chapter list /// extract chapters, and `sync_manga_chapters` is skipped — otherwise an
/// would soft-drop existing rows). /// empty chapter list would soft-drop existing rows).
///
/// `mode` controls the walk:
/// - `Backfill` — oldest-first, no early exit. The only mode that runs
/// the end-of-walk drop pass + writes `seed_completed_at`.
/// - `Incremental { stop_after_unchanged }` — newest-first, breaks out
/// after N consecutive Unchanged upserts. Drop pass is skipped (the
/// tail of the index is never visited, so its `last_seen_at` is
/// stale and using it to soft-drop would be unsafe).
#[allow(clippy::too_many_arguments)] #[allow(clippy::too_many_arguments)]
pub async fn run_metadata_pass( pub async fn run_metadata_pass(
browser_manager: &BrowserManager, browser_manager: &BrowserManager,
@@ -41,6 +61,7 @@ pub async fn run_metadata_pass(
start_url: &str, start_url: &str,
limit: usize, limit: usize,
skip_chapters: bool, skip_chapters: bool,
mode: DiscoverMode,
) -> anyhow::Result<MetadataStats> { ) -> anyhow::Result<MetadataStats> {
let lease = browser_manager let lease = browser_manager
.acquire() .acquire()
@@ -74,26 +95,39 @@ pub async fn run_metadata_pass(
let run_started_at = chrono::Utc::now(); let run_started_at = chrono::Utc::now();
let max_refs = (limit > 0).then_some(limit); let max_refs = (limit > 0).then_some(limit);
tracing::info!(?max_refs, "discovering manga list"); tracing::info!(?mode, ?max_refs, "starting metadata pass");
let refs = source let mut walker = source
.discover(&ctx, DiscoverMode::Backfill, max_refs) .discover(&ctx, mode)
.await .await
.context("discover failed")?; .context("discover failed")?;
tracing::info!(count = refs.len(), "discovered manga list");
let mut stats = MetadataStats { let mut stats = MetadataStats::default();
discovered: refs.len(), let mut consecutive_unchanged: usize = 0;
..MetadataStats::default() let mut walked_to_completion = false;
let mut hit_limit = false;
let mut hit_incremental_stop = false;
'outer: loop {
let batch = match walker.next_batch(&ctx).await? {
Some(b) => b,
None => {
walked_to_completion = true;
break;
}
}; };
for r in batch {
for (i, r) in refs.iter().enumerate() { if max_refs.map(|m| stats.discovered >= m).unwrap_or(false) {
hit_limit = true;
tracing::info!(cap = ?max_refs, "max_results reached; halting walk");
break 'outer;
}
stats.discovered += 1;
tracing::info!( tracing::info!(
idx = i + 1, idx = stats.discovered,
total = stats.discovered,
key = %r.source_manga_key, key = %r.source_manga_key,
"fetching metadata" "fetching metadata"
); );
let manga = match source.fetch_manga(&ctx, r).await { let manga = match source.fetch_manga(&ctx, &r).await {
Ok(m) => m, Ok(m) => m,
Err(e) => { Err(e) => {
tracing::warn!( tracing::warn!(
@@ -107,7 +141,9 @@ pub async fn run_metadata_pass(
} }
}; };
let upsert = match repo::crawler::upsert_manga_from_source(db, source_id, &r.url, &manga) let upsert = match repo::crawler::upsert_manga_from_source(
db, source_id, &r.url, &manga,
)
.await .await
{ {
Ok(u) => u, Ok(u) => u,
@@ -181,16 +217,67 @@ pub async fn run_metadata_pass(
), ),
} }
} }
// Incremental stop: count consecutive Unchanged upserts and
// bail once the threshold is reached. New/Updated resets the
// streak so a fresh entry mid-batch doesn't accidentally trip
// the cutoff.
match upsert.status {
repo::crawler::UpsertStatus::Unchanged => {
consecutive_unchanged += 1;
}
repo::crawler::UpsertStatus::New | repo::crawler::UpsertStatus::Updated => {
consecutive_unchanged = 0;
}
}
if should_stop(mode, consecutive_unchanged) {
hit_incremental_stop = true;
tracing::info!(
consecutive_unchanged,
"incremental stop threshold reached; halting walk"
);
break 'outer;
}
}
} }
if limit == 0 { // Drop pass: only when the walk truly covered everything the source
// surfaces. `last_seen_at` on un-visited rows is stale, so running
// the drop on a partial walk would soft-drop the tail of the index.
let full_walk = walked_to_completion && !hit_limit && !hit_incremental_stop;
let backfill_complete = full_walk && matches!(mode, DiscoverMode::Backfill);
if full_walk {
match repo::crawler::mark_dropped_mangas(db, source_id, run_started_at).await { match repo::crawler::mark_dropped_mangas(db, source_id, run_started_at).await {
Ok(n) => tracing::info!(dropped = n, "marked unseen manga as dropped"), Ok(n) => tracing::info!(dropped = n, "marked unseen manga as dropped"),
Err(e) => tracing::warn!(error = ?e, "drop-pass failed"), Err(e) => tracing::warn!(error = ?e, "drop-pass failed"),
} }
} else { } else {
tracing::info!(limit, "partial sync — skipping drop pass"); tracing::info!(
?mode,
hit_limit,
hit_incremental_stop,
"partial sync — skipping drop pass"
);
} }
if backfill_complete {
if let Err(e) = repo::crawler::mark_seed_completed(db, source_id, run_started_at).await {
tracing::warn!(error = ?e, "mark_seed_completed failed");
} else {
tracing::info!(source_id, "seed marked complete");
}
}
tracing::info!(
?mode,
discovered = stats.discovered,
upserted = stats.upserted,
covers_fetched = stats.covers_fetched,
mangas_failed = stats.mangas_failed,
walked_to_completion,
hit_limit,
hit_incremental_stop,
"metadata pass complete"
);
drop(lease); drop(lease);
Ok(stats) Ok(stats)
@@ -345,3 +432,36 @@ fn origin_of(url: &str) -> Option<String> {
let host = rest.split('/').next()?; let host = rest.split('/').next()?;
Some(format!("{scheme}://{host}")) Some(format!("{scheme}://{host}"))
} }
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn backfill_never_stops_regardless_of_streak() {
assert!(!should_stop(DiscoverMode::Backfill, 0));
assert!(!should_stop(DiscoverMode::Backfill, 100));
assert!(!should_stop(DiscoverMode::Backfill, usize::MAX));
}
#[test]
fn incremental_stops_when_streak_meets_threshold() {
let mode = DiscoverMode::Incremental {
stop_after_unchanged: 3,
};
assert!(!should_stop(mode, 0));
assert!(!should_stop(mode, 2));
assert!(should_stop(mode, 3), "stops at exactly the threshold");
assert!(should_stop(mode, 100), "stops at anything past threshold");
}
#[test]
fn incremental_with_zero_threshold_stops_immediately() {
// A nonsensical config (no Unchanged needed to stop) shouldn't
// panic — it just means the very first ref triggers the bail.
let mode = DiscoverMode::Incremental {
stop_after_unchanged: 0,
};
assert!(should_stop(mode, 0));
}
}

View File

@@ -82,21 +82,42 @@ pub struct FetchContext<'a> {
pub rate: &'a crate::crawler::rate_limit::HostRateLimiters, pub rate: &'a crate::crawler::rate_limit::HostRateLimiters,
} }
/// Lazy iterator over discovered manga refs. The caller drives the
/// walk one batch at a time, so it can break out as soon as a
/// downstream stop condition is met (e.g. N consecutive Unchanged
/// upserts in Incremental mode) without paying for pages it won't use.
///
/// Batches are typically one source-index page each. Within a batch
/// refs are already in the right per-page order for the active mode
/// (Backfill reverses each page to oldest-first; Incremental leaves
/// the source's natural newest-first ordering).
#[async_trait]
pub trait DiscoverWalk: Send {
/// Return the next batch of refs, or `Ok(None)` when the source has
/// no more pages. The walker is single-use; calling `next_batch`
/// after `None` is allowed and continues to return `None`.
async fn next_batch(
&mut self,
ctx: &FetchContext<'_>,
) -> anyhow::Result<Option<Vec<SourceMangaRef>>>;
}
#[async_trait] #[async_trait]
pub trait Source: Send + Sync { pub trait Source: Send + Sync {
/// Stable identifier — also the row key in the `sources` table. /// Stable identifier — also the row key in the `sources` table.
fn id(&self) -> &'static str; fn id(&self) -> &'static str;
/// Returns up to `max_results` manga refs in source order. Pass /// Begin discovery in `mode`. Returns a walker the caller drives
/// `None` for an uncapped walk (full backfill / incremental sweep). /// page-by-page via `next_batch`. The initial page-1 probe (used
/// Implementations should stop paginating as soon as the cap is /// to determine `last_page` and warm the cache for sites that
/// reached so partial runs don't pay for pages they won't use. /// can't be paged without knowing the bound) happens inside this
/// call, so a fresh walker is ready to yield its first batch
/// without further setup.
async fn discover( async fn discover(
&self, &self,
ctx: &FetchContext<'_>, ctx: &FetchContext<'_>,
mode: DiscoverMode, mode: DiscoverMode,
max_results: Option<usize>, ) -> anyhow::Result<Box<dyn DiscoverWalk + Send>>;
) -> anyhow::Result<Vec<SourceMangaRef>>;
async fn fetch_manga( async fn fetch_manga(
&self, &self,

View File

@@ -7,6 +7,7 @@
//! (`td:has(label:contains("Author:"))`) are implemented by walking //! (`td:has(label:contains("Author:"))`) are implemented by walking
//! the parsed tree. //! the parsed tree.
use std::collections::VecDeque;
use std::time::Duration; use std::time::Duration;
use anyhow::Context; use anyhow::Context;
@@ -14,13 +15,18 @@ use async_trait::async_trait;
use sha2::{Digest, Sha256}; use sha2::{Digest, Sha256};
use super::{ use super::{
DiscoverMode, FetchContext, Source, SourceChapter, SourceChapterRef, SourceManga, DiscoverMode, DiscoverWalk, FetchContext, Source, SourceChapter, SourceChapterRef,
SourceMangaRef, SourceManga, SourceMangaRef,
}; };
use crate::crawler::detect::{ use crate::crawler::detect::{
has_logo_sentinel, is_broken_page_body, retry_on_transient, PageError, has_logo_sentinel, is_broken_page_body, retry_on_transient, PageError,
}; };
/// `sources.id` value for this Source impl. Exposed as a const so the
/// daemon can look up per-source state (e.g. `seed_completed_at`)
/// before constructing the Source itself.
pub const SOURCE_ID: &str = "target";
/// In-loop retry budget for transient pages encountered during a single /// In-loop retry budget for transient pages encountered during a single
/// `discover` walk. Bounded small because the job system itself retries /// `discover` walk. Bounded small because the job system itself retries
/// the whole `Discover` job on failure — these inline retries only need /// the whole `Discover` job on failure — these inline retries only need
@@ -60,15 +66,14 @@ impl TargetSource {
#[async_trait] #[async_trait]
impl Source for TargetSource { impl Source for TargetSource {
fn id(&self) -> &'static str { fn id(&self) -> &'static str {
"target" SOURCE_ID
} }
async fn discover( async fn discover(
&self, &self,
ctx: &FetchContext<'_>, ctx: &FetchContext<'_>,
mode: DiscoverMode, mode: DiscoverMode,
max_results: Option<usize>, ) -> anyhow::Result<Box<dyn DiscoverWalk + Send>> {
) -> anyhow::Result<Vec<SourceMangaRef>> {
// Always visit page 1 first because that's the only way to // Always visit page 1 first because that's the only way to
// discover `last_page`. Retry it on transient — a broken first // discover `last_page`. Retry it on transient — a broken first
// page would otherwise abort the whole walk before we've even // page would otherwise abort the whole walk before we've even
@@ -85,15 +90,7 @@ impl Source for TargetSource {
}; };
let backfill = matches!(mode, DiscoverMode::Backfill); let backfill = matches!(mode, DiscoverMode::Backfill);
let order: Vec<i32> = match (last_page, backfill) { let order = build_page_order(last_page, backfill);
(None, _) => vec![1],
// Backfill = oldest-first: walk pages last → 1, then
// reverse within each page (the listing is update_date
// DESC, so the bottom of the last page is the oldest
// entry the source still surfaces).
(Some(last), true) => (1..=last).rev().collect(),
(Some(last), false) => (1..=last).collect(),
};
tracing::info!( tracing::info!(
?mode, ?mode,
last_page = ?last_page, last_page = ?last_page,
@@ -101,40 +98,12 @@ impl Source for TargetSource {
"walking pagination" "walking pagination"
); );
let mut all = Vec::new(); Ok(Box::new(TargetSourceWalker {
for page_num in order { base_url: self.base_url.clone(),
// Page 1 is already cached from the last_page probe — reuse backfill,
// it rather than navigating twice. Every other page goes pages_remaining: order,
// through the retry helper so a single broken page mid-walk first_page_html: Some(first_html),
// doesn't silently drop its mangas from the result. }))
let mut page_refs = if page_num == 1 {
let doc = scraper::Html::parse_document(&first_html);
parse_manga_list_from(&doc)?
} else {
retry_on_transient(
|| async {
let url = page_url(&self.base_url, page_num);
let html = navigate(ctx, &url).await?;
let doc = scraper::Html::parse_document(&html);
parse_manga_list_from(&doc)
},
PAGE_TRANSIENT_RETRY_ATTEMPTS,
PAGE_TRANSIENT_RETRY_DELAY,
)
.await?
};
if backfill {
page_refs.reverse();
}
tracing::info!(page_num, count = page_refs.len(), "page walked");
all.extend(page_refs);
if cap_reached(&all, max_results) {
tracing::info!(cap = ?max_results, "max_results reached; halting pagination");
break;
}
}
Ok(truncate_to_cap(all, max_results))
} }
async fn fetch_manga( async fn fetch_manga(
@@ -168,15 +137,81 @@ impl Source for TargetSource {
} }
} }
fn cap_reached<T>(buf: &[T], max: Option<usize>) -> bool { /// Build the queue of page numbers `TargetSource::discover` will walk.
matches!(max, Some(m) if buf.len() >= m) /// Backfill is oldest-first: pages `last..=1` (within each page the
/// walker reverses entries, since the source orders by update_date
/// DESC). Incremental is newest-first: pages `1..=last` in natural
/// order. If `last_page` is unknown (source surfaces no pagination)
/// only page 1 is visited.
fn build_page_order(last_page: Option<i32>, backfill: bool) -> VecDeque<i32> {
match (last_page, backfill) {
(None, _) => VecDeque::from([1]),
(Some(last), true) => (1..=last).rev().collect(),
(Some(last), false) => (1..=last).collect(),
}
} }
fn truncate_to_cap<T>(mut buf: Vec<T>, max: Option<usize>) -> Vec<T> { /// Walker returned by [`TargetSource::discover`]. Pops one source-index
if let Some(m) = max { /// page per `next_batch` call. Page 1's HTML is cached at construction
buf.truncate(m); /// time (the discover call needed it to read `last_page` anyway) so the
/// batch covering page 1 doesn't re-fetch.
struct TargetSourceWalker {
base_url: String,
backfill: bool,
pages_remaining: VecDeque<i32>,
first_page_html: Option<String>,
}
#[async_trait]
impl DiscoverWalk for TargetSourceWalker {
async fn next_batch(
&mut self,
ctx: &FetchContext<'_>,
) -> anyhow::Result<Option<Vec<SourceMangaRef>>> {
let Some(page_num) = self.pages_remaining.pop_front() else {
return Ok(None);
};
let mut page_refs = if page_num == 1 {
// Reuse the cached page-1 HTML from the initial probe. Take
// it (rather than clone) so a malformed page-order queue
// that re-visits page 1 still falls back to a real fetch.
match self.first_page_html.take() {
Some(html) => {
let doc = scraper::Html::parse_document(&html);
parse_manga_list_from(&doc)?
}
None => {
retry_on_transient(
|| async {
let html = navigate(ctx, self.base_url.as_str()).await?;
let doc = scraper::Html::parse_document(&html);
parse_manga_list_from(&doc)
},
PAGE_TRANSIENT_RETRY_ATTEMPTS,
PAGE_TRANSIENT_RETRY_DELAY,
)
.await?
}
}
} else {
retry_on_transient(
|| async {
let url = page_url(&self.base_url, page_num);
let html = navigate(ctx, &url).await?;
let doc = scraper::Html::parse_document(&html);
parse_manga_list_from(&doc)
},
PAGE_TRANSIENT_RETRY_ATTEMPTS,
PAGE_TRANSIENT_RETRY_DELAY,
)
.await?
};
if self.backfill {
page_refs.reverse();
}
tracing::info!(page_num, count = page_refs.len(), "page walked");
Ok(Some(page_refs))
} }
buf
} }
/// Single point of rate-limited navigation. Every Source request goes /// Single point of rate-limited navigation. Every Source request goes
@@ -922,4 +957,37 @@ mod tests {
let err = parse_manga_detail(html, "x", true).expect_err("expected Transient"); let err = parse_manga_detail(html, "x", true).expect_err("expected Transient");
assert!(err.is_transient(), "got non-transient: {err}"); assert!(err.is_transient(), "got non-transient: {err}");
} }
#[test]
fn build_page_order_backfill_is_last_to_one() {
// Backfill walks pages oldest-first: queue is [last, last-1, ..., 1]
// so popping from the front yields the last page first.
let order = build_page_order(Some(3), true);
assert_eq!(Vec::from(order), vec![3, 2, 1]);
}
#[test]
fn build_page_order_incremental_is_one_to_last() {
// Incremental walks newest-first in natural source order.
let order = build_page_order(Some(3), false);
assert_eq!(Vec::from(order), vec![1, 2, 3]);
}
#[test]
fn build_page_order_falls_back_to_page_one_only_without_pagination() {
let backfill = build_page_order(None, true);
assert_eq!(Vec::from(backfill), vec![1]);
let incremental = build_page_order(None, false);
assert_eq!(Vec::from(incremental), vec![1]);
}
#[test]
fn build_page_order_single_page_index_yields_one_entry() {
// Sources with exactly one page should not yield duplicates
// regardless of mode.
let backfill = build_page_order(Some(1), true);
assert_eq!(Vec::from(backfill), vec![1]);
let incremental = build_page_order(Some(1), false);
assert_eq!(Vec::from(incremental), vec![1]);
}
} }

View File

@@ -21,6 +21,11 @@ pub enum AppError {
PayloadTooLarge(String), PayloadTooLarge(String),
#[error("unsupported media type: {0}")] #[error("unsupported media type: {0}")]
UnsupportedMediaType(String), UnsupportedMediaType(String),
/// 429 with an optional `Retry-After` header value (in seconds).
#[error("too many requests")]
TooManyRequests {
retry_after_secs: Option<u64>,
},
/// Semantic per-field validation failure. `details` is rendered into the /// Semantic per-field validation failure. `details` is rendered into the
/// envelope so the client can highlight the bad field(s). /// envelope so the client can highlight the bad field(s).
#[error("validation failed")] #[error("validation failed")]
@@ -51,6 +56,7 @@ impl AppError {
AppError::Conflict(_) => "conflict", AppError::Conflict(_) => "conflict",
AppError::PayloadTooLarge(_) => "payload_too_large", AppError::PayloadTooLarge(_) => "payload_too_large",
AppError::UnsupportedMediaType(_) => "unsupported_media_type", AppError::UnsupportedMediaType(_) => "unsupported_media_type",
AppError::TooManyRequests { .. } => "too_many_requests",
AppError::ValidationFailed { .. } => "validation_failed", AppError::ValidationFailed { .. } => "validation_failed",
AppError::Database(sqlx::Error::RowNotFound) => "not_found", AppError::Database(sqlx::Error::RowNotFound) => "not_found",
AppError::Database(_) => "internal_error", AppError::Database(_) => "internal_error",
@@ -79,6 +85,31 @@ impl IntoResponse for AppError {
AppError::UnsupportedMediaType(msg) => { AppError::UnsupportedMediaType(msg) => {
(StatusCode::UNSUPPORTED_MEDIA_TYPE, msg.clone(), None) (StatusCode::UNSUPPORTED_MEDIA_TYPE, msg.clone(), None)
} }
AppError::TooManyRequests { retry_after_secs } => {
// Emit `Retry-After: N` (RFC 6585 §4) so a well-behaved
// client can back off correctly. Done by building the
// response by hand below — the `(status, headers,
// body)` tuple shape doesn't fit the standard
// `(status, body)` IntoResponse path for the other
// variants.
let body = json!({
"error": {
"code": code,
"message": "too many requests; slow down",
}
});
let mut resp = (StatusCode::TOO_MANY_REQUESTS, Json(body)).into_response();
if let Some(secs) = retry_after_secs {
// `HeaderValue: From<u64>` skips both the
// intermediate `String` allocation and the
// fallible-by-shape `from_str` path.
resp.headers_mut().insert(
axum::http::header::RETRY_AFTER,
axum::http::HeaderValue::from(*secs),
);
}
return resp;
}
AppError::ValidationFailed { message, details } => ( AppError::ValidationFailed { message, details } => (
StatusCode::UNPROCESSABLE_ENTITY, StatusCode::UNPROCESSABLE_ENTITY,
message.clone(), message.clone(),

View File

@@ -412,6 +412,53 @@ pub async fn sync_manga_chapters(
Ok(diff) Ok(diff)
} }
/// Record that a complete Backfill walk has finished for `source_id`.
/// The presence of this row is what the daemon's mode auto-detection
/// uses to flip from Backfill to Incremental on subsequent ticks.
///
/// Keyed `seed_completed:<source_id>` in `crawler_state`. JSON payload
/// stores the timestamp so we can surface "last fully reseeded at" in
/// future ops tooling without another migration.
pub async fn mark_seed_completed(
pool: &PgPool,
source_id: &str,
at: DateTime<Utc>,
) -> sqlx::Result<()> {
let key = format!("seed_completed:{source_id}");
sqlx::query(
"INSERT INTO crawler_state (key, value, updated_at) \
VALUES ($1, $2, now()) \
ON CONFLICT (key) DO UPDATE \
SET value = EXCLUDED.value, updated_at = now()",
)
.bind(&key)
.bind(serde_json::json!({ "at": at.to_rfc3339() }))
.execute(pool)
.await?;
Ok(())
}
/// Read the timestamp written by [`mark_seed_completed`], if any.
/// `None` means no complete Backfill has ever finished for this
/// source — the daemon should run Backfill on the next tick.
pub async fn seed_completed_at(
pool: &PgPool,
source_id: &str,
) -> sqlx::Result<Option<DateTime<Utc>>> {
let key = format!("seed_completed:{source_id}");
let row: Option<serde_json::Value> =
sqlx::query_scalar("SELECT value FROM crawler_state WHERE key = $1")
.bind(&key)
.fetch_optional(pool)
.await?;
Ok(row.and_then(|v| {
v.get("at")
.and_then(|s| s.as_str())
.and_then(|s| DateTime::parse_from_rfc3339(s).ok())
.map(|dt| dt.with_timezone(&Utc))
}))
}
pub async fn mark_dropped_mangas( pub async fn mark_dropped_mangas(
pool: &PgPool, pool: &PgPool,
source_id: &str, source_id: &str,

View File

@@ -567,6 +567,81 @@ async fn user_a_cannot_delete_user_b_token(pool: PgPool) {
assert_eq!(resp.status(), StatusCode::NO_CONTENT); assert_eq!(resp.status(), StatusCode::NO_CONTENT);
} }
/// Brute-force / spray protection: at default production limits, a
/// tight loop of /auth/login attempts should burst through the bucket
/// and then 429 every subsequent request until the bucket refills.
#[sqlx::test(migrations = "./migrations")]
async fn login_rate_limited_under_burst_pressure(pool: PgPool) {
let h = common::harness_with_auth_rate_limit(pool, 1, 3);
// Register a victim so the wrong-password branch is real work.
let _ = h
.app
.clone()
.oneshot(common::post_json("/api/v1/auth/register", creds("victim")))
.await
.unwrap();
// Register consumed one token from the burst-3 bucket. Fire 30
// wrong-password logins back-to-back; with per_sec=1 the refill
// is too slow to keep up and at least one must come back 429.
let mut saw_429 = false;
for _ in 0..30 {
let resp = h
.app
.clone()
.oneshot(common::post_json(
"/api/v1/auth/login",
json!({ "username": "victim", "password": "wrong" }),
))
.await
.unwrap();
if resp.status() == StatusCode::TOO_MANY_REQUESTS {
// RFC 6585 §4: 429 SHOULD include a Retry-After header. The
// value is in seconds; with per_sec=1 the bucket needs ~1s
// to refill, so the header should be 1 or 2.
let retry_after = resp
.headers()
.get(axum::http::header::RETRY_AFTER)
.and_then(|v| v.to_str().ok())
.and_then(|s| s.parse::<u32>().ok())
.expect("Retry-After header present and numeric");
assert!(
retry_after >= 1,
"Retry-After must be at least 1s, got {retry_after}"
);
let body = common::body_json(resp).await;
assert_eq!(body["error"]["code"], "too_many_requests");
saw_429 = true;
break;
}
}
assert!(
saw_429,
"expected at least one 429 within 30 rapid login attempts"
);
}
/// Default (test-harness) limits are disabled, so existing tests that
/// fire multiple auth requests don't start failing.
#[sqlx::test(migrations = "./migrations")]
async fn default_test_harness_does_not_rate_limit(pool: PgPool) {
let h = common::harness(pool);
for i in 0..50 {
let resp = h
.app
.clone()
.oneshot(common::post_json(
"/api/v1/auth/login",
json!({ "username": format!("nobody-{i}"), "password": "x" }),
))
.await
.unwrap();
// None of these should be 429 — only 401.
assert_eq!(resp.status(), StatusCode::UNAUTHORIZED, "iter {i}");
}
}
#[sqlx::test(migrations = "./migrations")] #[sqlx::test(migrations = "./migrations")]
async fn delete_unknown_token_is_404(pool: PgPool) { async fn delete_unknown_token_is_404(pool: PgPool) {
let h = common::harness(pool); let h = common::harness(pool);

View File

@@ -15,6 +15,7 @@ use tempfile::TempDir;
use tower::ServiceExt; use tower::ServiceExt;
use mangalord::app::{router, AppState}; use mangalord::app::{router, AppState};
use mangalord::auth::rate_limit::AuthRateLimiter;
use mangalord::config::{AuthConfig, UploadConfig}; use mangalord::config::{AuthConfig, UploadConfig};
use mangalord::storage::{LocalStorage, Storage, StorageError, StreamingFile}; use mangalord::storage::{LocalStorage, Storage, StorageError, StreamingFile};
@@ -49,20 +50,51 @@ fn harness_inner(
storage: Arc<dyn Storage>, storage: Arc<dyn Storage>,
storage_dir: TempDir, storage_dir: TempDir,
) -> Harness { ) -> Harness {
harness_with_auth_config(pool, storage, storage_dir, AuthConfig {
cookie_secure: false,
..AuthConfig::default()
})
}
fn harness_with_auth_config(
pool: PgPool,
storage: Arc<dyn Storage>,
storage_dir: TempDir,
auth: AuthConfig,
) -> Harness {
let auth_limiter = Arc::new(AuthRateLimiter::new(auth.rate_limit));
let state = AppState { let state = AppState {
db: pool, db: pool,
storage, storage,
auth: AuthConfig { cookie_secure: false, ..AuthConfig::default() }, auth,
upload: UploadConfig { upload: UploadConfig {
// Keep file caps small in tests so the size-cap path is cheap to // Keep file caps small in tests so the size-cap path is cheap to
// exercise without producing tens of MBs of bytes. // exercise without producing tens of MBs of bytes.
max_request_bytes: 4 * 1024 * 1024, max_request_bytes: 4 * 1024 * 1024,
max_file_bytes: 256 * 1024, max_file_bytes: 256 * 1024,
}, },
auth_limiter,
}; };
Harness { app: router(state), _storage_dir: storage_dir } Harness { app: router(state), _storage_dir: storage_dir }
} }
/// Like [`harness`] but configures a tight auth rate limit. Used by
/// the brute-force-rate-limiting test.
pub fn harness_with_auth_rate_limit(
pool: PgPool,
per_sec: u32,
burst: u32,
) -> Harness {
let storage_dir = tempfile::tempdir().expect("tempdir");
let storage = Arc::new(LocalStorage::new(storage_dir.path()));
let auth = AuthConfig {
cookie_secure: false,
rate_limit: mangalord::auth::rate_limit::RateLimitConfig { per_sec, burst },
..AuthConfig::default()
};
harness_with_auth_config(pool, storage, storage_dir, auth)
}
/// Wraps a real `Storage` and fails on the N-th `put` call so tests can /// Wraps a real `Storage` and fails on the N-th `put` call so tests can
/// assert that handlers roll their DB writes back when storage errors /// assert that handlers roll their DB writes back when storage errors
/// mid-upload. Reads and other operations delegate to `inner`. /// mid-upload. Reads and other operations delegate to `inner`.

View File

@@ -0,0 +1,85 @@
//! Integration tests for the incremental-mode coordination state:
//! `mark_seed_completed` / `seed_completed_at` round-trip via the
//! `crawler_state` table.
//!
//! End-to-end pipeline behavior (walker + stop-on-Unchanged) requires
//! a real `chromiumoxide::Browser` to construct a `FetchContext`, so
//! the live integration of that path is covered by
//! `crawler_browser_smoke.rs` instead. The pure stop logic itself is
//! unit-tested in `crawler::pipeline::tests`.
use chrono::Utc;
use mangalord::repo::crawler;
use sqlx::PgPool;
#[sqlx::test(migrations = "./migrations")]
async fn seed_completed_at_none_before_any_run(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let res = crawler::seed_completed_at(&pool, "target").await.unwrap();
assert!(res.is_none(), "fresh source has no seed marker");
}
#[sqlx::test(migrations = "./migrations")]
async fn mark_seed_completed_then_read_round_trips_timestamp(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let at = Utc::now();
crawler::mark_seed_completed(&pool, "target", at)
.await
.unwrap();
let read = crawler::seed_completed_at(&pool, "target")
.await
.unwrap()
.expect("marker present after mark");
// RFC3339 round-trip is millisecond-precise on chrono::Utc; allow a
// 1ms tolerance to absorb postgres jsonb whitespace canonicalization.
let drift = (read - at).num_milliseconds().abs();
assert!(drift <= 1, "round-trip drift: {drift}ms (at={at}, read={read})");
}
#[sqlx::test(migrations = "./migrations")]
async fn mark_seed_completed_overwrites_previous_value(pool: PgPool) {
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
let first = Utc::now() - chrono::Duration::hours(1);
let second = Utc::now();
crawler::mark_seed_completed(&pool, "target", first)
.await
.unwrap();
crawler::mark_seed_completed(&pool, "target", second)
.await
.unwrap();
let read = crawler::seed_completed_at(&pool, "target")
.await
.unwrap()
.expect("marker present");
let drift = (read - second).num_milliseconds().abs();
assert!(drift <= 1, "should reflect the latest mark, not the first");
}
#[sqlx::test(migrations = "./migrations")]
async fn seed_completed_is_per_source(pool: PgPool) {
// Two sources, only one is marked complete. The other must still
// report None — the key is namespaced by source_id.
crawler::ensure_source(&pool, "target", "T", "https://x.example")
.await
.unwrap();
crawler::ensure_source(&pool, "other", "O", "https://y.example")
.await
.unwrap();
crawler::mark_seed_completed(&pool, "target", Utc::now())
.await
.unwrap();
assert!(crawler::seed_completed_at(&pool, "target")
.await
.unwrap()
.is_some());
assert!(crawler::seed_completed_at(&pool, "other")
.await
.unwrap()
.is_none());
}

22
docker-compose.prod.yml Normal file
View File

@@ -0,0 +1,22 @@
# Production overlay: layer on top of docker-compose.yml on the deploy
# host so the backend and frontend run from pre-built registry images
# instead of building locally.
#
# docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d
#
# REGISTRY_URL and IMAGE_TAG are injected by .gitea/workflows/deploy.yml
# at deploy time. IMAGE_TAG defaults to `latest` so a manual
# `docker compose ... up -d` on the host still works.
services:
backend:
build: !reset null
image: ${REGISTRY_URL}/mangalord-backend:${IMAGE_TAG:-latest}
pull_policy: always
restart: unless-stopped
frontend:
build: !reset null
image: ${REGISTRY_URL}/mangalord-frontend:${IMAGE_TAG:-latest}
pull_policy: always
restart: unless-stopped

View File

@@ -1,6 +1,6 @@
{ {
"name": "mangalord-frontend", "name": "mangalord-frontend",
"version": "0.32.0", "version": "0.35.0",
"private": true, "private": true,
"type": "module", "type": "module",
"scripts": { "scripts": {