After the metadata pass, the crawler now fetches per-chapter image content for chapters belonging to bookmarked mangas. Logged-in chapter pages render every page image at once (no per-page navigation), so the crawler reuses the operator's browser session via a pasted PHPSESSID cookie. Each chapter sync is a single transaction: storage puts + page row inserts + page_count update commit together, or roll back together on any image error so the chapter stays at page_count=0 and is retried next run. New crawler modules: - `rate_limit::HostRateLimiters`: per-host buckets keyed by URL host, with optional per-host overrides. Replaces the single shared `Mutex<RateLimiter>`. Catalog and CDN no longer share a budget; default 1 req/s per host. - `session`: derives `.<registrable>.<tld>` from the start URL (override via `CRAWLER_COOKIE_DOMAIN` for multi-part TLDs), injects PHPSESSID into the Chromium cookie store, probes `#avatar_menu` at startup to fail fast on a bad/expired cookie. - `content`: parses `a#pic_container img:not(.loading)` with `pageN` id-based sorting (DOM order isn't trusted), then performs the atomic chapter sync. bin/crawler additions: - Concurrent chapter content phase via `futures_util::for_each_concurrent` (`CRAWLER_CHAPTER_WORKERS`, default 1). Browser is borrowed across workers — chromiumoxide allows concurrent `new_page` on `&self` — and per-host rate limit gates total RPS regardless of worker count. - reqwest gets the `cookies` feature, a `Jar` seeded with PHPSESSID for the catalog domain only (CDN intentionally not given the cookie), and `Referer` is set on cover + chapter image fetches. - New env knobs: `CRAWLER_PHPSESSID`, `CRAWLER_COOKIE_DOMAIN`, `CRAWLER_USER_AGENT`, `CRAWLER_CHAPTER_WORKERS`, `CRAWLER_SKIP_CHAPTER_CONTENT`, `CRAWLER_FORCE_REFETCH_CHAPTERS`, `CRAWLER_CDN_HOST` + `CRAWLER_CDN_RATE_MS`. - Mid-run session-expired detection: `#avatar_menu` is re-checked on every chapter page nav; first failure aborts the phase with a cookie-refresh message. Bookmark-driven enqueueing is sync-on-crawl-tick only: the bookmarked chapters with `page_count = 0` are queried at the start of the chapter-content phase. Sync-on-bookmark via an API hook is deferred to a follow-up branch — that needs a daemon consumer of crawler_jobs, which doesn't exist yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
57 lines
1.6 KiB
TOML
57 lines
1.6 KiB
TOML
[package]
|
|
name = "mangalord"
|
|
version = "0.25.0"
|
|
edition = "2021"
|
|
default-run = "mangalord"
|
|
|
|
[lib]
|
|
path = "src/lib.rs"
|
|
|
|
[[bin]]
|
|
name = "mangalord"
|
|
path = "src/main.rs"
|
|
|
|
[[bin]]
|
|
name = "crawler"
|
|
path = "src/bin/crawler.rs"
|
|
|
|
[dependencies]
|
|
axum = { version = "0.7", features = ["macros", "multipart"] }
|
|
tokio = { version = "1", features = ["full"] }
|
|
sqlx = { version = "0.8", features = ["runtime-tokio", "postgres", "uuid", "chrono", "macros", "migrate"] }
|
|
serde = { version = "1", features = ["derive"] }
|
|
serde_json = "1"
|
|
uuid = { version = "1", features = ["v4", "serde"] }
|
|
chrono = { version = "0.4", features = ["serde"] }
|
|
tracing = "0.1"
|
|
tracing-subscriber = { version = "0.3", features = ["env-filter"] }
|
|
tower = { version = "0.5", features = ["util"] }
|
|
tower-http = { version = "0.6", features = ["trace", "cors"] }
|
|
thiserror = "1"
|
|
anyhow = "1"
|
|
async-trait = "0.1"
|
|
dotenvy = "0.15"
|
|
argon2 = "0.5"
|
|
rand = "0.8"
|
|
sha2 = "0.10"
|
|
subtle = "2"
|
|
base64 = "0.22"
|
|
axum-extra = { version = "0.9", features = ["cookie", "typed-header"] }
|
|
time = "0.3"
|
|
infer = "0.16"
|
|
tokio-util = { version = "0.7", features = ["io"] }
|
|
futures-core = "0.3"
|
|
futures-util = "0.3"
|
|
bytes = "1"
|
|
chromiumoxide = { version = "0.7", features = ["tokio-runtime", "_fetcher-rusttls-tokio"], default-features = false }
|
|
scraper = "0.20"
|
|
reqwest = { version = "0.12", default-features = false, features = ["rustls-tls", "socks", "cookies"] }
|
|
|
|
[dev-dependencies]
|
|
tempfile = "3"
|
|
tower = { version = "0.5", features = ["util"] }
|
|
http-body-util = "0.1"
|
|
mime = "0.3"
|
|
futures-util = "0.3"
|
|
tokio = { version = "1", features = ["test-util"] }
|