bugfix: SSRF allowlist, image size cap, robust session detect (0.34.1)

Four crawler defences in one PR (all four threats the review flagged
in §3 of REVIEW.md):

- New crawler::safety module with is_safe_url + accumulate_capped +
  fetch_bytes_capped. Rejects non-http(s) schemes, RFC1918 / loopback
  / link-local / CGNAT / ULA / IPv6-link-local hosts, and any host
  not on the operator's allowlist (defaults to CRAWLER_START_URL host
  + CRAWLER_CDN_HOST + CRAWLER_DOWNLOAD_ALLOWLIST extras).
- Streaming size cap (CRAWLER_MAX_IMAGE_BYTES, default 32 MiB) so a
  10 GiB \"image\" can't fill memory before disk.
- looks_like_image() reject path: non-image bytes fail the chapter or
  cover instead of being stored as .bin and served as
  application/octet-stream.
- session::classify_chapter_probe: three-way classifier replaces the
  binary #avatar_menu check at content.rs:115. A transient hiccup
  (broken-page body, or logged-in-but-no-reader) now retries with
  backoff instead of falsely freezing every worker on
  session_expired.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-05-28 08:07:59 +02:00
parent e7662d18d6
commit c33f30972e
12 changed files with 807 additions and 43 deletions

View File

@@ -1,6 +1,6 @@
[package]
name = "mangalord"
version = "0.34.0"
version = "0.34.1"
edition = "2021"
default-run = "mangalord"
@@ -46,7 +46,7 @@ futures-util = "0.3"
bytes = "1"
chromiumoxide = { version = "0.7", features = ["tokio-runtime", "_fetcher-rusttls-tokio"], default-features = false }
scraper = "0.20"
reqwest = { version = "0.12", default-features = false, features = ["rustls-tls", "socks", "cookies"] }
reqwest = { version = "0.12", default-features = false, features = ["rustls-tls", "socks", "cookies", "stream"] }
[dev-dependencies]
tempfile = "3"