bugfix: SSRF allowlist, image size cap, robust session detect (0.34.1)

Four crawler defences in one PR (all four threats the review flagged
in §3 of REVIEW.md):

- New crawler::safety module with is_safe_url + accumulate_capped +
  fetch_bytes_capped. Rejects non-http(s) schemes, RFC1918 / loopback
  / link-local / CGNAT / ULA / IPv6-link-local hosts, and any host
  not on the operator's allowlist (defaults to CRAWLER_START_URL host
  + CRAWLER_CDN_HOST + CRAWLER_DOWNLOAD_ALLOWLIST extras).
- Streaming size cap (CRAWLER_MAX_IMAGE_BYTES, default 32 MiB) so a
  10 GiB \"image\" can't fill memory before disk.
- looks_like_image() reject path: non-image bytes fail the chapter or
  cover instead of being stored as .bin and served as
  application/octet-stream.
- session::classify_chapter_probe: three-way classifier replaces the
  binary #avatar_menu check at content.rs:115. A transient hiccup
  (broken-page body, or logged-in-but-no-reader) now retries with
  backoff instead of falsely freezing every worker on
  session_expired.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-05-28 08:07:59 +02:00
parent e7662d18d6
commit c33f30972e
12 changed files with 807 additions and 43 deletions

View File

@@ -44,6 +44,14 @@ MAX_REQUEST_BYTES=209715200
# Default 20 MiB.
MAX_FILE_BYTES=20971520
# ----- Crawler download safety -----
# Hosts the crawler is allowed to fetch images/covers from, in addition
# to CRAWLER_START_URL's host and CRAWLER_CDN_HOST. Comma-separated.
# Defends against SSRF via scraped <img src="http://10.0.0.1/...">.
CRAWLER_DOWNLOAD_ALLOWLIST=
# Hard cap on a single image body. Default 32 MiB.
CRAWLER_MAX_IMAGE_BYTES=33554432
# ----- Frontend -----
# The frontend container runs SvelteKit's Node adapter on :3000 and
# proxies /api/* to BACKEND_URL via src/hooks.server.ts. In compose the