bugfix: SSRF allowlist, image size cap, robust session detect (0.34.1)

Four crawler defences in one PR (all four threats the review flagged in §3 of REVIEW.md): - New crawler::safety module with is_safe_url + accumulate_capped + fetch_bytes_capped. Rejects non-http(s) schemes, RFC1918 / loopback / link-local / CGNAT / ULA / IPv6-link-local hosts, and any host not on the operator's allowlist (defaults to CRAWLER_START_URL host + CRAWLER_CDN_HOST + CRAWLER_DOWNLOAD_ALLOWLIST extras). - Streaming size cap (CRAWLER_MAX_IMAGE_BYTES, default 32 MiB) so a 10 GiB \"image\" can't fill memory before disk. - looks_like_image() reject path: non-image bytes fail the chapter or cover instead of being stored as .bin and served as application/octet-stream. - session::classify_chapter_probe: three-way classifier replaces the binary #avatar_menu check at content.rs:115. A transient hiccup (broken-page body, or logged-in-but-no-reader) now retries with backoff instead of falsely freezing every worker on session_expired. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 08:07:59 +02:00
parent e7662d18d6
commit c33f30972e
12 changed files with 807 additions and 43 deletions
--- a/.env.example
+++ b/.env.example
@@ -44,6 +44,14 @@ MAX_REQUEST_BYTES=209715200
 # Default 20 MiB.
 MAX_FILE_BYTES=20971520

+# ----- Crawler download safety -----
+# Hosts the crawler is allowed to fetch images/covers from, in addition
+# to CRAWLER_START_URL's host and CRAWLER_CDN_HOST. Comma-separated.
+# Defends against SSRF via scraped <img src="http://10.0.0.1/...">.
+CRAWLER_DOWNLOAD_ALLOWLIST=
+# Hard cap on a single image body. Default 32 MiB.
+CRAWLER_MAX_IMAGE_BYTES=33554432
+
 # ----- Frontend -----
 # The frontend container runs SvelteKit's Node adapter on :3000 and
 # proxies /api/* to BACKEND_URL via src/hooks.server.ts. In compose the