feat(crawler): CRAWLER_ALLOW_ANY_HOST bypasses the host allowlist (0.44.0)

Operators whose sources shard images across numbered CDN subdomains can't pre-enumerate every host in CRAWLER_DOWNLOAD_ALLOWLIST. The new flag short-circuits the host check in DownloadAllowlist::contains while leaving scheme, localhost, and private-IP defenses in is_safe_url untouched — scraped URLs pointing at 10.x / 169.254.169.254 / file:// stay refused. Default is false; fail-closed posture is preserved unless the operator opts in. Wired into both the server (config::build_download_allowlist) and the bin/crawler.rs one-shot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 14:52:49 +02:00
parent 1eebb90e25
commit a2826d6467
7 changed files with 111 additions and 20 deletions
--- a/.env.example
+++ b/.env.example
@@ -66,6 +66,12 @@ MAX_FILE_BYTES=20971520
 # to CRAWLER_START_URL's host and CRAWLER_CDN_HOST. Comma-separated.
 # Defends against SSRF via scraped <img src="http://10.0.0.1/...">.
 CRAWLER_DOWNLOAD_ALLOWLIST=
+# Bypass the host allowlist entirely. Intended for sources that shard
+# images across numbered CDN subdomains (cdn1/cdn2/…) where enumerating
+# every host upfront is impractical. The private-IP / localhost / non-
+# http(s) scheme defenses STAY ON — a scraped <img src="http://10.0.0.1/">
+# is still refused with this flag set.
+CRAWLER_ALLOW_ANY_HOST=false
 # Hard cap on a single image body. Default 32 MiB.
 CRAWLER_MAX_IMAGE_BYTES=33554432