feat(crawler): CRAWLER_ALLOW_ANY_HOST bypasses the host allowlist (0.44.0)
Some checks failed
deploy / test-backend (push) Failing after 11s
deploy / test-frontend (push) Failing after 36s
deploy / build-and-push (push) Has been skipped
deploy / deploy (push) Has been skipped

Operators whose sources shard images across numbered CDN subdomains
can't pre-enumerate every host in CRAWLER_DOWNLOAD_ALLOWLIST. The new
flag short-circuits the host check in DownloadAllowlist::contains
while leaving scheme, localhost, and private-IP defenses in
is_safe_url untouched — scraped URLs pointing at 10.x /
169.254.169.254 / file:// stay refused. Default is false; fail-closed
posture is preserved unless the operator opts in. Wired into both the
server (config::build_download_allowlist) and the bin/crawler.rs
one-shot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-05-31 14:52:49 +02:00
parent 1eebb90e25
commit a2826d6467
7 changed files with 111 additions and 20 deletions

View File

@@ -66,6 +66,12 @@ MAX_FILE_BYTES=20971520
# to CRAWLER_START_URL's host and CRAWLER_CDN_HOST. Comma-separated.
# Defends against SSRF via scraped <img src="http://10.0.0.1/...">.
CRAWLER_DOWNLOAD_ALLOWLIST=
# Bypass the host allowlist entirely. Intended for sources that shard
# images across numbered CDN subdomains (cdn1/cdn2/…) where enumerating
# every host upfront is impractical. The private-IP / localhost / non-
# http(s) scheme defenses STAY ON — a scraped <img src="http://10.0.0.1/">
# is still refused with this flag set.
CRAWLER_ALLOW_ANY_HOST=false
# Hard cap on a single image body. Default 32 MiB.
CRAWLER_MAX_IMAGE_BYTES=33554432