feat(crawler): CRAWLER_ALLOW_ANY_HOST bypasses the host allowlist (0.44.0)
Operators whose sources shard images across numbered CDN subdomains can't pre-enumerate every host in CRAWLER_DOWNLOAD_ALLOWLIST. The new flag short-circuits the host check in DownloadAllowlist::contains while leaving scheme, localhost, and private-IP defenses in is_safe_url untouched — scraped URLs pointing at 10.x / 169.254.169.254 / file:// stay refused. Default is false; fail-closed posture is preserved unless the operator opts in. Wired into both the server (config::build_download_allowlist) and the bin/crawler.rs one-shot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -66,6 +66,12 @@ MAX_FILE_BYTES=20971520
|
||||
# to CRAWLER_START_URL's host and CRAWLER_CDN_HOST. Comma-separated.
|
||||
# Defends against SSRF via scraped <img src="http://10.0.0.1/...">.
|
||||
CRAWLER_DOWNLOAD_ALLOWLIST=
|
||||
# Bypass the host allowlist entirely. Intended for sources that shard
|
||||
# images across numbered CDN subdomains (cdn1/cdn2/…) where enumerating
|
||||
# every host upfront is impractical. The private-IP / localhost / non-
|
||||
# http(s) scheme defenses STAY ON — a scraped <img src="http://10.0.0.1/">
|
||||
# is still refused with this flag set.
|
||||
CRAWLER_ALLOW_ANY_HOST=false
|
||||
# Hard cap on a single image body. Default 32 MiB.
|
||||
CRAWLER_MAX_IMAGE_BYTES=33554432
|
||||
|
||||
|
||||
Reference in New Issue
Block a user