Four crawler defences in one PR (all four threats the review flagged in §3 of REVIEW.md): - New crawler::safety module with is_safe_url + accumulate_capped + fetch_bytes_capped. Rejects non-http(s) schemes, RFC1918 / loopback / link-local / CGNAT / ULA / IPv6-link-local hosts, and any host not on the operator's allowlist (defaults to CRAWLER_START_URL host + CRAWLER_CDN_HOST + CRAWLER_DOWNLOAD_ALLOWLIST extras). - Streaming size cap (CRAWLER_MAX_IMAGE_BYTES, default 32 MiB) so a 10 GiB \"image\" can't fill memory before disk. - looks_like_image() reject path: non-image bytes fail the chapter or cover instead of being stored as .bin and served as application/octet-stream. - session::classify_chapter_probe: three-way classifier replaces the binary #avatar_menu check at content.rs:115. A transient hiccup (broken-page body, or logged-in-but-no-reader) now retries with backoff instead of falsely freezing every worker on session_expired. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
28 lines
915 B
Rust
28 lines
915 B
Rust
//! Crawler subsystem.
|
|
//!
|
|
//! Runs as its own binary (`src/bin/crawler.rs`) and shares `domain`,
|
|
//! `repo`, and `storage` with the API binary. Layering mirrors the
|
|
//! `Storage` trait pattern: callers depend on the `source::Source`
|
|
//! trait, not on a concrete site; new sites plug in as additional
|
|
//! impls without touching the job runner.
|
|
//!
|
|
//! Submodules:
|
|
//! - [`browser`]: launches and pools Chromium via `chromiumoxide`.
|
|
//! First run downloads a known-good build via the `fetcher` feature.
|
|
//! - [`source`]: the `Source` trait. Per-site impls live alongside it.
|
|
//! - [`jobs`]: job kinds, queue wrapper, handler dispatch.
|
|
//! - [`diff`]: change detection — new / updated / dropped semantics.
|
|
|
|
pub mod browser;
|
|
pub mod browser_manager;
|
|
pub mod content;
|
|
pub mod daemon;
|
|
pub mod detect;
|
|
pub mod diff;
|
|
pub mod jobs;
|
|
pub mod pipeline;
|
|
pub mod rate_limit;
|
|
pub mod safety;
|
|
pub mod session;
|
|
pub mod source;
|