Files
Mangalord/backend/src/crawler/mod.rs
MechaCat02 c33f30972e bugfix: SSRF allowlist, image size cap, robust session detect (0.34.1)
Four crawler defences in one PR (all four threats the review flagged
in §3 of REVIEW.md):

- New crawler::safety module with is_safe_url + accumulate_capped +
  fetch_bytes_capped. Rejects non-http(s) schemes, RFC1918 / loopback
  / link-local / CGNAT / ULA / IPv6-link-local hosts, and any host
  not on the operator's allowlist (defaults to CRAWLER_START_URL host
  + CRAWLER_CDN_HOST + CRAWLER_DOWNLOAD_ALLOWLIST extras).
- Streaming size cap (CRAWLER_MAX_IMAGE_BYTES, default 32 MiB) so a
  10 GiB \"image\" can't fill memory before disk.
- looks_like_image() reject path: non-image bytes fail the chapter or
  cover instead of being stored as .bin and served as
  application/octet-stream.
- session::classify_chapter_probe: three-way classifier replaces the
  binary #avatar_menu check at content.rs:115. A transient hiccup
  (broken-page body, or logged-in-but-no-reader) now retries with
  backoff instead of falsely freezing every worker on
  session_expired.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 20:00:22 +02:00

28 lines
915 B
Rust

//! Crawler subsystem.
//!
//! Runs as its own binary (`src/bin/crawler.rs`) and shares `domain`,
//! `repo`, and `storage` with the API binary. Layering mirrors the
//! `Storage` trait pattern: callers depend on the `source::Source`
//! trait, not on a concrete site; new sites plug in as additional
//! impls without touching the job runner.
//!
//! Submodules:
//! - [`browser`]: launches and pools Chromium via `chromiumoxide`.
//! First run downloads a known-good build via the `fetcher` feature.
//! - [`source`]: the `Source` trait. Per-site impls live alongside it.
//! - [`jobs`]: job kinds, queue wrapper, handler dispatch.
//! - [`diff`]: change detection — new / updated / dropped semantics.
pub mod browser;
pub mod browser_manager;
pub mod content;
pub mod daemon;
pub mod detect;
pub mod diff;
pub mod jobs;
pub mod pipeline;
pub mod rate_limit;
pub mod safety;
pub mod session;
pub mod source;