A hung TLS handshake or a page that never fires load could wedge a worker (or the cron metadata pass) indefinitely — chromiumoxide imposes no navigation timeout of its own. New crawler::nav::wait_for_nav caps each navigation at NAV_TIMEOUT (30s) and returns a typed NavError so timeouts surface as transient (retryable) errors. Wired at the three navigation sites: - source::target::navigate (catalog/detail/pagination) - content::sync_chapter_content (chapter reader) - session::fetch_probe_html (session probe) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
30 lines
947 B
Rust
30 lines
947 B
Rust
//! Crawler subsystem.
|
|
//!
|
|
//! Runs as its own binary (`src/bin/crawler.rs`) and shares `domain`,
|
|
//! `repo`, and `storage` with the API binary. Layering mirrors the
|
|
//! `Storage` trait pattern: callers depend on the `source::Source`
|
|
//! trait, not on a concrete site; new sites plug in as additional
|
|
//! impls without touching the job runner.
|
|
//!
|
|
//! Submodules:
|
|
//! - [`browser`]: launches and pools Chromium via `chromiumoxide`.
|
|
//! First run downloads a known-good build via the `fetcher` feature.
|
|
//! - [`source`]: the `Source` trait. Per-site impls live alongside it.
|
|
//! - [`jobs`]: job kinds, queue wrapper, handler dispatch.
|
|
//! - [`diff`]: change detection — new / updated / dropped semantics.
|
|
|
|
pub mod browser;
|
|
pub mod browser_manager;
|
|
pub mod content;
|
|
pub mod daemon;
|
|
pub mod detect;
|
|
pub mod diff;
|
|
pub mod jobs;
|
|
pub mod nav;
|
|
pub mod pipeline;
|
|
pub mod rate_limit;
|
|
pub mod safety;
|
|
pub mod session;
|
|
pub mod source;
|
|
pub mod url_utils;
|