feat: transient-page detection across the crawler (0.30.0)

Until now, when the target site returned its 403 "we're sorry, the
request file are not found" response on a page that actually exists,
selectors matched nothing and the crawler treated the page as
"legitimately empty". Pagination walks silently dropped whole pages
worth of mangas, fetch_manga skipped individual entries, and the
startup session probe blamed PHPSESSID for what was a site hiccup.

This branch adds a single detection layer that the whole pipeline
routes through:

- `crawler::detect`: PageError::Transient typed signal, plus two
  primitives (`is_broken_page_body` matches the universal 403 body;
  `has_logo_sentinel` asserts #logo, the site-wide header element)
  and a `retry_on_transient` helper that retries a closure on
  Transient with a small attempt budget.
- `navigate()` screens every fetched body for the broken-page
  signature before handing it to a selector.
- Parsers (`parse_manga_list_from`, `parse_manga_detail`,
  `parse_chapter_pages`) check their structural sentinels (#logo for
  full-layout pages; a#pic_container for the reader, which doesn't
  render #logo) and return Result<_, PageError>. Empty Vec is now
  reserved for genuinely empty pages.
- `discover()` retries each pagination page up to 3× (2s apart) before
  failing the whole Discover job — at which point the existing job
  system's retry/backoff takes over for longer outages.
- `verify_session` is three-state: broken-page → retry probe;
  #logo present but #avatar_menu absent → genuine logout (the only
  state that should blame PHPSESSID); both present → ok.

Test coverage added at the helper level: 13 unit tests for the
detection module (body signature, logo sentinel, PageError, retry
helper), parser-level tests for both transient and legitimately-empty
inputs, and 6 unit tests for the session probe classifier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-05-26 22:47:21 +02:00
parent b845d88766
commit 9ff49166a5
8 changed files with 594 additions and 59 deletions

View File

@@ -9,19 +9,39 @@
//! Two things the cookie alone doesn't give us:
//! 1. The cookie value is only meaningful to the *server* — we have
//! no way to predict from the value alone whether it's still valid.
//! `verify_session` does a navigation and checks for `#avatar_menu`,
//! which only renders for authenticated visitors. Bail clean at
//! startup if it's missing rather than discovering it 30 minutes
//! into a backfill.
//! `verify_session` does a navigation and inspects the probe page
//! for three outcomes: broken-page response (transient — retry the
//! probe), `#logo` present but `#avatar_menu` absent (genuine logout
//! — bail loudly), or both present (authenticated). The earlier
//! avatar-only check conflated "site is hiccuping" with "session is
//! dead" and refused to start the crawler when the site had a brief
//! 503.
//! 2. The reqwest client (used for cover and chapter-image downloads)
//! has its own cookie store; we seed it for the catalog host only.
//! CDN hosts are deliberately *not* given the cookie — they serve
//! image bytes by signed URLs and don't need it.
use std::time::Duration;
use anyhow::{anyhow, Context};
use chromiumoxide::browser::Browser;
use chromiumoxide::cdp::browser_protocol::network::CookieParam;
use crate::crawler::detect::{has_logo_sentinel, is_broken_page_body};
/// Outcome of inspecting a probe-page response.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum SessionProbe {
/// `#logo` present and `#avatar_menu` present — session valid.
Ok,
/// `#logo` present but `#avatar_menu` absent — site rendered the
/// normal layout for an unauthenticated visitor; refresh PHPSESSID.
Unauthenticated,
/// Broken-page body signature or `#logo` missing — site is hiccuping.
/// Caller retries the probe rather than blaming the session.
Transient,
}
/// Compute the cookie domain (e.g. `.example.com`) from a start URL.
/// The leading dot makes the cookie cover every subdomain — the source
/// often redirects between `www.` and other prefixes mid-crawl, and a
@@ -86,34 +106,86 @@ pub async fn inject_phpsessid(
Ok(())
}
/// Navigate to `probe_url` and confirm the logged-in `#avatar_menu`
/// element is present. The selector only renders for authenticated
/// visitors, so its absence is the unambiguous signal that PHPSESSID
/// is missing, expired, or revoked.
/// Three-way classification of a probe-page response. Pure over HTML so
/// it's unit-testable without a real browser. Order matters: a body
/// matching the broken-page template is `Transient` even if the page
/// happens to contain `#avatar_menu` HTML somewhere — trust the universal
/// site signal over a stray selector match.
pub fn classify_probe(html: &str) -> SessionProbe {
if is_broken_page_body(html) {
return SessionProbe::Transient;
}
let doc = scraper::Html::parse_document(html);
if !has_logo_sentinel(&doc) {
return SessionProbe::Transient;
}
let avatar_sel = scraper::Selector::parse("#avatar_menu").unwrap();
if doc.select(&avatar_sel).next().is_some() {
SessionProbe::Ok
} else {
SessionProbe::Unauthenticated
}
}
/// In-startup retry budget for the session probe. Small but non-zero —
/// startup hitting a 5-second site hiccup shouldn't fail the operator
/// with "PHPSESSID expired" when the session is actually fine.
const PROBE_MAX_ATTEMPTS: u32 = 3;
const PROBE_RETRY_DELAY: Duration = Duration::from_secs(2);
/// Navigate to `probe_url` and classify the response. Retries the probe
/// on `Transient` outcomes (broken-page body, missing `#logo`); fails
/// fast on `Unauthenticated`; returns `Ok(())` on success.
///
/// This burns one navigation against the catalog's rate limiter. The
/// trade is worth it — failing here costs ~1s; failing 30 minutes into
/// a backfill costs 30 minutes.
/// This burns one navigation per attempt against the catalog's rate
/// limiter. The trade is worth it — failing here costs ~1s; failing 30
/// minutes into a backfill costs 30 minutes.
pub async fn verify_session(browser: &Browser, probe_url: &str) -> anyhow::Result<()> {
let mut attempt = 0u32;
loop {
attempt += 1;
let html = fetch_probe_html(browser, probe_url).await?;
match classify_probe(&html) {
SessionProbe::Ok => {
tracing::info!(attempt, "session probe ok — #logo + #avatar_menu present");
return Ok(());
}
SessionProbe::Unauthenticated => {
return Err(anyhow!(
"session probe failed — #avatar_menu not present at {probe_url} \
(page rendered the normal layout); PHPSESSID is missing, expired, \
or revoked. Refresh CRAWLER_PHPSESSID and re-run."
));
}
SessionProbe::Transient if attempt < PROBE_MAX_ATTEMPTS => {
tracing::warn!(
attempt,
max_attempts = PROBE_MAX_ATTEMPTS,
"session probe got a transient page; retrying"
);
tokio::time::sleep(PROBE_RETRY_DELAY).await;
}
SessionProbe::Transient => {
return Err(anyhow!(
"session probe failed — probe page at {probe_url} returned a \
broken-page response after {PROBE_MAX_ATTEMPTS} attempts. \
The site appears to be down or rate-limiting us; try again \
later before refreshing CRAWLER_PHPSESSID."
));
}
}
}
}
async fn fetch_probe_html(browser: &Browser, probe_url: &str) -> anyhow::Result<String> {
let page = browser
.new_page(probe_url)
.await
.with_context(|| format!("open probe page {probe_url}"))?;
page.wait_for_navigation().await.context("wait for nav on probe")?;
// The avatar menu is rendered server-side as part of the header
// when a valid session cookie is present; absent JS is fine.
let found = page.find_element("#avatar_menu").await.is_ok();
let html = page.content().await.context("read probe html")?;
page.close().await.ok();
if found {
tracing::info!("session probe ok — #avatar_menu present");
Ok(())
} else {
Err(anyhow!(
"session probe failed — #avatar_menu not present at {probe_url}; \
PHPSESSID is missing, expired, or revoked. Refresh CRAWLER_PHPSESSID \
and re-run."
))
}
Ok(html)
}
#[cfg(test)]
@@ -158,4 +230,59 @@ mod tests {
fn registrable_domain_returns_none_for_garbage() {
assert!(registrable_domain("not a url").is_none());
}
#[test]
fn classify_probe_ok_when_logo_and_avatar_present() {
let html = r#"<html><body>
<header><div id="logo">Target</div><div id="avatar_menu"></div></header>
</body></html>"#;
assert_eq!(classify_probe(html), SessionProbe::Ok);
}
#[test]
fn classify_probe_unauth_when_logo_present_but_avatar_absent() {
// Real "logged out" response: site layout renders fine, just no
// avatar widget. This is the only state that should blame the
// session cookie.
let html = r#"<html><body>
<header><div id="logo">Target</div></header>
<main>Please log in.</main>
</body></html>"#;
assert_eq!(classify_probe(html), SessionProbe::Unauthenticated);
}
#[test]
fn classify_probe_transient_on_broken_page_body() {
let html = "<html><body>\
<p>we're sorry, the request file are not found.</p>\
</body></html>";
assert_eq!(classify_probe(html), SessionProbe::Transient);
}
#[test]
fn classify_probe_transient_when_logo_missing() {
// No broken-body marker, but no site layout either — treat as
// transient (could be a Cloudflare interstitial, a 5xx page,
// etc.) rather than blaming the session.
let html = "<html><body><h1>Service Unavailable</h1></body></html>";
assert_eq!(classify_probe(html), SessionProbe::Transient);
}
#[test]
fn classify_probe_transient_on_empty_response() {
assert_eq!(classify_probe(""), SessionProbe::Transient);
}
#[test]
fn classify_probe_trusts_broken_body_over_stray_avatar_match() {
// Defensive: if a broken-page body somehow contains an
// #avatar_menu element (e.g. an unrelated debug page on the
// same template), the body signature still wins.
let html = r#"<html><body>
<p>we're sorry, the request file are not found.</p>
<div id="logo"></div>
<div id="avatar_menu"></div>
</body></html>"#;
assert_eq!(classify_probe(html), SessionProbe::Transient);
}
}