feat: in-process crawler daemon with cron and worker pool (0.28.0)
The backend now boots an internal crawler daemon that runs a daily metadata pass (CRAWLER_DAILY_AT in CRAWLER_TZ, advisory-lock guarded for multi-replica safety) and drains SyncChapterContent jobs from crawler_jobs through a worker pool. Chromium launches lazily on first job and is torn down after CRAWLER_IDLE_TIMEOUT_S seconds of inactivity. Modules: - crawler::browser_manager — lazy-launch / idle-teardown wrapper around browser::Handle, with an on_launch hook that re-injects PHPSESSID on every fresh Chromium spawn. - crawler::pipeline — run_metadata_pass (the shared discover/upsert /cover/sync-chapters loop) and the enqueue_bookmarked_pending helper used by the cron tick. - crawler::daemon — cron task + worker pool, behind two trait seams (MetadataPass, ChapterDispatcher) so tests can inject stubs without standing up Chromium or a live source. Behavior: - CRAWLER_DAEMON=false skips daemon spawn entirely (default for tests). - Catch-up tick fires on startup if the last persisted slot was missed. - A SyncOutcome::SessionExpired sets a sticky AtomicBool; workers idle until operator restart with a refreshed PHPSESSID. - Worker dispatch wrapped in catch_unwind so a panicking handler marks the job failed instead of taking down the worker. - Migration 0015 adds a small crawler_state k-v table for the last_metadata_tick_at watermark. Dep additions: chrono-tz (IANA TZ parsing). CLI (bin/crawler) reuses pipeline::run_metadata_pass and now holds the browser via BrowserManager so the on_launch session injection flow stays in one place. Inline chapter-content sync semantics are unchanged — the queue is for the daemon, force-refetches and manual backfills still bypass it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,58 +1,23 @@
|
||||
//! Crawler binary.
|
||||
//!
|
||||
//! Walks the source's manga listing (all pages), fetches each manga's
|
||||
//! metadata + chapter list, downloads the cover into `Storage`, and
|
||||
//! reconciles everything into the DB. Then, for any chapter belonging
|
||||
//! to a bookmarked manga whose `page_count` is still 0, fetches the
|
||||
//! chapter page (logged in), pulls every image from the CDN, and writes
|
||||
//! the `pages` rows atomically per chapter.
|
||||
//! Now an ops escape hatch sitting alongside the in-process daemon: walks
|
||||
//! the source's manga listing (all pages), fetches each manga's metadata +
|
||||
//! chapter list, downloads covers, reconciles chapters — and then, for any
|
||||
//! chapter belonging to a bookmarked manga whose `page_count` is still 0,
|
||||
//! fetches the chapter pages inline. The daemon does the same work through
|
||||
//! `crawler_jobs`; the CLI is kept around for force-refetches and manual
|
||||
//! backfills.
|
||||
//!
|
||||
//! Configuration:
|
||||
//! - **Start URL** (required): first CLI positional arg, else
|
||||
//! `$CRAWLER_START_URL`. This is the manga *list* page (page 1).
|
||||
//! - **Database** (required): `$DATABASE_URL`.
|
||||
//! - **Storage dir**: `$STORAGE_DIR`, default `./data/storage` —
|
||||
//! matches the API binary so both write to the same local tree.
|
||||
//! - **Browser**: see `LaunchOptions::from_env` —
|
||||
//! `CRAWLER_BROWSER_MODE` (`headed`|`headless`) and
|
||||
//! `CRAWLER_BROWSER_ARGS`.
|
||||
//! - **Rate limit**: `CRAWLER_RATE_MS` (ms between requests per host,
|
||||
//! default `1000`). Per-host: catalog and each CDN have their own
|
||||
//! bucket and don't share a budget.
|
||||
//! - **CDN rate override** (optional): `CRAWLER_CDN_HOST` plus
|
||||
//! `CRAWLER_CDN_RATE_MS` to give a specific host a different
|
||||
//! interval. Useful when the image CDN tolerates higher RPS than
|
||||
//! the catalog host.
|
||||
//! - **Cap**: `CRAWLER_LIMIT` (max manga detail fetches per run,
|
||||
//! default `0` = no cap).
|
||||
//! - **Skip chapters**: `CRAWLER_SKIP_CHAPTERS=1` — turn off the
|
||||
//! chapter selector in the parser AND skip the per-manga
|
||||
//! `sync_manga_chapters` write. Use this for "metadata only" runs.
|
||||
//! - **Skip chapter content**: `CRAWLER_SKIP_CHAPTER_CONTENT=1` —
|
||||
//! skip the page-image phase even if chapters need syncing.
|
||||
//! - **Chapter content workers**: `CRAWLER_CHAPTER_WORKERS` (default
|
||||
//! `1`). Multiple workers process distinct chapters concurrently;
|
||||
//! the per-host rate limiter still gates total RPS to each origin.
|
||||
//! - **Force re-fetch**: `CRAWLER_FORCE_REFETCH_CHAPTERS=1` — re-fetch
|
||||
//! chapter images even when `page_count > 0`. Rare; use after the
|
||||
//! source replaces a chapter's images.
|
||||
//! - **PHPSESSID**: `CRAWLER_PHPSESSID` — paste your browser's
|
||||
//! session cookie. Required for chapter content (logged-out reader
|
||||
//! is paginated per-image and not viable at scale).
|
||||
//! - **Cookie domain** (optional): `CRAWLER_COOKIE_DOMAIN` overrides
|
||||
//! the auto-derived `.<registrable>.<tld>`. Only needed for
|
||||
//! multi-part TLDs (`.co.uk`, etc.).
|
||||
//! - **User agent** (optional): `CRAWLER_USER_AGENT` — applies to
|
||||
//! reqwest image fetches. Default uses reqwest's built-in UA.
|
||||
//! - **Proxy**: `$CRAWLER_PROXY` — single URL applied to both
|
||||
//! Chromium (`--proxy-server`) and `reqwest::Proxy::all`. Supports
|
||||
//! `http://`, `https://`, and `socks5://` (with optional user:pass).
|
||||
//! Example: `socks5://user:pass@host:1080`. Unset → direct.
|
||||
//! - **Keep browser open**: `CRAWLER_KEEP_BROWSER_OPEN=1` — when
|
||||
//! running headed, block on Ctrl+C at every shutdown point so the
|
||||
//! operator can inspect DOM state, cookies, or network calls in the
|
||||
//! visible Chromium window before exit. Ignored in headless mode
|
||||
//! (no window to inspect).
|
||||
//! Configuration mirrors the daemon's `CRAWLER_*` env vars (see
|
||||
//! `crate::config::CrawlerConfig`) plus the CLI-only:
|
||||
//! - **Start URL**: first CLI positional arg, else `$CRAWLER_START_URL`.
|
||||
//! - **Skip chapters / chapter content / force re-fetch / keep browser**:
|
||||
//! `CRAWLER_SKIP_CHAPTERS`, `CRAWLER_SKIP_CHAPTER_CONTENT`,
|
||||
//! `CRAWLER_FORCE_REFETCH_CHAPTERS`, `CRAWLER_KEEP_BROWSER_OPEN`.
|
||||
//! - **Limit**: `CRAWLER_LIMIT` (max manga detail fetches per run).
|
||||
//!
|
||||
//! See `crawler::pipeline::run_metadata_pass` for the shared metadata
|
||||
//! flow.
|
||||
|
||||
use std::path::PathBuf;
|
||||
use std::sync::Arc;
|
||||
@@ -60,14 +25,12 @@ use std::time::Duration;
|
||||
|
||||
use anyhow::{anyhow, Context};
|
||||
use futures_util::stream::{self, StreamExt};
|
||||
use mangalord::crawler::{
|
||||
browser::{self, LaunchOptions},
|
||||
content::{self, SyncOutcome},
|
||||
rate_limit::HostRateLimiters,
|
||||
session,
|
||||
source::{target::TargetSource, DiscoverMode, FetchContext, Source},
|
||||
};
|
||||
use mangalord::repo;
|
||||
use mangalord::crawler::browser::{BrowserMode, LaunchOptions};
|
||||
use mangalord::crawler::browser_manager::{self, BrowserManager};
|
||||
use mangalord::crawler::content::{self, SyncOutcome};
|
||||
use mangalord::crawler::pipeline;
|
||||
use mangalord::crawler::rate_limit::HostRateLimiters;
|
||||
use mangalord::crawler::session;
|
||||
use mangalord::storage::{LocalStorage, Storage};
|
||||
use sqlx::postgres::PgPoolOptions;
|
||||
use sqlx::PgPool;
|
||||
@@ -126,10 +89,6 @@ async fn main() -> anyhow::Result<()> {
|
||||
|
||||
let storage: Arc<dyn Storage> = Arc::new(LocalStorage::new(&storage_dir));
|
||||
|
||||
// Build reqwest with: own cookie jar (seeded with PHPSESSID for
|
||||
// the catalog domain only), optional UA override, optional single
|
||||
// proxy. `no_proxy()` disables env-based detection so the
|
||||
// CRAWLER_PROXY knob is the only routing input.
|
||||
let cookie_jar = Arc::new(reqwest::cookie::Jar::default());
|
||||
if let (Some(sid), Some(domain)) = (&phpsessid, &cookie_domain) {
|
||||
let cookie_str = format!("PHPSESSID={sid}; Domain={domain}; Path=/");
|
||||
@@ -155,12 +114,9 @@ async fn main() -> anyhow::Result<()> {
|
||||
if let Some(proxy) = &proxy_url {
|
||||
options.extra_args.push(format!("--proxy-server={proxy}"));
|
||||
}
|
||||
// Keep-open is a debug aid; only meaningful when there's a window
|
||||
// to inspect. Warn loudly if the operator set it under headless so
|
||||
// they don't sit waiting for a Ctrl+C that won't show anything.
|
||||
let keep_open = match (keep_browser_open, options.mode) {
|
||||
(true, browser::BrowserMode::Headed) => true,
|
||||
(true, browser::BrowserMode::Headless) => {
|
||||
(true, BrowserMode::Headed) => true,
|
||||
(true, BrowserMode::Headless) => {
|
||||
tracing::warn!(
|
||||
"CRAWLER_KEEP_BROWSER_OPEN ignored in headless mode (no window to inspect)"
|
||||
);
|
||||
@@ -188,32 +144,37 @@ async fn main() -> anyhow::Result<()> {
|
||||
"starting crawler"
|
||||
);
|
||||
|
||||
let handle = browser::launch(options).await.context("launch browser")?;
|
||||
|
||||
// Cookie + session probe must happen *before* any browser
|
||||
// navigation that depends on auth (i.e. chapter content). The
|
||||
// discover/metadata phase doesn't strictly need auth, but
|
||||
// probing now lets us fail fast: a bad cookie costs ~2s here
|
||||
// instead of 30 min into a backfill.
|
||||
let session_ready = if let (Some(sid), Some(domain)) = (&phpsessid, &cookie_domain) {
|
||||
if let Err(e) = session::inject_phpsessid(handle.browser(), sid, domain).await {
|
||||
close_or_wait(handle, keep_open).await;
|
||||
return Err(e);
|
||||
// BrowserManager with idle_timeout = ZERO so the CLI keeps Chromium
|
||||
// alive for the entire run — same lifecycle as the old direct
|
||||
// `browser::launch()` flow. on_launch re-injects PHPSESSID + runs the
|
||||
// session probe; bad cookies fail fast before any real work happens.
|
||||
let on_launch: browser_manager::OnLaunch = match (&phpsessid, &cookie_domain) {
|
||||
(Some(sid), Some(domain)) => {
|
||||
let sid = sid.clone();
|
||||
let domain = domain.clone();
|
||||
let start_url_clone = start_url.clone();
|
||||
Arc::new(move |browser| {
|
||||
let sid = sid.clone();
|
||||
let domain = domain.clone();
|
||||
let start_url = start_url_clone.clone();
|
||||
Box::pin(async move {
|
||||
session::inject_phpsessid(&browser, &sid, &domain)
|
||||
.await
|
||||
.context("inject_phpsessid")?;
|
||||
session::verify_session(&browser, &start_url)
|
||||
.await
|
||||
.context("verify_session")?;
|
||||
Ok(())
|
||||
})
|
||||
})
|
||||
}
|
||||
match session::verify_session(handle.browser(), &start_url).await {
|
||||
Ok(()) => true,
|
||||
Err(e) => {
|
||||
close_or_wait(handle, keep_open).await;
|
||||
return Err(e);
|
||||
}
|
||||
}
|
||||
} else {
|
||||
tracing::info!("no PHPSESSID supplied — chapter content phase will be skipped");
|
||||
false
|
||||
_ => browser_manager::noop_on_launch(),
|
||||
};
|
||||
let session_ready = phpsessid.is_some() && cookie_domain.is_some();
|
||||
let manager = BrowserManager::new(options, Duration::ZERO, on_launch);
|
||||
|
||||
let result = run(
|
||||
handle.browser(),
|
||||
Arc::clone(&manager),
|
||||
&db,
|
||||
Arc::clone(&storage),
|
||||
&http,
|
||||
@@ -228,17 +189,7 @@ async fn main() -> anyhow::Result<()> {
|
||||
force_refetch_chapters,
|
||||
)
|
||||
.await;
|
||||
close_or_wait(handle, keep_open).await;
|
||||
result
|
||||
}
|
||||
|
||||
/// Either close the browser immediately or wait for Ctrl+C first.
|
||||
/// `keep_open=true` is only ever passed when the browser is headed, so
|
||||
/// the operator has a real window to poke at. Browser is dropped at
|
||||
/// the end of this fn in both cases — chromiumoxide's `Browser` is
|
||||
/// `kill_on_drop`, so we must wait for the Ctrl+C *before* the drop
|
||||
/// or the Chromium child gets killed out from under the operator.
|
||||
async fn close_or_wait(handle: browser::Handle, keep_open: bool) {
|
||||
if keep_open {
|
||||
tracing::info!(
|
||||
"crawler finished; browser kept open. Press Ctrl+C to close and exit."
|
||||
@@ -246,12 +197,13 @@ async fn close_or_wait(handle: browser::Handle, keep_open: bool) {
|
||||
let _ = tokio::signal::ctrl_c().await;
|
||||
tracing::info!("Ctrl+C received; closing browser");
|
||||
}
|
||||
let _ = handle.close().await;
|
||||
manager.shutdown().await;
|
||||
result
|
||||
}
|
||||
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
async fn run(
|
||||
browser: &chromiumoxide::Browser,
|
||||
manager: Arc<BrowserManager>,
|
||||
db: &PgPool,
|
||||
storage: Arc<dyn Storage>,
|
||||
http: &reqwest::Client,
|
||||
@@ -270,132 +222,28 @@ async fn run(
|
||||
rate = rate.with_override(host, Duration::from_millis(cdn_rate_ms));
|
||||
}
|
||||
let rate = Arc::new(rate);
|
||||
let source = {
|
||||
let s = TargetSource::new(start_url.to_string());
|
||||
if skip_chapters {
|
||||
s.without_chapter_parsing()
|
||||
} else {
|
||||
s
|
||||
}
|
||||
};
|
||||
let ctx = FetchContext {
|
||||
browser,
|
||||
rate: rate.as_ref(),
|
||||
};
|
||||
|
||||
let source_id = source.id();
|
||||
repo::crawler::ensure_source(
|
||||
let stats = pipeline::run_metadata_pass(
|
||||
manager.as_ref(),
|
||||
db,
|
||||
source_id,
|
||||
"Target Site",
|
||||
&origin_of(start_url).unwrap_or_else(|| start_url.to_string()),
|
||||
storage.as_ref(),
|
||||
http,
|
||||
rate.as_ref(),
|
||||
start_url,
|
||||
limit,
|
||||
skip_chapters,
|
||||
)
|
||||
.await
|
||||
.context("ensure_source")?;
|
||||
|
||||
let run_started_at = chrono::Utc::now();
|
||||
|
||||
let max_refs = (limit > 0).then_some(limit);
|
||||
tracing::info!(?max_refs, "discovering manga list");
|
||||
let refs = source
|
||||
.discover(&ctx, DiscoverMode::Backfill, max_refs)
|
||||
.await
|
||||
.context("discover failed")?;
|
||||
tracing::info!(count = refs.len(), "discovered manga list");
|
||||
|
||||
let to_fetch = refs;
|
||||
let total = to_fetch.len();
|
||||
|
||||
for (i, r) in to_fetch.iter().enumerate() {
|
||||
tracing::info!(idx = i + 1, total, key = %r.source_manga_key, "fetching metadata");
|
||||
let manga = match source.fetch_manga(&ctx, r).await {
|
||||
Ok(m) => m,
|
||||
Err(e) => {
|
||||
tracing::warn!(key = %r.source_manga_key, url = %r.url, error = ?e, "fetch_manga failed");
|
||||
continue;
|
||||
}
|
||||
};
|
||||
|
||||
let upsert = match repo::crawler::upsert_manga_from_source(db, source_id, &r.url, &manga)
|
||||
.await
|
||||
{
|
||||
Ok(u) => u,
|
||||
Err(e) => {
|
||||
tracing::error!(key = %r.source_manga_key, error = ?e, "upsert_manga_from_source failed");
|
||||
continue;
|
||||
}
|
||||
};
|
||||
tracing::info!(
|
||||
key = %manga.source_manga_key,
|
||||
manga_id = %upsert.manga_id,
|
||||
status = ?upsert.status,
|
||||
title = %manga.title,
|
||||
"manga upserted"
|
||||
);
|
||||
|
||||
// Cover image: download when missing in storage (backfill for
|
||||
// mangas synced before cover-download support, plus the New
|
||||
// path) or when metadata changed (cover URL is part of
|
||||
// metadata_hash, so an Updated status implies the URL may
|
||||
// have moved). Failures are non-fatal.
|
||||
let needs_cover = upsert.cover_image_path.is_none()
|
||||
|| matches!(upsert.status, repo::crawler::UpsertStatus::Updated);
|
||||
if needs_cover {
|
||||
if let Some(cover_url) = manga.cover_url.as_deref() {
|
||||
if let Err(e) = download_and_store_cover(
|
||||
db,
|
||||
storage.as_ref(),
|
||||
http,
|
||||
rate.as_ref(),
|
||||
&r.url,
|
||||
upsert.manga_id,
|
||||
cover_url,
|
||||
)
|
||||
.await
|
||||
{
|
||||
tracing::warn!(manga_id = %upsert.manga_id, error = ?e, "cover download failed");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if !skip_chapters {
|
||||
match repo::crawler::sync_manga_chapters(
|
||||
db,
|
||||
source_id,
|
||||
upsert.manga_id,
|
||||
&manga.chapters,
|
||||
)
|
||||
.await
|
||||
{
|
||||
Ok(diff) => tracing::info!(
|
||||
manga_id = %upsert.manga_id,
|
||||
new = diff.new,
|
||||
refreshed = diff.refreshed,
|
||||
dropped = diff.dropped,
|
||||
"chapters synced"
|
||||
),
|
||||
Err(e) => tracing::warn!(manga_id = %upsert.manga_id, error = ?e, "chapter sync failed"),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if limit == 0 {
|
||||
match repo::crawler::mark_dropped_mangas(db, source_id, run_started_at).await {
|
||||
Ok(n) => tracing::info!(dropped = n, "marked unseen manga as dropped"),
|
||||
Err(e) => tracing::warn!(error = ?e, "drop-pass failed"),
|
||||
}
|
||||
} else {
|
||||
tracing::info!(limit, "partial sync — skipping drop pass");
|
||||
}
|
||||
.await?;
|
||||
tracing::info!(?stats, "metadata pass complete");
|
||||
|
||||
if !skip_chapter_content {
|
||||
sync_bookmarked_chapter_content(
|
||||
browser,
|
||||
Arc::clone(&manager),
|
||||
db,
|
||||
Arc::clone(&storage),
|
||||
http,
|
||||
Arc::clone(&rate),
|
||||
source_id,
|
||||
"target",
|
||||
chapter_workers,
|
||||
force_refetch_chapters,
|
||||
)
|
||||
@@ -405,17 +253,15 @@ async fn run(
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Find every chapter whose manga is bookmarked by at least one user
|
||||
/// and that hasn't been content-synced yet, then fan them out across
|
||||
/// `workers` concurrent tasks. Each task is one full chapter sync; the
|
||||
/// per-host rate limiter caps total RPS to the source/CDN regardless
|
||||
/// of worker count.
|
||||
/// Find every chapter whose manga is bookmarked by at least one user and
|
||||
/// that hasn't been content-synced yet, then fan them out across `workers`
|
||||
/// concurrent tasks. Same as before except the browser comes from a
|
||||
/// BrowserManager lease so it interleaves cleanly with the metadata pass.
|
||||
///
|
||||
/// A session-expired result from any task aborts the whole phase —
|
||||
/// continuing wastes time and risks the source flagging the pattern.
|
||||
/// A `SessionExpired` result aborts the phase.
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
async fn sync_bookmarked_chapter_content(
|
||||
browser: &chromiumoxide::Browser,
|
||||
manager: Arc<BrowserManager>,
|
||||
db: &PgPool,
|
||||
storage: Arc<dyn Storage>,
|
||||
http: &reqwest::Client,
|
||||
@@ -424,13 +270,6 @@ async fn sync_bookmarked_chapter_content(
|
||||
workers: usize,
|
||||
force_refetch: bool,
|
||||
) -> anyhow::Result<()> {
|
||||
// Subquery first so DISTINCT collapses multi-user bookmark rows
|
||||
// without forcing every ORDER BY column into the SELECT list (PG
|
||||
// rejects `ORDER BY c.created_at` against `SELECT DISTINCT c.id,
|
||||
// c.manga_id, cs.source_url` with "ORDER BY expressions must
|
||||
// appear in select list"). Outer ORDER BY then groups chapters by
|
||||
// their manga, oldest first, so backfills proceed in reading
|
||||
// order per manga.
|
||||
let pending: Vec<(Uuid, Uuid, String)> = sqlx::query_as(
|
||||
r#"
|
||||
SELECT id, manga_id, source_url FROM (
|
||||
@@ -457,12 +296,6 @@ async fn sync_bookmarked_chapter_content(
|
||||
}
|
||||
tracing::info!(count = pending.len(), workers, "chapter content phase starting");
|
||||
|
||||
// `for_each_concurrent` polls up to `workers` futures at once on
|
||||
// the *current* task, so each future borrows the browser, db, and
|
||||
// http client from the outer scope rather than requiring 'static
|
||||
// captures via spawn. chromiumoxide's `Browser::new_page(&self)`
|
||||
// is safe for concurrent calls; the per-host rate limiter
|
||||
// serializes the actual on-wire requests against each origin.
|
||||
let session_expired = Arc::new(std::sync::atomic::AtomicBool::new(false));
|
||||
let stats = std::sync::Mutex::new(WorkerStats::default());
|
||||
|
||||
@@ -471,13 +304,23 @@ async fn sync_bookmarked_chapter_content(
|
||||
let session_expired = Arc::clone(&session_expired);
|
||||
let storage = Arc::clone(&storage);
|
||||
let rate = Arc::clone(&rate);
|
||||
let manager = Arc::clone(&manager);
|
||||
let stats = &stats;
|
||||
async move {
|
||||
if session_expired.load(std::sync::atomic::Ordering::Relaxed) {
|
||||
return;
|
||||
}
|
||||
let lease = match manager.acquire().await {
|
||||
Ok(l) => l,
|
||||
Err(e) => {
|
||||
tracing::error!(%chapter_id, error = ?e, "browser acquire failed");
|
||||
let mut s = stats.lock().unwrap();
|
||||
s.failed += 1;
|
||||
return;
|
||||
}
|
||||
};
|
||||
let outcome = content::sync_chapter_content(
|
||||
browser,
|
||||
&lease,
|
||||
db,
|
||||
storage.as_ref(),
|
||||
http,
|
||||
@@ -488,6 +331,7 @@ async fn sync_bookmarked_chapter_content(
|
||||
force_refetch,
|
||||
)
|
||||
.await;
|
||||
drop(lease);
|
||||
let mut s = stats.lock().unwrap();
|
||||
match outcome {
|
||||
Ok(SyncOutcome::Fetched { pages }) => {
|
||||
@@ -535,51 +379,6 @@ struct WorkerStats {
|
||||
failed: usize,
|
||||
}
|
||||
|
||||
async fn download_and_store_cover(
|
||||
db: &PgPool,
|
||||
storage: &dyn Storage,
|
||||
http: &reqwest::Client,
|
||||
rate: &HostRateLimiters,
|
||||
manga_url: &str,
|
||||
manga_id: Uuid,
|
||||
cover_url: &str,
|
||||
) -> anyhow::Result<()> {
|
||||
let absolute = reqwest::Url::parse(manga_url)
|
||||
.context("parse manga URL")?
|
||||
.join(cover_url)
|
||||
.context("join cover URL onto manga URL")?;
|
||||
|
||||
rate.wait_for(absolute.as_str()).await?;
|
||||
let resp = http
|
||||
.get(absolute.clone())
|
||||
// Source CDNs commonly check Referer. Set it to the manga
|
||||
// detail page that linked the cover — same UX as a real
|
||||
// browser fetching the image.
|
||||
.header(reqwest::header::REFERER, manga_url)
|
||||
.send()
|
||||
.await
|
||||
.with_context(|| format!("GET {absolute}"))?
|
||||
.error_for_status()
|
||||
.with_context(|| format!("non-2xx for {absolute}"))?;
|
||||
let bytes = resp.bytes().await.context("read cover body")?;
|
||||
|
||||
// `infer` sniffs the magic bytes — same crate the upload handler
|
||||
// uses, so we don't trust the URL's extension.
|
||||
let kind = infer::get(&bytes);
|
||||
let ext = kind.map(|k| k.extension()).unwrap_or("bin");
|
||||
let key = format!("mangas/{manga_id}/cover.{ext}");
|
||||
|
||||
storage
|
||||
.put(&key, &bytes)
|
||||
.await
|
||||
.with_context(|| format!("store cover at {key}"))?;
|
||||
repo::manga::set_cover_image_path(db, manga_id, &key)
|
||||
.await
|
||||
.with_context(|| format!("update cover_image_path for {manga_id}"))?;
|
||||
tracing::info!(manga_id = %manga_id, key = %key, bytes = bytes.len(), %absolute, "cover stored");
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn resolve_start_url() -> anyhow::Result<String> {
|
||||
if let Some(arg) = std::env::args().nth(1) {
|
||||
return Ok(arg);
|
||||
@@ -591,12 +390,6 @@ fn resolve_start_url() -> anyhow::Result<String> {
|
||||
})
|
||||
}
|
||||
|
||||
fn origin_of(url: &str) -> Option<String> {
|
||||
let (scheme, rest) = url.split_once("://")?;
|
||||
let host = rest.split('/').next()?;
|
||||
Some(format!("{scheme}://{host}"))
|
||||
}
|
||||
|
||||
fn env_u64(name: &str, default: u64) -> u64 {
|
||||
std::env::var(name)
|
||||
.ok()
|
||||
@@ -611,3 +404,4 @@ fn env_bool(name: &str, default: bool) -> bool {
|
||||
_ => default,
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user