feat(deploy): dockurr/tor service + torrc; wire crawler to use it by default

Adds a `tor` service to the compose stack (dockurr/tor) with a torrc
tuned for the crawler — SOCKS5 on 9050 with IsolateDestAddr +
IsolateDestPort so NEWNYM picks up promptly, control port on 9051
with cookie auth, MaxCircuitDirtiness 60.

Backend defaults CRAWLER_PROXY → socks5h://tor:9050 and
CRAWLER_TOR_CONTROL_URL → tcp://tor:9051 so TOR + recircuit are on
out-of-the-box. Operators can override both to empty in .env to opt
out without removing the service.

The tor-data named volume is mounted ro on the backend so it can read
/var/lib/tor/control_auth_cookie; CookieAuthFileGroupReadable handles
the permissions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-05-31 18:54:40 +02:00
parent 8c6378b877
commit ecbbebafc4
3 changed files with 84 additions and 0 deletions

View File

@@ -83,6 +83,30 @@ CRAWLER_MAX_IMAGE_BYTES=33554432
# the image actually contains the binary.
CRAWLER_CHROMIUM_BINARY=
# ----- Crawler TOR proxy + recircuit -----
# The compose stack ships a `tor` service (dockurr/tor) and defaults
# CRAWLER_PROXY to it, so by default all crawler traffic exits via the
# TOR network. To opt out, set CRAWLER_PROXY= (empty) AND
# CRAWLER_TOR_CONTROL_URL= (empty) below — the tor service can stay
# running, it just won't be used.
#
# CRAWLER_PROXY: SOCKS5(h) URL. Use `socks5h://` (not `socks5://`) so
# DNS resolution also goes through TOR, avoiding leaks via the host's
# resolver. Leave unset to talk to the upstream directly.
CRAWLER_PROXY=socks5h://tor:9050
# Control-port URL for SIGNAL NEWNYM ("get a fresh circuit"). Triggered
# automatically on bad pages (broken-page body, missing #logo) and on
# the Unauthenticated session probe outcome. Leave unset to disable the
# recircuit feature (the SOCKS proxy still works).
CRAWLER_TOR_CONTROL_URL=tcp://tor:9051
# Auth — cookie file (preferred) or password (HashedControlPassword).
# Cookie wins when both are set. The bundled torrc enables cookie auth
# and shares /var/lib/tor between containers via a named volume.
CRAWLER_TOR_CONTROL_COOKIE_PATH=/var/lib/tor/control_auth_cookie
# CRAWLER_TOR_CONTROL_PASSWORD=
# Max NEWNYM-and-retry cycles per recircuit-eligible failure. Default 3.
CRAWLER_TOR_RECIRCUIT_MAX_ATTEMPTS=3
# ----- Frontend -----
# The frontend container runs SvelteKit's Node adapter on :3000 and
# proxies /api/* to BACKEND_URL via src/hooks.server.ts. In compose the