fix(deploy): pivot tor service to password auth + wrapper entrypoint
Dockurr/tor's stock entrypoint binds the control port to localhost (unreachable from a sibling container), refuses to run as a non-default user (its setup chowns dirs and su-execs down to its `tor` user, both requiring root), and skips its own HashedControlPassword injection whenever the user's torrc declares a ControlPort. The combination meant the original cookie-via-shared- volume design couldn't work without fighting the image. This commit: - Adds tor/entrypoint.sh, a small wrapper that hashes $PASSWORD with `tor --hash-password`, appends the hash to a writable copy of /etc/tor/torrc, then execs tor. Container runs as root only for that bring-up; the torrc's `User tor` directive drops privs after port binding. - Adds a healthcheck on the tor service that gates downstream containers on both 9050 + 9051 actually listening (was service_started, which fires before tor finishes bootstrap). - Loosens MaxCircuitDirtiness 60 → 600. The 60s value would have rotated mid-chapter for any chapter with > ~50 images, which is exactly the kind of fingerprint we're trying to avoid. - Wires TOR_CONTROL_PASSWORD as a REQUIRED .env var on both sides (PASSWORD on tor, CRAWLER_TOR_CONTROL_PASSWORD on backend). docker-compose.yml fails fast if unset. - Removes the tor-data shared volume on backend (cookie auth is no longer the default; operators wanting cookie can mount it back). - Documents the pivot + the cookie-vs-password tradeoff in .env.example. End-to-end validated: `docker compose up -d tor`, then `printf 'AUTHENTICATE "test"\r\nSIGNAL NEWNYM\r\nQUIT\r\n' | nc tor 9051` returns three `250 OK` lines. Audit ref: #2, #3, #6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
27
.env.example
27
.env.example
@@ -90,23 +90,36 @@ CRAWLER_CHROMIUM_BINARY=
|
|||||||
# CRAWLER_TOR_CONTROL_URL= (empty) below — the tor service can stay
|
# CRAWLER_TOR_CONTROL_URL= (empty) below — the tor service can stay
|
||||||
# running, it just won't be used.
|
# running, it just won't be used.
|
||||||
#
|
#
|
||||||
|
# Going through TOR adds latency to every fetch; image downloads in
|
||||||
|
# particular slow noticeably. The win is on sites that rate-limit or
|
||||||
|
# fingerprint by exit IP — NEWNYM recirculation makes a fresh exit
|
||||||
|
# cheap to reach for.
|
||||||
|
#
|
||||||
# CRAWLER_PROXY: SOCKS5(h) URL. Use `socks5h://` (not `socks5://`) so
|
# CRAWLER_PROXY: SOCKS5(h) URL. Use `socks5h://` (not `socks5://`) so
|
||||||
# DNS resolution also goes through TOR, avoiding leaks via the host's
|
# DNS resolution also goes through TOR, avoiding leaks via the host's
|
||||||
# resolver. Leave unset to talk to the upstream directly.
|
# resolver. Leave unset to talk to the upstream directly.
|
||||||
CRAWLER_PROXY=socks5h://tor:9050
|
CRAWLER_PROXY=socks5h://tor:9050
|
||||||
# Control-port URL for SIGNAL NEWNYM ("get a fresh circuit"). Triggered
|
# Control-port URL for SIGNAL NEWNYM ("get a fresh circuit"). Triggered
|
||||||
# automatically on bad pages (broken-page body, missing #logo) and on
|
# automatically on bad pages (broken-page body, missing #logo) and on
|
||||||
# the Unauthenticated session probe outcome. Leave unset to disable the
|
# the Unauthenticated session probe outcome. Leave unset to disable
|
||||||
# recircuit feature (the SOCKS proxy still works).
|
# the recircuit feature (the SOCKS proxy still works).
|
||||||
CRAWLER_TOR_CONTROL_URL=tcp://tor:9051
|
CRAWLER_TOR_CONTROL_URL=tcp://tor:9051
|
||||||
# Auth — cookie file (preferred) or password (HashedControlPassword).
|
|
||||||
# Cookie wins when both are set. The bundled torrc enables cookie auth
|
|
||||||
# and shares /var/lib/tor between containers via a named volume.
|
|
||||||
CRAWLER_TOR_CONTROL_COOKIE_PATH=/var/lib/tor/control_auth_cookie
|
|
||||||
# CRAWLER_TOR_CONTROL_PASSWORD=
|
|
||||||
# Max NEWNYM-and-retry cycles per recircuit-eligible failure. Default 3.
|
# Max NEWNYM-and-retry cycles per recircuit-eligible failure. Default 3.
|
||||||
CRAWLER_TOR_RECIRCUIT_MAX_ATTEMPTS=3
|
CRAWLER_TOR_RECIRCUIT_MAX_ATTEMPTS=3
|
||||||
|
|
||||||
|
# ----- TOR control-port password -----
|
||||||
|
# Shared between the bundled dockurr/tor service (which hashes it into
|
||||||
|
# its HashedControlPassword) and the backend's
|
||||||
|
# CRAWLER_TOR_CONTROL_PASSWORD. REQUIRED — docker-compose.yml fails
|
||||||
|
# fast if absent. Generate a strong random string; rotate by setting
|
||||||
|
# a new value and restarting both `tor` and `backend`.
|
||||||
|
#
|
||||||
|
# Operators running their own non-dockurr tor daemon with cookie-file
|
||||||
|
# auth can ignore this var and instead set
|
||||||
|
# CRAWLER_TOR_CONTROL_COOKIE_PATH on the backend — the TorController
|
||||||
|
# prefers cookie when both are present.
|
||||||
|
TOR_CONTROL_PASSWORD=change-me-to-a-strong-random-string
|
||||||
|
|
||||||
# ----- Frontend -----
|
# ----- Frontend -----
|
||||||
# The frontend container runs SvelteKit's Node adapter on :3000 and
|
# The frontend container runs SvelteKit's Node adapter on :3000 and
|
||||||
# proxies /api/* to BACKEND_URL via src/hooks.server.ts. In compose the
|
# proxies /api/* to BACKEND_URL via src/hooks.server.ts. In compose the
|
||||||
|
|||||||
@@ -24,13 +24,34 @@ services:
|
|||||||
# can signal NEWNYM on bad pages. See tor/torrc for the daemon
|
# can signal NEWNYM on bad pages. See tor/torrc for the daemon
|
||||||
# config; both ports are only `expose`d (compose-internal), never
|
# config; both ports are only `expose`d (compose-internal), never
|
||||||
# bound on the host.
|
# bound on the host.
|
||||||
|
#
|
||||||
|
# We bypass dockurr/tor's stock entrypoint because it binds the
|
||||||
|
# control port to localhost (unreachable from the backend
|
||||||
|
# container) and skips its own HashedControlPassword injection
|
||||||
|
# when the user's torrc declares a ControlPort. Our wrapper
|
||||||
|
# (tor/entrypoint.sh) generates the hash from $PASSWORD and execs
|
||||||
|
# tor with our torrc. Backend authenticates with the same plain
|
||||||
|
# string via CRAWLER_TOR_CONTROL_PASSWORD.
|
||||||
image: dockurr/tor:latest
|
image: dockurr/tor:latest
|
||||||
|
entrypoint: ["/bin/sh", "/usr/local/bin/mangalord-entrypoint.sh"]
|
||||||
|
environment:
|
||||||
|
PASSWORD: ${TOR_CONTROL_PASSWORD:?TOR_CONTROL_PASSWORD must be set in .env}
|
||||||
volumes:
|
volumes:
|
||||||
- ./tor/torrc:/etc/tor/torrc:ro
|
- ./tor/torrc:/etc/tor/torrc:ro
|
||||||
- tor-data:/var/lib/tor
|
- ./tor/entrypoint.sh:/usr/local/bin/mangalord-entrypoint.sh:ro
|
||||||
expose:
|
expose:
|
||||||
- "9050"
|
- "9050"
|
||||||
- "9051"
|
- "9051"
|
||||||
|
# Wait for both control + SOCKS ports to listen before downstream
|
||||||
|
# services start. dockurr/tor's main process spawns before tor
|
||||||
|
# itself is bound, so `service_started` alone races the first
|
||||||
|
# NEWNYM call.
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD-SHELL", "nc -z 127.0.0.1 9050 && nc -z 127.0.0.1 9051"]
|
||||||
|
interval: 5s
|
||||||
|
timeout: 5s
|
||||||
|
retries: 20
|
||||||
|
start_period: 30s
|
||||||
restart: unless-stopped
|
restart: unless-stopped
|
||||||
|
|
||||||
backend:
|
backend:
|
||||||
@@ -39,7 +60,7 @@ services:
|
|||||||
postgres:
|
postgres:
|
||||||
condition: service_healthy
|
condition: service_healthy
|
||||||
tor:
|
tor:
|
||||||
condition: service_started
|
condition: service_healthy
|
||||||
environment:
|
environment:
|
||||||
DATABASE_URL: postgres://${POSTGRES_USER:-mangalord}:${POSTGRES_PASSWORD:?POSTGRES_PASSWORD must be set in .env}@postgres:5432/${POSTGRES_DB:-mangalord}
|
DATABASE_URL: postgres://${POSTGRES_USER:-mangalord}:${POSTGRES_PASSWORD:?POSTGRES_PASSWORD must be set in .env}@postgres:5432/${POSTGRES_DB:-mangalord}
|
||||||
BIND_ADDRESS: 0.0.0.0:8080
|
BIND_ADDRESS: 0.0.0.0:8080
|
||||||
@@ -61,18 +82,17 @@ services:
|
|||||||
# so the image actually contains the binary.
|
# so the image actually contains the binary.
|
||||||
CRAWLER_CHROMIUM_BINARY: ${CRAWLER_CHROMIUM_BINARY:-}
|
CRAWLER_CHROMIUM_BINARY: ${CRAWLER_CHROMIUM_BINARY:-}
|
||||||
# TOR proxy + NEWNYM recircuit (see .env.example for details).
|
# TOR proxy + NEWNYM recircuit (see .env.example for details).
|
||||||
# Defaults assume the bundled `tor` service above; override to
|
# Defaults assume the bundled `tor` service above; override
|
||||||
# empty strings to disable.
|
# CRAWLER_PROXY= and CRAWLER_TOR_CONTROL_URL= (both empty) in
|
||||||
|
# .env to disable. CRAWLER_TOR_CONTROL_PASSWORD MUST match the
|
||||||
|
# tor service's PASSWORD (both wired to the same TOR_CONTROL_PASSWORD
|
||||||
|
# .env var below).
|
||||||
CRAWLER_PROXY: ${CRAWLER_PROXY-socks5h://tor:9050}
|
CRAWLER_PROXY: ${CRAWLER_PROXY-socks5h://tor:9050}
|
||||||
CRAWLER_TOR_CONTROL_URL: ${CRAWLER_TOR_CONTROL_URL-tcp://tor:9051}
|
CRAWLER_TOR_CONTROL_URL: ${CRAWLER_TOR_CONTROL_URL-tcp://tor:9051}
|
||||||
CRAWLER_TOR_CONTROL_COOKIE_PATH: ${CRAWLER_TOR_CONTROL_COOKIE_PATH-/var/lib/tor/control_auth_cookie}
|
CRAWLER_TOR_CONTROL_PASSWORD: ${TOR_CONTROL_PASSWORD:?TOR_CONTROL_PASSWORD must be set in .env}
|
||||||
CRAWLER_TOR_CONTROL_PASSWORD: ${CRAWLER_TOR_CONTROL_PASSWORD:-}
|
|
||||||
CRAWLER_TOR_RECIRCUIT_MAX_ATTEMPTS: ${CRAWLER_TOR_RECIRCUIT_MAX_ATTEMPTS:-3}
|
CRAWLER_TOR_RECIRCUIT_MAX_ATTEMPTS: ${CRAWLER_TOR_RECIRCUIT_MAX_ATTEMPTS:-3}
|
||||||
volumes:
|
volumes:
|
||||||
- storage-data:/var/lib/mangalord/storage
|
- storage-data:/var/lib/mangalord/storage
|
||||||
# Read the TOR control-auth cookie from the shared named volume.
|
|
||||||
# Read-only on the backend side; the tor service is the writer.
|
|
||||||
- tor-data:/var/lib/tor:ro
|
|
||||||
# No host port mapping in the default setup — the frontend proxies
|
# No host port mapping in the default setup — the frontend proxies
|
||||||
# /api/* through its hooks.server.ts. Expose :8080 only if you want
|
# /api/* through its hooks.server.ts. Expose :8080 only if you want
|
||||||
# to hit the API directly from the host (e.g., bot scripts during
|
# to hit the API directly from the host (e.g., bot scripts during
|
||||||
@@ -94,4 +114,3 @@ services:
|
|||||||
volumes:
|
volumes:
|
||||||
postgres-data:
|
postgres-data:
|
||||||
storage-data:
|
storage-data:
|
||||||
tor-data:
|
|
||||||
|
|||||||
40
tor/entrypoint.sh
Executable file
40
tor/entrypoint.sh
Executable file
@@ -0,0 +1,40 @@
|
|||||||
|
#!/bin/sh
|
||||||
|
# Mangalord wrapper around dockurr/tor's tor binary.
|
||||||
|
#
|
||||||
|
# We bypass the image's stock entrypoint for two reasons:
|
||||||
|
# 1. It generates a `ControlPort 9051` line that binds to localhost
|
||||||
|
# only (tor's default), but our backend lives in a separate
|
||||||
|
# container and needs to reach 0.0.0.0:9051.
|
||||||
|
# 2. It then *skips* writing HashedControlPassword whenever the
|
||||||
|
# user's torrc declares a ControlPort, so we can't both bind to
|
||||||
|
# 0.0.0.0 and benefit from its auto-hashing — it's one or the
|
||||||
|
# other. Doing the hashing ourselves is simpler than threading
|
||||||
|
# around its logic.
|
||||||
|
#
|
||||||
|
# This wrapper hashes $PASSWORD with `tor --hash-password`, appends a
|
||||||
|
# `HashedControlPassword` line to a writable copy of /etc/tor/torrc,
|
||||||
|
# then execs tor. Container runs as root (image default); tor binds
|
||||||
|
# 9050/9051 which don't require root and is fine inside a single-
|
||||||
|
# purpose container.
|
||||||
|
|
||||||
|
set -eu
|
||||||
|
|
||||||
|
if [ -z "${PASSWORD:-}" ]; then
|
||||||
|
echo "ERROR: PASSWORD env must be set (the plain string the backend will" >&2
|
||||||
|
echo " send as CRAWLER_TOR_CONTROL_PASSWORD)" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# `tor --hash-password` prints the hash on the last line of stdout
|
||||||
|
# (preceded by initialization noise).
|
||||||
|
HASH=$(tor --hash-password "$PASSWORD" 2>/dev/null | tail -n1)
|
||||||
|
if [ -z "$HASH" ]; then
|
||||||
|
echo "ERROR: 'tor --hash-password' produced no output" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# /etc/tor/torrc is bind-mounted read-only, so copy + append.
|
||||||
|
cp /etc/tor/torrc /tmp/torrc
|
||||||
|
printf '\n# Injected by mangalord-entrypoint.sh from $PASSWORD env.\nHashedControlPassword %s\n' "$HASH" >> /tmp/torrc
|
||||||
|
|
||||||
|
exec tor -f /tmp/torrc
|
||||||
30
tor/torrc
30
tor/torrc
@@ -12,20 +12,26 @@
|
|||||||
# to age out.
|
# to age out.
|
||||||
SOCKSPort 0.0.0.0:9050 IsolateDestAddr IsolateDestPort
|
SOCKSPort 0.0.0.0:9050 IsolateDestAddr IsolateDestPort
|
||||||
|
|
||||||
# Control port for SIGNAL NEWNYM. Cookie auth means no secret to manage
|
# Control port for SIGNAL NEWNYM. We rely on the dockurr/tor
|
||||||
# in .env — the cookie file is created by the daemon at startup and
|
# entrypoint to inject `HashedControlPassword <hash>` from its
|
||||||
# shared with the backend container via the named `tor-data` volume.
|
# PASSWORD env var (see docker-compose.yml `tor.environment.PASSWORD`)
|
||||||
# CookieAuthFileGroupReadable lets the backend's gid read it without
|
# via a higher-priority --defaults-torrc. We just need to declare the
|
||||||
# having to run as root.
|
# port itself here.
|
||||||
ControlPort 0.0.0.0:9051
|
ControlPort 0.0.0.0:9051
|
||||||
CookieAuthentication 1
|
|
||||||
CookieAuthFile /var/lib/tor/control_auth_cookie
|
|
||||||
CookieAuthFileGroupReadable 1
|
|
||||||
|
|
||||||
# Keep circuits short-lived so NEWNYM actually changes our visible
|
# Keep circuits dirty for a while so a single chapter (which serial-
|
||||||
# exit soon. Default is 600s (10 min); 60s is short enough that retries
|
# fetches all its images through the same SOCKS endpoint) finishes on
|
||||||
# after a brief site rate-limit window almost always see a new IP.
|
# one circuit rather than mid-circuit-rotating in a way that looks like
|
||||||
MaxCircuitDirtiness 60
|
# anti-bot evasion to the target. NEWNYM still forces a fresh circuit
|
||||||
|
# immediately when we want one — this is just the idle-rotation knob.
|
||||||
|
MaxCircuitDirtiness 600
|
||||||
|
|
||||||
|
# Drop privileges to the image's `tor` user after binding ports.
|
||||||
|
# Required because /var/lib/tor (the image's DataDirectory volume)
|
||||||
|
# is owned by tor:tor and tor refuses to use a data dir it doesn't
|
||||||
|
# own. Our entrypoint runs as root only so it can call
|
||||||
|
# `tor --hash-password` and write /tmp/torrc.
|
||||||
|
User tor
|
||||||
|
|
||||||
# Data + logs.
|
# Data + logs.
|
||||||
DataDirectory /var/lib/tor
|
DataDirectory /var/lib/tor
|
||||||
|
|||||||
Reference in New Issue
Block a user