Files
PiCloud/HANDBACK.md
MechaCat02 6080fc67f6 docs(v1.1.4): handback report
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 20:26:44 +02:00

253 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# v1.1.4 Handback — Outbound HTTP SDK & Cron Triggers
**Branch:** `feat/v1.1.4-http-cron` (off `main`)
**Commits:** 1 implementation commit (`feat(v1.1.4): outbound HTTP SDK + cron triggers`) + this HANDBACK commit.
> **Note on commit granularity:** the brief suggested split
> `feat(v1.1.4-http)` / `feat(v1.1.4-cron)` commits. The two features are
> interleaved across shared files (`Cargo.toml`, `crates/picloud/src/lib.rs`,
> `crates/manager-core/src/lib.rs`, `version.rs`, `services.rs`), so
> cleanly-*compiling* per-theme commits aren't separable without interactive
> hunk staging (unavailable in this environment). I chose one coherent,
> green commit over shipping broken intermediates. Squash/relabel as you see fit.
---
## Scope coverage
| # | Item | Status |
|---|------|--------|
| 1 | `http::*` SDK surface (get/post/put/patch/delete/head/post_form/request) | **Done** |
| 2 | SSRF deny-list (resolved-IP, DNS-rebinding defense, scheme/port, body caps, UA, timeouts, `PICLOUD_HTTP_ALLOW_PRIVATE`) | **Done** |
| 3 | http authz (`Capability::AppHttpRequest``script:write`, script-as-gate, no new Scope) | **Done** |
| 4 | `HttpService` trait + `HttpServiceImpl` + Services wiring | **Done** |
| 5 | Cron migration `0017` (Layout-E extension) | **Done** |
| 6 | Cron scheduler tokio task (catch-up = fire-once) | **Done** |
| 7 | `ctx.event.cron` shape + `TriggerEvent::Cron` | **Done** |
| 8 | Dispatcher routing extension (`… \| Cron`) | **Done** |
| 9 | Dashboard cron trigger UI (minimal) | **Done** |
| 10a | Redact `ModuleSourceError::Backend` at resolver boundary | **Done** |
| 10b | Pin `rhai = "=1.24"` | **Done** |
| 10c | CHANGELOG retroactive v1.1.3 cross-app-trigger security note | **Done** |
| 11 | Version bumps (workspace 1.1.4, SDK 1.5, dashboard 0.10.0) | **Done** |
| 12 | Tests (~50-70) | **Done** — 70 new |
---
## SSRF policy implementation notes
- **reqwest hook.** `crates/manager-core/src/ssrf.rs` defines `SsrfResolver`
implementing `reqwest::dns::Resolve`, plugged in via
`ClientBuilder::dns_resolver`. It delegates to the system resolver
(injectable for tests — see DNS-rebinding test), then filters each `IpAddr`
through `SsrfPolicy::check`. Because reqwest re-resolves at every connection
(including each redirect hop), the policy applies post-redirect too.
- **`dns_resolver` is generic over a concrete `R: Resolve`** (stores `Arc<R>`),
so the resolver is passed as `Arc<SsrfResolver>`, not `Arc<dyn Resolve>`.
- **Literal-IP gap closed.** reqwest only routes *hostnames* through the custom
resolver — a URL with a literal-IP host (`http://127.0.0.1/`) bypasses it
entirely. The impl therefore *also* runs `SsrfPolicy::check` on literal-IP
hosts at URL-parse time (`validate_url`), on every hop. Both paths are tested.
- **IPv4-mapped IPv6 re-check.** `check_v6` calls `Ipv6Addr::to_ipv4_mapped()`;
if `Some`, it re-runs the v4 deny-list against the embedded address
(`::ffff:127.0.0.1` → denied as "loopback").
- **Applied before AND after redirects.** Redirects are followed *manually*
(client built with `redirect(Policy::none())`) so per-request
`follow_redirects`/`max_redirects` are honored; each hop re-validates
scheme/port + literal-IP and re-resolves hostnames through the SSRF resolver.
- **Script-visible error format.** `"http: blocked by SSRF policy: <reason>"`
where `<reason>` is a CIDR category (`loopback`, `private`, `link-local`,
`carrier-grade-nat`, `multicast`, `reserved`, `unique-local`, `unspecified`).
**The resolved IP is never included.** The all-addresses-denied case surfaces
as `Ssrf` (not a generic DNS error) via a marker error the resolver emits and
the impl detects by walking the reqwest error source chain.
## Cron scheduler implementation notes
- **Catch-up = fire-once.** Matches the brief; no deviation. `next_due` returns a
single canonical scheduled-at (first slot after `last_fired_at`, or
`created_at` if never fired); after firing, `last_fired_at = now`, so the next
tick sees only future slots. **Verified live** against Postgres: an
every-second (`* * * * * *`) trigger with a 2s tick advanced `last_fired_at`
~once per 2s, not once per second.
- **No ExecutionGate contention.** The scheduler only enqueues to the outbox
(one row per due trigger per tick, in a `FOR UPDATE OF d SKIP LOCKED`
transaction that also bumps `last_fired_at`). The existing dispatcher acquires
the gate and delivers it identically to kv/docs/dead_letter — verified live
(the cron outbox row was consumed, the script executed, the row deleted).
- **Timezone handling.** `chrono-tz`. Invalid IANA names are rejected at the
admin endpoint with a 422 (`TriggersApiError::Invalid`, message contains
"timezone"); the repo re-validates defensively before insert.
- **Schema beyond the brief:** none. Followed the brief exactly — `schedule`,
`timezone DEFAULT 'UTC'`, `last_fired_at`, `idx_cron_triggers_due`. **No**
stored `next_scheduled_at` column (an exploration agent suggested one; the
brief computes next-fire in-process, which I followed).
---
## Tests added (70 new)
- **SSRF policy + resolver (`ssrf.rs`, 20):** one per deny CIDR (127/8, 0/8,
10/8, 172.16/12, 192.168/16, 169.254/16 incl. metadata, 100.64/10, 224/4,
240/4, ::1, ::, fe80::/10, fc00::/7, ff00::/8); 172.x outside-range allowed;
public v4/v6 allowed; IPv4-mapped re-check; `allow_private` disables all;
resolver returns only allowed addrs; all-denied → SSRF marker; **DNS rebinding**
(mock resolver: public then private — second denied); empty resolution ≠ SSRF.
- **HTTP client (`http_service.rs`, 16):** GET/POST round-trips vs a hand-rolled
`TcpListener`; body dispatch + default UA; custom UA override; empty body;
non-2xx no-error; response cap via Content-Length; response cap mid-stream
(no Content-Length); request body cap pre-send; redirect-to-max-then-throw;
scheme rejection (file/ftp/gopher); port rejection (22/25/465/587); SSRF
literal-loopback; SSRF hostname-resolves-to-loopback; timeout; authz
(anon skips / member forbidden / member-with-role allowed).
- **Bridge integration (`sdk_http.rs`, 15):** real Rhai engine under
`spawn_blocking` vs a recording fake — status+JSON body, non-JSON string,
empty→`()`, Map→JSON, String→text, `()`→no body, headers+timeout forwarded,
unknown opt key throws, timeout>max throws, non-2xx no-throw, network error
throws `http:`, `post_form` url-encoding, `request` arbitrary method,
default-UA carries `script_id`, `cx.app_id` forwarded for attribution.
- **Cron scheduler (`cron_scheduler.rs`, 11):** 6-field schedule accept / 5-field
+ malformed reject; IANA tz accept / reject; due/not-due; never-fired uses
created_at; **catch-up fires exactly once after 5 missed windows**; timezone
affects fire time; bad schedule/tz → None.
- **Cron admin (`triggers_api.rs`, 6):** create succeeds; invalid schedule;
unknown timezone; **module target rejected** (v1.1.3 regression); **cross-app
target rejected** (v1.1.3 regression); member-without-role forbidden.
- **Module redaction (2):** `modules.rs` — backend error redacted from the
script-visible error (no leak); `module_redaction_logging.rs` — original error
**is** logged at ERROR level (captured via a global tracing subscriber).
---
## Decisions beyond the brief (every prompt-default deviation, flagged)
1. **Three-arg split `verb(url, body, opts)`** (user-approved during planning).
Diverges from the brief's documented two-arg `(url, opts)` shape and
generalizes the escape hatch to `request(method, url, body, opts)`. Resolves
the brief's internal contradiction (its Slack example `http::post(url, #{text:...})`
passed a bare body map, which would be an "unknown opt key" under the
two-arg rule). The `opts` vocabulary is now exactly
`{headers, timeout_ms, follow_redirects, max_redirects}`**`body_raw` was
dropped** (raw strings go through the positional body as a String). The
Slack example works unchanged (`#{text:...}` is the body).
2. **Cron crate = `cron` (0.12), not `croner`.** The brief allowed either; `cron`
handles the 6-field-with-seconds format and named weekdays (`MON-FRI`) used in
the brief's example, and integrates with chrono `Schedule::after`.
3. **Catch-up = fire-once** — matches the brief; called out explicitly as
requested. No deviation.
4. **`SdkCallCx` gained a `script_id` field.** The brief's default User-Agent is
`picloud/<v> (script:<script_id>)`, but `SdkCallCx` didn't carry the script
id. Adding it (sourced from `ExecRequest.script_id` in the engine) is the
clean home and doubles as the audit-attribution key the brief emphasizes. All
19 construction sites updated. The dead-letters admin cx uses a fresh sentinel
id (no script executes there).
5. **SSRF also blocks IPv6 unspecified `::` and IPv4 `0.0.0.0`** with reason
"unspecified". `0.0.0.0/8` is in the brief's list; `::` is not explicitly but
is an obvious sibling hole, so I blocked it too (defensible superset).
6. **No reqwest feature additions needed**`dns_resolver` and `Response::chunk()`
compile under the existing `default-features = false, features = ["json","rustls-tls"]`.
No cookie jar (cookies feature is off, so there's no jar to disable). Added
`url` as an executor-core dep (for `form_urlencoded` in `post_form`).
---
## How to verify locally (§8 attestation — run on this exact HEAD)
All four gates were run on the handback HEAD (the `feat(v1.1.4)` commit, before
this markdown commit):
```
cargo fmt --all -- --check → exit 0
cargo clippy --all-targets --all-features -- -D warnings → exit 0
cargo test --workspace → 427 passed, 0 failed
(cd dashboard && npm run check) → 0 errors, 0 warnings (369 files)
```
This HANDBACK commit is pure markdown (no gate-relevant files), so the numbers
above hold for the final HEAD.
**Migrations — verified against a real Postgres (dev stack, port 15432):**
- Fresh-DB replay: the `#[sqlx::test]` schema-snapshot test applies all
migrations on a fresh ephemeral DB and matches the (re-blessed) golden — passes.
- On-top-of-prior-state: booting `picloud` against a dev DB pinned at migration
`0006` applied `0007…0017` cleanly (`"migrations applied"`); `_sqlx_migrations`
max is now `17`; `cron_trigger_details` + widened CHECKs present.
**Live smoke performed:**
- Boot logged the `PICLOUD_HTTP_ALLOW_PRIVATE` warning and started the cron
scheduler + HTTP service without panic.
- Seeded an every-second cron trigger → scheduler set `last_fired_at`, dispatcher
consumed the outbox row and ran the script (row deleted on success), and
`last_fired_at` advanced at the tick cadence (fire-once confirmed). Smoke data
cleaned up afterward.
- HTTP GET / SSRF-block / body-dispatch behaviors are covered by the automated
integration tests (real `TcpListener` round-trips + loopback/hostname SSRF
blocks) rather than a manual curl flow, since a live SSRF-block smoke
conflicts with the `PICLOUD_HTTP_ALLOW_PRIVATE` a local-server smoke requires.
To re-run the schema snapshot:
```
docker compose up -d postgres
DATABASE_URL=postgres://picloud:picloud@127.0.0.1:15432/picloud \
cargo test -p picloud-manager-core --test schema_snapshot -- --include-ignored
```
---
## ⚠️ Latent issue found: stale schema-snapshot golden
`crates/manager-core/tests/expected_schema.txt` was **significantly stale** — the
committed golden was missing many tables from prior releases
(`abandoned_executions`, `dead_letters`, `dead_letter_trigger_details`,
`docs_*`, etc.). The `schema_snapshot` test is `#[ignore]` (needs a DB), so it
was apparently never re-blessed across v1.1.1v1.1.3 and silently drifted.
I re-blessed it, so the diff is large (+217 lines) but **only `cron_trigger_details`
+ the two widened CHECK constraints are v1.1.4-new** — the rest is pre-existing
drift correction. The blessed golden now matches a clean replay (verified).
Recommend the reviewer skim the diff to confirm, and consider whether the
`#[ignore]` should be lifted in CI (with a DB service) so the golden can't drift
again.
---
## Latent security findings
None new beyond the (already-known, already-closed-in-v1.1.3) cross-app trigger
gap, which §10c now documents in the CHANGELOG. The SSRF surface is the main
security mechanism in this release; see the SSRF notes above for the
defense-in-depth layering (resolver hook + literal-IP check + per-hop
re-validation + IP-never-leaked errors).
One thing for the reviewer to weigh: the SSRF policy is a hardcoded deny-list
with no per-app allow-list (deferred to v1.2 per the brief). An operator who
needs a script to reach a private service has only the all-or-nothing
`PICLOUD_HTTP_ALLOW_PRIVATE` global escape hatch today.
## Open questions for the reviewer
1. **Three-arg HTTP shape** (decision #1) — confirm you're happy with
`verb(url, body, opts)` + dropping `body_raw`, vs the brief's documented
two-arg form. This is the one user-facing API-shape divergence.
2. **Stale schema golden** — OK to land the full re-bless in this PR, or would
you prefer the drift correction split out?
## Deferred items (per the brief's OUT list — not built)
WebSocket/SSE, streaming responses, HTTP/3, per-app outbound allow/deny lists,
per-app rate limits, mTLS, request signing, cookie jar, cron backfill replay,
cron next-fire preview, cron schedule history, drift compensation,
module-import-over-HTTP, files/pubsub/secrets/email/users/queue.
## Known limitations / rough edges
- The dashboard Triggers tab lists all trigger kinds but only *creates* cron
triggers (kv/docs creation remains API-only, unchanged from before). No
next-fire-at preview (deferred to v1.2).
- `post_form` / body field order follows Rhai map iteration order
(`BTreeMap`-backed, so sorted/deterministic; not insertion order).
- The cron scheduler tick is floored at 1s; sub-second schedules effectively
fire at the tick cadence (by design — see the fire-once policy).
- The stale REVIEW.md at repo root is the v1.1.3 reviewer's artifact; the
v1.1.4 reviewer should overwrite it.