PiCloud/HANDBACK.md

# v1.1.4 Handback — Outbound HTTP SDK & Cron Triggers

**Branch:** `feat/v1.1.4-http-cron` (off `main`)
**Commits:** 1 implementation commit (`feat(v1.1.4): outbound HTTP SDK + cron triggers`) + this HANDBACK commit.

> **Note on commit granularity:** the brief suggested split
> `feat(v1.1.4-http)` / `feat(v1.1.4-cron)` commits. The two features are
> interleaved across shared files (`Cargo.toml`, `crates/picloud/src/lib.rs`,
> `crates/manager-core/src/lib.rs`, `version.rs`, `services.rs`), so
> cleanly-*compiling* per-theme commits aren't separable without interactive
> hunk staging (unavailable in this environment). I chose one coherent,
> green commit over shipping broken intermediates. Squash/relabel as you see fit.

---

## Scope coverage

| # | Item | Status |
|---|------|--------|
| 1 | `http::*` SDK surface (get/post/put/patch/delete/head/post_form/request) | **Done** |
| 2 | SSRF deny-list (resolved-IP, DNS-rebinding defense, scheme/port, body caps, UA, timeouts, `PICLOUD_HTTP_ALLOW_PRIVATE`) | **Done** |
| 3 | http authz (`Capability::AppHttpRequest` → `script:write`, script-as-gate, no new Scope) | **Done** |
| 4 | `HttpService` trait + `HttpServiceImpl` + Services wiring | **Done** |
| 5 | Cron migration `0017` (Layout-E extension) | **Done** |
| 6 | Cron scheduler tokio task (catch-up = fire-once) | **Done** |
| 7 | `ctx.event.cron` shape + `TriggerEvent::Cron` | **Done** |
| 8 | Dispatcher routing extension (`… \| Cron`) | **Done** |
| 9 | Dashboard cron trigger UI (minimal) | **Done** |
| 10a | Redact `ModuleSourceError::Backend` at resolver boundary | **Done** |
| 10b | Pin `rhai = "=1.24"` | **Done** |
| 10c | CHANGELOG retroactive v1.1.3 cross-app-trigger security note | **Done** |
| 11 | Version bumps (workspace 1.1.4, SDK 1.5, dashboard 0.10.0) | **Done** |
| 12 | Tests (~50-70) | **Done** — 70 new |

---

## SSRF policy implementation notes

- **reqwest hook.** `crates/manager-core/src/ssrf.rs` defines `SsrfResolver`
  implementing `reqwest::dns::Resolve`, plugged in via
  `ClientBuilder::dns_resolver`. It delegates to the system resolver
  (injectable for tests — see DNS-rebinding test), then filters each `IpAddr`
  through `SsrfPolicy::check`. Because reqwest re-resolves at every connection
  (including each redirect hop), the policy applies post-redirect too.
- **`dns_resolver` is generic over a concrete `R: Resolve`** (stores `Arc<R>`),
  so the resolver is passed as `Arc<SsrfResolver>`, not `Arc<dyn Resolve>`.
- **Literal-IP gap closed.** reqwest only routes *hostnames* through the custom
  resolver — a URL with a literal-IP host (`http://127.0.0.1/`) bypasses it
  entirely. The impl therefore *also* runs `SsrfPolicy::check` on literal-IP
  hosts at URL-parse time (`validate_url`), on every hop. Both paths are tested.
- **IPv4-mapped IPv6 re-check.** `check_v6` calls `Ipv6Addr::to_ipv4_mapped()`;
  if `Some`, it re-runs the v4 deny-list against the embedded address
  (`::ffff:127.0.0.1` → denied as "loopback").
- **Applied before AND after redirects.** Redirects are followed *manually*
  (client built with `redirect(Policy::none())`) so per-request
  `follow_redirects`/`max_redirects` are honored; each hop re-validates
  scheme/port + literal-IP and re-resolves hostnames through the SSRF resolver.
- **Script-visible error format.** `"http: blocked by SSRF policy: <reason>"`
  where `<reason>` is a CIDR category (`loopback`, `private`, `link-local`,
  `carrier-grade-nat`, `multicast`, `reserved`, `unique-local`, `unspecified`).
  **The resolved IP is never included.** The all-addresses-denied case surfaces
  as `Ssrf` (not a generic DNS error) via a marker error the resolver emits and
  the impl detects by walking the reqwest error source chain.

## Cron scheduler implementation notes

- **Catch-up = fire-once.** Matches the brief; no deviation. `next_due` returns a
  single canonical scheduled-at (first slot after `last_fired_at`, or
  `created_at` if never fired); after firing, `last_fired_at = now`, so the next
  tick sees only future slots. **Verified live** against Postgres: an
  every-second (`* * * * * *`) trigger with a 2s tick advanced `last_fired_at`
  ~once per 2s, not once per second.
- **No ExecutionGate contention.** The scheduler only enqueues to the outbox
  (one row per due trigger per tick, in a `FOR UPDATE OF d SKIP LOCKED`
  transaction that also bumps `last_fired_at`). The existing dispatcher acquires
  the gate and delivers it identically to kv/docs/dead_letter — verified live
  (the cron outbox row was consumed, the script executed, the row deleted).
- **Timezone handling.** `chrono-tz`. Invalid IANA names are rejected at the
  admin endpoint with a 422 (`TriggersApiError::Invalid`, message contains
  "timezone"); the repo re-validates defensively before insert.
- **Schema beyond the brief:** none. Followed the brief exactly — `schedule`,
  `timezone DEFAULT 'UTC'`, `last_fired_at`, `idx_cron_triggers_due`. **No**
  stored `next_scheduled_at` column (an exploration agent suggested one; the
  brief computes next-fire in-process, which I followed).

---

## Tests added (70 new)

- **SSRF policy + resolver (`ssrf.rs`, 20):** one per deny CIDR (127/8, 0/8,
  10/8, 172.16/12, 192.168/16, 169.254/16 incl. metadata, 100.64/10, 224/4,
  240/4, ::1, ::, fe80::/10, fc00::/7, ff00::/8); 172.x outside-range allowed;
  public v4/v6 allowed; IPv4-mapped re-check; `allow_private` disables all;
  resolver returns only allowed addrs; all-denied → SSRF marker; **DNS rebinding**
  (mock resolver: public then private — second denied); empty resolution ≠ SSRF.
- **HTTP client (`http_service.rs`, 16):** GET/POST round-trips vs a hand-rolled
  `TcpListener`; body dispatch + default UA; custom UA override; empty body;
  non-2xx no-error; response cap via Content-Length; response cap mid-stream
  (no Content-Length); request body cap pre-send; redirect-to-max-then-throw;
  scheme rejection (file/ftp/gopher); port rejection (22/25/465/587); SSRF
  literal-loopback; SSRF hostname-resolves-to-loopback; timeout; authz
  (anon skips / member forbidden / member-with-role allowed).
- **Bridge integration (`sdk_http.rs`, 15):** real Rhai engine under
  `spawn_blocking` vs a recording fake — status+JSON body, non-JSON string,
  empty→`()`, Map→JSON, String→text, `()`→no body, headers+timeout forwarded,
  unknown opt key throws, timeout>max throws, non-2xx no-throw, network error
  throws `http:`, `post_form` url-encoding, `request` arbitrary method,
  default-UA carries `script_id`, `cx.app_id` forwarded for attribution.
- **Cron scheduler (`cron_scheduler.rs`, 11):** 6-field schedule accept / 5-field
  + malformed reject; IANA tz accept / reject; due/not-due; never-fired uses
  created_at; **catch-up fires exactly once after 5 missed windows**; timezone
  affects fire time; bad schedule/tz → None.
- **Cron admin (`triggers_api.rs`, 6):** create succeeds; invalid schedule;
  unknown timezone; **module target rejected** (v1.1.3 regression); **cross-app
  target rejected** (v1.1.3 regression); member-without-role forbidden.
- **Module redaction (2):** `modules.rs` — backend error redacted from the
  script-visible error (no leak); `module_redaction_logging.rs` — original error
  **is** logged at ERROR level (captured via a global tracing subscriber).

---

## Decisions beyond the brief (every prompt-default deviation, flagged)

1. **Three-arg split `verb(url, body, opts)`** (user-approved during planning).
   Diverges from the brief's documented two-arg `(url, opts)` shape and
   generalizes the escape hatch to `request(method, url, body, opts)`. Resolves
   the brief's internal contradiction (its Slack example `http::post(url, #{text:...})`
   passed a bare body map, which would be an "unknown opt key" under the
   two-arg rule). The `opts` vocabulary is now exactly
   `{headers, timeout_ms, follow_redirects, max_redirects}` — **`body_raw` was
   dropped** (raw strings go through the positional body as a String). The
   Slack example works unchanged (`#{text:...}` is the body).
2. **Cron crate = `cron` (0.12), not `croner`.** The brief allowed either; `cron`
   handles the 6-field-with-seconds format and named weekdays (`MON-FRI`) used in
   the brief's example, and integrates with chrono `Schedule::after`.
3. **Catch-up = fire-once** — matches the brief; called out explicitly as
   requested. No deviation.
4. **`SdkCallCx` gained a `script_id` field.** The brief's default User-Agent is
   `picloud/<v> (script:<script_id>)`, but `SdkCallCx` didn't carry the script
   id. Adding it (sourced from `ExecRequest.script_id` in the engine) is the
   clean home and doubles as the audit-attribution key the brief emphasizes. All
   19 construction sites updated. The dead-letters admin cx uses a fresh sentinel
   id (no script executes there).
5. **SSRF also blocks IPv6 unspecified `::` and IPv4 `0.0.0.0`** with reason
   "unspecified". `0.0.0.0/8` is in the brief's list; `::` is not explicitly but
   is an obvious sibling hole, so I blocked it too (defensible superset).
6. **No reqwest feature additions needed** — `dns_resolver` and `Response::chunk()`
   compile under the existing `default-features = false, features = ["json","rustls-tls"]`.
   No cookie jar (cookies feature is off, so there's no jar to disable). Added
   `url` as an executor-core dep (for `form_urlencoded` in `post_form`).

---

## How to verify locally (§8 attestation — run on this exact HEAD)

All four gates were run on the handback HEAD (the `feat(v1.1.4)` commit, before
this markdown commit):

```
cargo fmt --all -- --check                               → exit 0
cargo clippy --all-targets --all-features -- -D warnings → exit 0
cargo test --workspace                                   → 427 passed, 0 failed
(cd dashboard && npm run check)                          → 0 errors, 0 warnings (369 files)
```

This HANDBACK commit is pure markdown (no gate-relevant files), so the numbers
above hold for the final HEAD.

**Migrations — verified against a real Postgres (dev stack, port 15432):**
- Fresh-DB replay: the `#[sqlx::test]` schema-snapshot test applies all
  migrations on a fresh ephemeral DB and matches the (re-blessed) golden — passes.
- On-top-of-prior-state: booting `picloud` against a dev DB pinned at migration
  `0006` applied `0007…0017` cleanly (`"migrations applied"`); `_sqlx_migrations`
  max is now `17`; `cron_trigger_details` + widened CHECKs present.

**Live smoke performed:**
- Boot logged the `PICLOUD_HTTP_ALLOW_PRIVATE` warning and started the cron
  scheduler + HTTP service without panic.
- Seeded an every-second cron trigger → scheduler set `last_fired_at`, dispatcher
  consumed the outbox row and ran the script (row deleted on success), and
  `last_fired_at` advanced at the tick cadence (fire-once confirmed). Smoke data
  cleaned up afterward.
- HTTP GET / SSRF-block / body-dispatch behaviors are covered by the automated
  integration tests (real `TcpListener` round-trips + loopback/hostname SSRF
  blocks) rather than a manual curl flow, since a live SSRF-block smoke
  conflicts with the `PICLOUD_HTTP_ALLOW_PRIVATE` a local-server smoke requires.

To re-run the schema snapshot:
```
docker compose up -d postgres
DATABASE_URL=postgres://picloud:picloud@127.0.0.1:15432/picloud \
  cargo test -p picloud-manager-core --test schema_snapshot -- --include-ignored
```

---

## ⚠️ Latent issue found: stale schema-snapshot golden

`crates/manager-core/tests/expected_schema.txt` was **significantly stale** — the
committed golden was missing many tables from prior releases
(`abandoned_executions`, `dead_letters`, `dead_letter_trigger_details`,
`docs_*`, etc.). The `schema_snapshot` test is `#[ignore]` (needs a DB), so it
was apparently never re-blessed across v1.1.1–v1.1.3 and silently drifted.

I re-blessed it, so the diff is large (+217 lines) but **only `cron_trigger_details`
+ the two widened CHECK constraints are v1.1.4-new** — the rest is pre-existing
drift correction. The blessed golden now matches a clean replay (verified).
Recommend the reviewer skim the diff to confirm, and consider whether the
`#[ignore]` should be lifted in CI (with a DB service) so the golden can't drift
again.

---

## Latent security findings

None new beyond the (already-known, already-closed-in-v1.1.3) cross-app trigger
gap, which §10c now documents in the CHANGELOG. The SSRF surface is the main
security mechanism in this release; see the SSRF notes above for the
defense-in-depth layering (resolver hook + literal-IP check + per-hop
re-validation + IP-never-leaked errors).

One thing for the reviewer to weigh: the SSRF policy is a hardcoded deny-list
with no per-app allow-list (deferred to v1.2 per the brief). An operator who
needs a script to reach a private service has only the all-or-nothing
`PICLOUD_HTTP_ALLOW_PRIVATE` global escape hatch today.

## Open questions for the reviewer

1. **Three-arg HTTP shape** (decision #1) — confirm you're happy with
   `verb(url, body, opts)` + dropping `body_raw`, vs the brief's documented
   two-arg form. This is the one user-facing API-shape divergence.
2. **Stale schema golden** — OK to land the full re-bless in this PR, or would
   you prefer the drift correction split out?

## Deferred items (per the brief's OUT list — not built)

WebSocket/SSE, streaming responses, HTTP/3, per-app outbound allow/deny lists,
per-app rate limits, mTLS, request signing, cookie jar, cron backfill replay,
cron next-fire preview, cron schedule history, drift compensation,
module-import-over-HTTP, files/pubsub/secrets/email/users/queue.

## Known limitations / rough edges

- The dashboard Triggers tab lists all trigger kinds but only *creates* cron
  triggers (kv/docs creation remains API-only, unchanged from before). No
  next-fire-at preview (deferred to v1.2).
- `post_form` / body field order follows Rhai map iteration order
  (`BTreeMap`-backed, so sorted/deterministic; not insertion order).
- The cron scheduler tick is floored at 1s; sub-second schedules effectively
  fire at the tick cadence (by design — see the fire-once policy).
- The stale REVIEW.md at repo root is the v1.1.3 reviewer's artifact; the
  v1.1.4 reviewer should overwrite it.