Files
PiCloud/HANDBACK.md
MechaCat02 3cfb795206 docs(v1.1.7): handback report
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 22:46:33 +02:00

331 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# v1.1.7 — Configuration & Email — HANDBACK
**Branch:** `feat/v1.1.7-secrets-email` (9 commits off `main`, not pushed)
**Status:** ready for review. NOT merged, NOT pushed, no PR opened.
```
a7d3dad chore(v1.1.7): re-bless schema snapshot for secrets + email migrations
2ea47eb chore(v1.1.7): fix clippy --all-targets warnings
b355851 chore(v1.1.7): version bumps + CHANGELOG
fffcdf6 feat(v1.1.7-realtime-migration): encrypt signing keys at rest
02335a8 fix(v1.1.7-dead-letter): wire dispatcher → list_matching_dead_letter
1f78937 feat(v1.1.7-email-inbound): webhook receiver + email:receive trigger
8f2d2bc feat(v1.1.7-email-outbound): SMTP send/send_html
2d11090 feat(v1.1.7-secrets): secrets SDK + table + admin API + dashboard
dc2e4fa feat(v1.1.7-crypto): master-key infra + encryption helpers
```
---
## 1. Scope coverage
| Item | Status |
|---|---|
| Encryption infrastructure (master key + AES-256-GCM envelope) | **Done** |
| `secrets::*` SDK + `0023_secrets.sql` + admin API + dashboard tab | **Done** |
| Outbound email `email::send` / `email::send_html` (lettre SMTP) | **Done** |
| Inbound email webhook receiver + `email:receive` trigger + `0024` | **Done** (full scope, per user decision) |
| Dispatcher routing for email | **Done** |
| dead_letter handler wiring fix | **Done** |
| Realtime signing-key encryption (two-phase) + `0025` | **Done** |
| Dashboard (Secrets tab, email trigger form, `npm run check`) | **Done** |
| Version bumps (1.1.7 / SDK 1.8 / dashboard 0.13.0) + CHANGELOG | **Done** |
| Tests (match v1.1.5/v1.1.6 density) | **Done** |
Nothing deferred from scope-in. Inbound email (the deferrable-if-scope-
blew-up piece) was implemented in full.
---
## 2. Encryption infrastructure notes
- **Module:** `crates/shared/src/crypto.rs` (`picloud_shared::crypto`).
- **Master-key sourcing** (`MasterKey::from_env``resolve`):
- `PICLOUD_SECRET_KEY` = base64 of exactly 32 bytes. Missing →
`MasterKeyError::Missing` (fatal); non-base64 → `Malformed`; wrong
length → `WrongLength`. **Sourced in `main.rs::run_server` before any
DB work** — `build_app` takes the `MasterKey` as a parameter (so
tests pass a fixed key and don't mutate process env).
- Dev fallback: deterministic key (`SHA-256("picloud-dev-master-key-v1.1.7")`)
used ONLY when `PICLOUD_SECRET_KEY` is unset **AND**
`PICLOUD_DEV_MODE=true`, with a prominent `warn!`. No quiet
unencrypted mode.
- **aes-gcm version:** `0.10` (features `aes`, `alloc`). `Aes256Gcm`.
- **Nonce generation:** 12 bytes from `rand::thread_rng().fill_bytes`
(OS-CSPRNG-seeded), per-encryption.
- **Storage layout:** ciphertext **with the 16-byte GCM auth tag
appended** (RustCrypto `Aead`-trait layout — `encrypt` returns
`ciphertext || tag`, `decrypt` consumes the same). The 12-byte nonce is
stored in a separate column. `MasterKey`'s `Debug` is redacted.
- **Plaintext cap (secrets):** 64 KB default, enforced in
`secrets_service::seal` (the SDK boundary) → `SecretsError::TooLarge`
with limit + actual size. Override: `PICLOUD_SECRET_MAX_VALUE_BYTES`.
- **Key rotation:** out of scope. Documented in CHANGELOG + the module
docs that changing `PICLOUD_SECRET_KEY` orphans all ciphertext.
---
## 3. Secrets notes
- `SecretsService` (trait, `picloud-shared`) → `SecretsServiceImpl` +
`PostgresSecretsRepo` (`manager-core`) → Rhai bridge
(`executor-core/src/sdk/secrets.rs`). Collection-less; `app_id` from
`cx.app_id`.
- **JSON round-trip:** `set` serializes the value to JSON bytes, caps,
encrypts; `get` decrypts + deserializes — a String returns a String
(not a JSON-quoted `"\"…\""`). Verified by unit + bridge tests.
- **No ServiceEvent emission** (secret writes don't fire triggers).
- Admin API: `GET/POST/DELETE /api/v1/admin/apps/{id}/secrets`; list
returns names + `updated_at` only.
- Authz: `Capability::AppSecretsRead/Write``script:read`/`script:write`.
No new Scope variants (seven-scope commitment held).
---
## 4. Email implementation notes
- **SMTP transport:** `lettre 0.11` (`smtp-transport`,
`tokio1-rustls-tls`, `builder`, `hostname`). **Connection model:** one
connection per call (lettre default); pooling deferred to v1.2. The
transport sits behind an internal `EmailTransport` trait so the service
is unit-tested with a recording fake (no live SMTP).
- **Disabled mode:** if HOST/USER/PASSWORD aren't all set,
`EmailServiceImpl::from_env` builds no transport and every `send`
returns `NotConfigured` (warned at startup). A malformed relay
descriptor is also logged and yields disabled mode (email is
non-critical; never blocks startup).
- **Address validation:** hand-rolled RFC 5322-ish pre-check (single `@`,
non-empty local part, domain contains a dot, ≤320 bytes) followed by a
`lettre::Mailbox` parse (the authoritative validator). No deliverability
check.
- **Size cap:** 25 MB on `message.formatted()`,
`PICLOUD_EMAIL_MAX_MESSAGE_BYTES`.
- `email::send` forces text-only (ignores any `html`); `email::send_html`
requires `html` and builds `MultiPart::alternative_plain_html`.
`reply_to` defaults to `from`. `to`/`cc`/`bcc` accept a String or an
Array of Strings.
- **Inbound normalization:** only the generic provider-agnostic JSON
shape `{from,to[],cc[],subject,text,html,message_id}` is accepted in
v1.1.7 — `from` required, rest default. Provider-specific unmarshallers
→ v1.2. The expected shape is documented on the dashboard email-trigger
form.
---
## 5. Dead-letter handler fix notes
- **Call site:** `dispatcher::handle_failure`, the retry-exhaustion
branch. After `DeadLetterRepo::insert` (which returns the new
`DeadLetterId`), a new helper `fan_out_dead_letter` runs.
- **What it does:** calls `TriggerRepo::list_matching_dead_letter(app_id,
source, row.trigger_id, Some(resolved.script_id))` (the method that had
no production caller) and inserts one outbox row per match
(`source_kind = DeadLetter`, the DL trigger's id + handler script id,
`trigger_depth + 1`, `origin_principal = the DL trigger's registered
principal`).
- **Payload — built from the REAL `TriggerEvent::DeadLetter` variant**,
not the brief's §6 field list (see §7 deviations): `{ dead_letter_id,
original: Box::new(decoded row payload), attempts, last_error,
trigger_id, script_id, first_attempt_at, last_attempt_at }`. If the
outbox payload can't be decoded back into a `TriggerEvent` (so the
nested `original` can't be built), the fan-out is skipped — the
dead-letter row is still durably written.
- **Recursion-stop:** unchanged. The `is_dead_letter_handler`
short-circuit at the top of `handle_failure` returns before the
exhaustion branch, so a DL handler's own failure is never re-dead-
lettered. No new guard needed.
- **Tests verify the handler actually fires**
(`crates/picloud/tests/dispatcher_e2e.rs`, DB-gated):
`dispatcher_delivers_dead_letter_to_handler` now asserts BOTH row-create
AND handler-fire (inline doc updated);
`dispatcher_delivers_dead_letter_to_handler_actually_fires` asserts the
nested `original` KV event + `last_error`;
`dead_letter_source_filter_excludes_nonmatching` exercises the source
filter dimension; `dead_letter_handler_failure_does_not_recurse` proves
the recursion-stop (count stays at 1).
---
## 6. Realtime signing-key migration notes
- **Two-phase**, as recommended. `0025_encrypt_realtime_keys.sql` adds
NULL-able `realtime_signing_key_encrypted` + `realtime_signing_key_nonce`
and `DROP NOT NULL` on the plaintext column (so new keys can be stored
encrypted-only).
- **Repo:** `PostgresAppSecretsRepo` now holds the `MasterKey`. New keys
are written encrypted-only; the read path (`signing_key` /
`get_or_create_signing_key`) prefers the encrypted columns and falls
back to plaintext during the compat window (pure `decode_signing_key`
helper, unit-tested for all four precedence states).
- **Startup task:** `migrate_plaintext_keys()` runs once in `build_app`
(after the master key is loaded), encrypting any rows that still have
plaintext but no encrypted value. Plaintext is **left in place** for
rollback safety. Idempotent.
- **Plaintext column drop:** deferred to **v1.1.8** (documented in
CHANGELOG + the migration). Operators must upgrade through v1.1.7
(which performs the encryption) before v1.1.8.
- SSE keeps working: `RealtimeAuthorityImpl` is unchanged (it calls
`signing_key`). Verified by the pubsub e2e + unit tests; the dev DB
applied 0025 + the startup encryption cleanly during the test run.
---
## 7. Decisions beyond the brief / deviations flagged
1. **`inbound_secret` stored ENCRYPTED (user-approved deviation).** The
brief defaulted to a plaintext `inbound_secret` column on
`email_trigger_details`; the user chose to encrypt it via the master
key. Implemented: `0024` stores `inbound_secret_encrypted` +
`inbound_secret_nonce`; the admin endpoint seals the secret (as a JSON
string, via the secrets `seal` helper); the receiver `open`s it per
inbound POST to verify the HMAC. **Trade-off:** one AES-GCM decrypt per
inbound request on the hot path — negligible vs. the HMAC + DB
round-trip already there. The decrypted secret is never logged.
2. **Brief-internal contradiction flagged, not reinterpreted — §6
`TriggerEvent::DeadLetter` field names.** The brief's §6 sketches the
payload as `{source, op, original_event_id, original_payload,
attempt_count, last_error, …}`. The actual variant
(`crates/shared/src/trigger_event.rs`) is `{dead_letter_id, original:
Box<TriggerEvent>, attempts, last_error, trigger_id, script_id,
first_attempt_at, last_attempt_at}`. I built the payload from the
**real** variant (which the brief itself instructs to "verify
serializes correctly"). No type change needed.
3. **`build_app` signature gained a `MasterKey` parameter.** Rather than
sourcing the key inside `build_app` (which would force every e2e test
to set process env), `main.rs` sources it and passes it in. The 3
existing `build_app` test callers pass a fixed test key.
4. **Pre-existing clippy warnings fixed (see §10).** Four warnings predate
this work; I fixed them in a dedicated commit so the `-D warnings`
gate is green, and flag them as a latent finding.
5. **Email-trigger retry settings** use the standard async defaults
(3 attempts, exponential, 1000 ms) — the brief didn't specify; matches
the cron/kv default shape.
No other deviations from prompt-specified defaults.
---
## 8. How to verify locally — §8 attestation (sourced from cargo's literal output)
All gates run on the handed-back HEAD (`a7d3dad`):
```sh
cargo fmt --all -- --check # clean
cargo clippy --all-targets --all-features -- -D warnings # clean (exit 0)
cd dashboard && npm run check # 0 ERRORS 0 WARNINGS (371 files)
```
Full test run **with `DATABASE_URL` set** so the DB-gated suites
(schema_snapshot, dispatcher_e2e ×9, email_inbound ×8) execute:
```sh
DATABASE_URL='postgres://picloud:picloud@127.0.0.1:15432/picloud' \
cargo test --workspace -- --test-threads=2
```
**Pass count, summed from cargo's literal output (NOT hand-counted):**
```sh
DATABASE_URL=... cargo test --workspace -- --test-threads=2 2>&1 | \
awk '/test result: ok\./ { gsub(";", ""); sum += $4 } END { print sum }'
# => 617
```
**617 passed, 0 failed** across the workspace (34 `test result:` lines,
0 `FAILED`). Largest binaries: 290 (manager-core lib), 74, 43, 32, 30;
plus `dispatcher_e2e` (9) and `email_inbound` (8).
**Bounded-parallelism note (`--test-threads=2`):** the picloud e2e
binaries each call `build_app`, which opens its own Postgres pool. Under
full default parallelism against the *shared dev* Postgres, ~9 concurrent
`build_app`s exhaust connections and a couple of e2e tests flake on
timeout (observed: `dispatcher_delivers_pubsub_to_handler`,
`dead_letter_handler_failure_does_not_recurse`). They pass reliably at
`--test-threads=2` and in isolation. CI's dedicated fresh `postgres:15`
(not a shared dev DB) does not hit this. Environmental, not a correctness
issue — flagged so the reviewer runs the DB-gated suite with bounded
parallelism (or on CI).
**Migrations:** apply cleanly on the v1.1.6 dev DB (0023→0025 applied
during the test run) and the schema-snapshot guardrail passes after
re-bless. The `BLESS` diff was exactly the new tables/columns/constraints
(secrets, email_trigger_details, app_secrets encrypted columns +
NULL-able plaintext, widened kind/source CHECKs, migrations 00230025) —
no unrelated drift.
**Manual smoke:** the e2e suite covers secrets set/get/delete/list,
inbound signed POST → handler fires with `ctx.event.email`, dead-letter
handler fires, realtime-key encryption + SSE. Outbound email to a live
relay (mailtrap) was NOT exercised (no SMTP configured in this
environment) — asserted instead via recording-transport unit tests
(To/From/Subject/body, multipart parts, cc/bcc, reply_to).
---
## 9. Open questions for the reviewer
1. **§8 bounded-parallelism caveat** — acceptable, or should the e2e
harness share a single `build_app`/pool across tests in a binary?
(Out of v1.1.7 scope; the existing v1.1.6 e2e tests have the same
shape.)
2. **`email::send` ignoring a stray `html` key** (forcing text-only) vs.
throwing — I chose forgiving text-only; happy to make it strict.
3. **Inbound `received_at`** is stamped by the receiver (`Utc::now()`),
not read from a provider header — confirm that's the intended
semantics.
---
## 10. Latent security / correctness findings
1. **`clippy --all-targets --all-features -- -D warnings` did NOT pass at
v1.1.6 HEAD** (verified by stashing this branch and re-running clippy
on the committed slice-1 tree). Four pre-existing warnings:
`double_must_use` on `realtime_router`, `map_unwrap_or` in
`pubsub_service`, `redundant_closure` in `topic_repo`,
`needless_raw_string_hashes` in a subscriber-token test. Fixed all four
(commit `2ea47eb`) so the gate is now green — flagging because it means
prior "clippy green" claims were likely run without `--all-targets`
(which compiles the test binaries).
2. **Inbound HMAC fails closed on decrypt error.** If a stored
`inbound_secret` can't be decrypted (e.g. `PICLOUD_SECRET_KEY`
rotated), the receiver returns 401 — it refuses the POST rather than
silently skipping verification. Intentional.
3. **No rate limiting on the public inbound-email endpoint.** Like every
public data-plane route, `/api/v1/email-inbound/...` is
unauthenticated by design (URL + HMAC are the gate). An unsigned
trigger (no `inbound_secret`) accepts any POST to its URL and enqueues
outbox rows — URL secrecy is the only guard, as documented. Mitigation
is operator-level (Caddy) rate limiting, the same answer as for other
public routes; no new gap introduced, but noted.
---
## 11. Deferred items (unchanged from brief)
Master-key rotation / per-app master key (v1.2); native SMTP listener
(v1.3+); provider-specific inbound unmarshallers, inbound attachments,
outbound SMTP connection pooling, per-app `from` validation / SPF / DKIM
(v1.2 / operator); dashboard inbound payload viewer (v1.2, PII); drop the
plaintext `realtime_signing_key` column (v1.1.8); secrets
versioning/history + secrets-change triggers (never); `users::*` (v1.1.8);
`queue::*` / `invoke()` (v1.1.9).
---
## 12. Known limitations
- Production `EmailTransport` is a per-call connection; high outbound
volume is connection-churn-bound until pooling (v1.2).
- Outbound `email::send` was not smoke-tested against a live relay in
this environment (no SMTP configured); the SMTP message contents are
asserted via recording-transport unit tests.
- The §8 DB-gated run requires bounded parallelism on a shared Postgres
(see §8); CI's dedicated Postgres does not.