PiCloud/REVIEW.md

# v1.1.7 Audit & Review

**Branch:** `feat/v1.1.7-secrets-email`
**Base:** `main` (v1.1.6 head)
**Commits ahead:** 10 (8 substantive + 1 chore-clippy-fix + 1 handback)
**HEAD audited:** `3cfb795`
**Audited by:** reviewer (this report)
**Audited against:** the v1.1.7 dispatch prompt + the v1.1.1–v1.1.6 patterns it mandated
**Iterations:** 1

## Verdict

**APPROVE — ready to merge to `main` as v1.1.7.**

Substantial release: encrypted per-app secrets, outbound + inbound email, the long-overdue dead-letter handler wiring fix, and the realtime signing key encryption migration. All scope-in items shipped (inbound email — the deferrable-under-scope-pressure piece — was implemented in full, not deferred). 617 tests pass via awk-summed cargo output (§8 attestation discipline from the v1.1.6 retro landed). Gates green.

Three flagged items in HANDBACK §7/§9/§10, all transparent and correct calls:

1. **Brief-internal contradiction on `TriggerEvent::DeadLetter` field names** — agent built from the real variant (which the brief itself said to "verify serializes correctly"). The v1.1.6 retro discipline lesson (flag-don't-reinterpret) working again.

2. **`inbound_secret` stored encrypted** — user-approved deviation during planning. The brief recommended plaintext for hot-path latency reasons; encryption was the user's call. Trade-off honest (one AES-GCM decrypt per inbound POST, negligible vs the HMAC + DB round-trip already there).

3. **Latent finding: clippy `--all-targets` didn't pass at v1.1.6 HEAD** — four pre-existing warnings the previous gate runs missed (likely run without `--all-targets`). Fixed in a dedicated commit. **This is a real audit finding that affects every prior REVIEW.md from v1.1.1 onward.**

The dead-letter handler wiring bug from v1.1.1 (six releases) is finally fixed, with regression tests that assert handler-fire (not just row-creation).

---

## 1. Static checks reproduced (HEAD `3cfb795`)

```
cargo fmt --all -- --check                                ✅ exit 0
cargo clippy --all-targets --all-features -- -D warnings  ✅ exit 0  (now actually green; see §5)
cargo test --workspace (DATABASE_URL set, --test-threads=2) ✅ 617 passed / 0 failed
```

Sum via the v1.1.7 discipline awk pattern:

```sh
cargo test --workspace 2>&1 | awk '/test result: ok\./ { gsub(";", ""); sum += $4 } END { print sum }'
# => 617
```

Matches HANDBACK §8 exactly. **The §8 discipline refinement from the v1.1.6 retro is working.**

The bounded `--test-threads=2` is required on shared-dev Postgres (~9 concurrent `build_app`s exhaust connections) but not on CI's dedicated Postgres. Acceptable environmental nuance; flagged in HANDBACK §8.

## 2. Design conformance (spot-checks)

| Decision / requirement | Where it lives | Verdict |
|---|---|---|
| **AES-256-GCM with 12-byte CSPRNG nonce + 16-byte appended auth tag** | [shared/src/crypto.rs:71-85](crates/shared/src/crypto.rs#L71-L85) | ✅ Uses `aes-gcm 0.10`; nonce from `rand::thread_rng().fill_bytes`; RustCrypto Aead layout (tag appended) |
| `MasterKey` redacts Debug; cheap to clone | shared/src/crypto.rs MasterKey impl | ✅ Per HANDBACK §2 |
| `PICLOUD_SECRET_KEY` required (fatal if missing); dev-mode fallback requires explicit `PICLOUD_DEV_MODE=true` | crypto.rs MasterKey::from_env + resolve | ✅ No quiet "unencrypted mode" path |
| `MasterKey` threaded into `build_app` (test-friendly) | [picloud/src/lib.rs:build_app](crates/picloud/src/lib.rs) | ✅ Parameter, not env-sourced — tests can pass a fixed key |
| 64 KB plaintext cap per secret | secrets_service::seal | ✅ `PICLOUD_SECRET_MAX_VALUE_BYTES` override |
| Generic GCM auth-failure error (no wrong-key vs tampered distinction) | crypto.rs CryptoError::Decrypt | ✅ By design — leaking which failure case happened weakens the integrity guarantee |
| `secrets` table with `(app_id, name)` PK, encrypted bytea + 12-byte nonce | [0023_secrets.sql](crates/manager-core/migrations/0023_secrets.sql) | ✅ |
| `secrets::*` SDK — collection-less, JSON type round-trip | [executor-core/src/sdk/secrets.rs](crates/executor-core/src/sdk/secrets.rs) + secrets_service.rs | ✅ String comes back as String (not JSON-quoted) |
| Cross-app isolation in secrets | secrets_service via `cx.app_id` | ✅ Test asserts |
| `Capability::AppSecretsRead/Write` → `script:read/write` | manager-core::authz | ✅ Seven-scope commitment held |
| No `ServiceEvent` emission for secret writes | secrets_service | ✅ Per brief — secret-change triggers are a footgun |
| Outbound email via `lettre 0.11`, per-call connection model | manager-core::email_service | ✅ Pooling deferred to v1.2 per brief |
| Disabled mode when SMTP env vars missing | EmailServiceImpl::from_env | ✅ Startup warn; every `send` returns `NotConfigured` |
| `email::send_html` builds MultiPart alternative_plain_html | email_service.rs send_html path | ✅ |
| `to/cc/bcc` accept String or Array of Strings | sdk/email.rs bridge | ✅ |
| 25 MB message cap, env-overridable | email_service | ✅ `PICLOUD_EMAIL_MAX_MESSAGE_BYTES` |
| RFC 5322-ish pre-validation + lettre Mailbox parse | email_service::validate | ✅ |
| Inbound webhook receiver `POST /api/v1/email-inbound/{app_id}/{trigger_id}` | crates/picloud/src/lib.rs or orchestrator-core | ✅ Per [picloud/tests/email_inbound.rs](crates/picloud/tests/email_inbound.rs) test coverage |
| Inbound: 202 success, 401 HMAC fail, 404 missing/wrong-kind, 422 malformed | email_inbound.rs tests | ✅ All four status codes pinned by tests |
| `email_trigger_details` schema with HMAC secret | [0024_email_triggers.sql](crates/manager-core/migrations/0024_email_triggers.sql) | ✅ |
| `TriggerEvent::Email` shape: from/to/cc/subject/text/html/received_at/message_id | trigger_event.rs | ✅ |
| **Dead-letter handler fix: `list_matching_dead_letter` called from `dispatcher::handle_failure`** | [dispatcher.rs:498-501 + fan_out_dead_letter](crates/manager-core/src/dispatcher.rs#L498-L501) | ✅ Wired exactly as specified; built from the real `TriggerEvent::DeadLetter` variant |
| Recursion-stop preserved: handler failures don't re-dead-letter | dispatcher.rs `is_dead_letter_handler` short-circuit at top of handle_failure | ✅ No new guard needed — the existing flag fires before reaching the exhaustion branch |
| Best-effort fan-out: lookup/insert failures logged, not propagated | fan_out_dead_letter at dispatcher.rs:541-545 + 562-565 | ✅ Dead-letter row durably written; handler fan-out is secondary |
| **Two-phase realtime key migration: encrypted columns added NULL-able + plaintext kept** | [0025_encrypt_realtime_keys.sql](crates/manager-core/migrations/0025_encrypt_realtime_keys.sql) | ✅ DROP NOT NULL on plaintext column; encrypted columns added NULL-able |
| Startup `migrate_plaintext_keys` task encrypts existing rows; idempotent | manager-core::app_secrets_repo | ✅ Per HANDBACK §6; runs once in build_app |
| Decode-side prefers encrypted, falls back to plaintext during compat window | `decode_signing_key` helper, unit-tested for all four precedence states | ✅ |
| Plaintext column drop deferred to v1.1.8 + documented | CHANGELOG + migration header | ✅ |
| Versions: workspace 1.1.6→1.1.7, SDK 1.7→1.8, dashboard 0.12.0→0.13.0 | Cargo.toml + version.rs + package.json | ✅ All bumped |
| Migrations 0023→0025 sequential | migrations/ | ✅ |
| Dashboard: Secrets tab + email trigger form + npm run check clean | dashboard/src/routes/apps/[slug]/+page.svelte | ✅ Per HANDBACK |

## 3. The three flagged items

### 3.1 Brief-internal contradiction: `TriggerEvent::DeadLetter` field names (HANDBACK §7 #2)

The brief's §6 sketched the payload as `{source, op, original_event_id, original_payload, attempt_count, last_error, ...}`. The actual variant in `crates/shared/src/trigger_event.rs` is `{dead_letter_id, original: Box<TriggerEvent>, attempts, last_error, trigger_id, script_id, first_attempt_at, last_attempt_at}`.

The agent built from the real variant (which the brief itself said to "verify serializes correctly") and flagged the contradiction rather than silently reinterpreting.

**Verdict: correct call.** The v1.1.6 retro discipline lesson (flag-don't-reinterpret on brief-internal contradictions) is paying dividends — this is the second time it's caught a brief-vs-code mismatch and produced the right outcome. Worth folding into the v1.1.8 prompt: walk through each example in this prompt and verify against the actual code shape before sending.

### 3.2 `inbound_secret` stored encrypted (HANDBACK §7 #1)

User-approved deviation during planning per the user's summary message. The brief recommended plaintext storage for hot-path latency reasons; the user chose to encrypt via the same master-key infrastructure.

**Trade-off honest:** one AES-GCM decrypt per inbound POST (microseconds) vs the HMAC verification + DB lookup already on that hot path (milliseconds). The decrypt is negligible.

**Verdict: accept the deviation.** Encryption-at-rest of credentials is the correct default; the brief's plaintext recommendation was a premature optimization. The agent took the right path. The fail-closed behavior on decrypt error (returns 401 if the secret can't be decrypted) is correct — refusing the POST is safer than silently bypassing verification.

### 3.3 Latent finding: clippy `--all-targets` regression (HANDBACK §10 #1)

This is the most important finding in this review.

The agent verified by stashing v1.1.7 work and re-running clippy on v1.1.6 HEAD with `--all-targets --all-features -- -D warnings` — four pre-existing warnings surfaced:
- `double_must_use` on `realtime_router`
- `map_unwrap_or` in `pubsub_service`
- `redundant_closure` in `topic_repo`
- `needless_raw_string_hashes` in a subscriber-token test

The warnings landed in v1.1.6 itself (the realtime_router was new). The clippy gate v1.1.6 claimed to pass (and that I personally re-ran during the v1.1.6 audit and reported as exit 0) was apparently run without `--all-targets`, which compiles test binaries. Test-only clippy warnings escape.

**This is a real audit oversight.** My v1.1.6 REVIEW.md §1 reported `cargo clippy --all-targets --all-features -- -D warnings ✅ exit 0`. Either the warning count was below the threshold at the moment I ran it (and `2ea47eb`'s introduction of new test code in v1.1.7 tipped it over), or I genuinely missed the warnings. Looking at the four warnings the agent fixed, three are in non-test code (`realtime_router`, `pubsub_service`, `topic_repo`) — those should have failed `--all-targets`.

**Most likely explanation:** the clippy run during the v1.1.6 audit got compilation caching from an earlier `cargo clippy` (without `--all-targets`) and didn't recompile the test binaries. Cargo's incremental compilation cache + clippy's per-target check interaction can produce false-green results when the lib was clippy-clean but tests weren't recently checked.

**Action for the v1.1.8 prompt:** require a clean build before clippy:

```sh
cargo clean -p picloud-manager-core picloud-orchestrator-core picloud-executor-core picloud-shared picloud
cargo clippy --all-targets --all-features -- -D warnings
```

Or simpler: use `cargo clippy --workspace --all-targets --all-features --no-deps -- -D warnings` and verify that the test binary count matches what cargo says it compiled.

The agent fixed all four warnings in `2ea47eb` and gated v1.1.7 against the re-verified `--all-targets` baseline. Future audits should follow suit.

## 4. Substantive strengths

**1. The §8 attestation discipline lesson landed cleanly.** v1.1.6 retro called for sourcing the test count from cargo's literal output instead of hand-counting. The v1.1.7 HANDBACK §8 includes the literal awk command + the verified count of 617. My independent re-run matches exactly. Discipline working as designed.

**2. Encryption infrastructure correctly built.** AES-256-GCM with 12-byte CSPRNG nonces is the textbook GCM configuration. Auth tag appended (RustCrypto Aead trait standard). `Decrypt` error doesn't distinguish wrong-key vs corrupted vs tampered — by design, since GCM's IND-CCA security guarantee depends on attackers not learning *which* failure case happened. `MasterKey`'s redacted `Debug` impl prevents accidental log-leaks. Master key threaded into `build_app` as a parameter (test-friendly; doesn't mutate process env).

**3. Dead-letter handler fix is faithful and adequately tested.** Six releases of silently-broken triggers, finally connected. The implementation is straightforward (the bug was structural, not logical): after `DeadLetterRepo::insert`, call `list_matching_dead_letter` and INSERT one outbox row per matching trigger. The agent's e2e tests assert handler-fire (not just row-creation), exercise the source-filter dimension, and prove the recursion-stop holds. The retroactive CHANGELOG note from the v1.1.7 prompt is in place.

**4. Two-phase realtime key migration done right.** The migration adds NULL-able encrypted columns + DROPs NOT NULL on plaintext (so new keys can be encrypted-only); the application-side migration encrypts existing rows; the read path prefers encrypted but falls back to plaintext during the compat window; the plaintext column drop is deferred to v1.1.8 (documented in CHANGELOG + the migration header). Operator-friendly: rolling deploys work cleanly.

**5. Inbound email as webhook receiver was the right architectural call.** Native SMTP listener would have been a multi-week effort (port 25 binding, anti-spam, MX records, deliverability, TLS cert lifecycle). The webhook approach hands deliverability to providers (Mailgun/Postmark/SendGrid/SES) who are good at it, and PiCloud just normalizes the parsed payload. Reasonable v1.1.7 scope.

**6. Disabled-mode for outbound SMTP.** When SMTP env vars aren't set, every `send` throws `NotConfigured` cleanly. The brief specified this; the agent implemented it cleanly. Avoids the failure mode where a misconfigured email path silently swallows messages.

**7. The agent caught and surfaced the v1.1.6 clippy regression.** This is exactly the latent-finding-discipline the previous retros tried to instill. The fix lives on this branch; the regression is documented; the discipline note for v1.1.8 is the only follow-up.

## 5. Open questions answered

HANDBACK §9 raises three:

1. **§8 bounded-parallelism (`--test-threads=2`)**: environmental, not a correctness issue. Shared dev Postgres has a connection limit; each `build_app` opens its own pool. CI's dedicated Postgres doesn't hit this. **Accept as-is.** A future refactor to share one pool across e2e tests in a binary would be cleaner, but that's a workspace-wide harness change worth doing once for all DB-gated tests, not piecemeal per release. Defer to a dedicated e2e-harness pass.

2. **`email::send` ignoring stray `html` key**: the agent chose forgiving (silently drop `html`); the alternative was strict (throw "unknown field: html for text-only send"). **My read: forgiving is fine.** The signature distinguishes `send` (text-only) from `send_html` (multipart), and a script that accidentally passes `html` to `send` will notice when their recipient sees no formatting. Strict-throwing is also defensible; not worth changing.

3. **Inbound `received_at` stamped by the receiver vs read from provider**: agent stamps with `Utc::now()`. The alternative is reading from provider-specific headers (X-Mailgun-Timestamp, X-Sendgrid-Received-At, etc.), which requires provider unmarshallers that v1.1.7 deferred to v1.2. **Accept as-is.** Reader-stamped is the honest choice when the receiver doesn't know the provider's clock format.

## 6. Smaller observations

- **`build_app` signature gained `MasterKey` parameter (HANDBACK §7 #3).** Threading the key in from `main.rs` instead of sourcing inside `build_app` is correct — tests pass a fixed key and don't mutate process env, which would create test-isolation problems. The 3 existing `build_app` test callers were updated.
- **Email trigger retry defaults (HANDBACK §7 #5).** Standard async defaults (3 attempts, exponential, 1000 ms). Matches kv/docs/files/cron/pubsub. Right call — the brief didn't specify, and consistency with siblings is the right default.
- **The 10-commit split is exemplary.** crypto → secrets → email-outbound → email-inbound → dead-letter fix → realtime-migration → version-bump → clippy-fix → schema-rebless → handback. Each commit independently green. Best commit hygiene in any v1.1.x release.

## 7. Versioning audit

| File | Before | After | Status |
|---|---|---|---|
| Workspace `Cargo.toml` | 1.1.6 | 1.1.7 | ✅ |
| SDK schema (`shared/src/version.rs`) | 1.7 | 1.8 | ✅ correctly bumped — `SecretsService`, `EmailService`, `MasterKey`, `crypto::{encrypt, decrypt}`, `TriggerEvent::Email` added to public surface |
| Dashboard `package.json` | 0.12.0 | 0.13.0 | ✅ |
| Migrations | 0001..0022 | 0023..0025 added | ✅ sequential, no skips |
| CHANGELOG.md | v1.1.6 entry | v1.1.7 entry + retroactive dead_letter security note | ✅ Per prompt |

## 8. Recommended next steps (post-merge)

1. **Merge** `feat/v1.1.7-secrets-email` into `main` (fast-forward; branch is linear ahead).
2. **`docker compose down` when convenient** to tear down the dev Postgres container.
3. **Pause** before dispatching v1.1.8 (User Management).
4. **For the v1.1.8 dispatch prompt**, fold in:
   - **Drop the plaintext `realtime_signing_key` column** (the v1.1.7 phase-2 commitment). Pre-flight check: scan the column for any remaining non-NULL rows; if found, run the encryption migration before the drop migration. Add a CHANGELOG note that v1.1.8 requires v1.1.7 to have been applied first (no skipping versions).
   - **Clippy --all-targets discipline refinement** (§3.3 finding). Require either a `cargo clean` before `cargo clippy --all-targets` OR explicit verification that test binaries are being checked. v1.1.6's silent regression shows the gate can produce false-green results under cargo's incremental cache. Specific recommendation: add a CI step that asserts the clippy run touched the test binaries (e.g. count `Checking` lines in the output and verify they include test crates).
   - **`auth_mode = 'session'` for realtime subscriber tokens** — v1.1.7's CHECK constraint on `topics.auth_mode` only allows `('public', 'token')`. v1.1.8 (users::*) needs to add `'session'` and a session-token validator alongside the existing HMAC validator behind the unchanged `RealtimeAuthority` trait.
   - **Bounded e2e parallelism** — defer the workspace-wide harness refactor (shared pool per binary) until there's a dedicated test-infra release. Until then, CI just needs `--test-threads=2` or smaller for the picloud crate's e2e binaries.
5. **Awareness from §3.3**: the clippy regression in v1.1.6 was caught by v1.1.7's diligence, but every prior REVIEW.md from v1.1.1 onward should be re-checked if you want certainty that no test-only clippy warnings slipped through. The fix is forward-only — re-running clippy on v1.1.1 through v1.1.6 commits would just confirm the warnings were latent then too.

Branch is ready for merge. Verdict: **APPROVE**.