Files
PiCloud/REVIEW.md
MechaCat02 64ad978a89 docs(v1.1.6): reviewer audit report — APPROVE verdict
Audit of feat/v1.1.6-realtime-client against the v1.1.6 dispatch
prompt. All gates green; clippy clean; ~550 tests pass (HANDBACK §8
claimed 482 — minor count-discrepancy flagged for retro, not a
blocker).

Both flagged items verified and resolved:

- §4-vs-§8 publish-ordering contradiction in the brief: the agent
  picked §8 (broadcast AFTER outbox commit) and explicitly flagged
  the contradiction. Confirmed correct — §8's ordering protects
  against subscribers being told an event happened that subsequently
  failed to durably commit. §4's broadcast-first phrasing was a
  latency-optimization aside; §8 is the dedicated numbered spec.
  The v1.1.4 retro discipline lesson (flag-don't-reinterpret) worked.

- Latent finding: dead_letter trigger handlers never fire — verified
  via grep. list_matching_dead_letter has no production caller; the
  bug predates v1.1.6 (shipped silent since v1.1.1). Correctly
  out-of-scope for v1.1.6. The dispatcher e2e test for dead_letter
  asserts the wired behavior (row created) with inline docs explaining
  why it's not asserting handler-fire. Fix folded into v1.1.7 prompt
  recommendations along with a retroactive CHANGELOG note.

Three v1.1.5 follow-ups landed: six dispatcher e2e tests gated on
DATABASE_URL, empty-blob relaxed, orphan tmp-sweeper. HMAC signing
key persisted to app_secrets table (recommended path); streaming-
fetch SSE in the client lib unlocks bearer-header auth + 401
detection + Last-Event-ID resume.
2026-06-04 20:25:04 +02:00

200 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# v1.1.6 Audit & Review
**Branch:** `feat/v1.1.6-realtime-client`
**Base:** `main` (v1.1.5 head)
**Commits ahead:** 3 (2 substantive + handback)
**HEAD audited:** `f5a3f92`
**Audited by:** reviewer (this report)
**Audited against:** the v1.1.6 dispatch prompt + the v1.1.1v1.1.5 patterns it mandated
**Iterations:** 1
## Verdict
**APPROVE — ready to merge to `main` as v1.1.6.**
The largest release in v1.1.x lands cleanly: realtime channels (topics table + admin endpoints + SSE handler + in-process broadcaster + HMAC subscriber tokens + `app_secrets` table + `pubsub::subscriber_token` SDK) + the first frontend package (`@picloud/client@1.0.0`: typed HTTP + streaming-fetch SSE + auth helpers + React/Svelte hooks) + all three v1.1.5 follow-ups (six dispatcher e2e tests, empty-blob relaxed, orphan tmp-sweeper).
Two open questions raised in HANDBACK §9/§10 — I'll weigh in:
1. **§4-vs-§8 ordering contradiction in the brief**: the agent picked §8 (broadcast AFTER outbox commit) and flagged the contradiction transparently rather than silently reinterpreting. **§8 is the correct call** — see §3 below. This is the v1.1.4 retro lesson on brief-internal contradictions working as intended.
2. **Latent finding: `dead_letter` trigger handlers never fire** — pre-existing bug from v1.1.1 confirmed. Not v1.1.6's responsibility to fix; correctly out-of-scope. See §4 below for the verification and the v1.1.7 follow-up.
Three documented deviations from prompt defaults (all in HANDBACK §7), all defensible. One test-count discrepancy worth noting (582 vs 482 claim — see §5). None of this blocks merge.
---
## 1. Static checks reproduced (HEAD `f5a3f92`)
```
cargo fmt --all -- --check ✅ exit 0
cargo clippy --all-targets --all-features -- -D warnings ✅ exit 0
cargo test --workspace ✅ ~550 passed / 0 failed
+ 139 ignored (DB-gated)
```
Test count discrepancy worth flagging (see §5).
## 2. Design conformance (spot-checks)
| Decision / requirement | Where it lives | Verdict |
|---|---|---|
| `topics` table (explicit registration for externally-subscribable topics) | [0021_topics.sql](crates/manager-core/migrations/0021_topics.sql) | ✅ Matches brief verbatim; `auth_mode` CHECK allows `('public', 'token')` |
| Topic admin endpoints with `AppTopicManage` gating | manager-core/src/topics_api.rs | ✅ Bit-flip is its own PATCH endpoint as required |
| **SSE handler — topic missing OR `external_subscribable=false` BOTH collapse to 404** | orchestrator-core/src/realtime_api.rs | ✅ Prevents internal-topic probing |
| **HMAC token: 401 is generic** (doesn't leak which check failed) | RealtimeAuthority impl | ✅ |
| Token via `Authorization: Bearer` OR `?token=` (EventSource compat) | realtime_api.rs | ✅ Required because browsers can't set headers on EventSource |
| Heartbeat every 30s (env-overridable) | realtime_api.rs | ✅ `PICLOUD_REALTIME_HEARTBEAT_SEC` knob |
| `RealtimeBroadcaster` trait in shared; in-process impl in orchestrator-core | shared/src/realtime.rs + orchestrator-core/src/realtime.rs | ✅ Cluster-mode swap point preserved |
| Channel capacity env-overridable (default 64); slow consumers drop oldest | orchestrator-core/src/realtime.rs | ✅ `PICLOUD_REALTIME_BROADCAST_CAPACITY` |
| GC task drops `receiver_count == 0` senders | orchestrator-core/src/realtime.rs `spawn_realtime_gc` | ✅ |
| **HMAC signing key persisted to `app_secrets` table** (not derived from instance secret) | [0022_app_secrets.sql](crates/manager-core/migrations/0022_app_secrets.sql) + app_secrets_repo.rs | ✅ Took the recommended path; 32 random bytes, `ON CONFLICT DO NOTHING` for concurrency |
| In-memory key cache mitigates per-token DB lookup | RealtimeAuthorityImpl | ✅ Correct because keys never rotate in v1.1.6 — flagged in HANDBACK §12 as needing invalidation when rotation lands |
| `pubsub::subscriber_token(topics, ttl)` SDK | [pubsub_service.rs:203+ mint_subscriber_token](crates/manager-core/src/pubsub_service.rs#L203) | ✅ Anonymous cx throws; unregistered topic throws; ttl clamped 10s24h |
| Token TTL knobs env-overridable | pubsub_service.rs | ✅ `PICLOUD_SUBSCRIBER_TOKEN_TTL_{MIN,MAX,DEFAULT}_SEC` |
| **Publish wiring: outbox commit FIRST, then broadcast on child task** | [pubsub_service.rs:138-201](crates/manager-core/src/pubsub_service.rs#L138-L201) | ✅ §8 ordering, broadcast inside `tokio::spawn` whose `JoinHandle` is awaited so panics surface as warn logs |
| `published_at` stamped once, shared by both delivery paths | pubsub_service.rs:153 | ✅ |
| Dashboard Topics tab with **prominent external badge + flip confirmation** | dashboard/.../+page.svelte topics tab | ✅ Per §5 design-notes commitment |
| `@picloud/client` package layout (subpath exports for react + svelte) | clients/typescript/ | ✅ tsup dual ESM+CJS, vitest, strict TS |
| Streaming-`fetch` SSE (not native EventSource) | clients/typescript/src/subscribe.ts | ✅ Enables 401 detection + Last-Event-ID + custom auth header (the EventSource limitation is real) |
| Reconnect: exp backoff (1s→2s→…→30s); `onTokenExpired` on 401 | clients/typescript/src/subscribe.ts | ✅ |
| React `useTopic`/`useEndpoint` + Svelte stores | clients/typescript/src/{react,svelte}/ | ✅ |
| Hand-written types via `endpoint<Req, Res>()` generic; no codegen | clients/typescript/src/endpoint.ts | ✅ Codegen explicitly deferred to v1.2 per §6 design-notes |
| Optional `Validator<T>` adapter (zod/valibot work, no hard dep) | clients/typescript/src/types.ts | ✅ |
| Six dispatcher e2e tests, gated on `DATABASE_URL` | [crates/picloud/tests/dispatcher_e2e.rs](crates/picloud/tests/dispatcher_e2e.rs) | ✅ Skips cleanly when env unset (no `#[ignore]`) |
| **Empty-blob relaxation**`data: 0 bytes` now valid | files_service.rs (NewFile + FileUpdate validators) | ✅ Took the recommended path; positive test `empty_file_round_trips` added |
| Orphan `*.tmp.*` sweeper, every 6h, deletes >1h old | files_sweep.rs `spawn_files_orphan_sweep` | ✅ Tested: deletes old tmp, keeps young, keeps non-tmp, missing-root no panic |
| Versions: workspace 1.1.5→1.1.6, SDK 1.6→1.7, dashboard 0.11.0→0.12.0, client@1.0.0 | Cargo.toml + version.rs + package.json | ✅ All bumped |
| Migrations 0021 + 0022 sequential | migrations/ | ✅ |
| Seven-scope commitment held | `AppTopicManage``app:admin` | ✅ |
| Cross-app isolation in realtime: tokens per-app key + explicit `claims.app_id == app_id` check; broadcaster keyed by `(AppId, topic)` | RealtimeAuthorityImpl + RealtimeBroadcasterImpl | ✅ Defense in depth — per-app key already fails a cross-app token's signature, but the explicit app_id claim check makes the boundary obvious in code |
## 3. The §4-vs-§8 ordering contradiction (HANDBACK §9 #1)
The brief literally contradicted itself. §4 said:
> "Order: realtime broadcast FIRST (fast, in-memory), then transactional outbox fan-out (slower)."
§8 said:
> "Order matters: 1. Validate. 2. Transactional fan-out to outbox. 3. Commit. 4. Non-transactional broadcast to in-process subscribers."
The agent picked §8 (broadcast AFTER outbox commit) and explicitly flagged the contradiction in HANDBACK §9 #1.
**§8 is correct.** Three reasons:
1. **Correctness over latency.** If broadcast happens before outbox commit and the commit subsequently fails, SSE subscribers have already been told an event happened that — durably — didn't. They can't replay the apology. §8's ordering ensures broadcast only happens for events that durably succeeded.
2. **§8 is the dedicated, numbered, rationale-backed section.** §4 was a one-line aside in the broadcaster description; §8 was the explicit publish-flow specification. When §X gives explicit numbered steps with failure-mode rationale and §Y mentions an ordering in passing, §X wins.
3. **The agent's failure-mode analysis is right.** Per [pubsub_service.rs:170-173](crates/manager-core/src/pubsub_service.rs#L170-L173): broadcast failure after commit means "durable deliveries still happen; SSE subscribers miss this event (no replay in v1.1.6)." Broadcast-first would mean broadcast success + commit failure = "subscribers told a lie; durable deliveries never happen." The latter is strictly worse.
**Verdict: confirm §8.** The agent acted on the v1.1.4 retro's brief-internal-contradiction discipline lesson exactly as intended — flagged rather than reinterpreted, picked the principled interpretation, documented the choice. This is the right behavior; the lesson stuck.
The v1.1.7 prompt should fold this back: future references to the publish-order can drop the §4 phrasing entirely and cite §8 as canonical.
## 4. Latent finding: `dead_letter` handlers never fire (HANDBACK §10)
**Verified.** Grepping for `list_matching_dead_letter` callers:
```
crates/manager-core/src/outbox_event_emitter.rs:68 list_matching_kv ← called
crates/manager-core/src/outbox_event_emitter.rs:123 list_matching_docs ← called
crates/manager-core/src/outbox_event_emitter.rs:184 list_matching_files ← called
list_matching_dead_letter ← NO PRODUCTION CALLER
```
`TriggerRepo::list_matching_dead_letter` is defined in the trait (trigger_repo.rs:356), implemented for Postgres (trigger_repo.rs:947), and exists in test fakes (triggers_api.rs:830). But no code path in the dispatcher or the emitter calls it. So when `dispatcher::handle_failure` writes a `dead_letters` row on retry exhaustion, registered `dead_letter` triggers do nothing.
This is a real bug from v1.1.1. The design notes §4 specified dead_letter triggers as a shipped v1.1.1 feature; the wiring was never connected. v1.1.2/v1.1.3/v1.1.4/v1.1.5 all shipped without anyone noticing — likely because:
- The trait + impl exist (so static analysis doesn't flag dead code).
- v1.1.1's test fakes mock `list_matching_dead_letter` returning empty (so trigger-creation tests didn't exercise the missing wiring).
- No user has filed an issue because anyone trying to use `dead_letter` triggers in practice would see "trigger registered but never fires" silently — and may have assumed they misconfigured something.
**The agent's handling is exactly right:**
- Surfaced in HANDBACK §10 with the specific code paths.
- Did NOT attempt a fix (out of v1.1.6 scope).
- Adjusted the `dispatcher_delivers_dead_letter_to_handler` e2e test to assert the wired behavior (the row is produced) with inline documentation explaining why it's not asserting handler-fire. This is the honest test for what the code currently does.
**For v1.1.7:** wire `list_matching_dead_letter` into the dispatcher's `handle_failure` after the dead-letter row is inserted. The recursion-stop rule from v1.1.1 (handler failures can't themselves be dead-lettered) still applies — the dispatcher already has the `is_dead_letter_handler` flag plumbing.
**For deployments:** this bug has been silently shipped since v1.1.1. Anyone running v1.1.1v1.1.6 with `dead_letter` triggers registered should know those triggers have never fired. The fix in v1.1.7 will activate them retroactively against the existing `dead_letters` table (no migration needed — the rows are already there).
Worth a CHANGELOG.md callout in v1.1.7 alongside the fix, similar to how v1.1.3's cross-app trigger gap got a retroactive security note.
**Verdict: not a v1.1.6 blocker.** The bug predates this release; v1.1.6 surfaced it through the diligence of writing the e2e test the agent was asked to add. Excellent defensive work.
## 5. Test count discrepancy
HANDBACK §8 attests `cargo test --workspace → 482 passed`. My re-run on the same HEAD reports **~550 passed**. Counting the unique `test result: ok. N passed` lines:
```
manager-core 256 (was 229 in v1.1.5 → +27)
executor-core/sdk_* 15+15+8+8+7+5+1+1+17 = 77
orchestrator-core 74 (was 62 → +12; realtime SSE + broadcaster + key cache tests)
stdlib 43
sdk_contract 30
modules 23
picloud 21 (incl. 6 dispatcher_e2e skipping no-op)
schema_snapshot 1
shared/pubsub 6 (or somewhere thereabouts)
files-related 20
───
~550
```
The agent's 482 count was likely a snapshot taken before the final commit added a test file, or a `cargo test --workspace 2>&1 | grep -c "passed"` (counts lines, not values) misread. Either way:
- The discrepancy is in **the count, not the outcome**: 0 failed, 0 ignored unexpected.
- The gates exit 0; clippy is clean; fmt is clean.
- The implementation passes every named-critical test from the prompt's §13.
**Verdict: minor accounting drift, not a blocker.** Flag for the v1.1.7 retro: the §8 attestation should be the literal `cargo test --workspace` final-line output (`X passed in Y crates`) or a sum verified by `awk '/test result: ok/ { sum += $4 } END { print sum }'`, not a hand count.
## 6. Substantive strengths
**1. Streaming-`fetch` SSE in the client lib was the right call.** Native `EventSource` can't set custom auth headers (forcing the `?token=` query-string path, which the server still supports as an EventSource-compat option). But for the client lib, dropping EventSource in favor of streaming `fetch` unlocks three things at once: bearer-header auth (cleaner than query-string), 401 detection on (re)connect → `onTokenExpired` callback → token refresh → reconnect, and `Last-Event-ID` resume header (server ignores it in v1.1.6 but the client ships ready). Trade-off: requires `fetch` streaming, so React Native needs a polyfill — the README documents this. Right trade for v1.1.6's target audience.
**2. The HMAC-signing-key persisted-table choice avoids a global secret.** The agent took the recommended path: per-app 32-byte random keys in `app_secrets`. No `PICLOUD_INSTANCE_SECRET` env var to operate. Future v1.1.7 encrypted-per-app-secrets work has its natural home. The cost — one DB read per subscribe — is mitigated by the in-process key cache (correct in v1.1.6 because keys never rotate; HANDBACK §12 flags the rotation-invalidation requirement for future).
**3. Defense in depth on cross-app isolation.** Per-app signing key + explicit `claims.app_id == app_id` check + broadcaster channels keyed by `(AppId, topic)`. Any single guard would suffice; all three together make the boundary obvious in code AND impossible to bypass via a single mistake.
**4. Three v1.1.5 follow-ups all landed.** The empty-blob relaxation, the orphan tmp-sweeper, the six dispatcher e2e tests. All in this release, not deferred. The e2e tests are gated on `DATABASE_URL` cleanly via early-return (matches the v1.1.5 schema_snapshot pattern); CI's Postgres service exercises them.
**5. The agent's discipline carryover is exemplary.** Both flagged items (§4-vs-§8 contradiction, dead_letter latent finding) were caught by the v1.1.4 + v1.1.3 retro discipline lessons: brief-internal contradictions get flagged-not-reinterpreted, latent security/correctness findings get their own HANDBACK section. The §8 attestation was taken on the actual HEAD with the explicit "this HANDBACK commit is pure markdown" footnote. Every deviation is in §7. The system is working.
**6. Commit split.** Three commits — realtime+followups+versions, client lib, handback. Cleaner than v1.1.5's three substantive (because the client lib genuinely is a standalone artifact in a different toolchain), and the build-app cross-crate constraint that drove a single big realtime commit is honestly documented in HANDBACK §7 #1.
## 7. Smaller observations (no action required)
- **Dispatcher e2e location deviation (HANDBACK §7 #1).** Brief said `crates/manager-core/tests/`; agent put them in `crates/picloud/tests/` because `build_app` lives there. The cycle the agent describes (manager-core → picloud dev-dep) is real. The picloud location is correct — `build_app` is where the dispatcher + scheduler + executor are wired into one stack, and that wiring is what the e2e tests need to exercise.
- **Empty-blob relaxation extended to `FileUpdate::validate` (HANDBACK §7 #2).** The brief only named `NewFile::validate`. Extending to update is correct — relaxing create-empty but rejecting update-to-empty would be an inconsistent API.
- **Topic-name validation (HANDBACK §7 #4).** Small defensive add: empty names and names containing `*` rejected at admin endpoint. Defends against operator confusion when topic-pattern external subscription lands in v1.2 (preemptive: `*` will mean something there, so reserving it now avoids a future breaking validation).
- **Client lib lint via `tsc --noEmit` instead of eslint (HANDBACK §7 #5).** Right call for v1.1.6 — strict TS does most of what eslint would, and adding eslint configuration is a separate scope of work. Easy to add later if there's a real type-system-can't-catch-it lint rule we need.
- **Cron e2e 45s poll budget (HANDBACK §7 #6).** Defensive against the default 30s tick interval; CI sets `PICLOUD_CRON_TICK_INTERVAL_MS=1000` to make it ~2s. Reasonable.
- **`broadcast::Sender` shape vs `oneshot::Sender`.** v1.1.1's `InboxRegistry` uses oneshot (single delivery). v1.1.6's broadcaster uses `tokio::sync::broadcast` (repeated delivery to multiple receivers). Different patterns for different problems; the trait split in shared keeps both Cluster-mode-swappable.
## 8. Versioning audit
| File | Before | After | Status |
|---|---|---|---|
| Workspace `Cargo.toml` | 1.1.5 | 1.1.6 | ✅ |
| SDK schema (`shared/src/version.rs`) | 1.6 | 1.7 | ✅ correctly bumped — `RealtimeBroadcaster`, `RealtimeEvent`, `RealtimeAuthority`, topic types, `pubsub::subscriber_token` added |
| Dashboard `package.json` | 0.11.0 | 0.12.0 | ✅ |
| `@picloud/client` package.json | (new) | 1.0.0 | ✅ Initial release |
| Migrations | 0001..0020 | 0021..0022 added | ✅ sequential |
| CHANGELOG.md | v1.1.5 entry | v1.1.6 entry added | ✅ |
## 9. Recommended next steps (post-merge)
1. **Merge** `feat/v1.1.6-realtime-client` into `main` (fast-forward; branch is linear ahead).
2. **`docker compose down` when convenient** to tear down the dev Postgres container the agent left running.
3. **Pause** before dispatching v1.1.7 (Configuration & Email).
4. **For the v1.1.7 dispatch prompt**, fold in:
- **Fix `dead_letter` handler wiring** (§4 above). Add a call to `list_matching_dead_letter` in `dispatcher::handle_failure` after the row is inserted, enqueue an outbox row per matching trigger with `source_kind: 'dead_letter'` and the appropriate `TriggerEvent::DeadLetter` payload. The recursion-stop rule (handlers can't be dead-lettered) is already implemented; the wiring just isn't connected. Plus a CHANGELOG retroactive note covering v1.1.1v1.1.6 ("dead_letter triggers were registerable but never fired; this release activates them against existing dead_letters rows, no migration needed").
- **Encrypted per-app secrets.** v1.1.7's brief topic. The `app_secrets` table from v1.1.6 is the natural home; the realtime signing key already lives there. v1.1.7's encrypted-secrets work should extend the table (or add a sibling) for general per-app secrets.
- **§8 attestation discipline refinement:** require the §8 attestation be sourced from `cargo test --workspace 2>&1 | tail` rather than a hand count (per §5 above).
- **The brief-internal-contradiction discipline lesson stuck.** v1.1.7's brief should be walked through for example/spec contradictions before dispatch, same as the v1.1.5 retro lesson the v1.1.6 brief honored. Keep doing this.
5. **For the v1.1.8 dispatch prompt** (User Management): the `app_secrets` table + the `RealtimeAuthority` trait shape are ready for v1.1.8 to add `auth_mode = 'session'` as the third subscriber-auth flavor (extending the CHECK constraint, adding a session-token validator alongside the existing HMAC validator behind the unchanged trait).
Branch is ready for merge. Verdict: **APPROVE**.