Scope coverage, realtime + client-lib notes, §8 attestation (482 workspace tests; e2e 6/6 + schema snapshot verified on a real DB), every prompt-default deviation, and the latent dead_letter-trigger fan-out finding + §4-vs-§8 ordering question for the reviewer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
221 lines
13 KiB
Markdown
221 lines
13 KiB
Markdown
# HANDBACK — v1.1.6 Realtime Channels & Client Library
|
||
|
||
Branch: `feat/v1.1.6-realtime-client` (from `main`). Not pushed, no PR.
|
||
|
||
## 1. Scope coverage (§1–§13)
|
||
|
||
| § | Item | Status |
|
||
|---|------|--------|
|
||
| 1 | `topics` table (`0021_topics.sql`) | ✅ Done |
|
||
| 2 | Topic admin endpoints (POST/GET/PATCH/DELETE), `AppTopicManage` | ✅ Done |
|
||
| 3 | SSE endpoint `GET /realtime/topics/{topic}` | ✅ Done |
|
||
| 4 | In-process `RealtimeBroadcaster` + GC + publish wiring | ✅ Done |
|
||
| 5 | HMAC subscriber tokens + `app_secrets` (`0022`) + `pubsub::subscriber_token` | ✅ Done |
|
||
| 6 | Dashboard Topics tab | ✅ Done |
|
||
| 7 | `@picloud/client` TypeScript package | ✅ Done |
|
||
| 8 | Topic-aware publish → realtime wiring | ✅ Done |
|
||
| 9 | Dispatcher e2e tests (six) | ✅ Done (location deviation — see §7) |
|
||
| 10 | Empty-blob decision | ✅ Done — **relaxed** (accept empty) |
|
||
| 11 | Orphan `*.tmp.*` sweeper | ✅ Done |
|
||
| 12 | Version bumps (1.1.6 / SDK 1.7 / dash 0.12.0 / client 1.0.0) | ✅ Done |
|
||
| 13 | Tests (all named-critical cases) | ✅ Done |
|
||
|
||
## 2. Realtime implementation notes
|
||
|
||
### Topic resolution / SSE handshake sequence
|
||
1. Extract `Host` → `app_domains.resolve_app(host)` (existing two-phase
|
||
dispatch). No app → **404**.
|
||
2. Token from `Authorization: Bearer <t>` **or** `?token=<t>` (EventSource
|
||
can't set headers).
|
||
3. `RealtimeAuthority::authorize_subscribe(app_id, topic, token)`:
|
||
- topic missing OR `external_subscribable = false` → `NotFound` → **404**
|
||
(both collapse to 404 so the endpoint can't probe internal topics);
|
||
- `auth_mode='public'` → allow;
|
||
- `auth_mode='token'` → verify HMAC (present, signed by this app's key,
|
||
unexpired, scoped to this topic) → allow, else `Unauthorized` → **401**
|
||
(generic; never says which check failed).
|
||
4. `broadcaster.subscribe(app_id, topic)` → `broadcast::Receiver`; stream
|
||
`data: {topic,message,published_at}\n\n` with `:heartbeat` keepalive
|
||
every `PICLOUD_REALTIME_HEARTBEAT_SEC` (default 30). Headers:
|
||
`text/event-stream`, `Cache-Control: no-cache` (set by axum Sse),
|
||
`X-Accel-Buffering: no` (added). Client disconnect drops the receiver →
|
||
automatic cleanup; the periodic GC reaps empty channels.
|
||
|
||
The SSE handler lives in **orchestrator-core** (`realtime_api.rs`) and
|
||
depends only on the three picloud-shared traits — all DB + signing-key
|
||
access stays in the manager-core `RealtimeAuthority` impl, so the
|
||
data-plane crate never touches the key. Cluster mode (v1.3+) swaps the
|
||
broadcaster + authority impls behind the same traits.
|
||
|
||
### HMAC signing-key storage — chose **table** (`app_secrets`)
|
||
Per the asked decision. 32 random bytes per app, lazily created on the
|
||
first `pubsub::subscriber_token` call (`ON CONFLICT DO NOTHING` for
|
||
concurrency). No global `PICLOUD_INSTANCE_SECRET`; the table is the
|
||
natural home for v1.1.7's encrypted per-app secrets.
|
||
**Tension flagged:** §5 aspired to "validate the signature without a DB
|
||
lookup". The table approach needs a key read on the token-subscribe path.
|
||
Mitigated by an in-memory key cache in `RealtimeAuthorityImpl` (keys never
|
||
rotate in v1.1.6, so the cache needs no invalidation); the subscribe path
|
||
already reads the `topics` row, so this adds no new round-trip *category*.
|
||
|
||
### In-process broadcaster
|
||
`Mutex<HashMap<(AppId, String), broadcast::Sender<RealtimeEvent>>>`.
|
||
Capacity per channel `PICLOUD_REALTIME_BROADCAST_CAPACITY` (default 64);
|
||
slow consumers lose oldest events (`broadcast` lag semantics — best-effort,
|
||
no replay). `subscribe` creates channels lazily; `publish` is a silent
|
||
no-op when no channel exists; `drop_topic` removes the sender (existing
|
||
receivers observe a closed channel and disconnect). A `spawn_realtime_gc`
|
||
task (~60s) drops senders with `receiver_count() == 0`.
|
||
|
||
### Publish wiring (order + failure modes) — §8 ordering chosen
|
||
`PubsubServiceImpl::publish_durable`: validate/authz → **transactional
|
||
outbox fan-out + commit** (existing) → **then** best-effort
|
||
`RealtimeBroadcaster::publish`. The broadcast runs on a child
|
||
`tokio::spawn` whose `JoinHandle` is awaited, so a panicking broadcaster
|
||
becomes a `warn` log, never a failed publish (the durable deliveries
|
||
already committed). One `published_at` instant is shared by both paths.
|
||
**Brief-internal contradiction flagged** — see §9.
|
||
|
||
## 3. Client lib implementation notes
|
||
|
||
- **Build:** **tsup** (dual ESM+CJS + `.d.ts`), tests **vitest**, lint =
|
||
`tsc --noEmit` (strict; `noUncheckedIndexedAccess`, no `any` in exports).
|
||
- **Layout:** `src/{index,client,endpoint,subscribe,auth,types}.ts` +
|
||
`src/react/index.ts` + `src/svelte/index.ts`; subpath exports
|
||
`@picloud/client/react` and `@picloud/client/svelte` via package `exports`.
|
||
- **Reconnect:** SSE is implemented over **streaming `fetch`** (not native
|
||
`EventSource`) so the lib can (a) detect a **401** on (re)connect and call
|
||
`onTokenExpired` to refresh, (b) send **`Last-Event-ID`** on resume
|
||
(server ignores it in v1.1.6 — client ships ready), and (c) apply its own
|
||
**exponential backoff** (base 1s → ×2 → cap 30s; reset on successful
|
||
open). Token rides in `?token=` (EventSource-parity). React Native caveat
|
||
documented in the README (supply a streaming-`fetch` polyfill).
|
||
- **zod/valibot adapter:** the `Validator<T> = { parse(input): T }` shape.
|
||
A Zod schema satisfies it directly; Valibot wraps in one line. Used by
|
||
`endpoint(...).get({ validate })` and `subscribe(..., { validate })`. No
|
||
hard dep.
|
||
|
||
## 4. v1.1.5 follow-ups
|
||
|
||
- **Dispatcher e2e (six):** `dispatcher_delivers_{kv,docs,cron,files,pubsub,
|
||
dead_letter}_to_handler` in `crates/picloud/tests/dispatcher_e2e.rs`.
|
||
Verified green against a real Postgres (all 6).
|
||
- **Empty-blob — RELAXED** (per the asked decision). Dropped the
|
||
`data.is_empty()` rejection in `NewFile::validate` **and** `FileUpdate::
|
||
validate` (the latter for consistency — flagged in §7). Flipped the
|
||
pinning test in `files_service.rs` and added `empty_file_round_trips`.
|
||
- **Orphan sweep:** `spawn_files_orphan_sweep` (`files_sweep.rs`), every 6h
|
||
(`PICLOUD_FILES_ORPHAN_SWEEP_INTERVAL_SEC`), unlinks `*.tmp.*` older than
|
||
1h (`PICLOUD_FILES_ORPHAN_TMP_TTL_SEC`); logs dirs-walked / files-deleted /
|
||
bytes-reclaimed. No DB cross-check (v1.3+). Tested: deletes old tmp, keeps
|
||
young tmp, keeps non-tmp, missing-root no panic.
|
||
|
||
## 5. Schema decisions beyond the brief
|
||
None — `0021_topics.sql` and `0022_app_secrets.sql` match the brief's DDL
|
||
verbatim. Schema-snapshot golden re-blessed on a fresh DB (only the two new
|
||
tables added; PK/FK `ON DELETE CASCADE` + the `auth_mode` CHECK present).
|
||
|
||
## 6. How to verify locally
|
||
See §8 below.
|
||
|
||
## 7. Decisions beyond the brief — every prompt-default deviation
|
||
1. **Dispatcher e2e location:** the brief says
|
||
`crates/manager-core/tests/dispatcher_e2e.rs`; I put it in
|
||
**`crates/picloud/tests/`**. `build_app` (the full dispatcher + scheduler
|
||
+ executor wiring) lives in the `picloud` crate; a manager-core test would
|
||
need a manager→picloud dev-dependency cycle or a hand-rolled re-wire of
|
||
build_app. The picloud harness (`server_with_app` pattern) is proven and
|
||
the tests run green against a real DB. Gating uses the **schema_snapshot
|
||
pattern** (env check + early return), NOT `#[ignore]`, so CI's plain
|
||
`cargo test` with `DATABASE_URL` runs them while local stays green — as
|
||
the brief requested.
|
||
2. **Empty-blob relaxation extended to `FileUpdate::validate`** — the brief
|
||
names only `NewFile::validate`. Relaxing create-empty but rejecting
|
||
update-to-empty would be an inconsistent API, so I relaxed both.
|
||
3. **Publish order: §8 (broadcast AFTER outbox commit)**, not §4 (broadcast
|
||
FIRST). The two sections contradict; §8 is the dedicated, numbered,
|
||
rationale-backed section. See §9.
|
||
4. **Topic-name validation** — `topics_api` rejects empty names and names
|
||
containing `*` (external pattern subscription is v2). The brief didn't
|
||
specify name validation; this is a small guard.
|
||
5. **Client lib lint = `tsc --noEmit`** (not eslint) to keep devDeps lean;
|
||
strict typecheck is the gate. "No `any` in exports" is enforced by review
|
||
+ strict TS, not an eslint rule.
|
||
6. **Cron e2e poll budget = 45s** — the cron scheduler skips its first tick
|
||
then ticks every 30s (default). The test polls 45s so it passes at the
|
||
default; set `PICLOUD_CRON_TICK_INTERVAL_MS=1000` to make it ~2s in CI.
|
||
|
||
## 8. Attestation (gate runs on this HEAD)
|
||
|
||
```
|
||
cargo fmt --all -- --check → clean
|
||
cargo clippy --all-targets --all-features -- -D warnings → clean (exit 0)
|
||
cargo test --workspace → 482 passed, 0 failed
|
||
(DB-gated dispatcher_e2e
|
||
auto-skip as no-op when
|
||
DATABASE_URL unset)
|
||
# DB-gated (real Postgres @ 127.0.0.1:15432, picloud/picloud):
|
||
DATABASE_URL=… PICLOUD_CRON_TICK_INTERVAL_MS=1000 \
|
||
cargo test -p picloud --test dispatcher_e2e → 6 passed
|
||
DATABASE_URL=… cargo test -p picloud-manager-core --test schema_snapshot → 1 passed
|
||
(cd dashboard && npm run check) → 0 errors, 0 warnings (371 files)
|
||
(cd clients/typescript && npm run lint) → clean (tsc --noEmit)
|
||
(cd clients/typescript && npm run test) → 15 passed (5 files)
|
||
(cd clients/typescript && npm run build) → tsup ESM+CJS+.d.ts OK
|
||
```
|
||
Migrations verified applying cleanly on a fresh DB **and** on top of the
|
||
existing v1.1.5 dev DB (0020 → 0021 → 0022). Schema-snapshot golden diff is
|
||
exactly the two new tables.
|
||
|
||
## 9. Open questions for the reviewer
|
||
1. **§4 vs §8 ordering contradiction (load-bearing, flagged not
|
||
reinterpreted):** §4 says "realtime broadcast FIRST, then transactional
|
||
outbox fan-out"; §8 says broadcast AFTER the fan-out commits (numbered
|
||
steps 1–4). I implemented **§8** (dedicated section + failure-mode
|
||
rationale). If §4's ordering was intended, the broadcast should move
|
||
before `fan_out_publish` — but then a publish whose outbox write fails
|
||
would still have notified SSE subscribers of an event that never durably
|
||
happened, which seems wrong. Please confirm §8 is correct.
|
||
2. **Dead-letter → handler fan-out appears unwired (see §10).** Confirm
|
||
whether the `dead_letter` trigger source is intended to fire in v1.1.x.
|
||
|
||
## 10. Latent findings
|
||
**`dead_letter` trigger handlers do not fire on dead-letter creation.**
|
||
`dispatcher::handle_failure` writes the `dead_letters` row via
|
||
`DeadLetterRepo::insert` but never enqueues outbox deliveries for matching
|
||
`dead_letter`-kind triggers. `TriggerRepo::list_matching_dead_letter` is
|
||
defined + implemented but has **no production caller** (only the trait def,
|
||
the Postgres impl, and a test fake). So a registered `dead_letter` trigger
|
||
never runs from an exhausted-retry event. This predates v1.1.6 (the design
|
||
notes §4 describe it shipping in v1.1.1). I did **not** fix it (out of
|
||
v1.1.6 scope) — flagging for the reviewer. Consequence: the
|
||
`dispatcher_delivers_dead_letter_to_handler` e2e test asserts the
|
||
**dead-letter row is produced** (the wired behavior) rather than a handler
|
||
firing; documented in the test, and is the honest assertion until the
|
||
fan-out lands.
|
||
|
||
No new security gaps introduced. Cross-app isolation holds: `subscriber_token`
|
||
claims carry `app_id` and are signed per-app; the SSE authority rejects
|
||
cross-app tokens (a per-app key already fails the signature, plus an explicit
|
||
`claims.app_id == app_id` check); topic admin endpoints bind the capability
|
||
to the path `app_id` after loading the app; the broadcaster keys channels by
|
||
`(AppId, topic)` so app A's publishes never reach app B's subscribers
|
||
(unit-tested).
|
||
|
||
## 11. Deferred items (beyond this prompt's OUT list)
|
||
None added. Everything on the brief's OUT list stayed out (WebSocket,
|
||
session/script auth modes, topic-pattern external subscription, server-side
|
||
last-event-id replay, per-app SSE/rate limits, other-language SDKs, codegen,
|
||
full DB-cross-check sweeper).
|
||
|
||
## 12. Known limitations / rough edges
|
||
- SSE delivery is best-effort at-most-once; a slow consumer past the
|
||
broadcast buffer loses events with no server-side replay (by design).
|
||
- The client lib's streaming-`fetch` SSE needs a polyfill on runtimes
|
||
without `fetch` body streaming (React Native) — documented.
|
||
- Cron e2e takes ~31s at the default 30s tick interval (45s poll budget);
|
||
set `PICLOUD_CRON_TICK_INTERVAL_MS=1000` to speed it up.
|
||
- The realtime key cache in `RealtimeAuthorityImpl` is per-process and never
|
||
invalidated — correct only because v1.1.6 has no key rotation. A future
|
||
rotation API must clear it.
|