From 1efb350b540edcf0a782d50c5a32049f3ba4889c Mon Sep 17 00:00:00 2001 From: MechaCat02 Date: Mon, 1 Jun 2026 21:22:25 +0200 Subject: [PATCH] docs(v1.1.x): resolve in-flight decisions as Decided 2026-06-01 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Annotates the v1.1.x design notes with the resolutions for the 20 open calls — pub/sub split, universal outbox, NATS-style sync HTTP, status code strategy, retry policy, dead-letter recursion-stop, realtime auth model, frontend client library scope. Captured ahead of the v1.1.1 implementation so the schema + API decisions in this branch have a single load-bearing source of truth. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/v1.1.x-design-notes.md | 341 ++++++++++++++++++++++++------------ 1 file changed, 228 insertions(+), 113 deletions(-) diff --git a/docs/v1.1.x-design-notes.md b/docs/v1.1.x-design-notes.md index 5bfa302..8bec6fa 100644 --- a/docs/v1.1.x-design-notes.md +++ b/docs/v1.1.x-design-notes.md @@ -20,23 +20,27 @@ PiCloud will expose three distinct messaging concepts. The right way to slice th | | Recipients | Durability | Delivery | Retry on script failure | Mental model | |---|---|---|---|---|---| | **`invoke(script_id, args)`** | One **named** script | None (or fire-and-forget durable) | At-most-once sync, or at-least-once async | Caller-controlled via `retry::*` | Function call | -| **`pubsub::publish(topic, msg)`** | **All** scripts subscribed via trigger | Through outbox | **At-least-once per subscriber** | Per-subscriber retry up to N, then dead-letter | Fan-out broadcast | +| **`pubsub::publish_durable(topic, msg)`** | **All** scripts subscribed via trigger | Through outbox | **At-least-once per subscriber** | Per-subscriber retry up to N, then dead-letter | Fan-out broadcast (persisted) | +| **`pubsub::publish_ephemeral(topic, msg)`** *(future)* | **All** scripts subscribed via trigger | None (in-memory NOTIFY) | **At-most-once per subscriber** | None | Fan-out broadcast (best-effort) | | **`queue::enqueue(name, msg)`** | **Exactly one** consumer wins | Durable table | **At-least-once total** | Visibility timeout + nack-on-throw | Work distribution | **Critical distinction:** pub/sub and queue both end up at-least-once, but the **subscriber model** differs. Queue: 1 message → 1 delivery record → consumers compete. Pub/sub: 1 message → N delivery records (one per subscriber) → no competition. -### Pub/sub reframe — drop ephemeral, use the outbox +### Pub/sub reframe — durable through the outbox, ephemeral as named escape hatch -The original blueprint plan was pub/sub via Postgres `LISTEN/NOTIFY` (ephemeral, sub-millisecond fan-out). Reframe to **reuse the triggers framework's outbox infrastructure**: +The original blueprint plan was pub/sub via Postgres `LISTEN/NOTIFY` (ephemeral, sub-millisecond fan-out). Reframe to **reuse the triggers framework's outbox infrastructure for the durable path, and keep ephemeral as a separately-named future API**: -- `pubsub::publish(topic, msg)` writes to the outbox +- `pubsub::publish_durable(topic, msg)` writes to the outbox (v1.1.5) - Dispatcher fans out one delivery record per subscribed script trigger - Each delivery retried on failure with the same machinery as KV / doc / file triggers - After N retries → dead-letter (see §4) +- `pubsub::publish_ephemeral(topic, msg)` is committed as a future addition for the in-memory `LISTEN/NOTIFY` path — not shipped in v1.1.5, but the API split is decided now so users learn "durable by default, opt into ephemeral" from the start (rather than the reverse, which would be a breaking rename later). -**Wins:** one delivery model in the whole system, durable pub/sub for free, shared observability/retry/dead-letter tooling across every event-firing surface. +**Wins:** one delivery model in the whole system for the durable path, durable pub/sub for free, shared observability/retry/dead-letter tooling across every event-firing surface. -**Cost:** ~1ms Postgres write per publish (vs in-memory NOTIFY). For solo-dev / consumer hardware, the right tradeoff. If sub-ms ever matters, `pubsub::publish_ephemeral` is a future addition that bypasses the outbox. +**Cost:** ~1ms Postgres write per `publish_durable` (vs in-memory NOTIFY). For solo-dev / consumer hardware, the right tradeoff. The ephemeral escape hatch exists for sub-ms / high-frequency workloads if/when they emerge. + +**Note on durability semantics.** "Durable" here means the outbox row persists, not that fan-out is transactional with the publisher's own data writes. A script doing `kv.set(...)` then `pubsub::publish_durable(...)` performs two separate writes; a crash between them can drop the publish. This matches the standard transactional-outbox pattern and is consistent with how KV / doc / file triggers already work. ### Queue stays separate @@ -53,14 +57,15 @@ The queue table IS the outbox for queue semantics — no double-buffering. ### Status -- **Pub/sub via trigger outbox**: leaning yes; needs final ack. -- **Queue stays separate from pub/sub**: leaning yes. -- **Drop `LISTEN/NOTIFY` plan**: leaning yes. +- **Durable pub/sub via trigger outbox**: ✅ Decided 2026-06-01 — ship as `pubsub::publish_durable` in v1.1.5. +- **Ephemeral pub/sub**: ✅ Committed 2026-06-01 as a future addition named `pubsub::publish_ephemeral`. Not in v1.1.5; the explicit-naming split lands now so the durable default doesn't need a breaking rename later. +- **Drop `LISTEN/NOTIFY` for v1.1.5**: ✅ Decided 2026-06-01. +- **Queue stays separate from pub/sub**: ✅ Decided 2026-06-01 — two distinct top-level namespaces (`queue::*` and `pubsub::*`); no unifying `messaging::*` abstraction. Rationale: the two have genuinely different mental models (work distribution vs fan-out), the implementations share almost no code (queue needs `FOR UPDATE SKIP LOCKED` + visibility timeout + nack-on-throw; pub/sub needs per-subscriber fan-out + independent retry/dead-letter), and a unified API would force users to choose a mode they already know from the use case. A future Kafka-shaped consumer-group unification was considered and rejected — PiCloud is outbox-based, not log-based, so going Kafka-shaped would mean rebuilding storage. ### Open calls -1. Pub/sub durability via trigger outbox (durable, at-least-once) — confirmed? -2. Queue and pub/sub stay separate concepts (rather than unifying under a "messaging" abstraction with a subscription-mode flag) — confirmed? +1. ~~Pub/sub durability via trigger outbox~~ — ✅ Decided 2026-06-01: yes, both `publish_durable` (v1.1.5) and `publish_ephemeral` (future) committed with explicit names. +2. ~~Queue and pub/sub stay separate concepts~~ — ✅ Decided 2026-06-01: separate top-level namespaces; no unifying messaging abstraction. --- @@ -96,16 +101,18 @@ The triggers framework's outbox should be the universal substrate for **async di ### Status -- **Universal outbox for async dispatch**: leaning yes. -- **Routes-as-trigger conceptually**: leaning yes. -- **Routes-as-trigger schema-wise**: leaning no (keep separate tables). -- **Per-route `dispatch_mode: sync|async`**: leaning yes for v1.1.1 since the dispatcher is already being built. +- **Universal outbox for async dispatch**: ✅ Decided 2026-06-01 — yes; all async ingress (KV/cron/pubsub/queue/email/dead-letter) writes to one outbox; one dispatcher reads it. +- **Sync HTTP via outbox (NATS-style inbox)**: ✅ Decided 2026-06-01 — in-process oneshot in v1.1.1; cluster-mode keeps the door open for `LISTEN/NOTIFY` keyed on `inbox_id` in v1.3+ (see §3 implementation table). +- **Routes-as-trigger conceptually**: ✅ yes — the dispatch layer treats routes and triggers uniformly. +- **Trigger storage shape: Layout E (parent + per-kind detail tables)**: ✅ Decided 2026-06-01. One shared `triggers` parent with common columns (`id`, `app_id`, `script_id`, `kind`, `enabled`, `dispatch_mode`, retry config, timestamps); one `_trigger_details` table per service (`kv_trigger_details`, `cron_trigger_details`, `pubsub_trigger_details`, `queue_trigger_details`, `email_trigger_details`, `dead_letter_trigger_details`). Outbox FKs to `triggers.id`; dead-letters FK same. Exact column set (notably `outbox.app_id` denormalization, whether `script_id` also lives on outbox, ON DELETE behavior on the parent vs detail tables) will be refined when v1.1.1 implementation lands. +- **`routes` table stays separate from the `triggers` parent for now**: ✅ Decided 2026-06-01. `routes` is Phase-3 production schema with its own trie-index columns; folding into the parent is a v1.2 cleanup, not a v1.1.1 requirement. Outbox discriminates HTTP rows via `source_kind = 'http'` and `trigger_id` referencing `routes.id` for HTTP, `triggers.id` for everything else. +- **Per-route `dispatch_mode: sync|async`**: ✅ Decided 2026-06-01 — ships in v1.1.1. Async returns `202 Accepted` with a JSON body `{ "accepted_at": "...", "execution_id": "..." }`. `dispatch_mode` is a route property fixed at route creation; scripts cannot switch modes mid-call. ### Open calls -1. Sync HTTP via outbox + per-request inbox pattern (NATS-style; see §3) — confirmed, or keep direct dispatch for sync? -2. Ship `dispatch_mode: async` for HTTP routes in v1.1.1, or defer to a later release? -3. Keep `routes` and `triggers` as separate tables (unified at the dispatcher only), or merge schemas? +1. ~~Sync HTTP via outbox + per-request inbox~~ — ✅ Decided 2026-06-01: yes via outbox; in-process oneshot now, `LISTEN/NOTIFY` explicitly preserved for cluster mode (v1.3+). +2. ~~Ship `dispatch_mode: async` in v1.1.1~~ — ✅ Decided 2026-06-01: yes; `202 Accepted` + JSON body with `execution_id`; route-level config only. +3. ~~Trigger storage shape~~ — ✅ Decided 2026-06-01: Layout E (parent + per-kind detail tables); `routes` stays its own table for v1.1.x. Exact column set deferred to implementation PR. --- @@ -143,6 +150,25 @@ The HTTP caller's experience is unchanged (synchronous request/response). Under Per sync HTTP request, NATS-style adds: ~1-2ms Postgres write (outbox) + sub-ms dispatcher wake (in-process channel) + ~1ms response resolve = **~2-5ms overhead**. For most scripts (10-100ms execution), this is noise. PiCloud isn't optimizing for sub-ms; the architectural unification is worth a few ms. +### Default retry policy — decided + +✅ Decided 2026-06-01: + +| Knob | Default | Env override | Per-trigger column | +|---|---|---|---| +| Max attempts | 3 | `PICLOUD_TRIGGER_RETRY_MAX_ATTEMPTS` | `retry_max_attempts` | +| Backoff shape | exponential | `PICLOUD_TRIGGER_RETRY_BACKOFF` (`exponential` \| `linear` \| `constant`) | `retry_backoff` | +| Base delay | 1000ms | `PICLOUD_TRIGGER_RETRY_BASE_MS` | `retry_base_ms` | +| Jitter | ±20% | `PICLOUD_TRIGGER_RETRY_JITTER_PCT` | (not per-trigger; dispatcher-side) | + +With the defaults, schedule after each failed attempt is **~1s / ~2s / ~4s** (each ±20%), total time-to-dead-letter ~7s. + +**What triggers a retry:** any of Rhai runtime error, wall-clock timeout, operation-budget-exceeded, or platform-side failure (Postgres unavailable, executor crashed). Distinguishing them in the dispatcher is fiddly and the retry cost is bounded by `max_attempts`; if op-budget retries become dead-letter spam in practice, revisit. + +**Per-trigger override:** the three retry columns on the `triggers` parent table (Layout E) take precedence over the env-configured defaults. Trigger CRUD endpoints accept these on create/update; if omitted, the env defaults are applied at write time (not lazily at dispatch — keeps the policy auditable from the row itself). + +**Sync HTTP exception:** unchanged. `reply_to.is_some()` rows are never retried regardless of policy (see below). + ### Retry policy — `reply_to` IS the signal | Outbox row | Retry behavior | @@ -179,25 +205,35 @@ With NATS-style indirection, there are new ways for a sync HTTP request to vanis Every path resolves the channel with a result. The orchestrator's outer timeout is the backstop for "dispatcher just died completely". -### Status code strategy — open question +### Status code strategy — decided -Today's orchestrator distinguishes 422 / 502 / 503 / 504 / 507 / 500. User raised "everything should be 500" framing. Two options: +✅ Decided 2026-06-01: keep the granular status codes (Option A), with one refinement — `500` is reserved for **platform** problems (dispatcher vanished, outbox write failed, inbox channel timed out unexpectedly), not used as a generic catch-all. -- **Option A (recommended):** keep existing distinctions. Script crashes → 502, timeouts → 504, overloaded → 503, parse errors → 422, dispatcher vanished → 500. Clients get actionable info. -- **Option B:** flatten to 500 for everything that's "platform couldn't return a useful response". Simpler surface; loses actionable distinctions. +| Code | Cause | Who's at fault | +|---|---|---| +| 422 | Request validation failed | Client | +| 502 | Script threw / Rhai runtime error | User script | +| 503 | Gate refused (overloaded); `Retry-After: 1` | Platform (capacity) | +| 504 | Wall-clock timeout | Either (slow script or platform overload) | +| 507 | Operation budget exceeded | User script | +| 500 | Dispatcher vanished / outbox write failed / inbox channel timed out unexpectedly | Platform (bug or infra) | + +Rationale: each code is actionable for the caller (back off, redesign as async, fix the script, file a bug). Flattening to `500` would collapse "script crashed" vs "overloaded" vs "your timeout is too tight" vs "platform broke" into one undifferentiated signal — losing both client-facing UX and our own observability/alerting axis. ### Status -- **NATS-style for sync HTTP**: leaning yes; resolves the outbox vs direct-dispatch tension. -- **`reply_to` presence as the "don't retry" signal**: leaning yes. -- **Default retry policy** (3 attempts, exp backoff 1s/2s/4s): proposed. +- **NATS-style for sync HTTP**: ✅ Decided 2026-06-01 (see §2 #3). +- **`reply_to` presence as the "don't retry" signal**: ✅ Decided 2026-06-01 (folded with the NATS-style decision). +- **Status code strategy**: ✅ Decided 2026-06-01 — keep granular distinctions; `500` reserved for platform problems only. +- **Default retry policy**: ✅ Decided 2026-06-01 — 3 attempts / exponential / 1000ms base / ±20% jitter; all four env-overridable via `PICLOUD_TRIGGER_RETRY_*`; per-trigger columns on the parent table take precedence. +- **Cancel-on-timeout semantics**: ✅ Decided 2026-06-01 — option (b). Late results are discarded from the caller's POV (they already got a 504) but the dispatcher writes an `abandoned_executions` row whenever it tries to resolve a oneshot that's already closed/dropped. 7-day default retention via `PICLOUD_ABANDONED_EXECUTIONS_RETENTION_DAYS`; weekly GC sweep. A counter (`picloud_abandoned_executions_total{app_id}`) bumps on insert — that's the primary observability signal; the rows themselves are for forensics when the counter spikes. Only the dispatcher-after-orchestrator-timeout edge case writes a row; ordinary "script timed out, caller got 504" stays uneventful. ### Open calls -1. NATS-style request/reply for sync HTTP — confirmed? -2. Status code strategy: keep existing distinctions (A, recommended) or flatten to 500 (B)? -3. Default retry policy on triggers: 3 attempts with exp backoff (1s/2s/4s), or different defaults? -4. Cancel-on-timeout semantics: if orchestrator's wait times out but executor finishes successfully later, do we (a) discard the late result, (b) write it to an "abandoned executions" table for debugging, or (c) attempt to ack the caller late? Leaning (b) — log + discard but keep the row for forensics. +1. ~~NATS-style request/reply for sync HTTP~~ — ✅ Decided 2026-06-01 (see §2 #3). +2. ~~Status code strategy~~ — ✅ Decided 2026-06-01: Option A (keep distinctions); 500 reserved for platform problems. +3. ~~Default retry policy on triggers~~ — ✅ Decided 2026-06-01: 3/exp/1000ms base + ±20% jitter; env-overridable via `PICLOUD_TRIGGER_RETRY_*`; per-trigger row columns override the env defaults. +4. ~~Cancel-on-timeout semantics~~ — ✅ Decided 2026-06-01: option (b) — `abandoned_executions` table, dispatcher-written, 7-day retention, metric counter on insert. --- @@ -271,24 +307,36 @@ ctx.event.dead_letter = #{ The handler can `log::error`, send `email::send` to admins, write to `docs::collection("incidents").create(...)`, post to external alerting via `http::post`, or call `dead_letters::replay(id)` if it decides retry is favorable. -### Recursion stop rule +### Recursion stop rule — decided -**Dead-letter handlers execute once, no retry, and CANNOT themselves be dead-lettered.** +✅ Decided 2026-06-01: **dead-letter handlers execute once, no retry, and CANNOT themselves be dead-lettered.** -When the dispatcher invokes a dead-letter trigger, the resulting execution is marked `is_dead_letter_handler = true`. If it fails: +- The flag lives on the **execution/outbox row** (set by the dispatcher when it picks a row whose trigger has `kind = 'dead_letter'`), not on the trigger config. Same handler script could in principle be reused for non-DL work without inheriting the no-retry treatment. +- On handler failure: + - Full payload + error logged to structured logs + - Counter `picloud_dead_letter_handler_failures{app_id}` bumped + - Original dead-letter row annotated with `resolution = 'handler_failed'` + - **No retry, no second dead-letter row, no further fire.** +- **Missing handler script** (trigger references `script_id` that's been deleted): treated as a handler failure — same metric bump, same `resolution = 'handler_failed'`, same no-retry. Auto-disabling the trigger is deferred to v1.2; for v1.1.1 the user sees the metric spike and investigates. +- **Indirect loops** (DL handler writes to KV → fires a KV trigger → that handler fails → dead-letters → fires the same DL handler) are not blocked by this rule directly; they're bounded by the existing trigger-depth limit (`cx.trigger_depth`). The recursion-stop rule only prevents the *direct* infinite regress where a DL handler's failure would itself produce a DL row. -- Failure is logged to the structured log (full payload + error) -- A metric is bumped (`picloud_dead_letter_handler_failures`) -- Original dead-letter row annotated with `resolution = "handler_failed"` -- **Nothing else is fired.** Chain stops definitively. +Rationale: if your alerting script is broken, the platform shouldn't try to alert about that with the same broken script. The chain has to terminate, period. -This is the only safe stop rule. If your alerting script is broken, the platform shouldn't try to alert about that with the same broken script. +### Defaults — decided -### Defaults +✅ Decided 2026-06-01: **no automatic handler.** Dead letters land in the table; users opt into handling by registering a `dead_letter` trigger. -**No automatic handler.** Dead letters silently land in the table. Users opt into handling by registering a trigger. The dashboard surfaces an unresolved-count badge so users notice. +**Load-bearing commitment:** the v1.1.1 dashboard surfaces this state. Without dashboard surface, "no default handler" is irresponsible — users wouldn't know dead-letters exist until they queried Postgres directly. So shipping the table without the UI is not an option. -This avoids over-engineering — most apps will run for months without a dead-letter trigger; the table is the durable record either way. +Required in v1.1.1 alongside the table: + +- An **unresolved-count badge** per app, visible in the dashboard's app list and on the app detail page. Source query: `SELECT count(*) FROM dead_letters WHERE app_id = $1 AND resolved_at IS NULL`. +- A **per-app dead-letters list view** reachable from the badge. Columns: `created_at`, `source`, `op`, `script_id`, `last_error`, `attempt_count`, `first_attempt_at`, `last_attempt_at`. Per-row actions: **Replay** (re-inserts the original event into the outbox; dispatcher tries again from scratch) and **Mark resolved** (sets `resolution = 'ignored'`, no further action). +- A row detail panel showing the full payload + complete error history. + +Rationale: most apps will run for months without ever needing a DL handler; the table is the durable record either way. The dashboard surface gives users the lightest-touch signal that something is wrong without committing v1.1.1 to building a notifications channel. + +A heavier built-in default ("log to admin notifications channel") was considered and rejected — it would smuggle a notifications-surface design into v1.1.1 under the guise of a default, with real product-design questions (channel shape, configuration, opt-out, rate-limiting) that aren't worth answering yet. If the dashboard badge proves insufficient in practice, a structured-log fallback (writing to `execution_logs` with a known `dead_letter` shape) is an additive future change, not a breaking one. ### Sync HTTP failures don't dead-letter @@ -298,34 +346,53 @@ Sync HTTP requests (`reply_to.is_some()`) failures don't land in `dead_letters`. One `pubsub::publish` → N subscribers → each retries independently → each can independently dead-letter. So one publish can produce N dead-letter rows (one per subscriber that exhausted retries). Subscribers are independent failure domains. -### Manual replay +### Manual replay — Rhai SDK scope decided -| Surface | Use case | -|---|---| -| `POST /api/v1/admin/apps/{id}/dead_letters/{dl_id}/replay` | Admin clicks "replay" in dashboard | -| `dead_letters::replay(id)` Rhai SDK | A handler script decides to retry programmatically | -| `dead_letters::resolve(id, reason)` Rhai SDK | A handler decides "this is fine, don't bother me" | +✅ Decided 2026-06-01: ship `dead_letters::replay(id)` and `dead_letters::resolve(id, reason)` in v1.1.1; **defer `dead_letters::list(filter)` to v1.2** to align with `docs::find()` query semantics. + +| Surface | Use case | Shipping in | +|---|---|---| +| `POST /api/v1/admin/apps/{id}/dead_letters/{dl_id}/replay` | Admin clicks "replay" in dashboard | v1.1.1 | +| `POST /api/v1/admin/apps/{id}/dead_letters/{dl_id}/resolve` | Admin marks resolved via dashboard | v1.1.1 | +| `GET /api/v1/admin/apps/{id}/dead_letters` | Dashboard list view | v1.1.1 | +| `dead_letters::replay(id)` Rhai SDK | A handler script decides to retry programmatically | v1.1.1 | +| `dead_letters::resolve(id, reason)` Rhai SDK | A handler decides "this is fine, don't bother me" | v1.1.1 | +| `dead_letters::list(filter)` Rhai SDK | Bulk replay / cleanup scripts | **v1.2** (aligns with `docs::find()` query DSL) | Replay re-inserts the original event into the outbox; dispatcher tries again from scratch. -### Retention +**Authz:** both replay and resolve are gated by a new `Capability::AppDeadLetterManage(AppId)` checked inside the service methods. The capability is granted to app admins by default (existing Phase 3.5 role hierarchy). A public HTTP script running with `principal: None` would fail this check, which is correct. -Time-based: delete dead letters older than 30 days by default. Configurable per-app via app settings, or globally via env var (`PICLOUD_DEAD_LETTER_RETENTION_DAYS`). A weekly GC job in the manager handles the deletion using `FOR UPDATE SKIP LOCKED`. +**Trigger-execution principal (related decision):** ✅ a trigger execution runs as the principal that **registered the trigger**, captured on the trigger row at registration time. This gives a clean "the trigger fires as you" model and matches how cron jobs are typically conceptualized. The original event's principal (e.g. the anonymous caller of a public HTTP route) is recorded for forensics on the outbox row but does not become the execution principal. This is a wider trigger-framework decision surfaced here because dead-letter authz is the first concrete consumer; it applies to **every** trigger kind, not just dead-letter. + +### Retention — decided + +✅ Decided 2026-06-01: **30 days, GC by `created_at`, env-overridable only (no per-app override in v1.1.1).** + +- Default: 30 days +- Override: `PICLOUD_DEAD_LETTER_RETENTION_DAYS` (whole-deployment, not per-app) +- GC condition: `created_at < NOW() - retention` — applies to both resolved and unresolved rows uniformly. (Activity-age GC — keeping recently-resolved rows 30 days post-resolution — was considered and deferred; can switch if user feedback shows it's needed without breaking anything.) +- GC job: weekly sweep in `manager-core`, claiming via `FOR UPDATE SKIP LOCKED` to match the dispatcher's claim pattern. + +Per-app retention overrides are deferred to a later release. The env var covers single-deployer needs; per-app settings would need a dashboard surface + permissions story that isn't worth smuggling into v1.1.1. ### Status - **Separate `dead_letters` table**: leaning yes. - **`dead_letter` as trigger kind**: leaning yes. -- **Recursion stop rule** (handlers can't be dead-lettered): leaning yes. -- **No default handler** (rows just sit in table): leaning yes. +- **Recursion stop rule** (handlers can't be dead-lettered): ✅ Decided 2026-06-01 (above); flag lives on the execution; missing-handler case treated as handler failure. +- **No default handler** (rows sit in table; dashboard surfaces them): ✅ Decided 2026-06-01 — unresolved-count badge + per-app list view ship in v1.1.1 alongside the table. - **Sync HTTP failures don't dead-letter**: leaning yes. +- **Retention**: ✅ Decided 2026-06-01 — 30 days, GC by `created_at`, env-only override (`PICLOUD_DEAD_LETTER_RETENTION_DAYS`); weekly `FOR UPDATE SKIP LOCKED` sweep in `manager-core`. +- **Rhai SDK scope**: ✅ Decided 2026-06-01 — `replay` + `resolve` ship in v1.1.1; `list` deferred to v1.2 to align with `docs::find()` query DSL. New `Capability::AppDeadLetterManage(AppId)`. +- **Trigger-execution principal**: ✅ Decided 2026-06-01 — trigger fires as the principal that registered it (captured on the trigger row at registration). Original event's principal is recorded on the outbox row for forensics but does not become the execution principal. Applies to all trigger kinds. ### Open calls -1. Dead-letter handlers unretryable + can't be dead-lettered themselves — confirmed? -2. No default dead-letter handler (rows just sit in the table); user opts in — confirmed, or do you want a built-in "log to admin notifications channel" default? -3. 30-day default retention sensible, or longer/shorter? -4. Include Rhai SDK (`dead_letters::list/replay/resolve`) in v1.1.1 alongside admin endpoints, or defer the script-side surface to a later release? +1. ~~Dead-letter handlers unretryable + can't be dead-lettered themselves~~ — ✅ Decided 2026-06-01: confirmed; flag on execution; missing-handler = `resolution = 'handler_failed'`; indirect loops bounded by `cx.trigger_depth`. +2. ~~No default dead-letter handler~~ — ✅ Decided 2026-06-01: confirmed; rows sit in the table by default. Dashboard unresolved-count badge + per-app DL list view (with Replay + Mark-resolved actions) ship in v1.1.1 alongside the table. +3. ~~30-day default retention~~ — ✅ Decided 2026-06-01: 30 days, GC by `created_at`, env-only override; per-app retention deferred. +4. ~~Rhai SDK for dead-letters in v1.1.1~~ — ✅ Decided 2026-06-01: `replay` + `resolve` ship; `list` deferred to v1.2 to align with `docs::find()`; new `Capability::AppDeadLetterManage(AppId)`. Related: trigger executions run as the trigger-registering principal. --- @@ -333,48 +400,79 @@ Time-based: delete dead letters older than 30 days by default. Configurable per- Apps built on PiCloud need a way for browser/mobile clients to receive live updates (chat messages, dashboard data, multiplayer state, notifications). Today's pub/sub is internal-only (script ↔ script via triggers). -### The chosen approach +### The chosen approach — decided -**Option C (from prior debate): topics with opt-in external subscription.** +✅ Decided 2026-06-01: **Option C (one publish API, topics opt-in to external visibility) with the registration split below.** -- One `pubsub::publish(topic, msg)` API for scripts — produces a single event -- Topics are **internal-only by default** — script triggers can subscribe; external clients cannot -- Apps explicitly mark topics as externally-subscribable (per-topic config in dashboard / API) -- External clients connect to `GET /realtime/topics/{topic}` via SSE and receive only messages published to topics they're permitted to subscribe to +- One `pubsub::publish_durable(topic, msg)` API for scripts — produces a single event regardless of who subscribes. +- Topics are **internal-only by default**: script triggers can subscribe; external clients cannot. +- **Externally-subscribable topics must be registered explicitly** (admin API + dashboard surface). Internal-only topics remain implicit — anyone can `publish_durable("any.topic", msg)` and triggers can subscribe without registration. To externalize: create a `topics` row with `external_subscribable = true` first. +- External clients connect to `GET /realtime/topics/{topic}` via SSE; they only receive messages from registered, externally-subscribable topics they're permitted to access. -**Wins:** one publish API for scripts (DRY), topics don't leak by default (security), external visibility is an explicit opt-in per topic. +**UI/security commitments** (the difference between C working and C being default-public in disguise): -### Transport: SSE first +1. The externally-subscribable opt-in is prominent UI, not a buried checkbox. +2. The topic list view shows "external: yes/no" as a first-class column. +3. Marking a topic externally-subscribable requires app admin role (capability-gated via `Capability::AppTopicManage(AppId)`). +4. The bit-flip is its own API endpoint (not a side-effect of generic topic update) so it carries an independent audit trail. -SSE (Server-Sent Events) for v1.x: +**Wins:** one publish API for scripts (DRY), topics are private by default (security), external visibility requires deliberate explicit registration (not just a config flag flipped during quick edits). + +**Why not A (every topic externally-visible by default):** topic names tend to describe the event, not the audience; internal topics frequently carry PII or sensitive payloads; the Firebase-style "remember to lock it down" anti-pattern this whole design rejects. + +**Why not B (separate `channels::` service):** doubles the publish API for almost-identical use cases; scripts wanting both internal triggers AND client push would publish twice; users wrap it in a helper and we're back at C with extra steps and no central policy enforcement. + +### Transport: SSE first — decided + +✅ Decided 2026-06-01: **SSE-only for v1.1.6. WebSocket added in a later release if real bidirectional demand emerges.** - Simpler than WebSocket; works through any HTTP proxy without protocol upgrade -- Browsers auto-reconnect on disconnect -- Sufficient for "server-pushed events to the browser" (the dominant use case) +- Browsers auto-reconnect on disconnect (native `EventSource`) +- Covers the dominant use cases (chat-message-list updates, dashboard streams, notifications, IoT telemetry, build-status streams) cleanly +- Production-quality SSE requires HTTP/2 between Caddy and clients to dodge the per-origin connection cap on HTTP/1.1 — Caddy speaks HTTP/2 by default, so this is just a config note for the deploy docs -WebSocket is added later if bidirectional comms (chat-style) warrant it. +**Why not ship WS in v1.1.6:** WS is the right tool for sub-100ms bidirectional state (multiplayer games, CRDT collaborative editing, typing-indicator-level presence). On consumer hardware with Postgres-backed event distribution, that latency budget is dominated by the server stack anyway — WS would be paying implementation cost (frame management, ping/pong, close codes, backpressure protocol) without unlocking the latency it's designed for. SSE-only also frees v1.1.6 to invest in `@picloud/client` library quality instead of transport edge cases. -### Auth model for external subscribers +**Future addition path:** WebSocket coexists with SSE on a different endpoint (e.g. `/realtime/ws/{topic}`) backed by the same subscriber registry. Purely additive — no SSE clients break, no architecture decision in v1.1.6 closes the door. -Three flavors, ordered by complexity: +### Auth model for external subscribers — decided -- **Public topics** — anyone with the URL connects. For marketing-style broadcasts, public stat boards. -- **Token-gated topics** — client presents a token issued by a script. Pusher / Ably-style. Token can be a PiCloud API key (v1.1.6) or a users-SDK session token (v1.1.8+). -- **Script-mediated** — a script handles each subscribe request and decides yes/no. Most flexible, defer to v1.2. +✅ Decided 2026-06-01: ship **public** + **HMAC-signed subscriber-token** auth in v1.1.6; **users-SDK session-based** auth follows in v1.1.8 (additive); **script-mediated per-subscribe** auth deferred to v1.2. -Ship public + token-gated in v1.1.6; defer script-mediated. +**Topic config columns:** +- `external_subscribable: bool` — can external clients ever subscribe? +- `auth_mode: 'public' | 'token'` — if external, what's the gate? (ignored when `external_subscribable = false`) +- v1.1.8 adds `auth_mode = 'session'` for users-SDK-based sessions; v1.2 adds `auth_mode = 'script'` for script-mediated. + +**v1.1.6 trust flow (token-gated topics):** + +| Hop | Auth mechanism | +|---|---| +| Script → its own token-mint endpoint | Existing API-key + app authz | +| Script → SDK helper to mint token | New `pubsub::subscriber_token(topics, ttl)` | +| Frontend → script's token endpoint | App's own auth (cookie/session/whatever the app defines) | +| Frontend → PiCloud SSE | Short-lived HMAC-signed subscriber token (bearer header) | +| SSE handler → token validation | HMAC verify, scope-check requested topic against token's allowed list | + +The frontend **never** touches the app's API key. The script signs scoped, short-lived bearers (HMAC over `{topic_list, exp, app_id}`) with a secret derived from the app's API-key material. The SSE endpoint validates the signature without a DB lookup. + +**Token TTL:** clamped 10s ≤ ttl ≤ 24h. Default 1h. Both bounds and default env-overridable (`PICLOUD_SUBSCRIBER_TOKEN_TTL_MIN_SEC`, `PICLOUD_SUBSCRIBER_TOKEN_TTL_MAX_SEC`, `PICLOUD_SUBSCRIBER_TOKEN_TTL_DEFAULT_SEC`). + +**Token revocation:** none in v1.1.6 by design. HMAC bearers can't be revoked individually; rotation of the signing key invalidates all bearers wholesale. Short TTL is the safety mechanism. Per-token revocation arrives implicitly with v1.1.8's session-based auth (sessions CAN be invalidated). + +**Public topics:** no auth at all. `GET /realtime/topics/{topic}` works for anyone if the topic has `external_subscribable = true AND auth_mode = 'public'`. Used for marketing-style broadcasts and public stat boards. ### Status -- **Approach C (opt-in external subscription)**: leaning yes. -- **SSE first, WebSocket later**: leaning yes. -- **Public + token-gated auth in v1.1.6**: leaning yes. +- **Approach C (opt-in external subscription)**: ✅ Decided 2026-06-01 — internal-only by default; externally-subscribable topics require explicit registration + admin-role capability; UI surface treats the bit-flip as a deliberate, audited action. +- **SSE first, WebSocket later**: ✅ Decided 2026-06-01 — SSE-only in v1.1.6; WS deferred until concrete demand emerges; future addition is purely additive on a separate endpoint. +- **Public + token-gated auth in v1.1.6**: ✅ Decided 2026-06-01 — HMAC-signed subscriber-token flow (not raw API-key passing); `users::*` session-based and script-mediated auth deferred per the table above. ### Open calls -1. Approach C confirmed (vs A: pubsub IS realtime, B: separate `channels::` service)? -2. SSE first, WebSocket deferred — confirmed, or ship both in v1.1.6? -3. Auth: public + API-key gating in v1.1.6, defer users-SDK-based tokens to v1.1.8 follow-up — confirmed? +1. ~~Approach C confirmed~~ — ✅ Decided 2026-06-01: yes, with explicit registration required for externally-subscribable topics (internal-only stays implicit); new `Capability::AppTopicManage(AppId)`. +2. ~~SSE first, WebSocket deferred~~ — ✅ Decided 2026-06-01: SSE-only in v1.1.6; WS deferred to a later release; future addition is purely additive. +3. ~~Auth model~~ — ✅ Decided 2026-06-01: public + HMAC-signed subscriber tokens in v1.1.6; `users::*` session auth in v1.1.8; script-mediated auth in v1.2; token TTL clamped 10s–24h (default 1h), env-overridable; no per-token revocation in v1.1.6 (rely on TTL). --- @@ -391,15 +489,26 @@ Strategic positioning question: how much should PiCloud expose to frontend devel PiCloud today sits at the minimalist end (services exist for scripts to use, not for frontends). Crossing to maximalist would be a real product pivot, not a feature add. -### The chosen approach: hybrid +### The chosen approach: hybrid — decided -**Ship a client library that talks to scripts, not to services.** Specifically, three things: +✅ Decided 2026-06-01: **Hybrid model. No direct service access from the frontend; client library standardizes script-mediated ceremony.** + +Four pieces ship in `@picloud/client` for v1.1.6: 1. **Typed HTTP client to dev-defined endpoints** — `picloud.endpoint('/api/users').post({ name: 'alice' })`. Fetch wrapper with auth header injection, retry logic, structured error handling. 2. **SSE subscription** — `picloud.subscribe('chat-room-123', msg => …)`. Auto-reconnect, token refresh, backpressure. 3. **Auth flow helpers** — `picloud.auth.login(email, password)`, `picloud.auth.logout()`, `picloud.auth.token`. These call **dev-defined** endpoints under the hood (`/api/auth/login` etc.); the lib just standardizes the dance + token storage. +4. **Realtime-aware framework hooks** — `useTopic(topic)` for React, store-shape `subscribe(topic)` for Svelte. Thin polish over the SSE primitive; what frontend devs actually write. -Crucially: **no `picloud.kv.get()` or `picloud.docs.find()` from the frontend.** Those stay server-side, behind dev-written Rhai scripts. +Hard rule, load-bearing: **no `picloud.kv.get()` / `picloud.docs.find()` / `picloud.users.list()` from the frontend.** Direct service access from the browser is a strategic and security commitment, not a v1.1.6 limitation. A frontend dev who wants `kv.get()` from the browser writes a 6-line Rhai script binding it to a route — that friction is intentional, makes the dev decide deliberately that the read is okay to expose. + +**Why not Firebase-mode** (full direct service access): +- Different product, different competition (Supabase / Amplify / Appwrite have 5-year head start, fulltime teams). +- Requires security-rule language + per-row authorization evaluator + tooling that PiCloud's solo-dev audience cannot operate safely. Firebase's #1 cause of data exposure is misconfigured rules — well-documented, recurring. +- Script-as-gate is dramatically more defensible: the rules are just code, in the same language as the rest of the app, debuggable like any other code. + +**Why not pure-minimalist** (no client lib, just docs): +- Every PiCloud frontend dev hand-rolls the same fetch wrapper, SSE reconnect, token refresh, login/logout dance. Shipping `@picloud/client` removes that boilerplate without expanding the security surface. ### Why hybrid, not maximalist @@ -409,22 +518,28 @@ Firebase trades security for DX; the security-rule misconfiguration footgun is t A frontend dev shouldn't have to hand-roll fetch wrappers, SSE reconnect logic, and token-refresh dances. That stuff is identical across every app. Shipping it as `@picloud/client` is genuinely valuable — it doesn't expand the security surface (scripts still gate everything), it just removes boilerplate. -### TypeScript first +### TypeScript first — decided -Ship TypeScript first. Cross-language story (Python, Swift, Kotlin, Rust, …) deferred until demand emerges. TS covers the dominant "web app + mobile via React Native" segment. +✅ Decided 2026-06-01: **TypeScript only for v1.1.6. Other-language SDKs deferred, demand-driven, no preemptive ranking.** + +- TS covers ~85% of the realistic v1.x audience (web + React Native mobile + Capacitor + Electron). +- Native iOS / Android / Python / Rust / Go users can hit the REST + SSE endpoints directly without an SDK; they lose the typed wrapper but aren't blocked from shipping. +- The REST + SSE surface is documented as the **public protocol contract** so future PiCloud or the community can build other-language SDKs against a stable spec. PiCloud doesn't promise specific languages or timelines preemptively; a real user with a concrete use case is what triggers a new SDK. +- **Known caveat:** React Native doesn't ship a native `EventSource`. The TS client should runtime-detect and either fall back gracefully or require an explicit polyfill (`react-native-sse` / `react-native-event-source`) with clear docs. Not a blocker; worth surfacing in the v1.1.6 README. ### Status -- **Hybrid model (frontend through scripts only)**: leaning yes. -- **TypeScript first, other languages deferred**: leaning yes. -- **Co-ship with realtime as v1.1.6**: leaning yes (SSE wrapper is the killer feature of the lib). +- **Hybrid model (frontend through scripts only)**: ✅ Decided 2026-06-01 — confirmed; no direct service access from the browser; client lib standardizes script-mediated ceremony only. +- **TypeScript first, other languages deferred**: ✅ Decided 2026-06-01 — TS-only in v1.1.6; REST + SSE documented as public protocol contract; other languages demand-driven with no preemptive ranking; React Native SSE polyfill noted as known caveat. +- **Co-ship with realtime as v1.1.6**: ✅ Decided 2026-06-01 — server-side realtime AND `@picloud/client@1.0.0` ship together in v1.1.6. Built in parallel against a frozen REST + SSE spec. If v1.1.6 scope blows up under pressure, the lib is the deferrable piece (slips to v1.1.6.1); the realtime server itself doesn't slip. +- **Type safety / codegen**: ✅ Decided 2026-06-01 — defer codegen to v1.2+; v1.1.6 ships hand-written types with `endpoint()` generic + optional client-side runtime validation via user-provided schemas (zod/valibot adapter; ~50 lines). No schema-declaration syntax in v1.1.6 — committing to that before v1.2's coherent codegen design would lock us into a shape we'd regret. Doc schemas (already arriving in v1.1.2) are the natural foundation for v1.2 codegen; script-endpoint schemas get designed alongside the generator, not before. ### Open calls -1. Hybrid model — confirmed, or do you want to seriously evaluate Firebase-mode? -2. TypeScript first, multi-language deferred — confirmed? -3. Co-ship realtime + client lib as v1.1.6, or split (server in v1.1.6, client lib later)? -4. Type safety: hand-written types only, or aim for codegen from script-declared schemas? Codegen is big — defer to v1.2+ if at all? +1. ~~Hybrid model~~ — ✅ Decided 2026-06-01: confirmed; no direct service access from the frontend; `@picloud/client` ships typed HTTP + SSE + auth-flow + framework hooks. +2. ~~TypeScript first, multi-language deferred~~ — ✅ Decided 2026-06-01: TS-only in v1.1.6; REST + SSE is the public protocol; other-language SDKs are demand-driven; React Native SSE polyfill caveat documented. +3. ~~Co-ship realtime + client lib~~ — ✅ Decided 2026-06-01: co-ship in v1.1.6, built in parallel against a frozen REST + SSE spec. Lib is the deferrable piece under scope pressure (slips to v1.1.6.1); server doesn't slip. +4. ~~Type safety / codegen~~ — ✅ Decided 2026-06-01: defer codegen to v1.2+; v1.1.6 ships hand-written types with `endpoint()` generic + optional zod/valibot runtime validation; no schema declarations in v1.1.6. --- @@ -445,11 +560,11 @@ Net changes vs the [blueprint §12](../serverless_cloud_blueprint.md) roadmap: | **v1.1.3** | **Modules** — `scripts.kind`, per-app resolver replaces `DummyModuleResolver`, AST cache + dep-graph invalidation. | | **v1.1.4** | **Outbound HTTP & Scheduled Tasks** — `http::*` with SSRF deny-list; cron triggers (small now that the framework exists). | | **v1.1.5** | **Files & Pub/Sub** — filesystem-backed blobs (`files///`) with `files:*` triggers; pub/sub via the universal outbox with `pubsub:*` triggers. | -| **v1.1.6** | **Realtime Channels & Client Library** *(new)* — SSE-based external subscription to per-app pub/sub topics (public + API-key auth modes); `@picloud/client` TypeScript package (typed HTTP, SSE subscription, auth helpers). | +| **v1.1.6** | **Realtime Channels & Client Library** *(new)* — SSE-based external subscription to per-app pub/sub topics (public + HMAC-signed subscriber-token auth, minted via `pubsub::subscriber_token`); `@picloud/client` TypeScript package (typed HTTP via `endpoint()`, SSE subscription, auth helpers, framework hooks). | | **v1.1.7** | **Configuration & Email** *(was v1.1.6)* — encrypted per-app secrets; outbound `email::send/send_html` + inbound `email:receive` trigger. | | **v1.1.8** | **User Management** *(was v1.1.7)* — `users::*` for in-script CRUD, auth, roles, invites, password reset. | | **v1.1.9** | **Durable Queues & Function Composition** *(was v1.1.8)* — `queue::*` with `queue:receive` trigger; `invoke()` + `retry::*` (closures-as-args, re-entrant Rhai). | -| **v1.2** | **Workflows & Hierarchies** (per blueprint §Phase 5) — DAG execution, advanced docs query, interceptors, read triggers, audit log, script-mediated realtime auth. | +| **v1.2** | **Workflows & Hierarchies** (per blueprint §Phase 5) — DAG execution, advanced docs query, interceptors, read triggers, audit log, script-mediated realtime auth, `dead_letters::list` (aligned with `docs::find()` query DSL), client-lib type codegen from script-declared schemas. | | **v1.3+** | **Scale & Ops** (per blueprint §Phase 6) — cluster mode (NATS-style request/reply swaps to `LISTEN/NOTIFY`), cross-app data sharing, script versioning + rollback, rate limiting, richer auth, metrics, distributed tracing, webhooks, S3, monitoring/alerting on HTTP endpoint failures. | The v1.1.9 release marks the end of the v1.1.x expansion cadence. v1.2 is the next minor product bump (phase milestone per [versioning policy](versioning.md)). @@ -458,39 +573,39 @@ The v1.1.9 release marks the end of the v1.1.x expansion cadence. v1.2 is the ne ## Consolidated open calls -Numbered for easy reference in conversation. All currently un-answered. +All 20 open calls were resolved on 2026-06-01. This section is retained as a quick decision index — each item links the original question to the decision recorded in its section above. Sections will be pruned individually as their decisions ship into code and the [serverless_cloud_blueprint.md](../serverless_cloud_blueprint.md). ### §1 — Messaging primitives -1. Pub/sub durability via trigger outbox (durable, at-least-once) — confirmed? -2. Queue and pub/sub stay separate concepts (rather than unifying under a "messaging" abstraction with a subscription-mode flag) — confirmed? +1. ~~Pub/sub durability via trigger outbox~~ — ✅ Decided 2026-06-01: `publish_durable` ships in v1.1.5; `publish_ephemeral` committed as a future API. +2. ~~Queue and pub/sub stay separate~~ — ✅ Decided 2026-06-01: separate top-level namespaces; no unifying messaging abstraction. ### §2 — Universal trigger outbox -3. Sync HTTP via outbox + per-request inbox pattern (NATS-style; see §3) — confirmed, or keep direct dispatch for sync? -4. Ship `dispatch_mode: async` for HTTP routes in v1.1.1, or defer to a later release? -5. Keep `routes` and `triggers` as separate tables (unified at the dispatcher only), or merge schemas? +3. ~~Sync HTTP via outbox + per-request inbox~~ — ✅ Decided 2026-06-01: yes via outbox; in-process oneshot for v1.1.1, `LISTEN/NOTIFY` preserved as the cluster-mode (v1.3+) cross-process variant. +4. ~~Ship `dispatch_mode: async` for HTTP routes in v1.1.1~~ — ✅ Decided 2026-06-01: yes; `202 Accepted` + JSON body with `execution_id`; route-level config only. +5. ~~Trigger storage shape~~ — ✅ Decided 2026-06-01: Layout E (parent `triggers` + per-kind `_trigger_details`); `routes` stays its own table for v1.1.x; column-set refinements deferred to implementation PR. ### §3 — NATS-style sync HTTP -6. NATS-style request/reply for sync HTTP — confirmed? -7. Status code strategy: keep existing distinctions (recommended) or flatten to 500? -8. Default retry policy on triggers: 3 attempts with exp backoff (1s/2s/4s), or different defaults? -9. Cancel-on-timeout semantics: discard late results (a), write to "abandoned executions" table for debugging (b — recommended), or attempt to ack the caller late (c)? +6. ~~NATS-style request/reply for sync HTTP~~ — ✅ Decided 2026-06-01 (see §2 #3). +7. ~~Status code strategy~~ — ✅ Decided 2026-06-01: keep distinctions; `500` reserved for platform problems. +8. ~~Default retry policy on triggers~~ — ✅ Decided 2026-06-01: 3/exp/1000ms + ±20% jitter; env-overridable via `PICLOUD_TRIGGER_RETRY_*`; per-trigger columns override. +9. ~~Cancel-on-timeout semantics~~ — ✅ Decided 2026-06-01: (b) — `abandoned_executions` table; dispatcher-written; 7-day retention via `PICLOUD_ABANDONED_EXECUTIONS_RETENTION_DAYS`; metric counter on insert. ### §4 — Dead letters -10. Dead-letter handlers unretryable + can't be dead-lettered themselves — confirmed? -11. No default dead-letter handler (rows just sit in the table); user opts in — confirmed, or built-in "log to admin notifications channel" default? -12. 30-day default retention sensible, or longer/shorter? -13. Include Rhai SDK (`dead_letters::list/replay/resolve`) in v1.1.1 alongside admin endpoints, or defer the script-side surface to a later release? +10. ~~Dead-letter handlers unretryable + can't be dead-lettered themselves~~ — ✅ Decided 2026-06-01: confirmed; flag lives on the execution; missing handler = `resolution = 'handler_failed'`; indirect loops bounded by `cx.trigger_depth`. +11. ~~No default dead-letter handler~~ — ✅ Decided 2026-06-01: confirmed; rows sit in the table by default. Dashboard unresolved-count badge + per-app DL list view ship in v1.1.1. +12. ~~30-day default retention~~ — ✅ Decided 2026-06-01: 30 days, GC by `created_at`, env-only override (`PICLOUD_DEAD_LETTER_RETENTION_DAYS`). +13. ~~Rhai SDK for dead-letters in v1.1.1~~ — ✅ Decided 2026-06-01: `replay` + `resolve` in v1.1.1; `list` deferred to v1.2; new `Capability::AppDeadLetterManage(AppId)`. Related: trigger executions inherit the registrant's principal. ### §5 — Realtime -14. Approach C confirmed (opt-in external subscription on pub/sub topics) vs A: pubsub IS realtime, B: separate `channels::` service? -15. SSE first, WebSocket deferred — confirmed, or ship both in v1.1.6? -16. Auth: public + API-key gating in v1.1.6, defer users-SDK-based tokens to v1.1.8 follow-up — confirmed? +14. ~~Approach C confirmed~~ — ✅ Decided 2026-06-01: yes, with explicit registration required for externally-subscribable topics; new `Capability::AppTopicManage(AppId)`. +15. ~~SSE first, WebSocket deferred~~ — ✅ Decided 2026-06-01: SSE-only in v1.1.6; WS deferred. +16. ~~Auth model~~ — ✅ Decided 2026-06-01: public + HMAC-signed subscriber tokens in v1.1.6; `users::*` session auth in v1.1.8; script-mediated in v1.2; TTL 10s–24h (default 1h), env-overridable. ### §6 — Frontend client library -17. Hybrid model (frontend through scripts only) — confirmed, or seriously evaluate Firebase-mode? -18. TypeScript first, multi-language deferred — confirmed? -19. Co-ship realtime + client lib as v1.1.6, or split (server in v1.1.6, client lib later)? -20. Type safety: hand-written types only, or aim for codegen from script-declared schemas? Defer codegen to v1.2+ if at all? +17. ~~Hybrid model~~ — ✅ Decided 2026-06-01: confirmed; no direct service access from the frontend; client lib standardizes script-mediated ceremony only. +18. ~~TypeScript first, multi-language deferred~~ — ✅ Decided 2026-06-01: TS-only in v1.1.6; REST + SSE is the public protocol contract. +19. ~~Co-ship realtime + client lib~~ — ✅ Decided 2026-06-01: co-ship in v1.1.6, parallel-built against a frozen spec; lib is the deferrable piece under scope pressure. +20. ~~Type safety / codegen~~ — ✅ Decided 2026-06-01: defer codegen to v1.2+; v1.1.6 ships hand-written types via `endpoint()` + optional zod/valibot runtime validation. ---