Annotates the v1.1.x design notes with the resolutions for the 20 open calls — pub/sub split, universal outbox, NATS-style sync HTTP, status code strategy, retry policy, dead-letter recursion-stop, realtime auth model, frontend client library scope. Captured ahead of the v1.1.1 implementation so the schema + API decisions in this branch have a single load-bearing source of truth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
52 KiB
v1.1.x design notes — in-flight decisions + revised roadmap
Planning document for the v1.1.x release series. Companion to:
serverless_cloud_blueprint.md— authoritative designdocs/sdk-shape.md— SDK conventions (settled in v1.1.0)docs/stdlib-reference.md— stdlib API (settled in v1.1.0)docs/versioning.md— versioning policy (post-1.0 carve-out settled with v1.1.0)
Items in this doc are either tentatively decided but not yet shipped or open calls awaiting the maintainer's decision. Once an item ships, its content moves into the blueprint and the corresponding section here gets pruned.
This document was created at the v1.1.0 → v1.1.1 boundary, capturing the architectural conversations that followed v1.1.0 but haven't yet landed in code or in the blueprint.
1. The three messaging primitives
PiCloud will expose three distinct messaging concepts. The right way to slice them is along recipient model and delivery semantics:
| Recipients | Durability | Delivery | Retry on script failure | Mental model | |
|---|---|---|---|---|---|
invoke(script_id, args) |
One named script | None (or fire-and-forget durable) | At-most-once sync, or at-least-once async | Caller-controlled via retry::* |
Function call |
pubsub::publish_durable(topic, msg) |
All scripts subscribed via trigger | Through outbox | At-least-once per subscriber | Per-subscriber retry up to N, then dead-letter | Fan-out broadcast (persisted) |
pubsub::publish_ephemeral(topic, msg) (future) |
All scripts subscribed via trigger | None (in-memory NOTIFY) | At-most-once per subscriber | None | Fan-out broadcast (best-effort) |
queue::enqueue(name, msg) |
Exactly one consumer wins | Durable table | At-least-once total | Visibility timeout + nack-on-throw | Work distribution |
Critical distinction: pub/sub and queue both end up at-least-once, but the subscriber model differs. Queue: 1 message → 1 delivery record → consumers compete. Pub/sub: 1 message → N delivery records (one per subscriber) → no competition.
Pub/sub reframe — durable through the outbox, ephemeral as named escape hatch
The original blueprint plan was pub/sub via Postgres LISTEN/NOTIFY (ephemeral, sub-millisecond fan-out). Reframe to reuse the triggers framework's outbox infrastructure for the durable path, and keep ephemeral as a separately-named future API:
pubsub::publish_durable(topic, msg)writes to the outbox (v1.1.5)- Dispatcher fans out one delivery record per subscribed script trigger
- Each delivery retried on failure with the same machinery as KV / doc / file triggers
- After N retries → dead-letter (see §4)
pubsub::publish_ephemeral(topic, msg)is committed as a future addition for the in-memoryLISTEN/NOTIFYpath — not shipped in v1.1.5, but the API split is decided now so users learn "durable by default, opt into ephemeral" from the start (rather than the reverse, which would be a breaking rename later).
Wins: one delivery model in the whole system for the durable path, durable pub/sub for free, shared observability/retry/dead-letter tooling across every event-firing surface.
Cost: ~1ms Postgres write per publish_durable (vs in-memory NOTIFY). For solo-dev / consumer hardware, the right tradeoff. The ephemeral escape hatch exists for sub-ms / high-frequency workloads if/when they emerge.
Note on durability semantics. "Durable" here means the outbox row persists, not that fan-out is transactional with the publisher's own data writes. A script doing kv.set(...) then pubsub::publish_durable(...) performs two separate writes; a crash between them can drop the publish. This matches the standard transactional-outbox pattern and is consistent with how KV / doc / file triggers already work.
Queue stays separate
Pub/sub-through-outbox cannot model "work distribution with backpressure" cleanly. Queue keeps its own table:
- Producer:
queue::enqueue(name, msg)→ queue table - Consumer:
queue:receivetrigger fires when message available; runtime claims withFOR UPDATE SKIP LOCKED+ visibility timeout - Script returns successfully → auto-ack (delete row)
- Script throws → auto-nack (clear claim; message becomes visible again)
- Visibility timeout exceeded → reclaim allowed (handles crashed consumers)
- Max delivery attempts → dead-letter
The queue table IS the outbox for queue semantics — no double-buffering.
Status
- Durable pub/sub via trigger outbox: ✅ Decided 2026-06-01 — ship as
pubsub::publish_durablein v1.1.5. - Ephemeral pub/sub: ✅ Committed 2026-06-01 as a future addition named
pubsub::publish_ephemeral. Not in v1.1.5; the explicit-naming split lands now so the durable default doesn't need a breaking rename later. - Drop
LISTEN/NOTIFYfor v1.1.5: ✅ Decided 2026-06-01. - Queue stays separate from pub/sub: ✅ Decided 2026-06-01 — two distinct top-level namespaces (
queue::*andpubsub::*); no unifyingmessaging::*abstraction. Rationale: the two have genuinely different mental models (work distribution vs fan-out), the implementations share almost no code (queue needsFOR UPDATE SKIP LOCKED+ visibility timeout + nack-on-throw; pub/sub needs per-subscriber fan-out + independent retry/dead-letter), and a unified API would force users to choose a mode they already know from the use case. A future Kafka-shaped consumer-group unification was considered and rejected — PiCloud is outbox-based, not log-based, so going Kafka-shaped would mean rebuilding storage.
Open calls
Pub/sub durability via trigger outbox— ✅ Decided 2026-06-01: yes, bothpublish_durable(v1.1.5) andpublish_ephemeral(future) committed with explicit names.Queue and pub/sub stay separate concepts— ✅ Decided 2026-06-01: separate top-level namespaces; no unifying messaging abstraction.
2. Universal trigger outbox
The triggers framework's outbox should be the universal substrate for async dispatch. Every event source that fires scripts asynchronously writes to the same outbox table; one dispatcher reads from it and routes to the executor with shared load control, retry, dead-letter, and trigger-depth tracking.
What runs through the outbox
| Ingress | Path | Reason |
|---|---|---|
| HTTP request (sync) | Direct: orchestrator → executor → response (with NATS-style indirection — see §3) | Caller is waiting; the inbox pattern makes this work via the outbox |
| HTTP request (async, opt-in) | Orchestrator writes outbox → returns 202 → dispatcher → executor | Webhooks, fire-and-forget endpoints; explicit opt-in via route config |
| Cron tick | Scheduler writes outbox → dispatcher → executor | No caller; naturally async |
| KV / doc / file change | Service writes outbox → dispatcher → executor | No caller; the originating script already returned |
| Pub/sub publish | Service writes outbox → dispatcher → executor (per subscriber) | Fan-out semantics |
| Queue message | Queue table IS the outbox; dispatcher claims via FOR UPDATE SKIP LOCKED |
Avoids double-buffering |
| Inbound email | SMTP receiver writes outbox → dispatcher → executor | No caller |
What this gives
- One dispatcher = one place for load control (the existing
ExecutionGate), retry, dead-letter, trigger-depth tracking, fan-out. New event source = "write to outbox in this shape", nothing else. - Routes become a trigger kind, conceptually. A route is
(source=http, filter=method+path, script_id, dispatch_mode=sync|async). Schema-wise theroutestable likely stays separate from the newtriggerstable (polymorphic JSON columns get ugly), but the mental model collapses to "everything that fires a script is a trigger". dispatch_mode = asyncis a per-route opt-in. Webhook handlers can return 202 immediately and process in the background — dispatcher handles retries, caller gets a snappy ack.- Replay and debugging. Every async invocation has an outbox row; admin can re-fire a trigger by re-dispatching the row.
- Decoupled lifecycle. Dispatcher can be paused for maintenance without affecting HTTP ingress (it just queues); HTTP can degrade (overflow 503s) without affecting async work already in the outbox.
What this doesn't change
- Sync HTTP still hits the
ExecutionGatethe same way (now via the dispatcher). - Async outbox dispatch also hits the gate when the dispatcher picks a row. Sync and async share the cap on actual blocking-thread-in-use.
- Trigger CRUD likely stays in per-kind tables for schema sanity; the unification is conceptual + dispatch-layer, not schema-layer.
Status
- Universal outbox for async dispatch: ✅ Decided 2026-06-01 — yes; all async ingress (KV/cron/pubsub/queue/email/dead-letter) writes to one outbox; one dispatcher reads it.
- Sync HTTP via outbox (NATS-style inbox): ✅ Decided 2026-06-01 — in-process oneshot in v1.1.1; cluster-mode keeps the door open for
LISTEN/NOTIFYkeyed oninbox_idin v1.3+ (see §3 implementation table). - Routes-as-trigger conceptually: ✅ yes — the dispatch layer treats routes and triggers uniformly.
- Trigger storage shape: Layout E (parent + per-kind detail tables): ✅ Decided 2026-06-01. One shared
triggersparent with common columns (id,app_id,script_id,kind,enabled,dispatch_mode, retry config, timestamps); one<kind>_trigger_detailstable per service (kv_trigger_details,cron_trigger_details,pubsub_trigger_details,queue_trigger_details,email_trigger_details,dead_letter_trigger_details). Outbox FKs totriggers.id; dead-letters FK same. Exact column set (notablyoutbox.app_iddenormalization, whetherscript_idalso lives on outbox, ON DELETE behavior on the parent vs detail tables) will be refined when v1.1.1 implementation lands. routestable stays separate from thetriggersparent for now: ✅ Decided 2026-06-01.routesis Phase-3 production schema with its own trie-index columns; folding into the parent is a v1.2 cleanup, not a v1.1.1 requirement. Outbox discriminates HTTP rows viasource_kind = 'http'andtrigger_idreferencingroutes.idfor HTTP,triggers.idfor everything else.- Per-route
dispatch_mode: sync|async: ✅ Decided 2026-06-01 — ships in v1.1.1. Async returns202 Acceptedwith a JSON body{ "accepted_at": "...", "execution_id": "..." }.dispatch_modeis a route property fixed at route creation; scripts cannot switch modes mid-call.
Open calls
Sync HTTP via outbox + per-request inbox— ✅ Decided 2026-06-01: yes via outbox; in-process oneshot now,LISTEN/NOTIFYexplicitly preserved for cluster mode (v1.3+).Ship— ✅ Decided 2026-06-01: yes;dispatch_mode: asyncin v1.1.1202 Accepted+ JSON body withexecution_id; route-level config only.Trigger storage shape— ✅ Decided 2026-06-01: Layout E (parent + per-kind detail tables);routesstays its own table for v1.1.x. Exact column set deferred to implementation PR.
3. NATS-style request/reply for sync HTTP
The constraint that makes "universal outbox" tricky: HTTP has a caller waiting. We can't write to outbox, return 202, and walk away — the user's browser expects 200 OK with body. NATS's request/reply pattern resolves this elegantly.
Pattern
HTTP request → orchestrator generates inbox_id, registers a oneshot channel
→ writes outbox row { source: http, payload, reply_to: inbox_id }
→ awaits on the channel (with timeout = script's wall-clock + buffer)
Dispatcher → picks outbox row
→ dispatches to executor (gate + spawn_blocking + Rhai)
→ if reply_to.is_some(): resolves the channel with the result
→ if reply_to.is_none(): records completion + retries on failure per trigger config
Orchestrator → channel resolves → returns response to HTTP caller
→ on timeout: returns 504 or 500 → see status-code calls below
The HTTP caller's experience is unchanged (synchronous request/response). Under the hood, dispatch is identical for every invocation source.
Implementation by deployment mode
| Mode | Mechanism | Trade-off |
|---|---|---|
| In-process (v1.1.1, MVP) | Per-orchestrator HashMap<InboxId, oneshot::Sender<Result>>; dispatcher resolves the oneshot |
Sub-ms wake-up; fails across process boundaries |
| Cross-process (cluster mode v1.3+) | Postgres LISTEN/NOTIFY keyed on inbox_id, with a responses row as durable backup |
Sub-10ms wake-up; survives across nodes; needs careful long-listener management |
| Polling fallback | Orchestrator polls responses table for inbox_id every ~10ms |
Simple; ~10ms minimum latency; only as fallback |
Latency cost (honest numbers)
Per sync HTTP request, NATS-style adds: ~1-2ms Postgres write (outbox) + sub-ms dispatcher wake (in-process channel) + ~1ms response resolve = ~2-5ms overhead. For most scripts (10-100ms execution), this is noise. PiCloud isn't optimizing for sub-ms; the architectural unification is worth a few ms.
Default retry policy — decided
✅ Decided 2026-06-01:
| Knob | Default | Env override | Per-trigger column |
|---|---|---|---|
| Max attempts | 3 | PICLOUD_TRIGGER_RETRY_MAX_ATTEMPTS |
retry_max_attempts |
| Backoff shape | exponential | PICLOUD_TRIGGER_RETRY_BACKOFF (exponential | linear | constant) |
retry_backoff |
| Base delay | 1000ms | PICLOUD_TRIGGER_RETRY_BASE_MS |
retry_base_ms |
| Jitter | ±20% | PICLOUD_TRIGGER_RETRY_JITTER_PCT |
(not per-trigger; dispatcher-side) |
With the defaults, schedule after each failed attempt is ~1s / ~2s / ~4s (each ±20%), total time-to-dead-letter ~7s.
What triggers a retry: any of Rhai runtime error, wall-clock timeout, operation-budget-exceeded, or platform-side failure (Postgres unavailable, executor crashed). Distinguishing them in the dispatcher is fiddly and the retry cost is bounded by max_attempts; if op-budget retries become dead-letter spam in practice, revisit.
Per-trigger override: the three retry columns on the triggers parent table (Layout E) take precedence over the env-configured defaults. Trigger CRUD endpoints accept these on create/update; if omitted, the env defaults are applied at write time (not lazily at dispatch — keeps the policy auditable from the row itself).
Sync HTTP exception: unchanged. reply_to.is_some() rows are never retried regardless of policy (see below).
Retry policy — reply_to IS the signal
| Outbox row | Retry behavior |
|---|---|
reply_to.is_some() |
Never retry. Caller is waiting; retrying means the script might run twice and the caller gets one of two outcomes. Always: one attempt, surface result (success or failure) to inbox. |
reply_to.is_none() |
Retry per trigger's configured policy. Default: 3 attempts, exponential backoff (1s, 2s, 4s), dead-letter after. |
Per-trigger config lives on the trigger row:
trigger { source: cron, schedule: "0 */5 * * * *",
retry: { max_attempts: 5, backoff: exponential, base_ms: 1000 } }
trigger { source: pubsub, topic: "user.created",
retry: { max_attempts: 3, backoff: linear, base_ms: 500 } }
trigger { source: http, method: POST, path: "/api/foo",
dispatch_mode: sync } // retry absent — sync HTTP is always 1-attempt
Failure / crash handling
With NATS-style indirection, there are new ways for a sync HTTP request to vanish. Every failure path must resolve the orchestrator's oneshot channel with something:
| Failure mode | Detection | Caller sees |
|---|---|---|
| Script throws / runtime error | Executor returns ExecError::Runtime → written to inbox |
502 (or 500 — see status-code discussion) |
| Script exceeds wall-clock | tokio::time::timeout fires inside dispatcher → written to inbox |
504 (or 500) |
| Operation budget exceeded | Executor returns ExecError::OperationBudgetExceeded → inbox |
507 (or 500) |
| Executor process crashes mid-execution | JoinError → ExecError::Runtime → inbox |
500 |
| Dispatcher process dies between claim and reply | Orchestrator's wait times out | 500 |
| Outbox write fails (Postgres unavailable) | Orchestrator never publishes; immediate error | 500 |
| Orchestrator's own wait times out unexpectedly | Channel timeout fires before inbox resolves | 504 (or 500) |
Every path resolves the channel with a result. The orchestrator's outer timeout is the backstop for "dispatcher just died completely".
Status code strategy — decided
✅ Decided 2026-06-01: keep the granular status codes (Option A), with one refinement — 500 is reserved for platform problems (dispatcher vanished, outbox write failed, inbox channel timed out unexpectedly), not used as a generic catch-all.
| Code | Cause | Who's at fault |
|---|---|---|
| 422 | Request validation failed | Client |
| 502 | Script threw / Rhai runtime error | User script |
| 503 | Gate refused (overloaded); Retry-After: 1 |
Platform (capacity) |
| 504 | Wall-clock timeout | Either (slow script or platform overload) |
| 507 | Operation budget exceeded | User script |
| 500 | Dispatcher vanished / outbox write failed / inbox channel timed out unexpectedly | Platform (bug or infra) |
Rationale: each code is actionable for the caller (back off, redesign as async, fix the script, file a bug). Flattening to 500 would collapse "script crashed" vs "overloaded" vs "your timeout is too tight" vs "platform broke" into one undifferentiated signal — losing both client-facing UX and our own observability/alerting axis.
Status
- NATS-style for sync HTTP: ✅ Decided 2026-06-01 (see §2 #3).
reply_topresence as the "don't retry" signal: ✅ Decided 2026-06-01 (folded with the NATS-style decision).- Status code strategy: ✅ Decided 2026-06-01 — keep granular distinctions;
500reserved for platform problems only. - Default retry policy: ✅ Decided 2026-06-01 — 3 attempts / exponential / 1000ms base / ±20% jitter; all four env-overridable via
PICLOUD_TRIGGER_RETRY_*; per-trigger columns on the parent table take precedence. - Cancel-on-timeout semantics: ✅ Decided 2026-06-01 — option (b). Late results are discarded from the caller's POV (they already got a 504) but the dispatcher writes an
abandoned_executionsrow whenever it tries to resolve a oneshot that's already closed/dropped. 7-day default retention viaPICLOUD_ABANDONED_EXECUTIONS_RETENTION_DAYS; weekly GC sweep. A counter (picloud_abandoned_executions_total{app_id}) bumps on insert — that's the primary observability signal; the rows themselves are for forensics when the counter spikes. Only the dispatcher-after-orchestrator-timeout edge case writes a row; ordinary "script timed out, caller got 504" stays uneventful.
Open calls
NATS-style request/reply for sync HTTP— ✅ Decided 2026-06-01 (see §2 #3).Status code strategy— ✅ Decided 2026-06-01: Option A (keep distinctions); 500 reserved for platform problems.Default retry policy on triggers— ✅ Decided 2026-06-01: 3/exp/1000ms base + ±20% jitter; env-overridable viaPICLOUD_TRIGGER_RETRY_*; per-trigger row columns override the env defaults.Cancel-on-timeout semantics— ✅ Decided 2026-06-01: option (b) —abandoned_executionstable, dispatcher-written, 7-day retention, metric counter on insert.
4. Dead-letter handling
Events that exhaust their retry policy land in a separate dead_letters table (not a flag on the outbox — outbox should stay a queue with fast inserts and scans). Users handle dead letters by registering a script for the new dead_letter trigger kind.
Schema sketch
CREATE TABLE dead_letters (
id UUID PRIMARY KEY,
app_id UUID NOT NULL REFERENCES apps(id) ON DELETE CASCADE,
original_event_id UUID NOT NULL, -- the outbox row id
source TEXT NOT NULL, -- "kv", "cron", "pubsub", "queue", "email"
op TEXT NOT NULL,
trigger_id UUID, -- which trigger config fired (null for direct dispatches)
script_id UUID, -- which script failed
payload JSONB NOT NULL, -- the event payload, verbatim
attempt_count INT NOT NULL,
first_attempt_at TIMESTAMPTZ NOT NULL,
last_attempt_at TIMESTAMPTZ NOT NULL,
last_error TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
resolved_at TIMESTAMPTZ, -- null = unresolved
resolution TEXT -- "replayed" | "ignored" | "handled_by_script" | "handler_failed"
);
CREATE INDEX idx_dead_letters_app_unresolved
ON dead_letters(app_id) WHERE resolved_at IS NULL;
Dead letter as trigger source
trigger {
source: dead_letter,
filter: { source: "kv" }, -- optional; defaults to "any source"
script_id: <your handler>,
dispatch_mode: async,
retry: { max_attempts: 1 } -- forced — see recursion stop rule below
}
Filterable on:
source: only dead letters from a particular event source (kv, cron, pubsub, …)trigger_id: only dead letters from a particular trigger configscript_id: only dead letters from a particular script- No filter: every dead letter fires this handler
ctx.event for a dead-letter handler:
ctx.event.source // "dead_letter"
ctx.event.dead_letter = #{
original: #{
source: "kv",
op: "insert",
collection: "widgets",
key: "k1",
payload: #{ ... }
},
attempts: 3,
last_error: "script timeout after 30s",
trigger_id: "...",
script_id: "...",
first_attempt_at: "2026-05-30T12:00:00.000Z",
last_attempt_at: "2026-05-30T12:00:14.000Z"
}
The handler can log::error, send email::send to admins, write to docs::collection("incidents").create(...), post to external alerting via http::post, or call dead_letters::replay(id) if it decides retry is favorable.
Recursion stop rule — decided
✅ Decided 2026-06-01: dead-letter handlers execute once, no retry, and CANNOT themselves be dead-lettered.
- The flag lives on the execution/outbox row (set by the dispatcher when it picks a row whose trigger has
kind = 'dead_letter'), not on the trigger config. Same handler script could in principle be reused for non-DL work without inheriting the no-retry treatment. - On handler failure:
- Full payload + error logged to structured logs
- Counter
picloud_dead_letter_handler_failures{app_id}bumped - Original dead-letter row annotated with
resolution = 'handler_failed' - No retry, no second dead-letter row, no further fire.
- Missing handler script (trigger references
script_idthat's been deleted): treated as a handler failure — same metric bump, sameresolution = 'handler_failed', same no-retry. Auto-disabling the trigger is deferred to v1.2; for v1.1.1 the user sees the metric spike and investigates. - Indirect loops (DL handler writes to KV → fires a KV trigger → that handler fails → dead-letters → fires the same DL handler) are not blocked by this rule directly; they're bounded by the existing trigger-depth limit (
cx.trigger_depth). The recursion-stop rule only prevents the direct infinite regress where a DL handler's failure would itself produce a DL row.
Rationale: if your alerting script is broken, the platform shouldn't try to alert about that with the same broken script. The chain has to terminate, period.
Defaults — decided
✅ Decided 2026-06-01: no automatic handler. Dead letters land in the table; users opt into handling by registering a dead_letter trigger.
Load-bearing commitment: the v1.1.1 dashboard surfaces this state. Without dashboard surface, "no default handler" is irresponsible — users wouldn't know dead-letters exist until they queried Postgres directly. So shipping the table without the UI is not an option.
Required in v1.1.1 alongside the table:
- An unresolved-count badge per app, visible in the dashboard's app list and on the app detail page. Source query:
SELECT count(*) FROM dead_letters WHERE app_id = $1 AND resolved_at IS NULL. - A per-app dead-letters list view reachable from the badge. Columns:
created_at,source,op,script_id,last_error,attempt_count,first_attempt_at,last_attempt_at. Per-row actions: Replay (re-inserts the original event into the outbox; dispatcher tries again from scratch) and Mark resolved (setsresolution = 'ignored', no further action). - A row detail panel showing the full payload + complete error history.
Rationale: most apps will run for months without ever needing a DL handler; the table is the durable record either way. The dashboard surface gives users the lightest-touch signal that something is wrong without committing v1.1.1 to building a notifications channel.
A heavier built-in default ("log to admin notifications channel") was considered and rejected — it would smuggle a notifications-surface design into v1.1.1 under the guise of a default, with real product-design questions (channel shape, configuration, opt-out, rate-limiting) that aren't worth answering yet. If the dashboard badge proves insufficient in practice, a structured-log fallback (writing to execution_logs with a known dead_letter shape) is an additive future change, not a breaking one.
Sync HTTP failures don't dead-letter
Sync HTTP requests (reply_to.is_some()) failures don't land in dead_letters. Caller already got an error response; every failed HTTP request landing in dead_letters would flood the table; execution_logs already captures sync request failures. If a user wants alerts on HTTP endpoint failures, that's monitoring (v1.3+ territory), not dead-lettering.
Pub/sub fan-out dead-letters independently
One pubsub::publish → N subscribers → each retries independently → each can independently dead-letter. So one publish can produce N dead-letter rows (one per subscriber that exhausted retries). Subscribers are independent failure domains.
Manual replay — Rhai SDK scope decided
✅ Decided 2026-06-01: ship dead_letters::replay(id) and dead_letters::resolve(id, reason) in v1.1.1; defer dead_letters::list(filter) to v1.2 to align with docs::find() query semantics.
| Surface | Use case | Shipping in |
|---|---|---|
POST /api/v1/admin/apps/{id}/dead_letters/{dl_id}/replay |
Admin clicks "replay" in dashboard | v1.1.1 |
POST /api/v1/admin/apps/{id}/dead_letters/{dl_id}/resolve |
Admin marks resolved via dashboard | v1.1.1 |
GET /api/v1/admin/apps/{id}/dead_letters |
Dashboard list view | v1.1.1 |
dead_letters::replay(id) Rhai SDK |
A handler script decides to retry programmatically | v1.1.1 |
dead_letters::resolve(id, reason) Rhai SDK |
A handler decides "this is fine, don't bother me" | v1.1.1 |
dead_letters::list(filter) Rhai SDK |
Bulk replay / cleanup scripts | v1.2 (aligns with docs::find() query DSL) |
Replay re-inserts the original event into the outbox; dispatcher tries again from scratch.
Authz: both replay and resolve are gated by a new Capability::AppDeadLetterManage(AppId) checked inside the service methods. The capability is granted to app admins by default (existing Phase 3.5 role hierarchy). A public HTTP script running with principal: None would fail this check, which is correct.
Trigger-execution principal (related decision): ✅ a trigger execution runs as the principal that registered the trigger, captured on the trigger row at registration time. This gives a clean "the trigger fires as you" model and matches how cron jobs are typically conceptualized. The original event's principal (e.g. the anonymous caller of a public HTTP route) is recorded for forensics on the outbox row but does not become the execution principal. This is a wider trigger-framework decision surfaced here because dead-letter authz is the first concrete consumer; it applies to every trigger kind, not just dead-letter.
Retention — decided
✅ Decided 2026-06-01: 30 days, GC by created_at, env-overridable only (no per-app override in v1.1.1).
- Default: 30 days
- Override:
PICLOUD_DEAD_LETTER_RETENTION_DAYS(whole-deployment, not per-app) - GC condition:
created_at < NOW() - retention— applies to both resolved and unresolved rows uniformly. (Activity-age GC — keeping recently-resolved rows 30 days post-resolution — was considered and deferred; can switch if user feedback shows it's needed without breaking anything.) - GC job: weekly sweep in
manager-core, claiming viaFOR UPDATE SKIP LOCKEDto match the dispatcher's claim pattern.
Per-app retention overrides are deferred to a later release. The env var covers single-deployer needs; per-app settings would need a dashboard surface + permissions story that isn't worth smuggling into v1.1.1.
Status
- Separate
dead_letterstable: leaning yes. dead_letteras trigger kind: leaning yes.- Recursion stop rule (handlers can't be dead-lettered): ✅ Decided 2026-06-01 (above); flag lives on the execution; missing-handler case treated as handler failure.
- No default handler (rows sit in table; dashboard surfaces them): ✅ Decided 2026-06-01 — unresolved-count badge + per-app list view ship in v1.1.1 alongside the table.
- Sync HTTP failures don't dead-letter: leaning yes.
- Retention: ✅ Decided 2026-06-01 — 30 days, GC by
created_at, env-only override (PICLOUD_DEAD_LETTER_RETENTION_DAYS); weeklyFOR UPDATE SKIP LOCKEDsweep inmanager-core. - Rhai SDK scope: ✅ Decided 2026-06-01 —
replay+resolveship in v1.1.1;listdeferred to v1.2 to align withdocs::find()query DSL. NewCapability::AppDeadLetterManage(AppId). - Trigger-execution principal: ✅ Decided 2026-06-01 — trigger fires as the principal that registered it (captured on the trigger row at registration). Original event's principal is recorded on the outbox row for forensics but does not become the execution principal. Applies to all trigger kinds.
Open calls
Dead-letter handlers unretryable + can't be dead-lettered themselves— ✅ Decided 2026-06-01: confirmed; flag on execution; missing-handler =resolution = 'handler_failed'; indirect loops bounded bycx.trigger_depth.No default dead-letter handler— ✅ Decided 2026-06-01: confirmed; rows sit in the table by default. Dashboard unresolved-count badge + per-app DL list view (with Replay + Mark-resolved actions) ship in v1.1.1 alongside the table.30-day default retention— ✅ Decided 2026-06-01: 30 days, GC bycreated_at, env-only override; per-app retention deferred.Rhai SDK for dead-letters in v1.1.1— ✅ Decided 2026-06-01:replay+resolveship;listdeferred to v1.2 to align withdocs::find(); newCapability::AppDeadLetterManage(AppId). Related: trigger executions run as the trigger-registering principal.
5. Realtime updates for external clients
Apps built on PiCloud need a way for browser/mobile clients to receive live updates (chat messages, dashboard data, multiplayer state, notifications). Today's pub/sub is internal-only (script ↔ script via triggers).
The chosen approach — decided
✅ Decided 2026-06-01: Option C (one publish API, topics opt-in to external visibility) with the registration split below.
- One
pubsub::publish_durable(topic, msg)API for scripts — produces a single event regardless of who subscribes. - Topics are internal-only by default: script triggers can subscribe; external clients cannot.
- Externally-subscribable topics must be registered explicitly (admin API + dashboard surface). Internal-only topics remain implicit — anyone can
publish_durable("any.topic", msg)and triggers can subscribe without registration. To externalize: create atopicsrow withexternal_subscribable = truefirst. - External clients connect to
GET /realtime/topics/{topic}via SSE; they only receive messages from registered, externally-subscribable topics they're permitted to access.
UI/security commitments (the difference between C working and C being default-public in disguise):
- The externally-subscribable opt-in is prominent UI, not a buried checkbox.
- The topic list view shows "external: yes/no" as a first-class column.
- Marking a topic externally-subscribable requires app admin role (capability-gated via
Capability::AppTopicManage(AppId)). - The bit-flip is its own API endpoint (not a side-effect of generic topic update) so it carries an independent audit trail.
Wins: one publish API for scripts (DRY), topics are private by default (security), external visibility requires deliberate explicit registration (not just a config flag flipped during quick edits).
Why not A (every topic externally-visible by default): topic names tend to describe the event, not the audience; internal topics frequently carry PII or sensitive payloads; the Firebase-style "remember to lock it down" anti-pattern this whole design rejects.
Why not B (separate channels:: service): doubles the publish API for almost-identical use cases; scripts wanting both internal triggers AND client push would publish twice; users wrap it in a helper and we're back at C with extra steps and no central policy enforcement.
Transport: SSE first — decided
✅ Decided 2026-06-01: SSE-only for v1.1.6. WebSocket added in a later release if real bidirectional demand emerges.
- Simpler than WebSocket; works through any HTTP proxy without protocol upgrade
- Browsers auto-reconnect on disconnect (native
EventSource) - Covers the dominant use cases (chat-message-list updates, dashboard streams, notifications, IoT telemetry, build-status streams) cleanly
- Production-quality SSE requires HTTP/2 between Caddy and clients to dodge the per-origin connection cap on HTTP/1.1 — Caddy speaks HTTP/2 by default, so this is just a config note for the deploy docs
Why not ship WS in v1.1.6: WS is the right tool for sub-100ms bidirectional state (multiplayer games, CRDT collaborative editing, typing-indicator-level presence). On consumer hardware with Postgres-backed event distribution, that latency budget is dominated by the server stack anyway — WS would be paying implementation cost (frame management, ping/pong, close codes, backpressure protocol) without unlocking the latency it's designed for. SSE-only also frees v1.1.6 to invest in @picloud/client library quality instead of transport edge cases.
Future addition path: WebSocket coexists with SSE on a different endpoint (e.g. /realtime/ws/{topic}) backed by the same subscriber registry. Purely additive — no SSE clients break, no architecture decision in v1.1.6 closes the door.
Auth model for external subscribers — decided
✅ Decided 2026-06-01: ship public + HMAC-signed subscriber-token auth in v1.1.6; users-SDK session-based auth follows in v1.1.8 (additive); script-mediated per-subscribe auth deferred to v1.2.
Topic config columns:
external_subscribable: bool— can external clients ever subscribe?auth_mode: 'public' | 'token'— if external, what's the gate? (ignored whenexternal_subscribable = false)- v1.1.8 adds
auth_mode = 'session'for users-SDK-based sessions; v1.2 addsauth_mode = 'script'for script-mediated.
v1.1.6 trust flow (token-gated topics):
| Hop | Auth mechanism |
|---|---|
| Script → its own token-mint endpoint | Existing API-key + app authz |
| Script → SDK helper to mint token | New pubsub::subscriber_token(topics, ttl) |
| Frontend → script's token endpoint | App's own auth (cookie/session/whatever the app defines) |
| Frontend → PiCloud SSE | Short-lived HMAC-signed subscriber token (bearer header) |
| SSE handler → token validation | HMAC verify, scope-check requested topic against token's allowed list |
The frontend never touches the app's API key. The script signs scoped, short-lived bearers (HMAC over {topic_list, exp, app_id}) with a secret derived from the app's API-key material. The SSE endpoint validates the signature without a DB lookup.
Token TTL: clamped 10s ≤ ttl ≤ 24h. Default 1h. Both bounds and default env-overridable (PICLOUD_SUBSCRIBER_TOKEN_TTL_MIN_SEC, PICLOUD_SUBSCRIBER_TOKEN_TTL_MAX_SEC, PICLOUD_SUBSCRIBER_TOKEN_TTL_DEFAULT_SEC).
Token revocation: none in v1.1.6 by design. HMAC bearers can't be revoked individually; rotation of the signing key invalidates all bearers wholesale. Short TTL is the safety mechanism. Per-token revocation arrives implicitly with v1.1.8's session-based auth (sessions CAN be invalidated).
Public topics: no auth at all. GET /realtime/topics/{topic} works for anyone if the topic has external_subscribable = true AND auth_mode = 'public'. Used for marketing-style broadcasts and public stat boards.
Status
- Approach C (opt-in external subscription): ✅ Decided 2026-06-01 — internal-only by default; externally-subscribable topics require explicit registration + admin-role capability; UI surface treats the bit-flip as a deliberate, audited action.
- SSE first, WebSocket later: ✅ Decided 2026-06-01 — SSE-only in v1.1.6; WS deferred until concrete demand emerges; future addition is purely additive on a separate endpoint.
- Public + token-gated auth in v1.1.6: ✅ Decided 2026-06-01 — HMAC-signed subscriber-token flow (not raw API-key passing);
users::*session-based and script-mediated auth deferred per the table above.
Open calls
Approach C confirmed— ✅ Decided 2026-06-01: yes, with explicit registration required for externally-subscribable topics (internal-only stays implicit); newCapability::AppTopicManage(AppId).SSE first, WebSocket deferred— ✅ Decided 2026-06-01: SSE-only in v1.1.6; WS deferred to a later release; future addition is purely additive.Auth model— ✅ Decided 2026-06-01: public + HMAC-signed subscriber tokens in v1.1.6;users::*session auth in v1.1.8; script-mediated auth in v1.2; token TTL clamped 10s–24h (default 1h), env-overridable; no per-token revocation in v1.1.6 (rely on TTL).
6. Frontend client library
Strategic positioning question: how much should PiCloud expose to frontend developers building apps on top of it?
The two ends of the spectrum
| End | Frontend gets | Examples |
|---|---|---|
| Minimalist | HTTP to dev-defined script endpoints + SSE on dev-marked-public topics. Nothing else. | AWS Lambda + API Gateway, Cloudflare Workers, Deno Deploy |
| Maximalist | Direct client-side access to KV/docs/users/files. Frontend writes kv.get(), docs.find(), no Rhai script for trivial reads. |
Firebase, Supabase, AWS Amplify |
PiCloud today sits at the minimalist end (services exist for scripts to use, not for frontends). Crossing to maximalist would be a real product pivot, not a feature add.
The chosen approach: hybrid — decided
✅ Decided 2026-06-01: Hybrid model. No direct service access from the frontend; client library standardizes script-mediated ceremony.
Four pieces ship in @picloud/client for v1.1.6:
- Typed HTTP client to dev-defined endpoints —
picloud.endpoint('/api/users').post({ name: 'alice' }). Fetch wrapper with auth header injection, retry logic, structured error handling. - SSE subscription —
picloud.subscribe('chat-room-123', msg => …). Auto-reconnect, token refresh, backpressure. - Auth flow helpers —
picloud.auth.login(email, password),picloud.auth.logout(),picloud.auth.token. These call dev-defined endpoints under the hood (/api/auth/loginetc.); the lib just standardizes the dance + token storage. - Realtime-aware framework hooks —
useTopic(topic)for React, store-shapesubscribe(topic)for Svelte. Thin polish over the SSE primitive; what frontend devs actually write.
Hard rule, load-bearing: no picloud.kv.get() / picloud.docs.find() / picloud.users.list() from the frontend. Direct service access from the browser is a strategic and security commitment, not a v1.1.6 limitation. A frontend dev who wants kv.get() from the browser writes a 6-line Rhai script binding it to a route — that friction is intentional, makes the dev decide deliberately that the read is okay to expose.
Why not Firebase-mode (full direct service access):
- Different product, different competition (Supabase / Amplify / Appwrite have 5-year head start, fulltime teams).
- Requires security-rule language + per-row authorization evaluator + tooling that PiCloud's solo-dev audience cannot operate safely. Firebase's #1 cause of data exposure is misconfigured rules — well-documented, recurring.
- Script-as-gate is dramatically more defensible: the rules are just code, in the same language as the rest of the app, debuggable like any other code.
Why not pure-minimalist (no client lib, just docs):
- Every PiCloud frontend dev hand-rolls the same fetch wrapper, SSE reconnect, token refresh, login/logout dance. Shipping
@picloud/clientremoves that boilerplate without expanding the security surface.
Why hybrid, not maximalist
Firebase trades security for DX; the security-rule misconfiguration footgun is the #1 cause of accidental data exposure in serverless apps. PiCloud's "solo dev / consumer hardware" audience does not have the operational capacity to defend a Firebase-style attack surface against misconfiguration. The script layer is also where PiCloud differentiates — if frontends bypass scripts to talk directly to services, we're competing with Supabase head-to-head (unwinnable, they're better-resourced and have a 5-year head start).
Why hybrid, not pure minimalist
A frontend dev shouldn't have to hand-roll fetch wrappers, SSE reconnect logic, and token-refresh dances. That stuff is identical across every app. Shipping it as @picloud/client is genuinely valuable — it doesn't expand the security surface (scripts still gate everything), it just removes boilerplate.
TypeScript first — decided
✅ Decided 2026-06-01: TypeScript only for v1.1.6. Other-language SDKs deferred, demand-driven, no preemptive ranking.
- TS covers ~85% of the realistic v1.x audience (web + React Native mobile + Capacitor + Electron).
- Native iOS / Android / Python / Rust / Go users can hit the REST + SSE endpoints directly without an SDK; they lose the typed wrapper but aren't blocked from shipping.
- The REST + SSE surface is documented as the public protocol contract so future PiCloud or the community can build other-language SDKs against a stable spec. PiCloud doesn't promise specific languages or timelines preemptively; a real user with a concrete use case is what triggers a new SDK.
- Known caveat: React Native doesn't ship a native
EventSource. The TS client should runtime-detect and either fall back gracefully or require an explicit polyfill (react-native-sse/react-native-event-source) with clear docs. Not a blocker; worth surfacing in the v1.1.6 README.
Status
- Hybrid model (frontend through scripts only): ✅ Decided 2026-06-01 — confirmed; no direct service access from the browser; client lib standardizes script-mediated ceremony only.
- TypeScript first, other languages deferred: ✅ Decided 2026-06-01 — TS-only in v1.1.6; REST + SSE documented as public protocol contract; other languages demand-driven with no preemptive ranking; React Native SSE polyfill noted as known caveat.
- Co-ship with realtime as v1.1.6: ✅ Decided 2026-06-01 — server-side realtime AND
@picloud/client@1.0.0ship together in v1.1.6. Built in parallel against a frozen REST + SSE spec. If v1.1.6 scope blows up under pressure, the lib is the deferrable piece (slips to v1.1.6.1); the realtime server itself doesn't slip. - Type safety / codegen: ✅ Decided 2026-06-01 — defer codegen to v1.2+; v1.1.6 ships hand-written types with
endpoint<Req, Res>()generic + optional client-side runtime validation via user-provided schemas (zod/valibot adapter; ~50 lines). No schema-declaration syntax in v1.1.6 — committing to that before v1.2's coherent codegen design would lock us into a shape we'd regret. Doc schemas (already arriving in v1.1.2) are the natural foundation for v1.2 codegen; script-endpoint schemas get designed alongside the generator, not before.
Open calls
Hybrid model— ✅ Decided 2026-06-01: confirmed; no direct service access from the frontend;@picloud/clientships typed HTTP + SSE + auth-flow + framework hooks.TypeScript first, multi-language deferred— ✅ Decided 2026-06-01: TS-only in v1.1.6; REST + SSE is the public protocol; other-language SDKs are demand-driven; React Native SSE polyfill caveat documented.Co-ship realtime + client lib— ✅ Decided 2026-06-01: co-ship in v1.1.6, built in parallel against a frozen REST + SSE spec. Lib is the deferrable piece under scope pressure (slips to v1.1.6.1); server doesn't slip.Type safety / codegen— ✅ Decided 2026-06-01: defer codegen to v1.2+; v1.1.6 ships hand-written types withendpoint<Req, Res>()generic + optional zod/valibot runtime validation; no schema declarations in v1.1.6.
7. Revised v1.1.x roadmap
Net changes vs the blueprint §12 roadmap:
- v1.1.5 pub/sub: now via trigger outbox (drops
LISTEN/NOTIFYplan), tightening implementation scope - NEW v1.1.6 Realtime Channels & Client Library: realtime SSE +
@picloud/clientTS package; co-shipped - v1.1.7+ items shifted by one (was v1.1.6/7/8 → now v1.1.7/8/9)
- Dead letters and the unified outbox/dispatcher are absorbed into v1.1.1's existing scope (triggers framework)
| Version | Capability |
|---|---|
| v1.1.0 | Foundation & Standard Library — SDK shape, Services bundle, SdkCallCx, ExecutionGate, ServiceEventEmitter trait shape; stdlib utilities (regex, random, time, json, base64, hex, url). ✓ Shipped. |
| v1.1.1 | Storage & Events — KV store keyed (app_id, collection, key); triggers framework (universal outbox + dispatcher + NATS-style sync HTTP via inbox + per-trigger retry config + dead-letter table & dead_letter trigger source + trigger CRUD + ctx.event + depth limit); KV trigger kinds. |
| v1.1.2 | Documents — docs::collection(name).create/find/update/delete/list with docs:* triggers. |
| v1.1.3 | Modules — scripts.kind, per-app resolver replaces DummyModuleResolver, AST cache + dep-graph invalidation. |
| v1.1.4 | Outbound HTTP & Scheduled Tasks — http::* with SSRF deny-list; cron triggers (small now that the framework exists). |
| v1.1.5 | Files & Pub/Sub — filesystem-backed blobs (files/<app_id>/<id[0:2]>/<id>) with files:* triggers; pub/sub via the universal outbox with pubsub:* triggers. |
| v1.1.6 | Realtime Channels & Client Library (new) — SSE-based external subscription to per-app pub/sub topics (public + HMAC-signed subscriber-token auth, minted via pubsub::subscriber_token); @picloud/client TypeScript package (typed HTTP via endpoint<Req,Res>(), SSE subscription, auth helpers, framework hooks). |
| v1.1.7 | Configuration & Email (was v1.1.6) — encrypted per-app secrets; outbound email::send/send_html + inbound email:receive trigger. |
| v1.1.8 | User Management (was v1.1.7) — users::* for in-script CRUD, auth, roles, invites, password reset. |
| v1.1.9 | Durable Queues & Function Composition (was v1.1.8) — queue::* with queue:receive trigger; invoke() + retry::* (closures-as-args, re-entrant Rhai). |
| v1.2 | Workflows & Hierarchies (per blueprint §Phase 5) — DAG execution, advanced docs query, interceptors, read triggers, audit log, script-mediated realtime auth, dead_letters::list (aligned with docs::find() query DSL), client-lib type codegen from script-declared schemas. |
| v1.3+ | Scale & Ops (per blueprint §Phase 6) — cluster mode (NATS-style request/reply swaps to LISTEN/NOTIFY), cross-app data sharing, script versioning + rollback, rate limiting, richer auth, metrics, distributed tracing, webhooks, S3, monitoring/alerting on HTTP endpoint failures. |
The v1.1.9 release marks the end of the v1.1.x expansion cadence. v1.2 is the next minor product bump (phase milestone per versioning policy).
Consolidated open calls
All 20 open calls were resolved on 2026-06-01. This section is retained as a quick decision index — each item links the original question to the decision recorded in its section above. Sections will be pruned individually as their decisions ship into code and the serverless_cloud_blueprint.md.
§1 — Messaging primitives
Pub/sub durability via trigger outbox— ✅ Decided 2026-06-01:publish_durableships in v1.1.5;publish_ephemeralcommitted as a future API.Queue and pub/sub stay separate— ✅ Decided 2026-06-01: separate top-level namespaces; no unifying messaging abstraction.
§2 — Universal trigger outbox
Sync HTTP via outbox + per-request inbox— ✅ Decided 2026-06-01: yes via outbox; in-process oneshot for v1.1.1,LISTEN/NOTIFYpreserved as the cluster-mode (v1.3+) cross-process variant.Ship— ✅ Decided 2026-06-01: yes;dispatch_mode: asyncfor HTTP routes in v1.1.1202 Accepted+ JSON body withexecution_id; route-level config only.Trigger storage shape— ✅ Decided 2026-06-01: Layout E (parenttriggers+ per-kind<kind>_trigger_details);routesstays its own table for v1.1.x; column-set refinements deferred to implementation PR.
§3 — NATS-style sync HTTP
NATS-style request/reply for sync HTTP— ✅ Decided 2026-06-01 (see §2 #3).Status code strategy— ✅ Decided 2026-06-01: keep distinctions;500reserved for platform problems.Default retry policy on triggers— ✅ Decided 2026-06-01: 3/exp/1000ms + ±20% jitter; env-overridable viaPICLOUD_TRIGGER_RETRY_*; per-trigger columns override.Cancel-on-timeout semantics— ✅ Decided 2026-06-01: (b) —abandoned_executionstable; dispatcher-written; 7-day retention viaPICLOUD_ABANDONED_EXECUTIONS_RETENTION_DAYS; metric counter on insert.
§4 — Dead letters
Dead-letter handlers unretryable + can't be dead-lettered themselves— ✅ Decided 2026-06-01: confirmed; flag lives on the execution; missing handler =resolution = 'handler_failed'; indirect loops bounded bycx.trigger_depth.No default dead-letter handler— ✅ Decided 2026-06-01: confirmed; rows sit in the table by default. Dashboard unresolved-count badge + per-app DL list view ship in v1.1.1.30-day default retention— ✅ Decided 2026-06-01: 30 days, GC bycreated_at, env-only override (PICLOUD_DEAD_LETTER_RETENTION_DAYS).Rhai SDK for dead-letters in v1.1.1— ✅ Decided 2026-06-01:replay+resolvein v1.1.1;listdeferred to v1.2; newCapability::AppDeadLetterManage(AppId). Related: trigger executions inherit the registrant's principal.
§5 — Realtime
Approach C confirmed— ✅ Decided 2026-06-01: yes, with explicit registration required for externally-subscribable topics; newCapability::AppTopicManage(AppId).SSE first, WebSocket deferred— ✅ Decided 2026-06-01: SSE-only in v1.1.6; WS deferred.Auth model— ✅ Decided 2026-06-01: public + HMAC-signed subscriber tokens in v1.1.6;users::*session auth in v1.1.8; script-mediated in v1.2; TTL 10s–24h (default 1h), env-overridable.
§6 — Frontend client library
Hybrid model— ✅ Decided 2026-06-01: confirmed; no direct service access from the frontend; client lib standardizes script-mediated ceremony only.TypeScript first, multi-language deferred— ✅ Decided 2026-06-01: TS-only in v1.1.6; REST + SSE is the public protocol contract.Co-ship realtime + client lib— ✅ Decided 2026-06-01: co-ship in v1.1.6, parallel-built against a frozen spec; lib is the deferrable piece under scope pressure.Type safety / codegen— ✅ Decided 2026-06-01: defer codegen to v1.2+; v1.1.6 ships hand-written types viaendpoint<Req, Res>()+ optional zod/valibot runtime validation.
Lifecycle of this document
- Created at the v1.1.0 → v1.1.1 boundary (after the foundation PR series shipped).
- Each section gets pruned once its decisions ship and land in the blueprint.
- Open calls are answered in conversation, then folded into the corresponding section as "Decided: X" with the date.
- Document deleted when v1.1.9 ships — everything by then is either in the blueprint, in code, or explicitly deferred to v1.2+.