Files

MechaCat02 1efb350b54 docs(v1.1.x): resolve in-flight decisions as Decided 2026-06-01

Annotates the v1.1.x design notes with the resolutions for the 20 open
calls — pub/sub split, universal outbox, NATS-style sync HTTP, status
code strategy, retry policy, dead-letter recursion-stop, realtime
auth model, frontend client library scope. Captured ahead of the
v1.1.1 implementation so the schema + API decisions in this branch
have a single load-bearing source of truth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-06-01 21:22:25 +02:00

52 KiB

Raw Blame History

v1.1.x design notes — in-flight decisions + revised roadmap

Planning document for the v1.1.x release series. Companion to:

serverless_cloud_blueprint.md — authoritative design
docs/sdk-shape.md — SDK conventions (settled in v1.1.0)
docs/stdlib-reference.md — stdlib API (settled in v1.1.0)
docs/versioning.md — versioning policy (post-1.0 carve-out settled with v1.1.0)

Items in this doc are either tentatively decided but not yet shipped or open calls awaiting the maintainer's decision. Once an item ships, its content moves into the blueprint and the corresponding section here gets pruned.

This document was created at the v1.1.0 → v1.1.1 boundary, capturing the architectural conversations that followed v1.1.0 but haven't yet landed in code or in the blueprint.

1. The three messaging primitives

PiCloud will expose three distinct messaging concepts. The right way to slice them is along recipient model and delivery semantics:

	Recipients	Durability	Delivery	Retry on script failure	Mental model
`invoke(script_id, args)`	One named script	None (or fire-and-forget durable)	At-most-once sync, or at-least-once async	Caller-controlled via `retry::*`	Function call
`pubsub::publish_durable(topic, msg)`	All scripts subscribed via trigger	Through outbox	At-least-once per subscriber	Per-subscriber retry up to N, then dead-letter	Fan-out broadcast (persisted)
`pubsub::publish_ephemeral(topic, msg)` (future)	All scripts subscribed via trigger	None (in-memory NOTIFY)	At-most-once per subscriber	None	Fan-out broadcast (best-effort)
`queue::enqueue(name, msg)`	Exactly one consumer wins	Durable table	At-least-once total	Visibility timeout + nack-on-throw	Work distribution

Critical distinction: pub/sub and queue both end up at-least-once, but the subscriber model differs. Queue: 1 message → 1 delivery record → consumers compete. Pub/sub: 1 message → N delivery records (one per subscriber) → no competition.

Pub/sub reframe — durable through the outbox, ephemeral as named escape hatch

The original blueprint plan was pub/sub via Postgres LISTEN/NOTIFY (ephemeral, sub-millisecond fan-out). Reframe to reuse the triggers framework's outbox infrastructure for the durable path, and keep ephemeral as a separately-named future API:

pubsub::publish_durable(topic, msg) writes to the outbox (v1.1.5)
Dispatcher fans out one delivery record per subscribed script trigger
Each delivery retried on failure with the same machinery as KV / doc / file triggers
After N retries → dead-letter (see §4)
pubsub::publish_ephemeral(topic, msg) is committed as a future addition for the in-memory LISTEN/NOTIFY path — not shipped in v1.1.5, but the API split is decided now so users learn "durable by default, opt into ephemeral" from the start (rather than the reverse, which would be a breaking rename later).

Wins: one delivery model in the whole system for the durable path, durable pub/sub for free, shared observability/retry/dead-letter tooling across every event-firing surface.

Cost: ~1ms Postgres write per publish_durable (vs in-memory NOTIFY). For solo-dev / consumer hardware, the right tradeoff. The ephemeral escape hatch exists for sub-ms / high-frequency workloads if/when they emerge.

Note on durability semantics. "Durable" here means the outbox row persists, not that fan-out is transactional with the publisher's own data writes. A script doing kv.set(...) then pubsub::publish_durable(...) performs two separate writes; a crash between them can drop the publish. This matches the standard transactional-outbox pattern and is consistent with how KV / doc / file triggers already work.

Queue stays separate

Pub/sub-through-outbox cannot model "work distribution with backpressure" cleanly. Queue keeps its own table:

Producer: queue::enqueue(name, msg) → queue table
Consumer: queue:receive trigger fires when message available; runtime claims with FOR UPDATE SKIP LOCKED + visibility timeout
Script returns successfully → auto-ack (delete row)
Script throws → auto-nack (clear claim; message becomes visible again)
Visibility timeout exceeded → reclaim allowed (handles crashed consumers)
Max delivery attempts → dead-letter

The queue table IS the outbox for queue semantics — no double-buffering.

Status

Durable pub/sub via trigger outbox: ✅ Decided 2026-06-01 — ship as pubsub::publish_durable in v1.1.5.
Ephemeral pub/sub: ✅ Committed 2026-06-01 as a future addition named pubsub::publish_ephemeral. Not in v1.1.5; the explicit-naming split lands now so the durable default doesn't need a breaking rename later.
Drop LISTEN/NOTIFY for v1.1.5: ✅ Decided 2026-06-01.
Queue stays separate from pub/sub: ✅ Decided 2026-06-01 — two distinct top-level namespaces (queue::* and pubsub::*); no unifying messaging::* abstraction. Rationale: the two have genuinely different mental models (work distribution vs fan-out), the implementations share almost no code (queue needs FOR UPDATE SKIP LOCKED + visibility timeout + nack-on-throw; pub/sub needs per-subscriber fan-out + independent retry/dead-letter), and a unified API would force users to choose a mode they already know from the use case. A future Kafka-shaped consumer-group unification was considered and rejected — PiCloud is outbox-based, not log-based, so going Kafka-shaped would mean rebuilding storage.

Open calls

~~Pub/sub durability via trigger outbox~~ — ✅ Decided 2026-06-01: yes, both publish_durable (v1.1.5) and publish_ephemeral (future) committed with explicit names.
~~Queue and pub/sub stay separate concepts~~ — ✅ Decided 2026-06-01: separate top-level namespaces; no unifying messaging abstraction.

2. Universal trigger outbox

The triggers framework's outbox should be the universal substrate for async dispatch. Every event source that fires scripts asynchronously writes to the same outbox table; one dispatcher reads from it and routes to the executor with shared load control, retry, dead-letter, and trigger-depth tracking.

What runs through the outbox

Ingress	Path	Reason
HTTP request (sync)	Direct: orchestrator → executor → response (with NATS-style indirection — see §3)	Caller is waiting; the inbox pattern makes this work via the outbox
HTTP request (async, opt-in)	Orchestrator writes outbox → returns 202 → dispatcher → executor	Webhooks, fire-and-forget endpoints; explicit opt-in via route config
Cron tick	Scheduler writes outbox → dispatcher → executor	No caller; naturally async
KV / doc / file change	Service writes outbox → dispatcher → executor	No caller; the originating script already returned
Pub/sub publish	Service writes outbox → dispatcher → executor (per subscriber)	Fan-out semantics
Queue message	Queue table IS the outbox; dispatcher claims via `FOR UPDATE SKIP LOCKED`	Avoids double-buffering
Inbound email	SMTP receiver writes outbox → dispatcher → executor	No caller

What this gives

One dispatcher = one place for load control (the existing ExecutionGate), retry, dead-letter, trigger-depth tracking, fan-out. New event source = "write to outbox in this shape", nothing else.
Routes become a trigger kind, conceptually. A route is (source=http, filter=method+path, script_id, dispatch_mode=sync|async). Schema-wise the routes table likely stays separate from the new triggers table (polymorphic JSON columns get ugly), but the mental model collapses to "everything that fires a script is a trigger".
dispatch_mode = async is a per-route opt-in. Webhook handlers can return 202 immediately and process in the background — dispatcher handles retries, caller gets a snappy ack.
Replay and debugging. Every async invocation has an outbox row; admin can re-fire a trigger by re-dispatching the row.
Decoupled lifecycle. Dispatcher can be paused for maintenance without affecting HTTP ingress (it just queues); HTTP can degrade (overflow 503s) without affecting async work already in the outbox.

What this doesn't change

Sync HTTP still hits the ExecutionGate the same way (now via the dispatcher).
Async outbox dispatch also hits the gate when the dispatcher picks a row. Sync and async share the cap on actual blocking-thread-in-use.
Trigger CRUD likely stays in per-kind tables for schema sanity; the unification is conceptual + dispatch-layer, not schema-layer.

Status

Universal outbox for async dispatch: ✅ Decided 2026-06-01 — yes; all async ingress (KV/cron/pubsub/queue/email/dead-letter) writes to one outbox; one dispatcher reads it.
Sync HTTP via outbox (NATS-style inbox): ✅ Decided 2026-06-01 — in-process oneshot in v1.1.1; cluster-mode keeps the door open for LISTEN/NOTIFY keyed on inbox_id in v1.3+ (see §3 implementation table).
Routes-as-trigger conceptually: ✅ yes — the dispatch layer treats routes and triggers uniformly.
Trigger storage shape: Layout E (parent + per-kind detail tables): ✅ Decided 2026-06-01. One shared triggers parent with common columns (id, app_id, script_id, kind, enabled, dispatch_mode, retry config, timestamps); one <kind>_trigger_details table per service (kv_trigger_details, cron_trigger_details, pubsub_trigger_details, queue_trigger_details, email_trigger_details, dead_letter_trigger_details). Outbox FKs to triggers.id; dead-letters FK same. Exact column set (notably outbox.app_id denormalization, whether script_id also lives on outbox, ON DELETE behavior on the parent vs detail tables) will be refined when v1.1.1 implementation lands.
routes table stays separate from the triggers parent for now: ✅ Decided 2026-06-01. routes is Phase-3 production schema with its own trie-index columns; folding into the parent is a v1.2 cleanup, not a v1.1.1 requirement. Outbox discriminates HTTP rows via source_kind = 'http' and trigger_id referencing routes.id for HTTP, triggers.id for everything else.
Per-route dispatch_mode: sync|async: ✅ Decided 2026-06-01 — ships in v1.1.1. Async returns 202 Accepted with a JSON body { "accepted_at": "...", "execution_id": "..." }. dispatch_mode is a route property fixed at route creation; scripts cannot switch modes mid-call.

Open calls

~~Sync HTTP via outbox + per-request inbox~~ — ✅ Decided 2026-06-01: yes via outbox; in-process oneshot now, LISTEN/NOTIFY explicitly preserved for cluster mode (v1.3+).
~~Ship dispatch_mode: async in v1.1.1~~ — ✅ Decided 2026-06-01: yes; 202 Accepted + JSON body with execution_id; route-level config only.
~~Trigger storage shape~~ — ✅ Decided 2026-06-01: Layout E (parent + per-kind detail tables); routes stays its own table for v1.1.x. Exact column set deferred to implementation PR.

3. NATS-style request/reply for sync HTTP

The constraint that makes "universal outbox" tricky: HTTP has a caller waiting. We can't write to outbox, return 202, and walk away — the user's browser expects 200 OK with body. NATS's request/reply pattern resolves this elegantly.

Pattern

HTTP request  →  orchestrator generates inbox_id, registers a oneshot channel
              →  writes outbox row { source: http, payload, reply_to: inbox_id }
              →  awaits on the channel (with timeout = script's wall-clock + buffer)

Dispatcher    →  picks outbox row
              →  dispatches to executor (gate + spawn_blocking + Rhai)
              →  if reply_to.is_some(): resolves the channel with the result
              →  if reply_to.is_none(): records completion + retries on failure per trigger config

Orchestrator  →  channel resolves → returns response to HTTP caller
              →  on timeout: returns 504 or 500 → see status-code calls below

The HTTP caller's experience is unchanged (synchronous request/response). Under the hood, dispatch is identical for every invocation source.

Implementation by deployment mode

Mode	Mechanism	Trade-off
In-process (v1.1.1, MVP)	Per-orchestrator `HashMap<InboxId, oneshot::Sender<Result>>`; dispatcher resolves the oneshot	Sub-ms wake-up; fails across process boundaries
Cross-process (cluster mode v1.3+)	Postgres `LISTEN/NOTIFY` keyed on `inbox_id`, with a `responses` row as durable backup	Sub-10ms wake-up; survives across nodes; needs careful long-listener management
Polling fallback	Orchestrator polls `responses` table for `inbox_id` every ~10ms	Simple; ~10ms minimum latency; only as fallback

Latency cost (honest numbers)

Per sync HTTP request, NATS-style adds: ~1-2ms Postgres write (outbox) + sub-ms dispatcher wake (in-process channel) + ~1ms response resolve = ~2-5ms overhead. For most scripts (10-100ms execution), this is noise. PiCloud isn't optimizing for sub-ms; the architectural unification is worth a few ms.

Default retry policy — decided

✅ Decided 2026-06-01:

Knob	Default	Env override	Per-trigger column
Max attempts	3	`PICLOUD_TRIGGER_RETRY_MAX_ATTEMPTS`	`retry_max_attempts`
Backoff shape	exponential	`PICLOUD_TRIGGER_RETRY_BACKOFF` (`exponential` \| `linear` \| `constant`)	`retry_backoff`
Base delay	1000ms	`PICLOUD_TRIGGER_RETRY_BASE_MS`	`retry_base_ms`
Jitter	±20%	`PICLOUD_TRIGGER_RETRY_JITTER_PCT`	(not per-trigger; dispatcher-side)

With the defaults, schedule after each failed attempt is ~1s / ~2s / ~4s (each ±20%), total time-to-dead-letter ~7s.

What triggers a retry: any of Rhai runtime error, wall-clock timeout, operation-budget-exceeded, or platform-side failure (Postgres unavailable, executor crashed). Distinguishing them in the dispatcher is fiddly and the retry cost is bounded by max_attempts; if op-budget retries become dead-letter spam in practice, revisit.

Per-trigger override: the three retry columns on the triggers parent table (Layout E) take precedence over the env-configured defaults. Trigger CRUD endpoints accept these on create/update; if omitted, the env defaults are applied at write time (not lazily at dispatch — keeps the policy auditable from the row itself).

Sync HTTP exception: unchanged. reply_to.is_some() rows are never retried regardless of policy (see below).

Retry policy — `reply_to` IS the signal

Outbox row	Retry behavior
`reply_to.is_some()`	Never retry. Caller is waiting; retrying means the script might run twice and the caller gets one of two outcomes. Always: one attempt, surface result (success or failure) to inbox.
`reply_to.is_none()`	Retry per trigger's configured policy. Default: 3 attempts, exponential backoff (1s, 2s, 4s), dead-letter after.

Per-trigger config lives on the trigger row:

trigger { source: cron,   schedule: "0 */5 * * * *",
          retry: { max_attempts: 5, backoff: exponential, base_ms: 1000 } }

trigger { source: pubsub, topic: "user.created",
          retry: { max_attempts: 3, backoff: linear,      base_ms: 500  } }

trigger { source: http,   method: POST, path: "/api/foo",
          dispatch_mode: sync }   // retry absent — sync HTTP is always 1-attempt

Failure / crash handling

With NATS-style indirection, there are new ways for a sync HTTP request to vanish. Every failure path must resolve the orchestrator's oneshot channel with something:

Failure mode	Detection	Caller sees
Script throws / runtime error	Executor returns `ExecError::Runtime` → written to inbox	502 (or 500 — see status-code discussion)
Script exceeds wall-clock	`tokio::time::timeout` fires inside dispatcher → written to inbox	504 (or 500)
Operation budget exceeded	Executor returns `ExecError::OperationBudgetExceeded` → inbox	507 (or 500)
Executor process crashes mid-execution	`JoinError` → `ExecError::Runtime` → inbox	500
Dispatcher process dies between claim and reply	Orchestrator's wait times out	500
Outbox write fails (Postgres unavailable)	Orchestrator never publishes; immediate error	500
Orchestrator's own wait times out unexpectedly	Channel timeout fires before inbox resolves	504 (or 500)

Every path resolves the channel with a result. The orchestrator's outer timeout is the backstop for "dispatcher just died completely".

Status code strategy — decided

✅ Decided 2026-06-01: keep the granular status codes (Option A), with one refinement — 500 is reserved for platform problems (dispatcher vanished, outbox write failed, inbox channel timed out unexpectedly), not used as a generic catch-all.

Code	Cause	Who's at fault
422	Request validation failed	Client
502	Script threw / Rhai runtime error	User script
503	Gate refused (overloaded); `Retry-After: 1`	Platform (capacity)
504	Wall-clock timeout	Either (slow script or platform overload)
507	Operation budget exceeded	User script
500	Dispatcher vanished / outbox write failed / inbox channel timed out unexpectedly	Platform (bug or infra)

Rationale: each code is actionable for the caller (back off, redesign as async, fix the script, file a bug). Flattening to 500 would collapse "script crashed" vs "overloaded" vs "your timeout is too tight" vs "platform broke" into one undifferentiated signal — losing both client-facing UX and our own observability/alerting axis.

Status

NATS-style for sync HTTP: ✅ Decided 2026-06-01 (see §2 #3).
reply_to presence as the "don't retry" signal: ✅ Decided 2026-06-01 (folded with the NATS-style decision).
Status code strategy: ✅ Decided 2026-06-01 — keep granular distinctions; 500 reserved for platform problems only.
Default retry policy: ✅ Decided 2026-06-01 — 3 attempts / exponential / 1000ms base / ±20% jitter; all four env-overridable via PICLOUD_TRIGGER_RETRY_*; per-trigger columns on the parent table take precedence.
Cancel-on-timeout semantics: ✅ Decided 2026-06-01 — option (b). Late results are discarded from the caller's POV (they already got a 504) but the dispatcher writes an abandoned_executions row whenever it tries to resolve a oneshot that's already closed/dropped. 7-day default retention via PICLOUD_ABANDONED_EXECUTIONS_RETENTION_DAYS; weekly GC sweep. A counter (picloud_abandoned_executions_total{app_id}) bumps on insert — that's the primary observability signal; the rows themselves are for forensics when the counter spikes. Only the dispatcher-after-orchestrator-timeout edge case writes a row; ordinary "script timed out, caller got 504" stays uneventful.

Open calls

~~NATS-style request/reply for sync HTTP~~ — ✅ Decided 2026-06-01 (see §2 #3).
~~Status code strategy~~ — ✅ Decided 2026-06-01: Option A (keep distinctions); 500 reserved for platform problems.
~~Default retry policy on triggers~~ — ✅ Decided 2026-06-01: 3/exp/1000ms base + ±20% jitter; env-overridable via PICLOUD_TRIGGER_RETRY_*; per-trigger row columns override the env defaults.
~~Cancel-on-timeout semantics~~ — ✅ Decided 2026-06-01: option (b) — abandoned_executions table, dispatcher-written, 7-day retention, metric counter on insert.

4. Dead-letter handling

Events that exhaust their retry policy land in a separate dead_letters table (not a flag on the outbox — outbox should stay a queue with fast inserts and scans). Users handle dead letters by registering a script for the new dead_letter trigger kind.

Schema sketch

CREATE TABLE dead_letters (
  id                UUID PRIMARY KEY,
  app_id            UUID NOT NULL REFERENCES apps(id) ON DELETE CASCADE,
  original_event_id UUID NOT NULL,         -- the outbox row id
  source            TEXT NOT NULL,         -- "kv", "cron", "pubsub", "queue", "email"
  op                TEXT NOT NULL,
  trigger_id        UUID,                  -- which trigger config fired (null for direct dispatches)
  script_id         UUID,                  -- which script failed
  payload           JSONB NOT NULL,        -- the event payload, verbatim
  attempt_count     INT  NOT NULL,
  first_attempt_at  TIMESTAMPTZ NOT NULL,
  last_attempt_at   TIMESTAMPTZ NOT NULL,
  last_error        TEXT NOT NULL,
  created_at        TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  resolved_at       TIMESTAMPTZ,           -- null = unresolved
  resolution        TEXT                   -- "replayed" | "ignored" | "handled_by_script" | "handler_failed"
);

CREATE INDEX idx_dead_letters_app_unresolved
  ON dead_letters(app_id) WHERE resolved_at IS NULL;

Dead letter as trigger source

trigger {
  source: dead_letter,
  filter: { source: "kv" },        -- optional; defaults to "any source"
  script_id: <your handler>,
  dispatch_mode: async,
  retry: { max_attempts: 1 }       -- forced — see recursion stop rule below
}

Filterable on:

source: only dead letters from a particular event source (kv, cron, pubsub, …)
trigger_id: only dead letters from a particular trigger config
script_id: only dead letters from a particular script
No filter: every dead letter fires this handler

ctx.event for a dead-letter handler:

ctx.event.source              // "dead_letter"
ctx.event.dead_letter = #{
    original: #{
        source:     "kv",
        op:         "insert",
        collection: "widgets",
        key:        "k1",
        payload:    #{ ... }
    },
    attempts:        3,
    last_error:      "script timeout after 30s",
    trigger_id:      "...",
    script_id:       "...",
    first_attempt_at: "2026-05-30T12:00:00.000Z",
    last_attempt_at:  "2026-05-30T12:00:14.000Z"
}

The handler can log::error, send email::send to admins, write to docs::collection("incidents").create(...), post to external alerting via http::post, or call dead_letters::replay(id) if it decides retry is favorable.

Recursion stop rule — decided

✅ Decided 2026-06-01: dead-letter handlers execute once, no retry, and CANNOT themselves be dead-lettered.

The flag lives on the execution/outbox row (set by the dispatcher when it picks a row whose trigger has kind = 'dead_letter'), not on the trigger config. Same handler script could in principle be reused for non-DL work without inheriting the no-retry treatment.
On handler failure:
- Full payload + error logged to structured logs
- Counter picloud_dead_letter_handler_failures{app_id} bumped
- Original dead-letter row annotated with resolution = 'handler_failed'
- No retry, no second dead-letter row, no further fire.
Missing handler script (trigger references script_id that's been deleted): treated as a handler failure — same metric bump, same resolution = 'handler_failed', same no-retry. Auto-disabling the trigger is deferred to v1.2; for v1.1.1 the user sees the metric spike and investigates.
Indirect loops (DL handler writes to KV → fires a KV trigger → that handler fails → dead-letters → fires the same DL handler) are not blocked by this rule directly; they're bounded by the existing trigger-depth limit (cx.trigger_depth). The recursion-stop rule only prevents the direct infinite regress where a DL handler's failure would itself produce a DL row.

Rationale: if your alerting script is broken, the platform shouldn't try to alert about that with the same broken script. The chain has to terminate, period.

Defaults — decided

✅ Decided 2026-06-01: no automatic handler. Dead letters land in the table; users opt into handling by registering a dead_letter trigger.

Load-bearing commitment: the v1.1.1 dashboard surfaces this state. Without dashboard surface, "no default handler" is irresponsible — users wouldn't know dead-letters exist until they queried Postgres directly. So shipping the table without the UI is not an option.

Required in v1.1.1 alongside the table:

An unresolved-count badge per app, visible in the dashboard's app list and on the app detail page. Source query: SELECT count(*) FROM dead_letters WHERE app_id = $1 AND resolved_at IS NULL.
A per-app dead-letters list view reachable from the badge. Columns: created_at, source, op, script_id, last_error, attempt_count, first_attempt_at, last_attempt_at. Per-row actions: Replay (re-inserts the original event into the outbox; dispatcher tries again from scratch) and Mark resolved (sets resolution = 'ignored', no further action).
A row detail panel showing the full payload + complete error history.

Rationale: most apps will run for months without ever needing a DL handler; the table is the durable record either way. The dashboard surface gives users the lightest-touch signal that something is wrong without committing v1.1.1 to building a notifications channel.

A heavier built-in default ("log to admin notifications channel") was considered and rejected — it would smuggle a notifications-surface design into v1.1.1 under the guise of a default, with real product-design questions (channel shape, configuration, opt-out, rate-limiting) that aren't worth answering yet. If the dashboard badge proves insufficient in practice, a structured-log fallback (writing to execution_logs with a known dead_letter shape) is an additive future change, not a breaking one.

Sync HTTP failures don't dead-letter

Sync HTTP requests (reply_to.is_some()) failures don't land in dead_letters. Caller already got an error response; every failed HTTP request landing in dead_letters would flood the table; execution_logs already captures sync request failures. If a user wants alerts on HTTP endpoint failures, that's monitoring (v1.3+ territory), not dead-lettering.

Pub/sub fan-out dead-letters independently

One pubsub::publish → N subscribers → each retries independently → each can independently dead-letter. So one publish can produce N dead-letter rows (one per subscriber that exhausted retries). Subscribers are independent failure domains.

Manual replay — Rhai SDK scope decided

✅ Decided 2026-06-01: ship dead_letters::replay(id) and dead_letters::resolve(id, reason) in v1.1.1; defer dead_letters::list(filter) to v1.2 to align with docs::find() query semantics.

Surface	Use case	Shipping in
`POST /api/v1/admin/apps/{id}/dead_letters/{dl_id}/replay`	Admin clicks "replay" in dashboard	v1.1.1
`POST /api/v1/admin/apps/{id}/dead_letters/{dl_id}/resolve`	Admin marks resolved via dashboard	v1.1.1
`GET /api/v1/admin/apps/{id}/dead_letters`	Dashboard list view	v1.1.1
`dead_letters::replay(id)` Rhai SDK	A handler script decides to retry programmatically	v1.1.1
`dead_letters::resolve(id, reason)` Rhai SDK	A handler decides "this is fine, don't bother me"	v1.1.1
`dead_letters::list(filter)` Rhai SDK	Bulk replay / cleanup scripts	v1.2 (aligns with `docs::find()` query DSL)

Replay re-inserts the original event into the outbox; dispatcher tries again from scratch.

Authz: both replay and resolve are gated by a new Capability::AppDeadLetterManage(AppId) checked inside the service methods. The capability is granted to app admins by default (existing Phase 3.5 role hierarchy). A public HTTP script running with principal: None would fail this check, which is correct.

Trigger-execution principal (related decision): ✅ a trigger execution runs as the principal that registered the trigger, captured on the trigger row at registration time. This gives a clean "the trigger fires as you" model and matches how cron jobs are typically conceptualized. The original event's principal (e.g. the anonymous caller of a public HTTP route) is recorded for forensics on the outbox row but does not become the execution principal. This is a wider trigger-framework decision surfaced here because dead-letter authz is the first concrete consumer; it applies to every trigger kind, not just dead-letter.

Retention — decided

✅ Decided 2026-06-01: 30 days, GC by created_at, env-overridable only (no per-app override in v1.1.1).

Default: 30 days
Override: PICLOUD_DEAD_LETTER_RETENTION_DAYS (whole-deployment, not per-app)
GC condition: created_at < NOW() - retention — applies to both resolved and unresolved rows uniformly. (Activity-age GC — keeping recently-resolved rows 30 days post-resolution — was considered and deferred; can switch if user feedback shows it's needed without breaking anything.)
GC job: weekly sweep in manager-core, claiming via FOR UPDATE SKIP LOCKED to match the dispatcher's claim pattern.

Per-app retention overrides are deferred to a later release. The env var covers single-deployer needs; per-app settings would need a dashboard surface + permissions story that isn't worth smuggling into v1.1.1.

Status

Separate dead_letters table: leaning yes.
dead_letter as trigger kind: leaning yes.
Recursion stop rule (handlers can't be dead-lettered): ✅ Decided 2026-06-01 (above); flag lives on the execution; missing-handler case treated as handler failure.
No default handler (rows sit in table; dashboard surfaces them): ✅ Decided 2026-06-01 — unresolved-count badge + per-app list view ship in v1.1.1 alongside the table.
Sync HTTP failures don't dead-letter: leaning yes.
Retention: ✅ Decided 2026-06-01 — 30 days, GC by created_at, env-only override (PICLOUD_DEAD_LETTER_RETENTION_DAYS); weekly FOR UPDATE SKIP LOCKED sweep in manager-core.
Rhai SDK scope: ✅ Decided 2026-06-01 — replay + resolve ship in v1.1.1; list deferred to v1.2 to align with docs::find() query DSL. New Capability::AppDeadLetterManage(AppId).
Trigger-execution principal: ✅ Decided 2026-06-01 — trigger fires as the principal that registered it (captured on the trigger row at registration). Original event's principal is recorded on the outbox row for forensics but does not become the execution principal. Applies to all trigger kinds.

Open calls

~~Dead-letter handlers unretryable + can't be dead-lettered themselves~~ — ✅ Decided 2026-06-01: confirmed; flag on execution; missing-handler = resolution = 'handler_failed'; indirect loops bounded by cx.trigger_depth.
~~No default dead-letter handler~~ — ✅ Decided 2026-06-01: confirmed; rows sit in the table by default. Dashboard unresolved-count badge + per-app DL list view (with Replay + Mark-resolved actions) ship in v1.1.1 alongside the table.
~~30-day default retention~~ — ✅ Decided 2026-06-01: 30 days, GC by created_at, env-only override; per-app retention deferred.
~~Rhai SDK for dead-letters in v1.1.1~~ — ✅ Decided 2026-06-01: replay + resolve ship; list deferred to v1.2 to align with docs::find(); new Capability::AppDeadLetterManage(AppId). Related: trigger executions run as the trigger-registering principal.

5. Realtime updates for external clients

Apps built on PiCloud need a way for browser/mobile clients to receive live updates (chat messages, dashboard data, multiplayer state, notifications). Today's pub/sub is internal-only (script ↔ script via triggers).

The chosen approach — decided

✅ Decided 2026-06-01: Option C (one publish API, topics opt-in to external visibility) with the registration split below.

One pubsub::publish_durable(topic, msg) API for scripts — produces a single event regardless of who subscribes.
Topics are internal-only by default: script triggers can subscribe; external clients cannot.
Externally-subscribable topics must be registered explicitly (admin API + dashboard surface). Internal-only topics remain implicit — anyone can publish_durable("any.topic", msg) and triggers can subscribe without registration. To externalize: create a topics row with external_subscribable = true first.
External clients connect to GET /realtime/topics/{topic} via SSE; they only receive messages from registered, externally-subscribable topics they're permitted to access.

UI/security commitments (the difference between C working and C being default-public in disguise):

The externally-subscribable opt-in is prominent UI, not a buried checkbox.
The topic list view shows "external: yes/no" as a first-class column.
Marking a topic externally-subscribable requires app admin role (capability-gated via Capability::AppTopicManage(AppId)).
The bit-flip is its own API endpoint (not a side-effect of generic topic update) so it carries an independent audit trail.

Wins: one publish API for scripts (DRY), topics are private by default (security), external visibility requires deliberate explicit registration (not just a config flag flipped during quick edits).

Why not A (every topic externally-visible by default): topic names tend to describe the event, not the audience; internal topics frequently carry PII or sensitive payloads; the Firebase-style "remember to lock it down" anti-pattern this whole design rejects.

Why not B (separate channels:: service): doubles the publish API for almost-identical use cases; scripts wanting both internal triggers AND client push would publish twice; users wrap it in a helper and we're back at C with extra steps and no central policy enforcement.

Transport: SSE first — decided

✅ Decided 2026-06-01: SSE-only for v1.1.6. WebSocket added in a later release if real bidirectional demand emerges.

Simpler than WebSocket; works through any HTTP proxy without protocol upgrade
Browsers auto-reconnect on disconnect (native EventSource)
Covers the dominant use cases (chat-message-list updates, dashboard streams, notifications, IoT telemetry, build-status streams) cleanly
Production-quality SSE requires HTTP/2 between Caddy and clients to dodge the per-origin connection cap on HTTP/1.1 — Caddy speaks HTTP/2 by default, so this is just a config note for the deploy docs

Why not ship WS in v1.1.6: WS is the right tool for sub-100ms bidirectional state (multiplayer games, CRDT collaborative editing, typing-indicator-level presence). On consumer hardware with Postgres-backed event distribution, that latency budget is dominated by the server stack anyway — WS would be paying implementation cost (frame management, ping/pong, close codes, backpressure protocol) without unlocking the latency it's designed for. SSE-only also frees v1.1.6 to invest in @picloud/client library quality instead of transport edge cases.

Future addition path: WebSocket coexists with SSE on a different endpoint (e.g. /realtime/ws/{topic}) backed by the same subscriber registry. Purely additive — no SSE clients break, no architecture decision in v1.1.6 closes the door.

Auth model for external subscribers — decided

✅ Decided 2026-06-01: ship public + HMAC-signed subscriber-token auth in v1.1.6; users-SDK session-based auth follows in v1.1.8 (additive); script-mediated per-subscribe auth deferred to v1.2.

Topic config columns:

external_subscribable: bool — can external clients ever subscribe?
auth_mode: 'public' | 'token' — if external, what's the gate? (ignored when external_subscribable = false)
v1.1.8 adds auth_mode = 'session' for users-SDK-based sessions; v1.2 adds auth_mode = 'script' for script-mediated.

v1.1.6 trust flow (token-gated topics):

Hop	Auth mechanism
Script → its own token-mint endpoint	Existing API-key + app authz
Script → SDK helper to mint token	New `pubsub::subscriber_token(topics, ttl)`
Frontend → script's token endpoint	App's own auth (cookie/session/whatever the app defines)
Frontend → PiCloud SSE	Short-lived HMAC-signed subscriber token (bearer header)
SSE handler → token validation	HMAC verify, scope-check requested topic against token's allowed list

The frontend never touches the app's API key. The script signs scoped, short-lived bearers (HMAC over {topic_list, exp, app_id}) with a secret derived from the app's API-key material. The SSE endpoint validates the signature without a DB lookup.

Token TTL: clamped 10s ≤ ttl ≤ 24h. Default 1h. Both bounds and default env-overridable (PICLOUD_SUBSCRIBER_TOKEN_TTL_MIN_SEC, PICLOUD_SUBSCRIBER_TOKEN_TTL_MAX_SEC, PICLOUD_SUBSCRIBER_TOKEN_TTL_DEFAULT_SEC).

Token revocation: none in v1.1.6 by design. HMAC bearers can't be revoked individually; rotation of the signing key invalidates all bearers wholesale. Short TTL is the safety mechanism. Per-token revocation arrives implicitly with v1.1.8's session-based auth (sessions CAN be invalidated).

Public topics: no auth at all. GET /realtime/topics/{topic} works for anyone if the topic has external_subscribable = true AND auth_mode = 'public'. Used for marketing-style broadcasts and public stat boards.

Status

Approach C (opt-in external subscription): ✅ Decided 2026-06-01 — internal-only by default; externally-subscribable topics require explicit registration + admin-role capability; UI surface treats the bit-flip as a deliberate, audited action.
SSE first, WebSocket later: ✅ Decided 2026-06-01 — SSE-only in v1.1.6; WS deferred until concrete demand emerges; future addition is purely additive on a separate endpoint.
Public + token-gated auth in v1.1.6: ✅ Decided 2026-06-01 — HMAC-signed subscriber-token flow (not raw API-key passing); users::* session-based and script-mediated auth deferred per the table above.

Open calls

~~Approach C confirmed~~ — ✅ Decided 2026-06-01: yes, with explicit registration required for externally-subscribable topics (internal-only stays implicit); new Capability::AppTopicManage(AppId).
~~SSE first, WebSocket deferred~~ — ✅ Decided 2026-06-01: SSE-only in v1.1.6; WS deferred to a later release; future addition is purely additive.
~~Auth model~~ — ✅ Decided 2026-06-01: public + HMAC-signed subscriber tokens in v1.1.6; users::* session auth in v1.1.8; script-mediated auth in v1.2; token TTL clamped 10s–24h (default 1h), env-overridable; no per-token revocation in v1.1.6 (rely on TTL).

6. Frontend client library

Strategic positioning question: how much should PiCloud expose to frontend developers building apps on top of it?

The two ends of the spectrum

End	Frontend gets	Examples
Minimalist	HTTP to dev-defined script endpoints + SSE on dev-marked-public topics. Nothing else.	AWS Lambda + API Gateway, Cloudflare Workers, Deno Deploy
Maximalist	Direct client-side access to KV/docs/users/files. Frontend writes `kv.get()`, `docs.find()`, no Rhai script for trivial reads.	Firebase, Supabase, AWS Amplify

PiCloud today sits at the minimalist end (services exist for scripts to use, not for frontends). Crossing to maximalist would be a real product pivot, not a feature add.

The chosen approach: hybrid — decided

✅ Decided 2026-06-01: Hybrid model. No direct service access from the frontend; client library standardizes script-mediated ceremony.

Four pieces ship in @picloud/client for v1.1.6:

Typed HTTP client to dev-defined endpoints — picloud.endpoint('/api/users').post({ name: 'alice' }). Fetch wrapper with auth header injection, retry logic, structured error handling.
SSE subscription — picloud.subscribe('chat-room-123', msg => …). Auto-reconnect, token refresh, backpressure.
Auth flow helpers — picloud.auth.login(email, password), picloud.auth.logout(), picloud.auth.token. These call dev-defined endpoints under the hood (/api/auth/login etc.); the lib just standardizes the dance + token storage.
Realtime-aware framework hooks — useTopic(topic) for React, store-shape subscribe(topic) for Svelte. Thin polish over the SSE primitive; what frontend devs actually write.

Hard rule, load-bearing: no picloud.kv.get() / picloud.docs.find() / picloud.users.list() from the frontend. Direct service access from the browser is a strategic and security commitment, not a v1.1.6 limitation. A frontend dev who wants kv.get() from the browser writes a 6-line Rhai script binding it to a route — that friction is intentional, makes the dev decide deliberately that the read is okay to expose.

Why not Firebase-mode (full direct service access):

Different product, different competition (Supabase / Amplify / Appwrite have 5-year head start, fulltime teams).
Requires security-rule language + per-row authorization evaluator + tooling that PiCloud's solo-dev audience cannot operate safely. Firebase's #1 cause of data exposure is misconfigured rules — well-documented, recurring.
Script-as-gate is dramatically more defensible: the rules are just code, in the same language as the rest of the app, debuggable like any other code.

Why not pure-minimalist (no client lib, just docs):

Every PiCloud frontend dev hand-rolls the same fetch wrapper, SSE reconnect, token refresh, login/logout dance. Shipping @picloud/client removes that boilerplate without expanding the security surface.

Why hybrid, not maximalist

Firebase trades security for DX; the security-rule misconfiguration footgun is the #1 cause of accidental data exposure in serverless apps. PiCloud's "solo dev / consumer hardware" audience does not have the operational capacity to defend a Firebase-style attack surface against misconfiguration. The script layer is also where PiCloud differentiates — if frontends bypass scripts to talk directly to services, we're competing with Supabase head-to-head (unwinnable, they're better-resourced and have a 5-year head start).

Why hybrid, not pure minimalist

A frontend dev shouldn't have to hand-roll fetch wrappers, SSE reconnect logic, and token-refresh dances. That stuff is identical across every app. Shipping it as @picloud/client is genuinely valuable — it doesn't expand the security surface (scripts still gate everything), it just removes boilerplate.

TypeScript first — decided

✅ Decided 2026-06-01: TypeScript only for v1.1.6. Other-language SDKs deferred, demand-driven, no preemptive ranking.

TS covers ~85% of the realistic v1.x audience (web + React Native mobile + Capacitor + Electron).
Native iOS / Android / Python / Rust / Go users can hit the REST + SSE endpoints directly without an SDK; they lose the typed wrapper but aren't blocked from shipping.
The REST + SSE surface is documented as the public protocol contract so future PiCloud or the community can build other-language SDKs against a stable spec. PiCloud doesn't promise specific languages or timelines preemptively; a real user with a concrete use case is what triggers a new SDK.
Known caveat: React Native doesn't ship a native EventSource. The TS client should runtime-detect and either fall back gracefully or require an explicit polyfill (react-native-sse / react-native-event-source) with clear docs. Not a blocker; worth surfacing in the v1.1.6 README.

Status

Hybrid model (frontend through scripts only): ✅ Decided 2026-06-01 — confirmed; no direct service access from the browser; client lib standardizes script-mediated ceremony only.
TypeScript first, other languages deferred: ✅ Decided 2026-06-01 — TS-only in v1.1.6; REST + SSE documented as public protocol contract; other languages demand-driven with no preemptive ranking; React Native SSE polyfill noted as known caveat.
Co-ship with realtime as v1.1.6: ✅ Decided 2026-06-01 — server-side realtime AND @picloud/client@1.0.0 ship together in v1.1.6. Built in parallel against a frozen REST + SSE spec. If v1.1.6 scope blows up under pressure, the lib is the deferrable piece (slips to v1.1.6.1); the realtime server itself doesn't slip.
Type safety / codegen: ✅ Decided 2026-06-01 — defer codegen to v1.2+; v1.1.6 ships hand-written types with endpoint<Req, Res>() generic + optional client-side runtime validation via user-provided schemas (zod/valibot adapter; ~50 lines). No schema-declaration syntax in v1.1.6 — committing to that before v1.2's coherent codegen design would lock us into a shape we'd regret. Doc schemas (already arriving in v1.1.2) are the natural foundation for v1.2 codegen; script-endpoint schemas get designed alongside the generator, not before.

Open calls

~~Hybrid model~~ — ✅ Decided 2026-06-01: confirmed; no direct service access from the frontend; @picloud/client ships typed HTTP + SSE + auth-flow + framework hooks.
~~TypeScript first, multi-language deferred~~ — ✅ Decided 2026-06-01: TS-only in v1.1.6; REST + SSE is the public protocol; other-language SDKs are demand-driven; React Native SSE polyfill caveat documented.
~~Co-ship realtime + client lib~~ — ✅ Decided 2026-06-01: co-ship in v1.1.6, built in parallel against a frozen REST + SSE spec. Lib is the deferrable piece under scope pressure (slips to v1.1.6.1); server doesn't slip.
~~Type safety / codegen~~ — ✅ Decided 2026-06-01: defer codegen to v1.2+; v1.1.6 ships hand-written types with endpoint<Req, Res>() generic + optional zod/valibot runtime validation; no schema declarations in v1.1.6.

7. Revised v1.1.x roadmap

Net changes vs the blueprint §12 roadmap:

v1.1.5 pub/sub: now via trigger outbox (drops LISTEN/NOTIFY plan), tightening implementation scope
NEW v1.1.6 Realtime Channels & Client Library: realtime SSE + @picloud/client TS package; co-shipped
v1.1.7+ items shifted by one (was v1.1.6/7/8 → now v1.1.7/8/9)
Dead letters and the unified outbox/dispatcher are absorbed into v1.1.1's existing scope (triggers framework)

Version	Capability
v1.1.0	Foundation & Standard Library — SDK shape, `Services` bundle, `SdkCallCx`, `ExecutionGate`, `ServiceEventEmitter` trait shape; stdlib utilities (regex, random, time, json, base64, hex, url). ✓ Shipped.
v1.1.1	Storage & Events — KV store keyed `(app_id, collection, key)`; triggers framework (universal outbox + dispatcher + NATS-style sync HTTP via inbox + per-trigger retry config + dead-letter table & `dead_letter` trigger source + trigger CRUD + `ctx.event` + depth limit); KV trigger kinds.
v1.1.2	Documents — `docs::collection(name).create/find/update/delete/list` with `docs:*` triggers.
v1.1.3	Modules — `scripts.kind`, per-app resolver replaces `DummyModuleResolver`, AST cache + dep-graph invalidation.
v1.1.4	Outbound HTTP & Scheduled Tasks — `http::*` with SSRF deny-list; cron triggers (small now that the framework exists).
v1.1.5	Files & Pub/Sub — filesystem-backed blobs (`files/<app_id>/<id[0:2]>/<id>`) with `files:` triggers; pub/sub via the universal outbox with `pubsub:` triggers.
v1.1.6	Realtime Channels & Client Library (new) — SSE-based external subscription to per-app pub/sub topics (public + HMAC-signed subscriber-token auth, minted via `pubsub::subscriber_token`); `@picloud/client` TypeScript package (typed HTTP via `endpoint<Req,Res>()`, SSE subscription, auth helpers, framework hooks).
v1.1.7	Configuration & Email (was v1.1.6) — encrypted per-app secrets; outbound `email::send/send_html` + inbound `email:receive` trigger.
v1.1.8	User Management (was v1.1.7) — `users::*` for in-script CRUD, auth, roles, invites, password reset.
v1.1.9	Durable Queues & Function Composition (was v1.1.8) — `queue::` with `queue:receive` trigger; `invoke()` + `retry::` (closures-as-args, re-entrant Rhai).
v1.2	Workflows & Hierarchies (per blueprint §Phase 5) — DAG execution, advanced docs query, interceptors, read triggers, audit log, script-mediated realtime auth, `dead_letters::list` (aligned with `docs::find()` query DSL), client-lib type codegen from script-declared schemas.
v1.3+	Scale & Ops (per blueprint §Phase 6) — cluster mode (NATS-style request/reply swaps to `LISTEN/NOTIFY`), cross-app data sharing, script versioning + rollback, rate limiting, richer auth, metrics, distributed tracing, webhooks, S3, monitoring/alerting on HTTP endpoint failures.

The v1.1.9 release marks the end of the v1.1.x expansion cadence. v1.2 is the next minor product bump (phase milestone per versioning policy).

Consolidated open calls

All 20 open calls were resolved on 2026-06-01. This section is retained as a quick decision index — each item links the original question to the decision recorded in its section above. Sections will be pruned individually as their decisions ship into code and the serverless_cloud_blueprint.md.

§1 — Messaging primitives

~~Pub/sub durability via trigger outbox~~ — ✅ Decided 2026-06-01: publish_durable ships in v1.1.5; publish_ephemeral committed as a future API.
~~Queue and pub/sub stay separate~~ — ✅ Decided 2026-06-01: separate top-level namespaces; no unifying messaging abstraction.

§2 — Universal trigger outbox

~~Sync HTTP via outbox + per-request inbox~~ — ✅ Decided 2026-06-01: yes via outbox; in-process oneshot for v1.1.1, LISTEN/NOTIFY preserved as the cluster-mode (v1.3+) cross-process variant.
~~Ship dispatch_mode: async for HTTP routes in v1.1.1~~ — ✅ Decided 2026-06-01: yes; 202 Accepted + JSON body with execution_id; route-level config only.
~~Trigger storage shape~~ — ✅ Decided 2026-06-01: Layout E (parent triggers + per-kind <kind>_trigger_details); routes stays its own table for v1.1.x; column-set refinements deferred to implementation PR.

§3 — NATS-style sync HTTP

~~NATS-style request/reply for sync HTTP~~ — ✅ Decided 2026-06-01 (see §2 #3).
~~Status code strategy~~ — ✅ Decided 2026-06-01: keep distinctions; 500 reserved for platform problems.
~~Default retry policy on triggers~~ — ✅ Decided 2026-06-01: 3/exp/1000ms + ±20% jitter; env-overridable via PICLOUD_TRIGGER_RETRY_*; per-trigger columns override.
~~Cancel-on-timeout semantics~~ — ✅ Decided 2026-06-01: (b) — abandoned_executions table; dispatcher-written; 7-day retention via PICLOUD_ABANDONED_EXECUTIONS_RETENTION_DAYS; metric counter on insert.

§4 — Dead letters

~~Dead-letter handlers unretryable + can't be dead-lettered themselves~~ — ✅ Decided 2026-06-01: confirmed; flag lives on the execution; missing handler = resolution = 'handler_failed'; indirect loops bounded by cx.trigger_depth.
~~No default dead-letter handler~~ — ✅ Decided 2026-06-01: confirmed; rows sit in the table by default. Dashboard unresolved-count badge + per-app DL list view ship in v1.1.1.
~~30-day default retention~~ — ✅ Decided 2026-06-01: 30 days, GC by created_at, env-only override (PICLOUD_DEAD_LETTER_RETENTION_DAYS).
~~Rhai SDK for dead-letters in v1.1.1~~ — ✅ Decided 2026-06-01: replay + resolve in v1.1.1; list deferred to v1.2; new Capability::AppDeadLetterManage(AppId). Related: trigger executions inherit the registrant's principal.

§5 — Realtime

~~Approach C confirmed~~ — ✅ Decided 2026-06-01: yes, with explicit registration required for externally-subscribable topics; new Capability::AppTopicManage(AppId).
~~SSE first, WebSocket deferred~~ — ✅ Decided 2026-06-01: SSE-only in v1.1.6; WS deferred.
~~Auth model~~ — ✅ Decided 2026-06-01: public + HMAC-signed subscriber tokens in v1.1.6; users::* session auth in v1.1.8; script-mediated in v1.2; TTL 10s–24h (default 1h), env-overridable.

§6 — Frontend client library

~~Hybrid model~~ — ✅ Decided 2026-06-01: confirmed; no direct service access from the frontend; client lib standardizes script-mediated ceremony only.
~~TypeScript first, multi-language deferred~~ — ✅ Decided 2026-06-01: TS-only in v1.1.6; REST + SSE is the public protocol contract.
~~Co-ship realtime + client lib~~ — ✅ Decided 2026-06-01: co-ship in v1.1.6, parallel-built against a frozen spec; lib is the deferrable piece under scope pressure.
~~Type safety / codegen~~ — ✅ Decided 2026-06-01: defer codegen to v1.2+; v1.1.6 ships hand-written types via endpoint<Req, Res>() + optional zod/valibot runtime validation.

Lifecycle of this document

Created at the v1.1.0 → v1.1.1 boundary (after the foundation PR series shipped).
Each section gets pruned once its decisions ship and land in the blueprint.
Open calls are answered in conversation, then folded into the corresponding section as "Decided: X" with the date.
Document deleted when v1.1.9 ships — everything by then is either in the blueprint, in code, or explicitly deferred to v1.2+.

52 KiB Raw Blame History Unescape Escape

v1.1.x design notes — in-flight decisions + revised roadmap

1. The three messaging primitives

Pub/sub reframe — durable through the outbox, ephemeral as named escape hatch

Queue stays separate

Status

Open calls

2. Universal trigger outbox

What runs through the outbox

What this gives

What this doesn't change

Status

Open calls

3. NATS-style request/reply for sync HTTP

Pattern

Implementation by deployment mode

Latency cost (honest numbers)

Default retry policy — decided

Retry policy — reply_to IS the signal

Failure / crash handling

Status code strategy — decided

Status

Open calls

4. Dead-letter handling

Schema sketch

Dead letter as trigger source

Recursion stop rule — decided

Defaults — decided

Sync HTTP failures don't dead-letter

Pub/sub fan-out dead-letters independently

Manual replay — Rhai SDK scope decided

Retention — decided

Status

Open calls

5. Realtime updates for external clients

The chosen approach — decided

Transport: SSE first — decided

Auth model for external subscribers — decided

Status

Open calls

6. Frontend client library

The two ends of the spectrum

The chosen approach: hybrid — decided

Why hybrid, not maximalist

Why hybrid, not pure minimalist

TypeScript first — decided

Status

Open calls

7. Revised v1.1.x roadmap

Consolidated open calls

§1 — Messaging primitives

§2 — Universal trigger outbox

§3 — NATS-style sync HTTP

§4 — Dead letters

§5 — Realtime

§6 — Frontend client library

Lifecycle of this document

52 KiB

Raw Blame History

Retry policy — `reply_to` IS the signal