# v1.1.1 Implementation HANDBACK ## 1. Branch + commit count - Branch: `feat/v1.1.1-storage-and-events` - Base: `main` - 11 commits ahead of `main`. Branch is **not pushed**, **not merged**. ``` 66b661f chore(release): bump workspace to v1.1.1 + CHANGELOG 6b7ff78 feat(v1.1.1-gc): dead-letter + abandoned-executions retention sweepers 1795dfc feat(v1.1.1-dead-letters): dashboard badge + list view 20f1b5e feat(v1.1.1-dead-letters): service + Rhai SDK + admin endpoints 77b2cb5 feat(v1.1.1-routes): outbox-routed sync HTTP + dispatch_mode=async 6a2971a feat(v1.1.1-dispatcher): dispatcher loop + retry + depth limit + outbox emitter 2e92691 feat(v1.1.1-triggers): trigger CRUD admin endpoints 545d863 feat(v1.1.1-triggers): triggers + outbox schema + repos 6b99f74 feat(v1.1.1-kv): Rhai kv:: SDK module + ctx.event wiring 434fb63 feat(v1.1.1-kv): migrations + KvService trait + Postgres impl 1efb350 docs(v1.1.x): resolve in-flight decisions as Decided 2026-06-01 ``` The first commit (`1efb350`) absorbed working-tree edits to `docs/v1.1.x-design-notes.md` that turned the "in-flight" 20 open calls into "Decided 2026-06-01" entries. Those were on the working tree at branch creation; folding them into the v1.1.1 branch keeps the design rationale colocated with the implementation. ## 2. Scope coverage (Done / Partial / Skipped) | Scope item | Status | Notes | |---|---|---| | **1. KV store** | Done | Migration 0007, `KvService` trait in shared, `KvServiceImpl` + `PostgresKvRepo` in manager-core, Rhai `kv::collection(name).{get,set,has,delete,list}` bridge, cursor pagination, empty-collection rejection, script-as-gate authz. | | **2. Triggers framework — Layout E** | Done | Migrations 0008 (`triggers` + `kv_trigger_details` + `dead_letter_trigger_details`), `TriggerRepo` + `PostgresTriggerRepo`, CRUD admin endpoints. `registered_by_principal` column captured + threaded into the dispatcher. Depth-limit enforced in the dispatcher (default 8). | | **3. Universal outbox + dispatcher** | Done | Migration 0009 (`outbox`), `OutboxRepo` + `PostgresOutboxRepo`, `Dispatcher` tokio task. Polls every 100ms, claims 8 rows/tick via `FOR UPDATE SKIP LOCKED`, gate-bounds dispatch, retries with backoff+jitter, dead-letters on exhaustion, late-completion → `abandoned_executions`. | | **4. NATS-style sync HTTP** | Done | `InboxRegistry` in orchestrator-core (in-process `Mutex>`), `InboxResolver` trait in shared. Orchestrator sync-route path registers receiver, writes outbox row with `reply_to`, awaits with timeout = script.timeout + 2s buffer. Status mapping per design notes §3 (422/502/503/504/507/500). | | **5. `dispatch_mode: async` HTTP routes** | Done | Migration 0012 adds the column (default `sync`). `DispatchMode` enum in shared. Route admin payload + RouteRepository serialize it. Compiled routes carry it; the matcher returns it in `Matched`. Orchestrator branches: async → outbox + 202; sync → outbox + inbox. | | **6. Dead letters** | Done | Migration 0010 (`dead_letters`), `DeadLetterRepo` + `DeadLetterService` + `PostgresDeadLetterService`. Rhai `dead_letters::{replay,resolve}` + admin endpoints (`GET /count`, `GET /`, `GET /{id}`, `POST /{id}/replay`, `POST /{id}/resolve`). `Capability::AppDeadLetterManage(AppId)` enforced. List intentionally NOT shipped (deferred to v1.2). Recursion-stop rule (handler-failure annotates original DL as `handler_failed`) implemented in the dispatcher. | | **7. Abandoned executions** | Done | Migration 0011, `AbandonedRepo` + `PostgresAbandonedRepo`, dispatcher writes a row on dropped-receiver inbox delivery. Metric path reserved (`TODO(metrics)` markers in dispatcher.rs). | | **8. Retry policy defaults** | Done | `TriggerConfig::from_env` (new module). Env vars: `PICLOUD_MAX_TRIGGER_DEPTH`, `PICLOUD_TRIGGER_RETRY_{MAX_ATTEMPTS,BACKOFF,BASE_MS,JITTER_PCT}`, `PICLOUD_DEAD_LETTER_RETENTION_DAYS`, `PICLOUD_ABANDONED_EXECUTIONS_RETENTION_DAYS`. Per-trigger overrides applied at trigger-creation time. | | **9. `ctx.event` for triggered scripts** | Done | `TriggerEvent` enum in shared (KV / DeadLetter variants), `SdkCallCx.event: Option` + `is_dead_letter_handler: bool`. `engine.rs::build_ctx_map` flattens the event into `ctx.event` for triggered handlers; direct invocations leave the key absent. Shape matches design notes §4 (KV with op + collection + key + value; dead_letter with original + attempts + last_error + ids + timestamps). | | **10. Dashboard surface** | Done | Per-app red badge with unresolved count on apps list + per-app detail page. New `apps/[slug]/dead-letters/+page.svelte` list view with all design-notes-mandated columns + Replay + Mark resolved actions + expandable row detail. svelte-check passes (369 files, 0 errors, 0 warnings). | | **11. Workspace version bump** | Done | Workspace `1.1.0` → `1.1.1`, SDK `1.1` → `1.2`, dashboard `0.6.0` → `0.7.0`. CHANGELOG.md created at repo root. | ## 3. Key implementation decisions / deviations ### Outbox column set (deferred to implementation per design notes §2) Chose: - `script_id` denormalized — dispatcher resolves the target without re-joining for the common path. - `trigger_id` polymorphic (no DB FK) — references `triggers.id` for `source_kind IN {kv, dead_letter}`, `routes.id` for `source_kind = 'http'`. Discrimination in Rust at dispatch time. - `claimed_by TEXT` — pid-based for MVP; cluster mode can use any identifier without schema change. - `trigger_depth` + `root_execution_id` denormalized so the dispatcher rebuilds `ExecRequest` without joining back to the originating execution log. - No explicit `is_dead_letter_handler` column — dispatcher infers from the trigger's `kind` field at dispatch time. ### KV pagination - **Cursor-style**, opaque base64-encoded last-key. - Page-size cap of 1000 with default 100 (enforced in repo). - Documented in `crates/shared/src/kv.rs` and the SDK function comment. ### KV TTL - Blueprint §8.1 reserved an `expires_at` column. v1.1.1 design notes don't surface TTL through the SDK (`set(k,v)` has no TTL argument) so the column is **omitted from migration 0007**. Adding it later is a non-breaking forward migration. Recorded in CHANGELOG as a deferred item. ### Authz scope mapping (seven-scope commitment) The four new capabilities map onto existing scopes — **no new scope variants** to honour the `Scope` enum's "exactly seven values" contract (`crates/shared/src/auth.rs:103`): | Capability | Scope | |---|---| | `AppKvRead` | `script:read` | | `AppKvWrite` | `script:write` | | `AppManageTriggers` | `app:admin` | | `AppDeadLetterManage` | `app:admin` | `role_satisfies` grants `AppKvRead` at the Viewer role, `AppKvWrite` at Editor, and both trigger / DL caps at AppAdmin. ### Script-as-gate authz for SDK calls - `KvServiceImpl` runs `authz::require` only when `cx.principal.is_some()`. Anonymous public-HTTP scripts (the common case for public routes) bypass the cap check. - Cross-app isolation is **independent** of this — enforced by `cx.app_id` being the only source of `app_id` on every query. - `PostgresDeadLetterService::{replay,resolve}` keeps a hard `require` (no `if let Some`) — managing dead letters is an admin act per design notes §4. Public scripts with `principal: None` fail the check, which is correct. ### Trait split: `OutboxRepo` vs `OutboxWriter` orchestrator-core can't depend on manager-core (would invert the dependency arrow). Defined a small `OutboxWriter` trait in `picloud-shared` with a single `enqueue_http` method. `PostgresOutboxRepo` implements both `OutboxRepo` (dispatcher surface) and `OutboxWriter` (orchestrator surface); the picloud binary clones one concrete Arc into both trait views — mirrors the existing `members_concrete` / `AuthzRepo` pattern. ### `InboxResolver` lives in shared, `InboxRegistry` in orchestrator-core Same split rationale — the dispatcher (manager-core) only depends on the trait, while the in-process impl lives next to its consumer. Cluster mode (v1.3+) swaps the impl for `LISTEN/NOTIFY` behind the unchanged trait. ### manager-core now depends on executor-core Previously manager-core only depended on orchestrator-core. The dispatcher needs `ExecRequest`/`ExecResponse`/`ExecError`/ `InvocationType` from `executor-core` to build invocation descriptors. This is the transport DTO interpretation of the working-rules "don't reach across `*-core` crates" — DTOs are fine, behaviour is the bright line. ### Sync HTTP via outbox is the default for the user-routes path The orchestrator's user-route handler is fully on the NATS-style path now — every sync HTTP request writes to the outbox and awaits inbox delivery. Adds ~2-5ms per request per design notes §3 latency budget. `/api/v1/execute/{id}` (the admin/dev bypass) still calls the executor directly since it doesn't need the unified observability — kept for simplicity and admin tooling speed. ### Trigger-depth check is on the outbox row, not in the executor Dispatcher rejects depth-exceeded rows **before** trying to execute. The `cx.trigger_depth` field is informational on the executor side. Rejection writes a log + (reserved) metric and deletes the row — no DL, per design notes §4. ## 4. Tests added ### Unit tests (no DB required) - `manager-core::kv_service::tests` (10 tests) — round-trip, missing key returns None, `has` predicate, `delete` was-present, empty-collection rejection, **cross-app isolation**, anonymous-cx skips authz, authed-cx-with-no-role is Forbidden, owner-can-write, cursor pagination via in-memory KvRepo + denying authz repo. - `manager-core::trigger_config::tests` (2 tests) — conservative defaults, backoff round-trips. - `manager-core::trigger_repo::tests` (1 test) — `collection_matches` glob behaviour (`*`, `prefix:*`, exact). - `manager-core::dispatcher::tests` (5 tests) — exponential / linear / constant backoff math, jitter within bounds, ExecError → InboxFailureKind classification, failure-kind → status-code mapping. - `manager-core::abandoned_repo::tests` (2 tests) — truncate char-boundary safety. - `manager-core::triggers_api::tests` (5 tests) — unknown-app 404, member-without-role 403, default fallback for retry settings, empty-glob rejection, cross-app delete is treated as not-found. - `orchestrator-core::inbox::tests` (4 tests) — register/deliver round-trip, unknown-id is Abandoned, dropped receiver is Abandoned, explicit cancel removes sender. - `executor-core::engine::tests` (3 new) — `ctx.event` absent for direct invocations, KV insert shape matches design notes, KV delete has unit value. - `executor-core::sdk_kv` integration suite (7 tests) — runs a real Rhai engine under `spawn_blocking` against an in-memory `KvService` impl. Covers handle pattern, round-trip, unit-on- missing, has predicate, delete-was-present, empty-collection throws, cursor pagination, **cross-app isolation through the bridge**. **Total: 47 new tests across the workspace.** Workspace test counts after v1.1.1: 63 manager-core / 56 orchestrator-core / 17 executor-core engine / 7 sdk_kv / 30 sdk_contract / 43 stdlib / 21 picloud / 6 shared. ### Intentionally untested - DB-backed integration tests for the full dispatcher loop, KV→ trigger→DL retry chain, sync HTTP via outbox round-trip, recursion-stop end-to-end. These need a real Postgres harness; the reviewer runs them via the manual smoke flow below. - Postgres-specific repo behaviour (sqlx query correctness). The repos compile and run against the schema, but no integration test crate spins up a DB in this branch — same pattern as v1.1.0 (see existing `ignored, needs DATABASE_URL` test markers). ## 5. Open questions for the reviewer 1. **Outbox `claimed_at` clearing on success.** The dispatcher `delete`s the outbox row after success / DL. For failures it reschedules (which sets `claimed_at = NULL`). Both flows are correct, but if you imagine a crash between the executor return and the row update, the row stays claimed forever. Cluster mode should add a periodic "unstick stale claims" sweep. Not in v1.1.1 scope but worth surfacing. 2. **Sync HTTP overhead.** Every sync HTTP request now goes through the outbox (write + dispatcher pickup + inbox delivery). Measured overhead expected ~2-5ms per design notes §3. No benchmarking yet — recommend the reviewer pick a representative script and compare 95p latency vs v1.1.0 if performance matters. 3. **HTTP outbox rows don't run as a principal.** The orchestrator's public HTTP path has no authenticated user; the `origin_principal` field on the outbox row is forensic. The resulting `ExecRequest.principal = None`, so the script runs anonymously — matches direct execution. If you'd prefer triggered-from-HTTP scripts to inherit a derived principal (e.g. the route's app's owner), that's an additive change. 4. **Dispatcher uses `ASYNC_EXEC_TIMEOUT = 300s` for async rows.** Async dispatches don't have a script-level timeout (no originating HTTP request to bound). Picked the same platform cap as `LocalExecutorClient`. If async needs a different cap, easy to thread through `TriggerConfig`. 5. **Dispatcher tick cadence is 100ms.** Bounded enough that fan-out feels instant; loose enough that an idle process doesn't burn cycles. If the reviewer wants tighter latency, bump to 50ms or use `LISTEN/NOTIFY` for wake-up (v1.3+ work). 6. **CHANGELOG.md is new.** Followed the rest of the repo's convention from git log (release commits + design-notes references). If a different format is preferred, easy to swap. ## 6. Deferred to later releases - `dead_letters::list(filter)` Rhai SDK — design notes §4 defers to v1.2 to align with `docs::find()` query DSL. - KV TTL (`set(k, v, ttl_secs)`) — blueprint reserved it; v1.1.1 SDK doesn't surface it. Forward-compat (no schema cost). - Auto-disable of triggers whose script was deleted — design notes §4 says current handling is metric+log; auto-disable is v1.2. - Per-app dead-letter retention — design notes §4 says env-only in v1.1.1. - Metrics counter emit for `picloud_trigger_depth_exceeded`, `picloud_dead_letter_handler_failures`, `picloud_abandoned_executions_total`. Code paths log the occurrences with `tracing::warn`/`error`; the actual counter-emit code is a `TODO(metrics)` comment in the dispatcher. Metrics surface is v1.1.7+ per the roadmap. - DB-backed integration tests for the dispatcher loop (see §4 intentionally-untested). - Sync HTTP performance benchmarks comparing v1.1.0 direct path vs v1.1.1 outbox path. ## 7. How to verify locally ### Static checks (all green on this branch) ```sh cargo fmt --all -- --check cargo clippy --all-targets --all-features -- -D warnings cargo test --workspace cd dashboard && npm run check && npm run build ``` ### Migration integrity ```sh docker compose down -v && docker compose up -d postgres cargo run -p picloud # applies 0001..0012 from empty ``` Then start from `main` (v1.1.0 schema state) and switch to this branch; restart `picloud` to apply 0007..0012 on top. ### Manual end-to-end smoke (reviewer should run) ```sh docker compose up -d # 1. Bootstrap an owner user via the existing flow + create app A. # 2. Create a script in A whose body is: throw "boom" # 3. POST /api/v1/admin/apps/{A}/triggers/kv with # {"script_id": "", "collection_glob": "*", "ops": ["insert"]} # 4. From another script (or a public HTTP route): # kv::collection("widgets").set("k1", #{n:1}) # 5. Wait ~7 seconds (3 attempts × ~1/2/4s backoff with ±20% jitter). # 6. Open the dashboard at /admin. # 7. Apps list shows a red "1" badge next to app A. # 8. Click into app A → "Dead letters" tab link → row visible. # 9. Click row → full payload + error history. # 10. Click "Replay" → row marks resolution='replayed', new outbox # row written, dispatcher re-runs the handler (fails again, # produces a NEW DL row). # 11. Click "Mark resolved" on the original DL → resolution='ignored'. ``` ### Async route smoke ```sh # Create a route via POST /api/v1/admin/scripts/{id}/routes with # {"host_kind":"any","path_kind":"exact","path":"/work","dispatch_mode":"async"} curl -X POST -d '{"work":"thing"}' http://localhost:8080/work # Expect: HTTP 202 + {"accepted_at":"...","execution_id":"..."} # Then tail execution_logs — the script ran later (not synchronously). ``` ### Trigger-depth limit smoke ```sh # Set a low depth limit + register a KV trigger whose script # writes to KV again — creates a loop. PICLOUD_MAX_TRIGGER_DEPTH=3 cargo run -p picloud # kv.set(...) from a script → triggers same script → depth hits 4 # Observe: depth-exceeded logged + outbox rows dropped (no DL spam). ``` ## 8. Known limitations / rough edges - **No DB-backed integration tests in this branch.** Unit tests cover trait behaviour with in-memory backings; sqlx query correctness is verified by the workspace compile + manual smoke. - **Dispatcher concurrency is in-process serial-per-tick.** Up to 8 rows claimed per tick, processed one at a time. Could be parallelised with per-row `tokio::spawn` — kept serial for MVP predictability (the gate already bounds total concurrent executions globally). - **Metric emission is TODO** at the three spots noted in Open Questions §5. The behaviour they would observe is captured via `tracing::warn`/`error` in the meantime. - **`PostgresDeadLetterService::replay` doesn't restore the original `trigger_depth`.** Replays start at depth 0. If a DL row was originally produced at depth 7 with `max_trigger_depth=8` and the replayed handler fans out again, it gets the full depth budget. Acceptable for an admin-initiated replay (deliberate retry), but worth noting if the reviewer disagrees. - **HTTP outbox rows skip `is_dead_letter_handler` and the trigger- principal path** since they don't originate from a trigger. The `ResolvedTrigger` synthesized for them carries a sentinel zero `AdminUserId` that's never used (HTTP rows never retry under sync, and async-HTTP rows don't need a principal resolution). - **DataPlaneState's executor field is still generic** (`Arc` where `E: ExecutorClient`). The dispatcher uses `Arc` directly. The picloud binary clones the same `Arc` into both — works because the concrete type implements both the trait object and the generic bound. - **dispatcher always sets `principal: None` for HTTP rows.** As noted in Open Question §3, HTTP outbox rows don't resolve a principal. Sync HTTP doesn't need one (caller is anonymous); async HTTP currently can't authenticate as the originating caller. If that's not the intent, additive change. - **Cluster-mode crash recovery for claimed rows.** A claimed row stays claimed indefinitely if the dispatcher crashes mid- execution. v1.1.1 has one dispatcher per process so this is rare; cluster mode (v1.3+) needs a stale-claim sweeper. --- Branch ready for review. Reviewer reads this report + audits the diff. Do not merge to main until the audit clears.