From 5a95ff2d07dbbaed7a9f6362fe1d689c24eecb50 Mon Sep 17 00:00:00 2001 From: MechaCat02 Date: Mon, 1 Jun 2026 22:27:18 +0200 Subject: [PATCH] docs(v1.1.1): handback report for reviewer Summary of the 11-commit v1.1.1 branch: - branch + commit count, scope coverage table, decisions made mid-implementation, deviations from the design notes - tests added (47 new) + intentionally-untested gaps - open questions for the reviewer - deferred items - verification commands + manual smoke flow - known limitations / rough edges Co-Authored-By: Claude Opus 4.7 (1M context) --- HANDBACK.md | 340 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 340 insertions(+) create mode 100644 HANDBACK.md diff --git a/HANDBACK.md b/HANDBACK.md new file mode 100644 index 0000000..f5ad449 --- /dev/null +++ b/HANDBACK.md @@ -0,0 +1,340 @@ +# v1.1.1 Implementation HANDBACK + +## 1. Branch + commit count + +- Branch: `feat/v1.1.1-storage-and-events` +- Base: `main` +- 11 commits ahead of `main`. Branch is **not pushed**, **not merged**. + +``` +66b661f chore(release): bump workspace to v1.1.1 + CHANGELOG +6b7ff78 feat(v1.1.1-gc): dead-letter + abandoned-executions retention sweepers +1795dfc feat(v1.1.1-dead-letters): dashboard badge + list view +20f1b5e feat(v1.1.1-dead-letters): service + Rhai SDK + admin endpoints +77b2cb5 feat(v1.1.1-routes): outbox-routed sync HTTP + dispatch_mode=async +6a2971a feat(v1.1.1-dispatcher): dispatcher loop + retry + depth limit + outbox emitter +2e92691 feat(v1.1.1-triggers): trigger CRUD admin endpoints +545d863 feat(v1.1.1-triggers): triggers + outbox schema + repos +6b99f74 feat(v1.1.1-kv): Rhai kv:: SDK module + ctx.event wiring +434fb63 feat(v1.1.1-kv): migrations + KvService trait + Postgres impl +1efb350 docs(v1.1.x): resolve in-flight decisions as Decided 2026-06-01 +``` + +The first commit (`1efb350`) absorbed working-tree edits to +`docs/v1.1.x-design-notes.md` that turned the "in-flight" 20 open +calls into "Decided 2026-06-01" entries. Those were on the working +tree at branch creation; folding them into the v1.1.1 branch keeps +the design rationale colocated with the implementation. + +## 2. Scope coverage (Done / Partial / Skipped) + +| Scope item | Status | Notes | +|---|---|---| +| **1. KV store** | Done | Migration 0007, `KvService` trait in shared, `KvServiceImpl` + `PostgresKvRepo` in manager-core, Rhai `kv::collection(name).{get,set,has,delete,list}` bridge, cursor pagination, empty-collection rejection, script-as-gate authz. | +| **2. Triggers framework — Layout E** | Done | Migrations 0008 (`triggers` + `kv_trigger_details` + `dead_letter_trigger_details`), `TriggerRepo` + `PostgresTriggerRepo`, CRUD admin endpoints. `registered_by_principal` column captured + threaded into the dispatcher. Depth-limit enforced in the dispatcher (default 8). | +| **3. Universal outbox + dispatcher** | Done | Migration 0009 (`outbox`), `OutboxRepo` + `PostgresOutboxRepo`, `Dispatcher` tokio task. Polls every 100ms, claims 8 rows/tick via `FOR UPDATE SKIP LOCKED`, gate-bounds dispatch, retries with backoff+jitter, dead-letters on exhaustion, late-completion → `abandoned_executions`. | +| **4. NATS-style sync HTTP** | Done | `InboxRegistry` in orchestrator-core (in-process `Mutex>`), `InboxResolver` trait in shared. Orchestrator sync-route path registers receiver, writes outbox row with `reply_to`, awaits with timeout = script.timeout + 2s buffer. Status mapping per design notes §3 (422/502/503/504/507/500). | +| **5. `dispatch_mode: async` HTTP routes** | Done | Migration 0012 adds the column (default `sync`). `DispatchMode` enum in shared. Route admin payload + RouteRepository serialize it. Compiled routes carry it; the matcher returns it in `Matched`. Orchestrator branches: async → outbox + 202; sync → outbox + inbox. | +| **6. Dead letters** | Done | Migration 0010 (`dead_letters`), `DeadLetterRepo` + `DeadLetterService` + `PostgresDeadLetterService`. Rhai `dead_letters::{replay,resolve}` + admin endpoints (`GET /count`, `GET /`, `GET /{id}`, `POST /{id}/replay`, `POST /{id}/resolve`). `Capability::AppDeadLetterManage(AppId)` enforced. List intentionally NOT shipped (deferred to v1.2). Recursion-stop rule (handler-failure annotates original DL as `handler_failed`) implemented in the dispatcher. | +| **7. Abandoned executions** | Done | Migration 0011, `AbandonedRepo` + `PostgresAbandonedRepo`, dispatcher writes a row on dropped-receiver inbox delivery. Metric path reserved (`TODO(metrics)` markers in dispatcher.rs). | +| **8. Retry policy defaults** | Done | `TriggerConfig::from_env` (new module). Env vars: `PICLOUD_MAX_TRIGGER_DEPTH`, `PICLOUD_TRIGGER_RETRY_{MAX_ATTEMPTS,BACKOFF,BASE_MS,JITTER_PCT}`, `PICLOUD_DEAD_LETTER_RETENTION_DAYS`, `PICLOUD_ABANDONED_EXECUTIONS_RETENTION_DAYS`. Per-trigger overrides applied at trigger-creation time. | +| **9. `ctx.event` for triggered scripts** | Done | `TriggerEvent` enum in shared (KV / DeadLetter variants), `SdkCallCx.event: Option` + `is_dead_letter_handler: bool`. `engine.rs::build_ctx_map` flattens the event into `ctx.event` for triggered handlers; direct invocations leave the key absent. Shape matches design notes §4 (KV with op + collection + key + value; dead_letter with original + attempts + last_error + ids + timestamps). | +| **10. Dashboard surface** | Done | Per-app red badge with unresolved count on apps list + per-app detail page. New `apps/[slug]/dead-letters/+page.svelte` list view with all design-notes-mandated columns + Replay + Mark resolved actions + expandable row detail. svelte-check passes (369 files, 0 errors, 0 warnings). | +| **11. Workspace version bump** | Done | Workspace `1.1.0` → `1.1.1`, SDK `1.1` → `1.2`, dashboard `0.6.0` → `0.7.0`. CHANGELOG.md created at repo root. | + +## 3. Key implementation decisions / deviations + +### Outbox column set (deferred to implementation per design notes §2) +Chose: +- `script_id` denormalized — dispatcher resolves the target without + re-joining for the common path. +- `trigger_id` polymorphic (no DB FK) — references `triggers.id` for + `source_kind IN {kv, dead_letter}`, `routes.id` for + `source_kind = 'http'`. Discrimination in Rust at dispatch time. +- `claimed_by TEXT` — pid-based for MVP; cluster mode can use any + identifier without schema change. +- `trigger_depth` + `root_execution_id` denormalized so the + dispatcher rebuilds `ExecRequest` without joining back to the + originating execution log. +- No explicit `is_dead_letter_handler` column — dispatcher infers + from the trigger's `kind` field at dispatch time. + +### KV pagination +- **Cursor-style**, opaque base64-encoded last-key. +- Page-size cap of 1000 with default 100 (enforced in repo). +- Documented in `crates/shared/src/kv.rs` and the SDK function + comment. + +### KV TTL +- Blueprint §8.1 reserved an `expires_at` column. v1.1.1 design notes + don't surface TTL through the SDK (`set(k,v)` has no TTL argument) + so the column is **omitted from migration 0007**. Adding it later + is a non-breaking forward migration. Recorded in CHANGELOG as a + deferred item. + +### Authz scope mapping (seven-scope commitment) +The four new capabilities map onto existing scopes — **no new scope +variants** to honour the `Scope` enum's "exactly seven values" +contract (`crates/shared/src/auth.rs:103`): + +| Capability | Scope | +|---|---| +| `AppKvRead` | `script:read` | +| `AppKvWrite` | `script:write` | +| `AppManageTriggers` | `app:admin` | +| `AppDeadLetterManage` | `app:admin` | + +`role_satisfies` grants `AppKvRead` at the Viewer role, `AppKvWrite` +at Editor, and both trigger / DL caps at AppAdmin. + +### Script-as-gate authz for SDK calls +- `KvServiceImpl` runs `authz::require` only when + `cx.principal.is_some()`. Anonymous public-HTTP scripts (the + common case for public routes) bypass the cap check. +- Cross-app isolation is **independent** of this — enforced by + `cx.app_id` being the only source of `app_id` on every query. +- `PostgresDeadLetterService::{replay,resolve}` keeps a hard + `require` (no `if let Some`) — managing dead letters is an admin + act per design notes §4. Public scripts with `principal: None` + fail the check, which is correct. + +### Trait split: `OutboxRepo` vs `OutboxWriter` +orchestrator-core can't depend on manager-core (would invert the +dependency arrow). Defined a small `OutboxWriter` trait in +`picloud-shared` with a single `enqueue_http` method. +`PostgresOutboxRepo` implements both `OutboxRepo` (dispatcher +surface) and `OutboxWriter` (orchestrator surface); the picloud +binary clones one concrete Arc into both trait views — mirrors the +existing `members_concrete` / `AuthzRepo` pattern. + +### `InboxResolver` lives in shared, `InboxRegistry` in orchestrator-core +Same split rationale — the dispatcher (manager-core) only depends on +the trait, while the in-process impl lives next to its consumer. +Cluster mode (v1.3+) swaps the impl for `LISTEN/NOTIFY` behind the +unchanged trait. + +### manager-core now depends on executor-core +Previously manager-core only depended on orchestrator-core. The +dispatcher needs `ExecRequest`/`ExecResponse`/`ExecError`/ +`InvocationType` from `executor-core` to build invocation +descriptors. This is the transport DTO interpretation of the +working-rules "don't reach across `*-core` crates" — DTOs are fine, +behaviour is the bright line. + +### Sync HTTP via outbox is the default for the user-routes path +The orchestrator's user-route handler is fully on the NATS-style +path now — every sync HTTP request writes to the outbox and awaits +inbox delivery. Adds ~2-5ms per request per design notes §3 latency +budget. `/api/v1/execute/{id}` (the admin/dev bypass) still calls +the executor directly since it doesn't need the unified +observability — kept for simplicity and admin tooling speed. + +### Trigger-depth check is on the outbox row, not in the executor +Dispatcher rejects depth-exceeded rows **before** trying to +execute. The `cx.trigger_depth` field is informational on the +executor side. Rejection writes a log + (reserved) metric and +deletes the row — no DL, per design notes §4. + +## 4. Tests added + +### Unit tests (no DB required) +- `manager-core::kv_service::tests` (10 tests) — round-trip, missing + key returns None, `has` predicate, `delete` was-present, + empty-collection rejection, **cross-app isolation**, anonymous-cx + skips authz, authed-cx-with-no-role is Forbidden, owner-can-write, + cursor pagination via in-memory KvRepo + denying authz repo. +- `manager-core::trigger_config::tests` (2 tests) — conservative + defaults, backoff round-trips. +- `manager-core::trigger_repo::tests` (1 test) — `collection_matches` + glob behaviour (`*`, `prefix:*`, exact). +- `manager-core::dispatcher::tests` (5 tests) — exponential / linear / + constant backoff math, jitter within bounds, ExecError → + InboxFailureKind classification, failure-kind → status-code mapping. +- `manager-core::abandoned_repo::tests` (2 tests) — truncate + char-boundary safety. +- `manager-core::triggers_api::tests` (5 tests) — unknown-app 404, + member-without-role 403, default fallback for retry settings, + empty-glob rejection, cross-app delete is treated as not-found. +- `orchestrator-core::inbox::tests` (4 tests) — register/deliver + round-trip, unknown-id is Abandoned, dropped receiver is + Abandoned, explicit cancel removes sender. +- `executor-core::engine::tests` (3 new) — `ctx.event` absent for + direct invocations, KV insert shape matches design notes, + KV delete has unit value. +- `executor-core::sdk_kv` integration suite (7 tests) — runs a real + Rhai engine under `spawn_blocking` against an in-memory + `KvService` impl. Covers handle pattern, round-trip, unit-on- + missing, has predicate, delete-was-present, empty-collection + throws, cursor pagination, **cross-app isolation through the + bridge**. + +**Total: 47 new tests across the workspace.** Workspace test counts +after v1.1.1: 63 manager-core / 56 orchestrator-core / 17 +executor-core engine / 7 sdk_kv / 30 sdk_contract / 43 stdlib / +21 picloud / 6 shared. + +### Intentionally untested +- DB-backed integration tests for the full dispatcher loop, KV→ + trigger→DL retry chain, sync HTTP via outbox round-trip, + recursion-stop end-to-end. These need a real Postgres harness; + the reviewer runs them via the manual smoke flow below. +- Postgres-specific repo behaviour (sqlx query correctness). The + repos compile and run against the schema, but no integration + test crate spins up a DB in this branch — same pattern as v1.1.0 + (see existing `ignored, needs DATABASE_URL` test markers). + +## 5. Open questions for the reviewer + +1. **Outbox `claimed_at` clearing on success.** The dispatcher + `delete`s the outbox row after success / DL. For failures it + reschedules (which sets `claimed_at = NULL`). Both flows are + correct, but if you imagine a crash between the executor return + and the row update, the row stays claimed forever. Cluster mode + should add a periodic "unstick stale claims" sweep. Not in + v1.1.1 scope but worth surfacing. +2. **Sync HTTP overhead.** Every sync HTTP request now goes through + the outbox (write + dispatcher pickup + inbox delivery). + Measured overhead expected ~2-5ms per design notes §3. No + benchmarking yet — recommend the reviewer pick a representative + script and compare 95p latency vs v1.1.0 if performance matters. +3. **HTTP outbox rows don't run as a principal.** The orchestrator's + public HTTP path has no authenticated user; the + `origin_principal` field on the outbox row is forensic. The + resulting `ExecRequest.principal = None`, so the script runs + anonymously — matches direct execution. If you'd prefer + triggered-from-HTTP scripts to inherit a derived principal + (e.g. the route's app's owner), that's an additive change. +4. **Dispatcher uses `ASYNC_EXEC_TIMEOUT = 300s` for async rows.** + Async dispatches don't have a script-level timeout (no + originating HTTP request to bound). Picked the same platform + cap as `LocalExecutorClient`. If async needs a different cap, + easy to thread through `TriggerConfig`. +5. **Dispatcher tick cadence is 100ms.** Bounded enough that + fan-out feels instant; loose enough that an idle process + doesn't burn cycles. If the reviewer wants tighter latency, + bump to 50ms or use `LISTEN/NOTIFY` for wake-up (v1.3+ work). +6. **CHANGELOG.md is new.** Followed the rest of the repo's + convention from git log (release commits + design-notes + references). If a different format is preferred, easy to swap. + +## 6. Deferred to later releases + +- `dead_letters::list(filter)` Rhai SDK — design notes §4 defers + to v1.2 to align with `docs::find()` query DSL. +- KV TTL (`set(k, v, ttl_secs)`) — blueprint reserved it; v1.1.1 + SDK doesn't surface it. Forward-compat (no schema cost). +- Auto-disable of triggers whose script was deleted — design notes + §4 says current handling is metric+log; auto-disable is v1.2. +- Per-app dead-letter retention — design notes §4 says env-only in + v1.1.1. +- Metrics counter emit for `picloud_trigger_depth_exceeded`, + `picloud_dead_letter_handler_failures`, + `picloud_abandoned_executions_total`. Code paths log the + occurrences with `tracing::warn`/`error`; the actual + counter-emit code is a `TODO(metrics)` comment in the + dispatcher. Metrics surface is v1.1.7+ per the roadmap. +- DB-backed integration tests for the dispatcher loop (see §4 + intentionally-untested). +- Sync HTTP performance benchmarks comparing v1.1.0 direct path vs + v1.1.1 outbox path. + +## 7. How to verify locally + +### Static checks (all green on this branch) +```sh +cargo fmt --all -- --check +cargo clippy --all-targets --all-features -- -D warnings +cargo test --workspace +cd dashboard && npm run check && npm run build +``` + +### Migration integrity +```sh +docker compose down -v && docker compose up -d postgres +cargo run -p picloud # applies 0001..0012 from empty +``` +Then start from `main` (v1.1.0 schema state) and switch to this +branch; restart `picloud` to apply 0007..0012 on top. + +### Manual end-to-end smoke (reviewer should run) +```sh +docker compose up -d +# 1. Bootstrap an owner user via the existing flow + create app A. +# 2. Create a script in A whose body is: throw "boom" +# 3. POST /api/v1/admin/apps/{A}/triggers/kv with +# {"script_id": "", "collection_glob": "*", "ops": ["insert"]} +# 4. From another script (or a public HTTP route): +# kv::collection("widgets").set("k1", #{n:1}) +# 5. Wait ~7 seconds (3 attempts × ~1/2/4s backoff with ±20% jitter). +# 6. Open the dashboard at /admin. +# 7. Apps list shows a red "1" badge next to app A. +# 8. Click into app A → "Dead letters" tab link → row visible. +# 9. Click row → full payload + error history. +# 10. Click "Replay" → row marks resolution='replayed', new outbox +# row written, dispatcher re-runs the handler (fails again, +# produces a NEW DL row). +# 11. Click "Mark resolved" on the original DL → resolution='ignored'. +``` + +### Async route smoke +```sh +# Create a route via POST /api/v1/admin/scripts/{id}/routes with +# {"host_kind":"any","path_kind":"exact","path":"/work","dispatch_mode":"async"} +curl -X POST -d '{"work":"thing"}' http://localhost:8080/work +# Expect: HTTP 202 + {"accepted_at":"...","execution_id":"..."} +# Then tail execution_logs — the script ran later (not synchronously). +``` + +### Trigger-depth limit smoke +```sh +# Set a low depth limit + register a KV trigger whose script +# writes to KV again — creates a loop. +PICLOUD_MAX_TRIGGER_DEPTH=3 cargo run -p picloud +# kv.set(...) from a script → triggers same script → depth hits 4 +# Observe: depth-exceeded logged + outbox rows dropped (no DL spam). +``` + +## 8. Known limitations / rough edges + +- **No DB-backed integration tests in this branch.** Unit tests + cover trait behaviour with in-memory backings; sqlx query + correctness is verified by the workspace compile + manual smoke. +- **Dispatcher concurrency is in-process serial-per-tick.** Up to + 8 rows claimed per tick, processed one at a time. Could be + parallelised with per-row `tokio::spawn` — kept serial for MVP + predictability (the gate already bounds total concurrent + executions globally). +- **Metric emission is TODO** at the three spots noted in + Open Questions §5. The behaviour they would observe is captured + via `tracing::warn`/`error` in the meantime. +- **`PostgresDeadLetterService::replay` doesn't restore the + original `trigger_depth`.** Replays start at depth 0. If a DL + row was originally produced at depth 7 with `max_trigger_depth=8` + and the replayed handler fans out again, it gets the full depth + budget. Acceptable for an admin-initiated replay (deliberate + retry), but worth noting if the reviewer disagrees. +- **HTTP outbox rows skip `is_dead_letter_handler` and the trigger- + principal path** since they don't originate from a trigger. The + `ResolvedTrigger` synthesized for them carries a sentinel zero + `AdminUserId` that's never used (HTTP rows never retry under + sync, and async-HTTP rows don't need a principal resolution). +- **DataPlaneState's executor field is still generic** (`Arc` + where `E: ExecutorClient`). The dispatcher uses `Arc` directly. The picloud binary clones the same + `Arc` into both — works because the + concrete type implements both the trait object and the generic + bound. +- **dispatcher always sets `principal: None` for HTTP rows.** As + noted in Open Question §3, HTTP outbox rows don't resolve a + principal. Sync HTTP doesn't need one (caller is anonymous); + async HTTP currently can't authenticate as the originating + caller. If that's not the intent, additive change. +- **Cluster-mode crash recovery for claimed rows.** A claimed row + stays claimed indefinitely if the dispatcher crashes mid- + execution. v1.1.1 has one dispatcher per process so this is + rare; cluster mode (v1.3+) needs a stale-claim sweeper. + +--- + +Branch ready for review. Reviewer reads this report + audits the +diff. Do not merge to main until the audit clears.