Summary of the 11-commit v1.1.1 branch: - branch + commit count, scope coverage table, decisions made mid-implementation, deviations from the design notes - tests added (47 new) + intentionally-untested gaps - open questions for the reviewer - deferred items - verification commands + manual smoke flow - known limitations / rough edges Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
19 KiB
v1.1.1 Implementation HANDBACK
1. Branch + commit count
- Branch:
feat/v1.1.1-storage-and-events - Base:
main - 11 commits ahead of
main. Branch is not pushed, not merged.
66b661f chore(release): bump workspace to v1.1.1 + CHANGELOG
6b7ff78 feat(v1.1.1-gc): dead-letter + abandoned-executions retention sweepers
1795dfc feat(v1.1.1-dead-letters): dashboard badge + list view
20f1b5e feat(v1.1.1-dead-letters): service + Rhai SDK + admin endpoints
77b2cb5 feat(v1.1.1-routes): outbox-routed sync HTTP + dispatch_mode=async
6a2971a feat(v1.1.1-dispatcher): dispatcher loop + retry + depth limit + outbox emitter
2e92691 feat(v1.1.1-triggers): trigger CRUD admin endpoints
545d863 feat(v1.1.1-triggers): triggers + outbox schema + repos
6b99f74 feat(v1.1.1-kv): Rhai kv:: SDK module + ctx.event wiring
434fb63 feat(v1.1.1-kv): migrations + KvService trait + Postgres impl
1efb350 docs(v1.1.x): resolve in-flight decisions as Decided 2026-06-01
The first commit (1efb350) absorbed working-tree edits to
docs/v1.1.x-design-notes.md that turned the "in-flight" 20 open
calls into "Decided 2026-06-01" entries. Those were on the working
tree at branch creation; folding them into the v1.1.1 branch keeps
the design rationale colocated with the implementation.
2. Scope coverage (Done / Partial / Skipped)
| Scope item | Status | Notes |
|---|---|---|
| 1. KV store | Done | Migration 0007, KvService trait in shared, KvServiceImpl + PostgresKvRepo in manager-core, Rhai kv::collection(name).{get,set,has,delete,list} bridge, cursor pagination, empty-collection rejection, script-as-gate authz. |
| 2. Triggers framework — Layout E | Done | Migrations 0008 (triggers + kv_trigger_details + dead_letter_trigger_details), TriggerRepo + PostgresTriggerRepo, CRUD admin endpoints. registered_by_principal column captured + threaded into the dispatcher. Depth-limit enforced in the dispatcher (default 8). |
| 3. Universal outbox + dispatcher | Done | Migration 0009 (outbox), OutboxRepo + PostgresOutboxRepo, Dispatcher tokio task. Polls every 100ms, claims 8 rows/tick via FOR UPDATE SKIP LOCKED, gate-bounds dispatch, retries with backoff+jitter, dead-letters on exhaustion, late-completion → abandoned_executions. |
| 4. NATS-style sync HTTP | Done | InboxRegistry in orchestrator-core (in-process Mutex<HashMap<Uuid, oneshot::Sender>>), InboxResolver trait in shared. Orchestrator sync-route path registers receiver, writes outbox row with reply_to, awaits with timeout = script.timeout + 2s buffer. Status mapping per design notes §3 (422/502/503/504/507/500). |
5. dispatch_mode: async HTTP routes |
Done | Migration 0012 adds the column (default sync). DispatchMode enum in shared. Route admin payload + RouteRepository serialize it. Compiled routes carry it; the matcher returns it in Matched. Orchestrator branches: async → outbox + 202; sync → outbox + inbox. |
| 6. Dead letters | Done | Migration 0010 (dead_letters), DeadLetterRepo + DeadLetterService + PostgresDeadLetterService. Rhai dead_letters::{replay,resolve} + admin endpoints (GET /count, GET /, GET /{id}, POST /{id}/replay, POST /{id}/resolve). Capability::AppDeadLetterManage(AppId) enforced. List intentionally NOT shipped (deferred to v1.2). Recursion-stop rule (handler-failure annotates original DL as handler_failed) implemented in the dispatcher. |
| 7. Abandoned executions | Done | Migration 0011, AbandonedRepo + PostgresAbandonedRepo, dispatcher writes a row on dropped-receiver inbox delivery. Metric path reserved (TODO(metrics) markers in dispatcher.rs). |
| 8. Retry policy defaults | Done | TriggerConfig::from_env (new module). Env vars: PICLOUD_MAX_TRIGGER_DEPTH, PICLOUD_TRIGGER_RETRY_{MAX_ATTEMPTS,BACKOFF,BASE_MS,JITTER_PCT}, PICLOUD_DEAD_LETTER_RETENTION_DAYS, PICLOUD_ABANDONED_EXECUTIONS_RETENTION_DAYS. Per-trigger overrides applied at trigger-creation time. |
9. ctx.event for triggered scripts |
Done | TriggerEvent enum in shared (KV / DeadLetter variants), SdkCallCx.event: Option<TriggerEvent> + is_dead_letter_handler: bool. engine.rs::build_ctx_map flattens the event into ctx.event for triggered handlers; direct invocations leave the key absent. Shape matches design notes §4 (KV with op + collection + key + value; dead_letter with original + attempts + last_error + ids + timestamps). |
| 10. Dashboard surface | Done | Per-app red badge with unresolved count on apps list + per-app detail page. New apps/[slug]/dead-letters/+page.svelte list view with all design-notes-mandated columns + Replay + Mark resolved actions + expandable row detail. svelte-check passes (369 files, 0 errors, 0 warnings). |
| 11. Workspace version bump | Done | Workspace 1.1.0 → 1.1.1, SDK 1.1 → 1.2, dashboard 0.6.0 → 0.7.0. CHANGELOG.md created at repo root. |
3. Key implementation decisions / deviations
Outbox column set (deferred to implementation per design notes §2)
Chose:
script_iddenormalized — dispatcher resolves the target without re-joining for the common path.trigger_idpolymorphic (no DB FK) — referencestriggers.idforsource_kind IN {kv, dead_letter},routes.idforsource_kind = 'http'. Discrimination in Rust at dispatch time.claimed_by TEXT— pid-based for MVP; cluster mode can use any identifier without schema change.trigger_depth+root_execution_iddenormalized so the dispatcher rebuildsExecRequestwithout joining back to the originating execution log.- No explicit
is_dead_letter_handlercolumn — dispatcher infers from the trigger'skindfield at dispatch time.
KV pagination
- Cursor-style, opaque base64-encoded last-key.
- Page-size cap of 1000 with default 100 (enforced in repo).
- Documented in
crates/shared/src/kv.rsand the SDK function comment.
KV TTL
- Blueprint §8.1 reserved an
expires_atcolumn. v1.1.1 design notes don't surface TTL through the SDK (set(k,v)has no TTL argument) so the column is omitted from migration 0007. Adding it later is a non-breaking forward migration. Recorded in CHANGELOG as a deferred item.
Authz scope mapping (seven-scope commitment)
The four new capabilities map onto existing scopes — no new scope
variants to honour the Scope enum's "exactly seven values"
contract (crates/shared/src/auth.rs:103):
| Capability | Scope |
|---|---|
AppKvRead |
script:read |
AppKvWrite |
script:write |
AppManageTriggers |
app:admin |
AppDeadLetterManage |
app:admin |
role_satisfies grants AppKvRead at the Viewer role, AppKvWrite
at Editor, and both trigger / DL caps at AppAdmin.
Script-as-gate authz for SDK calls
KvServiceImplrunsauthz::requireonly whencx.principal.is_some(). Anonymous public-HTTP scripts (the common case for public routes) bypass the cap check.- Cross-app isolation is independent of this — enforced by
cx.app_idbeing the only source ofapp_idon every query. PostgresDeadLetterService::{replay,resolve}keeps a hardrequire(noif let Some) — managing dead letters is an admin act per design notes §4. Public scripts withprincipal: Nonefail the check, which is correct.
Trait split: OutboxRepo vs OutboxWriter
orchestrator-core can't depend on manager-core (would invert the
dependency arrow). Defined a small OutboxWriter trait in
picloud-shared with a single enqueue_http method.
PostgresOutboxRepo implements both OutboxRepo (dispatcher
surface) and OutboxWriter (orchestrator surface); the picloud
binary clones one concrete Arc into both trait views — mirrors the
existing members_concrete / AuthzRepo pattern.
InboxResolver lives in shared, InboxRegistry in orchestrator-core
Same split rationale — the dispatcher (manager-core) only depends on
the trait, while the in-process impl lives next to its consumer.
Cluster mode (v1.3+) swaps the impl for LISTEN/NOTIFY behind the
unchanged trait.
manager-core now depends on executor-core
Previously manager-core only depended on orchestrator-core. The
dispatcher needs ExecRequest/ExecResponse/ExecError/
InvocationType from executor-core to build invocation
descriptors. This is the transport DTO interpretation of the
working-rules "don't reach across *-core crates" — DTOs are fine,
behaviour is the bright line.
Sync HTTP via outbox is the default for the user-routes path
The orchestrator's user-route handler is fully on the NATS-style
path now — every sync HTTP request writes to the outbox and awaits
inbox delivery. Adds ~2-5ms per request per design notes §3 latency
budget. /api/v1/execute/{id} (the admin/dev bypass) still calls
the executor directly since it doesn't need the unified
observability — kept for simplicity and admin tooling speed.
Trigger-depth check is on the outbox row, not in the executor
Dispatcher rejects depth-exceeded rows before trying to
execute. The cx.trigger_depth field is informational on the
executor side. Rejection writes a log + (reserved) metric and
deletes the row — no DL, per design notes §4.
4. Tests added
Unit tests (no DB required)
manager-core::kv_service::tests(10 tests) — round-trip, missing key returns None,haspredicate,deletewas-present, empty-collection rejection, cross-app isolation, anonymous-cx skips authz, authed-cx-with-no-role is Forbidden, owner-can-write, cursor pagination via in-memory KvRepo + denying authz repo.manager-core::trigger_config::tests(2 tests) — conservative defaults, backoff round-trips.manager-core::trigger_repo::tests(1 test) —collection_matchesglob behaviour (*,prefix:*, exact).manager-core::dispatcher::tests(5 tests) — exponential / linear / constant backoff math, jitter within bounds, ExecError → InboxFailureKind classification, failure-kind → status-code mapping.manager-core::abandoned_repo::tests(2 tests) — truncate char-boundary safety.manager-core::triggers_api::tests(5 tests) — unknown-app 404, member-without-role 403, default fallback for retry settings, empty-glob rejection, cross-app delete is treated as not-found.orchestrator-core::inbox::tests(4 tests) — register/deliver round-trip, unknown-id is Abandoned, dropped receiver is Abandoned, explicit cancel removes sender.executor-core::engine::tests(3 new) —ctx.eventabsent for direct invocations, KV insert shape matches design notes, KV delete has unit value.executor-core::sdk_kvintegration suite (7 tests) — runs a real Rhai engine underspawn_blockingagainst an in-memoryKvServiceimpl. Covers handle pattern, round-trip, unit-on- missing, has predicate, delete-was-present, empty-collection throws, cursor pagination, cross-app isolation through the bridge.
Total: 47 new tests across the workspace. Workspace test counts after v1.1.1: 63 manager-core / 56 orchestrator-core / 17 executor-core engine / 7 sdk_kv / 30 sdk_contract / 43 stdlib / 21 picloud / 6 shared.
Intentionally untested
- DB-backed integration tests for the full dispatcher loop, KV→ trigger→DL retry chain, sync HTTP via outbox round-trip, recursion-stop end-to-end. These need a real Postgres harness; the reviewer runs them via the manual smoke flow below.
- Postgres-specific repo behaviour (sqlx query correctness). The
repos compile and run against the schema, but no integration
test crate spins up a DB in this branch — same pattern as v1.1.0
(see existing
ignored, needs DATABASE_URLtest markers).
5. Open questions for the reviewer
- Outbox
claimed_atclearing on success. The dispatcherdeletes the outbox row after success / DL. For failures it reschedules (which setsclaimed_at = NULL). Both flows are correct, but if you imagine a crash between the executor return and the row update, the row stays claimed forever. Cluster mode should add a periodic "unstick stale claims" sweep. Not in v1.1.1 scope but worth surfacing. - Sync HTTP overhead. Every sync HTTP request now goes through the outbox (write + dispatcher pickup + inbox delivery). Measured overhead expected ~2-5ms per design notes §3. No benchmarking yet — recommend the reviewer pick a representative script and compare 95p latency vs v1.1.0 if performance matters.
- HTTP outbox rows don't run as a principal. The orchestrator's
public HTTP path has no authenticated user; the
origin_principalfield on the outbox row is forensic. The resultingExecRequest.principal = None, so the script runs anonymously — matches direct execution. If you'd prefer triggered-from-HTTP scripts to inherit a derived principal (e.g. the route's app's owner), that's an additive change. - Dispatcher uses
ASYNC_EXEC_TIMEOUT = 300sfor async rows. Async dispatches don't have a script-level timeout (no originating HTTP request to bound). Picked the same platform cap asLocalExecutorClient. If async needs a different cap, easy to thread throughTriggerConfig. - Dispatcher tick cadence is 100ms. Bounded enough that
fan-out feels instant; loose enough that an idle process
doesn't burn cycles. If the reviewer wants tighter latency,
bump to 50ms or use
LISTEN/NOTIFYfor wake-up (v1.3+ work). - CHANGELOG.md is new. Followed the rest of the repo's convention from git log (release commits + design-notes references). If a different format is preferred, easy to swap.
6. Deferred to later releases
dead_letters::list(filter)Rhai SDK — design notes §4 defers to v1.2 to align withdocs::find()query DSL.- KV TTL (
set(k, v, ttl_secs)) — blueprint reserved it; v1.1.1 SDK doesn't surface it. Forward-compat (no schema cost). - Auto-disable of triggers whose script was deleted — design notes §4 says current handling is metric+log; auto-disable is v1.2.
- Per-app dead-letter retention — design notes §4 says env-only in v1.1.1.
- Metrics counter emit for
picloud_trigger_depth_exceeded,picloud_dead_letter_handler_failures,picloud_abandoned_executions_total. Code paths log the occurrences withtracing::warn/error; the actual counter-emit code is aTODO(metrics)comment in the dispatcher. Metrics surface is v1.1.7+ per the roadmap. - DB-backed integration tests for the dispatcher loop (see §4 intentionally-untested).
- Sync HTTP performance benchmarks comparing v1.1.0 direct path vs v1.1.1 outbox path.
7. How to verify locally
Static checks (all green on this branch)
cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings
cargo test --workspace
cd dashboard && npm run check && npm run build
Migration integrity
docker compose down -v && docker compose up -d postgres
cargo run -p picloud # applies 0001..0012 from empty
Then start from main (v1.1.0 schema state) and switch to this
branch; restart picloud to apply 0007..0012 on top.
Manual end-to-end smoke (reviewer should run)
docker compose up -d
# 1. Bootstrap an owner user via the existing flow + create app A.
# 2. Create a script in A whose body is: throw "boom"
# 3. POST /api/v1/admin/apps/{A}/triggers/kv with
# {"script_id": "<broken>", "collection_glob": "*", "ops": ["insert"]}
# 4. From another script (or a public HTTP route):
# kv::collection("widgets").set("k1", #{n:1})
# 5. Wait ~7 seconds (3 attempts × ~1/2/4s backoff with ±20% jitter).
# 6. Open the dashboard at /admin.
# 7. Apps list shows a red "1" badge next to app A.
# 8. Click into app A → "Dead letters" tab link → row visible.
# 9. Click row → full payload + error history.
# 10. Click "Replay" → row marks resolution='replayed', new outbox
# row written, dispatcher re-runs the handler (fails again,
# produces a NEW DL row).
# 11. Click "Mark resolved" on the original DL → resolution='ignored'.
Async route smoke
# Create a route via POST /api/v1/admin/scripts/{id}/routes with
# {"host_kind":"any","path_kind":"exact","path":"/work","dispatch_mode":"async"}
curl -X POST -d '{"work":"thing"}' http://localhost:8080/work
# Expect: HTTP 202 + {"accepted_at":"...","execution_id":"..."}
# Then tail execution_logs — the script ran later (not synchronously).
Trigger-depth limit smoke
# Set a low depth limit + register a KV trigger whose script
# writes to KV again — creates a loop.
PICLOUD_MAX_TRIGGER_DEPTH=3 cargo run -p picloud
# kv.set(...) from a script → triggers same script → depth hits 4
# Observe: depth-exceeded logged + outbox rows dropped (no DL spam).
8. Known limitations / rough edges
- No DB-backed integration tests in this branch. Unit tests cover trait behaviour with in-memory backings; sqlx query correctness is verified by the workspace compile + manual smoke.
- Dispatcher concurrency is in-process serial-per-tick. Up to
8 rows claimed per tick, processed one at a time. Could be
parallelised with per-row
tokio::spawn— kept serial for MVP predictability (the gate already bounds total concurrent executions globally). - Metric emission is TODO at the three spots noted in
Open Questions §5. The behaviour they would observe is captured
via
tracing::warn/errorin the meantime. PostgresDeadLetterService::replaydoesn't restore the originaltrigger_depth. Replays start at depth 0. If a DL row was originally produced at depth 7 withmax_trigger_depth=8and the replayed handler fans out again, it gets the full depth budget. Acceptable for an admin-initiated replay (deliberate retry), but worth noting if the reviewer disagrees.- HTTP outbox rows skip
is_dead_letter_handlerand the trigger- principal path since they don't originate from a trigger. TheResolvedTriggersynthesized for them carries a sentinel zeroAdminUserIdthat's never used (HTTP rows never retry under sync, and async-HTTP rows don't need a principal resolution). - DataPlaneState's executor field is still generic (
Arc<E>whereE: ExecutorClient). The dispatcher usesArc<dyn ExecutorClient>directly. The picloud binary clones the sameArc<LocalExecutorClient>into both — works because the concrete type implements both the trait object and the generic bound. - dispatcher always sets
principal: Nonefor HTTP rows. As noted in Open Question §3, HTTP outbox rows don't resolve a principal. Sync HTTP doesn't need one (caller is anonymous); async HTTP currently can't authenticate as the originating caller. If that's not the intent, additive change. - Cluster-mode crash recovery for claimed rows. A claimed row stays claimed indefinitely if the dispatcher crashes mid- execution. v1.1.1 has one dispatcher per process so this is rare; cluster mode (v1.3+) needs a stale-claim sweeper.
Branch ready for review. Reviewer reads this report + audits the diff. Do not merge to main until the audit clears.