Summary of the 11-commit v1.1.1 branch: - branch + commit count, scope coverage table, decisions made mid-implementation, deviations from the design notes - tests added (47 new) + intentionally-untested gaps - open questions for the reviewer - deferred items - verification commands + manual smoke flow - known limitations / rough edges Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
341 lines
19 KiB
Markdown
341 lines
19 KiB
Markdown
# v1.1.1 Implementation HANDBACK
|
||
|
||
## 1. Branch + commit count
|
||
|
||
- Branch: `feat/v1.1.1-storage-and-events`
|
||
- Base: `main`
|
||
- 11 commits ahead of `main`. Branch is **not pushed**, **not merged**.
|
||
|
||
```
|
||
66b661f chore(release): bump workspace to v1.1.1 + CHANGELOG
|
||
6b7ff78 feat(v1.1.1-gc): dead-letter + abandoned-executions retention sweepers
|
||
1795dfc feat(v1.1.1-dead-letters): dashboard badge + list view
|
||
20f1b5e feat(v1.1.1-dead-letters): service + Rhai SDK + admin endpoints
|
||
77b2cb5 feat(v1.1.1-routes): outbox-routed sync HTTP + dispatch_mode=async
|
||
6a2971a feat(v1.1.1-dispatcher): dispatcher loop + retry + depth limit + outbox emitter
|
||
2e92691 feat(v1.1.1-triggers): trigger CRUD admin endpoints
|
||
545d863 feat(v1.1.1-triggers): triggers + outbox schema + repos
|
||
6b99f74 feat(v1.1.1-kv): Rhai kv:: SDK module + ctx.event wiring
|
||
434fb63 feat(v1.1.1-kv): migrations + KvService trait + Postgres impl
|
||
1efb350 docs(v1.1.x): resolve in-flight decisions as Decided 2026-06-01
|
||
```
|
||
|
||
The first commit (`1efb350`) absorbed working-tree edits to
|
||
`docs/v1.1.x-design-notes.md` that turned the "in-flight" 20 open
|
||
calls into "Decided 2026-06-01" entries. Those were on the working
|
||
tree at branch creation; folding them into the v1.1.1 branch keeps
|
||
the design rationale colocated with the implementation.
|
||
|
||
## 2. Scope coverage (Done / Partial / Skipped)
|
||
|
||
| Scope item | Status | Notes |
|
||
|---|---|---|
|
||
| **1. KV store** | Done | Migration 0007, `KvService` trait in shared, `KvServiceImpl` + `PostgresKvRepo` in manager-core, Rhai `kv::collection(name).{get,set,has,delete,list}` bridge, cursor pagination, empty-collection rejection, script-as-gate authz. |
|
||
| **2. Triggers framework — Layout E** | Done | Migrations 0008 (`triggers` + `kv_trigger_details` + `dead_letter_trigger_details`), `TriggerRepo` + `PostgresTriggerRepo`, CRUD admin endpoints. `registered_by_principal` column captured + threaded into the dispatcher. Depth-limit enforced in the dispatcher (default 8). |
|
||
| **3. Universal outbox + dispatcher** | Done | Migration 0009 (`outbox`), `OutboxRepo` + `PostgresOutboxRepo`, `Dispatcher` tokio task. Polls every 100ms, claims 8 rows/tick via `FOR UPDATE SKIP LOCKED`, gate-bounds dispatch, retries with backoff+jitter, dead-letters on exhaustion, late-completion → `abandoned_executions`. |
|
||
| **4. NATS-style sync HTTP** | Done | `InboxRegistry` in orchestrator-core (in-process `Mutex<HashMap<Uuid, oneshot::Sender>>`), `InboxResolver` trait in shared. Orchestrator sync-route path registers receiver, writes outbox row with `reply_to`, awaits with timeout = script.timeout + 2s buffer. Status mapping per design notes §3 (422/502/503/504/507/500). |
|
||
| **5. `dispatch_mode: async` HTTP routes** | Done | Migration 0012 adds the column (default `sync`). `DispatchMode` enum in shared. Route admin payload + RouteRepository serialize it. Compiled routes carry it; the matcher returns it in `Matched`. Orchestrator branches: async → outbox + 202; sync → outbox + inbox. |
|
||
| **6. Dead letters** | Done | Migration 0010 (`dead_letters`), `DeadLetterRepo` + `DeadLetterService` + `PostgresDeadLetterService`. Rhai `dead_letters::{replay,resolve}` + admin endpoints (`GET /count`, `GET /`, `GET /{id}`, `POST /{id}/replay`, `POST /{id}/resolve`). `Capability::AppDeadLetterManage(AppId)` enforced. List intentionally NOT shipped (deferred to v1.2). Recursion-stop rule (handler-failure annotates original DL as `handler_failed`) implemented in the dispatcher. |
|
||
| **7. Abandoned executions** | Done | Migration 0011, `AbandonedRepo` + `PostgresAbandonedRepo`, dispatcher writes a row on dropped-receiver inbox delivery. Metric path reserved (`TODO(metrics)` markers in dispatcher.rs). |
|
||
| **8. Retry policy defaults** | Done | `TriggerConfig::from_env` (new module). Env vars: `PICLOUD_MAX_TRIGGER_DEPTH`, `PICLOUD_TRIGGER_RETRY_{MAX_ATTEMPTS,BACKOFF,BASE_MS,JITTER_PCT}`, `PICLOUD_DEAD_LETTER_RETENTION_DAYS`, `PICLOUD_ABANDONED_EXECUTIONS_RETENTION_DAYS`. Per-trigger overrides applied at trigger-creation time. |
|
||
| **9. `ctx.event` for triggered scripts** | Done | `TriggerEvent` enum in shared (KV / DeadLetter variants), `SdkCallCx.event: Option<TriggerEvent>` + `is_dead_letter_handler: bool`. `engine.rs::build_ctx_map` flattens the event into `ctx.event` for triggered handlers; direct invocations leave the key absent. Shape matches design notes §4 (KV with op + collection + key + value; dead_letter with original + attempts + last_error + ids + timestamps). |
|
||
| **10. Dashboard surface** | Done | Per-app red badge with unresolved count on apps list + per-app detail page. New `apps/[slug]/dead-letters/+page.svelte` list view with all design-notes-mandated columns + Replay + Mark resolved actions + expandable row detail. svelte-check passes (369 files, 0 errors, 0 warnings). |
|
||
| **11. Workspace version bump** | Done | Workspace `1.1.0` → `1.1.1`, SDK `1.1` → `1.2`, dashboard `0.6.0` → `0.7.0`. CHANGELOG.md created at repo root. |
|
||
|
||
## 3. Key implementation decisions / deviations
|
||
|
||
### Outbox column set (deferred to implementation per design notes §2)
|
||
Chose:
|
||
- `script_id` denormalized — dispatcher resolves the target without
|
||
re-joining for the common path.
|
||
- `trigger_id` polymorphic (no DB FK) — references `triggers.id` for
|
||
`source_kind IN {kv, dead_letter}`, `routes.id` for
|
||
`source_kind = 'http'`. Discrimination in Rust at dispatch time.
|
||
- `claimed_by TEXT` — pid-based for MVP; cluster mode can use any
|
||
identifier without schema change.
|
||
- `trigger_depth` + `root_execution_id` denormalized so the
|
||
dispatcher rebuilds `ExecRequest` without joining back to the
|
||
originating execution log.
|
||
- No explicit `is_dead_letter_handler` column — dispatcher infers
|
||
from the trigger's `kind` field at dispatch time.
|
||
|
||
### KV pagination
|
||
- **Cursor-style**, opaque base64-encoded last-key.
|
||
- Page-size cap of 1000 with default 100 (enforced in repo).
|
||
- Documented in `crates/shared/src/kv.rs` and the SDK function
|
||
comment.
|
||
|
||
### KV TTL
|
||
- Blueprint §8.1 reserved an `expires_at` column. v1.1.1 design notes
|
||
don't surface TTL through the SDK (`set(k,v)` has no TTL argument)
|
||
so the column is **omitted from migration 0007**. Adding it later
|
||
is a non-breaking forward migration. Recorded in CHANGELOG as a
|
||
deferred item.
|
||
|
||
### Authz scope mapping (seven-scope commitment)
|
||
The four new capabilities map onto existing scopes — **no new scope
|
||
variants** to honour the `Scope` enum's "exactly seven values"
|
||
contract (`crates/shared/src/auth.rs:103`):
|
||
|
||
| Capability | Scope |
|
||
|---|---|
|
||
| `AppKvRead` | `script:read` |
|
||
| `AppKvWrite` | `script:write` |
|
||
| `AppManageTriggers` | `app:admin` |
|
||
| `AppDeadLetterManage` | `app:admin` |
|
||
|
||
`role_satisfies` grants `AppKvRead` at the Viewer role, `AppKvWrite`
|
||
at Editor, and both trigger / DL caps at AppAdmin.
|
||
|
||
### Script-as-gate authz for SDK calls
|
||
- `KvServiceImpl` runs `authz::require` only when
|
||
`cx.principal.is_some()`. Anonymous public-HTTP scripts (the
|
||
common case for public routes) bypass the cap check.
|
||
- Cross-app isolation is **independent** of this — enforced by
|
||
`cx.app_id` being the only source of `app_id` on every query.
|
||
- `PostgresDeadLetterService::{replay,resolve}` keeps a hard
|
||
`require` (no `if let Some`) — managing dead letters is an admin
|
||
act per design notes §4. Public scripts with `principal: None`
|
||
fail the check, which is correct.
|
||
|
||
### Trait split: `OutboxRepo` vs `OutboxWriter`
|
||
orchestrator-core can't depend on manager-core (would invert the
|
||
dependency arrow). Defined a small `OutboxWriter` trait in
|
||
`picloud-shared` with a single `enqueue_http` method.
|
||
`PostgresOutboxRepo` implements both `OutboxRepo` (dispatcher
|
||
surface) and `OutboxWriter` (orchestrator surface); the picloud
|
||
binary clones one concrete Arc into both trait views — mirrors the
|
||
existing `members_concrete` / `AuthzRepo` pattern.
|
||
|
||
### `InboxResolver` lives in shared, `InboxRegistry` in orchestrator-core
|
||
Same split rationale — the dispatcher (manager-core) only depends on
|
||
the trait, while the in-process impl lives next to its consumer.
|
||
Cluster mode (v1.3+) swaps the impl for `LISTEN/NOTIFY` behind the
|
||
unchanged trait.
|
||
|
||
### manager-core now depends on executor-core
|
||
Previously manager-core only depended on orchestrator-core. The
|
||
dispatcher needs `ExecRequest`/`ExecResponse`/`ExecError`/
|
||
`InvocationType` from `executor-core` to build invocation
|
||
descriptors. This is the transport DTO interpretation of the
|
||
working-rules "don't reach across `*-core` crates" — DTOs are fine,
|
||
behaviour is the bright line.
|
||
|
||
### Sync HTTP via outbox is the default for the user-routes path
|
||
The orchestrator's user-route handler is fully on the NATS-style
|
||
path now — every sync HTTP request writes to the outbox and awaits
|
||
inbox delivery. Adds ~2-5ms per request per design notes §3 latency
|
||
budget. `/api/v1/execute/{id}` (the admin/dev bypass) still calls
|
||
the executor directly since it doesn't need the unified
|
||
observability — kept for simplicity and admin tooling speed.
|
||
|
||
### Trigger-depth check is on the outbox row, not in the executor
|
||
Dispatcher rejects depth-exceeded rows **before** trying to
|
||
execute. The `cx.trigger_depth` field is informational on the
|
||
executor side. Rejection writes a log + (reserved) metric and
|
||
deletes the row — no DL, per design notes §4.
|
||
|
||
## 4. Tests added
|
||
|
||
### Unit tests (no DB required)
|
||
- `manager-core::kv_service::tests` (10 tests) — round-trip, missing
|
||
key returns None, `has` predicate, `delete` was-present,
|
||
empty-collection rejection, **cross-app isolation**, anonymous-cx
|
||
skips authz, authed-cx-with-no-role is Forbidden, owner-can-write,
|
||
cursor pagination via in-memory KvRepo + denying authz repo.
|
||
- `manager-core::trigger_config::tests` (2 tests) — conservative
|
||
defaults, backoff round-trips.
|
||
- `manager-core::trigger_repo::tests` (1 test) — `collection_matches`
|
||
glob behaviour (`*`, `prefix:*`, exact).
|
||
- `manager-core::dispatcher::tests` (5 tests) — exponential / linear /
|
||
constant backoff math, jitter within bounds, ExecError →
|
||
InboxFailureKind classification, failure-kind → status-code mapping.
|
||
- `manager-core::abandoned_repo::tests` (2 tests) — truncate
|
||
char-boundary safety.
|
||
- `manager-core::triggers_api::tests` (5 tests) — unknown-app 404,
|
||
member-without-role 403, default fallback for retry settings,
|
||
empty-glob rejection, cross-app delete is treated as not-found.
|
||
- `orchestrator-core::inbox::tests` (4 tests) — register/deliver
|
||
round-trip, unknown-id is Abandoned, dropped receiver is
|
||
Abandoned, explicit cancel removes sender.
|
||
- `executor-core::engine::tests` (3 new) — `ctx.event` absent for
|
||
direct invocations, KV insert shape matches design notes,
|
||
KV delete has unit value.
|
||
- `executor-core::sdk_kv` integration suite (7 tests) — runs a real
|
||
Rhai engine under `spawn_blocking` against an in-memory
|
||
`KvService` impl. Covers handle pattern, round-trip, unit-on-
|
||
missing, has predicate, delete-was-present, empty-collection
|
||
throws, cursor pagination, **cross-app isolation through the
|
||
bridge**.
|
||
|
||
**Total: 47 new tests across the workspace.** Workspace test counts
|
||
after v1.1.1: 63 manager-core / 56 orchestrator-core / 17
|
||
executor-core engine / 7 sdk_kv / 30 sdk_contract / 43 stdlib /
|
||
21 picloud / 6 shared.
|
||
|
||
### Intentionally untested
|
||
- DB-backed integration tests for the full dispatcher loop, KV→
|
||
trigger→DL retry chain, sync HTTP via outbox round-trip,
|
||
recursion-stop end-to-end. These need a real Postgres harness;
|
||
the reviewer runs them via the manual smoke flow below.
|
||
- Postgres-specific repo behaviour (sqlx query correctness). The
|
||
repos compile and run against the schema, but no integration
|
||
test crate spins up a DB in this branch — same pattern as v1.1.0
|
||
(see existing `ignored, needs DATABASE_URL` test markers).
|
||
|
||
## 5. Open questions for the reviewer
|
||
|
||
1. **Outbox `claimed_at` clearing on success.** The dispatcher
|
||
`delete`s the outbox row after success / DL. For failures it
|
||
reschedules (which sets `claimed_at = NULL`). Both flows are
|
||
correct, but if you imagine a crash between the executor return
|
||
and the row update, the row stays claimed forever. Cluster mode
|
||
should add a periodic "unstick stale claims" sweep. Not in
|
||
v1.1.1 scope but worth surfacing.
|
||
2. **Sync HTTP overhead.** Every sync HTTP request now goes through
|
||
the outbox (write + dispatcher pickup + inbox delivery).
|
||
Measured overhead expected ~2-5ms per design notes §3. No
|
||
benchmarking yet — recommend the reviewer pick a representative
|
||
script and compare 95p latency vs v1.1.0 if performance matters.
|
||
3. **HTTP outbox rows don't run as a principal.** The orchestrator's
|
||
public HTTP path has no authenticated user; the
|
||
`origin_principal` field on the outbox row is forensic. The
|
||
resulting `ExecRequest.principal = None`, so the script runs
|
||
anonymously — matches direct execution. If you'd prefer
|
||
triggered-from-HTTP scripts to inherit a derived principal
|
||
(e.g. the route's app's owner), that's an additive change.
|
||
4. **Dispatcher uses `ASYNC_EXEC_TIMEOUT = 300s` for async rows.**
|
||
Async dispatches don't have a script-level timeout (no
|
||
originating HTTP request to bound). Picked the same platform
|
||
cap as `LocalExecutorClient`. If async needs a different cap,
|
||
easy to thread through `TriggerConfig`.
|
||
5. **Dispatcher tick cadence is 100ms.** Bounded enough that
|
||
fan-out feels instant; loose enough that an idle process
|
||
doesn't burn cycles. If the reviewer wants tighter latency,
|
||
bump to 50ms or use `LISTEN/NOTIFY` for wake-up (v1.3+ work).
|
||
6. **CHANGELOG.md is new.** Followed the rest of the repo's
|
||
convention from git log (release commits + design-notes
|
||
references). If a different format is preferred, easy to swap.
|
||
|
||
## 6. Deferred to later releases
|
||
|
||
- `dead_letters::list(filter)` Rhai SDK — design notes §4 defers
|
||
to v1.2 to align with `docs::find()` query DSL.
|
||
- KV TTL (`set(k, v, ttl_secs)`) — blueprint reserved it; v1.1.1
|
||
SDK doesn't surface it. Forward-compat (no schema cost).
|
||
- Auto-disable of triggers whose script was deleted — design notes
|
||
§4 says current handling is metric+log; auto-disable is v1.2.
|
||
- Per-app dead-letter retention — design notes §4 says env-only in
|
||
v1.1.1.
|
||
- Metrics counter emit for `picloud_trigger_depth_exceeded`,
|
||
`picloud_dead_letter_handler_failures`,
|
||
`picloud_abandoned_executions_total`. Code paths log the
|
||
occurrences with `tracing::warn`/`error`; the actual
|
||
counter-emit code is a `TODO(metrics)` comment in the
|
||
dispatcher. Metrics surface is v1.1.7+ per the roadmap.
|
||
- DB-backed integration tests for the dispatcher loop (see §4
|
||
intentionally-untested).
|
||
- Sync HTTP performance benchmarks comparing v1.1.0 direct path vs
|
||
v1.1.1 outbox path.
|
||
|
||
## 7. How to verify locally
|
||
|
||
### Static checks (all green on this branch)
|
||
```sh
|
||
cargo fmt --all -- --check
|
||
cargo clippy --all-targets --all-features -- -D warnings
|
||
cargo test --workspace
|
||
cd dashboard && npm run check && npm run build
|
||
```
|
||
|
||
### Migration integrity
|
||
```sh
|
||
docker compose down -v && docker compose up -d postgres
|
||
cargo run -p picloud # applies 0001..0012 from empty
|
||
```
|
||
Then start from `main` (v1.1.0 schema state) and switch to this
|
||
branch; restart `picloud` to apply 0007..0012 on top.
|
||
|
||
### Manual end-to-end smoke (reviewer should run)
|
||
```sh
|
||
docker compose up -d
|
||
# 1. Bootstrap an owner user via the existing flow + create app A.
|
||
# 2. Create a script in A whose body is: throw "boom"
|
||
# 3. POST /api/v1/admin/apps/{A}/triggers/kv with
|
||
# {"script_id": "<broken>", "collection_glob": "*", "ops": ["insert"]}
|
||
# 4. From another script (or a public HTTP route):
|
||
# kv::collection("widgets").set("k1", #{n:1})
|
||
# 5. Wait ~7 seconds (3 attempts × ~1/2/4s backoff with ±20% jitter).
|
||
# 6. Open the dashboard at /admin.
|
||
# 7. Apps list shows a red "1" badge next to app A.
|
||
# 8. Click into app A → "Dead letters" tab link → row visible.
|
||
# 9. Click row → full payload + error history.
|
||
# 10. Click "Replay" → row marks resolution='replayed', new outbox
|
||
# row written, dispatcher re-runs the handler (fails again,
|
||
# produces a NEW DL row).
|
||
# 11. Click "Mark resolved" on the original DL → resolution='ignored'.
|
||
```
|
||
|
||
### Async route smoke
|
||
```sh
|
||
# Create a route via POST /api/v1/admin/scripts/{id}/routes with
|
||
# {"host_kind":"any","path_kind":"exact","path":"/work","dispatch_mode":"async"}
|
||
curl -X POST -d '{"work":"thing"}' http://localhost:8080/work
|
||
# Expect: HTTP 202 + {"accepted_at":"...","execution_id":"..."}
|
||
# Then tail execution_logs — the script ran later (not synchronously).
|
||
```
|
||
|
||
### Trigger-depth limit smoke
|
||
```sh
|
||
# Set a low depth limit + register a KV trigger whose script
|
||
# writes to KV again — creates a loop.
|
||
PICLOUD_MAX_TRIGGER_DEPTH=3 cargo run -p picloud
|
||
# kv.set(...) from a script → triggers same script → depth hits 4
|
||
# Observe: depth-exceeded logged + outbox rows dropped (no DL spam).
|
||
```
|
||
|
||
## 8. Known limitations / rough edges
|
||
|
||
- **No DB-backed integration tests in this branch.** Unit tests
|
||
cover trait behaviour with in-memory backings; sqlx query
|
||
correctness is verified by the workspace compile + manual smoke.
|
||
- **Dispatcher concurrency is in-process serial-per-tick.** Up to
|
||
8 rows claimed per tick, processed one at a time. Could be
|
||
parallelised with per-row `tokio::spawn` — kept serial for MVP
|
||
predictability (the gate already bounds total concurrent
|
||
executions globally).
|
||
- **Metric emission is TODO** at the three spots noted in
|
||
Open Questions §5. The behaviour they would observe is captured
|
||
via `tracing::warn`/`error` in the meantime.
|
||
- **`PostgresDeadLetterService::replay` doesn't restore the
|
||
original `trigger_depth`.** Replays start at depth 0. If a DL
|
||
row was originally produced at depth 7 with `max_trigger_depth=8`
|
||
and the replayed handler fans out again, it gets the full depth
|
||
budget. Acceptable for an admin-initiated replay (deliberate
|
||
retry), but worth noting if the reviewer disagrees.
|
||
- **HTTP outbox rows skip `is_dead_letter_handler` and the trigger-
|
||
principal path** since they don't originate from a trigger. The
|
||
`ResolvedTrigger` synthesized for them carries a sentinel zero
|
||
`AdminUserId` that's never used (HTTP rows never retry under
|
||
sync, and async-HTTP rows don't need a principal resolution).
|
||
- **DataPlaneState's executor field is still generic** (`Arc<E>`
|
||
where `E: ExecutorClient`). The dispatcher uses `Arc<dyn
|
||
ExecutorClient>` directly. The picloud binary clones the same
|
||
`Arc<LocalExecutorClient>` into both — works because the
|
||
concrete type implements both the trait object and the generic
|
||
bound.
|
||
- **dispatcher always sets `principal: None` for HTTP rows.** As
|
||
noted in Open Question §3, HTTP outbox rows don't resolve a
|
||
principal. Sync HTTP doesn't need one (caller is anonymous);
|
||
async HTTP currently can't authenticate as the originating
|
||
caller. If that's not the intent, additive change.
|
||
- **Cluster-mode crash recovery for claimed rows.** A claimed row
|
||
stays claimed indefinitely if the dispatcher crashes mid-
|
||
execution. v1.1.1 has one dispatcher per process so this is
|
||
rare; cluster mode (v1.3+) needs a stale-claim sweeper.
|
||
|
||
---
|
||
|
||
Branch ready for review. Reviewer reads this report + audits the
|
||
diff. Do not merge to main until the audit clears.
|