docs(v1.1.1): handback report for reviewer

Summary of the 11-commit v1.1.1 branch:
- branch + commit count, scope coverage table, decisions made
  mid-implementation, deviations from the design notes
- tests added (47 new) + intentionally-untested gaps
- open questions for the reviewer
- deferred items
- verification commands + manual smoke flow
- known limitations / rough edges

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-01 22:27:18 +02:00
parent 66b661f64c
commit 5a95ff2d07

340
HANDBACK.md Normal file
View File

@@ -0,0 +1,340 @@
# v1.1.1 Implementation HANDBACK
## 1. Branch + commit count
- Branch: `feat/v1.1.1-storage-and-events`
- Base: `main`
- 11 commits ahead of `main`. Branch is **not pushed**, **not merged**.
```
66b661f chore(release): bump workspace to v1.1.1 + CHANGELOG
6b7ff78 feat(v1.1.1-gc): dead-letter + abandoned-executions retention sweepers
1795dfc feat(v1.1.1-dead-letters): dashboard badge + list view
20f1b5e feat(v1.1.1-dead-letters): service + Rhai SDK + admin endpoints
77b2cb5 feat(v1.1.1-routes): outbox-routed sync HTTP + dispatch_mode=async
6a2971a feat(v1.1.1-dispatcher): dispatcher loop + retry + depth limit + outbox emitter
2e92691 feat(v1.1.1-triggers): trigger CRUD admin endpoints
545d863 feat(v1.1.1-triggers): triggers + outbox schema + repos
6b99f74 feat(v1.1.1-kv): Rhai kv:: SDK module + ctx.event wiring
434fb63 feat(v1.1.1-kv): migrations + KvService trait + Postgres impl
1efb350 docs(v1.1.x): resolve in-flight decisions as Decided 2026-06-01
```
The first commit (`1efb350`) absorbed working-tree edits to
`docs/v1.1.x-design-notes.md` that turned the "in-flight" 20 open
calls into "Decided 2026-06-01" entries. Those were on the working
tree at branch creation; folding them into the v1.1.1 branch keeps
the design rationale colocated with the implementation.
## 2. Scope coverage (Done / Partial / Skipped)
| Scope item | Status | Notes |
|---|---|---|
| **1. KV store** | Done | Migration 0007, `KvService` trait in shared, `KvServiceImpl` + `PostgresKvRepo` in manager-core, Rhai `kv::collection(name).{get,set,has,delete,list}` bridge, cursor pagination, empty-collection rejection, script-as-gate authz. |
| **2. Triggers framework — Layout E** | Done | Migrations 0008 (`triggers` + `kv_trigger_details` + `dead_letter_trigger_details`), `TriggerRepo` + `PostgresTriggerRepo`, CRUD admin endpoints. `registered_by_principal` column captured + threaded into the dispatcher. Depth-limit enforced in the dispatcher (default 8). |
| **3. Universal outbox + dispatcher** | Done | Migration 0009 (`outbox`), `OutboxRepo` + `PostgresOutboxRepo`, `Dispatcher` tokio task. Polls every 100ms, claims 8 rows/tick via `FOR UPDATE SKIP LOCKED`, gate-bounds dispatch, retries with backoff+jitter, dead-letters on exhaustion, late-completion → `abandoned_executions`. |
| **4. NATS-style sync HTTP** | Done | `InboxRegistry` in orchestrator-core (in-process `Mutex<HashMap<Uuid, oneshot::Sender>>`), `InboxResolver` trait in shared. Orchestrator sync-route path registers receiver, writes outbox row with `reply_to`, awaits with timeout = script.timeout + 2s buffer. Status mapping per design notes §3 (422/502/503/504/507/500). |
| **5. `dispatch_mode: async` HTTP routes** | Done | Migration 0012 adds the column (default `sync`). `DispatchMode` enum in shared. Route admin payload + RouteRepository serialize it. Compiled routes carry it; the matcher returns it in `Matched`. Orchestrator branches: async → outbox + 202; sync → outbox + inbox. |
| **6. Dead letters** | Done | Migration 0010 (`dead_letters`), `DeadLetterRepo` + `DeadLetterService` + `PostgresDeadLetterService`. Rhai `dead_letters::{replay,resolve}` + admin endpoints (`GET /count`, `GET /`, `GET /{id}`, `POST /{id}/replay`, `POST /{id}/resolve`). `Capability::AppDeadLetterManage(AppId)` enforced. List intentionally NOT shipped (deferred to v1.2). Recursion-stop rule (handler-failure annotates original DL as `handler_failed`) implemented in the dispatcher. |
| **7. Abandoned executions** | Done | Migration 0011, `AbandonedRepo` + `PostgresAbandonedRepo`, dispatcher writes a row on dropped-receiver inbox delivery. Metric path reserved (`TODO(metrics)` markers in dispatcher.rs). |
| **8. Retry policy defaults** | Done | `TriggerConfig::from_env` (new module). Env vars: `PICLOUD_MAX_TRIGGER_DEPTH`, `PICLOUD_TRIGGER_RETRY_{MAX_ATTEMPTS,BACKOFF,BASE_MS,JITTER_PCT}`, `PICLOUD_DEAD_LETTER_RETENTION_DAYS`, `PICLOUD_ABANDONED_EXECUTIONS_RETENTION_DAYS`. Per-trigger overrides applied at trigger-creation time. |
| **9. `ctx.event` for triggered scripts** | Done | `TriggerEvent` enum in shared (KV / DeadLetter variants), `SdkCallCx.event: Option<TriggerEvent>` + `is_dead_letter_handler: bool`. `engine.rs::build_ctx_map` flattens the event into `ctx.event` for triggered handlers; direct invocations leave the key absent. Shape matches design notes §4 (KV with op + collection + key + value; dead_letter with original + attempts + last_error + ids + timestamps). |
| **10. Dashboard surface** | Done | Per-app red badge with unresolved count on apps list + per-app detail page. New `apps/[slug]/dead-letters/+page.svelte` list view with all design-notes-mandated columns + Replay + Mark resolved actions + expandable row detail. svelte-check passes (369 files, 0 errors, 0 warnings). |
| **11. Workspace version bump** | Done | Workspace `1.1.0``1.1.1`, SDK `1.1``1.2`, dashboard `0.6.0``0.7.0`. CHANGELOG.md created at repo root. |
## 3. Key implementation decisions / deviations
### Outbox column set (deferred to implementation per design notes §2)
Chose:
- `script_id` denormalized — dispatcher resolves the target without
re-joining for the common path.
- `trigger_id` polymorphic (no DB FK) — references `triggers.id` for
`source_kind IN {kv, dead_letter}`, `routes.id` for
`source_kind = 'http'`. Discrimination in Rust at dispatch time.
- `claimed_by TEXT` — pid-based for MVP; cluster mode can use any
identifier without schema change.
- `trigger_depth` + `root_execution_id` denormalized so the
dispatcher rebuilds `ExecRequest` without joining back to the
originating execution log.
- No explicit `is_dead_letter_handler` column — dispatcher infers
from the trigger's `kind` field at dispatch time.
### KV pagination
- **Cursor-style**, opaque base64-encoded last-key.
- Page-size cap of 1000 with default 100 (enforced in repo).
- Documented in `crates/shared/src/kv.rs` and the SDK function
comment.
### KV TTL
- Blueprint §8.1 reserved an `expires_at` column. v1.1.1 design notes
don't surface TTL through the SDK (`set(k,v)` has no TTL argument)
so the column is **omitted from migration 0007**. Adding it later
is a non-breaking forward migration. Recorded in CHANGELOG as a
deferred item.
### Authz scope mapping (seven-scope commitment)
The four new capabilities map onto existing scopes — **no new scope
variants** to honour the `Scope` enum's "exactly seven values"
contract (`crates/shared/src/auth.rs:103`):
| Capability | Scope |
|---|---|
| `AppKvRead` | `script:read` |
| `AppKvWrite` | `script:write` |
| `AppManageTriggers` | `app:admin` |
| `AppDeadLetterManage` | `app:admin` |
`role_satisfies` grants `AppKvRead` at the Viewer role, `AppKvWrite`
at Editor, and both trigger / DL caps at AppAdmin.
### Script-as-gate authz for SDK calls
- `KvServiceImpl` runs `authz::require` only when
`cx.principal.is_some()`. Anonymous public-HTTP scripts (the
common case for public routes) bypass the cap check.
- Cross-app isolation is **independent** of this — enforced by
`cx.app_id` being the only source of `app_id` on every query.
- `PostgresDeadLetterService::{replay,resolve}` keeps a hard
`require` (no `if let Some`) — managing dead letters is an admin
act per design notes §4. Public scripts with `principal: None`
fail the check, which is correct.
### Trait split: `OutboxRepo` vs `OutboxWriter`
orchestrator-core can't depend on manager-core (would invert the
dependency arrow). Defined a small `OutboxWriter` trait in
`picloud-shared` with a single `enqueue_http` method.
`PostgresOutboxRepo` implements both `OutboxRepo` (dispatcher
surface) and `OutboxWriter` (orchestrator surface); the picloud
binary clones one concrete Arc into both trait views — mirrors the
existing `members_concrete` / `AuthzRepo` pattern.
### `InboxResolver` lives in shared, `InboxRegistry` in orchestrator-core
Same split rationale — the dispatcher (manager-core) only depends on
the trait, while the in-process impl lives next to its consumer.
Cluster mode (v1.3+) swaps the impl for `LISTEN/NOTIFY` behind the
unchanged trait.
### manager-core now depends on executor-core
Previously manager-core only depended on orchestrator-core. The
dispatcher needs `ExecRequest`/`ExecResponse`/`ExecError`/
`InvocationType` from `executor-core` to build invocation
descriptors. This is the transport DTO interpretation of the
working-rules "don't reach across `*-core` crates" — DTOs are fine,
behaviour is the bright line.
### Sync HTTP via outbox is the default for the user-routes path
The orchestrator's user-route handler is fully on the NATS-style
path now — every sync HTTP request writes to the outbox and awaits
inbox delivery. Adds ~2-5ms per request per design notes §3 latency
budget. `/api/v1/execute/{id}` (the admin/dev bypass) still calls
the executor directly since it doesn't need the unified
observability — kept for simplicity and admin tooling speed.
### Trigger-depth check is on the outbox row, not in the executor
Dispatcher rejects depth-exceeded rows **before** trying to
execute. The `cx.trigger_depth` field is informational on the
executor side. Rejection writes a log + (reserved) metric and
deletes the row — no DL, per design notes §4.
## 4. Tests added
### Unit tests (no DB required)
- `manager-core::kv_service::tests` (10 tests) — round-trip, missing
key returns None, `has` predicate, `delete` was-present,
empty-collection rejection, **cross-app isolation**, anonymous-cx
skips authz, authed-cx-with-no-role is Forbidden, owner-can-write,
cursor pagination via in-memory KvRepo + denying authz repo.
- `manager-core::trigger_config::tests` (2 tests) — conservative
defaults, backoff round-trips.
- `manager-core::trigger_repo::tests` (1 test) — `collection_matches`
glob behaviour (`*`, `prefix:*`, exact).
- `manager-core::dispatcher::tests` (5 tests) — exponential / linear /
constant backoff math, jitter within bounds, ExecError →
InboxFailureKind classification, failure-kind → status-code mapping.
- `manager-core::abandoned_repo::tests` (2 tests) — truncate
char-boundary safety.
- `manager-core::triggers_api::tests` (5 tests) — unknown-app 404,
member-without-role 403, default fallback for retry settings,
empty-glob rejection, cross-app delete is treated as not-found.
- `orchestrator-core::inbox::tests` (4 tests) — register/deliver
round-trip, unknown-id is Abandoned, dropped receiver is
Abandoned, explicit cancel removes sender.
- `executor-core::engine::tests` (3 new) — `ctx.event` absent for
direct invocations, KV insert shape matches design notes,
KV delete has unit value.
- `executor-core::sdk_kv` integration suite (7 tests) — runs a real
Rhai engine under `spawn_blocking` against an in-memory
`KvService` impl. Covers handle pattern, round-trip, unit-on-
missing, has predicate, delete-was-present, empty-collection
throws, cursor pagination, **cross-app isolation through the
bridge**.
**Total: 47 new tests across the workspace.** Workspace test counts
after v1.1.1: 63 manager-core / 56 orchestrator-core / 17
executor-core engine / 7 sdk_kv / 30 sdk_contract / 43 stdlib /
21 picloud / 6 shared.
### Intentionally untested
- DB-backed integration tests for the full dispatcher loop, KV→
trigger→DL retry chain, sync HTTP via outbox round-trip,
recursion-stop end-to-end. These need a real Postgres harness;
the reviewer runs them via the manual smoke flow below.
- Postgres-specific repo behaviour (sqlx query correctness). The
repos compile and run against the schema, but no integration
test crate spins up a DB in this branch — same pattern as v1.1.0
(see existing `ignored, needs DATABASE_URL` test markers).
## 5. Open questions for the reviewer
1. **Outbox `claimed_at` clearing on success.** The dispatcher
`delete`s the outbox row after success / DL. For failures it
reschedules (which sets `claimed_at = NULL`). Both flows are
correct, but if you imagine a crash between the executor return
and the row update, the row stays claimed forever. Cluster mode
should add a periodic "unstick stale claims" sweep. Not in
v1.1.1 scope but worth surfacing.
2. **Sync HTTP overhead.** Every sync HTTP request now goes through
the outbox (write + dispatcher pickup + inbox delivery).
Measured overhead expected ~2-5ms per design notes §3. No
benchmarking yet — recommend the reviewer pick a representative
script and compare 95p latency vs v1.1.0 if performance matters.
3. **HTTP outbox rows don't run as a principal.** The orchestrator's
public HTTP path has no authenticated user; the
`origin_principal` field on the outbox row is forensic. The
resulting `ExecRequest.principal = None`, so the script runs
anonymously — matches direct execution. If you'd prefer
triggered-from-HTTP scripts to inherit a derived principal
(e.g. the route's app's owner), that's an additive change.
4. **Dispatcher uses `ASYNC_EXEC_TIMEOUT = 300s` for async rows.**
Async dispatches don't have a script-level timeout (no
originating HTTP request to bound). Picked the same platform
cap as `LocalExecutorClient`. If async needs a different cap,
easy to thread through `TriggerConfig`.
5. **Dispatcher tick cadence is 100ms.** Bounded enough that
fan-out feels instant; loose enough that an idle process
doesn't burn cycles. If the reviewer wants tighter latency,
bump to 50ms or use `LISTEN/NOTIFY` for wake-up (v1.3+ work).
6. **CHANGELOG.md is new.** Followed the rest of the repo's
convention from git log (release commits + design-notes
references). If a different format is preferred, easy to swap.
## 6. Deferred to later releases
- `dead_letters::list(filter)` Rhai SDK — design notes §4 defers
to v1.2 to align with `docs::find()` query DSL.
- KV TTL (`set(k, v, ttl_secs)`) — blueprint reserved it; v1.1.1
SDK doesn't surface it. Forward-compat (no schema cost).
- Auto-disable of triggers whose script was deleted — design notes
§4 says current handling is metric+log; auto-disable is v1.2.
- Per-app dead-letter retention — design notes §4 says env-only in
v1.1.1.
- Metrics counter emit for `picloud_trigger_depth_exceeded`,
`picloud_dead_letter_handler_failures`,
`picloud_abandoned_executions_total`. Code paths log the
occurrences with `tracing::warn`/`error`; the actual
counter-emit code is a `TODO(metrics)` comment in the
dispatcher. Metrics surface is v1.1.7+ per the roadmap.
- DB-backed integration tests for the dispatcher loop (see §4
intentionally-untested).
- Sync HTTP performance benchmarks comparing v1.1.0 direct path vs
v1.1.1 outbox path.
## 7. How to verify locally
### Static checks (all green on this branch)
```sh
cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings
cargo test --workspace
cd dashboard && npm run check && npm run build
```
### Migration integrity
```sh
docker compose down -v && docker compose up -d postgres
cargo run -p picloud # applies 0001..0012 from empty
```
Then start from `main` (v1.1.0 schema state) and switch to this
branch; restart `picloud` to apply 0007..0012 on top.
### Manual end-to-end smoke (reviewer should run)
```sh
docker compose up -d
# 1. Bootstrap an owner user via the existing flow + create app A.
# 2. Create a script in A whose body is: throw "boom"
# 3. POST /api/v1/admin/apps/{A}/triggers/kv with
# {"script_id": "<broken>", "collection_glob": "*", "ops": ["insert"]}
# 4. From another script (or a public HTTP route):
# kv::collection("widgets").set("k1", #{n:1})
# 5. Wait ~7 seconds (3 attempts × ~1/2/4s backoff with ±20% jitter).
# 6. Open the dashboard at /admin.
# 7. Apps list shows a red "1" badge next to app A.
# 8. Click into app A → "Dead letters" tab link → row visible.
# 9. Click row → full payload + error history.
# 10. Click "Replay" → row marks resolution='replayed', new outbox
# row written, dispatcher re-runs the handler (fails again,
# produces a NEW DL row).
# 11. Click "Mark resolved" on the original DL → resolution='ignored'.
```
### Async route smoke
```sh
# Create a route via POST /api/v1/admin/scripts/{id}/routes with
# {"host_kind":"any","path_kind":"exact","path":"/work","dispatch_mode":"async"}
curl -X POST -d '{"work":"thing"}' http://localhost:8080/work
# Expect: HTTP 202 + {"accepted_at":"...","execution_id":"..."}
# Then tail execution_logs — the script ran later (not synchronously).
```
### Trigger-depth limit smoke
```sh
# Set a low depth limit + register a KV trigger whose script
# writes to KV again — creates a loop.
PICLOUD_MAX_TRIGGER_DEPTH=3 cargo run -p picloud
# kv.set(...) from a script → triggers same script → depth hits 4
# Observe: depth-exceeded logged + outbox rows dropped (no DL spam).
```
## 8. Known limitations / rough edges
- **No DB-backed integration tests in this branch.** Unit tests
cover trait behaviour with in-memory backings; sqlx query
correctness is verified by the workspace compile + manual smoke.
- **Dispatcher concurrency is in-process serial-per-tick.** Up to
8 rows claimed per tick, processed one at a time. Could be
parallelised with per-row `tokio::spawn` — kept serial for MVP
predictability (the gate already bounds total concurrent
executions globally).
- **Metric emission is TODO** at the three spots noted in
Open Questions §5. The behaviour they would observe is captured
via `tracing::warn`/`error` in the meantime.
- **`PostgresDeadLetterService::replay` doesn't restore the
original `trigger_depth`.** Replays start at depth 0. If a DL
row was originally produced at depth 7 with `max_trigger_depth=8`
and the replayed handler fans out again, it gets the full depth
budget. Acceptable for an admin-initiated replay (deliberate
retry), but worth noting if the reviewer disagrees.
- **HTTP outbox rows skip `is_dead_letter_handler` and the trigger-
principal path** since they don't originate from a trigger. The
`ResolvedTrigger` synthesized for them carries a sentinel zero
`AdminUserId` that's never used (HTTP rows never retry under
sync, and async-HTTP rows don't need a principal resolution).
- **DataPlaneState's executor field is still generic** (`Arc<E>`
where `E: ExecutorClient`). The dispatcher uses `Arc<dyn
ExecutorClient>` directly. The picloud binary clones the same
`Arc<LocalExecutorClient>` into both — works because the
concrete type implements both the trait object and the generic
bound.
- **dispatcher always sets `principal: None` for HTTP rows.** As
noted in Open Question §3, HTTP outbox rows don't resolve a
principal. Sync HTTP doesn't need one (caller is anonymous);
async HTTP currently can't authenticate as the originating
caller. If that's not the intent, additive change.
- **Cluster-mode crash recovery for claimed rows.** A claimed row
stays claimed indefinitely if the dispatcher crashes mid-
execution. v1.1.1 has one dispatcher per process so this is
rare; cluster mode (v1.3+) needs a stale-claim sweeper.
---
Branch ready for review. Reviewer reads this report + audits the
diff. Do not merge to main until the audit clears.