Files
PiCloud/HANDBACK.md
MechaCat02 5a95ff2d07 docs(v1.1.1): handback report for reviewer
Summary of the 11-commit v1.1.1 branch:
- branch + commit count, scope coverage table, decisions made
  mid-implementation, deviations from the design notes
- tests added (47 new) + intentionally-untested gaps
- open questions for the reviewer
- deferred items
- verification commands + manual smoke flow
- known limitations / rough edges

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-01 22:27:18 +02:00

341 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# v1.1.1 Implementation HANDBACK
## 1. Branch + commit count
- Branch: `feat/v1.1.1-storage-and-events`
- Base: `main`
- 11 commits ahead of `main`. Branch is **not pushed**, **not merged**.
```
66b661f chore(release): bump workspace to v1.1.1 + CHANGELOG
6b7ff78 feat(v1.1.1-gc): dead-letter + abandoned-executions retention sweepers
1795dfc feat(v1.1.1-dead-letters): dashboard badge + list view
20f1b5e feat(v1.1.1-dead-letters): service + Rhai SDK + admin endpoints
77b2cb5 feat(v1.1.1-routes): outbox-routed sync HTTP + dispatch_mode=async
6a2971a feat(v1.1.1-dispatcher): dispatcher loop + retry + depth limit + outbox emitter
2e92691 feat(v1.1.1-triggers): trigger CRUD admin endpoints
545d863 feat(v1.1.1-triggers): triggers + outbox schema + repos
6b99f74 feat(v1.1.1-kv): Rhai kv:: SDK module + ctx.event wiring
434fb63 feat(v1.1.1-kv): migrations + KvService trait + Postgres impl
1efb350 docs(v1.1.x): resolve in-flight decisions as Decided 2026-06-01
```
The first commit (`1efb350`) absorbed working-tree edits to
`docs/v1.1.x-design-notes.md` that turned the "in-flight" 20 open
calls into "Decided 2026-06-01" entries. Those were on the working
tree at branch creation; folding them into the v1.1.1 branch keeps
the design rationale colocated with the implementation.
## 2. Scope coverage (Done / Partial / Skipped)
| Scope item | Status | Notes |
|---|---|---|
| **1. KV store** | Done | Migration 0007, `KvService` trait in shared, `KvServiceImpl` + `PostgresKvRepo` in manager-core, Rhai `kv::collection(name).{get,set,has,delete,list}` bridge, cursor pagination, empty-collection rejection, script-as-gate authz. |
| **2. Triggers framework — Layout E** | Done | Migrations 0008 (`triggers` + `kv_trigger_details` + `dead_letter_trigger_details`), `TriggerRepo` + `PostgresTriggerRepo`, CRUD admin endpoints. `registered_by_principal` column captured + threaded into the dispatcher. Depth-limit enforced in the dispatcher (default 8). |
| **3. Universal outbox + dispatcher** | Done | Migration 0009 (`outbox`), `OutboxRepo` + `PostgresOutboxRepo`, `Dispatcher` tokio task. Polls every 100ms, claims 8 rows/tick via `FOR UPDATE SKIP LOCKED`, gate-bounds dispatch, retries with backoff+jitter, dead-letters on exhaustion, late-completion → `abandoned_executions`. |
| **4. NATS-style sync HTTP** | Done | `InboxRegistry` in orchestrator-core (in-process `Mutex<HashMap<Uuid, oneshot::Sender>>`), `InboxResolver` trait in shared. Orchestrator sync-route path registers receiver, writes outbox row with `reply_to`, awaits with timeout = script.timeout + 2s buffer. Status mapping per design notes §3 (422/502/503/504/507/500). |
| **5. `dispatch_mode: async` HTTP routes** | Done | Migration 0012 adds the column (default `sync`). `DispatchMode` enum in shared. Route admin payload + RouteRepository serialize it. Compiled routes carry it; the matcher returns it in `Matched`. Orchestrator branches: async → outbox + 202; sync → outbox + inbox. |
| **6. Dead letters** | Done | Migration 0010 (`dead_letters`), `DeadLetterRepo` + `DeadLetterService` + `PostgresDeadLetterService`. Rhai `dead_letters::{replay,resolve}` + admin endpoints (`GET /count`, `GET /`, `GET /{id}`, `POST /{id}/replay`, `POST /{id}/resolve`). `Capability::AppDeadLetterManage(AppId)` enforced. List intentionally NOT shipped (deferred to v1.2). Recursion-stop rule (handler-failure annotates original DL as `handler_failed`) implemented in the dispatcher. |
| **7. Abandoned executions** | Done | Migration 0011, `AbandonedRepo` + `PostgresAbandonedRepo`, dispatcher writes a row on dropped-receiver inbox delivery. Metric path reserved (`TODO(metrics)` markers in dispatcher.rs). |
| **8. Retry policy defaults** | Done | `TriggerConfig::from_env` (new module). Env vars: `PICLOUD_MAX_TRIGGER_DEPTH`, `PICLOUD_TRIGGER_RETRY_{MAX_ATTEMPTS,BACKOFF,BASE_MS,JITTER_PCT}`, `PICLOUD_DEAD_LETTER_RETENTION_DAYS`, `PICLOUD_ABANDONED_EXECUTIONS_RETENTION_DAYS`. Per-trigger overrides applied at trigger-creation time. |
| **9. `ctx.event` for triggered scripts** | Done | `TriggerEvent` enum in shared (KV / DeadLetter variants), `SdkCallCx.event: Option<TriggerEvent>` + `is_dead_letter_handler: bool`. `engine.rs::build_ctx_map` flattens the event into `ctx.event` for triggered handlers; direct invocations leave the key absent. Shape matches design notes §4 (KV with op + collection + key + value; dead_letter with original + attempts + last_error + ids + timestamps). |
| **10. Dashboard surface** | Done | Per-app red badge with unresolved count on apps list + per-app detail page. New `apps/[slug]/dead-letters/+page.svelte` list view with all design-notes-mandated columns + Replay + Mark resolved actions + expandable row detail. svelte-check passes (369 files, 0 errors, 0 warnings). |
| **11. Workspace version bump** | Done | Workspace `1.1.0``1.1.1`, SDK `1.1``1.2`, dashboard `0.6.0``0.7.0`. CHANGELOG.md created at repo root. |
## 3. Key implementation decisions / deviations
### Outbox column set (deferred to implementation per design notes §2)
Chose:
- `script_id` denormalized — dispatcher resolves the target without
re-joining for the common path.
- `trigger_id` polymorphic (no DB FK) — references `triggers.id` for
`source_kind IN {kv, dead_letter}`, `routes.id` for
`source_kind = 'http'`. Discrimination in Rust at dispatch time.
- `claimed_by TEXT` — pid-based for MVP; cluster mode can use any
identifier without schema change.
- `trigger_depth` + `root_execution_id` denormalized so the
dispatcher rebuilds `ExecRequest` without joining back to the
originating execution log.
- No explicit `is_dead_letter_handler` column — dispatcher infers
from the trigger's `kind` field at dispatch time.
### KV pagination
- **Cursor-style**, opaque base64-encoded last-key.
- Page-size cap of 1000 with default 100 (enforced in repo).
- Documented in `crates/shared/src/kv.rs` and the SDK function
comment.
### KV TTL
- Blueprint §8.1 reserved an `expires_at` column. v1.1.1 design notes
don't surface TTL through the SDK (`set(k,v)` has no TTL argument)
so the column is **omitted from migration 0007**. Adding it later
is a non-breaking forward migration. Recorded in CHANGELOG as a
deferred item.
### Authz scope mapping (seven-scope commitment)
The four new capabilities map onto existing scopes — **no new scope
variants** to honour the `Scope` enum's "exactly seven values"
contract (`crates/shared/src/auth.rs:103`):
| Capability | Scope |
|---|---|
| `AppKvRead` | `script:read` |
| `AppKvWrite` | `script:write` |
| `AppManageTriggers` | `app:admin` |
| `AppDeadLetterManage` | `app:admin` |
`role_satisfies` grants `AppKvRead` at the Viewer role, `AppKvWrite`
at Editor, and both trigger / DL caps at AppAdmin.
### Script-as-gate authz for SDK calls
- `KvServiceImpl` runs `authz::require` only when
`cx.principal.is_some()`. Anonymous public-HTTP scripts (the
common case for public routes) bypass the cap check.
- Cross-app isolation is **independent** of this — enforced by
`cx.app_id` being the only source of `app_id` on every query.
- `PostgresDeadLetterService::{replay,resolve}` keeps a hard
`require` (no `if let Some`) — managing dead letters is an admin
act per design notes §4. Public scripts with `principal: None`
fail the check, which is correct.
### Trait split: `OutboxRepo` vs `OutboxWriter`
orchestrator-core can't depend on manager-core (would invert the
dependency arrow). Defined a small `OutboxWriter` trait in
`picloud-shared` with a single `enqueue_http` method.
`PostgresOutboxRepo` implements both `OutboxRepo` (dispatcher
surface) and `OutboxWriter` (orchestrator surface); the picloud
binary clones one concrete Arc into both trait views — mirrors the
existing `members_concrete` / `AuthzRepo` pattern.
### `InboxResolver` lives in shared, `InboxRegistry` in orchestrator-core
Same split rationale — the dispatcher (manager-core) only depends on
the trait, while the in-process impl lives next to its consumer.
Cluster mode (v1.3+) swaps the impl for `LISTEN/NOTIFY` behind the
unchanged trait.
### manager-core now depends on executor-core
Previously manager-core only depended on orchestrator-core. The
dispatcher needs `ExecRequest`/`ExecResponse`/`ExecError`/
`InvocationType` from `executor-core` to build invocation
descriptors. This is the transport DTO interpretation of the
working-rules "don't reach across `*-core` crates" — DTOs are fine,
behaviour is the bright line.
### Sync HTTP via outbox is the default for the user-routes path
The orchestrator's user-route handler is fully on the NATS-style
path now — every sync HTTP request writes to the outbox and awaits
inbox delivery. Adds ~2-5ms per request per design notes §3 latency
budget. `/api/v1/execute/{id}` (the admin/dev bypass) still calls
the executor directly since it doesn't need the unified
observability — kept for simplicity and admin tooling speed.
### Trigger-depth check is on the outbox row, not in the executor
Dispatcher rejects depth-exceeded rows **before** trying to
execute. The `cx.trigger_depth` field is informational on the
executor side. Rejection writes a log + (reserved) metric and
deletes the row — no DL, per design notes §4.
## 4. Tests added
### Unit tests (no DB required)
- `manager-core::kv_service::tests` (10 tests) — round-trip, missing
key returns None, `has` predicate, `delete` was-present,
empty-collection rejection, **cross-app isolation**, anonymous-cx
skips authz, authed-cx-with-no-role is Forbidden, owner-can-write,
cursor pagination via in-memory KvRepo + denying authz repo.
- `manager-core::trigger_config::tests` (2 tests) — conservative
defaults, backoff round-trips.
- `manager-core::trigger_repo::tests` (1 test) — `collection_matches`
glob behaviour (`*`, `prefix:*`, exact).
- `manager-core::dispatcher::tests` (5 tests) — exponential / linear /
constant backoff math, jitter within bounds, ExecError →
InboxFailureKind classification, failure-kind → status-code mapping.
- `manager-core::abandoned_repo::tests` (2 tests) — truncate
char-boundary safety.
- `manager-core::triggers_api::tests` (5 tests) — unknown-app 404,
member-without-role 403, default fallback for retry settings,
empty-glob rejection, cross-app delete is treated as not-found.
- `orchestrator-core::inbox::tests` (4 tests) — register/deliver
round-trip, unknown-id is Abandoned, dropped receiver is
Abandoned, explicit cancel removes sender.
- `executor-core::engine::tests` (3 new) — `ctx.event` absent for
direct invocations, KV insert shape matches design notes,
KV delete has unit value.
- `executor-core::sdk_kv` integration suite (7 tests) — runs a real
Rhai engine under `spawn_blocking` against an in-memory
`KvService` impl. Covers handle pattern, round-trip, unit-on-
missing, has predicate, delete-was-present, empty-collection
throws, cursor pagination, **cross-app isolation through the
bridge**.
**Total: 47 new tests across the workspace.** Workspace test counts
after v1.1.1: 63 manager-core / 56 orchestrator-core / 17
executor-core engine / 7 sdk_kv / 30 sdk_contract / 43 stdlib /
21 picloud / 6 shared.
### Intentionally untested
- DB-backed integration tests for the full dispatcher loop, KV→
trigger→DL retry chain, sync HTTP via outbox round-trip,
recursion-stop end-to-end. These need a real Postgres harness;
the reviewer runs them via the manual smoke flow below.
- Postgres-specific repo behaviour (sqlx query correctness). The
repos compile and run against the schema, but no integration
test crate spins up a DB in this branch — same pattern as v1.1.0
(see existing `ignored, needs DATABASE_URL` test markers).
## 5. Open questions for the reviewer
1. **Outbox `claimed_at` clearing on success.** The dispatcher
`delete`s the outbox row after success / DL. For failures it
reschedules (which sets `claimed_at = NULL`). Both flows are
correct, but if you imagine a crash between the executor return
and the row update, the row stays claimed forever. Cluster mode
should add a periodic "unstick stale claims" sweep. Not in
v1.1.1 scope but worth surfacing.
2. **Sync HTTP overhead.** Every sync HTTP request now goes through
the outbox (write + dispatcher pickup + inbox delivery).
Measured overhead expected ~2-5ms per design notes §3. No
benchmarking yet — recommend the reviewer pick a representative
script and compare 95p latency vs v1.1.0 if performance matters.
3. **HTTP outbox rows don't run as a principal.** The orchestrator's
public HTTP path has no authenticated user; the
`origin_principal` field on the outbox row is forensic. The
resulting `ExecRequest.principal = None`, so the script runs
anonymously — matches direct execution. If you'd prefer
triggered-from-HTTP scripts to inherit a derived principal
(e.g. the route's app's owner), that's an additive change.
4. **Dispatcher uses `ASYNC_EXEC_TIMEOUT = 300s` for async rows.**
Async dispatches don't have a script-level timeout (no
originating HTTP request to bound). Picked the same platform
cap as `LocalExecutorClient`. If async needs a different cap,
easy to thread through `TriggerConfig`.
5. **Dispatcher tick cadence is 100ms.** Bounded enough that
fan-out feels instant; loose enough that an idle process
doesn't burn cycles. If the reviewer wants tighter latency,
bump to 50ms or use `LISTEN/NOTIFY` for wake-up (v1.3+ work).
6. **CHANGELOG.md is new.** Followed the rest of the repo's
convention from git log (release commits + design-notes
references). If a different format is preferred, easy to swap.
## 6. Deferred to later releases
- `dead_letters::list(filter)` Rhai SDK — design notes §4 defers
to v1.2 to align with `docs::find()` query DSL.
- KV TTL (`set(k, v, ttl_secs)`) — blueprint reserved it; v1.1.1
SDK doesn't surface it. Forward-compat (no schema cost).
- Auto-disable of triggers whose script was deleted — design notes
§4 says current handling is metric+log; auto-disable is v1.2.
- Per-app dead-letter retention — design notes §4 says env-only in
v1.1.1.
- Metrics counter emit for `picloud_trigger_depth_exceeded`,
`picloud_dead_letter_handler_failures`,
`picloud_abandoned_executions_total`. Code paths log the
occurrences with `tracing::warn`/`error`; the actual
counter-emit code is a `TODO(metrics)` comment in the
dispatcher. Metrics surface is v1.1.7+ per the roadmap.
- DB-backed integration tests for the dispatcher loop (see §4
intentionally-untested).
- Sync HTTP performance benchmarks comparing v1.1.0 direct path vs
v1.1.1 outbox path.
## 7. How to verify locally
### Static checks (all green on this branch)
```sh
cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings
cargo test --workspace
cd dashboard && npm run check && npm run build
```
### Migration integrity
```sh
docker compose down -v && docker compose up -d postgres
cargo run -p picloud # applies 0001..0012 from empty
```
Then start from `main` (v1.1.0 schema state) and switch to this
branch; restart `picloud` to apply 0007..0012 on top.
### Manual end-to-end smoke (reviewer should run)
```sh
docker compose up -d
# 1. Bootstrap an owner user via the existing flow + create app A.
# 2. Create a script in A whose body is: throw "boom"
# 3. POST /api/v1/admin/apps/{A}/triggers/kv with
# {"script_id": "<broken>", "collection_glob": "*", "ops": ["insert"]}
# 4. From another script (or a public HTTP route):
# kv::collection("widgets").set("k1", #{n:1})
# 5. Wait ~7 seconds (3 attempts × ~1/2/4s backoff with ±20% jitter).
# 6. Open the dashboard at /admin.
# 7. Apps list shows a red "1" badge next to app A.
# 8. Click into app A → "Dead letters" tab link → row visible.
# 9. Click row → full payload + error history.
# 10. Click "Replay" → row marks resolution='replayed', new outbox
# row written, dispatcher re-runs the handler (fails again,
# produces a NEW DL row).
# 11. Click "Mark resolved" on the original DL → resolution='ignored'.
```
### Async route smoke
```sh
# Create a route via POST /api/v1/admin/scripts/{id}/routes with
# {"host_kind":"any","path_kind":"exact","path":"/work","dispatch_mode":"async"}
curl -X POST -d '{"work":"thing"}' http://localhost:8080/work
# Expect: HTTP 202 + {"accepted_at":"...","execution_id":"..."}
# Then tail execution_logs — the script ran later (not synchronously).
```
### Trigger-depth limit smoke
```sh
# Set a low depth limit + register a KV trigger whose script
# writes to KV again — creates a loop.
PICLOUD_MAX_TRIGGER_DEPTH=3 cargo run -p picloud
# kv.set(...) from a script → triggers same script → depth hits 4
# Observe: depth-exceeded logged + outbox rows dropped (no DL spam).
```
## 8. Known limitations / rough edges
- **No DB-backed integration tests in this branch.** Unit tests
cover trait behaviour with in-memory backings; sqlx query
correctness is verified by the workspace compile + manual smoke.
- **Dispatcher concurrency is in-process serial-per-tick.** Up to
8 rows claimed per tick, processed one at a time. Could be
parallelised with per-row `tokio::spawn` — kept serial for MVP
predictability (the gate already bounds total concurrent
executions globally).
- **Metric emission is TODO** at the three spots noted in
Open Questions §5. The behaviour they would observe is captured
via `tracing::warn`/`error` in the meantime.
- **`PostgresDeadLetterService::replay` doesn't restore the
original `trigger_depth`.** Replays start at depth 0. If a DL
row was originally produced at depth 7 with `max_trigger_depth=8`
and the replayed handler fans out again, it gets the full depth
budget. Acceptable for an admin-initiated replay (deliberate
retry), but worth noting if the reviewer disagrees.
- **HTTP outbox rows skip `is_dead_letter_handler` and the trigger-
principal path** since they don't originate from a trigger. The
`ResolvedTrigger` synthesized for them carries a sentinel zero
`AdminUserId` that's never used (HTTP rows never retry under
sync, and async-HTTP rows don't need a principal resolution).
- **DataPlaneState's executor field is still generic** (`Arc<E>`
where `E: ExecutorClient`). The dispatcher uses `Arc<dyn
ExecutorClient>` directly. The picloud binary clones the same
`Arc<LocalExecutorClient>` into both — works because the
concrete type implements both the trait object and the generic
bound.
- **dispatcher always sets `principal: None` for HTTP rows.** As
noted in Open Question §3, HTTP outbox rows don't resolve a
principal. Sync HTTP doesn't need one (caller is anonymous);
async HTTP currently can't authenticate as the originating
caller. If that's not the intent, additive change.
- **Cluster-mode crash recovery for claimed rows.** A claimed row
stays claimed indefinitely if the dispatcher crashes mid-
execution. v1.1.1 has one dispatcher per process so this is
rare; cluster mode (v1.3+) needs a stale-claim sweeper.
---
Branch ready for review. Reviewer reads this report + audits the
diff. Do not merge to main until the audit clears.