docs(v1.1.1): handback report for reviewer
Summary of the 11-commit v1.1.1 branch: - branch + commit count, scope coverage table, decisions made mid-implementation, deviations from the design notes - tests added (47 new) + intentionally-untested gaps - open questions for the reviewer - deferred items - verification commands + manual smoke flow - known limitations / rough edges Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
340
HANDBACK.md
Normal file
340
HANDBACK.md
Normal file
@@ -0,0 +1,340 @@
|
||||
# v1.1.1 Implementation HANDBACK
|
||||
|
||||
## 1. Branch + commit count
|
||||
|
||||
- Branch: `feat/v1.1.1-storage-and-events`
|
||||
- Base: `main`
|
||||
- 11 commits ahead of `main`. Branch is **not pushed**, **not merged**.
|
||||
|
||||
```
|
||||
66b661f chore(release): bump workspace to v1.1.1 + CHANGELOG
|
||||
6b7ff78 feat(v1.1.1-gc): dead-letter + abandoned-executions retention sweepers
|
||||
1795dfc feat(v1.1.1-dead-letters): dashboard badge + list view
|
||||
20f1b5e feat(v1.1.1-dead-letters): service + Rhai SDK + admin endpoints
|
||||
77b2cb5 feat(v1.1.1-routes): outbox-routed sync HTTP + dispatch_mode=async
|
||||
6a2971a feat(v1.1.1-dispatcher): dispatcher loop + retry + depth limit + outbox emitter
|
||||
2e92691 feat(v1.1.1-triggers): trigger CRUD admin endpoints
|
||||
545d863 feat(v1.1.1-triggers): triggers + outbox schema + repos
|
||||
6b99f74 feat(v1.1.1-kv): Rhai kv:: SDK module + ctx.event wiring
|
||||
434fb63 feat(v1.1.1-kv): migrations + KvService trait + Postgres impl
|
||||
1efb350 docs(v1.1.x): resolve in-flight decisions as Decided 2026-06-01
|
||||
```
|
||||
|
||||
The first commit (`1efb350`) absorbed working-tree edits to
|
||||
`docs/v1.1.x-design-notes.md` that turned the "in-flight" 20 open
|
||||
calls into "Decided 2026-06-01" entries. Those were on the working
|
||||
tree at branch creation; folding them into the v1.1.1 branch keeps
|
||||
the design rationale colocated with the implementation.
|
||||
|
||||
## 2. Scope coverage (Done / Partial / Skipped)
|
||||
|
||||
| Scope item | Status | Notes |
|
||||
|---|---|---|
|
||||
| **1. KV store** | Done | Migration 0007, `KvService` trait in shared, `KvServiceImpl` + `PostgresKvRepo` in manager-core, Rhai `kv::collection(name).{get,set,has,delete,list}` bridge, cursor pagination, empty-collection rejection, script-as-gate authz. |
|
||||
| **2. Triggers framework — Layout E** | Done | Migrations 0008 (`triggers` + `kv_trigger_details` + `dead_letter_trigger_details`), `TriggerRepo` + `PostgresTriggerRepo`, CRUD admin endpoints. `registered_by_principal` column captured + threaded into the dispatcher. Depth-limit enforced in the dispatcher (default 8). |
|
||||
| **3. Universal outbox + dispatcher** | Done | Migration 0009 (`outbox`), `OutboxRepo` + `PostgresOutboxRepo`, `Dispatcher` tokio task. Polls every 100ms, claims 8 rows/tick via `FOR UPDATE SKIP LOCKED`, gate-bounds dispatch, retries with backoff+jitter, dead-letters on exhaustion, late-completion → `abandoned_executions`. |
|
||||
| **4. NATS-style sync HTTP** | Done | `InboxRegistry` in orchestrator-core (in-process `Mutex<HashMap<Uuid, oneshot::Sender>>`), `InboxResolver` trait in shared. Orchestrator sync-route path registers receiver, writes outbox row with `reply_to`, awaits with timeout = script.timeout + 2s buffer. Status mapping per design notes §3 (422/502/503/504/507/500). |
|
||||
| **5. `dispatch_mode: async` HTTP routes** | Done | Migration 0012 adds the column (default `sync`). `DispatchMode` enum in shared. Route admin payload + RouteRepository serialize it. Compiled routes carry it; the matcher returns it in `Matched`. Orchestrator branches: async → outbox + 202; sync → outbox + inbox. |
|
||||
| **6. Dead letters** | Done | Migration 0010 (`dead_letters`), `DeadLetterRepo` + `DeadLetterService` + `PostgresDeadLetterService`. Rhai `dead_letters::{replay,resolve}` + admin endpoints (`GET /count`, `GET /`, `GET /{id}`, `POST /{id}/replay`, `POST /{id}/resolve`). `Capability::AppDeadLetterManage(AppId)` enforced. List intentionally NOT shipped (deferred to v1.2). Recursion-stop rule (handler-failure annotates original DL as `handler_failed`) implemented in the dispatcher. |
|
||||
| **7. Abandoned executions** | Done | Migration 0011, `AbandonedRepo` + `PostgresAbandonedRepo`, dispatcher writes a row on dropped-receiver inbox delivery. Metric path reserved (`TODO(metrics)` markers in dispatcher.rs). |
|
||||
| **8. Retry policy defaults** | Done | `TriggerConfig::from_env` (new module). Env vars: `PICLOUD_MAX_TRIGGER_DEPTH`, `PICLOUD_TRIGGER_RETRY_{MAX_ATTEMPTS,BACKOFF,BASE_MS,JITTER_PCT}`, `PICLOUD_DEAD_LETTER_RETENTION_DAYS`, `PICLOUD_ABANDONED_EXECUTIONS_RETENTION_DAYS`. Per-trigger overrides applied at trigger-creation time. |
|
||||
| **9. `ctx.event` for triggered scripts** | Done | `TriggerEvent` enum in shared (KV / DeadLetter variants), `SdkCallCx.event: Option<TriggerEvent>` + `is_dead_letter_handler: bool`. `engine.rs::build_ctx_map` flattens the event into `ctx.event` for triggered handlers; direct invocations leave the key absent. Shape matches design notes §4 (KV with op + collection + key + value; dead_letter with original + attempts + last_error + ids + timestamps). |
|
||||
| **10. Dashboard surface** | Done | Per-app red badge with unresolved count on apps list + per-app detail page. New `apps/[slug]/dead-letters/+page.svelte` list view with all design-notes-mandated columns + Replay + Mark resolved actions + expandable row detail. svelte-check passes (369 files, 0 errors, 0 warnings). |
|
||||
| **11. Workspace version bump** | Done | Workspace `1.1.0` → `1.1.1`, SDK `1.1` → `1.2`, dashboard `0.6.0` → `0.7.0`. CHANGELOG.md created at repo root. |
|
||||
|
||||
## 3. Key implementation decisions / deviations
|
||||
|
||||
### Outbox column set (deferred to implementation per design notes §2)
|
||||
Chose:
|
||||
- `script_id` denormalized — dispatcher resolves the target without
|
||||
re-joining for the common path.
|
||||
- `trigger_id` polymorphic (no DB FK) — references `triggers.id` for
|
||||
`source_kind IN {kv, dead_letter}`, `routes.id` for
|
||||
`source_kind = 'http'`. Discrimination in Rust at dispatch time.
|
||||
- `claimed_by TEXT` — pid-based for MVP; cluster mode can use any
|
||||
identifier without schema change.
|
||||
- `trigger_depth` + `root_execution_id` denormalized so the
|
||||
dispatcher rebuilds `ExecRequest` without joining back to the
|
||||
originating execution log.
|
||||
- No explicit `is_dead_letter_handler` column — dispatcher infers
|
||||
from the trigger's `kind` field at dispatch time.
|
||||
|
||||
### KV pagination
|
||||
- **Cursor-style**, opaque base64-encoded last-key.
|
||||
- Page-size cap of 1000 with default 100 (enforced in repo).
|
||||
- Documented in `crates/shared/src/kv.rs` and the SDK function
|
||||
comment.
|
||||
|
||||
### KV TTL
|
||||
- Blueprint §8.1 reserved an `expires_at` column. v1.1.1 design notes
|
||||
don't surface TTL through the SDK (`set(k,v)` has no TTL argument)
|
||||
so the column is **omitted from migration 0007**. Adding it later
|
||||
is a non-breaking forward migration. Recorded in CHANGELOG as a
|
||||
deferred item.
|
||||
|
||||
### Authz scope mapping (seven-scope commitment)
|
||||
The four new capabilities map onto existing scopes — **no new scope
|
||||
variants** to honour the `Scope` enum's "exactly seven values"
|
||||
contract (`crates/shared/src/auth.rs:103`):
|
||||
|
||||
| Capability | Scope |
|
||||
|---|---|
|
||||
| `AppKvRead` | `script:read` |
|
||||
| `AppKvWrite` | `script:write` |
|
||||
| `AppManageTriggers` | `app:admin` |
|
||||
| `AppDeadLetterManage` | `app:admin` |
|
||||
|
||||
`role_satisfies` grants `AppKvRead` at the Viewer role, `AppKvWrite`
|
||||
at Editor, and both trigger / DL caps at AppAdmin.
|
||||
|
||||
### Script-as-gate authz for SDK calls
|
||||
- `KvServiceImpl` runs `authz::require` only when
|
||||
`cx.principal.is_some()`. Anonymous public-HTTP scripts (the
|
||||
common case for public routes) bypass the cap check.
|
||||
- Cross-app isolation is **independent** of this — enforced by
|
||||
`cx.app_id` being the only source of `app_id` on every query.
|
||||
- `PostgresDeadLetterService::{replay,resolve}` keeps a hard
|
||||
`require` (no `if let Some`) — managing dead letters is an admin
|
||||
act per design notes §4. Public scripts with `principal: None`
|
||||
fail the check, which is correct.
|
||||
|
||||
### Trait split: `OutboxRepo` vs `OutboxWriter`
|
||||
orchestrator-core can't depend on manager-core (would invert the
|
||||
dependency arrow). Defined a small `OutboxWriter` trait in
|
||||
`picloud-shared` with a single `enqueue_http` method.
|
||||
`PostgresOutboxRepo` implements both `OutboxRepo` (dispatcher
|
||||
surface) and `OutboxWriter` (orchestrator surface); the picloud
|
||||
binary clones one concrete Arc into both trait views — mirrors the
|
||||
existing `members_concrete` / `AuthzRepo` pattern.
|
||||
|
||||
### `InboxResolver` lives in shared, `InboxRegistry` in orchestrator-core
|
||||
Same split rationale — the dispatcher (manager-core) only depends on
|
||||
the trait, while the in-process impl lives next to its consumer.
|
||||
Cluster mode (v1.3+) swaps the impl for `LISTEN/NOTIFY` behind the
|
||||
unchanged trait.
|
||||
|
||||
### manager-core now depends on executor-core
|
||||
Previously manager-core only depended on orchestrator-core. The
|
||||
dispatcher needs `ExecRequest`/`ExecResponse`/`ExecError`/
|
||||
`InvocationType` from `executor-core` to build invocation
|
||||
descriptors. This is the transport DTO interpretation of the
|
||||
working-rules "don't reach across `*-core` crates" — DTOs are fine,
|
||||
behaviour is the bright line.
|
||||
|
||||
### Sync HTTP via outbox is the default for the user-routes path
|
||||
The orchestrator's user-route handler is fully on the NATS-style
|
||||
path now — every sync HTTP request writes to the outbox and awaits
|
||||
inbox delivery. Adds ~2-5ms per request per design notes §3 latency
|
||||
budget. `/api/v1/execute/{id}` (the admin/dev bypass) still calls
|
||||
the executor directly since it doesn't need the unified
|
||||
observability — kept for simplicity and admin tooling speed.
|
||||
|
||||
### Trigger-depth check is on the outbox row, not in the executor
|
||||
Dispatcher rejects depth-exceeded rows **before** trying to
|
||||
execute. The `cx.trigger_depth` field is informational on the
|
||||
executor side. Rejection writes a log + (reserved) metric and
|
||||
deletes the row — no DL, per design notes §4.
|
||||
|
||||
## 4. Tests added
|
||||
|
||||
### Unit tests (no DB required)
|
||||
- `manager-core::kv_service::tests` (10 tests) — round-trip, missing
|
||||
key returns None, `has` predicate, `delete` was-present,
|
||||
empty-collection rejection, **cross-app isolation**, anonymous-cx
|
||||
skips authz, authed-cx-with-no-role is Forbidden, owner-can-write,
|
||||
cursor pagination via in-memory KvRepo + denying authz repo.
|
||||
- `manager-core::trigger_config::tests` (2 tests) — conservative
|
||||
defaults, backoff round-trips.
|
||||
- `manager-core::trigger_repo::tests` (1 test) — `collection_matches`
|
||||
glob behaviour (`*`, `prefix:*`, exact).
|
||||
- `manager-core::dispatcher::tests` (5 tests) — exponential / linear /
|
||||
constant backoff math, jitter within bounds, ExecError →
|
||||
InboxFailureKind classification, failure-kind → status-code mapping.
|
||||
- `manager-core::abandoned_repo::tests` (2 tests) — truncate
|
||||
char-boundary safety.
|
||||
- `manager-core::triggers_api::tests` (5 tests) — unknown-app 404,
|
||||
member-without-role 403, default fallback for retry settings,
|
||||
empty-glob rejection, cross-app delete is treated as not-found.
|
||||
- `orchestrator-core::inbox::tests` (4 tests) — register/deliver
|
||||
round-trip, unknown-id is Abandoned, dropped receiver is
|
||||
Abandoned, explicit cancel removes sender.
|
||||
- `executor-core::engine::tests` (3 new) — `ctx.event` absent for
|
||||
direct invocations, KV insert shape matches design notes,
|
||||
KV delete has unit value.
|
||||
- `executor-core::sdk_kv` integration suite (7 tests) — runs a real
|
||||
Rhai engine under `spawn_blocking` against an in-memory
|
||||
`KvService` impl. Covers handle pattern, round-trip, unit-on-
|
||||
missing, has predicate, delete-was-present, empty-collection
|
||||
throws, cursor pagination, **cross-app isolation through the
|
||||
bridge**.
|
||||
|
||||
**Total: 47 new tests across the workspace.** Workspace test counts
|
||||
after v1.1.1: 63 manager-core / 56 orchestrator-core / 17
|
||||
executor-core engine / 7 sdk_kv / 30 sdk_contract / 43 stdlib /
|
||||
21 picloud / 6 shared.
|
||||
|
||||
### Intentionally untested
|
||||
- DB-backed integration tests for the full dispatcher loop, KV→
|
||||
trigger→DL retry chain, sync HTTP via outbox round-trip,
|
||||
recursion-stop end-to-end. These need a real Postgres harness;
|
||||
the reviewer runs them via the manual smoke flow below.
|
||||
- Postgres-specific repo behaviour (sqlx query correctness). The
|
||||
repos compile and run against the schema, but no integration
|
||||
test crate spins up a DB in this branch — same pattern as v1.1.0
|
||||
(see existing `ignored, needs DATABASE_URL` test markers).
|
||||
|
||||
## 5. Open questions for the reviewer
|
||||
|
||||
1. **Outbox `claimed_at` clearing on success.** The dispatcher
|
||||
`delete`s the outbox row after success / DL. For failures it
|
||||
reschedules (which sets `claimed_at = NULL`). Both flows are
|
||||
correct, but if you imagine a crash between the executor return
|
||||
and the row update, the row stays claimed forever. Cluster mode
|
||||
should add a periodic "unstick stale claims" sweep. Not in
|
||||
v1.1.1 scope but worth surfacing.
|
||||
2. **Sync HTTP overhead.** Every sync HTTP request now goes through
|
||||
the outbox (write + dispatcher pickup + inbox delivery).
|
||||
Measured overhead expected ~2-5ms per design notes §3. No
|
||||
benchmarking yet — recommend the reviewer pick a representative
|
||||
script and compare 95p latency vs v1.1.0 if performance matters.
|
||||
3. **HTTP outbox rows don't run as a principal.** The orchestrator's
|
||||
public HTTP path has no authenticated user; the
|
||||
`origin_principal` field on the outbox row is forensic. The
|
||||
resulting `ExecRequest.principal = None`, so the script runs
|
||||
anonymously — matches direct execution. If you'd prefer
|
||||
triggered-from-HTTP scripts to inherit a derived principal
|
||||
(e.g. the route's app's owner), that's an additive change.
|
||||
4. **Dispatcher uses `ASYNC_EXEC_TIMEOUT = 300s` for async rows.**
|
||||
Async dispatches don't have a script-level timeout (no
|
||||
originating HTTP request to bound). Picked the same platform
|
||||
cap as `LocalExecutorClient`. If async needs a different cap,
|
||||
easy to thread through `TriggerConfig`.
|
||||
5. **Dispatcher tick cadence is 100ms.** Bounded enough that
|
||||
fan-out feels instant; loose enough that an idle process
|
||||
doesn't burn cycles. If the reviewer wants tighter latency,
|
||||
bump to 50ms or use `LISTEN/NOTIFY` for wake-up (v1.3+ work).
|
||||
6. **CHANGELOG.md is new.** Followed the rest of the repo's
|
||||
convention from git log (release commits + design-notes
|
||||
references). If a different format is preferred, easy to swap.
|
||||
|
||||
## 6. Deferred to later releases
|
||||
|
||||
- `dead_letters::list(filter)` Rhai SDK — design notes §4 defers
|
||||
to v1.2 to align with `docs::find()` query DSL.
|
||||
- KV TTL (`set(k, v, ttl_secs)`) — blueprint reserved it; v1.1.1
|
||||
SDK doesn't surface it. Forward-compat (no schema cost).
|
||||
- Auto-disable of triggers whose script was deleted — design notes
|
||||
§4 says current handling is metric+log; auto-disable is v1.2.
|
||||
- Per-app dead-letter retention — design notes §4 says env-only in
|
||||
v1.1.1.
|
||||
- Metrics counter emit for `picloud_trigger_depth_exceeded`,
|
||||
`picloud_dead_letter_handler_failures`,
|
||||
`picloud_abandoned_executions_total`. Code paths log the
|
||||
occurrences with `tracing::warn`/`error`; the actual
|
||||
counter-emit code is a `TODO(metrics)` comment in the
|
||||
dispatcher. Metrics surface is v1.1.7+ per the roadmap.
|
||||
- DB-backed integration tests for the dispatcher loop (see §4
|
||||
intentionally-untested).
|
||||
- Sync HTTP performance benchmarks comparing v1.1.0 direct path vs
|
||||
v1.1.1 outbox path.
|
||||
|
||||
## 7. How to verify locally
|
||||
|
||||
### Static checks (all green on this branch)
|
||||
```sh
|
||||
cargo fmt --all -- --check
|
||||
cargo clippy --all-targets --all-features -- -D warnings
|
||||
cargo test --workspace
|
||||
cd dashboard && npm run check && npm run build
|
||||
```
|
||||
|
||||
### Migration integrity
|
||||
```sh
|
||||
docker compose down -v && docker compose up -d postgres
|
||||
cargo run -p picloud # applies 0001..0012 from empty
|
||||
```
|
||||
Then start from `main` (v1.1.0 schema state) and switch to this
|
||||
branch; restart `picloud` to apply 0007..0012 on top.
|
||||
|
||||
### Manual end-to-end smoke (reviewer should run)
|
||||
```sh
|
||||
docker compose up -d
|
||||
# 1. Bootstrap an owner user via the existing flow + create app A.
|
||||
# 2. Create a script in A whose body is: throw "boom"
|
||||
# 3. POST /api/v1/admin/apps/{A}/triggers/kv with
|
||||
# {"script_id": "<broken>", "collection_glob": "*", "ops": ["insert"]}
|
||||
# 4. From another script (or a public HTTP route):
|
||||
# kv::collection("widgets").set("k1", #{n:1})
|
||||
# 5. Wait ~7 seconds (3 attempts × ~1/2/4s backoff with ±20% jitter).
|
||||
# 6. Open the dashboard at /admin.
|
||||
# 7. Apps list shows a red "1" badge next to app A.
|
||||
# 8. Click into app A → "Dead letters" tab link → row visible.
|
||||
# 9. Click row → full payload + error history.
|
||||
# 10. Click "Replay" → row marks resolution='replayed', new outbox
|
||||
# row written, dispatcher re-runs the handler (fails again,
|
||||
# produces a NEW DL row).
|
||||
# 11. Click "Mark resolved" on the original DL → resolution='ignored'.
|
||||
```
|
||||
|
||||
### Async route smoke
|
||||
```sh
|
||||
# Create a route via POST /api/v1/admin/scripts/{id}/routes with
|
||||
# {"host_kind":"any","path_kind":"exact","path":"/work","dispatch_mode":"async"}
|
||||
curl -X POST -d '{"work":"thing"}' http://localhost:8080/work
|
||||
# Expect: HTTP 202 + {"accepted_at":"...","execution_id":"..."}
|
||||
# Then tail execution_logs — the script ran later (not synchronously).
|
||||
```
|
||||
|
||||
### Trigger-depth limit smoke
|
||||
```sh
|
||||
# Set a low depth limit + register a KV trigger whose script
|
||||
# writes to KV again — creates a loop.
|
||||
PICLOUD_MAX_TRIGGER_DEPTH=3 cargo run -p picloud
|
||||
# kv.set(...) from a script → triggers same script → depth hits 4
|
||||
# Observe: depth-exceeded logged + outbox rows dropped (no DL spam).
|
||||
```
|
||||
|
||||
## 8. Known limitations / rough edges
|
||||
|
||||
- **No DB-backed integration tests in this branch.** Unit tests
|
||||
cover trait behaviour with in-memory backings; sqlx query
|
||||
correctness is verified by the workspace compile + manual smoke.
|
||||
- **Dispatcher concurrency is in-process serial-per-tick.** Up to
|
||||
8 rows claimed per tick, processed one at a time. Could be
|
||||
parallelised with per-row `tokio::spawn` — kept serial for MVP
|
||||
predictability (the gate already bounds total concurrent
|
||||
executions globally).
|
||||
- **Metric emission is TODO** at the three spots noted in
|
||||
Open Questions §5. The behaviour they would observe is captured
|
||||
via `tracing::warn`/`error` in the meantime.
|
||||
- **`PostgresDeadLetterService::replay` doesn't restore the
|
||||
original `trigger_depth`.** Replays start at depth 0. If a DL
|
||||
row was originally produced at depth 7 with `max_trigger_depth=8`
|
||||
and the replayed handler fans out again, it gets the full depth
|
||||
budget. Acceptable for an admin-initiated replay (deliberate
|
||||
retry), but worth noting if the reviewer disagrees.
|
||||
- **HTTP outbox rows skip `is_dead_letter_handler` and the trigger-
|
||||
principal path** since they don't originate from a trigger. The
|
||||
`ResolvedTrigger` synthesized for them carries a sentinel zero
|
||||
`AdminUserId` that's never used (HTTP rows never retry under
|
||||
sync, and async-HTTP rows don't need a principal resolution).
|
||||
- **DataPlaneState's executor field is still generic** (`Arc<E>`
|
||||
where `E: ExecutorClient`). The dispatcher uses `Arc<dyn
|
||||
ExecutorClient>` directly. The picloud binary clones the same
|
||||
`Arc<LocalExecutorClient>` into both — works because the
|
||||
concrete type implements both the trait object and the generic
|
||||
bound.
|
||||
- **dispatcher always sets `principal: None` for HTTP rows.** As
|
||||
noted in Open Question §3, HTTP outbox rows don't resolve a
|
||||
principal. Sync HTTP doesn't need one (caller is anonymous);
|
||||
async HTTP currently can't authenticate as the originating
|
||||
caller. If that's not the intent, additive change.
|
||||
- **Cluster-mode crash recovery for claimed rows.** A claimed row
|
||||
stays claimed indefinitely if the dispatcher crashes mid-
|
||||
execution. v1.1.1 has one dispatcher per process so this is
|
||||
rare; cluster mode (v1.3+) needs a stale-claim sweeper.
|
||||
|
||||
---
|
||||
|
||||
Branch ready for review. Reviewer reads this report + audits the
|
||||
diff. Do not merge to main until the audit clears.
|
||||
Reference in New Issue
Block a user