Files
PiCloud/HANDBACK.md
MechaCat02 5a95ff2d07 docs(v1.1.1): handback report for reviewer
Summary of the 11-commit v1.1.1 branch:
- branch + commit count, scope coverage table, decisions made
  mid-implementation, deviations from the design notes
- tests added (47 new) + intentionally-untested gaps
- open questions for the reviewer
- deferred items
- verification commands + manual smoke flow
- known limitations / rough edges

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-01 22:27:18 +02:00

19 KiB
Raw Blame History

v1.1.1 Implementation HANDBACK

1. Branch + commit count

  • Branch: feat/v1.1.1-storage-and-events
  • Base: main
  • 11 commits ahead of main. Branch is not pushed, not merged.
66b661f chore(release): bump workspace to v1.1.1 + CHANGELOG
6b7ff78 feat(v1.1.1-gc): dead-letter + abandoned-executions retention sweepers
1795dfc feat(v1.1.1-dead-letters): dashboard badge + list view
20f1b5e feat(v1.1.1-dead-letters): service + Rhai SDK + admin endpoints
77b2cb5 feat(v1.1.1-routes): outbox-routed sync HTTP + dispatch_mode=async
6a2971a feat(v1.1.1-dispatcher): dispatcher loop + retry + depth limit + outbox emitter
2e92691 feat(v1.1.1-triggers): trigger CRUD admin endpoints
545d863 feat(v1.1.1-triggers): triggers + outbox schema + repos
6b99f74 feat(v1.1.1-kv): Rhai kv:: SDK module + ctx.event wiring
434fb63 feat(v1.1.1-kv): migrations + KvService trait + Postgres impl
1efb350 docs(v1.1.x): resolve in-flight decisions as Decided 2026-06-01

The first commit (1efb350) absorbed working-tree edits to docs/v1.1.x-design-notes.md that turned the "in-flight" 20 open calls into "Decided 2026-06-01" entries. Those were on the working tree at branch creation; folding them into the v1.1.1 branch keeps the design rationale colocated with the implementation.

2. Scope coverage (Done / Partial / Skipped)

Scope item Status Notes
1. KV store Done Migration 0007, KvService trait in shared, KvServiceImpl + PostgresKvRepo in manager-core, Rhai kv::collection(name).{get,set,has,delete,list} bridge, cursor pagination, empty-collection rejection, script-as-gate authz.
2. Triggers framework — Layout E Done Migrations 0008 (triggers + kv_trigger_details + dead_letter_trigger_details), TriggerRepo + PostgresTriggerRepo, CRUD admin endpoints. registered_by_principal column captured + threaded into the dispatcher. Depth-limit enforced in the dispatcher (default 8).
3. Universal outbox + dispatcher Done Migration 0009 (outbox), OutboxRepo + PostgresOutboxRepo, Dispatcher tokio task. Polls every 100ms, claims 8 rows/tick via FOR UPDATE SKIP LOCKED, gate-bounds dispatch, retries with backoff+jitter, dead-letters on exhaustion, late-completion → abandoned_executions.
4. NATS-style sync HTTP Done InboxRegistry in orchestrator-core (in-process Mutex<HashMap<Uuid, oneshot::Sender>>), InboxResolver trait in shared. Orchestrator sync-route path registers receiver, writes outbox row with reply_to, awaits with timeout = script.timeout + 2s buffer. Status mapping per design notes §3 (422/502/503/504/507/500).
5. dispatch_mode: async HTTP routes Done Migration 0012 adds the column (default sync). DispatchMode enum in shared. Route admin payload + RouteRepository serialize it. Compiled routes carry it; the matcher returns it in Matched. Orchestrator branches: async → outbox + 202; sync → outbox + inbox.
6. Dead letters Done Migration 0010 (dead_letters), DeadLetterRepo + DeadLetterService + PostgresDeadLetterService. Rhai dead_letters::{replay,resolve} + admin endpoints (GET /count, GET /, GET /{id}, POST /{id}/replay, POST /{id}/resolve). Capability::AppDeadLetterManage(AppId) enforced. List intentionally NOT shipped (deferred to v1.2). Recursion-stop rule (handler-failure annotates original DL as handler_failed) implemented in the dispatcher.
7. Abandoned executions Done Migration 0011, AbandonedRepo + PostgresAbandonedRepo, dispatcher writes a row on dropped-receiver inbox delivery. Metric path reserved (TODO(metrics) markers in dispatcher.rs).
8. Retry policy defaults Done TriggerConfig::from_env (new module). Env vars: PICLOUD_MAX_TRIGGER_DEPTH, PICLOUD_TRIGGER_RETRY_{MAX_ATTEMPTS,BACKOFF,BASE_MS,JITTER_PCT}, PICLOUD_DEAD_LETTER_RETENTION_DAYS, PICLOUD_ABANDONED_EXECUTIONS_RETENTION_DAYS. Per-trigger overrides applied at trigger-creation time.
9. ctx.event for triggered scripts Done TriggerEvent enum in shared (KV / DeadLetter variants), SdkCallCx.event: Option<TriggerEvent> + is_dead_letter_handler: bool. engine.rs::build_ctx_map flattens the event into ctx.event for triggered handlers; direct invocations leave the key absent. Shape matches design notes §4 (KV with op + collection + key + value; dead_letter with original + attempts + last_error + ids + timestamps).
10. Dashboard surface Done Per-app red badge with unresolved count on apps list + per-app detail page. New apps/[slug]/dead-letters/+page.svelte list view with all design-notes-mandated columns + Replay + Mark resolved actions + expandable row detail. svelte-check passes (369 files, 0 errors, 0 warnings).
11. Workspace version bump Done Workspace 1.1.01.1.1, SDK 1.11.2, dashboard 0.6.00.7.0. CHANGELOG.md created at repo root.

3. Key implementation decisions / deviations

Outbox column set (deferred to implementation per design notes §2)

Chose:

  • script_id denormalized — dispatcher resolves the target without re-joining for the common path.
  • trigger_id polymorphic (no DB FK) — references triggers.id for source_kind IN {kv, dead_letter}, routes.id for source_kind = 'http'. Discrimination in Rust at dispatch time.
  • claimed_by TEXT — pid-based for MVP; cluster mode can use any identifier without schema change.
  • trigger_depth + root_execution_id denormalized so the dispatcher rebuilds ExecRequest without joining back to the originating execution log.
  • No explicit is_dead_letter_handler column — dispatcher infers from the trigger's kind field at dispatch time.

KV pagination

  • Cursor-style, opaque base64-encoded last-key.
  • Page-size cap of 1000 with default 100 (enforced in repo).
  • Documented in crates/shared/src/kv.rs and the SDK function comment.

KV TTL

  • Blueprint §8.1 reserved an expires_at column. v1.1.1 design notes don't surface TTL through the SDK (set(k,v) has no TTL argument) so the column is omitted from migration 0007. Adding it later is a non-breaking forward migration. Recorded in CHANGELOG as a deferred item.

Authz scope mapping (seven-scope commitment)

The four new capabilities map onto existing scopes — no new scope variants to honour the Scope enum's "exactly seven values" contract (crates/shared/src/auth.rs:103):

Capability Scope
AppKvRead script:read
AppKvWrite script:write
AppManageTriggers app:admin
AppDeadLetterManage app:admin

role_satisfies grants AppKvRead at the Viewer role, AppKvWrite at Editor, and both trigger / DL caps at AppAdmin.

Script-as-gate authz for SDK calls

  • KvServiceImpl runs authz::require only when cx.principal.is_some(). Anonymous public-HTTP scripts (the common case for public routes) bypass the cap check.
  • Cross-app isolation is independent of this — enforced by cx.app_id being the only source of app_id on every query.
  • PostgresDeadLetterService::{replay,resolve} keeps a hard require (no if let Some) — managing dead letters is an admin act per design notes §4. Public scripts with principal: None fail the check, which is correct.

Trait split: OutboxRepo vs OutboxWriter

orchestrator-core can't depend on manager-core (would invert the dependency arrow). Defined a small OutboxWriter trait in picloud-shared with a single enqueue_http method. PostgresOutboxRepo implements both OutboxRepo (dispatcher surface) and OutboxWriter (orchestrator surface); the picloud binary clones one concrete Arc into both trait views — mirrors the existing members_concrete / AuthzRepo pattern.

InboxResolver lives in shared, InboxRegistry in orchestrator-core

Same split rationale — the dispatcher (manager-core) only depends on the trait, while the in-process impl lives next to its consumer. Cluster mode (v1.3+) swaps the impl for LISTEN/NOTIFY behind the unchanged trait.

manager-core now depends on executor-core

Previously manager-core only depended on orchestrator-core. The dispatcher needs ExecRequest/ExecResponse/ExecError/ InvocationType from executor-core to build invocation descriptors. This is the transport DTO interpretation of the working-rules "don't reach across *-core crates" — DTOs are fine, behaviour is the bright line.

Sync HTTP via outbox is the default for the user-routes path

The orchestrator's user-route handler is fully on the NATS-style path now — every sync HTTP request writes to the outbox and awaits inbox delivery. Adds ~2-5ms per request per design notes §3 latency budget. /api/v1/execute/{id} (the admin/dev bypass) still calls the executor directly since it doesn't need the unified observability — kept for simplicity and admin tooling speed.

Trigger-depth check is on the outbox row, not in the executor

Dispatcher rejects depth-exceeded rows before trying to execute. The cx.trigger_depth field is informational on the executor side. Rejection writes a log + (reserved) metric and deletes the row — no DL, per design notes §4.

4. Tests added

Unit tests (no DB required)

  • manager-core::kv_service::tests (10 tests) — round-trip, missing key returns None, has predicate, delete was-present, empty-collection rejection, cross-app isolation, anonymous-cx skips authz, authed-cx-with-no-role is Forbidden, owner-can-write, cursor pagination via in-memory KvRepo + denying authz repo.
  • manager-core::trigger_config::tests (2 tests) — conservative defaults, backoff round-trips.
  • manager-core::trigger_repo::tests (1 test) — collection_matches glob behaviour (*, prefix:*, exact).
  • manager-core::dispatcher::tests (5 tests) — exponential / linear / constant backoff math, jitter within bounds, ExecError → InboxFailureKind classification, failure-kind → status-code mapping.
  • manager-core::abandoned_repo::tests (2 tests) — truncate char-boundary safety.
  • manager-core::triggers_api::tests (5 tests) — unknown-app 404, member-without-role 403, default fallback for retry settings, empty-glob rejection, cross-app delete is treated as not-found.
  • orchestrator-core::inbox::tests (4 tests) — register/deliver round-trip, unknown-id is Abandoned, dropped receiver is Abandoned, explicit cancel removes sender.
  • executor-core::engine::tests (3 new) — ctx.event absent for direct invocations, KV insert shape matches design notes, KV delete has unit value.
  • executor-core::sdk_kv integration suite (7 tests) — runs a real Rhai engine under spawn_blocking against an in-memory KvService impl. Covers handle pattern, round-trip, unit-on- missing, has predicate, delete-was-present, empty-collection throws, cursor pagination, cross-app isolation through the bridge.

Total: 47 new tests across the workspace. Workspace test counts after v1.1.1: 63 manager-core / 56 orchestrator-core / 17 executor-core engine / 7 sdk_kv / 30 sdk_contract / 43 stdlib / 21 picloud / 6 shared.

Intentionally untested

  • DB-backed integration tests for the full dispatcher loop, KV→ trigger→DL retry chain, sync HTTP via outbox round-trip, recursion-stop end-to-end. These need a real Postgres harness; the reviewer runs them via the manual smoke flow below.
  • Postgres-specific repo behaviour (sqlx query correctness). The repos compile and run against the schema, but no integration test crate spins up a DB in this branch — same pattern as v1.1.0 (see existing ignored, needs DATABASE_URL test markers).

5. Open questions for the reviewer

  1. Outbox claimed_at clearing on success. The dispatcher deletes the outbox row after success / DL. For failures it reschedules (which sets claimed_at = NULL). Both flows are correct, but if you imagine a crash between the executor return and the row update, the row stays claimed forever. Cluster mode should add a periodic "unstick stale claims" sweep. Not in v1.1.1 scope but worth surfacing.
  2. Sync HTTP overhead. Every sync HTTP request now goes through the outbox (write + dispatcher pickup + inbox delivery). Measured overhead expected ~2-5ms per design notes §3. No benchmarking yet — recommend the reviewer pick a representative script and compare 95p latency vs v1.1.0 if performance matters.
  3. HTTP outbox rows don't run as a principal. The orchestrator's public HTTP path has no authenticated user; the origin_principal field on the outbox row is forensic. The resulting ExecRequest.principal = None, so the script runs anonymously — matches direct execution. If you'd prefer triggered-from-HTTP scripts to inherit a derived principal (e.g. the route's app's owner), that's an additive change.
  4. Dispatcher uses ASYNC_EXEC_TIMEOUT = 300s for async rows. Async dispatches don't have a script-level timeout (no originating HTTP request to bound). Picked the same platform cap as LocalExecutorClient. If async needs a different cap, easy to thread through TriggerConfig.
  5. Dispatcher tick cadence is 100ms. Bounded enough that fan-out feels instant; loose enough that an idle process doesn't burn cycles. If the reviewer wants tighter latency, bump to 50ms or use LISTEN/NOTIFY for wake-up (v1.3+ work).
  6. CHANGELOG.md is new. Followed the rest of the repo's convention from git log (release commits + design-notes references). If a different format is preferred, easy to swap.

6. Deferred to later releases

  • dead_letters::list(filter) Rhai SDK — design notes §4 defers to v1.2 to align with docs::find() query DSL.
  • KV TTL (set(k, v, ttl_secs)) — blueprint reserved it; v1.1.1 SDK doesn't surface it. Forward-compat (no schema cost).
  • Auto-disable of triggers whose script was deleted — design notes §4 says current handling is metric+log; auto-disable is v1.2.
  • Per-app dead-letter retention — design notes §4 says env-only in v1.1.1.
  • Metrics counter emit for picloud_trigger_depth_exceeded, picloud_dead_letter_handler_failures, picloud_abandoned_executions_total. Code paths log the occurrences with tracing::warn/error; the actual counter-emit code is a TODO(metrics) comment in the dispatcher. Metrics surface is v1.1.7+ per the roadmap.
  • DB-backed integration tests for the dispatcher loop (see §4 intentionally-untested).
  • Sync HTTP performance benchmarks comparing v1.1.0 direct path vs v1.1.1 outbox path.

7. How to verify locally

Static checks (all green on this branch)

cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings
cargo test --workspace
cd dashboard && npm run check && npm run build

Migration integrity

docker compose down -v && docker compose up -d postgres
cargo run -p picloud      # applies 0001..0012 from empty

Then start from main (v1.1.0 schema state) and switch to this branch; restart picloud to apply 0007..0012 on top.

Manual end-to-end smoke (reviewer should run)

docker compose up -d
# 1. Bootstrap an owner user via the existing flow + create app A.
# 2. Create a script in A whose body is:  throw "boom"
# 3. POST /api/v1/admin/apps/{A}/triggers/kv with
#    {"script_id": "<broken>", "collection_glob": "*", "ops": ["insert"]}
# 4. From another script (or a public HTTP route):
#    kv::collection("widgets").set("k1", #{n:1})
# 5. Wait ~7 seconds (3 attempts × ~1/2/4s backoff with ±20% jitter).
# 6. Open the dashboard at /admin.
# 7. Apps list shows a red "1" badge next to app A.
# 8. Click into app A → "Dead letters" tab link → row visible.
# 9. Click row → full payload + error history.
# 10. Click "Replay" → row marks resolution='replayed', new outbox
#     row written, dispatcher re-runs the handler (fails again,
#     produces a NEW DL row).
# 11. Click "Mark resolved" on the original DL → resolution='ignored'.

Async route smoke

# Create a route via POST /api/v1/admin/scripts/{id}/routes with
# {"host_kind":"any","path_kind":"exact","path":"/work","dispatch_mode":"async"}
curl -X POST -d '{"work":"thing"}' http://localhost:8080/work
# Expect: HTTP 202 + {"accepted_at":"...","execution_id":"..."}
# Then tail execution_logs — the script ran later (not synchronously).

Trigger-depth limit smoke

# Set a low depth limit + register a KV trigger whose script
# writes to KV again — creates a loop.
PICLOUD_MAX_TRIGGER_DEPTH=3 cargo run -p picloud
# kv.set(...) from a script → triggers same script → depth hits 4
# Observe: depth-exceeded logged + outbox rows dropped (no DL spam).

8. Known limitations / rough edges

  • No DB-backed integration tests in this branch. Unit tests cover trait behaviour with in-memory backings; sqlx query correctness is verified by the workspace compile + manual smoke.
  • Dispatcher concurrency is in-process serial-per-tick. Up to 8 rows claimed per tick, processed one at a time. Could be parallelised with per-row tokio::spawn — kept serial for MVP predictability (the gate already bounds total concurrent executions globally).
  • Metric emission is TODO at the three spots noted in Open Questions §5. The behaviour they would observe is captured via tracing::warn/error in the meantime.
  • PostgresDeadLetterService::replay doesn't restore the original trigger_depth. Replays start at depth 0. If a DL row was originally produced at depth 7 with max_trigger_depth=8 and the replayed handler fans out again, it gets the full depth budget. Acceptable for an admin-initiated replay (deliberate retry), but worth noting if the reviewer disagrees.
  • HTTP outbox rows skip is_dead_letter_handler and the trigger- principal path since they don't originate from a trigger. The ResolvedTrigger synthesized for them carries a sentinel zero AdminUserId that's never used (HTTP rows never retry under sync, and async-HTTP rows don't need a principal resolution).
  • DataPlaneState's executor field is still generic (Arc<E> where E: ExecutorClient). The dispatcher uses Arc<dyn ExecutorClient> directly. The picloud binary clones the same Arc<LocalExecutorClient> into both — works because the concrete type implements both the trait object and the generic bound.
  • dispatcher always sets principal: None for HTTP rows. As noted in Open Question §3, HTTP outbox rows don't resolve a principal. Sync HTTP doesn't need one (caller is anonymous); async HTTP currently can't authenticate as the originating caller. If that's not the intent, additive change.
  • Cluster-mode crash recovery for claimed rows. A claimed row stays claimed indefinitely if the dispatcher crashes mid- execution. v1.1.1 has one dispatcher per process so this is rare; cluster mode (v1.3+) needs a stale-claim sweeper.

Branch ready for review. Reviewer reads this report + audits the diff. Do not merge to main until the audit clears.