xenia-rs/audit-runs/review-a-boot-state/plan.md

# Review A — boot-state review and shortest-path roadmap

**Session type**: PLAN-only.  No engine LOC changes; no canary
instrumentation changes.  Read-only investigation across the
existing audit chain artifacts.
**Date**: 2026-05-21
**Companion documents** (in this directory):
- `canary-boot-trajectory.md` — canary's call chain from entry_point
  to first gameplay draw, with wallclock timestamps.
- `ours-wedge-localization.md` — precise where-ours-stops, in graph
  terms.
- `shortest-path-roadmap.md` — 3-5 step roadmap with expected
  progression delta per step.
- `methodology-assessment.md` — alternative metric proposal.

This `plan.md` summarizes the five framing questions with answers
backed by file:line citations.

---

## Q1 — What is "first draw" in canary's Sylpheed boot?

**Two distinct "draws" must be disambiguated.**

### Q1.a: First boot-init `VdSwap` (the swap=1 event)

Canary's tid=6 (guest main) emits **one** `VdSwap` at ~9.5 s
wallclock, immediately after the GPU subsystem init sequence
`VdInitializeEngines → VdInitializeRingBuffer →
VdEnableRingBufferRPtrWriteBack → VdSetGraphicsInterruptCallback →
VdSetSystemCommandBufferGpuIdentifierAddress → VdGetSystemCommandBuffer`.
This swap publishes the boot framebuffer and contains no draw packets.

**Ours also reaches this swap** — visible in
`phase-w-wedge-reattack/ours-postfix.jsonl` at idx 105283 (host_ns
496,276,229).  This is what produces ours's `swaps=1` metric.

Both engines reach this point.  **It is NOT the gate.**

### Q1.b: First gameplay `VdSwap` (the swap≥2 / draws≥1 event)

Canary's renderer tid=13 (entry `0x822F1EE0`, spawned suspended at
1.671 s) wakes after the `sub_825070F0` worker fan-out at host_ns
≈ 10.383 s and begins emitting `VdGetSystemCommandBuffer` /
`VdSwap` pairs at ~150 fps.  Canary's tid=13 emits **12,092
VdSwap calls in the 90-s window** (per
`phase-nonmatch-investigation/canary-tid-profiles.md:21`).

The first of these is the **first gameplay draw**, fired at ~10.7 s
wallclock — about 1.2 s after the `sub_825070F0` fan-out triggers
the worker cluster.

**Pre-conditions canary establishes before this point** (per
`canary-boot-trajectory.md`):

1. Vtable `0x8200A1E8` of `ANON_Class_713383D7` installed at host_ns
   ≈ 9.4-9.6 s via POD-copy at GUEST PC `sub_824FD240+0x24`
   (per `project_audit_068_session4_2026_05_20`).
2. Activation chain `sub_822F1AA8 → sub_82173990 → sub_821746B0 →
   sub_82172BA0 → sub_821B55D8 → sub_824F8398 → sub_824F7CD0 →
   sub_824F7800 → bctrl vtable[1] = sub_825070F0` fires on tid=6.
3. `sub_825070F0` spawns 4 worker threads with entries
   `0x82506528/58/88/B8` and shared ctx `0xBCE251C0`.
4. Workers (canary tids 27/28/29) emit signals that unwedge the
   `sub_821CB030` Event waits across the cache-file IO completion
   chain.
5. Renderer tid=13's body (entered earlier but blocked on a
   tid=14/15 XAudio-coordinated event) unblocks; per-frame
   `VdGetSystemCommandBuffer` / `VdSwap` loop begins.

---

## Q2 — What is ours's actual progress, and what's the wedge root cause?

**Ours stops at the first wait in the activation chain.** Specifically:

- **tid=1 (main)** wedged at `sub_82173990+0x2D4` (PC `0x824ac578` =
  `do_wait_single`) on handle `0x12c8` = `Thread(id=13)` — waiting
  for the renderer's thread handle to signal (which happens only when
  tid=13 calls `ExTerminateThread`).
- **tid=13 (renderer / cache-IO worker)** wedged at
  `sub_821CB030+0x1B0` on handle `0x12d0` = `Event/Auto`, created by
  itself via `NtCreateEvent` at `sub_821CB030+0x128`.  `signals=0,
  wakes=0` — `<NO_SIGNALS_DESPITE_WAITS>`.
- **`sub_825070F0` fires 0×** at any horizon probed.

Citation: `phase-w-wedge-reattack/halt-on-deadlock-dump.txt` +
`phase-w-wedge-reattack/current-state.md`.

### Root cause (at one structural level deeper than the wedge symptom)

**Per AUDIT-069 Session 5 (the most recent measurement):**

- Canary fires 414 `NtReleaseSemaphore` calls on the work-queue
  semaphore in the 90-s window.
- Ours fires 99 (24%).
- Breakdown: Worker (382 vs 90), Main (7 vs 8), **Other producers
  (25 vs 1)**.

The "**other producers (25 vs 1)**" gap is the load-bearing
discrepancy.  Canary has **24 additional thread sources** releasing
the work semaphore during bootstrap that ours does not have.  These
correspond to:

1. The 4 `sub_825070F0` workers (canary tids 27/28/29 + 1) — absent
   in ours.
2. XAudio render threads (canary tids 14/15, spawned suspended in
   both engines, **resumed only in canary**).
3. The secondary spawn burst at 1.94-2.15 s (canary tids 18-25) —
   8 helpers including file-IO and NtWaitForMultipleObjectsEx workers
   — absent in ours.

### The ONE structural issue

> **Ours never reaches `sub_825070F0` because the activation chain
> that calls it is downstream of tid=13's wedge; and tid=13's wedge
> is downstream of the worker cluster activation; and the worker
> cluster activation is `sub_825070F0`.  This is a self-referential
> lock.**

Canary breaks the lock because some part of the bootstrap
*pre-activates* the producers (probably via XAudio thread resume at
1.726 s, which then runs ahead, populates the work queue, signals
events, etc.).  Ours never resumes the XAudio threads — they're
spawned suspended and stay that way.

**The single highest-leverage gap is the XAudio thread resume,**
because (a) it happens early (1.726 s in canary vs. ours's wedge
which fixes around 1.4 s — i.e. the resume should happen before the
wedge), (b) it activates the dominant event producers, and (c) AUDIT-069
S5's "other producers 25 vs 1" finding implicates exactly this class
of thread.

---

## Q3 — Shortest-path-to-first-draw roadmap

Three to four steps (full detail in `shortest-path-roadmap.md`):

- **Step 1 (~80-150 LOC, ours-side)**: add `--force-spawn-workers`
  cvar that crowbars `sub_825070F0` activation by directly spawning
  the 4 worker threads with the right ctx after `VdInitializeRingBuffer`
  returns.  Tests "are the workers functionally correct if activated"
  and "does activating them unwedge sub_821CB030."
- **Step 2 (~0 LOC)**: with Step 1 active, mine the canary jsonl for
  the kernel-call sequence on tid=6 in the wallclock window [9.4 s,
  9.6 s] (the install epoch).  Identify what guest call triggers
  `sub_824FD240+0x24`'s POD-copy of the vtable in canary.
- **Step 3 (~10-500 LOC, depending on what Step 2 finds)**: mirror
  that trigger in ours — likely a missing kernel-import return value
  or a missing post-condition that the trigger inspects.
- **Step 4 (~0 LOC; remove crowbar)**: re-test ours without
  `--force-spawn-workers`.  Verify natural bootstrap reaches
  `sub_825070F0` activation.
- **Step 5 (~0-50 LOC)**: measure renderer-thread VdSwap rate over 90 s
  wallclock; target ±30% of canary's 12,092 calls.

Expected delta:

| After step | `swaps` | `draws` | `unique_render_targets` |
|---|---:|---:|---:|
| Pre | 1 | 0 | 0 |
| Step 1 (crowbar) | 2+ | 1+ | 1+ |
| Step 4 (decrowbar) | 2+ | 1+ | 1+ |
| Step 5 (parity) | 100+ | 100+ | 1-5 |

---

## Q4 — What's NOT on the shortest path

Explicitly deferred (full rationale in `shortest-path-roadmap.md`):

- **Audio (host-audio-* / XAudio implementation)** — even though
  XAudio thread resume MAY be the trigger from Q2, ours's existing
  XAudio shim is sufficient for the workers to bootstrap if they
  receive the right kernel-call sequence.  Full XAudio
  implementation is beyond first-draw scope.
- **HID** — Sylpheed's intro/title screens are auto-advance; no
  input needed.
- **XAM content / save games** — not on first-draw path.
- **Scheduler determinism work** (Phase D Stages 0-4 and beyond) —
  null result; the wedge is upstream of contention scheduling.
  Close or indefinitely defer.
- **Diff-tool canonicalization** (Phase C+N for N > 25) — saturated
  on matched-prefix without progression; halt this work class until
  Step 4 lands and the workload re-baselines.
- **AUDIT-068 host-side install probes** — superseded by AUDIT-068
  Session 4 finding (writer is GUEST PC, not host).  The followup
  question is what *triggers* the guest code path, which Step 2
  addresses through cheaper means.

---

## Q5 — Methodology assessment

**Current methodology relied on matched-prefix as a progression
proxy.  This assumption is now empirically falsified**: +2,960
events of matched-prefix advancement produced 0 units of progression
(`swaps=1, draws=0` across 25+ iterates).

### Proposed alternative metric

**Option 6 (composite `progression_score`)**:

```
progression_score = 1 * swaps + 10 * draws + 100 * unique_render_targets
                  + 0.001 * matched_prefix
```

Continuous gradient; honest about wedge-solving vs. canonicalization
priority.  Requires ~10 LOC to add to `digest.json`.

Discipline: tag every iterate as either
"**canonicalization only — no progression**" or
"**progression**".  Cap at 5 consecutive canonicalization-only
iterates before mandatory pivot to wedge-attack work.

### New reading-error #39

> **#39 (matched-prefix as progression proxy)**: matched-prefix
> measures engine-to-engine divergence point, NOT game-to-game
> functional gap.  When the wedge is on a different thread than the
> matched-prefix anchor thread, advancing matched-prefix is
> orthogonal to unwedging.  Future audits MUST distinguish "ours's
> tid-X diverges from canary's tid-Y" from "ours's tid-X is *blocked
> because tid-Z is wedged*", and target the wedge directly when
> present.

---

## Counterintuitive findings (anti-anchoring)

Per Tripstones in the task brief:

### 1. Both engines reach `swaps=1`; ours is NOT behind on the boot swap.

The shared boot-init `VdSwap` fires in both.  Ours's `swaps=1` metric
is "achieved, just at the same point canary also did it".  The
divergence is NOT "ours can't do the first swap"; it's "ours can't do
the SECOND through Nth swap (the gameplay loop)".

### 2. Tripstone 4 verified: canary does reach gameplay draws, ours does not.

`canary-jitter-1.jsonl` shows 12,092 VdSwap calls on canary tid=13 in
90 s wallclock — definitively in the gameplay rendering loop, not
pre-first-draw.  Ours's tid analogous to canary tid=13 emits ~80
events total before wedging — definitively before gameplay starts.
The "both engines pre-first-draw" hypothesis is FALSE.

### 3. The matched-prefix metric is on the WRONG thread.

Matched-prefix tracks tid=6 (canary) vs tid=1 (ours), the main
threads.  But the wedge is on **tid=13 in both engines** — the
renderer thread.  Tid=1's matched-prefix can advance 105,128 events
without ever touching the wedge.

### 4. The "boot-state-machine" framing is misleading.

There's no monolithic boot state machine.  There are ~28 threads in
canary, each running their own lifecycle, communicating via shared
kernel objects.  The bottleneck isn't a state transition; it's a
THREAD ACTIVATION GAP.

### 5. AUDIT-069 Session 5's "other producers 25 vs 1" is the key forensic discovery, more than AUDIT-068's vtable install epoch.

The vtable install IS interesting but it's downstream of the producer
gap.  Producers must be running to populate the work queue, which
gets the worker to do its thing, which signals the wedge, which lets
the activation chain continue, which calls `sub_824FD240+0x24`,
which writes the vtable.  Fixing the vtable install in isolation
(e.g., via a host-side mem-write hack) doesn't help if no producer
is feeding work to the workers.

---

## Cascade prediction confidence

- A — canary boot trajectory characterized: **DONE, HIGH** (canary-jitter-1.jsonl provides direct evidence).
- B — ours's wedge root-cause localized deeper than "sub_821CB030 waits": **DONE, MEDIUM-HIGH** (AUDIT-069 S5 "other producers 25 vs 1" finding).
- C — shortest-path roadmap with ≤5 steps: **DONE, MEDIUM** (5 steps; Step 1 confidence ~60%).
- D — alternative metric proposed: **DONE, HIGH** (Option 6 composite, plus reading-error #39).

---

## Open questions / known unknowns

1. **What is the bootstrap trigger for canary's `sub_824FD240+0x24`?**
   Roadmap Step 2 addresses.  Could be answered in <1 session of
   canary jsonl analysis.
2. **Does Step 1's crowbar produce a clean wedge-unblock, or does it
   reveal additional unmodelled state in the ctx object?** Empirical;
   testable in one session.
3. **Are canary's XAudio threads (tids 14/15) the actual missing
   producer, or are they downstream of the same trigger?** Worth a
   targeted probe before Step 1; ~50 LOC ours-side to log
   NtResumeThread on the XAudio entry PCs.
4. **Will the AUDIT-067 "vtable install is host-side" finding
   resurface?** No — AUDIT-068 S4 falsified this; the writer is
   GUEST PC `sub_824FD240+0x24`.  The "host-side" framing was a
   mis-read of the POD-copy semantics (reading-error #36).

---

## Recommended next action

**Dispatch a "progression iterate" implementing Step 1 of the
roadmap** (`--force-spawn-workers` crowbar, ~80-150 LOC ours-side).
This is a high-variance, high-reward iterate; expected outcome is
either `swaps ≥ 2, draws ≥ 1` (success — wedge structurally
isolated to thread activation) or an informative failure mode (e.g.,
worker faults at first vtable bctrl indicating additional state
needed in ctx object).  Time-box: 1 session, max 2h.

If Step 1 succeeds in ANY way (even if draws stays 0), the next
iterate is Step 2 (kernel-call sequence mining in canary-jitter-1.jsonl).
This step has minimal risk and uses existing tooling.

If Step 1 fails completely (panic / segfault unrecoverable), revert
the crowbar and reframe: the wedge may be in ours's kernel-handler
implementations themselves, not just bootstrap activation.  At that
point a deeper Path β engine investigation is unavoidable.

---

## Memory hygiene note

This review is read-only.  xenia-rs HEAD unchanged.  canary HEAD
unchanged.  sylpheed.db unchanged.  No new artifacts beyond this
directory.

After dispatching Step 1, future memory entries should adopt the
new `progression_score` + tagging discipline outlined in
`methodology-assessment.md`.