xenia-rs/audit-runs/review-a-boot-state/shortest-path-roadmap.md

# Shortest-path-to-first-gameplay-draw roadmap

**Date**: 2026-05-21
**Read-only investigation; no LOC changes proposed.**
**Premise**: 25+ iterates have advanced matched-prefix 102,168 →
105,128 (+2,960 events) but `draws=0, swaps=1, render_targets=0`
have not moved.  This roadmap proposes a non-canonicalization path
forward.

## Definitions

- **First gameplay draw** = the first `VdSwap` call by ours's
  renderer (the thread spawned at entry `0x822F1EE0`, ours's tid
  analog of canary tid=13) that emits at least one `PM4_TYPE3
  DRAW_INDX` packet into the ringbuffer.
- **Observable success criterion**: `draws ≥ 1, swaps ≥ 2,
  unique_render_targets ≥ 1` in `xenia-rs check --stable-digest`
  output.  At least one frame from the **renderer thread** (not the
  boot-init swap that ours already emits).

## Why current iteration has stalled

The wedge has been mapped and remapped 20+ times.  Every audit
correctly identifies symptoms; every fix correctly canonicalizes a
diff-tool divergence.  But the wedge is **structurally cyclic**: the
worker cluster that signals the wait is downstream of the wait
completing.  Standard "find the divergent kernel call, mirror canary's
semantics" has saturated.

Two strategies remain that have NOT been tried at full scope:

1. **(A) Decouple the cycle by faking the worker activation**:
   directly call `sub_825070F0` from a host shim, or directly spawn
   the 4 worker threads with the right ctx, sidestepping the
   activation chain.  This is a *crowbar*: it doesn't fix the
   underlying bootstrap bug, but it tests "are the workers
   functionally correct IF activated."  If they signal the wedge and
   ours then reaches first draw, we know the bug is *exclusively* in
   the activation gate, and we can attack just that.

2. **(B) Find what triggers `sub_824FD240+0x24`'s POD-copy in canary**.
   AUDIT-068 Session 4 pinned the install epoch of vtable
   `0x8200A1E8` to this writer site.  But the *caller* of
   `sub_824FD240` — what guest call leads to it firing — is
   unidentified.  In ours, `sub_824FD240` fires 0× because the call
   chain `sub_824F8398 → sub_824F7CD0 → sub_824F7800 → sub_824FD240`
   is downstream of the tid=13 wedge.  So we have circular reasoning
   again — UNLESS Strategy A is applied first.

The roadmap below uses Strategy A as a wedge-crowbar and Strategy B
as the principled fix that follows.

## Roadmap

### Step 1 — Crowbar: force-spawn the `sub_825070F0` workers (~80–150 LOC)

**Action**: in `xenia-rs` add a debug-only cvar
`--force-spawn-workers` that, when set, after some bootstrap
checkpoint (e.g., first `VdInitializeRingBuffer` return), manually
spawns 4 ExCreateThread-equivalent guest threads with:

- entries `0x82506528 / 0x82506558 / 0x82506588 / 0x825065B8`
- ctx_ptr = run-determined; allocate a fresh
  `ANON_Class_713383D7`-shaped object on the unified heap and write
  vtable `0x8200A1E8` to slot 0 (mirror the POD-copy at
  `sub_824FD240+0x24`)
- stack_size 65536, suspended=True initially, then NtResumeThread

**Expected effect**:

- If the workers run correctly and signal the wedge: ours's tid=13
  unblocks, tid=1's join completes, normal game-loop begins.
  `draws ≥ 1, swaps ≥ 2`.
- If the workers fail (e.g., faulting because the ctx object's other
  fields aren't initialized): we learn what *else* needs to be
  installed alongside the vtable.

**Failure modes to expect**:

- The worker entries dispatch via vtable slots 35/36/37/38 of the
  ANON_Class — those slots also need to be populated.  Audit-067
  static analysis shows the vtable has 7 entries; the worker entries
  use offsets 140/144/148/152 (= slots 35/36/37/38 of a wider vtable)
  per `sub_825070F0.md` line 32-37.  So we'll need a parent class /
  derived class layout.
- The ctx object also has refcount/header fields that must be
  initialized — see AUDIT-068 Session 3 finding of 12-byte struct
  copy `{vptr, self, self}` followed by refcount=1.

**LOC budget**: 80-150 LOC ours-side; 0 LOC canary.
**Read-only fallback**: if force-spawn fails immediately, we've still
captured the failure mode, which is informative.
**Risk**: high — this is structurally a hack.  Acceptable as a
diagnostic.

### Step 2 — Identify what triggers `sub_824FD240+0x24` in canary (~0 LOC)

**Action**: with Step 1's crowbar enabled, ours reaches the
post-wedge code path.  Compare ours and canary on what `import.call`
(kernel API) sequence the **caller** of `sub_824FD240` makes
immediately before the POD-copy install.

The caller chain (per AUDIT-064/068) is:

```
sub_824F8398 → sub_824F7CD0 → sub_824F7800 → [bl at +0x38 = sub_824FD240] / [bctrl at +0x320 = sub_825070F0]
```

So `sub_824F7800` calls `sub_824FD240` at offset `+0x38`, BEFORE it
calls `sub_825070F0` at offset `+0x320`.

Question: what does `sub_824F8398`'s caller (one level up,
`sub_821B55D8`) pass as arguments, and what kernel APIs run in
between?  We need to trace tid=6's events in canary in the wallclock
window [9.4 s, 9.6 s] — the install epoch.

**LOC budget**: 0.  Pure event-stream analysis on captured canary
jsonl (we already have `canary-jitter-1.jsonl`, 18.7M events).
**Output**: an ordered list of kernel calls just before
`sub_824FD240+0x24` fires.  If any are missing in ours, that's a
candidate gap.

### Step 3 — Mirror the trigger in ours (variable LOC)

Once Step 2 names the missing kernel call(s), implement them in ours
following Phase C cadence (verify per-call return values match canary;
add diff-tool tests; document in memory).

**LOC budget**: depends on what's missing.  Could be 10–500 LOC.

### Step 4 — Remove the crowbar; verify natural bootstrap (~0 LOC)

With Step 3's fix in place, remove `--force-spawn-workers`.  Re-run
ours.  If the natural bootstrap chain runs and `draws ≥ 1, swaps ≥ 2`,
we've fixed the bug.

If progression still fails without the crowbar, there's another gap;
re-enter at Step 2 with a refined trigger search.

### Step 5 — Validate gameplay frame parity (~0–50 LOC)

Capture renderer-thread VdSwap counts at 90 s wallclock in both
engines.  Target: ours's renderer emits within ±30% of canary's
12,092 VdSwap/90s.  If yes: first-draw is reached and sustained.

If ours's renderer emits but at a much lower rate, that's a follow-up
performance issue, not a correctness one.  Defer.

## Expected progression per step

| Step | Expected `swaps` | Expected `draws` | Expected `unique_render_targets` | LOC delta |
|---|---:|---:|---:|---:|
| Pre-roadmap | 1 | 0 | 0 | — |
| Step 1 (crowbar) | 2-N | 1-N | 1+ | ~150 |
| Step 2 (trigger ID) | (unchanged) | (unchanged) | (unchanged) | 0 |
| Step 3 (mirror) | 2-N | 1-N | 1+ | 10-500 |
| Step 4 (decrowbar) | 2-N | 1-N | 1+ | -150 (remove) |
| Step 5 (parity) | 100+ | 100+ | 1-5 | 0-50 |

## What's NOT on this path (explicitly deferred)

1. **Host-audio bridge / XAudio resume**: the XAudio thread tids 14/15
   spawning suspended-and-never-resumed in ours is real but parallel
   to the worker-cluster wedge.  In canary, both threads run; in ours,
   neither runs.  Pursuing XAudio fixes does not address the
   graphics-blocking wedge.  Defer to a separate
   "post-first-draw" audit cluster.
2. **HID / controller**: Sylpheed's intro movie / title screen play
   without user input.  HID is irrelevant for first-draw.
3. **XAM content / save games**: irrelevant for first-draw; the
   intro/title screens don't require save-game enumeration.
4. **Scheduler determinism** (per `scheduler_determinism_plan` /
   Phase D Stages 0-4): null result, off-path.  The wedge is upstream
   of any contention.  Defer indefinitely or close.
5. **Diff-tool canonicalization** (Phase C-style fixes): saturated on
   moving matched-prefix without moving progression.  **Halt** further
   work in this class until Step 4 lands and re-baselines the diff
   workload.
6. **AUDIT-068 host-side install probes**: superseded by AUDIT-068
   Session 4 (writer identified at GUEST PC `sub_824FD240+0x24`).
   The remaining question is *what triggers* `sub_824FD240`, which
   Step 2 addresses.

## Alternative path (rejected)

**Skip the crowbar; do the trigger investigation cold.**  Read canary
source for `sub_824FD240` callers, walk upward, identify the trigger.
Why rejected: `sub_824FD240` is GAME code, not canary engine code —
the file we'd "read" is the disassembly of the XEX.  We'd need to
disassemble Sylpheed's RE'd PE and trace the call graph by hand.  Per
sylpheed.db, `sub_824FD240`'s static caller is `sub_824F7800+0x38`
(in line with AUDIT-064).  But what guest *call* causes `sub_824F7800`
to be invoked is itself a multi-fn upstream investigation that
returns to the same wedge cycle.  The crowbar bypasses this paradox.

## Risk assessment

- **Step 1 catastrophic failure**: ours's emulator panics or
  segfaults when the force-spawn workers run.  Mitigation: gate
  behind `--debug-only` cvar; ensure ours's CPU executes the worker
  entries in normal sandboxed PPC JIT; if they fault on missing
  guest state, log and exit cleanly.
- **Step 1 "succeeds but draws=0 anyway"**: the workers run but
  ours's tid=13 still doesn't unblock — there's an unmodelled state
  beyond just the missing thread spawns.  Mitigation: log every event
  the new workers emit; compare with canary's tid=27/28/29 streams in
  `canary-jitter-1.jsonl`.
- **Step 3 LOC explosion**: the trigger turns out to be a large
  subsystem (XAM content, XCONFIG, etc.).  Mitigation: scope-cut to
  a stub that returns "canary-equivalent" values without full
  implementation.

## Confidence levels

- Step 1 unblocks the wedge if executed correctly: **MEDIUM** (60%).
  Honest assessment: 25 prior audits have not unblocked it through
  natural fixes, so the crowbar approach is novel and the failure
  mode may not match expectations.
- Step 2 identifies a trigger in ≤1 session: **HIGH** (85%) — the
  canary jsonl already has the data; analysis is mechanical.
- Step 3 LOC budget ≤500: **MEDIUM** (50%) — depends entirely on Step
  2's answer.
- Step 4 natural bootstrap works post-Step-3: **MEDIUM** (50%) —
  there may be additional gaps the crowbar masked.

## Memory hygiene

After Step 1 lands (crowbar binary in place), check that
`xenia-rs/target/release/xenia-rs` builds cleanly with the new cvar.
Verify Phase B `image_canonical_sha256` is updated (the crowbar
changes engine LOC); document the new baseline.  Confirm 3× cold
runs produce identical digests with the crowbar enabled.

## What "winning" looks like

`xenia-rs check --stable-digest -n 50000000` (or higher cap, e.g.
`-n 500000000` to reach 30 s wallclock) outputs:

```json
{
  "instructions": 50000007,
  "imports": 40390+,
  "draws": >= 1,
  "swaps": >= 2,
  "unique_render_targets": >= 1,
  "shader_blobs_live": >= 1,
  "texture_cache_entries": >= 1
}
```

…and the value is reproducible across 3 cold runs.  A non-zero
`draws` value means at least one PM4_TYPE3 DRAW_INDX packet was
emitted by the renderer thread.