Files
xenia-rs/audit-runs/review-a-boot-state/shortest-path-roadmap.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

254 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Shortest-path-to-first-gameplay-draw roadmap
**Date**: 2026-05-21
**Read-only investigation; no LOC changes proposed.**
**Premise**: 25+ iterates have advanced matched-prefix 102,168 →
105,128 (+2,960 events) but `draws=0, swaps=1, render_targets=0`
have not moved. This roadmap proposes a non-canonicalization path
forward.
## Definitions
- **First gameplay draw** = the first `VdSwap` call by ours's
renderer (the thread spawned at entry `0x822F1EE0`, ours's tid
analog of canary tid=13) that emits at least one `PM4_TYPE3
DRAW_INDX` packet into the ringbuffer.
- **Observable success criterion**: `draws ≥ 1, swaps ≥ 2,
unique_render_targets ≥ 1` in `xenia-rs check --stable-digest`
output. At least one frame from the **renderer thread** (not the
boot-init swap that ours already emits).
## Why current iteration has stalled
The wedge has been mapped and remapped 20+ times. Every audit
correctly identifies symptoms; every fix correctly canonicalizes a
diff-tool divergence. But the wedge is **structurally cyclic**: the
worker cluster that signals the wait is downstream of the wait
completing. Standard "find the divergent kernel call, mirror canary's
semantics" has saturated.
Two strategies remain that have NOT been tried at full scope:
1. **(A) Decouple the cycle by faking the worker activation**:
directly call `sub_825070F0` from a host shim, or directly spawn
the 4 worker threads with the right ctx, sidestepping the
activation chain. This is a *crowbar*: it doesn't fix the
underlying bootstrap bug, but it tests "are the workers
functionally correct IF activated." If they signal the wedge and
ours then reaches first draw, we know the bug is *exclusively* in
the activation gate, and we can attack just that.
2. **(B) Find what triggers `sub_824FD240+0x24`'s POD-copy in canary**.
AUDIT-068 Session 4 pinned the install epoch of vtable
`0x8200A1E8` to this writer site. But the *caller* of
`sub_824FD240` — what guest call leads to it firing — is
unidentified. In ours, `sub_824FD240` fires 0× because the call
chain `sub_824F8398 → sub_824F7CD0 → sub_824F7800 → sub_824FD240`
is downstream of the tid=13 wedge. So we have circular reasoning
again — UNLESS Strategy A is applied first.
The roadmap below uses Strategy A as a wedge-crowbar and Strategy B
as the principled fix that follows.
## Roadmap
### Step 1 — Crowbar: force-spawn the `sub_825070F0` workers (~80150 LOC)
**Action**: in `xenia-rs` add a debug-only cvar
`--force-spawn-workers` that, when set, after some bootstrap
checkpoint (e.g., first `VdInitializeRingBuffer` return), manually
spawns 4 ExCreateThread-equivalent guest threads with:
- entries `0x82506528 / 0x82506558 / 0x82506588 / 0x825065B8`
- ctx_ptr = run-determined; allocate a fresh
`ANON_Class_713383D7`-shaped object on the unified heap and write
vtable `0x8200A1E8` to slot 0 (mirror the POD-copy at
`sub_824FD240+0x24`)
- stack_size 65536, suspended=True initially, then NtResumeThread
**Expected effect**:
- If the workers run correctly and signal the wedge: ours's tid=13
unblocks, tid=1's join completes, normal game-loop begins.
`draws ≥ 1, swaps ≥ 2`.
- If the workers fail (e.g., faulting because the ctx object's other
fields aren't initialized): we learn what *else* needs to be
installed alongside the vtable.
**Failure modes to expect**:
- The worker entries dispatch via vtable slots 35/36/37/38 of the
ANON_Class — those slots also need to be populated. Audit-067
static analysis shows the vtable has 7 entries; the worker entries
use offsets 140/144/148/152 (= slots 35/36/37/38 of a wider vtable)
per `sub_825070F0.md` line 32-37. So we'll need a parent class /
derived class layout.
- The ctx object also has refcount/header fields that must be
initialized — see AUDIT-068 Session 3 finding of 12-byte struct
copy `{vptr, self, self}` followed by refcount=1.
**LOC budget**: 80-150 LOC ours-side; 0 LOC canary.
**Read-only fallback**: if force-spawn fails immediately, we've still
captured the failure mode, which is informative.
**Risk**: high — this is structurally a hack. Acceptable as a
diagnostic.
### Step 2 — Identify what triggers `sub_824FD240+0x24` in canary (~0 LOC)
**Action**: with Step 1's crowbar enabled, ours reaches the
post-wedge code path. Compare ours and canary on what `import.call`
(kernel API) sequence the **caller** of `sub_824FD240` makes
immediately before the POD-copy install.
The caller chain (per AUDIT-064/068) is:
```
sub_824F8398 → sub_824F7CD0 → sub_824F7800 → [bl at +0x38 = sub_824FD240] / [bctrl at +0x320 = sub_825070F0]
```
So `sub_824F7800` calls `sub_824FD240` at offset `+0x38`, BEFORE it
calls `sub_825070F0` at offset `+0x320`.
Question: what does `sub_824F8398`'s caller (one level up,
`sub_821B55D8`) pass as arguments, and what kernel APIs run in
between? We need to trace tid=6's events in canary in the wallclock
window [9.4 s, 9.6 s] — the install epoch.
**LOC budget**: 0. Pure event-stream analysis on captured canary
jsonl (we already have `canary-jitter-1.jsonl`, 18.7M events).
**Output**: an ordered list of kernel calls just before
`sub_824FD240+0x24` fires. If any are missing in ours, that's a
candidate gap.
### Step 3 — Mirror the trigger in ours (variable LOC)
Once Step 2 names the missing kernel call(s), implement them in ours
following Phase C cadence (verify per-call return values match canary;
add diff-tool tests; document in memory).
**LOC budget**: depends on what's missing. Could be 10500 LOC.
### Step 4 — Remove the crowbar; verify natural bootstrap (~0 LOC)
With Step 3's fix in place, remove `--force-spawn-workers`. Re-run
ours. If the natural bootstrap chain runs and `draws ≥ 1, swaps ≥ 2`,
we've fixed the bug.
If progression still fails without the crowbar, there's another gap;
re-enter at Step 2 with a refined trigger search.
### Step 5 — Validate gameplay frame parity (~050 LOC)
Capture renderer-thread VdSwap counts at 90 s wallclock in both
engines. Target: ours's renderer emits within ±30% of canary's
12,092 VdSwap/90s. If yes: first-draw is reached and sustained.
If ours's renderer emits but at a much lower rate, that's a follow-up
performance issue, not a correctness one. Defer.
## Expected progression per step
| Step | Expected `swaps` | Expected `draws` | Expected `unique_render_targets` | LOC delta |
|---|---:|---:|---:|---:|
| Pre-roadmap | 1 | 0 | 0 | — |
| Step 1 (crowbar) | 2-N | 1-N | 1+ | ~150 |
| Step 2 (trigger ID) | (unchanged) | (unchanged) | (unchanged) | 0 |
| Step 3 (mirror) | 2-N | 1-N | 1+ | 10-500 |
| Step 4 (decrowbar) | 2-N | 1-N | 1+ | -150 (remove) |
| Step 5 (parity) | 100+ | 100+ | 1-5 | 0-50 |
## What's NOT on this path (explicitly deferred)
1. **Host-audio bridge / XAudio resume**: the XAudio thread tids 14/15
spawning suspended-and-never-resumed in ours is real but parallel
to the worker-cluster wedge. In canary, both threads run; in ours,
neither runs. Pursuing XAudio fixes does not address the
graphics-blocking wedge. Defer to a separate
"post-first-draw" audit cluster.
2. **HID / controller**: Sylpheed's intro movie / title screen play
without user input. HID is irrelevant for first-draw.
3. **XAM content / save games**: irrelevant for first-draw; the
intro/title screens don't require save-game enumeration.
4. **Scheduler determinism** (per `scheduler_determinism_plan` /
Phase D Stages 0-4): null result, off-path. The wedge is upstream
of any contention. Defer indefinitely or close.
5. **Diff-tool canonicalization** (Phase C-style fixes): saturated on
moving matched-prefix without moving progression. **Halt** further
work in this class until Step 4 lands and re-baselines the diff
workload.
6. **AUDIT-068 host-side install probes**: superseded by AUDIT-068
Session 4 (writer identified at GUEST PC `sub_824FD240+0x24`).
The remaining question is *what triggers* `sub_824FD240`, which
Step 2 addresses.
## Alternative path (rejected)
**Skip the crowbar; do the trigger investigation cold.** Read canary
source for `sub_824FD240` callers, walk upward, identify the trigger.
Why rejected: `sub_824FD240` is GAME code, not canary engine code —
the file we'd "read" is the disassembly of the XEX. We'd need to
disassemble Sylpheed's RE'd PE and trace the call graph by hand. Per
sylpheed.db, `sub_824FD240`'s static caller is `sub_824F7800+0x38`
(in line with AUDIT-064). But what guest *call* causes `sub_824F7800`
to be invoked is itself a multi-fn upstream investigation that
returns to the same wedge cycle. The crowbar bypasses this paradox.
## Risk assessment
- **Step 1 catastrophic failure**: ours's emulator panics or
segfaults when the force-spawn workers run. Mitigation: gate
behind `--debug-only` cvar; ensure ours's CPU executes the worker
entries in normal sandboxed PPC JIT; if they fault on missing
guest state, log and exit cleanly.
- **Step 1 "succeeds but draws=0 anyway"**: the workers run but
ours's tid=13 still doesn't unblock — there's an unmodelled state
beyond just the missing thread spawns. Mitigation: log every event
the new workers emit; compare with canary's tid=27/28/29 streams in
`canary-jitter-1.jsonl`.
- **Step 3 LOC explosion**: the trigger turns out to be a large
subsystem (XAM content, XCONFIG, etc.). Mitigation: scope-cut to
a stub that returns "canary-equivalent" values without full
implementation.
## Confidence levels
- Step 1 unblocks the wedge if executed correctly: **MEDIUM** (60%).
Honest assessment: 25 prior audits have not unblocked it through
natural fixes, so the crowbar approach is novel and the failure
mode may not match expectations.
- Step 2 identifies a trigger in ≤1 session: **HIGH** (85%) — the
canary jsonl already has the data; analysis is mechanical.
- Step 3 LOC budget ≤500: **MEDIUM** (50%) — depends entirely on Step
2's answer.
- Step 4 natural bootstrap works post-Step-3: **MEDIUM** (50%) —
there may be additional gaps the crowbar masked.
## Memory hygiene
After Step 1 lands (crowbar binary in place), check that
`xenia-rs/target/release/xenia-rs` builds cleanly with the new cvar.
Verify Phase B `image_canonical_sha256` is updated (the crowbar
changes engine LOC); document the new baseline. Confirm 3× cold
runs produce identical digests with the crowbar enabled.
## What "winning" looks like
`xenia-rs check --stable-digest -n 50000000` (or higher cap, e.g.
`-n 500000000` to reach 30 s wallclock) outputs:
```json
{
"instructions": 50000007,
"imports": 40390+,
"draws": >= 1,
"swaps": >= 2,
"unique_render_targets": >= 1,
"shader_blobs_live": >= 1,
"texture_cache_entries": >= 1
}
```
…and the value is reproducible across 3 cold runs. A non-zero
`draws` value means at least one PM4_TYPE3 DRAW_INDX packet was
emitted by the renderer thread.