handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
253
audit-runs/review-a-boot-state/shortest-path-roadmap.md
Normal file
253
audit-runs/review-a-boot-state/shortest-path-roadmap.md
Normal file
@@ -0,0 +1,253 @@
|
||||
# Shortest-path-to-first-gameplay-draw roadmap
|
||||
|
||||
**Date**: 2026-05-21
|
||||
**Read-only investigation; no LOC changes proposed.**
|
||||
**Premise**: 25+ iterates have advanced matched-prefix 102,168 →
|
||||
105,128 (+2,960 events) but `draws=0, swaps=1, render_targets=0`
|
||||
have not moved. This roadmap proposes a non-canonicalization path
|
||||
forward.
|
||||
|
||||
## Definitions
|
||||
|
||||
- **First gameplay draw** = the first `VdSwap` call by ours's
|
||||
renderer (the thread spawned at entry `0x822F1EE0`, ours's tid
|
||||
analog of canary tid=13) that emits at least one `PM4_TYPE3
|
||||
DRAW_INDX` packet into the ringbuffer.
|
||||
- **Observable success criterion**: `draws ≥ 1, swaps ≥ 2,
|
||||
unique_render_targets ≥ 1` in `xenia-rs check --stable-digest`
|
||||
output. At least one frame from the **renderer thread** (not the
|
||||
boot-init swap that ours already emits).
|
||||
|
||||
## Why current iteration has stalled
|
||||
|
||||
The wedge has been mapped and remapped 20+ times. Every audit
|
||||
correctly identifies symptoms; every fix correctly canonicalizes a
|
||||
diff-tool divergence. But the wedge is **structurally cyclic**: the
|
||||
worker cluster that signals the wait is downstream of the wait
|
||||
completing. Standard "find the divergent kernel call, mirror canary's
|
||||
semantics" has saturated.
|
||||
|
||||
Two strategies remain that have NOT been tried at full scope:
|
||||
|
||||
1. **(A) Decouple the cycle by faking the worker activation**:
|
||||
directly call `sub_825070F0` from a host shim, or directly spawn
|
||||
the 4 worker threads with the right ctx, sidestepping the
|
||||
activation chain. This is a *crowbar*: it doesn't fix the
|
||||
underlying bootstrap bug, but it tests "are the workers
|
||||
functionally correct IF activated." If they signal the wedge and
|
||||
ours then reaches first draw, we know the bug is *exclusively* in
|
||||
the activation gate, and we can attack just that.
|
||||
|
||||
2. **(B) Find what triggers `sub_824FD240+0x24`'s POD-copy in canary**.
|
||||
AUDIT-068 Session 4 pinned the install epoch of vtable
|
||||
`0x8200A1E8` to this writer site. But the *caller* of
|
||||
`sub_824FD240` — what guest call leads to it firing — is
|
||||
unidentified. In ours, `sub_824FD240` fires 0× because the call
|
||||
chain `sub_824F8398 → sub_824F7CD0 → sub_824F7800 → sub_824FD240`
|
||||
is downstream of the tid=13 wedge. So we have circular reasoning
|
||||
again — UNLESS Strategy A is applied first.
|
||||
|
||||
The roadmap below uses Strategy A as a wedge-crowbar and Strategy B
|
||||
as the principled fix that follows.
|
||||
|
||||
## Roadmap
|
||||
|
||||
### Step 1 — Crowbar: force-spawn the `sub_825070F0` workers (~80–150 LOC)
|
||||
|
||||
**Action**: in `xenia-rs` add a debug-only cvar
|
||||
`--force-spawn-workers` that, when set, after some bootstrap
|
||||
checkpoint (e.g., first `VdInitializeRingBuffer` return), manually
|
||||
spawns 4 ExCreateThread-equivalent guest threads with:
|
||||
|
||||
- entries `0x82506528 / 0x82506558 / 0x82506588 / 0x825065B8`
|
||||
- ctx_ptr = run-determined; allocate a fresh
|
||||
`ANON_Class_713383D7`-shaped object on the unified heap and write
|
||||
vtable `0x8200A1E8` to slot 0 (mirror the POD-copy at
|
||||
`sub_824FD240+0x24`)
|
||||
- stack_size 65536, suspended=True initially, then NtResumeThread
|
||||
|
||||
**Expected effect**:
|
||||
|
||||
- If the workers run correctly and signal the wedge: ours's tid=13
|
||||
unblocks, tid=1's join completes, normal game-loop begins.
|
||||
`draws ≥ 1, swaps ≥ 2`.
|
||||
- If the workers fail (e.g., faulting because the ctx object's other
|
||||
fields aren't initialized): we learn what *else* needs to be
|
||||
installed alongside the vtable.
|
||||
|
||||
**Failure modes to expect**:
|
||||
|
||||
- The worker entries dispatch via vtable slots 35/36/37/38 of the
|
||||
ANON_Class — those slots also need to be populated. Audit-067
|
||||
static analysis shows the vtable has 7 entries; the worker entries
|
||||
use offsets 140/144/148/152 (= slots 35/36/37/38 of a wider vtable)
|
||||
per `sub_825070F0.md` line 32-37. So we'll need a parent class /
|
||||
derived class layout.
|
||||
- The ctx object also has refcount/header fields that must be
|
||||
initialized — see AUDIT-068 Session 3 finding of 12-byte struct
|
||||
copy `{vptr, self, self}` followed by refcount=1.
|
||||
|
||||
**LOC budget**: 80-150 LOC ours-side; 0 LOC canary.
|
||||
**Read-only fallback**: if force-spawn fails immediately, we've still
|
||||
captured the failure mode, which is informative.
|
||||
**Risk**: high — this is structurally a hack. Acceptable as a
|
||||
diagnostic.
|
||||
|
||||
### Step 2 — Identify what triggers `sub_824FD240+0x24` in canary (~0 LOC)
|
||||
|
||||
**Action**: with Step 1's crowbar enabled, ours reaches the
|
||||
post-wedge code path. Compare ours and canary on what `import.call`
|
||||
(kernel API) sequence the **caller** of `sub_824FD240` makes
|
||||
immediately before the POD-copy install.
|
||||
|
||||
The caller chain (per AUDIT-064/068) is:
|
||||
|
||||
```
|
||||
sub_824F8398 → sub_824F7CD0 → sub_824F7800 → [bl at +0x38 = sub_824FD240] / [bctrl at +0x320 = sub_825070F0]
|
||||
```
|
||||
|
||||
So `sub_824F7800` calls `sub_824FD240` at offset `+0x38`, BEFORE it
|
||||
calls `sub_825070F0` at offset `+0x320`.
|
||||
|
||||
Question: what does `sub_824F8398`'s caller (one level up,
|
||||
`sub_821B55D8`) pass as arguments, and what kernel APIs run in
|
||||
between? We need to trace tid=6's events in canary in the wallclock
|
||||
window [9.4 s, 9.6 s] — the install epoch.
|
||||
|
||||
**LOC budget**: 0. Pure event-stream analysis on captured canary
|
||||
jsonl (we already have `canary-jitter-1.jsonl`, 18.7M events).
|
||||
**Output**: an ordered list of kernel calls just before
|
||||
`sub_824FD240+0x24` fires. If any are missing in ours, that's a
|
||||
candidate gap.
|
||||
|
||||
### Step 3 — Mirror the trigger in ours (variable LOC)
|
||||
|
||||
Once Step 2 names the missing kernel call(s), implement them in ours
|
||||
following Phase C cadence (verify per-call return values match canary;
|
||||
add diff-tool tests; document in memory).
|
||||
|
||||
**LOC budget**: depends on what's missing. Could be 10–500 LOC.
|
||||
|
||||
### Step 4 — Remove the crowbar; verify natural bootstrap (~0 LOC)
|
||||
|
||||
With Step 3's fix in place, remove `--force-spawn-workers`. Re-run
|
||||
ours. If the natural bootstrap chain runs and `draws ≥ 1, swaps ≥ 2`,
|
||||
we've fixed the bug.
|
||||
|
||||
If progression still fails without the crowbar, there's another gap;
|
||||
re-enter at Step 2 with a refined trigger search.
|
||||
|
||||
### Step 5 — Validate gameplay frame parity (~0–50 LOC)
|
||||
|
||||
Capture renderer-thread VdSwap counts at 90 s wallclock in both
|
||||
engines. Target: ours's renderer emits within ±30% of canary's
|
||||
12,092 VdSwap/90s. If yes: first-draw is reached and sustained.
|
||||
|
||||
If ours's renderer emits but at a much lower rate, that's a follow-up
|
||||
performance issue, not a correctness one. Defer.
|
||||
|
||||
## Expected progression per step
|
||||
|
||||
| Step | Expected `swaps` | Expected `draws` | Expected `unique_render_targets` | LOC delta |
|
||||
|---|---:|---:|---:|---:|
|
||||
| Pre-roadmap | 1 | 0 | 0 | — |
|
||||
| Step 1 (crowbar) | 2-N | 1-N | 1+ | ~150 |
|
||||
| Step 2 (trigger ID) | (unchanged) | (unchanged) | (unchanged) | 0 |
|
||||
| Step 3 (mirror) | 2-N | 1-N | 1+ | 10-500 |
|
||||
| Step 4 (decrowbar) | 2-N | 1-N | 1+ | -150 (remove) |
|
||||
| Step 5 (parity) | 100+ | 100+ | 1-5 | 0-50 |
|
||||
|
||||
## What's NOT on this path (explicitly deferred)
|
||||
|
||||
1. **Host-audio bridge / XAudio resume**: the XAudio thread tids 14/15
|
||||
spawning suspended-and-never-resumed in ours is real but parallel
|
||||
to the worker-cluster wedge. In canary, both threads run; in ours,
|
||||
neither runs. Pursuing XAudio fixes does not address the
|
||||
graphics-blocking wedge. Defer to a separate
|
||||
"post-first-draw" audit cluster.
|
||||
2. **HID / controller**: Sylpheed's intro movie / title screen play
|
||||
without user input. HID is irrelevant for first-draw.
|
||||
3. **XAM content / save games**: irrelevant for first-draw; the
|
||||
intro/title screens don't require save-game enumeration.
|
||||
4. **Scheduler determinism** (per `scheduler_determinism_plan` /
|
||||
Phase D Stages 0-4): null result, off-path. The wedge is upstream
|
||||
of any contention. Defer indefinitely or close.
|
||||
5. **Diff-tool canonicalization** (Phase C-style fixes): saturated on
|
||||
moving matched-prefix without moving progression. **Halt** further
|
||||
work in this class until Step 4 lands and re-baselines the diff
|
||||
workload.
|
||||
6. **AUDIT-068 host-side install probes**: superseded by AUDIT-068
|
||||
Session 4 (writer identified at GUEST PC `sub_824FD240+0x24`).
|
||||
The remaining question is *what triggers* `sub_824FD240`, which
|
||||
Step 2 addresses.
|
||||
|
||||
## Alternative path (rejected)
|
||||
|
||||
**Skip the crowbar; do the trigger investigation cold.** Read canary
|
||||
source for `sub_824FD240` callers, walk upward, identify the trigger.
|
||||
Why rejected: `sub_824FD240` is GAME code, not canary engine code —
|
||||
the file we'd "read" is the disassembly of the XEX. We'd need to
|
||||
disassemble Sylpheed's RE'd PE and trace the call graph by hand. Per
|
||||
sylpheed.db, `sub_824FD240`'s static caller is `sub_824F7800+0x38`
|
||||
(in line with AUDIT-064). But what guest *call* causes `sub_824F7800`
|
||||
to be invoked is itself a multi-fn upstream investigation that
|
||||
returns to the same wedge cycle. The crowbar bypasses this paradox.
|
||||
|
||||
## Risk assessment
|
||||
|
||||
- **Step 1 catastrophic failure**: ours's emulator panics or
|
||||
segfaults when the force-spawn workers run. Mitigation: gate
|
||||
behind `--debug-only` cvar; ensure ours's CPU executes the worker
|
||||
entries in normal sandboxed PPC JIT; if they fault on missing
|
||||
guest state, log and exit cleanly.
|
||||
- **Step 1 "succeeds but draws=0 anyway"**: the workers run but
|
||||
ours's tid=13 still doesn't unblock — there's an unmodelled state
|
||||
beyond just the missing thread spawns. Mitigation: log every event
|
||||
the new workers emit; compare with canary's tid=27/28/29 streams in
|
||||
`canary-jitter-1.jsonl`.
|
||||
- **Step 3 LOC explosion**: the trigger turns out to be a large
|
||||
subsystem (XAM content, XCONFIG, etc.). Mitigation: scope-cut to
|
||||
a stub that returns "canary-equivalent" values without full
|
||||
implementation.
|
||||
|
||||
## Confidence levels
|
||||
|
||||
- Step 1 unblocks the wedge if executed correctly: **MEDIUM** (60%).
|
||||
Honest assessment: 25 prior audits have not unblocked it through
|
||||
natural fixes, so the crowbar approach is novel and the failure
|
||||
mode may not match expectations.
|
||||
- Step 2 identifies a trigger in ≤1 session: **HIGH** (85%) — the
|
||||
canary jsonl already has the data; analysis is mechanical.
|
||||
- Step 3 LOC budget ≤500: **MEDIUM** (50%) — depends entirely on Step
|
||||
2's answer.
|
||||
- Step 4 natural bootstrap works post-Step-3: **MEDIUM** (50%) —
|
||||
there may be additional gaps the crowbar masked.
|
||||
|
||||
## Memory hygiene
|
||||
|
||||
After Step 1 lands (crowbar binary in place), check that
|
||||
`xenia-rs/target/release/xenia-rs` builds cleanly with the new cvar.
|
||||
Verify Phase B `image_canonical_sha256` is updated (the crowbar
|
||||
changes engine LOC); document the new baseline. Confirm 3× cold
|
||||
runs produce identical digests with the crowbar enabled.
|
||||
|
||||
## What "winning" looks like
|
||||
|
||||
`xenia-rs check --stable-digest -n 50000000` (or higher cap, e.g.
|
||||
`-n 500000000` to reach 30 s wallclock) outputs:
|
||||
|
||||
```json
|
||||
{
|
||||
"instructions": 50000007,
|
||||
"imports": 40390+,
|
||||
"draws": >= 1,
|
||||
"swaps": >= 2,
|
||||
"unique_render_targets": >= 1,
|
||||
"shader_blobs_live": >= 1,
|
||||
"texture_cache_entries": >= 1
|
||||
}
|
||||
```
|
||||
|
||||
…and the value is reproducible across 3 cold runs. A non-zero
|
||||
`draws` value means at least one PM4_TYPE3 DRAW_INDX packet was
|
||||
emitted by the renderer thread.
|
||||
Reference in New Issue
Block a user