Files
xenia-rs/audit-runs/review-a-boot-state/ours-wedge-localization.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

206 lines
9.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Ours wedge localization
**Source data**: `phase-w-wedge-reattack/ours-postfix.jsonl` (50M-instr
cold run, ~3 s wallclock, 121,569 events, 13 tids).
`phase-w-wedge-reattack/halt-on-deadlock-dump.txt` (per-tid state @
deadlock).
## TL;DR
Ours's wedge is **structurally identical** to AUDIT-049 (first found
2026-05-10). Across 25+ subsequent iterates (Phase C+1 … Phase C+25,
Phase D, AUDIT-049 .. AUDIT-069), the wedge has **never moved**:
- **tid=1 (main)** wedges at `sub_82173990+0x2D4` (PC `0x824ac578`,
`do_wait_single`) on **handle `0x12c8`** = `Thread(id=13)` — the
renderer thread's join handle.
- **tid=13 (renderer / cache-IO worker)** wedges at
`sub_821CB030+0x1B0` (PC `0x824ac578`, `do_wait_single`) on
**handle `0x12d0`** = `Event/Auto`, created by tid=13 itself at
`sub_821CB030+0x128` via `NtCreateEvent`. `<NO_SIGNALS_DESPITE_WAITS>`.
- **`sub_825070F0` fires 0×** at any horizon probed (50M, 500M, ∞
wallclock). The 4 workers (entries `0x82506528/58/88/B8`) never
spawn in ours.
This is what audits 049/058/059/060/062/063/064/065/066/067/068/069
collectively call "the wedge."
## Graph view: ours's actual reachable subgraph vs canary's
### What runs in BOTH engines (matched-prefix 105,128)
```
entry_point
└─ early CRT init ✓ ours ✓ canary
└─ subsystem init ✓
├─ VdInitializeEngines (×2, then VdShutdownEngines, then again)
├─ VdInitializeRingBuffer
├─ VdEnableRingBufferRPtrWriteBack
├─ VdSetGraphicsInterruptCallback
└─ VdSetSystemCommandBufferGpuIdentifierAddress
└─ 10× ExCreateThread (the matched first spawn burst)
├─ 0x82181830 / 0x8245A5D0 / 0x82450A28 ✓ ✓
├─ 0x82457EF0 (spawned by tid=10 → tid=11) ✓ ✓
├─ 0x824CD458 (KeWait worker, susp=F) ✓ ✓
├─ 0x822F1EE0 (renderer, susp=T) ✓ ✓
├─ 0x824D2878 / 0x824D2940 (XAudio, susp=T) ✓ ✓
├─ 0x82178950 (XMA, susp=F) ✓ ✓
└─ 0x821748F0 (file IO spawner, susp=T) ✓ ✓
└─ 1× boot-init VdSwap ✓ swaps=1
└─ tid=1 enters sub_8216EA68 → sub_822F1AA8
└─ bctrl vtable[0] of *(0x828E1F08)
└─ sub_82175330 → tail → sub_82173990
└─ sub_821746B0 → spawn worker (= ours tid=13, susp=F)
└─ KeWaitForSingleObject INFINITE on tid=13.handle ← WEDGE
```
### What runs ONLY in canary (the missing subgraph)
```
After tid=6's tid=17 worker (= ours's tid=13) terminates:
sub_82173990 returns to sub_822F1AA8's outer loop
└─ iterates sub_821741C8 → sub_82172BA0 → vtable[6] = sub_821B55D8
→ sub_824F8398 → sub_824F7CD0 → sub_824F7800 → vtable[1] = sub_825070F0
└─ 4× ExCreateThread(entry=0x82506528/58/88/B8, susp=T)
├─ Worker 0 → tid=28 (file IO, 3.26M events)
├─ Worker 1 → tid=27 (36k events)
├─ Worker 2 → tid=29 (91k events)
└─ Worker 3 (0x825065B8 — never resumed in jitter-1 run)
After workers come online:
Canary's secondary spawn burst (1.942.15 s) — 8 helpers (tids 1825)
Canary's tid=14/15 XAudio resumes (~ms after tid=6 spawns them in
susp=T; ours also spawns them susp=T but never resumes them)
Renderer tid=13 unblocks, starts emitting VdSwap at ~150 fps
Per-frame game loop: tid=6 emits `0x822F1BCC` 4040× / 60 s
```
## The wedge dependency graph (cyclic)
```
[tid=1 (main) wedge]
wait on handle 0x12c8 (= tid=13.thread_handle)
only signaled when tid=13 calls ExTerminateThread
tid=13 needs to complete sub_821CB030 body
sub_821CB030 waits on event 0x12d0
only signaled by sub_825070F0 worker cluster
sub_825070F0 never fires in ours
sub_825070F0 is reached via:
sub_82172BA0 → ... → sub_824F7800 → bctrl vtable[1]
↑↑↑ which is downstream of sub_822F1AA8's outer loop
which is downstream of sub_82173990 returning
which is downstream of tid=1's wait completing
← BACK TO TOP
```
This is the **AUDIT-063 self-referential lock**: the activation chain
that produces the signal that unwedges the wait is itself downstream
of the wait completing. In canary, the lock resolves because the
tid=17 worker (= ours tid=13's analog) calls `ExTerminateThread`
**by completing** its `sub_821CB030` body — and that completion is
fed by some OTHER signal source that ours doesn't replicate.
## Where the "other signal source" lives (the actual root cause)
From AUDIT-069 Session 5 (work-semaphore release-rate diff):
> Canary 414 release events vs ours 99 (24% rate). Worker (tid=10/5):
> 382 vs 90. Main (tid=6/1): 7 vs 8. **Other producers: 25 vs 1**.
The discrepancy in "other producers" (25 producers vs 1) is the key.
**Canary has multiple non-worker threads that release the work
semaphore during bootstrap — releasing this semaphore is what feeds
the worker-side wait that eventually causes sub_821CB030's event to
be signaled.** Ours has only one (tid=13 itself, before it wedges).
From AUDIT-069 Session 4 (`sub_82450A68` dispatch loop):
> Ours r3=0x1 (semaphore acquired) 91/91 captures (100%); canary
> r3=0x102 (TIMEOUT) 3/4 (75%).
**Ours's work-semaphore has count > 0 every time tid=5 checks; canary's
times out 75% of the time.** This is a *paradox at face value*: how
can ours have MORE semaphore signals available but still process
LESS work? The S5 reframe resolves it: ours's worker self-releases
the work semaphore from `sub_82450B68+0xCDC/+0xD28` MORE OFTEN than
it consumes, because the consume path early-exits when the dispatch
table doesn't have an entry to process — and the dispatch table
doesn't have entries because the producers (canary's "other 25 tids")
aren't running.
## Bootstrap divergence (when does ours first diverge from canary?)
Per the AUDIT-069 H3 framing: somewhere in the *bootstrap* of the
worker-cluster, a producer thread that should be alive in canary
isn't alive in ours. Candidates:
1. **XAudio render thread (canary tid=14/15)**: spawned suspended in
ours, **never resumed**. Canary resumes within ~1 ms of spawn at
1.726 s. Canary's tid=14 calls `XAudioGetVoiceCategoryVolumeChangeMask`
26,126× and is one of the top event producers. This thread runs
the host-audio bridge feed loop — *if it isn't running, downstream
producers expecting audio cues block.*
2. **XMA decoder (tid=16, entry `0x82178950`)**: spawned non-suspended
in both; ours emits 0 events from this thread because it presumably
waits on a kernel object that's never signaled.
3. **NtWaitForMultipleObjectsEx worker (canary tid=21, entry
`0x824563E0`)**: 1M events in canary; absent in ours (canary's
second spawn burst doesn't happen).
4. **The "tid=10 helper" (canary tid=10, entry `0x82450A28`)**: ours
has this thread (ours tid=5), but it's running the dispatch loop
`sub_82450A68` in a degenerate fast-path mode (S4 finding).
The most defensible single-root claim:
> **Ours never resumes the XAudio threads (tid=14/15), because the
> guest API call that triggers their resume in canary doesn't fire in
> ours, and as a knock-on the worker cluster never gets the bootstrap
> producer it expects.**
But this claim is not yet proven; AUDIT-068/069 stopped short of
identifying the resume trigger.
## Verified-but-doesn't-help LOC budget across recent audits
(For methodology context — every recent audit landed correctness or
diagnostic LOC but moved progression 0%.)
| Audit / Phase | LOC added | Component | Effect on progression |
|---|---:|---|---|
| AUDIT-067 vptr-mem-watch | +422 (canary) | Mem-watch diagnostic | 0 |
| AUDIT-068 S1-S4 | +520 cumul (canary) | Host-side write hooks | 0 (writer identified at guest PC) |
| AUDIT-069 S1-S5 | +60 (canary), 0 (ours) | Wait/release watch | 0 (counts diverge, no fix) |
| Phase D Stages 0-4 | +450-500 (ours+tools) | Contention manifest | 0 (104,607 cap unbroken) |
| Phase D D-extension | +95 (tool) | Nested-CS absorber | +439 matched-prefix only |
| Phase C+1 .. C+25 | varies | Allocator/event/thread shims | 0 (matched-prefix only) |
| Phase W | +20 (ours) | VdInitializeEngines r3=1 | +66 matched-prefix only |
| **Total to break wedge: 0 LOC of any kind** | | | |
This is the single most striking pattern from the audit chain: **every
honest correctness fix advances matched-prefix; none move
`draws / swaps / unique_render_targets`.**
## Falsification budget for the wedge framing
The wedge framing IS robust (no audit has falsified it since AUDIT-049).
But it has limited explanatory power: it tells us *what is blocked*,
not *what should unblock it*. Reading-error #38 (cross-spawn producer
paths missed by static reachability) and #36 (POD struct copy bypass)
both proved that the install / wake mechanism in canary involves paths
guest static analysis cannot see. This is a methodology constraint,
not an unsolvable problem.