Files
xenia-rs/audit-runs/review-a-boot-state/ours-wedge-localization.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

9.2 KiB
Raw Blame History

Ours wedge localization

Source data: phase-w-wedge-reattack/ours-postfix.jsonl (50M-instr cold run, ~3 s wallclock, 121,569 events, 13 tids). phase-w-wedge-reattack/halt-on-deadlock-dump.txt (per-tid state @ deadlock).

TL;DR

Ours's wedge is structurally identical to AUDIT-049 (first found 2026-05-10). Across 25+ subsequent iterates (Phase C+1 … Phase C+25, Phase D, AUDIT-049 .. AUDIT-069), the wedge has never moved:

  • tid=1 (main) wedges at sub_82173990+0x2D4 (PC 0x824ac578, do_wait_single) on handle 0x12c8 = Thread(id=13) — the renderer thread's join handle.
  • tid=13 (renderer / cache-IO worker) wedges at sub_821CB030+0x1B0 (PC 0x824ac578, do_wait_single) on handle 0x12d0 = Event/Auto, created by tid=13 itself at sub_821CB030+0x128 via NtCreateEvent. <NO_SIGNALS_DESPITE_WAITS>.
  • sub_825070F0 fires 0× at any horizon probed (50M, 500M, ∞ wallclock). The 4 workers (entries 0x82506528/58/88/B8) never spawn in ours.

This is what audits 049/058/059/060/062/063/064/065/066/067/068/069 collectively call "the wedge."

Graph view: ours's actual reachable subgraph vs canary's

What runs in BOTH engines (matched-prefix 105,128)

entry_point
 └─ early CRT init        ✓ ours ✓ canary
 └─ subsystem init        ✓
    ├─ VdInitializeEngines (×2, then VdShutdownEngines, then again)
    ├─ VdInitializeRingBuffer
    ├─ VdEnableRingBufferRPtrWriteBack
    ├─ VdSetGraphicsInterruptCallback
    └─ VdSetSystemCommandBufferGpuIdentifierAddress
 └─ 10× ExCreateThread (the matched first spawn burst)
    ├─ 0x82181830 / 0x8245A5D0 / 0x82450A28      ✓ ✓
    ├─ 0x82457EF0 (spawned by tid=10 → tid=11)   ✓ ✓
    ├─ 0x824CD458 (KeWait worker, susp=F)        ✓ ✓
    ├─ 0x822F1EE0 (renderer, susp=T)             ✓ ✓
    ├─ 0x824D2878 / 0x824D2940 (XAudio, susp=T)  ✓ ✓
    ├─ 0x82178950 (XMA, susp=F)                  ✓ ✓
    └─ 0x821748F0 (file IO spawner, susp=T)      ✓ ✓
 └─ 1× boot-init VdSwap                          ✓ swaps=1
 └─ tid=1 enters sub_8216EA68 → sub_822F1AA8
    └─ bctrl vtable[0] of *(0x828E1F08)
       └─ sub_82175330 → tail → sub_82173990
          └─ sub_821746B0 → spawn worker (= ours tid=13, susp=F)
          └─ KeWaitForSingleObject INFINITE on tid=13.handle    ← WEDGE

What runs ONLY in canary (the missing subgraph)

After tid=6's tid=17 worker (= ours's tid=13) terminates:
  sub_82173990 returns to sub_822F1AA8's outer loop
   └─ iterates sub_821741C8 → sub_82172BA0 → vtable[6] = sub_821B55D8
      → sub_824F8398 → sub_824F7CD0 → sub_824F7800 → vtable[1] = sub_825070F0
         └─ 4× ExCreateThread(entry=0x82506528/58/88/B8, susp=T)
            ├─ Worker 0 → tid=28 (file IO, 3.26M events)
            ├─ Worker 1 → tid=27 (36k events)
            ├─ Worker 2 → tid=29 (91k events)
            └─ Worker 3 (0x825065B8 — never resumed in jitter-1 run)

After workers come online:
  Canary's secondary spawn burst (1.942.15 s) — 8 helpers (tids 1825)
  Canary's tid=14/15 XAudio resumes (~ms after tid=6 spawns them in
    susp=T; ours also spawns them susp=T but never resumes them)
  Renderer tid=13 unblocks, starts emitting VdSwap at ~150 fps
  Per-frame game loop: tid=6 emits `0x822F1BCC` 4040× / 60 s

The wedge dependency graph (cyclic)

            [tid=1 (main) wedge]
                       │
                       ▼
     wait on handle 0x12c8 (= tid=13.thread_handle)
                       │
                       ▼
       only signaled when tid=13 calls ExTerminateThread
                       │
                       ▼
         tid=13 needs to complete sub_821CB030 body
                       │
                       ▼
         sub_821CB030 waits on event 0x12d0
                       │
                       ▼
   only signaled by sub_825070F0 worker cluster
                       │
                       ▼
        sub_825070F0 never fires in ours
                       │
                       ▼
    sub_825070F0 is reached via:
       sub_82172BA0 → ... → sub_824F7800 → bctrl vtable[1]
    ↑↑↑ which is downstream of sub_822F1AA8's outer loop
        which is downstream of sub_82173990 returning
        which is downstream of tid=1's wait completing
        ← BACK TO TOP

This is the AUDIT-063 self-referential lock: the activation chain that produces the signal that unwedges the wait is itself downstream of the wait completing. In canary, the lock resolves because the tid=17 worker (= ours tid=13's analog) calls ExTerminateThread by completing its sub_821CB030 body — and that completion is fed by some OTHER signal source that ours doesn't replicate.

Where the "other signal source" lives (the actual root cause)

From AUDIT-069 Session 5 (work-semaphore release-rate diff):

Canary 414 release events vs ours 99 (24% rate). Worker (tid=10/5): 382 vs 90. Main (tid=6/1): 7 vs 8. Other producers: 25 vs 1.

The discrepancy in "other producers" (25 producers vs 1) is the key. Canary has multiple non-worker threads that release the work semaphore during bootstrap — releasing this semaphore is what feeds the worker-side wait that eventually causes sub_821CB030's event to be signaled. Ours has only one (tid=13 itself, before it wedges).

From AUDIT-069 Session 4 (sub_82450A68 dispatch loop):

Ours r3=0x1 (semaphore acquired) 91/91 captures (100%); canary r3=0x102 (TIMEOUT) 3/4 (75%).

Ours's work-semaphore has count > 0 every time tid=5 checks; canary's times out 75% of the time. This is a paradox at face value: how can ours have MORE semaphore signals available but still process LESS work? The S5 reframe resolves it: ours's worker self-releases the work semaphore from sub_82450B68+0xCDC/+0xD28 MORE OFTEN than it consumes, because the consume path early-exits when the dispatch table doesn't have an entry to process — and the dispatch table doesn't have entries because the producers (canary's "other 25 tids") aren't running.

Bootstrap divergence (when does ours first diverge from canary?)

Per the AUDIT-069 H3 framing: somewhere in the bootstrap of the worker-cluster, a producer thread that should be alive in canary isn't alive in ours. Candidates:

  1. XAudio render thread (canary tid=14/15): spawned suspended in ours, never resumed. Canary resumes within ~1 ms of spawn at 1.726 s. Canary's tid=14 calls XAudioGetVoiceCategoryVolumeChangeMask 26,126× and is one of the top event producers. This thread runs the host-audio bridge feed loop — if it isn't running, downstream producers expecting audio cues block.
  2. XMA decoder (tid=16, entry 0x82178950): spawned non-suspended in both; ours emits 0 events from this thread because it presumably waits on a kernel object that's never signaled.
  3. NtWaitForMultipleObjectsEx worker (canary tid=21, entry 0x824563E0): 1M events in canary; absent in ours (canary's second spawn burst doesn't happen).
  4. The "tid=10 helper" (canary tid=10, entry 0x82450A28): ours has this thread (ours tid=5), but it's running the dispatch loop sub_82450A68 in a degenerate fast-path mode (S4 finding).

The most defensible single-root claim:

Ours never resumes the XAudio threads (tid=14/15), because the guest API call that triggers their resume in canary doesn't fire in ours, and as a knock-on the worker cluster never gets the bootstrap producer it expects.

But this claim is not yet proven; AUDIT-068/069 stopped short of identifying the resume trigger.

Verified-but-doesn't-help LOC budget across recent audits

(For methodology context — every recent audit landed correctness or diagnostic LOC but moved progression 0%.)

Audit / Phase LOC added Component Effect on progression
AUDIT-067 vptr-mem-watch +422 (canary) Mem-watch diagnostic 0
AUDIT-068 S1-S4 +520 cumul (canary) Host-side write hooks 0 (writer identified at guest PC)
AUDIT-069 S1-S5 +60 (canary), 0 (ours) Wait/release watch 0 (counts diverge, no fix)
Phase D Stages 0-4 +450-500 (ours+tools) Contention manifest 0 (104,607 cap unbroken)
Phase D D-extension +95 (tool) Nested-CS absorber +439 matched-prefix only
Phase C+1 .. C+25 varies Allocator/event/thread shims 0 (matched-prefix only)
Phase W +20 (ours) VdInitializeEngines r3=1 +66 matched-prefix only
Total to break wedge: 0 LOC of any kind

This is the single most striking pattern from the audit chain: every honest correctness fix advances matched-prefix; none move draws / swaps / unique_render_targets.

Falsification budget for the wedge framing

The wedge framing IS robust (no audit has falsified it since AUDIT-049). But it has limited explanatory power: it tells us what is blocked, not what should unblock it. Reading-error #38 (cross-spawn producer paths missed by static reachability) and #36 (POD struct copy bypass) both proved that the install / wake mechanism in canary involves paths guest static analysis cannot see. This is a methodology constraint, not an unsolvable problem.