Files
xenia-rs/audit-runs/phase-xaudio-resume/escalation.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

7.7 KiB

Phase XAudio-Resume — ESCALATION (case IV)

Date: 2026-05-19 Outcome: Resume mechanism is correctly implemented. The 60% missing event volume is gated on a DOWNSTREAM application-level spin-poll, not on the resume itself. No engine change landed.

Canary's resume mechanism (Step 1+2)

For each suspended XAudio worker (entry_pc=0x824D2878 aff=16 → tid=14; entry_pc=0x824D2940 aff=32 → tid=15), canary tid=6 (main) emits an identical 6-call sequence immediately after ExCreateThread:

canary tid=6 idx=106750..106766  (host_ns 1726.0..1726.2 ms)
  106750 import.call ExCreateThread
  106751 kernel.call ExCreateThread
  106752 handle.create        (raw_handle 0x???????? — tid=14 handle)
  106753 thread.create        (entry_pc=0x824d2878, suspended=true)
  106754 kernel.return ExCreateThread
  106755 import.call ObReferenceObjectByHandle
  106756 kernel.call ObReferenceObjectByHandle
  106757 kernel.return ObReferenceObjectByHandle
  106758 import.call KeSetBasePriorityThread
  106759 kernel.call KeSetBasePriorityThread
  106760 kernel.return KeSetBasePriorityThread
  106761 import.call KeResumeThread            ← RESUME (xboxkrnl ord 146)
  106762 kernel.call KeResumeThread
  106763 kernel.return KeResumeThread
  106764 import.call ObDereferenceObject
  106765 kernel.call ObDereferenceObject
  106766 kernel.return ObDereferenceObject

Block repeats verbatim at idx 106767-106783 for entry_pc=0x824D2940. Containing function is XAudioRegisterRenderDriverClient (visible at idx 106817).

Ours's behavior at the matched site (Step 3)

Cold ours (-n 500M, --halt-on-deadlock, fresh cache wipe), checked against /tmp/ours-xaudio.jsonl (121,569 events captured before halt):

ours tid=1 idx=106756..106786   (host_ns 1626 ms — boot is ~100 ms ahead of canary)
  106756 import.call ExCreateThread
  106757 kernel.call ExCreateThread
  106758 handle.create
  106759 thread.create     (entry_pc=0x824d2878, suspended=true) ← matches canary
  106760 kernel.return ExCreateThread
  ...
  106767 import.call KeResumeThread             ← RESUME fires
  106768 kernel.call KeResumeThread
  106769 kernel.return KeResumeThread
  ...
  106776 thread.create     (entry_pc=0x824d2940, suspended=true) ← matches canary
  ...
  106784 import.call KeResumeThread             ← second RESUME fires
  106785 kernel.call KeResumeThread
  106786 kernel.return KeResumeThread

ours's per-tid first-events (cold) for the spawned children:

tid=9  (=canary tid=14, entry 0x824d2878): 77 events, idx 0..76 identical to canary tid=14
tid=10 (=canary tid=15, entry 0x824d2940): 17 events, idx 0..16 identical to canary tid=15

Ours's tid=9 / tid=10 EXECUTE the canary-matching XAudio init sequence: KeWaitForSingleObject (with immediate signal) → spinlock/IRQL cycle → XAudioGetVoiceCategoryVolumeChangeMaskKeReleaseSemaphore → more IRQL cycles. Then halt.

Halt-on-deadlock diagnostic shows tids 9 and 10 in state Ready at pc=0x824d1404 lr=0x824d22b4 — they are NOT blocked on a missing kernel API, they are inside a guest-side spin-poll loop:

0x824d1400:  beqlr   cr6                          ; return if poll succeeded
0x824d1404:  cmpd    cr6, r10, r11                ; r10 vs r11
0x824d1408:  beq     cr6, 0x824D1420              ; ok-branch
0x824d140c:  mr      r31, r31                     ; nop (yield hint)
0x824d1410:  ld      r11, 0(r4)                   ; reload [r4]
0x824d1414:  cmpdi   cr6, r11, 0
0x824d1418:  bne     cr6, 0x824D1404              ; if nonzero, loop
0x824d141c:  blr

r4 = r31+356 (caller pushes addi r4, r31, 356 at 0x824d22a8). The threads are spin-polling guest memory at [r31+356] waiting for it to reach 0.

Classification: case IV (not I / II / III)

The plan's original classification anticipated:

  • (I) ours doesn't reach the spawn LR ← refuted: spawn fires at idx 106756/106773
  • (II) ours reaches spawn but no resume ← refuted: KeResumeThread fires at idx 106768/106785
  • (III) ours's NtResumeThread is misimplemented ← refuted: resume_ref() correctly clears Blocked(BlockReason::Suspended)Ready; halt diagnostic confirms post-resume Ready state and identical first-77/17 events to canary

Actual classification (IV): Resume succeeds; XAudio threads start running and execute their init sequence verbatim against canary; then enter a guest-side application spin-poll on [r31+356] that never resolves in ours. The producer of the 0-write to that location is part of canary's audio/GPU host bridge chain that AUDIT-048 only partially restored (cascades A/B/D landed; cascade C — XAudioSubmitRenderDriverFrame — remained 0 per that audit's own assessment).

Why the 60% volume claim doesn't follow from a resume-only fix

Phase NonMatch's "60% missing event volume" attribution to XAudio assumed the threads simply weren't running. They ARE running — they emit identical first events, get scheduled, and reach the spin loop. The volume bottleneck is the post-init steady-state pump: canary's 6.15 M tid=14 events come from 26,126 repeated iterations of the XAudioGetVoiceCategoryVolumeChangeMask / KeReleaseSemaphore / IRQL-cycle loop, each iteration gated on the host bridge clearing the [r31+356] flag. With the flag stuck non-zero in ours, the loop never re-enters; only the single first iteration (idx 0-76) ever executes. No quantum of resume-side change is going to unstick this.

Out-of-scope for this session

Per session authorization, fixing the host-bridge memory-write that clears [r31+356] requires touching xenia-apu/xenia-gpu host code, which is explicitly forbidden ("the host bridge is separate"). Therefore no engine change lands in this session.

Progression metric (re-validation gate, baseline-only)

Not re-measured for a change — there was no change. Pre-existing baseline remains the C+23+absorber state (23cf4c4cbf61a577caa4118ab2308ba6 / ba5b5e07… depending on Phase D stage). swaps and draws unchanged. Per-chain matched-prefixes from MEMORY.md remain:

  • main tid=6→1: 105,046 (with Phase D D-extension absorber)
  • sister chains 11/32/4/41/16: preserved

The remaining XAudio gate is AUDIT-048 cascade C: producer of the [r31+356]=0 write. This is the part of the audio host-bridge chain that did NOT land in AUDIT-048. It likely involves:

  1. XAudioSubmitRenderDriverFrame host-side callback firing the buffer-complete event with a side effect that decrements/clears a counter at offset 356 of the XAudio client struct.
  2. KeReleaseSemaphore on a paired semaphore that produces the host-side buffer-complete notification.

A targeted re-attack would:

  1. Read xenia-canary's apu/audio_system.cc + apu/xma_decoder.cc to find the host-side write that clears r31+356 (likely an XAUDIO_CLIENT_STATE struct field).
  2. Mirror it in xenia-rs's xaudio.rs / audio worker context.
  3. Re-validate the cold cycle. swaps may move 1→2 if the audio pump reaches the renderer fence; draws likely remain 0 (audio ≠ renderer per AUDIT-048).

That work is the AUDIT-048-cascade-C completion task, NOT the resume gate. It's the natural sister of the deferred sub_825070F0 main-gate Path P.

Per-chain delta (no change this session)

chain pre post delta
tid=6→1 main 105,046 105,046 0
tid=11→11 preserved preserved 0
tid=14→9 XAudio 41 41 0
tid=15→10 XAudio 16 16 0
tid=4→4 preserved preserved 0
tid=16→16 preserved preserved 0

Artifacts

  • tid6_window.json — canary tid=6 events idx 106700..108200 around the XAudio spawn burst
  • tid14_first.json / tid15_first.json — canary tid=14/15 first 120 events
  • extract_window.py — extraction script
  • escalation.md — this file