Files
xenia-rs/audit-runs/phase-nonmatch-investigation/result.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

10 KiB
Raw Blame History

Phase Non-match Investigation — Results

Date: 2026-05-19 Source: xenia-canary/build-cross/bin/Windows/Debug/canary-jitter-1.jsonl (4.4 GB, 18.7M events, 28 tids) Companion ours data: audit-runs/phase-w-wedge-reattack/ours-postfix.jsonl (121,569 events, 13 tids) Outcome: (A) — AUDIT-058/063/067 framing CONFIRMED end-to-end using new Phase A thread.create events.

TL;DR

Per Phase A thread.create events (wired in C+15-α), canary spawns 23 threads; the final 4 fire at host_ns ≈ 10.38 s and have entry PCs 0x82506528 / 0x82506558 / 0x82506588 / 0x825065B8 with shared context 0xBCE251C0 and stack 65,536 — these are exactly the 4 worker entries documented in the sub_825070F0 dossier. The historical AUDIT-058/063 framing is correct: sub_825070F0 is the one-shot 4-worker fan-out that ours never reaches.

Three of those four canary workers go on to dominate the trace: tid=28 (3.26M events, sub_82506528), tid=27 (36k events, sub_82506558), tid=29 (91k events, sub_82506588) — the fourth (0x825065B8) was never resumed in this 90s window.

Ours emits 10 thread.create events vs canary's 23, stops after spawn #10 (0x821748F0 at 1.727s), and never produces another thread.create for the rest of the run. The 13 subsequent canary spawns including the critical sub_825070F0 batch are entirely missing.

What canary's heavy workers DO

tid events role entry_pc
14 6.15 M XAudio voice-mask poll (26,126× XAudioGetVoiceCategoryVolumeChangeMask) 0x824D2878 (aff=16)
15 4.78 M XAudio sister (KeWaitForSingleObject + heavy IRQL spinlock cycle) 0x824D2940 (aff=32)
28 3.26 M sub_825070F0 worker 0 (1.07 M × RtlEnterCS, 530× NtReadFile) 0x82506528 (ctx 0xBCE251C0)
16 1.80 M XMA decoder (XMACreateContext, RtlEnterCS heavy) 0x82178950
21 1.00 M NtWaitForMultipleObjectsEx worker 0x824563E0
13 594 k Renderer (12,092× VdSwap, VdGetSystemCommandBuffer; 1,805× Ke/NtSetEvent; 475× wait.begin) 0x822F1EE0

The biggest workers (tid=14, tid=15) are NOT sub_825070F0 workers — they are spawned much earlier (1.726/1.727s) via sub_824D2878 / sub_824D2940 and run forever as XAudio render/voice threads. Ours spawns these two suspended (1.626s) but they never receive the resume call that would activate them — ours produces 0 XAudio* events on these tids (verifiable from ours's tid event counts: ours has only 13 tids total, none with the 6M-event signature).

Spawn-chain summary (full table in canary-tid-profiles.md)

Three distinct fan-out clusters in canary, all from tid=6 (guest main):

  1. 1.421.94 s — main init burst: 10 spawns (tids 817). Ours matches this 1:1 in spawn count and entries.
  2. 1.942.15 s — secondary burst (XAM/XCONFIG helpers, tids 1825): 8 additional spawns. Ours emits 0.
  3. 10.0810.38 s — XAudio worker fan-out: 5 spawns (tids 26, 27, 28, 29, +1 unresumed). The last 4 are the sub_825070F0 workers. Ours emits 0.

sub_825070F0 spawn-chain confirmation (static + runtime)

  • sylpheed.db confirms sub_825070F0 lives in vtable 0x8200A208 slot 1 and 0x8200A928 slot 1 (anonymous class ANON_Class_713383D7, 7 slots each).
  • Zero vptr_writes / zero xrefs / zero indirect_dispatch_candidates reach either vtable. AUDIT-067's host-side install hypothesis is confirmed by static-analysis exhaustion.
  • Function body contains the 4 sequential addi rN, r0, 0x8250652X + bl sub_824AA388 (= ExCreateThread wrapper) blocks at PCs 0x825071F8 / 0x82507244 / 0x82507290 / 0x825072DC.
  • The 4 worker entry thunks (0x82506528 / 0x82506558 / 0x82506588 / 0x825065B8) are uniform vtable-slot callers: each loads r3->vtable->[140|144|148|152] and dispatches via CTR (offsets 35/36/37/38).
  • Runtime ctx 0xBCE251C0 is referenced 4× in canary jsonl (the 4 spawn events) and 0× in ours-postfix.jsonl. Ours never allocates the dispatcher object that holds the 0x8200A208 vtable.

Wake/signal chain to wedge (partial)

  • Phase W: ours's wedge handle 0x12d0 (Event/Auto waited at sub_821CB030+0x1B0 on tid=13 the renderer); main tid=1 join-waits on Thread(id=13) at sub_82173990+0x2D4.
  • Canary tid=13 (renderer) creates 10 handles, calls Ke/NtSetEvent 1,805×, calls wait.begin 475× — it is alive and signaling. Earliest tid=13 handle.create at 2.396 s; explosion at 10.7 s once the sub_825070F0 workers come online.
  • Canary tid=13's signals correlate with the sub_825070F0 worker batch coming up at 10.7 s (tid=27/28/29 first-events are all 10.705 s). Without those workers, ours's renderer has no producer to wake the event it waits on, and main joins-on-renderer → full deadlock.
  • Full SID-level mapping of "which canary worker fires the NtSetEvent that wakes the renderer's wait" was not attempted (handle IDs and SIDs don't cross-correlate run-to-run; would require source-level read of sub_821CB030). The class of producer (sub_825070F0 workers) is identified.

Reading-error / methodology notes

  • #16 EH-handler caution: the sub_824AA388 spawn helper is reached via bl (direct call, not via EH unwind) — no risk of misanchoring on a catch handler.
  • #28 framing: Phase A thread.create.payload.parent_tid redundantly equals the event's tid field (per event_log.cc:312-326: emitted ON the parent thread's stream, child tid is NOT in payload). Child-tid is recovered by FIFO matching to first_event[tid] chronologically.
  • #30 cross-engine SIDs: ours's wedge handle SID d5e23609d3948568 does not appear in canary because these are worker-local Event handles, not process-global dispatchers; only the shared-global recipe is scheduling-invariant.
  • Cold-run jitter was not a factor here — only one canary jsonl was processed; the spawn-chain identification is robust because the SID-independent entry_pc + ctx_ptr + stack_size triplet is effectively a content-addressed fingerprint that survives reruns.

Outcome: (A) — historical framing confirmed

The Phase A thread.create data directly corroborates AUDIT-058/063/067:

  1. sub_825070F0 IS the function that spawns the 4 sub_82506528-family workers (confirmed in canary trace, never fires in ours).
  2. The dispatcher class ANON_Class_713383D7 whose vtable 0x8200A208 slot 1 points at sub_825070F0 has its vtable installed via a path invisible to static guest analysis (AUDIT-067 unresolved).
  3. The HEAVY workers (tid=14/15 → XAudio; tid=16 → XMA; tid=21 → NtWait worker) are spawned earlier via different entries (sub_824D2878, sub_824D2940, sub_82178950, sub_824563E0) but are all suspended; their resume gate is also missing in ours (those threads exist in ours-postfix but emit < 100 events each, all from the spawn-time bookkeeping).

Re-attempt the deferred AUDIT-067 / AUDIT-068 host-side vptr install probe with current tooling. Specific subtasks:

  1. Identify the allocator that produces the ANON_Class_713383D7 instance with vtable 0x8200A208.

    • Static search: which fn loads 0x8200A208 as a constant? (database says nothing — confirm with a fresh ghidra script that includes split-pair detection.)
    • Runtime probe: instrument both engines to log every stw vptr, 0(obj) where vptr ∈ {0x8200A208, 0x8200A928}. In canary, this MUST fire ≥ 1× before the 10.38 s spawn burst; in ours, it presumably never fires. Identify the PC.
  2. If host-side: trace through the kernel exports table. The most likely path is one of XAudio2*Create, XMACreateContext, XMPCreate*, or an undocumented XAudio API. Per the tid=14 call profile, XAudioGetVoiceCategoryVolumeChangeMask is the only XAudio API actively touched — look at its dossier (or canary's xboxkrnl_audio.cc / xam_audio.cc) for object-construction side-effects.

  3. Alternative: identify which Sylpheed API call is the trigger for the 10.38 s sub_825070F0 firing. Canary main (tid=6) at host_ns ≈ 10.3010.38 s does the work that leads up to this; ~300 ms before, tid=6 has activity that ours doesn't reach. Diff tid=6's event stream in canary vs ours's tid=1 in the time window [10 s, 10.4 s] (canary) / [whatever ours's wallclock-equivalent is] — but ours doesn't reach 10 s wallclock either, so the divergence is upstream.

  4. Secondary attack: the XAudio tid=14/15 resume gate. Those threads are spawned suspended in BOTH engines (canary at 1.726/1.727 s, ours at 1.626 s); canary resumes them within ~1 ms and they emit 11 M events combined. What guest call resumes them in canary? Cross-thread NtResumeThread on the tid=14 handle. Sylpheed presumably resumes them via an XAudio2 API. If we can identify the resume call site in canary and figure out why ours doesn't reach it, we unblock 60% of the missing event volume (XAudio) independent of sub_825070F0.

Artifacts

All artifacts in xenia-rs/audit-runs/phase-nonmatch-investigation/:

  • build_profiles.py — streaming jsonl profile builder (~200 LOC)
  • tid-event-counts.csv — per-tid totals (28 rows)
  • tid-top-calls.txt — per-tid top-20 kernel.call names
  • tid-ntset-handles.txt — per-tid Ke/NtSetEvent handle distribution (EMPTY — canary's kernel.call payloads have args:{} for NtSetEvent; handle is in resolved-arg JSON not exposed in current args_resolved. Not needed for Outcome (A) determination. Future Phase: extend Phase A kernel.call to also surface ALL register args in args for diff-tool consumption.)
  • tid-wait-handles.txt — per-tid wait.begin handle distribution (EMPTY for same reason: the wait.begin events I sampled have raw_handle_id=None because the payload uses a handle_semantic_ids array, not a single raw_handle_id. The handle.create map is populated correctly — see handle-create.json.)
  • thread-creates.json — canary thread.create payloads keyed by child_tid (note: child_tid is FIFO-inferred, see profiles doc)
  • thread-exits.json — canary thread.exit events (3 in this trace: tid=17/18/26)
  • excreate-events.json — all ExCreateThread import.call events with idx/host_ns
  • create-thread-events.json — full thread.create event payloads
  • handle-create.json — all handle.create with raw_handle, sid, object_type
  • spawn-chain.json — auto-correlated spawn → ExCreateThread linkage
  • canary-tid-profiles.md — human-readable per-tid catalogue + spawn-chain tables
  • result.md — this file