Files
xenia-rs/audit-runs/phase-nonmatch-investigation/result.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

157 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase Non-match Investigation — Results
**Date**: 2026-05-19
**Source**: `xenia-canary/build-cross/bin/Windows/Debug/canary-jitter-1.jsonl` (4.4 GB, 18.7M events, 28 tids)
**Companion ours data**: `audit-runs/phase-w-wedge-reattack/ours-postfix.jsonl` (121,569 events, 13 tids)
**Outcome**: **(A) — AUDIT-058/063/067 framing CONFIRMED** end-to-end using new Phase A thread.create events.
## TL;DR
Per Phase A `thread.create` events (wired in C+15-α), canary spawns **23 threads**; the final 4
fire at `host_ns ≈ 10.38 s` and have entry PCs `0x82506528 / 0x82506558 / 0x82506588 / 0x825065B8`
with shared context `0xBCE251C0` and stack 65,536 — these are **exactly** the 4 worker entries
documented in the `sub_825070F0` dossier. The historical AUDIT-058/063 framing is correct:
`sub_825070F0` is the one-shot 4-worker fan-out that ours never reaches.
Three of those four canary workers go on to dominate the trace:
**tid=28 (3.26M events, sub_82506528), tid=27 (36k events, sub_82506558), tid=29 (91k events, sub_82506588)**
— the fourth (`0x825065B8`) was never resumed in this 90s window.
Ours emits **10 thread.create** events vs canary's 23, stops after spawn #10 (`0x821748F0` at 1.727s),
and **never produces another thread.create** for the rest of the run. The 13 subsequent canary
spawns including the critical sub_825070F0 batch are entirely missing.
## What canary's heavy workers DO
| tid | events | role | entry_pc |
|----:|-------:|------|----------|
| 14 | **6.15 M** | **XAudio voice-mask poll** (26,126× XAudioGetVoiceCategoryVolumeChangeMask) | `0x824D2878` (aff=16) |
| 15 | **4.78 M** | XAudio sister (KeWaitForSingleObject + heavy IRQL spinlock cycle) | `0x824D2940` (aff=32) |
| 28 | **3.26 M** | **sub_825070F0 worker 0** (1.07 M × RtlEnterCS, 530× NtReadFile) | `0x82506528` (ctx `0xBCE251C0`) |
| 16 | 1.80 M | XMA decoder (`XMACreateContext`, RtlEnterCS heavy) | `0x82178950` |
| 21 | 1.00 M | NtWaitForMultipleObjectsEx worker | `0x824563E0` |
| 13 | 594 k | **Renderer** (12,092× VdSwap, VdGetSystemCommandBuffer; 1,805× Ke/NtSetEvent; 475× wait.begin) | `0x822F1EE0` |
The **biggest workers (tid=14, tid=15)** are NOT sub_825070F0 workers — they are spawned much earlier (1.726/1.727s)
via `sub_824D2878 / sub_824D2940` and run forever as XAudio render/voice threads. **Ours spawns these two
suspended (1.626s) but they never receive the resume call that would activate them** — ours produces 0
XAudio* events on these tids (verifiable from ours's tid event counts: ours has only 13 tids total, none
with the 6M-event signature).
## Spawn-chain summary (full table in `canary-tid-profiles.md`)
Three distinct fan-out clusters in canary, all from tid=6 (guest main):
1. **1.421.94 s — main init burst**: 10 spawns (tids 817). Ours matches this 1:1 in spawn count and entries.
2. **1.942.15 s — secondary burst** (XAM/XCONFIG helpers, tids 1825): 8 additional spawns. **Ours emits 0**.
3. **10.0810.38 s — XAudio worker fan-out**: 5 spawns (tids 26, 27, 28, 29, +1 unresumed). The last 4
are the `sub_825070F0` workers. **Ours emits 0**.
## sub_825070F0 spawn-chain confirmation (static + runtime)
- `sylpheed.db` confirms `sub_825070F0` lives in `vtable 0x8200A208 slot 1` and `0x8200A928 slot 1`
(anonymous class `ANON_Class_713383D7`, 7 slots each).
- **Zero `vptr_writes` / zero `xrefs` / zero `indirect_dispatch_candidates`** reach either vtable.
AUDIT-067's host-side install hypothesis is confirmed by static-analysis exhaustion.
- Function body contains the 4 sequential `addi rN, r0, 0x8250652X` + `bl sub_824AA388` (= ExCreateThread
wrapper) blocks at PCs `0x825071F8 / 0x82507244 / 0x82507290 / 0x825072DC`.
- The 4 worker entry thunks (`0x82506528 / 0x82506558 / 0x82506588 / 0x825065B8`) are uniform vtable-slot
callers: each loads `r3->vtable->[140|144|148|152]` and dispatches via CTR (offsets 35/36/37/38).
- Runtime ctx `0xBCE251C0` is referenced **4× in canary jsonl** (the 4 spawn events) and **0× in
ours-postfix.jsonl**. Ours never allocates the dispatcher object that holds the `0x8200A208` vtable.
## Wake/signal chain to wedge (partial)
- Phase W: ours's wedge handle `0x12d0` (`Event/Auto` waited at `sub_821CB030+0x1B0` on tid=13 the renderer);
main tid=1 join-waits on `Thread(id=13)` at `sub_82173990+0x2D4`.
- Canary tid=13 (renderer) creates **10 handles**, calls Ke/NtSetEvent **1,805×**, calls wait.begin **475×**
it is alive and signaling. Earliest tid=13 handle.create at 2.396 s; explosion at 10.7 s **once the
sub_825070F0 workers come online**.
- Canary tid=13's signals correlate with the sub_825070F0 worker batch coming up at 10.7 s (tid=27/28/29
first-events are all 10.705 s). Without those workers, ours's renderer has no producer to wake the
event it waits on, and main joins-on-renderer → full deadlock.
- Full SID-level mapping of "which canary worker fires the NtSetEvent that wakes the renderer's wait"
was not attempted (handle IDs and SIDs don't cross-correlate run-to-run; would require source-level
read of `sub_821CB030`). The class of producer (`sub_825070F0` workers) is identified.
## Reading-error / methodology notes
- **#16 EH-handler caution**: the `sub_824AA388` spawn helper is reached via `bl` (direct call, not via
EH unwind) — no risk of misanchoring on a catch handler.
- **#28 framing**: Phase A `thread.create.payload.parent_tid` redundantly equals the event's `tid` field
(per `event_log.cc:312-326`: emitted ON the parent thread's stream, child tid is NOT in payload).
Child-tid is recovered by FIFO matching to `first_event[tid]` chronologically.
- **#30 cross-engine SIDs**: ours's wedge handle SID `d5e23609d3948568` does not appear in canary because
these are worker-local Event handles, not process-global dispatchers; only the shared-global recipe
is scheduling-invariant.
- **Cold-run jitter** was not a factor here — only one canary jsonl was processed; the spawn-chain
identification is robust because the SID-independent entry_pc + ctx_ptr + stack_size triplet is
effectively a content-addressed fingerprint that survives reruns.
## Outcome: (A) — historical framing confirmed
The Phase A `thread.create` data directly corroborates AUDIT-058/063/067:
1. `sub_825070F0` IS the function that spawns the 4 sub_82506528-family workers (confirmed in canary
trace, never fires in ours).
2. The dispatcher class `ANON_Class_713383D7` whose vtable `0x8200A208` slot 1 points at `sub_825070F0`
has its vtable installed via a path invisible to static guest analysis (AUDIT-067 unresolved).
3. The HEAVY workers (tid=14/15 → XAudio; tid=16 → XMA; tid=21 → NtWait worker) are spawned **earlier**
via different entries (`sub_824D2878`, `sub_824D2940`, `sub_82178950`, `sub_824563E0`) but are all
suspended; their resume gate is also missing in ours (those threads exist in ours-postfix but emit
< 100 events each, all from the spawn-time bookkeeping).
## Recommended next attack target
**Re-attempt the deferred AUDIT-067 / AUDIT-068 host-side vptr install probe** with current tooling.
Specific subtasks:
1. **Identify the allocator that produces the `ANON_Class_713383D7` instance** with vtable `0x8200A208`.
- Static search: which fn loads `0x8200A208` as a constant? (database says nothing — confirm with a
fresh ghidra script that includes split-pair detection.)
- Runtime probe: instrument both engines to log every `stw vptr, 0(obj)` where `vptr ∈
{0x8200A208, 0x8200A928}`. In canary, this MUST fire ≥ 1× before the 10.38 s spawn burst;
in ours, it presumably never fires. Identify the PC.
2. **If host-side**: trace through the kernel exports table. The most likely path is one of
`XAudio2*Create`, `XMACreateContext`, `XMPCreate*`, or an undocumented `XAudio` API. Per the tid=14
call profile, `XAudioGetVoiceCategoryVolumeChangeMask` is the only XAudio API actively touched —
look at its dossier (or canary's `xboxkrnl_audio.cc` / `xam_audio.cc`) for object-construction
side-effects.
3. **Alternative**: identify which Sylpheed API call is the **trigger** for the 10.38 s `sub_825070F0`
firing. Canary main (tid=6) at host_ns ≈ 10.3010.38 s does the work that leads up to this; ~300 ms
before, tid=6 has activity that ours doesn't reach. Diff tid=6's event stream in canary vs ours's
tid=1 in the time window [10 s, 10.4 s] (canary) / [whatever ours's wallclock-equivalent is] — but
ours doesn't reach 10 s wallclock either, so the divergence is upstream.
4. **Secondary attack**: the XAudio tid=14/15 resume gate. Those threads are spawned suspended in
BOTH engines (canary at 1.726/1.727 s, ours at 1.626 s); canary resumes them within ~1 ms and they
emit 11 M events combined. **What guest call resumes them in canary?** Cross-thread NtResumeThread
on the tid=14 handle. Sylpheed presumably resumes them via an XAudio2 API. If we can identify the
resume call site in canary and figure out why ours doesn't reach it, we unblock 60% of the missing
event volume (XAudio) independent of `sub_825070F0`.
## Artifacts
All artifacts in `xenia-rs/audit-runs/phase-nonmatch-investigation/`:
- `build_profiles.py` — streaming jsonl profile builder (~200 LOC)
- `tid-event-counts.csv` — per-tid totals (28 rows)
- `tid-top-calls.txt` — per-tid top-20 kernel.call names
- `tid-ntset-handles.txt` — per-tid Ke/NtSetEvent handle distribution **(EMPTY — canary's
kernel.call payloads have `args:{}` for NtSetEvent; handle is in resolved-arg JSON not exposed
in current `args_resolved`. Not needed for Outcome (A) determination. Future Phase: extend
Phase A `kernel.call` to also surface ALL register args in `args` for diff-tool consumption.)**
- `tid-wait-handles.txt` — per-tid wait.begin handle distribution **(EMPTY for same reason: the
`wait.begin` events I sampled have `raw_handle_id=None` because the payload uses a
`handle_semantic_ids` array, not a single `raw_handle_id`. The handle.create map is populated
correctly — see `handle-create.json`.)**
- `thread-creates.json` — canary thread.create payloads keyed by child_tid (note: child_tid is FIFO-inferred, see profiles doc)
- `thread-exits.json` — canary thread.exit events (3 in this trace: tid=17/18/26)
- `excreate-events.json` — all ExCreateThread import.call events with idx/host_ns
- `create-thread-events.json` — full thread.create event payloads
- `handle-create.json` — all handle.create with raw_handle, sid, object_type
- `spawn-chain.json` — auto-correlated spawn → ExCreateThread linkage
- `canary-tid-profiles.md` — human-readable per-tid catalogue + spawn-chain tables
- `result.md` — this file