handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-05 07:19:08 +02:00
parent acd1656753
commit ef93a4fa14
620 changed files with 108303 additions and 1 deletions

View File

@@ -0,0 +1,156 @@
# Phase Non-match Investigation — Results
**Date**: 2026-05-19
**Source**: `xenia-canary/build-cross/bin/Windows/Debug/canary-jitter-1.jsonl` (4.4 GB, 18.7M events, 28 tids)
**Companion ours data**: `audit-runs/phase-w-wedge-reattack/ours-postfix.jsonl` (121,569 events, 13 tids)
**Outcome**: **(A) — AUDIT-058/063/067 framing CONFIRMED** end-to-end using new Phase A thread.create events.
## TL;DR
Per Phase A `thread.create` events (wired in C+15-α), canary spawns **23 threads**; the final 4
fire at `host_ns ≈ 10.38 s` and have entry PCs `0x82506528 / 0x82506558 / 0x82506588 / 0x825065B8`
with shared context `0xBCE251C0` and stack 65,536 — these are **exactly** the 4 worker entries
documented in the `sub_825070F0` dossier. The historical AUDIT-058/063 framing is correct:
`sub_825070F0` is the one-shot 4-worker fan-out that ours never reaches.
Three of those four canary workers go on to dominate the trace:
**tid=28 (3.26M events, sub_82506528), tid=27 (36k events, sub_82506558), tid=29 (91k events, sub_82506588)**
— the fourth (`0x825065B8`) was never resumed in this 90s window.
Ours emits **10 thread.create** events vs canary's 23, stops after spawn #10 (`0x821748F0` at 1.727s),
and **never produces another thread.create** for the rest of the run. The 13 subsequent canary
spawns including the critical sub_825070F0 batch are entirely missing.
## What canary's heavy workers DO
| tid | events | role | entry_pc |
|----:|-------:|------|----------|
| 14 | **6.15 M** | **XAudio voice-mask poll** (26,126× XAudioGetVoiceCategoryVolumeChangeMask) | `0x824D2878` (aff=16) |
| 15 | **4.78 M** | XAudio sister (KeWaitForSingleObject + heavy IRQL spinlock cycle) | `0x824D2940` (aff=32) |
| 28 | **3.26 M** | **sub_825070F0 worker 0** (1.07 M × RtlEnterCS, 530× NtReadFile) | `0x82506528` (ctx `0xBCE251C0`) |
| 16 | 1.80 M | XMA decoder (`XMACreateContext`, RtlEnterCS heavy) | `0x82178950` |
| 21 | 1.00 M | NtWaitForMultipleObjectsEx worker | `0x824563E0` |
| 13 | 594 k | **Renderer** (12,092× VdSwap, VdGetSystemCommandBuffer; 1,805× Ke/NtSetEvent; 475× wait.begin) | `0x822F1EE0` |
The **biggest workers (tid=14, tid=15)** are NOT sub_825070F0 workers — they are spawned much earlier (1.726/1.727s)
via `sub_824D2878 / sub_824D2940` and run forever as XAudio render/voice threads. **Ours spawns these two
suspended (1.626s) but they never receive the resume call that would activate them** — ours produces 0
XAudio* events on these tids (verifiable from ours's tid event counts: ours has only 13 tids total, none
with the 6M-event signature).
## Spawn-chain summary (full table in `canary-tid-profiles.md`)
Three distinct fan-out clusters in canary, all from tid=6 (guest main):
1. **1.421.94 s — main init burst**: 10 spawns (tids 817). Ours matches this 1:1 in spawn count and entries.
2. **1.942.15 s — secondary burst** (XAM/XCONFIG helpers, tids 1825): 8 additional spawns. **Ours emits 0**.
3. **10.0810.38 s — XAudio worker fan-out**: 5 spawns (tids 26, 27, 28, 29, +1 unresumed). The last 4
are the `sub_825070F0` workers. **Ours emits 0**.
## sub_825070F0 spawn-chain confirmation (static + runtime)
- `sylpheed.db` confirms `sub_825070F0` lives in `vtable 0x8200A208 slot 1` and `0x8200A928 slot 1`
(anonymous class `ANON_Class_713383D7`, 7 slots each).
- **Zero `vptr_writes` / zero `xrefs` / zero `indirect_dispatch_candidates`** reach either vtable.
AUDIT-067's host-side install hypothesis is confirmed by static-analysis exhaustion.
- Function body contains the 4 sequential `addi rN, r0, 0x8250652X` + `bl sub_824AA388` (= ExCreateThread
wrapper) blocks at PCs `0x825071F8 / 0x82507244 / 0x82507290 / 0x825072DC`.
- The 4 worker entry thunks (`0x82506528 / 0x82506558 / 0x82506588 / 0x825065B8`) are uniform vtable-slot
callers: each loads `r3->vtable->[140|144|148|152]` and dispatches via CTR (offsets 35/36/37/38).
- Runtime ctx `0xBCE251C0` is referenced **4× in canary jsonl** (the 4 spawn events) and **0× in
ours-postfix.jsonl**. Ours never allocates the dispatcher object that holds the `0x8200A208` vtable.
## Wake/signal chain to wedge (partial)
- Phase W: ours's wedge handle `0x12d0` (`Event/Auto` waited at `sub_821CB030+0x1B0` on tid=13 the renderer);
main tid=1 join-waits on `Thread(id=13)` at `sub_82173990+0x2D4`.
- Canary tid=13 (renderer) creates **10 handles**, calls Ke/NtSetEvent **1,805×**, calls wait.begin **475×**
it is alive and signaling. Earliest tid=13 handle.create at 2.396 s; explosion at 10.7 s **once the
sub_825070F0 workers come online**.
- Canary tid=13's signals correlate with the sub_825070F0 worker batch coming up at 10.7 s (tid=27/28/29
first-events are all 10.705 s). Without those workers, ours's renderer has no producer to wake the
event it waits on, and main joins-on-renderer → full deadlock.
- Full SID-level mapping of "which canary worker fires the NtSetEvent that wakes the renderer's wait"
was not attempted (handle IDs and SIDs don't cross-correlate run-to-run; would require source-level
read of `sub_821CB030`). The class of producer (`sub_825070F0` workers) is identified.
## Reading-error / methodology notes
- **#16 EH-handler caution**: the `sub_824AA388` spawn helper is reached via `bl` (direct call, not via
EH unwind) — no risk of misanchoring on a catch handler.
- **#28 framing**: Phase A `thread.create.payload.parent_tid` redundantly equals the event's `tid` field
(per `event_log.cc:312-326`: emitted ON the parent thread's stream, child tid is NOT in payload).
Child-tid is recovered by FIFO matching to `first_event[tid]` chronologically.
- **#30 cross-engine SIDs**: ours's wedge handle SID `d5e23609d3948568` does not appear in canary because
these are worker-local Event handles, not process-global dispatchers; only the shared-global recipe
is scheduling-invariant.
- **Cold-run jitter** was not a factor here — only one canary jsonl was processed; the spawn-chain
identification is robust because the SID-independent entry_pc + ctx_ptr + stack_size triplet is
effectively a content-addressed fingerprint that survives reruns.
## Outcome: (A) — historical framing confirmed
The Phase A `thread.create` data directly corroborates AUDIT-058/063/067:
1. `sub_825070F0` IS the function that spawns the 4 sub_82506528-family workers (confirmed in canary
trace, never fires in ours).
2. The dispatcher class `ANON_Class_713383D7` whose vtable `0x8200A208` slot 1 points at `sub_825070F0`
has its vtable installed via a path invisible to static guest analysis (AUDIT-067 unresolved).
3. The HEAVY workers (tid=14/15 → XAudio; tid=16 → XMA; tid=21 → NtWait worker) are spawned **earlier**
via different entries (`sub_824D2878`, `sub_824D2940`, `sub_82178950`, `sub_824563E0`) but are all
suspended; their resume gate is also missing in ours (those threads exist in ours-postfix but emit
< 100 events each, all from the spawn-time bookkeeping).
## Recommended next attack target
**Re-attempt the deferred AUDIT-067 / AUDIT-068 host-side vptr install probe** with current tooling.
Specific subtasks:
1. **Identify the allocator that produces the `ANON_Class_713383D7` instance** with vtable `0x8200A208`.
- Static search: which fn loads `0x8200A208` as a constant? (database says nothing — confirm with a
fresh ghidra script that includes split-pair detection.)
- Runtime probe: instrument both engines to log every `stw vptr, 0(obj)` where `vptr ∈
{0x8200A208, 0x8200A928}`. In canary, this MUST fire ≥ 1× before the 10.38 s spawn burst;
in ours, it presumably never fires. Identify the PC.
2. **If host-side**: trace through the kernel exports table. The most likely path is one of
`XAudio2*Create`, `XMACreateContext`, `XMPCreate*`, or an undocumented `XAudio` API. Per the tid=14
call profile, `XAudioGetVoiceCategoryVolumeChangeMask` is the only XAudio API actively touched —
look at its dossier (or canary's `xboxkrnl_audio.cc` / `xam_audio.cc`) for object-construction
side-effects.
3. **Alternative**: identify which Sylpheed API call is the **trigger** for the 10.38 s `sub_825070F0`
firing. Canary main (tid=6) at host_ns ≈ 10.3010.38 s does the work that leads up to this; ~300 ms
before, tid=6 has activity that ours doesn't reach. Diff tid=6's event stream in canary vs ours's
tid=1 in the time window [10 s, 10.4 s] (canary) / [whatever ours's wallclock-equivalent is] — but
ours doesn't reach 10 s wallclock either, so the divergence is upstream.
4. **Secondary attack**: the XAudio tid=14/15 resume gate. Those threads are spawned suspended in
BOTH engines (canary at 1.726/1.727 s, ours at 1.626 s); canary resumes them within ~1 ms and they
emit 11 M events combined. **What guest call resumes them in canary?** Cross-thread NtResumeThread
on the tid=14 handle. Sylpheed presumably resumes them via an XAudio2 API. If we can identify the
resume call site in canary and figure out why ours doesn't reach it, we unblock 60% of the missing
event volume (XAudio) independent of `sub_825070F0`.
## Artifacts
All artifacts in `xenia-rs/audit-runs/phase-nonmatch-investigation/`:
- `build_profiles.py` — streaming jsonl profile builder (~200 LOC)
- `tid-event-counts.csv` — per-tid totals (28 rows)
- `tid-top-calls.txt` — per-tid top-20 kernel.call names
- `tid-ntset-handles.txt` — per-tid Ke/NtSetEvent handle distribution **(EMPTY — canary's
kernel.call payloads have `args:{}` for NtSetEvent; handle is in resolved-arg JSON not exposed
in current `args_resolved`. Not needed for Outcome (A) determination. Future Phase: extend
Phase A `kernel.call` to also surface ALL register args in `args` for diff-tool consumption.)**
- `tid-wait-handles.txt` — per-tid wait.begin handle distribution **(EMPTY for same reason: the
`wait.begin` events I sampled have `raw_handle_id=None` because the payload uses a
`handle_semantic_ids` array, not a single `raw_handle_id`. The handle.create map is populated
correctly — see `handle-create.json`.)**
- `thread-creates.json` — canary thread.create payloads keyed by child_tid (note: child_tid is FIFO-inferred, see profiles doc)
- `thread-exits.json` — canary thread.exit events (3 in this trace: tid=17/18/26)
- `excreate-events.json` — all ExCreateThread import.call events with idx/host_ns
- `create-thread-events.json` — full thread.create event payloads
- `handle-create.json` — all handle.create with raw_handle, sid, object_type
- `spawn-chain.json` — auto-correlated spawn → ExCreateThread linkage
- `canary-tid-profiles.md` — human-readable per-tid catalogue + spawn-chain tables
- `result.md` — this file