Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
157 lines
10 KiB
Markdown
157 lines
10 KiB
Markdown
# Phase Non-match Investigation — Results
|
||
|
||
**Date**: 2026-05-19
|
||
**Source**: `xenia-canary/build-cross/bin/Windows/Debug/canary-jitter-1.jsonl` (4.4 GB, 18.7M events, 28 tids)
|
||
**Companion ours data**: `audit-runs/phase-w-wedge-reattack/ours-postfix.jsonl` (121,569 events, 13 tids)
|
||
**Outcome**: **(A) — AUDIT-058/063/067 framing CONFIRMED** end-to-end using new Phase A thread.create events.
|
||
|
||
## TL;DR
|
||
|
||
Per Phase A `thread.create` events (wired in C+15-α), canary spawns **23 threads**; the final 4
|
||
fire at `host_ns ≈ 10.38 s` and have entry PCs `0x82506528 / 0x82506558 / 0x82506588 / 0x825065B8`
|
||
with shared context `0xBCE251C0` and stack 65,536 — these are **exactly** the 4 worker entries
|
||
documented in the `sub_825070F0` dossier. The historical AUDIT-058/063 framing is correct:
|
||
`sub_825070F0` is the one-shot 4-worker fan-out that ours never reaches.
|
||
|
||
Three of those four canary workers go on to dominate the trace:
|
||
**tid=28 (3.26M events, sub_82506528), tid=27 (36k events, sub_82506558), tid=29 (91k events, sub_82506588)**
|
||
— the fourth (`0x825065B8`) was never resumed in this 90s window.
|
||
|
||
Ours emits **10 thread.create** events vs canary's 23, stops after spawn #10 (`0x821748F0` at 1.727s),
|
||
and **never produces another thread.create** for the rest of the run. The 13 subsequent canary
|
||
spawns including the critical sub_825070F0 batch are entirely missing.
|
||
|
||
## What canary's heavy workers DO
|
||
|
||
| tid | events | role | entry_pc |
|
||
|----:|-------:|------|----------|
|
||
| 14 | **6.15 M** | **XAudio voice-mask poll** (26,126× XAudioGetVoiceCategoryVolumeChangeMask) | `0x824D2878` (aff=16) |
|
||
| 15 | **4.78 M** | XAudio sister (KeWaitForSingleObject + heavy IRQL spinlock cycle) | `0x824D2940` (aff=32) |
|
||
| 28 | **3.26 M** | **sub_825070F0 worker 0** (1.07 M × RtlEnterCS, 530× NtReadFile) | `0x82506528` (ctx `0xBCE251C0`) |
|
||
| 16 | 1.80 M | XMA decoder (`XMACreateContext`, RtlEnterCS heavy) | `0x82178950` |
|
||
| 21 | 1.00 M | NtWaitForMultipleObjectsEx worker | `0x824563E0` |
|
||
| 13 | 594 k | **Renderer** (12,092× VdSwap, VdGetSystemCommandBuffer; 1,805× Ke/NtSetEvent; 475× wait.begin) | `0x822F1EE0` |
|
||
|
||
The **biggest workers (tid=14, tid=15)** are NOT sub_825070F0 workers — they are spawned much earlier (1.726/1.727s)
|
||
via `sub_824D2878 / sub_824D2940` and run forever as XAudio render/voice threads. **Ours spawns these two
|
||
suspended (1.626s) but they never receive the resume call that would activate them** — ours produces 0
|
||
XAudio* events on these tids (verifiable from ours's tid event counts: ours has only 13 tids total, none
|
||
with the 6M-event signature).
|
||
|
||
## Spawn-chain summary (full table in `canary-tid-profiles.md`)
|
||
|
||
Three distinct fan-out clusters in canary, all from tid=6 (guest main):
|
||
|
||
1. **1.42–1.94 s — main init burst**: 10 spawns (tids 8–17). Ours matches this 1:1 in spawn count and entries.
|
||
2. **1.94–2.15 s — secondary burst** (XAM/XCONFIG helpers, tids 18–25): 8 additional spawns. **Ours emits 0**.
|
||
3. **10.08–10.38 s — XAudio worker fan-out**: 5 spawns (tids 26, 27, 28, 29, +1 unresumed). The last 4
|
||
are the `sub_825070F0` workers. **Ours emits 0**.
|
||
|
||
## sub_825070F0 spawn-chain confirmation (static + runtime)
|
||
|
||
- `sylpheed.db` confirms `sub_825070F0` lives in `vtable 0x8200A208 slot 1` and `0x8200A928 slot 1`
|
||
(anonymous class `ANON_Class_713383D7`, 7 slots each).
|
||
- **Zero `vptr_writes` / zero `xrefs` / zero `indirect_dispatch_candidates`** reach either vtable.
|
||
AUDIT-067's host-side install hypothesis is confirmed by static-analysis exhaustion.
|
||
- Function body contains the 4 sequential `addi rN, r0, 0x8250652X` + `bl sub_824AA388` (= ExCreateThread
|
||
wrapper) blocks at PCs `0x825071F8 / 0x82507244 / 0x82507290 / 0x825072DC`.
|
||
- The 4 worker entry thunks (`0x82506528 / 0x82506558 / 0x82506588 / 0x825065B8`) are uniform vtable-slot
|
||
callers: each loads `r3->vtable->[140|144|148|152]` and dispatches via CTR (offsets 35/36/37/38).
|
||
- Runtime ctx `0xBCE251C0` is referenced **4× in canary jsonl** (the 4 spawn events) and **0× in
|
||
ours-postfix.jsonl**. Ours never allocates the dispatcher object that holds the `0x8200A208` vtable.
|
||
|
||
## Wake/signal chain to wedge (partial)
|
||
|
||
- Phase W: ours's wedge handle `0x12d0` (`Event/Auto` waited at `sub_821CB030+0x1B0` on tid=13 the renderer);
|
||
main tid=1 join-waits on `Thread(id=13)` at `sub_82173990+0x2D4`.
|
||
- Canary tid=13 (renderer) creates **10 handles**, calls Ke/NtSetEvent **1,805×**, calls wait.begin **475×** —
|
||
it is alive and signaling. Earliest tid=13 handle.create at 2.396 s; explosion at 10.7 s **once the
|
||
sub_825070F0 workers come online**.
|
||
- Canary tid=13's signals correlate with the sub_825070F0 worker batch coming up at 10.7 s (tid=27/28/29
|
||
first-events are all 10.705 s). Without those workers, ours's renderer has no producer to wake the
|
||
event it waits on, and main joins-on-renderer → full deadlock.
|
||
- Full SID-level mapping of "which canary worker fires the NtSetEvent that wakes the renderer's wait"
|
||
was not attempted (handle IDs and SIDs don't cross-correlate run-to-run; would require source-level
|
||
read of `sub_821CB030`). The class of producer (`sub_825070F0` workers) is identified.
|
||
|
||
## Reading-error / methodology notes
|
||
|
||
- **#16 EH-handler caution**: the `sub_824AA388` spawn helper is reached via `bl` (direct call, not via
|
||
EH unwind) — no risk of misanchoring on a catch handler.
|
||
- **#28 framing**: Phase A `thread.create.payload.parent_tid` redundantly equals the event's `tid` field
|
||
(per `event_log.cc:312-326`: emitted ON the parent thread's stream, child tid is NOT in payload).
|
||
Child-tid is recovered by FIFO matching to `first_event[tid]` chronologically.
|
||
- **#30 cross-engine SIDs**: ours's wedge handle SID `d5e23609d3948568` does not appear in canary because
|
||
these are worker-local Event handles, not process-global dispatchers; only the shared-global recipe
|
||
is scheduling-invariant.
|
||
- **Cold-run jitter** was not a factor here — only one canary jsonl was processed; the spawn-chain
|
||
identification is robust because the SID-independent entry_pc + ctx_ptr + stack_size triplet is
|
||
effectively a content-addressed fingerprint that survives reruns.
|
||
|
||
## Outcome: (A) — historical framing confirmed
|
||
|
||
The Phase A `thread.create` data directly corroborates AUDIT-058/063/067:
|
||
1. `sub_825070F0` IS the function that spawns the 4 sub_82506528-family workers (confirmed in canary
|
||
trace, never fires in ours).
|
||
2. The dispatcher class `ANON_Class_713383D7` whose vtable `0x8200A208` slot 1 points at `sub_825070F0`
|
||
has its vtable installed via a path invisible to static guest analysis (AUDIT-067 unresolved).
|
||
3. The HEAVY workers (tid=14/15 → XAudio; tid=16 → XMA; tid=21 → NtWait worker) are spawned **earlier**
|
||
via different entries (`sub_824D2878`, `sub_824D2940`, `sub_82178950`, `sub_824563E0`) but are all
|
||
suspended; their resume gate is also missing in ours (those threads exist in ours-postfix but emit
|
||
< 100 events each, all from the spawn-time bookkeeping).
|
||
|
||
## Recommended next attack target
|
||
|
||
**Re-attempt the deferred AUDIT-067 / AUDIT-068 host-side vptr install probe** with current tooling.
|
||
Specific subtasks:
|
||
|
||
1. **Identify the allocator that produces the `ANON_Class_713383D7` instance** with vtable `0x8200A208`.
|
||
- Static search: which fn loads `0x8200A208` as a constant? (database says nothing — confirm with a
|
||
fresh ghidra script that includes split-pair detection.)
|
||
- Runtime probe: instrument both engines to log every `stw vptr, 0(obj)` where `vptr ∈
|
||
{0x8200A208, 0x8200A928}`. In canary, this MUST fire ≥ 1× before the 10.38 s spawn burst;
|
||
in ours, it presumably never fires. Identify the PC.
|
||
|
||
2. **If host-side**: trace through the kernel exports table. The most likely path is one of
|
||
`XAudio2*Create`, `XMACreateContext`, `XMPCreate*`, or an undocumented `XAudio` API. Per the tid=14
|
||
call profile, `XAudioGetVoiceCategoryVolumeChangeMask` is the only XAudio API actively touched —
|
||
look at its dossier (or canary's `xboxkrnl_audio.cc` / `xam_audio.cc`) for object-construction
|
||
side-effects.
|
||
|
||
3. **Alternative**: identify which Sylpheed API call is the **trigger** for the 10.38 s `sub_825070F0`
|
||
firing. Canary main (tid=6) at host_ns ≈ 10.30–10.38 s does the work that leads up to this; ~300 ms
|
||
before, tid=6 has activity that ours doesn't reach. Diff tid=6's event stream in canary vs ours's
|
||
tid=1 in the time window [10 s, 10.4 s] (canary) / [whatever ours's wallclock-equivalent is] — but
|
||
ours doesn't reach 10 s wallclock either, so the divergence is upstream.
|
||
|
||
4. **Secondary attack**: the XAudio tid=14/15 resume gate. Those threads are spawned suspended in
|
||
BOTH engines (canary at 1.726/1.727 s, ours at 1.626 s); canary resumes them within ~1 ms and they
|
||
emit 11 M events combined. **What guest call resumes them in canary?** Cross-thread NtResumeThread
|
||
on the tid=14 handle. Sylpheed presumably resumes them via an XAudio2 API. If we can identify the
|
||
resume call site in canary and figure out why ours doesn't reach it, we unblock 60% of the missing
|
||
event volume (XAudio) independent of `sub_825070F0`.
|
||
|
||
## Artifacts
|
||
|
||
All artifacts in `xenia-rs/audit-runs/phase-nonmatch-investigation/`:
|
||
|
||
- `build_profiles.py` — streaming jsonl profile builder (~200 LOC)
|
||
- `tid-event-counts.csv` — per-tid totals (28 rows)
|
||
- `tid-top-calls.txt` — per-tid top-20 kernel.call names
|
||
- `tid-ntset-handles.txt` — per-tid Ke/NtSetEvent handle distribution **(EMPTY — canary's
|
||
kernel.call payloads have `args:{}` for NtSetEvent; handle is in resolved-arg JSON not exposed
|
||
in current `args_resolved`. Not needed for Outcome (A) determination. Future Phase: extend
|
||
Phase A `kernel.call` to also surface ALL register args in `args` for diff-tool consumption.)**
|
||
- `tid-wait-handles.txt` — per-tid wait.begin handle distribution **(EMPTY for same reason: the
|
||
`wait.begin` events I sampled have `raw_handle_id=None` because the payload uses a
|
||
`handle_semantic_ids` array, not a single `raw_handle_id`. The handle.create map is populated
|
||
correctly — see `handle-create.json`.)**
|
||
- `thread-creates.json` — canary thread.create payloads keyed by child_tid (note: child_tid is FIFO-inferred, see profiles doc)
|
||
- `thread-exits.json` — canary thread.exit events (3 in this trace: tid=17/18/26)
|
||
- `excreate-events.json` — all ExCreateThread import.call events with idx/host_ns
|
||
- `create-thread-events.json` — full thread.create event payloads
|
||
- `handle-create.json` — all handle.create with raw_handle, sid, object_type
|
||
- `spawn-chain.json` — auto-correlated spawn → ExCreateThread linkage
|
||
- `canary-tid-profiles.md` — human-readable per-tid catalogue + spawn-chain tables
|
||
- `result.md` — this file
|