handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
156
audit-runs/phase-nonmatch-investigation/result.md
Normal file
156
audit-runs/phase-nonmatch-investigation/result.md
Normal file
@@ -0,0 +1,156 @@
|
||||
# Phase Non-match Investigation — Results
|
||||
|
||||
**Date**: 2026-05-19
|
||||
**Source**: `xenia-canary/build-cross/bin/Windows/Debug/canary-jitter-1.jsonl` (4.4 GB, 18.7M events, 28 tids)
|
||||
**Companion ours data**: `audit-runs/phase-w-wedge-reattack/ours-postfix.jsonl` (121,569 events, 13 tids)
|
||||
**Outcome**: **(A) — AUDIT-058/063/067 framing CONFIRMED** end-to-end using new Phase A thread.create events.
|
||||
|
||||
## TL;DR
|
||||
|
||||
Per Phase A `thread.create` events (wired in C+15-α), canary spawns **23 threads**; the final 4
|
||||
fire at `host_ns ≈ 10.38 s` and have entry PCs `0x82506528 / 0x82506558 / 0x82506588 / 0x825065B8`
|
||||
with shared context `0xBCE251C0` and stack 65,536 — these are **exactly** the 4 worker entries
|
||||
documented in the `sub_825070F0` dossier. The historical AUDIT-058/063 framing is correct:
|
||||
`sub_825070F0` is the one-shot 4-worker fan-out that ours never reaches.
|
||||
|
||||
Three of those four canary workers go on to dominate the trace:
|
||||
**tid=28 (3.26M events, sub_82506528), tid=27 (36k events, sub_82506558), tid=29 (91k events, sub_82506588)**
|
||||
— the fourth (`0x825065B8`) was never resumed in this 90s window.
|
||||
|
||||
Ours emits **10 thread.create** events vs canary's 23, stops after spawn #10 (`0x821748F0` at 1.727s),
|
||||
and **never produces another thread.create** for the rest of the run. The 13 subsequent canary
|
||||
spawns including the critical sub_825070F0 batch are entirely missing.
|
||||
|
||||
## What canary's heavy workers DO
|
||||
|
||||
| tid | events | role | entry_pc |
|
||||
|----:|-------:|------|----------|
|
||||
| 14 | **6.15 M** | **XAudio voice-mask poll** (26,126× XAudioGetVoiceCategoryVolumeChangeMask) | `0x824D2878` (aff=16) |
|
||||
| 15 | **4.78 M** | XAudio sister (KeWaitForSingleObject + heavy IRQL spinlock cycle) | `0x824D2940` (aff=32) |
|
||||
| 28 | **3.26 M** | **sub_825070F0 worker 0** (1.07 M × RtlEnterCS, 530× NtReadFile) | `0x82506528` (ctx `0xBCE251C0`) |
|
||||
| 16 | 1.80 M | XMA decoder (`XMACreateContext`, RtlEnterCS heavy) | `0x82178950` |
|
||||
| 21 | 1.00 M | NtWaitForMultipleObjectsEx worker | `0x824563E0` |
|
||||
| 13 | 594 k | **Renderer** (12,092× VdSwap, VdGetSystemCommandBuffer; 1,805× Ke/NtSetEvent; 475× wait.begin) | `0x822F1EE0` |
|
||||
|
||||
The **biggest workers (tid=14, tid=15)** are NOT sub_825070F0 workers — they are spawned much earlier (1.726/1.727s)
|
||||
via `sub_824D2878 / sub_824D2940` and run forever as XAudio render/voice threads. **Ours spawns these two
|
||||
suspended (1.626s) but they never receive the resume call that would activate them** — ours produces 0
|
||||
XAudio* events on these tids (verifiable from ours's tid event counts: ours has only 13 tids total, none
|
||||
with the 6M-event signature).
|
||||
|
||||
## Spawn-chain summary (full table in `canary-tid-profiles.md`)
|
||||
|
||||
Three distinct fan-out clusters in canary, all from tid=6 (guest main):
|
||||
|
||||
1. **1.42–1.94 s — main init burst**: 10 spawns (tids 8–17). Ours matches this 1:1 in spawn count and entries.
|
||||
2. **1.94–2.15 s — secondary burst** (XAM/XCONFIG helpers, tids 18–25): 8 additional spawns. **Ours emits 0**.
|
||||
3. **10.08–10.38 s — XAudio worker fan-out**: 5 spawns (tids 26, 27, 28, 29, +1 unresumed). The last 4
|
||||
are the `sub_825070F0` workers. **Ours emits 0**.
|
||||
|
||||
## sub_825070F0 spawn-chain confirmation (static + runtime)
|
||||
|
||||
- `sylpheed.db` confirms `sub_825070F0` lives in `vtable 0x8200A208 slot 1` and `0x8200A928 slot 1`
|
||||
(anonymous class `ANON_Class_713383D7`, 7 slots each).
|
||||
- **Zero `vptr_writes` / zero `xrefs` / zero `indirect_dispatch_candidates`** reach either vtable.
|
||||
AUDIT-067's host-side install hypothesis is confirmed by static-analysis exhaustion.
|
||||
- Function body contains the 4 sequential `addi rN, r0, 0x8250652X` + `bl sub_824AA388` (= ExCreateThread
|
||||
wrapper) blocks at PCs `0x825071F8 / 0x82507244 / 0x82507290 / 0x825072DC`.
|
||||
- The 4 worker entry thunks (`0x82506528 / 0x82506558 / 0x82506588 / 0x825065B8`) are uniform vtable-slot
|
||||
callers: each loads `r3->vtable->[140|144|148|152]` and dispatches via CTR (offsets 35/36/37/38).
|
||||
- Runtime ctx `0xBCE251C0` is referenced **4× in canary jsonl** (the 4 spawn events) and **0× in
|
||||
ours-postfix.jsonl**. Ours never allocates the dispatcher object that holds the `0x8200A208` vtable.
|
||||
|
||||
## Wake/signal chain to wedge (partial)
|
||||
|
||||
- Phase W: ours's wedge handle `0x12d0` (`Event/Auto` waited at `sub_821CB030+0x1B0` on tid=13 the renderer);
|
||||
main tid=1 join-waits on `Thread(id=13)` at `sub_82173990+0x2D4`.
|
||||
- Canary tid=13 (renderer) creates **10 handles**, calls Ke/NtSetEvent **1,805×**, calls wait.begin **475×** —
|
||||
it is alive and signaling. Earliest tid=13 handle.create at 2.396 s; explosion at 10.7 s **once the
|
||||
sub_825070F0 workers come online**.
|
||||
- Canary tid=13's signals correlate with the sub_825070F0 worker batch coming up at 10.7 s (tid=27/28/29
|
||||
first-events are all 10.705 s). Without those workers, ours's renderer has no producer to wake the
|
||||
event it waits on, and main joins-on-renderer → full deadlock.
|
||||
- Full SID-level mapping of "which canary worker fires the NtSetEvent that wakes the renderer's wait"
|
||||
was not attempted (handle IDs and SIDs don't cross-correlate run-to-run; would require source-level
|
||||
read of `sub_821CB030`). The class of producer (`sub_825070F0` workers) is identified.
|
||||
|
||||
## Reading-error / methodology notes
|
||||
|
||||
- **#16 EH-handler caution**: the `sub_824AA388` spawn helper is reached via `bl` (direct call, not via
|
||||
EH unwind) — no risk of misanchoring on a catch handler.
|
||||
- **#28 framing**: Phase A `thread.create.payload.parent_tid` redundantly equals the event's `tid` field
|
||||
(per `event_log.cc:312-326`: emitted ON the parent thread's stream, child tid is NOT in payload).
|
||||
Child-tid is recovered by FIFO matching to `first_event[tid]` chronologically.
|
||||
- **#30 cross-engine SIDs**: ours's wedge handle SID `d5e23609d3948568` does not appear in canary because
|
||||
these are worker-local Event handles, not process-global dispatchers; only the shared-global recipe
|
||||
is scheduling-invariant.
|
||||
- **Cold-run jitter** was not a factor here — only one canary jsonl was processed; the spawn-chain
|
||||
identification is robust because the SID-independent entry_pc + ctx_ptr + stack_size triplet is
|
||||
effectively a content-addressed fingerprint that survives reruns.
|
||||
|
||||
## Outcome: (A) — historical framing confirmed
|
||||
|
||||
The Phase A `thread.create` data directly corroborates AUDIT-058/063/067:
|
||||
1. `sub_825070F0` IS the function that spawns the 4 sub_82506528-family workers (confirmed in canary
|
||||
trace, never fires in ours).
|
||||
2. The dispatcher class `ANON_Class_713383D7` whose vtable `0x8200A208` slot 1 points at `sub_825070F0`
|
||||
has its vtable installed via a path invisible to static guest analysis (AUDIT-067 unresolved).
|
||||
3. The HEAVY workers (tid=14/15 → XAudio; tid=16 → XMA; tid=21 → NtWait worker) are spawned **earlier**
|
||||
via different entries (`sub_824D2878`, `sub_824D2940`, `sub_82178950`, `sub_824563E0`) but are all
|
||||
suspended; their resume gate is also missing in ours (those threads exist in ours-postfix but emit
|
||||
< 100 events each, all from the spawn-time bookkeeping).
|
||||
|
||||
## Recommended next attack target
|
||||
|
||||
**Re-attempt the deferred AUDIT-067 / AUDIT-068 host-side vptr install probe** with current tooling.
|
||||
Specific subtasks:
|
||||
|
||||
1. **Identify the allocator that produces the `ANON_Class_713383D7` instance** with vtable `0x8200A208`.
|
||||
- Static search: which fn loads `0x8200A208` as a constant? (database says nothing — confirm with a
|
||||
fresh ghidra script that includes split-pair detection.)
|
||||
- Runtime probe: instrument both engines to log every `stw vptr, 0(obj)` where `vptr ∈
|
||||
{0x8200A208, 0x8200A928}`. In canary, this MUST fire ≥ 1× before the 10.38 s spawn burst;
|
||||
in ours, it presumably never fires. Identify the PC.
|
||||
|
||||
2. **If host-side**: trace through the kernel exports table. The most likely path is one of
|
||||
`XAudio2*Create`, `XMACreateContext`, `XMPCreate*`, or an undocumented `XAudio` API. Per the tid=14
|
||||
call profile, `XAudioGetVoiceCategoryVolumeChangeMask` is the only XAudio API actively touched —
|
||||
look at its dossier (or canary's `xboxkrnl_audio.cc` / `xam_audio.cc`) for object-construction
|
||||
side-effects.
|
||||
|
||||
3. **Alternative**: identify which Sylpheed API call is the **trigger** for the 10.38 s `sub_825070F0`
|
||||
firing. Canary main (tid=6) at host_ns ≈ 10.30–10.38 s does the work that leads up to this; ~300 ms
|
||||
before, tid=6 has activity that ours doesn't reach. Diff tid=6's event stream in canary vs ours's
|
||||
tid=1 in the time window [10 s, 10.4 s] (canary) / [whatever ours's wallclock-equivalent is] — but
|
||||
ours doesn't reach 10 s wallclock either, so the divergence is upstream.
|
||||
|
||||
4. **Secondary attack**: the XAudio tid=14/15 resume gate. Those threads are spawned suspended in
|
||||
BOTH engines (canary at 1.726/1.727 s, ours at 1.626 s); canary resumes them within ~1 ms and they
|
||||
emit 11 M events combined. **What guest call resumes them in canary?** Cross-thread NtResumeThread
|
||||
on the tid=14 handle. Sylpheed presumably resumes them via an XAudio2 API. If we can identify the
|
||||
resume call site in canary and figure out why ours doesn't reach it, we unblock 60% of the missing
|
||||
event volume (XAudio) independent of `sub_825070F0`.
|
||||
|
||||
## Artifacts
|
||||
|
||||
All artifacts in `xenia-rs/audit-runs/phase-nonmatch-investigation/`:
|
||||
|
||||
- `build_profiles.py` — streaming jsonl profile builder (~200 LOC)
|
||||
- `tid-event-counts.csv` — per-tid totals (28 rows)
|
||||
- `tid-top-calls.txt` — per-tid top-20 kernel.call names
|
||||
- `tid-ntset-handles.txt` — per-tid Ke/NtSetEvent handle distribution **(EMPTY — canary's
|
||||
kernel.call payloads have `args:{}` for NtSetEvent; handle is in resolved-arg JSON not exposed
|
||||
in current `args_resolved`. Not needed for Outcome (A) determination. Future Phase: extend
|
||||
Phase A `kernel.call` to also surface ALL register args in `args` for diff-tool consumption.)**
|
||||
- `tid-wait-handles.txt` — per-tid wait.begin handle distribution **(EMPTY for same reason: the
|
||||
`wait.begin` events I sampled have `raw_handle_id=None` because the payload uses a
|
||||
`handle_semantic_ids` array, not a single `raw_handle_id`. The handle.create map is populated
|
||||
correctly — see `handle-create.json`.)**
|
||||
- `thread-creates.json` — canary thread.create payloads keyed by child_tid (note: child_tid is FIFO-inferred, see profiles doc)
|
||||
- `thread-exits.json` — canary thread.exit events (3 in this trace: tid=17/18/26)
|
||||
- `excreate-events.json` — all ExCreateThread import.call events with idx/host_ns
|
||||
- `create-thread-events.json` — full thread.create event payloads
|
||||
- `handle-create.json` — all handle.create with raw_handle, sid, object_type
|
||||
- `spawn-chain.json` — auto-correlated spawn → ExCreateThread linkage
|
||||
- `canary-tid-profiles.md` — human-readable per-tid catalogue + spawn-chain tables
|
||||
- `result.md` — this file
|
||||
Reference in New Issue
Block a user