handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
285
audit-runs/iterate-2D-peer-producer-trace/investigation.md
Normal file
285
audit-runs/iterate-2D-peer-producer-trace/investigation.md
Normal file
@@ -0,0 +1,285 @@
|
||||
# Iterate 2.D — Peer-producer LR trace (investigation log)
|
||||
|
||||
**Date:** 2026-05-21
|
||||
**Mode:** WRITE (investigation only; no engine source changes).
|
||||
**Binaries:** `xenia-canary/.../xenia_canary_i2d.exe`, `xenia-rs/target/release/xrs-i2d`.
|
||||
**Reuses:** canary `--audit_69_log_all_sets=true --audit_70_log_all_releases=true` (existing),
|
||||
ours `--lr-trace` (existing); no new cvars added.
|
||||
**LOC delta:** engine 0, canary 0.
|
||||
|
||||
## Step 0 — Existing artifact triage
|
||||
|
||||
`audit-runs/audit-069-wait-signal-producer/s5/canary-release-trace.log` (S5) holds only
|
||||
NtReleaseSemaphore fires (414 events, single-handle restricted by configured handle list);
|
||||
**NtSetEvent dimension was never captured**. Audit-069 S6 bridge wrote a 414-vs-99 first-N
|
||||
release diff but did not extend coverage to NtSetEvent or to canary's wider tid family.
|
||||
Therefore a fresh capture covering BOTH operations was required.
|
||||
|
||||
## Step 1 — Capture both engines
|
||||
|
||||
### Canary cold run
|
||||
|
||||
```
|
||||
wine xenia_canary_i2d.exe "<iso>" --mute=true \
|
||||
--audit_69_log_all_sets=true --audit_70_log_all_releases=true \
|
||||
--log_file=canary-i2d.log
|
||||
```
|
||||
|
||||
Wallclock 90 s (timeout). Result: **79,014 peer-producer events** (61,659 NtSetEvent/XEvent::Set
|
||||
+ 17,355 NtReleaseSemaphore/xeKeReleaseSemaphore). 21 distinct guest tids.
|
||||
|
||||
### Ours cold run
|
||||
|
||||
```
|
||||
xrs-i2d exec "<iso>" --quiet \
|
||||
--lr-trace=0x8284DDDC,0x8284E49C,0x8284DF5C,0x8284E07C \
|
||||
--lr-trace-out=ours-i2d-iat-trace.jsonl
|
||||
```
|
||||
|
||||
The probe PCs are the IAT thunks for KeSetEvent / KeReleaseSemaphore / NtSetEvent /
|
||||
NtReleaseSemaphore respectively — covering BOTH wrapper paths and direct callers (matches
|
||||
canary's `audit_69`/`audit_70` C++ hook coverage). Wallclock 60 s (timeout). Result:
|
||||
**153 peer-producer events** across 7 distinct guest tids (1, 2, 5, 6, 8, 9, 11, 13).
|
||||
|
||||
A first pass with `--lr-trace=0x824AA2F0,0x824AB158` (game's NtSetEvent/NtReleaseSemaphore
|
||||
wrappers, sub_824AA2F0/sub_824AB158) yielded 150 events but **missed all KeReleaseSemaphore
|
||||
direct calls** (which bypass the game wrappers via the audio subsystem's `sub_824D21F0` and
|
||||
others). That trace is kept as `ours-i2d-lr-trace.jsonl` for cross-check; the IAT-thunk trace
|
||||
`ours-i2d-iat-trace.jsonl` is the authoritative one used for alignment.
|
||||
|
||||
## Step 2 — Cross-engine alignment
|
||||
|
||||
### Handle namespaces
|
||||
|
||||
Canary handles `0xF8000XXX` (8-bit XAM-style); ours handles `0x10XX` (slot ids). Not directly
|
||||
comparable by raw value — must use (op, lr-containing-fn) tuple.
|
||||
|
||||
### Documented tid map (per AUDIT-068/S6 bridge)
|
||||
|
||||
- canary tid=6 ↔ ours tid=1 (main)
|
||||
- canary tid=10 ↔ ours tid=5 (worker)
|
||||
- canary tid=17 ↔ ours tid=13 (cache thread)
|
||||
|
||||
Other tid pairs (audio threads tid=14 / tid=2 / tid=4 etc.) require entry-PC matching since
|
||||
their IDs are not stably mapped.
|
||||
|
||||
### LR alignment (operation + lr-containing-fn)
|
||||
|
||||
Cross-engine matching key: `(op_category, lr_value)`. Since BOTH engines route through
|
||||
identical game-side dispatch code, the same call site (lr) means the same source-level
|
||||
producer.
|
||||
|
||||
| LR | Op | Function | Canary fires | Ours fires | Status |
|
||||
|----|----|----------|---:|---:|--------|
|
||||
| `0x824D229C` | release | sub_824D21F0 (audio dispatch) | 16,452 | **1** | UNDER (×16,452) |
|
||||
| `0x824D2A44` | set | sub_824D29F0 (audio worker entry) | 16,452 | **0** | MISSING |
|
||||
| `0x824D292C` | set | sub_824D2878 (audio worker entry-2) | 16,452 | **0** | MISSING |
|
||||
| `0x824AA304` | set | sub_824AA2F0 (NtSetEvent wrapper, generic) | 15,765 | (multi) | MIXED |
|
||||
| `0x82506C90` | set | sub_82506B08 (worker dispatch +0x188) | 2,378 | **0** | MISSING |
|
||||
| `0x82508510` | set | sub_82508400 (+0x110) | 2,373 | **0** | MISSING |
|
||||
| `0x82508524` | set | sub_82508400 (+0x124) | 2,373 | **0** | MISSING |
|
||||
| `0x82506F9C` | set | sub_82506DE8 (worker dispatch) | 2,355 | **0** | MISSING |
|
||||
| `0x82508358` | set | sub_825078D8 (worker dispatch +0xa80) | 2,350 | **0** | MISSING |
|
||||
| `0x824AAFC8` | set | sub_824AAF50 (KeSetEvent wrapper) | 1,113 | **0** | MISSING |
|
||||
| `0x824AB168` | release | sub_824AB158 (NtReleaseSemaphore wrapper internal) | 903 | 90 | UNDER (×10) |
|
||||
| `0x82450D2C` | release | sub_82450B68 (worker self-release-2) | (multi via 0x824AB168) | 75 | match-class |
|
||||
| `0x82450CE0` | release | sub_82450B68 (worker self-release-1) | (multi via 0x824AB168) | 7 | match-class |
|
||||
| `0x82450314` | release | sub_82450218 | (multi via 0x824AB168) | 8 | match-class |
|
||||
|
||||
Total LRs missing in ours: **31 distinct call sites**, **61,659 canary fires** with **0 analogs
|
||||
in ours**. Plus the matched-class but-under-firing audio release (1 vs 16,452, ratio 1/16,452).
|
||||
|
||||
## Step 3 — First divergent producer (chronological)
|
||||
|
||||
Time-ordered scan of canary's events (host_ns) against ours's events (per-tid cycle):
|
||||
|
||||
- **Canary events 0–14** (host_ns 4.8 µs → 162.6 ms): all from tids 6/10/0 — these have
|
||||
exact ours analogs (tids 1/5/bootstrap), counts match approximately.
|
||||
- **Canary event 15** at **host_ns=277,967,100** (278 ms): tid=14 fires
|
||||
`xeKeReleaseSemaphore` on handle `0xF800006C` from lr=`0x824D229C` (sub_824D21F0+0xAC, audio
|
||||
dispatch). This is the **FIRST canary peer-producer ours doesn't match in volume.**
|
||||
|
||||
### Attribution
|
||||
|
||||
`sub_824D21F0` is the audio subsystem's "post-process-then-release-semaphore" leaf, called from
|
||||
3 sites:
|
||||
- `sub_824D2878+0xA0` (audio worker entry A)
|
||||
- `sub_824D2940+0x64` (audio worker entry B)
|
||||
- `sub_824D29F0+0xC4` (audio worker dispatch loop body)
|
||||
|
||||
`sub_824D29F0` is the main audio worker fn: enters CS, fires `KeSetEvent`, calls
|
||||
`KeWaitForMultipleObjects`, then either `sub_824D2108` or `sub_824D21F0` (the semaphore
|
||||
release). The fn pointers for both audio worker entries (`sub_824D2878`/`sub_824D2940`) are
|
||||
loaded by `sub_824D2C08` (audio subsystem init), which then calls `ExCreateThread` ×2 to
|
||||
spawn them, followed by `KeSetBasePriorityThread` and `KeResumeThread`.
|
||||
|
||||
### Ours's actual audio thread behavior
|
||||
|
||||
Phase-A event log (`/tmp/ours-i2d-events.jsonl`, 118,149 events) shows:
|
||||
|
||||
- **host_ns=1,586,993,047** (1.587 s): `thread.create` entry_pc=`0x824d2878` (audio thread A), suspended
|
||||
- **host_ns=1,587,001,117**: `ObReferenceObjectByHandle` (audio init handshake)
|
||||
- **host_ns=1,587,011,797**: `KeSetBasePriorityThread`
|
||||
- **host_ns=1,587,018,827**: `KeResumeThread` (audio thread A resumed)
|
||||
- **host_ns=1,587,049,878**: `ExCreateThread` (audio thread B, entry_pc=`0x824d2940`)
|
||||
- **host_ns=1,587,088,839**: `KeResumeThread` (audio thread B resumed)
|
||||
- **host_ns=1,587,097,519**: ours **tid=10** (audio thread A) starts: `KeWaitForSingleObject`,
|
||||
`KeRaiseIrqlToDpcLevel`, `KeAcquireSpinLockAtRaisedIrql`, `KeReleaseSpinLockFromRaisedIrql`,
|
||||
`KfLowerIrql` (17 events total). Then **silent.**
|
||||
- **host_ns=1,659,028,012**: ours **tid=11** (audio thread B) starts: `RtlEnterCriticalSection`,
|
||||
`KeSetEvent`, `KeWaitForMultipleObjects` (11 events total). Then **silent.**
|
||||
- **ours tid=9** (also audio family, ctx_ptr=0x828a3230 = audio static dispatcher) fires
|
||||
`KeReleaseSemaphore` ONCE at cycle=631 from lr=`0x824d229c`, then **silent.**
|
||||
|
||||
**Conclusion**: ours audio threads are spawned, resumed, execute *one* iteration of their
|
||||
work loop, then wedge in `KeWaitForMultipleObjects` (sub_824D29F0+0xA0) — the same wedge
|
||||
shape as tid=13 (cache thread). Canary's audio thread iterates ~16,452× over the same
|
||||
window; ours iterates ~1×. **The wedge is not localized to one thread — it is the same wedge
|
||||
pattern recurring in every peer-producer thread family.**
|
||||
|
||||
### Timing skew
|
||||
|
||||
Canary boot timeline (audio subsystem):
|
||||
|
||||
- host_ns=4.8 µs: first NtReleaseSemaphore (tid=6 main bootstrap)
|
||||
- host_ns=277.97 ms: audio worker (tid=14) first fires
|
||||
|
||||
Ours boot timeline:
|
||||
|
||||
- host_ns=5,4 ms: first NtReleaseSemaphore (tid=1 main, matched)
|
||||
- host_ns=**1,586.99 ms** = 1.587 s: audio thread spawned (5.7× later than canary's first audio fire)
|
||||
|
||||
This 1.3-second delay implies **upstream init phase divergence** — ours's main thread is taking
|
||||
significantly longer to reach the `sub_824D2C08` init call than canary does. The cause of this
|
||||
upstream delay is likely the cumulative effect of every prior subsystem wedge: each
|
||||
producer-consumer pair that wedges in ours costs time before the next subsystem can init.
|
||||
|
||||
## Step 4 — Outcome class
|
||||
|
||||
The plan defined three outcomes:
|
||||
|
||||
- **(A) Single missing producer, thread-spawn-dependent**: NO — ours DOES spawn the audio
|
||||
threads (tid=9, tid=10, tid=11) successfully. The threads execute briefly then wedge.
|
||||
- **(B) Single missing producer, state-divergence in an existing-but-divergent thread**: NO —
|
||||
the divergence is not in *one* thread but in *all* peer-producer threads.
|
||||
- **(C) Many distributed producers missing**: **YES.** 31 distinct LRs missing across at least
|
||||
4 thread families:
|
||||
- Audio workers (sub_824D2878, sub_824D2940, sub_824D29F0, sub_824D21F0 chain): ~50k fires
|
||||
canary, 1 fire ours.
|
||||
- Worker dispatch (sub_82506B08, sub_82508400, sub_82506DE8, sub_825078D8 family — the
|
||||
AUDIT-049 "sub_825070F0" caller universe): ~10k fires canary, 0 ours.
|
||||
- NtSetEvent wrapper-internal (lr=0x824AA304): 15,765 canary, 0 directly observed in ours
|
||||
via this LR (ours uses the IAT path differently).
|
||||
- Misc (sub_827E*, sub_82178*, sub_824D0*): smaller counts, mostly init paths.
|
||||
|
||||
Outcome class = **(C) Many distributed producers missing.**
|
||||
|
||||
### Structural interpretation
|
||||
|
||||
Every peer-producer thread family in ours executes its FIRST iteration normally (the
|
||||
bootstrap), then stalls in its wait primitive. The wait primitives are different across
|
||||
families (different events, different semaphores), but the **pattern is identical**:
|
||||
|
||||
```
|
||||
ThreadFamily X enters loop
|
||||
→ does bootstrap-once setup (1 release/set fires)
|
||||
→ enters KeWaitForMultipleObjects or KeWaitForSingleObject
|
||||
→ blocks forever because the producer for ITS wait event is itself another wedged thread
|
||||
```
|
||||
|
||||
This is a **multi-producer ladder collapse**: each thread depends on a peer (in the same OR
|
||||
different family) for its wake-up signal; that peer is also wedged on a dependency; etc. The
|
||||
graph is not strictly circular (each thread's specific wait may differ) but the topology means
|
||||
**no thread can advance because every thread's wake-source is also blocked.**
|
||||
|
||||
This subsumes the AUDIT-049 / AUDIT-069 framings into a unified picture: the wedge family
|
||||
includes at minimum {tid=10 audio-A, tid=11 audio-B, tid=9 audio-aux, tid=13 cache, all four
|
||||
sub_825070F0 worker spawnees}, all wedged on different wait sites, all unable to wake each
|
||||
other.
|
||||
|
||||
## Step 5 — Recommended next iterate
|
||||
|
||||
Given outcome class (C), single-keystone iterates (2.B branch-probe, 2.C arg/return capture,
|
||||
single-thread wedge investigation) will not unlock the whole ladder — each one would unblock
|
||||
ONE thread, only to find it blocks again on the next un-signaled event.
|
||||
|
||||
The plan's recommended pivot from outcome (C) is: **"may need a fundamentally different
|
||||
methodology"**.
|
||||
|
||||
### Option (1): Critical-path sweep (~400-600 LOC over multiple sessions)
|
||||
|
||||
Identify which thread families' first-iteration produces signals consumed by another thread
|
||||
family, build a dependency DAG, then trace each edge's first-divergence in ours. Many of these
|
||||
edges may converge on a small number of "root cause" missed signals further upstream
|
||||
(e.g., a single missing signal in init code that cascades).
|
||||
|
||||
### Option (2): Boot-time delta replay (~100 LOC investigation)
|
||||
|
||||
Compare canary's 0-278 ms event sequence (1,221 events before first audio fire) against ours's
|
||||
0-278 ms event sequence. There's an upstream timing skew (canary boots audio at 278 ms, ours
|
||||
at 1.587 s — 5.7× slower). The CAUSE of the slow init is upstream-of-audio and may be a
|
||||
single fixable wedge in the init path.
|
||||
|
||||
**Recommended: Option (2)** — cheaper, more focused. Diff the first ~1200 phase-a events of
|
||||
each engine to find the FIRST kernel-import-call divergence in early bootstrap. This may
|
||||
identify a single missing wake-signal in the early init flow that cascades to delay every
|
||||
subsequent subsystem.
|
||||
|
||||
### Option (3): Audio-specific micro-investigation (~50 LOC, narrower)
|
||||
|
||||
The single canary audio fire (lr=`0x824d229c` count=1 in ours) shows ours's audio thread DOES
|
||||
reach the release site once. Find what event/semaphore canary signals between iterations 1
|
||||
and 2 that ours doesn't. This is a narrower (B-shaped) sub-investigation that doesn't unblock
|
||||
the full ladder but adds disambiguation between "thread didn't spawn", "first iter wedged",
|
||||
and "subsequent iter wedged".
|
||||
|
||||
## Tripstones honored
|
||||
|
||||
- **#28 (per-engine tid stability)**: explicitly used (op, lr-containing-fn) tuple, not raw
|
||||
tid. Confirmed via `entry_pc` matching in phase-a thread.create events.
|
||||
- **#32 (canary jitter)**: no relevant variance — both engines bit-equivalent on first 15
|
||||
events (host_ns and counts match). Divergence starts at canary event 15 (audio fire).
|
||||
- **#37 (vtable base vs slot-N)**: not encountered.
|
||||
- **#39 (composite progression metric)**: not moved this iterate; investigation-only.
|
||||
- **#40 (single-keystone framing)**: explicitly broken by outcome (C). The "find THE missing
|
||||
producer" framing of S5/S6/2.A is **falsified** at the structural level — there are 31, not
|
||||
1.
|
||||
|
||||
## Cascade
|
||||
|
||||
- A (acquire both engines' producer traces): **PASS HIGH** (canary 79,014 events, ours 153)
|
||||
- B (align sequences): **PASS HIGH** (LR-keyed alignment; clear bit-equivalence on first 15
|
||||
canary events).
|
||||
- C (identify first divergent producer): **PASS HIGH** — canary event 15 at host_ns=278ms,
|
||||
tid=14, lr=0x824D229C, sub_824D21F0 (audio dispatch).
|
||||
- D (attribute cause): **PASS MEDIUM** — distributed wedge ladder, not single-thread blockage.
|
||||
- E (outcome class named): **PASS HIGH** — Class (C), 31 missing LRs, 4+ thread families.
|
||||
|
||||
5 PASS / 0 FAIL.
|
||||
|
||||
## Artifacts
|
||||
|
||||
All under `xenia-rs/audit-runs/iterate-2D-peer-producer-trace/`:
|
||||
|
||||
- `canary-i2d.log` (10.7 MB, 79,014 peer-producer events from canary's audit_69+audit_70).
|
||||
- `canary-i2d.stdout` / `.stderr`: canary run logs.
|
||||
- `canary-peer-producers.jsonl`: parsed structured form of canary events (79,014 records).
|
||||
- `ours-i2d-lr-trace.jsonl`: ours first pass at game wrapper PCs (150 events; missing
|
||||
KeReleaseSemaphore direct path — kept for cross-check only).
|
||||
- `ours-i2d-iat-trace.jsonl`: ours authoritative trace at IAT thunks (153 events,
|
||||
comprehensive Ke+Nt coverage).
|
||||
- `ours-i2d.stdout` / `.stderr`: ours run logs.
|
||||
- `aligned-sequence.csv`: chronological per-engine sequence (first 200 canary + all ours).
|
||||
- `investigation.md`: this document.
|
||||
- `report.md`: short outcome summary.
|
||||
|
||||
## Discipline
|
||||
|
||||
- xenia-rs HEAD UNCHANGED. canary HEAD UNCHANGED.
|
||||
- Binaries `xenia_canary_i2d.exe` and `xrs-i2d` are renamed copies; originals untouched.
|
||||
- Canary cache backed up to `/tmp/canary-cache-bak-iter2D` at session start; verified unchanged
|
||||
at session end (`diff -rq` returns empty).
|
||||
- `--mute=true` honored on canary run.
|
||||
- Investigation-only; no engine source changes.
|
||||
|
||||
LOC delta: 0 engine, 0 canary.
|
||||
Reference in New Issue
Block a user