handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
205
audit-runs/review-a-boot-state/ours-wedge-localization.md
Normal file
205
audit-runs/review-a-boot-state/ours-wedge-localization.md
Normal file
@@ -0,0 +1,205 @@
|
||||
# Ours wedge localization
|
||||
|
||||
**Source data**: `phase-w-wedge-reattack/ours-postfix.jsonl` (50M-instr
|
||||
cold run, ~3 s wallclock, 121,569 events, 13 tids).
|
||||
`phase-w-wedge-reattack/halt-on-deadlock-dump.txt` (per-tid state @
|
||||
deadlock).
|
||||
|
||||
## TL;DR
|
||||
|
||||
Ours's wedge is **structurally identical** to AUDIT-049 (first found
|
||||
2026-05-10). Across 25+ subsequent iterates (Phase C+1 … Phase C+25,
|
||||
Phase D, AUDIT-049 .. AUDIT-069), the wedge has **never moved**:
|
||||
|
||||
- **tid=1 (main)** wedges at `sub_82173990+0x2D4` (PC `0x824ac578`,
|
||||
`do_wait_single`) on **handle `0x12c8`** = `Thread(id=13)` — the
|
||||
renderer thread's join handle.
|
||||
- **tid=13 (renderer / cache-IO worker)** wedges at
|
||||
`sub_821CB030+0x1B0` (PC `0x824ac578`, `do_wait_single`) on
|
||||
**handle `0x12d0`** = `Event/Auto`, created by tid=13 itself at
|
||||
`sub_821CB030+0x128` via `NtCreateEvent`. `<NO_SIGNALS_DESPITE_WAITS>`.
|
||||
- **`sub_825070F0` fires 0×** at any horizon probed (50M, 500M, ∞
|
||||
wallclock). The 4 workers (entries `0x82506528/58/88/B8`) never
|
||||
spawn in ours.
|
||||
|
||||
This is what audits 049/058/059/060/062/063/064/065/066/067/068/069
|
||||
collectively call "the wedge."
|
||||
|
||||
## Graph view: ours's actual reachable subgraph vs canary's
|
||||
|
||||
### What runs in BOTH engines (matched-prefix 105,128)
|
||||
|
||||
```
|
||||
entry_point
|
||||
└─ early CRT init ✓ ours ✓ canary
|
||||
└─ subsystem init ✓
|
||||
├─ VdInitializeEngines (×2, then VdShutdownEngines, then again)
|
||||
├─ VdInitializeRingBuffer
|
||||
├─ VdEnableRingBufferRPtrWriteBack
|
||||
├─ VdSetGraphicsInterruptCallback
|
||||
└─ VdSetSystemCommandBufferGpuIdentifierAddress
|
||||
└─ 10× ExCreateThread (the matched first spawn burst)
|
||||
├─ 0x82181830 / 0x8245A5D0 / 0x82450A28 ✓ ✓
|
||||
├─ 0x82457EF0 (spawned by tid=10 → tid=11) ✓ ✓
|
||||
├─ 0x824CD458 (KeWait worker, susp=F) ✓ ✓
|
||||
├─ 0x822F1EE0 (renderer, susp=T) ✓ ✓
|
||||
├─ 0x824D2878 / 0x824D2940 (XAudio, susp=T) ✓ ✓
|
||||
├─ 0x82178950 (XMA, susp=F) ✓ ✓
|
||||
└─ 0x821748F0 (file IO spawner, susp=T) ✓ ✓
|
||||
└─ 1× boot-init VdSwap ✓ swaps=1
|
||||
└─ tid=1 enters sub_8216EA68 → sub_822F1AA8
|
||||
└─ bctrl vtable[0] of *(0x828E1F08)
|
||||
└─ sub_82175330 → tail → sub_82173990
|
||||
└─ sub_821746B0 → spawn worker (= ours tid=13, susp=F)
|
||||
└─ KeWaitForSingleObject INFINITE on tid=13.handle ← WEDGE
|
||||
```
|
||||
|
||||
### What runs ONLY in canary (the missing subgraph)
|
||||
|
||||
```
|
||||
After tid=6's tid=17 worker (= ours's tid=13) terminates:
|
||||
sub_82173990 returns to sub_822F1AA8's outer loop
|
||||
└─ iterates sub_821741C8 → sub_82172BA0 → vtable[6] = sub_821B55D8
|
||||
→ sub_824F8398 → sub_824F7CD0 → sub_824F7800 → vtable[1] = sub_825070F0
|
||||
└─ 4× ExCreateThread(entry=0x82506528/58/88/B8, susp=T)
|
||||
├─ Worker 0 → tid=28 (file IO, 3.26M events)
|
||||
├─ Worker 1 → tid=27 (36k events)
|
||||
├─ Worker 2 → tid=29 (91k events)
|
||||
└─ Worker 3 (0x825065B8 — never resumed in jitter-1 run)
|
||||
|
||||
After workers come online:
|
||||
Canary's secondary spawn burst (1.94–2.15 s) — 8 helpers (tids 18–25)
|
||||
Canary's tid=14/15 XAudio resumes (~ms after tid=6 spawns them in
|
||||
susp=T; ours also spawns them susp=T but never resumes them)
|
||||
Renderer tid=13 unblocks, starts emitting VdSwap at ~150 fps
|
||||
Per-frame game loop: tid=6 emits `0x822F1BCC` 4040× / 60 s
|
||||
```
|
||||
|
||||
## The wedge dependency graph (cyclic)
|
||||
|
||||
```
|
||||
[tid=1 (main) wedge]
|
||||
│
|
||||
▼
|
||||
wait on handle 0x12c8 (= tid=13.thread_handle)
|
||||
│
|
||||
▼
|
||||
only signaled when tid=13 calls ExTerminateThread
|
||||
│
|
||||
▼
|
||||
tid=13 needs to complete sub_821CB030 body
|
||||
│
|
||||
▼
|
||||
sub_821CB030 waits on event 0x12d0
|
||||
│
|
||||
▼
|
||||
only signaled by sub_825070F0 worker cluster
|
||||
│
|
||||
▼
|
||||
sub_825070F0 never fires in ours
|
||||
│
|
||||
▼
|
||||
sub_825070F0 is reached via:
|
||||
sub_82172BA0 → ... → sub_824F7800 → bctrl vtable[1]
|
||||
↑↑↑ which is downstream of sub_822F1AA8's outer loop
|
||||
which is downstream of sub_82173990 returning
|
||||
which is downstream of tid=1's wait completing
|
||||
← BACK TO TOP
|
||||
```
|
||||
|
||||
This is the **AUDIT-063 self-referential lock**: the activation chain
|
||||
that produces the signal that unwedges the wait is itself downstream
|
||||
of the wait completing. In canary, the lock resolves because the
|
||||
tid=17 worker (= ours tid=13's analog) calls `ExTerminateThread`
|
||||
**by completing** its `sub_821CB030` body — and that completion is
|
||||
fed by some OTHER signal source that ours doesn't replicate.
|
||||
|
||||
## Where the "other signal source" lives (the actual root cause)
|
||||
|
||||
From AUDIT-069 Session 5 (work-semaphore release-rate diff):
|
||||
|
||||
> Canary 414 release events vs ours 99 (24% rate). Worker (tid=10/5):
|
||||
> 382 vs 90. Main (tid=6/1): 7 vs 8. **Other producers: 25 vs 1**.
|
||||
|
||||
The discrepancy in "other producers" (25 producers vs 1) is the key.
|
||||
**Canary has multiple non-worker threads that release the work
|
||||
semaphore during bootstrap — releasing this semaphore is what feeds
|
||||
the worker-side wait that eventually causes sub_821CB030's event to
|
||||
be signaled.** Ours has only one (tid=13 itself, before it wedges).
|
||||
|
||||
From AUDIT-069 Session 4 (`sub_82450A68` dispatch loop):
|
||||
|
||||
> Ours r3=0x1 (semaphore acquired) 91/91 captures (100%); canary
|
||||
> r3=0x102 (TIMEOUT) 3/4 (75%).
|
||||
|
||||
**Ours's work-semaphore has count > 0 every time tid=5 checks; canary's
|
||||
times out 75% of the time.** This is a *paradox at face value*: how
|
||||
can ours have MORE semaphore signals available but still process
|
||||
LESS work? The S5 reframe resolves it: ours's worker self-releases
|
||||
the work semaphore from `sub_82450B68+0xCDC/+0xD28` MORE OFTEN than
|
||||
it consumes, because the consume path early-exits when the dispatch
|
||||
table doesn't have an entry to process — and the dispatch table
|
||||
doesn't have entries because the producers (canary's "other 25 tids")
|
||||
aren't running.
|
||||
|
||||
## Bootstrap divergence (when does ours first diverge from canary?)
|
||||
|
||||
Per the AUDIT-069 H3 framing: somewhere in the *bootstrap* of the
|
||||
worker-cluster, a producer thread that should be alive in canary
|
||||
isn't alive in ours. Candidates:
|
||||
|
||||
1. **XAudio render thread (canary tid=14/15)**: spawned suspended in
|
||||
ours, **never resumed**. Canary resumes within ~1 ms of spawn at
|
||||
1.726 s. Canary's tid=14 calls `XAudioGetVoiceCategoryVolumeChangeMask`
|
||||
26,126× and is one of the top event producers. This thread runs
|
||||
the host-audio bridge feed loop — *if it isn't running, downstream
|
||||
producers expecting audio cues block.*
|
||||
2. **XMA decoder (tid=16, entry `0x82178950`)**: spawned non-suspended
|
||||
in both; ours emits 0 events from this thread because it presumably
|
||||
waits on a kernel object that's never signaled.
|
||||
3. **NtWaitForMultipleObjectsEx worker (canary tid=21, entry
|
||||
`0x824563E0`)**: 1M events in canary; absent in ours (canary's
|
||||
second spawn burst doesn't happen).
|
||||
4. **The "tid=10 helper" (canary tid=10, entry `0x82450A28`)**: ours
|
||||
has this thread (ours tid=5), but it's running the dispatch loop
|
||||
`sub_82450A68` in a degenerate fast-path mode (S4 finding).
|
||||
|
||||
The most defensible single-root claim:
|
||||
|
||||
> **Ours never resumes the XAudio threads (tid=14/15), because the
|
||||
> guest API call that triggers their resume in canary doesn't fire in
|
||||
> ours, and as a knock-on the worker cluster never gets the bootstrap
|
||||
> producer it expects.**
|
||||
|
||||
But this claim is not yet proven; AUDIT-068/069 stopped short of
|
||||
identifying the resume trigger.
|
||||
|
||||
## Verified-but-doesn't-help LOC budget across recent audits
|
||||
|
||||
(For methodology context — every recent audit landed correctness or
|
||||
diagnostic LOC but moved progression 0%.)
|
||||
|
||||
| Audit / Phase | LOC added | Component | Effect on progression |
|
||||
|---|---:|---|---|
|
||||
| AUDIT-067 vptr-mem-watch | +422 (canary) | Mem-watch diagnostic | 0 |
|
||||
| AUDIT-068 S1-S4 | +520 cumul (canary) | Host-side write hooks | 0 (writer identified at guest PC) |
|
||||
| AUDIT-069 S1-S5 | +60 (canary), 0 (ours) | Wait/release watch | 0 (counts diverge, no fix) |
|
||||
| Phase D Stages 0-4 | +450-500 (ours+tools) | Contention manifest | 0 (104,607 cap unbroken) |
|
||||
| Phase D D-extension | +95 (tool) | Nested-CS absorber | +439 matched-prefix only |
|
||||
| Phase C+1 .. C+25 | varies | Allocator/event/thread shims | 0 (matched-prefix only) |
|
||||
| Phase W | +20 (ours) | VdInitializeEngines r3=1 | +66 matched-prefix only |
|
||||
| **Total to break wedge: 0 LOC of any kind** | | | |
|
||||
|
||||
This is the single most striking pattern from the audit chain: **every
|
||||
honest correctness fix advances matched-prefix; none move
|
||||
`draws / swaps / unique_render_targets`.**
|
||||
|
||||
## Falsification budget for the wedge framing
|
||||
|
||||
The wedge framing IS robust (no audit has falsified it since AUDIT-049).
|
||||
But it has limited explanatory power: it tells us *what is blocked*,
|
||||
not *what should unblock it*. Reading-error #38 (cross-spawn producer
|
||||
paths missed by static reachability) and #36 (POD struct copy bypass)
|
||||
both proved that the install / wake mechanism in canary involves paths
|
||||
guest static analysis cannot see. This is a methodology constraint,
|
||||
not an unsolvable problem.
|
||||
Reference in New Issue
Block a user