The silph title state machine (tid13) blocked on event 0x10a0, never signaled. Root: the event's producer chain runs on the silph worker (entry 0x821C4AD0, our tid14), which was starved. tid14 shares a HW slot with a guest spinlock/ barrier participant (sub_824D1328, entry 0x824D2940) that busy-spins on the db16cyc hint `or r31,r31,r31` (encoding 0x7FFFFB78) at 0x824D140C. Under our round-robin lockstep the spinner consumed its whole block every round and starved the co-located tid14 (only 9 progress hits over 200M instr) — so the producer never reached the event-create/duplicate/signal dance the canary oracle performs (handle F80000E8 set by the submitter F8000044 via a duplicated handle). Fix (canary-faithful): recognize the db16cyc spin hint exactly as canary's InstrEmit_orx does (code 0x7FFFFB78 -> DelayExecution) and surface it as a new StepResult::Yield. The scheduler's yield_current() promotes every Ready peer on the slot past STARVE_LIMIT so begin_slot_visit picks one next round, then they reset and the spinner reclaims the slot — fair alternation, no priority inversion, pure function of slot state (deterministic). Result (lockstep, cache-persist, -n 200M): tid14 progresses past its old stall into a real wait; tid13 advances off 0x10a0 to a new event; hub/submitter re-enter their wait loops. imports 280k->592k, packets 124M->164M, swaps 1->2. draws still 0 (the splash's first draw is a further-upstream gate). Determinism preserved (two cold n50m runs byte-identical). n50m golden re-baselined (imports 90296->339766, swaps 1->2; draws unchanged 0). n2m golden unchanged (db16cyc not reached in first 2M). Tests 670/670. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
132 lines
9.5 KiB
Markdown
132 lines
9.5 KiB
Markdown
# AUDIT-059 — handle disambiguation (iterate 2.BD)
|
||
|
||
**Date:** 2026-06-06. **Engines:** ours `target/release/xenia-rs -n 50M` (3.9 s wall, 50M instr, 40k import calls), canary Wine `xenia_canary.exe --mute=true --audit_handle_lifecycle=true` (~35 s wall, 34k log lines, 0 fatals).
|
||
|
||
## Verdict — HANDOFF's wedge handles are stale
|
||
|
||
HANDOFF said: *"opt_callback signals 0x108c, tid=1 wedges on 0x10e8."* Both IDs are now `<UNCREATED>` in ours, along with `0x1090 / 0x10dc / 0x10fc / 0x1104` (also in HANDOFF's adjacent list). The allocation order shifted since that snapshot.
|
||
|
||
## Real wedges, current code state
|
||
|
||
| Handle | Kind | Engine state | Waiter | Notes |
|
||
|---|---|---|---|---|
|
||
| **0x12a4** | `<UNCREATED>` | `<AUDIT_BLIND>`, waiters=1 | **tid=1 main**, pc=0x824ac578 | Wait went via `do_wait_single` but creation never hit `NtCreateEvent` — `KeInitializeEvent` path. **This is the iterate-2.BC wedge** (recorded as "0x10e8" in HANDOFF — same site, different ID). |
|
||
| **0x12ac** | Event/Auto | `<NO_SIGNALS_DESPITE_WAITS>`, waiters=1 | **tid=13** silph UI cluster, pc=0x824ac578 lr=0x821cb1e0 | Frame trail: `0x821cb1e0 → 0x821cbae0 → 0x821cc454 → 0x821c4f18 → 0x82174a80`. Frames 3-5 carry `silph::UImpl@GamePart_Title` / `silph::VGamePart_Title` vtables — **audit-049's cluster, unchanged**. |
|
||
| 0x12b8 | Event/Auto | NO_SIGNALS, waiters=1 | (tid TBD) | Sibling, 0xC bytes from 0x12ac. |
|
||
| 0x1020 | Event/Manual | NO_SIGNALS, waiters=1 | — | γ-class. |
|
||
| 0x1040 | Event/Auto | NO_SIGNALS, waits=32 (hot poll) | — | Heavy wait, no signal. |
|
||
| 0x10a8 | Event/Auto | NO_SIGNALS, waits=7 | — | γ-class. |
|
||
| 0x10e4 | Event/Manual | NO_SIGNALS, waiters=1, waits=2 | — | γ-class. |
|
||
|
||
**Working handles** (sanity baseline): 0x1028 (Sema, 8 waits / 7 signals / 7 wakes), 0x10d0 (Sema, 2 waits / 1 signal / 1 wake), 0x10f0 (Event/Auto, 1/1/1 ✓ marked `<SUSPECT>` but actually fine), 0x10e0 (Event/Manual, 32 primary signals from somewhere).
|
||
|
||
## GPU interrupt delivery — the iterate-2.BC delta confirmed
|
||
|
||
| Engine | gpu.interrupt.delivered (vsync) | EmulateCPInterruptDPC / vblank pump |
|
||
|---|---:|---:|
|
||
| **ours** | 54 (source=0) + 1 (source=1) | — |
|
||
| **canary** | — | **4712** in 30 s ≈ 157 Hz |
|
||
|
||
**~87× ratio.** Confirms HANDOFF's diagnosis: ours' victim-thread injector dies once guest threads all park; canary's host frame-limiter thread keeps firing regardless.
|
||
|
||
## Canary signaler attribution
|
||
|
||
Top KeSetEvent guest_ptrs in canary (30 s window):
|
||
|
||
| guest_ptr | KeSetEvent fires | Inferred role |
|
||
|---|---:|---|
|
||
| `0x828A3254` | 5729 | Audio host-pump worker (per AUDIT-032: `r3=0x828A3230` region) |
|
||
| `0x828A3244` | 5728 | Audio host-pump sibling |
|
||
| `0x828A3244` + 16-byte stride | — | Static XEX-image audio event struct |
|
||
| `0xBCE25234` | 1301 | **silph UI cluster PKEVENT** (heap-allocated, 0x10 stride). Likely ours' 0x12ac analog. |
|
||
| `0xBCE25214 / 0xBCE25244 / 0xBCE25224` | 648 / 603 / 603 | Sibling silph UI PKEVENTs (0x10 stride struct). Likely ours' 0x12a4 / 0x12b8 / 0x1040 analogs. |
|
||
|
||
Ours signals every one of those equivalents **0 times**.
|
||
|
||
## Round 2 — LR-extended probes name the producer
|
||
|
||
Extended the canary probes with guest-LR capture (5 sites in `xboxkrnl_threading.cc`, 10 LOC). Re-ran the harness. Now each `KeSetEvent` line carries the guest function that signaled the event. Result for the silph UI cluster:
|
||
|
||
| PKEVENT | KeSetEvent count | Producer LR(s) |
|
||
|---|---:|---|
|
||
| `0xBCE25214` | 574 | `0x82508510` (single producer) |
|
||
| `0xBCE25224` | 565 | `0x82508358` (single producer) |
|
||
| `0xBCE25234` | 1153 | `0x82506C90` (579) + `0x82508524` (574) |
|
||
| `0xBCE25244` | 570 | `0x82506F9C` (single producer) |
|
||
| `0xBCE25284` | 1 | `0x82507ABC` (one-shot 5th-worker init?) |
|
||
|
||
All 6 producer LRs sit in `0x82506000–0x82509000`. **This is exactly the `sub_825070F0` worker thread cluster** that audit-057/058 already named:
|
||
|
||
> *audit-057: "sub_825070F0 (4 missing, initializes 4 workers w/ shared ctx 0xBCE25340, entries 0x82506528/58/88/B8)"*
|
||
|
||
The 4 worker entries (`0x82506528/58/88/B8`) are inside `sub_82506xxx` — exactly where the producer LRs `0x82506C90`/`0x82506F9C` live. The other producer LRs `0x825083xx` / `0x825085xx` are in downstream callees (workers call deeper code which itself calls KeSetEvent).
|
||
|
||
For comparison the audio host-pump pair gets a single sharp producer too:
|
||
- `0x828A3254` × 5271 ← `lr=0x824D2A44`
|
||
- `0x828A3244` × 5271 ← `lr=0x824D292C`
|
||
|
||
(These match AUDIT-032's PC `0x824D229C / r3=0x828A3230` region — already-understood audio host-pump.)
|
||
|
||
## Verdict — 2.BE is INSUFFICIENT for the silph UI wedge
|
||
|
||
The silph UI PKEVENTs are signaled exclusively by threads spawned by `sub_825070F0`. Per audit-057/058, **`sub_825070F0` fires 0× in ours** — those 4 worker threads never spawn. Therefore the PKEVENTs are never signaled. Therefore tid=13 (`0x12ac` in ours) wedges forever.
|
||
|
||
**`sub_825070F0`'s call chain is gated by the audit-009 "unreachability island"** — a CRT-driven fnptr-array bootstrap that ours fails to enumerate. VSync delivery is irrelevant to that bootstrap; the host frame-limiter thread does not drive CRT initializers.
|
||
|
||
Therefore:
|
||
- **2.BE alone CANNOT unwedge tid=13.** It will close the 54-vs-4712 VSync delivery gap and may unblock things downstream of vsync, but the silph UI wedge has an independent missing-signaler root cause.
|
||
- **2.BE may still unwedge tid=1 main on `0x12a4`** — that wait went via `KeInitializeEvent` (handle never hit `NtCreateEvent` in ours, hence `<AUDIT_BLIND>`). Whether `0x12a4`'s signaler depends on VSync is unknown without further probing.
|
||
|
||
## Implications for next moves
|
||
|
||
A single fix won't take us to draws > 0. We need at least two:
|
||
|
||
1. **2.BE (VSync delivery)** — still worth landing for the architectural correctness it brings, AND because it's the only fix that can unwedge tid=1 main's `0x12a4` if that's vsync-derived. ~60–80 LOC per Agent C's plan.
|
||
2. **2.BF (sub_825070F0 activation)** — this is the audit-058 unfinished business. Options:
|
||
- (a) **Static work:** trace canary's CRT-driven fnptr-array path that activates the silph UI bootstrap; backport the missing init into ours. High info, slow. Requires more probing.
|
||
- (b) **Direct synthetic spawn:** ours injects host-side `ExCreateThread` calls for the 4 worker entries at boot completion, mirroring AUDIT-048's audio-host-pump precedent. Pragmatic; ~40 LOC; risks getting context (`0xBCE25340`) wrong.
|
||
|
||
A possible third move:
|
||
|
||
3. **Re-probe with LR on Wait paths** (we already added it but didn't grep for it) — to tell us whether tid=1's wait on `0x12a4` is the same LR as `sub_825070F0`-chain or a totally different signaler. If different, it's a 3rd missing producer.
|
||
|
||
## Round 4 — wait-side guest LR via one-frame back-chain walk
|
||
|
||
After fixing the PPC stack-walk offset (Xbox 360 stores saved LR at `[prev_sp - 8]`, not the `+4` AIX convention), wait-side LR comes through cleanly.
|
||
|
||
**Canary's top wait sites:**
|
||
|
||
| canary handle | wait count | guest_lr | LR region | mapping |
|
||
|---|---:|---|---|---|
|
||
| `F800005C` | 1635 | `0x8216EE14` | kernel early-boot infra | unrelated |
|
||
| `F800000C` | 1597 | `0x824AFFC4` | xboxkrnl wrapper (scheduler / work-queue?) | unrelated |
|
||
| **`F80000DC`** | **476** | **`0x821C7D3C`** | **silph::UImpl/GamePart** | **= ours' 0x12ac silph UI wedge** |
|
||
| `F80000B0` | 6 across | `0x821CBAE0` + `0x821CC19C` + `0x822DFE2x/D0` | **exact match with audit-049's frame trail** | sibling silph UI wait |
|
||
|
||
Identity proof: ours' audit-049 frame trail for the silph UI wedge was `0x821cb1e0 / 0x821cbae0 / 0x821cc454 / 0x821c4f18 / 0x82174a80`. Round 4 captures `0x821CBAE0` and `0x821CC19C` (adjacent PCs) as wait LRs in canary — same cluster, same code.
|
||
|
||
**Refined verdict.** ours' `0x12a4` (tid=1 main, AUDIT_BLIND) and `0x12ac` (tid=13 silph UI) are 8 bytes apart — likely sibling KEVENT fields in the same silph UI struct. canary's analogs are in the `F80000xx` namespace, similarly clustered. The single fix that addresses both:
|
||
|
||
> **2.BF (b)** — synthetic host-side spawn of `sub_825070F0`'s 4 workers at the audit-058-identified context (`0xBCE25340`), entries `0x82506528/58/88/B8`. Once those workers run, they signal the silph UI PKEVENT cluster, unwedging BOTH tid=1 main and tid=13 silph UI in one shot.
|
||
|
||
2.BE (host-driven VSync ISR delivery) becomes follow-on work after the UI bootstrap completes and frame pacing actually matters.
|
||
|
||
## Open questions for iterate 2.BD′ / 2.BE planning
|
||
|
||
1. **Does 2.BE alone unwedge tid=13?** Cheapest verification path: land 2.BE and re-run audit-059, see whether `0x12ac` signal count goes 0 → non-zero.
|
||
2. **What is the LR-pattern of canary's `KeSetEvent guest_ptr=0xBCE25234` callers?** The current probe doesn't capture LR — extending the cvar to do so on a filtered subset would let us name the producer function in canary's namespace.
|
||
3. **Does the GPU frame-limiter's CP interrupt actually walk into the silph UI cluster?** I.e., does `EmulateCPInterruptDPC` → `interrupt_callback` → guest code ever hit `sub_821CB030` or its callees? An LR probe inside `EmulateCPInterruptDPC` would answer this.
|
||
|
||
## Artifacts
|
||
|
||
- `canary.log` 2.2 MB / 34,095 lines / 32,977 AUDIT-HLC lines
|
||
- `canary.stdout` 2.2 MB (duplicate of canary.log due to log_file fallback)
|
||
- `canary.stderr` 8.4 KB (Wine diagnostics)
|
||
- `ours.log` 479 lines (focus ledger + thread diagnostics + final state)
|
||
- `ours.stderr` 317 lines (kernel-call counters)
|
||
- `vkd3d-proton.cache.write` 15 KB (build artifact, ignored)
|
||
|
||
Commits in play (xenia-canary, fork-local only):
|
||
- `03362b59f` cross-build-wine (cross-compile toolchain)
|
||
- `d031d7c51` audit-handle-lifecycle-probes (this audit's probes)
|