Files
xenia-rs/audit-runs/phase-w-wedge-reattack/escalation.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

133 lines
6.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase W escalation — wedge unbroken by accumulated tooling
**Outcome category: (C) — escalating cleanly.**
The Phase W mini-fix landed (`VdInitializeEngines` returns 1 vs old
0; matches canary `xboxkrnl_video.cc:271-279`). This is a real
correctness fix that advances Phase A matched-prefix
**105,046 → 105,112 (+66 events)**. But on the brief's actual gate —
`swaps > 1` / `draws > 0` / `texture_cache_entries > 0` — the
`check --stable-digest -n 500000000` run is **byte-identical to
baseline**: `draws=0, swaps=1, render_targets=0`. The fix does not
unblock progression.
## What we verified afresh
1. The wedge is structurally identical to AUDIT-049/058/059/062/065:
* tid=1 join-waits tid=13 at `sub_82173990+0x2D0` (handle `0x12c8`).
* tid=13 wedges at `sub_821CB030+0x1B0` on Event `0x12d0`
(`<NO_SIGNALS_DESPITE_WAITS>`).
* `sub_825070F0` (vtable[1] worker-spawner) fires 0×.
* 4 of 5 canary worker tids (canary's tid=14/15/4/+ several more)
emit hundreds of thousands of events; ours's equivalents emit
≤80. AUDIT-057 thread-gap PERSISTS.
2. New tooling (handle.create/destroy, thread.create/exit,
wait.begin, shared-global SID absorbers) was applied. It surfaces
normal cold-vs-cold divergences past 105K but does NOT illuminate
a new signal-flow gap on the wedge handle itself.
3. The wedge handle's SID `d5e23609d3948568` has zero matches in any
canary cold trace. The per-tid-PC SID recipe yields different SIDs
for what is *logically* the same Event across engines, because
create-site PC + tid + tid_event_idx all participate in the hash.
This is by design (it's NOT a process-global dispatcher), but it
means the new wait.begin events cannot directly identify "which
canary NtSetEvent call should signal this".
## Why this is hard — the structural impasse
The matched-prefix metric and the progression metric measure
different things. Matched-prefix tracks the **tid=1-only** event
sequence in lockstep up to the first kind-mismatch. The wedge is on
**tid=13** waiting for a signal that would come from a
**worker-cluster thread that never spawns**. The two threads barely
overlap in the matched-prefix view (tid=1 is fine for 105K events
*because* it hasn't reached the join-wait yet from Phase A's
perspective — `sub_82173990+0x2D0` is past idx 105,112 in canary's
tid=6 stream).
Every Phase C fix has correctly advanced matched-prefix while
leaving the wedge untouched, because the wedge needs the worker
cluster to bootstrap, and the worker cluster's activation chain
(`sub_822F1AA8 → sub_82173990 → sub_821746B0 → sub_821748F0 →
sub_821C4EB0 → sub_821CC3F8 → sub_821CBA08 → sub_821CB030` and
in parallel `→ sub_82172BA0 → sub_821B55D8 → sub_824F8398 →
sub_824F7CD0 → sub_824F7800 → sub_825070F0 → 4 worker spawns`) is
gated on the tid=13 wait completing, which is gated on a worker
signal, which is gated on the worker cluster bootstrapping. This
is the **same self-referential lock** AUDIT-063 documented.
## What new information Phase W produced
1. **VdInitializeEngines stub fix** (the landing). Trivially
correct, advances matched-prefix +66, does not move progression.
Worth keeping in canon for cold-vs-cold parity. New stable digest
`73e99d60029128b4d5c3dd98e540457d82a52b8a962e7495132be2be31411aca`
× 3 byte-identical.
2. **Confirmed via the new wait.begin events**: canary's tid=9
(= ours's tid=13 logical role) calls `wait.begin` on shared-global
dispatcher Event `0xf800004c` (SID `c9f426cc34f55865`) at idx 321
*immediately* after `RtlEnterCriticalSection` issues — proving
that CS contention on canary's side awakens via the shared-global
path while ours's per-tid Event takes the explicit
`NtCreateEvent+NtWaitForSingleObjectEx` path. **These are two
different objects, not one waiting for the same signal.** The
tooling correctly says so.
3. **The brief's hypothesis is correct**: matched-prefix is no
longer the right metric. Progression has not moved across 25
phases.
## Recommended next steps (ranked)
### Path 1 (recommended) — accept C+25 fallback and continue normal iteration
Dispatch C+25 = `MmAllocatePhysicalMemoryEx` / `MmGetPhysicalAddress`
deterministic allocator (the new first divergence at idx 105,112 is
in this family). Normal Phase C cadence; advances matched-prefix
without claiming wedge unblocking. **Be honest in memory notes that
matched-prefix is the only metric moving.**
### Path 2 — re-examine the absorbers
The C+18/C+21/D-extension absorbers all explicitly fold "scheduling
jitter" classes. Per the brief's Path B suggestion: is any absorber
HIDING a signal that would resolve the wedge? Specifically:
* C+18 shared-global SID absorber folds canary's
`aafae4c71fd42890` work-queue semaphore creation into ours's
emission window even when ours never creates the equivalent. If
ours's worker fails to *enqueue* something canary's worker awaits,
we'd never see the gap because the matched-prefix isn't on the
worker tid in the first place.
* The D-extension absorber folds nested-CS cleanup blocks. If
canary's `Enter/Leave` block contains the NtSetEvent that signals
the wedge handle (via descendant `xeKeSetEvent`), the absorber
hides that.
Concrete: un-absorb, re-diff, look for the first FOLDED canary block
that contains an `NtSetEvent` whose SID resolves to the wedge handle.
~3-5 hours of analysis, no LOC change.
### Path 3 — install host-side mem-watch + diff on wedge handle's guest memory
AUDIT-067 established that vtable installs go through host-side
writes invisible to guest-PC traces. By the same logic, the wedge
handle's kernel object header may be mutated by host code (the
canary scheduler / dispatcher) in ways ours doesn't replicate. Hook
`Memory::write*` in canary on the wedge handle's address; compare
against ours.
### Path 4 — scheduler determinism investment
The unfunded `scheduler_determinism_plan` artifact (per memory). Stage
0 was null result; the contention manifest stages landed but didn't
move the cap. The PLAN doc explicitly notes the wedge is upstream of
contention, so this is unlikely to help WITHOUT additional work.
## Honesty note
19 prior audits attacked this same wedge and failed. Phase W is the
20th. We landed a correct mini-fix, but the wedge itself is
unchanged. The user's instinct to call this honest fallback is the
correct posture.