Files
xenia-rs/audit-runs/phase-w-wedge-reattack/escalation.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

6.3 KiB
Raw Blame History

Phase W escalation — wedge unbroken by accumulated tooling

Outcome category: (C) — escalating cleanly.

The Phase W mini-fix landed (VdInitializeEngines returns 1 vs old 0; matches canary xboxkrnl_video.cc:271-279). This is a real correctness fix that advances Phase A matched-prefix 105,046 → 105,112 (+66 events). But on the brief's actual gate — swaps > 1 / draws > 0 / texture_cache_entries > 0 — the check --stable-digest -n 500000000 run is byte-identical to baseline: draws=0, swaps=1, render_targets=0. The fix does not unblock progression.

What we verified afresh

  1. The wedge is structurally identical to AUDIT-049/058/059/062/065:

    • tid=1 join-waits tid=13 at sub_82173990+0x2D0 (handle 0x12c8).
    • tid=13 wedges at sub_821CB030+0x1B0 on Event 0x12d0 (<NO_SIGNALS_DESPITE_WAITS>).
    • sub_825070F0 (vtable[1] worker-spawner) fires 0×.
    • 4 of 5 canary worker tids (canary's tid=14/15/4/+ several more) emit hundreds of thousands of events; ours's equivalents emit ≤80. AUDIT-057 thread-gap PERSISTS.
  2. New tooling (handle.create/destroy, thread.create/exit, wait.begin, shared-global SID absorbers) was applied. It surfaces normal cold-vs-cold divergences past 105K but does NOT illuminate a new signal-flow gap on the wedge handle itself.

  3. The wedge handle's SID d5e23609d3948568 has zero matches in any canary cold trace. The per-tid-PC SID recipe yields different SIDs for what is logically the same Event across engines, because create-site PC + tid + tid_event_idx all participate in the hash. This is by design (it's NOT a process-global dispatcher), but it means the new wait.begin events cannot directly identify "which canary NtSetEvent call should signal this".

Why this is hard — the structural impasse

The matched-prefix metric and the progression metric measure different things. Matched-prefix tracks the tid=1-only event sequence in lockstep up to the first kind-mismatch. The wedge is on tid=13 waiting for a signal that would come from a worker-cluster thread that never spawns. The two threads barely overlap in the matched-prefix view (tid=1 is fine for 105K events because it hasn't reached the join-wait yet from Phase A's perspective — sub_82173990+0x2D0 is past idx 105,112 in canary's tid=6 stream).

Every Phase C fix has correctly advanced matched-prefix while leaving the wedge untouched, because the wedge needs the worker cluster to bootstrap, and the worker cluster's activation chain (sub_822F1AA8 → sub_82173990 → sub_821746B0 → sub_821748F0 → sub_821C4EB0 → sub_821CC3F8 → sub_821CBA08 → sub_821CB030 and in parallel → sub_82172BA0 → sub_821B55D8 → sub_824F8398 → sub_824F7CD0 → sub_824F7800 → sub_825070F0 → 4 worker spawns) is gated on the tid=13 wait completing, which is gated on a worker signal, which is gated on the worker cluster bootstrapping. This is the same self-referential lock AUDIT-063 documented.

What new information Phase W produced

  1. VdInitializeEngines stub fix (the landing). Trivially correct, advances matched-prefix +66, does not move progression. Worth keeping in canon for cold-vs-cold parity. New stable digest 73e99d60029128b4d5c3dd98e540457d82a52b8a962e7495132be2be31411aca × 3 byte-identical.
  2. Confirmed via the new wait.begin events: canary's tid=9 (= ours's tid=13 logical role) calls wait.begin on shared-global dispatcher Event 0xf800004c (SID c9f426cc34f55865) at idx 321 immediately after RtlEnterCriticalSection issues — proving that CS contention on canary's side awakens via the shared-global path while ours's per-tid Event takes the explicit NtCreateEvent+NtWaitForSingleObjectEx path. These are two different objects, not one waiting for the same signal. The tooling correctly says so.
  3. The brief's hypothesis is correct: matched-prefix is no longer the right metric. Progression has not moved across 25 phases.

Dispatch C+25 = MmAllocatePhysicalMemoryEx / MmGetPhysicalAddress deterministic allocator (the new first divergence at idx 105,112 is in this family). Normal Phase C cadence; advances matched-prefix without claiming wedge unblocking. Be honest in memory notes that matched-prefix is the only metric moving.

Path 2 — re-examine the absorbers

The C+18/C+21/D-extension absorbers all explicitly fold "scheduling jitter" classes. Per the brief's Path B suggestion: is any absorber HIDING a signal that would resolve the wedge? Specifically:

  • C+18 shared-global SID absorber folds canary's aafae4c71fd42890 work-queue semaphore creation into ours's emission window even when ours never creates the equivalent. If ours's worker fails to enqueue something canary's worker awaits, we'd never see the gap because the matched-prefix isn't on the worker tid in the first place.
  • The D-extension absorber folds nested-CS cleanup blocks. If canary's Enter/Leave block contains the NtSetEvent that signals the wedge handle (via descendant xeKeSetEvent), the absorber hides that.

Concrete: un-absorb, re-diff, look for the first FOLDED canary block that contains an NtSetEvent whose SID resolves to the wedge handle. ~3-5 hours of analysis, no LOC change.

Path 3 — install host-side mem-watch + diff on wedge handle's guest memory

AUDIT-067 established that vtable installs go through host-side writes invisible to guest-PC traces. By the same logic, the wedge handle's kernel object header may be mutated by host code (the canary scheduler / dispatcher) in ways ours doesn't replicate. Hook Memory::write* in canary on the wedge handle's address; compare against ours.

Path 4 — scheduler determinism investment

The unfunded scheduler_determinism_plan artifact (per memory). Stage 0 was null result; the contention manifest stages landed but didn't move the cap. The PLAN doc explicitly notes the wedge is upstream of contention, so this is unlikely to help WITHOUT additional work.

Honesty note

19 prior audits attacked this same wedge and failed. Phase W is the 20th. We landed a correct mini-fix, but the wedge itself is unchanged. The user's instinct to call this honest fallback is the correct posture.