handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00
parent acd1656753
commit ef93a4fa14
620 changed files with 108303 additions and 1 deletions
--- a/audit-runs/phase-c25-mm-allocator-family/investigation.md
+++ b/audit-runs/phase-c25-mm-allocator-family/investigation.md
@@ -0,0 +1,117 @@
+# Phase C+25 — MmGetPhysicalAddress canonicalization
+
+## Step 1 — Framing verification (per reading-error #28)
+
+From `phase-w-wedge-reattack/diff-postfix.md` at `canary tid=6 → ours tid=1` idx 105,112:
+
+```
+canary: [105119] kernel.return MmGetPhysicalAddress return_value=353042432  status=0x150b0000
+ours:   [105112] kernel.return MmGetPhysicalAddress return_value=182251520  status=0x0adcf000
+```
+
+Decoded:
+- canary 353042432 = `0x150B0000`. Per `xenia-canary/src/xenia/memory.cc:2317-2325`
+  (`PhysicalHeap::GetPhysicalAddress`): `address -= heap_base_; if (heap_base_ >=
+  0xE0000000) address += 0x1000;`. To produce `0x150B0000` from `vE0000000` (heap_base
+  `0xE0000000`): input VA `0xF50AF000` → `0xF50AF000 - 0xE0000000 + 0x1000 = 0x150B0000`. ✓
+- ours `0x0ADCF000`. Per `exports.rs:985-988` (`mm_get_physical_address`):
+  `ctx.gpr[3] &= 0x1FFF_FFFF`. To produce `0x0ADCF000` from the unified heap region
+  `0x40000000+`: input VA `0x4ADCF000` → `0x4ADCF000 & 0x1FFF_FFFF = 0x0ADCF000`. ✓
+
+Pre-context: identical sequence of `MmAllocatePhysicalMemoryEx` (canonicalized to
+shared sentinel) → `MmGetPhysicalAddress`. Next event after divergence:
+`VdInitializeRingBuffer` — the GPU consumes the PA opaquely.
+
+Both engines' translations are SELF-CONSISTENT: within each engine, the same input
+VA always maps to the same PA, and any subsequent GPU command pointing at that PA
+gets read back from the same host backing store. The divergence at the diff layer
+is a host-allocator-region symptom, not a semantic bug.
+
+## Step 2 — Classification
+
+Four candidates:
+
+- **(A)** Per-call value bug. NO — both formulas are correct for their respective
+  heap layouts. Canary's `PhysicalHeap::GetPhysicalAddress` is the authoritative
+  implementation for the three-heap memory model; ours's `& 0x1FFF_FFFF` mask is
+  the documented equivalent for the unified heap (KRNBUG-Mm-04 noted at
+  `exports.rs:3771`).
+- **(B)** Allocator-region routing bug. YES, but this is the C+2 Path β deferral —
+  ours has a single `KernelState::heap_alloc` cursor at `0x40000000`; canary has
+  three physical heaps at `vA0/vC0/vE0` routed by page size via
+  `LookupHeapByType`. Estimated >100 LOC and would change boot trajectory
+  unpredictably. **OUT OF SCOPE per Phase C+2 scope discipline.**
+- **(C)** Canonicalization gap. YES — `MmGetPhysicalAddress` is a VA→PA translator
+  whose return is consumed opaquely by GPU/audio subsystems. The same per-(tid,name)
+  ordinal sentinel scheme that covers `MmAllocatePhysicalMemoryEx` (C+2) applies
+  here. Fix: extend `ALLOCATOR_RETURN_FNS`.
+- **(D)** Upstream. NO — the predecessor `kernel.call MmGetPhysicalAddress`
+  matched cleanly on both engines.
+
+**Selected: (C) — diff-tool canonicalization.**
+
+## Step 3 — Fix
+
+Extended `ALLOCATOR_RETURN_FNS` in `xenia-rs/tools/diff-events/diff_events.py`
+with `"MmGetPhysicalAddress"` and a 20-line comment block explaining the
+deferred-Path-β rationale. Zero engine LOC.
+
+Per-(tid,name) ordinal sentinels (`<ALLOC_MmGetPhysicalAddress_N>`) reuse the
+existing `canonicalize_allocator_returns` machinery. As long as both engines
+call the translator the same number of times in the same per-tid order, the
+ordinals line up. A translation-count mismatch correctly surfaces as a
+divergence (ordinal drift → distinct sentinels at that position).
+
+The `payload.status` field is auto-mirrored (existing behavior of the
+canonicalizer, since trampoline doesn't distinguish NTSTATUS from pointer-typed
+returns).
+
+## Step 4 — Tests added
+
+`test_diff_events.py` gains 4 unit tests (lines added at top of `main()`):
+
+1. `test_mm_get_physical_address_in_allocator_set` — registry guard.
+2. `test_mm_get_physical_address_canonicalization` — two-call per-tid ordinal.
+3. `test_mm_get_physical_address_cross_engine_alignment` — end-to-end: the
+   exact C+25 divergence (`0x150B0000` vs `0x0ADCF000`) canonicalizes to the
+   same sentinel on both sides.
+4. `test_mm_get_physical_address_count_mismatch_still_diverges` — ordinal-drift
+   negative test.
+
+39 baseline tests + 4 new = 43 total, all PASS.
+
+## Why no engine fix
+
+Per `project_phase_c2_MmAllocatePhysicalMemoryEx_2026_05_13.md`'s "Future work:
+β-class engine fix (deferred)" section:
+
+> If a future Phase C+N session surfaces a divergence whose causal chain goes
+> through region-arithmetic on a `MmAllocatePhysicalMemoryEx` return value
+> (e.g. `MmGetPhysicalAddress` yielding bus-incompatible addresses for GPU
+> command buffers), escalate to engine-side: add 3 physical heaps in
+> `xenia-memory` / `KernelState`, route `MmAllocatePhysicalMemoryEx` through
+> page-size lookup. Estimated 100-200 LOC + GPU/audio bridge re-validation;
+> out of scope for single-session work.
+
+This C+25 divergence IS the predicted scenario. The GPU is in-process here —
+both engines independently consume the PA they themselves emitted, so the
+opaque-pass-through invariant holds. The PA values diverge between engines
+but neither is wrong in its own coordinate space.
+
+Engine fix is deferred to a dedicated Path β session (estimated 100-200 LOC +
+multi-subsystem re-validation across GPU command buffer mappings, XMA audio
+context mapping via `MmMapIoSpace`, and any guest code paths doing PA
+arithmetic). Tripstone #3 explicitly forbids in-session escalation here.
+
+## Why progression metric is not expected to move
+
+Phase W documented the wedge: tid=1 (main) joins on tid=13, tid=13 waits on
+worker event `0x12d0` that never gets signaled. The wedge is upstream of any
+GPU activity. Advancing matched-prefix past `MmGetPhysicalAddress` does NOT
+exercise any new game-logic branch — it just allows the diff harness to
+continue measuring beyond a previously-occluded translator-return divergence.
+
+Per task spec: "If only the secondary metric moves and the primary remains
+pinned (`swaps=1, draws=0`), document candidly: 'matched-prefix advanced but
+no game progression — wedge persists per Phase W finding'." That's exactly
+what happens here.