[iterate-2V] VdSwap: stop bumping primary CP_RB_WPTR out-of-band (canary-faithful)

Ours' `vd_swap` wrote its 64-dword XE_SWAP block at the guest's reserved `buffer_ptr` slot AND then bumped the primary ring `CP_RB_WPTR` out-of-band via `state.gpu.extend_write_ptr_by(64)`. That bump was a bug: `buffer_ptr` (~0x4add6efc) is NOT inside the primary ring (base ~0x4adcd000, 8192 dwords) — it lives ~10k dwords past it, in the renderer indirect-buffer region. The bogus WPTR bump pushed the GPU read-pointer PAST the guest's real write-pointer; the drain treated the overshoot as a circular wrap and re-executed the splash's draw indirect-buffers ~2×, inflating draws to 78 (the real splash geometry is ~28 draws; 12 INDIRECT_BUFFERs vs the real 6). Canary's `VdSwap_entry` (xenia-canary xboxkrnl_video.cc:518-548) writes the fetch-constant patch + PM4_XE_SWAP + NOP pad into the reserved slot and returns — it NEVER touches CP_RB_WPTR. The guest advances the primary ring write-pointer itself via its own doorbell once it has populated the slot; swap-complete CP interrupts come only from the game's in-stream PM4_INTERRUPT packets, never from VdSwap. This fix removes only the out-of-band `extend_write_ptr_by(64)` call, keeping the buffer_ptr block write intact and byte-faithful to canary. Effect at `--gpu-inline -n 50M`: draws 78→28, INDIRECT_BUFFER 12→6 (re-execution artifact gone), swaps 4→2. The run now halts at ~19.27M instructions (worker threads exit) instead of spinning to 50M, because removing the corruption unmasks the real per-present-interrupt deadlock — the title loop needs a per-present PM4_INTERRUPT that the stalled game never submits. That deadlock is a SEPARATE, known gate tracked/addressed elsewhere; it is intentionally NOT papered over here. Re-baselined golden crates/xenia-app/tests/golden/sylpheed_n50m.json to the new honest values (regenerated twice, byte-identical). sylpheed_n2m.json is unaffected (draws=0 at 2M). cargo test --workspace: 675 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
[iterate-2U] VdGlobalDevice: allocate a real device cell so the swap counter (clock B) can advance
2026-06-14 19:58:05 +02:00 · 2026-06-14 16:20:08 +02:00 · 2026-06-14 15:20:02 +02:00
4 changed files with 96 additions and 48 deletions
--- a/crates/xenia-app/src/main.rs
+++ b/crates/xenia-app/src/main.rs
@@ -1540,8 +1540,19 @@ fn cmd_exec_inner(
                    mem.write_u32(addr, block);
                }
                ("xboxkrnl.exe", 0x01BE) => {
-                    // VdGlobalDevice — passed through to Vd* shims. Write 0.
-                    mem.write_u32(addr, 0);
+                    // VdGlobalDevice — a *pointer to* a global D3D-device cell.
+                    // Mirror xenia-canary RegisterVideoExports (xboxkrnl_video.cc:
+                    // 557-564): allocate a 4-byte cell, point the import slot at
+                    // it, and zero the cell. The guest's graphics init then stores
+                    // its device object INTO the cell (e.g. sub_824C6DC0 @
+                    // 0x824C6F18 `stw r31, 0([0x82000750])`), and the swap-complete
+                    // callback sub_824CE2B8 reads it back via the two-level
+                    // `[[VdGlobalDevice]+0]+15160` to bump the swap counter (clock
+                    // B). Writing 0 directly here (the old behaviour) made that
+                    // store land at address 0 and the swap counter never advance —
+                    // freezing the title-loop's per-frame manager update.
+                    let cell = alloc_zero(0x4, &mut mem, &mut kernel);
+                    mem.write_u32(addr, cell);
                }
                ("xboxkrnl.exe", 0x01C0) => {
                    // VdGpuClockInMHz
--- a/crates/xenia-app/tests/golden/sylpheed_n50m.json
+++ b/crates/xenia-app/tests/golden/sylpheed_n50m.json
@@ -1,9 +1,9 @@
 {
-  "instructions": 50000013,
-  "imports": 451497,
+  "instructions": 19274336,
+  "imports": 72513,
  "unimpl": 0,
-  "draws": 78,
-  "swaps": 3,
+  "draws": 28,
+  "swaps": 2,
  "unique_render_targets": 2,
  "shader_blobs_live": 3,
  "texture_cache_entries": 0
--- a/crates/xenia-gpu/src/gpu_system.rs
+++ b/crates/xenia-gpu/src/gpu_system.rs
@@ -726,10 +726,13 @@ impl GpuSystem {
            width,
            height,
        });
-        self.pending_interrupts.push(PendingInterrupt {
-            source: InterruptSource::Swap,
-            cpu_mask: 0x1,
-        });
+        // iterate-2T: do NOT raise a CP swap-complete interrupt here. Canary's
+        // `VdSwap`/PM4_XE_SWAP path raises no interrupt; swap-complete CP
+        // interrupts come ONLY from in-stream `PM4_INTERRUPT` packets, which
+        // are naturally ordered after D3D has armed the swap-callback slot.
+        // Synthesizing one out of band (as we did pre-2T) delivered a CP
+        // interrupt while the slot still held the `0xBADF00D` placeholder,
+        // tripping the graphics ISR's "Unanticipated CPU_INTERRUPT" assert.
        tracing::info!(
            frame = self.swap_counter,
            fb = format_args!("{frontbuffer_phys:#010x}"),
--- a/crates/xenia-kernel/src/exports.rs
+++ b/crates/xenia-kernel/src/exports.rs
@@ -2999,52 +2999,86 @@ fn vd_swap(ctx: &mut PpcContext, mem: &GuestMemory, state: &mut KernelState) {
    // xboxkrnl_video.cc:479. Currently skipped (see below).
    let _ = fetch_dwords; // silence unused — will be live again under the deferred path

-    // The original M2b path zero-filled buffer_ptr (in the system command
-    // buffer) and bumped WPTR by 64 to expose the game's own ring writes.
-    // Keep that untouched — the game still expects buffer_ptr to be a
-    // skippable scratch area, and the bump still exposes any game-batched
-    // PM4 packets for the drain.
+    // iterate-2V: mirror xenia-canary `VdSwap_entry` (xboxkrnl_video.cc:518-548)
+    // FAITHFULLY. The game reserves 64 dwords (256 bytes) in the primary ring
+    // at `buffer_ptr`; canary writes a `PM4_TYPE0(SHADER_CONSTANT_FETCH_00_0)`
+    // fetch-constant patch followed by `PM4_TYPE3(PM4_XE_SWAP)`, then pads with
+    // NOPs — and **NEVER touches `CP_RB_WPTR`**. The game advances the primary
+    // ring write-pointer itself via its own doorbell once it has finished
+    // populating the reserved slot, so VdSwap only fills the bytes.
+    //
+    // iterate-2V FIX (the bug this removes): a prior revision bumped the
+    // primary ring `CP_RB_WPTR` out-of-band here (`extend_write_ptr_by(64)`).
+    // But `buffer_ptr` (~0x4add6efc) is NOT inside the primary ring (base
+    // ~0x4adcd000, 8192 dwords) — it lives ~10k dwords past it, in the
+    // renderer indirect-buffer region. The bogus WPTR bump pushed the GPU
+    // read-pointer PAST the guest's real write-pointer, the drain treated the
+    // overshoot as a circular wrap, and **re-executed the splash's draw
+    // indirect-buffers ~2×** — inflating draws to 78 (real splash ≈ 28; 12
+    // INDIRECT_BUFFERs vs the real 6). Canary's `VdSwap_entry` writes the
+    // block and returns; the swap-complete CP interrupt comes only from the
+    // game's own in-stream `PM4_INTERRUPT` packets, never from VdSwap.
    if buffer_ptr != 0 {
-        for i in 0..64u32 {
-            mem.write_u32(buffer_ptr + i * 4, xenia_gpu::pm4::make_packet_type2());
+        let mut off = 0u32;
+        let mut put = |i: &mut u32, v: u32| {
+            mem.write_u32(buffer_ptr + *i * 4, v);
+            *i += 1;
+        };
+        // PM4_TYPE0 fetch-constant slot-0 patch (6 dwords payload). The
+        // base_address field is patched to the physical frontbuffer so the
+        // bloom/blur "sample frame N for frame N+1" path reads the right page.
+        let mut patched = fetch_dwords;
+        patched[1] = (patched[1] & 0x0000_0FFF) | ((frontbuffer_addr >> 12) << 12);
+        put(
+            &mut off,
+            xenia_gpu::pm4::make_packet_type0(
+                xenia_gpu::gpu_system::CONST_BASE_FETCH as u16,
+                6,
+            ),
+        );
+        for d in patched {
+            put(&mut off, d);
+        }
+        // PM4_TYPE3(PM4_XE_SWAP, 4 dwords): signature, frontbuffer_phys, w, h.
+        put(
+            &mut off,
+            xenia_gpu::pm4::make_packet_type3(xenia_gpu::pm4::PM4_XE_SWAP, 4),
+        );
+        put(&mut off, xenia_gpu::pm4::SWAP_SIGNATURE);
+        put(&mut off, frontbuffer_addr);
+        put(&mut off, width);
+        put(&mut off, height);
+        // Pad the remainder with NOP (Type-2) packets.
+        while off < 64 {
+            put(&mut off, xenia_gpu::pm4::make_packet_type2());
        }
    }
-    state.gpu.extend_write_ptr_by(64);
+    // NOTE: We deliberately do NOT bump `CP_RB_WPTR` here (see the iterate-2V
+    // comment above). The drain below consumes only the packets the game has
+    // legitimately advanced the write-pointer over.

-    // GPUBUG-DRAIN-001: notify the swap directly.
-    //
-    // Per xenia-canary `VdSwap_entry` (xboxkrnl_video.cc:438-521), the
-    // textbook approach is to inject `PM4_TYPE0(SHADER_CONSTANT_FETCH_00_0)`
-    // (fetch-constant slot-0 patch for the Sylpheed bloom/blur "frame N+1"
-    // sample) followed by `PM4_TYPE3(PM4_XE_SWAP)` directly into the
-    // primary ring at WPTR, then let the natural drain consume them.
-    //
-    // That works in **pure lockstep** (drain runs at every kernel callback
-    // boundary, ring has at most a few hundred packets pending). It
-    // **does not** work under `--parallel` (CPU + GPU ring contention) —
-    // observed empirically: vd_swap's `drain_to_current_wptr` consumes
-    // 8-10 million game-batched IB packets in the 900 ms inline-deadline
-    // window without reaching our tail-injected PM4_XE_SWAP. Under
-    // threaded backend the worker has the same deadline. Either:
-    //   (a) the safety-net direct notify (below) fires and gets the swap
-    //       counted — but if the worker *eventually* drains past our
-    //       injected packet later it would double-count,
-    //   (b) we extend the deadline so far that vd_swap blocks for many
-    //       seconds — unreasonable for a kernel callback.
-    //
-    // Skip the ring injection unconditionally and post `notify_xe_swap`
-    // directly. The drain still runs (game packets execute as normal).
-    // **Trade-off**: the slot-0 fetch-constant patch is deferred —
-    // tracked as GPUBUG-FETCH-PATCH-001. Sylpheed currently has draws=0,
-    // so a stale slot 0 has no observable effect.
+    // Drain the ring up to whatever the game has actually submitted; any
+    // in-stream `PM4_INTERRUPT` / draw packets execute in order. The
+    // reserved-slot PM4_XE_SWAP is consumed by the GPU only once the game
+    // advances its own doorbell over it. The swap-counter safety net below
+    // keeps host swap bookkeeping live in the meantime.
    let drained = state.gpu.drain_to_current_wptr(mem);
    tracing::debug!(drained, "VdSwap: drained PM4 packets");

-    // Direct swap notification. Inline mode bumps `swaps_seen`
-    // synchronously; threaded mode posts a `GpuCommand::NotifyXeSwap`
-    // and the worker bumps it asynchronously.
+    // Safety net: if the drain did NOT reach our PM4_XE_SWAP this call (e.g.
+    // an undersized inline deadline left game-batched packets pending), still
+    // bump the host swap counter so the UI present + swap stats stay live.
+    // Skip when the in-stream PM4_XE_SWAP already recorded this frontbuffer
+    // (avoids double-counting). This path does NOT raise a CP interrupt.
    if frontbuffer_addr != 0 && width > 0 && height > 0 {
-        state.gpu.notify_xe_swap(frontbuffer_addr, width, height);
+        let already_swapped = state
+            .gpu
+            .as_inline_mut()
+            .map(|g| g.last_swap.map(|s| s.frontbuffer_phys) == Some(frontbuffer_addr))
+            .unwrap_or(false);
+        if !already_swapped {
+            state.gpu.notify_xe_swap(frontbuffer_addr, width, height);
+        }
    }

    // The remaining vd_swap work (UI publish: shader blobs, constants,