[iterate-2T] VdSwap: route present through ring PM4_XE_SWAP, drop out-of-band swap interrupt

Make ours' VdSwap present path faithful to xenia-canary `VdSwap_entry` (xboxkrnl_video.cc:518-548): write the reserved 64-dword ring slot with a PM4_TYPE0 fetch-constant patch + PM4_TYPE3(PM4_XE_SWAP) + NOP padding, then let the natural drain consume the swap packet in command-stream order. Remove the synthetic CP swap-complete interrupt that `notify_xe_swap` raised out-of-band. Root found this session (the actual present-path bug): ours' `notify_xe_swap` pushed an `InterruptSource::Swap` (→ INTERRUPT_SOURCE_CP) interrupt directly from the VdSwap HLE, decoupled from the GPU command stream. When that interrupt reached the graphics ISR `sub_824BE9A0` before D3D had armed its swap-callback slot (`[gfx+10772]+16` still the `0xBADF00D` placeholder), the ISR took its error path and hit the assert "ERR[D3D]: Unanticipated CPU_INTERRUPT. Sign of a corrupt command buffer?" (`bl sub_824C5DF0; twi` at 0x824BE9DC) — 2x per run on master. Canary's VdSwap raises NO interrupt; swap-complete CP interrupts come only from in-stream PM4_INTERRUPT packets, which are naturally ordered after the callback-arming Type-0 writes. Routing the swap through the ring packet matches that ordering and eliminates the trap (2 -> 0). Canary oracle confirmation (muted, audit_mem_watch + audit_jit_prolog_pc): canary's early/loading loop is present-driven — swap counter [gfx+15160] (0xBE56CA38) advances ~per-vblank from vblank 65 onward, reaching 0xD02 (3330) in ~60s via 6184 CP source=1 interrupts, with VdSwap called only ONCE. So the present interrupts are entirely in-stream, not from the VdSwap export. This is a correctness/faithfulness fix; it does NOT cascade. draws stay 78 at 200M and 1B because the upstream gate persists: the game submits one render batch then stalls (renderer sub_82506xxx 0x; 2nd title thread 0x821748F0 never spawns). The per-frame loop sub_822F1AA8 runs ~1207 iterations on vsync but clock B (swap count) only advances ~once, so the manager update sub_821741C8 fires once. That is the iterate-2Q/2F title-pipeline gate, not a present/ interrupt bug. swaps 3 -> 4 (the in-stream PM4_XE_SWAP now drains). Deterministic in inline mode (n50m --gpu-inline --stable-digest regenerated byte-identical twice; golden re-baselined: swaps 3 -> 4). cargo test --workspace 675 passing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 15:20:02 +02:00
3 changed files with 77 additions and 44 deletions
--- a/crates/xenia-app/tests/golden/sylpheed_n50m.json
+++ b/crates/xenia-app/tests/golden/sylpheed_n50m.json
@@ -1,9 +1,9 @@
 {
-  "instructions": 50000013,
+  "instructions": 50000001,
-  "imports": 451497,
+  "imports": 451500,
  "unimpl": 0,
  "draws": 78,
-  "swaps": 3,
+  "swaps": 4,
  "unique_render_targets": 2,
  "shader_blobs_live": 3,
  "texture_cache_entries": 0
--- a/crates/xenia-gpu/src/gpu_system.rs
+++ b/crates/xenia-gpu/src/gpu_system.rs
@@ -726,10 +726,13 @@ impl GpuSystem {
            width,
            height,
        });
-        self.pending_interrupts.push(PendingInterrupt {
+        // iterate-2T: do NOT raise a CP swap-complete interrupt here. Canary's
-            source: InterruptSource::Swap,
+        // `VdSwap`/PM4_XE_SWAP path raises no interrupt; swap-complete CP
-            cpu_mask: 0x1,
+        // interrupts come ONLY from in-stream `PM4_INTERRUPT` packets, which
-        });
+        // are naturally ordered after D3D has armed the swap-callback slot.
        // Synthesizing one out of band (as we did pre-2T) delivered a CP
        // interrupt while the slot still held the `0xBADF00D` placeholder,
        // tripping the graphics ISR's "Unanticipated CPU_INTERRUPT" assert.
        tracing::info!(
            frame = self.swap_counter,
            fb = format_args!("{frontbuffer_phys:#010x}"),
--- a/crates/xenia-kernel/src/exports.rs
+++ b/crates/xenia-kernel/src/exports.rs
@@ -2999,53 +2999,83 @@ fn vd_swap(ctx: &mut PpcContext, mem: &GuestMemory, state: &mut KernelState) {
    // xboxkrnl_video.cc:479. Currently skipped (see below).
    let _ = fetch_dwords; // silence unused — will be live again under the deferred path
-    // The original M2b path zero-filled buffer_ptr (in the system command
+    // iterate-2T: mirror xenia-canary `VdSwap_entry` (xboxkrnl_video.cc:518-548)
-    // buffer) and bumped WPTR by 64 to expose the game's own ring writes.
+    // FAITHFULLY. The game reserves 64 dwords (256 bytes) in the primary ring
-    // Keep that untouched — the game still expects buffer_ptr to be a
+    // at `buffer_ptr`; canary writes a `PM4_TYPE0(SHADER_CONSTANT_FETCH_00_0)`
-    // skippable scratch area, and the bump still exposes any game-batched
+    // fetch-constant patch followed by `PM4_TYPE3(PM4_XE_SWAP)`, then pads with
-    // PM4 packets for the drain.
+    // NOPs. We do the same, then bump WPTR by 64 so the drain consumes the
    // PM4_XE_SWAP **in command-stream order** — i.e. AFTER any in-stream
    // callback-arming Type-0 writes the game already queued.
    //
    // Why this matters (the iterate-2T root): the previous M2b short-circuit
    // called `notify_xe_swap` directly from the HLE, which synthesized a CP
    // swap-complete interrupt OUT OF BAND. When that interrupt reached the
    // graphics ISR (`sub_824BE9A0`) before D3D had armed its swap-callback
    // slot (`[gfx+10772]+16` still the `0xBADF00D` placeholder), the ISR hit
    // its "ERR[D3D]: Unanticipated CPU_INTERRUPT. Sign of a corrupt command
    // buffer?" assert (`twi` at 0x824BE9DC). Routing the swap through the ring
    // packet keeps the interrupt naturally ordered after arming, matching
    // canary (whose VdSwap raises NO interrupt itself; swap-complete CP
    // interrupts come only from in-stream `PM4_INTERRUPT` packets).
    if buffer_ptr != 0 {
-        for i in 0..64u32 {
+        let mut off = 0u32;
-            mem.write_u32(buffer_ptr + i * 4, xenia_gpu::pm4::make_packet_type2());
+        let mut put = |i: &mut u32, v: u32| {
            mem.write_u32(buffer_ptr + *i * 4, v);
            *i += 1;
        };
        // PM4_TYPE0 fetch-constant slot-0 patch (6 dwords payload). The
        // base_address field is patched to the physical frontbuffer so the
        // bloom/blur "sample frame N for frame N+1" path reads the right page.
        let mut patched = fetch_dwords;
        patched[1] = (patched[1] & 0x0000_0FFF) | ((frontbuffer_addr >> 12) << 12);
        put(
            &mut off,
            xenia_gpu::pm4::make_packet_type0(
                xenia_gpu::gpu_system::CONST_BASE_FETCH as u16,
                6,
            ),
        );
        for d in patched {
            put(&mut off, d);
        }
        // PM4_TYPE3(PM4_XE_SWAP, 4 dwords): signature, frontbuffer_phys, w, h.
        put(
            &mut off,
            xenia_gpu::pm4::make_packet_type3(xenia_gpu::pm4::PM4_XE_SWAP, 4),
        );
        put(&mut off, xenia_gpu::pm4::SWAP_SIGNATURE);
        put(&mut off, frontbuffer_addr);
        put(&mut off, width);
        put(&mut off, height);
        // Pad the remainder with NOP (Type-2) packets.
        while off < 64 {
            put(&mut off, xenia_gpu::pm4::make_packet_type2());
        }
    }
    state.gpu.extend_write_ptr_by(64);
-    // GPUBUG-DRAIN-001: notify the swap directly.
+    // Drain the ring; the PM4_XE_SWAP we just queued (and any in-stream
-    //
+    // PM4_INTERRUPT) executes in order. The PM4_XE_SWAP handler calls
-    // Per xenia-canary `VdSwap_entry` (xboxkrnl_video.cc:438-521), the
+    // `notify_xe_swap` for host swap bookkeeping; no synthetic interrupt is
-    // textbook approach is to inject `PM4_TYPE0(SHADER_CONSTANT_FETCH_00_0)`
+    // raised (see `notify_xe_swap`).
    // (fetch-constant slot-0 patch for the Sylpheed bloom/blur "frame N+1"
    // sample) followed by `PM4_TYPE3(PM4_XE_SWAP)` directly into the
    // primary ring at WPTR, then let the natural drain consume them.
    //
    // That works in **pure lockstep** (drain runs at every kernel callback
    // boundary, ring has at most a few hundred packets pending). It
    // **does not** work under `--parallel` (CPU + GPU ring contention) —
    // observed empirically: vd_swap's `drain_to_current_wptr` consumes
    // 8-10 million game-batched IB packets in the 900 ms inline-deadline
    // window without reaching our tail-injected PM4_XE_SWAP. Under
    // threaded backend the worker has the same deadline. Either:
    //   (a) the safety-net direct notify (below) fires and gets the swap
    //       counted — but if the worker *eventually* drains past our
    //       injected packet later it would double-count,
    //   (b) we extend the deadline so far that vd_swap blocks for many
    //       seconds — unreasonable for a kernel callback.
    //
    // Skip the ring injection unconditionally and post `notify_xe_swap`
    // directly. The drain still runs (game packets execute as normal).
    // **Trade-off**: the slot-0 fetch-constant patch is deferred —
    // tracked as GPUBUG-FETCH-PATCH-001. Sylpheed currently has draws=0,
    // so a stale slot 0 has no observable effect.
    let drained = state.gpu.drain_to_current_wptr(mem);
    tracing::debug!(drained, "VdSwap: drained PM4 packets");
-    // Direct swap notification. Inline mode bumps `swaps_seen`
+    // Safety net: if the drain did NOT reach our PM4_XE_SWAP this call (e.g.
-    // synchronously; threaded mode posts a `GpuCommand::NotifyXeSwap`
+    // an undersized inline deadline left game-batched packets pending), still
-    // and the worker bumps it asynchronously.
+    // bump the host swap counter so the UI present + swap stats stay live.
    // Skip when the in-stream PM4_XE_SWAP already recorded this frontbuffer
    // (avoids double-counting). This path does NOT raise a CP interrupt.
    if frontbuffer_addr != 0 && width > 0 && height > 0 {
        let already_swapped = state
            .gpu
            .as_inline_mut()
            .map(|g| g.last_swap.map(|s| s.frontbuffer_phys) == Some(frontbuffer_addr))
            .unwrap_or(false);
        if !already_swapped {
            state.gpu.notify_xe_swap(frontbuffer_addr, width, height);
        }
    }
    // The remaining vd_swap work (UI publish: shader blobs, constants,
    // texture cache, frontbuffer detile, ui.notify_swap) reads