[iterate-2W] Sustain the title present loop: viewport-size register + ISR CPU impersonation

The title's per-frame loop (sub_822F1AA8) is clock-B-paced and only re-fires when the swap count [controller+88] changes, which advances only on source=1 CP swap-complete interrupts. Each present batch the guest submits (via the sub_824CE348 -> sub_824BF4D0 builder) ends with a WAIT_REG_MEM on a per-CPU swap-acknowledge fence [GCTX+0] (GCTX = [device+10772]); the GPU parks there until the graphics ISR (sub_824BE9A0) clears that CPU's bit. Two coupled gaps kept ours emitting only ONE source=1 then dead-locking (draws plateaued at 28, run halted ~19.27M): 1. GPU MMIO register 0x1961 (AVIVO_D1MODE_VIEWPORT_SIZE) read as 0. The swap callback sub_824CE2B8 divides by its low 12 bits (display height) as a refresh-pacing term, so a 0 read tripped its `twi` divide-by-zero guard and aborted the ISR before it reached the fence-clear. Mirror canary GraphicsSystem::ReadRegister (graphics_system.cc:311): return 0x050002D0 (1280x720). 2. The ISR ran on an arbitrary borrowed thread, so [r13+268] (the PCR processor number) did not match the interrupt's target CPU. The ISR clears `1 << current_cpu` from the fence; running on the wrong CPU cleared the wrong bit and the fence (bit 2, from cpu_mask 0x4) never reached 0. Carry the target CPU through the interrupt queue (bit index of the PM4_INTERRUPT cpu_mask for CP, 2 for vsync per canary DispatchInterruptCallback(0, 2)) and impersonate it on the borrowed thread's PCR around the ISR, mirroring canary EmulateCPInterruptDPC -> XThread::SetActiveCpu. With both fixes the fence clears, the GPU drains each present batch, source=1 sustains per-present, clock B advances, and the loop runs continuously. Draws climb linearly with the budget (no re-stall): 50M 28->718, 200M ->3411, 1B ->18734; swaps 2->147/950/6060. No "Unanticipated CPU_INTERRUPT" trap. Inline-deterministic (--stable-digest byte-identical x2); n50m golden re-baselined. 675 tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
[iterate-2V] VdSwap: stop bumping primary CP_RB_WPTR out-of-band (canary-faithful)
2026-06-14 20:49:32 +02:00 · 2026-06-14 19:58:05 +02:00 · 2026-06-14 16:20:08 +02:00 · 2026-06-14 15:20:02 +02:00
6 changed files with 189 additions and 68 deletions
--- a/crates/xenia-app/src/main.rs
+++ b/crates/xenia-app/src/main.rs
@@ -1540,8 +1540,19 @@ fn cmd_exec_inner(
                    mem.write_u32(addr, block);
                }
                ("xboxkrnl.exe", 0x01BE) => {
-                    // VdGlobalDevice — passed through to Vd* shims. Write 0.
+                    // VdGlobalDevice — a *pointer to* a global D3D-device cell.
-                    mem.write_u32(addr, 0);
+                    // Mirror xenia-canary RegisterVideoExports (xboxkrnl_video.cc:
                    // 557-564): allocate a 4-byte cell, point the import slot at
                    // it, and zero the cell. The guest's graphics init then stores
                    // its device object INTO the cell (e.g. sub_824C6DC0 @
                    // 0x824C6F18 `stw r31, 0([0x82000750])`), and the swap-complete
                    // callback sub_824CE2B8 reads it back via the two-level
                    // `[[VdGlobalDevice]+0]+15160` to bump the swap counter (clock
                    // B). Writing 0 directly here (the old behaviour) made that
                    // store land at address 0 and the swap counter never advance —
                    // freezing the title-loop's per-frame manager update.
                    let cell = alloc_zero(0x4, &mut mem, &mut kernel);
                    mem.write_u32(addr, cell);
                }
                ("xboxkrnl.exe", 0x01C0) => {
                    // VdGpuClockInMHz
@@ -2327,10 +2338,22 @@ fn coord_post_round(
    }
    if kernel.gpu.has_pending_interrupts() {
-        for _pi in kernel.gpu.take_pending_interrupts() {
+        for pi in kernel.gpu.take_pending_interrupts() {
            // Canary `ExecutePacketType3_INTERRUPT` dispatches the callback
            // once per set bit of `cpu_mask` with that bit's index as the
            // target CPU (`DispatchInterruptCallback(1, n)`). The guest's
            // swap-acknowledge fence stores `cpu_mask`, and the ISR clears
            // `1 << current_cpu` from it — so the ISR must run impersonating
            // the masked CPU or the fence never reaches 0. Sylpheed uses a
            // single-bit mask (`0x4` → CPU 2); take the lowest set bit.
            let cpu = if pi.cpu_mask == 0 {
                xenia_kernel::interrupts::VSYNC_TARGET_CPU
            } else {
                pi.cpu_mask.trailing_zeros().min(5) as u8
            };
            kernel
                .interrupts
-                .queue_interrupt(xenia_kernel::INTERRUPT_SOURCE_CP);
+                .queue_interrupt(xenia_kernel::INTERRUPT_SOURCE_CP, cpu);
        }
    }
@@ -3534,7 +3557,17 @@ fn dispatch_graphics_interrupts(
        None
    };
    /// X_KPCR offset of `prcb_data.current_cpu` (canary `xthread.cc`
    /// `SetActiveCpu` → `pcr.prcb_data.current_cpu`). The guest graphics
    /// ISR reads it via `lbz r10, 268(r13)` to decide which per-CPU bit of
    /// the swap-acknowledge fence to clear.
    const PCR_CURRENT_CPU_OFF: u32 = 268;
    while let Some(source) = kernel.interrupts.peek_next() {
        let target_cpu = kernel
            .interrupts
            .peek_next_cpu()
            .unwrap_or(xenia_kernel::interrupts::VSYNC_TARGET_CPU);
        // Victim selection: Ready first, then Blocked (canary's
        // `XThread::GetCurrentThread()` analog — any live thread will
        // do for borrowing context). Skip Idle/Exited/ServicingIrq.
@@ -3604,6 +3637,19 @@ fn dispatch_graphics_interrupts(
            saved
        };
        // Impersonate the interrupt's target CPU on the borrowed thread's
        // PCR, mirroring canary `EmulateCPInterruptDPC` →
        // `XThread::SetActiveCpu(cpu)`. The guest swap-complete ISR clears
        // `1 << [pcr.current_cpu]` from the per-present swap-acknowledge
        // fence; if it runs on the wrong CPU it clears the wrong bit and
        // the GPU's trailing `WAIT_REG_MEM` on that fence never releases —
        // stranding the present/title loop. Save/restore so borrowing a
        // thread doesn't permanently rewrite its processor number.
        let pcr_addr = (kernel.scheduler.ctx_mut_ref(target_ref).gpr[13] as u32)
            .wrapping_add(PCR_CURRENT_CPU_OFF);
        let saved_cpu = mem.read_u8(pcr_addr);
        mem.write_u8(pcr_addr, target_cpu);
        // Stash the previous `scheduler.current` (call_export reaches
        // it; imports the ISR calls must dispatch on the borrowed
        // thread). Restore on the way out.
@@ -3696,6 +3742,7 @@ fn dispatch_graphics_interrupts(
        // Restore the borrowed context.
        saved.restore(kernel.scheduler.ctx_mut_ref(target_ref));
        mem.write_u8(pcr_addr, saved_cpu);
        kernel.scheduler.current = prev_current;
        kernel.interrupts.delivered += 1;
--- a/crates/xenia-app/tests/golden/sylpheed_n50m.json
+++ b/crates/xenia-app/tests/golden/sylpheed_n50m.json
@@ -1,10 +1,10 @@
 {
-  "instructions": 50000013,
+  "instructions": 50000014,
-  "imports": 451497,
+  "imports": 352251,
  "unimpl": 0,
-  "draws": 78,
+  "draws": 718,
-  "swaps": 3,
+  "swaps": 147,
  "unique_render_targets": 2,
-  "shader_blobs_live": 3,
+  "shader_blobs_live": 6,
  "texture_cache_entries": 0
 }
--- a/crates/xenia-gpu/src/gpu_system.rs
+++ b/crates/xenia-gpu/src/gpu_system.rs
@@ -726,10 +726,13 @@ impl GpuSystem {
            width,
            height,
        });
-        self.pending_interrupts.push(PendingInterrupt {
+        // iterate-2T: do NOT raise a CP swap-complete interrupt here. Canary's
-            source: InterruptSource::Swap,
+        // `VdSwap`/PM4_XE_SWAP path raises no interrupt; swap-complete CP
-            cpu_mask: 0x1,
+        // interrupts come ONLY from in-stream `PM4_INTERRUPT` packets, which
-        });
+        // are naturally ordered after D3D has armed the swap-callback slot.
        // Synthesizing one out of band (as we did pre-2T) delivered a CP
        // interrupt while the slot still held the `0xBADF00D` placeholder,
        // tripping the graphics ISR's "Unanticipated CPU_INTERRUPT" assert.
        tracing::info!(
            frame = self.swap_counter,
            fb = format_args!("{frontbuffer_phys:#010x}"),
@@ -1541,6 +1544,15 @@ pub mod reg {
    /// `XE_GPU_REG_D1MODE_VBLANK_VLINE_STATUS` (Canary register_table.inc:1126).
    /// Bit 0 = VBLANK_INT_OCCURRED.
    pub const D1MODE_VBLANK_VLINE_STATUS: u32 = 0x1951;
    /// `XE_GPU_REG_D1MODE_VIEWPORT_SIZE` / `AVIVO_D1MODE_VIEWPORT_SIZE`
    /// (Canary `register_table.inc:1134`). Packs the active display resolution
    /// as `(width << 16) | height` with 12-bit fields. The guest's
    /// swap-complete interrupt callback (`sub_824CE2B8`) divides by the low
    /// 12 bits (`height`) as a refresh-pacing term, so a 0 read makes its
    /// `twi` divide-by-zero guard trap and abort the ISR before it clears the
    /// swap-acknowledge fence. Canary returns the constant below from
    /// `GraphicsSystem::ReadRegister` (graphics_system.cc:311).
    pub const D1MODE_VIEWPORT_SIZE: u32 = 0x1961;
    /// `XE_GPU_REG_VGT_EVENT_INITIATOR` — set by EVENT_WRITE.
    pub const VGT_EVENT_INITIATOR: u32 = 0x21F9;
    /// `XE_GPU_REG_COHER_STATUS_HOST` — coherency bits
--- a/crates/xenia-gpu/src/mmio_region.rs
+++ b/crates/xenia-gpu/src/mmio_region.rs
@@ -58,6 +58,15 @@ pub fn build_region(mmio: &GpuMmio) -> MmioRegion {
                reg::D1MODE_VBLANK_VLINE_STATUS => {
                    read_vblank_status.load(Ordering::Relaxed)
                }
                // AVIVO_D1MODE_VIEWPORT_SIZE: the active display resolution
                // (1280x720) packed as `(width << 16) | height`. Canary
                // serves this constant from `GraphicsSystem::ReadRegister`
                // (graphics_system.cc:311). The guest swap-complete interrupt
                // callback divides by the low 12 bits (`height = 0x2D0`); a 0
                // read trips its `twi` divide-guard and aborts the ISR before
                // it acknowledges the per-present swap fence — which strands
                // the present/title loop. Mirror canary exactly.
                reg::D1MODE_VIEWPORT_SIZE => 0x0500_02D0,
                _ => {
                    tracing::trace!(
                        reg = format_args!("{reg_index:#x}"),
--- a/crates/xenia-kernel/src/exports.rs
+++ b/crates/xenia-kernel/src/exports.rs
@@ -2999,52 +2999,86 @@ fn vd_swap(ctx: &mut PpcContext, mem: &GuestMemory, state: &mut KernelState) {
    // xboxkrnl_video.cc:479. Currently skipped (see below).
    let _ = fetch_dwords; // silence unused — will be live again under the deferred path
-    // The original M2b path zero-filled buffer_ptr (in the system command
+    // iterate-2V: mirror xenia-canary `VdSwap_entry` (xboxkrnl_video.cc:518-548)
-    // buffer) and bumped WPTR by 64 to expose the game's own ring writes.
+    // FAITHFULLY. The game reserves 64 dwords (256 bytes) in the primary ring
-    // Keep that untouched — the game still expects buffer_ptr to be a
+    // at `buffer_ptr`; canary writes a `PM4_TYPE0(SHADER_CONSTANT_FETCH_00_0)`
-    // skippable scratch area, and the bump still exposes any game-batched
+    // fetch-constant patch followed by `PM4_TYPE3(PM4_XE_SWAP)`, then pads with
-    // PM4 packets for the drain.
+    // NOPs — and **NEVER touches `CP_RB_WPTR`**. The game advances the primary
    // ring write-pointer itself via its own doorbell once it has finished
    // populating the reserved slot, so VdSwap only fills the bytes.
    //
    // iterate-2V FIX (the bug this removes): a prior revision bumped the
    // primary ring `CP_RB_WPTR` out-of-band here (`extend_write_ptr_by(64)`).
    // But `buffer_ptr` (~0x4add6efc) is NOT inside the primary ring (base
    // ~0x4adcd000, 8192 dwords) — it lives ~10k dwords past it, in the
    // renderer indirect-buffer region. The bogus WPTR bump pushed the GPU
    // read-pointer PAST the guest's real write-pointer, the drain treated the
    // overshoot as a circular wrap, and **re-executed the splash's draw
    // indirect-buffers ~2×** — inflating draws to 78 (real splash ≈ 28; 12
    // INDIRECT_BUFFERs vs the real 6). Canary's `VdSwap_entry` writes the
    // block and returns; the swap-complete CP interrupt comes only from the
    // game's own in-stream `PM4_INTERRUPT` packets, never from VdSwap.
    if buffer_ptr != 0 {
-        for i in 0..64u32 {
+        let mut off = 0u32;
-            mem.write_u32(buffer_ptr + i * 4, xenia_gpu::pm4::make_packet_type2());
+        let mut put = |i: &mut u32, v: u32| {
            mem.write_u32(buffer_ptr + *i * 4, v);
            *i += 1;
        };
        // PM4_TYPE0 fetch-constant slot-0 patch (6 dwords payload). The
        // base_address field is patched to the physical frontbuffer so the
        // bloom/blur "sample frame N for frame N+1" path reads the right page.
        let mut patched = fetch_dwords;
        patched[1] = (patched[1] & 0x0000_0FFF) | ((frontbuffer_addr >> 12) << 12);
        put(
            &mut off,
            xenia_gpu::pm4::make_packet_type0(
                xenia_gpu::gpu_system::CONST_BASE_FETCH as u16,
                6,
            ),
        );
        for d in patched {
            put(&mut off, d);
        }
        // PM4_TYPE3(PM4_XE_SWAP, 4 dwords): signature, frontbuffer_phys, w, h.
        put(
            &mut off,
            xenia_gpu::pm4::make_packet_type3(xenia_gpu::pm4::PM4_XE_SWAP, 4),
        );
        put(&mut off, xenia_gpu::pm4::SWAP_SIGNATURE);
        put(&mut off, frontbuffer_addr);
        put(&mut off, width);
        put(&mut off, height);
        // Pad the remainder with NOP (Type-2) packets.
        while off < 64 {
            put(&mut off, xenia_gpu::pm4::make_packet_type2());
        }
    }
-    state.gpu.extend_write_ptr_by(64);
+    // NOTE: We deliberately do NOT bump `CP_RB_WPTR` here (see the iterate-2V
    // comment above). The drain below consumes only the packets the game has
    // legitimately advanced the write-pointer over.
-    // GPUBUG-DRAIN-001: notify the swap directly.
+    // Drain the ring up to whatever the game has actually submitted; any
-    //
+    // in-stream `PM4_INTERRUPT` / draw packets execute in order. The
-    // Per xenia-canary `VdSwap_entry` (xboxkrnl_video.cc:438-521), the
+    // reserved-slot PM4_XE_SWAP is consumed by the GPU only once the game
-    // textbook approach is to inject `PM4_TYPE0(SHADER_CONSTANT_FETCH_00_0)`
+    // advances its own doorbell over it. The swap-counter safety net below
-    // (fetch-constant slot-0 patch for the Sylpheed bloom/blur "frame N+1"
+    // keeps host swap bookkeeping live in the meantime.
    // sample) followed by `PM4_TYPE3(PM4_XE_SWAP)` directly into the
    // primary ring at WPTR, then let the natural drain consume them.
    //
    // That works in **pure lockstep** (drain runs at every kernel callback
    // boundary, ring has at most a few hundred packets pending). It
    // **does not** work under `--parallel` (CPU + GPU ring contention) —
    // observed empirically: vd_swap's `drain_to_current_wptr` consumes
    // 8-10 million game-batched IB packets in the 900 ms inline-deadline
    // window without reaching our tail-injected PM4_XE_SWAP. Under
    // threaded backend the worker has the same deadline. Either:
    //   (a) the safety-net direct notify (below) fires and gets the swap
    //       counted — but if the worker *eventually* drains past our
    //       injected packet later it would double-count,
    //   (b) we extend the deadline so far that vd_swap blocks for many
    //       seconds — unreasonable for a kernel callback.
    //
    // Skip the ring injection unconditionally and post `notify_xe_swap`
    // directly. The drain still runs (game packets execute as normal).
    // **Trade-off**: the slot-0 fetch-constant patch is deferred —
    // tracked as GPUBUG-FETCH-PATCH-001. Sylpheed currently has draws=0,
    // so a stale slot 0 has no observable effect.
    let drained = state.gpu.drain_to_current_wptr(mem);
    tracing::debug!(drained, "VdSwap: drained PM4 packets");
-    // Direct swap notification. Inline mode bumps `swaps_seen`
+    // Safety net: if the drain did NOT reach our PM4_XE_SWAP this call (e.g.
-    // synchronously; threaded mode posts a `GpuCommand::NotifyXeSwap`
+    // an undersized inline deadline left game-batched packets pending), still
-    // and the worker bumps it asynchronously.
+    // bump the host swap counter so the UI present + swap stats stay live.
    // Skip when the in-stream PM4_XE_SWAP already recorded this frontbuffer
    // (avoids double-counting). This path does NOT raise a CP interrupt.
    if frontbuffer_addr != 0 && width > 0 && height > 0 {
-        state.gpu.notify_xe_swap(frontbuffer_addr, width, height);
+        let already_swapped = state
            .gpu
            .as_inline_mut()
            .map(|g| g.last_swap.map(|s| s.frontbuffer_phys) == Some(frontbuffer_addr))
            .unwrap_or(false);
        if !already_swapped {
            state.gpu.notify_xe_swap(frontbuffer_addr, width, height);
        }
    }
    // The remaining vd_swap work (UI publish: shader blobs, constants,
--- a/crates/xenia-kernel/src/interrupts.rs
+++ b/crates/xenia-kernel/src/interrupts.rs
@@ -30,6 +30,12 @@ use xenia_cpu::ThreadRef;
 pub const INTERRUPT_SOURCE_VSYNC: u32 = 0;
 pub const INTERRUPT_SOURCE_CP: u32 = 1;
 /// The processor the graphics ISR impersonates for a v-sync interrupt.
 /// Canary hard-codes this: `MarkVblank` → `DispatchInterruptCallback(0, 2)`
 /// (graphics_system.cc:478). CP interrupts instead use the bit index of the
 /// `PM4_INTERRUPT` `cpu_mask`.
 pub const VSYNC_TARGET_CPU: u8 = 2;
 /// Guest-registered V-sync / graphics-interrupt callback (from
 /// `VdSetGraphicsInterruptCallback`).
 #[derive(Debug, Clone, Copy)]
@@ -145,9 +151,16 @@ pub type PendingLocalIrq = [std::sync::atomic::AtomicU8;
 pub struct InterruptState {
    /// Registered callback (set by `VdSetGraphicsInterruptCallback`).
    pub callback: Option<GraphicsInterruptCallback>,
-    /// Bounded FIFO of pending interrupt sources awaiting injection.
+    /// Bounded FIFO of pending interrupts awaiting injection, as
-    /// Push-back on queue, pop-front on inject. Over-cap pushes drop.
+    /// `(source, target_cpu)`. Push-back on queue, pop-front on inject.
-    pub pending: VecDeque<u32>,
+    /// Over-cap pushes drop. `target_cpu` is the processor the graphics
    /// ISR must impersonate (canary `XThread::SetActiveCpu` / the
    /// `DispatchInterruptCallback(source, cpu)` argument): the bit index
    /// of the CP `PM4_INTERRUPT` `cpu_mask` for source=1, and a fixed `2`
    /// for vsync (canary `DispatchInterruptCallback(0, 2)`). The ISR reads
    /// it from the PCR (`[r13+268]`) to clear the matching per-CPU bit of
    /// the swap-acknowledge fence.
    pub pending: VecDeque<(u32, u8)>,
    /// When `Some`, some HW thread is currently running a callback; on
    /// return-to-sentinel we restore this and clear the flag.
    pub saved: Option<SavedCallbackCtx>,
@@ -211,8 +224,9 @@ impl InterruptState {
        });
    }
-    /// Queue an interrupt for the next safe injection point.
+    /// Queue an interrupt for the next safe injection point. `cpu` is the
-    pub fn queue_interrupt(&mut self, source: u32) {
+    /// processor the ISR must impersonate (see `pending`).
    pub fn queue_interrupt(&mut self, source: u32, cpu: u8) {
        if self.callback.is_none() {
            self.dropped += 1;
            return;
@@ -221,18 +235,23 @@ impl InterruptState {
            self.dropped += 1;
            return;
        }
-        self.pending.push_back(source);
+        self.pending.push_back((source, cpu));
    }
    /// Peek at the next pending source without removing it.
    pub fn peek_next(&self) -> Option<u32> {
-        self.pending.front().copied()
+        self.pending.front().map(|&(source, _)| source)
    }
    /// Peek at the target CPU of the next pending interrupt.
    pub fn peek_next_cpu(&self) -> Option<u8> {
        self.pending.front().map(|&(_, cpu)| cpu)
    }
    /// Pop the next pending source (called by the injector after it has
    /// committed to dispatching it).
    pub fn take_next(&mut self) -> Option<u32> {
-        self.pending.pop_front()
+        self.pending.pop_front().map(|(source, _)| source)
    }
    /// **Legacy** — instruction-count v-sync ticker. Kept for unit tests
@@ -249,7 +268,7 @@ impl InterruptState {
        let periods = self.vsync_accumulator / VSYNC_INSTR_PERIOD;
        self.vsync_accumulator %= VSYNC_INSTR_PERIOD;
        for _ in 0..periods {
-            self.queue_interrupt(INTERRUPT_SOURCE_VSYNC);
+            self.queue_interrupt(INTERRUPT_SOURCE_VSYNC, VSYNC_TARGET_CPU);
        }
        true
    }
@@ -288,7 +307,7 @@ impl InterruptState {
        self.last_vsync_instant = Some(anchor + advance);
        let to_queue = (periods as usize).min(INTERRUPT_QUEUE_CAP);
        for _ in 0..to_queue {
-            self.queue_interrupt(INTERRUPT_SOURCE_VSYNC);
+            self.queue_interrupt(INTERRUPT_SOURCE_VSYNC, VSYNC_TARGET_CPU);
        }
        true
    }
@@ -306,7 +325,7 @@ mod tests {
    #[test]
    fn queue_interrupt_drops_without_callback() {
        let mut s = InterruptState::default();
-        s.queue_interrupt(INTERRUPT_SOURCE_VSYNC);
+        s.queue_interrupt(INTERRUPT_SOURCE_VSYNC, VSYNC_TARGET_CPU);
        assert_eq!(s.dropped, 1);
        assert!(s.pending.is_empty());
    }
@@ -315,9 +334,9 @@ mod tests {
    fn queue_interrupt_fifo_preserves_order() {
        let mut s = InterruptState::default();
        s.set_callback(0x1000, 0xAB);
-        s.queue_interrupt(INTERRUPT_SOURCE_VSYNC);
+        s.queue_interrupt(INTERRUPT_SOURCE_VSYNC, VSYNC_TARGET_CPU);
-        s.queue_interrupt(INTERRUPT_SOURCE_CP);
+        s.queue_interrupt(INTERRUPT_SOURCE_CP, 2);
-        s.queue_interrupt(INTERRUPT_SOURCE_VSYNC);
+        s.queue_interrupt(INTERRUPT_SOURCE_VSYNC, VSYNC_TARGET_CPU);
        assert_eq!(s.dropped, 0);
        // FIFO: take_next hands them out in push order.
        assert_eq!(s.take_next(), Some(INTERRUPT_SOURCE_VSYNC));
@@ -331,11 +350,11 @@ mod tests {
        let mut s = InterruptState::default();
        s.set_callback(0x1000, 0xAB);
        for _ in 0..INTERRUPT_QUEUE_CAP {
-            s.queue_interrupt(INTERRUPT_SOURCE_VSYNC);
+            s.queue_interrupt(INTERRUPT_SOURCE_VSYNC, VSYNC_TARGET_CPU);
        }
        // Over-cap: drops rather than evicting the oldest.
-        s.queue_interrupt(INTERRUPT_SOURCE_VSYNC);
+        s.queue_interrupt(INTERRUPT_SOURCE_VSYNC, VSYNC_TARGET_CPU);
-        s.queue_interrupt(INTERRUPT_SOURCE_VSYNC);
+        s.queue_interrupt(INTERRUPT_SOURCE_VSYNC, VSYNC_TARGET_CPU);
        assert_eq!(s.dropped, 2);
        assert_eq!(s.pending.len(), INTERRUPT_QUEUE_CAP);
    }
Author	SHA1	Message	Date
MechaCat02	a91f4c550b	[iterate-2W] Sustain the title present loop: viewport-size register + ISR CPU impersonation The title's per-frame loop (sub_822F1AA8) is clock-B-paced and only re-fires when the swap count [controller+88] changes, which advances only on source=1 CP swap-complete interrupts. Each present batch the guest submits (via the sub_824CE348 -> sub_824BF4D0 builder) ends with a WAIT_REG_MEM on a per-CPU swap-acknowledge fence [GCTX+0] (GCTX = [device+10772]); the GPU parks there until the graphics ISR (sub_824BE9A0) clears that CPU's bit. Two coupled gaps kept ours emitting only ONE source=1 then dead-locking (draws plateaued at 28, run halted ~19.27M): 1. GPU MMIO register 0x1961 (AVIVO_D1MODE_VIEWPORT_SIZE) read as 0. The swap callback sub_824CE2B8 divides by its low 12 bits (display height) as a refresh-pacing term, so a 0 read tripped its `twi` divide-by-zero guard and aborted the ISR before it reached the fence-clear. Mirror canary GraphicsSystem::ReadRegister (graphics_system.cc:311): return 0x050002D0 (1280x720). 2. The ISR ran on an arbitrary borrowed thread, so [r13+268] (the PCR processor number) did not match the interrupt's target CPU. The ISR clears `1 << current_cpu` from the fence; running on the wrong CPU cleared the wrong bit and the fence (bit 2, from cpu_mask 0x4) never reached 0. Carry the target CPU through the interrupt queue (bit index of the PM4_INTERRUPT cpu_mask for CP, 2 for vsync per canary DispatchInterruptCallback(0, 2)) and impersonate it on the borrowed thread's PCR around the ISR, mirroring canary EmulateCPInterruptDPC -> XThread::SetActiveCpu. With both fixes the fence clears, the GPU drains each present batch, source=1 sustains per-present, clock B advances, and the loop runs continuously. Draws climb linearly with the budget (no re-stall): 50M 28->718, 200M ->3411, 1B ->18734; swaps 2->147/950/6060. No "Unanticipated CPU_INTERRUPT" trap. Inline-deterministic (--stable-digest byte-identical x2); n50m golden re-baselined. 675 tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 20:49:32 +02:00
MechaCat02	66bd805726	[iterate-2V] VdSwap: stop bumping primary CP_RB_WPTR out-of-band (canary-faithful) Ours' `vd_swap` wrote its 64-dword XE_SWAP block at the guest's reserved `buffer_ptr` slot AND then bumped the primary ring `CP_RB_WPTR` out-of-band via `state.gpu.extend_write_ptr_by(64)`. That bump was a bug: `buffer_ptr` (~0x4add6efc) is NOT inside the primary ring (base ~0x4adcd000, 8192 dwords) — it lives ~10k dwords past it, in the renderer indirect-buffer region. The bogus WPTR bump pushed the GPU read-pointer PAST the guest's real write-pointer; the drain treated the overshoot as a circular wrap and re-executed the splash's draw indirect-buffers ~2×, inflating draws to 78 (the real splash geometry is ~28 draws; 12 INDIRECT_BUFFERs vs the real 6). Canary's `VdSwap_entry` (xenia-canary xboxkrnl_video.cc:518-548) writes the fetch-constant patch + PM4_XE_SWAP + NOP pad into the reserved slot and returns — it NEVER touches CP_RB_WPTR. The guest advances the primary ring write-pointer itself via its own doorbell once it has populated the slot; swap-complete CP interrupts come only from the game's in-stream PM4_INTERRUPT packets, never from VdSwap. This fix removes only the out-of-band `extend_write_ptr_by(64)` call, keeping the buffer_ptr block write intact and byte-faithful to canary. Effect at `--gpu-inline -n 50M`: draws 78→28, INDIRECT_BUFFER 12→6 (re-execution artifact gone), swaps 4→2. The run now halts at ~19.27M instructions (worker threads exit) instead of spinning to 50M, because removing the corruption unmasks the real per-present-interrupt deadlock — the title loop needs a per-present PM4_INTERRUPT that the stalled game never submits. That deadlock is a SEPARATE, known gate tracked/addressed elsewhere; it is intentionally NOT papered over here. Re-baselined golden crates/xenia-app/tests/golden/sylpheed_n50m.json to the new honest values (regenerated twice, byte-identical). sylpheed_n2m.json is unaffected (draws=0 at 2M). cargo test --workspace: 675 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 19:58:05 +02:00
MechaCat02	ad9c8e4cb8	[iterate-2U] VdGlobalDevice: allocate a real device cell so the swap counter (clock B) can advance Sylpheed's title loop re-runs its per-frame manager update sub_821741C8 only when "clock B" ([controller+88], the swap count) changes. Clock B's sole source is the CP swap-complete callback sub_824CE2B8, which bumps [gfx+15160] via the TWO-LEVEL deref [[VdGlobalDevice]+0]+15160, where VdGlobalDevice is the kernel variable export 0x01BE at guest .data 0x82000750. Ours patched that import slot with literal 0 (the old "passed through to Vd* shims, write 0" behaviour). Consequences, both confirmed at runtime: * the guest's graphics init stores its D3D device object via `stw r31, 0([0x82000750])` (sub_824C6DC0 @0x824C6F18) — with the slot 0, that store lands at address 0; * the swap callback reads [[0x82000750]] = [0] = 0 and increments [0+15160] (the null page) instead of the real device's swap counter. So [gfx+15160] never moved, clock B stayed frozen at 0, sub_821741C8 fired exactly once, and the game submitted one render batch (the 78-draw splash) then stalled. Fix mirrors xenia-canary RegisterVideoExports (xboxkrnl_video.cc:557-564) exactly: allocate a 4-byte cell, point the import slot at it, zero the cell. The guest then stores its device into the cell, and the callback's two-level deref resolves correctly. Verified: [0x82000750] now holds a real cell whose [+0] is the device (gfx state), the swap callback bumps [gfx+15160] 0->1, clock B advances, and the per-frame chain steps forward (sub_821741C8 fires 1->2x, GamePart update sub_821C7CB8 0->1x). Determinism: --gpu-inline digest re-baselined and byte-identical across runs. The fix shifts the early execution trajectory (clock B unfreezing), so the n50m golden moves imports 451500->178937 and instructions 50000001->50000014; draws/swaps/RTs/shaders unchanged (78/4/2/3). n2m golden unchanged (early boot, pre-fix-effect). 675 workspace tests green; sylpheed_n50m oracle green. Note: this breaks the FIRST hard blocker (clock B could never advance at all). Full per-frame sustain (draws past 78) needs a further step: each GamePart update must submit a per-frame command buffer (with PM4_INTERRUPT) during the asset-streaming phase to keep generating CP interrupts; ours currently produces only the single seed interrupt from the initial batch, so the chain advances once and re-stalls. Tracked for the next iterate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 16:20:08 +02:00
MechaCat02	873c197ff1	[iterate-2T] VdSwap: route present through ring PM4_XE_SWAP, drop out-of-band swap interrupt Make ours' VdSwap present path faithful to xenia-canary `VdSwap_entry` (xboxkrnl_video.cc:518-548): write the reserved 64-dword ring slot with a PM4_TYPE0 fetch-constant patch + PM4_TYPE3(PM4_XE_SWAP) + NOP padding, then let the natural drain consume the swap packet in command-stream order. Remove the synthetic CP swap-complete interrupt that `notify_xe_swap` raised out-of-band. Root found this session (the actual present-path bug): ours' `notify_xe_swap` pushed an `InterruptSource::Swap` (→ INTERRUPT_SOURCE_CP) interrupt directly from the VdSwap HLE, decoupled from the GPU command stream. When that interrupt reached the graphics ISR `sub_824BE9A0` before D3D had armed its swap-callback slot (`[gfx+10772]+16` still the `0xBADF00D` placeholder), the ISR took its error path and hit the assert "ERR[D3D]: Unanticipated CPU_INTERRUPT. Sign of a corrupt command buffer?" (`bl sub_824C5DF0; twi` at 0x824BE9DC) — 2x per run on master. Canary's VdSwap raises NO interrupt; swap-complete CP interrupts come only from in-stream PM4_INTERRUPT packets, which are naturally ordered after the callback-arming Type-0 writes. Routing the swap through the ring packet matches that ordering and eliminates the trap (2 -> 0). Canary oracle confirmation (muted, audit_mem_watch + audit_jit_prolog_pc): canary's early/loading loop is present-driven — swap counter [gfx+15160] (0xBE56CA38) advances ~per-vblank from vblank 65 onward, reaching 0xD02 (3330) in ~60s via 6184 CP source=1 interrupts, with VdSwap called only ONCE. So the present interrupts are entirely in-stream, not from the VdSwap export. This is a correctness/faithfulness fix; it does NOT cascade. draws stay 78 at 200M and 1B because the upstream gate persists: the game submits one render batch then stalls (renderer sub_82506xxx 0x; 2nd title thread 0x821748F0 never spawns). The per-frame loop sub_822F1AA8 runs ~1207 iterations on vsync but clock B (swap count) only advances ~once, so the manager update sub_821741C8 fires once. That is the iterate-2Q/2F title-pipeline gate, not a present/ interrupt bug. swaps 3 -> 4 (the in-stream PM4_XE_SWAP now drains). Deterministic in inline mode (n50m --gpu-inline --stable-digest regenerated byte-identical twice; golden re-baselined: swaps 3 -> 4). cargo test --workspace 675 passing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 15:20:02 +02:00