[iterate-3AL] Superblock dispatch: chain basic blocks per slot-visit (~1.6x boot-to-splash)

Replace the one-basic-block-per-slot-per-round lockstep dispatch with a SUPERBLOCK runner: each slot-visit chains straight-line blocks through their terminating branches up to a deterministic instruction budget, amortizing the per-round (timebase/coord/round_schedule) and per-slot (worker_prologue) dispatch tax over ~128 instructions instead of ~6. Yield-points (end the chain, return to the round) are pure functions of guest state, preserving the lockstep cross-thread interleaving correctness: - non-Continue step result (Yield/SystemCall/Trap/Unimpl/Halted); db16cyc Yield is the spin-wait producer hand-off. - sync-sensitive block: lwarx/ldarx/stwcx./stdcx. or sync/eieio/isync (new PpcOpcode::is_sync_sensitive, flagged on DecodedBlock at build). - MMIO touch: new GuestMemory::mmio_access_count() watermark, sampled per block, keeps GPU/register ordering at one-block granularity. - next PC leaves ordinary guest code (import thunk / halt sentinel / unmapped) -> hand to the full worker_prologue next round. - instruction budget reached. Instruction-count/clock accounting stays exact: per-block cycle_count deltas are summed and handed to worker_epilogue once (instruction_count + decrement_quantum advance by the precise retired count). XENIA_SUPERBLOCK_BUDGET=1 reproduces the old one-block schedule byte-for-byte. Budget tuned to 128 (env-overridable): boot progression stays healthy up to 256, sharp cliff at ~384 (a boot producer/consumer handoff starves); 128 is 3x below the cliff. Also scale the inline-GPU per-round fairness cap with the budget (flat 64 throttled GPU command processing 17x under superblocks and collapsed the present loop). PERF (check -n 100M --gpu-inline): 25.3 -> 42.7 MIPS (1.69x); 1B: 26.0 -> 41.4 MIPS (1.59x). Callgrind n=5M: host instructions 2.178B -> 1.507B (-31%); worker_prologue -90%, coord_pre_round -91%, begin_slot_visit / round_schedule_into / coord_post_round / update_timestamp_bundle each ~-90%; interpreter execute byte-identical (real work unchanged). GATES: C1 boot progression 150M draws 7391/swaps 2164 (baseline 7415/2172), 1B draws 88547/swaps 29228 linear no stall, K8888 decode + RTs=2 intact. C2 determinism: n50m stable digest byte-identical across fresh runs; golden re-baselined intentionally (pacing-only deltas: imports 333453->243387, draws 1274->1279). C3 milestone-1 render: texture_decodes/draws/swaps/ present cadence track baseline (3AJ fade-in pacing preserved). C4: 690 tests green (+2 sync_sensitive). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 22:31:54 +02:00
parent dc1320cd4b
commit acb29db444
6 changed files with 289 additions and 29 deletions
--- a/crates/xenia-memory/src/heap.rs
+++ b/crates/xenia-memory/src/heap.rs
@@ -89,6 +89,14 @@ pub struct GuestMemory {
    mem_watch_addrs: Vec<u32>,
    /// Count of fires observed (for tests / hand-off telemetry).
    mem_watch_count: AtomicU64,
+    /// Monotonic count of MMIO accesses (every scalar load/store that
+    /// resolves to a registered MMIO region bumps this by 1). A pure,
+    /// deterministic function of guest execution — the superblock runner
+    /// samples it before/after each block to detect an MMIO touch and
+    /// end the run there (so MMIO ordering vs other HW threads stays at
+    /// the same fine lockstep granularity as before). Relaxed because the
+    /// lockstep path is single-threaded and only needs monotonicity.
+    mmio_access_count: AtomicU64,
 }

 /// Greatest common bit-mask such that `(a & m) == (b & m)` for every bit
@@ -133,9 +141,26 @@ impl GuestMemory {
            writes_total: AtomicU64::new(0),
            mem_watch_addrs: Vec::new(),
            mem_watch_count: AtomicU64::new(0),
+            mmio_access_count: AtomicU64::new(0),
        })
    }

+    /// Monotonic count of MMIO accesses since boot. Used by the superblock
+    /// runner to detect that a just-executed block touched MMIO (so it can
+    /// end the superblock there and keep MMIO ordering at lockstep
+    /// granularity). Deterministic function of guest execution.
+    #[inline]
+    pub fn mmio_access_count(&self) -> u64 {
+        self.mmio_access_count
+            .load(std::sync::atomic::Ordering::Relaxed)
+    }
+
+    #[inline]
+    fn bump_mmio_access(&self) {
+        self.mmio_access_count
+            .fetch_add(1, std::sync::atomic::Ordering::Relaxed);
+    }
+
    /// Current version watermark for the page containing `addr`. Bumped by
    /// any write through `write_u8/16/32/64`. Not affected by MMIO writes
    /// (those don't touch the backing texture memory).
@@ -488,6 +513,7 @@ impl MemoryAccess for GuestMemory {
        // MMIO dispatch must come first — a byte read at an MMIO-mapped
        // address should invoke the callback, not the backing memory.
        if let Some(mmio) = self.find_mmio(addr) {
+            self.bump_mmio_access();
            return (mmio.read_callback)(addr) as u8;
        }
        if !self.is_mapped(addr) { return 0; }
@@ -498,6 +524,7 @@ impl MemoryAccess for GuestMemory {
    #[inline]
    fn read_u16(&self, addr: u32) -> u16 {
        if let Some(mmio) = self.find_mmio(addr) {
+            self.bump_mmio_access();
            (mmio.read_callback)(addr) as u16
        } else if !self.is_mapped(addr) {
            0
@@ -510,6 +537,7 @@ impl MemoryAccess for GuestMemory {
    #[inline]
    fn read_u32(&self, addr: u32) -> u32 {
        if let Some(mmio) = self.find_mmio(addr) {
+            self.bump_mmio_access();
            (mmio.read_callback)(addr)
        } else if !self.is_mapped(addr) {
            0
@@ -522,6 +550,7 @@ impl MemoryAccess for GuestMemory {
    #[inline]
    fn read_u64(&self, addr: u32) -> u64 {
        if let Some(mmio) = self.find_mmio(addr) {
+            self.bump_mmio_access();
            let hi = (mmio.read_callback)(addr) as u64;
            let lo = (mmio.read_callback)(addr.wrapping_add(4)) as u64;
            (hi << 32) | lo
@@ -537,6 +566,7 @@ impl MemoryAccess for GuestMemory {
        // MMIO dispatch first — a byte write at an MMIO-mapped address
        // must invoke the callback, not the backing memory.
        if let Some(mmio) = self.find_mmio(addr) {
+            self.bump_mmio_access();
            (mmio.write_callback)(addr, val as u32);
            return;
        }
@@ -555,6 +585,7 @@ impl MemoryAccess for GuestMemory {

    fn write_u16(&self, addr: u32, val: u16) {
        if let Some(mmio) = self.find_mmio(addr) {
+            self.bump_mmio_access();
            (mmio.write_callback)(addr, val as u32);
        } else if !self.is_mapped(addr) {
        } else {
@@ -577,6 +608,7 @@ impl MemoryAccess for GuestMemory {

    fn write_u32(&self, addr: u32, val: u32) {
        if let Some(mmio) = self.find_mmio(addr) {
+            self.bump_mmio_access();
            (mmio.write_callback)(addr, val);
        } else if !self.is_mapped(addr) {
        } else {
@@ -596,6 +628,7 @@ impl MemoryAccess for GuestMemory {

    fn write_u64(&self, addr: u32, val: u64) {
        if let Some(mmio) = self.find_mmio(addr) {
+            self.bump_mmio_access();
            (mmio.write_callback)(addr, (val >> 32) as u32);
            (mmio.write_callback)(addr.wrapping_add(4), val as u32);
        } else if !self.is_mapped(addr) {