[iterate-3AK] Perf quick-wins: ~21% faster boot-to-splash (22→27 MIPS)

Profile-driven low-risk optimizations attacking the ~48% per-block / per-round host-bookkeeping tax found by the callgrind profile. Measured on the bounded headless workload `check -n 100000000 --gpu-inline`: baseline ~4490 ms (22.3 MIPS) -> ~3700 ms (27.0 MIPS), +21%. Tier A (determinism-neutral; n50m golden byte-IDENTICAL, exit 0): 1. mem-watch write path: gate capture_mem_watch_old/check_mem_watch behind one has_mem_watch() predicted branch in write_u8/16/32/64 + write_bulk so the common (no-watch) store does no out-of-line call. check_mem_watch (4.8%) gone from the profile. 2. round-schedule alloc churn: add Scheduler::round_schedule_into filling a reusable [u8; HW_THREAD_COUNT] stack buffer; the lockstep round loop no longer __rust_alloc/__rust_dealloc a Vec<u8> per round. Identical ordering/RNG-advance. __rust_alloc/dealloc gone from the profile. 3. probe-firing: hoist a single KernelState::any_probe_active() guard to worker_prologue so the four fire_*_if_match calls don't happen at all when no probe is configured (was 4x call overhead/visit). All four gone from the profile. 4. thunk-map hash: range-reject pc against the registered import-thunk address band (KernelState::pc_in_thunk_band, two int compares) before the thunk_map.get(&pc) HashMap lookup. hash_one (4.3%) gone. Tier B (#5, time-granularity change — LANDED, no re-baseline needed): 5. update_timestamp_bundle: throttle to a 0.25 ms quantum (only re-write the KeTimeStampBundle when the deterministic clock advanced >= 2500 units). Inclusive cost 8.65% -> 1.08%. The quantum is far below the 1 ms granularity any guest deadline math needs (tick_count stays fresh; the hub gate is +66 ms; the fade-in is vsync-counter driven per 3AH, not this bundle). VERIFIED: n50m stable digest BYTE-IDENTICAL to the existing golden (so no re-baseline), 150M boot reaches the splash (draws=7415, swaps=2172, gpu.texture.decode{K8888}=448, RTs=2 — all match the post-3AJ baseline), 688 tests green, release n50m oracle ok. Remaining headroom: interpreter::execute (13%), decrement_quantum (8%), step_block (7%) are now the top self-costs — the structural superblock/ JIT lever is the next step for the larger gain. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 22:05:53 +02:00
parent 9d24dd0eaa
commit dc1320cd4b
4 changed files with 156 additions and 26 deletions
--- a/crates/xenia-cpu/src/scheduler.rs
+++ b/crates/xenia-cpu/src/scheduler.rs
@@ -795,31 +795,46 @@ impl Scheduler {
    /// the fast path — zero bits mean no slot has work and the caller
    /// falls through to `advance_to_next_wake`.
    pub fn round_schedule(&mut self) -> Vec<u8> {
+        let mut buf = [0u8; HW_THREAD_COUNT];
+        let n = self.round_schedule_into(&mut buf);
+        buf[..n].to_vec()
+    }
+
+    /// Allocation-free variant of [`Self::round_schedule`] (Tier-A perf #2).
+    /// Fills `buf` with the runnable slot ids and returns the count `n`; the
+    /// valid range is `buf[..n]`. The hot scheduler loop (lockstep +
+    /// parallel) calls this with a reusable stack array so it does not
+    /// `__rust_alloc`/`__rust_dealloc` a fresh `Vec` every round (~7 instr
+    /// apart at boot-to-splash → millions of churned allocations). Identical
+    /// ordering / RNG-advance semantics to `round_schedule`, so the schedule
+    /// — and thus the lockstep digest — is byte-for-byte unchanged.
+    pub fn round_schedule_into(&mut self, buf: &mut [u8; HW_THREAD_COUNT]) -> usize {
        if self.non_empty_runnable == 0 {
-            return Vec::new();
+            return 0;
        }
        let start = self.rotation_cursor as usize;
-        let mut out: Vec<u8> = Vec::with_capacity(HW_THREAD_COUNT);
+        let mut n = 0usize;
        for off in 0..HW_THREAD_COUNT {
            let i = (start + off) % HW_THREAD_COUNT;
            if self.non_empty_runnable & (1 << i) != 0 {
-                out.push(i as u8);
+                buf[n] = i as u8;
+                n += 1;
            }
        }
        // Seeded mode layers a deterministic shuffle on top of the
        // already-filtered list. Same spawn/wake sequence + same seed ⇒
        // same schedule (invariant preserved from pre-Axis-1).
        if let OrderMode::Seeded { .. } = self.order {
-            for i in (1..out.len()).rev() {
+            for i in (1..n).rev() {
                self.rng_state ^= self.rng_state << 13;
                self.rng_state ^= self.rng_state >> 7;
                self.rng_state ^= self.rng_state << 17;
                let j = (self.rng_state as usize) % (i + 1);
-                out.swap(i, j);
+                buf.swap(i, j);
            }
        }
        self.rotation_cursor = ((start + 1) % HW_THREAD_COUNT) as u8;
-        out
+        n
    }

    pub fn begin_round(&mut self) {