[iterate-3AK] Perf quick-wins: ~21% faster boot-to-splash (22→27 MIPS)

Profile-driven low-risk optimizations attacking the ~48% per-block / per-round host-bookkeeping tax found by the callgrind profile. Measured on the bounded headless workload `check -n 100000000 --gpu-inline`: baseline ~4490 ms (22.3 MIPS) -> ~3700 ms (27.0 MIPS), +21%. Tier A (determinism-neutral; n50m golden byte-IDENTICAL, exit 0): 1. mem-watch write path: gate capture_mem_watch_old/check_mem_watch behind one has_mem_watch() predicted branch in write_u8/16/32/64 + write_bulk so the common (no-watch) store does no out-of-line call. check_mem_watch (4.8%) gone from the profile. 2. round-schedule alloc churn: add Scheduler::round_schedule_into filling a reusable [u8; HW_THREAD_COUNT] stack buffer; the lockstep round loop no longer __rust_alloc/__rust_dealloc a Vec<u8> per round. Identical ordering/RNG-advance. __rust_alloc/dealloc gone from the profile. 3. probe-firing: hoist a single KernelState::any_probe_active() guard to worker_prologue so the four fire_*_if_match calls don't happen at all when no probe is configured (was 4x call overhead/visit). All four gone from the profile. 4. thunk-map hash: range-reject pc against the registered import-thunk address band (KernelState::pc_in_thunk_band, two int compares) before the thunk_map.get(&pc) HashMap lookup. hash_one (4.3%) gone. Tier B (#5, time-granularity change — LANDED, no re-baseline needed): 5. update_timestamp_bundle: throttle to a 0.25 ms quantum (only re-write the KeTimeStampBundle when the deterministic clock advanced >= 2500 units). Inclusive cost 8.65% -> 1.08%. The quantum is far below the 1 ms granularity any guest deadline math needs (tick_count stays fresh; the hub gate is +66 ms; the fade-in is vsync-counter driven per 3AH, not this bundle). VERIFIED: n50m stable digest BYTE-IDENTICAL to the existing golden (so no re-baseline), 150M boot reaches the splash (draws=7415, swaps=2172, gpu.texture.decode{K8888}=448, RTs=2 — all match the post-3AJ baseline), 688 tests green, release n50m oracle ok. Remaining headroom: interpreter::execute (13%), decrement_quantum (8%), step_block (7%) are now the top self-costs — the structural superblock/ JIT lever is the next step for the larger gain. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 22:05:53 +02:00
parent 9d24dd0eaa
commit dc1320cd4b
4 changed files with 156 additions and 26 deletions
--- a/crates/xenia-app/src/main.rs
+++ b/crates/xenia-app/src/main.rs
@@ -2459,10 +2459,19 @@ fn worker_prologue(
    // and println one record. Read-only; lockstep digest unaffected.
    // Empty set is the common case → single `is_empty()` test inside
    // the helper, no overhead on the hot path.
-    kernel.fire_ctor_probe_if_match(hw_id, mem);
-    kernel.fire_branch_probe_if_match(hw_id);
-    kernel.fire_audit_pc_probe_if_match(hw_id, mem);
-    kernel.fire_lr_trace_if_match(hw_id);
+    // Perf (Tier-A #3): all four `fire_*_if_match` helpers early-return
+    // on an empty registry, but paying 4× call overhead per slot-visit
+    // (~3.2M visits boot-to-splash) is itself measurable. Gate the whole
+    // group behind a single `any_probe_active()` predicted branch so the
+    // common (no-probe) headless path never even makes the calls. When a
+    // probe IS configured each helper still re-checks its own set, so
+    // behaviour is identical either way.
+    if kernel.any_probe_active() {
+        kernel.fire_ctor_probe_if_match(hw_id, mem);
+        kernel.fire_branch_probe_if_match(hw_id);
+        kernel.fire_audit_pc_probe_if_match(hw_id, mem);
+        kernel.fire_lr_trace_if_match(hw_id);
+    }

    if mem.has_mem_watch() {
        let ctx = kernel.scheduler.ctx(hw_id);
@@ -2528,8 +2537,15 @@ fn worker_prologue(
        return PrologueOutcome::Continue;
    }

-    // 2) Import thunk intercept.
-    if let Some((module, ordinal, name)) = thunk_map.get(&pc) {
+    // 2) Import thunk intercept. Perf (Tier-A #4): import thunks occupy a
+    // small contiguous address band; the overwhelming majority of executing
+    // PCs are ordinary guest code outside it. Range-reject against the band
+    // (two integer compares) before paying the `thunk_map` hash. Faithful
+    // no-op — any in-band PC still goes through the exact map lookup, and an
+    // out-of-band PC can never be a registered thunk.
+    if kernel.pc_in_thunk_band(pc)
+        && let Some((module, ordinal, name)) = thunk_map.get(&pc)
+    {
        let module = *module;
        let ordinal_u32 = *ordinal as u32;
        let thunk_pc = pc;
@@ -2854,6 +2870,10 @@ fn run_execution(
    // re-decoding the same handful of pages 60×/s.
    let mut isr_decode_cache = xenia_cpu::decoder::DecodeCache::new();

+    // Tier-A perf #2: reusable buffer for `round_schedule_into` so the round
+    // loop doesn't heap-allocate a `Vec<u8>` every iteration.
+    let mut order_buf = [0u8; xenia_cpu::scheduler::HW_THREAD_COUNT];
+
    'outer: loop {
        // Per-round prologue: budget / shutdown / heartbeat / vsync /
        // timers / audio-interrupt injection. Carved into
@@ -2908,10 +2928,12 @@ fn run_execution(
            thunk_map,
        );

-        // Snapshot round schedule. `round_schedule` also advances rng state
-        // when seeded; mutation is intentional.
+        // Snapshot round schedule. `round_schedule_into` also advances rng
+        // state when seeded; mutation is intentional. Perf (Tier-A #2): fill
+        // a reusable stack array instead of allocating a fresh Vec per round.
        kernel.scheduler.begin_round();
-        let order = kernel.scheduler.round_schedule();
+        let order_n = kernel.scheduler.round_schedule_into(&mut order_buf);
+        let order = &order_buf[..order_n];

        if order.is_empty() {
            // No Ready threads — advance time to the earliest pending
@@ -2933,7 +2955,7 @@ fn run_execution(
        // GPU when block dispatch engages.
        let instrs_at_round_start = stats.instruction_count;

-        for hw_id in order {
+        for &hw_id in order {
            let wc = &mut workers[hw_id as usize];
            match worker_prologue(
                wc,