--- name: xenia-rs Tier-4 perf landed (2026-04-25) description: MMIO fast-reject + basic-block cache + GPU pacer; Sylpheed boot 318→136ms (2.3×); 370 tests pass; thread-interleaving divergence at large -n is expected type: project originSessionId: b082ddb2-530b-45e9-a454-5dfa856fecf3 --- ## What landed Three perf changes on top of the prior Tier-1–3 work: 1. **MMIO fast-reject** — `xenia-memory/src/heap.rs` `find_mmio` now does a single `(addr & mmio_aperture_mask) != mmio_aperture_value` compare before falling through to the linear `iter().find` over registered regions. Aperture pair recomputed in `add_mmio_region` via a `fold_aperture` helper (greatest common bit-mask agreement). Fast path is a *necessary* condition only — `contains()` still runs for matching candidates, so MMIO semantics are unchanged. 2. **Basic-block cache** — new `xenia-cpu/src/block_cache.rs`. 64 K direct-mapped slots keyed by `(start_pc >> 2) & 0xFFFF`, each holding a `DecodedBlock { start_pc, end_pc, page_version, instrs }`. Block walk stops on `PpcOpcode::terminates_block()` (branch / sc / trap / Invalid), at `MAX_BLOCK_INSTRS = 32`, or at a 4 KiB guest-page boundary (so single-`page_version` invalidation suffices). New `xenia-cpu::interpreter::step_block` dispatches each instruction in the block via the existing match-based `execute`. 3. **Hot-loop wiring + GPU pacer** — `xenia-app/src/main.rs::run_execution` now branches on `debugger.wants_hooks() || db_writer.is_some() || force_per_instr` — only the per-instruction path runs when any of those is true. A new env var `XENIA_FORCE_PER_INSTR=1` forces the slow path for A/B testing. Post-round GPU dispatch was changed from "1 `execute_one` per round" to `gpu_runs = max(1, min(64, executed_this_round / HW_THREAD_COUNT))` so block mode (which executes ~6× more instructions per outer round) doesn't starve the GPU. ## Why u32-narrowing and threaded-code dispatch were skipped - **u32-narrowing**: cmpi/cmpli/cmp/rlwinm arms already cast to u32/i32 in their bodies. The remaining "obvious" target — addi/addis — runs natively at u64 because Xenon GPRs are 64-bit. No measurable win available without rewriting the ISA semantics. - **Threaded-code dispatch**: extracting ~200 match arms into per-opcode free functions for an uncertain LLVM-jump-table-vs-fn-ptr win was a poor risk/reward. The basic-block cache benefit doesn't depend on threaded dispatch (each instruction inside a block still goes through the existing match), so this was the right phase to skip. Both decisions match the plan's bench-gated rule: "Phase 4 must not be merged on principle alone — it merges only if numbers go up." ## Numbers Baseline (pre-perf-track) → final (`xenia-rs check sylpheed.iso -n 2_000_000`): | metric | baseline | final | delta | |------------------------|-----------|--------|--------| | wall-time | 318 ms | 136 ms | 2.3× | | `tight_alu_loop` bench | 96.9 MIPS | 114.8 | +18.5% | | `loadstore_loop` bench | 78.3 MIPS | 91.8 | +17.2% | | `mmio_storm` bench | 59.7 MIPS | 67.8 | +13.6% | | workspace tests | 352 | 370 | +18 | Bench is `cargo bench -p xenia-cpu` against the new `crates/xenia-cpu/benches/interpreter.rs` harness. No criterion dep — custom `harness = false` `main()`. ## Verification - **Golden digest at -n 2M** (`crates/xenia-app/tests/golden/sylpheed_n2m.json`): byte-identical between block and per-instruction modes. - **VdSwap fidelity**: frame=1 fires before -n 18M; frame=2 fires between -n 18M–22M. Prior memory said "~28 M cycles" but that predates the GPU pacer; the actual figure with current scheduling shifts by mode (block mode is faster wall-time but identical instruction-count behavior up to the point of first thread divergence). - **Deadlock counters**: 0 halts / 0 recoveries on every Sylpheed run. - **All 370 workspace tests pass**, including new tests: - `xenia-memory::heap`: 5 (mmio_fast_path_*, fold_aperture_*). - `xenia-cpu::opcode`: 5 (terminates_block_*). - `xenia-cpu::block_cache`: 6 (build, page boundary, max-len, invalid terminator, invalidation, hit-returns-cached). - `xenia-cpu::interpreter`: 2 parity tests (block_dispatch_matches_per_instruction_alu_loop + loadstore_loop) — bit-identical CPU state between paths on a single-thread workload. ## Important caveat: thread-interleaving divergence at large -n At -n 30M+, the `--expect` digest **differs** between block and per-instruction modes: - imports diverge by ~10% (block lower) - packets diverge by ~3.7× (block lower) This is **fundamental to any block-batching dispatcher** in a multi-threaded scheduler. Per-instruction mode round-robins instructions across HW threads (HW0 ← 1 instr, HW1 ← 1 instr, …); block mode lets HW0 burst up to MAX_BLOCK_INSTRS before yielding. Different valid interleavings of the same multi-threaded program reach different relative-progress states at any given total instruction count. Both produce correct Sylpheed boots — VdSwap=1 and =2 fire, no deadlocks. Bit-identical comparison between modes is only meaningful at -n 2M (before workers spawn) and that remains the regression rail. ## Files touched in 2026-04-25 perf-track session - `crates/xenia-cpu/Cargo.toml` — `[[bench]] name = "interpreter" harness = false`. - `crates/xenia-cpu/benches/interpreter.rs` — new (3 benches). - `crates/xenia-cpu/src/lib.rs` — `pub mod block_cache;`. - `crates/xenia-cpu/src/block_cache.rs` — new file. - `crates/xenia-cpu/src/interpreter.rs` — `step_block`, parity tests. - `crates/xenia-cpu/src/opcode.rs` — `terminates_block` + tests. - `crates/xenia-memory/src/heap.rs` — MMIO fast-reject + tests. - `crates/xenia-app/src/main.rs` — block-cache wiring, GPU pacer, `XENIA_FORCE_PER_INSTR` escape hatch. - `crates/xenia-app/tests/golden/sylpheed_n2m.json` — golden digest. ## How to A/B test in future sessions ```bash # block-cache mode (default) ./target/release/xenia-rs check -n 2_000_000 --expect crates/xenia-app/tests/golden/sylpheed_n2m.json # force per-instruction (debugging) XENIA_FORCE_PER_INSTR=1 ./target/release/xenia-rs check -n 2_000_000 --expect ... # bench cargo bench -p xenia-cpu # or: cargo run --release --bench interpreter ``` ## What's next on the perf track if needed If Sylpheed boot is still too slow after this lands: 1. Profile with `--profile out.svg` to see where time goes now. 2. Threaded-code dispatch is still on the table — but only with a bench showing >1.5× win on `tight_alu_loop` from a small-prototype spike branch. 3. The `MAX_BLOCK_INSTRS = 32` cap could be tuned. Lower (16, 8) reduces thread-interleaving divergence at the cost of dispatch wins.