---
name: xenia-rs Tier-4 perf landed (2026-04-25)
description: MMIO fast-reject + basic-block cache + GPU pacer; Sylpheed boot 318→136ms (2.3×); 370 tests pass; thread-interleaving divergence at large -n is expected
type: project
originSessionId: b082ddb2-530b-45e9-a454-5dfa856fecf3
---
## What landed

Three perf changes on top of the prior Tier-1–3 work:

1. **MMIO fast-reject** — `xenia-memory/src/heap.rs` `find_mmio` now does a
   single `(addr & mmio_aperture_mask) != mmio_aperture_value` compare
   before falling through to the linear `iter().find` over registered
   regions. Aperture pair recomputed in `add_mmio_region` via a
   `fold_aperture` helper (greatest common bit-mask agreement). Fast
   path is a *necessary* condition only — `contains()` still runs for
   matching candidates, so MMIO semantics are unchanged.

2. **Basic-block cache** — new `xenia-cpu/src/block_cache.rs`. 64 K
   direct-mapped slots keyed by `(start_pc >> 2) & 0xFFFF`, each holding
   a `DecodedBlock { start_pc, end_pc, page_version, instrs }`. Block
   walk stops on `PpcOpcode::terminates_block()` (branch / sc / trap /
   Invalid), at `MAX_BLOCK_INSTRS = 32`, or at a 4 KiB guest-page
   boundary (so single-`page_version` invalidation suffices). New
   `xenia-cpu::interpreter::step_block` dispatches each instruction in
   the block via the existing match-based `execute`.

3. **Hot-loop wiring + GPU pacer** —
   `xenia-app/src/main.rs::run_execution` now branches on
   `debugger.wants_hooks() || db_writer.is_some() ||
   force_per_instr` — only the per-instruction path runs when any of
   those is true. A new env var `XENIA_FORCE_PER_INSTR=1` forces the
   slow path for A/B testing. Post-round GPU dispatch was changed from
   "1 `execute_one` per round" to
   `gpu_runs = max(1, min(64, executed_this_round / HW_THREAD_COUNT))`
   so block mode (which executes ~6× more instructions per outer
   round) doesn't starve the GPU.

## Why u32-narrowing and threaded-code dispatch were skipped

- **u32-narrowing**: cmpi/cmpli/cmp/rlwinm arms already cast to u32/i32
  in their bodies. The remaining "obvious" target — addi/addis — runs
  natively at u64 because Xenon GPRs are 64-bit. No measurable win
  available without rewriting the ISA semantics.
- **Threaded-code dispatch**: extracting ~200 match arms into per-opcode
  free functions for an uncertain LLVM-jump-table-vs-fn-ptr win was a
  poor risk/reward. The basic-block cache benefit doesn't depend on
  threaded dispatch (each instruction inside a block still goes through
  the existing match), so this was the right phase to skip.

Both decisions match the plan's bench-gated rule: "Phase 4 must not be
merged on principle alone — it merges only if numbers go up."

## Numbers

Baseline (pre-perf-track) → final (`xenia-rs check sylpheed.iso -n 2_000_000`):

| metric                 | baseline  | final  | delta  |
|------------------------|-----------|--------|--------|
| wall-time              | 318 ms    | 136 ms | 2.3×   |
| `tight_alu_loop` bench | 96.9 MIPS | 114.8  | +18.5% |
| `loadstore_loop` bench | 78.3 MIPS | 91.8   | +17.2% |
| `mmio_storm` bench     | 59.7 MIPS | 67.8   | +13.6% |
| workspace tests        | 352       | 370    | +18    |

Bench is `cargo bench -p xenia-cpu` against the new
`crates/xenia-cpu/benches/interpreter.rs` harness. No criterion dep —
custom `harness = false` `main()`.

## Verification

- **Golden digest at -n 2M** (`crates/xenia-app/tests/golden/sylpheed_n2m.json`):
  byte-identical between block and per-instruction modes.
- **VdSwap fidelity**: frame=1 fires before -n 18M; frame=2 fires
  between -n 18M–22M. Prior memory said "~28 M cycles" but that
  predates the GPU pacer; the actual figure with current scheduling
  shifts by mode (block mode is faster wall-time but identical
  instruction-count behavior up to the point of first thread
  divergence).
- **Deadlock counters**: 0 halts / 0 recoveries on every Sylpheed run.
- **All 370 workspace tests pass**, including new tests:
  - `xenia-memory::heap`: 5 (mmio_fast_path_*, fold_aperture_*).
  - `xenia-cpu::opcode`: 5 (terminates_block_*).
  - `xenia-cpu::block_cache`: 6 (build, page boundary, max-len, invalid
    terminator, invalidation, hit-returns-cached).
  - `xenia-cpu::interpreter`: 2 parity tests
    (block_dispatch_matches_per_instruction_alu_loop +
     loadstore_loop) — bit-identical CPU state between paths on a
    single-thread workload.

## Important caveat: thread-interleaving divergence at large -n

At -n 30M+, the `--expect` digest **differs** between block and
per-instruction modes:

- imports diverge by ~10% (block lower)
- packets diverge by ~3.7× (block lower)

This is **fundamental to any block-batching dispatcher** in a
multi-threaded scheduler. Per-instruction mode round-robins
instructions across HW threads (HW0 ← 1 instr, HW1 ← 1 instr, …);
block mode lets HW0 burst up to MAX_BLOCK_INSTRS before yielding.
Different valid interleavings of the same multi-threaded program
reach different relative-progress states at any given total
instruction count. Both produce correct Sylpheed boots — VdSwap=1
and =2 fire, no deadlocks. Bit-identical comparison between modes
is only meaningful at -n 2M (before workers spawn) and that
remains the regression rail.

## Files touched in 2026-04-25 perf-track session

- `crates/xenia-cpu/Cargo.toml` — `[[bench]] name = "interpreter" harness = false`.
- `crates/xenia-cpu/benches/interpreter.rs` — new (3 benches).
- `crates/xenia-cpu/src/lib.rs` — `pub mod block_cache;`.
- `crates/xenia-cpu/src/block_cache.rs` — new file.
- `crates/xenia-cpu/src/interpreter.rs` — `step_block`, parity tests.
- `crates/xenia-cpu/src/opcode.rs` — `terminates_block` + tests.
- `crates/xenia-memory/src/heap.rs` — MMIO fast-reject + tests.
- `crates/xenia-app/src/main.rs` — block-cache wiring, GPU pacer,
  `XENIA_FORCE_PER_INSTR` escape hatch.
- `crates/xenia-app/tests/golden/sylpheed_n2m.json` — golden digest.

## How to A/B test in future sessions

```bash
# block-cache mode (default)
./target/release/xenia-rs check <iso> -n 2_000_000 --expect crates/xenia-app/tests/golden/sylpheed_n2m.json

# force per-instruction (debugging)
XENIA_FORCE_PER_INSTR=1 ./target/release/xenia-rs check <iso> -n 2_000_000 --expect ...

# bench
cargo bench -p xenia-cpu
# or: cargo run --release --bench interpreter
```

## What's next on the perf track if needed

If Sylpheed boot is still too slow after this lands:

1. Profile with `--profile out.svg` to see where time goes now.
2. Threaded-code dispatch is still on the table — but only with a
   bench showing >1.5× win on `tight_alu_loop` from a small-prototype
   spike branch.
3. The `MAX_BLOCK_INSTRS = 32` cap could be tuned. Lower (16, 8)
   reduces thread-interleaving divergence at the cost of dispatch wins.