Bundles state that lives OUTSIDE the xenia-rs repo so a fresh clone on
another machine can be brought up to identical configuration via
migration/setup.sh:
- claude-memory/ ~/.claude/projects/-home-fabi-RE-Project-Sylpheed/memory/
(103 files, 1.1 MB - MEMORY.md + every
project_xenia_rs_*.md from audits
addis_signext through audit-058)
- project-root/dot-claude/ <project-root>/.claude/settings.json
(Stop hook + permissions)
- project-root/ppc-manual/ <project-root>/ppc-manual/
(PowerPC reference docs, 397 files, 3.7 MB)
- project-root/run-canary.sh <project-root>/run-canary.sh
- README.md Human-readable setup checklist
- setup.sh Idempotent installer (also reclones
xenia-canary at pinned HEAD 6de80dffe)
- MANIFEST.md Per-file mapping + per-file-not-bundled
restoration recipe
Excluded from bundle (not shippable via git):
- Sylpheed ISO (7.8 GB; copyright; manual copy required)
- sylpheed.db (395 MB; regenerable from XEX via analysis tooling)
- target/ build artifacts (rebuild on target)
- audit-runs probe firehoses (.log/.stdout/.stderr ~11 GB; rerun if needed)
- audit-runs memory dumps (.bin ~4.5 GB; rerun audit-026/027/029 if needed)
- xenia-canary checkout (setup.sh reclones from
git.mc02.dev/fabi/Xenia-Canary.git at HEAD 6de80dffe)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.7 KiB
name, description, type, originSessionId
| name | description | type | originSessionId |
|---|---|---|---|
| xenia-rs Tier-4 perf landed (2026-04-25) | MMIO fast-reject + basic-block cache + GPU pacer; Sylpheed boot 318→136ms (2.3×); 370 tests pass; thread-interleaving divergence at large -n is expected | project | b082ddb2-530b-45e9-a454-5dfa856fecf3 |
What landed
Three perf changes on top of the prior Tier-1–3 work:
-
MMIO fast-reject —
xenia-memory/src/heap.rsfind_mmionow does a single(addr & mmio_aperture_mask) != mmio_aperture_valuecompare before falling through to the lineariter().findover registered regions. Aperture pair recomputed inadd_mmio_regionvia afold_aperturehelper (greatest common bit-mask agreement). Fast path is a necessary condition only —contains()still runs for matching candidates, so MMIO semantics are unchanged. -
Basic-block cache — new
xenia-cpu/src/block_cache.rs. 64 K direct-mapped slots keyed by(start_pc >> 2) & 0xFFFF, each holding aDecodedBlock { start_pc, end_pc, page_version, instrs }. Block walk stops onPpcOpcode::terminates_block()(branch / sc / trap / Invalid), atMAX_BLOCK_INSTRS = 32, or at a 4 KiB guest-page boundary (so single-page_versioninvalidation suffices). Newxenia-cpu::interpreter::step_blockdispatches each instruction in the block via the existing match-basedexecute. -
Hot-loop wiring + GPU pacer —
xenia-app/src/main.rs::run_executionnow branches ondebugger.wants_hooks() || db_writer.is_some() || force_per_instr— only the per-instruction path runs when any of those is true. A new env varXENIA_FORCE_PER_INSTR=1forces the slow path for A/B testing. Post-round GPU dispatch was changed from "1execute_oneper round" togpu_runs = max(1, min(64, executed_this_round / HW_THREAD_COUNT))so block mode (which executes ~6× more instructions per outer round) doesn't starve the GPU.
Why u32-narrowing and threaded-code dispatch were skipped
- u32-narrowing: cmpi/cmpli/cmp/rlwinm arms already cast to u32/i32 in their bodies. The remaining "obvious" target — addi/addis — runs natively at u64 because Xenon GPRs are 64-bit. No measurable win available without rewriting the ISA semantics.
- Threaded-code dispatch: extracting ~200 match arms into per-opcode free functions for an uncertain LLVM-jump-table-vs-fn-ptr win was a poor risk/reward. The basic-block cache benefit doesn't depend on threaded dispatch (each instruction inside a block still goes through the existing match), so this was the right phase to skip.
Both decisions match the plan's bench-gated rule: "Phase 4 must not be merged on principle alone — it merges only if numbers go up."
Numbers
Baseline (pre-perf-track) → final (xenia-rs check sylpheed.iso -n 2_000_000):
| metric | baseline | final | delta |
|---|---|---|---|
| wall-time | 318 ms | 136 ms | 2.3× |
tight_alu_loop bench |
96.9 MIPS | 114.8 | +18.5% |
loadstore_loop bench |
78.3 MIPS | 91.8 | +17.2% |
mmio_storm bench |
59.7 MIPS | 67.8 | +13.6% |
| workspace tests | 352 | 370 | +18 |
Bench is cargo bench -p xenia-cpu against the new
crates/xenia-cpu/benches/interpreter.rs harness. No criterion dep —
custom harness = false main().
Verification
- Golden digest at -n 2M (
crates/xenia-app/tests/golden/sylpheed_n2m.json): byte-identical between block and per-instruction modes. - VdSwap fidelity: frame=1 fires before -n 18M; frame=2 fires between -n 18M–22M. Prior memory said "~28 M cycles" but that predates the GPU pacer; the actual figure with current scheduling shifts by mode (block mode is faster wall-time but identical instruction-count behavior up to the point of first thread divergence).
- Deadlock counters: 0 halts / 0 recoveries on every Sylpheed run.
- All 370 workspace tests pass, including new tests:
xenia-memory::heap: 5 (mmio_fast_path_, fold_aperture_).xenia-cpu::opcode: 5 (terminates_block_*).xenia-cpu::block_cache: 6 (build, page boundary, max-len, invalid terminator, invalidation, hit-returns-cached).xenia-cpu::interpreter: 2 parity tests (block_dispatch_matches_per_instruction_alu_loop + loadstore_loop) — bit-identical CPU state between paths on a single-thread workload.
Important caveat: thread-interleaving divergence at large -n
At -n 30M+, the --expect digest differs between block and
per-instruction modes:
- imports diverge by ~10% (block lower)
- packets diverge by ~3.7× (block lower)
This is fundamental to any block-batching dispatcher in a multi-threaded scheduler. Per-instruction mode round-robins instructions across HW threads (HW0 ← 1 instr, HW1 ← 1 instr, …); block mode lets HW0 burst up to MAX_BLOCK_INSTRS before yielding. Different valid interleavings of the same multi-threaded program reach different relative-progress states at any given total instruction count. Both produce correct Sylpheed boots — VdSwap=1 and =2 fire, no deadlocks. Bit-identical comparison between modes is only meaningful at -n 2M (before workers spawn) and that remains the regression rail.
Files touched in 2026-04-25 perf-track session
crates/xenia-cpu/Cargo.toml—[[bench]] name = "interpreter" harness = false.crates/xenia-cpu/benches/interpreter.rs— new (3 benches).crates/xenia-cpu/src/lib.rs—pub mod block_cache;.crates/xenia-cpu/src/block_cache.rs— new file.crates/xenia-cpu/src/interpreter.rs—step_block, parity tests.crates/xenia-cpu/src/opcode.rs—terminates_block+ tests.crates/xenia-memory/src/heap.rs— MMIO fast-reject + tests.crates/xenia-app/src/main.rs— block-cache wiring, GPU pacer,XENIA_FORCE_PER_INSTRescape hatch.crates/xenia-app/tests/golden/sylpheed_n2m.json— golden digest.
How to A/B test in future sessions
# block-cache mode (default)
./target/release/xenia-rs check <iso> -n 2_000_000 --expect crates/xenia-app/tests/golden/sylpheed_n2m.json
# force per-instruction (debugging)
XENIA_FORCE_PER_INSTR=1 ./target/release/xenia-rs check <iso> -n 2_000_000 --expect ...
# bench
cargo bench -p xenia-cpu
# or: cargo run --release --bench interpreter
What's next on the perf track if needed
If Sylpheed boot is still too slow after this lands:
- Profile with
--profile out.svgto see where time goes now. - Threaded-code dispatch is still on the table — but only with a
bench showing >1.5× win on
tight_alu_loopfrom a small-prototype spike branch. - The
MAX_BLOCK_INSTRS = 32cap could be tuned. Lower (16, 8) reduces thread-interleaving divergence at the cost of dispatch wins.