Files
xenia-rs/migration/claude-memory/project_xenia_rs_perf_tier4.md
MechaCat02 e6d43a23ac chore: add migration/ bundle for cross-machine setup
Bundles state that lives OUTSIDE the xenia-rs repo so a fresh clone on
another machine can be brought up to identical configuration via
migration/setup.sh:

  - claude-memory/             ~/.claude/projects/-home-fabi-RE-Project-Sylpheed/memory/
                               (103 files, 1.1 MB - MEMORY.md + every
                                project_xenia_rs_*.md from audits
                                addis_signext through audit-058)
  - project-root/dot-claude/   <project-root>/.claude/settings.json
                               (Stop hook + permissions)
  - project-root/ppc-manual/   <project-root>/ppc-manual/
                               (PowerPC reference docs, 397 files, 3.7 MB)
  - project-root/run-canary.sh <project-root>/run-canary.sh
  - README.md                  Human-readable setup checklist
  - setup.sh                   Idempotent installer (also reclones
                               xenia-canary at pinned HEAD 6de80dffe)
  - MANIFEST.md                Per-file mapping + per-file-not-bundled
                               restoration recipe

Excluded from bundle (not shippable via git):
  - Sylpheed ISO (7.8 GB; copyright; manual copy required)
  - sylpheed.db (395 MB; regenerable from XEX via analysis tooling)
  - target/ build artifacts (rebuild on target)
  - audit-runs probe firehoses (.log/.stdout/.stderr ~11 GB; rerun if needed)
  - audit-runs memory dumps (.bin ~4.5 GB; rerun audit-026/027/029 if needed)
  - xenia-canary checkout (setup.sh reclones from
    git.mc02.dev/fabi/Xenia-Canary.git at HEAD 6de80dffe)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 21:38:38 +02:00

6.7 KiB
Raw Blame History

name, description, type, originSessionId
name description type originSessionId
xenia-rs Tier-4 perf landed (2026-04-25) MMIO fast-reject + basic-block cache + GPU pacer; Sylpheed boot 318→136ms (2.3×); 370 tests pass; thread-interleaving divergence at large -n is expected project b082ddb2-530b-45e9-a454-5dfa856fecf3

What landed

Three perf changes on top of the prior Tier-13 work:

  1. MMIO fast-rejectxenia-memory/src/heap.rs find_mmio now does a single (addr & mmio_aperture_mask) != mmio_aperture_value compare before falling through to the linear iter().find over registered regions. Aperture pair recomputed in add_mmio_region via a fold_aperture helper (greatest common bit-mask agreement). Fast path is a necessary condition only — contains() still runs for matching candidates, so MMIO semantics are unchanged.

  2. Basic-block cache — new xenia-cpu/src/block_cache.rs. 64 K direct-mapped slots keyed by (start_pc >> 2) & 0xFFFF, each holding a DecodedBlock { start_pc, end_pc, page_version, instrs }. Block walk stops on PpcOpcode::terminates_block() (branch / sc / trap / Invalid), at MAX_BLOCK_INSTRS = 32, or at a 4 KiB guest-page boundary (so single-page_version invalidation suffices). New xenia-cpu::interpreter::step_block dispatches each instruction in the block via the existing match-based execute.

  3. Hot-loop wiring + GPU pacerxenia-app/src/main.rs::run_execution now branches on debugger.wants_hooks() || db_writer.is_some() || force_per_instr — only the per-instruction path runs when any of those is true. A new env var XENIA_FORCE_PER_INSTR=1 forces the slow path for A/B testing. Post-round GPU dispatch was changed from "1 execute_one per round" to gpu_runs = max(1, min(64, executed_this_round / HW_THREAD_COUNT)) so block mode (which executes ~6× more instructions per outer round) doesn't starve the GPU.

Why u32-narrowing and threaded-code dispatch were skipped

  • u32-narrowing: cmpi/cmpli/cmp/rlwinm arms already cast to u32/i32 in their bodies. The remaining "obvious" target — addi/addis — runs natively at u64 because Xenon GPRs are 64-bit. No measurable win available without rewriting the ISA semantics.
  • Threaded-code dispatch: extracting ~200 match arms into per-opcode free functions for an uncertain LLVM-jump-table-vs-fn-ptr win was a poor risk/reward. The basic-block cache benefit doesn't depend on threaded dispatch (each instruction inside a block still goes through the existing match), so this was the right phase to skip.

Both decisions match the plan's bench-gated rule: "Phase 4 must not be merged on principle alone — it merges only if numbers go up."

Numbers

Baseline (pre-perf-track) → final (xenia-rs check sylpheed.iso -n 2_000_000):

metric baseline final delta
wall-time 318 ms 136 ms 2.3×
tight_alu_loop bench 96.9 MIPS 114.8 +18.5%
loadstore_loop bench 78.3 MIPS 91.8 +17.2%
mmio_storm bench 59.7 MIPS 67.8 +13.6%
workspace tests 352 370 +18

Bench is cargo bench -p xenia-cpu against the new crates/xenia-cpu/benches/interpreter.rs harness. No criterion dep — custom harness = false main().

Verification

  • Golden digest at -n 2M (crates/xenia-app/tests/golden/sylpheed_n2m.json): byte-identical between block and per-instruction modes.
  • VdSwap fidelity: frame=1 fires before -n 18M; frame=2 fires between -n 18M22M. Prior memory said "~28 M cycles" but that predates the GPU pacer; the actual figure with current scheduling shifts by mode (block mode is faster wall-time but identical instruction-count behavior up to the point of first thread divergence).
  • Deadlock counters: 0 halts / 0 recoveries on every Sylpheed run.
  • All 370 workspace tests pass, including new tests:
    • xenia-memory::heap: 5 (mmio_fast_path_, fold_aperture_).
    • xenia-cpu::opcode: 5 (terminates_block_*).
    • xenia-cpu::block_cache: 6 (build, page boundary, max-len, invalid terminator, invalidation, hit-returns-cached).
    • xenia-cpu::interpreter: 2 parity tests (block_dispatch_matches_per_instruction_alu_loop + loadstore_loop) — bit-identical CPU state between paths on a single-thread workload.

Important caveat: thread-interleaving divergence at large -n

At -n 30M+, the --expect digest differs between block and per-instruction modes:

  • imports diverge by ~10% (block lower)
  • packets diverge by ~3.7× (block lower)

This is fundamental to any block-batching dispatcher in a multi-threaded scheduler. Per-instruction mode round-robins instructions across HW threads (HW0 ← 1 instr, HW1 ← 1 instr, …); block mode lets HW0 burst up to MAX_BLOCK_INSTRS before yielding. Different valid interleavings of the same multi-threaded program reach different relative-progress states at any given total instruction count. Both produce correct Sylpheed boots — VdSwap=1 and =2 fire, no deadlocks. Bit-identical comparison between modes is only meaningful at -n 2M (before workers spawn) and that remains the regression rail.

Files touched in 2026-04-25 perf-track session

  • crates/xenia-cpu/Cargo.toml[[bench]] name = "interpreter" harness = false.
  • crates/xenia-cpu/benches/interpreter.rs — new (3 benches).
  • crates/xenia-cpu/src/lib.rspub mod block_cache;.
  • crates/xenia-cpu/src/block_cache.rs — new file.
  • crates/xenia-cpu/src/interpreter.rsstep_block, parity tests.
  • crates/xenia-cpu/src/opcode.rsterminates_block + tests.
  • crates/xenia-memory/src/heap.rs — MMIO fast-reject + tests.
  • crates/xenia-app/src/main.rs — block-cache wiring, GPU pacer, XENIA_FORCE_PER_INSTR escape hatch.
  • crates/xenia-app/tests/golden/sylpheed_n2m.json — golden digest.

How to A/B test in future sessions

# block-cache mode (default)
./target/release/xenia-rs check <iso> -n 2_000_000 --expect crates/xenia-app/tests/golden/sylpheed_n2m.json

# force per-instruction (debugging)
XENIA_FORCE_PER_INSTR=1 ./target/release/xenia-rs check <iso> -n 2_000_000 --expect ...

# bench
cargo bench -p xenia-cpu
# or: cargo run --release --bench interpreter

What's next on the perf track if needed

If Sylpheed boot is still too slow after this lands:

  1. Profile with --profile out.svg to see where time goes now.
  2. Threaded-code dispatch is still on the table — but only with a bench showing >1.5× win on tight_alu_loop from a small-prototype spike branch.
  3. The MAX_BLOCK_INSTRS = 32 cap could be tuned. Lower (16, 8) reduces thread-interleaving divergence at the cost of dispatch wins.