chore: add migration/ bundle for cross-machine setup

Bundles state that lives OUTSIDE the xenia-rs repo so a fresh clone on another machine can be brought up to identical configuration via migration/setup.sh: - claude-memory/ ~/.claude/projects/-home-fabi-RE-Project-Sylpheed/memory/ (103 files, 1.1 MB - MEMORY.md + every project_xenia_rs_*.md from audits addis_signext through audit-058) - project-root/dot-claude/ <project-root>/.claude/settings.json (Stop hook + permissions) - project-root/ppc-manual/ <project-root>/ppc-manual/ (PowerPC reference docs, 397 files, 3.7 MB) - project-root/run-canary.sh <project-root>/run-canary.sh - README.md Human-readable setup checklist - setup.sh Idempotent installer (also reclones xenia-canary at pinned HEAD 6de80dffe) - MANIFEST.md Per-file mapping + per-file-not-bundled restoration recipe Excluded from bundle (not shippable via git): - Sylpheed ISO (7.8 GB; copyright; manual copy required) - sylpheed.db (395 MB; regenerable from XEX via analysis tooling) - target/ build artifacts (rebuild on target) - audit-runs probe firehoses (.log/.stdout/.stderr ~11 GB; rerun if needed) - audit-runs memory dumps (.bin ~4.5 GB; rerun audit-026/027/029 if needed) - xenia-canary checkout (setup.sh reclones from git.mc02.dev/fabi/Xenia-Canary.git at HEAD 6de80dffe) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 21:38:38 +02:00
parent 8e709b0a24
commit e6d43a23ac
505 changed files with 86028 additions and 0 deletions
--- a/migration/claude-memory/project_xenia_rs_perf_tier4.md
+++ b/migration/claude-memory/project_xenia_rs_perf_tier4.md
@@ -0,0 +1,146 @@
+---
+name: xenia-rs Tier-4 perf landed (2026-04-25)
+description: MMIO fast-reject + basic-block cache + GPU pacer; Sylpheed boot 318→136ms (2.3×); 370 tests pass; thread-interleaving divergence at large -n is expected
+type: project
+originSessionId: b082ddb2-530b-45e9-a454-5dfa856fecf3
+---
+## What landed
+
+Three perf changes on top of the prior Tier-1–3 work:
+
+1. **MMIO fast-reject** — `xenia-memory/src/heap.rs` `find_mmio` now does a
+   single `(addr & mmio_aperture_mask) != mmio_aperture_value` compare
+   before falling through to the linear `iter().find` over registered
+   regions. Aperture pair recomputed in `add_mmio_region` via a
+   `fold_aperture` helper (greatest common bit-mask agreement). Fast
+   path is a *necessary* condition only — `contains()` still runs for
+   matching candidates, so MMIO semantics are unchanged.
+
+2. **Basic-block cache** — new `xenia-cpu/src/block_cache.rs`. 64 K
+   direct-mapped slots keyed by `(start_pc >> 2) & 0xFFFF`, each holding
+   a `DecodedBlock { start_pc, end_pc, page_version, instrs }`. Block
+   walk stops on `PpcOpcode::terminates_block()` (branch / sc / trap /
+   Invalid), at `MAX_BLOCK_INSTRS = 32`, or at a 4 KiB guest-page
+   boundary (so single-`page_version` invalidation suffices). New
+   `xenia-cpu::interpreter::step_block` dispatches each instruction in
+   the block via the existing match-based `execute`.
+
+3. **Hot-loop wiring + GPU pacer** —
+   `xenia-app/src/main.rs::run_execution` now branches on
+   `debugger.wants_hooks() || db_writer.is_some() ||
+   force_per_instr` — only the per-instruction path runs when any of
+   those is true. A new env var `XENIA_FORCE_PER_INSTR=1` forces the
+   slow path for A/B testing. Post-round GPU dispatch was changed from
+   "1 `execute_one` per round" to
+   `gpu_runs = max(1, min(64, executed_this_round / HW_THREAD_COUNT))`
+   so block mode (which executes ~6× more instructions per outer
+   round) doesn't starve the GPU.
+
+## Why u32-narrowing and threaded-code dispatch were skipped
+
+- **u32-narrowing**: cmpi/cmpli/cmp/rlwinm arms already cast to u32/i32
+  in their bodies. The remaining "obvious" target — addi/addis — runs
+  natively at u64 because Xenon GPRs are 64-bit. No measurable win
+  available without rewriting the ISA semantics.
+- **Threaded-code dispatch**: extracting ~200 match arms into per-opcode
+  free functions for an uncertain LLVM-jump-table-vs-fn-ptr win was a
+  poor risk/reward. The basic-block cache benefit doesn't depend on
+  threaded dispatch (each instruction inside a block still goes through
+  the existing match), so this was the right phase to skip.
+
+Both decisions match the plan's bench-gated rule: "Phase 4 must not be
+merged on principle alone — it merges only if numbers go up."
+
+## Numbers
+
+Baseline (pre-perf-track) → final (`xenia-rs check sylpheed.iso -n 2_000_000`):
+
+| metric                 | baseline  | final  | delta  |
+|------------------------|-----------|--------|--------|
+| wall-time              | 318 ms    | 136 ms | 2.3×   |
+| `tight_alu_loop` bench | 96.9 MIPS | 114.8  | +18.5% |
+| `loadstore_loop` bench | 78.3 MIPS | 91.8   | +17.2% |
+| `mmio_storm` bench     | 59.7 MIPS | 67.8   | +13.6% |
+| workspace tests        | 352       | 370    | +18    |
+
+Bench is `cargo bench -p xenia-cpu` against the new
+`crates/xenia-cpu/benches/interpreter.rs` harness. No criterion dep —
+custom `harness = false` `main()`.
+
+## Verification
+
+- **Golden digest at -n 2M** (`crates/xenia-app/tests/golden/sylpheed_n2m.json`):
+  byte-identical between block and per-instruction modes.
+- **VdSwap fidelity**: frame=1 fires before -n 18M; frame=2 fires
+  between -n 18M–22M. Prior memory said "~28 M cycles" but that
+  predates the GPU pacer; the actual figure with current scheduling
+  shifts by mode (block mode is faster wall-time but identical
+  instruction-count behavior up to the point of first thread
+  divergence).
+- **Deadlock counters**: 0 halts / 0 recoveries on every Sylpheed run.
+- **All 370 workspace tests pass**, including new tests:
+  - `xenia-memory::heap`: 5 (mmio_fast_path_*, fold_aperture_*).
+  - `xenia-cpu::opcode`: 5 (terminates_block_*).
+  - `xenia-cpu::block_cache`: 6 (build, page boundary, max-len, invalid
+    terminator, invalidation, hit-returns-cached).
+  - `xenia-cpu::interpreter`: 2 parity tests
+    (block_dispatch_matches_per_instruction_alu_loop +
+     loadstore_loop) — bit-identical CPU state between paths on a
+    single-thread workload.
+
+## Important caveat: thread-interleaving divergence at large -n
+
+At -n 30M+, the `--expect` digest **differs** between block and
+per-instruction modes:
+
+- imports diverge by ~10% (block lower)
+- packets diverge by ~3.7× (block lower)
+
+This is **fundamental to any block-batching dispatcher** in a
+multi-threaded scheduler. Per-instruction mode round-robins
+instructions across HW threads (HW0 ← 1 instr, HW1 ← 1 instr, …);
+block mode lets HW0 burst up to MAX_BLOCK_INSTRS before yielding.
+Different valid interleavings of the same multi-threaded program
+reach different relative-progress states at any given total
+instruction count. Both produce correct Sylpheed boots — VdSwap=1
+and =2 fire, no deadlocks. Bit-identical comparison between modes
+is only meaningful at -n 2M (before workers spawn) and that
+remains the regression rail.
+
+## Files touched in 2026-04-25 perf-track session
+
+- `crates/xenia-cpu/Cargo.toml` — `[[bench]] name = "interpreter" harness = false`.
+- `crates/xenia-cpu/benches/interpreter.rs` — new (3 benches).
+- `crates/xenia-cpu/src/lib.rs` — `pub mod block_cache;`.
+- `crates/xenia-cpu/src/block_cache.rs` — new file.
+- `crates/xenia-cpu/src/interpreter.rs` — `step_block`, parity tests.
+- `crates/xenia-cpu/src/opcode.rs` — `terminates_block` + tests.
+- `crates/xenia-memory/src/heap.rs` — MMIO fast-reject + tests.
+- `crates/xenia-app/src/main.rs` — block-cache wiring, GPU pacer,
+  `XENIA_FORCE_PER_INSTR` escape hatch.
+- `crates/xenia-app/tests/golden/sylpheed_n2m.json` — golden digest.
+
+## How to A/B test in future sessions
+
+```bash
+# block-cache mode (default)
+./target/release/xenia-rs check <iso> -n 2_000_000 --expect crates/xenia-app/tests/golden/sylpheed_n2m.json
+
+# force per-instruction (debugging)
+XENIA_FORCE_PER_INSTR=1 ./target/release/xenia-rs check <iso> -n 2_000_000 --expect ...
+
+# bench
+cargo bench -p xenia-cpu
+# or: cargo run --release --bench interpreter
+```
+
+## What's next on the perf track if needed
+
+If Sylpheed boot is still too slow after this lands:
+
+1. Profile with `--profile out.svg` to see where time goes now.
+2. Threaded-code dispatch is still on the table — but only with a
+   bench showing >1.5× win on `tight_alu_loop` from a small-prototype
+   spike branch.
+3. The `MAX_BLOCK_INSTRS = 32` cap could be tuned. Lower (16, 8)
+   reduces thread-interleaving divergence at the cost of dispatch wins.