chore: add migration/ bundle for cross-machine setup
Bundles state that lives OUTSIDE the xenia-rs repo so a fresh clone on
another machine can be brought up to identical configuration via
migration/setup.sh:
- claude-memory/ ~/.claude/projects/-home-fabi-RE-Project-Sylpheed/memory/
(103 files, 1.1 MB - MEMORY.md + every
project_xenia_rs_*.md from audits
addis_signext through audit-058)
- project-root/dot-claude/ <project-root>/.claude/settings.json
(Stop hook + permissions)
- project-root/ppc-manual/ <project-root>/ppc-manual/
(PowerPC reference docs, 397 files, 3.7 MB)
- project-root/run-canary.sh <project-root>/run-canary.sh
- README.md Human-readable setup checklist
- setup.sh Idempotent installer (also reclones
xenia-canary at pinned HEAD 6de80dffe)
- MANIFEST.md Per-file mapping + per-file-not-bundled
restoration recipe
Excluded from bundle (not shippable via git):
- Sylpheed ISO (7.8 GB; copyright; manual copy required)
- sylpheed.db (395 MB; regenerable from XEX via analysis tooling)
- target/ build artifacts (rebuild on target)
- audit-runs probe firehoses (.log/.stdout/.stderr ~11 GB; rerun if needed)
- audit-runs memory dumps (.bin ~4.5 GB; rerun audit-026/027/029 if needed)
- xenia-canary checkout (setup.sh reclones from
git.mc02.dev/fabi/Xenia-Canary.git at HEAD 6de80dffe)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
146
migration/claude-memory/project_xenia_rs_perf_tier4.md
Normal file
146
migration/claude-memory/project_xenia_rs_perf_tier4.md
Normal file
@@ -0,0 +1,146 @@
|
||||
---
|
||||
name: xenia-rs Tier-4 perf landed (2026-04-25)
|
||||
description: MMIO fast-reject + basic-block cache + GPU pacer; Sylpheed boot 318→136ms (2.3×); 370 tests pass; thread-interleaving divergence at large -n is expected
|
||||
type: project
|
||||
originSessionId: b082ddb2-530b-45e9-a454-5dfa856fecf3
|
||||
---
|
||||
## What landed
|
||||
|
||||
Three perf changes on top of the prior Tier-1–3 work:
|
||||
|
||||
1. **MMIO fast-reject** — `xenia-memory/src/heap.rs` `find_mmio` now does a
|
||||
single `(addr & mmio_aperture_mask) != mmio_aperture_value` compare
|
||||
before falling through to the linear `iter().find` over registered
|
||||
regions. Aperture pair recomputed in `add_mmio_region` via a
|
||||
`fold_aperture` helper (greatest common bit-mask agreement). Fast
|
||||
path is a *necessary* condition only — `contains()` still runs for
|
||||
matching candidates, so MMIO semantics are unchanged.
|
||||
|
||||
2. **Basic-block cache** — new `xenia-cpu/src/block_cache.rs`. 64 K
|
||||
direct-mapped slots keyed by `(start_pc >> 2) & 0xFFFF`, each holding
|
||||
a `DecodedBlock { start_pc, end_pc, page_version, instrs }`. Block
|
||||
walk stops on `PpcOpcode::terminates_block()` (branch / sc / trap /
|
||||
Invalid), at `MAX_BLOCK_INSTRS = 32`, or at a 4 KiB guest-page
|
||||
boundary (so single-`page_version` invalidation suffices). New
|
||||
`xenia-cpu::interpreter::step_block` dispatches each instruction in
|
||||
the block via the existing match-based `execute`.
|
||||
|
||||
3. **Hot-loop wiring + GPU pacer** —
|
||||
`xenia-app/src/main.rs::run_execution` now branches on
|
||||
`debugger.wants_hooks() || db_writer.is_some() ||
|
||||
force_per_instr` — only the per-instruction path runs when any of
|
||||
those is true. A new env var `XENIA_FORCE_PER_INSTR=1` forces the
|
||||
slow path for A/B testing. Post-round GPU dispatch was changed from
|
||||
"1 `execute_one` per round" to
|
||||
`gpu_runs = max(1, min(64, executed_this_round / HW_THREAD_COUNT))`
|
||||
so block mode (which executes ~6× more instructions per outer
|
||||
round) doesn't starve the GPU.
|
||||
|
||||
## Why u32-narrowing and threaded-code dispatch were skipped
|
||||
|
||||
- **u32-narrowing**: cmpi/cmpli/cmp/rlwinm arms already cast to u32/i32
|
||||
in their bodies. The remaining "obvious" target — addi/addis — runs
|
||||
natively at u64 because Xenon GPRs are 64-bit. No measurable win
|
||||
available without rewriting the ISA semantics.
|
||||
- **Threaded-code dispatch**: extracting ~200 match arms into per-opcode
|
||||
free functions for an uncertain LLVM-jump-table-vs-fn-ptr win was a
|
||||
poor risk/reward. The basic-block cache benefit doesn't depend on
|
||||
threaded dispatch (each instruction inside a block still goes through
|
||||
the existing match), so this was the right phase to skip.
|
||||
|
||||
Both decisions match the plan's bench-gated rule: "Phase 4 must not be
|
||||
merged on principle alone — it merges only if numbers go up."
|
||||
|
||||
## Numbers
|
||||
|
||||
Baseline (pre-perf-track) → final (`xenia-rs check sylpheed.iso -n 2_000_000`):
|
||||
|
||||
| metric | baseline | final | delta |
|
||||
|------------------------|-----------|--------|--------|
|
||||
| wall-time | 318 ms | 136 ms | 2.3× |
|
||||
| `tight_alu_loop` bench | 96.9 MIPS | 114.8 | +18.5% |
|
||||
| `loadstore_loop` bench | 78.3 MIPS | 91.8 | +17.2% |
|
||||
| `mmio_storm` bench | 59.7 MIPS | 67.8 | +13.6% |
|
||||
| workspace tests | 352 | 370 | +18 |
|
||||
|
||||
Bench is `cargo bench -p xenia-cpu` against the new
|
||||
`crates/xenia-cpu/benches/interpreter.rs` harness. No criterion dep —
|
||||
custom `harness = false` `main()`.
|
||||
|
||||
## Verification
|
||||
|
||||
- **Golden digest at -n 2M** (`crates/xenia-app/tests/golden/sylpheed_n2m.json`):
|
||||
byte-identical between block and per-instruction modes.
|
||||
- **VdSwap fidelity**: frame=1 fires before -n 18M; frame=2 fires
|
||||
between -n 18M–22M. Prior memory said "~28 M cycles" but that
|
||||
predates the GPU pacer; the actual figure with current scheduling
|
||||
shifts by mode (block mode is faster wall-time but identical
|
||||
instruction-count behavior up to the point of first thread
|
||||
divergence).
|
||||
- **Deadlock counters**: 0 halts / 0 recoveries on every Sylpheed run.
|
||||
- **All 370 workspace tests pass**, including new tests:
|
||||
- `xenia-memory::heap`: 5 (mmio_fast_path_*, fold_aperture_*).
|
||||
- `xenia-cpu::opcode`: 5 (terminates_block_*).
|
||||
- `xenia-cpu::block_cache`: 6 (build, page boundary, max-len, invalid
|
||||
terminator, invalidation, hit-returns-cached).
|
||||
- `xenia-cpu::interpreter`: 2 parity tests
|
||||
(block_dispatch_matches_per_instruction_alu_loop +
|
||||
loadstore_loop) — bit-identical CPU state between paths on a
|
||||
single-thread workload.
|
||||
|
||||
## Important caveat: thread-interleaving divergence at large -n
|
||||
|
||||
At -n 30M+, the `--expect` digest **differs** between block and
|
||||
per-instruction modes:
|
||||
|
||||
- imports diverge by ~10% (block lower)
|
||||
- packets diverge by ~3.7× (block lower)
|
||||
|
||||
This is **fundamental to any block-batching dispatcher** in a
|
||||
multi-threaded scheduler. Per-instruction mode round-robins
|
||||
instructions across HW threads (HW0 ← 1 instr, HW1 ← 1 instr, …);
|
||||
block mode lets HW0 burst up to MAX_BLOCK_INSTRS before yielding.
|
||||
Different valid interleavings of the same multi-threaded program
|
||||
reach different relative-progress states at any given total
|
||||
instruction count. Both produce correct Sylpheed boots — VdSwap=1
|
||||
and =2 fire, no deadlocks. Bit-identical comparison between modes
|
||||
is only meaningful at -n 2M (before workers spawn) and that
|
||||
remains the regression rail.
|
||||
|
||||
## Files touched in 2026-04-25 perf-track session
|
||||
|
||||
- `crates/xenia-cpu/Cargo.toml` — `[[bench]] name = "interpreter" harness = false`.
|
||||
- `crates/xenia-cpu/benches/interpreter.rs` — new (3 benches).
|
||||
- `crates/xenia-cpu/src/lib.rs` — `pub mod block_cache;`.
|
||||
- `crates/xenia-cpu/src/block_cache.rs` — new file.
|
||||
- `crates/xenia-cpu/src/interpreter.rs` — `step_block`, parity tests.
|
||||
- `crates/xenia-cpu/src/opcode.rs` — `terminates_block` + tests.
|
||||
- `crates/xenia-memory/src/heap.rs` — MMIO fast-reject + tests.
|
||||
- `crates/xenia-app/src/main.rs` — block-cache wiring, GPU pacer,
|
||||
`XENIA_FORCE_PER_INSTR` escape hatch.
|
||||
- `crates/xenia-app/tests/golden/sylpheed_n2m.json` — golden digest.
|
||||
|
||||
## How to A/B test in future sessions
|
||||
|
||||
```bash
|
||||
# block-cache mode (default)
|
||||
./target/release/xenia-rs check <iso> -n 2_000_000 --expect crates/xenia-app/tests/golden/sylpheed_n2m.json
|
||||
|
||||
# force per-instruction (debugging)
|
||||
XENIA_FORCE_PER_INSTR=1 ./target/release/xenia-rs check <iso> -n 2_000_000 --expect ...
|
||||
|
||||
# bench
|
||||
cargo bench -p xenia-cpu
|
||||
# or: cargo run --release --bench interpreter
|
||||
```
|
||||
|
||||
## What's next on the perf track if needed
|
||||
|
||||
If Sylpheed boot is still too slow after this lands:
|
||||
|
||||
1. Profile with `--profile out.svg` to see where time goes now.
|
||||
2. Threaded-code dispatch is still on the table — but only with a
|
||||
bench showing >1.5× win on `tight_alu_loop` from a small-prototype
|
||||
spike branch.
|
||||
3. The `MAX_BLOCK_INSTRS = 32` cap could be tuned. Lower (16, 8)
|
||||
reduces thread-interleaving divergence at the cost of dispatch wins.
|
||||
Reference in New Issue
Block a user