chore: add migration/ bundle for cross-machine setup

Bundles state that lives OUTSIDE the xenia-rs repo so a fresh clone on
another machine can be brought up to identical configuration via
migration/setup.sh:

  - claude-memory/             ~/.claude/projects/-home-fabi-RE-Project-Sylpheed/memory/
                               (103 files, 1.1 MB - MEMORY.md + every
                                project_xenia_rs_*.md from audits
                                addis_signext through audit-058)
  - project-root/dot-claude/   <project-root>/.claude/settings.json
                               (Stop hook + permissions)
  - project-root/ppc-manual/   <project-root>/ppc-manual/
                               (PowerPC reference docs, 397 files, 3.7 MB)
  - project-root/run-canary.sh <project-root>/run-canary.sh
  - README.md                  Human-readable setup checklist
  - setup.sh                   Idempotent installer (also reclones
                               xenia-canary at pinned HEAD 6de80dffe)
  - MANIFEST.md                Per-file mapping + per-file-not-bundled
                               restoration recipe

Excluded from bundle (not shippable via git):
  - Sylpheed ISO (7.8 GB; copyright; manual copy required)
  - sylpheed.db (395 MB; regenerable from XEX via analysis tooling)
  - target/ build artifacts (rebuild on target)
  - audit-runs probe firehoses (.log/.stdout/.stderr ~11 GB; rerun if needed)
  - audit-runs memory dumps (.bin ~4.5 GB; rerun audit-026/027/029 if needed)
  - xenia-canary checkout (setup.sh reclones from
    git.mc02.dev/fabi/Xenia-Canary.git at HEAD 6de80dffe)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-05-10 21:38:38 +02:00
parent 8e709b0a24
commit e6d43a23ac
505 changed files with 86028 additions and 0 deletions

View File

@@ -0,0 +1,146 @@
---
name: xenia-rs Tier-4 perf landed (2026-04-25)
description: MMIO fast-reject + basic-block cache + GPU pacer; Sylpheed boot 318→136ms (2.3×); 370 tests pass; thread-interleaving divergence at large -n is expected
type: project
originSessionId: b082ddb2-530b-45e9-a454-5dfa856fecf3
---
## What landed
Three perf changes on top of the prior Tier-13 work:
1. **MMIO fast-reject**`xenia-memory/src/heap.rs` `find_mmio` now does a
single `(addr & mmio_aperture_mask) != mmio_aperture_value` compare
before falling through to the linear `iter().find` over registered
regions. Aperture pair recomputed in `add_mmio_region` via a
`fold_aperture` helper (greatest common bit-mask agreement). Fast
path is a *necessary* condition only — `contains()` still runs for
matching candidates, so MMIO semantics are unchanged.
2. **Basic-block cache** — new `xenia-cpu/src/block_cache.rs`. 64 K
direct-mapped slots keyed by `(start_pc >> 2) & 0xFFFF`, each holding
a `DecodedBlock { start_pc, end_pc, page_version, instrs }`. Block
walk stops on `PpcOpcode::terminates_block()` (branch / sc / trap /
Invalid), at `MAX_BLOCK_INSTRS = 32`, or at a 4 KiB guest-page
boundary (so single-`page_version` invalidation suffices). New
`xenia-cpu::interpreter::step_block` dispatches each instruction in
the block via the existing match-based `execute`.
3. **Hot-loop wiring + GPU pacer**
`xenia-app/src/main.rs::run_execution` now branches on
`debugger.wants_hooks() || db_writer.is_some() ||
force_per_instr` — only the per-instruction path runs when any of
those is true. A new env var `XENIA_FORCE_PER_INSTR=1` forces the
slow path for A/B testing. Post-round GPU dispatch was changed from
"1 `execute_one` per round" to
`gpu_runs = max(1, min(64, executed_this_round / HW_THREAD_COUNT))`
so block mode (which executes ~6× more instructions per outer
round) doesn't starve the GPU.
## Why u32-narrowing and threaded-code dispatch were skipped
- **u32-narrowing**: cmpi/cmpli/cmp/rlwinm arms already cast to u32/i32
in their bodies. The remaining "obvious" target — addi/addis — runs
natively at u64 because Xenon GPRs are 64-bit. No measurable win
available without rewriting the ISA semantics.
- **Threaded-code dispatch**: extracting ~200 match arms into per-opcode
free functions for an uncertain LLVM-jump-table-vs-fn-ptr win was a
poor risk/reward. The basic-block cache benefit doesn't depend on
threaded dispatch (each instruction inside a block still goes through
the existing match), so this was the right phase to skip.
Both decisions match the plan's bench-gated rule: "Phase 4 must not be
merged on principle alone — it merges only if numbers go up."
## Numbers
Baseline (pre-perf-track) → final (`xenia-rs check sylpheed.iso -n 2_000_000`):
| metric | baseline | final | delta |
|------------------------|-----------|--------|--------|
| wall-time | 318 ms | 136 ms | 2.3× |
| `tight_alu_loop` bench | 96.9 MIPS | 114.8 | +18.5% |
| `loadstore_loop` bench | 78.3 MIPS | 91.8 | +17.2% |
| `mmio_storm` bench | 59.7 MIPS | 67.8 | +13.6% |
| workspace tests | 352 | 370 | +18 |
Bench is `cargo bench -p xenia-cpu` against the new
`crates/xenia-cpu/benches/interpreter.rs` harness. No criterion dep —
custom `harness = false` `main()`.
## Verification
- **Golden digest at -n 2M** (`crates/xenia-app/tests/golden/sylpheed_n2m.json`):
byte-identical between block and per-instruction modes.
- **VdSwap fidelity**: frame=1 fires before -n 18M; frame=2 fires
between -n 18M22M. Prior memory said "~28 M cycles" but that
predates the GPU pacer; the actual figure with current scheduling
shifts by mode (block mode is faster wall-time but identical
instruction-count behavior up to the point of first thread
divergence).
- **Deadlock counters**: 0 halts / 0 recoveries on every Sylpheed run.
- **All 370 workspace tests pass**, including new tests:
- `xenia-memory::heap`: 5 (mmio_fast_path_*, fold_aperture_*).
- `xenia-cpu::opcode`: 5 (terminates_block_*).
- `xenia-cpu::block_cache`: 6 (build, page boundary, max-len, invalid
terminator, invalidation, hit-returns-cached).
- `xenia-cpu::interpreter`: 2 parity tests
(block_dispatch_matches_per_instruction_alu_loop +
loadstore_loop) — bit-identical CPU state between paths on a
single-thread workload.
## Important caveat: thread-interleaving divergence at large -n
At -n 30M+, the `--expect` digest **differs** between block and
per-instruction modes:
- imports diverge by ~10% (block lower)
- packets diverge by ~3.7× (block lower)
This is **fundamental to any block-batching dispatcher** in a
multi-threaded scheduler. Per-instruction mode round-robins
instructions across HW threads (HW0 ← 1 instr, HW1 ← 1 instr, …);
block mode lets HW0 burst up to MAX_BLOCK_INSTRS before yielding.
Different valid interleavings of the same multi-threaded program
reach different relative-progress states at any given total
instruction count. Both produce correct Sylpheed boots — VdSwap=1
and =2 fire, no deadlocks. Bit-identical comparison between modes
is only meaningful at -n 2M (before workers spawn) and that
remains the regression rail.
## Files touched in 2026-04-25 perf-track session
- `crates/xenia-cpu/Cargo.toml``[[bench]] name = "interpreter" harness = false`.
- `crates/xenia-cpu/benches/interpreter.rs` — new (3 benches).
- `crates/xenia-cpu/src/lib.rs``pub mod block_cache;`.
- `crates/xenia-cpu/src/block_cache.rs` — new file.
- `crates/xenia-cpu/src/interpreter.rs``step_block`, parity tests.
- `crates/xenia-cpu/src/opcode.rs``terminates_block` + tests.
- `crates/xenia-memory/src/heap.rs` — MMIO fast-reject + tests.
- `crates/xenia-app/src/main.rs` — block-cache wiring, GPU pacer,
`XENIA_FORCE_PER_INSTR` escape hatch.
- `crates/xenia-app/tests/golden/sylpheed_n2m.json` — golden digest.
## How to A/B test in future sessions
```bash
# block-cache mode (default)
./target/release/xenia-rs check <iso> -n 2_000_000 --expect crates/xenia-app/tests/golden/sylpheed_n2m.json
# force per-instruction (debugging)
XENIA_FORCE_PER_INSTR=1 ./target/release/xenia-rs check <iso> -n 2_000_000 --expect ...
# bench
cargo bench -p xenia-cpu
# or: cargo run --release --bench interpreter
```
## What's next on the perf track if needed
If Sylpheed boot is still too slow after this lands:
1. Profile with `--profile out.svg` to see where time goes now.
2. Threaded-code dispatch is still on the table — but only with a
bench showing >1.5× win on `tight_alu_loop` from a small-prototype
spike branch.
3. The `MAX_BLOCK_INSTRS = 32` cap could be tuned. Lower (16, 8)
reduces thread-interleaving divergence at the cost of dispatch wins.