xenia-rs

Author	SHA1	Message	Date
MechaCat02	acb29db444	[iterate-3AL] Superblock dispatch: chain basic blocks per slot-visit (~1.6x boot-to-splash) Replace the one-basic-block-per-slot-per-round lockstep dispatch with a SUPERBLOCK runner: each slot-visit chains straight-line blocks through their terminating branches up to a deterministic instruction budget, amortizing the per-round (timebase/coord/round_schedule) and per-slot (worker_prologue) dispatch tax over ~128 instructions instead of ~6. Yield-points (end the chain, return to the round) are pure functions of guest state, preserving the lockstep cross-thread interleaving correctness: - non-Continue step result (Yield/SystemCall/Trap/Unimpl/Halted); db16cyc Yield is the spin-wait producer hand-off. - sync-sensitive block: lwarx/ldarx/stwcx./stdcx. or sync/eieio/isync (new PpcOpcode::is_sync_sensitive, flagged on DecodedBlock at build). - MMIO touch: new GuestMemory::mmio_access_count() watermark, sampled per block, keeps GPU/register ordering at one-block granularity. - next PC leaves ordinary guest code (import thunk / halt sentinel / unmapped) -> hand to the full worker_prologue next round. - instruction budget reached. Instruction-count/clock accounting stays exact: per-block cycle_count deltas are summed and handed to worker_epilogue once (instruction_count + decrement_quantum advance by the precise retired count). XENIA_SUPERBLOCK_BUDGET=1 reproduces the old one-block schedule byte-for-byte. Budget tuned to 128 (env-overridable): boot progression stays healthy up to 256, sharp cliff at ~384 (a boot producer/consumer handoff starves); 128 is 3x below the cliff. Also scale the inline-GPU per-round fairness cap with the budget (flat 64 throttled GPU command processing 17x under superblocks and collapsed the present loop). PERF (check -n 100M --gpu-inline): 25.3 -> 42.7 MIPS (1.69x); 1B: 26.0 -> 41.4 MIPS (1.59x). Callgrind n=5M: host instructions 2.178B -> 1.507B (-31%); worker_prologue -90%, coord_pre_round -91%, begin_slot_visit / round_schedule_into / coord_post_round / update_timestamp_bundle each ~-90%; interpreter execute byte-identical (real work unchanged). GATES: C1 boot progression 150M draws 7391/swaps 2164 (baseline 7415/2172), 1B draws 88547/swaps 29228 linear no stall, K8888 decode + RTs=2 intact. C2 determinism: n50m stable digest byte-identical across fresh runs; golden re-baselined intentionally (pacing-only deltas: imports 333453->243387, draws 1274->1279). C3 milestone-1 render: texture_decodes/draws/swaps/ present cadence track baseline (3AJ fade-in pacing preserved). C4: 690 tests green (+2 sync_sensitive). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-19 22:31:54 +02:00
MechaCat02	dc1320cd4b	[iterate-3AK] Perf quick-wins: ~21% faster boot-to-splash (22→27 MIPS) Profile-driven low-risk optimizations attacking the ~48% per-block / per-round host-bookkeeping tax found by the callgrind profile. Measured on the bounded headless workload `check -n 100000000 --gpu-inline`: baseline ~4490 ms (22.3 MIPS) -> ~3700 ms (27.0 MIPS), +21%. Tier A (determinism-neutral; n50m golden byte-IDENTICAL, exit 0): 1. mem-watch write path: gate capture_mem_watch_old/check_mem_watch behind one has_mem_watch() predicted branch in write_u8/16/32/64 + write_bulk so the common (no-watch) store does no out-of-line call. check_mem_watch (4.8%) gone from the profile. 2. round-schedule alloc churn: add Scheduler::round_schedule_into filling a reusable [u8; HW_THREAD_COUNT] stack buffer; the lockstep round loop no longer __rust_alloc/__rust_dealloc a Vec<u8> per round. Identical ordering/RNG-advance. __rust_alloc/dealloc gone from the profile. 3. probe-firing: hoist a single KernelState::any_probe_active() guard to worker_prologue so the four fire_*_if_match calls don't happen at all when no probe is configured (was 4x call overhead/visit). All four gone from the profile. 4. thunk-map hash: range-reject pc against the registered import-thunk address band (KernelState::pc_in_thunk_band, two int compares) before the thunk_map.get(&pc) HashMap lookup. hash_one (4.3%) gone. Tier B (#5, time-granularity change — LANDED, no re-baseline needed): 5. update_timestamp_bundle: throttle to a 0.25 ms quantum (only re-write the KeTimeStampBundle when the deterministic clock advanced >= 2500 units). Inclusive cost 8.65% -> 1.08%. The quantum is far below the 1 ms granularity any guest deadline math needs (tick_count stays fresh; the hub gate is +66 ms; the fade-in is vsync-counter driven per 3AH, not this bundle). VERIFIED: n50m stable digest BYTE-IDENTICAL to the existing golden (so no re-baseline), 150M boot reaches the splash (draws=7415, swaps=2172, gpu.texture.decode{K8888}=448, RTs=2 — all match the post-3AJ baseline), 688 tests green, release n50m oracle ok. Remaining headroom: interpreter::execute (13%), decrement_quantum (8%), step_block (7%) are now the top self-costs — the structural superblock/ JIT lever is the next step for the larger gain. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-19 22:05:53 +02:00
MechaCat02	978a6950d1	feat(memory): --mem-watch=ADDR per-store writer trace Adds an opt-in diagnostic that emits one tracing line per guest store overlapping any armed byte address, naming the writer (tid, pc, lr) plus old/new u32 lanes. Mirrors the --pc-probe / --branch-probe shape; pc/lr are stamped from worker_prologue via a thread-local Cell, so default runs (empty watch set) take a single is_empty() check on each write. Lockstep digest preserved (instructions=100000003 across reruns, sylpheed_n50m.json golden byte-identical). Diagnostic infra only; no functional change. Used to identify producers of dispatch-state writes for the audit-017 / audit-019 hunt.	2026-05-06 21:00:20 +02:00
MechaCat02	780e854c2f	fix(memory): XMODBUG-002 — write_bulk bumps page_versions for touched pages `GuestMemory::write_bulk` did the bulk copy via raw `copy_nonoverlapping` without bumping page_versions for any of the pages it touched. The per-byte `write_u8/u16/u32` methods all bump page_versions after their store; downstream caches (texture cache, shader cache) Acquire-load the slot to invalidate stale entries on guest writes. Without the bulk bump, a caller like `NtReadFile` writing a texture/shader resource into guest memory would leave any cache that had already keyed on the prior version handing back stale decoded bytes. After the copy, walk every page the write touched and bump it. Cheap: the typical bulk write spans a few pages (NtReadFile uses 64-128 KB chunks → 16-32 pages). Reservation-table invalidation for `lwarx`/`stwcx.` (XMODBUG-001's sibling) is NOT addressed here — the reservation table lives on KernelState, not GuestMemory, and plumbing it through requires a wider change. Callers that bulk-write code-bearing or atomic-bearing memory should call `kernel.reservations.invalidate_for_write(addr)` themselves; XEX-loader and NtReadFile are doing data-bearing writes that don't intersect lwarx targets, so this is acceptable for now. Verification at -n 100M lockstep: swaps: 2 → 2 (unchanged) draws: 0 → 0 texture_cache_entries: 0 → 0 (Sylpheed hasn't issued IM_LOAD yet — the bump is silent until a cache keys on a touched page, which won't happen until Phase F2/F3 unblocks the resource-loader workers) packets: ~59M (within noise) Tests: 16 memory pass. Closes XMODBUG-002 (P1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 14:30:22 +02:00
MechaCat02	e9b2b57a44	xenia-memory: interior-mutable writes, page versioning, fenced ops Re-shape MemoryAccess so write methods take &self and rely on interior mutability (atomics in GuestMemory, Cell in test mocks). This unblocks the &Arc<KernelState>-only execution model the CPU/HLE crates moved to. GuestMemory grows: per-4 KiB-page write-version counter (page_version) that the CPU's decode cache and the texture cache observe via Acquire, fenced 32-bit/64-bit read/write helpers (Release on writer / Acquire on reader) that PM4_EVENT_WRITE_SHD and the matching CPU consumers use to synchronize fence publication, and broader page-table / heap accounting needed by the new HLE allocators. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:27:13 +02:00
MechaCat02	c694bb3f43	Initial commit: xenia-rs workspace for Xbox 360 RE Rust reimplementation of the xenia Xbox 360 emulator targeting reverse- engineering and preservation, initially scoped to Project Sylpheed. Includes: - XEX2 loader (LZX decompression, AES decryption, PE parsing) - XISO / XGD2 disc image VFS - PPC interpreter with 200+ opcodes and VMX128 decoding - Static analyzer: functions, cross-references, labels, asm + SQLite output - HLE kernel covering the xboxkrnl/xam subset used by Sylpheed init - Debugger with in-memory and SQLite-backed execution tracing - `xenia-rs` CLI with extract/dis/exec commands that produce cumulative, superset SQLite databases and opt-in instruction/import/branch traces Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-16 23:14:56 +02:00

6 Commits