xenia-rs

Author	SHA1	Message	Date
MechaCat02	2bdb93e51e	[iterate-2K] GPU physical-mirror aliasing: ring/IB/RPtr/resolve read wrong host region Root cause (physical-mirror aliasing gap → GPU read wrong region → ring never truly drained → render worker ring-space wait → no frame → no draw): The Xbox 360 maps its 512 MB of physical DRAM into several virtual mirror windows differing only in cache policy — bare physical (0x0xxxxxxx), write-combine (0x4xxxxxxx), and cached 0xA/0xC/0xExxxxxxx — all aliasing addr & 0x1FFF_FFFF. Ours has one flat membase and `heap_alloc` (MmAllocatePhysicalMemoryEx) commits physical backing in the 0x4xxxxxxx window. The guest masks its CP-ring allocation base to bare physical (0x4adcc000 & 0x1FFFFFFF = 0x0adcc000) before handing it to VdInitializeRingBuffer, and PM4 INDIRECT_BUFFER / writeback / resolve pointers are likewise bare-physical. Ours stored those verbatim and read `membase + 0x0adcc000`, a never-committed zero-filled page — so the GPU drained ~718k zero PM4 headers, never executed the real Type3/DRAW stream, and the RPtr writeback landed on a zero page the render worker (tid=8) polls, freezing it forever. Fix (GPU/Vd-boundary translation, not memory-layer): add `physical_to_backing(addr)` deriving the committed backing exactly from `heap_alloc`'s placement (0x4000_0000 \| (addr & 0x1FFF_FFFF), idempotent for the WC window, flat for non-physical code/stack). Apply it at every point the GPU/kernel consumes a guest physical address: ring base (initialize_ring_buffer), RPtr writeback (enable_rptr_writeback), PM4 INDIRECT_BUFFER pointer, WAIT_REG_MEM / COND_WRITE memory poll+write, REG_TO_MEM / MEM_WRITE / EVENT_WRITE* / LOAD_ALU_CONSTANT / IM_LOAD addresses, the resolve dest write, and the vd_swap frontbuffer present read. This was chosen over memory-layer aliasing because the latter re-projects every CPU load/store and corrupts the guest's flat 0xA/0xC/0xE accesses (it caused an early PC=0xfffffffc fault). Two adjacent GPU-backend gates this exposed and also fixed (canary-faithful): - WaitCmp::from_wait_info was off by one vs canary's MatchValueAndRef selector (it decoded wait_info&7==3 as NotEqual instead of Equal), inverting the standard CP coherency wait so the GPU parked forever on the first INDIRECT_BUFFER. Remapped to 1=Less..7=Always, 0=Never. - Added MakeCoherent: a WAIT polling COHER_STATUS_HOST clears the status bit (mirrors command_processor.cc:801-838) so the coherency handshake resolves. Result: the GPU now decodes the real Type3 packets at 0x4adcc000 (ME_INIT, INDIRECT_BUFFER → real Type0/WAIT_REG_MEM at 0x4adf5080) instead of zero-headers; RPtr at 0x408619fc advances (0x13, 0x16, … written by the GPU worker); the frame loop sub_822F1AA8 actively writes the controller at 0x40d09a40 (0x20→0x21→0x23); no fault, full 200M/1B budget runs clean. draws_seen is still 0: the remaining gate is upstream and separate — the main frame loop never sets controller bit-28 (frame-ready) at [0x40d09a40] (stalls at 0x23, the known iterate-2C state-divergence gate), so the guest never enqueues a render IB; the GPU only ever replays the init IB. This fix correctly unblocks the GPU ring/IB/RPtr data path (gate-2 GPU backend); the bit-28 frame-ready gate is the next target. Stable golden (sylpheed_n50m) unchanged (draws/swaps/RTs/shaders identical at 50M); regenerated twice byte-identical. cargo test --workspace: 672 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-13 13:39:57 +02:00
MechaCat02	7a1b6b3306	fix(gpu): GPUBUG-DRAIN-001 — silence VdSwap PM4 fallback under --parallel The Phase-C VdSwap PM4 ring path (commit `82f3d61`) emits two "PM4_XE_SWAP not consumed by drain" warnings when running: exec sylpheed.iso --ui --quiet --halt-on-deadlock \ --parallel --reservations-table Lockstep -n 100M never trips it. Two distinct race windows: (a) Inline backend (--ui forces it): drain(mem, 4096) hit its fixed packet cap before reaching the PM4_XE_SWAP we'd just injected at the WPTR tail. With 6 CPU threads, the ring accumulates >4096 packets between vd_swap callbacks. (b) Threaded backend (--parallel without --ui): the worker's DrainFence handler has a 900 ms deadline and game-batched IBs (8-10 M packets observed) keep it from reaching the tail in any reasonable budget. If the worker eventually drained past the injected packet later, the safety-net direct notify would double-count. Three changes: * gpu_system.rs: new `drain_until_wptr(target, time_budget)` draining by the canary `WorkerThreadMain` predicate (read_offset != target) instead of a fixed packet count. 900 ms deadline mirrors the threaded DrainFence handler. * handle.rs: inline `drain_to_current_wptr` switches to `drain_until_wptr`. DrainFence handler publishes the digest mirror BEFORE replying so the CPU's post-drain `digest_snapshot` sees fresh stats. * exports.rs (vd_swap): skip the PM4 ring injection unconditionally and route swap notification through `notify_xe_swap` directly. Tail-injection is unreliable under --parallel for both backends. The slot-0 fetch-constant patch is deferred (GPUBUG-FETCH-PATCH-001); draws=0 today so a stale slot 0 has no observable effect. Verification: * cargo test --workspace --release: 556 passing (unchanged). * Lockstep -n 100M --stable-digest: bit-identical to pre-fix master HEAD `aa3f1d3`. {instructions:100000002, imports:987685, unimpl:0, draws:0, swaps:2, ...} * check --parallel --reservations-table -n 30M: 0 warnings (was 2). swaps=2. * exec --gpu-inline --parallel --reservations-table -n 30M: 0 warnings (was 2 with drained=8M-10M observed). swaps=2. Audit IDs: GPUBUG-DRAIN-001 (closed), GPUBUG-FETCH-PATCH-001 (filed, deferred). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 17:12:15 +02:00
MechaCat02	8fc1b1dfed	fix(gpu): GPUBUG-006 — sync_with_mmio Acquire/Release pair the producer The producer side (`mmio_region.rs:78`, the guest's CP_RB_WPTR MMIO write callback) uses `Ordering::Release` so any ring-memory writes the guest performed before bumping WPTR are visible to a paired `Acquire`-load on the consumer. The consumer here at `sync_with_mmio` was using `Ordering::Relaxed` for both the WPTR load and the RPTR mirror store — leaving the Release/Acquire pairing broken. Under `--parallel`, this broken pairing means the GPU worker can observe a fresh WPTR value while still reading stale ring-memory contents at the corresponding offsets — garbage PM4 packets. The audit's M11 grid run confirmed --parallel is non-deterministic beyond the documented `packets` ±5% noise; this fix is one strand of that. Symmetric fix on the RPTR mirror store: Release pairs with any guest-side Acquire-load of CP_RB_RPTR for ring-writeback bookkeeping. Verification at -n 100M lockstep: swaps: 2 → 2 (unchanged) draws: 0 → 0 (unchanged) packets: ~60M (within noise) Tests: 149 (no count change; this is a memory-ordering correctness fix, not a behavioral change visible at the digest level in lockstep). Closes GPUBUG-006 (P1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 14:26:09 +02:00
MechaCat02	79eb52c378	xenia-gpu: end-to-end Xenos pipeline (PM4, ucode, EDRAM, resolve) First real GPU implementation. Ring/PM4 frontend (ring_view, ring_drain, pm4) drains the command processor; gpu_system owns the threaded backend (DrainFence RPC + parker/fence helpers from M1) and the MMIO-mapped register block (mmio_region). Xenos shader frontend: ucode/{alu,control_flow,fetch,mod}.rs decode the Xbox 360 microcode, translator.rs lowers it onto the WGSL xenos_interp interpreter shader (shaders/xenos_interp.wgsl). shader_metrics.rs counts decode/translate work. Render state: draw_state, primitive, render_target_cache, texture_cache, tiled_address (Xenos's swizzled tiled-memory layout), xenos_constants (register field constants), edram (the 10 MiB EDRAM model with MSAA), and resolve.rs (TILE_FLUSH copy-out — clear-resolve plus bitwise-equivalent 32 bpp + 64 bpp paths landed). handle.rs owns the typed GPU-resource handles the kernel hands out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:29:38 +02:00

4 Commits