xenia-rs

Author	SHA1	Message	Date
MechaCat02	873c197ff1	[iterate-2T] VdSwap: route present through ring PM4_XE_SWAP, drop out-of-band swap interrupt Make ours' VdSwap present path faithful to xenia-canary `VdSwap_entry` (xboxkrnl_video.cc:518-548): write the reserved 64-dword ring slot with a PM4_TYPE0 fetch-constant patch + PM4_TYPE3(PM4_XE_SWAP) + NOP padding, then let the natural drain consume the swap packet in command-stream order. Remove the synthetic CP swap-complete interrupt that `notify_xe_swap` raised out-of-band. Root found this session (the actual present-path bug): ours' `notify_xe_swap` pushed an `InterruptSource::Swap` (→ INTERRUPT_SOURCE_CP) interrupt directly from the VdSwap HLE, decoupled from the GPU command stream. When that interrupt reached the graphics ISR `sub_824BE9A0` before D3D had armed its swap-callback slot (`[gfx+10772]+16` still the `0xBADF00D` placeholder), the ISR took its error path and hit the assert "ERR[D3D]: Unanticipated CPU_INTERRUPT. Sign of a corrupt command buffer?" (`bl sub_824C5DF0; twi` at 0x824BE9DC) — 2x per run on master. Canary's VdSwap raises NO interrupt; swap-complete CP interrupts come only from in-stream PM4_INTERRUPT packets, which are naturally ordered after the callback-arming Type-0 writes. Routing the swap through the ring packet matches that ordering and eliminates the trap (2 -> 0). Canary oracle confirmation (muted, audit_mem_watch + audit_jit_prolog_pc): canary's early/loading loop is present-driven — swap counter [gfx+15160] (0xBE56CA38) advances ~per-vblank from vblank 65 onward, reaching 0xD02 (3330) in ~60s via 6184 CP source=1 interrupts, with VdSwap called only ONCE. So the present interrupts are entirely in-stream, not from the VdSwap export. This is a correctness/faithfulness fix; it does NOT cascade. draws stay 78 at 200M and 1B because the upstream gate persists: the game submits one render batch then stalls (renderer sub_82506xxx 0x; 2nd title thread 0x821748F0 never spawns). The per-frame loop sub_822F1AA8 runs ~1207 iterations on vsync but clock B (swap count) only advances ~once, so the manager update sub_821741C8 fires once. That is the iterate-2Q/2F title-pipeline gate, not a present/ interrupt bug. swaps 3 -> 4 (the in-stream PM4_XE_SWAP now drains). Deterministic in inline mode (n50m --gpu-inline --stable-digest regenerated byte-identical twice; golden re-baselined: swaps 3 -> 4). cargo test --workspace 675 passing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 15:20:02 +02:00
MechaCat02	1ae472bd2b	[iterate-2S] GPU: implement CP SCRATCH_REG memory writeback — arms Sylpheed's swap-callback slot Sylpheed renders the splash (draws=78, iterate-2O) then plateaus: the title's per-frame manager (sub_821741C8) only re-fires when "clock B" ([gfx+15160], swap count) changes, which only the CP swap-complete callback sub_824CE2B8 increments. The graphics ISR sub_824BE9A0 indirect-calls that callback via [[gfx+10772]+16] on CP (source=1) interrupts, but the slot stayed NULL so the callback never ran. Root (runtime-verified, ours-side GPU): the guest arms the slot through the Xenos CP scratch-register writeback path, which ours never implemented. The arming IB (drained by ours at 0x4adf5180) contains a Type-0 register write of the callback PC 0x824ce2b8 into SCRATCH_REG4 (0x057C). On hardware/canary, writing a SCRATCH_REG{n} mirrors the value to SCRATCH_ADDR + n4 in memory when the matching SCRATCH_UMSK bit is set. Runtime values: SCRATCH_ADDR=0x0b1d5000 (the [gfx+10772] descriptor), SCRATCH_UMSK=0x20033 (bit 4 set), so SCRATCH_REG4 -> 0x0b1d5010 = descriptor+16 = the callback slot (0x4b1d5010). Ours decoded the Type-0 write into the register file but performed no writeback (case a: drained-but-mishandled), so the slot stayed NULL. Fix mirrors canary's CommandProcessor::HandleSpecialRegisterWrite (command_processor.cc:545-552): a scratch_register_writeback() helper called from handle_type0/handle_type1 after every register write; for SCRATCH_REG0..7 with the UMSK bit set, it writes the value (big-endian, as mem.write_u32 already stores) to SCRATCH_ADDR + n4 (projected via physical_to_backing). Deterministic given identical register state; proven by unit test. Cascade (verified by runtime probe): slot 0x4b1d5010 now armed with 0x824ce2b8; on the 2-3 CP interrupts that fire, the ISR reads the slot and bcctrl's into sub_824CE2B8 (runs 2x; 0x cascade on master); sub_824CE2B8 increments clock B ([gfx+15160]). The cascade does NOT yet reach draws>78: there are only ~3 CP interrupts (from the initial 9825- packet batch), and the title render loop stalls upstream (the iterate-2Q title-respawn gate) before it submits more PM4_INTERRUPT work, so the callback can't bootstrap a self-sustaining loop. This is the remaining update-17/18 arming gap closed; the upstream stall is the next gate. The default threaded GPU backend drains the ring on a separate host thread, so with the callback now doing work the exact CP-interrupt delivery instruction varies run to run (pre-existing GPU-thread race). Pin the n50m oracle test to --gpu-inline (instruction-count deterministic) and re-baseline its golden; bit-exact across repeated runs. New unit test scratch_reg_write_mirrors_to_memory_when_umsk_enabled. Tests: 675 pass (was 674). Golden re-baselined + determinism verified. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 14:21:30 +02:00
MechaCat02	034ec8b47f	[iterate-2O] GPU: drain indirect buffers correctly — Sylpheed renders splash (draws 0→78) Ours' GPU never drained the D3D driver's system command buffer past the first 11-dword indirect buffer, so DRAW_INDX / reg-0x57C-arm packets never executed and draws stayed 0 (the long-hunted render gate; see UPDATE-18). Runtime tracing (temporary, removed) showed the guest submits 6 INDIRECT_BUFFER packets at boot (CP_RB_WPTR 22→37) but ours executed exactly ONE IB and then spun 15.7M packets inside it. Three coupled command-processor bugs, all corrected to match canary: 1. `sync_with_mmio` applied the primary CP_RB_WPTR to whichever ring was active, including an executing indirect buffer — `37 % 11 = 3` clobbered the IB's write pointer so its read pointer looped 0→2→5→0 forever and never popped back to the primary ring. CP_RB_WPTR governs ONLY the primary ring; while an IB executes, the primary is the bottom of the IB stack. Canary executes each IB through a separate `RingBuffer reader_` (command_processor.cc), so the primary write pointer is structurally inapplicable to an IB. 2. Indirect buffers were treated as circular rings: read wrapped at `size_dwords` (`11 % 11 = 0`) and never reached the fixed write pointer, so even without the clobber the IB could not terminate. An IB is a fixed linear sub-stream; add `RingBufferView.indirect` and drain `[0, ib_size)` monotonically, then pop. 3. `is_ready` only checked the active ring, so an IB that now correctly exhausts would never get `execute_one` called again to pop back to the primary ring (whose WPTR may have advanced). Check the whole IB stack. Also: the ring was sized `1 << size_log2` bytes (1024 dwords) vs canary's `1 << (size_log2 + 3)` (8192 dwords) — an 8× undersize that desynced WPTR-wrap math from the guest. Fixed in `GpuSystem::initialize_ring_buffer` (and the dead bookkeeping copy in `vd_initialize_ring_buffer`). Cascade (deterministic; threaded-default backend, byte-identical across runs): reg 0x57C now written, IB jumps 1→12, packets 15.7M→9,825, and the splash renders — draws 0→78, shaders 0→3, render_targets 0→2, swaps 2→3 — stable at 50M / 200M / 1B. Boot then reaches a new downstream gate (draws plateau at 78, interrupts keep climbing → engine alive, not deadlocked). golden `sylpheed_n50m.json` re-baselined (draws 78). `cargo test --workspace` green (674; +2 ring_view regression tests). vd_swap's synthetic-swap short-circuit is now redundant but left untouched (cascade works without changing it); cleaning it up is a separate follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-13 22:06:16 +02:00
MechaCat02	2bdb93e51e	[iterate-2K] GPU physical-mirror aliasing: ring/IB/RPtr/resolve read wrong host region Root cause (physical-mirror aliasing gap → GPU read wrong region → ring never truly drained → render worker ring-space wait → no frame → no draw): The Xbox 360 maps its 512 MB of physical DRAM into several virtual mirror windows differing only in cache policy — bare physical (0x0xxxxxxx), write-combine (0x4xxxxxxx), and cached 0xA/0xC/0xExxxxxxx — all aliasing addr & 0x1FFF_FFFF. Ours has one flat membase and `heap_alloc` (MmAllocatePhysicalMemoryEx) commits physical backing in the 0x4xxxxxxx window. The guest masks its CP-ring allocation base to bare physical (0x4adcc000 & 0x1FFFFFFF = 0x0adcc000) before handing it to VdInitializeRingBuffer, and PM4 INDIRECT_BUFFER / writeback / resolve pointers are likewise bare-physical. Ours stored those verbatim and read `membase + 0x0adcc000`, a never-committed zero-filled page — so the GPU drained ~718k zero PM4 headers, never executed the real Type3/DRAW stream, and the RPtr writeback landed on a zero page the render worker (tid=8) polls, freezing it forever. Fix (GPU/Vd-boundary translation, not memory-layer): add `physical_to_backing(addr)` deriving the committed backing exactly from `heap_alloc`'s placement (0x4000_0000 \| (addr & 0x1FFF_FFFF), idempotent for the WC window, flat for non-physical code/stack). Apply it at every point the GPU/kernel consumes a guest physical address: ring base (initialize_ring_buffer), RPtr writeback (enable_rptr_writeback), PM4 INDIRECT_BUFFER pointer, WAIT_REG_MEM / COND_WRITE memory poll+write, REG_TO_MEM / MEM_WRITE / EVENT_WRITE* / LOAD_ALU_CONSTANT / IM_LOAD addresses, the resolve dest write, and the vd_swap frontbuffer present read. This was chosen over memory-layer aliasing because the latter re-projects every CPU load/store and corrupts the guest's flat 0xA/0xC/0xE accesses (it caused an early PC=0xfffffffc fault). Two adjacent GPU-backend gates this exposed and also fixed (canary-faithful): - WaitCmp::from_wait_info was off by one vs canary's MatchValueAndRef selector (it decoded wait_info&7==3 as NotEqual instead of Equal), inverting the standard CP coherency wait so the GPU parked forever on the first INDIRECT_BUFFER. Remapped to 1=Less..7=Always, 0=Never. - Added MakeCoherent: a WAIT polling COHER_STATUS_HOST clears the status bit (mirrors command_processor.cc:801-838) so the coherency handshake resolves. Result: the GPU now decodes the real Type3 packets at 0x4adcc000 (ME_INIT, INDIRECT_BUFFER → real Type0/WAIT_REG_MEM at 0x4adf5080) instead of zero-headers; RPtr at 0x408619fc advances (0x13, 0x16, … written by the GPU worker); the frame loop sub_822F1AA8 actively writes the controller at 0x40d09a40 (0x20→0x21→0x23); no fault, full 200M/1B budget runs clean. draws_seen is still 0: the remaining gate is upstream and separate — the main frame loop never sets controller bit-28 (frame-ready) at [0x40d09a40] (stalls at 0x23, the known iterate-2C state-divergence gate), so the guest never enqueues a render IB; the GPU only ever replays the init IB. This fix correctly unblocks the GPU ring/IB/RPtr data path (gate-2 GPU backend); the bit-28 frame-ready gate is the next target. Stable golden (sylpheed_n50m) unchanged (draws/swaps/RTs/shaders identical at 50M); regenerated twice byte-identical. cargo test --workspace: 672 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-13 13:39:57 +02:00
MechaCat02	7a1b6b3306	fix(gpu): GPUBUG-DRAIN-001 — silence VdSwap PM4 fallback under --parallel The Phase-C VdSwap PM4 ring path (commit `82f3d61`) emits two "PM4_XE_SWAP not consumed by drain" warnings when running: exec sylpheed.iso --ui --quiet --halt-on-deadlock \ --parallel --reservations-table Lockstep -n 100M never trips it. Two distinct race windows: (a) Inline backend (--ui forces it): drain(mem, 4096) hit its fixed packet cap before reaching the PM4_XE_SWAP we'd just injected at the WPTR tail. With 6 CPU threads, the ring accumulates >4096 packets between vd_swap callbacks. (b) Threaded backend (--parallel without --ui): the worker's DrainFence handler has a 900 ms deadline and game-batched IBs (8-10 M packets observed) keep it from reaching the tail in any reasonable budget. If the worker eventually drained past the injected packet later, the safety-net direct notify would double-count. Three changes: * gpu_system.rs: new `drain_until_wptr(target, time_budget)` draining by the canary `WorkerThreadMain` predicate (read_offset != target) instead of a fixed packet count. 900 ms deadline mirrors the threaded DrainFence handler. * handle.rs: inline `drain_to_current_wptr` switches to `drain_until_wptr`. DrainFence handler publishes the digest mirror BEFORE replying so the CPU's post-drain `digest_snapshot` sees fresh stats. * exports.rs (vd_swap): skip the PM4 ring injection unconditionally and route swap notification through `notify_xe_swap` directly. Tail-injection is unreliable under --parallel for both backends. The slot-0 fetch-constant patch is deferred (GPUBUG-FETCH-PATCH-001); draws=0 today so a stale slot 0 has no observable effect. Verification: * cargo test --workspace --release: 556 passing (unchanged). * Lockstep -n 100M --stable-digest: bit-identical to pre-fix master HEAD `aa3f1d3`. {instructions:100000002, imports:987685, unimpl:0, draws:0, swaps:2, ...} * check --parallel --reservations-table -n 30M: 0 warnings (was 2). swaps=2. * exec --gpu-inline --parallel --reservations-table -n 30M: 0 warnings (was 2 with drained=8M-10M observed). swaps=2. Audit IDs: GPUBUG-DRAIN-001 (closed), GPUBUG-FETCH-PATCH-001 (filed, deferred). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 17:12:15 +02:00
MechaCat02	8fc1b1dfed	fix(gpu): GPUBUG-006 — sync_with_mmio Acquire/Release pair the producer The producer side (`mmio_region.rs:78`, the guest's CP_RB_WPTR MMIO write callback) uses `Ordering::Release` so any ring-memory writes the guest performed before bumping WPTR are visible to a paired `Acquire`-load on the consumer. The consumer here at `sync_with_mmio` was using `Ordering::Relaxed` for both the WPTR load and the RPTR mirror store — leaving the Release/Acquire pairing broken. Under `--parallel`, this broken pairing means the GPU worker can observe a fresh WPTR value while still reading stale ring-memory contents at the corresponding offsets — garbage PM4 packets. The audit's M11 grid run confirmed --parallel is non-deterministic beyond the documented `packets` ±5% noise; this fix is one strand of that. Symmetric fix on the RPTR mirror store: Release pairs with any guest-side Acquire-load of CP_RB_RPTR for ring-writeback bookkeeping. Verification at -n 100M lockstep: swaps: 2 → 2 (unchanged) draws: 0 → 0 (unchanged) packets: ~60M (within noise) Tests: 149 (no count change; this is a memory-ordering correctness fix, not a behavioral change visible at the digest level in lockstep). Closes GPUBUG-006 (P1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 14:26:09 +02:00
MechaCat02	8723d6826b	fix(gpu): GPUBUG-103/104/105 — fix 8 draw-state register addresses + index_size bit Eight of the register-index constants in draw_state.rs::reg pointed at completely unrelated registers because the canonical canary table (register_table.inc) was misread when the module was first authored. Re-validated each value against canary's lines 1232-1336. \| Register \| Pre-fix \| Canary \| Was-actually \| \| ------------------------- \| ------- \| ------ \| ------------- \| \| VGT_DRAW_INITIATOR \| 0x2281 \| 0x21FC \| (junk) \| \| VGT_DMA_BASE \| 0x2282 \| 0x21FA \| (junk) \| \| VGT_DMA_SIZE \| 0x2283 \| 0x21FB \| (junk) \| \| PA_SC_WINDOW_SCISSOR_TL \| 0x200E \| 0x2081 \| SCREEN_SCIS_TL\| \| PA_SC_WINDOW_SCISSOR_BR \| 0x200F \| 0x2082 \| SCREEN_SCIS_BR\| \| RB_COLOR_INFO_1 \| 0x2010 \| 0x2003 \| COHER_DEST_BASE_10\| \| RB_COLOR_INFO_2 \| 0x2011 \| 0x2004 \| COHER_DEST_BASE_11\| \| RB_COLOR_INFO_3 \| 0x2012 \| 0x2005 \| COHER_DEST_BASE_12\| \| PA_SU_VTX_CNTL \| 0x2083 \| 0x2302 \| PA_SC_CLIPRECT_RULE\| Also corrected the `index_size` bit position in VGT_DRAW_INITIATOR extraction: was bit 8 (which is `major_mode[0]`), should be bit 11 per canary `registers.h:324` (`xenos::IndexFormat index_size : 1; // +11`). The block comment in `extract()` was also wrong about the intermediate field layout and has been refreshed. Verification at -n 100M lockstep: swaps: 2 → 2 (unchanged) draws: 0 → 0 (still gated — see below) packets: ~61M (within noise) Tests: 149 (no count change; existing draw_state tests cover the new constants implicitly via behavioral round-trip). The audit predicted Phases C+D+E together would unlock `draws > 0`, but the runtime plateau is multi-causal per the audit's own analysis (`project_xenia_rs_audit_2026_05_02.md`). The likely remaining blockers in -n 100M: * 4 parked-waiter worker threads (handles 0x1004, 0x100c, 0x15e4, 0x42450b5c) — Phase F's XAM/spinlock fixes target this. * shader_blobs_live=0 after 100M — the game hasn't issued IM_LOAD yet because workers haven't loaded shader resources. The register fixes here are still load-bearing for any draw that DOES happen (every register read at 0x2281 was junk before this commit) — landing them now is correct even if draws=0 persists until Phase F unparks the resource-loader threads. Closes GPUBUG-103, GPUBUG-104, GPUBUG-105 (P0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 14:22:04 +02:00
MechaCat02	ec2d955dbd	fix(gpu): GPUBUG-102 — apply per-format endian byte-swap to vertex fetch The vertex fetch constant (canary `xe_gpu_vertex_fetch_t`, xenos.h:1158-1172) holds an `endian` field (low 2 bits of dword_1) selecting kNone/k8in16/k8in32/k16in32 swap patterns per `GpuSwapInline` (xenos.h:1090-1109). Xbox 360 vertex data is stored big-endian; the host is little-endian. Pre-fix every dword was bitcast as-is — vertex positions decoded as byte-reversed garbage, producing clipped or NaN positions in any draw that survived to the host. Mechanical changes: - crates/xenia-gpu/src/translator.rs: AOT `emit_vfetch` reads fetch_const dword 1 (endian) and wraps each lane's load in `gpu_swap(value, endian)`. New `gpu_swap` helper added to the emitted module header. - crates/xenia-gpu/src/shaders/xenos_interp.wgsl: matching `gpu_swap` helper added to the runtime interpreter shader. `interpret_vertex_fetch` reads fc1, computes the endian, and wraps every format's per-lane load (including 8_8_8_8 and 16_16_FLOAT paths). Mirrors the AOT translator's emission. Verification at -n 100M lockstep: swaps: 2 → 2 (gated by Phase E for draws) draws: 0 → 0 packets: ~60M (within noise) Tests: +1 (vfetch_emit_includes_gpu_swap_helper_call). Closes GPUBUG-102 (P0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 14:18:46 +02:00
MechaCat02	c5c6713419	fix(gpu): GPUBUG-100 — apply per-operand swizzle + negate to ALU sources Word-1 of every ALU triple holds three 8-bit component-relative swizzles (`src1_swiz`/`src2_swiz`/`src3_swiz` at bits 16-23/8-15/0-7 per canary ucode.h:2064-2066) and three per-operand negate flags (bits 24/25/26). Pre-fix, both the WGSL interpreter and the AOT translator discarded word-1 entirely with `_ = w1;` — every ALU result was missing its swizzle (broadcast/permute patterns like `.zyxw`, `.xxxx`) and any negated operand was used positive instead. Component-relative semantics (canary's `AluInstruction::GetSwizzledComponentIndex`, ucode.h:1996): for output component i, the source component is `((swizzle >> (2*i)) + i) & 3`. Identity swizzle is 0x00, NOT 0xE4 — the original `apply_swizzle` in the interpreter shader treated it as absolute, also incorrect. Mechanical changes: - crates/xenia-gpu/src/ucode/alu.rs: extend AluInstruction with src_X_swiz (u8) and src_X_negate (bool) fields. decode_alu unpacks them from word 1. - crates/xenia-gpu/src/shaders/xenos_interp.wgsl: apply_swizzle uses component-relative semantics. interpret_alu decodes the modifiers and applies via apply_swizzle + apply_modifiers (with abs=false). - crates/xenia-gpu/src/translator.rs: src_operand emits the precomputed swizzle inline as `vec4<f32>(base.x, base.y, ...)`, then wraps in `(-…)` when negated. Identity swizzle (0x00) emits a bare base expression so it round-trips with the trivial-shader fixture. Abs is omitted in this commit — the abs flag is dual-meaning (for temps it lives at bit 7 of the src byte; for constants at word-2 bit 7 `abs_constants`). Wiring it up correctly requires more careful case-split logic; deferred to Phase G. Verification at -n 100M lockstep: swaps: 2 → 2 (gated by Phase E for draws) draws: 0 → 0 packets: ~58M (within noise) Tests: 554 → 555 (+1 swizzle/negate test, no count change otherwise because identity swizzle test merged into D1's parameterised test). WGSL still validates via naga (combined_module_parses_as_wgsl). Closes GPUBUG-100 (P0). Abs deferred to Phase G. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 14:15:07 +02:00
MechaCat02	78ea81c12a	fix(gpu): GPUBUG-101 — decode src1/2/3_sel temp-vs-constant selector Per canary AluInstruction layout (xenia-canary/src/xenia/gpu/ucode.h: 2078-2086), word-0 bits 29-31 are the per-operand `srcN_sel` flags selecting temp register (1) vs ALU constant (0); the corresponding 8-bit src byte indexes either: - a temp register (bits 5:0 = index, bits 6/7 reserved for relative-addressing / abs flags consumed by Phase D2), or - an ALU constant (full 8-bit index). Pre-fix, the WGSL interpreter and AOT translator both masked `& 0x7F` on the src byte and emitted `r[low7]` regardless of the operand class. Every shader's WVP matrix / light constant / per-frame uniform read came back as r[low7] — typically zero — yielding invisible rendering. Mechanical changes: - crates/xenia-gpu/src/ucode/alu.rs: decode src_a_is_temp / src_b_is_temp / src_c_is_temp from w0 bits 29/30/31. Note that our src_a (low byte of w0) is canary's third operand, hence its selector is bit 29 (canary src3_sel), not bit 31. - crates/xenia-gpu/src/shaders/xenos_interp.wgsl: `read_src` now takes the is_temp flag; constants index xenos_consts.alu directly. - crates/xenia-gpu/src/translator.rs: `src_operand` mirrors the interpreter — `r[idx]` when temp, `xenos_consts.alu[idx]` when constant. The trivial-shader synthetic test was updated to set the temp flags so its `r[0u] = (r[0u] + r[0u])` assertion remains valid; without the flags set, all sources would now resolve as constants. Bank-selection (cf-level relative addressing for higher banks of the 512 ALU constants) remains a Phase G+ extension — covers c0..c127 in bank 0, which most Sylpheed shaders use directly. Verification at -n 100M lockstep: swaps: 2 → 2 (unchanged — gated by D2/D3/E for draws) draws: 0 → 0 packets: ~61M (within noise) Tests: 552 → 554 (+2 translator tests for the temp/constant decode). Closes GPUBUG-101 (P0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 14:10:11 +02:00
MechaCat02	82f3d611e2	fix(gpu,kernel): KRNBUG-Vd-04 / GPUBUG-001 / XMODBUG-013 — VdSwap PM4 ring path The pre-fix VdSwap zero-filled the guest's reserved buffer with NOPs and called `state.gpu.notify_xe_swap` directly — bypassing the ring, leaving the PM4_XE_SWAP handler at gpu_system.rs:1232 dead code, and skipping the PM4_TYPE0(SHADER_CONSTANT_FETCH_00_0, 6) patch. Sylpheed's bloom/ blur "sample frame N for frame N+1" path samples fetch-constant slot 0 expecting the frontbuffer descriptor; without the patch, slot 0 stayed stale and any shader sampling it read garbage. This commit writes the canary VdSwap PM4 sequence directly into the primary ring at the current write pointer (read via the shared MMIO atomic), then advances WPTR over the injection. The natural CP drain consumes PM4_XE_SWAP — bumping `swaps_seen` and patching fetch-constant slot 0 — without going through any direct kernel→GPU bypass. Sequence per xenia-canary VdSwap_entry (xboxkrnl_video.cc:438-521): 1) PM4_TYPE0(0x4800, count=6) + 6 fetch-header dwords (with base_address re-patched from virtual to physical >> 12). 2) PM4_TYPE3(PM4_XE_SWAP, count=4) + signature + frontbuffer_phys + width + height. Mechanism notes: - buffer_ptr in xenia-rs is in the system command buffer, NOT the primary ring (verified empirically: buffer_ptr=0x4acd4df8 vs ring_base=0x0accb000, size 4 KB). Canary's VdSwap writes to buffer_ptr because its ring layout maps the reserved slot inside the ring; xenia-rs's doesn't, so we have to write at the actual ring WPTR address (cached on KernelState.ring_base from VdInitializeRingBuffer). - The original "buffer_ptr zero-fill + bump WPTR by 64" path is preserved before the injection — it exposes any game-batched PM4 packets and keeps the buffer_ptr region skippable per existing game compat behavior. - A safety-net fallback at the end calls `notify_xe_swap` directly if swaps_seen didn't advance during the drain (e.g. a ring-arithmetic edge case). Idempotent — only fires when the PM4 path didn't. - KRNBUG-Mm-04 deferred: virt→phys uses the masked stub `virt & 0x1FFF_FFFF`, sufficient for the standard heap. Mechanical changes: - crates/xenia-gpu/src/pm4.rs: add make_packet_type0 / type2 / type3 helpers + round-trip unit test (mirrors canary xenos.h:1682-1709). - crates/xenia-gpu/src/handle.rs: add mmio_cp_rb_wptr_load accessor (Acquire-load) so the kernel can compute ring offsets. - crates/xenia-kernel/src/state.rs: cache ring_base / ring_size_dwords on KernelState (set by VdInitializeRingBuffer). - crates/xenia-kernel/src/exports.rs: rewrite the vd_swap PM4-emit block; patch fetch_dwords[1] base_address virt→phys before injection. Verification at -n 100M lockstep: swaps: 2 → 2 (game fires VdSwap exactly twice) draws: 0 → 0 (gated by Phases D+E) fallback warning: 0 occurrences (PM4 path consumed both swaps) instructions: ~100M Tests: 552 passing (553 with new pm4 round-trip test). Lockstep stable-fields determinism: byte-identical across two 100M runs. The "swaps > 2" prediction in the audit's plan assumed the game would fire VdSwap more often once the path worked; empirically Sylpheed only calls VdSwap twice within 100M instructions (this is the renderer plateau the audit identified). The success criterion for Phase C is that the PM4 path is now operational, which Phases D+E require for visible draws. Closes KRNBUG-Vd-04, GPUBUG-001, XMODBUG-013. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 14:00:23 +02:00
MechaCat02	79eb52c378	xenia-gpu: end-to-end Xenos pipeline (PM4, ucode, EDRAM, resolve) First real GPU implementation. Ring/PM4 frontend (ring_view, ring_drain, pm4) drains the command processor; gpu_system owns the threaded backend (DrainFence RPC + parker/fence helpers from M1) and the MMIO-mapped register block (mmio_region). Xenos shader frontend: ucode/{alu,control_flow,fetch,mod}.rs decode the Xbox 360 microcode, translator.rs lowers it onto the WGSL xenos_interp interpreter shader (shaders/xenos_interp.wgsl). shader_metrics.rs counts decode/translate work. Render state: draw_state, primitive, render_target_cache, texture_cache, tiled_address (Xenos's swizzled tiled-memory layout), xenos_constants (register field constants), edram (the 10 MiB EDRAM model with MSAA), and resolve.rs (TILE_FLUSH copy-out — clear-resolve plus bitwise-equivalent 32 bpp + 64 bpp paths landed). handle.rs owns the typed GPU-resource handles the kernel hands out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:29:38 +02:00
MechaCat02	c694bb3f43	Initial commit: xenia-rs workspace for Xbox 360 RE Rust reimplementation of the xenia Xbox 360 emulator targeting reverse- engineering and preservation, initially scoped to Project Sylpheed. Includes: - XEX2 loader (LZX decompression, AES decryption, PE parsing) - XISO / XGD2 disc image VFS - PPC interpreter with 200+ opcodes and VMX128 decoding - Static analyzer: functions, cross-references, labels, asm + SQLite output - HLE kernel covering the xboxkrnl/xam subset used by Sylpheed init - Debugger with in-memory and SQLite-backed execution tracing - `xenia-rs` CLI with extract/dis/exec commands that produce cumulative, superset SQLite databases and opt-in instruction/import/branch traces Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-16 23:14:56 +02:00

13 Commits