xenia-rs

Author	SHA1	Message	Date
MechaCat02	39723dfe37	[iterate-3W] draw_capture: walk CF exec sequence to find the real vertex fetch Fix the UI-side vertex-window resolver (`resolve_vertex_window`) so it identifies vertex fetches via the control-flow `Exec` clause `sequence` bitmap instead of blindly decoding every 3-dword triple. Root cause (GPUBUG-109): the Xenos instruction block packs ALU and fetch instructions identically (96 bits each); only the owning `Exec` clause's `sequence` bitmap (2 bits per instruction, bit[2i] = fetch/ALU) tells them apart. The old resolver scanned every triple and trusted the first that happened to decode as a vertex fetch, gated by a `dword0 & 3 == 3` "type" guard. On real shaders this mis-decoded ALU triples as fetches and either picked a garbage fetch-constant slot or rejected the clause before reaching the true vertex fetch. Now walk the CF exec clauses exactly as the translator does (`translator.rs::emit_exec`) and take the first sequence-flagged vertex* fetch. Measured (env-gated probes, removed before commit): the resolver now reaches the real fetch on every splash VS. The RectangleList draws (vs 0x36660986 / 0xd4c14f46) keep resolving real geometry (valid fetch const 0). The publisher-logo QuadList (vs 0x03b7b020) is correctly seen to fetch from a fetch constant whose dword0 = 0x1 (no vertex buffer) — i.e. its geometry is NOT sourced from a memory vertex buffer, so it still (correctly) falls to the procedural path. That remaining gap (the logo's auto-generated/index-derived geometry) is the next milestone-1 step; this commit removes the decoder defect that masked it. Determinism: UI-only. `resolve_vertex_window` runs only when `frame_captures` is `Some` (i.e. `--ui`); the headless `--gpu-inline` core never calls it. `check -n50000000 --gpu-inline --stable-digest` exit 0 and byte-identical run-to-run. cargo test --workspace: 681 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 18:50:13 +02:00
MechaCat02	da7c29b6d2	[iterate-3V] Fix logo texture: map texture-fetch physical base onto backing window The publisher-logo texture (K8888 1280x768 linear, `E59B2B3D`'s tfetch surface) rendered flat/transparent because the GPU texture decode read the wrong host bytes — NOT because the asset was never decompressed. First-divergence (vs canary, measured both engines): - Ours DOES read game:\hidden\Resource3D\.xpr in full, builds a byte-identical cache, decompresses the logo, and CPU-writes the real artwork (~839K nonzero bytes) into the texture buffer — at the guest physical-aperture VA 0x4dbee000 (writer sub_823C3E70 @ 0x823c3f8c). This REFUTES the iterate-3U verdict that the texture was never filled. - BUT the GPU decode used the raw fetch-constant base 0x0dbee000 as a virtual address. In ours' flat 4GB memory, virtual 0x0dbee000 and the physical alias 0x4dbee000 are DIFFERENT host bytes (no aliasing in the read/write path), so the decode read all-zeros. The Xenos texture fetch constant carries a guest physical* base; the CPU writes texels through its cached-physical aperture, which ours backs at the committed 0x4000_0000 window. Map the base via the existing `physical_to_backing` helper before reading — exactly as the vertex fetch path (draw_capture.rs, iterate-3Q) and as canary reads textures through its GPU shared memory (= physical). Measured after fix (env-gated probe, removed): the logo decode reads base=0x4dbee000 and produces 839068/3932160 nonzero bytes (21.3%) — a centered logo on a transparent field, matching canary's ~21% exactly. Determinism: GPU-side pure read; no CPU/guest-memory state changes. The n50m --gpu-inline --stable-digest golden is byte-identical (verified 2x, texture_cache_entries unchanged). cargo test --workspace green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 18:07:00 +02:00
MechaCat02	1b9918450f	[iterate-3T] Real UV interpolation + per-draw textures: shader/UV/bind chain complete Build the full texture-sampling chain for the publisher splash so the textured logo CAN sample real artwork at the guest's real UVs. Measured with an env-gated frontbuffer readback (since removed): the chain is correct end-to-end, but the sampled K8888 1280x768 texture is ALL-ZERO in the UI window's reachable boot range — the artwork is produced by an EDRAM resolve (RT->texture copy) that ours does not yet perform (resolves=0). So this lands the correct shader/UV/bind work and isolates the remaining blocker to the resolve gap, not the shader path. Translator (xenia-gpu/src/translator.rs), all UI-translator-only: - Real Xenos export-index model (replaces the AllocKind heuristic that collapsed every VS export to one color slot and DROPPED the texcoord). When export_data is set the 6-bit vector_dest IS the export index: VS 62=oPos, 0..15=interps; PS 0=RT0. The logo VS exports oPos(62), interp0(color), interp1(UV) distinctly. - Real interpolator passthrough: VsOut carries 8 interpolator locations; the PS seeds r[i] = in.interp[i] (Xenos PS-input-GPR mapping) so tfetch samples at the real interpolated texcoord (r1) instead of (0,0). - vfetch format 6 (k_16_16) packed-16 unpack + per-attribute dword offset, so the 3 vfetches sharing one fetch-constant (pos/UV/color in a 6-dword vertex) read the right attribute. Previously rejected the whole logo VS to the interpreter. - QuadList/RectangleList host->guest vertex-index remap in the VS (replay is non-indexed): QuadList 6 host verts -> guest [0,1,2,0,2,3] (full quad). fetch.rs: decode vfetch `offset` (dword2[8:15], dwords), `is_signed`, `is_normalized`. Per-draw textures: DrawCapture carries the decoded texture(s) (keyed off the active PS's tfetch slots, attached in gpu_system after decode); render.rs::dispatch_xenos_captures uploads + binds each capture's texture via the host texture cache before its draw, instead of one last-draw primary_texture. Determinism: all changes feed only the UI translator/capture path; frame_captures is None headless. `check -n50m --gpu-inline --stable-digest --expect` byte- identical (exit 0). 681 tests pass (+2 regression: logo VS now translates with interpolators; PS seeds interps into registers). Temp readback/dump probes removed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 17:12:16 +02:00
MechaCat02	80fbff8bd1	[iterate-3S] Real splash geometry renders: fix ALU/vfetch decode + per-draw NDC transform The 3O→3R real-render slice ran the guest's real translated VS/PS on real captured vertices at full boot speed, but the --ui window stayed blank. Bifurcated with an env-gated frontbuffer readback + per-vertex NDC dump (both removed): the captured splash quads (RectangleList, k_32_32_FLOAT, 3 verts) were non-zero and sane, so this was a transform/decode chain of bugs, not missing geometry. Four coupled root causes: - GPUBUG-106 (ucode/alu.rs): decode_alu read EVERY field out of w2, but canary's AluInstruction lays dest/write-mask/export/scalar-opcode in w0, the vector opcode + source regs in w2, swizzle/negate/pred in w1. The misread made every export ALU decode with vector_write_mask=0 → no oPos/oColor export emitted → the translated VS collapsed every vertex to the clip origin. Rewrote the field map to match ucode.h:2036-2086. - GPUBUG-107 (ucode/fetch.rs + translator.rs): the translator hardcoded R32G32B32A32_FLOAT (4 floats, stride 4); the splash quads are k_32_32_FLOAT (2 floats, stride 2). Over-striding read the next vertex's X into .w → negative W → the rectangle clipped behind the camera. Decode the real VertexFormat + dword stride and emit the matching component read (1/2/3/4 float formats; others reject to the interpreter). - GPUBUG-108 (translator.rs + xenos_interp.wgsl): the vfetch recomputed the buffer base from xenos_consts.fetch[], but that uniform carries the last-published per-frame fetch constant, not this draw's (stale 0x8a000002 vs the real base). The captured window already begins at the fetch base, so index from 0 (vertex i at i*stride) when a real window is present; only the synthetic fallback consults the uniform. - iterate-3S NDC transform (draw_capture.rs + xenos_pipeline.rs + WGSL): the guest VS emits screen-space pixel coords (clip disabled, VTE viewport scale/offset off). Added compute_ndc_xy (mirrors canary GetHostViewportInfo): rescales render-target pixels to [-1,1] clip with the Y-flip for wgpu, plumbed per-draw into DrawConstants and applied in both the translated and interpreter VS. Result (env-gated readback, since removed): the real splash geometry now fills ~50% of the frontbuffer in a clean triangular coverage pattern, real positions from real guest vertices through the real translated shaders (textures are the next stage — sampled color is still the magenta/white texture stub, tex-cache=0). Headless-inert: draw_capture is only built when frame_captures is Some (--ui); the changed decoders feed only the UI translator/metrics. Golden byte-identical (check -n50m --gpu-inline --stable-digest exit 0); 679 workspace tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 16:35:01 +02:00
MechaCat02	a3aa3cc7d6	[iterate-3Q] draw_capture: read vertex window via physical alias The UI geometry-capture read the vertex-fetch base at its bare low VA (~0x0adf_xxxx), which is unmapped in ours, so it copied all-zeros. The fetch constant's address:30 field is a guest physical dword address (canary reads it via Memory::TranslatePhysical, draw_util.cc:961). Ours only maps the cached-physical window at 0x4000_0000 (physical_to_backing). Rebase a low physical base onto that mapped alias when the raw VA is unmapped; window_base_dwords still carries the original base so the shader's rebase indexes the uploaded window. Decode itself was verified correct against canary (xe_gpu_vertex_fetch_t + GetVertexFetch + ucode.h vfetch const_index*3+const_index_sel): for the splash draws const_index_sel==0, so ours' stride-6 register offset lands on the exact same constant as canary's stride-2 offset; raw dwords match byte-for-byte. UI-only path (frame_captures is None in headless), so the deterministic --gpu-inline golden is byte-identical (verified) and 679 tests stay green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 15:28:49 +02:00
MechaCat02	6ff184694d	[iterate-3P] Real splash geometry in --ui: fix CF predication decode + translator op coverage Stage 1 of the iterate-3O resume plan: make the P7 translator actually compile the splash's real VS/PS so real per-vertex POSITIONS render via the host wgpu pipeline, instead of every draw falling to the interpreter (which emits a placeholder triangle). Two coupled fixes, both faithful (Route A): 1. ucode/control_flow.rs (GPUBUG-103): clause-level predication was decoded from payload bits 28/29, which fall inside the exec clause's `sequence_`/ `vc_hi_` fields, NOT the predicate flag. That stamped `predicated=true` on plain `kExec` clauses, so the translator rejected EVERY splash VS as `cf_cond`. Per canary ucode.h, clause predication is determined by the opcode (only kCondExecPred* = 5/6/13/14 are predicate-register-gated; their `condition_` is at word1 bit 9 = payload bit 41). kExec/kExecEnd (1/2) run unconditionally; kCondExec (3/4) is bool-constant-gated (not modeled). Diagnosed live in --ui: reject reason cf_cond on all 7 splash shader pairs → after fix, predicated=false and CF passes. 2. translator.rs: with CF passing, the next reject was `scl_op_unsupported` for scalar opcodes 4 (kMulsPrev2 / LIT emul) and 8 (kSgts), plus thin vector coverage. Expanded vector_expr + scalar_expr to mirror the runtime interpreter's op set (which mirrors canary AluVectorOpcode/AluScalarOpcode): CND_EQ/GE/GT, TRUNC, MAX4, DST for vectors; the full SEQS/SGTS/SGES/SNES, MULS_PREV2 (with the -FLT_MAX / non-finite / b<=0 guard), SUBS(_PREV), EXP/LOG/RCP/RSQ/SQRT/SIN/COS, FRCS/TRUNCS/FLOORS for scalars. Side-effecting ops (setp/kills/maxas*) still reject → interpreter fallback (honest). Result (--ui, measured): xlated-pipelines 0→6, all draws served by the translator (served_interp=0) — real VS/PS now run on the host GPU. The splash is still not visibly correct because the captured guest vertex windows read all-zero: the vertex-buffer base VA (~0x0adf_xxxx) is UNMAPPED in guest memory (mem.translate()==None). That is a CPU/kernel memory-mapping gap, not a GPU-render gap — the next stage. Determinism: both files are in xenia-gpu core but the CF `predicated` field only feeds the UI translator + a metric tag, never deterministic state. Verified: `check -n50000000 --gpu-inline --stable-digest` matches the golden byte-for-byte (exit 0); 679 tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 15:07:06 +02:00
MechaCat02	504592ac13	[iterate-3O] Real-render slice: replay guest geometry in --ui (Route A) Replace the synthetic placeholder triangle in the --ui window with the splash's REAL guest geometry, proving the faithful-render pipe end to end. Architecture: Route A (UI-side replay). A per-draw capture channel carries each PM4_DRAW_INDX's real state to the UI, which replays it through the existing wgpu Xenos pipeline. The deterministic headless core is untouched: capture is gated on an Option<Vec<DrawCapture>> that is None in headless mode and only enabled on the --ui path, so the --gpu-inline n50m golden is byte-identical (verified 2x). The hard part was sourcing real vertices. The WGSL VS already does format-aware vertex fetch from the b4 storage buffer at the address from the fetch constant -- but b4 was never populated and the fetch address is an absolute guest dword address. The slice: xenia-gpu/draw_capture.rs: parse the active VS, find its first vertex fetch, read that fetch constant, copy a bounded window of guest memory at the fetch base. Best-effort: has_real_vertices=false falls back to procedural geometry (never fabricated pixels). * gpu_system.rs: accumulate one DrawCapture per draw into frame_captures. * exports.rs (vd_swap): drain + publish the frame's captures to the UI. * ui_bridge/bridge.rs: new publish_geometry channel + UiHandles.geometry. * WGSL (interp + translator): rebase the absolute fetch address by a new DrawConstants.vertex_base_dwords so it indexes the uploaded window. * render.rs: dispatch_xenos_captures uploads each draw's real vertex window + matching shader, issues real DrawRequests (real prim type, host vertex count, vs/ps keys). * app.rs: prefer the real-capture replay; HUD adds real-geo=N counter. Verified in --ui on Sylpheed: "first Xenos capture batch replayed (real geometry) captures=24 real_vertex_draws=24" -- all draws resolved a real guest vertex window; WGSL compiles; no validation errors over 1616 swaps. Still synthetic-free but not yet pixel-perfect: textures/UVs, DMA index buffers (auto-index only for now), and kCopy resolve routing are staged for follow-ups. Faithful: real vertex data, prim types, shaders, constants. cargo test --workspace green; n50m golden unchanged (2x byte-identical). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-17 22:38:46 +02:00
MechaCat02	6bb4355e3d	[iterate-3M] Fix Xenos shader CF/fetch decode so the textured logo binds The publisher splash (title idx0) rendered FLAT in ours while canary samples a texture: ours never decoded the logo's textured pixel shader (E59B2B3D, a `tfetch2D` sprite) even though our guest IM_LOADs the exact same microcode canary does (verified byte-identical against the Wine oracle). The shader was misparsed as flat. Three coupled bugs in the ucode decoder, all off vs canary `gpu/ucode.h`: 1. CF opcode table was off-by-one (`control_flow.rs`): mapped opcode 0→Exec and 1→Exit, but Xenos has 0=kNop, 1=kExec, 2=kExecEnd, 3..6/13..14 the cond-exec variants, 7/8 loop, 9/10 call/return, 11 condjmp, 12 alloc, 15 mark-vs-fetch-done. So a real `kExec` clause was read as a terminal `Exit`, truncating the CF block and dropping every instruction (incl. the `tfetch`) after it. Added Nop/MarkVsFetchDone variants; parse now ends on an END-bit exec clause. 2. exec/loop `address` is an absolute instruction-triple index from shader dword 0, but indexed our post-CF `instructions` slice directly (`ucode/mod.rs`). Rebase addresses by the CF triple count so `address*3` lands on the right instruction. 3. Fetch instruction bitfields were wrong (`ucode/fetch.rs`): `const_index` read from bit 5 (actually `src_reg`) instead of bit 20, and texture `dimension` from dword1 instead of dword2 bit14. The logo's `tfetch ..,tf0` was read as `tf1`, whose empty fetch-constant failed to decode → no texture. Also the `sequence` fetch/ALU bit is bit[0] of each pair, not bit[1] (`shader_metrics.rs`, `translator.rs`, `xenos_interp.wgsl`). Result (--gpu-inline, deterministic 2x): the active PS's `tfetch_slots` now resolves slot 0, the tf0 fetch-constant decodes (fmt K8888), and `gpu.texture.decode` fires (137x at -n 50M; texture_cache_entries 0→1, the only golden field that changed — all draw/swap counts unchanged). The same fixes correct the WGSL uber-shader's fetch/CF walk for the threaded/--ui path. Added a regression test that parses the real E59B2B3D microcode and asserts a tfetch slot is found. Golden re-baselined (texture_cache_entries 0→1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-17 21:53:35 +02:00
MechaCat02	2f55d1fd7d	[iterate-2X] Texture pipeline: un-stub RectangleList + draw-time texture decode Two faithful, deterministic GPU-backend changes that make the texture path correct for whatever textured draw the splash eventually dispatches. Both are currently inert on Sylpheed (the textured logo draw is still gated downstream — see below), but neither shifts the stable-digest golden, so they land safely. 1. Un-stub RectangleList primitive expansion (primitive.rs). The splash submits 2819 RectangleList draws at 200M, all of which were REJECTED by the P3 stub (`gpu.primitive.rejected{rectangle_list}`) → only ~592 flat point/quad draws rasterized. Mirror canary's intent (primitive_processor.cc:389-456 kRectangleListAsTriangleStrip) within our CPU index-rewrite idiom: emit each rect's 3 real vertices as one TriangleList triangle (v0,v1,v2), rejected=false, faithful host_vertex_count. The full quad (synthesized 4th corner v3=v0+v2-v1) needs real vertex fetch in vs_main — left as a documented TODO. Rejection warnings drop 2819→0. 2. Draw-time texture decode keyed off the active PS's real tfetch slots (gpu_system.rs + exports.rs vd_swap). Previously vd_swap decoded a hardcoded fetch-constant slot 0 at swap time. Now the DRAW handler parses the bound pixel shader (ucode::parse_shader), collects its tfetch fetch_const slots via new shader_metrics::tfetch_slots, reads each 6-dword fetch constant, and decode+caches it into GpuSystem::last_draw_textures. vd_swap publishes the first of these (UI binds one texture today), falling back to the legacy slot-0 probe on flat-only frames. New span_max_version helper walks page_version over the trait (draw-time &dyn MemoryAccess lacks the heap's inherent max_page_version). Pure function of guest writes — deterministic. Status: texture_decodes stays 0 on Sylpheed because all 6 live shaders are flat (no tfetch); canary's textured logo shaders E59B2B3D/F7B1457 are not yet dispatched by ours (a downstream title-state gate, the next frontier). The full P5 decode→publish→upload→sample path is already wired; this makes the decode side key off the real shader instead of a guess. Validation: stable-digest golden sylpheed_n50m unchanged (draws=718 swaps=147 tex=0), regenerated twice byte-identical; 200M run shows 0 RectangleList rejections. cargo test --workspace green (677, +2: rectangle_list_expansion, tfetch_slots_extracts_texture_fetch_constants). No temp hooks. Branch only; not pushed/merged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 21:34:43 +02:00
MechaCat02	a91f4c550b	[iterate-2W] Sustain the title present loop: viewport-size register + ISR CPU impersonation The title's per-frame loop (sub_822F1AA8) is clock-B-paced and only re-fires when the swap count [controller+88] changes, which advances only on source=1 CP swap-complete interrupts. Each present batch the guest submits (via the sub_824CE348 -> sub_824BF4D0 builder) ends with a WAIT_REG_MEM on a per-CPU swap-acknowledge fence [GCTX+0] (GCTX = [device+10772]); the GPU parks there until the graphics ISR (sub_824BE9A0) clears that CPU's bit. Two coupled gaps kept ours emitting only ONE source=1 then dead-locking (draws plateaued at 28, run halted ~19.27M): 1. GPU MMIO register 0x1961 (AVIVO_D1MODE_VIEWPORT_SIZE) read as 0. The swap callback sub_824CE2B8 divides by its low 12 bits (display height) as a refresh-pacing term, so a 0 read tripped its `twi` divide-by-zero guard and aborted the ISR before it reached the fence-clear. Mirror canary GraphicsSystem::ReadRegister (graphics_system.cc:311): return 0x050002D0 (1280x720). 2. The ISR ran on an arbitrary borrowed thread, so [r13+268] (the PCR processor number) did not match the interrupt's target CPU. The ISR clears `1 << current_cpu` from the fence; running on the wrong CPU cleared the wrong bit and the fence (bit 2, from cpu_mask 0x4) never reached 0. Carry the target CPU through the interrupt queue (bit index of the PM4_INTERRUPT cpu_mask for CP, 2 for vsync per canary DispatchInterruptCallback(0, 2)) and impersonate it on the borrowed thread's PCR around the ISR, mirroring canary EmulateCPInterruptDPC -> XThread::SetActiveCpu. With both fixes the fence clears, the GPU drains each present batch, source=1 sustains per-present, clock B advances, and the loop runs continuously. Draws climb linearly with the budget (no re-stall): 50M 28->718, 200M ->3411, 1B ->18734; swaps 2->147/950/6060. No "Unanticipated CPU_INTERRUPT" trap. Inline-deterministic (--stable-digest byte-identical x2); n50m golden re-baselined. 675 tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 20:49:32 +02:00
MechaCat02	873c197ff1	[iterate-2T] VdSwap: route present through ring PM4_XE_SWAP, drop out-of-band swap interrupt Make ours' VdSwap present path faithful to xenia-canary `VdSwap_entry` (xboxkrnl_video.cc:518-548): write the reserved 64-dword ring slot with a PM4_TYPE0 fetch-constant patch + PM4_TYPE3(PM4_XE_SWAP) + NOP padding, then let the natural drain consume the swap packet in command-stream order. Remove the synthetic CP swap-complete interrupt that `notify_xe_swap` raised out-of-band. Root found this session (the actual present-path bug): ours' `notify_xe_swap` pushed an `InterruptSource::Swap` (→ INTERRUPT_SOURCE_CP) interrupt directly from the VdSwap HLE, decoupled from the GPU command stream. When that interrupt reached the graphics ISR `sub_824BE9A0` before D3D had armed its swap-callback slot (`[gfx+10772]+16` still the `0xBADF00D` placeholder), the ISR took its error path and hit the assert "ERR[D3D]: Unanticipated CPU_INTERRUPT. Sign of a corrupt command buffer?" (`bl sub_824C5DF0; twi` at 0x824BE9DC) — 2x per run on master. Canary's VdSwap raises NO interrupt; swap-complete CP interrupts come only from in-stream PM4_INTERRUPT packets, which are naturally ordered after the callback-arming Type-0 writes. Routing the swap through the ring packet matches that ordering and eliminates the trap (2 -> 0). Canary oracle confirmation (muted, audit_mem_watch + audit_jit_prolog_pc): canary's early/loading loop is present-driven — swap counter [gfx+15160] (0xBE56CA38) advances ~per-vblank from vblank 65 onward, reaching 0xD02 (3330) in ~60s via 6184 CP source=1 interrupts, with VdSwap called only ONCE. So the present interrupts are entirely in-stream, not from the VdSwap export. This is a correctness/faithfulness fix; it does NOT cascade. draws stay 78 at 200M and 1B because the upstream gate persists: the game submits one render batch then stalls (renderer sub_82506xxx 0x; 2nd title thread 0x821748F0 never spawns). The per-frame loop sub_822F1AA8 runs ~1207 iterations on vsync but clock B (swap count) only advances ~once, so the manager update sub_821741C8 fires once. That is the iterate-2Q/2F title-pipeline gate, not a present/ interrupt bug. swaps 3 -> 4 (the in-stream PM4_XE_SWAP now drains). Deterministic in inline mode (n50m --gpu-inline --stable-digest regenerated byte-identical twice; golden re-baselined: swaps 3 -> 4). cargo test --workspace 675 passing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 15:20:02 +02:00
MechaCat02	1ae472bd2b	[iterate-2S] GPU: implement CP SCRATCH_REG memory writeback — arms Sylpheed's swap-callback slot Sylpheed renders the splash (draws=78, iterate-2O) then plateaus: the title's per-frame manager (sub_821741C8) only re-fires when "clock B" ([gfx+15160], swap count) changes, which only the CP swap-complete callback sub_824CE2B8 increments. The graphics ISR sub_824BE9A0 indirect-calls that callback via [[gfx+10772]+16] on CP (source=1) interrupts, but the slot stayed NULL so the callback never ran. Root (runtime-verified, ours-side GPU): the guest arms the slot through the Xenos CP scratch-register writeback path, which ours never implemented. The arming IB (drained by ours at 0x4adf5180) contains a Type-0 register write of the callback PC 0x824ce2b8 into SCRATCH_REG4 (0x057C). On hardware/canary, writing a SCRATCH_REG{n} mirrors the value to SCRATCH_ADDR + n4 in memory when the matching SCRATCH_UMSK bit is set. Runtime values: SCRATCH_ADDR=0x0b1d5000 (the [gfx+10772] descriptor), SCRATCH_UMSK=0x20033 (bit 4 set), so SCRATCH_REG4 -> 0x0b1d5010 = descriptor+16 = the callback slot (0x4b1d5010). Ours decoded the Type-0 write into the register file but performed no writeback (case a: drained-but-mishandled), so the slot stayed NULL. Fix mirrors canary's CommandProcessor::HandleSpecialRegisterWrite (command_processor.cc:545-552): a scratch_register_writeback() helper called from handle_type0/handle_type1 after every register write; for SCRATCH_REG0..7 with the UMSK bit set, it writes the value (big-endian, as mem.write_u32 already stores) to SCRATCH_ADDR + n4 (projected via physical_to_backing). Deterministic given identical register state; proven by unit test. Cascade (verified by runtime probe): slot 0x4b1d5010 now armed with 0x824ce2b8; on the 2-3 CP interrupts that fire, the ISR reads the slot and bcctrl's into sub_824CE2B8 (runs 2x; 0x cascade on master); sub_824CE2B8 increments clock B ([gfx+15160]). The cascade does NOT yet reach draws>78: there are only ~3 CP interrupts (from the initial 9825- packet batch), and the title render loop stalls upstream (the iterate-2Q title-respawn gate) before it submits more PM4_INTERRUPT work, so the callback can't bootstrap a self-sustaining loop. This is the remaining update-17/18 arming gap closed; the upstream stall is the next gate. The default threaded GPU backend drains the ring on a separate host thread, so with the callback now doing work the exact CP-interrupt delivery instruction varies run to run (pre-existing GPU-thread race). Pin the n50m oracle test to --gpu-inline (instruction-count deterministic) and re-baseline its golden; bit-exact across repeated runs. New unit test scratch_reg_write_mirrors_to_memory_when_umsk_enabled. Tests: 675 pass (was 674). Golden re-baselined + determinism verified. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 14:21:30 +02:00
MechaCat02	034ec8b47f	[iterate-2O] GPU: drain indirect buffers correctly — Sylpheed renders splash (draws 0→78) Ours' GPU never drained the D3D driver's system command buffer past the first 11-dword indirect buffer, so DRAW_INDX / reg-0x57C-arm packets never executed and draws stayed 0 (the long-hunted render gate; see UPDATE-18). Runtime tracing (temporary, removed) showed the guest submits 6 INDIRECT_BUFFER packets at boot (CP_RB_WPTR 22→37) but ours executed exactly ONE IB and then spun 15.7M packets inside it. Three coupled command-processor bugs, all corrected to match canary: 1. `sync_with_mmio` applied the primary CP_RB_WPTR to whichever ring was active, including an executing indirect buffer — `37 % 11 = 3` clobbered the IB's write pointer so its read pointer looped 0→2→5→0 forever and never popped back to the primary ring. CP_RB_WPTR governs ONLY the primary ring; while an IB executes, the primary is the bottom of the IB stack. Canary executes each IB through a separate `RingBuffer reader_` (command_processor.cc), so the primary write pointer is structurally inapplicable to an IB. 2. Indirect buffers were treated as circular rings: read wrapped at `size_dwords` (`11 % 11 = 0`) and never reached the fixed write pointer, so even without the clobber the IB could not terminate. An IB is a fixed linear sub-stream; add `RingBufferView.indirect` and drain `[0, ib_size)` monotonically, then pop. 3. `is_ready` only checked the active ring, so an IB that now correctly exhausts would never get `execute_one` called again to pop back to the primary ring (whose WPTR may have advanced). Check the whole IB stack. Also: the ring was sized `1 << size_log2` bytes (1024 dwords) vs canary's `1 << (size_log2 + 3)` (8192 dwords) — an 8× undersize that desynced WPTR-wrap math from the guest. Fixed in `GpuSystem::initialize_ring_buffer` (and the dead bookkeeping copy in `vd_initialize_ring_buffer`). Cascade (deterministic; threaded-default backend, byte-identical across runs): reg 0x57C now written, IB jumps 1→12, packets 15.7M→9,825, and the splash renders — draws 0→78, shaders 0→3, render_targets 0→2, swaps 2→3 — stable at 50M / 200M / 1B. Boot then reaches a new downstream gate (draws plateau at 78, interrupts keep climbing → engine alive, not deadlocked). golden `sylpheed_n50m.json` re-baselined (draws 78). `cargo test --workspace` green (674; +2 ring_view regression tests). vd_swap's synthetic-swap short-circuit is now redundant but left untouched (cascade works without changing it); cleaning it up is a separate follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-13 22:06:16 +02:00
MechaCat02	2bdb93e51e	[iterate-2K] GPU physical-mirror aliasing: ring/IB/RPtr/resolve read wrong host region Root cause (physical-mirror aliasing gap → GPU read wrong region → ring never truly drained → render worker ring-space wait → no frame → no draw): The Xbox 360 maps its 512 MB of physical DRAM into several virtual mirror windows differing only in cache policy — bare physical (0x0xxxxxxx), write-combine (0x4xxxxxxx), and cached 0xA/0xC/0xExxxxxxx — all aliasing addr & 0x1FFF_FFFF. Ours has one flat membase and `heap_alloc` (MmAllocatePhysicalMemoryEx) commits physical backing in the 0x4xxxxxxx window. The guest masks its CP-ring allocation base to bare physical (0x4adcc000 & 0x1FFFFFFF = 0x0adcc000) before handing it to VdInitializeRingBuffer, and PM4 INDIRECT_BUFFER / writeback / resolve pointers are likewise bare-physical. Ours stored those verbatim and read `membase + 0x0adcc000`, a never-committed zero-filled page — so the GPU drained ~718k zero PM4 headers, never executed the real Type3/DRAW stream, and the RPtr writeback landed on a zero page the render worker (tid=8) polls, freezing it forever. Fix (GPU/Vd-boundary translation, not memory-layer): add `physical_to_backing(addr)` deriving the committed backing exactly from `heap_alloc`'s placement (0x4000_0000 \| (addr & 0x1FFF_FFFF), idempotent for the WC window, flat for non-physical code/stack). Apply it at every point the GPU/kernel consumes a guest physical address: ring base (initialize_ring_buffer), RPtr writeback (enable_rptr_writeback), PM4 INDIRECT_BUFFER pointer, WAIT_REG_MEM / COND_WRITE memory poll+write, REG_TO_MEM / MEM_WRITE / EVENT_WRITE* / LOAD_ALU_CONSTANT / IM_LOAD addresses, the resolve dest write, and the vd_swap frontbuffer present read. This was chosen over memory-layer aliasing because the latter re-projects every CPU load/store and corrupts the guest's flat 0xA/0xC/0xE accesses (it caused an early PC=0xfffffffc fault). Two adjacent GPU-backend gates this exposed and also fixed (canary-faithful): - WaitCmp::from_wait_info was off by one vs canary's MatchValueAndRef selector (it decoded wait_info&7==3 as NotEqual instead of Equal), inverting the standard CP coherency wait so the GPU parked forever on the first INDIRECT_BUFFER. Remapped to 1=Less..7=Always, 0=Never. - Added MakeCoherent: a WAIT polling COHER_STATUS_HOST clears the status bit (mirrors command_processor.cc:801-838) so the coherency handshake resolves. Result: the GPU now decodes the real Type3 packets at 0x4adcc000 (ME_INIT, INDIRECT_BUFFER → real Type0/WAIT_REG_MEM at 0x4adf5080) instead of zero-headers; RPtr at 0x408619fc advances (0x13, 0x16, … written by the GPU worker); the frame loop sub_822F1AA8 actively writes the controller at 0x40d09a40 (0x20→0x21→0x23); no fault, full 200M/1B budget runs clean. draws_seen is still 0: the remaining gate is upstream and separate — the main frame loop never sets controller bit-28 (frame-ready) at [0x40d09a40] (stalls at 0x23, the known iterate-2C state-divergence gate), so the guest never enqueues a render IB; the GPU only ever replays the init IB. This fix correctly unblocks the GPU ring/IB/RPtr data path (gate-2 GPU backend); the bit-28 frame-ready gate is the next target. Stable golden (sylpheed_n50m) unchanged (draws/swaps/RTs/shaders identical at 50M); regenerated twice byte-identical. cargo test --workspace: 672 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-13 13:39:57 +02:00
MechaCat02	7a1b6b3306	fix(gpu): GPUBUG-DRAIN-001 — silence VdSwap PM4 fallback under --parallel The Phase-C VdSwap PM4 ring path (commit `82f3d61`) emits two "PM4_XE_SWAP not consumed by drain" warnings when running: exec sylpheed.iso --ui --quiet --halt-on-deadlock \ --parallel --reservations-table Lockstep -n 100M never trips it. Two distinct race windows: (a) Inline backend (--ui forces it): drain(mem, 4096) hit its fixed packet cap before reaching the PM4_XE_SWAP we'd just injected at the WPTR tail. With 6 CPU threads, the ring accumulates >4096 packets between vd_swap callbacks. (b) Threaded backend (--parallel without --ui): the worker's DrainFence handler has a 900 ms deadline and game-batched IBs (8-10 M packets observed) keep it from reaching the tail in any reasonable budget. If the worker eventually drained past the injected packet later, the safety-net direct notify would double-count. Three changes: * gpu_system.rs: new `drain_until_wptr(target, time_budget)` draining by the canary `WorkerThreadMain` predicate (read_offset != target) instead of a fixed packet count. 900 ms deadline mirrors the threaded DrainFence handler. * handle.rs: inline `drain_to_current_wptr` switches to `drain_until_wptr`. DrainFence handler publishes the digest mirror BEFORE replying so the CPU's post-drain `digest_snapshot` sees fresh stats. * exports.rs (vd_swap): skip the PM4 ring injection unconditionally and route swap notification through `notify_xe_swap` directly. Tail-injection is unreliable under --parallel for both backends. The slot-0 fetch-constant patch is deferred (GPUBUG-FETCH-PATCH-001); draws=0 today so a stale slot 0 has no observable effect. Verification: * cargo test --workspace --release: 556 passing (unchanged). * Lockstep -n 100M --stable-digest: bit-identical to pre-fix master HEAD `aa3f1d3`. {instructions:100000002, imports:987685, unimpl:0, draws:0, swaps:2, ...} * check --parallel --reservations-table -n 30M: 0 warnings (was 2). swaps=2. * exec --gpu-inline --parallel --reservations-table -n 30M: 0 warnings (was 2 with drained=8M-10M observed). swaps=2. Audit IDs: GPUBUG-DRAIN-001 (closed), GPUBUG-FETCH-PATCH-001 (filed, deferred). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 17:12:15 +02:00
MechaCat02	8fc1b1dfed	fix(gpu): GPUBUG-006 — sync_with_mmio Acquire/Release pair the producer The producer side (`mmio_region.rs:78`, the guest's CP_RB_WPTR MMIO write callback) uses `Ordering::Release` so any ring-memory writes the guest performed before bumping WPTR are visible to a paired `Acquire`-load on the consumer. The consumer here at `sync_with_mmio` was using `Ordering::Relaxed` for both the WPTR load and the RPTR mirror store — leaving the Release/Acquire pairing broken. Under `--parallel`, this broken pairing means the GPU worker can observe a fresh WPTR value while still reading stale ring-memory contents at the corresponding offsets — garbage PM4 packets. The audit's M11 grid run confirmed --parallel is non-deterministic beyond the documented `packets` ±5% noise; this fix is one strand of that. Symmetric fix on the RPTR mirror store: Release pairs with any guest-side Acquire-load of CP_RB_RPTR for ring-writeback bookkeeping. Verification at -n 100M lockstep: swaps: 2 → 2 (unchanged) draws: 0 → 0 (unchanged) packets: ~60M (within noise) Tests: 149 (no count change; this is a memory-ordering correctness fix, not a behavioral change visible at the digest level in lockstep). Closes GPUBUG-006 (P1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 14:26:09 +02:00
MechaCat02	8723d6826b	fix(gpu): GPUBUG-103/104/105 — fix 8 draw-state register addresses + index_size bit Eight of the register-index constants in draw_state.rs::reg pointed at completely unrelated registers because the canonical canary table (register_table.inc) was misread when the module was first authored. Re-validated each value against canary's lines 1232-1336. \| Register \| Pre-fix \| Canary \| Was-actually \| \| ------------------------- \| ------- \| ------ \| ------------- \| \| VGT_DRAW_INITIATOR \| 0x2281 \| 0x21FC \| (junk) \| \| VGT_DMA_BASE \| 0x2282 \| 0x21FA \| (junk) \| \| VGT_DMA_SIZE \| 0x2283 \| 0x21FB \| (junk) \| \| PA_SC_WINDOW_SCISSOR_TL \| 0x200E \| 0x2081 \| SCREEN_SCIS_TL\| \| PA_SC_WINDOW_SCISSOR_BR \| 0x200F \| 0x2082 \| SCREEN_SCIS_BR\| \| RB_COLOR_INFO_1 \| 0x2010 \| 0x2003 \| COHER_DEST_BASE_10\| \| RB_COLOR_INFO_2 \| 0x2011 \| 0x2004 \| COHER_DEST_BASE_11\| \| RB_COLOR_INFO_3 \| 0x2012 \| 0x2005 \| COHER_DEST_BASE_12\| \| PA_SU_VTX_CNTL \| 0x2083 \| 0x2302 \| PA_SC_CLIPRECT_RULE\| Also corrected the `index_size` bit position in VGT_DRAW_INITIATOR extraction: was bit 8 (which is `major_mode[0]`), should be bit 11 per canary `registers.h:324` (`xenos::IndexFormat index_size : 1; // +11`). The block comment in `extract()` was also wrong about the intermediate field layout and has been refreshed. Verification at -n 100M lockstep: swaps: 2 → 2 (unchanged) draws: 0 → 0 (still gated — see below) packets: ~61M (within noise) Tests: 149 (no count change; existing draw_state tests cover the new constants implicitly via behavioral round-trip). The audit predicted Phases C+D+E together would unlock `draws > 0`, but the runtime plateau is multi-causal per the audit's own analysis (`project_xenia_rs_audit_2026_05_02.md`). The likely remaining blockers in -n 100M: * 4 parked-waiter worker threads (handles 0x1004, 0x100c, 0x15e4, 0x42450b5c) — Phase F's XAM/spinlock fixes target this. * shader_blobs_live=0 after 100M — the game hasn't issued IM_LOAD yet because workers haven't loaded shader resources. The register fixes here are still load-bearing for any draw that DOES happen (every register read at 0x2281 was junk before this commit) — landing them now is correct even if draws=0 persists until Phase F unparks the resource-loader threads. Closes GPUBUG-103, GPUBUG-104, GPUBUG-105 (P0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 14:22:04 +02:00
MechaCat02	ec2d955dbd	fix(gpu): GPUBUG-102 — apply per-format endian byte-swap to vertex fetch The vertex fetch constant (canary `xe_gpu_vertex_fetch_t`, xenos.h:1158-1172) holds an `endian` field (low 2 bits of dword_1) selecting kNone/k8in16/k8in32/k16in32 swap patterns per `GpuSwapInline` (xenos.h:1090-1109). Xbox 360 vertex data is stored big-endian; the host is little-endian. Pre-fix every dword was bitcast as-is — vertex positions decoded as byte-reversed garbage, producing clipped or NaN positions in any draw that survived to the host. Mechanical changes: - crates/xenia-gpu/src/translator.rs: AOT `emit_vfetch` reads fetch_const dword 1 (endian) and wraps each lane's load in `gpu_swap(value, endian)`. New `gpu_swap` helper added to the emitted module header. - crates/xenia-gpu/src/shaders/xenos_interp.wgsl: matching `gpu_swap` helper added to the runtime interpreter shader. `interpret_vertex_fetch` reads fc1, computes the endian, and wraps every format's per-lane load (including 8_8_8_8 and 16_16_FLOAT paths). Mirrors the AOT translator's emission. Verification at -n 100M lockstep: swaps: 2 → 2 (gated by Phase E for draws) draws: 0 → 0 packets: ~60M (within noise) Tests: +1 (vfetch_emit_includes_gpu_swap_helper_call). Closes GPUBUG-102 (P0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 14:18:46 +02:00
MechaCat02	c5c6713419	fix(gpu): GPUBUG-100 — apply per-operand swizzle + negate to ALU sources Word-1 of every ALU triple holds three 8-bit component-relative swizzles (`src1_swiz`/`src2_swiz`/`src3_swiz` at bits 16-23/8-15/0-7 per canary ucode.h:2064-2066) and three per-operand negate flags (bits 24/25/26). Pre-fix, both the WGSL interpreter and the AOT translator discarded word-1 entirely with `_ = w1;` — every ALU result was missing its swizzle (broadcast/permute patterns like `.zyxw`, `.xxxx`) and any negated operand was used positive instead. Component-relative semantics (canary's `AluInstruction::GetSwizzledComponentIndex`, ucode.h:1996): for output component i, the source component is `((swizzle >> (2*i)) + i) & 3`. Identity swizzle is 0x00, NOT 0xE4 — the original `apply_swizzle` in the interpreter shader treated it as absolute, also incorrect. Mechanical changes: - crates/xenia-gpu/src/ucode/alu.rs: extend AluInstruction with src_X_swiz (u8) and src_X_negate (bool) fields. decode_alu unpacks them from word 1. - crates/xenia-gpu/src/shaders/xenos_interp.wgsl: apply_swizzle uses component-relative semantics. interpret_alu decodes the modifiers and applies via apply_swizzle + apply_modifiers (with abs=false). - crates/xenia-gpu/src/translator.rs: src_operand emits the precomputed swizzle inline as `vec4<f32>(base.x, base.y, ...)`, then wraps in `(-…)` when negated. Identity swizzle (0x00) emits a bare base expression so it round-trips with the trivial-shader fixture. Abs is omitted in this commit — the abs flag is dual-meaning (for temps it lives at bit 7 of the src byte; for constants at word-2 bit 7 `abs_constants`). Wiring it up correctly requires more careful case-split logic; deferred to Phase G. Verification at -n 100M lockstep: swaps: 2 → 2 (gated by Phase E for draws) draws: 0 → 0 packets: ~58M (within noise) Tests: 554 → 555 (+1 swizzle/negate test, no count change otherwise because identity swizzle test merged into D1's parameterised test). WGSL still validates via naga (combined_module_parses_as_wgsl). Closes GPUBUG-100 (P0). Abs deferred to Phase G. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 14:15:07 +02:00
MechaCat02	78ea81c12a	fix(gpu): GPUBUG-101 — decode src1/2/3_sel temp-vs-constant selector Per canary AluInstruction layout (xenia-canary/src/xenia/gpu/ucode.h: 2078-2086), word-0 bits 29-31 are the per-operand `srcN_sel` flags selecting temp register (1) vs ALU constant (0); the corresponding 8-bit src byte indexes either: - a temp register (bits 5:0 = index, bits 6/7 reserved for relative-addressing / abs flags consumed by Phase D2), or - an ALU constant (full 8-bit index). Pre-fix, the WGSL interpreter and AOT translator both masked `& 0x7F` on the src byte and emitted `r[low7]` regardless of the operand class. Every shader's WVP matrix / light constant / per-frame uniform read came back as r[low7] — typically zero — yielding invisible rendering. Mechanical changes: - crates/xenia-gpu/src/ucode/alu.rs: decode src_a_is_temp / src_b_is_temp / src_c_is_temp from w0 bits 29/30/31. Note that our src_a (low byte of w0) is canary's third operand, hence its selector is bit 29 (canary src3_sel), not bit 31. - crates/xenia-gpu/src/shaders/xenos_interp.wgsl: `read_src` now takes the is_temp flag; constants index xenos_consts.alu directly. - crates/xenia-gpu/src/translator.rs: `src_operand` mirrors the interpreter — `r[idx]` when temp, `xenos_consts.alu[idx]` when constant. The trivial-shader synthetic test was updated to set the temp flags so its `r[0u] = (r[0u] + r[0u])` assertion remains valid; without the flags set, all sources would now resolve as constants. Bank-selection (cf-level relative addressing for higher banks of the 512 ALU constants) remains a Phase G+ extension — covers c0..c127 in bank 0, which most Sylpheed shaders use directly. Verification at -n 100M lockstep: swaps: 2 → 2 (unchanged — gated by D2/D3/E for draws) draws: 0 → 0 packets: ~61M (within noise) Tests: 552 → 554 (+2 translator tests for the temp/constant decode). Closes GPUBUG-101 (P0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 14:10:11 +02:00
MechaCat02	82f3d611e2	fix(gpu,kernel): KRNBUG-Vd-04 / GPUBUG-001 / XMODBUG-013 — VdSwap PM4 ring path The pre-fix VdSwap zero-filled the guest's reserved buffer with NOPs and called `state.gpu.notify_xe_swap` directly — bypassing the ring, leaving the PM4_XE_SWAP handler at gpu_system.rs:1232 dead code, and skipping the PM4_TYPE0(SHADER_CONSTANT_FETCH_00_0, 6) patch. Sylpheed's bloom/ blur "sample frame N for frame N+1" path samples fetch-constant slot 0 expecting the frontbuffer descriptor; without the patch, slot 0 stayed stale and any shader sampling it read garbage. This commit writes the canary VdSwap PM4 sequence directly into the primary ring at the current write pointer (read via the shared MMIO atomic), then advances WPTR over the injection. The natural CP drain consumes PM4_XE_SWAP — bumping `swaps_seen` and patching fetch-constant slot 0 — without going through any direct kernel→GPU bypass. Sequence per xenia-canary VdSwap_entry (xboxkrnl_video.cc:438-521): 1) PM4_TYPE0(0x4800, count=6) + 6 fetch-header dwords (with base_address re-patched from virtual to physical >> 12). 2) PM4_TYPE3(PM4_XE_SWAP, count=4) + signature + frontbuffer_phys + width + height. Mechanism notes: - buffer_ptr in xenia-rs is in the system command buffer, NOT the primary ring (verified empirically: buffer_ptr=0x4acd4df8 vs ring_base=0x0accb000, size 4 KB). Canary's VdSwap writes to buffer_ptr because its ring layout maps the reserved slot inside the ring; xenia-rs's doesn't, so we have to write at the actual ring WPTR address (cached on KernelState.ring_base from VdInitializeRingBuffer). - The original "buffer_ptr zero-fill + bump WPTR by 64" path is preserved before the injection — it exposes any game-batched PM4 packets and keeps the buffer_ptr region skippable per existing game compat behavior. - A safety-net fallback at the end calls `notify_xe_swap` directly if swaps_seen didn't advance during the drain (e.g. a ring-arithmetic edge case). Idempotent — only fires when the PM4 path didn't. - KRNBUG-Mm-04 deferred: virt→phys uses the masked stub `virt & 0x1FFF_FFFF`, sufficient for the standard heap. Mechanical changes: - crates/xenia-gpu/src/pm4.rs: add make_packet_type0 / type2 / type3 helpers + round-trip unit test (mirrors canary xenos.h:1682-1709). - crates/xenia-gpu/src/handle.rs: add mmio_cp_rb_wptr_load accessor (Acquire-load) so the kernel can compute ring offsets. - crates/xenia-kernel/src/state.rs: cache ring_base / ring_size_dwords on KernelState (set by VdInitializeRingBuffer). - crates/xenia-kernel/src/exports.rs: rewrite the vd_swap PM4-emit block; patch fetch_dwords[1] base_address virt→phys before injection. Verification at -n 100M lockstep: swaps: 2 → 2 (game fires VdSwap exactly twice) draws: 0 → 0 (gated by Phases D+E) fallback warning: 0 occurrences (PM4 path consumed both swaps) instructions: ~100M Tests: 552 passing (553 with new pm4 round-trip test). Lockstep stable-fields determinism: byte-identical across two 100M runs. The "swaps > 2" prediction in the audit's plan assumed the game would fire VdSwap more often once the path worked; empirically Sylpheed only calls VdSwap twice within 100M instructions (this is the renderer plateau the audit identified). The success criterion for Phase C is that the PM4 path is now operational, which Phases D+E require for visible draws. Closes KRNBUG-Vd-04, GPUBUG-001, XMODBUG-013. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 14:00:23 +02:00
MechaCat02	79eb52c378	xenia-gpu: end-to-end Xenos pipeline (PM4, ucode, EDRAM, resolve) First real GPU implementation. Ring/PM4 frontend (ring_view, ring_drain, pm4) drains the command processor; gpu_system owns the threaded backend (DrainFence RPC + parker/fence helpers from M1) and the MMIO-mapped register block (mmio_region). Xenos shader frontend: ucode/{alu,control_flow,fetch,mod}.rs decode the Xbox 360 microcode, translator.rs lowers it onto the WGSL xenos_interp interpreter shader (shaders/xenos_interp.wgsl). shader_metrics.rs counts decode/translate work. Render state: draw_state, primitive, render_target_cache, texture_cache, tiled_address (Xenos's swizzled tiled-memory layout), xenos_constants (register field constants), edram (the 10 MiB EDRAM model with MSAA), and resolve.rs (TILE_FLUSH copy-out — clear-resolve plus bitwise-equivalent 32 bpp + 64 bpp paths landed). handle.rs owns the typed GPU-resource handles the kernel hands out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:29:38 +02:00
MechaCat02	c694bb3f43	Initial commit: xenia-rs workspace for Xbox 360 RE Rust reimplementation of the xenia Xbox 360 emulator targeting reverse- engineering and preservation, initially scoped to Project Sylpheed. Includes: - XEX2 loader (LZX decompression, AES decryption, PE parsing) - XISO / XGD2 disc image VFS - PPC interpreter with 200+ opcodes and VMX128 decoding - Static analyzer: functions, cross-references, labels, asm + SQLite output - HLE kernel covering the xboxkrnl/xam subset used by Sylpheed init - Debugger with in-memory and SQLite-backed execution tracing - `xenia-rs` CLI with extract/dis/exec commands that produce cumulative, superset SQLite databases and opt-in instruction/import/branch traces Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-16 23:14:56 +02:00

23 Commits