xenia-rs

Author	SHA1	Message	Date
MechaCat02	9d24dd0eaa	[iterate-3AJ] Present-anchor vsync so the splash logo fade-in renders The publisher/dev splash logo's intro fade-IN was skipped: the logo popped in at full brightness instead of ramping dim->bright like the canary oracle. Root (measured, iterate-3AF/3AI): ours' guest vsync counter is fed by a fixed-instruction-quantum proxy (one vsync per 150k retired instructions). During the ~1.1s splash asset-load the title's frame pump runs ~10M instructions inside a single guest frame, so the proxy fired ~66 vsyncs in that one frame. The pump's per-frame delta (counter_now - counter_last) was therefore ~66 on the first tick, which the anim tick (sub_823CDBF8) divides into the fade counter [item+72] @ 0x40c0add0 -> the counter JUMPED 0->0x42(66) in one step, landing past the fade-in region. Canary's wall-clock 60Hz vblank advances ~1 per heavy load frame, so its counter ramps smoothly 0->66 and the fade-in renders. Fix: anchor the lockstep vsync ticker to the guest's real present rate (VdSwap count), mirroring real hardware where the title double-buffers at vblank, so one heavy guest frame advances the vsync counter by ~1 instead of ~66. - interrupts.rs: tick_vsync_instr now takes the live present count. Two regimes: (1) bootstrap, before the guest's first present, keeps the original fixed instruction quantum unchanged -- the iterate-2W present-loop bootstrap needs vsyncs delivered BEFORE it can present (measured: callback registered ~6M instr, first delivered vsync and first present coincide; pure present-driven vsync would deadlock). (2) present-anchored, after the first present: one vblank per present, plus a small DRY_FALLBACK_CAP=4 instruction-quantum fallback per dry window so a non-presenting frame still ticks a few vsyncs (a small ramp like canary's 0/5/10/2/1...) without re-spiking to 66. - handle.rs: cheap GpuBackend::swaps_seen() accessor. - main.rs: pass the live present count into the lockstep ticker. Not masking: the fade dt/counter is never clamped or synthesized; the guest naturally computes a smooth dt once vblank tracks presents. Verified: - V1: fade counter 0x40c0add0 now ramps 0,6,8,10,12,13,+1... (was a 0->0x42 jump; direct baseline-vs-fix mem-watch). - V2 (--ui readback via per-frame logo vertex-alpha): logo alpha ramps 102,136,204,221,238,254 (dim->bright fade-IN) vs baseline all 255 (pop-in). Real artwork (has_real_vertices) still renders; milestone-1 intact. - V3: 150M boot progression intact -- texture_decodes=2, RTs=2, tex_cache=1 unchanged; draws/swaps higher (tighter present loop), 1B sanity linear, no stall/collapse. - V4: 50M --gpu-inline --stable-digest byte-identical 2x; golden re-baselined intentionally (pacing-only delta: draws 718->1274, swaps 147->259; structural fields unchanged). 688 tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-19 21:01:33 +02:00
MechaCat02	c62a355418	[iterate-3AE] Fix spurious WHITE TRIANGLE flashing before each splash logo The publisher and developer splash logos rendered correctly, but a fullscreen OPAQUE WHITE diagonal half-triangle flashed at boot, before each logo, and persisted across the dev-logo transition — canary shows a black background there. Readback-isolated it (env-gated frontbuffer grid + per-draw inventory, both removed) to the background-fill draws. ROOT (measured, refutes the prior "saturate/interpreter/depth" guesses): the position-only VS `0xd4c14f46` (one vfetch → oPos; exports NO color) paired with PS `0xed732b5a` (`ocolor0 = interp0`). The iterate-3T translator seeded `ointerp[0] = (1,1,1,1)` "so a VS that only exports position still yields a visible non-zero color" — a debug FAKE: it injects white that no guest value backs. So that fill's interp0 stayed white → opaque-white fullscreen triangle. Vertex windows of a WHITE frame and a steady BLACK frame were byte-identical; served_translated=true for all of them and depth is disabled in the replay, so the white came purely from the injected seed, not saturate/interp/depth. FIX (UI-translator only, golden byte-identical): - translator.rs: default un-exported interpolators to (0,0,0,0) instead of seeding interp0 white. A position-only VS now contributes nothing visible under its real blend (RGB=0 → black; A=0 → premult transparent), matching canary; every VS that really exports interp0 (the logo `0x03b7b020`, the color fill `0x36660986`) overwrites the seed → logos unaffected. - app.rs: clear the splash frontbuffer to BLACK, not the iterate-3S navy placeholder `[0.04,0.04,0.06]` (never matched to the guest). The fill is a fullscreen Xbox-360 RectangleList drawn as a single triangle in the replay (4th implied corner not yet synthesized), so its uncovered half exposed the clear; black makes the transition uniformly black like the oracle. (Full RectangleList→rectangle expansion is a separate follow-up.) READBACK (env-gated, removed): white-heavy frames 200+ → 0; navy frames 240 → 0; transition frames uniformly black; the publisher logo (white text + red dots) and the developer logos (colored, on black) still render. Determinism: changes feed only the UI translator/clear; n50m --gpu-inline --stable-digest byte-identical 2× and matches the committed golden (--expect exit 0). cargo test --workspace 686 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-19 14:09:54 +02:00
MechaCat02	3f8d3b6f1c	[iterate-3AD] Fix 2nd splash logo rendering black: re-upload evolving atlas The publisher (SQUARE ENIX) and the 2nd developer/studio splash logo share one K8888 atlas at physical base 0x4dbee000, sampled at different UVs. The publisher's white text occupies the top V-bands; the developer logo's (bluish/gold) artwork is CPU-written into the SAME surface AFTER the publisher frame, so the atlas evolves across frames. The UI host texture cache (`texture_cache_host::upload`) only re-uploads a `TextureKey` when `version_when_uploaded` increases. But the per-draw bind in `render.rs` hardcoded `version_when_uploaded = 1` for every draw, so once the atlas was first uploaded (during the publisher frame, with only the top bands filled) the cache pinned that partial upload. The 2nd logo, sampling a V-band that was still zero at first-upload time, read transparent-black -> rendered nothing (the "white-triangle / black stub" the user saw after SQUARE ENIX). Verdict: (G) a legitimate 2nd LOGO item whose real artwork lives in the same evolving atlas — NOT a spurious 3rd item, and NOT a geometry/shader/blend gap. Measured via readback: the 2nd-logo geometry rasterizes correctly (3 on-screen quads), interp1 (UV) and interp0 (color) reach the PS with real values, the texture content at the sampled bands exists — only the bound wgpu texture was the stale partial upload. Fix (UI-only, deterministic core untouched): - `gpu_system`: thread the real content `version` (from `span_max_version`) into `last_draw_textures` (now `(key, version, bytes)`). - `draw_capture::DrawCapture.textures`: same 3-tuple. - `render.rs`: use the real `version` (not a hardcoded 1) so the host cache re-uploads when the guest fills more of the atlas. - `exports.rs` `vd_swap`: the legacy single-texture `publish_texture` bridge drops the version (`(key, _v, bytes) -> (key, bytes)`). Readback (env-gated probe, removed before commit): after the fix the 2nd logo renders real varied artwork (blue + gold texels in a centered strip) instead of black. Determinism: `check -n50m --gpu-inline --stable-digest` byte- identical to the `c0c6088` baseline (captured both via git-stash). 686 tests green. No faking — real decoded texels through the real guest draw. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-19 13:38:35 +02:00
MechaCat02	c0c6088e4d	[iterate-3AA] Fix logo upside-down: no Y-flip on the clip-enabled NDC path DEFECT 1 (logo upside down) ROOT + FIX. The publisher "SQUARE ENIX" logo rendered vertically mirrored vs the canary oracle (white upright on black). Measured (env-gated readback + texture-row + per-vertex dumps, all removed): - The K8888 logo texture decodes UPRIGHT (text in the top rows 1..161; the red dots sit at ~43% from the texture top). NOT a decoder row-order bug. - The logo geometry is a centered QuadList whose vertices are emitted in clip space (Y-UP, e.g. pos.y +0.085 top / -0.104 bottom), with the texture V mapped top->bottom (UV v 0.001 at the top vertex, 0.090 at the bottom). On both the Xbox 360 (D3D9) and wgpu, clip +Y maps to the framebuffer top — so a clip-space position is portable with NO Y-flip. - `compute_ndc_xy` unconditionally negated Y (the flip the screen-space pixel path legitimately needs). For the clip-enabled logo this swapped top<->bottom vertices while leaving the texture V unchanged, so the sampled sub-rect read bottom-up: red dots rendered at 58% from the top (a clean vertical mirror) instead of 43%. FIX: keep the Y-flip only on the clip_disable (screen-space pixel) branch where the framebuffer Y-down->wgpu Y-up flip is real; the clip-enabled branch now passes clip-Y-up through identity. Readback after the fix: red dots at 42% from the top (= texture's 43%) -> logo UPRIGHT, still centered. DEFECT 2 (background) was already correct + faithful; 3Z's contradiction is REFUTED by direct readback: the bg fill (vs 0x36660986 / ps 0xed732b5a, fullscreen RectangleList) reads its real vertex color (raw 0x818000c7 = -32896.5 as float) into r0, the PS exports it, and the GPUBUG-115 RB-UNORM saturate (canary spirv_shader_translator.cc:3607) clamps it to 0 -> BLACK, matching canary. The seed r0=(gvidx,...) does NOT show through (it's overwritten by the color vfetch). No code change needed. Readback of the full frame now matches canary: WHITE upright "SQUARE ENIX" + red dots on a BLACK field. UI-capture-path only (`compute_ndc_xy` runs solely when frame_captures is Some, i.e. --ui; None headless) -> deterministic core untouched, n50m --gpu-inline --stable-digest exit 0 (DRAW_INDX 275 / K8888 decode 137, identical across runs). cargo test --workspace green. Temp probes removed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 21:18:30 +02:00
MechaCat02	f6f3aac673	[iterate-3Z] Fix logo color (yellow->white): k_8_8_8_8 vfetch + vfetch field/stride/saturate Defect 2 of the three render-fidelity defects vs the canary oracle (the publisher "SQUARE ENIX" logo rendered YELLOW instead of WHITE). Root, measured by readback (env-gated probes, removed): the logo PS multiplies the sampled texture by the interpolated vertex COLOR; the K8888 texture itself decodes correctly (67,667 white texels + 2,087 red — the red dots — zero yellow), so the yellow came from the vertex-color attribute decode. Four coupled, canary-faithful fixes (all UI-translator/capture only — the deterministic headless core is untouched; n50m --gpu-inline --stable-digest golden byte-identical, exit 0): - GPUBUG-112 (translator vfetch): VertexFormat 6 = k_8_8_8_8 (4x u8 normalized, 1 dword), NOT k_16_16 (which is 25) per canary xenos.h:643. The logo color stream is k_8_8_8_8; decoding it as k_16_16 read only 2 of 4 channels and forced BLUE = 0 -> white texture x (R,G,0) = yellow. Now unpacks all four 8-bit channels (canary spirv_shader_translator_fetch.cc k_8_8_8_8 packed_offsets 0/8/16/24); added k_16_16 (format 25) too. - GPUBUG-113 (ucode/fetch): vfetch is_signed / is_normalized / is_mini_fetch bit positions were wrong (read bits 24/25, which sit inside exp_adjust). Per canary ucode.h:757-758,764: signed=fomat_comp_all (w1 bit12), normalized=(num_format_all==0) (w1 bit13), mini_fetch (w1 bit30). - GPUBUG-114 (translator vfetch): a vfetch_mini reuses the address AND STRIDE of the preceding full vfetch of the same stream (canary ucode.h:733); its own stride field is 0. Track the last full stride per fetch-const and inherit it so a mini color/UV attribute indexes by the real vertex stride, not its tight dword count. - GPUBUG-115 (translator PS export): saturate the color export to [0,1] before the UNORM render-target write, mirroring canary spirv_shader_translator.cc:3607 ("Saturate, flushing NaN to 0"). Without it an out-of-range guest color writes garbage to the sRGB target. Verified by env-gated frontbuffer readback (copy_texture_to_buffer, removed before commit): the logo now renders WHITE text + RED dots (bbox centered ~y322-389), zero yellow anywhere. Workspace tests green (added 4: k_8_8_8_8 4-channel unpack, mini-fetch stride inheritance, vfetch bit decode, PS saturate). Determinism: golden byte-identical. Remaining (defects 1 & 3, see memory iterate-3Z): logo orientation and the ed732b5a fullscreen background fill (renders ~white, canary shows black) — both localized but not yet cleanly resolved; plan in the memory file. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 20:58:21 +02:00
MechaCat02	2a992db47b	[iterate-3Y] Replay per-draw blend + write-mask so the logo composites visible The publisher logo rendered its real artwork in isolation (3X) but was overpainted in the full composite: every replayed draw used ONE fixed SrcAlpha/OneMinusSrcAlpha pipeline + an opaque-magenta texture stub, so the textured RectangleList draws whose sampler slot is shadowed by a vertex-fetch constant (no resolvable texture) wrote opaque magenta over the logo. Per-draw render-state inventory at the splash (env-gated probe, removed): - logo QuadList vs=0x03b7b020 ps=0x03b79001: bc0=0x07010701 (One,OneMinusSrcAlpha — premultiplied alpha), cmask=0xF, ntex=1 (real K8888) - RectangleList vs=0xd4c14f46 ps=0x03b79001: SAME premult blend, ntex=0 (slot 0 holds a type=3 vertex constant → texture decode rejects) → magenta - opaque fill vs=0x36660986 ps=0xed732b5a: bc0=0x00010001 (One,Zero) — green Draw order: the logo is drawn LAST per group, so order was not the problem; the fixed pipeline state was. Change (UI-side capture/replay only): - draw_capture: capture RB_BLENDCONTROL0 + RB_COLOR_MASK (+ colorcontrol / depthcontrol for follow-ups) per draw. - xenos_pipeline: new RenderState{blend_control,color_mask}; map Xenos blend factors/ops -> wgpu mirroring canary kBlendFactorMap/kBlendFactorAlphaMap; One,Zero,Add => blend:None (opaque); zero-channel mask => ColorWrites; cache translator AND interpreter pipelines keyed on (vs,ps,RenderState) / RenderState so each draw composites with its real state. - render: pass each capture's RenderState through both replay paths. - dummy texture magenta(255,0,255,255) -> transparent(0,0,0,0): an unresolvable texture now contributes nothing under its real premult blend instead of fabricating opaque magenta (removes a fake, adds none). Readback (env-gated, removed): full 1280x720 composite now shows the logo's real artwork (maxR=255, 50-102 distinct colors/cell) in a centered strip; no magenta anywhere. Background is uniform green (the 0xed732b5a opaque fill) — a separate vertex-color/shader fidelity issue, NOT compositing (next iterate). Determinism: UI-only; draw_capture additions only run when frame_captures=Some. check -n50m --gpu-inline --stable-digest --expect = "matches golden" (2x). cargo test --workspace = 682 passed. Temp probes removed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 19:53:25 +02:00
MechaCat02	89b5c39d8a	[iterate-3X] Real splash logo geometry renders: fix vertex-fetch const_index_sel + per-draw submit Two readback-proven root-cause fixes make the publisher-logo QuadList draw land its REAL captured vertex buffer (the texture was already correct from 3V). REFUTES iterate-3W's "logo geometry is auto-generated from vertex_id": the logo IS sourced from a 4-vertex QuadList buffer at guest physical 0x0adf60f0 (measured), it was just resolved at the wrong fetch-constant register. GPUBUG-110 (vertex fetch const_index_sel dropped). The Xenos vertex-fetch instruction encodes const_index (w0[20:24]) AND const_index_sel (w0[25:26]); the full constant index is const_index3 + const_index_sel (canary ucode.h:700), packed 3 two-dword constants per 6-dword register group. ucode/fetch.rs decoded only const_index and read sub-slot 0 (fc6). The logo vfetch is const_index=31, sel=2 -> the real base lives at reg 0x48BE, but ours read 0x48BA which held an unused 0x00000001 (base=0,size=0) slot. So resolve_vertex_window returned None -> has_real_vertices=false -> the logo fell to the procedural fullscreen magenta fallback. Fix: decode const_index_sel, add VertexFetch::const_reg_offset() = const_index6 + sel2, and use it in both draw_capture.rs (capture) and translator.rs (the WGSL endian term + no-window fallback base; the old expression there read the src_reg bits, not the const index). Measured: logo now resolves a 24-dword (4 verts x stride 6) window, base 0x0adf60f0. GPUBUG-111 (single batch encoder = last-draw-wins vertex data). In wgpu every queue.write_buffer staged before a single queue.submit is applied before ANY command in that submit runs. dispatch_xenos_captures recorded the whole batch into one encoder + one submit, so every draw read only the LAST draw's vertex buffer / per-draw uniforms. The logo quad therefore sampled the trailing fullscreen background quad's vertices and rasterized nothing where the logo was. Fix: submit one encoder per draw (frontbuffer LoadOp::Load composites identically). Measured (env-gated readback, removed): with this fix the logo draw in isolation renders real varied texels (e.g. (225,17,22)/(255,255,0)) in a centered strip (~20k px), vs 100% navy before. Determinism: all changes are UI-side (xenia-ui replay) or the UI translator / capture path (frame_captures None in headless); the fetch.rs field addition is purely additive and does not change any existing decoded value. Verified the deterministic core unchanged: check -n50M --gpu-inline --stable-digest exit 0 and all 136 metric counters byte-identical across two runs. All temp probes removed. cargo test --workspace green; new regression test vertex_fetch_const_index_sel_and_reg_offset. Known remaining (next iterate): a fullscreen flat QuadList (ps 0x03b79081, vertex color green, no texture) and other textureless draws overpaint the logo in the full composite (their per-draw blend/alpha render state is not yet replayed, and draw order alternates bg/logo). The logo artwork renders correctly in isolation; the composite is not yet clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 19:25:50 +02:00
MechaCat02	39723dfe37	[iterate-3W] draw_capture: walk CF exec sequence to find the real vertex fetch Fix the UI-side vertex-window resolver (`resolve_vertex_window`) so it identifies vertex fetches via the control-flow `Exec` clause `sequence` bitmap instead of blindly decoding every 3-dword triple. Root cause (GPUBUG-109): the Xenos instruction block packs ALU and fetch instructions identically (96 bits each); only the owning `Exec` clause's `sequence` bitmap (2 bits per instruction, bit[2i] = fetch/ALU) tells them apart. The old resolver scanned every triple and trusted the first that happened to decode as a vertex fetch, gated by a `dword0 & 3 == 3` "type" guard. On real shaders this mis-decoded ALU triples as fetches and either picked a garbage fetch-constant slot or rejected the clause before reaching the true vertex fetch. Now walk the CF exec clauses exactly as the translator does (`translator.rs::emit_exec`) and take the first sequence-flagged vertex* fetch. Measured (env-gated probes, removed before commit): the resolver now reaches the real fetch on every splash VS. The RectangleList draws (vs 0x36660986 / 0xd4c14f46) keep resolving real geometry (valid fetch const 0). The publisher-logo QuadList (vs 0x03b7b020) is correctly seen to fetch from a fetch constant whose dword0 = 0x1 (no vertex buffer) — i.e. its geometry is NOT sourced from a memory vertex buffer, so it still (correctly) falls to the procedural path. That remaining gap (the logo's auto-generated/index-derived geometry) is the next milestone-1 step; this commit removes the decoder defect that masked it. Determinism: UI-only. `resolve_vertex_window` runs only when `frame_captures` is `Some` (i.e. `--ui`); the headless `--gpu-inline` core never calls it. `check -n50000000 --gpu-inline --stable-digest` exit 0 and byte-identical run-to-run. cargo test --workspace: 681 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 18:50:13 +02:00
MechaCat02	da7c29b6d2	[iterate-3V] Fix logo texture: map texture-fetch physical base onto backing window The publisher-logo texture (K8888 1280x768 linear, `E59B2B3D`'s tfetch surface) rendered flat/transparent because the GPU texture decode read the wrong host bytes — NOT because the asset was never decompressed. First-divergence (vs canary, measured both engines): - Ours DOES read game:\hidden\Resource3D\.xpr in full, builds a byte-identical cache, decompresses the logo, and CPU-writes the real artwork (~839K nonzero bytes) into the texture buffer — at the guest physical-aperture VA 0x4dbee000 (writer sub_823C3E70 @ 0x823c3f8c). This REFUTES the iterate-3U verdict that the texture was never filled. - BUT the GPU decode used the raw fetch-constant base 0x0dbee000 as a virtual address. In ours' flat 4GB memory, virtual 0x0dbee000 and the physical alias 0x4dbee000 are DIFFERENT host bytes (no aliasing in the read/write path), so the decode read all-zeros. The Xenos texture fetch constant carries a guest physical* base; the CPU writes texels through its cached-physical aperture, which ours backs at the committed 0x4000_0000 window. Map the base via the existing `physical_to_backing` helper before reading — exactly as the vertex fetch path (draw_capture.rs, iterate-3Q) and as canary reads textures through its GPU shared memory (= physical). Measured after fix (env-gated probe, removed): the logo decode reads base=0x4dbee000 and produces 839068/3932160 nonzero bytes (21.3%) — a centered logo on a transparent field, matching canary's ~21% exactly. Determinism: GPU-side pure read; no CPU/guest-memory state changes. The n50m --gpu-inline --stable-digest golden is byte-identical (verified 2x, texture_cache_entries unchanged). cargo test --workspace green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 18:07:00 +02:00
MechaCat02	1b9918450f	[iterate-3T] Real UV interpolation + per-draw textures: shader/UV/bind chain complete Build the full texture-sampling chain for the publisher splash so the textured logo CAN sample real artwork at the guest's real UVs. Measured with an env-gated frontbuffer readback (since removed): the chain is correct end-to-end, but the sampled K8888 1280x768 texture is ALL-ZERO in the UI window's reachable boot range — the artwork is produced by an EDRAM resolve (RT->texture copy) that ours does not yet perform (resolves=0). So this lands the correct shader/UV/bind work and isolates the remaining blocker to the resolve gap, not the shader path. Translator (xenia-gpu/src/translator.rs), all UI-translator-only: - Real Xenos export-index model (replaces the AllocKind heuristic that collapsed every VS export to one color slot and DROPPED the texcoord). When export_data is set the 6-bit vector_dest IS the export index: VS 62=oPos, 0..15=interps; PS 0=RT0. The logo VS exports oPos(62), interp0(color), interp1(UV) distinctly. - Real interpolator passthrough: VsOut carries 8 interpolator locations; the PS seeds r[i] = in.interp[i] (Xenos PS-input-GPR mapping) so tfetch samples at the real interpolated texcoord (r1) instead of (0,0). - vfetch format 6 (k_16_16) packed-16 unpack + per-attribute dword offset, so the 3 vfetches sharing one fetch-constant (pos/UV/color in a 6-dword vertex) read the right attribute. Previously rejected the whole logo VS to the interpreter. - QuadList/RectangleList host->guest vertex-index remap in the VS (replay is non-indexed): QuadList 6 host verts -> guest [0,1,2,0,2,3] (full quad). fetch.rs: decode vfetch `offset` (dword2[8:15], dwords), `is_signed`, `is_normalized`. Per-draw textures: DrawCapture carries the decoded texture(s) (keyed off the active PS's tfetch slots, attached in gpu_system after decode); render.rs::dispatch_xenos_captures uploads + binds each capture's texture via the host texture cache before its draw, instead of one last-draw primary_texture. Determinism: all changes feed only the UI translator/capture path; frame_captures is None headless. `check -n50m --gpu-inline --stable-digest --expect` byte- identical (exit 0). 681 tests pass (+2 regression: logo VS now translates with interpolators; PS seeds interps into registers). Temp readback/dump probes removed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 17:12:16 +02:00
MechaCat02	80fbff8bd1	[iterate-3S] Real splash geometry renders: fix ALU/vfetch decode + per-draw NDC transform The 3O→3R real-render slice ran the guest's real translated VS/PS on real captured vertices at full boot speed, but the --ui window stayed blank. Bifurcated with an env-gated frontbuffer readback + per-vertex NDC dump (both removed): the captured splash quads (RectangleList, k_32_32_FLOAT, 3 verts) were non-zero and sane, so this was a transform/decode chain of bugs, not missing geometry. Four coupled root causes: - GPUBUG-106 (ucode/alu.rs): decode_alu read EVERY field out of w2, but canary's AluInstruction lays dest/write-mask/export/scalar-opcode in w0, the vector opcode + source regs in w2, swizzle/negate/pred in w1. The misread made every export ALU decode with vector_write_mask=0 → no oPos/oColor export emitted → the translated VS collapsed every vertex to the clip origin. Rewrote the field map to match ucode.h:2036-2086. - GPUBUG-107 (ucode/fetch.rs + translator.rs): the translator hardcoded R32G32B32A32_FLOAT (4 floats, stride 4); the splash quads are k_32_32_FLOAT (2 floats, stride 2). Over-striding read the next vertex's X into .w → negative W → the rectangle clipped behind the camera. Decode the real VertexFormat + dword stride and emit the matching component read (1/2/3/4 float formats; others reject to the interpreter). - GPUBUG-108 (translator.rs + xenos_interp.wgsl): the vfetch recomputed the buffer base from xenos_consts.fetch[], but that uniform carries the last-published per-frame fetch constant, not this draw's (stale 0x8a000002 vs the real base). The captured window already begins at the fetch base, so index from 0 (vertex i at i*stride) when a real window is present; only the synthetic fallback consults the uniform. - iterate-3S NDC transform (draw_capture.rs + xenos_pipeline.rs + WGSL): the guest VS emits screen-space pixel coords (clip disabled, VTE viewport scale/offset off). Added compute_ndc_xy (mirrors canary GetHostViewportInfo): rescales render-target pixels to [-1,1] clip with the Y-flip for wgpu, plumbed per-draw into DrawConstants and applied in both the translated and interpreter VS. Result (env-gated readback, since removed): the real splash geometry now fills ~50% of the frontbuffer in a clean triangular coverage pattern, real positions from real guest vertices through the real translated shaders (textures are the next stage — sampled color is still the magenta/white texture stub, tex-cache=0). Headless-inert: draw_capture is only built when frame_captures is Some (--ui); the changed decoders feed only the UI translator/metrics. Golden byte-identical (check -n50m --gpu-inline --stable-digest exit 0); 679 workspace tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 16:35:01 +02:00
MechaCat02	a3aa3cc7d6	[iterate-3Q] draw_capture: read vertex window via physical alias The UI geometry-capture read the vertex-fetch base at its bare low VA (~0x0adf_xxxx), which is unmapped in ours, so it copied all-zeros. The fetch constant's address:30 field is a guest physical dword address (canary reads it via Memory::TranslatePhysical, draw_util.cc:961). Ours only maps the cached-physical window at 0x4000_0000 (physical_to_backing). Rebase a low physical base onto that mapped alias when the raw VA is unmapped; window_base_dwords still carries the original base so the shader's rebase indexes the uploaded window. Decode itself was verified correct against canary (xe_gpu_vertex_fetch_t + GetVertexFetch + ucode.h vfetch const_index*3+const_index_sel): for the splash draws const_index_sel==0, so ours' stride-6 register offset lands on the exact same constant as canary's stride-2 offset; raw dwords match byte-for-byte. UI-only path (frame_captures is None in headless), so the deterministic --gpu-inline golden is byte-identical (verified) and 679 tests stay green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 15:28:49 +02:00
MechaCat02	6ff184694d	[iterate-3P] Real splash geometry in --ui: fix CF predication decode + translator op coverage Stage 1 of the iterate-3O resume plan: make the P7 translator actually compile the splash's real VS/PS so real per-vertex POSITIONS render via the host wgpu pipeline, instead of every draw falling to the interpreter (which emits a placeholder triangle). Two coupled fixes, both faithful (Route A): 1. ucode/control_flow.rs (GPUBUG-103): clause-level predication was decoded from payload bits 28/29, which fall inside the exec clause's `sequence_`/ `vc_hi_` fields, NOT the predicate flag. That stamped `predicated=true` on plain `kExec` clauses, so the translator rejected EVERY splash VS as `cf_cond`. Per canary ucode.h, clause predication is determined by the opcode (only kCondExecPred* = 5/6/13/14 are predicate-register-gated; their `condition_` is at word1 bit 9 = payload bit 41). kExec/kExecEnd (1/2) run unconditionally; kCondExec (3/4) is bool-constant-gated (not modeled). Diagnosed live in --ui: reject reason cf_cond on all 7 splash shader pairs → after fix, predicated=false and CF passes. 2. translator.rs: with CF passing, the next reject was `scl_op_unsupported` for scalar opcodes 4 (kMulsPrev2 / LIT emul) and 8 (kSgts), plus thin vector coverage. Expanded vector_expr + scalar_expr to mirror the runtime interpreter's op set (which mirrors canary AluVectorOpcode/AluScalarOpcode): CND_EQ/GE/GT, TRUNC, MAX4, DST for vectors; the full SEQS/SGTS/SGES/SNES, MULS_PREV2 (with the -FLT_MAX / non-finite / b<=0 guard), SUBS(_PREV), EXP/LOG/RCP/RSQ/SQRT/SIN/COS, FRCS/TRUNCS/FLOORS for scalars. Side-effecting ops (setp/kills/maxas*) still reject → interpreter fallback (honest). Result (--ui, measured): xlated-pipelines 0→6, all draws served by the translator (served_interp=0) — real VS/PS now run on the host GPU. The splash is still not visibly correct because the captured guest vertex windows read all-zero: the vertex-buffer base VA (~0x0adf_xxxx) is UNMAPPED in guest memory (mem.translate()==None). That is a CPU/kernel memory-mapping gap, not a GPU-render gap — the next stage. Determinism: both files are in xenia-gpu core but the CF `predicated` field only feeds the UI translator + a metric tag, never deterministic state. Verified: `check -n50000000 --gpu-inline --stable-digest` matches the golden byte-for-byte (exit 0); 679 tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 15:07:06 +02:00
MechaCat02	504592ac13	[iterate-3O] Real-render slice: replay guest geometry in --ui (Route A) Replace the synthetic placeholder triangle in the --ui window with the splash's REAL guest geometry, proving the faithful-render pipe end to end. Architecture: Route A (UI-side replay). A per-draw capture channel carries each PM4_DRAW_INDX's real state to the UI, which replays it through the existing wgpu Xenos pipeline. The deterministic headless core is untouched: capture is gated on an Option<Vec<DrawCapture>> that is None in headless mode and only enabled on the --ui path, so the --gpu-inline n50m golden is byte-identical (verified 2x). The hard part was sourcing real vertices. The WGSL VS already does format-aware vertex fetch from the b4 storage buffer at the address from the fetch constant -- but b4 was never populated and the fetch address is an absolute guest dword address. The slice: xenia-gpu/draw_capture.rs: parse the active VS, find its first vertex fetch, read that fetch constant, copy a bounded window of guest memory at the fetch base. Best-effort: has_real_vertices=false falls back to procedural geometry (never fabricated pixels). * gpu_system.rs: accumulate one DrawCapture per draw into frame_captures. * exports.rs (vd_swap): drain + publish the frame's captures to the UI. * ui_bridge/bridge.rs: new publish_geometry channel + UiHandles.geometry. * WGSL (interp + translator): rebase the absolute fetch address by a new DrawConstants.vertex_base_dwords so it indexes the uploaded window. * render.rs: dispatch_xenos_captures uploads each draw's real vertex window + matching shader, issues real DrawRequests (real prim type, host vertex count, vs/ps keys). * app.rs: prefer the real-capture replay; HUD adds real-geo=N counter. Verified in --ui on Sylpheed: "first Xenos capture batch replayed (real geometry) captures=24 real_vertex_draws=24" -- all draws resolved a real guest vertex window; WGSL compiles; no validation errors over 1616 swaps. Still synthetic-free but not yet pixel-perfect: textures/UVs, DMA index buffers (auto-index only for now), and kCopy resolve routing are staged for follow-ups. Faithful: real vertex data, prim types, shaders, constants. cargo test --workspace green; n50m golden unchanged (2x byte-identical). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-17 22:38:46 +02:00
MechaCat02	6bb4355e3d	[iterate-3M] Fix Xenos shader CF/fetch decode so the textured logo binds The publisher splash (title idx0) rendered FLAT in ours while canary samples a texture: ours never decoded the logo's textured pixel shader (E59B2B3D, a `tfetch2D` sprite) even though our guest IM_LOADs the exact same microcode canary does (verified byte-identical against the Wine oracle). The shader was misparsed as flat. Three coupled bugs in the ucode decoder, all off vs canary `gpu/ucode.h`: 1. CF opcode table was off-by-one (`control_flow.rs`): mapped opcode 0→Exec and 1→Exit, but Xenos has 0=kNop, 1=kExec, 2=kExecEnd, 3..6/13..14 the cond-exec variants, 7/8 loop, 9/10 call/return, 11 condjmp, 12 alloc, 15 mark-vs-fetch-done. So a real `kExec` clause was read as a terminal `Exit`, truncating the CF block and dropping every instruction (incl. the `tfetch`) after it. Added Nop/MarkVsFetchDone variants; parse now ends on an END-bit exec clause. 2. exec/loop `address` is an absolute instruction-triple index from shader dword 0, but indexed our post-CF `instructions` slice directly (`ucode/mod.rs`). Rebase addresses by the CF triple count so `address*3` lands on the right instruction. 3. Fetch instruction bitfields were wrong (`ucode/fetch.rs`): `const_index` read from bit 5 (actually `src_reg`) instead of bit 20, and texture `dimension` from dword1 instead of dword2 bit14. The logo's `tfetch ..,tf0` was read as `tf1`, whose empty fetch-constant failed to decode → no texture. Also the `sequence` fetch/ALU bit is bit[0] of each pair, not bit[1] (`shader_metrics.rs`, `translator.rs`, `xenos_interp.wgsl`). Result (--gpu-inline, deterministic 2x): the active PS's `tfetch_slots` now resolves slot 0, the tf0 fetch-constant decodes (fmt K8888), and `gpu.texture.decode` fires (137x at -n 50M; texture_cache_entries 0→1, the only golden field that changed — all draw/swap counts unchanged). The same fixes correct the WGSL uber-shader's fetch/CF walk for the threaded/--ui path. Added a regression test that parses the real E59B2B3D microcode and asserts a tfetch slot is found. Golden re-baselined (texture_cache_entries 0→1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-17 21:53:35 +02:00
MechaCat02	2f55d1fd7d	[iterate-2X] Texture pipeline: un-stub RectangleList + draw-time texture decode Two faithful, deterministic GPU-backend changes that make the texture path correct for whatever textured draw the splash eventually dispatches. Both are currently inert on Sylpheed (the textured logo draw is still gated downstream — see below), but neither shifts the stable-digest golden, so they land safely. 1. Un-stub RectangleList primitive expansion (primitive.rs). The splash submits 2819 RectangleList draws at 200M, all of which were REJECTED by the P3 stub (`gpu.primitive.rejected{rectangle_list}`) → only ~592 flat point/quad draws rasterized. Mirror canary's intent (primitive_processor.cc:389-456 kRectangleListAsTriangleStrip) within our CPU index-rewrite idiom: emit each rect's 3 real vertices as one TriangleList triangle (v0,v1,v2), rejected=false, faithful host_vertex_count. The full quad (synthesized 4th corner v3=v0+v2-v1) needs real vertex fetch in vs_main — left as a documented TODO. Rejection warnings drop 2819→0. 2. Draw-time texture decode keyed off the active PS's real tfetch slots (gpu_system.rs + exports.rs vd_swap). Previously vd_swap decoded a hardcoded fetch-constant slot 0 at swap time. Now the DRAW handler parses the bound pixel shader (ucode::parse_shader), collects its tfetch fetch_const slots via new shader_metrics::tfetch_slots, reads each 6-dword fetch constant, and decode+caches it into GpuSystem::last_draw_textures. vd_swap publishes the first of these (UI binds one texture today), falling back to the legacy slot-0 probe on flat-only frames. New span_max_version helper walks page_version over the trait (draw-time &dyn MemoryAccess lacks the heap's inherent max_page_version). Pure function of guest writes — deterministic. Status: texture_decodes stays 0 on Sylpheed because all 6 live shaders are flat (no tfetch); canary's textured logo shaders E59B2B3D/F7B1457 are not yet dispatched by ours (a downstream title-state gate, the next frontier). The full P5 decode→publish→upload→sample path is already wired; this makes the decode side key off the real shader instead of a guess. Validation: stable-digest golden sylpheed_n50m unchanged (draws=718 swaps=147 tex=0), regenerated twice byte-identical; 200M run shows 0 RectangleList rejections. cargo test --workspace green (677, +2: rectangle_list_expansion, tfetch_slots_extracts_texture_fetch_constants). No temp hooks. Branch only; not pushed/merged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 21:34:43 +02:00
MechaCat02	a91f4c550b	[iterate-2W] Sustain the title present loop: viewport-size register + ISR CPU impersonation The title's per-frame loop (sub_822F1AA8) is clock-B-paced and only re-fires when the swap count [controller+88] changes, which advances only on source=1 CP swap-complete interrupts. Each present batch the guest submits (via the sub_824CE348 -> sub_824BF4D0 builder) ends with a WAIT_REG_MEM on a per-CPU swap-acknowledge fence [GCTX+0] (GCTX = [device+10772]); the GPU parks there until the graphics ISR (sub_824BE9A0) clears that CPU's bit. Two coupled gaps kept ours emitting only ONE source=1 then dead-locking (draws plateaued at 28, run halted ~19.27M): 1. GPU MMIO register 0x1961 (AVIVO_D1MODE_VIEWPORT_SIZE) read as 0. The swap callback sub_824CE2B8 divides by its low 12 bits (display height) as a refresh-pacing term, so a 0 read tripped its `twi` divide-by-zero guard and aborted the ISR before it reached the fence-clear. Mirror canary GraphicsSystem::ReadRegister (graphics_system.cc:311): return 0x050002D0 (1280x720). 2. The ISR ran on an arbitrary borrowed thread, so [r13+268] (the PCR processor number) did not match the interrupt's target CPU. The ISR clears `1 << current_cpu` from the fence; running on the wrong CPU cleared the wrong bit and the fence (bit 2, from cpu_mask 0x4) never reached 0. Carry the target CPU through the interrupt queue (bit index of the PM4_INTERRUPT cpu_mask for CP, 2 for vsync per canary DispatchInterruptCallback(0, 2)) and impersonate it on the borrowed thread's PCR around the ISR, mirroring canary EmulateCPInterruptDPC -> XThread::SetActiveCpu. With both fixes the fence clears, the GPU drains each present batch, source=1 sustains per-present, clock B advances, and the loop runs continuously. Draws climb linearly with the budget (no re-stall): 50M 28->718, 200M ->3411, 1B ->18734; swaps 2->147/950/6060. No "Unanticipated CPU_INTERRUPT" trap. Inline-deterministic (--stable-digest byte-identical x2); n50m golden re-baselined. 675 tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 20:49:32 +02:00
MechaCat02	873c197ff1	[iterate-2T] VdSwap: route present through ring PM4_XE_SWAP, drop out-of-band swap interrupt Make ours' VdSwap present path faithful to xenia-canary `VdSwap_entry` (xboxkrnl_video.cc:518-548): write the reserved 64-dword ring slot with a PM4_TYPE0 fetch-constant patch + PM4_TYPE3(PM4_XE_SWAP) + NOP padding, then let the natural drain consume the swap packet in command-stream order. Remove the synthetic CP swap-complete interrupt that `notify_xe_swap` raised out-of-band. Root found this session (the actual present-path bug): ours' `notify_xe_swap` pushed an `InterruptSource::Swap` (→ INTERRUPT_SOURCE_CP) interrupt directly from the VdSwap HLE, decoupled from the GPU command stream. When that interrupt reached the graphics ISR `sub_824BE9A0` before D3D had armed its swap-callback slot (`[gfx+10772]+16` still the `0xBADF00D` placeholder), the ISR took its error path and hit the assert "ERR[D3D]: Unanticipated CPU_INTERRUPT. Sign of a corrupt command buffer?" (`bl sub_824C5DF0; twi` at 0x824BE9DC) — 2x per run on master. Canary's VdSwap raises NO interrupt; swap-complete CP interrupts come only from in-stream PM4_INTERRUPT packets, which are naturally ordered after the callback-arming Type-0 writes. Routing the swap through the ring packet matches that ordering and eliminates the trap (2 -> 0). Canary oracle confirmation (muted, audit_mem_watch + audit_jit_prolog_pc): canary's early/loading loop is present-driven — swap counter [gfx+15160] (0xBE56CA38) advances ~per-vblank from vblank 65 onward, reaching 0xD02 (3330) in ~60s via 6184 CP source=1 interrupts, with VdSwap called only ONCE. So the present interrupts are entirely in-stream, not from the VdSwap export. This is a correctness/faithfulness fix; it does NOT cascade. draws stay 78 at 200M and 1B because the upstream gate persists: the game submits one render batch then stalls (renderer sub_82506xxx 0x; 2nd title thread 0x821748F0 never spawns). The per-frame loop sub_822F1AA8 runs ~1207 iterations on vsync but clock B (swap count) only advances ~once, so the manager update sub_821741C8 fires once. That is the iterate-2Q/2F title-pipeline gate, not a present/ interrupt bug. swaps 3 -> 4 (the in-stream PM4_XE_SWAP now drains). Deterministic in inline mode (n50m --gpu-inline --stable-digest regenerated byte-identical twice; golden re-baselined: swaps 3 -> 4). cargo test --workspace 675 passing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 15:20:02 +02:00
MechaCat02	1ae472bd2b	[iterate-2S] GPU: implement CP SCRATCH_REG memory writeback — arms Sylpheed's swap-callback slot Sylpheed renders the splash (draws=78, iterate-2O) then plateaus: the title's per-frame manager (sub_821741C8) only re-fires when "clock B" ([gfx+15160], swap count) changes, which only the CP swap-complete callback sub_824CE2B8 increments. The graphics ISR sub_824BE9A0 indirect-calls that callback via [[gfx+10772]+16] on CP (source=1) interrupts, but the slot stayed NULL so the callback never ran. Root (runtime-verified, ours-side GPU): the guest arms the slot through the Xenos CP scratch-register writeback path, which ours never implemented. The arming IB (drained by ours at 0x4adf5180) contains a Type-0 register write of the callback PC 0x824ce2b8 into SCRATCH_REG4 (0x057C). On hardware/canary, writing a SCRATCH_REG{n} mirrors the value to SCRATCH_ADDR + n4 in memory when the matching SCRATCH_UMSK bit is set. Runtime values: SCRATCH_ADDR=0x0b1d5000 (the [gfx+10772] descriptor), SCRATCH_UMSK=0x20033 (bit 4 set), so SCRATCH_REG4 -> 0x0b1d5010 = descriptor+16 = the callback slot (0x4b1d5010). Ours decoded the Type-0 write into the register file but performed no writeback (case a: drained-but-mishandled), so the slot stayed NULL. Fix mirrors canary's CommandProcessor::HandleSpecialRegisterWrite (command_processor.cc:545-552): a scratch_register_writeback() helper called from handle_type0/handle_type1 after every register write; for SCRATCH_REG0..7 with the UMSK bit set, it writes the value (big-endian, as mem.write_u32 already stores) to SCRATCH_ADDR + n4 (projected via physical_to_backing). Deterministic given identical register state; proven by unit test. Cascade (verified by runtime probe): slot 0x4b1d5010 now armed with 0x824ce2b8; on the 2-3 CP interrupts that fire, the ISR reads the slot and bcctrl's into sub_824CE2B8 (runs 2x; 0x cascade on master); sub_824CE2B8 increments clock B ([gfx+15160]). The cascade does NOT yet reach draws>78: there are only ~3 CP interrupts (from the initial 9825- packet batch), and the title render loop stalls upstream (the iterate-2Q title-respawn gate) before it submits more PM4_INTERRUPT work, so the callback can't bootstrap a self-sustaining loop. This is the remaining update-17/18 arming gap closed; the upstream stall is the next gate. The default threaded GPU backend drains the ring on a separate host thread, so with the callback now doing work the exact CP-interrupt delivery instruction varies run to run (pre-existing GPU-thread race). Pin the n50m oracle test to --gpu-inline (instruction-count deterministic) and re-baseline its golden; bit-exact across repeated runs. New unit test scratch_reg_write_mirrors_to_memory_when_umsk_enabled. Tests: 675 pass (was 674). Golden re-baselined + determinism verified. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 14:21:30 +02:00
MechaCat02	034ec8b47f	[iterate-2O] GPU: drain indirect buffers correctly — Sylpheed renders splash (draws 0→78) Ours' GPU never drained the D3D driver's system command buffer past the first 11-dword indirect buffer, so DRAW_INDX / reg-0x57C-arm packets never executed and draws stayed 0 (the long-hunted render gate; see UPDATE-18). Runtime tracing (temporary, removed) showed the guest submits 6 INDIRECT_BUFFER packets at boot (CP_RB_WPTR 22→37) but ours executed exactly ONE IB and then spun 15.7M packets inside it. Three coupled command-processor bugs, all corrected to match canary: 1. `sync_with_mmio` applied the primary CP_RB_WPTR to whichever ring was active, including an executing indirect buffer — `37 % 11 = 3` clobbered the IB's write pointer so its read pointer looped 0→2→5→0 forever and never popped back to the primary ring. CP_RB_WPTR governs ONLY the primary ring; while an IB executes, the primary is the bottom of the IB stack. Canary executes each IB through a separate `RingBuffer reader_` (command_processor.cc), so the primary write pointer is structurally inapplicable to an IB. 2. Indirect buffers were treated as circular rings: read wrapped at `size_dwords` (`11 % 11 = 0`) and never reached the fixed write pointer, so even without the clobber the IB could not terminate. An IB is a fixed linear sub-stream; add `RingBufferView.indirect` and drain `[0, ib_size)` monotonically, then pop. 3. `is_ready` only checked the active ring, so an IB that now correctly exhausts would never get `execute_one` called again to pop back to the primary ring (whose WPTR may have advanced). Check the whole IB stack. Also: the ring was sized `1 << size_log2` bytes (1024 dwords) vs canary's `1 << (size_log2 + 3)` (8192 dwords) — an 8× undersize that desynced WPTR-wrap math from the guest. Fixed in `GpuSystem::initialize_ring_buffer` (and the dead bookkeeping copy in `vd_initialize_ring_buffer`). Cascade (deterministic; threaded-default backend, byte-identical across runs): reg 0x57C now written, IB jumps 1→12, packets 15.7M→9,825, and the splash renders — draws 0→78, shaders 0→3, render_targets 0→2, swaps 2→3 — stable at 50M / 200M / 1B. Boot then reaches a new downstream gate (draws plateau at 78, interrupts keep climbing → engine alive, not deadlocked). golden `sylpheed_n50m.json` re-baselined (draws 78). `cargo test --workspace` green (674; +2 ring_view regression tests). vd_swap's synthetic-swap short-circuit is now redundant but left untouched (cascade works without changing it); cleaning it up is a separate follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-13 22:06:16 +02:00
MechaCat02	2bdb93e51e	[iterate-2K] GPU physical-mirror aliasing: ring/IB/RPtr/resolve read wrong host region Root cause (physical-mirror aliasing gap → GPU read wrong region → ring never truly drained → render worker ring-space wait → no frame → no draw): The Xbox 360 maps its 512 MB of physical DRAM into several virtual mirror windows differing only in cache policy — bare physical (0x0xxxxxxx), write-combine (0x4xxxxxxx), and cached 0xA/0xC/0xExxxxxxx — all aliasing addr & 0x1FFF_FFFF. Ours has one flat membase and `heap_alloc` (MmAllocatePhysicalMemoryEx) commits physical backing in the 0x4xxxxxxx window. The guest masks its CP-ring allocation base to bare physical (0x4adcc000 & 0x1FFFFFFF = 0x0adcc000) before handing it to VdInitializeRingBuffer, and PM4 INDIRECT_BUFFER / writeback / resolve pointers are likewise bare-physical. Ours stored those verbatim and read `membase + 0x0adcc000`, a never-committed zero-filled page — so the GPU drained ~718k zero PM4 headers, never executed the real Type3/DRAW stream, and the RPtr writeback landed on a zero page the render worker (tid=8) polls, freezing it forever. Fix (GPU/Vd-boundary translation, not memory-layer): add `physical_to_backing(addr)` deriving the committed backing exactly from `heap_alloc`'s placement (0x4000_0000 \| (addr & 0x1FFF_FFFF), idempotent for the WC window, flat for non-physical code/stack). Apply it at every point the GPU/kernel consumes a guest physical address: ring base (initialize_ring_buffer), RPtr writeback (enable_rptr_writeback), PM4 INDIRECT_BUFFER pointer, WAIT_REG_MEM / COND_WRITE memory poll+write, REG_TO_MEM / MEM_WRITE / EVENT_WRITE* / LOAD_ALU_CONSTANT / IM_LOAD addresses, the resolve dest write, and the vd_swap frontbuffer present read. This was chosen over memory-layer aliasing because the latter re-projects every CPU load/store and corrupts the guest's flat 0xA/0xC/0xE accesses (it caused an early PC=0xfffffffc fault). Two adjacent GPU-backend gates this exposed and also fixed (canary-faithful): - WaitCmp::from_wait_info was off by one vs canary's MatchValueAndRef selector (it decoded wait_info&7==3 as NotEqual instead of Equal), inverting the standard CP coherency wait so the GPU parked forever on the first INDIRECT_BUFFER. Remapped to 1=Less..7=Always, 0=Never. - Added MakeCoherent: a WAIT polling COHER_STATUS_HOST clears the status bit (mirrors command_processor.cc:801-838) so the coherency handshake resolves. Result: the GPU now decodes the real Type3 packets at 0x4adcc000 (ME_INIT, INDIRECT_BUFFER → real Type0/WAIT_REG_MEM at 0x4adf5080) instead of zero-headers; RPtr at 0x408619fc advances (0x13, 0x16, … written by the GPU worker); the frame loop sub_822F1AA8 actively writes the controller at 0x40d09a40 (0x20→0x21→0x23); no fault, full 200M/1B budget runs clean. draws_seen is still 0: the remaining gate is upstream and separate — the main frame loop never sets controller bit-28 (frame-ready) at [0x40d09a40] (stalls at 0x23, the known iterate-2C state-divergence gate), so the guest never enqueues a render IB; the GPU only ever replays the init IB. This fix correctly unblocks the GPU ring/IB/RPtr data path (gate-2 GPU backend); the bit-28 frame-ready gate is the next target. Stable golden (sylpheed_n50m) unchanged (draws/swaps/RTs/shaders identical at 50M); regenerated twice byte-identical. cargo test --workspace: 672 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-13 13:39:57 +02:00
MechaCat02	7a1b6b3306	fix(gpu): GPUBUG-DRAIN-001 — silence VdSwap PM4 fallback under --parallel The Phase-C VdSwap PM4 ring path (commit `82f3d61`) emits two "PM4_XE_SWAP not consumed by drain" warnings when running: exec sylpheed.iso --ui --quiet --halt-on-deadlock \ --parallel --reservations-table Lockstep -n 100M never trips it. Two distinct race windows: (a) Inline backend (--ui forces it): drain(mem, 4096) hit its fixed packet cap before reaching the PM4_XE_SWAP we'd just injected at the WPTR tail. With 6 CPU threads, the ring accumulates >4096 packets between vd_swap callbacks. (b) Threaded backend (--parallel without --ui): the worker's DrainFence handler has a 900 ms deadline and game-batched IBs (8-10 M packets observed) keep it from reaching the tail in any reasonable budget. If the worker eventually drained past the injected packet later, the safety-net direct notify would double-count. Three changes: * gpu_system.rs: new `drain_until_wptr(target, time_budget)` draining by the canary `WorkerThreadMain` predicate (read_offset != target) instead of a fixed packet count. 900 ms deadline mirrors the threaded DrainFence handler. * handle.rs: inline `drain_to_current_wptr` switches to `drain_until_wptr`. DrainFence handler publishes the digest mirror BEFORE replying so the CPU's post-drain `digest_snapshot` sees fresh stats. * exports.rs (vd_swap): skip the PM4 ring injection unconditionally and route swap notification through `notify_xe_swap` directly. Tail-injection is unreliable under --parallel for both backends. The slot-0 fetch-constant patch is deferred (GPUBUG-FETCH-PATCH-001); draws=0 today so a stale slot 0 has no observable effect. Verification: * cargo test --workspace --release: 556 passing (unchanged). * Lockstep -n 100M --stable-digest: bit-identical to pre-fix master HEAD `aa3f1d3`. {instructions:100000002, imports:987685, unimpl:0, draws:0, swaps:2, ...} * check --parallel --reservations-table -n 30M: 0 warnings (was 2). swaps=2. * exec --gpu-inline --parallel --reservations-table -n 30M: 0 warnings (was 2 with drained=8M-10M observed). swaps=2. Audit IDs: GPUBUG-DRAIN-001 (closed), GPUBUG-FETCH-PATCH-001 (filed, deferred). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 17:12:15 +02:00
MechaCat02	8fc1b1dfed	fix(gpu): GPUBUG-006 — sync_with_mmio Acquire/Release pair the producer The producer side (`mmio_region.rs:78`, the guest's CP_RB_WPTR MMIO write callback) uses `Ordering::Release` so any ring-memory writes the guest performed before bumping WPTR are visible to a paired `Acquire`-load on the consumer. The consumer here at `sync_with_mmio` was using `Ordering::Relaxed` for both the WPTR load and the RPTR mirror store — leaving the Release/Acquire pairing broken. Under `--parallel`, this broken pairing means the GPU worker can observe a fresh WPTR value while still reading stale ring-memory contents at the corresponding offsets — garbage PM4 packets. The audit's M11 grid run confirmed --parallel is non-deterministic beyond the documented `packets` ±5% noise; this fix is one strand of that. Symmetric fix on the RPTR mirror store: Release pairs with any guest-side Acquire-load of CP_RB_RPTR for ring-writeback bookkeeping. Verification at -n 100M lockstep: swaps: 2 → 2 (unchanged) draws: 0 → 0 (unchanged) packets: ~60M (within noise) Tests: 149 (no count change; this is a memory-ordering correctness fix, not a behavioral change visible at the digest level in lockstep). Closes GPUBUG-006 (P1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 14:26:09 +02:00
MechaCat02	8723d6826b	fix(gpu): GPUBUG-103/104/105 — fix 8 draw-state register addresses + index_size bit Eight of the register-index constants in draw_state.rs::reg pointed at completely unrelated registers because the canonical canary table (register_table.inc) was misread when the module was first authored. Re-validated each value against canary's lines 1232-1336. \| Register \| Pre-fix \| Canary \| Was-actually \| \| ------------------------- \| ------- \| ------ \| ------------- \| \| VGT_DRAW_INITIATOR \| 0x2281 \| 0x21FC \| (junk) \| \| VGT_DMA_BASE \| 0x2282 \| 0x21FA \| (junk) \| \| VGT_DMA_SIZE \| 0x2283 \| 0x21FB \| (junk) \| \| PA_SC_WINDOW_SCISSOR_TL \| 0x200E \| 0x2081 \| SCREEN_SCIS_TL\| \| PA_SC_WINDOW_SCISSOR_BR \| 0x200F \| 0x2082 \| SCREEN_SCIS_BR\| \| RB_COLOR_INFO_1 \| 0x2010 \| 0x2003 \| COHER_DEST_BASE_10\| \| RB_COLOR_INFO_2 \| 0x2011 \| 0x2004 \| COHER_DEST_BASE_11\| \| RB_COLOR_INFO_3 \| 0x2012 \| 0x2005 \| COHER_DEST_BASE_12\| \| PA_SU_VTX_CNTL \| 0x2083 \| 0x2302 \| PA_SC_CLIPRECT_RULE\| Also corrected the `index_size` bit position in VGT_DRAW_INITIATOR extraction: was bit 8 (which is `major_mode[0]`), should be bit 11 per canary `registers.h:324` (`xenos::IndexFormat index_size : 1; // +11`). The block comment in `extract()` was also wrong about the intermediate field layout and has been refreshed. Verification at -n 100M lockstep: swaps: 2 → 2 (unchanged) draws: 0 → 0 (still gated — see below) packets: ~61M (within noise) Tests: 149 (no count change; existing draw_state tests cover the new constants implicitly via behavioral round-trip). The audit predicted Phases C+D+E together would unlock `draws > 0`, but the runtime plateau is multi-causal per the audit's own analysis (`project_xenia_rs_audit_2026_05_02.md`). The likely remaining blockers in -n 100M: * 4 parked-waiter worker threads (handles 0x1004, 0x100c, 0x15e4, 0x42450b5c) — Phase F's XAM/spinlock fixes target this. * shader_blobs_live=0 after 100M — the game hasn't issued IM_LOAD yet because workers haven't loaded shader resources. The register fixes here are still load-bearing for any draw that DOES happen (every register read at 0x2281 was junk before this commit) — landing them now is correct even if draws=0 persists until Phase F unparks the resource-loader threads. Closes GPUBUG-103, GPUBUG-104, GPUBUG-105 (P0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 14:22:04 +02:00
MechaCat02	ec2d955dbd	fix(gpu): GPUBUG-102 — apply per-format endian byte-swap to vertex fetch The vertex fetch constant (canary `xe_gpu_vertex_fetch_t`, xenos.h:1158-1172) holds an `endian` field (low 2 bits of dword_1) selecting kNone/k8in16/k8in32/k16in32 swap patterns per `GpuSwapInline` (xenos.h:1090-1109). Xbox 360 vertex data is stored big-endian; the host is little-endian. Pre-fix every dword was bitcast as-is — vertex positions decoded as byte-reversed garbage, producing clipped or NaN positions in any draw that survived to the host. Mechanical changes: - crates/xenia-gpu/src/translator.rs: AOT `emit_vfetch` reads fetch_const dword 1 (endian) and wraps each lane's load in `gpu_swap(value, endian)`. New `gpu_swap` helper added to the emitted module header. - crates/xenia-gpu/src/shaders/xenos_interp.wgsl: matching `gpu_swap` helper added to the runtime interpreter shader. `interpret_vertex_fetch` reads fc1, computes the endian, and wraps every format's per-lane load (including 8_8_8_8 and 16_16_FLOAT paths). Mirrors the AOT translator's emission. Verification at -n 100M lockstep: swaps: 2 → 2 (gated by Phase E for draws) draws: 0 → 0 packets: ~60M (within noise) Tests: +1 (vfetch_emit_includes_gpu_swap_helper_call). Closes GPUBUG-102 (P0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 14:18:46 +02:00
MechaCat02	c5c6713419	fix(gpu): GPUBUG-100 — apply per-operand swizzle + negate to ALU sources Word-1 of every ALU triple holds three 8-bit component-relative swizzles (`src1_swiz`/`src2_swiz`/`src3_swiz` at bits 16-23/8-15/0-7 per canary ucode.h:2064-2066) and three per-operand negate flags (bits 24/25/26). Pre-fix, both the WGSL interpreter and the AOT translator discarded word-1 entirely with `_ = w1;` — every ALU result was missing its swizzle (broadcast/permute patterns like `.zyxw`, `.xxxx`) and any negated operand was used positive instead. Component-relative semantics (canary's `AluInstruction::GetSwizzledComponentIndex`, ucode.h:1996): for output component i, the source component is `((swizzle >> (2*i)) + i) & 3`. Identity swizzle is 0x00, NOT 0xE4 — the original `apply_swizzle` in the interpreter shader treated it as absolute, also incorrect. Mechanical changes: - crates/xenia-gpu/src/ucode/alu.rs: extend AluInstruction with src_X_swiz (u8) and src_X_negate (bool) fields. decode_alu unpacks them from word 1. - crates/xenia-gpu/src/shaders/xenos_interp.wgsl: apply_swizzle uses component-relative semantics. interpret_alu decodes the modifiers and applies via apply_swizzle + apply_modifiers (with abs=false). - crates/xenia-gpu/src/translator.rs: src_operand emits the precomputed swizzle inline as `vec4<f32>(base.x, base.y, ...)`, then wraps in `(-…)` when negated. Identity swizzle (0x00) emits a bare base expression so it round-trips with the trivial-shader fixture. Abs is omitted in this commit — the abs flag is dual-meaning (for temps it lives at bit 7 of the src byte; for constants at word-2 bit 7 `abs_constants`). Wiring it up correctly requires more careful case-split logic; deferred to Phase G. Verification at -n 100M lockstep: swaps: 2 → 2 (gated by Phase E for draws) draws: 0 → 0 packets: ~58M (within noise) Tests: 554 → 555 (+1 swizzle/negate test, no count change otherwise because identity swizzle test merged into D1's parameterised test). WGSL still validates via naga (combined_module_parses_as_wgsl). Closes GPUBUG-100 (P0). Abs deferred to Phase G. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 14:15:07 +02:00
MechaCat02	78ea81c12a	fix(gpu): GPUBUG-101 — decode src1/2/3_sel temp-vs-constant selector Per canary AluInstruction layout (xenia-canary/src/xenia/gpu/ucode.h: 2078-2086), word-0 bits 29-31 are the per-operand `srcN_sel` flags selecting temp register (1) vs ALU constant (0); the corresponding 8-bit src byte indexes either: - a temp register (bits 5:0 = index, bits 6/7 reserved for relative-addressing / abs flags consumed by Phase D2), or - an ALU constant (full 8-bit index). Pre-fix, the WGSL interpreter and AOT translator both masked `& 0x7F` on the src byte and emitted `r[low7]` regardless of the operand class. Every shader's WVP matrix / light constant / per-frame uniform read came back as r[low7] — typically zero — yielding invisible rendering. Mechanical changes: - crates/xenia-gpu/src/ucode/alu.rs: decode src_a_is_temp / src_b_is_temp / src_c_is_temp from w0 bits 29/30/31. Note that our src_a (low byte of w0) is canary's third operand, hence its selector is bit 29 (canary src3_sel), not bit 31. - crates/xenia-gpu/src/shaders/xenos_interp.wgsl: `read_src` now takes the is_temp flag; constants index xenos_consts.alu directly. - crates/xenia-gpu/src/translator.rs: `src_operand` mirrors the interpreter — `r[idx]` when temp, `xenos_consts.alu[idx]` when constant. The trivial-shader synthetic test was updated to set the temp flags so its `r[0u] = (r[0u] + r[0u])` assertion remains valid; without the flags set, all sources would now resolve as constants. Bank-selection (cf-level relative addressing for higher banks of the 512 ALU constants) remains a Phase G+ extension — covers c0..c127 in bank 0, which most Sylpheed shaders use directly. Verification at -n 100M lockstep: swaps: 2 → 2 (unchanged — gated by D2/D3/E for draws) draws: 0 → 0 packets: ~61M (within noise) Tests: 552 → 554 (+2 translator tests for the temp/constant decode). Closes GPUBUG-101 (P0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 14:10:11 +02:00
MechaCat02	82f3d611e2	fix(gpu,kernel): KRNBUG-Vd-04 / GPUBUG-001 / XMODBUG-013 — VdSwap PM4 ring path The pre-fix VdSwap zero-filled the guest's reserved buffer with NOPs and called `state.gpu.notify_xe_swap` directly — bypassing the ring, leaving the PM4_XE_SWAP handler at gpu_system.rs:1232 dead code, and skipping the PM4_TYPE0(SHADER_CONSTANT_FETCH_00_0, 6) patch. Sylpheed's bloom/ blur "sample frame N for frame N+1" path samples fetch-constant slot 0 expecting the frontbuffer descriptor; without the patch, slot 0 stayed stale and any shader sampling it read garbage. This commit writes the canary VdSwap PM4 sequence directly into the primary ring at the current write pointer (read via the shared MMIO atomic), then advances WPTR over the injection. The natural CP drain consumes PM4_XE_SWAP — bumping `swaps_seen` and patching fetch-constant slot 0 — without going through any direct kernel→GPU bypass. Sequence per xenia-canary VdSwap_entry (xboxkrnl_video.cc:438-521): 1) PM4_TYPE0(0x4800, count=6) + 6 fetch-header dwords (with base_address re-patched from virtual to physical >> 12). 2) PM4_TYPE3(PM4_XE_SWAP, count=4) + signature + frontbuffer_phys + width + height. Mechanism notes: - buffer_ptr in xenia-rs is in the system command buffer, NOT the primary ring (verified empirically: buffer_ptr=0x4acd4df8 vs ring_base=0x0accb000, size 4 KB). Canary's VdSwap writes to buffer_ptr because its ring layout maps the reserved slot inside the ring; xenia-rs's doesn't, so we have to write at the actual ring WPTR address (cached on KernelState.ring_base from VdInitializeRingBuffer). - The original "buffer_ptr zero-fill + bump WPTR by 64" path is preserved before the injection — it exposes any game-batched PM4 packets and keeps the buffer_ptr region skippable per existing game compat behavior. - A safety-net fallback at the end calls `notify_xe_swap` directly if swaps_seen didn't advance during the drain (e.g. a ring-arithmetic edge case). Idempotent — only fires when the PM4 path didn't. - KRNBUG-Mm-04 deferred: virt→phys uses the masked stub `virt & 0x1FFF_FFFF`, sufficient for the standard heap. Mechanical changes: - crates/xenia-gpu/src/pm4.rs: add make_packet_type0 / type2 / type3 helpers + round-trip unit test (mirrors canary xenos.h:1682-1709). - crates/xenia-gpu/src/handle.rs: add mmio_cp_rb_wptr_load accessor (Acquire-load) so the kernel can compute ring offsets. - crates/xenia-kernel/src/state.rs: cache ring_base / ring_size_dwords on KernelState (set by VdInitializeRingBuffer). - crates/xenia-kernel/src/exports.rs: rewrite the vd_swap PM4-emit block; patch fetch_dwords[1] base_address virt→phys before injection. Verification at -n 100M lockstep: swaps: 2 → 2 (game fires VdSwap exactly twice) draws: 0 → 0 (gated by Phases D+E) fallback warning: 0 occurrences (PM4 path consumed both swaps) instructions: ~100M Tests: 552 passing (553 with new pm4 round-trip test). Lockstep stable-fields determinism: byte-identical across two 100M runs. The "swaps > 2" prediction in the audit's plan assumed the game would fire VdSwap more often once the path worked; empirically Sylpheed only calls VdSwap twice within 100M instructions (this is the renderer plateau the audit identified). The success criterion for Phase C is that the PM4 path is now operational, which Phases D+E require for visible draws. Closes KRNBUG-Vd-04, GPUBUG-001, XMODBUG-013. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 14:00:23 +02:00
MechaCat02	79eb52c378	xenia-gpu: end-to-end Xenos pipeline (PM4, ucode, EDRAM, resolve) First real GPU implementation. Ring/PM4 frontend (ring_view, ring_drain, pm4) drains the command processor; gpu_system owns the threaded backend (DrainFence RPC + parker/fence helpers from M1) and the MMIO-mapped register block (mmio_region). Xenos shader frontend: ucode/{alu,control_flow,fetch,mod}.rs decode the Xbox 360 microcode, translator.rs lowers it onto the WGSL xenos_interp interpreter shader (shaders/xenos_interp.wgsl). shader_metrics.rs counts decode/translate work. Render state: draw_state, primitive, render_target_cache, texture_cache, tiled_address (Xenos's swizzled tiled-memory layout), xenos_constants (register field constants), edram (the 10 MiB EDRAM model with MSAA), and resolve.rs (TILE_FLUSH copy-out — clear-resolve plus bitwise-equivalent 32 bpp + 64 bpp paths landed). handle.rs owns the typed GPU-resource handles the kernel hands out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:29:38 +02:00
MechaCat02	c694bb3f43	Initial commit: xenia-rs workspace for Xbox 360 RE Rust reimplementation of the xenia Xbox 360 emulator targeting reverse- engineering and preservation, initially scoped to Project Sylpheed. Includes: - XEX2 loader (LZX decompression, AES decryption, PE parsing) - XISO / XGD2 disc image VFS - PPC interpreter with 200+ opcodes and VMX128 decoding - Static analyzer: functions, cross-references, labels, asm + SQLite output - HLE kernel covering the xboxkrnl/xam subset used by Sylpheed init - Debugger with in-memory and SQLite-backed execution tracing - `xenia-rs` CLI with extract/dis/exec commands that produce cumulative, superset SQLite databases and opt-in instruction/import/branch traces Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-16 23:14:56 +02:00

30 Commits