23189b95aff2b1b3d18f64a14714ee61c1eb528d
30 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
9d24dd0eaa |
[iterate-3AJ] Present-anchor vsync so the splash logo fade-in renders
The publisher/dev splash logo's intro fade-IN was skipped: the logo popped in at full brightness instead of ramping dim->bright like the canary oracle. Root (measured, iterate-3AF/3AI): ours' guest vsync counter is fed by a fixed-instruction-quantum proxy (one vsync per 150k retired instructions). During the ~1.1s splash asset-load the title's frame pump runs ~10M instructions inside a single guest frame, so the proxy fired ~66 vsyncs in that one frame. The pump's per-frame delta (counter_now - counter_last) was therefore ~66 on the first tick, which the anim tick (sub_823CDBF8) divides into the fade counter [item+72] @ 0x40c0add0 -> the counter JUMPED 0->0x42(66) in one step, landing past the fade-in region. Canary's wall-clock 60Hz vblank advances ~1 per heavy load frame, so its counter ramps smoothly 0->66 and the fade-in renders. Fix: anchor the lockstep vsync ticker to the guest's real present rate (VdSwap count), mirroring real hardware where the title double-buffers at vblank, so one heavy guest frame advances the vsync counter by ~1 instead of ~66. - interrupts.rs: tick_vsync_instr now takes the live present count. Two regimes: (1) bootstrap, before the guest's first present, keeps the original fixed instruction quantum unchanged -- the iterate-2W present-loop bootstrap needs vsyncs delivered BEFORE it can present (measured: callback registered ~6M instr, first delivered vsync and first present coincide; pure present-driven vsync would deadlock). (2) present-anchored, after the first present: one vblank per present, plus a small DRY_FALLBACK_CAP=4 instruction-quantum fallback per dry window so a non-presenting frame still ticks a few vsyncs (a small ramp like canary's 0/5/10/2/1...) without re-spiking to 66. - handle.rs: cheap GpuBackend::swaps_seen() accessor. - main.rs: pass the live present count into the lockstep ticker. Not masking: the fade dt/counter is never clamped or synthesized; the guest naturally computes a smooth dt once vblank tracks presents. Verified: - V1: fade counter 0x40c0add0 now ramps 0,6,8,10,12,13,+1... (was a 0->0x42 jump; direct baseline-vs-fix mem-watch). - V2 (--ui readback via per-frame logo vertex-alpha): logo alpha ramps 102,136,204,221,238,254 (dim->bright fade-IN) vs baseline all 255 (pop-in). Real artwork (has_real_vertices) still renders; milestone-1 intact. - V3: 150M boot progression intact -- texture_decodes=2, RTs=2, tex_cache=1 unchanged; draws/swaps higher (tighter present loop), 1B sanity linear, no stall/collapse. - V4: 50M --gpu-inline --stable-digest byte-identical 2x; golden re-baselined intentionally (pacing-only delta: draws 718->1274, swaps 147->259; structural fields unchanged). 688 tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
c62a355418 |
[iterate-3AE] Fix spurious WHITE TRIANGLE flashing before each splash logo
The publisher and developer splash logos rendered correctly, but a fullscreen OPAQUE WHITE diagonal half-triangle flashed at boot, before each logo, and persisted across the dev-logo transition — canary shows a black background there. Readback-isolated it (env-gated frontbuffer grid + per-draw inventory, both removed) to the background-fill draws. ROOT (measured, refutes the prior "saturate/interpreter/depth" guesses): the position-only VS `0xd4c14f46` (one vfetch → oPos; exports NO color) paired with PS `0xed732b5a` (`ocolor0 = interp0`). The iterate-3T translator seeded `ointerp[0] = (1,1,1,1)` "so a VS that only exports position still yields a visible non-zero color" — a debug FAKE: it injects white that no guest value backs. So that fill's interp0 stayed white → opaque-white fullscreen triangle. Vertex windows of a WHITE frame and a steady BLACK frame were byte-identical; served_translated=true for all of them and depth is disabled in the replay, so the white came purely from the injected seed, not saturate/interp/depth. FIX (UI-translator only, golden byte-identical): - translator.rs: default un-exported interpolators to (0,0,0,0) instead of seeding interp0 white. A position-only VS now contributes nothing visible under its real blend (RGB=0 → black; A=0 → premult transparent), matching canary; every VS that really exports interp0 (the logo `0x03b7b020`, the color fill `0x36660986`) overwrites the seed → logos unaffected. - app.rs: clear the splash frontbuffer to BLACK, not the iterate-3S navy placeholder `[0.04,0.04,0.06]` (never matched to the guest). The fill is a fullscreen Xbox-360 RectangleList drawn as a single triangle in the replay (4th implied corner not yet synthesized), so its uncovered half exposed the clear; black makes the transition uniformly black like the oracle. (Full RectangleList→rectangle expansion is a separate follow-up.) READBACK (env-gated, removed): white-heavy frames 200+ → 0; navy frames 240 → 0; transition frames uniformly black; the publisher logo (white text + red dots) and the developer logos (colored, on black) still render. Determinism: changes feed only the UI translator/clear; n50m --gpu-inline --stable-digest byte-identical 2× and matches the committed golden (--expect exit 0). cargo test --workspace 686 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
3f8d3b6f1c |
[iterate-3AD] Fix 2nd splash logo rendering black: re-upload evolving atlas
The publisher (SQUARE ENIX) and the 2nd developer/studio splash logo share
one K8888 atlas at physical base 0x4dbee000, sampled at different UVs. The
publisher's white text occupies the top V-bands; the developer logo's
(bluish/gold) artwork is CPU-written into the SAME surface AFTER the publisher
frame, so the atlas evolves across frames.
The UI host texture cache (`texture_cache_host::upload`) only re-uploads a
`TextureKey` when `version_when_uploaded` increases. But the per-draw bind in
`render.rs` hardcoded `version_when_uploaded = 1` for every draw, so once the
atlas was first uploaded (during the publisher frame, with only the top bands
filled) the cache pinned that partial upload. The 2nd logo, sampling a V-band
that was still zero at first-upload time, read transparent-black -> rendered
nothing (the "white-triangle / black stub" the user saw after SQUARE ENIX).
Verdict: (G) a legitimate 2nd LOGO item whose real artwork lives in the same
evolving atlas — NOT a spurious 3rd item, and NOT a geometry/shader/blend gap.
Measured via readback: the 2nd-logo geometry rasterizes correctly (3 on-screen
quads), interp1 (UV) and interp0 (color) reach the PS with real values, the
texture content at the sampled bands exists — only the bound wgpu texture was
the stale partial upload.
Fix (UI-only, deterministic core untouched):
- `gpu_system`: thread the real content `version` (from `span_max_version`)
into `last_draw_textures` (now `(key, version, bytes)`).
- `draw_capture::DrawCapture.textures`: same 3-tuple.
- `render.rs`: use the real `version` (not a hardcoded 1) so the host cache
re-uploads when the guest fills more of the atlas.
- `exports.rs` `vd_swap`: the legacy single-texture `publish_texture` bridge
drops the version (`(key, _v, bytes) -> (key, bytes)`).
Readback (env-gated probe, removed before commit): after the fix the 2nd logo
renders real varied artwork (blue + gold texels in a centered strip) instead
of black. Determinism: `check -n50m --gpu-inline --stable-digest` byte-
identical to the
|
||
|
|
c0c6088e4d |
[iterate-3AA] Fix logo upside-down: no Y-flip on the clip-enabled NDC path
DEFECT 1 (logo upside down) ROOT + FIX. The publisher "SQUARE ENIX" logo rendered vertically mirrored vs the canary oracle (white upright on black). Measured (env-gated readback + texture-row + per-vertex dumps, all removed): - The K8888 logo texture decodes UPRIGHT (text in the top rows 1..161; the red dots sit at ~43% from the texture top). NOT a decoder row-order bug. - The logo geometry is a centered QuadList whose vertices are emitted in *clip space* (Y-UP, e.g. pos.y +0.085 top / -0.104 bottom), with the texture V mapped top->bottom (UV v 0.001 at the top vertex, 0.090 at the bottom). On both the Xbox 360 (D3D9) and wgpu, clip +Y maps to the framebuffer top — so a clip-space position is portable with NO Y-flip. - `compute_ndc_xy` unconditionally negated Y (the flip the *screen-space* pixel path legitimately needs). For the clip-enabled logo this swapped top<->bottom vertices while leaving the texture V unchanged, so the sampled sub-rect read bottom-up: red dots rendered at 58% from the top (a clean vertical mirror) instead of 43%. FIX: keep the Y-flip only on the clip_disable (screen-space pixel) branch where the framebuffer Y-down->wgpu Y-up flip is real; the clip-enabled branch now passes clip-Y-up through identity. Readback after the fix: red dots at 42% from the top (= texture's 43%) -> logo UPRIGHT, still centered. DEFECT 2 (background) was already correct + faithful; 3Z's contradiction is REFUTED by direct readback: the bg fill (vs 0x36660986 / ps 0xed732b5a, fullscreen RectangleList) reads its real vertex color (raw 0x818000c7 = -32896.5 as float) into r0, the PS exports it, and the GPUBUG-115 RB-UNORM saturate (canary spirv_shader_translator.cc:3607) clamps it to 0 -> BLACK, matching canary. The seed r0=(gvidx,...) does NOT show through (it's overwritten by the color vfetch). No code change needed. Readback of the full frame now matches canary: WHITE upright "SQUARE ENIX" + red dots on a BLACK field. UI-capture-path only (`compute_ndc_xy` runs solely when frame_captures is Some, i.e. --ui; None headless) -> deterministic core untouched, n50m --gpu-inline --stable-digest exit 0 (DRAW_INDX 275 / K8888 decode 137, identical across runs). cargo test --workspace green. Temp probes removed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
f6f3aac673 |
[iterate-3Z] Fix logo color (yellow->white): k_8_8_8_8 vfetch + vfetch field/stride/saturate
Defect 2 of the three render-fidelity defects vs the canary oracle (the
publisher "SQUARE ENIX" logo rendered YELLOW instead of WHITE). Root,
measured by readback (env-gated probes, removed): the logo PS multiplies
the sampled texture by the interpolated vertex COLOR; the K8888 texture
itself decodes correctly (67,667 white texels + 2,087 red — the red dots —
zero yellow), so the yellow came from the vertex-color attribute decode.
Four coupled, canary-faithful fixes (all UI-translator/capture only — the
deterministic headless core is untouched; n50m --gpu-inline --stable-digest
golden byte-identical, exit 0):
- GPUBUG-112 (translator vfetch): VertexFormat 6 = k_8_8_8_8 (4x u8
normalized, 1 dword), NOT k_16_16 (which is 25) per canary xenos.h:643.
The logo color stream is k_8_8_8_8; decoding it as k_16_16 read only 2 of
4 channels and forced BLUE = 0 -> white texture x (R,G,0) = yellow. Now
unpacks all four 8-bit channels (canary spirv_shader_translator_fetch.cc
k_8_8_8_8 packed_offsets 0/8/16/24); added k_16_16 (format 25) too.
- GPUBUG-113 (ucode/fetch): vfetch is_signed / is_normalized / is_mini_fetch
bit positions were wrong (read bits 24/25, which sit inside exp_adjust).
Per canary ucode.h:757-758,764: signed=fomat_comp_all (w1 bit12),
normalized=(num_format_all==0) (w1 bit13), mini_fetch (w1 bit30).
- GPUBUG-114 (translator vfetch): a vfetch_mini reuses the address AND
STRIDE of the preceding full vfetch of the same stream (canary
ucode.h:733); its own stride field is 0. Track the last full stride per
fetch-const and inherit it so a mini color/UV attribute indexes by the
real vertex stride, not its tight dword count.
- GPUBUG-115 (translator PS export): saturate the color export to [0,1]
before the UNORM render-target write, mirroring canary
spirv_shader_translator.cc:3607 ("Saturate, flushing NaN to 0"). Without
it an out-of-range guest color writes garbage to the sRGB target.
Verified by env-gated frontbuffer readback (copy_texture_to_buffer, removed
before commit): the logo now renders WHITE text + RED dots (bbox centered
~y322-389), zero yellow anywhere. Workspace tests green (added 4: k_8_8_8_8
4-channel unpack, mini-fetch stride inheritance, vfetch bit decode, PS
saturate). Determinism: golden byte-identical.
Remaining (defects 1 & 3, see memory iterate-3Z): logo orientation and the
ed732b5a fullscreen background fill (renders ~white, canary shows black) —
both localized but not yet cleanly resolved; plan in the memory file.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
||
|
|
2a992db47b |
[iterate-3Y] Replay per-draw blend + write-mask so the logo composites visible
The publisher logo rendered its real artwork in isolation (3X) but was
overpainted in the full composite: every replayed draw used ONE fixed
SrcAlpha/OneMinusSrcAlpha pipeline + an opaque-magenta texture stub, so the
textured RectangleList draws whose sampler slot is shadowed by a vertex-fetch
constant (no resolvable texture) wrote opaque magenta over the logo.
Per-draw render-state inventory at the splash (env-gated probe, removed):
- logo QuadList vs=0x03b7b020 ps=0x03b79001: bc0=0x07010701
(One,OneMinusSrcAlpha — premultiplied alpha), cmask=0xF, ntex=1 (real K8888)
- RectangleList vs=0xd4c14f46 ps=0x03b79001: SAME premult blend, ntex=0
(slot 0 holds a type=3 vertex constant → texture decode rejects) → magenta
- opaque fill vs=0x36660986 ps=0xed732b5a: bc0=0x00010001 (One,Zero) — green
Draw order: the logo is drawn LAST per group, so order was not the problem;
the fixed pipeline state was.
Change (UI-side capture/replay only):
- draw_capture: capture RB_BLENDCONTROL0 + RB_COLOR_MASK (+ colorcontrol /
depthcontrol for follow-ups) per draw.
- xenos_pipeline: new RenderState{blend_control,color_mask}; map Xenos blend
factors/ops -> wgpu mirroring canary kBlendFactorMap/kBlendFactorAlphaMap;
One,Zero,Add => blend:None (opaque); zero-channel mask => ColorWrites; cache
translator AND interpreter pipelines keyed on (vs,ps,RenderState) /
RenderState so each draw composites with its real state.
- render: pass each capture's RenderState through both replay paths.
- dummy texture magenta(255,0,255,255) -> transparent(0,0,0,0): an
unresolvable texture now contributes nothing under its real premult blend
instead of fabricating opaque magenta (removes a fake, adds none).
Readback (env-gated, removed): full 1280x720 composite now shows the logo's
real artwork (maxR=255, 50-102 distinct colors/cell) in a centered strip; no
magenta anywhere. Background is uniform green (the 0xed732b5a opaque fill) — a
separate vertex-color/shader fidelity issue, NOT compositing (next iterate).
Determinism: UI-only; draw_capture additions only run when frame_captures=Some.
check -n50m --gpu-inline --stable-digest --expect = "matches golden" (2x).
cargo test --workspace = 682 passed. Temp probes removed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
||
|
|
89b5c39d8a |
[iterate-3X] Real splash logo geometry renders: fix vertex-fetch const_index_sel + per-draw submit
Two readback-proven root-cause fixes make the publisher-logo QuadList draw land its REAL captured vertex buffer (the texture was already correct from 3V). REFUTES iterate-3W's "logo geometry is auto-generated from vertex_id": the logo IS sourced from a 4-vertex QuadList buffer at guest physical 0x0adf60f0 (measured), it was just resolved at the wrong fetch-constant register. GPUBUG-110 (vertex fetch const_index_sel dropped). The Xenos vertex-fetch instruction encodes const_index (w0[20:24]) AND const_index_sel (w0[25:26]); the full constant index is const_index*3 + const_index_sel (canary ucode.h:700), packed 3 two-dword constants per 6-dword register group. ucode/fetch.rs decoded only const_index and read sub-slot 0 (fc*6). The logo vfetch is const_index=31, sel=2 -> the real base lives at reg 0x48BE, but ours read 0x48BA which held an unused 0x00000001 (base=0,size=0) slot. So resolve_vertex_window returned None -> has_real_vertices=false -> the logo fell to the procedural fullscreen magenta fallback. Fix: decode const_index_sel, add VertexFetch::const_reg_offset() = const_index*6 + sel*2, and use it in both draw_capture.rs (capture) and translator.rs (the WGSL endian term + no-window fallback base; the old expression there read the src_reg bits, not the const index). Measured: logo now resolves a 24-dword (4 verts x stride 6) window, base 0x0adf60f0. GPUBUG-111 (single batch encoder = last-draw-wins vertex data). In wgpu every queue.write_buffer staged before a single queue.submit is applied before ANY command in that submit runs. dispatch_xenos_captures recorded the whole batch into one encoder + one submit, so every draw read only the LAST draw's vertex buffer / per-draw uniforms. The logo quad therefore sampled the trailing fullscreen background quad's vertices and rasterized nothing where the logo was. Fix: submit one encoder per draw (frontbuffer LoadOp::Load composites identically). Measured (env-gated readback, removed): with this fix the logo draw in isolation renders real varied texels (e.g. (225,17,22)/(255,255,0)) in a centered strip (~20k px), vs 100% navy before. Determinism: all changes are UI-side (xenia-ui replay) or the UI translator / capture path (frame_captures None in headless); the fetch.rs field addition is purely additive and does not change any existing decoded value. Verified the deterministic core unchanged: check -n50M --gpu-inline --stable-digest exit 0 and all 136 metric counters byte-identical across two runs. All temp probes removed. cargo test --workspace green; new regression test vertex_fetch_const_index_sel_and_reg_offset. Known remaining (next iterate): a fullscreen flat QuadList (ps 0x03b79081, vertex color green, no texture) and other textureless draws overpaint the logo in the full composite (their per-draw blend/alpha render state is not yet replayed, and draw order alternates bg/logo). The logo artwork renders correctly in isolation; the composite is not yet clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
39723dfe37 |
[iterate-3W] draw_capture: walk CF exec sequence to find the real vertex fetch
Fix the UI-side vertex-window resolver (`resolve_vertex_window`) so it identifies vertex fetches via the control-flow `Exec` clause `sequence` bitmap instead of blindly decoding every 3-dword triple. Root cause (GPUBUG-109): the Xenos instruction block packs ALU and fetch instructions identically (96 bits each); only the owning `Exec` clause's `sequence` bitmap (2 bits per instruction, bit[2*i] = fetch/ALU) tells them apart. The old resolver scanned every triple and trusted the first that happened to decode as a vertex fetch, gated by a `dword0 & 3 == 3` "type" guard. On real shaders this mis-decoded ALU triples as fetches and either picked a garbage fetch-constant slot or rejected the clause before reaching the true vertex fetch. Now walk the CF exec clauses exactly as the translator does (`translator.rs::emit_exec`) and take the first sequence-flagged *vertex* fetch. Measured (env-gated probes, removed before commit): the resolver now reaches the real fetch on every splash VS. The RectangleList draws (vs 0x36660986 / 0xd4c14f46) keep resolving real geometry (valid fetch const 0). The publisher-logo QuadList (vs 0x03b7b020) is correctly seen to fetch from a fetch constant whose dword0 = 0x1 (no vertex buffer) — i.e. its geometry is NOT sourced from a memory vertex buffer, so it still (correctly) falls to the procedural path. That remaining gap (the logo's auto-generated/index-derived geometry) is the next milestone-1 step; this commit removes the decoder defect that masked it. Determinism: UI-only. `resolve_vertex_window` runs only when `frame_captures` is `Some` (i.e. `--ui`); the headless `--gpu-inline` core never calls it. `check -n50000000 --gpu-inline --stable-digest` exit 0 and byte-identical run-to-run. cargo test --workspace: 681 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
da7c29b6d2 |
[iterate-3V] Fix logo texture: map texture-fetch physical base onto backing window
The publisher-logo texture (K8888 1280x768 linear, `E59B2B3D`'s tfetch surface) rendered flat/transparent because the GPU texture decode read the wrong host bytes — NOT because the asset was never decompressed. First-divergence (vs canary, measured both engines): - Ours DOES read game:\hidden\Resource3D\*.xpr in full, builds a byte-identical cache, decompresses the logo, and CPU-writes the real artwork (~839K nonzero bytes) into the texture buffer — at the guest physical-aperture VA 0x4dbee000 (writer sub_823C3E70 @ 0x823c3f8c). This REFUTES the iterate-3U verdict that the texture was never filled. - BUT the GPU decode used the raw fetch-constant base 0x0dbee000 as a virtual address. In ours' flat 4GB memory, virtual 0x0dbee000 and the physical alias 0x4dbee000 are DIFFERENT host bytes (no aliasing in the read/write path), so the decode read all-zeros. The Xenos texture fetch constant carries a guest *physical* base; the CPU writes texels through its cached-physical aperture, which ours backs at the committed 0x4000_0000 window. Map the base via the existing `physical_to_backing` helper before reading — exactly as the vertex fetch path (draw_capture.rs, iterate-3Q) and as canary reads textures through its GPU shared memory (= physical). Measured after fix (env-gated probe, removed): the logo decode reads base=0x4dbee000 and produces 839068/3932160 nonzero bytes (21.3%) — a centered logo on a transparent field, matching canary's ~21% exactly. Determinism: GPU-side pure read; no CPU/guest-memory state changes. The n50m --gpu-inline --stable-digest golden is byte-identical (verified 2x, texture_cache_entries unchanged). cargo test --workspace green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
1b9918450f |
[iterate-3T] Real UV interpolation + per-draw textures: shader/UV/bind chain complete
Build the full texture-sampling chain for the publisher splash so the textured logo CAN sample real artwork at the guest's real UVs. Measured with an env-gated frontbuffer readback (since removed): the chain is correct end-to-end, but the sampled K8888 1280x768 texture is ALL-ZERO in the UI window's reachable boot range — the artwork is produced by an EDRAM resolve (RT->texture copy) that ours does not yet perform (resolves=0). So this lands the correct shader/UV/bind work and isolates the remaining blocker to the resolve gap, not the shader path. Translator (xenia-gpu/src/translator.rs), all UI-translator-only: - Real Xenos export-index model (replaces the AllocKind heuristic that collapsed every VS export to one color slot and DROPPED the texcoord). When export_data is set the 6-bit vector_dest IS the export index: VS 62=oPos, 0..15=interps; PS 0=RT0. The logo VS exports oPos(62), interp0(color), interp1(UV) distinctly. - Real interpolator passthrough: VsOut carries 8 interpolator locations; the PS seeds r[i] = in.interp[i] (Xenos PS-input-GPR mapping) so tfetch samples at the real interpolated texcoord (r1) instead of (0,0). - vfetch format 6 (k_16_16) packed-16 unpack + per-attribute dword offset, so the 3 vfetches sharing one fetch-constant (pos/UV/color in a 6-dword vertex) read the right attribute. Previously rejected the whole logo VS to the interpreter. - QuadList/RectangleList host->guest vertex-index remap in the VS (replay is non-indexed): QuadList 6 host verts -> guest [0,1,2,0,2,3] (full quad). fetch.rs: decode vfetch `offset` (dword2[8:15], dwords), `is_signed`, `is_normalized`. Per-draw textures: DrawCapture carries the decoded texture(s) (keyed off the active PS's tfetch slots, attached in gpu_system after decode); render.rs::dispatch_xenos_captures uploads + binds each capture's texture via the host texture cache before its draw, instead of one last-draw primary_texture. Determinism: all changes feed only the UI translator/capture path; frame_captures is None headless. `check -n50m --gpu-inline --stable-digest --expect` byte- identical (exit 0). 681 tests pass (+2 regression: logo VS now translates with interpolators; PS seeds interps into registers). Temp readback/dump probes removed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
80fbff8bd1 |
[iterate-3S] Real splash geometry renders: fix ALU/vfetch decode + per-draw NDC transform
The 3O→3R real-render slice ran the guest's real translated VS/PS on real captured vertices at full boot speed, but the --ui window stayed blank. Bifurcated with an env-gated frontbuffer readback + per-vertex NDC dump (both removed): the captured splash quads (RectangleList, k_32_32_FLOAT, 3 verts) were non-zero and sane, so this was a transform/decode chain of bugs, not missing geometry. Four coupled root causes: - GPUBUG-106 (ucode/alu.rs): decode_alu read EVERY field out of w2, but canary's AluInstruction lays dest/write-mask/export/scalar-opcode in w0, the vector opcode + source regs in w2, swizzle/negate/pred in w1. The misread made every *export* ALU decode with vector_write_mask=0 → no oPos/oColor export emitted → the translated VS collapsed every vertex to the clip origin. Rewrote the field map to match ucode.h:2036-2086. - GPUBUG-107 (ucode/fetch.rs + translator.rs): the translator hardcoded R32G32B32A32_FLOAT (4 floats, stride 4); the splash quads are k_32_32_FLOAT (2 floats, stride 2). Over-striding read the next vertex's X into .w → negative W → the rectangle clipped behind the camera. Decode the real VertexFormat + dword stride and emit the matching component read (1/2/3/4 float formats; others reject to the interpreter). - GPUBUG-108 (translator.rs + xenos_interp.wgsl): the vfetch recomputed the buffer base from xenos_consts.fetch[], but that uniform carries the last-published per-frame fetch constant, not this draw's (stale 0x8a000002 vs the real base). The captured window already begins at the fetch base, so index from 0 (vertex i at i*stride) when a real window is present; only the synthetic fallback consults the uniform. - iterate-3S NDC transform (draw_capture.rs + xenos_pipeline.rs + WGSL): the guest VS emits screen-space pixel coords (clip disabled, VTE viewport scale/offset off). Added compute_ndc_xy (mirrors canary GetHostViewportInfo): rescales render-target pixels to [-1,1] clip with the Y-flip for wgpu, plumbed per-draw into DrawConstants and applied in both the translated and interpreter VS. Result (env-gated readback, since removed): the real splash geometry now fills ~50% of the frontbuffer in a clean triangular coverage pattern, real positions from real guest vertices through the real translated shaders (textures are the next stage — sampled color is still the magenta/white texture stub, tex-cache=0). Headless-inert: draw_capture is only built when frame_captures is Some (--ui); the changed decoders feed only the UI translator/metrics. Golden byte-identical (check -n50m --gpu-inline --stable-digest exit 0); 679 workspace tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
a3aa3cc7d6 |
[iterate-3Q] draw_capture: read vertex window via physical alias
The UI geometry-capture read the vertex-fetch base at its bare low VA (~0x0adf_xxxx), which is unmapped in ours, so it copied all-zeros. The fetch constant's address:30 field is a guest *physical* dword address (canary reads it via Memory::TranslatePhysical, draw_util.cc:961). Ours only maps the cached-physical window at 0x4000_0000 (physical_to_backing). Rebase a low physical base onto that mapped alias when the raw VA is unmapped; window_base_dwords still carries the original base so the shader's rebase indexes the uploaded window. Decode itself was verified correct against canary (xe_gpu_vertex_fetch_t + GetVertexFetch + ucode.h vfetch const_index*3+const_index_sel): for the splash draws const_index_sel==0, so ours' stride-6 register offset lands on the exact same constant as canary's stride-2 offset; raw dwords match byte-for-byte. UI-only path (frame_captures is None in headless), so the deterministic --gpu-inline golden is byte-identical (verified) and 679 tests stay green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
6ff184694d |
[iterate-3P] Real splash geometry in --ui: fix CF predication decode + translator op coverage
Stage 1 of the iterate-3O resume plan: make the P7 translator actually compile the splash's real VS/PS so real per-vertex POSITIONS render via the host wgpu pipeline, instead of every draw falling to the interpreter (which emits a placeholder triangle). Two coupled fixes, both faithful (Route A): 1. ucode/control_flow.rs (GPUBUG-103): clause-level predication was decoded from payload bits 28/29, which fall inside the exec clause's `sequence_`/ `vc_hi_` fields, NOT the predicate flag. That stamped `predicated=true` on plain `kExec` clauses, so the translator rejected EVERY splash VS as `cf_cond`. Per canary ucode.h, clause predication is determined by the *opcode* (only kCondExecPred* = 5/6/13/14 are predicate-register-gated; their `condition_` is at word1 bit 9 = payload bit 41). kExec/kExecEnd (1/2) run unconditionally; kCondExec (3/4) is bool-constant-gated (not modeled). Diagnosed live in --ui: reject reason cf_cond on all 7 splash shader pairs → after fix, predicated=false and CF passes. 2. translator.rs: with CF passing, the next reject was `scl_op_unsupported` for scalar opcodes 4 (kMulsPrev2 / LIT emul) and 8 (kSgts), plus thin vector coverage. Expanded vector_expr + scalar_expr to mirror the runtime interpreter's op set (which mirrors canary AluVectorOpcode/AluScalarOpcode): CND_EQ/GE/GT, TRUNC, MAX4, DST for vectors; the full SEQS/SGTS/SGES/SNES, MULS_PREV2 (with the -FLT_MAX / non-finite / b<=0 guard), SUBS(_PREV), EXP/LOG/RCP/RSQ/SQRT/SIN/COS, FRCS/TRUNCS/FLOORS for scalars. Side-effecting ops (setp*/kills*/maxas*) still reject → interpreter fallback (honest). Result (--ui, measured): xlated-pipelines 0→6, all draws served by the translator (served_interp=0) — real VS/PS now run on the host GPU. The splash is still not visibly correct because the captured guest vertex windows read all-zero: the vertex-buffer base VA (~0x0adf_xxxx) is UNMAPPED in guest memory (mem.translate()==None). That is a CPU/kernel memory-mapping gap, not a GPU-render gap — the next stage. Determinism: both files are in xenia-gpu core but the CF `predicated` field only feeds the UI translator + a metric tag, never deterministic state. Verified: `check -n50000000 --gpu-inline --stable-digest` matches the golden byte-for-byte (exit 0); 679 tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
504592ac13 |
[iterate-3O] Real-render slice: replay guest geometry in --ui (Route A)
Replace the synthetic placeholder triangle in the --ui window with the
splash's REAL guest geometry, proving the faithful-render pipe end to end.
Architecture: Route A (UI-side replay). A per-draw capture channel carries
each PM4_DRAW_INDX*'s real state to the UI, which replays it through the
existing wgpu Xenos pipeline. The deterministic headless core is untouched:
capture is gated on an Option<Vec<DrawCapture>> that is None in headless
mode and only enabled on the --ui path, so the --gpu-inline n50m golden is
byte-identical (verified 2x).
The hard part was sourcing real vertices. The WGSL VS already does
format-aware vertex fetch from the b4 storage buffer at the address from the
fetch constant -- but b4 was never populated and the fetch address is an
absolute guest dword address. The slice:
* xenia-gpu/draw_capture.rs: parse the active VS, find its first vertex
fetch, read that fetch constant, copy a bounded window of guest memory
at the fetch base. Best-effort: has_real_vertices=false falls back to
procedural geometry (never fabricated pixels).
* gpu_system.rs: accumulate one DrawCapture per draw into frame_captures.
* exports.rs (vd_swap): drain + publish the frame's captures to the UI.
* ui_bridge/bridge.rs: new publish_geometry channel + UiHandles.geometry.
* WGSL (interp + translator): rebase the absolute fetch address by a new
DrawConstants.vertex_base_dwords so it indexes the uploaded window.
* render.rs: dispatch_xenos_captures uploads each draw's real vertex
window + matching shader, issues real DrawRequests (real prim type,
host vertex count, vs/ps keys).
* app.rs: prefer the real-capture replay; HUD adds real-geo=N counter.
Verified in --ui on Sylpheed: "first Xenos capture batch replayed (real
geometry) captures=24 real_vertex_draws=24" -- all draws resolved a real
guest vertex window; WGSL compiles; no validation errors over 1616 swaps.
Still synthetic-free but not yet pixel-perfect: textures/UVs, DMA index
buffers (auto-index only for now), and kCopy resolve routing are staged
for follow-ups. Faithful: real vertex data, prim types, shaders, constants.
cargo test --workspace green; n50m golden unchanged (2x byte-identical).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
||
|
|
6bb4355e3d |
[iterate-3M] Fix Xenos shader CF/fetch decode so the textured logo binds
The publisher splash (title idx0) rendered FLAT in ours while canary samples a texture: ours never decoded the logo's textured pixel shader (E59B2B3D, a `tfetch2D` sprite) even though our guest IM_LOADs the exact same microcode canary does (verified byte-identical against the Wine oracle). The shader was misparsed as flat. Three coupled bugs in the ucode decoder, all off vs canary `gpu/ucode.h`: 1. CF opcode table was off-by-one (`control_flow.rs`): mapped opcode 0→Exec and 1→Exit, but Xenos has 0=kNop, 1=kExec, 2=kExecEnd, 3..6/13..14 the cond-exec variants, 7/8 loop, 9/10 call/return, 11 condjmp, 12 alloc, 15 mark-vs-fetch-done. So a real `kExec` clause was read as a terminal `Exit`, truncating the CF block and dropping every instruction (incl. the `tfetch`) after it. Added Nop/MarkVsFetchDone variants; parse now ends on an END-bit exec clause. 2. exec/loop `address` is an absolute instruction-triple index from shader dword 0, but indexed our post-CF `instructions` slice directly (`ucode/mod.rs`). Rebase addresses by the CF triple count so `address*3` lands on the right instruction. 3. Fetch instruction bitfields were wrong (`ucode/fetch.rs`): `const_index` read from bit 5 (actually `src_reg`) instead of bit 20, and texture `dimension` from dword1 instead of dword2 bit14. The logo's `tfetch ..,tf0` was read as `tf1`, whose empty fetch-constant failed to decode → no texture. Also the `sequence` fetch/ALU bit is bit[0] of each pair, not bit[1] (`shader_metrics.rs`, `translator.rs`, `xenos_interp.wgsl`). Result (--gpu-inline, deterministic 2x): the active PS's `tfetch_slots` now resolves slot 0, the tf0 fetch-constant decodes (fmt K8888), and `gpu.texture.decode` fires (137x at -n 50M; texture_cache_entries 0→1, the only golden field that changed — all draw/swap counts unchanged). The same fixes correct the WGSL uber-shader's fetch/CF walk for the threaded/--ui path. Added a regression test that parses the real E59B2B3D microcode and asserts a tfetch slot is found. Golden re-baselined (texture_cache_entries 0→1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
2f55d1fd7d |
[iterate-2X] Texture pipeline: un-stub RectangleList + draw-time texture decode
Two faithful, deterministic GPU-backend changes that make the texture path
correct for whatever textured draw the splash eventually dispatches. Both are
currently inert on Sylpheed (the textured logo draw is still gated downstream
— see below), but neither shifts the stable-digest golden, so they land safely.
1. Un-stub RectangleList primitive expansion (primitive.rs). The splash submits
2819 RectangleList draws at 200M, all of which were REJECTED by the P3 stub
(`gpu.primitive.rejected{rectangle_list}`) → only ~592 flat point/quad draws
rasterized. Mirror canary's intent (primitive_processor.cc:389-456
kRectangleListAsTriangleStrip) within our CPU index-rewrite idiom: emit each
rect's 3 real vertices as one TriangleList triangle (v0,v1,v2), rejected=false,
faithful host_vertex_count. The full quad (synthesized 4th corner v3=v0+v2-v1)
needs real vertex fetch in vs_main — left as a documented TODO. Rejection
warnings drop 2819→0.
2. Draw-time texture decode keyed off the active PS's real tfetch slots
(gpu_system.rs + exports.rs vd_swap). Previously vd_swap decoded a hardcoded
fetch-constant slot 0 at swap time. Now the DRAW handler parses the bound
pixel shader (ucode::parse_shader), collects its tfetch fetch_const slots via
new shader_metrics::tfetch_slots, reads each 6-dword fetch constant, and
decode+caches it into GpuSystem::last_draw_textures. vd_swap publishes the
first of these (UI binds one texture today), falling back to the legacy slot-0
probe on flat-only frames. New span_max_version helper walks page_version over
the trait (draw-time &dyn MemoryAccess lacks the heap's inherent
max_page_version). Pure function of guest writes — deterministic.
Status: texture_decodes stays 0 on Sylpheed because all 6 live shaders are flat
(no tfetch); canary's textured logo shaders E59B2B3D/F7B1457 are not yet
dispatched by ours (a downstream title-state gate, the next frontier). The full
P5 decode→publish→upload→sample path is already wired; this makes the decode
side key off the real shader instead of a guess.
Validation: stable-digest golden sylpheed_n50m unchanged (draws=718 swaps=147
tex=0), regenerated twice byte-identical; 200M run shows 0 RectangleList
rejections. cargo test --workspace green (677, +2: rectangle_list_expansion,
tfetch_slots_extracts_texture_fetch_constants). No temp hooks. Branch only;
not pushed/merged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
||
|
|
a91f4c550b |
[iterate-2W] Sustain the title present loop: viewport-size register + ISR CPU impersonation
The title's per-frame loop (sub_822F1AA8) is clock-B-paced and only re-fires when the swap count [controller+88] changes, which advances only on source=1 CP swap-complete interrupts. Each present batch the guest submits (via the sub_824CE348 -> sub_824BF4D0 builder) ends with a WAIT_REG_MEM on a per-CPU swap-acknowledge fence [GCTX+0] (GCTX = [device+10772]); the GPU parks there until the graphics ISR (sub_824BE9A0) clears that CPU's bit. Two coupled gaps kept ours emitting only ONE source=1 then dead-locking (draws plateaued at 28, run halted ~19.27M): 1. GPU MMIO register 0x1961 (AVIVO_D1MODE_VIEWPORT_SIZE) read as 0. The swap callback sub_824CE2B8 divides by its low 12 bits (display height) as a refresh-pacing term, so a 0 read tripped its `twi` divide-by-zero guard and aborted the ISR before it reached the fence-clear. Mirror canary GraphicsSystem::ReadRegister (graphics_system.cc:311): return 0x050002D0 (1280x720). 2. The ISR ran on an arbitrary borrowed thread, so [r13+268] (the PCR processor number) did not match the interrupt's target CPU. The ISR clears `1 << current_cpu` from the fence; running on the wrong CPU cleared the wrong bit and the fence (bit 2, from cpu_mask 0x4) never reached 0. Carry the target CPU through the interrupt queue (bit index of the PM4_INTERRUPT cpu_mask for CP, 2 for vsync per canary DispatchInterruptCallback(0, 2)) and impersonate it on the borrowed thread's PCR around the ISR, mirroring canary EmulateCPInterruptDPC -> XThread::SetActiveCpu. With both fixes the fence clears, the GPU drains each present batch, source=1 sustains per-present, clock B advances, and the loop runs continuously. Draws climb linearly with the budget (no re-stall): 50M 28->718, 200M ->3411, 1B ->18734; swaps 2->147/950/6060. No "Unanticipated CPU_INTERRUPT" trap. Inline-deterministic (--stable-digest byte-identical x2); n50m golden re-baselined. 675 tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
873c197ff1 |
[iterate-2T] VdSwap: route present through ring PM4_XE_SWAP, drop out-of-band swap interrupt
Make ours' VdSwap present path faithful to xenia-canary `VdSwap_entry` (xboxkrnl_video.cc:518-548): write the reserved 64-dword ring slot with a PM4_TYPE0 fetch-constant patch + PM4_TYPE3(PM4_XE_SWAP) + NOP padding, then let the natural drain consume the swap packet in command-stream order. Remove the synthetic CP swap-complete interrupt that `notify_xe_swap` raised out-of-band. Root found this session (the actual present-path bug): ours' `notify_xe_swap` pushed an `InterruptSource::Swap` (→ INTERRUPT_SOURCE_CP) interrupt directly from the VdSwap HLE, decoupled from the GPU command stream. When that interrupt reached the graphics ISR `sub_824BE9A0` before D3D had armed its swap-callback slot (`[gfx+10772]+16` still the `0xBADF00D` placeholder), the ISR took its error path and hit the assert "ERR[D3D]: Unanticipated CPU_INTERRUPT. Sign of a corrupt command buffer?" (`bl sub_824C5DF0; twi` at 0x824BE9DC) — 2x per run on master. Canary's VdSwap raises NO interrupt; swap-complete CP interrupts come only from in-stream PM4_INTERRUPT packets, which are naturally ordered after the callback-arming Type-0 writes. Routing the swap through the ring packet matches that ordering and eliminates the trap (2 -> 0). Canary oracle confirmation (muted, audit_mem_watch + audit_jit_prolog_pc): canary's early/loading loop is present-driven — swap counter [gfx+15160] (0xBE56CA38) advances ~per-vblank from vblank 65 onward, reaching 0xD02 (3330) in ~60s via 6184 CP source=1 interrupts, with VdSwap called only ONCE. So the present interrupts are entirely in-stream, not from the VdSwap export. This is a correctness/faithfulness fix; it does NOT cascade. draws stay 78 at 200M and 1B because the upstream gate persists: the game submits one render batch then stalls (renderer sub_82506xxx 0x; 2nd title thread 0x821748F0 never spawns). The per-frame loop sub_822F1AA8 runs ~1207 iterations on vsync but clock B (swap count) only advances ~once, so the manager update sub_821741C8 fires once. That is the iterate-2Q/2F title-pipeline gate, not a present/ interrupt bug. swaps 3 -> 4 (the in-stream PM4_XE_SWAP now drains). Deterministic in inline mode (n50m --gpu-inline --stable-digest regenerated byte-identical twice; golden re-baselined: swaps 3 -> 4). cargo test --workspace 675 passing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
1ae472bd2b |
[iterate-2S] GPU: implement CP SCRATCH_REG memory writeback — arms Sylpheed's swap-callback slot
Sylpheed renders the splash (draws=78, iterate-2O) then plateaus: the
title's per-frame manager (sub_821741C8) only re-fires when "clock B"
([gfx+15160], swap count) changes, which only the CP swap-complete
callback sub_824CE2B8 increments. The graphics ISR sub_824BE9A0
indirect-calls that callback via [[gfx+10772]+16] on CP (source=1)
interrupts, but the slot stayed NULL so the callback never ran.
Root (runtime-verified, ours-side GPU): the guest arms the slot through
the Xenos CP scratch-register writeback path, which ours never
implemented. The arming IB (drained by ours at 0x4adf5180) contains a
Type-0 register write of the callback PC 0x824ce2b8 into SCRATCH_REG4
(0x057C). On hardware/canary, writing a SCRATCH_REG{n} mirrors the value
to SCRATCH_ADDR + n*4 in memory when the matching SCRATCH_UMSK bit is
set. Runtime values: SCRATCH_ADDR=0x0b1d5000 (the [gfx+10772]
descriptor), SCRATCH_UMSK=0x20033 (bit 4 set), so SCRATCH_REG4 ->
0x0b1d5010 = descriptor+16 = the callback slot (0x4b1d5010). Ours
decoded the Type-0 write into the register file but performed no
writeback (case a: drained-but-mishandled), so the slot stayed NULL.
Fix mirrors canary's CommandProcessor::HandleSpecialRegisterWrite
(command_processor.cc:545-552): a scratch_register_writeback() helper
called from handle_type0/handle_type1 after every register write; for
SCRATCH_REG0..7 with the UMSK bit set, it writes the value (big-endian,
as mem.write_u32 already stores) to SCRATCH_ADDR + n*4 (projected via
physical_to_backing). Deterministic given identical register state;
proven by unit test.
Cascade (verified by runtime probe): slot 0x4b1d5010 now armed with
0x824ce2b8; on the 2-3 CP interrupts that fire, the ISR reads the slot
and bcctrl's into sub_824CE2B8 (runs 2x; 0x cascade on master);
sub_824CE2B8 increments clock B ([gfx+15160]). The cascade does NOT yet
reach draws>78: there are only ~3 CP interrupts (from the initial 9825-
packet batch), and the title render loop stalls upstream (the iterate-2Q
title-respawn gate) before it submits more PM4_INTERRUPT work, so the
callback can't bootstrap a self-sustaining loop. This is the remaining
update-17/18 arming gap closed; the upstream stall is the next gate.
The default threaded GPU backend drains the ring on a separate host
thread, so with the callback now doing work the exact CP-interrupt
delivery instruction varies run to run (pre-existing GPU-thread race).
Pin the n50m oracle test to --gpu-inline (instruction-count
deterministic) and re-baseline its golden; bit-exact across repeated
runs. New unit test scratch_reg_write_mirrors_to_memory_when_umsk_enabled.
Tests: 675 pass (was 674). Golden re-baselined + determinism verified.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
||
|
|
034ec8b47f |
[iterate-2O] GPU: drain indirect buffers correctly — Sylpheed renders splash (draws 0→78)
Ours' GPU never drained the D3D driver's system command buffer past the first 11-dword indirect buffer, so DRAW_INDX / reg-0x57C-arm packets never executed and draws stayed 0 (the long-hunted render gate; see UPDATE-18). Runtime tracing (temporary, removed) showed the guest submits 6 INDIRECT_BUFFER packets at boot (CP_RB_WPTR 22→37) but ours executed exactly ONE IB and then spun 15.7M packets inside it. Three coupled command-processor bugs, all corrected to match canary: 1. `sync_with_mmio` applied the primary CP_RB_WPTR to whichever ring was active, including an executing indirect buffer — `37 % 11 = 3` clobbered the IB's write pointer so its read pointer looped 0→2→5→0 forever and never popped back to the primary ring. CP_RB_WPTR governs ONLY the primary ring; while an IB executes, the primary is the bottom of the IB stack. Canary executes each IB through a separate `RingBuffer reader_` (command_processor.cc), so the primary write pointer is structurally inapplicable to an IB. 2. Indirect buffers were treated as circular rings: read wrapped at `size_dwords` (`11 % 11 = 0`) and never reached the fixed write pointer, so even without the clobber the IB could not terminate. An IB is a fixed *linear* sub-stream; add `RingBufferView.indirect` and drain `[0, ib_size)` monotonically, then pop. 3. `is_ready` only checked the active ring, so an IB that now correctly exhausts would never get `execute_one` called again to pop back to the primary ring (whose WPTR may have advanced). Check the whole IB stack. Also: the ring was sized `1 << size_log2` bytes (1024 dwords) vs canary's `1 << (size_log2 + 3)` (8192 dwords) — an 8× undersize that desynced WPTR-wrap math from the guest. Fixed in `GpuSystem::initialize_ring_buffer` (and the dead bookkeeping copy in `vd_initialize_ring_buffer`). Cascade (deterministic; threaded-default backend, byte-identical across runs): reg 0x57C now written, IB jumps 1→12, packets 15.7M→9,825, and the splash renders — draws 0→78, shaders 0→3, render_targets 0→2, swaps 2→3 — stable at 50M / 200M / 1B. Boot then reaches a new downstream gate (draws plateau at 78, interrupts keep climbing → engine alive, not deadlocked). golden `sylpheed_n50m.json` re-baselined (draws 78). `cargo test --workspace` green (674; +2 ring_view regression tests). vd_swap's synthetic-swap short-circuit is now redundant but left untouched (cascade works without changing it); cleaning it up is a separate follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
2bdb93e51e |
[iterate-2K] GPU physical-mirror aliasing: ring/IB/RPtr/resolve read wrong host region
Root cause (physical-mirror aliasing gap → GPU read wrong region → ring never truly drained → render worker ring-space wait → no frame → no draw): The Xbox 360 maps its 512 MB of physical DRAM into several virtual mirror windows differing only in cache policy — bare physical (0x0xxxxxxx), write-combine (0x4xxxxxxx), and cached 0xA/0xC/0xExxxxxxx — all aliasing addr & 0x1FFF_FFFF. Ours has one flat membase and `heap_alloc` (MmAllocatePhysicalMemoryEx) commits physical backing in the 0x4xxxxxxx window. The guest masks its CP-ring allocation base to bare physical (0x4adcc000 & 0x1FFFFFFF = 0x0adcc000) before handing it to VdInitializeRingBuffer, and PM4 INDIRECT_BUFFER / writeback / resolve pointers are likewise bare-physical. Ours stored those verbatim and read `membase + 0x0adcc000`, a never-committed zero-filled page — so the GPU drained ~718k zero PM4 headers, never executed the real Type3/DRAW stream, and the RPtr writeback landed on a zero page the render worker (tid=8) polls, freezing it forever. Fix (GPU/Vd-boundary translation, not memory-layer): add `physical_to_backing(addr)` deriving the committed backing exactly from `heap_alloc`'s placement (0x4000_0000 | (addr & 0x1FFF_FFFF), idempotent for the WC window, flat for non-physical code/stack). Apply it at every point the GPU/kernel consumes a guest physical address: ring base (initialize_ring_buffer), RPtr writeback (enable_rptr_writeback), PM4 INDIRECT_BUFFER pointer, WAIT_REG_MEM / COND_WRITE memory poll+write, REG_TO_MEM / MEM_WRITE / EVENT_WRITE* / LOAD_ALU_CONSTANT / IM_LOAD addresses, the resolve dest write, and the vd_swap frontbuffer present read. This was chosen over memory-layer aliasing because the latter re-projects every CPU load/store and corrupts the guest's flat 0xA/0xC/0xE accesses (it caused an early PC=0xfffffffc fault). Two adjacent GPU-backend gates this exposed and also fixed (canary-faithful): - WaitCmp::from_wait_info was off by one vs canary's MatchValueAndRef selector (it decoded wait_info&7==3 as NotEqual instead of Equal), inverting the standard CP coherency wait so the GPU parked forever on the first INDIRECT_BUFFER. Remapped to 1=Less..7=Always, 0=Never. - Added MakeCoherent: a WAIT polling COHER_STATUS_HOST clears the status bit (mirrors command_processor.cc:801-838) so the coherency handshake resolves. Result: the GPU now decodes the real Type3 packets at 0x4adcc000 (ME_INIT, INDIRECT_BUFFER → real Type0/WAIT_REG_MEM at 0x4adf5080) instead of zero-headers; RPtr at 0x408619fc advances (0x13, 0x16, … written by the GPU worker); the frame loop sub_822F1AA8 actively writes the controller at 0x40d09a40 (0x20→0x21→0x23); no fault, full 200M/1B budget runs clean. draws_seen is still 0: the remaining gate is upstream and separate — the main frame loop never sets controller bit-28 (frame-ready) at [0x40d09a40] (stalls at 0x23, the known iterate-2C state-divergence gate), so the guest never enqueues a render IB; the GPU only ever replays the init IB. This fix correctly unblocks the GPU ring/IB/RPtr data path (gate-2 GPU backend); the bit-28 frame-ready gate is the next target. Stable golden (sylpheed_n50m) unchanged (draws/swaps/RTs/shaders identical at 50M); regenerated twice byte-identical. cargo test --workspace: 672 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
||
|
|
7a1b6b3306 |
fix(gpu): GPUBUG-DRAIN-001 — silence VdSwap PM4 fallback under --parallel
The Phase-C VdSwap PM4 ring path (commit |
||
|
|
8fc1b1dfed |
fix(gpu): GPUBUG-006 — sync_with_mmio Acquire/Release pair the producer
The producer side (`mmio_region.rs:78`, the guest's CP_RB_WPTR MMIO write callback) uses `Ordering::Release` so any ring-memory writes the guest performed before bumping WPTR are visible to a paired `Acquire`-load on the consumer. The consumer here at `sync_with_mmio` was using `Ordering::Relaxed` for both the WPTR load and the RPTR mirror store — leaving the Release/Acquire pairing broken. Under `--parallel`, this broken pairing means the GPU worker can observe a fresh WPTR value while still reading stale ring-memory contents at the corresponding offsets — garbage PM4 packets. The audit's M11 grid run confirmed --parallel is non-deterministic beyond the documented `packets` ±5% noise; this fix is one strand of that. Symmetric fix on the RPTR mirror store: Release pairs with any guest-side Acquire-load of CP_RB_RPTR for ring-writeback bookkeeping. Verification at -n 100M lockstep: swaps: 2 → 2 (unchanged) draws: 0 → 0 (unchanged) packets: ~60M (within noise) Tests: 149 (no count change; this is a memory-ordering correctness fix, not a behavioral change visible at the digest level in lockstep). Closes GPUBUG-006 (P1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
8723d6826b |
fix(gpu): GPUBUG-103/104/105 — fix 8 draw-state register addresses + index_size bit
Eight of the register-index constants in draw_state.rs::reg pointed at
completely unrelated registers because the canonical canary table
(register_table.inc) was misread when the module was first authored.
Re-validated each value against canary's lines 1232-1336.
| Register | Pre-fix | Canary | Was-actually |
| ------------------------- | ------- | ------ | ------------- |
| VGT_DRAW_INITIATOR | 0x2281 | 0x21FC | (junk) |
| VGT_DMA_BASE | 0x2282 | 0x21FA | (junk) |
| VGT_DMA_SIZE | 0x2283 | 0x21FB | (junk) |
| PA_SC_WINDOW_SCISSOR_TL | 0x200E | 0x2081 | SCREEN_SCIS_TL|
| PA_SC_WINDOW_SCISSOR_BR | 0x200F | 0x2082 | SCREEN_SCIS_BR|
| RB_COLOR_INFO_1 | 0x2010 | 0x2003 | COHER_DEST_BASE_10|
| RB_COLOR_INFO_2 | 0x2011 | 0x2004 | COHER_DEST_BASE_11|
| RB_COLOR_INFO_3 | 0x2012 | 0x2005 | COHER_DEST_BASE_12|
| PA_SU_VTX_CNTL | 0x2083 | 0x2302 | PA_SC_CLIPRECT_RULE|
Also corrected the `index_size` bit position in VGT_DRAW_INITIATOR
extraction: was bit 8 (which is `major_mode[0]`), should be bit 11 per
canary `registers.h:324` (`xenos::IndexFormat index_size : 1; // +11`).
The block comment in `extract()` was also wrong about the
intermediate field layout and has been refreshed.
Verification at -n 100M lockstep:
swaps: 2 → 2 (unchanged)
draws: 0 → 0 (still gated — see below)
packets: ~61M (within noise)
Tests: 149 (no count change; existing draw_state tests cover the
new constants implicitly via behavioral round-trip).
The audit predicted Phases C+D+E together would unlock `draws > 0`,
but the runtime plateau is multi-causal per the audit's own analysis
(`project_xenia_rs_audit_2026_05_02.md`). The likely remaining
blockers in -n 100M:
* 4 parked-waiter worker threads (handles 0x1004, 0x100c, 0x15e4,
0x42450b5c) — Phase F's XAM/spinlock fixes target this.
* shader_blobs_live=0 after 100M — the game hasn't issued IM_LOAD
yet because workers haven't loaded shader resources.
The register fixes here are still load-bearing for any draw that
DOES happen (every register read at 0x2281 was junk before this
commit) — landing them now is correct even if draws=0 persists until
Phase F unparks the resource-loader threads.
Closes GPUBUG-103, GPUBUG-104, GPUBUG-105 (P0).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
ec2d955dbd |
fix(gpu): GPUBUG-102 — apply per-format endian byte-swap to vertex fetch
The vertex fetch constant (canary `xe_gpu_vertex_fetch_t`, xenos.h:1158-1172) holds an `endian` field (low 2 bits of dword_1) selecting kNone/k8in16/k8in32/k16in32 swap patterns per `GpuSwapInline` (xenos.h:1090-1109). Xbox 360 vertex data is stored big-endian; the host is little-endian. Pre-fix every dword was bitcast as-is — vertex positions decoded as byte-reversed garbage, producing clipped or NaN positions in any draw that survived to the host. Mechanical changes: - crates/xenia-gpu/src/translator.rs: AOT `emit_vfetch` reads fetch_const dword 1 (endian) and wraps each lane's load in `gpu_swap(value, endian)`. New `gpu_swap` helper added to the emitted module header. - crates/xenia-gpu/src/shaders/xenos_interp.wgsl: matching `gpu_swap` helper added to the runtime interpreter shader. `interpret_vertex_fetch` reads fc1, computes the endian, and wraps every format's per-lane load (including 8_8_8_8 and 16_16_FLOAT paths). Mirrors the AOT translator's emission. Verification at -n 100M lockstep: swaps: 2 → 2 (gated by Phase E for draws) draws: 0 → 0 packets: ~60M (within noise) Tests: +1 (vfetch_emit_includes_gpu_swap_helper_call). Closes GPUBUG-102 (P0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
c5c6713419 |
fix(gpu): GPUBUG-100 — apply per-operand swizzle + negate to ALU sources
Word-1 of every ALU triple holds three 8-bit component-relative swizzles (`src1_swiz`/`src2_swiz`/`src3_swiz` at bits 16-23/8-15/0-7 per canary ucode.h:2064-2066) and three per-operand negate flags (bits 24/25/26). Pre-fix, both the WGSL interpreter and the AOT translator discarded word-1 entirely with `_ = w1;` — every ALU result was missing its swizzle (broadcast/permute patterns like `.zyxw`, `.xxxx`) and any negated operand was used positive instead. Component-relative semantics (canary's `AluInstruction::GetSwizzledComponentIndex`, ucode.h:1996): for output component i, the source component is `((swizzle >> (2*i)) + i) & 3`. Identity swizzle is 0x00, NOT 0xE4 — the original `apply_swizzle` in the interpreter shader treated it as absolute, also incorrect. Mechanical changes: - crates/xenia-gpu/src/ucode/alu.rs: extend AluInstruction with src_X_swiz (u8) and src_X_negate (bool) fields. decode_alu unpacks them from word 1. - crates/xenia-gpu/src/shaders/xenos_interp.wgsl: apply_swizzle uses component-relative semantics. interpret_alu decodes the modifiers and applies via apply_swizzle + apply_modifiers (with abs=false). - crates/xenia-gpu/src/translator.rs: src_operand emits the precomputed swizzle inline as `vec4<f32>(base.x, base.y, ...)`, then wraps in `(-…)` when negated. Identity swizzle (0x00) emits a bare base expression so it round-trips with the trivial-shader fixture. Abs is omitted in this commit — the abs flag is dual-meaning (for temps it lives at bit 7 of the src byte; for constants at word-2 bit 7 `abs_constants`). Wiring it up correctly requires more careful case-split logic; deferred to Phase G. Verification at -n 100M lockstep: swaps: 2 → 2 (gated by Phase E for draws) draws: 0 → 0 packets: ~58M (within noise) Tests: 554 → 555 (+1 swizzle/negate test, no count change otherwise because identity swizzle test merged into D1's parameterised test). WGSL still validates via naga (combined_module_parses_as_wgsl). Closes GPUBUG-100 (P0). Abs deferred to Phase G. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
78ea81c12a |
fix(gpu): GPUBUG-101 — decode src1/2/3_sel temp-vs-constant selector
Per canary AluInstruction layout (xenia-canary/src/xenia/gpu/ucode.h:
2078-2086), word-0 bits 29-31 are the per-operand `srcN_sel` flags
selecting temp register (1) vs ALU constant (0); the corresponding
8-bit src byte indexes either:
- a temp register (bits 5:0 = index, bits 6/7 reserved for
relative-addressing / abs flags consumed by Phase D2), or
- an ALU constant (full 8-bit index).
Pre-fix, the WGSL interpreter and AOT translator both masked `& 0x7F`
on the src byte and emitted `r[low7]` regardless of the operand class.
Every shader's WVP matrix / light constant / per-frame uniform read
came back as r[low7] — typically zero — yielding invisible rendering.
Mechanical changes:
- crates/xenia-gpu/src/ucode/alu.rs: decode src_a_is_temp /
src_b_is_temp / src_c_is_temp from w0 bits 29/30/31. Note that our
src_a (low byte of w0) is canary's third operand, hence its selector
is bit 29 (canary src3_sel), not bit 31.
- crates/xenia-gpu/src/shaders/xenos_interp.wgsl: `read_src` now takes
the is_temp flag; constants index xenos_consts.alu directly.
- crates/xenia-gpu/src/translator.rs: `src_operand` mirrors the
interpreter — `r[idx]` when temp, `xenos_consts.alu[idx]` when
constant.
The trivial-shader synthetic test was updated to set the temp flags so
its `r[0u] = (r[0u] + r[0u])` assertion remains valid; without the
flags set, all sources would now resolve as constants.
Bank-selection (cf-level relative addressing for higher banks of the
512 ALU constants) remains a Phase G+ extension — covers c0..c127
in bank 0, which most Sylpheed shaders use directly.
Verification at -n 100M lockstep:
swaps: 2 → 2 (unchanged — gated by D2/D3/E for draws)
draws: 0 → 0
packets: ~61M (within noise)
Tests: 552 → 554 (+2 translator tests for the temp/constant decode).
Closes GPUBUG-101 (P0).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
82f3d611e2 |
fix(gpu,kernel): KRNBUG-Vd-04 / GPUBUG-001 / XMODBUG-013 — VdSwap PM4 ring path
The pre-fix VdSwap zero-filled the guest's reserved buffer with NOPs and
called `state.gpu.notify_xe_swap` directly — bypassing the ring, leaving
the PM4_XE_SWAP handler at gpu_system.rs:1232 dead code, and skipping
the PM4_TYPE0(SHADER_CONSTANT_FETCH_00_0, 6) patch. Sylpheed's bloom/
blur "sample frame N for frame N+1" path samples fetch-constant slot 0
expecting the frontbuffer descriptor; without the patch, slot 0 stayed
stale and any shader sampling it read garbage.
This commit writes the canary VdSwap PM4 sequence directly into the
primary ring at the current write pointer (read via the shared MMIO
atomic), then advances WPTR over the injection. The natural CP drain
consumes PM4_XE_SWAP — bumping `swaps_seen` and patching fetch-constant
slot 0 — without going through any direct kernel→GPU bypass.
Sequence per xenia-canary VdSwap_entry (xboxkrnl_video.cc:438-521):
1) PM4_TYPE0(0x4800, count=6) + 6 fetch-header dwords (with
base_address re-patched from virtual to physical >> 12).
2) PM4_TYPE3(PM4_XE_SWAP, count=4) + signature + frontbuffer_phys
+ width + height.
Mechanism notes:
- buffer_ptr in xenia-rs is in the system command buffer, NOT the
primary ring (verified empirically: buffer_ptr=0x4acd4df8 vs
ring_base=0x0accb000, size 4 KB). Canary's VdSwap writes to
buffer_ptr because its ring layout maps the reserved slot inside
the ring; xenia-rs's doesn't, so we have to write at the actual
ring WPTR address (cached on KernelState.ring_base from
VdInitializeRingBuffer).
- The original "buffer_ptr zero-fill + bump WPTR by 64" path is
preserved before the injection — it exposes any game-batched PM4
packets and keeps the buffer_ptr region skippable per existing
game compat behavior.
- A safety-net fallback at the end calls `notify_xe_swap` directly if
swaps_seen didn't advance during the drain (e.g. a ring-arithmetic
edge case). Idempotent — only fires when the PM4 path didn't.
- KRNBUG-Mm-04 deferred: virt→phys uses the masked stub
`virt & 0x1FFF_FFFF`, sufficient for the standard heap.
Mechanical changes:
- crates/xenia-gpu/src/pm4.rs: add make_packet_type0 / type2 / type3
helpers + round-trip unit test (mirrors canary xenos.h:1682-1709).
- crates/xenia-gpu/src/handle.rs: add mmio_cp_rb_wptr_load accessor
(Acquire-load) so the kernel can compute ring offsets.
- crates/xenia-kernel/src/state.rs: cache ring_base / ring_size_dwords
on KernelState (set by VdInitializeRingBuffer).
- crates/xenia-kernel/src/exports.rs: rewrite the vd_swap PM4-emit
block; patch fetch_dwords[1] base_address virt→phys before injection.
Verification at -n 100M lockstep:
swaps: 2 → 2 (game fires VdSwap exactly twice)
draws: 0 → 0 (gated by Phases D+E)
fallback warning: 0 occurrences (PM4 path consumed both swaps)
instructions: ~100M
Tests: 552 passing (553 with new pm4 round-trip test). Lockstep
stable-fields determinism: byte-identical across two 100M runs.
The "swaps > 2" prediction in the audit's plan assumed the game would
fire VdSwap more often once the path worked; empirically Sylpheed only
calls VdSwap twice within 100M instructions (this is the renderer
plateau the audit identified). The success criterion for Phase C is
that the PM4 path is now operational, which Phases D+E require for
visible draws.
Closes KRNBUG-Vd-04, GPUBUG-001, XMODBUG-013.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
79eb52c378 |
xenia-gpu: end-to-end Xenos pipeline (PM4, ucode, EDRAM, resolve)
First real GPU implementation. Ring/PM4 frontend (ring_view,
ring_drain, pm4) drains the command processor; gpu_system owns the
threaded backend (DrainFence RPC + parker/fence helpers from M1) and
the MMIO-mapped register block (mmio_region).
Xenos shader frontend: ucode/{alu,control_flow,fetch,mod}.rs decode
the Xbox 360 microcode, translator.rs lowers it onto the WGSL
xenos_interp interpreter shader (shaders/xenos_interp.wgsl).
shader_metrics.rs counts decode/translate work.
Render state: draw_state, primitive, render_target_cache,
texture_cache, tiled_address (Xenos's swizzled tiled-memory layout),
xenos_constants (register field constants), edram (the 10 MiB EDRAM
model with MSAA), and resolve.rs (TILE_FLUSH copy-out — clear-resolve
plus bitwise-equivalent 32 bpp + 64 bpp paths landed). handle.rs
owns the typed GPU-resource handles the kernel hands out.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
c694bb3f43 |
Initial commit: xenia-rs workspace for Xbox 360 RE
Rust reimplementation of the xenia Xbox 360 emulator targeting reverse- engineering and preservation, initially scoped to Project Sylpheed. Includes: - XEX2 loader (LZX decompression, AES decryption, PE parsing) - XISO / XGD2 disc image VFS - PPC interpreter with 200+ opcodes and VMX128 decoding - Static analyzer: functions, cross-references, labels, asm + SQLite output - HLE kernel covering the xboxkrnl/xam subset used by Sylpheed init - Debugger with in-memory and SQLite-backed execution tracing - `xenia-rs` CLI with extract/dis/exec commands that produce cumulative, superset SQLite databases and opt-in instruction/import/branch traces Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |