Commit Graph

8 Commits

Author SHA1 Message Date
MechaCat02
89b5c39d8a [iterate-3X] Real splash logo geometry renders: fix vertex-fetch const_index_sel + per-draw submit
Two readback-proven root-cause fixes make the publisher-logo QuadList draw
land its REAL captured vertex buffer (the texture was already correct from
3V). REFUTES iterate-3W's "logo geometry is auto-generated from vertex_id":
the logo IS sourced from a 4-vertex QuadList buffer at guest physical
0x0adf60f0 (measured), it was just resolved at the wrong fetch-constant
register.

GPUBUG-110 (vertex fetch const_index_sel dropped). The Xenos vertex-fetch
instruction encodes const_index (w0[20:24]) AND const_index_sel (w0[25:26]);
the full constant index is const_index*3 + const_index_sel (canary
ucode.h:700), packed 3 two-dword constants per 6-dword register group.
ucode/fetch.rs decoded only const_index and read sub-slot 0 (fc*6). The logo
vfetch is const_index=31, sel=2 -> the real base lives at reg 0x48BE, but
ours read 0x48BA which held an unused 0x00000001 (base=0,size=0) slot. So
resolve_vertex_window returned None -> has_real_vertices=false -> the logo
fell to the procedural fullscreen magenta fallback. Fix: decode
const_index_sel, add VertexFetch::const_reg_offset() = const_index*6 + sel*2,
and use it in both draw_capture.rs (capture) and translator.rs (the WGSL
endian term + no-window fallback base; the old expression there read the
src_reg bits, not the const index). Measured: logo now resolves a 24-dword
(4 verts x stride 6) window, base 0x0adf60f0.

GPUBUG-111 (single batch encoder = last-draw-wins vertex data). In wgpu every
queue.write_buffer staged before a single queue.submit is applied before ANY
command in that submit runs. dispatch_xenos_captures recorded the whole batch
into one encoder + one submit, so every draw read only the LAST draw's vertex
buffer / per-draw uniforms. The logo quad therefore sampled the trailing
fullscreen background quad's vertices and rasterized nothing where the logo
was. Fix: submit one encoder per draw (frontbuffer LoadOp::Load composites
identically). Measured (env-gated readback, removed): with this fix the logo
draw in isolation renders real varied texels (e.g. (225,17,22)/(255,255,0))
in a centered strip (~20k px), vs 100% navy before.

Determinism: all changes are UI-side (xenia-ui replay) or the UI translator /
capture path (frame_captures None in headless); the fetch.rs field addition
is purely additive and does not change any existing decoded value. Verified
the deterministic core unchanged: check -n50M --gpu-inline --stable-digest
exit 0 and all 136 metric counters byte-identical across two runs. All temp
probes removed. cargo test --workspace green; new regression test
vertex_fetch_const_index_sel_and_reg_offset.

Known remaining (next iterate): a fullscreen flat QuadList (ps 0x03b79081,
vertex color green, no texture) and other textureless draws overpaint the
logo in the full composite (their per-draw blend/alpha render state is not
yet replayed, and draw order alternates bg/logo). The logo artwork renders
correctly in isolation; the composite is not yet clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 19:25:50 +02:00
MechaCat02
1b9918450f [iterate-3T] Real UV interpolation + per-draw textures: shader/UV/bind chain complete
Build the full texture-sampling chain for the publisher splash so the textured
logo CAN sample real artwork at the guest's real UVs. Measured with an env-gated
frontbuffer readback (since removed): the chain is correct end-to-end, but the
sampled K8888 1280x768 texture is ALL-ZERO in the UI window's reachable boot
range — the artwork is produced by an EDRAM resolve (RT->texture copy) that ours
does not yet perform (resolves=0). So this lands the correct shader/UV/bind work
and isolates the remaining blocker to the resolve gap, not the shader path.

Translator (xenia-gpu/src/translator.rs), all UI-translator-only:
- Real Xenos export-index model (replaces the AllocKind heuristic that collapsed
  every VS export to one color slot and DROPPED the texcoord). When export_data
  is set the 6-bit vector_dest IS the export index: VS 62=oPos, 0..15=interps;
  PS 0=RT0. The logo VS exports oPos(62), interp0(color), interp1(UV) distinctly.
- Real interpolator passthrough: VsOut carries 8 interpolator locations; the PS
  seeds r[i] = in.interp[i] (Xenos PS-input-GPR mapping) so tfetch samples at the
  real interpolated texcoord (r1) instead of (0,0).
- vfetch format 6 (k_16_16) packed-16 unpack + per-attribute dword offset, so the
  3 vfetches sharing one fetch-constant (pos/UV/color in a 6-dword vertex) read
  the right attribute. Previously rejected the whole logo VS to the interpreter.
- QuadList/RectangleList host->guest vertex-index remap in the VS (replay is
  non-indexed): QuadList 6 host verts -> guest [0,1,2,0,2,3] (full quad).

fetch.rs: decode vfetch `offset` (dword2[8:15], dwords), `is_signed`,
`is_normalized`.

Per-draw textures: DrawCapture carries the decoded texture(s) (keyed off the
active PS's tfetch slots, attached in gpu_system after decode);
render.rs::dispatch_xenos_captures uploads + binds each capture's texture via the
host texture cache before its draw, instead of one last-draw primary_texture.

Determinism: all changes feed only the UI translator/capture path; frame_captures
is None headless. `check -n50m --gpu-inline --stable-digest --expect` byte-
identical (exit 0). 681 tests pass (+2 regression: logo VS now translates with
interpolators; PS seeds interps into registers). Temp readback/dump probes removed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 17:12:16 +02:00
MechaCat02
80fbff8bd1 [iterate-3S] Real splash geometry renders: fix ALU/vfetch decode + per-draw NDC transform
The 3O→3R real-render slice ran the guest's real translated VS/PS on real
captured vertices at full boot speed, but the --ui window stayed blank.
Bifurcated with an env-gated frontbuffer readback + per-vertex NDC dump
(both removed): the captured splash quads (RectangleList, k_32_32_FLOAT,
3 verts) were non-zero and sane, so this was a transform/decode chain of
bugs, not missing geometry. Four coupled root causes:

- GPUBUG-106 (ucode/alu.rs): decode_alu read EVERY field out of w2, but
  canary's AluInstruction lays dest/write-mask/export/scalar-opcode in w0,
  the vector opcode + source regs in w2, swizzle/negate/pred in w1. The
  misread made every *export* ALU decode with vector_write_mask=0 → no
  oPos/oColor export emitted → the translated VS collapsed every vertex to
  the clip origin. Rewrote the field map to match ucode.h:2036-2086.

- GPUBUG-107 (ucode/fetch.rs + translator.rs): the translator hardcoded
  R32G32B32A32_FLOAT (4 floats, stride 4); the splash quads are
  k_32_32_FLOAT (2 floats, stride 2). Over-striding read the next vertex's
  X into .w → negative W → the rectangle clipped behind the camera. Decode
  the real VertexFormat + dword stride and emit the matching component
  read (1/2/3/4 float formats; others reject to the interpreter).

- GPUBUG-108 (translator.rs + xenos_interp.wgsl): the vfetch recomputed
  the buffer base from xenos_consts.fetch[], but that uniform carries the
  last-published per-frame fetch constant, not this draw's (stale
  0x8a000002 vs the real base). The captured window already begins at the
  fetch base, so index from 0 (vertex i at i*stride) when a real window is
  present; only the synthetic fallback consults the uniform.

- iterate-3S NDC transform (draw_capture.rs + xenos_pipeline.rs + WGSL):
  the guest VS emits screen-space pixel coords (clip disabled, VTE viewport
  scale/offset off). Added compute_ndc_xy (mirrors canary
  GetHostViewportInfo): rescales render-target pixels to [-1,1] clip with
  the Y-flip for wgpu, plumbed per-draw into DrawConstants and applied in
  both the translated and interpreter VS.

Result (env-gated readback, since removed): the real splash geometry now
fills ~50% of the frontbuffer in a clean triangular coverage pattern, real
positions from real guest vertices through the real translated shaders
(textures are the next stage — sampled color is still the magenta/white
texture stub, tex-cache=0). Headless-inert: draw_capture is only built
when frame_captures is Some (--ui); the changed decoders feed only the UI
translator/metrics. Golden byte-identical (check -n50m --gpu-inline
--stable-digest exit 0); 679 workspace tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:35:01 +02:00
MechaCat02
6ff184694d [iterate-3P] Real splash geometry in --ui: fix CF predication decode + translator op coverage
Stage 1 of the iterate-3O resume plan: make the P7 translator actually
compile the splash's real VS/PS so real per-vertex POSITIONS render via the
host wgpu pipeline, instead of every draw falling to the interpreter (which
emits a placeholder triangle). Two coupled fixes, both faithful (Route A):

1. ucode/control_flow.rs (GPUBUG-103): clause-level predication was decoded
   from payload bits 28/29, which fall inside the exec clause's `sequence_`/
   `vc_hi_` fields, NOT the predicate flag. That stamped `predicated=true`
   on plain `kExec` clauses, so the translator rejected EVERY splash VS as
   `cf_cond`. Per canary ucode.h, clause predication is determined by the
   *opcode* (only kCondExecPred* = 5/6/13/14 are predicate-register-gated;
   their `condition_` is at word1 bit 9 = payload bit 41). kExec/kExecEnd
   (1/2) run unconditionally; kCondExec (3/4) is bool-constant-gated (not
   modeled). Diagnosed live in --ui: reject reason cf_cond on all 7 splash
   shader pairs → after fix, predicated=false and CF passes.

2. translator.rs: with CF passing, the next reject was `scl_op_unsupported`
   for scalar opcodes 4 (kMulsPrev2 / LIT emul) and 8 (kSgts), plus thin
   vector coverage. Expanded vector_expr + scalar_expr to mirror the runtime
   interpreter's op set (which mirrors canary AluVectorOpcode/AluScalarOpcode):
   CND_EQ/GE/GT, TRUNC, MAX4, DST for vectors; the full SEQS/SGTS/SGES/SNES,
   MULS_PREV2 (with the -FLT_MAX / non-finite / b<=0 guard), SUBS(_PREV),
   EXP/LOG/RCP/RSQ/SQRT/SIN/COS, FRCS/TRUNCS/FLOORS for scalars. Side-effecting
   ops (setp*/kills*/maxas*) still reject → interpreter fallback (honest).

Result (--ui, measured): xlated-pipelines 0→6, all draws served by the
translator (served_interp=0) — real VS/PS now run on the host GPU. The
splash is still not visibly correct because the captured guest vertex
windows read all-zero: the vertex-buffer base VA (~0x0adf_xxxx) is UNMAPPED
in guest memory (mem.translate()==None). That is a CPU/kernel memory-mapping
gap, not a GPU-render gap — the next stage.

Determinism: both files are in xenia-gpu core but the CF `predicated` field
only feeds the UI translator + a metric tag, never deterministic state.
Verified: `check -n50000000 --gpu-inline --stable-digest` matches the golden
byte-for-byte (exit 0); 679 tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 15:07:06 +02:00
MechaCat02
6bb4355e3d [iterate-3M] Fix Xenos shader CF/fetch decode so the textured logo binds
The publisher splash (title idx0) rendered FLAT in ours while canary samples
a texture: ours never decoded the logo's textured pixel shader
(E59B2B3D, a `tfetch2D` sprite) even though our guest IM_LOADs the exact same
microcode canary does (verified byte-identical against the Wine oracle). The
shader was misparsed as flat. Three coupled bugs in the ucode decoder, all
off vs canary `gpu/ucode.h`:

1. CF opcode table was off-by-one (`control_flow.rs`): mapped opcode 0→Exec
   and 1→Exit, but Xenos has 0=kNop, 1=kExec, 2=kExecEnd, 3..6/13..14 the
   cond-exec variants, 7/8 loop, 9/10 call/return, 11 condjmp, 12 alloc,
   15 mark-vs-fetch-done. So a real `kExec` clause was read as a terminal
   `Exit`, truncating the CF block and dropping every instruction (incl. the
   `tfetch`) after it. Added Nop/MarkVsFetchDone variants; parse now ends on
   an END-bit exec clause.

2. exec/loop `address` is an absolute instruction-triple index from shader
   dword 0, but indexed our post-CF `instructions` slice directly
   (`ucode/mod.rs`). Rebase addresses by the CF triple count so `address*3`
   lands on the right instruction.

3. Fetch instruction bitfields were wrong (`ucode/fetch.rs`): `const_index`
   read from bit 5 (actually `src_reg`) instead of bit 20, and texture
   `dimension` from dword1 instead of dword2 bit14. The logo's `tfetch ..,tf0`
   was read as `tf1`, whose empty fetch-constant failed to decode → no
   texture. Also the `sequence` fetch/ALU bit is bit[0] of each pair, not
   bit[1] (`shader_metrics.rs`, `translator.rs`, `xenos_interp.wgsl`).

Result (--gpu-inline, deterministic 2x): the active PS's `tfetch_slots` now
resolves slot 0, the tf0 fetch-constant decodes (fmt K8888), and
`gpu.texture.decode` fires (137x at -n 50M; texture_cache_entries 0→1, the
only golden field that changed — all draw/swap counts unchanged). The same
fixes correct the WGSL uber-shader's fetch/CF walk for the threaded/--ui path.

Added a regression test that parses the real E59B2B3D microcode and asserts a
tfetch slot is found. Golden re-baselined (texture_cache_entries 0→1).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 21:53:35 +02:00
MechaCat02
c5c6713419 fix(gpu): GPUBUG-100 — apply per-operand swizzle + negate to ALU sources
Word-1 of every ALU triple holds three 8-bit component-relative
swizzles (`src1_swiz`/`src2_swiz`/`src3_swiz` at bits 16-23/8-15/0-7
per canary ucode.h:2064-2066) and three per-operand negate flags
(bits 24/25/26). Pre-fix, both the WGSL interpreter and the AOT
translator discarded word-1 entirely with `_ = w1;` — every ALU
result was missing its swizzle (broadcast/permute patterns like
`.zyxw`, `.xxxx`) and any negated operand was used positive instead.

Component-relative semantics (canary's
`AluInstruction::GetSwizzledComponentIndex`, ucode.h:1996): for output
component i, the source component is `((swizzle >> (2*i)) + i) & 3`.
Identity swizzle is 0x00, NOT 0xE4 — the original `apply_swizzle` in
the interpreter shader treated it as absolute, also incorrect.

Mechanical changes:
- crates/xenia-gpu/src/ucode/alu.rs: extend AluInstruction with
  src_X_swiz (u8) and src_X_negate (bool) fields. decode_alu unpacks
  them from word 1.
- crates/xenia-gpu/src/shaders/xenos_interp.wgsl: apply_swizzle uses
  component-relative semantics. interpret_alu decodes the modifiers
  and applies via apply_swizzle + apply_modifiers (with abs=false).
- crates/xenia-gpu/src/translator.rs: src_operand emits the
  precomputed swizzle inline as `vec4<f32>(base.x, base.y, ...)`,
  then wraps in `(-…)` when negated. Identity swizzle (0x00) emits a
  bare base expression so it round-trips with the trivial-shader
  fixture.

Abs is omitted in this commit — the abs flag is dual-meaning (for
temps it lives at bit 7 of the src byte; for constants at word-2 bit
7 `abs_constants`). Wiring it up correctly requires more careful
case-split logic; deferred to Phase G.

Verification at -n 100M lockstep:
  swaps:                2 → 2     (gated by Phase E for draws)
  draws:                0 → 0
  packets:              ~58M (within noise)
Tests: 554 → 555 (+1 swizzle/negate test, no count change otherwise
because identity swizzle test merged into D1's parameterised test).
WGSL still validates via naga (combined_module_parses_as_wgsl).

Closes GPUBUG-100 (P0). Abs deferred to Phase G.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:15:07 +02:00
MechaCat02
78ea81c12a fix(gpu): GPUBUG-101 — decode src1/2/3_sel temp-vs-constant selector
Per canary AluInstruction layout (xenia-canary/src/xenia/gpu/ucode.h:
2078-2086), word-0 bits 29-31 are the per-operand `srcN_sel` flags
selecting temp register (1) vs ALU constant (0); the corresponding
8-bit src byte indexes either:
  - a temp register (bits 5:0 = index, bits 6/7 reserved for
    relative-addressing / abs flags consumed by Phase D2), or
  - an ALU constant (full 8-bit index).

Pre-fix, the WGSL interpreter and AOT translator both masked `& 0x7F`
on the src byte and emitted `r[low7]` regardless of the operand class.
Every shader's WVP matrix / light constant / per-frame uniform read
came back as r[low7] — typically zero — yielding invisible rendering.

Mechanical changes:
- crates/xenia-gpu/src/ucode/alu.rs: decode src_a_is_temp /
  src_b_is_temp / src_c_is_temp from w0 bits 29/30/31. Note that our
  src_a (low byte of w0) is canary's third operand, hence its selector
  is bit 29 (canary src3_sel), not bit 31.
- crates/xenia-gpu/src/shaders/xenos_interp.wgsl: `read_src` now takes
  the is_temp flag; constants index xenos_consts.alu directly.
- crates/xenia-gpu/src/translator.rs: `src_operand` mirrors the
  interpreter — `r[idx]` when temp, `xenos_consts.alu[idx]` when
  constant.

The trivial-shader synthetic test was updated to set the temp flags so
its `r[0u] = (r[0u] + r[0u])` assertion remains valid; without the
flags set, all sources would now resolve as constants.

Bank-selection (cf-level relative addressing for higher banks of the
512 ALU constants) remains a Phase G+ extension — covers c0..c127
in bank 0, which most Sylpheed shaders use directly.

Verification at -n 100M lockstep:
  swaps:                2 → 2     (unchanged — gated by D2/D3/E for draws)
  draws:                0 → 0
  packets:              ~61M (within noise)
Tests: 552 → 554 (+2 translator tests for the temp/constant decode).

Closes GPUBUG-101 (P0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:10:11 +02:00
MechaCat02
79eb52c378 xenia-gpu: end-to-end Xenos pipeline (PM4, ucode, EDRAM, resolve)
First real GPU implementation. Ring/PM4 frontend (ring_view,
ring_drain, pm4) drains the command processor; gpu_system owns the
threaded backend (DrainFence RPC + parker/fence helpers from M1) and
the MMIO-mapped register block (mmio_region).

Xenos shader frontend: ucode/{alu,control_flow,fetch,mod}.rs decode
the Xbox 360 microcode, translator.rs lowers it onto the WGSL
xenos_interp interpreter shader (shaders/xenos_interp.wgsl).
shader_metrics.rs counts decode/translate work.

Render state: draw_state, primitive, render_target_cache,
texture_cache, tiled_address (Xenos's swizzled tiled-memory layout),
xenos_constants (register field constants), edram (the 10 MiB EDRAM
model with MSAA), and resolve.rs (TILE_FLUSH copy-out — clear-resolve
plus bitwise-equivalent 32 bpp + 64 bpp paths landed). handle.rs
owns the typed GPU-resource handles the kernel hands out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:29:38 +02:00