Files
xenia-rs/crates/xenia-gpu/src/ucode/mod.rs
MechaCat02 6bb4355e3d [iterate-3M] Fix Xenos shader CF/fetch decode so the textured logo binds
The publisher splash (title idx0) rendered FLAT in ours while canary samples
a texture: ours never decoded the logo's textured pixel shader
(E59B2B3D, a `tfetch2D` sprite) even though our guest IM_LOADs the exact same
microcode canary does (verified byte-identical against the Wine oracle). The
shader was misparsed as flat. Three coupled bugs in the ucode decoder, all
off vs canary `gpu/ucode.h`:

1. CF opcode table was off-by-one (`control_flow.rs`): mapped opcode 0→Exec
   and 1→Exit, but Xenos has 0=kNop, 1=kExec, 2=kExecEnd, 3..6/13..14 the
   cond-exec variants, 7/8 loop, 9/10 call/return, 11 condjmp, 12 alloc,
   15 mark-vs-fetch-done. So a real `kExec` clause was read as a terminal
   `Exit`, truncating the CF block and dropping every instruction (incl. the
   `tfetch`) after it. Added Nop/MarkVsFetchDone variants; parse now ends on
   an END-bit exec clause.

2. exec/loop `address` is an absolute instruction-triple index from shader
   dword 0, but indexed our post-CF `instructions` slice directly
   (`ucode/mod.rs`). Rebase addresses by the CF triple count so `address*3`
   lands on the right instruction.

3. Fetch instruction bitfields were wrong (`ucode/fetch.rs`): `const_index`
   read from bit 5 (actually `src_reg`) instead of bit 20, and texture
   `dimension` from dword1 instead of dword2 bit14. The logo's `tfetch ..,tf0`
   was read as `tf1`, whose empty fetch-constant failed to decode → no
   texture. Also the `sequence` fetch/ALU bit is bit[0] of each pair, not
   bit[1] (`shader_metrics.rs`, `translator.rs`, `xenos_interp.wgsl`).

Result (--gpu-inline, deterministic 2x): the active PS's `tfetch_slots` now
resolves slot 0, the tf0 fetch-constant decodes (fmt K8888), and
`gpu.texture.decode` fires (137x at -n 50M; texture_cache_entries 0→1, the
only golden field that changed — all draw/swap counts unchanged). The same
fixes correct the WGSL uber-shader's fetch/CF walk for the threaded/--ui path.

Added a regression test that parses the real E59B2B3D microcode and asserts a
tfetch slot is found. Golden re-baselined (texture_cache_entries 0→1).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 21:53:35 +02:00

288 lines
12 KiB
Rust

//! Xenos (ATI R500-family) shader microcode decoder.
//!
//! Ground truth: `xenia-canary/src/xenia/gpu/ucode.h`. We parse only what a
//! shader *interpreter* (P3 uber-shader) needs: control-flow clauses, ALU
//! instructions (vector + scalar pipes), and fetch instructions (vertex +
//! texture). The uber-shader consumes this IR directly; when a WGSL-emitting
//! translator comes online in P7, it reuses the same parser.
//!
//! ## Binary layout
//!
//! A compiled shader has two sections back-to-back:
//!
//! 1. **Control-flow block** — `cf_count` 64-bit clause pairs. Canary packs
//! two clauses into three 32-bit words:
//! ```text
//! word0 word1 word2
//! [-CF_A (48)-][-CF_B (48)-]
//! ```
//! Word 0 is the low 32 of CF_A; word 1's low 16 bits finish CF_A and
//! its high 16 bits start CF_B; word 2 holds CF_B's remaining 32 bits.
//!
//! 2. **Instruction block** — variable-size array of 96-bit ALU / fetch
//! instructions. Each control-flow clause of kind `Exec*` references a
//! contiguous range of these by `(address, count)` in dwords * 3.
//!
//! We read big-endian dwords straight out of guest memory (the `raw`
//! `&[u32]` slice is already host-endian-corrected by the PM4 executor that
//! cached the shader blob). See `ucode.h:218-256` for the exec clause bit
//! layout and `:700-877` for the fetch/ALU mix.
pub mod alu;
pub mod control_flow;
pub mod fetch;
use self::alu::AluInstruction;
use self::control_flow::{AllocKind, ControlFlowInstruction, decode_cf_pair};
use self::fetch::FetchInstruction;
/// CF-clause kind codes encoded into the WGSL-facing packed shader. Kept
/// in sync with `shaders/xenos_interp.wgsl`'s `CF_KIND_*` constants.
pub mod cf_kind {
pub const EXEC: u32 = 0;
pub const EXEC_END: u32 = 1;
pub const ALLOC: u32 = 2;
pub const EXIT: u32 = 3;
pub const LOOP_START: u32 = 4;
pub const LOOP_END: u32 = 5;
pub const COND_JMP: u32 = 6;
pub const COND_CALL: u32 = 7;
pub const RETURN: u32 = 8;
/// Non-executing CF clause: `kNop` padding or `kMarkVsFetchDone` hint.
/// The WGSL CF walker treats this as a no-op (advance, do not reject).
pub const NOP: u32 = 9;
pub const UNKNOWN: u32 = 15;
}
/// Alloc-kind codes, packed into the aux dword of an `Alloc` clause.
pub mod cf_alloc_kind {
pub const POSITION: u32 = 0;
pub const INTERPOLATORS: u32 = 1;
pub const COLORS: u32 = 2;
pub const MEMEXPORT: u32 = 3;
pub const OTHER: u32 = 4;
}
/// Pack a [`ParsedShader`] into the dense dword layout the WGSL runtime
/// interpreter expects:
///
/// ```text
/// [0] cf_count
/// [1 .. 1 + cf_count*3] CF table: (kind, primary, aux) triples per clause
/// [1 + cf_count*3 ..] raw 3-dword instruction stream (ALU/fetch)
/// ```
///
/// The CF table lets WGSL walk clauses without reconstructing bit-packed
/// layouts on the GPU. Semantics per `kind`:
///
/// | kind | primary | aux |
/// |-------------|----------------------------|------------------------------|
/// | EXEC/EXEC_END | address (in triples) | (sequence<<8) \| count |
/// | ALLOC | alloc_kind (see cf_alloc_kind) | size |
/// | EXIT | 0 | 0 |
/// | LOOP_START | address | loop_id |
/// | LOOP_END | address | loop_id |
/// | COND_JMP | target | predicate flags |
/// | COND_CALL | target | 0 |
/// | RETURN | 0 | 0 |
/// | UNKNOWN | opcode | 0 |
pub fn pack_for_wgsl(parsed: &ParsedShader) -> Vec<u32> {
let cf_count = parsed.cf.len() as u32;
let mut out = Vec::with_capacity(1 + (cf_count as usize) * 3 + parsed.instructions.len());
out.push(cf_count);
for clause in &parsed.cf {
let (kind, primary, aux) = encode_cf(*clause);
out.push(kind);
out.push(primary);
out.push(aux);
}
out.extend_from_slice(&parsed.instructions);
out
}
fn encode_cf(c: ControlFlowInstruction) -> (u32, u32, u32) {
use ControlFlowInstruction::*;
match c {
Exec {
address,
count,
sequence,
is_end,
predicated,
predicate_condition,
} => {
let pred_bits = (predicated as u32) | ((predicate_condition as u32) << 1);
let kind = if is_end { cf_kind::EXEC_END } else { cf_kind::EXEC }
| (pred_bits << 8);
(kind, address, (sequence << 8) | count)
}
Alloc { size, kind } => {
let akind = match kind {
AllocKind::Position => cf_alloc_kind::POSITION,
AllocKind::Interpolators => cf_alloc_kind::INTERPOLATORS,
AllocKind::Colors => cf_alloc_kind::COLORS,
AllocKind::Memexport => cf_alloc_kind::MEMEXPORT,
AllocKind::Other => cf_alloc_kind::OTHER,
};
(cf_kind::ALLOC, akind, size)
}
Exit => (cf_kind::EXIT, 0, 0),
LoopStart { address, loop_id } => (cf_kind::LOOP_START, address, loop_id),
LoopEnd { address, loop_id } => (cf_kind::LOOP_END, address, loop_id),
CondJmp {
target,
predicated,
predicate_condition,
} => {
let pred_bits = (predicated as u32) | ((predicate_condition as u32) << 1);
(cf_kind::COND_JMP, target, pred_bits)
}
CondCall { target } => (cf_kind::COND_CALL, target, 0),
Return => (cf_kind::RETURN, 0, 0),
Nop | MarkVsFetchDone => (cf_kind::NOP, 0, 0),
Unknown { opcode } => (cf_kind::UNKNOWN, opcode as u32, 0),
}
}
/// One instruction word set from the instruction-block section. Xenos packs
/// ALU and fetch instructions identically (96 bits each); the owning exec
/// clause's "sequence" bitmap decides which is which.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum DecodedInstruction {
/// ALU pipe (vector ALU + optional co-issued scalar ALU).
Alu(AluInstruction),
/// Vertex or texture fetch.
Fetch(FetchInstruction),
}
/// Parsed shader: the control-flow clause list + the raw 32-bit instruction
/// words. The uber-shader / translator is expected to index into
/// `instructions` based on `(clause.address * 3, clause.count * 3)`.
#[derive(Debug, Clone, Default)]
pub struct ParsedShader {
pub cf: Vec<ControlFlowInstruction>,
/// Raw instruction dwords. Each 3-dword triple is one ALU or fetch
/// instruction; the owning `Exec` clause's `sequence` bitmap picks the
/// kind.
pub instructions: Vec<u32>,
}
/// Decode a shader blob. `raw_dwords` is a host-endian slice of the entire
/// microcode buffer (control flow + instructions). The CF block is implicitly
/// bounded: we walk clause-pair rows until one terminates the shader (an
/// `Exec`/`CondExec` clause with the END bit set, per Xenos). Everything after
/// that row is the instruction block; exec/loop addresses are then rebased to
/// be relative to it.
pub fn parse_shader(raw_dwords: &[u32]) -> ParsedShader {
let mut cf = Vec::new();
// CF clauses are 48-bit (word1 lo 16 + word0 = 48 or so per canary's
// layout). Walk pairs of 3 dwords per pair of clauses.
let mut i = 0usize;
while i + 2 < raw_dwords.len() {
let a = decode_cf_pair(raw_dwords[i], raw_dwords[i + 1], raw_dwords[i + 2]);
let (first, second) = a;
// The CF block ends after the clause that terminates the shader: an
// `Exec` with the END bit set (Xenos `kExecEnd`/`kCondExec*End`), a
// synthetic `Exit`, or an `Unknown` opcode (decode ran off the CF
// block into instruction data — stop defensively). `Nop` padding
// does NOT terminate. (Previously this stopped on the first `Exit`,
// but with the corrected opcode table opcode 1 is `kExec`, not exit,
// so real exec clauses kept the parse going as intended.)
let terminates = |cf: &ControlFlowInstruction| {
matches!(
cf,
ControlFlowInstruction::Exec { is_end: true, .. }
| ControlFlowInstruction::Exit
| ControlFlowInstruction::Unknown { .. }
)
};
let seen_end = terminates(&first) || terminates(&second);
cf.push(first);
cf.push(second);
i += 3;
if seen_end {
break;
}
}
// Everything after `i` dwords is the instruction block.
let instructions = raw_dwords[i..].to_vec();
// Xenos exec/loop `address` fields are absolute instruction-triple indices
// counted from shader dword 0, but `instructions` here begins *after* the
// CF block. Rebase those addresses to be relative to the instruction block
// (subtract the CF triple count) so `address * 3` indexes `instructions`
// directly. (Without this, every exec read 3 dwords too far per CF triple —
// the publisher-logo `tfetch` triple was skipped → flat splash.)
let cf_triples = (i / 3) as u32;
for clause in cf.iter_mut() {
match clause {
ControlFlowInstruction::Exec { address, .. } => {
*address = address.saturating_sub(cf_triples);
}
ControlFlowInstruction::LoopStart { address, .. }
| ControlFlowInstruction::LoopEnd { address, .. } => {
*address = address.saturating_sub(cf_triples);
}
_ => {}
}
}
ParsedShader { cf, instructions }
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn empty_blob_parses_empty() {
let p = parse_shader(&[]);
assert!(p.cf.is_empty());
assert!(p.instructions.is_empty());
}
#[test]
fn pack_for_wgsl_layout_is_correct() {
// Build a tiny ParsedShader by hand and verify the packed form.
let parsed = ParsedShader {
cf: vec![
ControlFlowInstruction::Exec {
address: 0x10,
count: 3,
sequence: 0b1010,
is_end: false,
predicated: false,
predicate_condition: false,
},
ControlFlowInstruction::Exit,
],
instructions: vec![0x1111, 0x2222, 0x3333],
};
let packed = pack_for_wgsl(&parsed);
assert_eq!(packed[0], 2, "cf_count");
// First clause: EXEC, address=0x10, aux = (sequence<<8)|count = 0x0A03
assert_eq!(packed[1] & 0xFF, cf_kind::EXEC);
assert_eq!(packed[2], 0x10);
assert_eq!(packed[3], (0b1010 << 8) | 3);
// Second clause: EXIT
assert_eq!(packed[4] & 0xFF, cf_kind::EXIT);
// Instruction block starts at 1 + 2*3 = 7
assert_eq!(packed[7..], [0x1111, 0x2222, 0x3333]);
}
#[test]
fn exec_end_clause_stops_parsing() {
// Row: clause B = kExecEnd (opcode 2) terminates the CF block.
// 48-bit payload of B occupies hi16(word1) + word2; opcode lives in
// bits 44..47 of that payload. Put opcode 2 there: payload bit 44 set
// for the `2` → (2 << 44). In B's framing, bits 16..47 come from
// word2, so word2 bit (44-16)=28 region holds the opcode nibble.
let b_payload: u64 = 2u64 << 44; // kExecEnd
// B = lo16 from hi16(word1), hi from word2. Reconstruct word1/word2.
let word1 = ((b_payload & 0xFFFF) as u32) << 16; // B's low 16 bits → hi16(word1)
let word2 = ((b_payload >> 16) & 0xFFFF_FFFF) as u32;
let p = parse_shader(&[0, word1, word2, 0xDEAD_BEEF]);
assert!(!p.cf.is_empty());
// ExecEnd detected in the first row → remaining dword is instruction data.
assert_eq!(p.instructions, vec![0xDEAD_BEEF]);
}
}