[iterate-3S] Real splash geometry renders: fix ALU/vfetch decode + per-draw NDC transform
The 3O→3R real-render slice ran the guest's real translated VS/PS on real captured vertices at full boot speed, but the --ui window stayed blank. Bifurcated with an env-gated frontbuffer readback + per-vertex NDC dump (both removed): the captured splash quads (RectangleList, k_32_32_FLOAT, 3 verts) were non-zero and sane, so this was a transform/decode chain of bugs, not missing geometry. Four coupled root causes: - GPUBUG-106 (ucode/alu.rs): decode_alu read EVERY field out of w2, but canary's AluInstruction lays dest/write-mask/export/scalar-opcode in w0, the vector opcode + source regs in w2, swizzle/negate/pred in w1. The misread made every *export* ALU decode with vector_write_mask=0 → no oPos/oColor export emitted → the translated VS collapsed every vertex to the clip origin. Rewrote the field map to match ucode.h:2036-2086. - GPUBUG-107 (ucode/fetch.rs + translator.rs): the translator hardcoded R32G32B32A32_FLOAT (4 floats, stride 4); the splash quads are k_32_32_FLOAT (2 floats, stride 2). Over-striding read the next vertex's X into .w → negative W → the rectangle clipped behind the camera. Decode the real VertexFormat + dword stride and emit the matching component read (1/2/3/4 float formats; others reject to the interpreter). - GPUBUG-108 (translator.rs + xenos_interp.wgsl): the vfetch recomputed the buffer base from xenos_consts.fetch[], but that uniform carries the last-published per-frame fetch constant, not this draw's (stale 0x8a000002 vs the real base). The captured window already begins at the fetch base, so index from 0 (vertex i at i*stride) when a real window is present; only the synthetic fallback consults the uniform. - iterate-3S NDC transform (draw_capture.rs + xenos_pipeline.rs + WGSL): the guest VS emits screen-space pixel coords (clip disabled, VTE viewport scale/offset off). Added compute_ndc_xy (mirrors canary GetHostViewportInfo): rescales render-target pixels to [-1,1] clip with the Y-flip for wgpu, plumbed per-draw into DrawConstants and applied in both the translated and interpreter VS. Result (env-gated readback, since removed): the real splash geometry now fills ~50% of the frontbuffer in a clean triangular coverage pattern, real positions from real guest vertices through the real translated shaders (textures are the next stage — sampled color is still the magenta/white texture stub, tex-cache=0). Headless-inert: draw_capture is only built when frame_captures is Some (--ui); the changed decoders feed only the UI translator/metrics. Golden byte-identical (check -n50m --gpu-inline --stable-digest exit 0); 679 workspace tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -59,6 +59,102 @@ pub struct DrawCapture {
|
||||
/// the UI falls back to its procedural geometry for this draw (honest:
|
||||
/// nothing faked, just "couldn't source real vertices").
|
||||
pub has_real_vertices: bool,
|
||||
/// iterate-3S: per-draw NDC transform derived from the guest viewport /
|
||||
/// clip / VTE registers (mirrors canary `GetHostViewportInfo`). The host VS
|
||||
/// converts the guest-VS position to wgpu clip space via
|
||||
/// `clip.xy = pos.xy * ndc_scale + ndc_offset * pos.w`. The Y component
|
||||
/// already carries the render-target → wgpu Y-flip (negated).
|
||||
pub ndc_scale: [f32; 2],
|
||||
pub ndc_offset: [f32; 2],
|
||||
}
|
||||
|
||||
/// iterate-3S: compute the guest→host NDC XY transform for a draw, mirroring
|
||||
/// canary's `draw_util.cc::GetHostViewportInfo` (the XY half). The Xbox 360 VS
|
||||
/// emits a clip-space position which the HW then scales/offsets by the viewport
|
||||
/// (`PA_CL_VPORT_*`, gated by `PA_CL_VTE_CNTL`) into render-target pixels, OR,
|
||||
/// when clipping is disabled (`PA_CL_CLIP_CNTL.clip_disable`), the VS emits
|
||||
/// render-target-pixel coordinates directly (the screen-space UI / clear case —
|
||||
/// this is what Sylpheed's splash quads do). Either way we must rescale into the
|
||||
/// host's [-1,1] clip space and flip Y (render-target Y-down → wgpu Y-up).
|
||||
///
|
||||
/// Returns `(ndc_scale[2], ndc_offset[2])` such that
|
||||
/// `host_clip.xy = guest_pos.xy * ndc_scale + ndc_offset * guest_pos.w`.
|
||||
/// The Y entries are pre-negated to flip into wgpu's Y-up clip space.
|
||||
pub fn compute_ndc_xy(rf: &RegisterFile) -> ([f32; 2], [f32; 2]) {
|
||||
const PA_CL_CLIP_CNTL: u32 = 0x2204;
|
||||
const PA_SU_SC_MODE_CNTL: u32 = 0x2205;
|
||||
const PA_CL_VTE_CNTL: u32 = 0x2206;
|
||||
const PA_SU_VTX_CNTL: u32 = 0x2302;
|
||||
const PA_CL_VPORT_XSCALE: u32 = 0x210F;
|
||||
const PA_CL_VPORT_XOFFSET: u32 = 0x2110;
|
||||
const PA_CL_VPORT_YSCALE: u32 = 0x2111;
|
||||
const PA_CL_VPORT_YOFFSET: u32 = 0x2112;
|
||||
const PA_SC_WINDOW_OFFSET: u32 = 0x2080;
|
||||
const PA_SC_WINDOW_SCISSOR_BR: u32 = 0x2082;
|
||||
const RB_SURFACE_INFO: u32 = 0x2000;
|
||||
|
||||
let clip_cntl = rf.read(PA_CL_CLIP_CNTL);
|
||||
let vte = rf.read(PA_CL_VTE_CNTL);
|
||||
let su_sc_mode = rf.read(PA_SU_SC_MODE_CNTL);
|
||||
let su_vtx = rf.read(PA_SU_VTX_CNTL);
|
||||
let fbits = |r: u32| f32::from_bits(rf.read(r));
|
||||
|
||||
// VTE enable bits (xenos.h PA_CL_VTE_CNTL): bit0 vport_x_scale_ena,
|
||||
// bit1 vport_x_offset_ena, bit2 vport_y_scale_ena, bit3 vport_y_offset_ena.
|
||||
let scale_x = if vte & (1 << 0) != 0 { fbits(PA_CL_VPORT_XSCALE) } else { 1.0 };
|
||||
let off_x = if vte & (1 << 1) != 0 { fbits(PA_CL_VPORT_XOFFSET) } else { 0.0 };
|
||||
let scale_y = if vte & (1 << 2) != 0 { fbits(PA_CL_VPORT_YSCALE) } else { 1.0 };
|
||||
let off_y = if vte & (1 << 3) != 0 { fbits(PA_CL_VPORT_YOFFSET) } else { 0.0 };
|
||||
|
||||
// Render-target extent in guest pixels: clamp to the texture max (2048),
|
||||
// sourced from the window scissor BR (matches canary `x_max`/`y_max`).
|
||||
let br = rf.read(PA_SC_WINDOW_SCISSOR_BR);
|
||||
let x_max = ((br & 0x7FFF).max(1)).min(2048) as f32;
|
||||
let y_max = (((br >> 16) & 0x7FFF).max(1)).min(2048) as f32;
|
||||
let _ = RB_SURFACE_INFO;
|
||||
|
||||
// Half-pixel + window offsets added in render-target pixels.
|
||||
let mut add_x = 0.0f32;
|
||||
let mut add_y = 0.0f32;
|
||||
if su_sc_mode & (1 << 16) != 0 {
|
||||
let wo = rf.read(PA_SC_WINDOW_OFFSET);
|
||||
// 15-bit signed each (x: [14:0], y: [30:16]).
|
||||
let sx = (((wo & 0x7FFF) << 1) as i32) >> 1;
|
||||
let sy = ((((wo >> 16) & 0x7FFF) << 1) as i32) >> 1;
|
||||
add_x += sx as f32;
|
||||
add_y += sy as f32;
|
||||
}
|
||||
if su_vtx & 1 == 0 {
|
||||
// pix_center == kD3DZero → +0.5 half-pixel offset.
|
||||
add_x += 0.5;
|
||||
add_y += 0.5;
|
||||
}
|
||||
|
||||
let (s, o);
|
||||
if clip_cntl & (1 << 16) != 0 {
|
||||
// clip_disable: VS outputs render-target-pixel coords. Rescale the
|
||||
// whole RT extent to [-1,1] (canary's huge-host-viewport path).
|
||||
let px2ndc_x = 2.0 / x_max;
|
||||
let px2ndc_y = 2.0 / y_max;
|
||||
let sx = scale_x * px2ndc_x;
|
||||
let ox = (off_x - x_max * 0.5 + add_x) * px2ndc_x;
|
||||
let sy = scale_y * px2ndc_y;
|
||||
let oy = (off_y - y_max * 0.5 + add_y) * px2ndc_y;
|
||||
s = [sx, sy];
|
||||
o = [ox, oy];
|
||||
} else {
|
||||
// Clipping enabled: the VS already emits clip space; the viewport
|
||||
// scale/offset map clip→pixels. Convert to the host clip directly:
|
||||
// host_ndc = guest_ndc (scale ~ 1) but still apply the abs-scale based
|
||||
// remap canary uses. For the common enabled case the guest already
|
||||
// outputs [-1,1] so scale=1/offset=0 except sign of Y. We approximate
|
||||
// with identity XY + Y-flip (sufficient for non-screen-space draws;
|
||||
// refined alongside depth in a follow-up).
|
||||
s = [1.0, 1.0];
|
||||
o = [0.0, 0.0];
|
||||
}
|
||||
// Flip Y for wgpu (render-target Y-down → clip Y-up).
|
||||
([s[0], -s[1]], [o[0], -o[1]])
|
||||
}
|
||||
|
||||
/// Encode a [`PrimitiveType`] as the raw Xenos code used across the bridge.
|
||||
@@ -179,6 +275,7 @@ pub fn build(
|
||||
Some((d, base)) => (d, base, true),
|
||||
None => (Vec::new(), 0, false),
|
||||
};
|
||||
let (ndc_scale, ndc_offset) = compute_ndc_xy(rf);
|
||||
DrawCapture {
|
||||
draw_index,
|
||||
prim_code: prim_code(primitive),
|
||||
@@ -188,5 +285,7 @@ pub fn build(
|
||||
vertex_dwords,
|
||||
window_base_dwords,
|
||||
has_real_vertices: has_real,
|
||||
ndc_scale,
|
||||
ndc_offset,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -26,6 +26,9 @@ struct XenosDrawConstants {
|
||||
// fetch-constant address so it indexes the uploaded window. 0 means "no
|
||||
// real vertex window" (procedural fallback path).
|
||||
vertex_base_dwords: u32,
|
||||
// iterate-3S: guest viewport → host NDC XY transform (Y pre-flipped).
|
||||
ndc_scale: vec2<f32>,
|
||||
ndc_offset: vec2<f32>,
|
||||
};
|
||||
|
||||
struct XenosConstants {
|
||||
@@ -663,9 +666,14 @@ fn interpret_vertex_fetch(t: u32) {
|
||||
// window. When no real window was published (`vertex_base_dwords == 0`)
|
||||
// keep the absolute value (the `addr < n` guards below then skip the read
|
||||
// and the procedural fallback position is used).
|
||||
// GPUBUG-108 (iterate-3S): the captured window begins exactly at the fetch
|
||||
// base, so index from 0 (vertex i at i*stride). The uniform `fetch[]` holds
|
||||
// the last-published per-frame constant, not this draw's — recomputing
|
||||
// `abs_base` from it produced a stale out-of-window address (the splash
|
||||
// collapsed to one pixel). Only consult the uniform for the no-window
|
||||
// synthetic fallback.
|
||||
let abs_base = (fc0 & 0xFFFFFFFCu) >> 2u;
|
||||
let base_dwords = select(abs_base, abs_base - draw_ctx.vertex_base_dwords,
|
||||
draw_ctx.vertex_base_dwords != 0u && abs_base >= draw_ctx.vertex_base_dwords);
|
||||
let base_dwords = select(abs_base, 0u, draw_ctx.vertex_base_dwords != 0u);
|
||||
// GPUBUG-102: per-format endian byte-swap. Xbox 360 vertex data is
|
||||
// big-endian; the host is little-endian. Pre-fix every dword was
|
||||
// bitcast as-is — vertex positions were byte-reversed garbage.
|
||||
@@ -885,7 +893,13 @@ fn vs_main(@builtin(vertex_index) vidx: u32) -> VsOut {
|
||||
// Use registers[OPOS_REG] as position; the procedural fallback above
|
||||
// seeded it so an un-interpreted shader still draws a recognisable
|
||||
// circle.
|
||||
out.position = vec4<f32>(registers[OPOS_REG].xyz, registers[OPOS_REG].w);
|
||||
var opos = vec4<f32>(registers[OPOS_REG].xyz, registers[OPOS_REG].w);
|
||||
// iterate-3S: guest VS position → host clip space (see translator.rs). When
|
||||
// the transform is unset (procedural fallback) pass through unchanged.
|
||||
if (draw_ctx.ndc_scale.x != 0.0 || draw_ctx.ndc_scale.y != 0.0) {
|
||||
opos = vec4<f32>(opos.xy * draw_ctx.ndc_scale + draw_ctx.ndc_offset * opos.w, opos.z, opos.w);
|
||||
}
|
||||
out.position = opos;
|
||||
out.color = vec4<f32>(registers[OCOLOR_REG].rgb + vec3<f32>(vb_live), registers[OCOLOR_REG].a);
|
||||
return out;
|
||||
}
|
||||
|
||||
@@ -95,6 +95,8 @@ struct XenosDrawConstants {
|
||||
vertex_count: u32,
|
||||
prim_kind: u32,
|
||||
vertex_base_dwords: u32,
|
||||
ndc_scale: vec2<f32>,
|
||||
ndc_offset: vec2<f32>,
|
||||
};
|
||||
|
||||
struct XenosConstants {
|
||||
@@ -254,6 +256,18 @@ impl EmitCtx {
|
||||
match self.stage {
|
||||
Stage::Vertex => {
|
||||
self.push("var out: VsOut;");
|
||||
// iterate-3S: guest VS position → host clip space. The guest
|
||||
// emits either clip-space or (screen-space, clip disabled)
|
||||
// render-target-pixel coords; `ndc_scale`/`ndc_offset` (from
|
||||
// canary's GetHostViewportInfo, computed CPU-side per draw)
|
||||
// rescale XY into wgpu clip space with Y already flipped. When
|
||||
// the transform is unset (all-zero scale, procedural fallback)
|
||||
// pass the position through unchanged.
|
||||
self.push("if (draw_ctx.ndc_scale.x != 0.0 || draw_ctx.ndc_scale.y != 0.0) {");
|
||||
self.indent += 1;
|
||||
self.push("opos = vec4<f32>(opos.xy * draw_ctx.ndc_scale + draw_ctx.ndc_offset * opos.w, opos.z, opos.w);");
|
||||
self.indent -= 1;
|
||||
self.push("}");
|
||||
self.push("out.position = opos;");
|
||||
self.push("out.color = ocolor;");
|
||||
self.push("return out;");
|
||||
@@ -401,38 +415,72 @@ impl EmitCtx {
|
||||
}
|
||||
|
||||
fn emit_vfetch(&mut self, vf: &crate::ucode::fetch::VertexFetch) -> Result<(), &'static str> {
|
||||
// v1: treat all vertex fetches as R32G32B32A32_FLOAT, stride = 4
|
||||
// dwords. Matches the interpreter's MVP semantics; unlocks more
|
||||
// formats alongside the CPU texture cache's format expansion.
|
||||
// GPUBUG-107 (iterate-3S): decode the vertex FORMAT + dword STRIDE from
|
||||
// the vfetch instruction instead of hardcoding R32G32B32A32 (4 floats,
|
||||
// stride 4). Sylpheed's splash quads are `k_32_32_FLOAT` (2 floats,
|
||||
// stride 2); over-reading them put the next vertex's X into .w → a
|
||||
// negative W → the whole rectangle clipped behind the camera. We cover
|
||||
// the float vertex formats (the UI / screen-space draws); other formats
|
||||
// reject to the interpreter.
|
||||
//
|
||||
// GPUBUG-102: the fetch constant (xe_gpu_vertex_fetch_t,
|
||||
// xenos.h:1158-1172) holds the endian field in dword_1's low
|
||||
// 2 bits. Vertex data on Xbox 360 is big-endian; the host is
|
||||
// little-endian. Pre-fix, every dword was bitcast as-is →
|
||||
// vertex positions were byte-reversed garbage and any draw
|
||||
// that did reach the host produced clipped / NaN positions.
|
||||
// GPUBUG-102: the fetch constant holds the endian field in dword_1's
|
||||
// low 2 bits; Xbox 360 vertex data is big-endian, so `gpu_swap` undoes
|
||||
// it per component.
|
||||
let (comps, stride): (u32, u32) = match vf.format {
|
||||
36 => (1, 1), // k_32_FLOAT
|
||||
37 => (2, 2), // k_32_32_FLOAT
|
||||
57 => (3, 3), // k_32_32_32_FLOAT
|
||||
38 => (4, 4), // k_32_32_32_32_FLOAT
|
||||
_ => return Err(reject::VFETCH_FMT),
|
||||
};
|
||||
// A stride of 0 in the instruction means "use the fetch-constant
|
||||
// stride"; fall back to the tightly packed component count.
|
||||
let stride = if vf.stride != 0 { vf.stride as u32 } else { stride };
|
||||
let fetch_const = (vf.raw[0] >> 5) & 0x1F;
|
||||
let src_reg = vf.src_register & 0x7F;
|
||||
let dst_reg = vf.dest_register & 0x7F;
|
||||
// Build the per-component reads; unread lanes default to 0/0/0/1 so an
|
||||
// XY-only position keeps W=1 (and Z=0).
|
||||
let lane = |i: u32| -> String {
|
||||
if i < comps {
|
||||
format!("bitcast<f32>(gpu_swap(vertex_buffer[addr + {i}u], endian))")
|
||||
} else if i == 3 {
|
||||
"1.0".to_string()
|
||||
} else {
|
||||
"0.0".to_string()
|
||||
}
|
||||
};
|
||||
let read_bound = comps - 1;
|
||||
// GPUBUG-108 (iterate-3S): for the captured-geometry path the CPU
|
||||
// uploads a vertex window that begins EXACTLY at the fetch base, so the
|
||||
// base within `vertex_buffer` is 0 and vertex i sits at `i * stride`.
|
||||
// The previous `abs_base - vertex_base_dwords` rebase recomputed the
|
||||
// base from `xenos_consts.fetch[]`, but that uniform carries the
|
||||
// *last-published* (per-frame) fetch constant, not this draw's — for
|
||||
// the splash it was stale (0x8a000002 vs the real 0x0adf… base), so the
|
||||
// rebase produced a huge out-of-window address, the bounds guard
|
||||
// failed, and every vertex kept its seed (vertex_index, 0, 0, 1) →
|
||||
// every quad collapsed to ~one pixel at the origin. Index from 0 when a
|
||||
// real window is present (`vertex_base_dwords != 0`); only the
|
||||
// synthetic/no-window fallback consults the uniform fetch constant.
|
||||
let endian_term = format!("xenos_consts.fetch[{}u] & 0x3u", fetch_const * 2 + 1);
|
||||
self.push(&format!(
|
||||
"{{ let fc0 = xenos_consts.fetch[{fc0_idx}u]; \
|
||||
let fc1 = xenos_consts.fetch[{fc1_idx}u]; \
|
||||
let endian = fc1 & 0x3u; \
|
||||
let abs_base = (fc0 & 0xFFFFFFFCu) >> 2u; \
|
||||
let base = select(abs_base, abs_base - draw_ctx.vertex_base_dwords, \
|
||||
draw_ctx.vertex_base_dwords != 0u && abs_base >= draw_ctx.vertex_base_dwords); \
|
||||
"{{ let endian = {endian_term}; \
|
||||
let vidx = u32(r[{src_reg}u].x); \
|
||||
let addr = base + vidx * 4u; \
|
||||
var base = 0u; \
|
||||
if (draw_ctx.vertex_base_dwords == 0u) {{ \
|
||||
base = (xenos_consts.fetch[{fc0_idx}u] & 0xFFFFFFFCu) >> 2u; \
|
||||
}} \
|
||||
let addr = base + vidx * {stride}u; \
|
||||
let n = arrayLength(&vertex_buffer); \
|
||||
if (addr + 3u < n) {{ \
|
||||
r[{dst_reg}u] = vec4<f32>( \
|
||||
bitcast<f32>(gpu_swap(vertex_buffer[addr + 0u], endian)), \
|
||||
bitcast<f32>(gpu_swap(vertex_buffer[addr + 1u], endian)), \
|
||||
bitcast<f32>(gpu_swap(vertex_buffer[addr + 2u], endian)), \
|
||||
bitcast<f32>(gpu_swap(vertex_buffer[addr + 3u], endian))); \
|
||||
if (addr + {read_bound}u < n) {{ \
|
||||
r[{dst_reg}u] = vec4<f32>({l0}, {l1}, {l2}, {l3}); \
|
||||
}} }}",
|
||||
fc0_idx = fetch_const * 2,
|
||||
fc1_idx = fetch_const * 2 + 1,
|
||||
l0 = lane(0),
|
||||
l1 = lane(1),
|
||||
l2 = lane(2),
|
||||
l3 = lane(3),
|
||||
));
|
||||
Ok(())
|
||||
}
|
||||
@@ -582,13 +630,16 @@ mod tests {
|
||||
// Single Exec clause: ALU add r0 = r0 + r0; scalar_op = RETAIN_PREV
|
||||
// with full write-mask on vector, zero on scalar. Alloc(Position)
|
||||
// precedes so the ALU's export (if it were one) would target oPos.
|
||||
// Word-0 bits 29-31 set so all three operands resolve as temps —
|
||||
// matches the prior assertion `r[0u] = (r[0u] + r[0u])`.
|
||||
let w0 = (1u32 << 29) | (1u32 << 30) | (1u32 << 31);
|
||||
let w2 = (vop::ADD as u32)
|
||||
| ((sop::RETAIN_PREV as u32) << 6)
|
||||
| (0xF << 12) // vector_write_mask
|
||||
| (0u32 << 16); // vector_dest = 0
|
||||
// GPUBUG-106 canary layout: dest/mask/scalar_opc in w0; vector_opc +
|
||||
// src_sel in w2. All three operands temps → r0.
|
||||
let w0 = (0u32) // vector_dest = 0
|
||||
| (0xFu32 << 16) // vector_write_mask = 0xF
|
||||
| ((sop::RETAIN_PREV as u32) << 26); // scalar_opc
|
||||
let w1 = 0u32;
|
||||
let w2 = ((vop::ADD as u32) << 24) // vector_opc
|
||||
| (1u32 << 31) // src1_sel = temp
|
||||
| (1u32 << 30) // src2_sel = temp
|
||||
| (1u32 << 29); // src3_sel = temp
|
||||
ParsedShader {
|
||||
cf: vec![
|
||||
ControlFlowInstruction::Alloc {
|
||||
@@ -604,7 +655,7 @@ mod tests {
|
||||
predicate_condition: false,
|
||||
},
|
||||
],
|
||||
instructions: vec![w0, 0, w2],
|
||||
instructions: vec![w0, w1, w2],
|
||||
}
|
||||
}
|
||||
|
||||
@@ -692,19 +743,17 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn shader_using_c0_emits_xenos_consts_read() {
|
||||
// ALU: r0 = c0 + r0. src_a (low byte) is constant index 0;
|
||||
// src_b (next byte) is temp index 0. src_a_is_temp=false →
|
||||
// src1_sel-style bit at w0 bit 29 = 0; src_b_is_temp=true →
|
||||
// bit 30 = 1. (src_c left as 0/temp; unused.)
|
||||
let w0 = 0x00u32 // src_a = c0
|
||||
| (0x00u32 << 8) // src_b = r0
|
||||
| (0x00u32 << 16) // src_c
|
||||
| (0u32 << 29) // src_a_is_temp = false (constant)
|
||||
| (1u32 << 30); // src_b_is_temp = true (register)
|
||||
let w2 = (vop::ADD as u32)
|
||||
| ((sop::RETAIN_PREV as u32) << 6)
|
||||
| (0xF << 12)
|
||||
| (0u32 << 16);
|
||||
// ALU: r0 = c0 + r0. GPUBUG-106 canary layout. src_a = src1 (w2
|
||||
// 16:23), src_b = src2 (w2 8:15). src1_sel (w2 bit31) = 0 → c0;
|
||||
// src2_sel (w2 bit30) = 1 → r0.
|
||||
let w0 = (0u32) // vector_dest = 0
|
||||
| (0xFu32 << 16) // vector_write_mask
|
||||
| ((sop::RETAIN_PREV as u32) << 26); // scalar_opc
|
||||
let w2 = ((vop::ADD as u32) << 24) // vector_opc
|
||||
| (0u32 << 16) // src1_reg = 0 → c0
|
||||
| (0u32 << 8) // src2_reg = 0 → r0
|
||||
| (0u32 << 31) // src1_sel = 0 (constant)
|
||||
| (1u32 << 30); // src2_sel = 1 (temp)
|
||||
let shader = ParsedShader {
|
||||
cf: vec![
|
||||
ControlFlowInstruction::Alloc {
|
||||
@@ -748,6 +797,8 @@ mod tests {
|
||||
src_register: 0,
|
||||
dest_register: 0,
|
||||
dest_write_mask: 0xF,
|
||||
format: 38, // k_32_32_32_32_FLOAT (4 floats)
|
||||
stride: 4,
|
||||
raw: [0; 3],
|
||||
};
|
||||
ctx.emit_vfetch(&vf).expect("emit_vfetch");
|
||||
@@ -772,9 +823,10 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn unsupported_op_rejected() {
|
||||
let w2 = (29u32) // VOP_MAX_A, not in v1 subset
|
||||
| ((sop::RETAIN_PREV as u32) << 6)
|
||||
| (0xF << 12);
|
||||
// GPUBUG-106 layout: vector_write_mask in w0 (16:19), vector_opc in
|
||||
// w2 (24:28). MAX_A (29) is outside the supported subset → reject.
|
||||
let w0 = (0xFu32 << 16) | ((sop::RETAIN_PREV as u32) << 26);
|
||||
let w2 = (29u32) << 24; // VOP_MAX_A
|
||||
let shader = ParsedShader {
|
||||
cf: vec![ControlFlowInstruction::Exec {
|
||||
address: 0,
|
||||
@@ -784,7 +836,7 @@ mod tests {
|
||||
predicated: false,
|
||||
predicate_condition: false,
|
||||
}],
|
||||
instructions: vec![0, 0, w2],
|
||||
instructions: vec![w0, 0, w2],
|
||||
};
|
||||
assert!(matches!(
|
||||
translate(&shader, Stage::Vertex),
|
||||
|
||||
@@ -71,33 +71,50 @@ pub fn decode_alu(words: [u32; 3]) -> AluInstruction {
|
||||
let w0 = words[0];
|
||||
let w1 = words[1];
|
||||
let w2 = words[2];
|
||||
// GPUBUG-106 (iterate-3S): correct the dword field map to match canary's
|
||||
// `AluInstruction` union (ucode.h:2036-2086). Pre-fix this read the
|
||||
// dest/mask/export/scalar-opcode out of `w2`, but they live in `w0`; the
|
||||
// vector opcode + source registers live in `w2`, and swizzle/negate/pred
|
||||
// in `w1`. The misread made every *export* ALU decode with
|
||||
// `vector_write_mask=0` → no oPos/oColor export emitted → the translated VS
|
||||
// collapsed every vertex to the clip origin (degenerate, nothing drawn).
|
||||
//
|
||||
// w0: vector_dest(0:5) vector_dest_rel(6) abs_constants(7)
|
||||
// scalar_dest(8:13) scalar_dest_rel(14) export_data(15)
|
||||
// vector_write_mask(16:19) scalar_write_mask(20:23)
|
||||
// vector_clamp(24) scalar_clamp(25) scalar_opc(26:31)
|
||||
// w1: src3_swiz(0:7) src2_swiz(8:15) src1_swiz(16:23)
|
||||
// src3/2/1_reg_negate(24/25/26) pred_condition(27) is_predicated(28)
|
||||
// w2: src3_reg(0:7) src2_reg(8:15) src1_reg(16:23)
|
||||
// vector_opc(24:28) src3/2/1_sel(29/30/31)
|
||||
//
|
||||
// Our (a,b,c) operands map to canary's (src1,src2,src3).
|
||||
AluInstruction {
|
||||
vector_opcode: (w2 & 0x3F) as u8,
|
||||
scalar_opcode: ((w2 >> 6) & 0x3F) as u8,
|
||||
vector_dest: ((w2 >> 16) & 0x7F) as u8,
|
||||
scalar_dest: ((w2 >> 24) & 0x7F) as u8,
|
||||
vector_write_mask: ((w2 >> 12) & 0xF) as u8,
|
||||
scalar_write_mask: ((w2 >> 8) & 0xF) as u8,
|
||||
vector_dest_is_export: ((w2 >> 23) & 1) != 0,
|
||||
scalar_src_is_ps: ((w0 >> 26) & 1) != 0,
|
||||
src_a: (w0 & 0xFF) as u8,
|
||||
src_b: ((w0 >> 8) & 0xFF) as u8,
|
||||
src_c: ((w0 >> 16) & 0xFF) as u8,
|
||||
// Word-0 bits 29-31 are the per-operand temp-vs-constant
|
||||
// selector (canary `src3_sel`/`src2_sel`/`src1_sel`,
|
||||
// ucode.h:2078-2086). Our `src_a` is canary's third operand
|
||||
// (low byte of w0), so its selector is bit 29.
|
||||
src_a_is_temp: ((w0 >> 29) & 1) != 0,
|
||||
src_b_is_temp: ((w0 >> 30) & 1) != 0,
|
||||
src_c_is_temp: ((w0 >> 31) & 1) != 0,
|
||||
src_a_swiz: (w1 & 0xFF) as u8,
|
||||
vector_opcode: ((w2 >> 24) & 0x1F) as u8,
|
||||
scalar_opcode: ((w0 >> 26) & 0x3F) as u8,
|
||||
vector_dest: (w0 & 0x3F) as u8,
|
||||
scalar_dest: ((w0 >> 8) & 0x3F) as u8,
|
||||
vector_write_mask: ((w0 >> 16) & 0xF) as u8,
|
||||
scalar_write_mask: ((w0 >> 20) & 0xF) as u8,
|
||||
vector_dest_is_export: ((w0 >> 15) & 1) != 0,
|
||||
// Not a real microcode bit — the scalar pipe selects `ps` implicitly
|
||||
// via the *_PREV opcodes, which `scalar_expr` handles by opcode.
|
||||
scalar_src_is_ps: false,
|
||||
src_a: ((w2 >> 16) & 0xFF) as u8,
|
||||
src_b: ((w2 >> 8) & 0xFF) as u8,
|
||||
src_c: (w2 & 0xFF) as u8,
|
||||
// sel==1 → operand is a temp register; sel==0 → ALU constant.
|
||||
src_a_is_temp: ((w2 >> 31) & 1) != 0,
|
||||
src_b_is_temp: ((w2 >> 30) & 1) != 0,
|
||||
src_c_is_temp: ((w2 >> 29) & 1) != 0,
|
||||
src_a_swiz: ((w1 >> 16) & 0xFF) as u8,
|
||||
src_b_swiz: ((w1 >> 8) & 0xFF) as u8,
|
||||
src_c_swiz: ((w1 >> 16) & 0xFF) as u8,
|
||||
src_a_negate: ((w1 >> 24) & 1) != 0,
|
||||
src_c_swiz: (w1 & 0xFF) as u8,
|
||||
src_a_negate: ((w1 >> 26) & 1) != 0,
|
||||
src_b_negate: ((w1 >> 25) & 1) != 0,
|
||||
src_c_negate: ((w1 >> 26) & 1) != 0,
|
||||
predicated: ((w0 >> 27) & 1) != 0,
|
||||
predicate_condition: ((w0 >> 28) & 1) != 0,
|
||||
src_c_negate: ((w1 >> 24) & 1) != 0,
|
||||
predicated: ((w1 >> 28) & 1) != 0,
|
||||
predicate_condition: ((w1 >> 27) & 1) != 0,
|
||||
raw: words,
|
||||
}
|
||||
}
|
||||
@@ -225,19 +242,24 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn decode_extracts_opcodes_and_dests() {
|
||||
// Build a minimal ALU word:
|
||||
// vector_opcode = ADD (0), scalar_opcode = RCP (22),
|
||||
// vector_dest = 3, scalar_dest = 7, vector_write_mask = 0xF
|
||||
let w2 = (vop::ADD as u32)
|
||||
| ((sop::RCP as u32) << 6)
|
||||
| (0xF << 12) // vector_write_mask
|
||||
| (3u32 << 16) // vector_dest
|
||||
| (7u32 << 24); // scalar_dest
|
||||
let alu = decode_alu([0, 0, w2]);
|
||||
// GPUBUG-106: correct canary field map. w0 carries dest/mask/scalar_opc;
|
||||
// w2 carries vector_opc + source regs.
|
||||
// vector_opcode = ADD (0) → w2 bits 24:28
|
||||
// scalar_opcode = RCP (22) → w0 bits 26:31
|
||||
// vector_dest = 3 → w0 bits 0:5, scalar_dest = 7 → w0 bits 8:13
|
||||
// vector_write_mask = 0xF → w0 bits 16:19, export_data → w0 bit 15
|
||||
let w0 = 3u32 // vector_dest
|
||||
| (7u32 << 8) // scalar_dest
|
||||
| (1u32 << 15) // export_data
|
||||
| (0xFu32 << 16) // vector_write_mask
|
||||
| ((sop::RCP as u32) << 26); // scalar_opc
|
||||
let w2 = (vop::ADD as u32) << 24; // vector_opc
|
||||
let alu = decode_alu([w0, 0, w2]);
|
||||
assert_eq!(alu.vector_opcode, vop::ADD);
|
||||
assert_eq!(alu.scalar_opcode, sop::RCP);
|
||||
assert_eq!(alu.vector_dest, 3);
|
||||
assert_eq!(alu.scalar_dest, 7);
|
||||
assert_eq!(alu.vector_write_mask, 0xF);
|
||||
assert!(alu.vector_dest_is_export);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -25,6 +25,15 @@ pub struct VertexFetch {
|
||||
pub dest_register: u8,
|
||||
/// 4-bit write mask.
|
||||
pub dest_write_mask: u8,
|
||||
/// iterate-3S (GPUBUG-107): `xenos::VertexFormat` (6 bits, dword1[16:21]).
|
||||
/// Determines how many components to read and their packing. Pre-fix the
|
||||
/// translator hardcoded `k_32_32_32_32_FLOAT` (4 floats, stride 4),
|
||||
/// over-striding 2-float UI quads (`k_32_32_FLOAT`) → wrong/clipped
|
||||
/// positions (the next vertex's X bled into .w, giving negative W → the
|
||||
/// whole rectangle was clipped behind the camera).
|
||||
pub format: u8,
|
||||
/// Dword stride between consecutive vertices (dword2[0:7]).
|
||||
pub stride: u8,
|
||||
pub raw: [u32; 3],
|
||||
}
|
||||
|
||||
@@ -72,6 +81,9 @@ pub fn decode_fetch(words: [u32; 3]) -> FetchInstruction {
|
||||
src_register: ((w0 >> 5) & 0x3F) as u8,
|
||||
dest_register: ((w0 >> 12) & 0x3F) as u8,
|
||||
dest_write_mask: (w1 & 0xF) as u8,
|
||||
// dword1[16:21] = VertexFormat; dword2[0:7] = dword stride.
|
||||
format: ((w1 >> 16) & 0x3F) as u8,
|
||||
stride: (w2 & 0xFF) as u8,
|
||||
raw: words,
|
||||
}),
|
||||
op::TEXTURE_FETCH => FetchInstruction::Texture(TextureFetch {
|
||||
|
||||
@@ -663,6 +663,10 @@ impl RenderState {
|
||||
prim_kind,
|
||||
// Synthetic fallback path: no real vertex window.
|
||||
vertex_base_dwords: 0,
|
||||
// No real geometry → no NDC transform (procedural positions are
|
||||
// already in clip space).
|
||||
ndc_scale: [0.0, 0.0],
|
||||
ndc_offset: [0.0, 0.0],
|
||||
};
|
||||
if use_translated
|
||||
&& let Some(p) = self.xenos_pipeline.translated_pipeline(vs_key, ps_key) {
|
||||
@@ -788,6 +792,11 @@ impl RenderState {
|
||||
vertex_count: cap.host_vertex_count.max(3),
|
||||
prim_kind: cap.prim_code,
|
||||
vertex_base_dwords: base,
|
||||
// iterate-3S: apply the per-draw guest viewport → host NDC
|
||||
// transform only when we have real geometry (otherwise the
|
||||
// procedural fallback already emits clip-space positions).
|
||||
ndc_scale: if cap.has_real_vertices { cap.ndc_scale } else { [0.0, 0.0] },
|
||||
ndc_offset: if cap.has_real_vertices { cap.ndc_offset } else { [0.0, 0.0] },
|
||||
};
|
||||
if use_translated
|
||||
&& let Some(p) = self.xenos_pipeline.translated_pipeline(cap.vs_key, cap.ps_key)
|
||||
|
||||
@@ -39,6 +39,11 @@ struct DrawConstants {
|
||||
/// iterate-3O: guest dword base of the uploaded `vertex_buffer` window.
|
||||
/// The WGSL subtracts this from the absolute vertex-fetch address.
|
||||
vertex_base_dwords: u32,
|
||||
/// iterate-3S: guest→host NDC XY transform (mirrors canary
|
||||
/// `GetHostViewportInfo`). `clip.xy = pos.xy * ndc_scale + ndc_offset*pos.w`.
|
||||
/// Y is pre-flipped for wgpu. 16 bytes so the block stays 16-byte aligned.
|
||||
ndc_scale: [f32; 2],
|
||||
ndc_offset: [f32; 2],
|
||||
}
|
||||
|
||||
/// Submitted to [`XenosPipeline::render_one`] to render one captured draw.
|
||||
@@ -53,6 +58,10 @@ pub struct DrawRequest {
|
||||
/// iterate-3O: guest dword base of the per-draw vertex window uploaded to
|
||||
/// `vertex_buffer` (b4). 0 = no real vertex window (procedural fallback).
|
||||
pub vertex_base_dwords: u32,
|
||||
/// iterate-3S: guest→host NDC XY transform (Y pre-flipped). When all-zero
|
||||
/// the shader leaves the position untransformed (procedural fallback).
|
||||
pub ndc_scale: [f32; 2],
|
||||
pub ndc_offset: [f32; 2],
|
||||
}
|
||||
|
||||
/// Reasonable upper bound on a single shader blob (dwords). Most Xbox 360
|
||||
@@ -199,6 +208,8 @@ impl XenosPipeline {
|
||||
vertex_count: 3,
|
||||
prim_kind: 4,
|
||||
vertex_base_dwords: 0,
|
||||
ndc_scale: [0.0, 0.0],
|
||||
ndc_offset: [0.0, 0.0],
|
||||
};
|
||||
let draw_ctx_buffer = device.create_buffer_init(&wgpu::util::BufferInitDescriptor {
|
||||
label: Some("xenos draw ctx"),
|
||||
@@ -486,6 +497,8 @@ impl XenosPipeline {
|
||||
vertex_count: req.vertex_count.max(3),
|
||||
prim_kind: req.prim_kind,
|
||||
vertex_base_dwords: req.vertex_base_dwords,
|
||||
ndc_scale: req.ndc_scale,
|
||||
ndc_offset: req.ndc_offset,
|
||||
};
|
||||
queue.write_buffer(&self.draw_ctx_buffer, 0, bytemuck::bytes_of(&cb));
|
||||
|
||||
@@ -612,6 +625,8 @@ impl XenosPipeline {
|
||||
vertex_count: req.vertex_count.max(3),
|
||||
prim_kind: req.prim_kind,
|
||||
vertex_base_dwords: req.vertex_base_dwords,
|
||||
ndc_scale: req.ndc_scale,
|
||||
ndc_offset: req.ndc_offset,
|
||||
};
|
||||
queue.write_buffer(&self.draw_ctx_buffer, 0, bytemuck::bytes_of(&cb));
|
||||
|
||||
@@ -643,6 +658,6 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn draw_constants_layout_matches_wgsl_uniform() {
|
||||
assert_eq!(std::mem::size_of::<DrawConstants>(), 16);
|
||||
assert_eq!(std::mem::size_of::<DrawConstants>(), 32);
|
||||
}
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user