The Phase-C VdSwap PM4 ring path (commit 82f3d61) emits two
"PM4_XE_SWAP not consumed by drain" warnings when running:
exec sylpheed.iso --ui --quiet --halt-on-deadlock \
--parallel --reservations-table
Lockstep -n 100M never trips it. Two distinct race windows:
(a) Inline backend (--ui forces it): drain(mem, 4096) hit its
fixed packet cap before reaching the PM4_XE_SWAP we'd just
injected at the WPTR tail. With 6 CPU threads, the ring
accumulates >4096 packets between vd_swap callbacks.
(b) Threaded backend (--parallel without --ui): the worker's
DrainFence handler has a 900 ms deadline and game-batched
IBs (8-10 M packets observed) keep it from reaching the
tail in any reasonable budget. If the worker eventually
drained past the injected packet later, the safety-net
direct notify would double-count.
Three changes:
* gpu_system.rs: new `drain_until_wptr(target, time_budget)`
draining by the canary `WorkerThreadMain` predicate
(read_offset != target) instead of a fixed packet count.
900 ms deadline mirrors the threaded DrainFence handler.
* handle.rs: inline `drain_to_current_wptr` switches to
`drain_until_wptr`. DrainFence handler publishes the digest
mirror BEFORE replying so the CPU's post-drain
`digest_snapshot` sees fresh stats.
* exports.rs (vd_swap): skip the PM4 ring injection
unconditionally and route swap notification through
`notify_xe_swap` directly. Tail-injection is unreliable
under --parallel for both backends. The slot-0
fetch-constant patch is deferred (GPUBUG-FETCH-PATCH-001);
draws=0 today so a stale slot 0 has no observable effect.
Verification:
* cargo test --workspace --release: 556 passing (unchanged).
* Lockstep -n 100M --stable-digest: bit-identical to
pre-fix master HEAD aa3f1d3.
{instructions:100000002, imports:987685, unimpl:0, draws:0,
swaps:2, ...}
* check --parallel --reservations-table -n 30M: 0 warnings
(was 2). swaps=2.
* exec --gpu-inline --parallel --reservations-table -n 30M:
0 warnings (was 2 with drained=8M-10M observed). swaps=2.
Audit IDs: GPUBUG-DRAIN-001 (closed),
GPUBUG-FETCH-PATCH-001 (filed, deferred).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The producer side (`mmio_region.rs:78`, the guest's CP_RB_WPTR MMIO
write callback) uses `Ordering::Release` so any ring-memory writes
the guest performed before bumping WPTR are visible to a paired
`Acquire`-load on the consumer. The consumer here at `sync_with_mmio`
was using `Ordering::Relaxed` for both the WPTR load and the RPTR
mirror store — leaving the Release/Acquire pairing broken.
Under `--parallel`, this broken pairing means the GPU worker can
observe a fresh WPTR value while still reading stale ring-memory
contents at the corresponding offsets — garbage PM4 packets. The
audit's M11 grid run confirmed --parallel is non-deterministic
beyond the documented `packets` ±5% noise; this fix is one strand
of that.
Symmetric fix on the RPTR mirror store: Release pairs with any
guest-side Acquire-load of CP_RB_RPTR for ring-writeback
bookkeeping.
Verification at -n 100M lockstep:
swaps: 2 → 2 (unchanged)
draws: 0 → 0 (unchanged)
packets: ~60M (within noise)
Tests: 149 (no count change; this is a memory-ordering correctness
fix, not a behavioral change visible at the digest level in
lockstep).
Closes GPUBUG-006 (P1).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Eight of the register-index constants in draw_state.rs::reg pointed at
completely unrelated registers because the canonical canary table
(register_table.inc) was misread when the module was first authored.
Re-validated each value against canary's lines 1232-1336.
| Register | Pre-fix | Canary | Was-actually |
| ------------------------- | ------- | ------ | ------------- |
| VGT_DRAW_INITIATOR | 0x2281 | 0x21FC | (junk) |
| VGT_DMA_BASE | 0x2282 | 0x21FA | (junk) |
| VGT_DMA_SIZE | 0x2283 | 0x21FB | (junk) |
| PA_SC_WINDOW_SCISSOR_TL | 0x200E | 0x2081 | SCREEN_SCIS_TL|
| PA_SC_WINDOW_SCISSOR_BR | 0x200F | 0x2082 | SCREEN_SCIS_BR|
| RB_COLOR_INFO_1 | 0x2010 | 0x2003 | COHER_DEST_BASE_10|
| RB_COLOR_INFO_2 | 0x2011 | 0x2004 | COHER_DEST_BASE_11|
| RB_COLOR_INFO_3 | 0x2012 | 0x2005 | COHER_DEST_BASE_12|
| PA_SU_VTX_CNTL | 0x2083 | 0x2302 | PA_SC_CLIPRECT_RULE|
Also corrected the `index_size` bit position in VGT_DRAW_INITIATOR
extraction: was bit 8 (which is `major_mode[0]`), should be bit 11 per
canary `registers.h:324` (`xenos::IndexFormat index_size : 1; // +11`).
The block comment in `extract()` was also wrong about the
intermediate field layout and has been refreshed.
Verification at -n 100M lockstep:
swaps: 2 → 2 (unchanged)
draws: 0 → 0 (still gated — see below)
packets: ~61M (within noise)
Tests: 149 (no count change; existing draw_state tests cover the
new constants implicitly via behavioral round-trip).
The audit predicted Phases C+D+E together would unlock `draws > 0`,
but the runtime plateau is multi-causal per the audit's own analysis
(`project_xenia_rs_audit_2026_05_02.md`). The likely remaining
blockers in -n 100M:
* 4 parked-waiter worker threads (handles 0x1004, 0x100c, 0x15e4,
0x42450b5c) — Phase F's XAM/spinlock fixes target this.
* shader_blobs_live=0 after 100M — the game hasn't issued IM_LOAD
yet because workers haven't loaded shader resources.
The register fixes here are still load-bearing for any draw that
DOES happen (every register read at 0x2281 was junk before this
commit) — landing them now is correct even if draws=0 persists until
Phase F unparks the resource-loader threads.
Closes GPUBUG-103, GPUBUG-104, GPUBUG-105 (P0).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The vertex fetch constant (canary `xe_gpu_vertex_fetch_t`,
xenos.h:1158-1172) holds an `endian` field (low 2 bits of dword_1)
selecting kNone/k8in16/k8in32/k16in32 swap patterns per
`GpuSwapInline` (xenos.h:1090-1109). Xbox 360 vertex data is stored
big-endian; the host is little-endian. Pre-fix every dword was
bitcast as-is — vertex positions decoded as byte-reversed garbage,
producing clipped or NaN positions in any draw that survived to the
host.
Mechanical changes:
- crates/xenia-gpu/src/translator.rs: AOT `emit_vfetch` reads
fetch_const dword 1 (endian) and wraps each lane's load in
`gpu_swap(value, endian)`. New `gpu_swap` helper added to the
emitted module header.
- crates/xenia-gpu/src/shaders/xenos_interp.wgsl: matching
`gpu_swap` helper added to the runtime interpreter shader.
`interpret_vertex_fetch` reads fc1, computes the endian, and wraps
every format's per-lane load (including 8_8_8_8 and 16_16_FLOAT
paths). Mirrors the AOT translator's emission.
Verification at -n 100M lockstep:
swaps: 2 → 2 (gated by Phase E for draws)
draws: 0 → 0
packets: ~60M (within noise)
Tests: +1 (vfetch_emit_includes_gpu_swap_helper_call).
Closes GPUBUG-102 (P0).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Word-1 of every ALU triple holds three 8-bit component-relative
swizzles (`src1_swiz`/`src2_swiz`/`src3_swiz` at bits 16-23/8-15/0-7
per canary ucode.h:2064-2066) and three per-operand negate flags
(bits 24/25/26). Pre-fix, both the WGSL interpreter and the AOT
translator discarded word-1 entirely with `_ = w1;` — every ALU
result was missing its swizzle (broadcast/permute patterns like
`.zyxw`, `.xxxx`) and any negated operand was used positive instead.
Component-relative semantics (canary's
`AluInstruction::GetSwizzledComponentIndex`, ucode.h:1996): for output
component i, the source component is `((swizzle >> (2*i)) + i) & 3`.
Identity swizzle is 0x00, NOT 0xE4 — the original `apply_swizzle` in
the interpreter shader treated it as absolute, also incorrect.
Mechanical changes:
- crates/xenia-gpu/src/ucode/alu.rs: extend AluInstruction with
src_X_swiz (u8) and src_X_negate (bool) fields. decode_alu unpacks
them from word 1.
- crates/xenia-gpu/src/shaders/xenos_interp.wgsl: apply_swizzle uses
component-relative semantics. interpret_alu decodes the modifiers
and applies via apply_swizzle + apply_modifiers (with abs=false).
- crates/xenia-gpu/src/translator.rs: src_operand emits the
precomputed swizzle inline as `vec4<f32>(base.x, base.y, ...)`,
then wraps in `(-…)` when negated. Identity swizzle (0x00) emits a
bare base expression so it round-trips with the trivial-shader
fixture.
Abs is omitted in this commit — the abs flag is dual-meaning (for
temps it lives at bit 7 of the src byte; for constants at word-2 bit
7 `abs_constants`). Wiring it up correctly requires more careful
case-split logic; deferred to Phase G.
Verification at -n 100M lockstep:
swaps: 2 → 2 (gated by Phase E for draws)
draws: 0 → 0
packets: ~58M (within noise)
Tests: 554 → 555 (+1 swizzle/negate test, no count change otherwise
because identity swizzle test merged into D1's parameterised test).
WGSL still validates via naga (combined_module_parses_as_wgsl).
Closes GPUBUG-100 (P0). Abs deferred to Phase G.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per canary AluInstruction layout (xenia-canary/src/xenia/gpu/ucode.h:
2078-2086), word-0 bits 29-31 are the per-operand `srcN_sel` flags
selecting temp register (1) vs ALU constant (0); the corresponding
8-bit src byte indexes either:
- a temp register (bits 5:0 = index, bits 6/7 reserved for
relative-addressing / abs flags consumed by Phase D2), or
- an ALU constant (full 8-bit index).
Pre-fix, the WGSL interpreter and AOT translator both masked `& 0x7F`
on the src byte and emitted `r[low7]` regardless of the operand class.
Every shader's WVP matrix / light constant / per-frame uniform read
came back as r[low7] — typically zero — yielding invisible rendering.
Mechanical changes:
- crates/xenia-gpu/src/ucode/alu.rs: decode src_a_is_temp /
src_b_is_temp / src_c_is_temp from w0 bits 29/30/31. Note that our
src_a (low byte of w0) is canary's third operand, hence its selector
is bit 29 (canary src3_sel), not bit 31.
- crates/xenia-gpu/src/shaders/xenos_interp.wgsl: `read_src` now takes
the is_temp flag; constants index xenos_consts.alu directly.
- crates/xenia-gpu/src/translator.rs: `src_operand` mirrors the
interpreter — `r[idx]` when temp, `xenos_consts.alu[idx]` when
constant.
The trivial-shader synthetic test was updated to set the temp flags so
its `r[0u] = (r[0u] + r[0u])` assertion remains valid; without the
flags set, all sources would now resolve as constants.
Bank-selection (cf-level relative addressing for higher banks of the
512 ALU constants) remains a Phase G+ extension — covers c0..c127
in bank 0, which most Sylpheed shaders use directly.
Verification at -n 100M lockstep:
swaps: 2 → 2 (unchanged — gated by D2/D3/E for draws)
draws: 0 → 0
packets: ~61M (within noise)
Tests: 552 → 554 (+2 translator tests for the temp/constant decode).
Closes GPUBUG-101 (P0).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pre-fix VdSwap zero-filled the guest's reserved buffer with NOPs and
called `state.gpu.notify_xe_swap` directly — bypassing the ring, leaving
the PM4_XE_SWAP handler at gpu_system.rs:1232 dead code, and skipping
the PM4_TYPE0(SHADER_CONSTANT_FETCH_00_0, 6) patch. Sylpheed's bloom/
blur "sample frame N for frame N+1" path samples fetch-constant slot 0
expecting the frontbuffer descriptor; without the patch, slot 0 stayed
stale and any shader sampling it read garbage.
This commit writes the canary VdSwap PM4 sequence directly into the
primary ring at the current write pointer (read via the shared MMIO
atomic), then advances WPTR over the injection. The natural CP drain
consumes PM4_XE_SWAP — bumping `swaps_seen` and patching fetch-constant
slot 0 — without going through any direct kernel→GPU bypass.
Sequence per xenia-canary VdSwap_entry (xboxkrnl_video.cc:438-521):
1) PM4_TYPE0(0x4800, count=6) + 6 fetch-header dwords (with
base_address re-patched from virtual to physical >> 12).
2) PM4_TYPE3(PM4_XE_SWAP, count=4) + signature + frontbuffer_phys
+ width + height.
Mechanism notes:
- buffer_ptr in xenia-rs is in the system command buffer, NOT the
primary ring (verified empirically: buffer_ptr=0x4acd4df8 vs
ring_base=0x0accb000, size 4 KB). Canary's VdSwap writes to
buffer_ptr because its ring layout maps the reserved slot inside
the ring; xenia-rs's doesn't, so we have to write at the actual
ring WPTR address (cached on KernelState.ring_base from
VdInitializeRingBuffer).
- The original "buffer_ptr zero-fill + bump WPTR by 64" path is
preserved before the injection — it exposes any game-batched PM4
packets and keeps the buffer_ptr region skippable per existing
game compat behavior.
- A safety-net fallback at the end calls `notify_xe_swap` directly if
swaps_seen didn't advance during the drain (e.g. a ring-arithmetic
edge case). Idempotent — only fires when the PM4 path didn't.
- KRNBUG-Mm-04 deferred: virt→phys uses the masked stub
`virt & 0x1FFF_FFFF`, sufficient for the standard heap.
Mechanical changes:
- crates/xenia-gpu/src/pm4.rs: add make_packet_type0 / type2 / type3
helpers + round-trip unit test (mirrors canary xenos.h:1682-1709).
- crates/xenia-gpu/src/handle.rs: add mmio_cp_rb_wptr_load accessor
(Acquire-load) so the kernel can compute ring offsets.
- crates/xenia-kernel/src/state.rs: cache ring_base / ring_size_dwords
on KernelState (set by VdInitializeRingBuffer).
- crates/xenia-kernel/src/exports.rs: rewrite the vd_swap PM4-emit
block; patch fetch_dwords[1] base_address virt→phys before injection.
Verification at -n 100M lockstep:
swaps: 2 → 2 (game fires VdSwap exactly twice)
draws: 0 → 0 (gated by Phases D+E)
fallback warning: 0 occurrences (PM4 path consumed both swaps)
instructions: ~100M
Tests: 552 passing (553 with new pm4 round-trip test). Lockstep
stable-fields determinism: byte-identical across two 100M runs.
The "swaps > 2" prediction in the audit's plan assumed the game would
fire VdSwap more often once the path worked; empirically Sylpheed only
calls VdSwap twice within 100M instructions (this is the renderer
plateau the audit identified). The success criterion for Phase C is
that the PM4 path is now operational, which Phases D+E require for
visible draws.
Closes KRNBUG-Vd-04, GPUBUG-001, XMODBUG-013.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rust reimplementation of the xenia Xbox 360 emulator targeting reverse-
engineering and preservation, initially scoped to Project Sylpheed.
Includes:
- XEX2 loader (LZX decompression, AES decryption, PE parsing)
- XISO / XGD2 disc image VFS
- PPC interpreter with 200+ opcodes and VMX128 decoding
- Static analyzer: functions, cross-references, labels, asm + SQLite output
- HLE kernel covering the xboxkrnl/xam subset used by Sylpheed init
- Debugger with in-memory and SQLite-backed execution tracing
- `xenia-rs` CLI with extract/dis/exec commands that produce cumulative,
superset SQLite databases and opt-in instruction/import/branch traces
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>