Commit Graph

3 Commits

Author SHA1 Message Date
MechaCat02
7a1b6b3306 fix(gpu): GPUBUG-DRAIN-001 — silence VdSwap PM4 fallback under --parallel
The Phase-C VdSwap PM4 ring path (commit 82f3d61) emits two
"PM4_XE_SWAP not consumed by drain" warnings when running:

  exec sylpheed.iso --ui --quiet --halt-on-deadlock \
    --parallel --reservations-table

Lockstep -n 100M never trips it. Two distinct race windows:

(a) Inline backend (--ui forces it): drain(mem, 4096) hit its
    fixed packet cap before reaching the PM4_XE_SWAP we'd just
    injected at the WPTR tail. With 6 CPU threads, the ring
    accumulates >4096 packets between vd_swap callbacks.

(b) Threaded backend (--parallel without --ui): the worker's
    DrainFence handler has a 900 ms deadline and game-batched
    IBs (8-10 M packets observed) keep it from reaching the
    tail in any reasonable budget. If the worker eventually
    drained past the injected packet later, the safety-net
    direct notify would double-count.

Three changes:

* gpu_system.rs: new `drain_until_wptr(target, time_budget)`
  draining by the canary `WorkerThreadMain` predicate
  (read_offset != target) instead of a fixed packet count.
  900 ms deadline mirrors the threaded DrainFence handler.

* handle.rs: inline `drain_to_current_wptr` switches to
  `drain_until_wptr`. DrainFence handler publishes the digest
  mirror BEFORE replying so the CPU's post-drain
  `digest_snapshot` sees fresh stats.

* exports.rs (vd_swap): skip the PM4 ring injection
  unconditionally and route swap notification through
  `notify_xe_swap` directly. Tail-injection is unreliable
  under --parallel for both backends. The slot-0
  fetch-constant patch is deferred (GPUBUG-FETCH-PATCH-001);
  draws=0 today so a stale slot 0 has no observable effect.

Verification:

* cargo test --workspace --release: 556 passing (unchanged).

* Lockstep -n 100M --stable-digest: bit-identical to
  pre-fix master HEAD aa3f1d3.
  {instructions:100000002, imports:987685, unimpl:0, draws:0,
   swaps:2, ...}

* check --parallel --reservations-table -n 30M: 0 warnings
  (was 2). swaps=2.

* exec --gpu-inline --parallel --reservations-table -n 30M:
  0 warnings (was 2 with drained=8M-10M observed). swaps=2.

Audit IDs: GPUBUG-DRAIN-001 (closed),
GPUBUG-FETCH-PATCH-001 (filed, deferred).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 17:12:15 +02:00
MechaCat02
8fc1b1dfed fix(gpu): GPUBUG-006 — sync_with_mmio Acquire/Release pair the producer
The producer side (`mmio_region.rs:78`, the guest's CP_RB_WPTR MMIO
write callback) uses `Ordering::Release` so any ring-memory writes
the guest performed before bumping WPTR are visible to a paired
`Acquire`-load on the consumer. The consumer here at `sync_with_mmio`
was using `Ordering::Relaxed` for both the WPTR load and the RPTR
mirror store — leaving the Release/Acquire pairing broken.

Under `--parallel`, this broken pairing means the GPU worker can
observe a fresh WPTR value while still reading stale ring-memory
contents at the corresponding offsets — garbage PM4 packets. The
audit's M11 grid run confirmed --parallel is non-deterministic
beyond the documented `packets` ±5% noise; this fix is one strand
of that.

Symmetric fix on the RPTR mirror store: Release pairs with any
guest-side Acquire-load of CP_RB_RPTR for ring-writeback
bookkeeping.

Verification at -n 100M lockstep:
  swaps:                2 → 2     (unchanged)
  draws:                0 → 0     (unchanged)
  packets:              ~60M (within noise)
Tests: 149 (no count change; this is a memory-ordering correctness
fix, not a behavioral change visible at the digest level in
lockstep).

Closes GPUBUG-006 (P1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:26:09 +02:00
MechaCat02
79eb52c378 xenia-gpu: end-to-end Xenos pipeline (PM4, ucode, EDRAM, resolve)
First real GPU implementation. Ring/PM4 frontend (ring_view,
ring_drain, pm4) drains the command processor; gpu_system owns the
threaded backend (DrainFence RPC + parker/fence helpers from M1) and
the MMIO-mapped register block (mmio_region).

Xenos shader frontend: ucode/{alu,control_flow,fetch,mod}.rs decode
the Xbox 360 microcode, translator.rs lowers it onto the WGSL
xenos_interp interpreter shader (shaders/xenos_interp.wgsl).
shader_metrics.rs counts decode/translate work.

Render state: draw_state, primitive, render_target_cache,
texture_cache, tiled_address (Xenos's swizzled tiled-memory layout),
xenos_constants (register field constants), edram (the 10 MiB EDRAM
model with MSAA), and resolve.rs (TILE_FLUSH copy-out — clear-resolve
plus bitwise-equivalent 32 bpp + 64 bpp paths landed). handle.rs
owns the typed GPU-resource handles the kernel hands out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:29:38 +02:00