The Phase-C VdSwap PM4 ring path (commit 82f3d61) emits two
"PM4_XE_SWAP not consumed by drain" warnings when running:
exec sylpheed.iso --ui --quiet --halt-on-deadlock \
--parallel --reservations-table
Lockstep -n 100M never trips it. Two distinct race windows:
(a) Inline backend (--ui forces it): drain(mem, 4096) hit its
fixed packet cap before reaching the PM4_XE_SWAP we'd just
injected at the WPTR tail. With 6 CPU threads, the ring
accumulates >4096 packets between vd_swap callbacks.
(b) Threaded backend (--parallel without --ui): the worker's
DrainFence handler has a 900 ms deadline and game-batched
IBs (8-10 M packets observed) keep it from reaching the
tail in any reasonable budget. If the worker eventually
drained past the injected packet later, the safety-net
direct notify would double-count.
Three changes:
* gpu_system.rs: new `drain_until_wptr(target, time_budget)`
draining by the canary `WorkerThreadMain` predicate
(read_offset != target) instead of a fixed packet count.
900 ms deadline mirrors the threaded DrainFence handler.
* handle.rs: inline `drain_to_current_wptr` switches to
`drain_until_wptr`. DrainFence handler publishes the digest
mirror BEFORE replying so the CPU's post-drain
`digest_snapshot` sees fresh stats.
* exports.rs (vd_swap): skip the PM4 ring injection
unconditionally and route swap notification through
`notify_xe_swap` directly. Tail-injection is unreliable
under --parallel for both backends. The slot-0
fetch-constant patch is deferred (GPUBUG-FETCH-PATCH-001);
draws=0 today so a stale slot 0 has no observable effect.
Verification:
* cargo test --workspace --release: 556 passing (unchanged).
* Lockstep -n 100M --stable-digest: bit-identical to
pre-fix master HEAD aa3f1d3.
{instructions:100000002, imports:987685, unimpl:0, draws:0,
swaps:2, ...}
* check --parallel --reservations-table -n 30M: 0 warnings
(was 2). swaps=2.
* exec --gpu-inline --parallel --reservations-table -n 30M:
0 warnings (was 2 with drained=8M-10M observed). swaps=2.
Audit IDs: GPUBUG-DRAIN-001 (closed),
GPUBUG-FETCH-PATCH-001 (filed, deferred).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The producer side (`mmio_region.rs:78`, the guest's CP_RB_WPTR MMIO
write callback) uses `Ordering::Release` so any ring-memory writes
the guest performed before bumping WPTR are visible to a paired
`Acquire`-load on the consumer. The consumer here at `sync_with_mmio`
was using `Ordering::Relaxed` for both the WPTR load and the RPTR
mirror store — leaving the Release/Acquire pairing broken.
Under `--parallel`, this broken pairing means the GPU worker can
observe a fresh WPTR value while still reading stale ring-memory
contents at the corresponding offsets — garbage PM4 packets. The
audit's M11 grid run confirmed --parallel is non-deterministic
beyond the documented `packets` ±5% noise; this fix is one strand
of that.
Symmetric fix on the RPTR mirror store: Release pairs with any
guest-side Acquire-load of CP_RB_RPTR for ring-writeback
bookkeeping.
Verification at -n 100M lockstep:
swaps: 2 → 2 (unchanged)
draws: 0 → 0 (unchanged)
packets: ~60M (within noise)
Tests: 149 (no count change; this is a memory-ordering correctness
fix, not a behavioral change visible at the digest level in
lockstep).
Closes GPUBUG-006 (P1).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>