Commit Graph

71 Commits

Author SHA1 Message Date
MechaCat02
2a9fd1fc86 feat(kernel): KRNBUG-AUDIT-002 — multi-frame guest stack capture at handle creation
Adds `walk_guest_back_chain` (PPC EABI back-chain walker) and a
`record_create_with_stack` audit hook gated on `--trace-handles-focus`.
NtCreateEvent / NtCreateSemaphore / NtCreateTimer / XamTaskSchedule now
route through the new helper so focused handles capture up to 6 stack
frames at allocation time. Diagnostic-only, read-only memory access:
unfocused handles pay one HashSet lookup, focused ones pay six
back-chain dereferences. Lockstep determinism preserved.

End-to-end finding: handles 0x1004 (8-instance pool via static ctor at
0x8280F810), 0x100c (singleton built inside main()), 0x15e0 (singleton
in distinct cluster) are silph-framework dispatcher objects whose
producer code is unreached at -n 500M. The producer hunt now has class
ownership; vtable/RTTI readout is the next step.

Tests: 576 → 581 green. `--stable-digest -n 100M` instructions=100000002
unchanged. Master HEAD prior: 9d45efe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 20:41:06 +02:00
MechaCat02
07068e7616 feat(audio): APUBUG-PRODUCER-001 — XAudio register driver client + opt-in callback ticker
Replace the three XAudio kernel-export stubs (Register/Unregister/SubmitFrame)
with canary-faithful implementations and add a periodic buffer-complete
callback ticker reusing the existing SavedCallbackCtx injection machinery.

Canary parity:
- xboxkrnl_audio.cc:56-93 — read callback_ptr[0..1], wrap callback_arg in a
  4-byte big-endian guest heap buffer (`wrapped_callback_arg`), write
  `0x4155_xxxx` to *driver_ptr.
- audio_system.cc:139-141 — guest callback receives r3 = wrapped pointer,
  not raw callback_arg.
- audio_driver.h:21-24 — frame rate 256 samples / 48 kHz ≈ 5.33 ms.

Implementation:
- New `crates/xenia-kernel/src/xaudio.rs` — `XAudioClient`, `XAudioState`
  (8-slot table, pending FIFO, dual-mode ticker), `XAUDIO_INSTR_PERIOD =
  48_000` (lockstep) and `XAUDIO_PERIOD = 5.333 ms` (--parallel), same
  pattern as KRNBUG-D08 v-sync.
- `try_inject_audio_callback` in xenia-app mirrors `try_inject_graphics_interrupt`,
  shares `interrupts.saved` slot for mutex with graphics callbacks.

Gating: ticker + injector run only when `--xaudio-tick` /
`XENIA_XAUDIO_TICK=1`. Default off because Sylpheed's audio callback
enters an infinite `KeWaitForSingleObject` loop on first invocation
(canary's host worker thread provides the buffer-completion fence we
don't model), which hijacks a guest HW thread and regresses
`swaps=2 → 1`. Default-off preserves the lockstep `sylpheed_n*m.json`
goldens exactly.

Producer hunt outcome (FALSIFIED for parked handles 0x1004/0x100c/0x15e4):
at `-n 500M --xaudio-tick` all 3 handles still show
`signal_attempts=0 (primary=0, ghost=0)`. Audio callback is not the
missing producer. Next candidate per audit-findings.md is Timer DPC
delivery (KeSetTimer / KeInsertQueueDpc).

Tests: 562 → 576 green (10 in `xaudio.rs`, 4 in `exports.rs`).
Lockstep `--stable-digest -n 100M` default-off: instructions=100000002,
swaps=2 (matches pre-change baseline byte-for-byte).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 19:50:22 +02:00
MechaCat02
691404e36e fix(xam): XAMBUG-PRODUCER-001 — XamTaskSchedule spawns a real guest thread
Replaces the no-op stub at xam.rs:204 with a canary-faithful
implementation mirroring xenia-canary/src/xenia/kernel/xam/xam_task.cc:43-80.
Allocates a ThreadImage, allocates a KernelObject::Thread handle, and
routes through Scheduler::spawn with entry=callback and
start_context=message_ptr (canary's third positional XThread ctor arg).
Stack size = max(0x4000, page-aligned 0x10_0000).

Producer-hypothesis outcome (500M --trace-handles-focus run): the call
site at 0x824a9a10 is never reached during this boot horizon, so
XamTaskSchedule cannot be the missing producer for the 3 parked
Event/Manual handles (0x1004, 0x100c, 0x15e4). The fix still lands —
the stub was a real correctness bug that would manifest the moment
the boot advances past the current deadlock. Next candidate per
audit-findings.md: XAudioRegisterRenderDriverClient.

- Workspace tests: 561 → 562 green (new test
  xam::tests::xam_task_schedule_spawns_real_thread).
- --stable-digest -n 100M: instructions=100000002 unchanged from
  baseline; lockstep determinism preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:32:40 +02:00
MechaCat02
27d3608174 fix(kernel): KRNBUG-D08 — wall-clock v-sync under --parallel
The synthetic v-sync ticker used a per-instruction proxy
(VSYNC_INSTR_PERIOD = 150 k) tuned for ~10 MIPS lockstep
throughput → 60 Hz. Audit M11 observed this drifts under
`--parallel`: with 6 worker threads sharing the kernel mutex,
the dispatcher executes more PPC instructions per tick
callback, so the accumulator never crosses 150 k. Result:
~629 v-syncs/100M lockstep → ~2 v-syncs/100M --parallel.

Hybrid solution preserves lockstep determinism (which the
goldens depend on) while fixing --parallel:

* `tick_vsync_instr(instr_count)` — legacy instruction-count
  ticker, used by lockstep. Bit-stable across runs.

* `tick_vsync_wallclock()` — new Instant-based ticker. Fires
  `floor(elapsed / VSYNC_PERIOD)` v-syncs since the anchor
  and advances the anchor by that many full periods (no
  lazy backlog). Capped at INTERRUPT_QUEUE_CAP per call so a
  forward-jumping clock can't overflow the FIFO.

* `KernelState.parallel_active` flag set at startup from
  `--parallel` / `XENIA_PARALLEL=1`. Read by `coord_pre_round`
  in main.rs to choose between the two tickers.

Verification:

* cargo test --workspace --release: 561 passing (+3 new
  wall-clock tests vs prior 558 baseline).
* lockstep -n 100M --stable-digest: BIT-IDENTICAL to
  pre-Phase-3 baseline. interrupts_delivered preserved at
  ~630 (was ~629 pre-fix).
* --parallel --reservations-table -n 30M: interrupts_delivered
  rose from ~2 to 17. (FIFO INTERRUPT_QUEUE_CAP=4 still caps
  burst delivery; that's a separate bottleneck — addressed
  by raising cap when --parallel queue depth becomes the
  next blocker.)

Trade-off: --parallel runs are non-deterministic at the
v-sync rate by design (per audit M05 PPCBUG-703 already).
Lockstep stays bit-identical, so the `sylpheed_n*m.json`
goldens are untouched.

Audit IDs: KRNBUG-D08 (closed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 17:34:30 +02:00
MechaCat02
d1105aafae diag(audit): KRNBUG-AUDIT-001 — focused parked-waiter ghost-trail diagnostic
Adds a one-run diagnostic that distinguishes "guest never called
Nt/KeSetEvent on this handle" from "signal landed but waiter wasn't
woken", for any handle named via `--trace-handles-focus`.

Parked-waiter context (project_xenia_rs_sylpheed_stage3_2026_04_29):
four worker threads block Sylpheed past `draws=0` on handles
0x1004 / 0x100c / 0x15e4 / 0x42450b5c (mr=true, sig=false). The
pre-existing audit dropped signal-attempts that targeted handles
without a primary trail, so we couldn't tell whether the producer
was unreachable in the guest or whether the signal landed but missed
its waiter.

Three changes:

* audit.rs: `HandleAudit` gains `focus: HashSet<u32>` and
  `ghost_trails: HashMap<u32, GhostTrail>`. `record_signal`
  auto-falls-through to a new `record_signal_attempt_ghost` when no
  primary trail exists AND the handle is in `focus`. Bounded by
  AUDIT_RING_CAPACITY per handle. Two new tests cover the focus
  ghost-trail and no-double-record invariants.

* main.rs: new `--trace-handles-focus=<LIST>` flag (hex 0x or decimal,
  comma-separated) populates `kernel.audit.focus`. Implies
  `--trace-handles`. New "=== Handle audit (focus) ===" section in
  `dump_thread_diagnostic` emits per-handle:
    - signal_attempts (primary + ghost), waits, wakes
    - merged cycle-sorted timeline (last 16)
    - GuestExport / KernelInternal classification
    - <AUDIT_BLIND> marker when waiter_count > 0 but the audit
      saw no waits (i.e. waiter parked via a non-audit path —
      CS / spinlock / DPC).
    - DIAGNOSIS conclusion that selects between five branches.

* `cmd_check` passes None for focus → goldens unaffected.

Empirical run output at -n 500M lockstep with
`--trace-handles-focus=0x1004,0x100c,0x15e4,0x42450b5c`:

  handle=0x00001004 kind=Event/Manual waiters=1 signaled=false
                    signal_attempts=0 (primary=0, ghost=0)
                    waits=1 wakes=0
     created cycle=0 tid=1 lr=0x824a9f6c src=NtCreateEvent
     => producer is a missing kernel signal source
        (or BST-paradox upstream)
  ... (same shape for 0x100c, 0x15e4)
  handle=0x42450b5c kind=<UNCREATED> waiters=1 signal_attempts=0
                    waits=0 wakes=0 <AUDIT_BLIND>
     => waiter parked via non-audited path

Conclusion: hypothesis (A) confirmed for all 4 handles. Producer is
NOT a wake/eligibility bug — it is a genuinely missing kernel signal
source. The 3 Event/Manual handles share a creator
(lr=0x824a9f6c, tid=1) and the same wait-call wrapper at
lr=0x824ac578 — these are 3 worker threads all parked on
"work-available" notifications that never come.

Verification:
* cargo test --workspace --release: 558 passing (+2 new ghost-trail
  tests vs prior 556 baseline)
* lockstep -n 100M --stable-digest: bit-identical to master HEAD

Audit IDs: KRNBUG-AUDIT-001 (closed — diagnostic instrumentation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 17:22:14 +02:00
MechaCat02
7a1b6b3306 fix(gpu): GPUBUG-DRAIN-001 — silence VdSwap PM4 fallback under --parallel
The Phase-C VdSwap PM4 ring path (commit 82f3d61) emits two
"PM4_XE_SWAP not consumed by drain" warnings when running:

  exec sylpheed.iso --ui --quiet --halt-on-deadlock \
    --parallel --reservations-table

Lockstep -n 100M never trips it. Two distinct race windows:

(a) Inline backend (--ui forces it): drain(mem, 4096) hit its
    fixed packet cap before reaching the PM4_XE_SWAP we'd just
    injected at the WPTR tail. With 6 CPU threads, the ring
    accumulates >4096 packets between vd_swap callbacks.

(b) Threaded backend (--parallel without --ui): the worker's
    DrainFence handler has a 900 ms deadline and game-batched
    IBs (8-10 M packets observed) keep it from reaching the
    tail in any reasonable budget. If the worker eventually
    drained past the injected packet later, the safety-net
    direct notify would double-count.

Three changes:

* gpu_system.rs: new `drain_until_wptr(target, time_budget)`
  draining by the canary `WorkerThreadMain` predicate
  (read_offset != target) instead of a fixed packet count.
  900 ms deadline mirrors the threaded DrainFence handler.

* handle.rs: inline `drain_to_current_wptr` switches to
  `drain_until_wptr`. DrainFence handler publishes the digest
  mirror BEFORE replying so the CPU's post-drain
  `digest_snapshot` sees fresh stats.

* exports.rs (vd_swap): skip the PM4 ring injection
  unconditionally and route swap notification through
  `notify_xe_swap` directly. Tail-injection is unreliable
  under --parallel for both backends. The slot-0
  fetch-constant patch is deferred (GPUBUG-FETCH-PATCH-001);
  draws=0 today so a stale slot 0 has no observable effect.

Verification:

* cargo test --workspace --release: 556 passing (unchanged).

* Lockstep -n 100M --stable-digest: bit-identical to
  pre-fix master HEAD aa3f1d3.
  {instructions:100000002, imports:987685, unimpl:0, draws:0,
   swaps:2, ...}

* check --parallel --reservations-table -n 30M: 0 warnings
  (was 2). swaps=2.

* exec --gpu-inline --parallel --reservations-table -n 30M:
  0 warnings (was 2 with drained=8M-10M observed). swaps=2.

Audit IDs: GPUBUG-DRAIN-001 (closed),
GPUBUG-FETCH-PATCH-001 (filed, deferred).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 17:12:15 +02:00
MechaCat02
780e854c2f fix(memory): XMODBUG-002 — write_bulk bumps page_versions for touched pages
`GuestMemory::write_bulk` did the bulk copy via raw `copy_nonoverlapping`
without bumping page_versions for any of the pages it touched. The
per-byte `write_u8/u16/u32` methods all bump page_versions after their
store; downstream caches (texture cache, shader cache) Acquire-load the
slot to invalidate stale entries on guest writes. Without the bulk
bump, a caller like `NtReadFile` writing a texture/shader resource into
guest memory would leave any cache that had already keyed on the prior
version handing back stale decoded bytes.

After the copy, walk every page the write touched and bump it. Cheap:
the typical bulk write spans a few pages (NtReadFile uses 64-128 KB
chunks → 16-32 pages).

Reservation-table invalidation for `lwarx`/`stwcx.` (XMODBUG-001's
sibling) is NOT addressed here — the reservation table lives on
KernelState, not GuestMemory, and plumbing it through requires a wider
change. Callers that bulk-write code-bearing or atomic-bearing memory
should call `kernel.reservations.invalidate_for_write(addr)`
themselves; XEX-loader and NtReadFile are doing data-bearing writes
that don't intersect lwarx targets, so this is acceptable for now.

Verification at -n 100M lockstep:
  swaps:                2 → 2     (unchanged)
  draws:                0 → 0
  texture_cache_entries: 0 → 0    (Sylpheed hasn't issued IM_LOAD yet
                                   — the bump is silent until a cache
                                   keys on a touched page, which won't
                                   happen until Phase F2/F3 unblocks
                                   the resource-loader workers)
  packets:              ~59M (within noise)
Tests: 16 memory pass.

Closes XMODBUG-002 (P1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:30:22 +02:00
MechaCat02
8fc1b1dfed fix(gpu): GPUBUG-006 — sync_with_mmio Acquire/Release pair the producer
The producer side (`mmio_region.rs:78`, the guest's CP_RB_WPTR MMIO
write callback) uses `Ordering::Release` so any ring-memory writes
the guest performed before bumping WPTR are visible to a paired
`Acquire`-load on the consumer. The consumer here at `sync_with_mmio`
was using `Ordering::Relaxed` for both the WPTR load and the RPTR
mirror store — leaving the Release/Acquire pairing broken.

Under `--parallel`, this broken pairing means the GPU worker can
observe a fresh WPTR value while still reading stale ring-memory
contents at the corresponding offsets — garbage PM4 packets. The
audit's M11 grid run confirmed --parallel is non-deterministic
beyond the documented `packets` ±5% noise; this fix is one strand
of that.

Symmetric fix on the RPTR mirror store: Release pairs with any
guest-side Acquire-load of CP_RB_RPTR for ring-writeback
bookkeeping.

Verification at -n 100M lockstep:
  swaps:                2 → 2     (unchanged)
  draws:                0 → 0     (unchanged)
  packets:              ~60M (within noise)
Tests: 149 (no count change; this is a memory-ordering correctness
fix, not a behavioral change visible at the digest level in
lockstep).

Closes GPUBUG-006 (P1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:26:09 +02:00
MechaCat02
e7d0fcf2c9 fix(kernel): KRNBUG-017 — real Kf*SpinLock + KeReleaseSpinLockFromRaisedIrql
The Kf-family spinlock exports were registered as stubs:
  KfAcquireSpinLock              → stub_return_zero (didn't write lock)
  KfReleaseSpinLock              → stub_success     (didn't clear lock)
  KeReleaseSpinLockFromRaisedIrql → stub_success     (same)
  KeTryToAcquireSpinLockAtRaisedIrql → returned 1 but didn't set lock value

Guest code that read the lock value back (e.g. nested
acquire/release sanity checks, debug assertions) saw 0 even after
"acquiring", and could enter critical regions without contention
serialization. Under `--parallel` the coarse Arc<Mutex<KernelState>>
already serializes us, so the audit's P0-under-parallel ranking is
about correctness of the lock value visible to guest code, not
mutual-exclusion (which is provided by the host mutex).

Implementation mirrors canary's
`xenia-canary/src/xenia/kernel/xboxkrnl/xboxkrnl_threading.cc`:
  - KfAcquireSpinLock:           write 1 to *SpinLock, return 0 (old IRQL)
  - KfReleaseSpinLock:           write 0 to *SpinLock
  - KeReleaseSpinLockFromRaisedIrql: write 0 to *SpinLock
  - KeTryToAcquireSpinLockAtRaisedIrql: write 1 to *SpinLock, return 1

Single-threaded HLE: contention can never be observed (we never run
two guest threads simultaneously without holding the kernel mutex),
so the spin-loop can degenerate to an unconditional acquire.

Verification at -n 100M lockstep:
  swaps:                2 → 2     (unchanged)
  draws:                0 → 0     (gated by F2/F3/G)
  packets:              ~59M (within noise)
Tests: 76 kernel pass (no count change; existing harness covers the
new write semantics implicitly via guest-memory smoke tests).

Closes KRNBUG-017 (P0 under --parallel).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:24:47 +02:00
MechaCat02
8723d6826b fix(gpu): GPUBUG-103/104/105 — fix 8 draw-state register addresses + index_size bit
Eight of the register-index constants in draw_state.rs::reg pointed at
completely unrelated registers because the canonical canary table
(register_table.inc) was misread when the module was first authored.
Re-validated each value against canary's lines 1232-1336.

| Register                  | Pre-fix | Canary | Was-actually  |
| ------------------------- | ------- | ------ | ------------- |
| VGT_DRAW_INITIATOR        | 0x2281  | 0x21FC | (junk)        |
| VGT_DMA_BASE              | 0x2282  | 0x21FA | (junk)        |
| VGT_DMA_SIZE              | 0x2283  | 0x21FB | (junk)        |
| PA_SC_WINDOW_SCISSOR_TL   | 0x200E  | 0x2081 | SCREEN_SCIS_TL|
| PA_SC_WINDOW_SCISSOR_BR   | 0x200F  | 0x2082 | SCREEN_SCIS_BR|
| RB_COLOR_INFO_1           | 0x2010  | 0x2003 | COHER_DEST_BASE_10|
| RB_COLOR_INFO_2           | 0x2011  | 0x2004 | COHER_DEST_BASE_11|
| RB_COLOR_INFO_3           | 0x2012  | 0x2005 | COHER_DEST_BASE_12|
| PA_SU_VTX_CNTL            | 0x2083  | 0x2302 | PA_SC_CLIPRECT_RULE|

Also corrected the `index_size` bit position in VGT_DRAW_INITIATOR
extraction: was bit 8 (which is `major_mode[0]`), should be bit 11 per
canary `registers.h:324` (`xenos::IndexFormat index_size : 1; // +11`).
The block comment in `extract()` was also wrong about the
intermediate field layout and has been refreshed.

Verification at -n 100M lockstep:
  swaps:                2 → 2     (unchanged)
  draws:                0 → 0     (still gated — see below)
  packets:              ~61M (within noise)
Tests: 149 (no count change; existing draw_state tests cover the
new constants implicitly via behavioral round-trip).

The audit predicted Phases C+D+E together would unlock `draws > 0`,
but the runtime plateau is multi-causal per the audit's own analysis
(`project_xenia_rs_audit_2026_05_02.md`). The likely remaining
blockers in -n 100M:
  * 4 parked-waiter worker threads (handles 0x1004, 0x100c, 0x15e4,
    0x42450b5c) — Phase F's XAM/spinlock fixes target this.
  * shader_blobs_live=0 after 100M — the game hasn't issued IM_LOAD
    yet because workers haven't loaded shader resources.
The register fixes here are still load-bearing for any draw that
DOES happen (every register read at 0x2281 was junk before this
commit) — landing them now is correct even if draws=0 persists until
Phase F unparks the resource-loader threads.

Closes GPUBUG-103, GPUBUG-104, GPUBUG-105 (P0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:22:04 +02:00
MechaCat02
ec2d955dbd fix(gpu): GPUBUG-102 — apply per-format endian byte-swap to vertex fetch
The vertex fetch constant (canary `xe_gpu_vertex_fetch_t`,
xenos.h:1158-1172) holds an `endian` field (low 2 bits of dword_1)
selecting kNone/k8in16/k8in32/k16in32 swap patterns per
`GpuSwapInline` (xenos.h:1090-1109). Xbox 360 vertex data is stored
big-endian; the host is little-endian. Pre-fix every dword was
bitcast as-is — vertex positions decoded as byte-reversed garbage,
producing clipped or NaN positions in any draw that survived to the
host.

Mechanical changes:
- crates/xenia-gpu/src/translator.rs: AOT `emit_vfetch` reads
  fetch_const dword 1 (endian) and wraps each lane's load in
  `gpu_swap(value, endian)`. New `gpu_swap` helper added to the
  emitted module header.
- crates/xenia-gpu/src/shaders/xenos_interp.wgsl: matching
  `gpu_swap` helper added to the runtime interpreter shader.
  `interpret_vertex_fetch` reads fc1, computes the endian, and wraps
  every format's per-lane load (including 8_8_8_8 and 16_16_FLOAT
  paths). Mirrors the AOT translator's emission.

Verification at -n 100M lockstep:
  swaps:                2 → 2     (gated by Phase E for draws)
  draws:                0 → 0
  packets:              ~60M (within noise)
Tests: +1 (vfetch_emit_includes_gpu_swap_helper_call).

Closes GPUBUG-102 (P0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:18:46 +02:00
MechaCat02
c5c6713419 fix(gpu): GPUBUG-100 — apply per-operand swizzle + negate to ALU sources
Word-1 of every ALU triple holds three 8-bit component-relative
swizzles (`src1_swiz`/`src2_swiz`/`src3_swiz` at bits 16-23/8-15/0-7
per canary ucode.h:2064-2066) and three per-operand negate flags
(bits 24/25/26). Pre-fix, both the WGSL interpreter and the AOT
translator discarded word-1 entirely with `_ = w1;` — every ALU
result was missing its swizzle (broadcast/permute patterns like
`.zyxw`, `.xxxx`) and any negated operand was used positive instead.

Component-relative semantics (canary's
`AluInstruction::GetSwizzledComponentIndex`, ucode.h:1996): for output
component i, the source component is `((swizzle >> (2*i)) + i) & 3`.
Identity swizzle is 0x00, NOT 0xE4 — the original `apply_swizzle` in
the interpreter shader treated it as absolute, also incorrect.

Mechanical changes:
- crates/xenia-gpu/src/ucode/alu.rs: extend AluInstruction with
  src_X_swiz (u8) and src_X_negate (bool) fields. decode_alu unpacks
  them from word 1.
- crates/xenia-gpu/src/shaders/xenos_interp.wgsl: apply_swizzle uses
  component-relative semantics. interpret_alu decodes the modifiers
  and applies via apply_swizzle + apply_modifiers (with abs=false).
- crates/xenia-gpu/src/translator.rs: src_operand emits the
  precomputed swizzle inline as `vec4<f32>(base.x, base.y, ...)`,
  then wraps in `(-…)` when negated. Identity swizzle (0x00) emits a
  bare base expression so it round-trips with the trivial-shader
  fixture.

Abs is omitted in this commit — the abs flag is dual-meaning (for
temps it lives at bit 7 of the src byte; for constants at word-2 bit
7 `abs_constants`). Wiring it up correctly requires more careful
case-split logic; deferred to Phase G.

Verification at -n 100M lockstep:
  swaps:                2 → 2     (gated by Phase E for draws)
  draws:                0 → 0
  packets:              ~58M (within noise)
Tests: 554 → 555 (+1 swizzle/negate test, no count change otherwise
because identity swizzle test merged into D1's parameterised test).
WGSL still validates via naga (combined_module_parses_as_wgsl).

Closes GPUBUG-100 (P0). Abs deferred to Phase G.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:15:07 +02:00
MechaCat02
78ea81c12a fix(gpu): GPUBUG-101 — decode src1/2/3_sel temp-vs-constant selector
Per canary AluInstruction layout (xenia-canary/src/xenia/gpu/ucode.h:
2078-2086), word-0 bits 29-31 are the per-operand `srcN_sel` flags
selecting temp register (1) vs ALU constant (0); the corresponding
8-bit src byte indexes either:
  - a temp register (bits 5:0 = index, bits 6/7 reserved for
    relative-addressing / abs flags consumed by Phase D2), or
  - an ALU constant (full 8-bit index).

Pre-fix, the WGSL interpreter and AOT translator both masked `& 0x7F`
on the src byte and emitted `r[low7]` regardless of the operand class.
Every shader's WVP matrix / light constant / per-frame uniform read
came back as r[low7] — typically zero — yielding invisible rendering.

Mechanical changes:
- crates/xenia-gpu/src/ucode/alu.rs: decode src_a_is_temp /
  src_b_is_temp / src_c_is_temp from w0 bits 29/30/31. Note that our
  src_a (low byte of w0) is canary's third operand, hence its selector
  is bit 29 (canary src3_sel), not bit 31.
- crates/xenia-gpu/src/shaders/xenos_interp.wgsl: `read_src` now takes
  the is_temp flag; constants index xenos_consts.alu directly.
- crates/xenia-gpu/src/translator.rs: `src_operand` mirrors the
  interpreter — `r[idx]` when temp, `xenos_consts.alu[idx]` when
  constant.

The trivial-shader synthetic test was updated to set the temp flags so
its `r[0u] = (r[0u] + r[0u])` assertion remains valid; without the
flags set, all sources would now resolve as constants.

Bank-selection (cf-level relative addressing for higher banks of the
512 ALU constants) remains a Phase G+ extension — covers c0..c127
in bank 0, which most Sylpheed shaders use directly.

Verification at -n 100M lockstep:
  swaps:                2 → 2     (unchanged — gated by D2/D3/E for draws)
  draws:                0 → 0
  packets:              ~61M (within noise)
Tests: 552 → 554 (+2 translator tests for the temp/constant decode).

Closes GPUBUG-101 (P0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:10:11 +02:00
MechaCat02
82f3d611e2 fix(gpu,kernel): KRNBUG-Vd-04 / GPUBUG-001 / XMODBUG-013 — VdSwap PM4 ring path
The pre-fix VdSwap zero-filled the guest's reserved buffer with NOPs and
called `state.gpu.notify_xe_swap` directly — bypassing the ring, leaving
the PM4_XE_SWAP handler at gpu_system.rs:1232 dead code, and skipping
the PM4_TYPE0(SHADER_CONSTANT_FETCH_00_0, 6) patch. Sylpheed's bloom/
blur "sample frame N for frame N+1" path samples fetch-constant slot 0
expecting the frontbuffer descriptor; without the patch, slot 0 stayed
stale and any shader sampling it read garbage.

This commit writes the canary VdSwap PM4 sequence directly into the
primary ring at the current write pointer (read via the shared MMIO
atomic), then advances WPTR over the injection. The natural CP drain
consumes PM4_XE_SWAP — bumping `swaps_seen` and patching fetch-constant
slot 0 — without going through any direct kernel→GPU bypass.

Sequence per xenia-canary VdSwap_entry (xboxkrnl_video.cc:438-521):
  1) PM4_TYPE0(0x4800, count=6) + 6 fetch-header dwords (with
     base_address re-patched from virtual to physical >> 12).
  2) PM4_TYPE3(PM4_XE_SWAP, count=4) + signature + frontbuffer_phys
     + width + height.

Mechanism notes:
- buffer_ptr in xenia-rs is in the system command buffer, NOT the
  primary ring (verified empirically: buffer_ptr=0x4acd4df8 vs
  ring_base=0x0accb000, size 4 KB). Canary's VdSwap writes to
  buffer_ptr because its ring layout maps the reserved slot inside
  the ring; xenia-rs's doesn't, so we have to write at the actual
  ring WPTR address (cached on KernelState.ring_base from
  VdInitializeRingBuffer).
- The original "buffer_ptr zero-fill + bump WPTR by 64" path is
  preserved before the injection — it exposes any game-batched PM4
  packets and keeps the buffer_ptr region skippable per existing
  game compat behavior.
- A safety-net fallback at the end calls `notify_xe_swap` directly if
  swaps_seen didn't advance during the drain (e.g. a ring-arithmetic
  edge case). Idempotent — only fires when the PM4 path didn't.
- KRNBUG-Mm-04 deferred: virt→phys uses the masked stub
  `virt & 0x1FFF_FFFF`, sufficient for the standard heap.

Mechanical changes:
- crates/xenia-gpu/src/pm4.rs: add make_packet_type0 / type2 / type3
  helpers + round-trip unit test (mirrors canary xenos.h:1682-1709).
- crates/xenia-gpu/src/handle.rs: add mmio_cp_rb_wptr_load accessor
  (Acquire-load) so the kernel can compute ring offsets.
- crates/xenia-kernel/src/state.rs: cache ring_base / ring_size_dwords
  on KernelState (set by VdInitializeRingBuffer).
- crates/xenia-kernel/src/exports.rs: rewrite the vd_swap PM4-emit
  block; patch fetch_dwords[1] base_address virt→phys before injection.

Verification at -n 100M lockstep:
  swaps:                2 → 2     (game fires VdSwap exactly twice)
  draws:                0 → 0     (gated by Phases D+E)
  fallback warning:     0 occurrences (PM4 path consumed both swaps)
  instructions:         ~100M
Tests: 552 passing (553 with new pm4 round-trip test). Lockstep
stable-fields determinism: byte-identical across two 100M runs.

The "swaps > 2" prediction in the audit's plan assumed the game would
fire VdSwap more often once the path worked; empirically Sylpheed only
calls VdSwap twice within 100M instructions (this is the renderer
plateau the audit identified). The success criterion for Phase C is
that the PM4 path is now operational, which Phases D+E require for
visible draws.

Closes KRNBUG-Vd-04, GPUBUG-001, XMODBUG-013.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:00:23 +02:00
MechaCat02
1f416aaa2e test(check): ORACBUG-004 — sylpheed_n50m stable-digest oracle
Adds a regression-catcher golden for Sylpheed boot at -n 50M lockstep,
covering the first VdSwap pair (the n2m oracle is swap-blind because
the first VdSwap fires at ~18M instructions). The new --stable-digest
flag emits/compares only fields that are deterministic in lockstep:

  instructions, imports, unimpl, draws, swaps,
  unique_render_targets, shader_blobs_live, texture_cache_entries

Excluded:

  packets — empirically ±2-8% lockstep variance (GPU thread race per
    audit M11)
  resolves, interrupts_delivered, interrupts_dropped, texture_decodes —
    scheduling-sensitive under --parallel
  path — cwd-dependent

Empirical determinism: 3 consecutive lockstep -n 50M runs produce
byte-identical stable-digest output.

The n4b canonical-invocation golden the audit's recommended next sprint
also called for is deferred. Per audit memory `--parallel
--reservations-table` is pathologically slow (>32 min for -n 100M), so
-n 4B in that mode would be many hours per run, not the 5-15 min the
plan estimated. n4b will be captured one-shot post-renderer-unblock as
a manual artifact under audit-runs/post-fix/, not as a test golden. See
crates/xenia-app/tests/golden/README.md.

Test infrastructure:
- crates/xenia-app/tests/sylpheed_oracles.rs — invokes
  CARGO_BIN_EXE_xenia-rs against the ISO. Path resolved via SYLPHEED_ISO
  env var (skips gracefully if missing).
- #[ignore]-gated; run via:
    cargo test --release -p xenia-app --test sylpheed_oracles \\
      -- --ignored --nocapture

Closes ORACBUG-004 (P0). Partial: ORACBUG-006 (P1 deferred).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 13:46:02 +02:00
MechaCat02
9ab986ec09 fix(cpu): SWAPBUG-001 — revert addi 32-bit truncation
The addi opcode was truncating its result to 32 bits per the post-P4-batch3
"32-bit ABI" rationale (commit bf8208e). Hunk-level bisection during the
2026-05 audit (M11) isolated this single cast as the cause of the
post-P8 swap regression: swaps dropped 2 → 1 and the renderer lost a
frame. PowerISA mandates sign-extension to 64 bits; canary does not
truncate addi. The truncation was a canary-divergent over-extension
of the addis fix (which IS canary-divergent by design, see
addis at interpreter.rs:121-134).

The addi_li_neg_one_zero_extends_upper test encoded the wrong invariant.
Replaced with a sign-extension test asserting canonical PowerISA
behavior (gpr[3] == 0xFFFF_FFFF_FFFF_FFFF for `li r3, -1`).

Verification at -n 100M lockstep:
  swaps:                1 → 2     (gate met)
  draws:                0 → 0     (unchanged — gated by Phase C+D+E)
  instructions:         ~100M (unchanged)
  imports:              11.4M → 987k    (game escapes retry loop)
  packets:              281M → 57M      (same)
  interrupts_delivered: 629 → 630
Tests: 551 passing (unchanged). Lockstep determinism: byte-identical
across two 100M runs except packets (±5%, GPU-thread-race noise floor).

Closes SWAPBUG-001 / PPCBUG-001.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 13:37:51 +02:00
MechaCat02
09c6c927bd refactor(cpu): fpscr round_single_toward_zero — collapse duplicate-branch ULP step
Post-P8 review nit: the if/else branches were identical
(`adj_bits - 1` either way). Both positive and negative finite f32
values use the IEEE-754 sign bit as the MSB, and subtracting 1 from
`to_bits()` always reduces magnitude by one ULP. Replace the
mock-conditional with the unconditional form + a comment explaining
why one operation works for both signs.

No behavior change.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 14:45:55 +02:00
MechaCat02
f1166d0f75 fix(cpu): revert PPCBUG-105 — lwa/lwax/lwaux sign-extend per PowerISA
Post-P8 end-to-end review caught an ISA deviation introduced by P4
batch 5. The original code used `as i32 as i64 as u64` (correct
PowerISA sign-extension; canary's `SignExtend(INT64_TYPE)`). My P4
batch 5 commit (20a730d) changed all three to `as u64` (zero-extend),
citing the audit's "32-bit-ABI hazard" note for PPCBUG-105.

This deviation is wrong per PowerISA and any 64-bit-mode kernel code
that uses `lwa rT, off(rA)` will silently produce the wrong rT for
negative words (e.g. memory 0x80000000 should yield 0xFFFFFFFF_80000000
but was yielding 0x00000000_80000000).

Restore ISA-spec sign-extension for all three forms (lwa, lwax, lwaux).
The audit's 32-bit-ABI hazard concern was speculative — there's no
evidence that Xbox 360 user code emits `lwa` (compilers use `lwz`).
If a real bug surfaces from a 32-bit-ABI consumer that feeds an
`lwa`-loaded value into a u64 unsigned compare, that's a separate
issue to debug at the consumer site.

Test renamed: lwa_high_bit_set_zero_extends_upper → lwa_sign_extends_to_i64
with assertion flipped to expect 0xFFFFFFFF_80000000.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 14:43:47 +02:00
MechaCat02
1f9696ad47 test(cpu): rename vmsum3fp_… to vmaddfp_lane_fma per reviewer nit
P8 review feedback (non-blocking): the test fn name said vmsum3fp but
the encoding/body actually tests vmaddfp. Rename + clarify comment;
no behavior change.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 14:22:39 +02:00
MechaCat02
261480616c test(cpu): PPCBUG-240/277/278/316/321/370/490/517 P8 batch 4 — VMX integer/permute/load-store
Phase 8 batch 4 — VMX integer + permute/pack + multiply-sum + load/store.

12 new tests:
- VMX add/sub (240): vaddubm byte add, vsubuwm word sub.
- VMX compare (277): vcmpequb lane mask.
- VMX min/max (278): vmaxsw signed lane max.
- VMX shift/rotate (316): vsl 128-bit left shift, vsraw arithmetic per-lane.
- VMX logical (321): vand lane-wise AND.
- VMX permute (370): vsldoi byte concatenation + shift.
- VMX multiply-sum (490): vmaddfp lane FMA.
- VMX load/store (517): lvx aligned quadword load, stvx aligned store,
  lvebx byte-lane load.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 14:16:51 +02:00
MechaCat02
ebfd18a64e test(cpu): PPCBUG-187/208/228/438/439/440 P8 batch 3 — FPU + VMX float
Phase 8 batch 3 — FPU and VMX float test gap closure.

14 new tests:
- Single FPU (187): fadds, fmuls
- Double FPU (208): fmul, fdiv (zero-numerator), fneg, fabs, fmr
- FPU convert/compare (228): fcmpu, fcfid
- VMX float compare (438): vcmpeqfp lane mask
- VMX rounding (439): vrfip, vrfim, vrfiz
- VMX convert (440): vctsxs saturation to INT_MAX/INT_MIN

The VMX VX-form encoding nit (XO is 11 bits at PPC 21-31, host bits 10-0,
with bit 0 the LSB — not bit 1) was caught by initial test failures and
fixed before commit. VC-form (vcmpeqfp) has the same "XO at bit 0" layout.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 14:14:10 +02:00
MechaCat02
2d223eee69 test(cpu): PPCBUG-091/100/109-111/118/127/129/132/146-147/153/163/171 P8 batch 2 — load/store
Phase 8 batch 2 — load/store test gap closure.

15 new tests across the load/store opcodes:
- lbz zero-extend (091), lwbrx byte-swap (109/110), lwarx smoke (111),
  ld doubleword (118), lmw + lswi (127), lswx with XER TBC (127),
  lfs single-to-double widening (129).
- stb (132), sth, stw (146), std (153), stmw + stswx (163), stfs (171).

`lswx_uses_xer_tbc_for_byte_count` and `stswx_uses_xer_tbc_for_byte_count`
specifically lock in the new XER TBC infrastructure landed in P6 (68c0ee5);
both opcodes were permanent no-ops before that.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 14:10:26 +02:00
MechaCat02
9827b03f1a test(cpu): PPCBUG-055/067/070/081-085/089 P8 batch 1 — branch/CR/SPR/sync
Phase 8 batch 1 — test gap closure for the branch/CR-logical/SPR/MSR/
FPSCR/cache+sync groups.

12 new tests across the affected groups:
- PPCBUG-055 branch: blr, bctr, bcl-LK-on-not-taken
- PPCBUG-070 CR logical: cror, crand, crxor (crclr idiom)
- PPCBUG-067 trap+sc: sc smoke, tw TO=0 never-traps
- PPCBUG-081-085 SPR/MSR/FPSCR moves: mfcr 8-field assembly, mtfsb1/mtfsb0
- PPCBUG-089 cache+sync: sync state-non-mutation smoke

These groups previously had near-zero unit test coverage. New tests lock
in the current ISA-correct behavior; would catch a regression in any of
the dispatch/encoding/result paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 14:08:54 +02:00
MechaCat02
5ece5e315f refactor(cpu): mcrfs uses fpscr::VX_ALL constant per reviewer nit
P6 review nit: replace the inline `const VX_ALL_MASK` in the mcrfs arm
with the existing `fpscr::VX_ALL` constant (single source of truth).
Behaviorally identical.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:56:34 +02:00
MechaCat02
99e7814836 test(cpu): PPCBUG-022 verify mulld_ov INT_MIN*-1 + auto-resolved markers
Phase 6 batch 4 — overflow/cleanup verification.

- PPCBUG-022 mulld_ov INT_MIN * -1: the audit-claimed missing edge case
  is actually handled by `i64::checked_mul()` (returns None when the
  result would be -i64::MIN = i64::MAX+1, which doesn't fit). New
  regression tests in overflow.rs confirm: INT_MIN * -1 overflows;
  INT_MIN * 1 doesn't; (INT_MIN+1) * -1 = INT_MAX, no overflow.
  Audit's claim was incorrect; documented by the new tests.
- PPCBUG-021 (overflow.rs OE checks at bit 63): largely auto-resolved
  by P4 batch 6 (16993bb), which switched all 32-bit ABI ops to inline
  `true_sum != (result32 as i32) as i128`. Helpers like add_ov_64 are
  now only called from 64-bit ABI ops where bit-63 is correct.
- PPCBUG-027 (rlwimix upper-32 zeros): auto-resolved by P4 (rlwimix
  now writes via `as u32 as u64` truncation).
- PPCBUG-039 (cntlzdx 32-bit-ABI): wontfix per audit — only matters
  if a 32-bit-ABI binary emits cntlzd, which compilers don't.

Remaining low-impact items (PPCBUG-642 ISA-undefined fmt_bcctr decr,
PPCBUG-643/644 SIMM/D-form hex display, PPCBUG-367/368 vupkhpx/vpkpx
channel ordering, PPCBUG-487/495 vsum operand naming, PPCBUG-515/516
lvebx/lvsr documentation, PPCBUG-601 decode_op6 invariant doc) are
left for a P9 or follow-up batch — they're cosmetic/test-coverage
items rather than correctness bugs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:51:43 +02:00
MechaCat02
0f2a26c460 fix(cpu): PPCBUG-068/078/080 mcrfs VX recompute + mtmsrd L=1 + mfvscr zero
Phase 6 batch 3 — SPR/MSR/VSCR semantics.

- PPCBUG-078 mtmsrd L=1: PowerISA requires partial-MSR-write — only
  MSR[EE] (u64 bit 15) and MSR[RI] (u64 bit 0) modified, all other
  MSR bits preserved. Used by kernel code to toggle external interrupts.
  Previously merged with mtmsr (full overwrite), silently corrupting
  MSR for any L=1 caller.
- PPCBUG-080 mfvscr: ISA places VSCR in the rightmost word of VD with
  bytes 0-11 zeroed. Previously copied the full 128-bit ctx.vscr,
  leaking stale upper data to guest. Now zero-extends per canary.
- PPCBUG-068 mcrfs VX summary: when mcrfs clears VX* exception bits,
  the VX summary bit at FPSCR[2] must be recomputed (clears if all
  contributors are 0; remains 1 otherwise). Previously left stale,
  causing subsequent CR-test sequences to misread the FPU state.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:50:10 +02:00
MechaCat02
68c0ee55ce fix(cpu): PPCBUG-123/124/125/126/161/162/566 XER TBC + lswi/stswi/lmw
Phase 6 batch 2 — XER TBC enabling + load/store-multiple cleanups.

- PPCBUG-123/124/161/566 (coupled): XER TBC field was unmodelled —
  `ctx.xer()` always returned 0 in bits 0-6, and `ctx.set_xer()`
  silently discarded any TBC writes. Result: `lswx` and `stswx` were
  permanent no-ops (the `while bytes_left > 0` loop never executed).
  Fix: add `pub xer_tbc: u8` to `PpcContext`; wire into `xer()` and
  `set_xer()`. Initialize to 0 in `PpcContext::new()`. lswx/stswx
  bodies are correct as-is once the infrastructure is wired.

- PPCBUG-125 lmw: PowerISA marks `lmw rT, D(rA)` invalid when rA is
  in [rT..31]; canary skips the write to rA to preserve the EA base.
  Now matches canary.

- PPCBUG-126/162 lswi/stswi: replaced `instr.rb()` with `instr.nb()`
  for the NB field. Both accessors return identical values today
  (bits 16-20), but the maintenance hazard from the misnomer is now
  removed. A future `rb()` type-system refactor would have broken
  lswi/stswi silently.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:48:03 +02:00
MechaCat02
d96986a10e fix(cpu): PPCBUG-063/064/065 trap PC + sc LEV + twi typed-trap logging
Phase 6 batch 1 — trap/sc semantics.

- PPCBUG-063 trap PC: previously ctx.pc was incremented to CIA+4 BEFORE
  StepResult::Trap returned, forcing handlers to .wrapping_sub(4) to
  recover the faulting instruction address. Now ctx.pc stays at CIA on
  trap, matching SRR0 semantics on real hardware. Critical for any
  future SEH/exception-delivery path (e.g. the Sylpheed C++ throw work).
- PPCBUG-065 typed-trap logging: `twi 31, r0, IMM` is the Xbox 360
  CRT/kernel typed-trap convention encoding C++ exception class via
  SIMM. The trace now logs the SIMM type code when this pattern fires.
  Routing the type code via a StepResult payload requires an enum
  extension (multiple consumer sites) that's deferred.
- PPCBUG-064 sc LEV logging: `sc 2` is the Xbox 360 hypervisor-call
  convention; canary dispatches it to a different handler than `sc 0`.
  Now logs a warning when LEV != 0. Routing LEV=2 to a HypervisorCall
  variant also requires a StepResult enum extension; deferred.

The two enum-extension follow-ups can land as a structural sub-batch
once a clear consumer (SEH dispatch, hypervisor-call HLE) is in place.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:42:50 +02:00
MechaCat02
05f2f72c71 refactor(cpu): vrfin uses stdlib f32::round_ties_even() per reviewer nit
P5 review feedback (non-blocking): replace the inline round-to-even
implementation with the stable stdlib intrinsic (Rust 1.77+).
Functionally equivalent; cleaner.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:37:54 +02:00
MechaCat02
6fe2cbf251 fix(cpu): PPCBUG-426/427/433 single-FMA vnmsubfp + vctsxs NaN saturation
Phase 5 batch 6 (5f): saturation and FMA-rounding fixes.

- PPCBUG-426 vnmsubfp: was `bi - ai * ci` (two rounding steps); now
  `-ai.mul_add(ci, -bi)` which is mathematically equivalent (= bi - ai*ci)
  but uses a single FMA round per ISA.
- PPCBUG-427 vnmsubfp128: same single-FMA fix.
- PPCBUG-433 vctsxs / vcfpsxws128 NaN saturation: AltiVec ISA saturates
  NaN to INT_MIN (0x80000000); xenia returned 0. The vctuxs (unsigned)
  NaN→0 is correct per ISA.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:31:10 +02:00
MechaCat02
6ba8f83c30 fix(cpu): PPCBUG-184 fresx pre-quantize input to f32 (canary parity)
Phase 5 batch 5 (5e): minimal-viable fix for the estimate-precision
family. Hardware Xenon `fres` produces a ~12-bit LUT estimate; xenia
and canary both produce a fully IEEE single reciprocal, but canary
pre-quantizes the f64 input to f32 to at least match the input
precision. Now matches canary.

PPCBUG-428..431 (vrefp/vrsqrtefp/vexptefp/vlogefp) already operate on
f32 inputs naturally (no f64 → f32 quantization step needed); the
estimate-precision deviation is purely the output side. Newton-Raphson
convergence is unaffected. Documented in audit-findings.md as
LOW-impact full-fix-requires-LUT.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:29:07 +02:00
MechaCat02
538fa5ab74 fix(cpu): PPCBUG-435/436/437 VSCR.NJ subnormal flush for VMX float
Phase 5 batch 4 (5d) — partial: VSCR.NJ subnormal flush for VMX float
arithmetic. Xbox 360 always boots with NJ=1, so games expect inputs
and outputs flushed to ±0.

- PPCBUG-435 vaddfp/vaddfp128/vsubfp/vsubfp128/vmulfp128: previously
  no flush at all on these opcodes (only vmaddfp family flushed).
  Now flushes both inputs and output per Canary's unconditional model.
- PPCBUG-436 vmsum3fp128/vmsum4fp128: per-product intermediates now
  flushed individually (was only the final sum).
- PPCBUG-437 vmaddfp/vmaddfp128/vmaddcfp128/vnmsubfp/vnmsubfp128:
  outputs now flushed (inputs were already flushed).

PPCBUG-185 (FPSCR.NI flush for scalar FPU) deferred — requires adding
a NI bit constant and post-op flush wrapper across all *sx arms; will
land in a focused sub-batch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:26:36 +02:00
MechaCat02
49bf74fae6 fix(cpu): PPCBUG-223/224/225/229/230 FPU XX bit on inexact conversions
Phase 5 batch 3 (5c) — partial: targeted XX-on-inexact fixes for the
float-to-int and double-to-single conversion family. (PPCBUG-180/200,
the broader update_after_op XX/FR/FI rework, deferred to a focused
sub-batch.)

- PPCBUG-225 frspx: set XX when the f64→f32 round produces a different
  value (i.e. precision loss). Almost every frsp call is inexact —
  previously games polling FPSCR.XX never saw the set bit after a frsp.
- PPCBUG-224 fcfidx: set XX when the i64 input has > 53 significant
  bits (precision lost in conversion to f64).
- PPCBUG-229 fctidx/fctidzx: set XX when input is non-integer (fractional
  part discarded by the conversion).
- PPCBUG-230 fctiwx/fctiwzx: same shape for word-width conversions.
- PPCBUG-223 verified already correct in current code (fcmpo sets
  VXSNAN/VXVC on NaN operands; the audit-cited drift was already fixed).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:22:47 +02:00
MechaCat02
26b98975c3 fix(cpu): PPCBUG-181/182/183/202/203/205 FMA VXISI + NaN sign preservation
Phase 5 batch 2 (5b): VXISI / NaN handling for the FMA family.

The 8 FMA opcodes (fmaddx/fmaddsx/fmsubx/fmsubsx/fnmaddx/fnmaddsx/fnmsubx/
fnmsubsx) all share two fix shapes:

1. VXISI on the add/sub step. The previous code passed `a*c` to
   check_invalid_add, which has separate rounding from the FMA. In
   extreme cases this gives the wrong sign (PPCBUG-202) or wrong infinity
   status. Worse, fmsub/fnmadd/fnmsub had NO add-step VXISI check at all
   (PPCBUG-181/182/203). The fnmsub pattern is the canonical Newton-
   Raphson step — the most common FPU path in Xbox 360 graphics code.

2. NaN sign preservation in fnmadd/fnmsub. ISA Book I §4.3.4 forbids
   negation of a NaN FMA result; Rust's unary `-` flips the IEEE-754
   sign bit (PPCBUG-183/205).

Fixes:
- fpscr.rs: new helper `check_invalid_fma_add(ctx, a, c, b, sub)` that
  derives VXISI from input properties (mathematical-product sign +
  b sign) instead of from the lossy `a*c` value. Also covers SNaN.
- interpreter.rs: all 8 FMA arms now use the new helper; fnmadd[s]/
  fnmsub[s] gate the negation on `!fma.is_nan()`.

Tests:
- fmsub_inf_minus_inf_sets_vxisi: regression for PPCBUG-203.
- fnmadd_nan_input_preserves_nan_sign: regression for PPCBUG-205.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:20:02 +02:00
MechaCat02
f6a444b9d1 fix(cpu): PPCBUG-221+227 round_to_i64 + PPCBUG-432 vrfin round-to-even
Phase 5 batch 1 (5a): round-to-int correctness.

PPCBUG-221+227 (coupled): round_to_i64 NearestEven tie-breaking used
`(diff - 0.5).abs() < f64::EPSILON` to detect half-integers, but for
|v| > 2^52 every f64 value is an exact integer (v.trunc() == v), giving
diff == 0. The buggy check fell through to v.round() (round-half-away-
from-zero), giving wrong results for large odd half-integers. Replaced
with a fractional-part-only check that's exact for |v| <= 2^52 and
degenerates to truncation above.

PPCBUG-432: vrfin/vrfin128 used Rust's `f32::round()` which is round-
half-away-from-zero. ISA requires round-to-nearest-even (banker's
rounding). Implemented inline.

PPCBUG-201 (FPSCR.RN for double arithmetic) deferred — requires
MXCSR-set/restore wrappers around 10+ FPU arms; will land in a focused
sub-batch after the remaining 5a-5f fixes.

Tests:
- round_to_i64_nearest_even_on_tie: extended with 0.5, 1.5, -0.5, -1.5.
- round_to_i64_non_tie_cases: 0.4/0.6 (non-tie sanity).
- round_to_i32_nearest_even_on_tie: PPCBUG-227 coverage.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:13:08 +02:00
MechaCat02
49103bb898 fix(cpu): P4 review-fix — subfx/subfcx OE predicate + mulli test rigor
Independent reviewer of the P4 branch found two issues:

(1) BLOCKING — subfx and subfcx OE handlers still called the legacy
`overflow::sum_overflow_64(true_diff, result32 as u64)` while batch 6
had migrated all add* sites to the inline `true_sum != (result32 as i32)
as i128` form. The legacy helper compares `true_diff` against
`(result32 as u64) as i64 as i128`, which views any bit-31-set result
as a positive i64 (e.g. result=0x80000000 → +2147483648 in i64). For a
legitimate i32::MIN result with no actual 32-bit overflow, this caused
spurious OV=1.

Concrete repro now caught by `subfo_no_spurious_ov_when_result_has_bit31_set`:
r3=1, r4=0x80000001 → result=0x80000000, true_diff=-2147483648, no OV.
Pre-fix: spurious OV=1.

(2) Minor — `mulli_overflow_wraps_to_32` rubber-stamped: with ra=0x80000000
and imm=2, both pre-fix (`as i64 as u64`) and post-fix (`as u32 as u64`)
write the same value. Replaced with ra=u64::MAX (polluted upper bits) where
pre-fix writes 0xFFFFFFFF_FFFFFFFE and post-fix writes 0x00000000_FFFFFFFE.

Fixes:
- interpreter.rs subfx/subfcx OE: switch to inline 32-bit predicate
  matching the rest of batch 6.
- subfo_sets_xer_ov_on_min_minus_one: renamed and updated to test 32-bit
  overflow (r4=0x80000000 - 1 = 0x7FFFFFFF, OV=1).
- New: subfo_no_spurious_ov_when_result_has_bit31_set (PPCBUG-017
  review-fix regression).
- New: subfco_no_spurious_ov_when_result_has_bit31_set (same for PPCBUG-007).
- mulli_overflow_wraps_to_32: redesigned with polluted upper bits to
  actually discriminate pre/post fix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:07:32 +02:00
MechaCat02
16993bb8af fix(cpu): PPCBUG-012-017/020/023-026/032/044 4c+4d latent + CR0 catch-all
Phase 4 batch 6: latent writeback truncation (4c) and CR0 catch-all (4d).
~13 PPCBUGs across all remaining 32-bit ABI ALU sites.

Latent writeback (4c) — the 4a/4b fixes already eliminate the upstream
poisoning, but a defensive truncation here catches any future regression:
- PPCBUG-012 addx, PPCBUG-013 addcx, PPCBUG-014 addex, PPCBUG-015 addzex,
  PPCBUG-016 addmex, PPCBUG-017 subfx — all rewritten to compute on u32
  operands and write `as u64`. CA computed via 32-bit unsigned compare.
  Overflow now uses `true_sum != (result32 as i32) as i128` (32-bit
  predicate, since sum_overflow_64 is i64-bounded).
- PPCBUG-032 andx/orx/xorx — CR0 catch-all only (results inherit upper
  bits from operands; once those are clean, no truncation needed).

CR0 catch-all (4d) — fix the `update_cr_signed(0, X as i64)` pattern at
every 32-bit-ABI Rc=1 path:
- PPCBUG-020 catch-all: applied to mulhwx, mulhwux, divwux, mullwx (was
  already done in batch 4), addx/addcx/addex/addzex/addmex/subfx (now in
  4c above), andx/orx/xorx, andix, andisx, slwx, srwx, cntlzwx,
  rlwinmx, rlwimix, rlwnmx, mullwx (already), divwx (already),
  srawx/srawix (already in batch 4).
- PPCBUG-023 andisx: now correctly classifies bit-31 results as CR0.LT.
- PPCBUG-024 rlwinmx, PPCBUG-025 rlwimix, PPCBUG-026 rlwnmx.
- PPCBUG-044 slwx/srwx: bit-31 result like 0x80000000 now CR0.LT.

64-bit ABI ops (rldicl/rldicr/rldic/rldimi/rldcl/rldcr, sldx/srdx/sradx/
sradix, mulhdx/mulhdux/mulldx, divdx/divdux, cntlzdx) intentionally retain
the 64-bit `as i64` form per ISA — these are 64-bit-mode instructions.

Updated old tests:
- addo_sets_xer_ov_on_signed_overflow_and_stickies_so: i32::MAX + 1 → INT_MIN.
- addx_rc_uses_64bit_compare_not_32bit: renamed to ..._uses_32bit_compare_in_xbox_abi
  with assertions flipped to the correct 32-bit ABI behavior.

New tests:
- andisx_sign_bit_set_classifies_lt (PPCBUG-023).
- slwx_high_bit_result_classifies_lt (PPCBUG-044).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 11:55:50 +02:00
MechaCat02
20a730d69e fix(cpu): PPCBUG-095/096/097/098/105 halfword + lwa load truncation
Phase 4 batch 5: 5 PPCBUGs in the load family. lha/lhax/lhau/lhaux
sign-extended halfword results to u64 (active poisoning for negative
halfwords); lwa/lwax/lwaux sign-extended u32 results.

- PPCBUG-095/096/097/098 lha[ux]: `as i16 as i64 as u64` →
  `as i16 as i32 as u32 as u64`. Sign-extend to i32 then zero-extend.
  Common trigger: int16_t struct fields, PCM samples, packed vertex
  deltas. Memory 0x8000 was producing 0xFFFFFFFF_FFFF8000.
- PPCBUG-105 lwa/lwax/lwaux: `as i32 as i64 as u64` → `as u64`.
  Per-canary the 64-bit-mode form sign-extends, but in 32-bit ABI we
  must zero-extend (canary's behavior is rescued by x86 register
  zeroing in JIT; pure interpreter has no escape). Memory 0x80000000
  was producing 0xFFFFFFFF_80000000.

Tests:
- lha_negative_halfword_zero_extends_upper (PPCBUG-095).
- lhaux_negative_halfword_clean_writeback (PPCBUG-098 + EA update).
- lwa_high_bit_set_zero_extends_upper (PPCBUG-105).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 11:47:24 +02:00
MechaCat02
82a9bff934 fix(cpu): PPCBUG-009/010+011/041+042+043 mul/div + srawx truncation
Phase 4 batch 4: mulwx, divwx (coupled +CR0), srawx/srawix (coupled +CR0).

- PPCBUG-009 mullwx: 32-bit ABI. Product truncated to u32 before write.
  OE handler still uses full i64 product to detect overflow.
- PPCBUG-010+011 divwx (coupled): quotient zero-extended (canary uses
  ZeroExtend(v, INT64_TYPE)). CR0 view via i32 — without this, a negative
  i32 quotient (e.g. -3 from -10/3) would be classified as positive in
  i64 view of the now-zero-extended writeback.
- PPCBUG-041+042+043 srawx/srawix (coupled): writeback uses `as u32 as u64`
  (was `as i64 as u64`). All-ones case (sh>=32 with negative input) writes
  0x00000000_FFFFFFFF instead of u64::MAX. CR0 view via i32. CA logic
  preserved unchanged (audit-verified independently correct).

Tests:
- mullwx_overflow_truncates_to_32 (PPCBUG-009).
- divwx_negative_quotient_zero_extends (PPCBUG-010+011).
- srawx_negative_value_zero_extends_upper (PPCBUG-041+043).
- srawix_high_count_negative_input_yields_low32_all_ones (PPCBUG-042+043).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 11:44:34 +02:00
MechaCat02
bf8208e88c fix(cpu): PPCBUG-001/002/003/004/005/007 4b immediate ALU truncation
Phase 4 batch 3: 6 PPCBUGs in the same-shape-as-addis (4b) sub-section.
All share the pattern of computing on 64-bit values when the 32-bit ABI
requires u32 arithmetic.

- PPCBUG-001 addi: `li rT, -1` produced 0xFFFFFFFF_FFFFFFFF; now 0x00000000_FFFFFFFF.
- PPCBUG-002 addic: writeback truncated + CA from u32 unsigned compare
  matching canary's `AddDidCarry`.
- PPCBUG-003 addicx: same plus CR0 i32 view (regression vs. the frozen
  ppc-manual snapshot which had the correct form).
- PPCBUG-004 mulli: 64-bit signed product now truncated to 32 bits.
- PPCBUG-005 subficx: writeback + CA in u32 space; removes the bits-32-63
  pollution from sign-extended negative SIMM.
- PPCBUG-007 subfcx: defensive 32-bit truncation of CA compare. Same shape
  as the compare that broke addis (0x828F3F98 / 0x828F3F68 case).

Tests:
- addi_li_neg_one_zero_extends_upper (PPCBUG-001).
- addic_carry_uses_32bit_compare (PPCBUG-002).
- mulli_overflow_wraps_to_32 (PPCBUG-004).
- subficx_neg_simm_zero_extends (PPCBUG-005).
- subfcx_addis_incident_case (PPCBUG-007 — exact addis-incident case).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 11:41:49 +02:00
MechaCat02
145a7a4019 fix(cpu): PPCBUG-034+035+036+037 extsbx/extshx writeback + CR0 (coupled)
Phase 4 batch 2: extsbx and extshx writeback truncation + CR0 view fix.
Coupled per audit — must land together because the writeback fix would
silently break CR0 sign classification if the CR0 fix didn't ship in
the same commit.

Before:
- extsbx: `as i8 as i64 as u64` — every negative byte poisoned upper
  32 bits (active poisoning, not latent). 0x80 → 0xFFFFFFFF_FFFFFF80.
- extshx: same shape for halfwords.
- CR0: `as i64` view — accidentally correct on the buggy 64-bit form
  because the high bits matched the byte's sign bit.

After:
- extsbx: `as i8 as i32 as u32 as u64` — sign-extend to i32 then
  zero-extend to u64. 0x80 → 0x00000000_FFFFFF80.
- extshx: same for halfwords.
- CR0: `as u32 as i32 as i64` — i32 view, so a result with bit 31 set
  is correctly classified as negative under the 32-bit ABI.

Tests:
- extsbx_negative_byte_zero_extends_upper: 0x80 input → 0x00000000_FFFFFF80
  with CR0.LT set.
- extshx_negative_halfword_zero_extends_upper: same shape for 0x8000.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 11:38:22 +02:00
MechaCat02
e18a0a40b8 fix(cpu): PPCBUG-006/008/018/019/028/029/030/031/033 4a active poisoning
Phase 4 batch 1: 9 PPCBUGs in the active-poisoning sub-section. All
follow the pattern `!val` on u64, which unconditionally flips the upper
32 bits and poisons the GPR even with clean inputs — every execution
corrupts the high 32 bits regardless of upstream state.

Sub/neg family:
- PPCBUG-006 negx: `(!ra).wrapping_add(1)` on u64 + neg_ov_64 checks
  64-bit INT_MIN. Fix: do arithmetic in u32, OE checks PPC[ra32==0x80000000].
- PPCBUG-008 subfex: same shape as above plus 64-bit unsigned CA compare.
  Fix: cast all operands to u32, compute, write `as u64`.
- PPCBUG-018 subfzex: `!ra` on u64. Fix: u32 arithmetic.
- PPCBUG-019 subfmex: `!ra` on u64 + always-true CA edge (`!ra != 0`
  was always true for clean ra<0xFFFFFFFF because high bits of !u64
  are non-zero). Fix: u32 arithmetic; CA predicate now correct.

Logical NOT family:
- PPCBUG-028 orcx: rs | !rb on u64 → high-bit poison.
- PPCBUG-029 norx: !(rs|rb) — the `not` simplified mnemonic. Hot path,
  every `not` corrupted GPR upper 32 bits.
- PPCBUG-030 nandx: !(rs&rb).
- PPCBUG-031 eqvx: !(rs^rb). The common `eqv rA,rA,rA` set-to-all-ones
  idiom now produces 0x00000000_FFFFFFFF instead of 0xFFFFFFFF_FFFFFFFF.
- PPCBUG-033 andcx: rs & !rb.

CR0 update at every Rc=1 path now uses `as u32 as i32 as i64` so a result
with bit 31 set gets classified as negative under the 32-bit ABI (was
positive before because upper bits were ones; will be positive in new
truncated form unless we cast through i32). This pre-emptively addresses
PPCBUG-020 for these specific opcodes; the catch-all sweep in batch 6
covers the remaining sites.

Tests:
- nego_sets_ov_only_on_int_min: updated from i64::MIN → 0x80000000 (32-bit).
- test_subfze_carry_only_when_ra_zero_and_ca_one: result expectations
  updated from u64::MAX → 0xFFFFFFFF (low 32 bits, upper 32 zero).
- New: neg_clean_input_no_upper_bits (PPCBUG-006 regression).
- New: norx_not_simplified_keeps_upper_bits_clean (PPCBUG-029 regression).
- New: eqvx_self_self_self_sets_low32_to_all_ones (PPCBUG-031 regression).
- New: andcx_bit_clear_keeps_upper_clean (PPCBUG-033 regression).
- New: subfex_clean_inputs_no_upper_bits (PPCBUG-008 regression).
- New: subfmex_ra_max_ca_zero_clears_ca (PPCBUG-019 always-true CA fix).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 11:35:05 +02:00
MechaCat02
7609dcd406 fix(cpu): PPCBUG-700 VMX128 register accessors match canary bitfield layout
Independent review of P3 batch 2 (52ece4b) found that all three VMX128
register accessors disagreed with canary's FormatVX128/VX128_R bitfield
struct (`xenia-canary/src/xenia/cpu/ppc/ppc_decode_data.h:484-663`). The
audit at line 2958 had marked these "confirmed-clean" but had miscounted
LSB-first bitfield offsets.

Canary's actual layout (LSB-first, GCC/Clang/MSVC on x86):
  VA128 = VA128l(5) | VA128h(1)<<5 | VA128H(1)<<6
        = PPC[11:15] | PPC[26]<<5 | PPC[21]<<6  (7-bit selector, 3 fields)
  VB128 = VB128l(5) | VB128h(2)<<5
        = PPC[16:20] | PPC[30:31]<<5            (7-bit selector, 2 fields)
  VD128 = VD128l(5) | VD128h(2)<<5
        = PPC[6:10]  | PPC[28:29]<<5            (7-bit selector, 2 fields)
  VX128_R Rc = PPC[25]  (host bit 6)             not PPC[27] as prior fix had

The buggy convention was internally consistent with hand-crafted test
fixtures (which set bits 29/21/22 to encode the high registers, matching
the buggy accessor). Real Xbox 360 game code follows canary's convention,
so any production VMX128 instruction with VR >= 32 was silently mis-decoded
— but no unit test exercised that path until the va128 fix in 52ece4b
exposed the inconsistency.

Changes:
- decoder.rs: rewrite va128/vb128/vd128/vx128r_rc_bit to canary positions.
  Drop the speculative `key4_dt` dot-form dispatch in decode_op6 — canary
  has no separate dot-form opcodes for VX128_R compute ops; Rc is a
  runtime modifier read by the interpreter via vx128r_rc_bit().
- decoder.rs tests: rewrite vmx128_test_word helper for canary layout;
  rename/re-encode vmx128_vd128_*, vmx128_va128_*, vmx128_vb128_* tests.
- interpreter.rs: update encode_vpkd3d128 test helper to encode VD via
  canary's VD128h field; tests now pass vd=96 explicitly.
- tests/disasm_goldens.rs: replace the vrlimi128/vsrw128/vpermwi128/
  vperm128 hand-encoded raws with canary-compliant encodings; introduce
  a shared `encode_vx128` helper.
- tests/golden/vmx128_registers.json: re-encode 9 entries (vperm128,
  vsrw128 ×2, vpermwi128, vrlimi128 ×2, vmaddfp128, vmaddcfp128,
  vnmsubfp128) to canary-compliant raws preserving the same expected
  operand strings.
- audit-findings.md: new PPCBUG-700 entry documenting the discovery and
  invalidating the audit's "confirmed-clean" assessment.

Affects all VMX128 binary ops (vaddfp128, vsubfp128, vmulfp128, vand128,
vor128, vxor128, vnor128, vandc128, vsel128, vslo128, vsro128, vperm128,
vsrw128, vmaddfp128, vmaddcfp128, vnmsubfp128, vpkd3d128, vpkshss128,
vpkshus128, vpkswss128, vpkswus128, vpkuhum128, vpkuhus128, vpkuwum128,
vpkuwus128, vmsum3fp128, vmsum4fp128, vrlimi128, vpermwi128 — 30+
opcodes), plus VX128_R compare dot-forms.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 11:22:20 +02:00
MechaCat02
2be25bdd41 fix(disasm): PPCBUG-641+649 sync/lwsync L-field discrimination
PPCBUG-641: PpcOpcode::sync emitted "sync" regardless of the L-field at
PPC bit 10. The Xbox 360 acquire barrier (encoding 0x7C2004AC, L=1) is
lwsync, used in every spinlock. The disassembly DB stored every lwsync
as `mnemonic='sync'`, so `SELECT WHERE mnemonic='lwsync'` returned zero
rows regardless of binary content.

PPCBUG-649 (companion): the golden fixture for lwsync had no ext_mnemonic
field, pinning the wrong output and defeating regression detection.

Fix: in disasm.rs, gate on `(instr.raw >> 21) & 1` (PPC bit 10) — when
set, emit the lwsync extended form. Update extended_mnemonics.json
fixture to expect `ext_mnemonic: "lwsync"`.

Note: this is the disassembler-side fix only. The interpreter-side
PPCBUG-088 (lwsync vs sync semantics) is separate.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 10:43:24 +02:00
MechaCat02
d4f6ea787b fix(disasm): PPCBUG-640+650 fmt_bc spurious condition suffix on bdnz/bdz
PPCBUG-640: For BO=16 (bdnz: decrement CTR, branch if non-zero, ignore CR)
and BO=18 (bdz: same with branch-if-zero), `fmt_bc` fell through to the
`if decr` block and computed `cond_name_opt` from the don't-care BI=0 /
cond_true=false pair, yielding `Some("ge")`. The output was therefore
`bdnzge` / `bdzge` — a CTR-only branch with a spurious CR-derived suffix.

PPCBUG-650 (companion): the golden fixture pinned the wrong output, so
the regression had no detection signal until now.

`fmt_bclr` already had the correct `if decr && uncond` guard at line 872
producing `bdnzlr` / `bdzlr`. `fmt_bc` lacked the equivalent.

Fix: gate the condition string on `!uncond` inside the `if decr` block.
For BO=16/18 (uncond bit set), the condition suffix is now empty.

Tests: extended_mnemonics.json fixture rows for bdnz/bdz now expect the
correct `ext_mnemonic: "bdnz"` / `"bdz"`.

Impact: every analysis-DB query for `bdnz` loops (common in pixel-shader
and vertex processing) was returning zero rows; matches stored as `bdnzge`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 10:40:45 +02:00
MechaCat02
3d8e2ced2e fix(cpu): PPCBUG-053+054 32-bit CTR semantics in bcx/bclrx + mtspr CTR
PPCBUG-053: bcx and bclrx tested `ctx.ctr != 0` against the full 64-bit
register, but the Xbox 360 ABI runs CTR as a 32-bit counter (canary
explicitly truncates: `f.Truncate(ctr, INT32_TYPE)`). When upstream 64-bit
GPR pollution flowed through `mtspr CTR, rN`, the upper 32 bits stayed
non-zero forever; bdnz then looped past the intended 32-bit zero point
because the 64-bit comparison still saw the high bits.

PPCBUG-054: `mtspr CTR` writeback wrote the full 64-bit GPR value,
acting as a firewall gap that fed PPCBUG-053. Defensive truncation
prevents CTR from ever acquiring non-zero upper 32 bits independently
of the GPR-pollution source.

Fixes:
- interpreter.rs:849, 879: ctr_ok now uses `(ctx.ctr as u32) != 0`
- interpreter.rs:1523: mtspr CTR writes `val as u32 as u64`

Tests:
- bcx_bdnz_uses_32bit_ctr_compare: bdnz with CTR=0x0000_0001_0000_0001
  decrements to 0x0000_0001_0000_0000 and exits (low 32 bits = 0).
- bclrx_uses_32bit_ctr_compare: same coverage for bdnzlr.
- mtspr_ctr_truncates_to_32_bits: gpr=0xFFFF_FFFF_8000_0001 → ctr=0x8000_0001.

Coupled fix per the audit: PPCBUG-053 and PPCBUG-054 land together because
either alone is necessary-but-not-sufficient — the truncation prevents new
pollution, the 32-bit compare protects against any pollution that slipped
in via routes other than mtspr (e.g. mfctr-mtctr roundtrips).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 10:38:18 +02:00
MechaCat02
52ece4bd86 fix(cpu): PPCBUG-424+425 vmaddfp128/vmaddcfp128 operand swap + va128 field fix
PPCBUG-424: vmaddfp128 computed VA×VB+VD instead of ISA-mandated VA×VD+VB.
PPCBUG-425: vmaddcfp128 computed VD×VB+VA instead of ISA-mandated VA×VD+VB.

Root-cause discovered while writing the operand-order regression tests:
va128() was extracting PPC bits 6-10 (the same field as vd128's low 5 bits),
not PPC bits 11-15 where VA lives in VX128 form. This meant va128() silently
aliased vd128 for any instruction where VA != VD, making the operand swap
invisible in the existing denorm-flush test (which used VA == VD == v2).

Fixes in this commit:
- decoder.rs: va128() now extracts PPC bits 11-15 (host bits 20-16) + bit29.
  The vmx128_va128_uses_bit29 test encoding updated to match the correct field.
- interpreter.rs: vmaddfp128 changed from ai.mul_add(bi,di) to ai.mul_add(di,bi)
  (VA×VD+VB). vmaddcfp128 changed from di.mul_add(bi,ai) to ai.mul_add(di,bi).
  vmaddfp128_flushes_denormal_inputs redesigned with distinct VA/VD/VB registers
  (v1/v2/v3) so the flush test is independent of the accessor fix.
  New vmaddfp128_operand_order_va_times_vd_plus_vb and
  vmaddcfp128_operand_order_va_times_vd_plus_vb tests verify 2×3+10=16.
- disasm_goldens.rs + vmx128_registers.json: vmaddfp128/vmaddcfp128/vnmsubfp128
  golden raws updated to properly encode VA at PPC bits 11-15 (new raws:
  0x146328D4 / 0x14632914 / 0x14632954). vperm128 / vsrw128 golden operands
  updated to reflect correct VA extraction (v4 instead of v3/v0).

Affects all VMX128 binary ops that call va128(): vaddfp128, vsubfp128,
vmulfp128, vmaddfp128, vmaddcfp128, vnmsubfp128, vperm128, vsrw128 etc.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 10:33:24 +02:00
MechaCat02
cedee3c385 fix(cpu): PPCBUG-510 stvewx128 writes 16 bytes instead of 4
stvewx128 was aligning EA to 16 bytes and writing all 16 bytes of the
vector, corrupting 12 adjacent bytes on every call. ISA semantics:
word-align EA, extract word lane (EA & 0xF) >> 2, write 4 bytes only.

The non-128 stvewx was already correct; stvewx128 was never updated.
Mirror the stvewx body with instr.vs128() substituted for instr.rs().
The invalidate_for_write call from P1 now covers the correct word-aligned
EA rather than the over-wide 16-byte range.

interpreter.rs: stvewx128 arm (~line 2984)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 10:05:37 +02:00
MechaCat02
6b9de17925 fix(cpu): PPCBUG-363 PPCBUG-369 vpkd3d128 post-pack permutation
vpkd3d128 was storing the pack codec output directly into vd128 without
applying the MakePermuteMask permutation that merges the packed scalar(s)
into the previous register value according to pack (slot layout) and shift
(destination lane offset).

PPCBUG-363: vpkd3d128 was missing the post-pack lane-placement step.
PPCBUG-369: vpkd3d128 pack field not extracted; pack=0 still worked
  (identity), but pack=1/2/3 always wrote raw out instead of blending.

Fix: extract `pack = uimm & 3` and `shift = instr.vx128_4_z()` from the
VX128_4 IMM and z fields. For pack==0 (identity) store out directly as
before. For pack 1-3, read the existing vd128 value and select 4 u32
words from {prev, out} using the 3×4 static permutation tables from
canary ppc_emit_altivec.cc:2126-2188.

Tables derived from canary MakePermuteMask(r0,l0,…r3,l3):
  pack=1 (VPACK_32): out[3] placed at lane (3-shift), prev elsewhere
  pack=2 (64-bit):   out[2..3] placed at lanes (2-shift)..(3-shift)
  pack=3 (64-bit):   same as pack=2 except shift=3 → out[2] at lane 3

Tests: vpkd3d128_pack0_legacy_unchanged, vpkd3d128_pack1_shift0_d3d_vertex_pack,
       vpkd3d128_pack1_shift3_puts_out3_at_lane0

interpreter.rs: vpkd3d128 arm (~line 3999)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 22:06:00 +02:00
MechaCat02
64e8ecbfd0 fix(cpu): PPCBUG-361 PPCBUG-565 fix vsldoi128 SH field extraction
PPCBUG-565: Add vx128_5_sh() to decoder.rs — 4-bit shift at PPC bits
22-25 (host bits 6-9). The correct MSB is at PPC bit 22 (host bit 9).

PPCBUG-361: vsldoi128 was reading the SH MSB from host bit 4 (PPC bit
27, reserved) instead of host bit 9 (PPC bit 22). All shift amounts >= 8
decoded incorrectly (e.g. shift=8 executed as shift=0). Replace the
inline bit-shuffle with instr.vx128_5_sh().

Also fix vx128_p_perm_assembles_correctly test: replace nonexistent
DecodedInstr::from_raw() calls with struct literal construction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 21:29:12 +02:00