The lockstep scheduler's pick_runnable is strict priority
(max_by_key (priority, -idx)). On a cooperative single-host HW slot,
a CPU-bound spinner that never blocks (the silph poll loop pinned by
affinity to hw=5) wins pick_runnable every round forever, permanently
starving a co-located peer (the submitter, tid6) that the spinner is
actually waiting on. On real hardware those threads run on separate SMT
contexts concurrently, so the spinner never starves the submitter; ours
collapses them onto one slot with no anti-starvation, turning priority
(or equal-priority index order) into permanent starvation.
The starved submitter never dequeued job-4 -> the worker-hub (tid5)
blocked INFINITE on completion event 0x1080 -> silph (tid13) wedged on
0x1078 -> no vsync -> draws_seen=0, the publisher splash never renders.
(decrement_quantum's within-slot rotation is dead: begin_slot_visit
unconditionally re-pick_runnable()s each round, discarding the rotated
running_idx. The fix is therefore evaluated at pick time, not via that
discarded rotation.)
Fix (Option A, bounded anti-starvation, deterministic):
- Add per-thread steps_starved counter to GuestThread.
- begin_slot_visit increments it for every Ready peer passed over this
visit, resets it to 0 for the picked thread.
- pick_runnable selects by effective_priority: once steps_starved
reaches STARVE_LIMIT (4096) the thread is lifted to i32::MAX and wins
exactly one pick, then resets. The genuinely higher-priority thread
still wins ~4095/4096 visits -- the boost grants periodic forward
progress only, it does NOT invert priority. Pure function of
counter/priority/index -> deterministic (no wall-clock, no RNG).
Cascade (lockstep exec, XENIA_CACHE_PERSIST=1, -n 200M):
- submitter dequeue sub_82458508 now fires 4x (was 3x); the 4th job
(buf 0x40baa2c0) is dequeued at cycle 6.15M.
- hub tid5 leaves Blocked(0x1080) -> now Ready (no more INFINITE wait).
- GPU packets 0 -> 116,101,363 (command stream now flowing).
- tid13 (silph::UImpl) advances past the old 0x1078 wedge to a NEW
downstream wait (handle 0x10a0); 3 new threads spawn (tid14/15/16).
- draws_seen still 0 -> the splash's first draw is a NEW downstream gate,
not this starvation.
Determinism: two cold lockstep `check -n 5M` runs byte-identical (full
and stable digests). New n50m stable digest deterministic across two
cold runs. Golden re-baselined: instructions 50000007->50000003,
imports 92317->90296 (trajectory shift from the changed pick order).
Tests: 666/666 (+1 test_anti_starvation_bounded_progress).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Lockstep livelocked the scheduler the same way --parallel did before
0332d19: the kernel deadline-arithmetic (`now_basis_at`) read per-thread
`ctx(hw_id).timebase`, but a parked/poll thread has `running_idx == None`
so `Scheduler::ctx()` returns `idle_ctx` (timebase 0). A poll thread (tid=7,
a `KeWaitForSingleObject` loop with a 30ms relative timeout) computing its
deadline via `parse_timeout` therefore read `now = 0` and registered
`deadline = 0 + 3000 = 3000` — a constant ~7.78M units in the past.
`coord_idle_advance` then re-armed that same constant 3000 deadline forever,
pinning virtual time and starving every other thread's real future deadline.
Render-gate impact: the submitter (tid=6) re-enters a 16ms-timeout
WaitForMultiple after its first jobs; that timeout never fired because vtime
was pinned at 3000, so virtual time never reached real future deadlines.
Fix (Option A — mirror the parallel fix): drive the existing deterministic
`Scheduler::global_clock` in lockstep too (floored up once per outer round
to `stats.instruction_count`, a pure function of retired guest instructions —
no wall-clock), and route `KernelState::now_basis_at` through `global_clock()`
in BOTH modes. New `Scheduler::advance_global_clock_to(now)` floor-up keeps it
monotone alongside `advance_all_timebases_to`. Parallel behavior unchanged
(it already read `global_clock()`).
Verified (lockstep, 50M):
- DETERMINISM: two cold `check -n 5M` and two cold `-n 50M` runs byte-identical.
- LIVELOCK GONE: "advanced to deadline" went from 592,679 fires / 2 unique
values / 562,084 pinned at 3000 -> 18,586 fires / 18,567 unique /
0 pinned, strictly increasing 5.4M -> 50M. Poll thread tid=7 now ends
Blocked with a real future deadline Some(60002824) instead of spin-Ready
on the past 3000.
- imports 1,790,936 -> 92,317 at 50M (the spin no longer burns import calls).
Cascade (lockstep, XENIA_CACHE_PERSIST=1, -n 200M): engine now runs to budget
instead of hard-deadlocking. Hub enqueue (sub_82458068) 4x; submitter dequeue
(sub_82458508) still 3x — the lost 4th-job HANDOFF (count/notify between
sub_82458068's tail and the submitter queue) is a SEPARATE downstream gate,
not the timebase. New gate: tid=5 (hub) Blocked INFINITE on event 0x1080
(job-4 completion); tid=6 (submitter) Ready, parked in WaitForMultiple
(sub_824AB214), loop-top stops at cycle 6.23M. draws still 0, VdSwap 1.
Golden re-baseline (same commit): sylpheed_n50m
instructions 50000004 -> 50000007, imports 1790936 -> 92317
(swaps/draws/RTs/shaders/textures unchanged). sylpheed_n2m unchanged
(livelock onsets after 2M). Suite 665/665 + oracle green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
In --parallel mode a long run livelocked: the scheduler spun
"advanced to deadline 3000 waking hw=2 idx=0" ~14k times in
microseconds. Root cause: each guest thread owns ctx.timebase
(+1/instr in step_block), and all kernel deadline arithmetic read
Scheduler::ctx(hw_id).timebase as "now". But the parallel worker
extracts its PpcContext via mem::replace(ctx_mut_ref, PpcContext::new())
— leaving a ZEROED timebase in the slot while it steps unlocked — and
advance_all_timebases_to only walks runqueue (never idle_ctx). So the
coordinator's coord_pre_round drain and a woken thread's parse_timeout
could read a zeroed/stale basis decoupled from the deadline the
scheduler just advanced to. The thread re-armed the same constant
deadline forever; the global clock never moved.
Fix: add a single monotonic Scheduler::global_clock, advanced by the
per-block retired-instruction count on each parallel writeback and
floored up by advance_all_timebases_to. Kernel deadline reads route
through KernelState::now_basis_at(hw_id), which returns global_clock
ONLY when parallel_active; lockstep keeps reading the exact pre-existing
ctx(hw_id).timebase expression, so the deterministic lockstep trace is
byte-identical (sylpheed_n50m golden unchanged, zero re-baseline).
Verified:
- 50M --parallel run completes (was: hung). Deadlines now strictly
increasing 5.4M -> 49.1M (18097 unique of 18116; max repeat 2) vs
pre-fix constant 3000 x ~14000.
- sylpheed_n50m golden byte-identical via plain `check` (no persist).
- Full suite 665/665 green.
Note: an intermittent parallel hang/crash (~1-2/20 at -n 5M) is
pre-existing (master 1/20, this build 2/20 — within noise) and distinct
from the timebase livelock: it is a parallel-race class (e.g. the
unsafe block_ptr deref in run_execution_parallel). Tracked separately;
lockstep remains the recommendation for long runs.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Second differential audit, lead prong: hunt siblings of PPCBUG-020 (the
word-form ALU truncation fixed in 341196a, whose "32-bit ABI / MSR.SF=0"
premise was false — Xenon is a 64-bit core). Found three more band-aids of the
same class, each verified against the canary oracle. All three are genuine
oracle/ISA divergences but INERT on Sylpheed's lockstep trace (sylpheed_n50m
golden digest unchanged; no re-baseline). Fixed + directed tests anyway to
close the band-aid class (per audit decision).
1. slw/srw shift-count mask (PPCBUG-044 site). Ours tested the full u32 count
`< 32`; canary InstrEmit_slwx/srwx mask `rb & 0x3F` then test bit 5. A count
like 0x40 (low-6-bits 0) must pass the value through, not zero it. Fixed both
to `& 0x3F`. The 32-bit CR0 i32-view is unchanged (genuinely 32-bit).
2. sraw/srawi result extension (PPCBUG-041/042/043 "writeback truncation").
Ours zero-extended the 32-bit arithmetic-shift result (`result as u32 as u64`);
PowerISA + canary InstrEmit_srawx/srawix SIGN-extend it (`f.SignExtend`, the
`(i64.s)&¬m` fill). 0x80000000>>1 is now 0xFFFFFFFF_C0000000, not
0x00000000_C0000000. CA math and CR0 view byte-identical.
3. mtspr CTR width (PPCBUG-054). Ours stored `val as u32 as u64`, dropping the
upper 32 bits; CTR is a 64-bit SPR and canary InstrEmit_mtspr stores the full
GPR (`f.StoreCTR(rt)`). A later `mfspr rX, CTR` now round-trips correctly.
bdnz/bcctr still consume only CTR's low 32 bits (the bcx zero-TEST truncation
at line ~922 MATCHES canary's `f.Truncate(ctr, INT32_TYPE)` — left untouched).
Tests: updated srawx_negative_value_sign_extends_upper,
srawix_high_count_negative_input_sign_extends_all_ones, and
mtspr_ctr_keeps_full_64_bits (formerly premise-defending the bugs —
reading-error #24). Added slwx/srwx 6-bit-mask tests, mfspr_ctr round-trip, and
the rlwinm MB>ME wraparound-mask test (plan-requested gap closure). 665/665.
Left correct (re-confirmed vs canary, do NOT touch): bcx/bclr CTR 32-bit test,
divw/divwu zero-extend quotient (canary f.ZeroExtend, ISA upper undefined),
extsb/extsh, logical-NOT chain, mulhw/mulhwu, srawx 0x3F mask, pixel pack/unpack.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Xenon is a 64-bit PPC core (32-bit *pointer* ABI, but 64-bit registers and
integer arithmetic). The interpreter was truncating every word-form integer
ALU writeback to 32 bits and zero-extending, on a false "MSR.SF=0 / 32-bit
ABI" premise. This silently corrupted any genuine 64-bit value flowing through
word-form arithmetic.
Confirmed load-bearing via runtime ours-vs-canary capture: Sylpheed's
millisecond->LARGE_INTEGER timeout converter sub_824ACA88 does
`clrldi; mulli r11,r11,-10000; std`. For a 16 ms wait the correct result is
-160000 = 0xFFFFFFFF_FFFD8F00 (relative). canary stores exactly that; ours'
truncating `mulli` stored 0x00000000_FFFD8F00 (positive) -> the i64 timeout
read as a huge *absolute* deadline -> a ~26000x over-wait that froze the main
frame loop. After the fix the timeout matches canary and the previously-frozen
frame/worker loops run (parallel boot NtWaitForMultipleObjectsEx 94 -> 30428;
KeWaitForSingleObject/critical-section loops resume).
Fix mirrors canary's INT64 emitters (ppc_emit_alu.cc) op-by-op for the 17
data-losing word-form ops: addis, addic(.), subfic(.), mulli, add(c/e/ze/me)x,
subf(c/e/ze/me)x, negx, mullwx. Only the result *writeback* widens to full
64 bit; the 32-bit carry (XER[CA]) and overflow (XER[OV]) computations and the
CR0 i32 view are preserved byte-identical (the low 32 bits of the new result
equal the old truncated result), so this is a strict no-op for clean 32-bit
values and only restores the previously-zeroed upper bits for genuine 64-bit
values. Genuinely-32-bit ops (rlwinm/slw/srw/cmpw, mulhw/divw whose upper bits
are ISA-undefined) are left untouched.
Updated 7 unit tests that asserted the truncation (they encoded the bug) to the
canary-correct full-64-bit values. Re-baselined the sylpheed_n50m golden
(imports 40454 -> 1790936: the unwedged frame/worker loops now cycle under the
instruction-count timebase); sylpheed_n2m unchanged (pre-frame-loop). Lockstep
determinism preserved (two 50M runs identical). Full suite 660/660.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
From the 2026-06-12 5-subsystem differential audit. All verified against
canary as oracle; 660/660 workspace tests green (655 + 5 new).
1. nt_create_event polarity (exports.rs) — `manual_reset = gpr[5] != 0`
was INVERTED. Canary xboxkrnl_threading.cc:668 `Initialize(!event_type,..)`
+ xevent.cc:41 (type 0 = NotificationEvent = manual, type 1 = Sync = auto).
Now `== 0`. Was the dormant 2.AI fix on chore/portable-snapshot, never
merged. The Ke-path was already correct; only the Nt-path was wrong.
2. 2.AF deadline drain (main.rs coord_pre_round) — expired KeWait/KeDelay
deadlines never fired under load because advance_to_next_wake_if_due was
only called in coord_idle_advance (no-Ready-threads path). Added a
per-round drain loop; covers BOTH lockstep and parallel outer loops since
both call coord_pre_round. Was the dormant 2.AF fix, never merged.
3. handle slab-recycle ABA guard (state.rs + scheduler.rs) — release_handle_slot
(my round-34 regression) recycled a closed slot even with a thread still
parked on it, risking a stale-waiter wake when the slot is re-minted. Added
Scheduler::any_thread_waiting_on; decline to recycle a still-waited slot.
4. vpkpx pixel-pack (vmx.rs) — wrong field mapping (~100% mismatch). Now
exact canary ppc_emit_altivec.cc:1795 shift/mask (red 6b out[15:10] from
w[24:19], green out[9:5] from w[14:10], blue out[4:0] from w[7:3]; no
fabricated alpha bit). +unit test.
5. VFS GDFX attribute plumbing (vfs/*, exports.rs query fns) — VfsEntry now
carries the real on-disc attribute byte (GDFX dirent +12, canary
disc_image_device.cc:136/154) instead of inferring directory-ness from
path shape. Query exports report the real FILE_ATTRIBUTE_* bits. Candidate
driver of the XamShowDirtyDiscErrorUI gate. +tests.
6. MmGetPhysicalAddress region-aware mirror (exports.rs) — flat 0x1FFFFFFF
mask missed canary's +0x1000 host_address_offset for 0xE0000000+ mirror
(memory.cc:2317). Read-only query; proven byte-identical 50M digest. +test.
Investigated and intentionally NOT changed:
- zero-on-recommit: no-op; ours has no region-reuse path (bump allocators,
free is a stub).
- 32-bit ALU writeback truncation (PPCBUG-020): documented-deliberate; premise
(MSR.SF=0) is questionable but flipping it is out of scope here.
- KeSetEvent/NtSetEvent return value: ours returns true previous state
(hardware-faithful); canary returns constant 1 — NOT an ours bug.
sylpheed_n50m golden will need re-baselining (legit behavior change).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hook NtCreateEvent for the silph::UImpl tid=13 chain (entry=0x821748F0,
start_context=0x4024a840, frame-1 LR=0x821CB15C inside sub_821CB030+0x128)
and auto-signal the resulting handle after XENIA_SILPH_UI_AUTOSIGNAL_DELAY
instructions. Env-gated; default off.
SR4 verdict B (partial unwedge):
- handle 0x1078 signal_attempts 0->1
- tid=13 Blocked(WaitAny[0x1078]) -> Ready pc=0x824a9108
- ExCreateThread 10 -> 12 (new silph::UImpl tid=14, worker tid=15)
- New downstream wedges 0x1084 + 0x1088
- cxx_throw runtime_error on tid=5 inside R26 dispatcher
(BST not-registered instance lhs=0x715a7af0)
- VdSwap stays 1; no draws (POC is diagnostic, not final fix)
Confirms Phase C diagnosis end-to-end. The real signaler must (a) drive
NtSetEvent on the silph KEVENT AND (b) register the dispatcher's BST
instance upstream; this POC only does (a).
Reading-error class #20: ctx.lr at kernel export entry is the thunk
wrapper's return slot, NOT the guest caller's post-bl PC. Walk back-chain
1 step to get frames[1].lr.
Reading-error class #21: --parallel and lockstep have SEPARATE outer
loops in main.rs (run_execution_parallel line 2928 vs run_execution
line 2706). Per-round hooks must be wired in BOTH paths.
Files:
- crates/xenia-cpu/src/scheduler.rs: GuestThread.start_entry/start_context
fields + spawn() population + current_thread_entry_and_ctx() helper
- crates/xenia-kernel/src/state.rs: AutoSignalPending struct, env-parsed
silph_autosignal_delay, pending Vec, last_cycle_hint, set_now_cycle_hint,
maybe_register_silph_autosignal (walks back-chain), fire_due_silph_autosignals
- crates/xenia-kernel/src/exports.rs: hook in nt_create_event
- crates/xenia-app/src/main.rs: fire-site + cycle hint in both outer loops
- audit-runs/audit-059-handle-disambiguation/round-D2-autosignal-poc/FINDINGS.md
Tests 655/655 green. Default behavior byte-identical when env unset.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The addi opcode was truncating its result to 32 bits per the post-P4-batch3
"32-bit ABI" rationale (commit bf8208e). Hunk-level bisection during the
2026-05 audit (M11) isolated this single cast as the cause of the
post-P8 swap regression: swaps dropped 2 → 1 and the renderer lost a
frame. PowerISA mandates sign-extension to 64 bits; canary does not
truncate addi. The truncation was a canary-divergent over-extension
of the addis fix (which IS canary-divergent by design, see
addis at interpreter.rs:121-134).
The addi_li_neg_one_zero_extends_upper test encoded the wrong invariant.
Replaced with a sign-extension test asserting canonical PowerISA
behavior (gpr[3] == 0xFFFF_FFFF_FFFF_FFFF for `li r3, -1`).
Verification at -n 100M lockstep:
swaps: 1 → 2 (gate met)
draws: 0 → 0 (unchanged — gated by Phase C+D+E)
instructions: ~100M (unchanged)
imports: 11.4M → 987k (game escapes retry loop)
packets: 281M → 57M (same)
interrupts_delivered: 629 → 630
Tests: 551 passing (unchanged). Lockstep determinism: byte-identical
across two 100M runs except packets (±5%, GPU-thread-race noise floor).
Closes SWAPBUG-001 / PPCBUG-001.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-P8 review nit: the if/else branches were identical
(`adj_bits - 1` either way). Both positive and negative finite f32
values use the IEEE-754 sign bit as the MSB, and subtracting 1 from
`to_bits()` always reduces magnitude by one ULP. Replace the
mock-conditional with the unconditional form + a comment explaining
why one operation works for both signs.
No behavior change.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Post-P8 end-to-end review caught an ISA deviation introduced by P4
batch 5. The original code used `as i32 as i64 as u64` (correct
PowerISA sign-extension; canary's `SignExtend(INT64_TYPE)`). My P4
batch 5 commit (20a730d) changed all three to `as u64` (zero-extend),
citing the audit's "32-bit-ABI hazard" note for PPCBUG-105.
This deviation is wrong per PowerISA and any 64-bit-mode kernel code
that uses `lwa rT, off(rA)` will silently produce the wrong rT for
negative words (e.g. memory 0x80000000 should yield 0xFFFFFFFF_80000000
but was yielding 0x00000000_80000000).
Restore ISA-spec sign-extension for all three forms (lwa, lwax, lwaux).
The audit's 32-bit-ABI hazard concern was speculative — there's no
evidence that Xbox 360 user code emits `lwa` (compilers use `lwz`).
If a real bug surfaces from a 32-bit-ABI consumer that feeds an
`lwa`-loaded value into a u64 unsigned compare, that's a separate
issue to debug at the consumer site.
Test renamed: lwa_high_bit_set_zero_extends_upper → lwa_sign_extends_to_i64
with assertion flipped to expect 0xFFFFFFFF_80000000.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
P8 review feedback (non-blocking): the test fn name said vmsum3fp but
the encoding/body actually tests vmaddfp. Rename + clarify comment;
no behavior change.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 8 batch 3 — FPU and VMX float test gap closure.
14 new tests:
- Single FPU (187): fadds, fmuls
- Double FPU (208): fmul, fdiv (zero-numerator), fneg, fabs, fmr
- FPU convert/compare (228): fcmpu, fcfid
- VMX float compare (438): vcmpeqfp lane mask
- VMX rounding (439): vrfip, vrfim, vrfiz
- VMX convert (440): vctsxs saturation to INT_MAX/INT_MIN
The VMX VX-form encoding nit (XO is 11 bits at PPC 21-31, host bits 10-0,
with bit 0 the LSB — not bit 1) was caught by initial test failures and
fixed before commit. VC-form (vcmpeqfp) has the same "XO at bit 0" layout.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 8 batch 1 — test gap closure for the branch/CR-logical/SPR/MSR/
FPSCR/cache+sync groups.
12 new tests across the affected groups:
- PPCBUG-055 branch: blr, bctr, bcl-LK-on-not-taken
- PPCBUG-070 CR logical: cror, crand, crxor (crclr idiom)
- PPCBUG-067 trap+sc: sc smoke, tw TO=0 never-traps
- PPCBUG-081-085 SPR/MSR/FPSCR moves: mfcr 8-field assembly, mtfsb1/mtfsb0
- PPCBUG-089 cache+sync: sync state-non-mutation smoke
These groups previously had near-zero unit test coverage. New tests lock
in the current ISA-correct behavior; would catch a regression in any of
the dispatch/encoding/result paths.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
P6 review nit: replace the inline `const VX_ALL_MASK` in the mcrfs arm
with the existing `fpscr::VX_ALL` constant (single source of truth).
Behaviorally identical.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 6 batch 4 — overflow/cleanup verification.
- PPCBUG-022 mulld_ov INT_MIN * -1: the audit-claimed missing edge case
is actually handled by `i64::checked_mul()` (returns None when the
result would be -i64::MIN = i64::MAX+1, which doesn't fit). New
regression tests in overflow.rs confirm: INT_MIN * -1 overflows;
INT_MIN * 1 doesn't; (INT_MIN+1) * -1 = INT_MAX, no overflow.
Audit's claim was incorrect; documented by the new tests.
- PPCBUG-021 (overflow.rs OE checks at bit 63): largely auto-resolved
by P4 batch 6 (16993bb), which switched all 32-bit ABI ops to inline
`true_sum != (result32 as i32) as i128`. Helpers like add_ov_64 are
now only called from 64-bit ABI ops where bit-63 is correct.
- PPCBUG-027 (rlwimix upper-32 zeros): auto-resolved by P4 (rlwimix
now writes via `as u32 as u64` truncation).
- PPCBUG-039 (cntlzdx 32-bit-ABI): wontfix per audit — only matters
if a 32-bit-ABI binary emits cntlzd, which compilers don't.
Remaining low-impact items (PPCBUG-642 ISA-undefined fmt_bcctr decr,
PPCBUG-643/644 SIMM/D-form hex display, PPCBUG-367/368 vupkhpx/vpkpx
channel ordering, PPCBUG-487/495 vsum operand naming, PPCBUG-515/516
lvebx/lvsr documentation, PPCBUG-601 decode_op6 invariant doc) are
left for a P9 or follow-up batch — they're cosmetic/test-coverage
items rather than correctness bugs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 6 batch 3 — SPR/MSR/VSCR semantics.
- PPCBUG-078 mtmsrd L=1: PowerISA requires partial-MSR-write — only
MSR[EE] (u64 bit 15) and MSR[RI] (u64 bit 0) modified, all other
MSR bits preserved. Used by kernel code to toggle external interrupts.
Previously merged with mtmsr (full overwrite), silently corrupting
MSR for any L=1 caller.
- PPCBUG-080 mfvscr: ISA places VSCR in the rightmost word of VD with
bytes 0-11 zeroed. Previously copied the full 128-bit ctx.vscr,
leaking stale upper data to guest. Now zero-extends per canary.
- PPCBUG-068 mcrfs VX summary: when mcrfs clears VX* exception bits,
the VX summary bit at FPSCR[2] must be recomputed (clears if all
contributors are 0; remains 1 otherwise). Previously left stale,
causing subsequent CR-test sequences to misread the FPU state.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 6 batch 2 — XER TBC enabling + load/store-multiple cleanups.
- PPCBUG-123/124/161/566 (coupled): XER TBC field was unmodelled —
`ctx.xer()` always returned 0 in bits 0-6, and `ctx.set_xer()`
silently discarded any TBC writes. Result: `lswx` and `stswx` were
permanent no-ops (the `while bytes_left > 0` loop never executed).
Fix: add `pub xer_tbc: u8` to `PpcContext`; wire into `xer()` and
`set_xer()`. Initialize to 0 in `PpcContext::new()`. lswx/stswx
bodies are correct as-is once the infrastructure is wired.
- PPCBUG-125 lmw: PowerISA marks `lmw rT, D(rA)` invalid when rA is
in [rT..31]; canary skips the write to rA to preserve the EA base.
Now matches canary.
- PPCBUG-126/162 lswi/stswi: replaced `instr.rb()` with `instr.nb()`
for the NB field. Both accessors return identical values today
(bits 16-20), but the maintenance hazard from the misnomer is now
removed. A future `rb()` type-system refactor would have broken
lswi/stswi silently.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 6 batch 1 — trap/sc semantics.
- PPCBUG-063 trap PC: previously ctx.pc was incremented to CIA+4 BEFORE
StepResult::Trap returned, forcing handlers to .wrapping_sub(4) to
recover the faulting instruction address. Now ctx.pc stays at CIA on
trap, matching SRR0 semantics on real hardware. Critical for any
future SEH/exception-delivery path (e.g. the Sylpheed C++ throw work).
- PPCBUG-065 typed-trap logging: `twi 31, r0, IMM` is the Xbox 360
CRT/kernel typed-trap convention encoding C++ exception class via
SIMM. The trace now logs the SIMM type code when this pattern fires.
Routing the type code via a StepResult payload requires an enum
extension (multiple consumer sites) that's deferred.
- PPCBUG-064 sc LEV logging: `sc 2` is the Xbox 360 hypervisor-call
convention; canary dispatches it to a different handler than `sc 0`.
Now logs a warning when LEV != 0. Routing LEV=2 to a HypervisorCall
variant also requires a StepResult enum extension; deferred.
The two enum-extension follow-ups can land as a structural sub-batch
once a clear consumer (SEH dispatch, hypervisor-call HLE) is in place.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 5 batch 6 (5f): saturation and FMA-rounding fixes.
- PPCBUG-426 vnmsubfp: was `bi - ai * ci` (two rounding steps); now
`-ai.mul_add(ci, -bi)` which is mathematically equivalent (= bi - ai*ci)
but uses a single FMA round per ISA.
- PPCBUG-427 vnmsubfp128: same single-FMA fix.
- PPCBUG-433 vctsxs / vcfpsxws128 NaN saturation: AltiVec ISA saturates
NaN to INT_MIN (0x80000000); xenia returned 0. The vctuxs (unsigned)
NaN→0 is correct per ISA.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 5 batch 5 (5e): minimal-viable fix for the estimate-precision
family. Hardware Xenon `fres` produces a ~12-bit LUT estimate; xenia
and canary both produce a fully IEEE single reciprocal, but canary
pre-quantizes the f64 input to f32 to at least match the input
precision. Now matches canary.
PPCBUG-428..431 (vrefp/vrsqrtefp/vexptefp/vlogefp) already operate on
f32 inputs naturally (no f64 → f32 quantization step needed); the
estimate-precision deviation is purely the output side. Newton-Raphson
convergence is unaffected. Documented in audit-findings.md as
LOW-impact full-fix-requires-LUT.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 5 batch 4 (5d) — partial: VSCR.NJ subnormal flush for VMX float
arithmetic. Xbox 360 always boots with NJ=1, so games expect inputs
and outputs flushed to ±0.
- PPCBUG-435 vaddfp/vaddfp128/vsubfp/vsubfp128/vmulfp128: previously
no flush at all on these opcodes (only vmaddfp family flushed).
Now flushes both inputs and output per Canary's unconditional model.
- PPCBUG-436 vmsum3fp128/vmsum4fp128: per-product intermediates now
flushed individually (was only the final sum).
- PPCBUG-437 vmaddfp/vmaddfp128/vmaddcfp128/vnmsubfp/vnmsubfp128:
outputs now flushed (inputs were already flushed).
PPCBUG-185 (FPSCR.NI flush for scalar FPU) deferred — requires adding
a NI bit constant and post-op flush wrapper across all *sx arms; will
land in a focused sub-batch.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 5 batch 3 (5c) — partial: targeted XX-on-inexact fixes for the
float-to-int and double-to-single conversion family. (PPCBUG-180/200,
the broader update_after_op XX/FR/FI rework, deferred to a focused
sub-batch.)
- PPCBUG-225 frspx: set XX when the f64→f32 round produces a different
value (i.e. precision loss). Almost every frsp call is inexact —
previously games polling FPSCR.XX never saw the set bit after a frsp.
- PPCBUG-224 fcfidx: set XX when the i64 input has > 53 significant
bits (precision lost in conversion to f64).
- PPCBUG-229 fctidx/fctidzx: set XX when input is non-integer (fractional
part discarded by the conversion).
- PPCBUG-230 fctiwx/fctiwzx: same shape for word-width conversions.
- PPCBUG-223 verified already correct in current code (fcmpo sets
VXSNAN/VXVC on NaN operands; the audit-cited drift was already fixed).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 5 batch 2 (5b): VXISI / NaN handling for the FMA family.
The 8 FMA opcodes (fmaddx/fmaddsx/fmsubx/fmsubsx/fnmaddx/fnmaddsx/fnmsubx/
fnmsubsx) all share two fix shapes:
1. VXISI on the add/sub step. The previous code passed `a*c` to
check_invalid_add, which has separate rounding from the FMA. In
extreme cases this gives the wrong sign (PPCBUG-202) or wrong infinity
status. Worse, fmsub/fnmadd/fnmsub had NO add-step VXISI check at all
(PPCBUG-181/182/203). The fnmsub pattern is the canonical Newton-
Raphson step — the most common FPU path in Xbox 360 graphics code.
2. NaN sign preservation in fnmadd/fnmsub. ISA Book I §4.3.4 forbids
negation of a NaN FMA result; Rust's unary `-` flips the IEEE-754
sign bit (PPCBUG-183/205).
Fixes:
- fpscr.rs: new helper `check_invalid_fma_add(ctx, a, c, b, sub)` that
derives VXISI from input properties (mathematical-product sign +
b sign) instead of from the lossy `a*c` value. Also covers SNaN.
- interpreter.rs: all 8 FMA arms now use the new helper; fnmadd[s]/
fnmsub[s] gate the negation on `!fma.is_nan()`.
Tests:
- fmsub_inf_minus_inf_sets_vxisi: regression for PPCBUG-203.
- fnmadd_nan_input_preserves_nan_sign: regression for PPCBUG-205.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 5 batch 1 (5a): round-to-int correctness.
PPCBUG-221+227 (coupled): round_to_i64 NearestEven tie-breaking used
`(diff - 0.5).abs() < f64::EPSILON` to detect half-integers, but for
|v| > 2^52 every f64 value is an exact integer (v.trunc() == v), giving
diff == 0. The buggy check fell through to v.round() (round-half-away-
from-zero), giving wrong results for large odd half-integers. Replaced
with a fractional-part-only check that's exact for |v| <= 2^52 and
degenerates to truncation above.
PPCBUG-432: vrfin/vrfin128 used Rust's `f32::round()` which is round-
half-away-from-zero. ISA requires round-to-nearest-even (banker's
rounding). Implemented inline.
PPCBUG-201 (FPSCR.RN for double arithmetic) deferred — requires
MXCSR-set/restore wrappers around 10+ FPU arms; will land in a focused
sub-batch after the remaining 5a-5f fixes.
Tests:
- round_to_i64_nearest_even_on_tie: extended with 0.5, 1.5, -0.5, -1.5.
- round_to_i64_non_tie_cases: 0.4/0.6 (non-tie sanity).
- round_to_i32_nearest_even_on_tie: PPCBUG-227 coverage.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Independent reviewer of the P4 branch found two issues:
(1) BLOCKING — subfx and subfcx OE handlers still called the legacy
`overflow::sum_overflow_64(true_diff, result32 as u64)` while batch 6
had migrated all add* sites to the inline `true_sum != (result32 as i32)
as i128` form. The legacy helper compares `true_diff` against
`(result32 as u64) as i64 as i128`, which views any bit-31-set result
as a positive i64 (e.g. result=0x80000000 → +2147483648 in i64). For a
legitimate i32::MIN result with no actual 32-bit overflow, this caused
spurious OV=1.
Concrete repro now caught by `subfo_no_spurious_ov_when_result_has_bit31_set`:
r3=1, r4=0x80000001 → result=0x80000000, true_diff=-2147483648, no OV.
Pre-fix: spurious OV=1.
(2) Minor — `mulli_overflow_wraps_to_32` rubber-stamped: with ra=0x80000000
and imm=2, both pre-fix (`as i64 as u64`) and post-fix (`as u32 as u64`)
write the same value. Replaced with ra=u64::MAX (polluted upper bits) where
pre-fix writes 0xFFFFFFFF_FFFFFFFE and post-fix writes 0x00000000_FFFFFFFE.
Fixes:
- interpreter.rs subfx/subfcx OE: switch to inline 32-bit predicate
matching the rest of batch 6.
- subfo_sets_xer_ov_on_min_minus_one: renamed and updated to test 32-bit
overflow (r4=0x80000000 - 1 = 0x7FFFFFFF, OV=1).
- New: subfo_no_spurious_ov_when_result_has_bit31_set (PPCBUG-017
review-fix regression).
- New: subfco_no_spurious_ov_when_result_has_bit31_set (same for PPCBUG-007).
- mulli_overflow_wraps_to_32: redesigned with polluted upper bits to
actually discriminate pre/post fix.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 4 batch 6: latent writeback truncation (4c) and CR0 catch-all (4d).
~13 PPCBUGs across all remaining 32-bit ABI ALU sites.
Latent writeback (4c) — the 4a/4b fixes already eliminate the upstream
poisoning, but a defensive truncation here catches any future regression:
- PPCBUG-012 addx, PPCBUG-013 addcx, PPCBUG-014 addex, PPCBUG-015 addzex,
PPCBUG-016 addmex, PPCBUG-017 subfx — all rewritten to compute on u32
operands and write `as u64`. CA computed via 32-bit unsigned compare.
Overflow now uses `true_sum != (result32 as i32) as i128` (32-bit
predicate, since sum_overflow_64 is i64-bounded).
- PPCBUG-032 andx/orx/xorx — CR0 catch-all only (results inherit upper
bits from operands; once those are clean, no truncation needed).
CR0 catch-all (4d) — fix the `update_cr_signed(0, X as i64)` pattern at
every 32-bit-ABI Rc=1 path:
- PPCBUG-020 catch-all: applied to mulhwx, mulhwux, divwux, mullwx (was
already done in batch 4), addx/addcx/addex/addzex/addmex/subfx (now in
4c above), andx/orx/xorx, andix, andisx, slwx, srwx, cntlzwx,
rlwinmx, rlwimix, rlwnmx, mullwx (already), divwx (already),
srawx/srawix (already in batch 4).
- PPCBUG-023 andisx: now correctly classifies bit-31 results as CR0.LT.
- PPCBUG-024 rlwinmx, PPCBUG-025 rlwimix, PPCBUG-026 rlwnmx.
- PPCBUG-044 slwx/srwx: bit-31 result like 0x80000000 now CR0.LT.
64-bit ABI ops (rldicl/rldicr/rldic/rldimi/rldcl/rldcr, sldx/srdx/sradx/
sradix, mulhdx/mulhdux/mulldx, divdx/divdux, cntlzdx) intentionally retain
the 64-bit `as i64` form per ISA — these are 64-bit-mode instructions.
Updated old tests:
- addo_sets_xer_ov_on_signed_overflow_and_stickies_so: i32::MAX + 1 → INT_MIN.
- addx_rc_uses_64bit_compare_not_32bit: renamed to ..._uses_32bit_compare_in_xbox_abi
with assertions flipped to the correct 32-bit ABI behavior.
New tests:
- andisx_sign_bit_set_classifies_lt (PPCBUG-023).
- slwx_high_bit_result_classifies_lt (PPCBUG-044).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 4 batch 5: 5 PPCBUGs in the load family. lha/lhax/lhau/lhaux
sign-extended halfword results to u64 (active poisoning for negative
halfwords); lwa/lwax/lwaux sign-extended u32 results.
- PPCBUG-095/096/097/098 lha[ux]: `as i16 as i64 as u64` →
`as i16 as i32 as u32 as u64`. Sign-extend to i32 then zero-extend.
Common trigger: int16_t struct fields, PCM samples, packed vertex
deltas. Memory 0x8000 was producing 0xFFFFFFFF_FFFF8000.
- PPCBUG-105 lwa/lwax/lwaux: `as i32 as i64 as u64` → `as u64`.
Per-canary the 64-bit-mode form sign-extends, but in 32-bit ABI we
must zero-extend (canary's behavior is rescued by x86 register
zeroing in JIT; pure interpreter has no escape). Memory 0x80000000
was producing 0xFFFFFFFF_80000000.
Tests:
- lha_negative_halfword_zero_extends_upper (PPCBUG-095).
- lhaux_negative_halfword_clean_writeback (PPCBUG-098 + EA update).
- lwa_high_bit_set_zero_extends_upper (PPCBUG-105).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 4 batch 3: 6 PPCBUGs in the same-shape-as-addis (4b) sub-section.
All share the pattern of computing on 64-bit values when the 32-bit ABI
requires u32 arithmetic.
- PPCBUG-001 addi: `li rT, -1` produced 0xFFFFFFFF_FFFFFFFF; now 0x00000000_FFFFFFFF.
- PPCBUG-002 addic: writeback truncated + CA from u32 unsigned compare
matching canary's `AddDidCarry`.
- PPCBUG-003 addicx: same plus CR0 i32 view (regression vs. the frozen
ppc-manual snapshot which had the correct form).
- PPCBUG-004 mulli: 64-bit signed product now truncated to 32 bits.
- PPCBUG-005 subficx: writeback + CA in u32 space; removes the bits-32-63
pollution from sign-extended negative SIMM.
- PPCBUG-007 subfcx: defensive 32-bit truncation of CA compare. Same shape
as the compare that broke addis (0x828F3F98 / 0x828F3F68 case).
Tests:
- addi_li_neg_one_zero_extends_upper (PPCBUG-001).
- addic_carry_uses_32bit_compare (PPCBUG-002).
- mulli_overflow_wraps_to_32 (PPCBUG-004).
- subficx_neg_simm_zero_extends (PPCBUG-005).
- subfcx_addis_incident_case (PPCBUG-007 — exact addis-incident case).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 4 batch 2: extsbx and extshx writeback truncation + CR0 view fix.
Coupled per audit — must land together because the writeback fix would
silently break CR0 sign classification if the CR0 fix didn't ship in
the same commit.
Before:
- extsbx: `as i8 as i64 as u64` — every negative byte poisoned upper
32 bits (active poisoning, not latent). 0x80 → 0xFFFFFFFF_FFFFFF80.
- extshx: same shape for halfwords.
- CR0: `as i64` view — accidentally correct on the buggy 64-bit form
because the high bits matched the byte's sign bit.
After:
- extsbx: `as i8 as i32 as u32 as u64` — sign-extend to i32 then
zero-extend to u64. 0x80 → 0x00000000_FFFFFF80.
- extshx: same for halfwords.
- CR0: `as u32 as i32 as i64` — i32 view, so a result with bit 31 set
is correctly classified as negative under the 32-bit ABI.
Tests:
- extsbx_negative_byte_zero_extends_upper: 0x80 input → 0x00000000_FFFFFF80
with CR0.LT set.
- extshx_negative_halfword_zero_extends_upper: same shape for 0x8000.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 4 batch 1: 9 PPCBUGs in the active-poisoning sub-section. All
follow the pattern `!val` on u64, which unconditionally flips the upper
32 bits and poisons the GPR even with clean inputs — every execution
corrupts the high 32 bits regardless of upstream state.
Sub/neg family:
- PPCBUG-006 negx: `(!ra).wrapping_add(1)` on u64 + neg_ov_64 checks
64-bit INT_MIN. Fix: do arithmetic in u32, OE checks PPC[ra32==0x80000000].
- PPCBUG-008 subfex: same shape as above plus 64-bit unsigned CA compare.
Fix: cast all operands to u32, compute, write `as u64`.
- PPCBUG-018 subfzex: `!ra` on u64. Fix: u32 arithmetic.
- PPCBUG-019 subfmex: `!ra` on u64 + always-true CA edge (`!ra != 0`
was always true for clean ra<0xFFFFFFFF because high bits of !u64
are non-zero). Fix: u32 arithmetic; CA predicate now correct.
Logical NOT family:
- PPCBUG-028 orcx: rs | !rb on u64 → high-bit poison.
- PPCBUG-029 norx: !(rs|rb) — the `not` simplified mnemonic. Hot path,
every `not` corrupted GPR upper 32 bits.
- PPCBUG-030 nandx: !(rs&rb).
- PPCBUG-031 eqvx: !(rs^rb). The common `eqv rA,rA,rA` set-to-all-ones
idiom now produces 0x00000000_FFFFFFFF instead of 0xFFFFFFFF_FFFFFFFF.
- PPCBUG-033 andcx: rs & !rb.
CR0 update at every Rc=1 path now uses `as u32 as i32 as i64` so a result
with bit 31 set gets classified as negative under the 32-bit ABI (was
positive before because upper bits were ones; will be positive in new
truncated form unless we cast through i32). This pre-emptively addresses
PPCBUG-020 for these specific opcodes; the catch-all sweep in batch 6
covers the remaining sites.
Tests:
- nego_sets_ov_only_on_int_min: updated from i64::MIN → 0x80000000 (32-bit).
- test_subfze_carry_only_when_ra_zero_and_ca_one: result expectations
updated from u64::MAX → 0xFFFFFFFF (low 32 bits, upper 32 zero).
- New: neg_clean_input_no_upper_bits (PPCBUG-006 regression).
- New: norx_not_simplified_keeps_upper_bits_clean (PPCBUG-029 regression).
- New: eqvx_self_self_self_sets_low32_to_all_ones (PPCBUG-031 regression).
- New: andcx_bit_clear_keeps_upper_clean (PPCBUG-033 regression).
- New: subfex_clean_inputs_no_upper_bits (PPCBUG-008 regression).
- New: subfmex_ra_max_ca_zero_clears_ca (PPCBUG-019 always-true CA fix).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Independent review of P3 batch 2 (52ece4b) found that all three VMX128
register accessors disagreed with canary's FormatVX128/VX128_R bitfield
struct (`xenia-canary/src/xenia/cpu/ppc/ppc_decode_data.h:484-663`). The
audit at line 2958 had marked these "confirmed-clean" but had miscounted
LSB-first bitfield offsets.
Canary's actual layout (LSB-first, GCC/Clang/MSVC on x86):
VA128 = VA128l(5) | VA128h(1)<<5 | VA128H(1)<<6
= PPC[11:15] | PPC[26]<<5 | PPC[21]<<6 (7-bit selector, 3 fields)
VB128 = VB128l(5) | VB128h(2)<<5
= PPC[16:20] | PPC[30:31]<<5 (7-bit selector, 2 fields)
VD128 = VD128l(5) | VD128h(2)<<5
= PPC[6:10] | PPC[28:29]<<5 (7-bit selector, 2 fields)
VX128_R Rc = PPC[25] (host bit 6) not PPC[27] as prior fix had
The buggy convention was internally consistent with hand-crafted test
fixtures (which set bits 29/21/22 to encode the high registers, matching
the buggy accessor). Real Xbox 360 game code follows canary's convention,
so any production VMX128 instruction with VR >= 32 was silently mis-decoded
— but no unit test exercised that path until the va128 fix in 52ece4b
exposed the inconsistency.
Changes:
- decoder.rs: rewrite va128/vb128/vd128/vx128r_rc_bit to canary positions.
Drop the speculative `key4_dt` dot-form dispatch in decode_op6 — canary
has no separate dot-form opcodes for VX128_R compute ops; Rc is a
runtime modifier read by the interpreter via vx128r_rc_bit().
- decoder.rs tests: rewrite vmx128_test_word helper for canary layout;
rename/re-encode vmx128_vd128_*, vmx128_va128_*, vmx128_vb128_* tests.
- interpreter.rs: update encode_vpkd3d128 test helper to encode VD via
canary's VD128h field; tests now pass vd=96 explicitly.
- tests/disasm_goldens.rs: replace the vrlimi128/vsrw128/vpermwi128/
vperm128 hand-encoded raws with canary-compliant encodings; introduce
a shared `encode_vx128` helper.
- tests/golden/vmx128_registers.json: re-encode 9 entries (vperm128,
vsrw128 ×2, vpermwi128, vrlimi128 ×2, vmaddfp128, vmaddcfp128,
vnmsubfp128) to canary-compliant raws preserving the same expected
operand strings.
- audit-findings.md: new PPCBUG-700 entry documenting the discovery and
invalidating the audit's "confirmed-clean" assessment.
Affects all VMX128 binary ops (vaddfp128, vsubfp128, vmulfp128, vand128,
vor128, vxor128, vnor128, vandc128, vsel128, vslo128, vsro128, vperm128,
vsrw128, vmaddfp128, vmaddcfp128, vnmsubfp128, vpkd3d128, vpkshss128,
vpkshus128, vpkswss128, vpkswus128, vpkuhum128, vpkuhus128, vpkuwum128,
vpkuwus128, vmsum3fp128, vmsum4fp128, vrlimi128, vpermwi128 — 30+
opcodes), plus VX128_R compare dot-forms.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PPCBUG-641: PpcOpcode::sync emitted "sync" regardless of the L-field at
PPC bit 10. The Xbox 360 acquire barrier (encoding 0x7C2004AC, L=1) is
lwsync, used in every spinlock. The disassembly DB stored every lwsync
as `mnemonic='sync'`, so `SELECT WHERE mnemonic='lwsync'` returned zero
rows regardless of binary content.
PPCBUG-649 (companion): the golden fixture for lwsync had no ext_mnemonic
field, pinning the wrong output and defeating regression detection.
Fix: in disasm.rs, gate on `(instr.raw >> 21) & 1` (PPC bit 10) — when
set, emit the lwsync extended form. Update extended_mnemonics.json
fixture to expect `ext_mnemonic: "lwsync"`.
Note: this is the disassembler-side fix only. The interpreter-side
PPCBUG-088 (lwsync vs sync semantics) is separate.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PPCBUG-640: For BO=16 (bdnz: decrement CTR, branch if non-zero, ignore CR)
and BO=18 (bdz: same with branch-if-zero), `fmt_bc` fell through to the
`if decr` block and computed `cond_name_opt` from the don't-care BI=0 /
cond_true=false pair, yielding `Some("ge")`. The output was therefore
`bdnzge` / `bdzge` — a CTR-only branch with a spurious CR-derived suffix.
PPCBUG-650 (companion): the golden fixture pinned the wrong output, so
the regression had no detection signal until now.
`fmt_bclr` already had the correct `if decr && uncond` guard at line 872
producing `bdnzlr` / `bdzlr`. `fmt_bc` lacked the equivalent.
Fix: gate the condition string on `!uncond` inside the `if decr` block.
For BO=16/18 (uncond bit set), the condition suffix is now empty.
Tests: extended_mnemonics.json fixture rows for bdnz/bdz now expect the
correct `ext_mnemonic: "bdnz"` / `"bdz"`.
Impact: every analysis-DB query for `bdnz` loops (common in pixel-shader
and vertex processing) was returning zero rows; matches stored as `bdnzge`.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PPCBUG-053: bcx and bclrx tested `ctx.ctr != 0` against the full 64-bit
register, but the Xbox 360 ABI runs CTR as a 32-bit counter (canary
explicitly truncates: `f.Truncate(ctr, INT32_TYPE)`). When upstream 64-bit
GPR pollution flowed through `mtspr CTR, rN`, the upper 32 bits stayed
non-zero forever; bdnz then looped past the intended 32-bit zero point
because the 64-bit comparison still saw the high bits.
PPCBUG-054: `mtspr CTR` writeback wrote the full 64-bit GPR value,
acting as a firewall gap that fed PPCBUG-053. Defensive truncation
prevents CTR from ever acquiring non-zero upper 32 bits independently
of the GPR-pollution source.
Fixes:
- interpreter.rs:849, 879: ctr_ok now uses `(ctx.ctr as u32) != 0`
- interpreter.rs:1523: mtspr CTR writes `val as u32 as u64`
Tests:
- bcx_bdnz_uses_32bit_ctr_compare: bdnz with CTR=0x0000_0001_0000_0001
decrements to 0x0000_0001_0000_0000 and exits (low 32 bits = 0).
- bclrx_uses_32bit_ctr_compare: same coverage for bdnzlr.
- mtspr_ctr_truncates_to_32_bits: gpr=0xFFFF_FFFF_8000_0001 → ctr=0x8000_0001.
Coupled fix per the audit: PPCBUG-053 and PPCBUG-054 land together because
either alone is necessary-but-not-sufficient — the truncation prevents new
pollution, the 32-bit compare protects against any pollution that slipped
in via routes other than mtspr (e.g. mfctr-mtctr roundtrips).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PPCBUG-424: vmaddfp128 computed VA×VB+VD instead of ISA-mandated VA×VD+VB.
PPCBUG-425: vmaddcfp128 computed VD×VB+VA instead of ISA-mandated VA×VD+VB.
Root-cause discovered while writing the operand-order regression tests:
va128() was extracting PPC bits 6-10 (the same field as vd128's low 5 bits),
not PPC bits 11-15 where VA lives in VX128 form. This meant va128() silently
aliased vd128 for any instruction where VA != VD, making the operand swap
invisible in the existing denorm-flush test (which used VA == VD == v2).
Fixes in this commit:
- decoder.rs: va128() now extracts PPC bits 11-15 (host bits 20-16) + bit29.
The vmx128_va128_uses_bit29 test encoding updated to match the correct field.
- interpreter.rs: vmaddfp128 changed from ai.mul_add(bi,di) to ai.mul_add(di,bi)
(VA×VD+VB). vmaddcfp128 changed from di.mul_add(bi,ai) to ai.mul_add(di,bi).
vmaddfp128_flushes_denormal_inputs redesigned with distinct VA/VD/VB registers
(v1/v2/v3) so the flush test is independent of the accessor fix.
New vmaddfp128_operand_order_va_times_vd_plus_vb and
vmaddcfp128_operand_order_va_times_vd_plus_vb tests verify 2×3+10=16.
- disasm_goldens.rs + vmx128_registers.json: vmaddfp128/vmaddcfp128/vnmsubfp128
golden raws updated to properly encode VA at PPC bits 11-15 (new raws:
0x146328D4 / 0x14632914 / 0x14632954). vperm128 / vsrw128 golden operands
updated to reflect correct VA extraction (v4 instead of v3/v0).
Affects all VMX128 binary ops that call va128(): vaddfp128, vsubfp128,
vmulfp128, vmaddfp128, vmaddcfp128, vnmsubfp128, vperm128, vsrw128 etc.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
stvewx128 was aligning EA to 16 bytes and writing all 16 bytes of the
vector, corrupting 12 adjacent bytes on every call. ISA semantics:
word-align EA, extract word lane (EA & 0xF) >> 2, write 4 bytes only.
The non-128 stvewx was already correct; stvewx128 was never updated.
Mirror the stvewx body with instr.vs128() substituted for instr.rs().
The invalidate_for_write call from P1 now covers the correct word-aligned
EA rather than the over-wide 16-byte range.
interpreter.rs: stvewx128 arm (~line 2984)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
vpkd3d128 was storing the pack codec output directly into vd128 without
applying the MakePermuteMask permutation that merges the packed scalar(s)
into the previous register value according to pack (slot layout) and shift
(destination lane offset).
PPCBUG-363: vpkd3d128 was missing the post-pack lane-placement step.
PPCBUG-369: vpkd3d128 pack field not extracted; pack=0 still worked
(identity), but pack=1/2/3 always wrote raw out instead of blending.
Fix: extract `pack = uimm & 3` and `shift = instr.vx128_4_z()` from the
VX128_4 IMM and z fields. For pack==0 (identity) store out directly as
before. For pack 1-3, read the existing vd128 value and select 4 u32
words from {prev, out} using the 3×4 static permutation tables from
canary ppc_emit_altivec.cc:2126-2188.
Tables derived from canary MakePermuteMask(r0,l0,…r3,l3):
pack=1 (VPACK_32): out[3] placed at lane (3-shift), prev elsewhere
pack=2 (64-bit): out[2..3] placed at lanes (2-shift)..(3-shift)
pack=3 (64-bit): same as pack=2 except shift=3 → out[2] at lane 3
Tests: vpkd3d128_pack0_legacy_unchanged, vpkd3d128_pack1_shift0_d3d_vertex_pack,
vpkd3d128_pack1_shift3_puts_out3_at_lane0
interpreter.rs: vpkd3d128 arm (~line 3999)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PPCBUG-565: Add vx128_5_sh() to decoder.rs — 4-bit shift at PPC bits
22-25 (host bits 6-9). The correct MSB is at PPC bit 22 (host bit 9).
PPCBUG-361: vsldoi128 was reading the SH MSB from host bit 4 (PPC bit
27, reserved) instead of host bit 9 (PPC bit 22). All shift amounts >= 8
decoded incorrectly (e.g. shift=8 executed as shift=0). Replace the
inline bit-shuffle with instr.vx128_5_sh().
Also fix vx128_p_perm_assembles_correctly test: replace nonexistent
DecodedInstr::from_raw() calls with struct literal construction.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PPCBUG-563: Add vx128_4_imm() (PPC bits 11-15) and vx128_4_z() (PPC bits
24-25) accessors to decoder.rs for VX128_4-form instructions.
PPCBUG-315: vrlimi128 was reading z from host bits 16-17 (a subset of IMM)
and mask from host bits 2-5 (a reserved/XO region). Replace with the
correct accessors: z selects which word-lane to start the rotation from
(0-3); IMM is the 5-bit per-lane blend mask.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PPCBUG-562: Add vc_rc_bit() (PPC bit 21) and vx128r_rc_bit() (PPC bit 27)
to decoder.rs. The generic rc_bit() reads bit 0 (PPC bit 31); all vcmp XO
values are even so bit 0 is always 0, making CR6 permanently dead.
PPCBUG-275/276/420/421: Replace rc_bit() with vc_rc_bit() at all 8 pure
VC-form vcmp arms (vcmpequb, vcmpequh, vcmpgtub, vcmpgtsb, vcmpgtuh,
vcmpgtsh, vcmpgtuw, vcmpgtsw) and with the correct per-form accessor at
the 4 combined arms (vcmpeqfp|128, vcmpgefp|128, vcmpgtfp|128,
vcmpequw|128) and vcmpbfp|128.
PPCBUG-422: VX128_R-form 128-variants in combined arms now use
vx128r_rc_bit() instead of vc_rc_bit().
PPCBUG-423/600: Add 5 dot-form key entries to decode_op6 so
vcmp*fp128./vcmpequw128. decode as the correct opcode instead of Invalid.
Uses a 5-bit key (bits22-24 + bit25 + bit27) for dot-forms to avoid
aliasing against the shift/merge group (which sets bit25=1 when bit27=1).
Interpreter uses vx128r_rc_bit() to conditionally update CR6.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PPCBUG-561: Add DecodedInstr::mb_md() to decoder.rs — the correct MD-form
6-bit mask-begin reconstruction (MB[4:0] at PPC bits 21-25, MB[5] at PPC
bit 26). The disassembler already had the correct local formula; this
promotes it to a single source of truth on DecodedInstr.
PPCBUG-046: All 6 doubleword-rotate arms (rldicl, rldicr, rldic, rldimi,
rldcl, rldcr) inlined "(instr.mb() << 1) | ((instr.raw >> 1) & 1)" which
reads SH5 (host bit 1) instead of MB5 (host bit 5). For the canonical
"clrldi r3, r4, 32" zero-extend idiom (mb=32 → MB5=1, MB[4:0]=0), the
wrong formula produced mb=0, making the instruction a no-op and leaving
upper 32 bits of the GPR polluted. Replace all 6 sites with instr.mb_md().
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PPCBUG-040: decoder.rs sh64() assembled the XS-form shift amount as
(SH[4:0] << 1) | SH[5] instead of (SH[5] << 5) | SH[4:0]. Every
`sradi` with shift N ∈ 1..=62 executed with a completely wrong shift
count (e.g. shift=32 executed as shift=1).
PPCBUG-560: disasm_goldens.rs rldicl() test helper was encoding sh[5:1]
at PPC bits 16-20 and sh[0] at PPC bit 30 — exactly backwards. The wrong
encoder and wrong decoder cancelled out, hiding PPCBUG-040 from tests.
Fix both together so tests validate ISA-correct encodings.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PPCBUG-160 partial: stswi's single invalidate_for_write(ea) only covered
the first cache line; with nb up to 32, the write span can cross a 128-byte
line boundary. Replace with two-call guard:
first_line = ea & !RESERVATION_MASK
last_line = ea.wrapping_add(nb - 1) & !RESERVATION_MASK
invalidate first; if last != first, invalidate last.
PPCBUG-160 partial: stswx had the same single-call gap; nb from XER[0:6]
can be up to 127 bytes. Same two-call guard applied; wrapped in `if nb > 0`
to guard against nb==0 underflow (XER TBC field is 0 when no bytes to store).
dcbz: zeroes 32 bytes at a 32-byte-aligned EA — touches exactly one 128-byte
cache line; add canonical single-call invalidate guard (was entirely missing).
dcbz128: zeroes 128 bytes at a 128-byte-aligned EA — one full reservation
line; add canonical single-call invalidate guard (was entirely missing).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds doc comments above lwarx/ldarx/stwcx./stdcx. clarifying that the
legacy per-ctx reservation path is only correct in strict lockstep
(single host thread); under --parallel the M3 scheduler must enable
the cross-thread ReservationTable before spawning a second host thread.
A debug_assert fires in the legacy stwcx./stdcx. branch if a
non-primary HW slot (hw_id != 0) takes that path — surfacing
ReservationTable-disabled misconfiguration early in debug builds.
Note: the primary slot (hw_id==0) racing other parallel slots is
not caught by the assert; that case requires the table to be enabled.
Affected:
PPCBUG-108 legacy per-ctx reservation path cannot invalidate
cross-thread; informational — no behavioral change
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Track lwarx vs ldarx reservation width in PpcContext as a u8 (4 = word,
8 = doubleword, 0 = none). stwcx. requires width==4; stdcx. requires
width==8. Cross-width pairs (lwarx + stdcx., ldarx + stwcx.) now fail
deterministically with CR0.EQ=0 instead of spuriously succeeding.
The width is held per-thread; the cross-thread reservation table keeps
its existing slot encoding because each host thread consults its own
ctx.reservation_width before committing.
Affected:
PPCBUG-151 stwcx./stdcx. shared the same reservation slot without
width discriminator; cross-width commits silently succeeded
Tests: lwarx_then_stdcx_cross_width_fails,
ldarx_then_stwcx_cross_width_fails
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Continuation of the PPCBUG-107 cascade sweep. All 16 VMX store opcodes
(stvx/stvxl, stvebx/stvehx/stvewx, stvlx/stvrx and 128 variants of each)
now invalidate the reservation table before writing.
stvlx/stvrx partial-vector stores can write at non-16-byte-aligned EAs;
they invalidate both potentially-touched cache lines.
stvewx128 currently writes 16 bytes at the wrong EA scope (PPCBUG-510);
the invalidate guard fires at that over-wide EA today and will narrow
automatically when PPCBUG-510 is fixed in P3.
Affected:
PPCBUG-511 stvx, stvx128, stvxl, stvxl128
PPCBUG-512 stvebx, stvehx, stvewx, stvewx128
PPCBUG-513 stvlx, stvlx128, stvlxl, stvlxl128
PPCBUG-514 stvrx, stvrx128, stvrxl, stvrxl128
Tests: lwarx_then_plain_stvx_invalidates_reservation,
lwarx_then_plain_stvlx_invalidates_reservation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>