xenia-rs

Author	SHA1	Message	Date
MechaCat02	de21c7a544	[iterate-2G] db16cyc spin-hint cooperative yield: unblock title-screen 0x10a0 gate The silph title state machine (tid13) blocked on event 0x10a0, never signaled. Root: the event's producer chain runs on the silph worker (entry 0x821C4AD0, our tid14), which was starved. tid14 shares a HW slot with a guest spinlock/ barrier participant (sub_824D1328, entry 0x824D2940) that busy-spins on the db16cyc hint `or r31,r31,r31` (encoding 0x7FFFFB78) at 0x824D140C. Under our round-robin lockstep the spinner consumed its whole block every round and starved the co-located tid14 (only 9 progress hits over 200M instr) — so the producer never reached the event-create/duplicate/signal dance the canary oracle performs (handle F80000E8 set by the submitter F8000044 via a duplicated handle). Fix (canary-faithful): recognize the db16cyc spin hint exactly as canary's InstrEmit_orx does (code 0x7FFFFB78 -> DelayExecution) and surface it as a new StepResult::Yield. The scheduler's yield_current() promotes every Ready peer on the slot past STARVE_LIMIT so begin_slot_visit picks one next round, then they reset and the spinner reclaims the slot — fair alternation, no priority inversion, pure function of slot state (deterministic). Result (lockstep, cache-persist, -n 200M): tid14 progresses past its old stall into a real wait; tid13 advances off 0x10a0 to a new event; hub/submitter re-enter their wait loops. imports 280k->592k, packets 124M->164M, swaps 1->2. draws still 0 (the splash's first draw is a further-upstream gate). Determinism preserved (two cold n50m runs byte-identical). n50m golden re-baselined (imports 90296->339766, swaps 1->2; draws unchanged 0). n2m golden unchanged (db16cyc not reached in first 2M). Tests 670/670. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-13 10:38:17 +02:00
MechaCat02	f3b7e8b760	[iterate-2F] Scheduler anti-starvation floor: fix job-4 handoff render gate The lockstep scheduler's pick_runnable is strict priority (max_by_key (priority, -idx)). On a cooperative single-host HW slot, a CPU-bound spinner that never blocks (the silph poll loop pinned by affinity to hw=5) wins pick_runnable every round forever, permanently starving a co-located peer (the submitter, tid6) that the spinner is actually waiting on. On real hardware those threads run on separate SMT contexts concurrently, so the spinner never starves the submitter; ours collapses them onto one slot with no anti-starvation, turning priority (or equal-priority index order) into permanent starvation. The starved submitter never dequeued job-4 -> the worker-hub (tid5) blocked INFINITE on completion event 0x1080 -> silph (tid13) wedged on 0x1078 -> no vsync -> draws_seen=0, the publisher splash never renders. (decrement_quantum's within-slot rotation is dead: begin_slot_visit unconditionally re-pick_runnable()s each round, discarding the rotated running_idx. The fix is therefore evaluated at pick time, not via that discarded rotation.) Fix (Option A, bounded anti-starvation, deterministic): - Add per-thread steps_starved counter to GuestThread. - begin_slot_visit increments it for every Ready peer passed over this visit, resets it to 0 for the picked thread. - pick_runnable selects by effective_priority: once steps_starved reaches STARVE_LIMIT (4096) the thread is lifted to i32::MAX and wins exactly one pick, then resets. The genuinely higher-priority thread still wins ~4095/4096 visits -- the boost grants periodic forward progress only, it does NOT invert priority. Pure function of counter/priority/index -> deterministic (no wall-clock, no RNG). Cascade (lockstep exec, XENIA_CACHE_PERSIST=1, -n 200M): - submitter dequeue sub_82458508 now fires 4x (was 3x); the 4th job (buf 0x40baa2c0) is dequeued at cycle 6.15M. - hub tid5 leaves Blocked(0x1080) -> now Ready (no more INFINITE wait). - GPU packets 0 -> 116,101,363 (command stream now flowing). - tid13 (silph::UImpl) advances past the old 0x1078 wedge to a NEW downstream wait (handle 0x10a0); 3 new threads spawn (tid14/15/16). - draws_seen still 0 -> the splash's first draw is a NEW downstream gate, not this starvation. Determinism: two cold lockstep `check -n 5M` runs byte-identical (full and stable digests). New n50m stable digest deterministic across two cold runs. Golden re-baselined: instructions 50000007->50000003, imports 92317->90296 (trajectory shift from the changed pick order). Tests: 666/666 (+1 test_anti_starvation_bounded_progress). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-13 10:02:02 +02:00
MechaCat02	7e2603a9e5	[iterate-2E] Extend coherent monotonic clock to lockstep (timebase-desync livelock fix) Lockstep livelocked the scheduler the same way --parallel did before `0332d19`: the kernel deadline-arithmetic (`now_basis_at`) read per-thread `ctx(hw_id).timebase`, but a parked/poll thread has `running_idx == None` so `Scheduler::ctx()` returns `idle_ctx` (timebase 0). A poll thread (tid=7, a `KeWaitForSingleObject` loop with a 30ms relative timeout) computing its deadline via `parse_timeout` therefore read `now = 0` and registered `deadline = 0 + 3000 = 3000` — a constant ~7.78M units in the past. `coord_idle_advance` then re-armed that same constant 3000 deadline forever, pinning virtual time and starving every other thread's real future deadline. Render-gate impact: the submitter (tid=6) re-enters a 16ms-timeout WaitForMultiple after its first jobs; that timeout never fired because vtime was pinned at 3000, so virtual time never reached real future deadlines. Fix (Option A — mirror the parallel fix): drive the existing deterministic `Scheduler::global_clock` in lockstep too (floored up once per outer round to `stats.instruction_count`, a pure function of retired guest instructions — no wall-clock), and route `KernelState::now_basis_at` through `global_clock()` in BOTH modes. New `Scheduler::advance_global_clock_to(now)` floor-up keeps it monotone alongside `advance_all_timebases_to`. Parallel behavior unchanged (it already read `global_clock()`). Verified (lockstep, 50M): - DETERMINISM: two cold `check -n 5M` and two cold `-n 50M` runs byte-identical. - LIVELOCK GONE: "advanced to deadline" went from 592,679 fires / 2 unique values / 562,084 pinned at 3000 -> 18,586 fires / 18,567 unique / 0 pinned, strictly increasing 5.4M -> 50M. Poll thread tid=7 now ends Blocked with a real future deadline Some(60002824) instead of spin-Ready on the past 3000. - imports 1,790,936 -> 92,317 at 50M (the spin no longer burns import calls). Cascade (lockstep, XENIA_CACHE_PERSIST=1, -n 200M): engine now runs to budget instead of hard-deadlocking. Hub enqueue (sub_82458068) 4x; submitter dequeue (sub_82458508) still 3x — the lost 4th-job HANDOFF (count/notify between sub_82458068's tail and the submitter queue) is a SEPARATE downstream gate, not the timebase. New gate: tid=5 (hub) Blocked INFINITE on event 0x1080 (job-4 completion); tid=6 (submitter) Ready, parked in WaitForMultiple (sub_824AB214), loop-top stops at cycle 6.23M. draws still 0, VdSwap 1. Golden re-baseline (same commit): sylpheed_n50m instructions 50000004 -> 50000007, imports 1790936 -> 92317 (swaps/draws/RTs/shaders/textures unchanged). sylpheed_n2m unchanged (livelock onsets after 2M). Suite 665/665 + oracle green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-12 21:42:28 +02:00
MechaCat02	0332d1990d	[Track 2] Parallel-scoped global clock fixes timebase-desync livelock In --parallel mode a long run livelocked: the scheduler spun "advanced to deadline 3000 waking hw=2 idx=0" ~14k times in microseconds. Root cause: each guest thread owns ctx.timebase (+1/instr in step_block), and all kernel deadline arithmetic read Scheduler::ctx(hw_id).timebase as "now". But the parallel worker extracts its PpcContext via mem::replace(ctx_mut_ref, PpcContext::new()) — leaving a ZEROED timebase in the slot while it steps unlocked — and advance_all_timebases_to only walks runqueue (never idle_ctx). So the coordinator's coord_pre_round drain and a woken thread's parse_timeout could read a zeroed/stale basis decoupled from the deadline the scheduler just advanced to. The thread re-armed the same constant deadline forever; the global clock never moved. Fix: add a single monotonic Scheduler::global_clock, advanced by the per-block retired-instruction count on each parallel writeback and floored up by advance_all_timebases_to. Kernel deadline reads route through KernelState::now_basis_at(hw_id), which returns global_clock ONLY when parallel_active; lockstep keeps reading the exact pre-existing ctx(hw_id).timebase expression, so the deterministic lockstep trace is byte-identical (sylpheed_n50m golden unchanged, zero re-baseline). Verified: - 50M --parallel run completes (was: hung). Deadlines now strictly increasing 5.4M -> 49.1M (18097 unique of 18116; max repeat 2) vs pre-fix constant 3000 x ~14000. - sylpheed_n50m golden byte-identical via plain `check` (no persist). - Full suite 665/665 green. Note: an intermittent parallel hang/crash (~1-2/20 at -n 5M) is pre-existing (master 1/20, this build 2/20 — within noise) and distinct from the timebase livelock: it is a parallel-race class (e.g. the unsafe block_ptr deref in run_execution_parallel). Tracked separately; lockstep remains the recommendation for long runs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-12 19:32:14 +02:00
MechaCat02	b20c99f141	[Subsystem-fixes] 6 verified ours-vs-canary divergence fixes From the 2026-06-12 5-subsystem differential audit. All verified against canary as oracle; 660/660 workspace tests green (655 + 5 new). 1. nt_create_event polarity (exports.rs) — `manual_reset = gpr[5] != 0` was INVERTED. Canary xboxkrnl_threading.cc:668 `Initialize(!event_type,..)` + xevent.cc:41 (type 0 = NotificationEvent = manual, type 1 = Sync = auto). Now `== 0`. Was the dormant 2.AI fix on chore/portable-snapshot, never merged. The Ke-path was already correct; only the Nt-path was wrong. 2. 2.AF deadline drain (main.rs coord_pre_round) — expired KeWait/KeDelay deadlines never fired under load because advance_to_next_wake_if_due was only called in coord_idle_advance (no-Ready-threads path). Added a per-round drain loop; covers BOTH lockstep and parallel outer loops since both call coord_pre_round. Was the dormant 2.AF fix, never merged. 3. handle slab-recycle ABA guard (state.rs + scheduler.rs) — release_handle_slot (my round-34 regression) recycled a closed slot even with a thread still parked on it, risking a stale-waiter wake when the slot is re-minted. Added Scheduler::any_thread_waiting_on; decline to recycle a still-waited slot. 4. vpkpx pixel-pack (vmx.rs) — wrong field mapping (~100% mismatch). Now exact canary ppc_emit_altivec.cc:1795 shift/mask (red 6b out[15:10] from w[24:19], green out[9:5] from w[14:10], blue out[4:0] from w[7:3]; no fabricated alpha bit). +unit test. 5. VFS GDFX attribute plumbing (vfs/, exports.rs query fns) — VfsEntry now carries the real on-disc attribute byte (GDFX dirent +12, canary disc_image_device.cc:136/154) instead of inferring directory-ness from path shape. Query exports report the real FILE_ATTRIBUTE_ bits. Candidate driver of the XamShowDirtyDiscErrorUI gate. +tests. 6. MmGetPhysicalAddress region-aware mirror (exports.rs) — flat 0x1FFFFFFF mask missed canary's +0x1000 host_address_offset for 0xE0000000+ mirror (memory.cc:2317). Read-only query; proven byte-identical 50M digest. +test. Investigated and intentionally NOT changed: - zero-on-recommit: no-op; ours has no region-reuse path (bump allocators, free is a stub). - 32-bit ALU writeback truncation (PPCBUG-020): documented-deliberate; premise (MSR.SF=0) is questionable but flipping it is out of scope here. - KeSetEvent/NtSetEvent return value: ours returns true previous state (hardware-faithful); canary returns constant 1 — NOT an ours bug. sylpheed_n50m golden will need re-baselining (legit behavior change). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-12 14:57:38 +02:00
MechaCat02	db90ad0f7d	[AUDIT-059 R-D2] Phase D auto-signal POC confirms audit-049 wedge diagnosis Hook NtCreateEvent for the silph::UImpl tid=13 chain (entry=0x821748F0, start_context=0x4024a840, frame-1 LR=0x821CB15C inside sub_821CB030+0x128) and auto-signal the resulting handle after XENIA_SILPH_UI_AUTOSIGNAL_DELAY instructions. Env-gated; default off. SR4 verdict B (partial unwedge): - handle 0x1078 signal_attempts 0->1 - tid=13 Blocked(WaitAny[0x1078]) -> Ready pc=0x824a9108 - ExCreateThread 10 -> 12 (new silph::UImpl tid=14, worker tid=15) - New downstream wedges 0x1084 + 0x1088 - cxx_throw runtime_error on tid=5 inside R26 dispatcher (BST not-registered instance lhs=0x715a7af0) - VdSwap stays 1; no draws (POC is diagnostic, not final fix) Confirms Phase C diagnosis end-to-end. The real signaler must (a) drive NtSetEvent on the silph KEVENT AND (b) register the dispatcher's BST instance upstream; this POC only does (a). Reading-error class #20: ctx.lr at kernel export entry is the thunk wrapper's return slot, NOT the guest caller's post-bl PC. Walk back-chain 1 step to get frames[1].lr. Reading-error class #21: --parallel and lockstep have SEPARATE outer loops in main.rs (run_execution_parallel line 2928 vs run_execution line 2706). Per-round hooks must be wired in BOTH paths. Files: - crates/xenia-cpu/src/scheduler.rs: GuestThread.start_entry/start_context fields + spawn() population + current_thread_entry_and_ctx() helper - crates/xenia-kernel/src/state.rs: AutoSignalPending struct, env-parsed silph_autosignal_delay, pending Vec, last_cycle_hint, set_now_cycle_hint, maybe_register_silph_autosignal (walks back-chain), fire_due_silph_autosignals - crates/xenia-kernel/src/exports.rs: hook in nt_create_event - crates/xenia-app/src/main.rs: fire-site + cycle hint in both outer loops - audit-runs/audit-059-handle-disambiguation/round-D2-autosignal-poc/FINDINGS.md Tests 655/655 green. Default behavior byte-identical when env unset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-11 18:38:38 +02:00
MechaCat02	c36cca14f9	xenia-cpu: VMX128, FPSCR, decoder split, scheduler, decode/block caches Split the monolithic interpreter into cohesive modules: dedicated decoder (decoder.rs) producing 8-byte DecodedInstr; opcode tables (opcode.rs); explicit traps (trap.rs); FPSCR helpers (fpscr.rs); overflow/carry helpers (overflow.rs); a 4 KiB-page-versioned decode cache and basic-block cache (block_cache.rs); and a full VMX/VMX128 implementation (vmx.rs) covering AltiVec + Xenon's 128-bit extensions. Add the parallel-execution substrate behind --parallel: a 7-party phaser (phaser.rs) for round-based barrier sync, ReservationTable (reservation.rs) for guest LL/SC, and the per-HW-thread scheduler core (scheduler.rs) that owns ThreadRefs, runqueues, and pending IRQs. Disassembler is now the single source of truth: disasm.rs gains the full base + extended + VMX128 mnemonic set, with golden JSON fixtures and a disasm_goldens test suite. Add a criterion-style interpreter bench. context.rs grows the per-thread state the new modules need (reservation slot, FPSCR, vector regs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:27:43 +02:00

7 Commits