xenia-rs

Author	SHA1	Message	Date
MechaCat02	acb29db444	[iterate-3AL] Superblock dispatch: chain basic blocks per slot-visit (~1.6x boot-to-splash) Replace the one-basic-block-per-slot-per-round lockstep dispatch with a SUPERBLOCK runner: each slot-visit chains straight-line blocks through their terminating branches up to a deterministic instruction budget, amortizing the per-round (timebase/coord/round_schedule) and per-slot (worker_prologue) dispatch tax over ~128 instructions instead of ~6. Yield-points (end the chain, return to the round) are pure functions of guest state, preserving the lockstep cross-thread interleaving correctness: - non-Continue step result (Yield/SystemCall/Trap/Unimpl/Halted); db16cyc Yield is the spin-wait producer hand-off. - sync-sensitive block: lwarx/ldarx/stwcx./stdcx. or sync/eieio/isync (new PpcOpcode::is_sync_sensitive, flagged on DecodedBlock at build). - MMIO touch: new GuestMemory::mmio_access_count() watermark, sampled per block, keeps GPU/register ordering at one-block granularity. - next PC leaves ordinary guest code (import thunk / halt sentinel / unmapped) -> hand to the full worker_prologue next round. - instruction budget reached. Instruction-count/clock accounting stays exact: per-block cycle_count deltas are summed and handed to worker_epilogue once (instruction_count + decrement_quantum advance by the precise retired count). XENIA_SUPERBLOCK_BUDGET=1 reproduces the old one-block schedule byte-for-byte. Budget tuned to 128 (env-overridable): boot progression stays healthy up to 256, sharp cliff at ~384 (a boot producer/consumer handoff starves); 128 is 3x below the cliff. Also scale the inline-GPU per-round fairness cap with the budget (flat 64 throttled GPU command processing 17x under superblocks and collapsed the present loop). PERF (check -n 100M --gpu-inline): 25.3 -> 42.7 MIPS (1.69x); 1B: 26.0 -> 41.4 MIPS (1.59x). Callgrind n=5M: host instructions 2.178B -> 1.507B (-31%); worker_prologue -90%, coord_pre_round -91%, begin_slot_visit / round_schedule_into / coord_post_round / update_timestamp_bundle each ~-90%; interpreter execute byte-identical (real work unchanged). GATES: C1 boot progression 150M draws 7391/swaps 2164 (baseline 7415/2172), 1B draws 88547/swaps 29228 linear no stall, K8888 decode + RTs=2 intact. C2 determinism: n50m stable digest byte-identical across fresh runs; golden re-baselined intentionally (pacing-only deltas: imports 333453->243387, draws 1274->1279). C3 milestone-1 render: texture_decodes/draws/swaps/ present cadence track baseline (3AJ fade-in pacing preserved). C4: 690 tests green (+2 sync_sensitive). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-19 22:31:54 +02:00
MechaCat02	dc1320cd4b	[iterate-3AK] Perf quick-wins: ~21% faster boot-to-splash (22→27 MIPS) Profile-driven low-risk optimizations attacking the ~48% per-block / per-round host-bookkeeping tax found by the callgrind profile. Measured on the bounded headless workload `check -n 100000000 --gpu-inline`: baseline ~4490 ms (22.3 MIPS) -> ~3700 ms (27.0 MIPS), +21%. Tier A (determinism-neutral; n50m golden byte-IDENTICAL, exit 0): 1. mem-watch write path: gate capture_mem_watch_old/check_mem_watch behind one has_mem_watch() predicted branch in write_u8/16/32/64 + write_bulk so the common (no-watch) store does no out-of-line call. check_mem_watch (4.8%) gone from the profile. 2. round-schedule alloc churn: add Scheduler::round_schedule_into filling a reusable [u8; HW_THREAD_COUNT] stack buffer; the lockstep round loop no longer __rust_alloc/__rust_dealloc a Vec<u8> per round. Identical ordering/RNG-advance. __rust_alloc/dealloc gone from the profile. 3. probe-firing: hoist a single KernelState::any_probe_active() guard to worker_prologue so the four fire_*_if_match calls don't happen at all when no probe is configured (was 4x call overhead/visit). All four gone from the profile. 4. thunk-map hash: range-reject pc against the registered import-thunk address band (KernelState::pc_in_thunk_band, two int compares) before the thunk_map.get(&pc) HashMap lookup. hash_one (4.3%) gone. Tier B (#5, time-granularity change — LANDED, no re-baseline needed): 5. update_timestamp_bundle: throttle to a 0.25 ms quantum (only re-write the KeTimeStampBundle when the deterministic clock advanced >= 2500 units). Inclusive cost 8.65% -> 1.08%. The quantum is far below the 1 ms granularity any guest deadline math needs (tick_count stays fresh; the hub gate is +66 ms; the fade-in is vsync-counter driven per 3AH, not this bundle). VERIFIED: n50m stable digest BYTE-IDENTICAL to the existing golden (so no re-baseline), 150M boot reaches the splash (draws=7415, swaps=2172, gpu.texture.decode{K8888}=448, RTs=2 — all match the post-3AJ baseline), 688 tests green, release n50m oracle ok. Remaining headroom: interpreter::execute (13%), decrement_quantum (8%), step_block (7%) are now the top self-costs — the structural superblock/ JIT lever is the next step for the larger gain. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-19 22:05:53 +02:00
MechaCat02	9d24dd0eaa	[iterate-3AJ] Present-anchor vsync so the splash logo fade-in renders The publisher/dev splash logo's intro fade-IN was skipped: the logo popped in at full brightness instead of ramping dim->bright like the canary oracle. Root (measured, iterate-3AF/3AI): ours' guest vsync counter is fed by a fixed-instruction-quantum proxy (one vsync per 150k retired instructions). During the ~1.1s splash asset-load the title's frame pump runs ~10M instructions inside a single guest frame, so the proxy fired ~66 vsyncs in that one frame. The pump's per-frame delta (counter_now - counter_last) was therefore ~66 on the first tick, which the anim tick (sub_823CDBF8) divides into the fade counter [item+72] @ 0x40c0add0 -> the counter JUMPED 0->0x42(66) in one step, landing past the fade-in region. Canary's wall-clock 60Hz vblank advances ~1 per heavy load frame, so its counter ramps smoothly 0->66 and the fade-in renders. Fix: anchor the lockstep vsync ticker to the guest's real present rate (VdSwap count), mirroring real hardware where the title double-buffers at vblank, so one heavy guest frame advances the vsync counter by ~1 instead of ~66. - interrupts.rs: tick_vsync_instr now takes the live present count. Two regimes: (1) bootstrap, before the guest's first present, keeps the original fixed instruction quantum unchanged -- the iterate-2W present-loop bootstrap needs vsyncs delivered BEFORE it can present (measured: callback registered ~6M instr, first delivered vsync and first present coincide; pure present-driven vsync would deadlock). (2) present-anchored, after the first present: one vblank per present, plus a small DRY_FALLBACK_CAP=4 instruction-quantum fallback per dry window so a non-presenting frame still ticks a few vsyncs (a small ramp like canary's 0/5/10/2/1...) without re-spiking to 66. - handle.rs: cheap GpuBackend::swaps_seen() accessor. - main.rs: pass the live present count into the lockstep ticker. Not masking: the fade dt/counter is never clamped or synthesized; the guest naturally computes a smooth dt once vblank tracks presents. Verified: - V1: fade counter 0x40c0add0 now ramps 0,6,8,10,12,13,+1... (was a 0->0x42 jump; direct baseline-vs-fix mem-watch). - V2 (--ui readback via per-frame logo vertex-alpha): logo alpha ramps 102,136,204,221,238,254 (dim->bright fade-IN) vs baseline all 255 (pop-in). Real artwork (has_real_vertices) still renders; milestone-1 intact. - V3: 150M boot progression intact -- texture_decodes=2, RTs=2, tex_cache=1 unchanged; draws/swaps higher (tighter present loop), 1B sanity linear, no stall/collapse. - V4: 50M --gpu-inline --stable-digest byte-identical 2x; golden re-baselined intentionally (pacing-only delta: draws 718->1274, swaps 147->259; structural fields unchanged). 688 tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-19 21:01:33 +02:00
MechaCat02	c62a355418	[iterate-3AE] Fix spurious WHITE TRIANGLE flashing before each splash logo The publisher and developer splash logos rendered correctly, but a fullscreen OPAQUE WHITE diagonal half-triangle flashed at boot, before each logo, and persisted across the dev-logo transition — canary shows a black background there. Readback-isolated it (env-gated frontbuffer grid + per-draw inventory, both removed) to the background-fill draws. ROOT (measured, refutes the prior "saturate/interpreter/depth" guesses): the position-only VS `0xd4c14f46` (one vfetch → oPos; exports NO color) paired with PS `0xed732b5a` (`ocolor0 = interp0`). The iterate-3T translator seeded `ointerp[0] = (1,1,1,1)` "so a VS that only exports position still yields a visible non-zero color" — a debug FAKE: it injects white that no guest value backs. So that fill's interp0 stayed white → opaque-white fullscreen triangle. Vertex windows of a WHITE frame and a steady BLACK frame were byte-identical; served_translated=true for all of them and depth is disabled in the replay, so the white came purely from the injected seed, not saturate/interp/depth. FIX (UI-translator only, golden byte-identical): - translator.rs: default un-exported interpolators to (0,0,0,0) instead of seeding interp0 white. A position-only VS now contributes nothing visible under its real blend (RGB=0 → black; A=0 → premult transparent), matching canary; every VS that really exports interp0 (the logo `0x03b7b020`, the color fill `0x36660986`) overwrites the seed → logos unaffected. - app.rs: clear the splash frontbuffer to BLACK, not the iterate-3S navy placeholder `[0.04,0.04,0.06]` (never matched to the guest). The fill is a fullscreen Xbox-360 RectangleList drawn as a single triangle in the replay (4th implied corner not yet synthesized), so its uncovered half exposed the clear; black makes the transition uniformly black like the oracle. (Full RectangleList→rectangle expansion is a separate follow-up.) READBACK (env-gated, removed): white-heavy frames 200+ → 0; navy frames 240 → 0; transition frames uniformly black; the publisher logo (white text + red dots) and the developer logos (colored, on black) still render. Determinism: changes feed only the UI translator/clear; n50m --gpu-inline --stable-digest byte-identical 2× and matches the committed golden (--expect exit 0). cargo test --workspace 686 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-19 14:09:54 +02:00
MechaCat02	3f8d3b6f1c	[iterate-3AD] Fix 2nd splash logo rendering black: re-upload evolving atlas The publisher (SQUARE ENIX) and the 2nd developer/studio splash logo share one K8888 atlas at physical base 0x4dbee000, sampled at different UVs. The publisher's white text occupies the top V-bands; the developer logo's (bluish/gold) artwork is CPU-written into the SAME surface AFTER the publisher frame, so the atlas evolves across frames. The UI host texture cache (`texture_cache_host::upload`) only re-uploads a `TextureKey` when `version_when_uploaded` increases. But the per-draw bind in `render.rs` hardcoded `version_when_uploaded = 1` for every draw, so once the atlas was first uploaded (during the publisher frame, with only the top bands filled) the cache pinned that partial upload. The 2nd logo, sampling a V-band that was still zero at first-upload time, read transparent-black -> rendered nothing (the "white-triangle / black stub" the user saw after SQUARE ENIX). Verdict: (G) a legitimate 2nd LOGO item whose real artwork lives in the same evolving atlas — NOT a spurious 3rd item, and NOT a geometry/shader/blend gap. Measured via readback: the 2nd-logo geometry rasterizes correctly (3 on-screen quads), interp1 (UV) and interp0 (color) reach the PS with real values, the texture content at the sampled bands exists — only the bound wgpu texture was the stale partial upload. Fix (UI-only, deterministic core untouched): - `gpu_system`: thread the real content `version` (from `span_max_version`) into `last_draw_textures` (now `(key, version, bytes)`). - `draw_capture::DrawCapture.textures`: same 3-tuple. - `render.rs`: use the real `version` (not a hardcoded 1) so the host cache re-uploads when the guest fills more of the atlas. - `exports.rs` `vd_swap`: the legacy single-texture `publish_texture` bridge drops the version (`(key, _v, bytes) -> (key, bytes)`). Readback (env-gated probe, removed before commit): after the fix the 2nd logo renders real varied artwork (blue + gold texels in a centered strip) instead of black. Determinism: `check -n50m --gpu-inline --stable-digest` byte- identical to the `c0c6088` baseline (captured both via git-stash). 686 tests green. No faking — real decoded texels through the real guest draw. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-19 13:38:35 +02:00
MechaCat02	c0c6088e4d	[iterate-3AA] Fix logo upside-down: no Y-flip on the clip-enabled NDC path DEFECT 1 (logo upside down) ROOT + FIX. The publisher "SQUARE ENIX" logo rendered vertically mirrored vs the canary oracle (white upright on black). Measured (env-gated readback + texture-row + per-vertex dumps, all removed): - The K8888 logo texture decodes UPRIGHT (text in the top rows 1..161; the red dots sit at ~43% from the texture top). NOT a decoder row-order bug. - The logo geometry is a centered QuadList whose vertices are emitted in clip space (Y-UP, e.g. pos.y +0.085 top / -0.104 bottom), with the texture V mapped top->bottom (UV v 0.001 at the top vertex, 0.090 at the bottom). On both the Xbox 360 (D3D9) and wgpu, clip +Y maps to the framebuffer top — so a clip-space position is portable with NO Y-flip. - `compute_ndc_xy` unconditionally negated Y (the flip the screen-space pixel path legitimately needs). For the clip-enabled logo this swapped top<->bottom vertices while leaving the texture V unchanged, so the sampled sub-rect read bottom-up: red dots rendered at 58% from the top (a clean vertical mirror) instead of 43%. FIX: keep the Y-flip only on the clip_disable (screen-space pixel) branch where the framebuffer Y-down->wgpu Y-up flip is real; the clip-enabled branch now passes clip-Y-up through identity. Readback after the fix: red dots at 42% from the top (= texture's 43%) -> logo UPRIGHT, still centered. DEFECT 2 (background) was already correct + faithful; 3Z's contradiction is REFUTED by direct readback: the bg fill (vs 0x36660986 / ps 0xed732b5a, fullscreen RectangleList) reads its real vertex color (raw 0x818000c7 = -32896.5 as float) into r0, the PS exports it, and the GPUBUG-115 RB-UNORM saturate (canary spirv_shader_translator.cc:3607) clamps it to 0 -> BLACK, matching canary. The seed r0=(gvidx,...) does NOT show through (it's overwritten by the color vfetch). No code change needed. Readback of the full frame now matches canary: WHITE upright "SQUARE ENIX" + red dots on a BLACK field. UI-capture-path only (`compute_ndc_xy` runs solely when frame_captures is Some, i.e. --ui; None headless) -> deterministic core untouched, n50m --gpu-inline --stable-digest exit 0 (DRAW_INDX 275 / K8888 decode 137, identical across runs). cargo test --workspace green. Temp probes removed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 21:18:30 +02:00
MechaCat02	f6f3aac673	[iterate-3Z] Fix logo color (yellow->white): k_8_8_8_8 vfetch + vfetch field/stride/saturate Defect 2 of the three render-fidelity defects vs the canary oracle (the publisher "SQUARE ENIX" logo rendered YELLOW instead of WHITE). Root, measured by readback (env-gated probes, removed): the logo PS multiplies the sampled texture by the interpolated vertex COLOR; the K8888 texture itself decodes correctly (67,667 white texels + 2,087 red — the red dots — zero yellow), so the yellow came from the vertex-color attribute decode. Four coupled, canary-faithful fixes (all UI-translator/capture only — the deterministic headless core is untouched; n50m --gpu-inline --stable-digest golden byte-identical, exit 0): - GPUBUG-112 (translator vfetch): VertexFormat 6 = k_8_8_8_8 (4x u8 normalized, 1 dword), NOT k_16_16 (which is 25) per canary xenos.h:643. The logo color stream is k_8_8_8_8; decoding it as k_16_16 read only 2 of 4 channels and forced BLUE = 0 -> white texture x (R,G,0) = yellow. Now unpacks all four 8-bit channels (canary spirv_shader_translator_fetch.cc k_8_8_8_8 packed_offsets 0/8/16/24); added k_16_16 (format 25) too. - GPUBUG-113 (ucode/fetch): vfetch is_signed / is_normalized / is_mini_fetch bit positions were wrong (read bits 24/25, which sit inside exp_adjust). Per canary ucode.h:757-758,764: signed=fomat_comp_all (w1 bit12), normalized=(num_format_all==0) (w1 bit13), mini_fetch (w1 bit30). - GPUBUG-114 (translator vfetch): a vfetch_mini reuses the address AND STRIDE of the preceding full vfetch of the same stream (canary ucode.h:733); its own stride field is 0. Track the last full stride per fetch-const and inherit it so a mini color/UV attribute indexes by the real vertex stride, not its tight dword count. - GPUBUG-115 (translator PS export): saturate the color export to [0,1] before the UNORM render-target write, mirroring canary spirv_shader_translator.cc:3607 ("Saturate, flushing NaN to 0"). Without it an out-of-range guest color writes garbage to the sRGB target. Verified by env-gated frontbuffer readback (copy_texture_to_buffer, removed before commit): the logo now renders WHITE text + RED dots (bbox centered ~y322-389), zero yellow anywhere. Workspace tests green (added 4: k_8_8_8_8 4-channel unpack, mini-fetch stride inheritance, vfetch bit decode, PS saturate). Determinism: golden byte-identical. Remaining (defects 1 & 3, see memory iterate-3Z): logo orientation and the ed732b5a fullscreen background fill (renders ~white, canary shows black) — both localized but not yet cleanly resolved; plan in the memory file. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 20:58:21 +02:00
MechaCat02	2a992db47b	[iterate-3Y] Replay per-draw blend + write-mask so the logo composites visible The publisher logo rendered its real artwork in isolation (3X) but was overpainted in the full composite: every replayed draw used ONE fixed SrcAlpha/OneMinusSrcAlpha pipeline + an opaque-magenta texture stub, so the textured RectangleList draws whose sampler slot is shadowed by a vertex-fetch constant (no resolvable texture) wrote opaque magenta over the logo. Per-draw render-state inventory at the splash (env-gated probe, removed): - logo QuadList vs=0x03b7b020 ps=0x03b79001: bc0=0x07010701 (One,OneMinusSrcAlpha — premultiplied alpha), cmask=0xF, ntex=1 (real K8888) - RectangleList vs=0xd4c14f46 ps=0x03b79001: SAME premult blend, ntex=0 (slot 0 holds a type=3 vertex constant → texture decode rejects) → magenta - opaque fill vs=0x36660986 ps=0xed732b5a: bc0=0x00010001 (One,Zero) — green Draw order: the logo is drawn LAST per group, so order was not the problem; the fixed pipeline state was. Change (UI-side capture/replay only): - draw_capture: capture RB_BLENDCONTROL0 + RB_COLOR_MASK (+ colorcontrol / depthcontrol for follow-ups) per draw. - xenos_pipeline: new RenderState{blend_control,color_mask}; map Xenos blend factors/ops -> wgpu mirroring canary kBlendFactorMap/kBlendFactorAlphaMap; One,Zero,Add => blend:None (opaque); zero-channel mask => ColorWrites; cache translator AND interpreter pipelines keyed on (vs,ps,RenderState) / RenderState so each draw composites with its real state. - render: pass each capture's RenderState through both replay paths. - dummy texture magenta(255,0,255,255) -> transparent(0,0,0,0): an unresolvable texture now contributes nothing under its real premult blend instead of fabricating opaque magenta (removes a fake, adds none). Readback (env-gated, removed): full 1280x720 composite now shows the logo's real artwork (maxR=255, 50-102 distinct colors/cell) in a centered strip; no magenta anywhere. Background is uniform green (the 0xed732b5a opaque fill) — a separate vertex-color/shader fidelity issue, NOT compositing (next iterate). Determinism: UI-only; draw_capture additions only run when frame_captures=Some. check -n50m --gpu-inline --stable-digest --expect = "matches golden" (2x). cargo test --workspace = 682 passed. Temp probes removed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 19:53:25 +02:00
MechaCat02	89b5c39d8a	[iterate-3X] Real splash logo geometry renders: fix vertex-fetch const_index_sel + per-draw submit Two readback-proven root-cause fixes make the publisher-logo QuadList draw land its REAL captured vertex buffer (the texture was already correct from 3V). REFUTES iterate-3W's "logo geometry is auto-generated from vertex_id": the logo IS sourced from a 4-vertex QuadList buffer at guest physical 0x0adf60f0 (measured), it was just resolved at the wrong fetch-constant register. GPUBUG-110 (vertex fetch const_index_sel dropped). The Xenos vertex-fetch instruction encodes const_index (w0[20:24]) AND const_index_sel (w0[25:26]); the full constant index is const_index3 + const_index_sel (canary ucode.h:700), packed 3 two-dword constants per 6-dword register group. ucode/fetch.rs decoded only const_index and read sub-slot 0 (fc6). The logo vfetch is const_index=31, sel=2 -> the real base lives at reg 0x48BE, but ours read 0x48BA which held an unused 0x00000001 (base=0,size=0) slot. So resolve_vertex_window returned None -> has_real_vertices=false -> the logo fell to the procedural fullscreen magenta fallback. Fix: decode const_index_sel, add VertexFetch::const_reg_offset() = const_index6 + sel2, and use it in both draw_capture.rs (capture) and translator.rs (the WGSL endian term + no-window fallback base; the old expression there read the src_reg bits, not the const index). Measured: logo now resolves a 24-dword (4 verts x stride 6) window, base 0x0adf60f0. GPUBUG-111 (single batch encoder = last-draw-wins vertex data). In wgpu every queue.write_buffer staged before a single queue.submit is applied before ANY command in that submit runs. dispatch_xenos_captures recorded the whole batch into one encoder + one submit, so every draw read only the LAST draw's vertex buffer / per-draw uniforms. The logo quad therefore sampled the trailing fullscreen background quad's vertices and rasterized nothing where the logo was. Fix: submit one encoder per draw (frontbuffer LoadOp::Load composites identically). Measured (env-gated readback, removed): with this fix the logo draw in isolation renders real varied texels (e.g. (225,17,22)/(255,255,0)) in a centered strip (~20k px), vs 100% navy before. Determinism: all changes are UI-side (xenia-ui replay) or the UI translator / capture path (frame_captures None in headless); the fetch.rs field addition is purely additive and does not change any existing decoded value. Verified the deterministic core unchanged: check -n50M --gpu-inline --stable-digest exit 0 and all 136 metric counters byte-identical across two runs. All temp probes removed. cargo test --workspace green; new regression test vertex_fetch_const_index_sel_and_reg_offset. Known remaining (next iterate): a fullscreen flat QuadList (ps 0x03b79081, vertex color green, no texture) and other textureless draws overpaint the logo in the full composite (their per-draw blend/alpha render state is not yet replayed, and draw order alternates bg/logo). The logo artwork renders correctly in isolation; the composite is not yet clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 19:25:50 +02:00
MechaCat02	39723dfe37	[iterate-3W] draw_capture: walk CF exec sequence to find the real vertex fetch Fix the UI-side vertex-window resolver (`resolve_vertex_window`) so it identifies vertex fetches via the control-flow `Exec` clause `sequence` bitmap instead of blindly decoding every 3-dword triple. Root cause (GPUBUG-109): the Xenos instruction block packs ALU and fetch instructions identically (96 bits each); only the owning `Exec` clause's `sequence` bitmap (2 bits per instruction, bit[2i] = fetch/ALU) tells them apart. The old resolver scanned every triple and trusted the first that happened to decode as a vertex fetch, gated by a `dword0 & 3 == 3` "type" guard. On real shaders this mis-decoded ALU triples as fetches and either picked a garbage fetch-constant slot or rejected the clause before reaching the true vertex fetch. Now walk the CF exec clauses exactly as the translator does (`translator.rs::emit_exec`) and take the first sequence-flagged vertex* fetch. Measured (env-gated probes, removed before commit): the resolver now reaches the real fetch on every splash VS. The RectangleList draws (vs 0x36660986 / 0xd4c14f46) keep resolving real geometry (valid fetch const 0). The publisher-logo QuadList (vs 0x03b7b020) is correctly seen to fetch from a fetch constant whose dword0 = 0x1 (no vertex buffer) — i.e. its geometry is NOT sourced from a memory vertex buffer, so it still (correctly) falls to the procedural path. That remaining gap (the logo's auto-generated/index-derived geometry) is the next milestone-1 step; this commit removes the decoder defect that masked it. Determinism: UI-only. `resolve_vertex_window` runs only when `frame_captures` is `Some` (i.e. `--ui`); the headless `--gpu-inline` core never calls it. `check -n50000000 --gpu-inline --stable-digest` exit 0 and byte-identical run-to-run. cargo test --workspace: 681 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 18:50:13 +02:00
MechaCat02	da7c29b6d2	[iterate-3V] Fix logo texture: map texture-fetch physical base onto backing window The publisher-logo texture (K8888 1280x768 linear, `E59B2B3D`'s tfetch surface) rendered flat/transparent because the GPU texture decode read the wrong host bytes — NOT because the asset was never decompressed. First-divergence (vs canary, measured both engines): - Ours DOES read game:\hidden\Resource3D\.xpr in full, builds a byte-identical cache, decompresses the logo, and CPU-writes the real artwork (~839K nonzero bytes) into the texture buffer — at the guest physical-aperture VA 0x4dbee000 (writer sub_823C3E70 @ 0x823c3f8c). This REFUTES the iterate-3U verdict that the texture was never filled. - BUT the GPU decode used the raw fetch-constant base 0x0dbee000 as a virtual address. In ours' flat 4GB memory, virtual 0x0dbee000 and the physical alias 0x4dbee000 are DIFFERENT host bytes (no aliasing in the read/write path), so the decode read all-zeros. The Xenos texture fetch constant carries a guest physical* base; the CPU writes texels through its cached-physical aperture, which ours backs at the committed 0x4000_0000 window. Map the base via the existing `physical_to_backing` helper before reading — exactly as the vertex fetch path (draw_capture.rs, iterate-3Q) and as canary reads textures through its GPU shared memory (= physical). Measured after fix (env-gated probe, removed): the logo decode reads base=0x4dbee000 and produces 839068/3932160 nonzero bytes (21.3%) — a centered logo on a transparent field, matching canary's ~21% exactly. Determinism: GPU-side pure read; no CPU/guest-memory state changes. The n50m --gpu-inline --stable-digest golden is byte-identical (verified 2x, texture_cache_entries unchanged). cargo test --workspace green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 18:07:00 +02:00
MechaCat02	1b9918450f	[iterate-3T] Real UV interpolation + per-draw textures: shader/UV/bind chain complete Build the full texture-sampling chain for the publisher splash so the textured logo CAN sample real artwork at the guest's real UVs. Measured with an env-gated frontbuffer readback (since removed): the chain is correct end-to-end, but the sampled K8888 1280x768 texture is ALL-ZERO in the UI window's reachable boot range — the artwork is produced by an EDRAM resolve (RT->texture copy) that ours does not yet perform (resolves=0). So this lands the correct shader/UV/bind work and isolates the remaining blocker to the resolve gap, not the shader path. Translator (xenia-gpu/src/translator.rs), all UI-translator-only: - Real Xenos export-index model (replaces the AllocKind heuristic that collapsed every VS export to one color slot and DROPPED the texcoord). When export_data is set the 6-bit vector_dest IS the export index: VS 62=oPos, 0..15=interps; PS 0=RT0. The logo VS exports oPos(62), interp0(color), interp1(UV) distinctly. - Real interpolator passthrough: VsOut carries 8 interpolator locations; the PS seeds r[i] = in.interp[i] (Xenos PS-input-GPR mapping) so tfetch samples at the real interpolated texcoord (r1) instead of (0,0). - vfetch format 6 (k_16_16) packed-16 unpack + per-attribute dword offset, so the 3 vfetches sharing one fetch-constant (pos/UV/color in a 6-dword vertex) read the right attribute. Previously rejected the whole logo VS to the interpreter. - QuadList/RectangleList host->guest vertex-index remap in the VS (replay is non-indexed): QuadList 6 host verts -> guest [0,1,2,0,2,3] (full quad). fetch.rs: decode vfetch `offset` (dword2[8:15], dwords), `is_signed`, `is_normalized`. Per-draw textures: DrawCapture carries the decoded texture(s) (keyed off the active PS's tfetch slots, attached in gpu_system after decode); render.rs::dispatch_xenos_captures uploads + binds each capture's texture via the host texture cache before its draw, instead of one last-draw primary_texture. Determinism: all changes feed only the UI translator/capture path; frame_captures is None headless. `check -n50m --gpu-inline --stable-digest --expect` byte- identical (exit 0). 681 tests pass (+2 regression: logo VS now translates with interpolators; PS seeds interps into registers). Temp readback/dump probes removed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 17:12:16 +02:00
MechaCat02	80fbff8bd1	[iterate-3S] Real splash geometry renders: fix ALU/vfetch decode + per-draw NDC transform The 3O→3R real-render slice ran the guest's real translated VS/PS on real captured vertices at full boot speed, but the --ui window stayed blank. Bifurcated with an env-gated frontbuffer readback + per-vertex NDC dump (both removed): the captured splash quads (RectangleList, k_32_32_FLOAT, 3 verts) were non-zero and sane, so this was a transform/decode chain of bugs, not missing geometry. Four coupled root causes: - GPUBUG-106 (ucode/alu.rs): decode_alu read EVERY field out of w2, but canary's AluInstruction lays dest/write-mask/export/scalar-opcode in w0, the vector opcode + source regs in w2, swizzle/negate/pred in w1. The misread made every export ALU decode with vector_write_mask=0 → no oPos/oColor export emitted → the translated VS collapsed every vertex to the clip origin. Rewrote the field map to match ucode.h:2036-2086. - GPUBUG-107 (ucode/fetch.rs + translator.rs): the translator hardcoded R32G32B32A32_FLOAT (4 floats, stride 4); the splash quads are k_32_32_FLOAT (2 floats, stride 2). Over-striding read the next vertex's X into .w → negative W → the rectangle clipped behind the camera. Decode the real VertexFormat + dword stride and emit the matching component read (1/2/3/4 float formats; others reject to the interpreter). - GPUBUG-108 (translator.rs + xenos_interp.wgsl): the vfetch recomputed the buffer base from xenos_consts.fetch[], but that uniform carries the last-published per-frame fetch constant, not this draw's (stale 0x8a000002 vs the real base). The captured window already begins at the fetch base, so index from 0 (vertex i at i*stride) when a real window is present; only the synthetic fallback consults the uniform. - iterate-3S NDC transform (draw_capture.rs + xenos_pipeline.rs + WGSL): the guest VS emits screen-space pixel coords (clip disabled, VTE viewport scale/offset off). Added compute_ndc_xy (mirrors canary GetHostViewportInfo): rescales render-target pixels to [-1,1] clip with the Y-flip for wgpu, plumbed per-draw into DrawConstants and applied in both the translated and interpreter VS. Result (env-gated readback, since removed): the real splash geometry now fills ~50% of the frontbuffer in a clean triangular coverage pattern, real positions from real guest vertices through the real translated shaders (textures are the next stage — sampled color is still the magenta/white texture stub, tex-cache=0). Headless-inert: draw_capture is only built when frame_captures is Some (--ui); the changed decoders feed only the UI translator/metrics. Golden byte-identical (check -n50m --gpu-inline --stable-digest exit 0); 679 workspace tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 16:35:01 +02:00
MechaCat02	6d8a2817a3	[iterate-3R] Fix --ui boot throttle: demote idle-advance scheduler log to DEBUG The publisher-splash draws first appear ~30M instructions and ramp through 80M-150M. Headless `check` reaches that window and renders the textured logo (DRAW_INDX=568, gpu.texture.decode{K8888}=279 at -n 150M). The interactive `exec --ui` path appeared to "self-terminate at ~8.8s / ~33M" before reaching the splash. Root cause (measured, not inferred): NO termination and NO guest halt. The `exec --ui` default path runs at the INFO log level (headless `check` runs `--quiet` = WARN). During the boot idle-spin the scheduler has no Ready thread for long stretches and `advance_to_next_wake_if_due` fires once per timed-wait deadline-wake — hundreds of thousands of times. That `tracing::info!` emitted ~286K lines / 154 MB to disk in ~25s, throttling the guest so hard that the instruction count crawled (deadline raced to 325M timebase while instructions stayed near ~5M). The verbose run never terminated — it was alive and logging-bound. Quiet `--ui` (no flood) reaches 161M instr / 2636 GPU draws in ~31s, exactly tracking headless. Fix: demote the per-deadline-wake log from INFO to DEBUG (it is a hot-path scheduler internal). Default `exec --ui` now emits ~1.6K log lines instead of ~235K over the same window; the idle-advance flood drops 286700 -> 0 at INFO, and boot flows to the splash window. Log-level only: deterministic golden byte-identical (n50m --stable-digest --expect matches, exit 0), 679 tests green. Refutes the iterate-3O/3Q "8.8s --ui auto-close / BreakOuter" hypothesis (T1-T5 all negative): the cause was logging wall-clock cost, not a window/present deadlock, watchdog, or guest exception. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 15:47:16 +02:00
MechaCat02	a3aa3cc7d6	[iterate-3Q] draw_capture: read vertex window via physical alias The UI geometry-capture read the vertex-fetch base at its bare low VA (~0x0adf_xxxx), which is unmapped in ours, so it copied all-zeros. The fetch constant's address:30 field is a guest physical dword address (canary reads it via Memory::TranslatePhysical, draw_util.cc:961). Ours only maps the cached-physical window at 0x4000_0000 (physical_to_backing). Rebase a low physical base onto that mapped alias when the raw VA is unmapped; window_base_dwords still carries the original base so the shader's rebase indexes the uploaded window. Decode itself was verified correct against canary (xe_gpu_vertex_fetch_t + GetVertexFetch + ucode.h vfetch const_index*3+const_index_sel): for the splash draws const_index_sel==0, so ours' stride-6 register offset lands on the exact same constant as canary's stride-2 offset; raw dwords match byte-for-byte. UI-only path (frame_captures is None in headless), so the deterministic --gpu-inline golden is byte-identical (verified) and 679 tests stay green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 15:28:49 +02:00
MechaCat02	6ff184694d	[iterate-3P] Real splash geometry in --ui: fix CF predication decode + translator op coverage Stage 1 of the iterate-3O resume plan: make the P7 translator actually compile the splash's real VS/PS so real per-vertex POSITIONS render via the host wgpu pipeline, instead of every draw falling to the interpreter (which emits a placeholder triangle). Two coupled fixes, both faithful (Route A): 1. ucode/control_flow.rs (GPUBUG-103): clause-level predication was decoded from payload bits 28/29, which fall inside the exec clause's `sequence_`/ `vc_hi_` fields, NOT the predicate flag. That stamped `predicated=true` on plain `kExec` clauses, so the translator rejected EVERY splash VS as `cf_cond`. Per canary ucode.h, clause predication is determined by the opcode (only kCondExecPred* = 5/6/13/14 are predicate-register-gated; their `condition_` is at word1 bit 9 = payload bit 41). kExec/kExecEnd (1/2) run unconditionally; kCondExec (3/4) is bool-constant-gated (not modeled). Diagnosed live in --ui: reject reason cf_cond on all 7 splash shader pairs → after fix, predicated=false and CF passes. 2. translator.rs: with CF passing, the next reject was `scl_op_unsupported` for scalar opcodes 4 (kMulsPrev2 / LIT emul) and 8 (kSgts), plus thin vector coverage. Expanded vector_expr + scalar_expr to mirror the runtime interpreter's op set (which mirrors canary AluVectorOpcode/AluScalarOpcode): CND_EQ/GE/GT, TRUNC, MAX4, DST for vectors; the full SEQS/SGTS/SGES/SNES, MULS_PREV2 (with the -FLT_MAX / non-finite / b<=0 guard), SUBS(_PREV), EXP/LOG/RCP/RSQ/SQRT/SIN/COS, FRCS/TRUNCS/FLOORS for scalars. Side-effecting ops (setp/kills/maxas*) still reject → interpreter fallback (honest). Result (--ui, measured): xlated-pipelines 0→6, all draws served by the translator (served_interp=0) — real VS/PS now run on the host GPU. The splash is still not visibly correct because the captured guest vertex windows read all-zero: the vertex-buffer base VA (~0x0adf_xxxx) is UNMAPPED in guest memory (mem.translate()==None). That is a CPU/kernel memory-mapping gap, not a GPU-render gap — the next stage. Determinism: both files are in xenia-gpu core but the CF `predicated` field only feeds the UI translator + a metric tag, never deterministic state. Verified: `check -n50000000 --gpu-inline --stable-digest` matches the golden byte-for-byte (exit 0); 679 tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 15:07:06 +02:00
MechaCat02	504592ac13	[iterate-3O] Real-render slice: replay guest geometry in --ui (Route A) Replace the synthetic placeholder triangle in the --ui window with the splash's REAL guest geometry, proving the faithful-render pipe end to end. Architecture: Route A (UI-side replay). A per-draw capture channel carries each PM4_DRAW_INDX's real state to the UI, which replays it through the existing wgpu Xenos pipeline. The deterministic headless core is untouched: capture is gated on an Option<Vec<DrawCapture>> that is None in headless mode and only enabled on the --ui path, so the --gpu-inline n50m golden is byte-identical (verified 2x). The hard part was sourcing real vertices. The WGSL VS already does format-aware vertex fetch from the b4 storage buffer at the address from the fetch constant -- but b4 was never populated and the fetch address is an absolute guest dword address. The slice: xenia-gpu/draw_capture.rs: parse the active VS, find its first vertex fetch, read that fetch constant, copy a bounded window of guest memory at the fetch base. Best-effort: has_real_vertices=false falls back to procedural geometry (never fabricated pixels). * gpu_system.rs: accumulate one DrawCapture per draw into frame_captures. * exports.rs (vd_swap): drain + publish the frame's captures to the UI. * ui_bridge/bridge.rs: new publish_geometry channel + UiHandles.geometry. * WGSL (interp + translator): rebase the absolute fetch address by a new DrawConstants.vertex_base_dwords so it indexes the uploaded window. * render.rs: dispatch_xenos_captures uploads each draw's real vertex window + matching shader, issues real DrawRequests (real prim type, host vertex count, vs/ps keys). * app.rs: prefer the real-capture replay; HUD adds real-geo=N counter. Verified in --ui on Sylpheed: "first Xenos capture batch replayed (real geometry) captures=24 real_vertex_draws=24" -- all draws resolved a real guest vertex window; WGSL compiles; no validation errors over 1616 swaps. Still synthetic-free but not yet pixel-perfect: textures/UVs, DMA index buffers (auto-index only for now), and kCopy resolve routing are staged for follow-ups. Faithful: real vertex data, prim types, shaders, constants. cargo test --workspace green; n50m golden unchanged (2x byte-identical). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-17 22:38:46 +02:00
MechaCat02	6bb4355e3d	[iterate-3M] Fix Xenos shader CF/fetch decode so the textured logo binds The publisher splash (title idx0) rendered FLAT in ours while canary samples a texture: ours never decoded the logo's textured pixel shader (E59B2B3D, a `tfetch2D` sprite) even though our guest IM_LOADs the exact same microcode canary does (verified byte-identical against the Wine oracle). The shader was misparsed as flat. Three coupled bugs in the ucode decoder, all off vs canary `gpu/ucode.h`: 1. CF opcode table was off-by-one (`control_flow.rs`): mapped opcode 0→Exec and 1→Exit, but Xenos has 0=kNop, 1=kExec, 2=kExecEnd, 3..6/13..14 the cond-exec variants, 7/8 loop, 9/10 call/return, 11 condjmp, 12 alloc, 15 mark-vs-fetch-done. So a real `kExec` clause was read as a terminal `Exit`, truncating the CF block and dropping every instruction (incl. the `tfetch`) after it. Added Nop/MarkVsFetchDone variants; parse now ends on an END-bit exec clause. 2. exec/loop `address` is an absolute instruction-triple index from shader dword 0, but indexed our post-CF `instructions` slice directly (`ucode/mod.rs`). Rebase addresses by the CF triple count so `address*3` lands on the right instruction. 3. Fetch instruction bitfields were wrong (`ucode/fetch.rs`): `const_index` read from bit 5 (actually `src_reg`) instead of bit 20, and texture `dimension` from dword1 instead of dword2 bit14. The logo's `tfetch ..,tf0` was read as `tf1`, whose empty fetch-constant failed to decode → no texture. Also the `sequence` fetch/ALU bit is bit[0] of each pair, not bit[1] (`shader_metrics.rs`, `translator.rs`, `xenos_interp.wgsl`). Result (--gpu-inline, deterministic 2x): the active PS's `tfetch_slots` now resolves slot 0, the tf0 fetch-constant decodes (fmt K8888), and `gpu.texture.decode` fires (137x at -n 50M; texture_cache_entries 0→1, the only golden field that changed — all draw/swap counts unchanged). The same fixes correct the WGSL uber-shader's fetch/CF walk for the threaded/--ui path. Added a regression test that parses the real E59B2B3D microcode and asserts a tfetch slot is found. Golden re-baselined (texture_cache_entries 0→1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-17 21:53:35 +02:00
MechaCat02	3f5d5cf5f7	[iterate-2Z] Implement NtSetInformationFile FileRenameInformation for cache: files The GamePart title-logo gate first-divergence: Sylpheed's asset cache decompresses each packed resource to a staging `cache:\<hash><tail>.tmp` file, then renames it into its final nested path `cache:\<hash>\<dir>\<file>` (e.g. the title logo texture `\69d8e45c\e\534ffea`) via NtSetInformationFile class 10 (XFileRenameInformation). Our handler treated class 10 as a permissive no-op (catch-all `_ => STATUS_SUCCESS`), so the host rename never happened: the nested target directories were created but left EMPTY while the decompressed data stayed in the flat `.tmp` file. When the title later reads back `\69d8e45c\...` to build the logo texture the read misses, so the textured logo pixel shader (canary `E59B2B3D`, tfetch2D) is never dispatched and the logo never renders. Fix: implement class 10 faithfully, mirroring canary `xboxkrnl_io_info.cc:226` (`X_FILE_RENAME_INFORMATION{ replace_existing@0, root_dir_handle@4, ANSI_STRING@8 }` -> `file->Rename(TranslateAnsiPath)`). Read the target path from the embedded ANSI_STRING at info_ptr+8, resolve it against the host cache backing dir (`resolve_cache_path`), create the parent dirs, `std::fs::rename` the backing file, and update the handle's `path` + `host_path`. Non-cache (read-only VFS) sources keep the prior permissive acknowledge. Verified at runtime: 20 renames/80M now move `69d8e45ce534ffea.tmp -> 69d8e45c/e/534ffea` etc., and the nested cache tree now matches canary's HostPathDevice layout byte-for-byte (data present, not empty dirs). Made `path::read_ansi_string` pub so the handler can parse the rename target. Deterministic + golden-invariant: two `check --gpu-inline --stable-digest -n 50000000` runs are byte-identical and the 50M stable digest is unchanged (draws=718/swaps=147/6 shaders/tex=0); the logo read-back occurs later than the observable window so GPU counters at 1B/2.5B are unchanged (2.5B: draws=48734, swaps=16060, still 6 flat shaders, texture_decodes=0). The fix is a verified-necessary precondition — without it the nested asset read-back is guaranteed to miss. A downstream gate (the 2nd title thread's load-completion post skipped when its notify target `[r29+8]==0`, and the later read-back phase being beyond 2.5B) remains for follow-up. New test: `nt_set_information_file_rename_moves_cache_file` (678 total, was 677). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-15 21:33:25 +02:00
MechaCat02	2f55d1fd7d	[iterate-2X] Texture pipeline: un-stub RectangleList + draw-time texture decode Two faithful, deterministic GPU-backend changes that make the texture path correct for whatever textured draw the splash eventually dispatches. Both are currently inert on Sylpheed (the textured logo draw is still gated downstream — see below), but neither shifts the stable-digest golden, so they land safely. 1. Un-stub RectangleList primitive expansion (primitive.rs). The splash submits 2819 RectangleList draws at 200M, all of which were REJECTED by the P3 stub (`gpu.primitive.rejected{rectangle_list}`) → only ~592 flat point/quad draws rasterized. Mirror canary's intent (primitive_processor.cc:389-456 kRectangleListAsTriangleStrip) within our CPU index-rewrite idiom: emit each rect's 3 real vertices as one TriangleList triangle (v0,v1,v2), rejected=false, faithful host_vertex_count. The full quad (synthesized 4th corner v3=v0+v2-v1) needs real vertex fetch in vs_main — left as a documented TODO. Rejection warnings drop 2819→0. 2. Draw-time texture decode keyed off the active PS's real tfetch slots (gpu_system.rs + exports.rs vd_swap). Previously vd_swap decoded a hardcoded fetch-constant slot 0 at swap time. Now the DRAW handler parses the bound pixel shader (ucode::parse_shader), collects its tfetch fetch_const slots via new shader_metrics::tfetch_slots, reads each 6-dword fetch constant, and decode+caches it into GpuSystem::last_draw_textures. vd_swap publishes the first of these (UI binds one texture today), falling back to the legacy slot-0 probe on flat-only frames. New span_max_version helper walks page_version over the trait (draw-time &dyn MemoryAccess lacks the heap's inherent max_page_version). Pure function of guest writes — deterministic. Status: texture_decodes stays 0 on Sylpheed because all 6 live shaders are flat (no tfetch); canary's textured logo shaders E59B2B3D/F7B1457 are not yet dispatched by ours (a downstream title-state gate, the next frontier). The full P5 decode→publish→upload→sample path is already wired; this makes the decode side key off the real shader instead of a guess. Validation: stable-digest golden sylpheed_n50m unchanged (draws=718 swaps=147 tex=0), regenerated twice byte-identical; 200M run shows 0 RectangleList rejections. cargo test --workspace green (677, +2: rectangle_list_expansion, tfetch_slots_extracts_texture_fetch_constants). No temp hooks. Branch only; not pushed/merged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 21:34:43 +02:00
MechaCat02	a91f4c550b	[iterate-2W] Sustain the title present loop: viewport-size register + ISR CPU impersonation The title's per-frame loop (sub_822F1AA8) is clock-B-paced and only re-fires when the swap count [controller+88] changes, which advances only on source=1 CP swap-complete interrupts. Each present batch the guest submits (via the sub_824CE348 -> sub_824BF4D0 builder) ends with a WAIT_REG_MEM on a per-CPU swap-acknowledge fence [GCTX+0] (GCTX = [device+10772]); the GPU parks there until the graphics ISR (sub_824BE9A0) clears that CPU's bit. Two coupled gaps kept ours emitting only ONE source=1 then dead-locking (draws plateaued at 28, run halted ~19.27M): 1. GPU MMIO register 0x1961 (AVIVO_D1MODE_VIEWPORT_SIZE) read as 0. The swap callback sub_824CE2B8 divides by its low 12 bits (display height) as a refresh-pacing term, so a 0 read tripped its `twi` divide-by-zero guard and aborted the ISR before it reached the fence-clear. Mirror canary GraphicsSystem::ReadRegister (graphics_system.cc:311): return 0x050002D0 (1280x720). 2. The ISR ran on an arbitrary borrowed thread, so [r13+268] (the PCR processor number) did not match the interrupt's target CPU. The ISR clears `1 << current_cpu` from the fence; running on the wrong CPU cleared the wrong bit and the fence (bit 2, from cpu_mask 0x4) never reached 0. Carry the target CPU through the interrupt queue (bit index of the PM4_INTERRUPT cpu_mask for CP, 2 for vsync per canary DispatchInterruptCallback(0, 2)) and impersonate it on the borrowed thread's PCR around the ISR, mirroring canary EmulateCPInterruptDPC -> XThread::SetActiveCpu. With both fixes the fence clears, the GPU drains each present batch, source=1 sustains per-present, clock B advances, and the loop runs continuously. Draws climb linearly with the budget (no re-stall): 50M 28->718, 200M ->3411, 1B ->18734; swaps 2->147/950/6060. No "Unanticipated CPU_INTERRUPT" trap. Inline-deterministic (--stable-digest byte-identical x2); n50m golden re-baselined. 675 tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 20:49:32 +02:00
MechaCat02	66bd805726	[iterate-2V] VdSwap: stop bumping primary CP_RB_WPTR out-of-band (canary-faithful) Ours' `vd_swap` wrote its 64-dword XE_SWAP block at the guest's reserved `buffer_ptr` slot AND then bumped the primary ring `CP_RB_WPTR` out-of-band via `state.gpu.extend_write_ptr_by(64)`. That bump was a bug: `buffer_ptr` (~0x4add6efc) is NOT inside the primary ring (base ~0x4adcd000, 8192 dwords) — it lives ~10k dwords past it, in the renderer indirect-buffer region. The bogus WPTR bump pushed the GPU read-pointer PAST the guest's real write-pointer; the drain treated the overshoot as a circular wrap and re-executed the splash's draw indirect-buffers ~2×, inflating draws to 78 (the real splash geometry is ~28 draws; 12 INDIRECT_BUFFERs vs the real 6). Canary's `VdSwap_entry` (xenia-canary xboxkrnl_video.cc:518-548) writes the fetch-constant patch + PM4_XE_SWAP + NOP pad into the reserved slot and returns — it NEVER touches CP_RB_WPTR. The guest advances the primary ring write-pointer itself via its own doorbell once it has populated the slot; swap-complete CP interrupts come only from the game's in-stream PM4_INTERRUPT packets, never from VdSwap. This fix removes only the out-of-band `extend_write_ptr_by(64)` call, keeping the buffer_ptr block write intact and byte-faithful to canary. Effect at `--gpu-inline -n 50M`: draws 78→28, INDIRECT_BUFFER 12→6 (re-execution artifact gone), swaps 4→2. The run now halts at ~19.27M instructions (worker threads exit) instead of spinning to 50M, because removing the corruption unmasks the real per-present-interrupt deadlock — the title loop needs a per-present PM4_INTERRUPT that the stalled game never submits. That deadlock is a SEPARATE, known gate tracked/addressed elsewhere; it is intentionally NOT papered over here. Re-baselined golden crates/xenia-app/tests/golden/sylpheed_n50m.json to the new honest values (regenerated twice, byte-identical). sylpheed_n2m.json is unaffected (draws=0 at 2M). cargo test --workspace: 675 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 19:58:05 +02:00
MechaCat02	ad9c8e4cb8	[iterate-2U] VdGlobalDevice: allocate a real device cell so the swap counter (clock B) can advance Sylpheed's title loop re-runs its per-frame manager update sub_821741C8 only when "clock B" ([controller+88], the swap count) changes. Clock B's sole source is the CP swap-complete callback sub_824CE2B8, which bumps [gfx+15160] via the TWO-LEVEL deref [[VdGlobalDevice]+0]+15160, where VdGlobalDevice is the kernel variable export 0x01BE at guest .data 0x82000750. Ours patched that import slot with literal 0 (the old "passed through to Vd* shims, write 0" behaviour). Consequences, both confirmed at runtime: * the guest's graphics init stores its D3D device object via `stw r31, 0([0x82000750])` (sub_824C6DC0 @0x824C6F18) — with the slot 0, that store lands at address 0; * the swap callback reads [[0x82000750]] = [0] = 0 and increments [0+15160] (the null page) instead of the real device's swap counter. So [gfx+15160] never moved, clock B stayed frozen at 0, sub_821741C8 fired exactly once, and the game submitted one render batch (the 78-draw splash) then stalled. Fix mirrors xenia-canary RegisterVideoExports (xboxkrnl_video.cc:557-564) exactly: allocate a 4-byte cell, point the import slot at it, zero the cell. The guest then stores its device into the cell, and the callback's two-level deref resolves correctly. Verified: [0x82000750] now holds a real cell whose [+0] is the device (gfx state), the swap callback bumps [gfx+15160] 0->1, clock B advances, and the per-frame chain steps forward (sub_821741C8 fires 1->2x, GamePart update sub_821C7CB8 0->1x). Determinism: --gpu-inline digest re-baselined and byte-identical across runs. The fix shifts the early execution trajectory (clock B unfreezing), so the n50m golden moves imports 451500->178937 and instructions 50000001->50000014; draws/swaps/RTs/shaders unchanged (78/4/2/3). n2m golden unchanged (early boot, pre-fix-effect). 675 workspace tests green; sylpheed_n50m oracle green. Note: this breaks the FIRST hard blocker (clock B could never advance at all). Full per-frame sustain (draws past 78) needs a further step: each GamePart update must submit a per-frame command buffer (with PM4_INTERRUPT) during the asset-streaming phase to keep generating CP interrupts; ours currently produces only the single seed interrupt from the initial batch, so the chain advances once and re-stalls. Tracked for the next iterate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 16:20:08 +02:00
MechaCat02	873c197ff1	[iterate-2T] VdSwap: route present through ring PM4_XE_SWAP, drop out-of-band swap interrupt Make ours' VdSwap present path faithful to xenia-canary `VdSwap_entry` (xboxkrnl_video.cc:518-548): write the reserved 64-dword ring slot with a PM4_TYPE0 fetch-constant patch + PM4_TYPE3(PM4_XE_SWAP) + NOP padding, then let the natural drain consume the swap packet in command-stream order. Remove the synthetic CP swap-complete interrupt that `notify_xe_swap` raised out-of-band. Root found this session (the actual present-path bug): ours' `notify_xe_swap` pushed an `InterruptSource::Swap` (→ INTERRUPT_SOURCE_CP) interrupt directly from the VdSwap HLE, decoupled from the GPU command stream. When that interrupt reached the graphics ISR `sub_824BE9A0` before D3D had armed its swap-callback slot (`[gfx+10772]+16` still the `0xBADF00D` placeholder), the ISR took its error path and hit the assert "ERR[D3D]: Unanticipated CPU_INTERRUPT. Sign of a corrupt command buffer?" (`bl sub_824C5DF0; twi` at 0x824BE9DC) — 2x per run on master. Canary's VdSwap raises NO interrupt; swap-complete CP interrupts come only from in-stream PM4_INTERRUPT packets, which are naturally ordered after the callback-arming Type-0 writes. Routing the swap through the ring packet matches that ordering and eliminates the trap (2 -> 0). Canary oracle confirmation (muted, audit_mem_watch + audit_jit_prolog_pc): canary's early/loading loop is present-driven — swap counter [gfx+15160] (0xBE56CA38) advances ~per-vblank from vblank 65 onward, reaching 0xD02 (3330) in ~60s via 6184 CP source=1 interrupts, with VdSwap called only ONCE. So the present interrupts are entirely in-stream, not from the VdSwap export. This is a correctness/faithfulness fix; it does NOT cascade. draws stay 78 at 200M and 1B because the upstream gate persists: the game submits one render batch then stalls (renderer sub_82506xxx 0x; 2nd title thread 0x821748F0 never spawns). The per-frame loop sub_822F1AA8 runs ~1207 iterations on vsync but clock B (swap count) only advances ~once, so the manager update sub_821741C8 fires once. That is the iterate-2Q/2F title-pipeline gate, not a present/ interrupt bug. swaps 3 -> 4 (the in-stream PM4_XE_SWAP now drains). Deterministic in inline mode (n50m --gpu-inline --stable-digest regenerated byte-identical twice; golden re-baselined: swaps 3 -> 4). cargo test --workspace 675 passing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 15:20:02 +02:00
MechaCat02	1ae472bd2b	[iterate-2S] GPU: implement CP SCRATCH_REG memory writeback — arms Sylpheed's swap-callback slot Sylpheed renders the splash (draws=78, iterate-2O) then plateaus: the title's per-frame manager (sub_821741C8) only re-fires when "clock B" ([gfx+15160], swap count) changes, which only the CP swap-complete callback sub_824CE2B8 increments. The graphics ISR sub_824BE9A0 indirect-calls that callback via [[gfx+10772]+16] on CP (source=1) interrupts, but the slot stayed NULL so the callback never ran. Root (runtime-verified, ours-side GPU): the guest arms the slot through the Xenos CP scratch-register writeback path, which ours never implemented. The arming IB (drained by ours at 0x4adf5180) contains a Type-0 register write of the callback PC 0x824ce2b8 into SCRATCH_REG4 (0x057C). On hardware/canary, writing a SCRATCH_REG{n} mirrors the value to SCRATCH_ADDR + n4 in memory when the matching SCRATCH_UMSK bit is set. Runtime values: SCRATCH_ADDR=0x0b1d5000 (the [gfx+10772] descriptor), SCRATCH_UMSK=0x20033 (bit 4 set), so SCRATCH_REG4 -> 0x0b1d5010 = descriptor+16 = the callback slot (0x4b1d5010). Ours decoded the Type-0 write into the register file but performed no writeback (case a: drained-but-mishandled), so the slot stayed NULL. Fix mirrors canary's CommandProcessor::HandleSpecialRegisterWrite (command_processor.cc:545-552): a scratch_register_writeback() helper called from handle_type0/handle_type1 after every register write; for SCRATCH_REG0..7 with the UMSK bit set, it writes the value (big-endian, as mem.write_u32 already stores) to SCRATCH_ADDR + n4 (projected via physical_to_backing). Deterministic given identical register state; proven by unit test. Cascade (verified by runtime probe): slot 0x4b1d5010 now armed with 0x824ce2b8; on the 2-3 CP interrupts that fire, the ISR reads the slot and bcctrl's into sub_824CE2B8 (runs 2x; 0x cascade on master); sub_824CE2B8 increments clock B ([gfx+15160]). The cascade does NOT yet reach draws>78: there are only ~3 CP interrupts (from the initial 9825- packet batch), and the title render loop stalls upstream (the iterate-2Q title-respawn gate) before it submits more PM4_INTERRUPT work, so the callback can't bootstrap a self-sustaining loop. This is the remaining update-17/18 arming gap closed; the upstream stall is the next gate. The default threaded GPU backend drains the ring on a separate host thread, so with the callback now doing work the exact CP-interrupt delivery instruction varies run to run (pre-existing GPU-thread race). Pin the n50m oracle test to --gpu-inline (instruction-count deterministic) and re-baseline its golden; bit-exact across repeated runs. New unit test scratch_reg_write_mirrors_to_memory_when_umsk_enabled. Tests: 675 pass (was 674). Golden re-baselined + determinism verified. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 14:21:30 +02:00
MechaCat02	034ec8b47f	[iterate-2O] GPU: drain indirect buffers correctly — Sylpheed renders splash (draws 0→78) Ours' GPU never drained the D3D driver's system command buffer past the first 11-dword indirect buffer, so DRAW_INDX / reg-0x57C-arm packets never executed and draws stayed 0 (the long-hunted render gate; see UPDATE-18). Runtime tracing (temporary, removed) showed the guest submits 6 INDIRECT_BUFFER packets at boot (CP_RB_WPTR 22→37) but ours executed exactly ONE IB and then spun 15.7M packets inside it. Three coupled command-processor bugs, all corrected to match canary: 1. `sync_with_mmio` applied the primary CP_RB_WPTR to whichever ring was active, including an executing indirect buffer — `37 % 11 = 3` clobbered the IB's write pointer so its read pointer looped 0→2→5→0 forever and never popped back to the primary ring. CP_RB_WPTR governs ONLY the primary ring; while an IB executes, the primary is the bottom of the IB stack. Canary executes each IB through a separate `RingBuffer reader_` (command_processor.cc), so the primary write pointer is structurally inapplicable to an IB. 2. Indirect buffers were treated as circular rings: read wrapped at `size_dwords` (`11 % 11 = 0`) and never reached the fixed write pointer, so even without the clobber the IB could not terminate. An IB is a fixed linear sub-stream; add `RingBufferView.indirect` and drain `[0, ib_size)` monotonically, then pop. 3. `is_ready` only checked the active ring, so an IB that now correctly exhausts would never get `execute_one` called again to pop back to the primary ring (whose WPTR may have advanced). Check the whole IB stack. Also: the ring was sized `1 << size_log2` bytes (1024 dwords) vs canary's `1 << (size_log2 + 3)` (8192 dwords) — an 8× undersize that desynced WPTR-wrap math from the guest. Fixed in `GpuSystem::initialize_ring_buffer` (and the dead bookkeeping copy in `vd_initialize_ring_buffer`). Cascade (deterministic; threaded-default backend, byte-identical across runs): reg 0x57C now written, IB jumps 1→12, packets 15.7M→9,825, and the splash renders — draws 0→78, shaders 0→3, render_targets 0→2, swaps 2→3 — stable at 50M / 200M / 1B. Boot then reaches a new downstream gate (draws plateau at 78, interrupts keep climbing → engine alive, not deadlocked). golden `sylpheed_n50m.json` re-baselined (draws 78). `cargo test --workspace` green (674; +2 ring_view regression tests). vd_swap's synthetic-swap short-circuit is now redundant but left untouched (cascade works without changing it); cleaning it up is a separate follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-13 22:06:16 +02:00
MechaCat02	93f60a3ba0	[iterate-2M] PCR+0x10C (PRCB.current_cpu): init per-HW-thread to unwedge spin-barrier Ours never initialized the PRCB `current_cpu` byte at PCR+0x10C (prcb_data@0x100 + current_cpu@0xC). Canary sets it from `GetFakeCpuNumber(affinity)` (xthread.cc:847 `pcr->prcb_data.current_cpu = cpu_index`), which equals the HW thread id ours already writes at PCR+0x2C. Left unwritten it read 0 for every thread. Guest spin-barrier `sub_824D1328` (used by the audio/update pump threads at entries 0x824D2878 / 0x824D2940, ours tid 9 / tid 10) indexes a per-HW-thread occupancy byte array via `lbz r11, 268(r13)` then `stbx ..., [array+index]`. With index 0 for all threads, every thread marked slot 0; the multi-byte rendezvous signature it then spins on (`ld [obj+0x164]` compared against the packed per-slot expectation) could never assemble. Both pump threads busied at pc 0x824d140c/0x824d1410 forever (Ready, 5M+ barrier iterations) and never ran their `KeSetEvent` loops — so the events they signal (the 21k-per-thread heartbeat in canary) never fired, starving the downstream worker handshake. Fix: write `hw_id` to PCR+0x10C alongside PCR+0x2C in both the static thread image init (thread.rs) and the dynamic PcrWriter (state.rs, used by scheduler spawn + affinity migration) so the two stay in sync. Runtime-verified BOTH engines. Post-fix the pump threads escape the barrier (barrier iterations 5M+ -> 3) and advance into their loop bodies, now correctly Blocked(WaitAny) at pc 0x824d28d0 / 0x824d29c0 (was spinning at 0x824d140c). imports at n50M 339,766 -> 451,508; deterministic (two cold runs byte-identical). draws still 0 (a later, separate render gate). golden re-baselined. cargo test --workspace: 672 passed, 0 failed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-13 18:08:46 +02:00
MechaCat02	2bdb93e51e	[iterate-2K] GPU physical-mirror aliasing: ring/IB/RPtr/resolve read wrong host region Root cause (physical-mirror aliasing gap → GPU read wrong region → ring never truly drained → render worker ring-space wait → no frame → no draw): The Xbox 360 maps its 512 MB of physical DRAM into several virtual mirror windows differing only in cache policy — bare physical (0x0xxxxxxx), write-combine (0x4xxxxxxx), and cached 0xA/0xC/0xExxxxxxx — all aliasing addr & 0x1FFF_FFFF. Ours has one flat membase and `heap_alloc` (MmAllocatePhysicalMemoryEx) commits physical backing in the 0x4xxxxxxx window. The guest masks its CP-ring allocation base to bare physical (0x4adcc000 & 0x1FFFFFFF = 0x0adcc000) before handing it to VdInitializeRingBuffer, and PM4 INDIRECT_BUFFER / writeback / resolve pointers are likewise bare-physical. Ours stored those verbatim and read `membase + 0x0adcc000`, a never-committed zero-filled page — so the GPU drained ~718k zero PM4 headers, never executed the real Type3/DRAW stream, and the RPtr writeback landed on a zero page the render worker (tid=8) polls, freezing it forever. Fix (GPU/Vd-boundary translation, not memory-layer): add `physical_to_backing(addr)` deriving the committed backing exactly from `heap_alloc`'s placement (0x4000_0000 \| (addr & 0x1FFF_FFFF), idempotent for the WC window, flat for non-physical code/stack). Apply it at every point the GPU/kernel consumes a guest physical address: ring base (initialize_ring_buffer), RPtr writeback (enable_rptr_writeback), PM4 INDIRECT_BUFFER pointer, WAIT_REG_MEM / COND_WRITE memory poll+write, REG_TO_MEM / MEM_WRITE / EVENT_WRITE* / LOAD_ALU_CONSTANT / IM_LOAD addresses, the resolve dest write, and the vd_swap frontbuffer present read. This was chosen over memory-layer aliasing because the latter re-projects every CPU load/store and corrupts the guest's flat 0xA/0xC/0xE accesses (it caused an early PC=0xfffffffc fault). Two adjacent GPU-backend gates this exposed and also fixed (canary-faithful): - WaitCmp::from_wait_info was off by one vs canary's MatchValueAndRef selector (it decoded wait_info&7==3 as NotEqual instead of Equal), inverting the standard CP coherency wait so the GPU parked forever on the first INDIRECT_BUFFER. Remapped to 1=Less..7=Always, 0=Never. - Added MakeCoherent: a WAIT polling COHER_STATUS_HOST clears the status bit (mirrors command_processor.cc:801-838) so the coherency handshake resolves. Result: the GPU now decodes the real Type3 packets at 0x4adcc000 (ME_INIT, INDIRECT_BUFFER → real Type0/WAIT_REG_MEM at 0x4adf5080) instead of zero-headers; RPtr at 0x408619fc advances (0x13, 0x16, … written by the GPU worker); the frame loop sub_822F1AA8 actively writes the controller at 0x40d09a40 (0x20→0x21→0x23); no fault, full 200M/1B budget runs clean. draws_seen is still 0: the remaining gate is upstream and separate — the main frame loop never sets controller bit-28 (frame-ready) at [0x40d09a40] (stalls at 0x23, the known iterate-2C state-divergence gate), so the guest never enqueues a render IB; the GPU only ever replays the init IB. This fix correctly unblocks the GPU ring/IB/RPtr data path (gate-2 GPU backend); the bit-28 frame-ready gate is the next target. Stable golden (sylpheed_n50m) unchanged (draws/swaps/RTs/shaders identical at 50M); regenerated twice byte-identical. cargo test --workspace: 672 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-13 13:39:57 +02:00
MechaCat02	ed2e0e72fd	[iterate-2J] KeTimeStampBundle deterministic tick: fix frozen+mislaid guest clock The xboxkrnl data export KeTimeStampBundle (ordinal 0x00AD, import slot 0x820007d0 — confirmed via sylpheed.db imports table) was set up with TWO defects in the import-patch pass: 1. FROZEN: the block was written once at boot and never updated, so every field stayed a constant for the whole run (observed: the guest's clock reader sub_824AA830 = [[0x820007d0]+0x10] returned a constant 0x01d6bc0c from 5M..150M instructions). 2. WRONG LAYOUT: it stuffed the FILETIME high-dword at +0x10. The canonical X_TIME_STAMP_BUNDLE (xenia-canary kernel_state.h) is: +0x00 interrupt_time u64 (100ns since boot) +0x08 system_time u64 (FILETIME 100ns since 1601) +0x10 tick_count u32 (milliseconds since boot) +0x14 padding so [block+0x10] is tick_count in ms, not a FILETIME dword. Fix (deterministic, no wall-clock): * Initialize the block with the correct field layout (tick_count = 0 at boot, system_time = FILETIME base, interrupt_time = 0). * Store the block VA on KernelState::timestamp_bundle_addr during the import patch. * Add KernelState::update_timestamp_bundle(mem, clock) and call it every round in BOTH the lockstep (run_execution) and parallel (run_execution_parallel) outer loops, right where the deterministic Scheduler::global_clock is advanced. The clock is the retired-instruction monotonic global_clock, so every guest-visible time value stays a pure function of guest progress (lockstep byte-reproducible). * Cadence: 1 global_clock unit = 100ns (coherent with parse_timeout, which divides 100ns timeouts by 100 onto the same basis), so INSTRUCTIONS_PER_MS = 10_000. tick_count now advances 0 -> ~4999ms over a 50M-instruction window. Also make KeQuerySystemTime read the same 100ns clock instead of a frozen FILETIME constant. Verification: tick_count at 0x40002010 now advances (deadline arm at 0x82450d0c stores clock+66 = 0x260,0x269,...,0x51d,... advancing, vs the frozen 0x01d6bc4e before the fix). Determinism: two cold --stable-digest runs are byte-identical; the n50m golden is UNCHANGED (the clock-affected counter is not in the stable digest). 672/672 tests pass. HONEST CAVEAT — the predicted render cascade did NOT materialize on this branch. The diagnosed consuming gate at 0x82450b10 (the clock-vs-deadline compare in the worker-hub channel loop sub_82450A68) is unreachable here: the loop always branches away at 0x82450b0c ([this+220] >= channel-index), so the hub already dispatches sub_82450B68 342x in BOTH the frozen and fixed builds. Guest trajectory (imports 339766@50M / 1738001@200M / 9212446@1B), draws (0), swaps (2) and thread topology (tid14 Ready, not blocked on 0x109c) are identical frozen-vs-fixed. This commit is therefore a correct latent-clock-bug fix and determinism-safe prerequisite, NOT the render unblock. The 0x109c/tid14 starvation premise was not reproduced at f75bc96; the next gate must be re-localized. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-13 11:54:44 +02:00
MechaCat02	f75bc96d17	[iterate-2H] PPC spin/yield/sync hint-class audit: lock no-over-yield + barrier-decode invariants Audited the full PowerPC spin/yield/sync/SMT-priority-hint instruction class against the canary oracle (ppc_emit_alu.cc InstrEmit_orx / ppc_emit_memory.cc sync/eieio/isync) and against what Project Sylpheed actually executes (static scan of the extracted image + disasm of the spin sites 0x824D1328 / 0x824C17AC / 0x824D3CF8). Findings (no behavior change required — the class is already faithful): - or rX,rX,rX SMT priority hints: canary special-cases EXACTLY 0x7FFFFB78 (db16cyc) -> DelayExecution; every OTHER or-self form -> Nop. Ours already matches (only 0x7FFFFB78 yields). Image scan: the documented priority hints or 1/2/3/6/26..30 do NOT appear in Sylpheed at all; the only SMT spin hint used is or 31,31,31 (db16cyc), already handled in `de21c7a`. The 854 `or 8,8,8` etc. are compiler register self-moves (plain no-ops), not spin hints. - sync / lwsync / ptesync share XO=598 -> all decode to PpcOpcode::sync (canary keys on XO only, identical); eieio (XO=854), isync (XO=150) decode correctly. All are value-neutral no-ops under the single-host model, matching canary MemoryBarrier/Nop. unimpl=0 in a 200M run confirms none trap. tlbsync is not implemented by canary either and is unused by Sylpheed. - mftb-based timed back-off (loop at 0x824D3CF8: mftb delta vs timeout, with db16cyc between polls and a timeout escape) relies on the already-landed db16cyc yield + coherent global-clock timebase; no deadlock, no new gap. - ori 0,0,0 canonical nop (140 sites) is value-neutral; matches canary Nop. Lands two regression tests that lock the audited invariants so a future change cannot over-yield on a benign priority hint (which would perturb the deterministic schedule) or break the sync L-field decode: - test_smt_priority_hints_are_nops_not_yields - test_lwsync_ptesync_eieio_isync_decode_as_benign_noops Determinism preserved (tests-only): two cold lockstep `check -n 5M` (no persist) byte-identical; golden digest unchanged (no re-baseline). Full workspace suite green. 200M cascade unchanged (packets~172M, draws=0, shaders=0, swaps=1) — confirms the hint class is exhausted; the render gate is now downstream (tid14 0x109c per-job completion event), not CPU semantics. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-13 10:53:54 +02:00
MechaCat02	de21c7a544	[iterate-2G] db16cyc spin-hint cooperative yield: unblock title-screen 0x10a0 gate The silph title state machine (tid13) blocked on event 0x10a0, never signaled. Root: the event's producer chain runs on the silph worker (entry 0x821C4AD0, our tid14), which was starved. tid14 shares a HW slot with a guest spinlock/ barrier participant (sub_824D1328, entry 0x824D2940) that busy-spins on the db16cyc hint `or r31,r31,r31` (encoding 0x7FFFFB78) at 0x824D140C. Under our round-robin lockstep the spinner consumed its whole block every round and starved the co-located tid14 (only 9 progress hits over 200M instr) — so the producer never reached the event-create/duplicate/signal dance the canary oracle performs (handle F80000E8 set by the submitter F8000044 via a duplicated handle). Fix (canary-faithful): recognize the db16cyc spin hint exactly as canary's InstrEmit_orx does (code 0x7FFFFB78 -> DelayExecution) and surface it as a new StepResult::Yield. The scheduler's yield_current() promotes every Ready peer on the slot past STARVE_LIMIT so begin_slot_visit picks one next round, then they reset and the spinner reclaims the slot — fair alternation, no priority inversion, pure function of slot state (deterministic). Result (lockstep, cache-persist, -n 200M): tid14 progresses past its old stall into a real wait; tid13 advances off 0x10a0 to a new event; hub/submitter re-enter their wait loops. imports 280k->592k, packets 124M->164M, swaps 1->2. draws still 0 (the splash's first draw is a further-upstream gate). Determinism preserved (two cold n50m runs byte-identical). n50m golden re-baselined (imports 90296->339766, swaps 1->2; draws unchanged 0). n2m golden unchanged (db16cyc not reached in first 2M). Tests 670/670. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-13 10:38:17 +02:00
MechaCat02	f3b7e8b760	[iterate-2F] Scheduler anti-starvation floor: fix job-4 handoff render gate The lockstep scheduler's pick_runnable is strict priority (max_by_key (priority, -idx)). On a cooperative single-host HW slot, a CPU-bound spinner that never blocks (the silph poll loop pinned by affinity to hw=5) wins pick_runnable every round forever, permanently starving a co-located peer (the submitter, tid6) that the spinner is actually waiting on. On real hardware those threads run on separate SMT contexts concurrently, so the spinner never starves the submitter; ours collapses them onto one slot with no anti-starvation, turning priority (or equal-priority index order) into permanent starvation. The starved submitter never dequeued job-4 -> the worker-hub (tid5) blocked INFINITE on completion event 0x1080 -> silph (tid13) wedged on 0x1078 -> no vsync -> draws_seen=0, the publisher splash never renders. (decrement_quantum's within-slot rotation is dead: begin_slot_visit unconditionally re-pick_runnable()s each round, discarding the rotated running_idx. The fix is therefore evaluated at pick time, not via that discarded rotation.) Fix (Option A, bounded anti-starvation, deterministic): - Add per-thread steps_starved counter to GuestThread. - begin_slot_visit increments it for every Ready peer passed over this visit, resets it to 0 for the picked thread. - pick_runnable selects by effective_priority: once steps_starved reaches STARVE_LIMIT (4096) the thread is lifted to i32::MAX and wins exactly one pick, then resets. The genuinely higher-priority thread still wins ~4095/4096 visits -- the boost grants periodic forward progress only, it does NOT invert priority. Pure function of counter/priority/index -> deterministic (no wall-clock, no RNG). Cascade (lockstep exec, XENIA_CACHE_PERSIST=1, -n 200M): - submitter dequeue sub_82458508 now fires 4x (was 3x); the 4th job (buf 0x40baa2c0) is dequeued at cycle 6.15M. - hub tid5 leaves Blocked(0x1080) -> now Ready (no more INFINITE wait). - GPU packets 0 -> 116,101,363 (command stream now flowing). - tid13 (silph::UImpl) advances past the old 0x1078 wedge to a NEW downstream wait (handle 0x10a0); 3 new threads spawn (tid14/15/16). - draws_seen still 0 -> the splash's first draw is a NEW downstream gate, not this starvation. Determinism: two cold lockstep `check -n 5M` runs byte-identical (full and stable digests). New n50m stable digest deterministic across two cold runs. Golden re-baselined: instructions 50000007->50000003, imports 92317->90296 (trajectory shift from the changed pick order). Tests: 666/666 (+1 test_anti_starvation_bounded_progress). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-13 10:02:02 +02:00
MechaCat02	7e2603a9e5	[iterate-2E] Extend coherent monotonic clock to lockstep (timebase-desync livelock fix) Lockstep livelocked the scheduler the same way --parallel did before `0332d19`: the kernel deadline-arithmetic (`now_basis_at`) read per-thread `ctx(hw_id).timebase`, but a parked/poll thread has `running_idx == None` so `Scheduler::ctx()` returns `idle_ctx` (timebase 0). A poll thread (tid=7, a `KeWaitForSingleObject` loop with a 30ms relative timeout) computing its deadline via `parse_timeout` therefore read `now = 0` and registered `deadline = 0 + 3000 = 3000` — a constant ~7.78M units in the past. `coord_idle_advance` then re-armed that same constant 3000 deadline forever, pinning virtual time and starving every other thread's real future deadline. Render-gate impact: the submitter (tid=6) re-enters a 16ms-timeout WaitForMultiple after its first jobs; that timeout never fired because vtime was pinned at 3000, so virtual time never reached real future deadlines. Fix (Option A — mirror the parallel fix): drive the existing deterministic `Scheduler::global_clock` in lockstep too (floored up once per outer round to `stats.instruction_count`, a pure function of retired guest instructions — no wall-clock), and route `KernelState::now_basis_at` through `global_clock()` in BOTH modes. New `Scheduler::advance_global_clock_to(now)` floor-up keeps it monotone alongside `advance_all_timebases_to`. Parallel behavior unchanged (it already read `global_clock()`). Verified (lockstep, 50M): - DETERMINISM: two cold `check -n 5M` and two cold `-n 50M` runs byte-identical. - LIVELOCK GONE: "advanced to deadline" went from 592,679 fires / 2 unique values / 562,084 pinned at 3000 -> 18,586 fires / 18,567 unique / 0 pinned, strictly increasing 5.4M -> 50M. Poll thread tid=7 now ends Blocked with a real future deadline Some(60002824) instead of spin-Ready on the past 3000. - imports 1,790,936 -> 92,317 at 50M (the spin no longer burns import calls). Cascade (lockstep, XENIA_CACHE_PERSIST=1, -n 200M): engine now runs to budget instead of hard-deadlocking. Hub enqueue (sub_82458068) 4x; submitter dequeue (sub_82458508) still 3x — the lost 4th-job HANDOFF (count/notify between sub_82458068's tail and the submitter queue) is a SEPARATE downstream gate, not the timebase. New gate: tid=5 (hub) Blocked INFINITE on event 0x1080 (job-4 completion); tid=6 (submitter) Ready, parked in WaitForMultiple (sub_824AB214), loop-top stops at cycle 6.23M. draws still 0, VdSwap 1. Golden re-baseline (same commit): sylpheed_n50m instructions 50000004 -> 50000007, imports 1790936 -> 92317 (swaps/draws/RTs/shaders/textures unchanged). sylpheed_n2m unchanged (livelock onsets after 2M). Suite 665/665 + oracle green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-12 21:42:28 +02:00
MechaCat02	5aaadfec36	[iterate-2E] Add XENIA_AUDIT_DEREF pointer-chase probe On each AUDIT-PC-PROBE fire, treat gpr[reg] as a base object, dump its first 64 bytes, follow [base+off] to a sub-object, dump that, then follow [[base+off]+0] to its vtable and dump 48 slots. Env-gated (XENIA_AUDIT_DEREF=<reg>:<off>), read-only, lockstep digest unaffected. Captures the live work-item + stream object + vtable at sub_824510E0 before the pool recycles the slot — which overturned the prior session's "infinite spin" diagnosis: the streaming read PROGRESSES 68/68 128KB chunks of a 9MB file, then the hub (tid=5) blocks INFINITE on a self-created Event/Manual (0x1060) that is never signaled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-12 20:29:01 +02:00
MechaCat02	0332d1990d	[Track 2] Parallel-scoped global clock fixes timebase-desync livelock In --parallel mode a long run livelocked: the scheduler spun "advanced to deadline 3000 waking hw=2 idx=0" ~14k times in microseconds. Root cause: each guest thread owns ctx.timebase (+1/instr in step_block), and all kernel deadline arithmetic read Scheduler::ctx(hw_id).timebase as "now". But the parallel worker extracts its PpcContext via mem::replace(ctx_mut_ref, PpcContext::new()) — leaving a ZEROED timebase in the slot while it steps unlocked — and advance_all_timebases_to only walks runqueue (never idle_ctx). So the coordinator's coord_pre_round drain and a woken thread's parse_timeout could read a zeroed/stale basis decoupled from the deadline the scheduler just advanced to. The thread re-armed the same constant deadline forever; the global clock never moved. Fix: add a single monotonic Scheduler::global_clock, advanced by the per-block retired-instruction count on each parallel writeback and floored up by advance_all_timebases_to. Kernel deadline reads route through KernelState::now_basis_at(hw_id), which returns global_clock ONLY when parallel_active; lockstep keeps reading the exact pre-existing ctx(hw_id).timebase expression, so the deterministic lockstep trace is byte-identical (sylpheed_n50m golden unchanged, zero re-baseline). Verified: - 50M --parallel run completes (was: hung). Deadlines now strictly increasing 5.4M -> 49.1M (18097 unique of 18116; max repeat 2) vs pre-fix constant 3000 x ~14000. - sylpheed_n50m golden byte-identical via plain `check` (no persist). - Full suite 665/665 green. Note: an intermittent parallel hang/crash (~1-2/20 at -n 5M) is pre-existing (master 1/20, this build 2/20 — within noise) and distinct from the timebase livelock: it is a parallel-race class (e.g. the unsafe block_ptr deref in run_execution_parallel). Tracked separately; lockstep remains the recommendation for long runs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-12 19:32:14 +02:00
MechaCat02	6271ba1f55	chore: gitignore vkd3d-proton/DXVK runtime shader caches The Wine canary build drops vkd3d-proton.cache into the working dir. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-12 18:06:25 +02:00
MechaCat02	48b19e490f	[Prong A] Three 32-bit ABI PPCBUG siblings corrected to canary semantics Second differential audit, lead prong: hunt siblings of PPCBUG-020 (the word-form ALU truncation fixed in `341196a`, whose "32-bit ABI / MSR.SF=0" premise was false — Xenon is a 64-bit core). Found three more band-aids of the same class, each verified against the canary oracle. All three are genuine oracle/ISA divergences but INERT on Sylpheed's lockstep trace (sylpheed_n50m golden digest unchanged; no re-baseline). Fixed + directed tests anyway to close the band-aid class (per audit decision). 1. slw/srw shift-count mask (PPCBUG-044 site). Ours tested the full u32 count `< 32`; canary InstrEmit_slwx/srwx mask `rb & 0x3F` then test bit 5. A count like 0x40 (low-6-bits 0) must pass the value through, not zero it. Fixed both to `& 0x3F`. The 32-bit CR0 i32-view is unchanged (genuinely 32-bit). 2. sraw/srawi result extension (PPCBUG-041/042/043 "writeback truncation"). Ours zero-extended the 32-bit arithmetic-shift result (`result as u32 as u64`); PowerISA + canary InstrEmit_srawx/srawix SIGN-extend it (`f.SignExtend`, the `(i64.s)&¬m` fill). 0x80000000>>1 is now 0xFFFFFFFF_C0000000, not 0x00000000_C0000000. CA math and CR0 view byte-identical. 3. mtspr CTR width (PPCBUG-054). Ours stored `val as u32 as u64`, dropping the upper 32 bits; CTR is a 64-bit SPR and canary InstrEmit_mtspr stores the full GPR (`f.StoreCTR(rt)`). A later `mfspr rX, CTR` now round-trips correctly. bdnz/bcctr still consume only CTR's low 32 bits (the bcx zero-TEST truncation at line ~922 MATCHES canary's `f.Truncate(ctr, INT32_TYPE)` — left untouched). Tests: updated srawx_negative_value_sign_extends_upper, srawix_high_count_negative_input_sign_extends_all_ones, and mtspr_ctr_keeps_full_64_bits (formerly premise-defending the bugs — reading-error #24). Added slwx/srwx 6-bit-mask tests, mfspr_ctr round-trip, and the rlwinm MB>ME wraparound-mask test (plan-requested gap closure). 665/665. Left correct (re-confirmed vs canary, do NOT touch): bcx/bclr CTR 32-bit test, divw/divwu zero-extend quotient (canary f.ZeroExtend, ISA upper undefined), extsb/extsh, logical-NOT chain, mulhw/mulhwu, srawx 0x3F mask, pixel pack/unpack. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-12 17:25:41 +02:00
MechaCat02	341196a111	[Issue-1 PPCBUG-020] Word-form ALU ops produce full 64-bit results Xenon is a 64-bit PPC core (32-bit pointer ABI, but 64-bit registers and integer arithmetic). The interpreter was truncating every word-form integer ALU writeback to 32 bits and zero-extending, on a false "MSR.SF=0 / 32-bit ABI" premise. This silently corrupted any genuine 64-bit value flowing through word-form arithmetic. Confirmed load-bearing via runtime ours-vs-canary capture: Sylpheed's millisecond->LARGE_INTEGER timeout converter sub_824ACA88 does `clrldi; mulli r11,r11,-10000; std`. For a 16 ms wait the correct result is -160000 = 0xFFFFFFFF_FFFD8F00 (relative). canary stores exactly that; ours' truncating `mulli` stored 0x00000000_FFFD8F00 (positive) -> the i64 timeout read as a huge absolute deadline -> a ~26000x over-wait that froze the main frame loop. After the fix the timeout matches canary and the previously-frozen frame/worker loops run (parallel boot NtWaitForMultipleObjectsEx 94 -> 30428; KeWaitForSingleObject/critical-section loops resume). Fix mirrors canary's INT64 emitters (ppc_emit_alu.cc) op-by-op for the 17 data-losing word-form ops: addis, addic(.), subfic(.), mulli, add(c/e/ze/me)x, subf(c/e/ze/me)x, negx, mullwx. Only the result writeback widens to full 64 bit; the 32-bit carry (XER[CA]) and overflow (XER[OV]) computations and the CR0 i32 view are preserved byte-identical (the low 32 bits of the new result equal the old truncated result), so this is a strict no-op for clean 32-bit values and only restores the previously-zeroed upper bits for genuine 64-bit values. Genuinely-32-bit ops (rlwinm/slw/srw/cmpw, mulhw/divw whose upper bits are ISA-undefined) are left untouched. Updated 7 unit tests that asserted the truncation (they encoded the bug) to the canary-correct full-64-bit values. Re-baselined the sylpheed_n50m golden (imports 40454 -> 1790936: the unwedged frame/worker loops now cycle under the instruction-count timebase); sylpheed_n2m unchanged (pre-frame-loop). Lockstep determinism preserved (two 50M runs identical). Full suite 660/660. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-12 16:21:11 +02:00
MechaCat02	b20c99f141	[Subsystem-fixes] 6 verified ours-vs-canary divergence fixes From the 2026-06-12 5-subsystem differential audit. All verified against canary as oracle; 660/660 workspace tests green (655 + 5 new). 1. nt_create_event polarity (exports.rs) — `manual_reset = gpr[5] != 0` was INVERTED. Canary xboxkrnl_threading.cc:668 `Initialize(!event_type,..)` + xevent.cc:41 (type 0 = NotificationEvent = manual, type 1 = Sync = auto). Now `== 0`. Was the dormant 2.AI fix on chore/portable-snapshot, never merged. The Ke-path was already correct; only the Nt-path was wrong. 2. 2.AF deadline drain (main.rs coord_pre_round) — expired KeWait/KeDelay deadlines never fired under load because advance_to_next_wake_if_due was only called in coord_idle_advance (no-Ready-threads path). Added a per-round drain loop; covers BOTH lockstep and parallel outer loops since both call coord_pre_round. Was the dormant 2.AF fix, never merged. 3. handle slab-recycle ABA guard (state.rs + scheduler.rs) — release_handle_slot (my round-34 regression) recycled a closed slot even with a thread still parked on it, risking a stale-waiter wake when the slot is re-minted. Added Scheduler::any_thread_waiting_on; decline to recycle a still-waited slot. 4. vpkpx pixel-pack (vmx.rs) — wrong field mapping (~100% mismatch). Now exact canary ppc_emit_altivec.cc:1795 shift/mask (red 6b out[15:10] from w[24:19], green out[9:5] from w[14:10], blue out[4:0] from w[7:3]; no fabricated alpha bit). +unit test. 5. VFS GDFX attribute plumbing (vfs/, exports.rs query fns) — VfsEntry now carries the real on-disc attribute byte (GDFX dirent +12, canary disc_image_device.cc:136/154) instead of inferring directory-ness from path shape. Query exports report the real FILE_ATTRIBUTE_ bits. Candidate driver of the XamShowDirtyDiscErrorUI gate. +tests. 6. MmGetPhysicalAddress region-aware mirror (exports.rs) — flat 0x1FFFFFFF mask missed canary's +0x1000 host_address_offset for 0xE0000000+ mirror (memory.cc:2317). Read-only query; proven byte-identical 50M digest. +test. Investigated and intentionally NOT changed: - zero-on-recommit: no-op; ours has no region-reuse path (bump allocators, free is a stub). - 32-bit ALU writeback truncation (PPCBUG-020): documented-deliberate; premise (MSR.SF=0) is questionable but flipping it is out of scope here. - KeSetEvent/NtSetEvent return value: ours returns true previous state (hardware-faithful); canary returns constant 1 — NOT an ours bug. sylpheed_n50m golden will need re-baselining (legit behavior change). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-12 14:57:38 +02:00
MechaCat02	db90ad0f7d	[AUDIT-059 R-D2] Phase D auto-signal POC confirms audit-049 wedge diagnosis Hook NtCreateEvent for the silph::UImpl tid=13 chain (entry=0x821748F0, start_context=0x4024a840, frame-1 LR=0x821CB15C inside sub_821CB030+0x128) and auto-signal the resulting handle after XENIA_SILPH_UI_AUTOSIGNAL_DELAY instructions. Env-gated; default off. SR4 verdict B (partial unwedge): - handle 0x1078 signal_attempts 0->1 - tid=13 Blocked(WaitAny[0x1078]) -> Ready pc=0x824a9108 - ExCreateThread 10 -> 12 (new silph::UImpl tid=14, worker tid=15) - New downstream wedges 0x1084 + 0x1088 - cxx_throw runtime_error on tid=5 inside R26 dispatcher (BST not-registered instance lhs=0x715a7af0) - VdSwap stays 1; no draws (POC is diagnostic, not final fix) Confirms Phase C diagnosis end-to-end. The real signaler must (a) drive NtSetEvent on the silph KEVENT AND (b) register the dispatcher's BST instance upstream; this POC only does (a). Reading-error class #20: ctx.lr at kernel export entry is the thunk wrapper's return slot, NOT the guest caller's post-bl PC. Walk back-chain 1 step to get frames[1].lr. Reading-error class #21: --parallel and lockstep have SEPARATE outer loops in main.rs (run_execution_parallel line 2928 vs run_execution line 2706). Per-round hooks must be wired in BOTH paths. Files: - crates/xenia-cpu/src/scheduler.rs: GuestThread.start_entry/start_context fields + spawn() population + current_thread_entry_and_ctx() helper - crates/xenia-kernel/src/state.rs: AutoSignalPending struct, env-parsed silph_autosignal_delay, pending Vec, last_cycle_hint, set_now_cycle_hint, maybe_register_silph_autosignal (walks back-chain), fire_due_silph_autosignals - crates/xenia-kernel/src/exports.rs: hook in nt_create_event - crates/xenia-app/src/main.rs: fire-site + cycle hint in both outer loops - audit-runs/audit-059-handle-disambiguation/round-D2-autosignal-poc/FINDINGS.md Tests 655/655 green. Default behavior byte-identical when env unset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-11 18:38:38 +02:00
MechaCat02	481591fdb2	[AUDIT-059 R-C1] Phase C: bit-28 setter hypothesis REFUTED via dump-addr Phase A's diagnosis (bit 28 of [0x40d09a40] gets set to exit sub_822F1AA8's loop) is falsified by direct probe + --dump-addr in 4 sub-rounds. Key evidence: - sub_821B55D8 candidate fn fires 0× in ours; sub_824AA858 (XamInputSetState wrapper) fires 0× in canary too — chain is dead code in both engines. - end-of-run dump shows [0x40d09a40+0] = 0x00000021, same as at entry — bit 28 is NEVER set. - bcctrl at PC 0x822F1B4C (sub_822F1AA8+0xA4) fires (LR=0x822F1B50) but the post-bcctrl BB head 0x822F1B50 fires 0× — bcctrl never returns. - sub_82173990 (vtable[0] of singleton at [0x828E1F08]) is the call target; tid=1 wedges inside this 768-byte function on a thread-join to handle 0x1070 (= tid=13's thread handle). - tid=13 (entry=sub_821748F0, ctx=0x4024a840, handle=0x1070) reaches sub_821C4EB0 (silph::UImpl@GamePart_Title) at cycle 1882 → audit-049 cluster IS reached, wedges on handle 0x1078 there. C.2 force-clear POC NOT EXECUTED — would be no-op since bit 28 is never set. Per plan stopping criterion, hand back instead of proceeding blind. Adds reading-error class #19: disasm-pattern-match without runtime verification (Phase A scanned 49 oris-0x1000 sites and declared one the setter without ever observing the bit get set). No xenia-rs source changes. Canary repo also unchanged (config edit reverted clean). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-11 17:57:27 +02:00
MechaCat02	52c30d82a7	[AUDIT-059 R-A] Phase A backward-trace: divergence is sub_822F1AA8 loop exit, not factory/registry Round-37 anchor reframe: both engines install the SAME static .rdata vtable 0x820A183C at [0x828E1F08]. Instance VAs differ only because of ε-class allocator divergence (audit-043). vtable bytes byte-identical; the user prompt's "factory/registry" framing was falsified. Phase A walkthrough (rounds A1..A8): - A.1 canary --audit_jit_prolog_pc=0x821741C8: tid=6, r3=0xBCCC4A80 (= inner sub-object of [0x828E1F08]'s singleton), LR=0x822F1D5C (return-from-bctrl inside sub_822F1AA8) - A.2 found tid=6 spawn site sub_821746B0 at PC 0x82174824 spawning entry=sub_821748F0 ctx=BC365700/BC366DA0. sub_822F1AA8 ALSO spawns a second thread (entry=sub_822F1EE0 ctx=BCE24A40) at PC 0x822F1B08 - A.3 sub_822F1AA8 has 2 callers, both in sub_8216EA68 (its sole caller is sub_824AB748 = entry_point) - A.4 ours mirror probe: sub_821746B0 enters, [0x828E2B14] gate passes, ExCreateThread fires returning handle 0x1070 (= tid=13). Ours' tid=13 IS the same logical thread as canary's spawned silph initializer - A.5 canary --audit_jit_prolog_pc=0x821749C0: fires only 2× on short-lived tid=17, tid=26 (the spawned initializers — NOT tid=6) - A.6 canary --audit_jit_prolog_pc=0x822F1AA8: fires 1× on tid=6 with r3=0xBCE24A40 LR=0x8216EE14 (the second sub_822F1AA8 call site) - A.7 canary --audit_jit_prolog_pc=0x824AB748 (entry_point): fires on tid=00000006. CONFIRMS canary's tid=6 = canary's main thread. Verdict: identical call chain entry_point → sub_8216EA68 → sub_822F1AA8 in both engines; same controller (ε-divergent VA, byte-identical fields). Canary's main thread stays in sub_822F1AA8's dispatcher loop firing sub_821741C8 ~1678×/30s. Ours' main thread exits the loop and thread-joins on the spawned initializer (tid=13), which is itself wedged on handle 0x1078 forever. Loop exit is gated by bit 28 of [r30+0] (the controller's flag word). Same value 0x21 at function entry in both engines. Some code between entry and loop check sets bit 28 in ours but not in canary. Mem-watch on 0x40d09a40 shows zero guest stores in ours' 50M parallel run — setter is either a kernel-side store, computed alias, or probe-quantum-elided JIT store. Phase B classification: Class 3a (state-divergence on controller object). The vtable is the same; the controller's bit 28 evolves differently during sub_822F1AA8 setup. Class 4 (synthesis) is now less attractive since we correctly reach the dispatcher with the right inputs — we just exit too soon. Phase C will need either JIT instrumentation to identify the bit-28 setter, or a kernel-side hook to clear bit 28 on entry to the loop check site. Findings notes: - round-A4b-ours-spawn-gate/FINDINGS.md (spawn topology + tid mapping) - round-A8-ours-822F1AA8-trace/FINDINGS.md (full loop structure + bit-28 gate) New reading-error class #18: probe-output anchor misframing (singleton[VA]=X vtable=Y was misread as "Y is canary-only vtable" when Y is the same .rdata vtable in both engines). Branch: iterate-2C/silph-ui-spawn-trace off master @ `229b46c`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-11 17:02:20 +02:00
MechaCat02	229b46c765	[Kernel] Slab-recycle handle allocator (AUDIT-059 R34) Adds a FIFO free list of closed handle slots so alloc_handle returns recycled IDs before bumping next_handle. Mirrors canary's slab-style ObjectTable: F8000098 reused 130x per 30s window in canary, but ours' monotonic bump allocator never reused slots — so a recycled slot in canary maps to a fresh, never-reused slot in ours, drifting kernel object identity per AUDIT-042's analysis. release_handle_slot is wired into nt_close's refcount==0 branch and gated to the canonical [0x1000, 0xF000_0000) range so synthetic XAudio park handles (AUDIT-048) are never recycled. Verified: all 655 workspace tests green, smoke tests at -n 50M show NtClose 115/run with handle table renumbering active (round-34 max handle 0x12ac vs round-16 baseline 0x12b8 over same workload). γ- cluster #2 wedge unchanged — silph wait still parks tid=13 on the renumbered handle (4216=0x1078 here vs 0x12a4 baseline), confirming the wedge is independent of allocator policy. Lands as a parity fix to bring our kernel-object identity in line with canary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-10 18:04:34 +02:00
MechaCat02	40f208ea4e	[2.BF] Silph WorkerCtx: install canary's real sub-vtable at [+0x2C][0] Round-21 pivot of the audit-059 synth-spawn module. Round 20 made the silph::WorkerCtx workers run by attaching a 32-slot stub sub-vtable where every entry was a `li r3, 0; blr` stub — workers spawned but spun forever because slots 15/17 short-circuited to NULL ("no work"). Round 21 reads canary's real sub-vtable VA out of the XEX `.rdata` — `0x8200A168` — and points `[sub_object + 0]` at it directly. The vtable bytes live in the static image both engines map, so no guest memory is consumed and slot 15 (= `sub_824FCCC8`) and slot 17 (= `sub_824FCE38`) — the only slots `sub_82506B08` ever calls — become working game methods. Discovery method (canary probes in `audit-runs/audit-059-handle-disambiguation/round21-subvtable-canary/`): 1. `--audit_jit_prolog_pc=0x82506B08` to catch the first WorkerCtx virtual-dispatch entry; `[r3+0x2C]` revealed the sub-object VA. 2. Re-run with `--audit_jit_prolog_mem_dump=<sub-obj VA>` to deref `[sub-object + 0]` = sub-vtable VA = 0x8200A168. 3. PE inspection (`xex-text/xex-rdata` is the static image) reads all 31 slots; slot 15 -> sub_824FCCC8, slot 17 -> sub_824FCE38. Smoke metrics (50M instructions, `XENIA_CACHE_PERSIST=1 XENIA_SILPH_SYNTH=1`, audit-runs/audit-059-handle-disambiguation/ round21-real-vtable/): * 4/4 workers spawned, no crash, no new fault * KeSetEvent 633885 -> 431860 (-32%) * KeWaitForSingleObject 258441 -> 185762 (-28%) * Per-handle state unchanged on the focused stalled set (0x1020/0x1090 still `<NO_SIGNALS_DESPITE_WAITS>`, 0x12a4/0x12ac/0x1218/0x1224 still `<UNCREATED>`). * No VdSwap/draws progression observed in this window. Verdict: B (partial). The workers no longer spin in a stub-loop — internal call density shifted — but the focused wedge handles still don't get signalled. Likely root cause: workers may now be waiting on the WorkerCtx's own KEVENTs (which we synthesised at +0x54/+0x94) for upstream work that no producer is enqueuing. Net LOC: 29 ins / 31 del. Tests: workspace passes (lockstep app tests, kernel 127/127, hir 288/288, scheduler 38/38). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-07 21:19:52 +02:00
MechaCat02	8683fb59ed	[2.BF] Silph WorkerCtx: synthesize sub-object + vtable at [+0x2C] Audit-059 round 19 isolated the round-18 worker fault: the four silph:: WorkerCtx worker bodies all execute the sequence lwz r3, 44(rN) ; r3 = [ctx+0x2C] — sub-object pointer lwz r11, 0(r3) ; r11 = sub-object vtable lwz r11, 60(r11) ; r11 = sub-object vtable[15] mtctr r11 bctrl Ours left [ctx+0x2C] NULL → PC=0 fault on first virtual dispatch. Round 19 recommended materialising a sub-object whose vtable points entirely at an existing trivial-return stub so workers idle live, returning NULL work, without crashing. Changes (silph_synth.rs only, +63/-6): - Grow SILPH_CTX_SIZE 0x500 → 0x800 to embed sub-object at +0x300 and a 32-slot sub-vtable at +0x500 in the same heap_alloc. - After ctx header init, write sub-object pointer at [ctx+0x2C], the XEX- resident wrapper constant 0xBE568F00 (round-7 finding) at [ctx+0x30], and leave [ctx+0x28] NULL (matches canary first-fire snapshot). - Populate every slot of the 32-entry sub-vtable with VA 0x8216CAA4, the first 4-byte-aligned standalone `li r3, 0; blr` stub located by a fresh PE-text scan (preceded by a `blr` terminating the previous function). - Sub-object body itself is zero-filled apart from the [+0]=vtable_ptr write; round-19 disassembly confirms workers only touch slots 15/17. Smoke (XENIA_SILPH_SYNTH=1, persistent cache, 5e7 instr): - Lockstep: no crash, all 4 workers (tid=6/7/8/9) reach Ready in deep worker-body PCs (0x825067xx/0x825089xx/0x825091xx). Verdict (D) — workers run their idle loop returning NULL; existing silph waiters (0x1020, 0x1090) remain <NO_SIGNALS_DESPITE_WAITS> because we deliberately neutered productive work. - Parallel: identical picture, no PC=0/PC=garbage fault anywhere. No regression in 765-test suite. Next round: feed real work-items into the intrusive ring at ctx+0x210 so workers' returned-NULL idle becomes returned-work productive; or discover which sub-vtable slots actually need real callees (slot 15 worker drain, slot 17 producer). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-07 21:04:04 +02:00
MechaCat02	b5885b8560	[2.BF] Synthetic silph::WorkerCtx spawn (round 18 — opt-in landing) Adds infrastructure to synthesise the silph::WorkerCtx that AUDIT-058/059 identified as never reached by ours' static-init chain (real chain entry sits in audit-059 round 9's wrong-vtable wedge at sub_82172BA0+0x1E8). Ctx layout follows round 5's live hexdump from canary: +0x00 vtable = 0x8200A1E8 +0x04 self +0x08 intrusive list head -> self +0x0C init flag = 1 +0x10 packed byte field +0x18 2x float ~1.0 (UI rates) +0x24 flag = 1 +0x28..+0x30 3x foreign-arena pointers (left NULL — see below) +0x54..+0x84 4x X_KEVENT auto-reset, state=0 +0x94..+0xC4 4x X_KEVENT manual-reset, state=1 (pre-signaled) +0x210..+0x250 4-entry intrusive work-ring, empty Worker spawn mirrors AUDIT-048's audio-worker pattern in xaudio_register_render_driver: per-worker allocate_thread_image + state.scheduler.spawn with r3 = ctx_ptr. Trigger fires at the first dat/* VFS open (ours' earliest is dat/files.tbl), which is when canary runs the equivalent chain. ROUND 18 OUTCOME — opt-in only: With workers spawned Ready (XENIA_SILPH_SYNTH=1), boot CRASHES at cycle ~5.5M with PC=0 on hw=1, just after worker_3 (entry 0x825065B8) spawns. Per task constraints this is STOP-and-report: the ctx fields +0x28/+0x2C/+0x30 (foreign heap pointers — canary's 0x30057018, 0xBCE25640, 0xBE568F00, distinct arenas per audit-059 round 7) are left NULL, and the worker bodies plausibly dereference one of them. Synthesising those is a fresh investigation (round 19+). With workers spawned Suspended (XENIA_SILPH_SYNTH=suspend), boot completes normally (11 spawns, VdSwap=1, KeSetEvent=2, KeReleaseSemaphore=1 — matches default baseline). The ctx remains materialised in guest memory at the logged VA for downstream probing. Default (env var unset): no synth, no regression. Files: crates/xenia-kernel/src/silph_synth.rs (new, 225 LOC) crates/xenia-kernel/src/lib.rs (+1 LOC, register module) crates/xenia-kernel/src/exports.rs (+37 LOC, hook in open_vfs_file) crates/xenia-kernel/src/state.rs (+18 LOC, 4 silph_synth_* fields) Tests: cargo test --release --workspace = 765 pass / 0 fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-07 20:44:29 +02:00
MechaCat02	9340ff4592	[Audit] --audit-r3-dump-bytes: dump N bytes at r3 when probe fires AUDIT-059 round 15 — diagnostic. When `--audit-r3-dump-bytes=N` is set, every `--audit-pc-probe-hex` fire emits a paired `AUDIT-R3-DUMP` line with N bytes of guest memory from r3 as u32 lanes (4-byte aligned, cap 256B). Sized for the 80-byte stack-local struct at sub_82452DC0's `r31+96` (probe sub_8245B000 entry where r3 IS the struct ptr). Settable via `XENIA_AUDIT_R3_DUMP_BYTES` env. Read-only; lockstep digest unaffected (empty-set fast path in fire_audit_pc_probe_if_match). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-07 19:39:22 +02:00
MechaCat02	bcd018659b	[Audit] --audit-mem-dump-chain: deref a guest address N levels for diagnosis Round-14 of AUDIT-2BF (singleton-dump). The bctrl at sub_822F1AA8+0x90 (PC 0x822F1B4C) loads [0x828E1F08] (a global singleton), dereferences its vtable, and indirect-calls vtable[0]. Canary returns; ours hangs. To name the resolved target we need to dump the (singleton, vtable, vtable[0]) chain on probe firing. Adds `--audit-mem-read-hex` / `XENIA_AUDIT_MEM_READ` taking a single guest VA. When set and any `--audit-pc-probe-hex` PC fires, the kernel emits a paired `AUDIT-MEM-READ` line with three guest reads: AUDIT-MEM-READ addr=0x828E1F08 val=<addr> vtable=<addr> \ vtable[0]=<addr+0> vtable[24]=<*addr+24> ... `vtable[24]` is included as the slot-6 method (audit-059 round 9 documented the canary silph chain dispatching slot 6 of a vtable here). Read-only; lockstep digest unaffected. ~30 LOC across state.rs and main.rs. `cmd_check` opts out of the flag (same policy as the existing audit_pc_probe_hex). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-07 12:13:42 +02:00
MechaCat02	09e59e09b7	Audit-2BF.delta: add --audit-pc-probe-hex for silph-init bctrl probe Adds a per-PC probe analogous to --lr-trace / --branch-probe but tuned for the silph init chain's virtual-dispatch site at sub_82172BA0+0x1E8 (PC 0x82172D88, the bctrl after a 3-deep `lwz` chain that loads vtable slot 6). Each fire emits one AUDIT-PC-PROBE line with (pc, tid, hw, cycle, lr, r3, r11) plus four guest-memory dereferences off r3 — the vtable, slot-6 method pointer, auxiliary handle field, and embedded sub-object vtable — so the line can be compared head-to-head with canary's round-9 capture (r3=0xBCCC52C0, [r3+0]=0x820A3644, slot6=sub_821B55D8, [r3+0xC]=0xF80000D8, [r3+0x30]=0x820A1870) to identify whether ours dispatches to the wrong vtable on a correct object (case A) or to a wrong object entirely (case B). Why this addition rather than reuse of an existing probe: --lr-trace emits JSONL designed for canary-side diffing and only captures r3/r4/r5/r6/lr (no memory dereferences); --branch-probe captures CR flags and lr but again no memory; --ctor-probe is single-shot per PC and walks the stack back-chain. None of them load the four indirect fields needed to identify a vtable-shape divergence. Implementation: - state.rs: new HashSet<u32> field `audit_pc_probe_pcs` and helper `fire_audit_pc_probe_if_match(hw_id, mem)`. Empty-set fast-path keeps the cost to one is_empty() check per worker_prologue call when the flag is unused. Read-only — no guest state mutation, lockstep digest unchanged. - main.rs: new CLI flag --audit-pc-probe-hex with bare-hex comma parsing (tolerates `0x` prefix), settable also via XENIA_AUDIT_PC_PROBE env var. Threaded through cmd_exec_inner; cmd_check passes None so check digests are unaffected. Probe wired into worker_prologue alongside fire_ctor_probe / fire_- branch_probe / fire_lr_trace. Like its siblings, it fires once per basic-block entry — known limitation (audit-045 reading-error class 13); use a block-entry PC if probing a mid-block instruction. Verification: kernel 127/127, app 5/5 non-ignored, no behaviour change with empty flag. Cross-references audit-059 round 9's canary capture and lays the groundwork for the round-10 ours-side comparison. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-07 10:59:03 +02:00
MechaCat02	5a8fe21ad5	Iterate-2.BF.γ: refine is_in_callback gate to per-thread exclusion Lockstep vsync delivery was capped at 54/run despite the ticker firing 333 periods and dispatcher being called 1.2M times. Root cause: the blanket `is_in_callback()` gate skipped dispatch entirely whenever the async audio path held `interrupts.saved`, which is essentially the entire boot (audio worker rarely hits its LR_HALT_SENTINEL between back-to-back callbacks). 5.85M dispatch_skip_in_callback events drowned out the 55 with-pending windows. Graphics dispatch (iterate-2.BE) runs the ISR synchronously and restores the borrowed context before returning — it doesn't touch `interrupts.saved`. The only real conflict is if graphics picks the same thread audio borrowed (which would stomp audio's SavedCallbackCtx). Replace the blanket gate with per-thread exclusion: when audio is mid-flight, exclude only its `injected_ref` from victim selection. Falls through to the existing no-victim drop if that's the only candidate. Lockstep (50M instr): gpu.interrupt.delivered{source=0} 54 → 295 (5.5×), all 333 ticker periods either delivered or unarmed (no more queue_full_drops). Wallclock unchanged ~3 s. Parallel (30M instr): 1193 → 3458 baseline lift (2.9×), no regression. Tests: xenia-kernel 127/127, xenia-app 5/5 non-ignored. Lockstep goldens will drift (interrupts.delivered is in the digest); deferred to next iterate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-06 19:52:16 +02:00

1 2 3 4 5

215 Commits