Files
xenia-rs/audit-runs/iterate-2K-longer-budget-replay/writer-report.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

12 KiB
Raw Blame History

Iterate 2.K — Longer-budget cache-wipe replay (writer report)

Date: 2026-05-28. LOC delta: engine 0, canary 0. Pure measurement. Tests: N/A (no source modifications).

Headline

INSTALL-CHAIN-ABSENT-NEW-BLOCKER. 500M-instruction budget run (10× 2.J's 50M) reaches the budget cap cleanly at wallclock=13.96s but emits ZERO new Phase-A events past 2.J's terminus. Event count 121,569 bit-identical to 2.J. tid=1 max guest_cycle 9,169,116 bit-identical to 2.J. The keystone sub_824F8398 install chain still 0 fires; sub_825070F0 worker fan-out still 0 fires. Final-state dump reveals all 12 live threads parked in Blocked(WaitAny ..., deadline: None) waits, 5 of them at PC 0x824ac578 — the exact AUDIT-049 wedge PC. The 2.J "wedge moved / wait returns success" observation was budget-truncated artifact: under longer budget, the engine re-converges to a deadlock at the same call site. 2.J's NtWaitForSingleObjectEx return=0 events are the wrapper successfully returning on prior iterations of a tight wait → return → wait loop; the FINAL wait of each tid blocks forever and never emits a kernel.return. Cache parity was load-bearing but is NOT THE keystone. Next blocker is upstream of the install chain at the wedge-loop level.

Mode

ZERO LOC. Invocation:

XENIA_CACHE_WIPE=1 timeout 600 ./xenia-rs/target/release/xenia-rs exec \
  -n 500000000 --quiet \
  --phase-a-event-log audit-runs/iterate-2K-longer-budget-replay/ours-cold.jsonl \
  "Project Sylpheed - Arc of Deception (USA, Europe) (En,Ja).iso"

Identical to 2.J except -n 50000000-n 500000000. XDG cache already absent (no /home/fabi/.local/share/xenia-rs/cache/) before run; XENIA_CACHE_WIPE=1 set for belt-and-braces.

Run completed EXIT=0 at wallclock 13.96s. Final reason from non-quiet diagnostic re-run: reached max instruction count limit=500000000 (instruction budget hit, not a panic/fault/timeout). Total instructions executed: 500,000,004.

Primary gate results

gate 2.J 2.K result
sub_824F8398 install-chain fires 0 0 UNCHANGED
sub_825070F0 worker fan-out fires 0 0 UNCHANGED

Grep against full ours-cold.jsonl (case-insensitive on hex literal, plus per-tid first-kernel-call signature): zero hits for either symbol across all kinds (thread.create, import.call, kernel.call, kernel.return, payload fields). The canary's tids 15/27/28 (the sub_825070F0 family workers) and tid 14 (audio worker sub_824D2878-driven) are structurally absent from ours's thread fan-out at this trajectory point, even given 10× the instruction budget.

Secondary cascade gate results

Thread set

10 thread.create entries, bit-identical to 2.J (same entry_pcs, same ctx_ptrs). Per tripstone #28 (don't key on integer tid):

entry_pc ctx_ptr canary analog
0x82181830 0x828f3d08 main bootstrap
0x8245a5d0 0x828f4838 early helper
0x82450a28 0x828f3b68 producer (AUDIT-069)
0x82457ef0 0x828f3b08 dispatcher tid=5
0x824cd458 0xbe8cbb3c per-AUDIT-068 sister
0x822f1ee0 0xbd184a40 helper
0x824d2878 0x00000000 audio worker (no kernel calls)
0x824d2940 0x00000000 audio companion (no kernel calls)
0x82178950 0x828f3ec0 input/lifecycle
0x821748f0 0xbc6c5640 early helper

NB: sub_824D2878 IS in the spawn set but its tid emits no kernel calls in the entire 500M-instruction run (matches 2.J). Workers sub_825070F0 × 4 + secondary-burst tids never spawn.

VdSwap / draws (gameplay progression — tripstone #39)

  • VdSwap = 1 (same single swap at cycle=5,577,303 / host_ns=493.5ms as 2.J). Bit-identical timestamp.
  • Draws = 0 (no *Draw* kernel name emitted).
  • Gameplay progression NOT achieved. Honest "no" per #39.

Total event count

  • 121,569 events (bit-identical to 2.J).
  • File size 28,724,871 bytes vs 2.J 28,667,xxx ish — content identical up to floating host_ns jitter; structurally equal.
  • Implication: between 50M and 500M instructions (4× more wallclock), the engine emitted 0 new kernel calls, 0 new wait.begin, 0 new handle events. The host clock advanced (~3× wallclock) but the guest committed no observable progress.

Wedge state (final-state dump, non-quiet diagnostic re-run)

At budget exhaustion, all live threads parked:

tid PC LR state handle waiting on
1 0x824ac578 0x824ac578 Blocked(WaitAny, no deadline) 0x12c8 = Thread(id=13)
11 0x824d2a94 0x824d2a94 Blocked(WaitAny, no deadline) 0x828a3244 = Event(sig=false)
2 0x824a95f8 0x824a95f8 Blocked(WaitAny, no deadline) 0x8287093c = Event(sig=false)
13 0x824ac578 0x824ac578 Blocked(WaitAny, no deadline) 0x12d0 = Event(sig=false)
7 0x824cd4f4 0x824cd4f4 Blocked(WaitAny, deadline=3000) 0xbe8cbb5c = Event
8 0x824ab214 0x824ab214 Blocked(WaitAny, no deadline) 0x10d8 = Semaphore(0/2^31-1)
4 0x824ac578 0x824ac578 Blocked(WaitAny, no deadline) 0x1028 = Semaphore(0/2^31-1)
5 0x824ac578 0x824ac578 Blocked(WaitAny, no deadline) 0x12e4 = Event(sig=false)
9 0x824d1404 0x824d22b4 Ready
6 0x824ab214 0x824ab214 Ready
10 0x824d1404 0x824d22b4 Ready
12 0x824aa6a4 0x824aa6a4 Ready
3 0x824ac578 0x824ac578 Blocked(WaitAny, no deadline) 0x1020 = Event(sig=false)

5 of 13 tids parked at PC 0x824ac578 (the AUDIT-049 wedge), including the canonical tid=1 → Thread(id=13) → Event circular wait. 4 tids in Ready state but never re-scheduled to advance.

tid=1's last kernel.return in Phase-A log shows NtWaitForSingleObjectEx return_value=0 status=0x00000000 at cycle=9,169,116 — but this is one of an earlier iteration of the wait loop, NOT the wait it is currently blocked on. The final wait (handle 0x12c8 = tid=13 thread handle) NEVER returned; no kernel.return event was emitted for it because the wrapper is parked indefinitely.

Reading-error #41 candidate (new this iterate)

Phase-A "kernel.return success" events do NOT imply forward progress when the call site is a tight wait-loop. 2.J's report observed "tid=1 NtWait returns success, wedge moved or absent" — but the events captured were prior loop iterations that fed back into the SAME wait call which then blocks forever. The honest interpretation is "wait wrapper made N successful round-trips, then the (N+1)th call blocked indefinitely." Recommend registering: return-success in Phase-A does not prove wedge resolution; cross-check against final-state thread diagnostic dump under the longest available budget.

Comparison: 2.H → 2.J → 2.K

gate 2.H (no wipe) 2.J (wipe, 50M) 2.K (wipe, 500M)
cache probe 0xc000000f FAIL PASS (9/9) PASS (9/9)
total events 118,149 121,569 121,569
tid=4 events 160 2,075 2,075
thread.create count 10 10 10
tid=1 last cycle 9,140,200 9,169,116 9,169,116
VdSwap count 1 1 1
draws 0 0 0
sub_824F8398 fires 0 0 0
sub_825070F0 fires 0 0 0
wedge PC 0x824ac578 parked yes "moved" (budget short) 5 tids parked there
termination 50M budget 50M budget 500M budget cleanly
wallclock to terminate ~5s ~5s 13.96s

Critical finding: 2.J ≡ 2.K at the Phase-A event level. All gates identical to 2.J. The 10× budget bought 4× more wallclock but zero additional observable guest progress. The engine is genuinely wedged from somewhere between cycle 9,140,200 and 9,169,116 onward.

Tripstone audit

  • #28 (cross-engine tid stability): All ours-internal claims keyed on entry_pc, not integer tid. 2.J ↔ 2.K both ours-side so integer tid stable; entry_pc/ctx_ptr columns bit-stable.
  • #39 (gameplay progression IS progression): Headline does NOT claim progression. VdSwap=1, draws=0 — same as 2.J. PASS claim is on characterization of the wedge (now visible at the same PC as AUDIT-049), not on cascade.
  • #40 (single-keystone framing): The 2.J framing "cache parity is the keystone, longer budget will reveal the install chain" is FALSIFIED by 2.K. Neither cache parity nor longer budget unblocks sub_824F8398. Reading-error #40 class repeats again (this iterate's expectation that 10× budget unblocks the chain). Recommend registering reading-error #41: Phase-A kernel.return success events do not prove wedge resolution when the call site is a tight wait-loop with N successful spins before the (N+1)th terminal block.

Confidence

  • HIGH that 2.K reached 500M instructions cleanly (exec complete wall_ms=13959 instructions=500000004 in diagnostic re-run).
  • HIGH that Phase-A event log is bit-identical to 2.J at the structural level (count, last tid_event_idx, last guest_cycle).
  • HIGH that 5 tids parked at 0x824ac578 at budget exhaustion (final-state dump direct evidence).
  • HIGH that sub_824F8398 and sub_825070F0 are 0 fires (grepped across all event kinds + payload fields).
  • HIGH that wallclock-vs-events ratio diverges 3:1 between 2.J and 2.K — the engine is consuming host time without making guest observable progress, i.e. spinning in the JIT loop on re-execution of already-blocked waits or busy-loops.

Next iterate recommendation

Iterate 2.L should be ONE of:

  1. Walk the wedge backward from 0x824ac578 to find the missing signaler (~0-50 LOC instrumentation). Each parked tid is waiting on a specific event/semaphore handle. Identify per-tid: (a) who in canary signals that handle and when; (b) whether the signaler tid exists in ours; (c) if it exists, why doesn't it reach the signal site. The wedge handles in this run are:

    • tid=1 → 0x12c8 = Thread(id=13) — waiting for tid=13 to exit
    • tid=13 → 0x12d0 = Event — needs an external signaler
    • tid=3,4,5 → various Event/Semaphore handles
    • tid=8 → 0x10d8 = Semaphore (the AUDIT-069 work-semaphore class) This is essentially AUDIT-069 territory: producer-underrun at the work-semaphore. ~0 LOC if reusing existing --lr-trace / --branch-probe infra.
  2. Push budget further (-n 5000000000, 50×) to see if anything eventually fires (~0 LOC, ~2.5 min wallclock estimate, decisive negative). LOW PRIORITY — based on 2.K's flat-zero events 50M-500M, strongly predict 0 events 500M-5000M.

  3. 2.D-style diff re-measure of (op, lr) missing-tuple count from the IAT producer LR side (~0-30 LOC). 2.J said "expected unchanged at 28/28". 2.K confirms structurally identical to 2.J, so missing-tuple count is also expected unchanged. Re-measure to CONFIRM (and to refresh the producer-rate at LR 0x824AB168 which was 9.97% in 2.D). Useful as cascade-sanity even if negative.

Recommended priority: (1) — direct per-handle waiter→signaler walk on the 5 parked tids at 0x824ac578. Will identify the most upstream missing signaler and likely lead to either AUDIT-069's producer-underrun root or a new state-parity divergence upstream of the install epoch. ~0-50 LOC, ~30-60 min.

DO NOT pursue (2) without first attempting (1) — the structural evidence (event count flat, max-cycle flat, final-state genuine wedge) makes "longer budget" a high-confidence negative.

Artifacts

Under xenia-rs/audit-runs/iterate-2K-longer-budget-replay/:

  • ours-cold.jsonl (121,569 events, 500M-instr quiet run, ~28MB)
  • ours-cold.stdout.log / ours-cold.stderr.log (empty — quiet mode)
  • exit-diag-full.log (390 lines, non-quiet diagnostic re-run capturing budget-hit message + final-state dump + thread diagnostics
    • metrics summary)
  • exit-diag.log (50-line tail of first diagnostic run)
  • exit-diag-head.log (100-line head of second diagnostic run)
  • writer-report.md (this file)

Cache wiped via XENIA_CACHE_WIPE=1 env (per-process tmpdir at /tmp/xenia-rs-cache-244570-0/). No XDG cache pre-existed.