Files
xenia-rs/audit-runs/phase-c23-keWait-timeout-encoding/cold-vs-cold-result.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

5.6 KiB
Raw Blame History

Phase C+23 cold-vs-cold result (2026-05-18)

Outcome: ENGINE FIX LANDED

addis sign-extension fix at xenia-cpu/src/interpreter.rs resolves D-NEW-2 (ε-class timeout sign-extension on the canary tid=12 → ours tid=7 sister chain). 5 LOC effective. Determinism preserved (3× cold runs byte-identical post-fix).

Matched-prefix table (vs C+22 baseline)

chain C+22 C+23 (fresh) delta
canary tid=6 → ours tid=1 main 104,607 104,607 0
canary tid=4 → ours tid=11 11 11 0
canary tid=7 → ours tid=2 32 32 0
canary tid=12 → ours tid=7 3 4 +1
canary tid=14 → ours tid=9 41 41 0
canary tid=15 → ours tid=10 16 16 0

Floating-event absorption counts (fresh c23)

chain floating_create (c/o) floating_wait (c/o)
canary tid=6 → ours tid=1 main 2 / 0 3 / 0
canary tid=15 → ours tid=10 0 / 1 0 / 0
others 0 / 0 0 / 0

C+18 absorber engaged on main chain (2 canary handle.create floated) and on tid=15→10 (1 ours handle.create floated). C+21 absorber engaged on main chain (3 canary wait.begin events floated — this canary cold sample took the contended slow path 3 times).

Cold-stable invariants

  • ours-cold byte-identical (det-fields) across 3 runs: digest 23cf4c4cbf61a577caa4118ab2308ba6. Replaces C+22's e1dfcb1559f987b35012a7f2dc6d93f5 baseline (digest moved due to engine source change). New baseline anchored here.
  • Event count unchanged: 121,569 ours events (matches C+22).
  • Phase B image_canonical_sha256 = ea8d160e9369328a5b922258a92113efb8d7ce3e1a5c12cc521e375985c91c18 — UNCHANGED. Image-loading path untouched.
  • Engine source change: xenia-cpu/src/interpreter.rs::addis (5 LOC effective, ~25 LOC including comment + commented-out truncation). No xenia-canary source changes. No diff-tool changes.
  • Tests: kernel 204 unchanged; cpu 288 → 291 (3 new regression tests for the addis fix).

Direct fix-verification at the divergence point

ours-cold post-fix, tid=7 events 0-4:

[0] import.call    KeWaitForSingleObject
[1] kernel.call    KeWaitForSingleObject
[2] handle.create  sid=6e3d96c5a52bf429
[3] wait.begin     {timeout_ns: -30000000, alertable: false, wait_type: any}
[4] kernel.return  return_value=0 status=0x00000000

canary-cold, tid=12 events 0-4:

[0] import.call    KeWaitForSingleObject
[1] kernel.call    KeWaitForSingleObject
[2] handle.create  sid=c49d8f0ab90401ea  (different SID, absorbed)
[3] wait.begin     {timeout_ns: -30000000, alertable: false, wait_type: any}
[4] kernel.return  return_value=258 status=0x00000102 (TIMEOUT)

timeout_ns: -30000000 MATCHES across engines (was 429466729600 pre-fix).

New downstream divergence at idx=4 (C+23 → C+24+ target)

The advance reveals the next-class issue at idx=4:

canary: [4] kernel.return KeWaitForSingleObject  return_value=258 (TIMEOUT)
ours:   [4] kernel.return KeWaitForSingleObject  return_value=0   (SUCCESS)

Classification: (A) scheduler-determinism, same family as C+20 and C+22 escalations. Ours's monolithic-thread runner doesn't allow the 30 ms timeout window to elapse with no signaler, so the wait returns SUCCESS (the event was already signaled at the entry?) or the wait was implicit-fast-served. Canary's contended scheduler lets the timeout fire. Engine-side fix requires the parallel scheduler-determinism track (multi-session refactor).

Verification that fix is NOT diff-tool jitter

Multiple distinct evidences:

  1. Direct ours-cold inspection — the wait.begin.timeout_ns field is read directly from ours-cold.jsonl (no diff-tool interpretation), and it's now -30000000.
  2. Unit testslis_ori_std_negative_timeout_writes_sign_ extended_doubleword in xenia-cpu asserts the architectural fact directly.
  3. Determinism — 3× cold runs produce byte-identical det-fields digest. The fix isn't a race that flickered on this one sample.
  4. Phase B image hash unchanged — the fix is purely behavioral on the JIT layer, not a re-link or image change.

Cascade outcome

  • A=verify canary's timeout read logic: PASS (identical formula).
  • B=identify encoding bug class: PASS — (d) sign-extension.
  • C=land fix: PASS — 5 LOC + 3 tests.
  • D=tid=12→7 advances past 3: PASS (3 → 4).
  • E=no regression on main or other sisters: PASS (all preserved).

Files

  • investigation.md
  • cold-vs-cold-result.md (this file)
  • diff-cold-vs-cold.md
  • re-validation.md
  • ours-cold.jsonl / ours-cold-stdout.log / ours-cold-stderr.log
  • canary-cold-trunc.jsonl / canary-cold-stdout.log
  • canary-binary-cache-pre-wipe.tar.gz / canary-xdg-cache-pre-wipe.tar.gz
  • digest-cold-stable-1.json / -2.json / -3.json
  • fix.diff

Next-target recommendation

  • C+24 = D-NEW-3 (canary tid=14 → ours tid=9 idx=41): canary calls XAudioGetVoiceCategoryVolumeChangeMask; ours calls RtlEnterCriticalSection. Likely missing/stubbed XAudio export in ours causing fallback. Independent of scheduler-determinism.
  • Parallel scheduler-determinism track: tackle the C+20/C+22 + the newly-surfaced C+23-idx=4 family at the root via a per-CS-pointer expected-contention inference layer. Multi-session.