Files
xenia-rs/audit-runs/phase-c23-keWait-timeout-encoding/cold-vs-cold-result.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

135 lines
5.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase C+23 cold-vs-cold result (2026-05-18)
## Outcome: ENGINE FIX LANDED
`addis` sign-extension fix at `xenia-cpu/src/interpreter.rs` resolves
D-NEW-2 (ε-class timeout sign-extension on the canary tid=12 → ours
tid=7 sister chain). 5 LOC effective. Determinism preserved (3× cold
runs byte-identical post-fix).
## Matched-prefix table (vs C+22 baseline)
| chain | C+22 | C+23 (fresh) | delta |
|--------------------------------|---------|--------------|-------|
| canary tid=6 → ours tid=1 main | 104,607 | 104,607 | 0 |
| canary tid=4 → ours tid=11 | 11 | 11 | 0 |
| canary tid=7 → ours tid=2 | 32 | 32 | 0 |
| canary tid=12 → ours tid=7 | 3 | **4** | **+1** |
| canary tid=14 → ours tid=9 | 41 | 41 | 0 |
| canary tid=15 → ours tid=10 | 16 | 16 | 0 |
## Floating-event absorption counts (fresh c23)
| chain | floating_create (c/o) | floating_wait (c/o) |
|--------------------------------|-----------------------|---------------------|
| canary tid=6 → ours tid=1 main | 2 / 0 | 3 / 0 |
| canary tid=15 → ours tid=10 | 0 / 1 | 0 / 0 |
| others | 0 / 0 | 0 / 0 |
C+18 absorber engaged on main chain (2 canary handle.create floated)
and on tid=15→10 (1 ours handle.create floated). C+21 absorber engaged
on main chain (3 canary wait.begin events floated — this canary cold
sample took the contended slow path 3 times).
## Cold-stable invariants
- **ours-cold byte-identical (det-fields) across 3 runs**:
digest `23cf4c4cbf61a577caa4118ab2308ba6`. Replaces C+22's
`e1dfcb1559f987b35012a7f2dc6d93f5` baseline (digest moved due
to engine source change). New baseline anchored here.
- **Event count** unchanged: 121,569 ours events (matches C+22).
- **Phase B `image_canonical_sha256` =
`ea8d160e9369328a5b922258a92113efb8d7ce3e1a5c12cc521e375985c91c18`**
— UNCHANGED. Image-loading path untouched.
- **Engine source change**: `xenia-cpu/src/interpreter.rs::addis`
(5 LOC effective, ~25 LOC including comment + commented-out
truncation). No `xenia-canary` source changes. No diff-tool changes.
- **Tests**: kernel 204 unchanged; cpu 288 → 291 (3 new regression
tests for the addis fix).
## Direct fix-verification at the divergence point
ours-cold post-fix, tid=7 events 0-4:
```
[0] import.call KeWaitForSingleObject
[1] kernel.call KeWaitForSingleObject
[2] handle.create sid=6e3d96c5a52bf429
[3] wait.begin {timeout_ns: -30000000, alertable: false, wait_type: any}
[4] kernel.return return_value=0 status=0x00000000
```
canary-cold, tid=12 events 0-4:
```
[0] import.call KeWaitForSingleObject
[1] kernel.call KeWaitForSingleObject
[2] handle.create sid=c49d8f0ab90401ea (different SID, absorbed)
[3] wait.begin {timeout_ns: -30000000, alertable: false, wait_type: any}
[4] kernel.return return_value=258 status=0x00000102 (TIMEOUT)
```
`timeout_ns: -30000000` MATCHES across engines (was `429466729600` pre-fix).
## New downstream divergence at idx=4 (C+23 → C+24+ target)
The advance reveals the next-class issue at idx=4:
```
canary: [4] kernel.return KeWaitForSingleObject return_value=258 (TIMEOUT)
ours: [4] kernel.return KeWaitForSingleObject return_value=0 (SUCCESS)
```
Classification: **(A) scheduler-determinism**, same family as C+20
and C+22 escalations. Ours's monolithic-thread runner doesn't allow
the 30 ms timeout window to elapse with no signaler, so the wait
returns SUCCESS (the event was already signaled at the entry?) or
the wait was implicit-fast-served. Canary's contended scheduler lets
the timeout fire. Engine-side fix requires the parallel
scheduler-determinism track (multi-session refactor).
## Verification that fix is NOT diff-tool jitter
Multiple distinct evidences:
1. **Direct ours-cold inspection** — the `wait.begin.timeout_ns`
field is read directly from ours-cold.jsonl (no diff-tool
interpretation), and it's now -30000000.
2. **Unit tests** — `lis_ori_std_negative_timeout_writes_sign_
extended_doubleword` in xenia-cpu asserts the architectural fact
directly.
3. **Determinism** — 3× cold runs produce byte-identical det-fields
digest. The fix isn't a race that flickered on this one sample.
4. **Phase B image hash unchanged** — the fix is purely behavioral
on the JIT layer, not a re-link or image change.
## Cascade outcome
- A=verify canary's timeout read logic: PASS (identical formula).
- B=identify encoding bug class: PASS — (d) sign-extension.
- C=land fix: PASS — 5 LOC + 3 tests.
- D=tid=12→7 advances past 3: PASS (3 → 4).
- E=no regression on main or other sisters: PASS (all preserved).
## Files
- `investigation.md`
- `cold-vs-cold-result.md` (this file)
- `diff-cold-vs-cold.md`
- `re-validation.md`
- `ours-cold.jsonl` / `ours-cold-stdout.log` / `ours-cold-stderr.log`
- `canary-cold-trunc.jsonl` / `canary-cold-stdout.log`
- `canary-binary-cache-pre-wipe.tar.gz` / `canary-xdg-cache-pre-wipe.tar.gz`
- `digest-cold-stable-1.json` / `-2.json` / `-3.json`
- `fix.diff`
## Next-target recommendation
- **C+24 = D-NEW-3** (canary tid=14 → ours tid=9 idx=41): canary
calls `XAudioGetVoiceCategoryVolumeChangeMask`; ours calls
`RtlEnterCriticalSection`. Likely missing/stubbed XAudio export
in ours causing fallback. Independent of scheduler-determinism.
- **Parallel scheduler-determinism track**: tackle the C+20/C+22 +
the newly-surfaced C+23-idx=4 family at the root via a
per-CS-pointer expected-contention inference layer. Multi-session.