Files
xenia-rs/audit-runs/iterate-2N-rebaseline/writer-report.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

13 KiB
Raw Blame History

Iterate 2.N — Clean re-baseline post-2.F/2.H/2.L/2.M (writer report)

Date: 2026-05-28. LOC delta: engine 0, canary 0, tooling 0. Pure recon, no source modifications. Tests: N/A (no code change). Cascade: N/A — recon class per tripstone #39; #40 explicitly NOT claiming any cascade fix.

Headline

BASELINE-CLEAN-DIVERGENCE-CHARACTERIZED. Categorized first-divergence detected on the main chain (canary tid=6 → ours tid=1) at matched-prefix position 105,286bit-identical to the Phase C+23 baseline (was 105,286 before yesterday's fixes; remains 105,286 today). Engine fixes 2.F (VdSwap drain) + 2.H (vA0000000 bucket) + tooling fix 2.L (categorized harness) + 2.M (always-on exit-state.json) all verified operating as designed. ours wedge geometry bit-identical to 2.K/2.M (exit-thread-state.json diff-clean vs 2.M). One previously-hidden [return_value mismatch] surfaces on the sister chain (canary tid=12 → ours tid=7) at idx=4 — KeWaitForSingleObject returns 258 (STATUS_TIMEOUT) on canary vs 0 (SUCCESS) on ours, a return-value class divergence that the categorized harness now flags explicitly.

Mode

Pure measurement, ZERO LOC change. Invocation (identical to 2.J/2.K/2.M):

XENIA_CACHE_WIPE=1 timeout 600 ./target/release/xenia-rs exec \
  -n 50000000 --quiet \
  --phase-a-event-log audit-runs/iterate-2N-rebaseline/ours-cold.jsonl \
  "<iso>"

XDG cache directory ~/.local/share/xenia-rs/cache/ empty at run start (belt-and-braces; XENIA_CACHE_WIPE=1 already redirects to per-pid tmpdir). Canary trace reused from phase-c23-keWait-timeout-encoding/canary-cold-trunc.jsonl (cold-cache capture from 2026-05-18; matches the canonical Phase C+23 baseline used by 2.J/2.L). No fresh canary run needed.

Categorized harness invocation:

python3 tools/diff-events/diff_events.py \
  --canary audit-runs/phase-c23-keWait-timeout-encoding/canary-cold-trunc.jsonl \
  --ours   audit-runs/iterate-2N-rebaseline/ours-cold.jsonl \
  --out    audit-runs/iterate-2N-rebaseline/diff-report.md

Infrastructure gate verification

infrastructure expectation observed result
2.F VdSwap drain (900ms → 1ms) host_ns at idx 105,285 ≈ 0.67s (not 1.6s) 0.670s PASS
2.H vA0000000 physical heap bucket 0xbe8cbb3c, 0xbd184a40, 0xbc6c5640 thread ctx_ptrs identical 3 vA-bucket ctx_ptrs PASS
2.L categorized tags surface [return_value mismatch] / [status mismatch] / [args_resolved.path mismatch] greppable 1 [return_value mismatch] on sister chain tid=12→7 PASS
2.L raw idx surfaced both sides canary raw tid_event_idx=N, ours raw tid_event_idx=M in report present on every divergence line PASS
2.M exit-thread-state.json auto-emit sibling file in trace dir, no flag needed 9651 bytes, 13 threads + 10 wedge entries PASS
2.M stderr emission notice exit-thread-state: wrote 13 thread(s), 10 wedge entr(ies) identical line emitted PASS

All four infrastructure pieces working as designed. No regression.

Cascade questions (recon-only — no fix claim)

(a) Has the matched prefix grown post-fixes?

No — matched prefix is bit-identical to Phase C+23 baseline at 105,286. This is the same prefix length 2.J/2.L/2.M produced. The cache-wipe + physical-heap + VdSwap-drain fixes did NOT advance the matched-prefix length on the main chain — they shifted what's running within the prefix (cache-rebuild tid=4 from 160 → 2,075 events, wedge PC geometry from 1.7s spin → 0.7s natural-end) but did not extend it. The divergence at the post-VdSwap control-flow boundary (VdGetCurrentDisplayGamma canary vs KeAcquireSpinLockAtRaisedIrql ours) was NOT what cache-wipe / heap-bucket / VdSwap-drain were addressing. Conclusion: Phase C+23 cap remains the next-frontier on the main chain.

(b) Is the first divergence still at idx 102424?

No — 102,424 (the NtQueryFullAttributesFile cache-probe SUCCESS/NO_SUCH_FILE inversion) is CLOSED. Main chain advances to 105,286. All 8 ours-side NtQueryFullAttributesFile cache-probe returns now equal 0xc000000f (STATUS_NO_SUCH_FILE), bit-aligned with canary's cold-cache returns (verified by enumerating every cache:\* probe in ours-cold.jsonl). The 2.J finding holds: cache-state parity restored, cache-probe inversion absent, harness correctly advances to the next-downstream divergence.

(c) What is the new first divergence's category + signature?

Main chain (canary tid=6 → ours tid=1) at matched-prefix 105,286 (canary raw idx=105298, ours raw idx=105286): payload.ord mismatch — NOT a categorized return_value / status / args case. Canary fires import.call VdGetCurrentDisplayGamma (ord=441) immediately after kernel.return VdSwap; ours fires import.call KeAcquireSpinLockAtRaisedIrql (ord=77) at the same position. Different functions called from the same matched-prefix tail = control-flow branch divergence inside the post-VdSwap guest code. Pre-context (last 5 matching events): VdGetSystemCommandBuffer (call+return) → VdSwap (import.call, kernel.call, kernel.return). After kernel.return VdSwap, the two engines branch.

Sister chain (canary tid=12 → ours tid=7) at idx=4 (FIRST iteration where 2.L's category tag actually fires): [return_value mismatch] kernel.return name=KeWaitForSingleObject: canary=258 ours=0. Canary returns STATUS_TIMEOUT (0x102 = 258); ours returns SUCCESS (0). Pre-context (idx 0-3 match exactly): import.call + kernel.call KeWaitForSingleObjecthandle.create (different SIDs: canary c49d8f0ab90401ea vs ours 9559797117e919f0, but absorbed by Phase C+18 cross-tid SID matching) → wait.begin with timeout_ns=-30000000 (30ms relative wait, IDENTICAL on both sides) → divergent return. This is the AUDIT-069 / phase-C+23 KeWaitForSingleObject timeout-encoding family but at a NEW position the categorized harness now exposes. ours's wait returns SUCCESS where canary times out, implying ours's wait object is signaled within the 30ms window where canary's is not — opposite of the audio underrun class (#34/#35). Worth investigating in next iterate as a new lead.

(d) Does ours's exit-state show same 5 blocked tids at PC=0x824ac578?

Yes — bit-identical to 2.M. diff -q iterate-2N-rebaseline/exit-thread-state.json iterate-2M-exit-state-dump/exit-thread-state.json returns silent (no differences). 13 alive threads, 10 wedge entries. Blocked tids at PC 0x824ac578: tid 1, 13, 4, 5, 3 (same 5 as 2.K/2.M). Wedge map:

tid=1  → Thread(id=13)              (handle 0x000012c8, signaler=13 → circular)
tid=13 → Event(sig=false)           (handle 0x000012d0, signaler=null)
tid=4  → Semaphore(0/2147483647)    (handle 0x00001028, signaler=null = AUDIT-069 work-sem)
tid=5  → Event(sig=false)           (handle 0x000012e4, signaler=null)
tid=3  → Event(sig=false)           (handle 0x00001020, signaler=null)
tid=11 → Event(sig=false) × 2       (handles 0x828a3244, 0x828a3220)
tid=2  → Event(sig=false)           (handle 0x8287093c)
tid=8  → Event(sig=false)           (handle 0x000010ec)
tid=8  → Semaphore(0/2147483647)    (handle 0x000010d8)

Wedge geometry stable across 2.M ↔ 2.N (deterministic).

(e) ours's thread set vs canary at same wallclock — what is missing?

ours's 10 thread.create entry_pcs:

0x82181830, 0x8245a5d0, 0x82450a28, 0x82457ef0, 0x824cd458,
0x822f1ee0, 0x824d2878, 0x824d2940, 0x82178950, 0x821748f0

Canary's spawns up to canary host_ns ≤ 1.698s (matched-prefix tail +100ms slack): the SAME 8 entry_pcs in the SAME order (ours's 9th + 10th spawns happen slightly past the matched-prefix-tail wallclock in ours, but canary spawns those same two at host_ns=1.897s/1.902s, also past the matched-prefix tail). At the matched-prefix boundary the thread sets are bit-identical entry-pc-wise. Canary diverges by spawning 8 additional threads in the full 97s capture window:

0x821c4ad0 @ 1.924s  tid=17
0x822c6870 @ 1.928s  tid=18 (× 2)
0x824563e0 @ 2.050s  tid=6  (spawned-by)
0x82170430 @ 2.064s  tid=6
0x823dde30 @ 2.082s  tid=6
0x823ddb50 @ 2.083s  tid=6 (× 2)

These are the canary-only sub_825070F0 worker fan-out family + 0x821c4ad0 renderer + 0x822c6870 audio classes that the AUDIT-049/ 2.K/2.M lineage documents — ours never reaches the install epoch that spawns them because ours wedges/budget-ends at host_ns=1.008s (50M-instr budget cap), while the install epoch fires on canary at host_ns≈9.4s per AUDIT-068. Thread-set gap is the same as documented; no new missing-thread class surfaced.

Comparison table — Phase C+23 baseline vs 2.N

metric Phase C+23 (2.J/2.L) 2.N delta
Main-chain matched prefix 105,286 105,286 0 (bit-identical)
Main-chain first divergence kind payload.ord (import.call) payload.ord (import.call) UNCHANGED
Main-chain divergence: canary fn VdGetCurrentDisplayGamma (ord 441) VdGetCurrentDisplayGamma (ord 441) UNCHANGED
Main-chain divergence: ours fn KeAcquireSpinLockAtRaisedIrql (ord 77) KeAcquireSpinLockAtRaisedIrql (ord 77) UNCHANGED
Cache-probe inversions (NtQueryFullAttributesFile) 0 (closed by 2.I/2.J) 0 UNCHANGED
ours total events 121,569 (2.J/2.M) 121,569 0 (bit-identical)
ours thread.create count 10 (2.J/2.M) 10 0 (bit-identical)
ours wedge map size 10 entries (2.M) 10 bit-identical to 2.M
ours blocked tids at PC=0x824ac578 {1,13,4,5,3} (2.M) {1,13,4,5,3} bit-identical to 2.M
Categorized [return_value mismatch] count n/a (pre-2.L) 1 (sister chain tid=12→7) newly visible
Categorized [status mismatch] count n/a 0
Categorized [args_resolved.* mismatch] count n/a 0
exit-thread-state.json auto-emit n/a (pre-2.M) YES (no flag) infrastructure-new

Confidence

  • HIGH that ours's 2.N trace is deterministic vs 2.M (121,569 events bit-equal payload-wise; only host_ns and post-divergence guest_cycle differ on 6 of 121,569 lines).
  • HIGH that infrastructure 2.F/2.H/2.L/2.M all operate as designed (gate table all-PASS).
  • HIGH that matched-prefix length is 105,286 (categorized harness output explicit, raw idx printed on both sides per 2.L's reading-error #41 closure).
  • HIGH that wedge geometry is unchanged from 2.M (bit-identical exit-thread-state.json).
  • HIGH that the main-chain divergence is a payload.ord (not a return_value / status / args) class — i.e., the categorized harness correctly does NOT misclassify it.
  • MEDIUM-HIGH that the sister-chain [return_value mismatch] at tid=12→7 idx=4 (KeWaitForSingleObject 258 vs 0) is a NEW finding worth investigating. The categorized harness made this visible at-a-glance for the first time. Pre-2.L it would have surfaced as a generic payload.return_value line, not greppable as a return-value class.

Tripstone audit

  • #28 (cross-engine tid stability): comparisons keyed on entry_pc. Main chain identified by (canary tid=6, ours tid=1) — these are stable cross-engine identities established by the harness's alignment, not by raw tid integers. Wedge map intra-run tids acceptable (ours-only).
  • #39 (composite progression): recon class, NO progression claim made. VdSwap count UNCHANGED (1), draw count UNCHANGED (0).
  • #40 (single-keystone framing): explicitly NOT proposing any fix. This iterate verifies prior fixes are clean; it does NOT assert any one-step cascade unblock.
  • #41 (silent test-harness state leak): CLOSED at the harness output level by 2.L. Verified empirically — categorized tags emit on return-value mismatch (sister chain tid=12→7 idx=4), and raw per-tid idx surfaced on both sides of every divergence.
  • #42 (Phase-A blind to blocked-forever waits): CLOSED at the output level by 2.M. Verified empirically — exit-thread-state.json auto-emitted with full 13-thread snapshot + 10-entry wedge map. No flag required; no manual diag dump needed.

Next-iterate recommendation (single sentence, no fix proposal)

Two clean leads with the new visibility, in priority order: (1) the sister-chain [return_value mismatch] at tid=12→7 idx=4 (KeWaitForSingleObject canary=258 STATUS_TIMEOUT vs ours=0 SUCCESS) is brand-new actionable data the categorized harness uncovered — worth a ~0-LOC investigation iterate to localize which wait object differs and why ours's signaler races canary's by < 30ms; (2) the main-chain post-VdSwap branch at 105,286 is the same blocker as Phase C+23 and remains the strategic target, but is downstream of the install-epoch gap and likely needs the longer-budget replay (-n 500000000) plus the install-chain investigation already outlined in 2.K/2.J writer-reports.

Artifacts

Under xenia-rs/audit-runs/iterate-2N-rebaseline/:

  • ours-cold.jsonl (Phase-A trace, 121,569 events, ~28MB, payload bit-equal to 2.J/2.M)
  • ours-cold.stdout.log (empty — quiet mode)
  • ours-cold.stderr.log (single line: 2.M emission notice)
  • exit-thread-state.json (13 threads + 10 wedge entries; bit-equal to 2.M's)
  • diff-report.md (categorized harness output: 4 first-divergence blocks, 1 [return_value mismatch] tag, all with raw idx both sides)
  • writer-report.md (this file)

Canary trace REUSED (not re-captured): xenia-rs/audit-runs/phase-c23-keWait-timeout-encoding/canary-cold-trunc.jsonl (132MB, 565,773 events, cold-cache 2026-05-18 capture).