handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-05 07:19:08 +02:00
parent acd1656753
commit ef93a4fa14
620 changed files with 108303 additions and 1 deletions

View File

@@ -0,0 +1,267 @@
# Iterate 2.N — Clean re-baseline post-2.F/2.H/2.L/2.M (writer report)
**Date:** 2026-05-28. **LOC delta:** engine **0**, canary **0**, tooling
**0**. Pure recon, no source modifications.
**Tests:** N/A (no code change).
**Cascade:** N/A — recon class per tripstone #39; #40 explicitly NOT
claiming any cascade fix.
## Headline
**BASELINE-CLEAN-DIVERGENCE-CHARACTERIZED.** Categorized first-divergence
detected on the main chain (canary tid=6 → ours tid=1) at matched-prefix
position **105,286****bit-identical to the Phase C+23 baseline** (was
105,286 before yesterday's fixes; remains 105,286 today). Engine fixes
2.F (VdSwap drain) + 2.H (vA0000000 bucket) + tooling fix 2.L
(categorized harness) + 2.M (always-on exit-state.json) all verified
operating as designed. ours wedge geometry **bit-identical to 2.K/2.M**
(`exit-thread-state.json` diff-clean vs 2.M). One previously-hidden
`[return_value mismatch]` surfaces on the sister chain
(canary tid=12 → ours tid=7) at idx=4 — KeWaitForSingleObject returns
`258` (STATUS_TIMEOUT) on canary vs `0` (SUCCESS) on ours, a
return-value class divergence that the categorized harness now flags
explicitly.
## Mode
Pure measurement, ZERO LOC change. Invocation (identical to 2.J/2.K/2.M):
```
XENIA_CACHE_WIPE=1 timeout 600 ./target/release/xenia-rs exec \
-n 50000000 --quiet \
--phase-a-event-log audit-runs/iterate-2N-rebaseline/ours-cold.jsonl \
"<iso>"
```
XDG cache directory `~/.local/share/xenia-rs/cache/` empty at run start
(belt-and-braces; `XENIA_CACHE_WIPE=1` already redirects to per-pid
tmpdir). Canary trace **reused** from
`phase-c23-keWait-timeout-encoding/canary-cold-trunc.jsonl` (cold-cache
capture from 2026-05-18; matches the canonical Phase C+23 baseline used
by 2.J/2.L). No fresh canary run needed.
Categorized harness invocation:
```
python3 tools/diff-events/diff_events.py \
--canary audit-runs/phase-c23-keWait-timeout-encoding/canary-cold-trunc.jsonl \
--ours audit-runs/iterate-2N-rebaseline/ours-cold.jsonl \
--out audit-runs/iterate-2N-rebaseline/diff-report.md
```
## Infrastructure gate verification
| infrastructure | expectation | observed | result |
|---|---|---|---|
| **2.F** VdSwap drain (900ms → 1ms) | host_ns at idx 105,285 ≈ 0.67s (not 1.6s) | 0.670s | **PASS** |
| **2.H** vA0000000 physical heap bucket | `0xbe8cbb3c`, `0xbd184a40`, `0xbc6c5640` thread ctx_ptrs | identical 3 vA-bucket ctx_ptrs | **PASS** |
| **2.L** categorized tags surface | `[return_value mismatch]` / `[status mismatch]` / `[args_resolved.path mismatch]` greppable | 1 `[return_value mismatch]` on sister chain tid=12→7 | **PASS** |
| **2.L** raw idx surfaced both sides | `canary raw tid_event_idx=N, ours raw tid_event_idx=M` in report | present on every divergence line | **PASS** |
| **2.M** `exit-thread-state.json` auto-emit | sibling file in trace dir, no flag needed | 9651 bytes, 13 threads + 10 wedge entries | **PASS** |
| **2.M** stderr emission notice | `exit-thread-state: wrote 13 thread(s), 10 wedge entr(ies)` | identical line emitted | **PASS** |
All four infrastructure pieces working as designed. No regression.
## Cascade questions (recon-only — no fix claim)
### (a) Has the matched prefix grown post-fixes?
**No — matched prefix is bit-identical to Phase C+23 baseline at
105,286.** This is the same prefix length 2.J/2.L/2.M produced. The
cache-wipe + physical-heap + VdSwap-drain fixes did NOT advance the
matched-prefix length on the main chain — they shifted *what's running*
within the prefix (cache-rebuild tid=4 from 160 → 2,075 events, wedge PC
geometry from 1.7s spin → 0.7s natural-end) but did not extend it. The
divergence at the post-VdSwap control-flow boundary
(`VdGetCurrentDisplayGamma` canary vs `KeAcquireSpinLockAtRaisedIrql`
ours) was NOT what cache-wipe / heap-bucket / VdSwap-drain were
addressing. Conclusion: Phase C+23 cap remains the next-frontier on the
main chain.
### (b) Is the first divergence still at idx 102424?
**No — 102,424 (the NtQueryFullAttributesFile cache-probe
SUCCESS/NO_SUCH_FILE inversion) is CLOSED. Main chain advances to
105,286.** All 8 ours-side `NtQueryFullAttributesFile` cache-probe
returns now equal `0xc000000f` (STATUS_NO_SUCH_FILE), bit-aligned with
canary's cold-cache returns (verified by enumerating every
`cache:\*` probe in ours-cold.jsonl). The 2.J finding holds: cache-state
parity restored, cache-probe inversion absent, harness correctly
advances to the next-downstream divergence.
### (c) What is the new first divergence's category + signature?
**Main chain (canary tid=6 → ours tid=1) at matched-prefix 105,286**
(canary raw idx=105298, ours raw idx=105286):
**`payload.ord` mismatch — NOT a categorized return_value / status /
args case.** Canary fires `import.call VdGetCurrentDisplayGamma`
(ord=441) immediately after `kernel.return VdSwap`; ours fires
`import.call KeAcquireSpinLockAtRaisedIrql` (ord=77) at the same
position. Different functions called from the same matched-prefix tail
= control-flow branch divergence inside the post-VdSwap guest code.
Pre-context (last 5 matching events):
`VdGetSystemCommandBuffer` (call+return) → `VdSwap` (import.call,
kernel.call, kernel.return). After `kernel.return VdSwap`, the two
engines branch.
**Sister chain (canary tid=12 → ours tid=7) at idx=4** (FIRST iteration
where 2.L's category tag actually fires):
**`[return_value mismatch] kernel.return name=KeWaitForSingleObject:
canary=258 ours=0`.** Canary returns `STATUS_TIMEOUT` (`0x102` = 258);
ours returns `SUCCESS` (0). Pre-context (idx 0-3 match exactly):
import.call + kernel.call `KeWaitForSingleObject`
`handle.create` (different SIDs: canary `c49d8f0ab90401ea` vs ours
`9559797117e919f0`, but absorbed by Phase C+18 cross-tid SID matching)
`wait.begin` with `timeout_ns=-30000000` (30ms relative wait, IDENTICAL
on both sides) → divergent return. This is the AUDIT-069 / phase-C+23
KeWaitForSingleObject timeout-encoding family but at a NEW position
the categorized harness now exposes. ours's wait returns SUCCESS where
canary times out, implying ours's wait object is signaled within the
30ms window where canary's is not — opposite of the audio underrun
class (#34/#35). Worth investigating in next iterate as a new lead.
### (d) Does ours's exit-state show same 5 blocked tids at PC=0x824ac578?
**Yes — bit-identical to 2.M.** `diff -q
iterate-2N-rebaseline/exit-thread-state.json
iterate-2M-exit-state-dump/exit-thread-state.json` returns silent
(no differences). 13 alive threads, 10 wedge entries. Blocked tids at
PC `0x824ac578`: **tid 1, 13, 4, 5, 3** (same 5 as 2.K/2.M).
Wedge map:
```
tid=1 → Thread(id=13) (handle 0x000012c8, signaler=13 → circular)
tid=13 → Event(sig=false) (handle 0x000012d0, signaler=null)
tid=4 → Semaphore(0/2147483647) (handle 0x00001028, signaler=null = AUDIT-069 work-sem)
tid=5 → Event(sig=false) (handle 0x000012e4, signaler=null)
tid=3 → Event(sig=false) (handle 0x00001020, signaler=null)
tid=11 → Event(sig=false) × 2 (handles 0x828a3244, 0x828a3220)
tid=2 → Event(sig=false) (handle 0x8287093c)
tid=8 → Event(sig=false) (handle 0x000010ec)
tid=8 → Semaphore(0/2147483647) (handle 0x000010d8)
```
Wedge geometry stable across 2.M ↔ 2.N (deterministic).
### (e) ours's thread set vs canary at same wallclock — what is missing?
ours's 10 thread.create entry_pcs:
```
0x82181830, 0x8245a5d0, 0x82450a28, 0x82457ef0, 0x824cd458,
0x822f1ee0, 0x824d2878, 0x824d2940, 0x82178950, 0x821748f0
```
Canary's spawns up to canary host_ns ≤ 1.698s (matched-prefix tail
+100ms slack): the **SAME 8 entry_pcs in the SAME order** (ours's 9th +
10th spawns happen slightly past the matched-prefix-tail wallclock in
ours, but canary spawns those same two at host_ns=1.897s/1.902s, also
past the matched-prefix tail). At the matched-prefix boundary the
thread sets are bit-identical entry-pc-wise. Canary diverges by
spawning **8 additional threads** in the full 97s capture window:
```
0x821c4ad0 @ 1.924s tid=17
0x822c6870 @ 1.928s tid=18 (× 2)
0x824563e0 @ 2.050s tid=6 (spawned-by)
0x82170430 @ 2.064s tid=6
0x823dde30 @ 2.082s tid=6
0x823ddb50 @ 2.083s tid=6 (× 2)
```
These are the canary-only `sub_825070F0` worker fan-out family +
`0x821c4ad0` renderer + `0x822c6870` audio classes that the AUDIT-049/
2.K/2.M lineage documents — ours never reaches the install epoch that
spawns them because ours wedges/budget-ends at host_ns=1.008s
(50M-instr budget cap), while the install epoch fires on canary at
host_ns≈9.4s per AUDIT-068. **Thread-set gap is the same as documented;
no new missing-thread class surfaced.**
## Comparison table — Phase C+23 baseline vs 2.N
| metric | Phase C+23 (2.J/2.L) | 2.N | delta |
|---|---|---|---|
| Main-chain matched prefix | 105,286 | **105,286** | **0** (bit-identical) |
| Main-chain first divergence kind | `payload.ord` (import.call) | `payload.ord` (import.call) | UNCHANGED |
| Main-chain divergence: canary fn | `VdGetCurrentDisplayGamma` (ord 441) | `VdGetCurrentDisplayGamma` (ord 441) | UNCHANGED |
| Main-chain divergence: ours fn | `KeAcquireSpinLockAtRaisedIrql` (ord 77) | `KeAcquireSpinLockAtRaisedIrql` (ord 77) | UNCHANGED |
| Cache-probe inversions (NtQueryFullAttributesFile) | 0 (closed by 2.I/2.J) | **0** | UNCHANGED |
| ours total events | 121,569 (2.J/2.M) | **121,569** | **0** (bit-identical) |
| ours thread.create count | 10 (2.J/2.M) | **10** | **0** (bit-identical) |
| ours wedge map size | 10 entries (2.M) | **10** | bit-identical to 2.M |
| ours blocked tids at PC=0x824ac578 | {1,13,4,5,3} (2.M) | **{1,13,4,5,3}** | bit-identical to 2.M |
| Categorized `[return_value mismatch]` count | n/a (pre-2.L) | **1** (sister chain tid=12→7) | newly visible |
| Categorized `[status mismatch]` count | n/a | 0 | — |
| Categorized `[args_resolved.* mismatch]` count | n/a | 0 | — |
| `exit-thread-state.json` auto-emit | n/a (pre-2.M) | **YES** (no flag) | infrastructure-new |
## Confidence
- **HIGH** that ours's 2.N trace is deterministic vs 2.M (121,569 events
bit-equal payload-wise; only host_ns and post-divergence guest_cycle
differ on 6 of 121,569 lines).
- **HIGH** that infrastructure 2.F/2.H/2.L/2.M all operate as designed
(gate table all-PASS).
- **HIGH** that matched-prefix length is 105,286 (categorized harness
output explicit, raw idx printed on both sides per 2.L's reading-error
#41 closure).
- **HIGH** that wedge geometry is unchanged from 2.M (bit-identical
`exit-thread-state.json`).
- **HIGH** that the main-chain divergence is a `payload.ord` (not a
return_value / status / args) class — i.e., the categorized harness
correctly does NOT misclassify it.
- **MEDIUM-HIGH** that the sister-chain `[return_value mismatch]` at
tid=12→7 idx=4 (KeWaitForSingleObject 258 vs 0) is a NEW finding
worth investigating. The categorized harness made this visible
at-a-glance for the first time. Pre-2.L it would have surfaced as a
generic `payload.return_value` line, not greppable as a return-value
class.
## Tripstone audit
- **#28** (cross-engine tid stability): comparisons keyed on entry_pc.
Main chain identified by (canary tid=6, ours tid=1) — these are
stable cross-engine identities established by the harness's
alignment, not by raw tid integers. Wedge map intra-run tids
acceptable (ours-only).
- **#39** (composite progression): recon class, NO progression claim
made. VdSwap count UNCHANGED (1), draw count UNCHANGED (0).
- **#40** (single-keystone framing): explicitly NOT proposing any fix.
This iterate verifies prior fixes are clean; it does NOT assert any
one-step cascade unblock.
- **#41** (silent test-harness state leak): CLOSED at the harness
output level by 2.L. Verified empirically — categorized tags emit on
return-value mismatch (sister chain tid=12→7 idx=4), and raw
per-tid idx surfaced on both sides of every divergence.
- **#42** (Phase-A blind to blocked-forever waits): CLOSED at the
output level by 2.M. Verified empirically — `exit-thread-state.json`
auto-emitted with full 13-thread snapshot + 10-entry wedge map. No
flag required; no manual diag dump needed.
## Next-iterate recommendation (single sentence, no fix proposal)
Two clean leads with the new visibility, in priority order:
**(1)** the sister-chain `[return_value mismatch]` at tid=12→7 idx=4
(KeWaitForSingleObject canary=258 STATUS_TIMEOUT vs ours=0 SUCCESS) is
brand-new actionable data the categorized harness uncovered — worth a
~0-LOC investigation iterate to localize which wait object differs and
why ours's signaler races canary's by < 30ms; **(2)** the main-chain
post-VdSwap branch at 105,286 is the same blocker as Phase C+23 and
remains the strategic target, but is downstream of the install-epoch
gap and likely needs the longer-budget replay (`-n 500000000`) plus the
install-chain investigation already outlined in 2.K/2.J writer-reports.
## Artifacts
Under `xenia-rs/audit-runs/iterate-2N-rebaseline/`:
- `ours-cold.jsonl` (Phase-A trace, 121,569 events, ~28MB, payload
bit-equal to 2.J/2.M)
- `ours-cold.stdout.log` (empty — quiet mode)
- `ours-cold.stderr.log` (single line: 2.M emission notice)
- `exit-thread-state.json` (13 threads + 10 wedge entries; bit-equal to
2.M's)
- `diff-report.md` (categorized harness output: 4 first-divergence
blocks, 1 `[return_value mismatch]` tag, all with raw idx both sides)
- `writer-report.md` (this file)
Canary trace REUSED (not re-captured):
`xenia-rs/audit-runs/phase-c23-keWait-timeout-encoding/canary-cold-trunc.jsonl`
(132MB, 565,773 events, cold-cache 2026-05-18 capture).