Files
xenia-rs/audit-runs/iterate-2N-rebaseline/writer-report.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

268 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Iterate 2.N — Clean re-baseline post-2.F/2.H/2.L/2.M (writer report)
**Date:** 2026-05-28. **LOC delta:** engine **0**, canary **0**, tooling
**0**. Pure recon, no source modifications.
**Tests:** N/A (no code change).
**Cascade:** N/A — recon class per tripstone #39; #40 explicitly NOT
claiming any cascade fix.
## Headline
**BASELINE-CLEAN-DIVERGENCE-CHARACTERIZED.** Categorized first-divergence
detected on the main chain (canary tid=6 → ours tid=1) at matched-prefix
position **105,286****bit-identical to the Phase C+23 baseline** (was
105,286 before yesterday's fixes; remains 105,286 today). Engine fixes
2.F (VdSwap drain) + 2.H (vA0000000 bucket) + tooling fix 2.L
(categorized harness) + 2.M (always-on exit-state.json) all verified
operating as designed. ours wedge geometry **bit-identical to 2.K/2.M**
(`exit-thread-state.json` diff-clean vs 2.M). One previously-hidden
`[return_value mismatch]` surfaces on the sister chain
(canary tid=12 → ours tid=7) at idx=4 — KeWaitForSingleObject returns
`258` (STATUS_TIMEOUT) on canary vs `0` (SUCCESS) on ours, a
return-value class divergence that the categorized harness now flags
explicitly.
## Mode
Pure measurement, ZERO LOC change. Invocation (identical to 2.J/2.K/2.M):
```
XENIA_CACHE_WIPE=1 timeout 600 ./target/release/xenia-rs exec \
-n 50000000 --quiet \
--phase-a-event-log audit-runs/iterate-2N-rebaseline/ours-cold.jsonl \
"<iso>"
```
XDG cache directory `~/.local/share/xenia-rs/cache/` empty at run start
(belt-and-braces; `XENIA_CACHE_WIPE=1` already redirects to per-pid
tmpdir). Canary trace **reused** from
`phase-c23-keWait-timeout-encoding/canary-cold-trunc.jsonl` (cold-cache
capture from 2026-05-18; matches the canonical Phase C+23 baseline used
by 2.J/2.L). No fresh canary run needed.
Categorized harness invocation:
```
python3 tools/diff-events/diff_events.py \
--canary audit-runs/phase-c23-keWait-timeout-encoding/canary-cold-trunc.jsonl \
--ours audit-runs/iterate-2N-rebaseline/ours-cold.jsonl \
--out audit-runs/iterate-2N-rebaseline/diff-report.md
```
## Infrastructure gate verification
| infrastructure | expectation | observed | result |
|---|---|---|---|
| **2.F** VdSwap drain (900ms → 1ms) | host_ns at idx 105,285 ≈ 0.67s (not 1.6s) | 0.670s | **PASS** |
| **2.H** vA0000000 physical heap bucket | `0xbe8cbb3c`, `0xbd184a40`, `0xbc6c5640` thread ctx_ptrs | identical 3 vA-bucket ctx_ptrs | **PASS** |
| **2.L** categorized tags surface | `[return_value mismatch]` / `[status mismatch]` / `[args_resolved.path mismatch]` greppable | 1 `[return_value mismatch]` on sister chain tid=12→7 | **PASS** |
| **2.L** raw idx surfaced both sides | `canary raw tid_event_idx=N, ours raw tid_event_idx=M` in report | present on every divergence line | **PASS** |
| **2.M** `exit-thread-state.json` auto-emit | sibling file in trace dir, no flag needed | 9651 bytes, 13 threads + 10 wedge entries | **PASS** |
| **2.M** stderr emission notice | `exit-thread-state: wrote 13 thread(s), 10 wedge entr(ies)` | identical line emitted | **PASS** |
All four infrastructure pieces working as designed. No regression.
## Cascade questions (recon-only — no fix claim)
### (a) Has the matched prefix grown post-fixes?
**No — matched prefix is bit-identical to Phase C+23 baseline at
105,286.** This is the same prefix length 2.J/2.L/2.M produced. The
cache-wipe + physical-heap + VdSwap-drain fixes did NOT advance the
matched-prefix length on the main chain — they shifted *what's running*
within the prefix (cache-rebuild tid=4 from 160 → 2,075 events, wedge PC
geometry from 1.7s spin → 0.7s natural-end) but did not extend it. The
divergence at the post-VdSwap control-flow boundary
(`VdGetCurrentDisplayGamma` canary vs `KeAcquireSpinLockAtRaisedIrql`
ours) was NOT what cache-wipe / heap-bucket / VdSwap-drain were
addressing. Conclusion: Phase C+23 cap remains the next-frontier on the
main chain.
### (b) Is the first divergence still at idx 102424?
**No — 102,424 (the NtQueryFullAttributesFile cache-probe
SUCCESS/NO_SUCH_FILE inversion) is CLOSED. Main chain advances to
105,286.** All 8 ours-side `NtQueryFullAttributesFile` cache-probe
returns now equal `0xc000000f` (STATUS_NO_SUCH_FILE), bit-aligned with
canary's cold-cache returns (verified by enumerating every
`cache:\*` probe in ours-cold.jsonl). The 2.J finding holds: cache-state
parity restored, cache-probe inversion absent, harness correctly
advances to the next-downstream divergence.
### (c) What is the new first divergence's category + signature?
**Main chain (canary tid=6 → ours tid=1) at matched-prefix 105,286**
(canary raw idx=105298, ours raw idx=105286):
**`payload.ord` mismatch — NOT a categorized return_value / status /
args case.** Canary fires `import.call VdGetCurrentDisplayGamma`
(ord=441) immediately after `kernel.return VdSwap`; ours fires
`import.call KeAcquireSpinLockAtRaisedIrql` (ord=77) at the same
position. Different functions called from the same matched-prefix tail
= control-flow branch divergence inside the post-VdSwap guest code.
Pre-context (last 5 matching events):
`VdGetSystemCommandBuffer` (call+return) → `VdSwap` (import.call,
kernel.call, kernel.return). After `kernel.return VdSwap`, the two
engines branch.
**Sister chain (canary tid=12 → ours tid=7) at idx=4** (FIRST iteration
where 2.L's category tag actually fires):
**`[return_value mismatch] kernel.return name=KeWaitForSingleObject:
canary=258 ours=0`.** Canary returns `STATUS_TIMEOUT` (`0x102` = 258);
ours returns `SUCCESS` (0). Pre-context (idx 0-3 match exactly):
import.call + kernel.call `KeWaitForSingleObject`
`handle.create` (different SIDs: canary `c49d8f0ab90401ea` vs ours
`9559797117e919f0`, but absorbed by Phase C+18 cross-tid SID matching)
`wait.begin` with `timeout_ns=-30000000` (30ms relative wait, IDENTICAL
on both sides) → divergent return. This is the AUDIT-069 / phase-C+23
KeWaitForSingleObject timeout-encoding family but at a NEW position
the categorized harness now exposes. ours's wait returns SUCCESS where
canary times out, implying ours's wait object is signaled within the
30ms window where canary's is not — opposite of the audio underrun
class (#34/#35). Worth investigating in next iterate as a new lead.
### (d) Does ours's exit-state show same 5 blocked tids at PC=0x824ac578?
**Yes — bit-identical to 2.M.** `diff -q
iterate-2N-rebaseline/exit-thread-state.json
iterate-2M-exit-state-dump/exit-thread-state.json` returns silent
(no differences). 13 alive threads, 10 wedge entries. Blocked tids at
PC `0x824ac578`: **tid 1, 13, 4, 5, 3** (same 5 as 2.K/2.M).
Wedge map:
```
tid=1 → Thread(id=13) (handle 0x000012c8, signaler=13 → circular)
tid=13 → Event(sig=false) (handle 0x000012d0, signaler=null)
tid=4 → Semaphore(0/2147483647) (handle 0x00001028, signaler=null = AUDIT-069 work-sem)
tid=5 → Event(sig=false) (handle 0x000012e4, signaler=null)
tid=3 → Event(sig=false) (handle 0x00001020, signaler=null)
tid=11 → Event(sig=false) × 2 (handles 0x828a3244, 0x828a3220)
tid=2 → Event(sig=false) (handle 0x8287093c)
tid=8 → Event(sig=false) (handle 0x000010ec)
tid=8 → Semaphore(0/2147483647) (handle 0x000010d8)
```
Wedge geometry stable across 2.M ↔ 2.N (deterministic).
### (e) ours's thread set vs canary at same wallclock — what is missing?
ours's 10 thread.create entry_pcs:
```
0x82181830, 0x8245a5d0, 0x82450a28, 0x82457ef0, 0x824cd458,
0x822f1ee0, 0x824d2878, 0x824d2940, 0x82178950, 0x821748f0
```
Canary's spawns up to canary host_ns ≤ 1.698s (matched-prefix tail
+100ms slack): the **SAME 8 entry_pcs in the SAME order** (ours's 9th +
10th spawns happen slightly past the matched-prefix-tail wallclock in
ours, but canary spawns those same two at host_ns=1.897s/1.902s, also
past the matched-prefix tail). At the matched-prefix boundary the
thread sets are bit-identical entry-pc-wise. Canary diverges by
spawning **8 additional threads** in the full 97s capture window:
```
0x821c4ad0 @ 1.924s tid=17
0x822c6870 @ 1.928s tid=18 (× 2)
0x824563e0 @ 2.050s tid=6 (spawned-by)
0x82170430 @ 2.064s tid=6
0x823dde30 @ 2.082s tid=6
0x823ddb50 @ 2.083s tid=6 (× 2)
```
These are the canary-only `sub_825070F0` worker fan-out family +
`0x821c4ad0` renderer + `0x822c6870` audio classes that the AUDIT-049/
2.K/2.M lineage documents — ours never reaches the install epoch that
spawns them because ours wedges/budget-ends at host_ns=1.008s
(50M-instr budget cap), while the install epoch fires on canary at
host_ns≈9.4s per AUDIT-068. **Thread-set gap is the same as documented;
no new missing-thread class surfaced.**
## Comparison table — Phase C+23 baseline vs 2.N
| metric | Phase C+23 (2.J/2.L) | 2.N | delta |
|---|---|---|---|
| Main-chain matched prefix | 105,286 | **105,286** | **0** (bit-identical) |
| Main-chain first divergence kind | `payload.ord` (import.call) | `payload.ord` (import.call) | UNCHANGED |
| Main-chain divergence: canary fn | `VdGetCurrentDisplayGamma` (ord 441) | `VdGetCurrentDisplayGamma` (ord 441) | UNCHANGED |
| Main-chain divergence: ours fn | `KeAcquireSpinLockAtRaisedIrql` (ord 77) | `KeAcquireSpinLockAtRaisedIrql` (ord 77) | UNCHANGED |
| Cache-probe inversions (NtQueryFullAttributesFile) | 0 (closed by 2.I/2.J) | **0** | UNCHANGED |
| ours total events | 121,569 (2.J/2.M) | **121,569** | **0** (bit-identical) |
| ours thread.create count | 10 (2.J/2.M) | **10** | **0** (bit-identical) |
| ours wedge map size | 10 entries (2.M) | **10** | bit-identical to 2.M |
| ours blocked tids at PC=0x824ac578 | {1,13,4,5,3} (2.M) | **{1,13,4,5,3}** | bit-identical to 2.M |
| Categorized `[return_value mismatch]` count | n/a (pre-2.L) | **1** (sister chain tid=12→7) | newly visible |
| Categorized `[status mismatch]` count | n/a | 0 | — |
| Categorized `[args_resolved.* mismatch]` count | n/a | 0 | — |
| `exit-thread-state.json` auto-emit | n/a (pre-2.M) | **YES** (no flag) | infrastructure-new |
## Confidence
- **HIGH** that ours's 2.N trace is deterministic vs 2.M (121,569 events
bit-equal payload-wise; only host_ns and post-divergence guest_cycle
differ on 6 of 121,569 lines).
- **HIGH** that infrastructure 2.F/2.H/2.L/2.M all operate as designed
(gate table all-PASS).
- **HIGH** that matched-prefix length is 105,286 (categorized harness
output explicit, raw idx printed on both sides per 2.L's reading-error
#41 closure).
- **HIGH** that wedge geometry is unchanged from 2.M (bit-identical
`exit-thread-state.json`).
- **HIGH** that the main-chain divergence is a `payload.ord` (not a
return_value / status / args) class — i.e., the categorized harness
correctly does NOT misclassify it.
- **MEDIUM-HIGH** that the sister-chain `[return_value mismatch]` at
tid=12→7 idx=4 (KeWaitForSingleObject 258 vs 0) is a NEW finding
worth investigating. The categorized harness made this visible
at-a-glance for the first time. Pre-2.L it would have surfaced as a
generic `payload.return_value` line, not greppable as a return-value
class.
## Tripstone audit
- **#28** (cross-engine tid stability): comparisons keyed on entry_pc.
Main chain identified by (canary tid=6, ours tid=1) — these are
stable cross-engine identities established by the harness's
alignment, not by raw tid integers. Wedge map intra-run tids
acceptable (ours-only).
- **#39** (composite progression): recon class, NO progression claim
made. VdSwap count UNCHANGED (1), draw count UNCHANGED (0).
- **#40** (single-keystone framing): explicitly NOT proposing any fix.
This iterate verifies prior fixes are clean; it does NOT assert any
one-step cascade unblock.
- **#41** (silent test-harness state leak): CLOSED at the harness
output level by 2.L. Verified empirically — categorized tags emit on
return-value mismatch (sister chain tid=12→7 idx=4), and raw
per-tid idx surfaced on both sides of every divergence.
- **#42** (Phase-A blind to blocked-forever waits): CLOSED at the
output level by 2.M. Verified empirically — `exit-thread-state.json`
auto-emitted with full 13-thread snapshot + 10-entry wedge map. No
flag required; no manual diag dump needed.
## Next-iterate recommendation (single sentence, no fix proposal)
Two clean leads with the new visibility, in priority order:
**(1)** the sister-chain `[return_value mismatch]` at tid=12→7 idx=4
(KeWaitForSingleObject canary=258 STATUS_TIMEOUT vs ours=0 SUCCESS) is
brand-new actionable data the categorized harness uncovered — worth a
~0-LOC investigation iterate to localize which wait object differs and
why ours's signaler races canary's by < 30ms; **(2)** the main-chain
post-VdSwap branch at 105,286 is the same blocker as Phase C+23 and
remains the strategic target, but is downstream of the install-epoch
gap and likely needs the longer-budget replay (`-n 500000000`) plus the
install-chain investigation already outlined in 2.K/2.J writer-reports.
## Artifacts
Under `xenia-rs/audit-runs/iterate-2N-rebaseline/`:
- `ours-cold.jsonl` (Phase-A trace, 121,569 events, ~28MB, payload
bit-equal to 2.J/2.M)
- `ours-cold.stdout.log` (empty — quiet mode)
- `ours-cold.stderr.log` (single line: 2.M emission notice)
- `exit-thread-state.json` (13 threads + 10 wedge entries; bit-equal to
2.M's)
- `diff-report.md` (categorized harness output: 4 first-divergence
blocks, 1 `[return_value mismatch]` tag, all with raw idx both sides)
- `writer-report.md` (this file)
Canary trace REUSED (not re-captured):
`xenia-rs/audit-runs/phase-c23-keWait-timeout-encoding/canary-cold-trunc.jsonl`
(132MB, 565,773 events, cold-cache 2026-05-18 capture).