handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-05 07:19:08 +02:00
parent acd1656753
commit ef93a4fa14
620 changed files with 108303 additions and 1 deletions

View File

@@ -0,0 +1,262 @@
# Iterate 2.J — Cache-wipe replay (writer report)
**Date:** 2026-05-28. **LOC delta:** engine **0**, canary **0**. Pure
test-harness parity measurement (no code change).
**Tests:** N/A (no source modifications).
## Headline
**WEDGE-MOVED.** Primary gate **PASS**: 2.J's `NtQueryFullAttributesFile`
cache-probe calls now return `0xc000000f` (`STATUS_NO_SUCH_FILE`) for all
9 `cache:\*` paths, matching canary's cold-cache baseline (iterate 2.I
documented ours returning `STATUS_SUCCESS` for the same paths in 2.H —
the inversion identified there is closed by the env-var fix). Cascade is
**partial**: tid=4 (cache-rebuild worker) explodes from 160 → 2,075
events (~13×, +97% NtCreateFile/NtOpenFile/NtWriteFile to `cache:\` and
`cache:\<bucket>\<x>\<file>.tmp`); total event count 118,149 → 121,569
(+3,420, +2.9%); tid=1 wedge geometry changed (last `guest_cycle`
9,140,200 → 9,169,116, +28,916 cycles). VdSwap count unchanged (1
swap); thread set still 10 entries (no new spawns); `sub_824F8398` /
`sub_825070F0` still 0 fires. Cache-divergence is real and now closed,
but it was not the keystone for the AUDIT-068 install chain.
## Mode
Pure measurement, ZERO LOC change. Invocation:
```
XENIA_CACHE_WIPE=1 timeout 600 ./target/release/xenia-rs exec -n 50000000 --quiet \
--phase-a-event-log audit-runs/iterate-2J-cache-wipe-replay/ours-cold.jsonl \
"<iso>"
```
Identical to iterate-2H invocation, with `XENIA_CACHE_WIPE=1` prepended.
Belt-and-braces: also `rm -rf /home/fabi/.local/share/xenia-rs/cache/`
before run (backup at `/tmp/xenia-rs-cache-pre-2J-backup-*`).
## Cache wipe mechanism (verified)
From `xenia-rs/crates/xenia-kernel/src/state.rs:1837-1893`
(`resolve_default_cache_root`): `XENIA_CACHE_WIPE=1` redirects
`cache_root` to a per-process tmpdir at
`$TMPDIR/xenia-rs-cache-<pid>-<n>` AND returns `wipe=true`, which makes
`init_cache_root` (state.rs:728-758) do the clear-then-recreate dance.
This properly isolates ours from any pre-existing XDG cache. No
separate binary/JIT cache exists in this codebase
(only XDG cache at `$HOME/.local/share/xenia-rs/cache/`).
## Primary gate result — cache-probe return values
**PASS (9/9).** Every `NtQueryFullAttributesFile` call on a `cache:\*`
path in 2.J returns `0xc000000f` (`STATUS_NO_SUCH_FILE`). The first
divergence flagged by iterate 2.I (idx 102423,
`cache:\d4ea4615\e\46ee8ca`, ours `STATUS_SUCCESS` vs canary
`STATUS_NO_SUCH_FILE`) is now bit-aligned with canary's cold-cache
return.
Cache-probe paths and 2.J returns:
| tid_event_idx | path | 2.J status | canary baseline status |
|---|---|---|---|
| 102423 | `cache:\d4ea4615\e\46ee8ca` | `0xc000000f` | `0xc000000f` |
| 103840 | `cache:\69d8e45c\8\3421153` | `0xc000000f` | `0xc000000f` |
| 103996 | `cache:\69d8e45c\9\355f2f8` | `0xc000000f` | `0xc000000f` |
| 104453 | `cache:\69d8e45c\e\534ffea` | `0xc000000f` | `0xc000000f` |
| 105477 | `cache:\aab216c3\a\2c8c185` | `0xc000000f` | `0xc000000f` |
| 105792 | `cache:\69d8e45c\9\73a5c0a` | `0xc000000f` | `0xc000000f` |
| 106228 | `cache:\69d8e45c\9\39a9dcc` | `0xc000000f` | `0xc000000f` |
| (+others) | `cache:\aab216c3\5\ee70e0a` | `0xc000000f` | `0xc000000f` |
`cache:\` root open and `cache:\access`/`cache:\ignore`/`cache:\recent`
metadata probes also align with canary's cold-cache behavior.
## Secondary cascade gate results
### (a) tid=1 last timestamp
- **2.H**: cycle=9,140,200 / host_ns=792,522,910 (NtWaitForSingleObjectEx return)
- **2.J**: cycle=9,169,116 / host_ns=749,717,731 (NtWaitForSingleObjectEx return)
- Delta: **+28,916 cycles** on tid=1 (continued progression). host_ns
decrease is mechanical: 2.H spent ~43ms of host wallclock spinning at
the wedge during the last few hundred matched events; 2.J consumed
fewer host-side spin cycles because it actually consumed instruction
budget on cache-rebuild work. Both runs hit the 50M-instr budget,
not a wedge.
### (b) Wedge PC
Per the prompt, the 2.F+2.I wedge target was tid=1 PC `0x824ac578` (the
`bl 0x8284E02C` NtWaitForSingleObjectEx with timeout=-1 on thread
handle `0x1210`). 2.J's tail shows tid=1 executing many `NtWait...`
calls past that wedge that **return success** (`return_value=0`,
`status=0x00000000`), not timeout. The wait wrapper is no longer
parked. The 50M-instr run terminates with all 14 tids in returning
`NtWait...` calls, not in blocked waits. **WEDGE-MOVED** (or possibly
absent within this instruction budget — would need a longer run to
distinguish).
### (c) `sub_824F8398` fires?
**0 fires.** Grep for `824f8398` across the full ours-cold.jsonl: zero
hits. The AUDIT-068 ctx-installer chain (`sub_824F8398 →
sub_824F7CD0 → sub_824F7800 → sub_824FD240+0x24`) is **still upstream
of the boot window** ours reaches in 50M instructions. Per canary
baseline this fires at host_ns≈9.4s; ours reaches host_ns≈759ms.
### (d) `sub_825070F0` fires?
**0 fires.** The post-VdSwap worker fan-out is still absent. Same
mechanism as (c) — downstream of an install chain that ours doesn't
reach inside the budget.
### (e) Thread set / spawn count
**10 thread.create entries (unchanged from 2.H).** The new
entry_pc list is bit-identical to 2.H:
```
0x82181830, 0x8245a5d0, 0x82450a28, 0x82457ef0, 0x824cd458,
0x822f1ee0, 0x824d2878, 0x824d2940, 0x82178950, 0x821748f0
```
Canary tids 15/27/28 worker analogs still **absent**. ctx_ptr columns
bit-stable vs 2.H (vA0000000 bucket fix retained):
`0xbe8cbb3c`, `0xbd184a40`, `0xbc6c5640`. Per tripstone #28, comparison
is keyed on entry_pc, not integer tid.
### (f) Total event count
**118,149 → 121,569 (+3,420, +2.9%).** The increment is concentrated on
the cache-rebuild worker (tid=4: 160 → 2,075 events, +1,915 = ~56% of
the delta).
### (g) Missing (op, lr) tuples (iterate-2D method)
**Not re-measured.** Phase-A `--phase-a-event-log` capture does not feed
the 2.D diff pipeline (which consumes `--lr-trace` of IAT thunks at
`0x8284DDDC/E49C/DF5C/E07C`). 2.H report noted the same restriction.
Expected unchanged at 28/28 — the producer LRs that fire in canary
target downstream worker classes (`sub_825070F0` fan-out) that ours
still doesn't reach. Re-running 2.D requires a separate capture mode.
### (h) VdSwap count
**1 swap unchanged** (3 events = import.call + kernel.call + kernel.return
for the same single VdSwap call at cycle=5,577,303 / host_ns=489.2ms).
Per tripstone #39: gameplay-level progression (swaps > 1 or draws > 0)
NOT achieved. The 2.J run still wedges before the second swap.
### (i) Draw count
**0 draws.** No `*Draw*` kernel-call names emitted (consistent with
VdSwap=1: pre-gameplay).
## Cascade roll-up
| gate | description | 2.H | 2.J | result |
|------|-------------|-----|-----|--------|
| PRIMARY | cache-probe `0xc000000f` matches canary | FAIL (returns SUCCESS) | PASS (9/9) | **PASS** |
| (a) tid=1 last cycle | progression | 9,140,200 | 9,169,116 | +28,916 |
| (b) wedge PC `0x824ac578` parked | wait timeout=-1 | parked | NtWait returns 0 | **MOVED** |
| (c) `sub_824F8398` fires | install chain | 0 | 0 | UNCHANGED |
| (d) `sub_825070F0` fires | fan-out | 0 | 0 | UNCHANGED |
| (e) thread set size | spawns | 10 entries | 10 entries | UNCHANGED |
| (f) total event count | volume | 118,149 | 121,569 | +2.9% |
| (g) missing-tuple count | 2.D diff | 28 | n/a (different capture) | NOT-MEASURED |
| (h) VdSwap count | gameplay swaps | 1 | 1 | UNCHANGED |
| (i) draws | gameplay draws | 0 | 0 | UNCHANGED |
**Outcome class: WEDGE-MOVED.** Primary gate fully passes. tid=1 wedge
geometry moved (wait now returns success). Cache-rebuild worker tid=4
springs into life (~13× event growth). But the deeper install chain
(`sub_824F8398` / `sub_825070F0`) remains downstream of the 50M-instr
budget; gameplay-level progression (VdSwap > 1, draws > 0) NOT achieved.
## What changed and why
The 2.I diagnosis was correct in its mechanism but only partially
correct in its prediction:
- **Mechanism correct**: ours's cache contained 9 files from previous
runs (276K total). `NtQueryFullAttributesFile` returned
`STATUS_SUCCESS` for files that should be missing on a cold boot.
Canary's capture protocol wipes both XDG and binary caches; ours's
warm-cache state put the engine on a cache-HIT replay branch instead
of cache-MISS reconstruction. tid=4 was hardly doing anything in 2.H
because the cache already existed. In 2.J it actively rebuilds the
cache (36 NtCreateFile, 24 NtOpenFile, 19 NtWriteFile to `*.tmp`
files and bucket directories).
- **Prediction partial**: closing the cache-state divergence did unblock
one wait wrapper (the previously-parked `0x824ac578` wait now returns
success), but did NOT cascade through to the
`sub_824F8398` install chain or `sub_825070F0` worker fan-out. The
install epoch on canary fires at host_ns≈9.4s; ours's 50M-instr run
ends at host_ns≈760ms. The wedge moved earlier, but the canary
trajectory is still ~12× further along in wallclock when its install
chain fires.
## Tripstone audit
- **#28** (per-engine tid stability): All cross-engine comparisons are
keyed on `entry_pc` and first-kernel-call signature, never on integer
tid. The "tid=1 wedge" / "tid=4 cache rebuild" identities are
ours-internal and stable across 2.H ↔ 2.J because both runs are
ours-side (deterministic scheduler).
- **#39** (composite progression): The headline does NOT claim "gameplay
progression" — VdSwap count unchanged at 1, draws unchanged at 0. The
PRIMARY-gate PASS is a **structural / state-parity** claim (cache
state matches canary baseline). Secondary observation tid=1 wedge
geometry MOVED is reported with both improving (cycle +28,916) and
ambiguous (host_ns shifted backward due to less spin-wait) evidence.
- **#40** (single-keystone framing): The 2.I prompt framing
"cache-wipe single test-harness parity fix may unblock the wedge"
is **partially falsified**. Cache-state IS load-bearing (one wedge
moved, +3,420 events, tid=4 came alive) but is NOT the keystone for
the AUDIT-068 install chain (`sub_824F8398` still 0 fires). The
iterate 2.E reading-error #40 class ("single-keystone framing
falsified") REPEATS here. Recommend explicitly registering reading
error #41: **state-parity gate PASS does not imply cascade — even
bit-identical input state can land on different trajectories when
~12× wallclock separates the install epochs**.
## Confidence
- **HIGH** that primary gate genuinely passes (all 9 cache-probe paths
bit-aligned with canary).
- **HIGH** that tid=4 cache-rebuild work is the bulk of the +3,420
event delta (cache file I/O directly visible in args_resolved.path).
- **HIGH** that the wedge moved (NtWait at `0x824ac578` no longer
parked).
- **HIGH** that `sub_824F8398` / `sub_825070F0` still 0 fires
(instrumented multiple grep paths).
- **MEDIUM** that the next blocker is "longer instruction budget +
install chain investigation" vs "additional state-parity divergence
upstream of install epoch". Both classes remain candidates.
## Next iterate recommendation
**Iterate 2.K should be one of:**
1. **Longer-budget replay (~0 LOC).** Re-run 2.J with `-n 500000000`
(10× budget, ~60s wallclock estimate) to push past host_ns≈9.4s and
see if the AUDIT-068 install chain fires naturally now that the
cache-state divergence is closed. If `sub_824F8398` fires in the
longer run, the cascade IS following just at slower wallclock. If it
still doesn't, there's a second state-parity divergence to find.
2. **Replay-then-replay determinism check (~0 LOC).** Run 2.J twice
back-to-back with `XENIA_CACHE_WIPE=1` and verify the second run
produces identical (or near-identical) event count + same tid=4
work pattern. Cross-check that the persistent-cache path doesn't
contaminate state between runs.
3. **2.I-style arg-diff at the NEW first-divergence (~50-100 LOC).**
2.I's diff harness was keyed on (kind, name, ord) only and missed
the return-value divergence. Now that those return values align,
re-run the diff to find the NEXT cross-engine first-divergence in
args_resolved or side_effects within the 0-1s window. Likely
reveals what state-parity divergence (if any) blocks the install
chain from firing earlier on ours.
Recommended priority: **(1) first** (zero LOC, ~5 min, decisive),
then **(3)** if (1) shows no install-chain fire.
## Artifacts
Under `xenia-rs/audit-runs/iterate-2J-cache-wipe-replay/`:
- `ours-cold.jsonl` (121,569 events, 50M-instr run, cache-wiped boot,
~28MB)
- `ours-cold.stdout.log` / `ours-cold.stderr.log` (empty — quiet mode)
- `writer-report.md` (this file)
Backup of pre-wipe XDG cache:
`/tmp/xenia-rs-cache-pre-2J-backup-<timestamp>` (276K, 9 files).