Files
xenia-rs/audit-runs/iterate-2J-cache-wipe-replay/writer-report.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

263 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Iterate 2.J — Cache-wipe replay (writer report)
**Date:** 2026-05-28. **LOC delta:** engine **0**, canary **0**. Pure
test-harness parity measurement (no code change).
**Tests:** N/A (no source modifications).
## Headline
**WEDGE-MOVED.** Primary gate **PASS**: 2.J's `NtQueryFullAttributesFile`
cache-probe calls now return `0xc000000f` (`STATUS_NO_SUCH_FILE`) for all
9 `cache:\*` paths, matching canary's cold-cache baseline (iterate 2.I
documented ours returning `STATUS_SUCCESS` for the same paths in 2.H —
the inversion identified there is closed by the env-var fix). Cascade is
**partial**: tid=4 (cache-rebuild worker) explodes from 160 → 2,075
events (~13×, +97% NtCreateFile/NtOpenFile/NtWriteFile to `cache:\` and
`cache:\<bucket>\<x>\<file>.tmp`); total event count 118,149 → 121,569
(+3,420, +2.9%); tid=1 wedge geometry changed (last `guest_cycle`
9,140,200 → 9,169,116, +28,916 cycles). VdSwap count unchanged (1
swap); thread set still 10 entries (no new spawns); `sub_824F8398` /
`sub_825070F0` still 0 fires. Cache-divergence is real and now closed,
but it was not the keystone for the AUDIT-068 install chain.
## Mode
Pure measurement, ZERO LOC change. Invocation:
```
XENIA_CACHE_WIPE=1 timeout 600 ./target/release/xenia-rs exec -n 50000000 --quiet \
--phase-a-event-log audit-runs/iterate-2J-cache-wipe-replay/ours-cold.jsonl \
"<iso>"
```
Identical to iterate-2H invocation, with `XENIA_CACHE_WIPE=1` prepended.
Belt-and-braces: also `rm -rf /home/fabi/.local/share/xenia-rs/cache/`
before run (backup at `/tmp/xenia-rs-cache-pre-2J-backup-*`).
## Cache wipe mechanism (verified)
From `xenia-rs/crates/xenia-kernel/src/state.rs:1837-1893`
(`resolve_default_cache_root`): `XENIA_CACHE_WIPE=1` redirects
`cache_root` to a per-process tmpdir at
`$TMPDIR/xenia-rs-cache-<pid>-<n>` AND returns `wipe=true`, which makes
`init_cache_root` (state.rs:728-758) do the clear-then-recreate dance.
This properly isolates ours from any pre-existing XDG cache. No
separate binary/JIT cache exists in this codebase
(only XDG cache at `$HOME/.local/share/xenia-rs/cache/`).
## Primary gate result — cache-probe return values
**PASS (9/9).** Every `NtQueryFullAttributesFile` call on a `cache:\*`
path in 2.J returns `0xc000000f` (`STATUS_NO_SUCH_FILE`). The first
divergence flagged by iterate 2.I (idx 102423,
`cache:\d4ea4615\e\46ee8ca`, ours `STATUS_SUCCESS` vs canary
`STATUS_NO_SUCH_FILE`) is now bit-aligned with canary's cold-cache
return.
Cache-probe paths and 2.J returns:
| tid_event_idx | path | 2.J status | canary baseline status |
|---|---|---|---|
| 102423 | `cache:\d4ea4615\e\46ee8ca` | `0xc000000f` | `0xc000000f` |
| 103840 | `cache:\69d8e45c\8\3421153` | `0xc000000f` | `0xc000000f` |
| 103996 | `cache:\69d8e45c\9\355f2f8` | `0xc000000f` | `0xc000000f` |
| 104453 | `cache:\69d8e45c\e\534ffea` | `0xc000000f` | `0xc000000f` |
| 105477 | `cache:\aab216c3\a\2c8c185` | `0xc000000f` | `0xc000000f` |
| 105792 | `cache:\69d8e45c\9\73a5c0a` | `0xc000000f` | `0xc000000f` |
| 106228 | `cache:\69d8e45c\9\39a9dcc` | `0xc000000f` | `0xc000000f` |
| (+others) | `cache:\aab216c3\5\ee70e0a` | `0xc000000f` | `0xc000000f` |
`cache:\` root open and `cache:\access`/`cache:\ignore`/`cache:\recent`
metadata probes also align with canary's cold-cache behavior.
## Secondary cascade gate results
### (a) tid=1 last timestamp
- **2.H**: cycle=9,140,200 / host_ns=792,522,910 (NtWaitForSingleObjectEx return)
- **2.J**: cycle=9,169,116 / host_ns=749,717,731 (NtWaitForSingleObjectEx return)
- Delta: **+28,916 cycles** on tid=1 (continued progression). host_ns
decrease is mechanical: 2.H spent ~43ms of host wallclock spinning at
the wedge during the last few hundred matched events; 2.J consumed
fewer host-side spin cycles because it actually consumed instruction
budget on cache-rebuild work. Both runs hit the 50M-instr budget,
not a wedge.
### (b) Wedge PC
Per the prompt, the 2.F+2.I wedge target was tid=1 PC `0x824ac578` (the
`bl 0x8284E02C` NtWaitForSingleObjectEx with timeout=-1 on thread
handle `0x1210`). 2.J's tail shows tid=1 executing many `NtWait...`
calls past that wedge that **return success** (`return_value=0`,
`status=0x00000000`), not timeout. The wait wrapper is no longer
parked. The 50M-instr run terminates with all 14 tids in returning
`NtWait...` calls, not in blocked waits. **WEDGE-MOVED** (or possibly
absent within this instruction budget — would need a longer run to
distinguish).
### (c) `sub_824F8398` fires?
**0 fires.** Grep for `824f8398` across the full ours-cold.jsonl: zero
hits. The AUDIT-068 ctx-installer chain (`sub_824F8398 →
sub_824F7CD0 → sub_824F7800 → sub_824FD240+0x24`) is **still upstream
of the boot window** ours reaches in 50M instructions. Per canary
baseline this fires at host_ns≈9.4s; ours reaches host_ns≈759ms.
### (d) `sub_825070F0` fires?
**0 fires.** The post-VdSwap worker fan-out is still absent. Same
mechanism as (c) — downstream of an install chain that ours doesn't
reach inside the budget.
### (e) Thread set / spawn count
**10 thread.create entries (unchanged from 2.H).** The new
entry_pc list is bit-identical to 2.H:
```
0x82181830, 0x8245a5d0, 0x82450a28, 0x82457ef0, 0x824cd458,
0x822f1ee0, 0x824d2878, 0x824d2940, 0x82178950, 0x821748f0
```
Canary tids 15/27/28 worker analogs still **absent**. ctx_ptr columns
bit-stable vs 2.H (vA0000000 bucket fix retained):
`0xbe8cbb3c`, `0xbd184a40`, `0xbc6c5640`. Per tripstone #28, comparison
is keyed on entry_pc, not integer tid.
### (f) Total event count
**118,149 → 121,569 (+3,420, +2.9%).** The increment is concentrated on
the cache-rebuild worker (tid=4: 160 → 2,075 events, +1,915 = ~56% of
the delta).
### (g) Missing (op, lr) tuples (iterate-2D method)
**Not re-measured.** Phase-A `--phase-a-event-log` capture does not feed
the 2.D diff pipeline (which consumes `--lr-trace` of IAT thunks at
`0x8284DDDC/E49C/DF5C/E07C`). 2.H report noted the same restriction.
Expected unchanged at 28/28 — the producer LRs that fire in canary
target downstream worker classes (`sub_825070F0` fan-out) that ours
still doesn't reach. Re-running 2.D requires a separate capture mode.
### (h) VdSwap count
**1 swap unchanged** (3 events = import.call + kernel.call + kernel.return
for the same single VdSwap call at cycle=5,577,303 / host_ns=489.2ms).
Per tripstone #39: gameplay-level progression (swaps > 1 or draws > 0)
NOT achieved. The 2.J run still wedges before the second swap.
### (i) Draw count
**0 draws.** No `*Draw*` kernel-call names emitted (consistent with
VdSwap=1: pre-gameplay).
## Cascade roll-up
| gate | description | 2.H | 2.J | result |
|------|-------------|-----|-----|--------|
| PRIMARY | cache-probe `0xc000000f` matches canary | FAIL (returns SUCCESS) | PASS (9/9) | **PASS** |
| (a) tid=1 last cycle | progression | 9,140,200 | 9,169,116 | +28,916 |
| (b) wedge PC `0x824ac578` parked | wait timeout=-1 | parked | NtWait returns 0 | **MOVED** |
| (c) `sub_824F8398` fires | install chain | 0 | 0 | UNCHANGED |
| (d) `sub_825070F0` fires | fan-out | 0 | 0 | UNCHANGED |
| (e) thread set size | spawns | 10 entries | 10 entries | UNCHANGED |
| (f) total event count | volume | 118,149 | 121,569 | +2.9% |
| (g) missing-tuple count | 2.D diff | 28 | n/a (different capture) | NOT-MEASURED |
| (h) VdSwap count | gameplay swaps | 1 | 1 | UNCHANGED |
| (i) draws | gameplay draws | 0 | 0 | UNCHANGED |
**Outcome class: WEDGE-MOVED.** Primary gate fully passes. tid=1 wedge
geometry moved (wait now returns success). Cache-rebuild worker tid=4
springs into life (~13× event growth). But the deeper install chain
(`sub_824F8398` / `sub_825070F0`) remains downstream of the 50M-instr
budget; gameplay-level progression (VdSwap > 1, draws > 0) NOT achieved.
## What changed and why
The 2.I diagnosis was correct in its mechanism but only partially
correct in its prediction:
- **Mechanism correct**: ours's cache contained 9 files from previous
runs (276K total). `NtQueryFullAttributesFile` returned
`STATUS_SUCCESS` for files that should be missing on a cold boot.
Canary's capture protocol wipes both XDG and binary caches; ours's
warm-cache state put the engine on a cache-HIT replay branch instead
of cache-MISS reconstruction. tid=4 was hardly doing anything in 2.H
because the cache already existed. In 2.J it actively rebuilds the
cache (36 NtCreateFile, 24 NtOpenFile, 19 NtWriteFile to `*.tmp`
files and bucket directories).
- **Prediction partial**: closing the cache-state divergence did unblock
one wait wrapper (the previously-parked `0x824ac578` wait now returns
success), but did NOT cascade through to the
`sub_824F8398` install chain or `sub_825070F0` worker fan-out. The
install epoch on canary fires at host_ns≈9.4s; ours's 50M-instr run
ends at host_ns≈760ms. The wedge moved earlier, but the canary
trajectory is still ~12× further along in wallclock when its install
chain fires.
## Tripstone audit
- **#28** (per-engine tid stability): All cross-engine comparisons are
keyed on `entry_pc` and first-kernel-call signature, never on integer
tid. The "tid=1 wedge" / "tid=4 cache rebuild" identities are
ours-internal and stable across 2.H ↔ 2.J because both runs are
ours-side (deterministic scheduler).
- **#39** (composite progression): The headline does NOT claim "gameplay
progression" — VdSwap count unchanged at 1, draws unchanged at 0. The
PRIMARY-gate PASS is a **structural / state-parity** claim (cache
state matches canary baseline). Secondary observation tid=1 wedge
geometry MOVED is reported with both improving (cycle +28,916) and
ambiguous (host_ns shifted backward due to less spin-wait) evidence.
- **#40** (single-keystone framing): The 2.I prompt framing
"cache-wipe single test-harness parity fix may unblock the wedge"
is **partially falsified**. Cache-state IS load-bearing (one wedge
moved, +3,420 events, tid=4 came alive) but is NOT the keystone for
the AUDIT-068 install chain (`sub_824F8398` still 0 fires). The
iterate 2.E reading-error #40 class ("single-keystone framing
falsified") REPEATS here. Recommend explicitly registering reading
error #41: **state-parity gate PASS does not imply cascade — even
bit-identical input state can land on different trajectories when
~12× wallclock separates the install epochs**.
## Confidence
- **HIGH** that primary gate genuinely passes (all 9 cache-probe paths
bit-aligned with canary).
- **HIGH** that tid=4 cache-rebuild work is the bulk of the +3,420
event delta (cache file I/O directly visible in args_resolved.path).
- **HIGH** that the wedge moved (NtWait at `0x824ac578` no longer
parked).
- **HIGH** that `sub_824F8398` / `sub_825070F0` still 0 fires
(instrumented multiple grep paths).
- **MEDIUM** that the next blocker is "longer instruction budget +
install chain investigation" vs "additional state-parity divergence
upstream of install epoch". Both classes remain candidates.
## Next iterate recommendation
**Iterate 2.K should be one of:**
1. **Longer-budget replay (~0 LOC).** Re-run 2.J with `-n 500000000`
(10× budget, ~60s wallclock estimate) to push past host_ns≈9.4s and
see if the AUDIT-068 install chain fires naturally now that the
cache-state divergence is closed. If `sub_824F8398` fires in the
longer run, the cascade IS following just at slower wallclock. If it
still doesn't, there's a second state-parity divergence to find.
2. **Replay-then-replay determinism check (~0 LOC).** Run 2.J twice
back-to-back with `XENIA_CACHE_WIPE=1` and verify the second run
produces identical (or near-identical) event count + same tid=4
work pattern. Cross-check that the persistent-cache path doesn't
contaminate state between runs.
3. **2.I-style arg-diff at the NEW first-divergence (~50-100 LOC).**
2.I's diff harness was keyed on (kind, name, ord) only and missed
the return-value divergence. Now that those return values align,
re-run the diff to find the NEXT cross-engine first-divergence in
args_resolved or side_effects within the 0-1s window. Likely
reveals what state-parity divergence (if any) blocks the install
chain from firing earlier on ours.
Recommended priority: **(1) first** (zero LOC, ~5 min, decisive),
then **(3)** if (1) shows no install-chain fire.
## Artifacts
Under `xenia-rs/audit-runs/iterate-2J-cache-wipe-replay/`:
- `ours-cold.jsonl` (121,569 events, 50M-instr run, cache-wiped boot,
~28MB)
- `ours-cold.stdout.log` / `ours-cold.stderr.log` (empty — quiet mode)
- `writer-report.md` (this file)
Backup of pre-wipe XDG cache:
`/tmp/xenia-rs-cache-pre-2J-backup-<timestamp>` (276K, 9 files).