handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
262
audit-runs/iterate-2J-cache-wipe-replay/writer-report.md
Normal file
262
audit-runs/iterate-2J-cache-wipe-replay/writer-report.md
Normal file
@@ -0,0 +1,262 @@
|
||||
# Iterate 2.J — Cache-wipe replay (writer report)
|
||||
|
||||
**Date:** 2026-05-28. **LOC delta:** engine **0**, canary **0**. Pure
|
||||
test-harness parity measurement (no code change).
|
||||
**Tests:** N/A (no source modifications).
|
||||
|
||||
## Headline
|
||||
|
||||
**WEDGE-MOVED.** Primary gate **PASS**: 2.J's `NtQueryFullAttributesFile`
|
||||
cache-probe calls now return `0xc000000f` (`STATUS_NO_SUCH_FILE`) for all
|
||||
9 `cache:\*` paths, matching canary's cold-cache baseline (iterate 2.I
|
||||
documented ours returning `STATUS_SUCCESS` for the same paths in 2.H —
|
||||
the inversion identified there is closed by the env-var fix). Cascade is
|
||||
**partial**: tid=4 (cache-rebuild worker) explodes from 160 → 2,075
|
||||
events (~13×, +97% NtCreateFile/NtOpenFile/NtWriteFile to `cache:\` and
|
||||
`cache:\<bucket>\<x>\<file>.tmp`); total event count 118,149 → 121,569
|
||||
(+3,420, +2.9%); tid=1 wedge geometry changed (last `guest_cycle`
|
||||
9,140,200 → 9,169,116, +28,916 cycles). VdSwap count unchanged (1
|
||||
swap); thread set still 10 entries (no new spawns); `sub_824F8398` /
|
||||
`sub_825070F0` still 0 fires. Cache-divergence is real and now closed,
|
||||
but it was not the keystone for the AUDIT-068 install chain.
|
||||
|
||||
## Mode
|
||||
|
||||
Pure measurement, ZERO LOC change. Invocation:
|
||||
```
|
||||
XENIA_CACHE_WIPE=1 timeout 600 ./target/release/xenia-rs exec -n 50000000 --quiet \
|
||||
--phase-a-event-log audit-runs/iterate-2J-cache-wipe-replay/ours-cold.jsonl \
|
||||
"<iso>"
|
||||
```
|
||||
Identical to iterate-2H invocation, with `XENIA_CACHE_WIPE=1` prepended.
|
||||
Belt-and-braces: also `rm -rf /home/fabi/.local/share/xenia-rs/cache/`
|
||||
before run (backup at `/tmp/xenia-rs-cache-pre-2J-backup-*`).
|
||||
|
||||
## Cache wipe mechanism (verified)
|
||||
|
||||
From `xenia-rs/crates/xenia-kernel/src/state.rs:1837-1893`
|
||||
(`resolve_default_cache_root`): `XENIA_CACHE_WIPE=1` redirects
|
||||
`cache_root` to a per-process tmpdir at
|
||||
`$TMPDIR/xenia-rs-cache-<pid>-<n>` AND returns `wipe=true`, which makes
|
||||
`init_cache_root` (state.rs:728-758) do the clear-then-recreate dance.
|
||||
This properly isolates ours from any pre-existing XDG cache. No
|
||||
separate binary/JIT cache exists in this codebase
|
||||
(only XDG cache at `$HOME/.local/share/xenia-rs/cache/`).
|
||||
|
||||
## Primary gate result — cache-probe return values
|
||||
|
||||
**PASS (9/9).** Every `NtQueryFullAttributesFile` call on a `cache:\*`
|
||||
path in 2.J returns `0xc000000f` (`STATUS_NO_SUCH_FILE`). The first
|
||||
divergence flagged by iterate 2.I (idx 102423,
|
||||
`cache:\d4ea4615\e\46ee8ca`, ours `STATUS_SUCCESS` vs canary
|
||||
`STATUS_NO_SUCH_FILE`) is now bit-aligned with canary's cold-cache
|
||||
return.
|
||||
|
||||
Cache-probe paths and 2.J returns:
|
||||
|
||||
| tid_event_idx | path | 2.J status | canary baseline status |
|
||||
|---|---|---|---|
|
||||
| 102423 | `cache:\d4ea4615\e\46ee8ca` | `0xc000000f` | `0xc000000f` |
|
||||
| 103840 | `cache:\69d8e45c\8\3421153` | `0xc000000f` | `0xc000000f` |
|
||||
| 103996 | `cache:\69d8e45c\9\355f2f8` | `0xc000000f` | `0xc000000f` |
|
||||
| 104453 | `cache:\69d8e45c\e\534ffea` | `0xc000000f` | `0xc000000f` |
|
||||
| 105477 | `cache:\aab216c3\a\2c8c185` | `0xc000000f` | `0xc000000f` |
|
||||
| 105792 | `cache:\69d8e45c\9\73a5c0a` | `0xc000000f` | `0xc000000f` |
|
||||
| 106228 | `cache:\69d8e45c\9\39a9dcc` | `0xc000000f` | `0xc000000f` |
|
||||
| (+others) | `cache:\aab216c3\5\ee70e0a` | `0xc000000f` | `0xc000000f` |
|
||||
|
||||
`cache:\` root open and `cache:\access`/`cache:\ignore`/`cache:\recent`
|
||||
metadata probes also align with canary's cold-cache behavior.
|
||||
|
||||
## Secondary cascade gate results
|
||||
|
||||
### (a) tid=1 last timestamp
|
||||
- **2.H**: cycle=9,140,200 / host_ns=792,522,910 (NtWaitForSingleObjectEx return)
|
||||
- **2.J**: cycle=9,169,116 / host_ns=749,717,731 (NtWaitForSingleObjectEx return)
|
||||
- Delta: **+28,916 cycles** on tid=1 (continued progression). host_ns
|
||||
decrease is mechanical: 2.H spent ~43ms of host wallclock spinning at
|
||||
the wedge during the last few hundred matched events; 2.J consumed
|
||||
fewer host-side spin cycles because it actually consumed instruction
|
||||
budget on cache-rebuild work. Both runs hit the 50M-instr budget,
|
||||
not a wedge.
|
||||
|
||||
### (b) Wedge PC
|
||||
Per the prompt, the 2.F+2.I wedge target was tid=1 PC `0x824ac578` (the
|
||||
`bl 0x8284E02C` NtWaitForSingleObjectEx with timeout=-1 on thread
|
||||
handle `0x1210`). 2.J's tail shows tid=1 executing many `NtWait...`
|
||||
calls past that wedge that **return success** (`return_value=0`,
|
||||
`status=0x00000000`), not timeout. The wait wrapper is no longer
|
||||
parked. The 50M-instr run terminates with all 14 tids in returning
|
||||
`NtWait...` calls, not in blocked waits. **WEDGE-MOVED** (or possibly
|
||||
absent within this instruction budget — would need a longer run to
|
||||
distinguish).
|
||||
|
||||
### (c) `sub_824F8398` fires?
|
||||
**0 fires.** Grep for `824f8398` across the full ours-cold.jsonl: zero
|
||||
hits. The AUDIT-068 ctx-installer chain (`sub_824F8398 →
|
||||
sub_824F7CD0 → sub_824F7800 → sub_824FD240+0x24`) is **still upstream
|
||||
of the boot window** ours reaches in 50M instructions. Per canary
|
||||
baseline this fires at host_ns≈9.4s; ours reaches host_ns≈759ms.
|
||||
|
||||
### (d) `sub_825070F0` fires?
|
||||
**0 fires.** The post-VdSwap worker fan-out is still absent. Same
|
||||
mechanism as (c) — downstream of an install chain that ours doesn't
|
||||
reach inside the budget.
|
||||
|
||||
### (e) Thread set / spawn count
|
||||
**10 thread.create entries (unchanged from 2.H).** The new
|
||||
entry_pc list is bit-identical to 2.H:
|
||||
```
|
||||
0x82181830, 0x8245a5d0, 0x82450a28, 0x82457ef0, 0x824cd458,
|
||||
0x822f1ee0, 0x824d2878, 0x824d2940, 0x82178950, 0x821748f0
|
||||
```
|
||||
Canary tids 15/27/28 worker analogs still **absent**. ctx_ptr columns
|
||||
bit-stable vs 2.H (vA0000000 bucket fix retained):
|
||||
`0xbe8cbb3c`, `0xbd184a40`, `0xbc6c5640`. Per tripstone #28, comparison
|
||||
is keyed on entry_pc, not integer tid.
|
||||
|
||||
### (f) Total event count
|
||||
**118,149 → 121,569 (+3,420, +2.9%).** The increment is concentrated on
|
||||
the cache-rebuild worker (tid=4: 160 → 2,075 events, +1,915 = ~56% of
|
||||
the delta).
|
||||
|
||||
### (g) Missing (op, lr) tuples (iterate-2D method)
|
||||
**Not re-measured.** Phase-A `--phase-a-event-log` capture does not feed
|
||||
the 2.D diff pipeline (which consumes `--lr-trace` of IAT thunks at
|
||||
`0x8284DDDC/E49C/DF5C/E07C`). 2.H report noted the same restriction.
|
||||
Expected unchanged at 28/28 — the producer LRs that fire in canary
|
||||
target downstream worker classes (`sub_825070F0` fan-out) that ours
|
||||
still doesn't reach. Re-running 2.D requires a separate capture mode.
|
||||
|
||||
### (h) VdSwap count
|
||||
**1 swap unchanged** (3 events = import.call + kernel.call + kernel.return
|
||||
for the same single VdSwap call at cycle=5,577,303 / host_ns=489.2ms).
|
||||
Per tripstone #39: gameplay-level progression (swaps > 1 or draws > 0)
|
||||
NOT achieved. The 2.J run still wedges before the second swap.
|
||||
|
||||
### (i) Draw count
|
||||
**0 draws.** No `*Draw*` kernel-call names emitted (consistent with
|
||||
VdSwap=1: pre-gameplay).
|
||||
|
||||
## Cascade roll-up
|
||||
|
||||
| gate | description | 2.H | 2.J | result |
|
||||
|------|-------------|-----|-----|--------|
|
||||
| PRIMARY | cache-probe `0xc000000f` matches canary | FAIL (returns SUCCESS) | PASS (9/9) | **PASS** |
|
||||
| (a) tid=1 last cycle | progression | 9,140,200 | 9,169,116 | +28,916 |
|
||||
| (b) wedge PC `0x824ac578` parked | wait timeout=-1 | parked | NtWait returns 0 | **MOVED** |
|
||||
| (c) `sub_824F8398` fires | install chain | 0 | 0 | UNCHANGED |
|
||||
| (d) `sub_825070F0` fires | fan-out | 0 | 0 | UNCHANGED |
|
||||
| (e) thread set size | spawns | 10 entries | 10 entries | UNCHANGED |
|
||||
| (f) total event count | volume | 118,149 | 121,569 | +2.9% |
|
||||
| (g) missing-tuple count | 2.D diff | 28 | n/a (different capture) | NOT-MEASURED |
|
||||
| (h) VdSwap count | gameplay swaps | 1 | 1 | UNCHANGED |
|
||||
| (i) draws | gameplay draws | 0 | 0 | UNCHANGED |
|
||||
|
||||
**Outcome class: WEDGE-MOVED.** Primary gate fully passes. tid=1 wedge
|
||||
geometry moved (wait now returns success). Cache-rebuild worker tid=4
|
||||
springs into life (~13× event growth). But the deeper install chain
|
||||
(`sub_824F8398` / `sub_825070F0`) remains downstream of the 50M-instr
|
||||
budget; gameplay-level progression (VdSwap > 1, draws > 0) NOT achieved.
|
||||
|
||||
## What changed and why
|
||||
|
||||
The 2.I diagnosis was correct in its mechanism but only partially
|
||||
correct in its prediction:
|
||||
|
||||
- **Mechanism correct**: ours's cache contained 9 files from previous
|
||||
runs (276K total). `NtQueryFullAttributesFile` returned
|
||||
`STATUS_SUCCESS` for files that should be missing on a cold boot.
|
||||
Canary's capture protocol wipes both XDG and binary caches; ours's
|
||||
warm-cache state put the engine on a cache-HIT replay branch instead
|
||||
of cache-MISS reconstruction. tid=4 was hardly doing anything in 2.H
|
||||
because the cache already existed. In 2.J it actively rebuilds the
|
||||
cache (36 NtCreateFile, 24 NtOpenFile, 19 NtWriteFile to `*.tmp`
|
||||
files and bucket directories).
|
||||
|
||||
- **Prediction partial**: closing the cache-state divergence did unblock
|
||||
one wait wrapper (the previously-parked `0x824ac578` wait now returns
|
||||
success), but did NOT cascade through to the
|
||||
`sub_824F8398` install chain or `sub_825070F0` worker fan-out. The
|
||||
install epoch on canary fires at host_ns≈9.4s; ours's 50M-instr run
|
||||
ends at host_ns≈760ms. The wedge moved earlier, but the canary
|
||||
trajectory is still ~12× further along in wallclock when its install
|
||||
chain fires.
|
||||
|
||||
## Tripstone audit
|
||||
|
||||
- **#28** (per-engine tid stability): All cross-engine comparisons are
|
||||
keyed on `entry_pc` and first-kernel-call signature, never on integer
|
||||
tid. The "tid=1 wedge" / "tid=4 cache rebuild" identities are
|
||||
ours-internal and stable across 2.H ↔ 2.J because both runs are
|
||||
ours-side (deterministic scheduler).
|
||||
- **#39** (composite progression): The headline does NOT claim "gameplay
|
||||
progression" — VdSwap count unchanged at 1, draws unchanged at 0. The
|
||||
PRIMARY-gate PASS is a **structural / state-parity** claim (cache
|
||||
state matches canary baseline). Secondary observation tid=1 wedge
|
||||
geometry MOVED is reported with both improving (cycle +28,916) and
|
||||
ambiguous (host_ns shifted backward due to less spin-wait) evidence.
|
||||
- **#40** (single-keystone framing): The 2.I prompt framing
|
||||
"cache-wipe single test-harness parity fix may unblock the wedge"
|
||||
is **partially falsified**. Cache-state IS load-bearing (one wedge
|
||||
moved, +3,420 events, tid=4 came alive) but is NOT the keystone for
|
||||
the AUDIT-068 install chain (`sub_824F8398` still 0 fires). The
|
||||
iterate 2.E reading-error #40 class ("single-keystone framing
|
||||
falsified") REPEATS here. Recommend explicitly registering reading
|
||||
error #41: **state-parity gate PASS does not imply cascade — even
|
||||
bit-identical input state can land on different trajectories when
|
||||
~12× wallclock separates the install epochs**.
|
||||
|
||||
## Confidence
|
||||
|
||||
- **HIGH** that primary gate genuinely passes (all 9 cache-probe paths
|
||||
bit-aligned with canary).
|
||||
- **HIGH** that tid=4 cache-rebuild work is the bulk of the +3,420
|
||||
event delta (cache file I/O directly visible in args_resolved.path).
|
||||
- **HIGH** that the wedge moved (NtWait at `0x824ac578` no longer
|
||||
parked).
|
||||
- **HIGH** that `sub_824F8398` / `sub_825070F0` still 0 fires
|
||||
(instrumented multiple grep paths).
|
||||
- **MEDIUM** that the next blocker is "longer instruction budget +
|
||||
install chain investigation" vs "additional state-parity divergence
|
||||
upstream of install epoch". Both classes remain candidates.
|
||||
|
||||
## Next iterate recommendation
|
||||
|
||||
**Iterate 2.K should be one of:**
|
||||
|
||||
1. **Longer-budget replay (~0 LOC).** Re-run 2.J with `-n 500000000`
|
||||
(10× budget, ~60s wallclock estimate) to push past host_ns≈9.4s and
|
||||
see if the AUDIT-068 install chain fires naturally now that the
|
||||
cache-state divergence is closed. If `sub_824F8398` fires in the
|
||||
longer run, the cascade IS following just at slower wallclock. If it
|
||||
still doesn't, there's a second state-parity divergence to find.
|
||||
|
||||
2. **Replay-then-replay determinism check (~0 LOC).** Run 2.J twice
|
||||
back-to-back with `XENIA_CACHE_WIPE=1` and verify the second run
|
||||
produces identical (or near-identical) event count + same tid=4
|
||||
work pattern. Cross-check that the persistent-cache path doesn't
|
||||
contaminate state between runs.
|
||||
|
||||
3. **2.I-style arg-diff at the NEW first-divergence (~50-100 LOC).**
|
||||
2.I's diff harness was keyed on (kind, name, ord) only and missed
|
||||
the return-value divergence. Now that those return values align,
|
||||
re-run the diff to find the NEXT cross-engine first-divergence in
|
||||
args_resolved or side_effects within the 0-1s window. Likely
|
||||
reveals what state-parity divergence (if any) blocks the install
|
||||
chain from firing earlier on ours.
|
||||
|
||||
Recommended priority: **(1) first** (zero LOC, ~5 min, decisive),
|
||||
then **(3)** if (1) shows no install-chain fire.
|
||||
|
||||
## Artifacts
|
||||
|
||||
Under `xenia-rs/audit-runs/iterate-2J-cache-wipe-replay/`:
|
||||
|
||||
- `ours-cold.jsonl` (121,569 events, 50M-instr run, cache-wiped boot,
|
||||
~28MB)
|
||||
- `ours-cold.stdout.log` / `ours-cold.stderr.log` (empty — quiet mode)
|
||||
- `writer-report.md` (this file)
|
||||
|
||||
Backup of pre-wipe XDG cache:
|
||||
`/tmp/xenia-rs-cache-pre-2J-backup-<timestamp>` (276K, 9 files).
|
||||
Reference in New Issue
Block a user