Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5.5 KiB
Canary Variance Characterization — Reading-Error #32
Source data
Re-analysis of the C+22 archived jitter jsonls + ours-cold.jsonl from xenia-rs/audit-runs/phase-c22-rtl-enter-leave-control-flow/. No fresh runs done in this session — the C+22 samples (4 canary cold runs + 1 ours cold run) are sufficient to characterize.
Files inspected
canary-cold-trunc.jsonl(494 MB, truncated to ~250k tid=6 events) — fresh c22ours-cold.jsonl(28 MB, 121,569 events)- Archived: jitter-1, jitter-2, jitter-3 (referenced in C+22 memory +
investigation.md) cold-vs-cold-result.md— variance table
Variance summary at tid=6 idx 104,604..104,620
Pattern of import.call events (E = RtlEnterCriticalSection, L = RtlLeaveCriticalSection):
| sample | observed pattern | wait.begin slow-path? | notes |
|---|---|---|---|
| C+21 archived (jitter-2 equivalent) | E E L L | no | fast-path acquire, fast-path nested-acquire, two releases |
| canary jitter-1 | E wait.begin E L L | yes (between first E's call and return) | slow-path on the OUTER acquire |
| canary jitter-2 | E E L L | no | same as C+21 |
| canary jitter-3 | E E L L (shifted by +3 indices upstream) | no | upstream tid=6 events have different ordering |
| fresh c22 | E wait.begin E L L | yes | same shape as jitter-1 |
| ours cold | E L NtClose | no | NO nested acquire; releases and proceeds to close |
Key observations
- Canary 5/5 samples have the second (nested)
Eregardless of whether the outer acquire took the slow path. The nested-Enter is canary-structural, not jitter. - wait.begin presence varies: 2 of 5 canary samples emit it, 3 of 5 don't. The C+21 floating absorber correctly masks both cases via the shared-global SID
75ae880ec432eb36. - Ours-cold takes a different control-flow path: no second E, no nested cleanup, proceeds straight to NtClose. This is
RtlLeaveCriticalSectionfollowed byNtCloseon the Event handle that the CS was protecting. - The C+21 floating-absorb engages correctly in all canary samples (
floating_create (c/o) = 1/0,floating_wait (c/o)varies 0-3/0). Matched-prefix is invariant at 104,607 across all canary cold samples after absorption.
The structural divergence
After the C+21 absorber runs, the next event index on each side is:
- Canary:
import.call RtlEnterCriticalSection(the nested second E at canary idx 104,610, post-absorption-aligned to ours idx 104,607). - Ours:
import.call RtlLeaveCriticalSection(the simple release at ours idx 104,607).
These are different guest control-flow paths. Both are correct executions of the SAME guest code under different scheduling assumptions:
- Canary path: tid=6 blocked on the dispatcher Event while another guest thread acquired the CS, mutated protected state (queue ptr / refcount / signaled flag), released, transferred the CS to tid=6. tid=6 woke, post-acquire branch reads MUTATED state, takes nested-cleanup path.
- Ours path: tid=1 (mapped from canary tid=6) was running monolithically under the cooperative scheduler. No other thread ran during what would have been the wait window. Post-acquire branch reads PRE-WAIT state (unchanged), takes simple-release path.
Variance taxonomy
| variance dimension | observable | absorbable by current diff tool? | root cause |
|---|---|---|---|
| Whether wait.begin event fires | yes (event present/absent) | YES (C+21 absorber, shared-global SID) | host-OS scheduler decided contention/no-contention timing |
| Index offset in upstream events | yes (idx shifts ±3 across samples) | partial (C+21 absorbs ≤1 floating per side) | upstream contention propagates index drift |
| Whether nested Enter/Leave block fires | yes (E E L L vs E L) | NO (would cross reading-error #23) | post-wait state mutation by another thread; real guest control-flow |
| First-toucher tid for shared dispatcher | yes (varies tid=9, others) | YES (C+18 shared-global SID scheduling-invariant) | host-OS scheduler decided first-thread-touches-dispatcher |
| handle.create raw_handle_id | yes (differs across runs) | YES (SKIP_PAYLOAD_FIELDS) | canary stashes handle-table slot; ours uses dispatcher VA |
| KeQuerySystemTime returned value | yes (wallclock vs fixed) | partial (already-known void-export pattern from C+1) | canary wallclock vs ours fixed FILETIME |
What this means for the plan
The C+21 absorber handles the observation-side jitter (the wait.begin event itself; the upstream index drift) up to the boundary of reading-error #23. Past 104,607, the variance becomes state-side: canary's tid=6 reads mutated protected state, ours's tid=1 doesn't. No event-level absorption can hide a different sequence of guest-code-executed instructions.
This is why the plan recommends approach H' (manifest replay): make ours produce the same state-side outcome (mutated CS state after a real wait) so that ours's tid=1 takes the same nested-cleanup path canary's tid=6 takes. The absorber stays unchanged; ours's events become structurally identical to canary's.
Fresh re-runs not performed
This session is plan-only — no fresh wine xenia_canary --mute=true cold runs. The C+22 jitter-1/2/3 + c21 + c22 samples are sufficient to characterize variance for plan-design purposes. Fresh re-runs will happen during Stage 0 spike and Stage 1 implementation per the validation criteria in plan.md.
Reading-error #32 status
MITIGATED at the diff-tool layer for shared-global SIDs (C+18) and wait.begin (C+21). Residual variance at 104,607 is OUT of #32 scope — it's state-mutation timing, addressed by the plan's Stage 3 forced-contention replay.