Files
xenia-rs/audit-runs/scheduler-determinism-plan/canary-variance.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

5.5 KiB

Canary Variance Characterization — Reading-Error #32

Source data

Re-analysis of the C+22 archived jitter jsonls + ours-cold.jsonl from xenia-rs/audit-runs/phase-c22-rtl-enter-leave-control-flow/. No fresh runs done in this session — the C+22 samples (4 canary cold runs + 1 ours cold run) are sufficient to characterize.

Files inspected

  • canary-cold-trunc.jsonl (494 MB, truncated to ~250k tid=6 events) — fresh c22
  • ours-cold.jsonl (28 MB, 121,569 events)
  • Archived: jitter-1, jitter-2, jitter-3 (referenced in C+22 memory + investigation.md)
  • cold-vs-cold-result.md — variance table

Variance summary at tid=6 idx 104,604..104,620

Pattern of import.call events (E = RtlEnterCriticalSection, L = RtlLeaveCriticalSection):

sample observed pattern wait.begin slow-path? notes
C+21 archived (jitter-2 equivalent) E E L L no fast-path acquire, fast-path nested-acquire, two releases
canary jitter-1 E wait.begin E L L yes (between first E's call and return) slow-path on the OUTER acquire
canary jitter-2 E E L L no same as C+21
canary jitter-3 E E L L (shifted by +3 indices upstream) no upstream tid=6 events have different ordering
fresh c22 E wait.begin E L L yes same shape as jitter-1
ours cold E L NtClose no NO nested acquire; releases and proceeds to close

Key observations

  1. Canary 5/5 samples have the second (nested) E regardless of whether the outer acquire took the slow path. The nested-Enter is canary-structural, not jitter.
  2. wait.begin presence varies: 2 of 5 canary samples emit it, 3 of 5 don't. The C+21 floating absorber correctly masks both cases via the shared-global SID 75ae880ec432eb36.
  3. Ours-cold takes a different control-flow path: no second E, no nested cleanup, proceeds straight to NtClose. This is RtlLeaveCriticalSection followed by NtClose on the Event handle that the CS was protecting.
  4. The C+21 floating-absorb engages correctly in all canary samples (floating_create (c/o) = 1/0, floating_wait (c/o) varies 0-3/0). Matched-prefix is invariant at 104,607 across all canary cold samples after absorption.

The structural divergence

After the C+21 absorber runs, the next event index on each side is:

  • Canary: import.call RtlEnterCriticalSection (the nested second E at canary idx 104,610, post-absorption-aligned to ours idx 104,607).
  • Ours: import.call RtlLeaveCriticalSection (the simple release at ours idx 104,607).

These are different guest control-flow paths. Both are correct executions of the SAME guest code under different scheduling assumptions:

  • Canary path: tid=6 blocked on the dispatcher Event while another guest thread acquired the CS, mutated protected state (queue ptr / refcount / signaled flag), released, transferred the CS to tid=6. tid=6 woke, post-acquire branch reads MUTATED state, takes nested-cleanup path.
  • Ours path: tid=1 (mapped from canary tid=6) was running monolithically under the cooperative scheduler. No other thread ran during what would have been the wait window. Post-acquire branch reads PRE-WAIT state (unchanged), takes simple-release path.

Variance taxonomy

variance dimension observable absorbable by current diff tool? root cause
Whether wait.begin event fires yes (event present/absent) YES (C+21 absorber, shared-global SID) host-OS scheduler decided contention/no-contention timing
Index offset in upstream events yes (idx shifts ±3 across samples) partial (C+21 absorbs ≤1 floating per side) upstream contention propagates index drift
Whether nested Enter/Leave block fires yes (E E L L vs E L) NO (would cross reading-error #23) post-wait state mutation by another thread; real guest control-flow
First-toucher tid for shared dispatcher yes (varies tid=9, others) YES (C+18 shared-global SID scheduling-invariant) host-OS scheduler decided first-thread-touches-dispatcher
handle.create raw_handle_id yes (differs across runs) YES (SKIP_PAYLOAD_FIELDS) canary stashes handle-table slot; ours uses dispatcher VA
KeQuerySystemTime returned value yes (wallclock vs fixed) partial (already-known void-export pattern from C+1) canary wallclock vs ours fixed FILETIME

What this means for the plan

The C+21 absorber handles the observation-side jitter (the wait.begin event itself; the upstream index drift) up to the boundary of reading-error #23. Past 104,607, the variance becomes state-side: canary's tid=6 reads mutated protected state, ours's tid=1 doesn't. No event-level absorption can hide a different sequence of guest-code-executed instructions.

This is why the plan recommends approach H' (manifest replay): make ours produce the same state-side outcome (mutated CS state after a real wait) so that ours's tid=1 takes the same nested-cleanup path canary's tid=6 takes. The absorber stays unchanged; ours's events become structurally identical to canary's.

Fresh re-runs not performed

This session is plan-only — no fresh wine xenia_canary --mute=true cold runs. The C+22 jitter-1/2/3 + c21 + c22 samples are sufficient to characterize variance for plan-design purposes. Fresh re-runs will happen during Stage 0 spike and Stage 1 implementation per the validation criteria in plan.md.

Reading-error #32 status

MITIGATED at the diff-tool layer for shared-global SIDs (C+18) and wait.begin (C+21). Residual variance at 104,607 is OUT of #32 scope — it's state-mutation timing, addressed by the plan's Stage 3 forced-contention replay.