Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
70 lines
5.5 KiB
Markdown
70 lines
5.5 KiB
Markdown
# Canary Variance Characterization — Reading-Error #32
|
|
|
|
## Source data
|
|
|
|
Re-analysis of the C+22 archived jitter jsonls + ours-cold.jsonl from `xenia-rs/audit-runs/phase-c22-rtl-enter-leave-control-flow/`. No fresh runs done in this session — the C+22 samples (4 canary cold runs + 1 ours cold run) are sufficient to characterize.
|
|
|
|
## Files inspected
|
|
|
|
- `canary-cold-trunc.jsonl` (494 MB, truncated to ~250k tid=6 events) — fresh c22
|
|
- `ours-cold.jsonl` (28 MB, 121,569 events)
|
|
- Archived: jitter-1, jitter-2, jitter-3 (referenced in C+22 memory + `investigation.md`)
|
|
- `cold-vs-cold-result.md` — variance table
|
|
|
|
## Variance summary at tid=6 idx 104,604..104,620
|
|
|
|
Pattern of `import.call` events (E = RtlEnterCriticalSection, L = RtlLeaveCriticalSection):
|
|
|
|
| sample | observed pattern | wait.begin slow-path? | notes |
|
|
|---|---|---|---|
|
|
| C+21 archived (jitter-2 equivalent) | E E L L | no | fast-path acquire, fast-path nested-acquire, two releases |
|
|
| canary jitter-1 | E **wait.begin** E L L | yes (between first E's call and return) | slow-path on the OUTER acquire |
|
|
| canary jitter-2 | E E L L | no | same as C+21 |
|
|
| canary jitter-3 | E E L L (shifted by +3 indices upstream) | no | upstream tid=6 events have different ordering |
|
|
| fresh c22 | E **wait.begin** E L L | yes | same shape as jitter-1 |
|
|
| **ours cold** | **E L NtClose** | no | NO nested acquire; releases and proceeds to close |
|
|
|
|
## Key observations
|
|
|
|
1. **Canary 5/5 samples** have the second (nested) `E` regardless of whether the outer acquire took the slow path. The nested-Enter is canary-structural, not jitter.
|
|
2. **wait.begin presence varies**: 2 of 5 canary samples emit it, 3 of 5 don't. The C+21 floating absorber correctly masks both cases via the shared-global SID `75ae880ec432eb36`.
|
|
3. **Ours-cold takes a different control-flow path**: no second E, no nested cleanup, proceeds straight to NtClose. This is `RtlLeaveCriticalSection` followed by `NtClose` on the Event handle that the CS was protecting.
|
|
4. The C+21 floating-absorb engages correctly in all canary samples (`floating_create (c/o) = 1/0`, `floating_wait (c/o)` varies 0-3/0). Matched-prefix is invariant at 104,607 across all canary cold samples after absorption.
|
|
|
|
## The structural divergence
|
|
|
|
After the C+21 absorber runs, the next event index on each side is:
|
|
|
|
- **Canary**: `import.call RtlEnterCriticalSection` (the nested second E at canary idx 104,610, post-absorption-aligned to ours idx 104,607).
|
|
- **Ours**: `import.call RtlLeaveCriticalSection` (the simple release at ours idx 104,607).
|
|
|
|
These are different guest control-flow paths. Both are correct executions of the SAME guest code under different scheduling assumptions:
|
|
|
|
- **Canary path**: tid=6 blocked on the dispatcher Event while another guest thread acquired the CS, mutated protected state (queue ptr / refcount / signaled flag), released, transferred the CS to tid=6. tid=6 woke, post-acquire branch reads MUTATED state, takes nested-cleanup path.
|
|
- **Ours path**: tid=1 (mapped from canary tid=6) was running monolithically under the cooperative scheduler. No other thread ran during what would have been the wait window. Post-acquire branch reads PRE-WAIT state (unchanged), takes simple-release path.
|
|
|
|
## Variance taxonomy
|
|
|
|
| variance dimension | observable | absorbable by current diff tool? | root cause |
|
|
|---|---|---|---|
|
|
| Whether wait.begin event fires | yes (event present/absent) | YES (C+21 absorber, shared-global SID) | host-OS scheduler decided contention/no-contention timing |
|
|
| Index offset in upstream events | yes (idx shifts ±3 across samples) | partial (C+21 absorbs ≤1 floating per side) | upstream contention propagates index drift |
|
|
| Whether nested Enter/Leave block fires | yes (E E L L vs E L) | NO (would cross reading-error #23) | post-wait state mutation by another thread; real guest control-flow |
|
|
| First-toucher tid for shared dispatcher | yes (varies tid=9, others) | YES (C+18 shared-global SID scheduling-invariant) | host-OS scheduler decided first-thread-touches-dispatcher |
|
|
| handle.create raw_handle_id | yes (differs across runs) | YES (SKIP_PAYLOAD_FIELDS) | canary stashes handle-table slot; ours uses dispatcher VA |
|
|
| KeQuerySystemTime returned value | yes (wallclock vs fixed) | partial (already-known void-export pattern from C+1) | canary wallclock vs ours fixed FILETIME |
|
|
|
|
## What this means for the plan
|
|
|
|
The C+21 absorber handles the *observation-side* jitter (the wait.begin event itself; the upstream index drift) up to the boundary of reading-error #23. Past 104,607, the variance becomes *state-side*: canary's tid=6 reads mutated protected state, ours's tid=1 doesn't. No event-level absorption can hide a different sequence of guest-code-executed instructions.
|
|
|
|
This is why the plan recommends approach H' (manifest replay): make ours produce the same state-side outcome (mutated CS state after a real wait) so that ours's tid=1 takes the same nested-cleanup path canary's tid=6 takes. The absorber stays unchanged; ours's events become structurally identical to canary's.
|
|
|
|
## Fresh re-runs not performed
|
|
|
|
This session is plan-only — no fresh `wine xenia_canary --mute=true` cold runs. The C+22 jitter-1/2/3 + c21 + c22 samples are sufficient to characterize variance for plan-design purposes. Fresh re-runs will happen during Stage 0 spike and Stage 1 implementation per the validation criteria in `plan.md`.
|
|
|
|
## Reading-error #32 status
|
|
|
|
**MITIGATED** at the diff-tool layer for shared-global SIDs (C+18) and wait.begin (C+21). Residual variance at 104,607 is OUT of #32 scope — it's state-mutation timing, addressed by the plan's Stage 3 forced-contention replay.
|