Files
xenia-rs/audit-runs/phase-c21-wait-begin-floating-absorb/investigation.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

167 lines
7.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase C+21 investigation — wait.begin floating-absorb (2026-05-14)
## Framing (extends C+20 reading-error #32)
C+20 escalated the divergence at canary tid=6 idx=104,606 to "scheduler
determinism" because the cross-3-cold-run canary jitter survey showed
the wait.begin was **host-scheduler-driven** in canary itself:
| jitter | idx 104,606 event |
|--------|----------------------------------------------------------------|
| 1 | `wait.begin sid=75ae880ec432eb36` |
| 2 | `kernel.return RtlEnterCriticalSection` (fast path — matches ours) |
| 3 | offset-shifted; wait.begin at idx 104,603 with different SID |
The diff harness's per-tid_event_idx matching anchors to whatever
canary cold sample is chosen. The bug is observational — ours's
behavior is structurally equivalent to canary's fast-path; the
divergence is induced by canary's slow-path entry on contention
which doesn't reproduce.
C+20 deferred a fix because the scope said "no scheduler
determinism". C+21 takes the lighter-weight option per the prompt:
**extend the C+18 floating-absorb pattern to wait.begin events
referencing shared-global SIDs**. This works because:
1. The SIDs at issue are shared-global dispatcher SIDs (same
pointer-derived recipe as the `KEVENT`/`KSEMAPHORE` cases C+18
addressed).
2. The wait.begin events themselves are observation-side artifacts of
thread-scheduling contention — they belong to the same harness-
observation-error class as the floating handle.create events.
## Verified observational
3 archived canary cold jitter jsonls plus a fresh C+21 canary cold
captured under wiped-cache conditions:
```
SID `75ae880ec432eb36`:
- canary-jitter-1: handle.create on tid=9 idx=295; wait.begin on
tid=6/9/10/18 (15× total)
- canary-jitter-2: similar multi-tid usage (NOT at idx 104,606)
- canary-jitter-3: similar; wait.begin shifted to idx 104,603
- canary-cold-c21 (fresh): same SID pattern; idx 104,606 fast-paths
- ours-cold-c19: SID never appears (contention never reproduces)
```
Multi-tid SID usage on tids 6/9/10/18 (or 6/9/10/16/17/18/26 for the
related SID `a25a16a4f6f547aa`) is a robust shared-global signature.
The canary `EmitHandleCreateSharedGlobal` (`event_log.cc:435`)
asymmetry — it hashes the dispatcher VA but stashes
`object->handle()` as raw_handle_id — means canary's shared-global
handle.create events are NOT self-recognizable by the C+18 recipe
check alone. The C+21 fix adds a complementary cross-tid usage
heuristic that detects them through their multi-tid presence in
either handle.create OR wait.begin events.
## The fix (diff tool only — no engine changes)
`xenia-rs/tools/diff-events/diff_events.py`:
1. `collect_shared_global_sids(canary_by_tid, ours_by_tid)`: new
pre-pass union of (a) recipe-matching handle.create SIDs (C+18)
AND (b) any SID referenced by handle.create OR wait.begin on 2+
distinct tids in either engine (C+21 cross-tid heuristic).
2. `is_shared_global_wait_begin(ev, shared_sids)`: classifies a
wait.begin event as floating if ANY of its
`handles_semantic_ids` is in `shared_sids` (covers `wait_type=any`
and `wait_type=all`).
3. `diff_one_tid`: extends the two-pointer walk to absorb floating
wait.begin events on kind mismatch, mirroring the C+18
handle.create absorption logic. Per-thread waits remain strict —
only shared-global waits float.
Engine source UNCHANGED. Wire format UNCHANGED (`schema_version=1`
holds; payload structure is identical).
Total LOC: ~140 lines additive across `diff_events.py`,
`test_diff_events.py`, and `schema-v1.md`. 16 new diff-tool test
assertions on top of the existing 14 — 30 total, all PASS.
## 3-jitter verification (per RE class #32 discipline)
Pre-C+21 jitter-1 result (from C+19/C+20 baseline): tid=6→1 main
matched 102,553 (C+18) or 104,606 (C+20 — different SID).
Post-C+21:
| run | tid=6→1 matched | floating_wait (c/o) |
|-------------------|-----------------|---------------------|
| jitter-1 | **104,607** | 1 / 0 |
| jitter-2 | **104,607** | 0 / 0 |
| jitter-3 | **104,607** | 3 / 0 |
| fresh cold-c21 | **104,607** | 0 / 0 |
All four canary cold samples converge on the SAME matched-prefix
(104,607). The C+21 absorb is doing exactly what it should:
- jitter-1 contended → 1 wait.begin absorbed → advance past 104,606.
- jitter-2 fast-pathed → 0 absorbed; matches strictly.
- jitter-3 had 3 absorbable contended waits scattered → 3 absorbed.
- fresh c21 fast-pathed → 0 absorbed; matches strictly.
Sister chains UNCHANGED:
| chain | C+19/C+20 | C+21 | delta |
|---------------|-----------|---------|-------|
| 4 → 11 | 11 | 11 | 0 |
| 7 → 2 | 32 | 32 | 0 |
| 12 → 7 | 3 | 3 | 0 |
| 14 → 9 | 41 | 41 | 0 |
| 15 → 10 | 16 | 16 | 0 |
The `floating_create` column shows `0/1` on tid=15→10 (C+18's fix
still operating) and `1/0` on tid=6→1 of jitter-3 (jitter-3 had an
extra canary-side handle.create that C+21's recipe match detected).
No spurious absorption.
## The new divergence beyond the jitter cloud
At canary tid=6 idx 104,607 (ours tid=1 idx 104,607 post-absorb):
```
[104604] ours+canary import.call RtlEnterCriticalSection
[104605] ours+canary kernel.call RtlEnterCriticalSection
[104606] ours+canary kernel.return RtlEnterCriticalSection (both fast-path)
[104607] canary import.call RtlEnterCriticalSection (ANOTHER CS)
[104607] ours import.call RtlLeaveCriticalSection (leaves CS)
```
This is a **REAL structural divergence** — canary entered a different
CS while ours moved on to leave one. Not in scope for C+21. Will be
addressed in C+22 framing.
## Reading-error class #32 — locked in
The C+20 documentation introduced #32. C+21 confirms its taxonomy
applies broadly:
> **#32 Canary itself is non-deterministic across cold runs in
> contention-dependent regions. Single-canary-cold-run sampling is
> unreliable for matched-prefix in those regions.**
The C+21 fix is the diff-tool counter-measure: SIDs that are
referenced by multi-tid usage are floating; their wait.begin events
get the same observation-side treatment as their handle.create
events. The matched-prefix metric becomes **deterministic across
canary cold samples** within shared-global contention windows.
## Cascade outcome
- A=design floating-absorb extension: PASS.
- B=implement + test in diff tool: PASS (~140 LOC, 16 new tests).
- C=verifies across all 3 jitter jsonls: PASS — all yield 104,607.
- D=fresh canary measurement: matched-prefix > 104,606: PASS (104,607).
## Scope adherence
- Engine sources: UNCHANGED.
- Diff tool: `diff_events.py` + `test_diff_events.py` only.
- Docs: `schema-v1.md` v1.3 + this audit-run dir.
- GPU/audio/HID: untouched.
- D-NEW-2 (`KeWaitForSingleObject` timeout_ns mismatch on tid=12→7
idx=3): NOT fixed in C+21 — still the next downstream divergence
on the tid=12→7 chain (matched=3).