Files
xenia-rs/audit-runs/phase-c21-wait-begin-floating-absorb/investigation.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

7.0 KiB
Raw Blame History

Phase C+21 investigation — wait.begin floating-absorb (2026-05-14)

Framing (extends C+20 reading-error #32)

C+20 escalated the divergence at canary tid=6 idx=104,606 to "scheduler determinism" because the cross-3-cold-run canary jitter survey showed the wait.begin was host-scheduler-driven in canary itself:

jitter idx 104,606 event
1 wait.begin sid=75ae880ec432eb36
2 kernel.return RtlEnterCriticalSection (fast path — matches ours)
3 offset-shifted; wait.begin at idx 104,603 with different SID

The diff harness's per-tid_event_idx matching anchors to whatever canary cold sample is chosen. The bug is observational — ours's behavior is structurally equivalent to canary's fast-path; the divergence is induced by canary's slow-path entry on contention which doesn't reproduce.

C+20 deferred a fix because the scope said "no scheduler determinism". C+21 takes the lighter-weight option per the prompt: extend the C+18 floating-absorb pattern to wait.begin events referencing shared-global SIDs. This works because:

  1. The SIDs at issue are shared-global dispatcher SIDs (same pointer-derived recipe as the KEVENT/KSEMAPHORE cases C+18 addressed).
  2. The wait.begin events themselves are observation-side artifacts of thread-scheduling contention — they belong to the same harness- observation-error class as the floating handle.create events.

Verified observational

3 archived canary cold jitter jsonls plus a fresh C+21 canary cold captured under wiped-cache conditions:

SID `75ae880ec432eb36`:
- canary-jitter-1: handle.create on tid=9 idx=295; wait.begin on
  tid=6/9/10/18 (15× total)
- canary-jitter-2: similar multi-tid usage (NOT at idx 104,606)
- canary-jitter-3: similar; wait.begin shifted to idx 104,603
- canary-cold-c21 (fresh): same SID pattern; idx 104,606 fast-paths
- ours-cold-c19: SID never appears (contention never reproduces)

Multi-tid SID usage on tids 6/9/10/18 (or 6/9/10/16/17/18/26 for the related SID a25a16a4f6f547aa) is a robust shared-global signature.

The canary EmitHandleCreateSharedGlobal (event_log.cc:435) asymmetry — it hashes the dispatcher VA but stashes object->handle() as raw_handle_id — means canary's shared-global handle.create events are NOT self-recognizable by the C+18 recipe check alone. The C+21 fix adds a complementary cross-tid usage heuristic that detects them through their multi-tid presence in either handle.create OR wait.begin events.

The fix (diff tool only — no engine changes)

xenia-rs/tools/diff-events/diff_events.py:

  1. collect_shared_global_sids(canary_by_tid, ours_by_tid): new pre-pass union of (a) recipe-matching handle.create SIDs (C+18) AND (b) any SID referenced by handle.create OR wait.begin on 2+ distinct tids in either engine (C+21 cross-tid heuristic).
  2. is_shared_global_wait_begin(ev, shared_sids): classifies a wait.begin event as floating if ANY of its handles_semantic_ids is in shared_sids (covers wait_type=any and wait_type=all).
  3. diff_one_tid: extends the two-pointer walk to absorb floating wait.begin events on kind mismatch, mirroring the C+18 handle.create absorption logic. Per-thread waits remain strict — only shared-global waits float.

Engine source UNCHANGED. Wire format UNCHANGED (schema_version=1 holds; payload structure is identical).

Total LOC: ~140 lines additive across diff_events.py, test_diff_events.py, and schema-v1.md. 16 new diff-tool test assertions on top of the existing 14 — 30 total, all PASS.

3-jitter verification (per RE class #32 discipline)

Pre-C+21 jitter-1 result (from C+19/C+20 baseline): tid=6→1 main matched 102,553 (C+18) or 104,606 (C+20 — different SID).

Post-C+21:

run tid=6→1 matched floating_wait (c/o)
jitter-1 104,607 1 / 0
jitter-2 104,607 0 / 0
jitter-3 104,607 3 / 0
fresh cold-c21 104,607 0 / 0

All four canary cold samples converge on the SAME matched-prefix (104,607). The C+21 absorb is doing exactly what it should:

  • jitter-1 contended → 1 wait.begin absorbed → advance past 104,606.
  • jitter-2 fast-pathed → 0 absorbed; matches strictly.
  • jitter-3 had 3 absorbable contended waits scattered → 3 absorbed.
  • fresh c21 fast-pathed → 0 absorbed; matches strictly.

Sister chains UNCHANGED:

chain C+19/C+20 C+21 delta
4 → 11 11 11 0
7 → 2 32 32 0
12 → 7 3 3 0
14 → 9 41 41 0
15 → 10 16 16 0

The floating_create column shows 0/1 on tid=15→10 (C+18's fix still operating) and 1/0 on tid=6→1 of jitter-3 (jitter-3 had an extra canary-side handle.create that C+21's recipe match detected). No spurious absorption.

The new divergence beyond the jitter cloud

At canary tid=6 idx 104,607 (ours tid=1 idx 104,607 post-absorb):

[104604] ours+canary import.call RtlEnterCriticalSection
[104605] ours+canary kernel.call RtlEnterCriticalSection
[104606] ours+canary kernel.return RtlEnterCriticalSection  (both fast-path)
[104607] canary    import.call RtlEnterCriticalSection       (ANOTHER CS)
[104607] ours      import.call RtlLeaveCriticalSection       (leaves CS)

This is a REAL structural divergence — canary entered a different CS while ours moved on to leave one. Not in scope for C+21. Will be addressed in C+22 framing.

Reading-error class #32 — locked in

The C+20 documentation introduced #32. C+21 confirms its taxonomy applies broadly:

#32 Canary itself is non-deterministic across cold runs in contention-dependent regions. Single-canary-cold-run sampling is unreliable for matched-prefix in those regions.

The C+21 fix is the diff-tool counter-measure: SIDs that are referenced by multi-tid usage are floating; their wait.begin events get the same observation-side treatment as their handle.create events. The matched-prefix metric becomes deterministic across canary cold samples within shared-global contention windows.

Cascade outcome

  • A=design floating-absorb extension: PASS.
  • B=implement + test in diff tool: PASS (~140 LOC, 16 new tests).
  • C=verifies across all 3 jitter jsonls: PASS — all yield 104,607.
  • D=fresh canary measurement: matched-prefix > 104,606: PASS (104,607).

Scope adherence

  • Engine sources: UNCHANGED.
  • Diff tool: diff_events.py + test_diff_events.py only.
  • Docs: schema-v1.md v1.3 + this audit-run dir.
  • GPU/audio/HID: untouched.
  • D-NEW-2 (KeWaitForSingleObject timeout_ns mismatch on tid=12→7 idx=3): NOT fixed in C+21 — still the next downstream divergence on the tid=12→7 chain (matched=3).