Files
xenia-rs/audit-runs/phase-c17-keWait-native-object/cold-vs-cold-result.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

6.0 KiB
Raw Blame History

Phase C+17 cold-vs-cold result (2026-05-14)

Matched-prefix table

canary_tid ours_tid C+16 C+17 delta first_divergence_at kind
6 1 102,171 102,553 +382 102,553 NtDuplicateObject no handle.create (NEW-1)
4 11 8 11 +3 no divergence in 11 events (ours stalls)
7 2 30 32 +2 no divergence in 32 events
12 7 2 3 +1 3 timeout_ns differs in wait.begin (NEW-2)
14 9 2 41 +39 41 unrelated XAudioGetVoiceCategoryVolumeChangeMask
15 10 16 2 -14 2 ordering: ours emits handle.create on first thread-touch of shared dispatcher (NEW-3)

Main chain advanced +382 (D-2/D-3/D-4 root cause resolved). 4 of 5 sister chains advanced. The tid=15→10 chain regressed by 14 events due to a cross-thread-caching ordering side-effect (see broad-impact.md / NEW-3); the underlying state alignment is the SAME root cause, so the regression is "observation-side" — canary's GetNativeObject is process-global, so the adoption happens on whichever thread touches the dispatcher first.

New first divergence on main (idx=102,553)

canary: [102551] import.call NtDuplicateObject
ours:   [102551] import.call NtDuplicateObject
canary: [102552] kernel.call NtDuplicateObject
ours:   [102552] kernel.call NtDuplicateObject
canary: [102553] handle.create sid=df686b147b291902 (object_type=1)
ours:   [102553] kernel.return NtDuplicateObject
canary: [102554] kernel.return NtDuplicateObject
ours:   [102554] import.call RtlEnterCriticalSection

Canary's NtDuplicateObject_entry calls ObjectTable::DuplicateHandle which fires AddHandle for the new slot, emitting handle.create. Ours's nt_duplicate_object short-circuits via handle aliasing (AUDIT-062's dup_id=source_id design) and does NOT emit a new handle.create. This is D-NEW-1 HIGH — first C+18 target.

Acceptance gates

  • Gate 1 (default-off digest): PASS — 3× reproducible at e1dfcb1559f987b35012a7f2dc6d93f5 (unchanged from C+13/C+15-α/C+16 baseline). The fix is observation-only at the digest level; the new shadow-handle refcount entries do not feed back into guest behavior inside the 50M-instruction window.
  • Gate 2 (cvar-on emit): PASS — ours 121,544 events (was 121,537 in C+16, +7 from new lazy handle.create emits in the main chain bring-up); canary 3,059,463 events in ~90s. Both JSONL parse cleanly.
  • Gate 3 (diff tool): PASS — diff tool produces 6-chain report with the new SID-skip semantics for wait.begin.handles_semantic_ids.
  • Gate 4 (cold-vs-cold): PASS — main matched prefix advances 102,171 → 102,553 (+382). 4 of 5 sister chains advance; 1 minor regression on tid=15→10 (NEW-3, observation-side).
  • Gate 5 (build clean): PASS — cargo build --release clean (1 pre-existing dead_code warning unrelated).
  • Gate 6 (tests): PASS — 186 → 191 (added 5 new lifecycle tests for ensure_dispatcher_object; all pass + entire workspace green).
  • Gate 7 (Phase B image hash): PASS — image_loaded_sha256 = ea8d160e9369328a5b922258a92113efb8d7ce3e1a5c12cc521e375985c91c18 (unchanged).
  • Gate 8 (event-log determinism): PASS — handle.create event stream (post-strip of host_ns) is bit-identical across 3 cold runs: md5 0bd91b4c61dea52d72859e7d9c3541ba.

Sister-chain analysis

All 5 sister chains' first divergences are no longer "wait.begin with SID=0":

  • tid=4→11: was KeWaitForMultipleObjects at idx=8 with empty SIDs; now goes 11 events deep with NO divergence (ours stalls, but for reasons unrelated to D-2/D-3/D-4).
  • tid=7→2: was KeWaitForSingleObject at idx=30 with SID=0; now 32 events with NO divergence.
  • tid=12→7: was at idx=2 with SID=0; now idx=3 — the handle.create matches (SID skipped per diff-tool policy), divergence is now timeout_ns mismatch (-30000000 vs 429466729600) — a real game-side wait-quantum mismatch.
  • tid=14→9: was at idx=2 with SID=0; now idx=41 — reached a real XAudioGetVoiceCategoryVolumeChangeMask divergence (sister-chain audio export the boot doesn't reach in ours).
  • tid=15→10: was at idx=16 (no divergence in 16 events); now idx=2 diverges because ours emits handle.create on this thread's first touch of a globally-shared semaphore dispatcher at 0x828a3230, while canary emitted it earlier on another thread. Observation-side ordering issue; underlying state model is the same. NEW-3 below.

Refcount leak risk audit

The fix bumps state.handle_refcount[ptr] = 1 for each first-touch shadow. Three concerns and mitigations:

  1. Leak risk: no code path currently destroys these shadows (ensure_dispatcher_object adoptions). Canary's design has the same property — GetNativeObject-synthesized XObjects survive until process exit. No leak relative to canary's behavior.
  2. Double-bump risk: the early-return guard at the top of ensure_dispatcher_object (state.objects.contains_key(&ptr)) ensures the refcount entry is initialized exactly once per pointer. Test ensure_dispatcher_object_is_idempotent_on_repeated_touch verifies this.
  3. Refcount underflow risk: if a future change wires handle.destroy on shadow removal (e.g., when NtClose is somehow called on a guest dispatcher pointer), the refcount must not underflow. The or_insert(1) form preserves any pre-existing refcount (e.g., if the same pointer was previously allocated via alloc_handle_for, though that's impossible since next_handle starts at 0x1000 and pointers live above 0x1_0000).