Files
xenia-rs/audit-runs/phase-c18-shared-global-race/investigation.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

6.5 KiB
Raw Blame History

Phase C+18 Investigation — Shared-global first-toucher race (2026-05-14)

Framing verification (reading-error #28 discipline)

C+17 result: main matched-prefix advanced 102,171 → 102,553 (+382) when ours's ensure_dispatcher_object started emitting handle.create for synthesized shadows. But sister chain tid=15→10 REGRESSED from 16 → 2:

canary tid=15:                              ours tid=10:
[0] import.call KeWaitForSingleObject       [0] import.call KeWaitForSingleObject
[1] kernel.call KeWaitForSingleObject       [1] kernel.call KeWaitForSingleObject
[2] wait.begin sid=66ae1b598f928969         [2] handle.create sid=b9e6799594b746ee
[3] kernel.return                           [3] wait.begin sid=b9e6799594b746ee
                                            [4] kernel.return

The two engines disagree at idx=2: canary's tid=15 has wait.begin, ours's tid=10 has handle.create. The SIDs are different too (66ae1b598f928969 vs b9e6799594b746ee) but the diff tool already SKIPS SID fields per C+15-α schema-v1.

Root cause: shared-global first-toucher race

The dispatcher at guest pointer 0x828a3230 is a process-global KSEMAPHORE (object_type=3) that's touched by MULTIPLE guest threads during boot:

  • Canary: some thread other than tid=15 (likely the main boot thread, tid=6) touches it first → emits handle.create there. By the time tid=15 reaches KeWaitForSingleObject, the wrapper exists, so XObject::GetNativeObject short-circuits via the kXObjSignature marker and emits NO additional event. Canary tid=15's stream is 3 events long: import → kernel.call → wait.begin → kernel.return.

  • Ours: tid=10 happens to be the first toucher → ours's ensure_dispatcher_object emits handle.create on tid=10. ours tid=10's stream is 4 events long: import → kernel.call → handle.create → wait.begin → kernel.return.

Both engines do the right thing semantically; whichever thread wins the "first toucher" race depends on thread scheduling, which is NOT bit-identical across engines (different host schedulers, JIT, etc.). The diff tool sees one extra event on one side and reports it as a divergence — but it's observation-side, not behavioral.

This is C+17 D-NEW-3.

Verified via static + dynamic evidence

  1. Both ours's ensure_dispatcher_object (exports.rs:4363) and canary's XObject::GetNativeObject (xobject.cc:397-483) are per-pointer idempotent: re-entry on a pointer that already has the kXObjSignature marker short-circuits without emit.
  2. The shared objects table is process-global in both engines (KernelState::objects map; canary's KernelState::object_table()).
  3. In the ours-cold log, 0x828a3230 appears in exactly ONE handle.create (on tid=10) — confirming the per-pointer idempotence:
$ grep '"raw_handle_id":"0x828a3230"' ours-cold.jsonl
{"kind":"handle.create","tid":10,"tid_event_idx":2,...}
  1. The canary diff side reports [2] wait.begin with a SID that refers to a dispatcher whose handle.create was already emitted elsewhere (likely on canary tid=6 main chain or a worker).

  2. The SID computation in both engines uses semantic_id(create_site_pc=0, creating_tid, idx_at_creation, object_type). Both creating_tid and idx_at_creation depend on WHICH thread did the first touch — so even if both engines wrapped the same dispatcher, their SIDs would still differ.

Class of bug

Class η — harness observation-side asymmetry on scheduling-non- deterministic process-global state. Not a real engine bug; both engines are doing the right thing. The harness (per-tid sequence diff) is the wrong abstraction for this class of event.

Fix shape

Two coordinated changes, both small and additive:

(A) Engine: scheduling-invariant SID for process-global dispatchers

Add event_log::semantic_id_shared_global(pointer, object_type) (ours and canary) — a SID recipe keyed only on (pointer, object_type). Inputs to the existing FNV-1a:

create_site_pc = SHARED_GLOBAL_SID_MARKER (= 0xC01AB005, fixed sentinel)
creating_tid   = 0
tid_event_idx  = pointer as u64
object_type    = object_type

The marker constant sits outside any plausible guest-PC range (PPC text 0x82000000-0x82FFFFFF; XEX header 0x3001xxxx; heap 0x4xxxxxxx) so it NEVER collides with regular per-thread SIDs (which use real PCs).

ensure_dispatcher_object (ours) and XObject::GetNativeObject (canary) route their handle.create emit through this recipe instead of the per-thread semantic_id. Both engines compute the same SID for the same dispatcher pointer regardless of which guest thread wins the first-toucher race.

(B) Diff tool: cross-tid floating handle.create matching

Pre-pass: collect the set of shared-global SIDs across BOTH engines and ALL tids. A handle.create event is detected as shared-global by recomputing the deterministic SID from its (raw_handle_id, object_type) payload and matching against handle_semantic_id.

When per-tid comparison finds a kind mismatch where one side has a handle.create whose SID is in the floating set:

  • Advance only that side's stream pointer past the floating event.
  • Re-compare at the same canonical position.

This handles the "extra event on tid=10 but not tid=15" case symmetrically. Subsequent wait.begin events whose handles_semantic_ids element matches a shared-global SID continue to align via the schema-v1 strict-equality rule (SID fields are already skipped per the C+15-α SKIP_PAYLOAD_FIELDS_BY_KIND policy, but the underlying object alignment is preserved by the deterministic recipe — useful for future passes that re-enable SID comparison).

Why this is the right fix (not over-suppression)

  • Pointer-derived SIDs are unique per object identity. Two distinct dispatchers at the same pointer with different object_type get distinct SIDs (defense in depth).
  • Regular per-thread handle.create events keep strict alignment. Only events whose SID matches the deterministic shared-global recipe are eligible for cross-tid absorption. A regular file-handle create (allocated via alloc_handle_for/AddHandle) uses the per-(tid, idx) SID recipe and CANNOT match the shared-global hash by construction.
  • The diff tool still reports real divergences. Tests confirm:
    • test_non_floating_real_divergence_still_caught — an unrelated extra event on ours's side IS reported.
    • test_strict_alignment_without_floating — when the floating set is empty, legacy strict behavior holds.