handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00
parent acd1656753
commit ef93a4fa14
620 changed files with 108303 additions and 1 deletions
--- a/audit-runs/phase-c18-shared-global-race/investigation.md
+++ b/audit-runs/phase-c18-shared-global-race/investigation.md
@@ -0,0 +1,143 @@
+# Phase C+18 Investigation — Shared-global first-toucher race (2026-05-14)
+
+## Framing verification (reading-error #28 discipline)
+
+C+17 result: main matched-prefix advanced 102,171 → 102,553 (+382) when
+ours's `ensure_dispatcher_object` started emitting `handle.create` for
+synthesized shadows. But sister chain `tid=15→10` REGRESSED from 16 → 2:
+
+```
+canary tid=15:                              ours tid=10:
+[0] import.call KeWaitForSingleObject       [0] import.call KeWaitForSingleObject
+[1] kernel.call KeWaitForSingleObject       [1] kernel.call KeWaitForSingleObject
+[2] wait.begin sid=66ae1b598f928969         [2] handle.create sid=b9e6799594b746ee
+[3] kernel.return                           [3] wait.begin sid=b9e6799594b746ee
+                                            [4] kernel.return
+```
+
+The two engines disagree at idx=2: canary's tid=15 has `wait.begin`,
+ours's tid=10 has `handle.create`. The SIDs are different too
+(`66ae1b598f928969` vs `b9e6799594b746ee`) but the diff tool already
+SKIPS SID fields per C+15-α schema-v1.
+
+## Root cause: shared-global first-toucher race
+
+The dispatcher at guest pointer `0x828a3230` is a **process-global
+KSEMAPHORE** (object_type=3) that's touched by MULTIPLE guest threads
+during boot:
+
+- Canary: some thread other than tid=15 (likely the main boot thread,
+  tid=6) touches it first → emits `handle.create` there. By the time
+  tid=15 reaches `KeWaitForSingleObject`, the wrapper exists, so
+  `XObject::GetNativeObject` short-circuits via the `kXObjSignature`
+  marker and emits NO additional event. Canary tid=15's stream is
+  3 events long: import → kernel.call → wait.begin → kernel.return.
+
+- Ours: tid=10 happens to be the first toucher → ours's
+  `ensure_dispatcher_object` emits `handle.create` on tid=10. ours
+  tid=10's stream is 4 events long: import → kernel.call →
+  **handle.create** → wait.begin → kernel.return.
+
+Both engines do the right thing semantically; whichever thread wins the
+"first toucher" race depends on thread scheduling, which is NOT
+bit-identical across engines (different host schedulers, JIT, etc.).
+The diff tool sees one extra event on one side and reports it as a
+divergence — but it's **observation-side**, not behavioral.
+
+This is C+17 D-NEW-3.
+
+## Verified via static + dynamic evidence
+
+1. Both ours's `ensure_dispatcher_object` (exports.rs:4363) and canary's
+   `XObject::GetNativeObject` (xobject.cc:397-483) are **per-pointer
+   idempotent**: re-entry on a pointer that already has the
+   `kXObjSignature` marker short-circuits without emit.
+2. The shared `objects` table is process-global in both engines
+   (`KernelState::objects` map; canary's `KernelState::object_table()`).
+3. In the ours-cold log, `0x828a3230` appears in exactly ONE
+   `handle.create` (on tid=10) — confirming the per-pointer
+   idempotence:
+
+```
+$ grep '"raw_handle_id":"0x828a3230"' ours-cold.jsonl
+{"kind":"handle.create","tid":10,"tid_event_idx":2,...}
+```
+
+4. The canary diff side reports `[2] wait.begin` with a SID that
+   refers to a dispatcher whose `handle.create` was already emitted
+   elsewhere (likely on canary tid=6 main chain or a worker).
+
+5. The SID computation in both engines uses
+   `semantic_id(create_site_pc=0, creating_tid, idx_at_creation,
+   object_type)`. Both `creating_tid` and `idx_at_creation` depend on
+   WHICH thread did the first touch — so even if both engines wrapped
+   the same dispatcher, their SIDs would still differ.
+
+## Class of bug
+
+Class η — **harness observation-side asymmetry on scheduling-non-
+deterministic process-global state**. Not a real engine bug; both
+engines are doing the right thing. The harness (per-tid sequence
+diff) is the wrong abstraction for this class of event.
+
+## Fix shape
+
+Two coordinated changes, both small and additive:
+
+### (A) Engine: scheduling-invariant SID for process-global dispatchers
+
+Add `event_log::semantic_id_shared_global(pointer, object_type)` (ours
+and canary) — a SID recipe keyed only on `(pointer, object_type)`.
+Inputs to the existing FNV-1a:
+```
+create_site_pc = SHARED_GLOBAL_SID_MARKER (= 0xC01AB005, fixed sentinel)
+creating_tid   = 0
+tid_event_idx  = pointer as u64
+object_type    = object_type
+```
+The marker constant sits outside any plausible guest-PC range (PPC text
+0x82000000-0x82FFFFFF; XEX header 0x3001xxxx; heap 0x4xxxxxxx) so it
+NEVER collides with regular per-thread SIDs (which use real PCs).
+
+`ensure_dispatcher_object` (ours) and `XObject::GetNativeObject`
+(canary) route their `handle.create` emit through this recipe instead
+of the per-thread `semantic_id`. Both engines compute the **same SID**
+for the same dispatcher pointer regardless of which guest thread wins
+the first-toucher race.
+
+### (B) Diff tool: cross-tid floating `handle.create` matching
+
+Pre-pass: collect the set of shared-global SIDs across BOTH engines and
+ALL tids. A `handle.create` event is detected as shared-global by
+recomputing the deterministic SID from its `(raw_handle_id,
+object_type)` payload and matching against `handle_semantic_id`.
+
+When per-tid comparison finds a kind mismatch where one side has a
+`handle.create` whose SID is in the floating set:
+- Advance only that side's stream pointer past the floating event.
+- Re-compare at the same canonical position.
+
+This handles the "extra event on tid=10 but not tid=15" case
+symmetrically. Subsequent `wait.begin` events whose
+`handles_semantic_ids` element matches a shared-global SID continue to
+align via the schema-v1 strict-equality rule (SID fields are already
+skipped per the C+15-α SKIP_PAYLOAD_FIELDS_BY_KIND policy, but the
+underlying object alignment is preserved by the deterministic recipe —
+useful for future passes that re-enable SID comparison).
+
+### Why this is the right fix (not over-suppression)
+
+- **Pointer-derived SIDs are unique per object identity**. Two distinct
+  dispatchers at the same pointer with different `object_type` get
+  distinct SIDs (defense in depth).
+- **Regular per-thread `handle.create` events keep strict alignment**.
+  Only events whose SID matches the deterministic shared-global recipe
+  are eligible for cross-tid absorption. A regular file-handle create
+  (allocated via `alloc_handle_for`/`AddHandle`) uses the per-(tid,
+  idx) SID recipe and CANNOT match the shared-global hash by
+  construction.
+- **The diff tool still reports real divergences**. Tests confirm:
+  - `test_non_floating_real_divergence_still_caught` — an unrelated
+    extra event on ours's side IS reported.
+  - `test_strict_alignment_without_floating` — when the floating set is
+    empty, legacy strict behavior holds.