handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-05 07:19:08 +02:00
parent acd1656753
commit ef93a4fa14
620 changed files with 108303 additions and 1 deletions

View File

@@ -0,0 +1,143 @@
# Phase C+18 Investigation — Shared-global first-toucher race (2026-05-14)
## Framing verification (reading-error #28 discipline)
C+17 result: main matched-prefix advanced 102,171 → 102,553 (+382) when
ours's `ensure_dispatcher_object` started emitting `handle.create` for
synthesized shadows. But sister chain `tid=15→10` REGRESSED from 16 → 2:
```
canary tid=15: ours tid=10:
[0] import.call KeWaitForSingleObject [0] import.call KeWaitForSingleObject
[1] kernel.call KeWaitForSingleObject [1] kernel.call KeWaitForSingleObject
[2] wait.begin sid=66ae1b598f928969 [2] handle.create sid=b9e6799594b746ee
[3] kernel.return [3] wait.begin sid=b9e6799594b746ee
[4] kernel.return
```
The two engines disagree at idx=2: canary's tid=15 has `wait.begin`,
ours's tid=10 has `handle.create`. The SIDs are different too
(`66ae1b598f928969` vs `b9e6799594b746ee`) but the diff tool already
SKIPS SID fields per C+15-α schema-v1.
## Root cause: shared-global first-toucher race
The dispatcher at guest pointer `0x828a3230` is a **process-global
KSEMAPHORE** (object_type=3) that's touched by MULTIPLE guest threads
during boot:
- Canary: some thread other than tid=15 (likely the main boot thread,
tid=6) touches it first → emits `handle.create` there. By the time
tid=15 reaches `KeWaitForSingleObject`, the wrapper exists, so
`XObject::GetNativeObject` short-circuits via the `kXObjSignature`
marker and emits NO additional event. Canary tid=15's stream is
3 events long: import → kernel.call → wait.begin → kernel.return.
- Ours: tid=10 happens to be the first toucher → ours's
`ensure_dispatcher_object` emits `handle.create` on tid=10. ours
tid=10's stream is 4 events long: import → kernel.call →
**handle.create** → wait.begin → kernel.return.
Both engines do the right thing semantically; whichever thread wins the
"first toucher" race depends on thread scheduling, which is NOT
bit-identical across engines (different host schedulers, JIT, etc.).
The diff tool sees one extra event on one side and reports it as a
divergence — but it's **observation-side**, not behavioral.
This is C+17 D-NEW-3.
## Verified via static + dynamic evidence
1. Both ours's `ensure_dispatcher_object` (exports.rs:4363) and canary's
`XObject::GetNativeObject` (xobject.cc:397-483) are **per-pointer
idempotent**: re-entry on a pointer that already has the
`kXObjSignature` marker short-circuits without emit.
2. The shared `objects` table is process-global in both engines
(`KernelState::objects` map; canary's `KernelState::object_table()`).
3. In the ours-cold log, `0x828a3230` appears in exactly ONE
`handle.create` (on tid=10) — confirming the per-pointer
idempotence:
```
$ grep '"raw_handle_id":"0x828a3230"' ours-cold.jsonl
{"kind":"handle.create","tid":10,"tid_event_idx":2,...}
```
4. The canary diff side reports `[2] wait.begin` with a SID that
refers to a dispatcher whose `handle.create` was already emitted
elsewhere (likely on canary tid=6 main chain or a worker).
5. The SID computation in both engines uses
`semantic_id(create_site_pc=0, creating_tid, idx_at_creation,
object_type)`. Both `creating_tid` and `idx_at_creation` depend on
WHICH thread did the first touch — so even if both engines wrapped
the same dispatcher, their SIDs would still differ.
## Class of bug
Class η — **harness observation-side asymmetry on scheduling-non-
deterministic process-global state**. Not a real engine bug; both
engines are doing the right thing. The harness (per-tid sequence
diff) is the wrong abstraction for this class of event.
## Fix shape
Two coordinated changes, both small and additive:
### (A) Engine: scheduling-invariant SID for process-global dispatchers
Add `event_log::semantic_id_shared_global(pointer, object_type)` (ours
and canary) — a SID recipe keyed only on `(pointer, object_type)`.
Inputs to the existing FNV-1a:
```
create_site_pc = SHARED_GLOBAL_SID_MARKER (= 0xC01AB005, fixed sentinel)
creating_tid = 0
tid_event_idx = pointer as u64
object_type = object_type
```
The marker constant sits outside any plausible guest-PC range (PPC text
0x82000000-0x82FFFFFF; XEX header 0x3001xxxx; heap 0x4xxxxxxx) so it
NEVER collides with regular per-thread SIDs (which use real PCs).
`ensure_dispatcher_object` (ours) and `XObject::GetNativeObject`
(canary) route their `handle.create` emit through this recipe instead
of the per-thread `semantic_id`. Both engines compute the **same SID**
for the same dispatcher pointer regardless of which guest thread wins
the first-toucher race.
### (B) Diff tool: cross-tid floating `handle.create` matching
Pre-pass: collect the set of shared-global SIDs across BOTH engines and
ALL tids. A `handle.create` event is detected as shared-global by
recomputing the deterministic SID from its `(raw_handle_id,
object_type)` payload and matching against `handle_semantic_id`.
When per-tid comparison finds a kind mismatch where one side has a
`handle.create` whose SID is in the floating set:
- Advance only that side's stream pointer past the floating event.
- Re-compare at the same canonical position.
This handles the "extra event on tid=10 but not tid=15" case
symmetrically. Subsequent `wait.begin` events whose
`handles_semantic_ids` element matches a shared-global SID continue to
align via the schema-v1 strict-equality rule (SID fields are already
skipped per the C+15-α SKIP_PAYLOAD_FIELDS_BY_KIND policy, but the
underlying object alignment is preserved by the deterministic recipe —
useful for future passes that re-enable SID comparison).
### Why this is the right fix (not over-suppression)
- **Pointer-derived SIDs are unique per object identity**. Two distinct
dispatchers at the same pointer with different `object_type` get
distinct SIDs (defense in depth).
- **Regular per-thread `handle.create` events keep strict alignment**.
Only events whose SID matches the deterministic shared-global recipe
are eligible for cross-tid absorption. A regular file-handle create
(allocated via `alloc_handle_for`/`AddHandle`) uses the per-(tid,
idx) SID recipe and CANNOT match the shared-global hash by
construction.
- **The diff tool still reports real divergences**. Tests confirm:
- `test_non_floating_real_divergence_still_caught` — an unrelated
extra event on ours's side IS reported.
- `test_strict_alignment_without_floating` — when the floating set is
empty, legacy strict behavior holds.