Files
xenia-rs/audit-runs/phase-c18-shared-global-race/investigation.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

144 lines
6.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase C+18 Investigation — Shared-global first-toucher race (2026-05-14)
## Framing verification (reading-error #28 discipline)
C+17 result: main matched-prefix advanced 102,171 → 102,553 (+382) when
ours's `ensure_dispatcher_object` started emitting `handle.create` for
synthesized shadows. But sister chain `tid=15→10` REGRESSED from 16 → 2:
```
canary tid=15: ours tid=10:
[0] import.call KeWaitForSingleObject [0] import.call KeWaitForSingleObject
[1] kernel.call KeWaitForSingleObject [1] kernel.call KeWaitForSingleObject
[2] wait.begin sid=66ae1b598f928969 [2] handle.create sid=b9e6799594b746ee
[3] kernel.return [3] wait.begin sid=b9e6799594b746ee
[4] kernel.return
```
The two engines disagree at idx=2: canary's tid=15 has `wait.begin`,
ours's tid=10 has `handle.create`. The SIDs are different too
(`66ae1b598f928969` vs `b9e6799594b746ee`) but the diff tool already
SKIPS SID fields per C+15-α schema-v1.
## Root cause: shared-global first-toucher race
The dispatcher at guest pointer `0x828a3230` is a **process-global
KSEMAPHORE** (object_type=3) that's touched by MULTIPLE guest threads
during boot:
- Canary: some thread other than tid=15 (likely the main boot thread,
tid=6) touches it first → emits `handle.create` there. By the time
tid=15 reaches `KeWaitForSingleObject`, the wrapper exists, so
`XObject::GetNativeObject` short-circuits via the `kXObjSignature`
marker and emits NO additional event. Canary tid=15's stream is
3 events long: import → kernel.call → wait.begin → kernel.return.
- Ours: tid=10 happens to be the first toucher → ours's
`ensure_dispatcher_object` emits `handle.create` on tid=10. ours
tid=10's stream is 4 events long: import → kernel.call →
**handle.create** → wait.begin → kernel.return.
Both engines do the right thing semantically; whichever thread wins the
"first toucher" race depends on thread scheduling, which is NOT
bit-identical across engines (different host schedulers, JIT, etc.).
The diff tool sees one extra event on one side and reports it as a
divergence — but it's **observation-side**, not behavioral.
This is C+17 D-NEW-3.
## Verified via static + dynamic evidence
1. Both ours's `ensure_dispatcher_object` (exports.rs:4363) and canary's
`XObject::GetNativeObject` (xobject.cc:397-483) are **per-pointer
idempotent**: re-entry on a pointer that already has the
`kXObjSignature` marker short-circuits without emit.
2. The shared `objects` table is process-global in both engines
(`KernelState::objects` map; canary's `KernelState::object_table()`).
3. In the ours-cold log, `0x828a3230` appears in exactly ONE
`handle.create` (on tid=10) — confirming the per-pointer
idempotence:
```
$ grep '"raw_handle_id":"0x828a3230"' ours-cold.jsonl
{"kind":"handle.create","tid":10,"tid_event_idx":2,...}
```
4. The canary diff side reports `[2] wait.begin` with a SID that
refers to a dispatcher whose `handle.create` was already emitted
elsewhere (likely on canary tid=6 main chain or a worker).
5. The SID computation in both engines uses
`semantic_id(create_site_pc=0, creating_tid, idx_at_creation,
object_type)`. Both `creating_tid` and `idx_at_creation` depend on
WHICH thread did the first touch — so even if both engines wrapped
the same dispatcher, their SIDs would still differ.
## Class of bug
Class η — **harness observation-side asymmetry on scheduling-non-
deterministic process-global state**. Not a real engine bug; both
engines are doing the right thing. The harness (per-tid sequence
diff) is the wrong abstraction for this class of event.
## Fix shape
Two coordinated changes, both small and additive:
### (A) Engine: scheduling-invariant SID for process-global dispatchers
Add `event_log::semantic_id_shared_global(pointer, object_type)` (ours
and canary) — a SID recipe keyed only on `(pointer, object_type)`.
Inputs to the existing FNV-1a:
```
create_site_pc = SHARED_GLOBAL_SID_MARKER (= 0xC01AB005, fixed sentinel)
creating_tid = 0
tid_event_idx = pointer as u64
object_type = object_type
```
The marker constant sits outside any plausible guest-PC range (PPC text
0x82000000-0x82FFFFFF; XEX header 0x3001xxxx; heap 0x4xxxxxxx) so it
NEVER collides with regular per-thread SIDs (which use real PCs).
`ensure_dispatcher_object` (ours) and `XObject::GetNativeObject`
(canary) route their `handle.create` emit through this recipe instead
of the per-thread `semantic_id`. Both engines compute the **same SID**
for the same dispatcher pointer regardless of which guest thread wins
the first-toucher race.
### (B) Diff tool: cross-tid floating `handle.create` matching
Pre-pass: collect the set of shared-global SIDs across BOTH engines and
ALL tids. A `handle.create` event is detected as shared-global by
recomputing the deterministic SID from its `(raw_handle_id,
object_type)` payload and matching against `handle_semantic_id`.
When per-tid comparison finds a kind mismatch where one side has a
`handle.create` whose SID is in the floating set:
- Advance only that side's stream pointer past the floating event.
- Re-compare at the same canonical position.
This handles the "extra event on tid=10 but not tid=15" case
symmetrically. Subsequent `wait.begin` events whose
`handles_semantic_ids` element matches a shared-global SID continue to
align via the schema-v1 strict-equality rule (SID fields are already
skipped per the C+15-α SKIP_PAYLOAD_FIELDS_BY_KIND policy, but the
underlying object alignment is preserved by the deterministic recipe —
useful for future passes that re-enable SID comparison).
### Why this is the right fix (not over-suppression)
- **Pointer-derived SIDs are unique per object identity**. Two distinct
dispatchers at the same pointer with different `object_type` get
distinct SIDs (defense in depth).
- **Regular per-thread `handle.create` events keep strict alignment**.
Only events whose SID matches the deterministic shared-global recipe
are eligible for cross-tid absorption. A regular file-handle create
(allocated via `alloc_handle_for`/`AddHandle`) uses the per-(tid,
idx) SID recipe and CANNOT match the shared-global hash by
construction.
- **The diff tool still reports real divergences**. Tests confirm:
- `test_non_floating_real_divergence_still_caught` — an unrelated
extra event on ours's side IS reported.
- `test_strict_alignment_without_floating` — when the floating set is
empty, legacy strict behavior holds.