Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
144 lines
6.5 KiB
Markdown
144 lines
6.5 KiB
Markdown
# Phase C+18 Investigation — Shared-global first-toucher race (2026-05-14)
|
||
|
||
## Framing verification (reading-error #28 discipline)
|
||
|
||
C+17 result: main matched-prefix advanced 102,171 → 102,553 (+382) when
|
||
ours's `ensure_dispatcher_object` started emitting `handle.create` for
|
||
synthesized shadows. But sister chain `tid=15→10` REGRESSED from 16 → 2:
|
||
|
||
```
|
||
canary tid=15: ours tid=10:
|
||
[0] import.call KeWaitForSingleObject [0] import.call KeWaitForSingleObject
|
||
[1] kernel.call KeWaitForSingleObject [1] kernel.call KeWaitForSingleObject
|
||
[2] wait.begin sid=66ae1b598f928969 [2] handle.create sid=b9e6799594b746ee
|
||
[3] kernel.return [3] wait.begin sid=b9e6799594b746ee
|
||
[4] kernel.return
|
||
```
|
||
|
||
The two engines disagree at idx=2: canary's tid=15 has `wait.begin`,
|
||
ours's tid=10 has `handle.create`. The SIDs are different too
|
||
(`66ae1b598f928969` vs `b9e6799594b746ee`) but the diff tool already
|
||
SKIPS SID fields per C+15-α schema-v1.
|
||
|
||
## Root cause: shared-global first-toucher race
|
||
|
||
The dispatcher at guest pointer `0x828a3230` is a **process-global
|
||
KSEMAPHORE** (object_type=3) that's touched by MULTIPLE guest threads
|
||
during boot:
|
||
|
||
- Canary: some thread other than tid=15 (likely the main boot thread,
|
||
tid=6) touches it first → emits `handle.create` there. By the time
|
||
tid=15 reaches `KeWaitForSingleObject`, the wrapper exists, so
|
||
`XObject::GetNativeObject` short-circuits via the `kXObjSignature`
|
||
marker and emits NO additional event. Canary tid=15's stream is
|
||
3 events long: import → kernel.call → wait.begin → kernel.return.
|
||
|
||
- Ours: tid=10 happens to be the first toucher → ours's
|
||
`ensure_dispatcher_object` emits `handle.create` on tid=10. ours
|
||
tid=10's stream is 4 events long: import → kernel.call →
|
||
**handle.create** → wait.begin → kernel.return.
|
||
|
||
Both engines do the right thing semantically; whichever thread wins the
|
||
"first toucher" race depends on thread scheduling, which is NOT
|
||
bit-identical across engines (different host schedulers, JIT, etc.).
|
||
The diff tool sees one extra event on one side and reports it as a
|
||
divergence — but it's **observation-side**, not behavioral.
|
||
|
||
This is C+17 D-NEW-3.
|
||
|
||
## Verified via static + dynamic evidence
|
||
|
||
1. Both ours's `ensure_dispatcher_object` (exports.rs:4363) and canary's
|
||
`XObject::GetNativeObject` (xobject.cc:397-483) are **per-pointer
|
||
idempotent**: re-entry on a pointer that already has the
|
||
`kXObjSignature` marker short-circuits without emit.
|
||
2. The shared `objects` table is process-global in both engines
|
||
(`KernelState::objects` map; canary's `KernelState::object_table()`).
|
||
3. In the ours-cold log, `0x828a3230` appears in exactly ONE
|
||
`handle.create` (on tid=10) — confirming the per-pointer
|
||
idempotence:
|
||
|
||
```
|
||
$ grep '"raw_handle_id":"0x828a3230"' ours-cold.jsonl
|
||
{"kind":"handle.create","tid":10,"tid_event_idx":2,...}
|
||
```
|
||
|
||
4. The canary diff side reports `[2] wait.begin` with a SID that
|
||
refers to a dispatcher whose `handle.create` was already emitted
|
||
elsewhere (likely on canary tid=6 main chain or a worker).
|
||
|
||
5. The SID computation in both engines uses
|
||
`semantic_id(create_site_pc=0, creating_tid, idx_at_creation,
|
||
object_type)`. Both `creating_tid` and `idx_at_creation` depend on
|
||
WHICH thread did the first touch — so even if both engines wrapped
|
||
the same dispatcher, their SIDs would still differ.
|
||
|
||
## Class of bug
|
||
|
||
Class η — **harness observation-side asymmetry on scheduling-non-
|
||
deterministic process-global state**. Not a real engine bug; both
|
||
engines are doing the right thing. The harness (per-tid sequence
|
||
diff) is the wrong abstraction for this class of event.
|
||
|
||
## Fix shape
|
||
|
||
Two coordinated changes, both small and additive:
|
||
|
||
### (A) Engine: scheduling-invariant SID for process-global dispatchers
|
||
|
||
Add `event_log::semantic_id_shared_global(pointer, object_type)` (ours
|
||
and canary) — a SID recipe keyed only on `(pointer, object_type)`.
|
||
Inputs to the existing FNV-1a:
|
||
```
|
||
create_site_pc = SHARED_GLOBAL_SID_MARKER (= 0xC01AB005, fixed sentinel)
|
||
creating_tid = 0
|
||
tid_event_idx = pointer as u64
|
||
object_type = object_type
|
||
```
|
||
The marker constant sits outside any plausible guest-PC range (PPC text
|
||
0x82000000-0x82FFFFFF; XEX header 0x3001xxxx; heap 0x4xxxxxxx) so it
|
||
NEVER collides with regular per-thread SIDs (which use real PCs).
|
||
|
||
`ensure_dispatcher_object` (ours) and `XObject::GetNativeObject`
|
||
(canary) route their `handle.create` emit through this recipe instead
|
||
of the per-thread `semantic_id`. Both engines compute the **same SID**
|
||
for the same dispatcher pointer regardless of which guest thread wins
|
||
the first-toucher race.
|
||
|
||
### (B) Diff tool: cross-tid floating `handle.create` matching
|
||
|
||
Pre-pass: collect the set of shared-global SIDs across BOTH engines and
|
||
ALL tids. A `handle.create` event is detected as shared-global by
|
||
recomputing the deterministic SID from its `(raw_handle_id,
|
||
object_type)` payload and matching against `handle_semantic_id`.
|
||
|
||
When per-tid comparison finds a kind mismatch where one side has a
|
||
`handle.create` whose SID is in the floating set:
|
||
- Advance only that side's stream pointer past the floating event.
|
||
- Re-compare at the same canonical position.
|
||
|
||
This handles the "extra event on tid=10 but not tid=15" case
|
||
symmetrically. Subsequent `wait.begin` events whose
|
||
`handles_semantic_ids` element matches a shared-global SID continue to
|
||
align via the schema-v1 strict-equality rule (SID fields are already
|
||
skipped per the C+15-α SKIP_PAYLOAD_FIELDS_BY_KIND policy, but the
|
||
underlying object alignment is preserved by the deterministic recipe —
|
||
useful for future passes that re-enable SID comparison).
|
||
|
||
### Why this is the right fix (not over-suppression)
|
||
|
||
- **Pointer-derived SIDs are unique per object identity**. Two distinct
|
||
dispatchers at the same pointer with different `object_type` get
|
||
distinct SIDs (defense in depth).
|
||
- **Regular per-thread `handle.create` events keep strict alignment**.
|
||
Only events whose SID matches the deterministic shared-global recipe
|
||
are eligible for cross-tid absorption. A regular file-handle create
|
||
(allocated via `alloc_handle_for`/`AddHandle`) uses the per-(tid,
|
||
idx) SID recipe and CANNOT match the shared-global hash by
|
||
construction.
|
||
- **The diff tool still reports real divergences**. Tests confirm:
|
||
- `test_non_floating_real_divergence_still_caught` — an unrelated
|
||
extra event on ours's side IS reported.
|
||
- `test_strict_alignment_without_floating` — when the floating set is
|
||
empty, legacy strict behavior holds.
|