handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-05 07:19:08 +02:00
parent acd1656753
commit ef93a4fa14
620 changed files with 108303 additions and 1 deletions

View File

@@ -0,0 +1,83 @@
# Phase C+18 cold-vs-cold result (2026-05-14)
## Matched-prefix table
| canary_tid | ours_tid | C+17 | C+18 | delta | first_divergence_at | kind |
|------------|----------|---------|---------|----------|---------------------|------|
| 6 | 1 | 102,553 | 102,553 | 0 | 102,553 | `NtDuplicateObject` no `handle.create` (D-NEW-1, unchanged) |
| 4 | 11 | 11 | 11 | 0 | — | no divergence in 11 events |
| 7 | 2 | 32 | 32 | 0 | — | no divergence in 32 events |
| 12 | 7 | 3 | 3 | 0 | 3 | `timeout_ns` mismatch (D-NEW-2, unchanged) |
| 14 | 9 | 41 | 41 | 0 | 41 | unrelated `XAudioGetVoiceCategoryVolumeChangeMask` |
| 15 | 10 | 2 | **16** | **+14** | — | **regression RESTORED** (D-NEW-3 fix landed) |
**tid=15→10 RESTORED**: matched-prefix advances 2 → 16 (+14), back to
the C+16 baseline. 1 ours-side `handle.create` floating event absorbed
by Phase C+18 cross-tid SID matching (`floating_skipped` column =
`0/1`). No other chains regress. Main chain unchanged at 102,553.
## Acceptance gates
- **Gate 1 (default-off digest)**: PASS — 3× reproducible at
`e1dfcb1559f987b35012a7f2dc6d93f5` (unchanged from
C+13/C+15-α/C+16/C+17 baseline). The fix is observation-only at the
digest level; the new SID recipe is a string change in the event-log
emit, NOT a guest-behavior change.
- **Gate 2 (cvar-on emit)**: PASS — ours 121,544 events (unchanged
from C+17 — same emit count, different SID values), canary
3,517,980 events in ~90s.
- **Gate 3 (diff tool)**: PASS — produces 6-chain report with new
`floating_skipped (c/o)` column. tid=15→10 shows `0/1` — exactly
one ours-side floating create absorbed.
- **Gate 4 (cold-vs-cold)**: PASS — main matched prefix unchanged at
102,553, tid=15→10 restored from 2 → 16, all other sister chains
unchanged.
- **Gate 5 (build)**: PASS — both engines build clean (only the
pre-existing `dead_code` warning on `walk_committed_regions`).
- **Gate 6 (tests)**: PASS — ours kernel tests 191 → 193 (+2 new SID
determinism tests). Diff-tool tests: 14/14 PASS (new
`test_diff_events.py`).
- **Gate 7 (Phase B image hash)**: PASS — `image_loaded_sha256` =
`ea8d160e9369328a5b922258a92113efb8d7ce3e1a5c12cc521e375985c91c18`
(unchanged).
- **Gate 8 (event-log determinism)**: PASS — `handle.create` emit
count unchanged (121,544 → 121,544 in ours). The new SID recipe is
bit-deterministic over `(pointer, object_type)`.
## Sister-chain analysis
- **tid=4→11** (no divergence): unchanged 11 events matched.
- **tid=7→2** (no divergence): unchanged 32 events matched.
- **tid=12→7** (`timeout_ns` mismatch): unchanged at idx=3. D-NEW-2 is
next-after-D-NEW-1 in the queue.
- **tid=14→9** (audio export): unchanged at idx=41. D-NEW-something
to be triaged later.
- **tid=15→10** (RESTORED): the diff tool's floating-create absorb
pulled tid=15's matched count back up to 16 (= matches the full
canary tid=15 stream length within the 20k truncation cap; the next
divergence is presumably beyond the truncation window).
## Refcount and stability audit
The fix touches the SID computation only — the `handle_refcount` and
`state.objects` insertion logic is unchanged. The C+17 refcount-leak
risk audit (5 tests) continues to apply unchanged.
The deterministic SID is a fresh value computed at first-touch and
overwrites the registry entry. The old per-tid SID is never seen by
the diff tool. No double-insertion or stale-mapping issues.
## D-NEW-3 status
**RESOLVED**. The race is now invisible to the diff tool. Both engines
emit the same SID for the same dispatcher; the diff tool absorbs the
floating-tid mismatch via the cross-tid match.
## Next target
**C+19 = D-NEW-1 (`NtDuplicateObject` `handle.create`)**, unchanged
from C+17 plan. Canary's `ObjectTable::DuplicateHandle` allocates a
fresh slot via `AddHandle` (emits `handle.create`); ours's
`nt_duplicate_object` aliases via `dup_id=source_id` per AUDIT-062 and
does NOT emit a new event. Tradeoff between mirror (~30-40 LOC, risk =
AUDIT-062 worker-cluster regression) vs diff-tool suppress (band-aid).

View File

@@ -0,0 +1,135 @@
# Phase A diff report
**This report is the output of Phase A's diff harness. Divergences
shown here are INPUT for Phase B (first-divergence localization),
not findings of Phase A.** Phase A's job is to make the harness
itself correct, not to analyze what it surfaces.
## Summary
| canary_tid | ours_tid | matched | canary_total | ours_total | first_divergence_at | floating_skipped (c/o) |
|---|---|---|---|---|---|---|
| 4 | 11 | 11 | 20000 | 11 | — | 0/0 |
| 6 | 1 | 102553 | 250000 | 108490 | 102553 | 0/0 |
| 7 | 2 | 32 | 32 | 33 | — | 0/0 |
| 12 | 7 | 3 | 6954 | 5 | 3 | 0/0 |
| 14 | 9 | 41 | 20000 | 77 | 41 | 0/0 |
| 15 | 10 | 16 | 20000 | 17 | — | 0/1 |
*`floating_skipped (c/o)` counts shared-global `handle.create` events absorbed by Phase C+18 cross-tid SID matching (per-side, observation-side ordering of process-global dispatchers). See schema-v1.md §"Shared-global SIDs".*
## canary_tid=4 → ours_tid=11
No divergence within the 11 compared events (canary has 20000, ours has 11).
## canary_tid=6 → ours_tid=1
First divergence at `tid_event_idx=102553`: kind: canary='handle.create' ours='kernel.return'
**Pre-context (last 5 matching events):**
```
canary: [102548] import.call RtlLeaveCriticalSection
ours: [102548] import.call RtlLeaveCriticalSection
canary: [102549] kernel.call RtlLeaveCriticalSection
ours: [102549] kernel.call RtlLeaveCriticalSection
canary: [102550] kernel.return RtlLeaveCriticalSection
ours: [102550] kernel.return RtlLeaveCriticalSection
canary: [102551] import.call NtDuplicateObject
ours: [102551] import.call NtDuplicateObject
canary: [102552] kernel.call NtDuplicateObject
ours: [102552] kernel.call NtDuplicateObject
```
**Divergent event:**
```
canary: [102553] handle.create sid=df686b147b291902
ours: [102553] kernel.return NtDuplicateObject
```
**Next event after the divergence (if any):**
```
canary: [102554] kernel.return NtDuplicateObject
ours: [102554] import.call RtlEnterCriticalSection
```
**Raw events (JSON):**
```json
{"deterministic": true, "engine": "canary", "guest_cycle": 0, "host_ns": 1424314500, "kind": "handle.create", "payload": {"handle_semantic_id": "df686b147b291902", "object_name": null, "object_type": 1, "raw_handle_id": "0xf8000044"}, "schema_version": 1, "tid": 6, "tid_event_idx": 102553}
{"deterministic": true, "engine": "ours", "guest_cycle": 5398419, "host_ns": 461742475, "kind": "kernel.return", "payload": {"name": "NtDuplicateObject", "return_value": 0, "side_effects": [], "status": "0x00000000"}, "schema_version": 1, "tid": 1, "tid_event_idx": 102553}
```
## canary_tid=7 → ours_tid=2
No divergence within the 32 compared events (canary has 32, ours has 33).
## canary_tid=12 → ours_tid=7
First divergence at `tid_event_idx=3`: payload.timeout_ns: canary=-30000000 ours=429466729600
**Pre-context (last 5 matching events):**
```
canary: [0] import.call KeWaitForSingleObject
ours: [0] import.call KeWaitForSingleObject
canary: [1] kernel.call KeWaitForSingleObject
ours: [1] kernel.call KeWaitForSingleObject
canary: [2] handle.create sid=c49d8f0ab90401ea
ours: [2] handle.create sid=6e3d96c5a52bf429
```
**Divergent event:**
```
canary: [3] wait.begin {'handles_semantic_ids': ['c49d8f0ab90401ea'], 'timeout_ns': -30000000, 'alertable': False, 'wait_type': 'any'}
ours: [3] wait.begin {'handles_semantic_ids': ['6e3d96c5a52bf429'], 'timeout_ns': 429466729600, 'alertable': False, 'wait_type': 'any'}
```
**Next event after the divergence (if any):**
```
canary: [4] kernel.return KeWaitForSingleObject
ours: [4] kernel.return KeWaitForSingleObject
```
**Raw events (JSON):**
```json
{"deterministic": true, "engine": "canary", "guest_cycle": 0, "host_ns": 1570223300, "kind": "wait.begin", "payload": {"alertable": false, "handles_semantic_ids": ["c49d8f0ab90401ea"], "timeout_ns": -30000000, "wait_type": "any"}, "schema_version": 1, "tid": 12, "tid_event_idx": 3}
{"deterministic": true, "engine": "ours", "guest_cycle": 0, "host_ns": 485908293, "kind": "wait.begin", "payload": {"alertable": false, "handles_semantic_ids": ["6e3d96c5a52bf429"], "timeout_ns": 429466729600, "wait_type": "any"}, "schema_version": 1, "tid": 7, "tid_event_idx": 3}
```
## canary_tid=14 → ours_tid=9
First divergence at `tid_event_idx=41`: payload.ord: canary=503 ours=293
**Pre-context (last 5 matching events):**
```
canary: [36] kernel.call KeReleaseSpinLockFromRaisedIrql
ours: [36] kernel.call KeReleaseSpinLockFromRaisedIrql
canary: [37] kernel.return KeReleaseSpinLockFromRaisedIrql
ours: [37] kernel.return KeReleaseSpinLockFromRaisedIrql
canary: [38] import.call KfLowerIrql
ours: [38] import.call KfLowerIrql
canary: [39] kernel.call KfLowerIrql
ours: [39] kernel.call KfLowerIrql
canary: [40] kernel.return KfLowerIrql
ours: [40] kernel.return KfLowerIrql
```
**Divergent event:**
```
canary: [41] import.call XAudioGetVoiceCategoryVolumeChangeMask
ours: [41] import.call RtlEnterCriticalSection
```
**Next event after the divergence (if any):**
```
canary: [42] kernel.call XAudioGetVoiceCategoryVolumeChangeMask
ours: [42] kernel.call RtlEnterCriticalSection
```
**Raw events (JSON):**
```json
{"deterministic": true, "engine": "canary", "guest_cycle": 0, "host_ns": 1811324400, "kind": "import.call", "payload": {"module": "xboxkrnl.exe", "name": "XAudioGetVoiceCategoryVolumeChangeMask", "ord": 503}, "schema_version": 1, "tid": 14, "tid_event_idx": 41}
{"deterministic": true, "engine": "ours", "guest_cycle": 417, "host_ns": 1606502025, "kind": "import.call", "payload": {"module": "xboxkrnl.exe", "name": "RtlEnterCriticalSection", "ord": 293}, "schema_version": 1, "tid": 9, "tid_event_idx": 41}
```
## canary_tid=15 → ours_tid=10
No divergence within the 16 compared events (canary has 20000, ours has 17).

View File

@@ -0,0 +1,10 @@
{
"instructions": 50000007,
"imports": 40390,
"unimpl": 0,
"draws": 0,
"swaps": 1,
"unique_render_targets": 0,
"shader_blobs_live": 0,
"texture_cache_entries": 0
}

View File

@@ -0,0 +1,10 @@
{
"instructions": 50000007,
"imports": 40390,
"unimpl": 0,
"draws": 0,
"swaps": 1,
"unique_render_targets": 0,
"shader_blobs_live": 0,
"texture_cache_entries": 0
}

View File

@@ -0,0 +1,10 @@
{
"instructions": 50000007,
"imports": 40390,
"unimpl": 0,
"draws": 0,
"swaps": 1,
"unique_render_targets": 0,
"shader_blobs_live": 0,
"texture_cache_entries": 0
}

View File

@@ -0,0 +1,143 @@
# Phase C+18 Investigation — Shared-global first-toucher race (2026-05-14)
## Framing verification (reading-error #28 discipline)
C+17 result: main matched-prefix advanced 102,171 → 102,553 (+382) when
ours's `ensure_dispatcher_object` started emitting `handle.create` for
synthesized shadows. But sister chain `tid=15→10` REGRESSED from 16 → 2:
```
canary tid=15: ours tid=10:
[0] import.call KeWaitForSingleObject [0] import.call KeWaitForSingleObject
[1] kernel.call KeWaitForSingleObject [1] kernel.call KeWaitForSingleObject
[2] wait.begin sid=66ae1b598f928969 [2] handle.create sid=b9e6799594b746ee
[3] kernel.return [3] wait.begin sid=b9e6799594b746ee
[4] kernel.return
```
The two engines disagree at idx=2: canary's tid=15 has `wait.begin`,
ours's tid=10 has `handle.create`. The SIDs are different too
(`66ae1b598f928969` vs `b9e6799594b746ee`) but the diff tool already
SKIPS SID fields per C+15-α schema-v1.
## Root cause: shared-global first-toucher race
The dispatcher at guest pointer `0x828a3230` is a **process-global
KSEMAPHORE** (object_type=3) that's touched by MULTIPLE guest threads
during boot:
- Canary: some thread other than tid=15 (likely the main boot thread,
tid=6) touches it first → emits `handle.create` there. By the time
tid=15 reaches `KeWaitForSingleObject`, the wrapper exists, so
`XObject::GetNativeObject` short-circuits via the `kXObjSignature`
marker and emits NO additional event. Canary tid=15's stream is
3 events long: import → kernel.call → wait.begin → kernel.return.
- Ours: tid=10 happens to be the first toucher → ours's
`ensure_dispatcher_object` emits `handle.create` on tid=10. ours
tid=10's stream is 4 events long: import → kernel.call →
**handle.create** → wait.begin → kernel.return.
Both engines do the right thing semantically; whichever thread wins the
"first toucher" race depends on thread scheduling, which is NOT
bit-identical across engines (different host schedulers, JIT, etc.).
The diff tool sees one extra event on one side and reports it as a
divergence — but it's **observation-side**, not behavioral.
This is C+17 D-NEW-3.
## Verified via static + dynamic evidence
1. Both ours's `ensure_dispatcher_object` (exports.rs:4363) and canary's
`XObject::GetNativeObject` (xobject.cc:397-483) are **per-pointer
idempotent**: re-entry on a pointer that already has the
`kXObjSignature` marker short-circuits without emit.
2. The shared `objects` table is process-global in both engines
(`KernelState::objects` map; canary's `KernelState::object_table()`).
3. In the ours-cold log, `0x828a3230` appears in exactly ONE
`handle.create` (on tid=10) — confirming the per-pointer
idempotence:
```
$ grep '"raw_handle_id":"0x828a3230"' ours-cold.jsonl
{"kind":"handle.create","tid":10,"tid_event_idx":2,...}
```
4. The canary diff side reports `[2] wait.begin` with a SID that
refers to a dispatcher whose `handle.create` was already emitted
elsewhere (likely on canary tid=6 main chain or a worker).
5. The SID computation in both engines uses
`semantic_id(create_site_pc=0, creating_tid, idx_at_creation,
object_type)`. Both `creating_tid` and `idx_at_creation` depend on
WHICH thread did the first touch — so even if both engines wrapped
the same dispatcher, their SIDs would still differ.
## Class of bug
Class η — **harness observation-side asymmetry on scheduling-non-
deterministic process-global state**. Not a real engine bug; both
engines are doing the right thing. The harness (per-tid sequence
diff) is the wrong abstraction for this class of event.
## Fix shape
Two coordinated changes, both small and additive:
### (A) Engine: scheduling-invariant SID for process-global dispatchers
Add `event_log::semantic_id_shared_global(pointer, object_type)` (ours
and canary) — a SID recipe keyed only on `(pointer, object_type)`.
Inputs to the existing FNV-1a:
```
create_site_pc = SHARED_GLOBAL_SID_MARKER (= 0xC01AB005, fixed sentinel)
creating_tid = 0
tid_event_idx = pointer as u64
object_type = object_type
```
The marker constant sits outside any plausible guest-PC range (PPC text
0x82000000-0x82FFFFFF; XEX header 0x3001xxxx; heap 0x4xxxxxxx) so it
NEVER collides with regular per-thread SIDs (which use real PCs).
`ensure_dispatcher_object` (ours) and `XObject::GetNativeObject`
(canary) route their `handle.create` emit through this recipe instead
of the per-thread `semantic_id`. Both engines compute the **same SID**
for the same dispatcher pointer regardless of which guest thread wins
the first-toucher race.
### (B) Diff tool: cross-tid floating `handle.create` matching
Pre-pass: collect the set of shared-global SIDs across BOTH engines and
ALL tids. A `handle.create` event is detected as shared-global by
recomputing the deterministic SID from its `(raw_handle_id,
object_type)` payload and matching against `handle_semantic_id`.
When per-tid comparison finds a kind mismatch where one side has a
`handle.create` whose SID is in the floating set:
- Advance only that side's stream pointer past the floating event.
- Re-compare at the same canonical position.
This handles the "extra event on tid=10 but not tid=15" case
symmetrically. Subsequent `wait.begin` events whose
`handles_semantic_ids` element matches a shared-global SID continue to
align via the schema-v1 strict-equality rule (SID fields are already
skipped per the C+15-α SKIP_PAYLOAD_FIELDS_BY_KIND policy, but the
underlying object alignment is preserved by the deterministic recipe —
useful for future passes that re-enable SID comparison).
### Why this is the right fix (not over-suppression)
- **Pointer-derived SIDs are unique per object identity**. Two distinct
dispatchers at the same pointer with different `object_type` get
distinct SIDs (defense in depth).
- **Regular per-thread `handle.create` events keep strict alignment**.
Only events whose SID matches the deterministic shared-global recipe
are eligible for cross-tid absorption. A regular file-handle create
(allocated via `alloc_handle_for`/`AddHandle`) uses the per-(tid,
idx) SID recipe and CANNOT match the shared-global hash by
construction.
- **The diff tool still reports real divergences**. Tests confirm:
- `test_non_floating_real_divergence_still_caught` — an unrelated
extra event on ours's side IS reported.
- `test_strict_alignment_without_floating` — when the floating set is
empty, legacy strict behavior holds.

View File

@@ -0,0 +1,48 @@
#!/usr/bin/env python3
"""Per-tid truncation for canary JSONL logs.
Canary's full boot log can exceed 800 MB; the diff tool loads the
entire file into RAM. We only need enough events per tid to walk past
the first divergence — anything beyond is dead weight. Cap each tid at
a configurable max (default: 250k for tid=6 main, 20k for others)."""
import json
import sys
from pathlib import Path
MAIN_CAP = 250_000 # tid=6 (canary's main chain — mapped to ours tid=1)
SISTER_CAP = 20_000 # everything else
def main() -> int:
src = Path(sys.argv[1])
dst = Path(sys.argv[2])
counts: dict[int, int] = {}
kept = 0
total = 0
with src.open("r", encoding="utf-8") as fin, dst.open("w", encoding="utf-8") as fout:
for lineno, line in enumerate(fin, start=1):
if lineno == 1:
fout.write(line)
continue
total += 1
try:
ev = json.loads(line)
except json.JSONDecodeError:
continue
tid = ev.get("tid", 0)
cap = MAIN_CAP if tid == 6 else SISTER_CAP
c = counts.get(tid, 0)
if c >= cap:
continue
counts[tid] = c + 1
fout.write(line)
kept += 1
print(f"kept {kept}/{total} events across {len(counts)} tids")
for tid in sorted(counts):
print(f" tid={tid:4d} {counts[tid]}")
return 0
if __name__ == "__main__":
sys.exit(main())