handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-05 07:19:08 +02:00
parent acd1656753
commit ef93a4fa14
620 changed files with 108303 additions and 1 deletions

View File

@@ -0,0 +1,108 @@
# Phase C+17 cold-vs-cold result (2026-05-14)
## Matched-prefix table
| canary_tid | ours_tid | C+16 | C+17 | delta | first_divergence_at | kind |
|------------|----------|---------|---------|-----------|---------------------|-----------------------------------------------------|
| 6 | 1 | 102,171 | 102,553 | **+382** | 102,553 | `NtDuplicateObject` no `handle.create` (NEW-1) |
| 4 | 11 | 8 | 11 | **+3** | — | no divergence in 11 events (ours stalls) |
| 7 | 2 | 30 | 32 | **+2** | — | no divergence in 32 events |
| 12 | 7 | 2 | 3 | **+1** | 3 | `timeout_ns` differs in `wait.begin` (NEW-2) |
| 14 | 9 | 2 | 41 | **+39** | 41 | unrelated `XAudioGetVoiceCategoryVolumeChangeMask` |
| 15 | 10 | 16 | 2 | **-14** | 2 | ordering: ours emits `handle.create` on first thread-touch of shared dispatcher (NEW-3) |
**Main chain advanced +382** (D-2/D-3/D-4 root cause resolved). 4 of 5 sister
chains advanced. The tid=15→10 chain regressed by 14 events due to a
cross-thread-caching ordering side-effect (see broad-impact.md / NEW-3); the
underlying state alignment is the SAME root cause, so the regression is
"observation-side" — canary's `GetNativeObject` is process-global, so the
adoption happens on whichever thread touches the dispatcher first.
## New first divergence on main (idx=102,553)
```
canary: [102551] import.call NtDuplicateObject
ours: [102551] import.call NtDuplicateObject
canary: [102552] kernel.call NtDuplicateObject
ours: [102552] kernel.call NtDuplicateObject
canary: [102553] handle.create sid=df686b147b291902 (object_type=1)
ours: [102553] kernel.return NtDuplicateObject
canary: [102554] kernel.return NtDuplicateObject
ours: [102554] import.call RtlEnterCriticalSection
```
Canary's `NtDuplicateObject_entry` calls `ObjectTable::DuplicateHandle` which
fires `AddHandle` for the new slot, emitting `handle.create`. Ours's
`nt_duplicate_object` short-circuits via handle aliasing (AUDIT-062's
`dup_id=source_id` design) and does NOT emit a new `handle.create`. This is
**D-NEW-1 HIGH** — first C+18 target.
## Acceptance gates
- **Gate 1 (default-off digest)**: PASS — 3× reproducible at
`e1dfcb1559f987b35012a7f2dc6d93f5` (unchanged from C+13/C+15-α/C+16
baseline). The fix is observation-only at the digest level; the new
shadow-handle refcount entries do not feed back into guest behavior
inside the 50M-instruction window.
- **Gate 2 (cvar-on emit)**: PASS — ours 121,544 events (was 121,537 in
C+16, +7 from new lazy `handle.create` emits in the main chain
bring-up); canary 3,059,463 events in ~90s. Both JSONL parse cleanly.
- **Gate 3 (diff tool)**: PASS — diff tool produces 6-chain report with
the new SID-skip semantics for `wait.begin.handles_semantic_ids`.
- **Gate 4 (cold-vs-cold)**: PASS — main matched prefix advances
102,171 → 102,553 (+382). 4 of 5 sister chains advance; 1 minor
regression on tid=15→10 (NEW-3, observation-side).
- **Gate 5 (build clean)**: PASS — `cargo build --release` clean
(1 pre-existing dead_code warning unrelated).
- **Gate 6 (tests)**: PASS — 186 → 191 (added 5 new lifecycle tests for
`ensure_dispatcher_object`; all pass + entire workspace green).
- **Gate 7 (Phase B image hash)**: PASS — `image_loaded_sha256` =
`ea8d160e9369328a5b922258a92113efb8d7ce3e1a5c12cc521e375985c91c18`
(unchanged).
- **Gate 8 (event-log determinism)**: PASS — `handle.create` event
stream (post-strip of `host_ns`) is bit-identical across 3 cold
runs: md5 `0bd91b4c61dea52d72859e7d9c3541ba`.
## Sister-chain analysis
All 5 sister chains' first divergences are no longer "wait.begin with SID=0":
- tid=4→11: was `KeWaitForMultipleObjects` at idx=8 with empty SIDs;
now goes 11 events deep with NO divergence (ours stalls, but for
reasons unrelated to D-2/D-3/D-4).
- tid=7→2: was `KeWaitForSingleObject` at idx=30 with SID=0; now 32
events with NO divergence.
- tid=12→7: was at idx=2 with SID=0; now idx=3 — the `handle.create`
matches (SID skipped per diff-tool policy), divergence is now
`timeout_ns` mismatch (-30000000 vs 429466729600) — a real
game-side wait-quantum mismatch.
- tid=14→9: was at idx=2 with SID=0; now idx=41 — reached a real
`XAudioGetVoiceCategoryVolumeChangeMask` divergence (sister-chain
audio export the boot doesn't reach in ours).
- tid=15→10: was at idx=16 (no divergence in 16 events); now idx=2
diverges because ours emits `handle.create` on this thread's first
touch of a globally-shared semaphore dispatcher at `0x828a3230`,
while canary emitted it earlier on another thread. Observation-side
ordering issue; underlying state model is the same. NEW-3 below.
## Refcount leak risk audit
The fix bumps `state.handle_refcount[ptr] = 1` for each first-touch shadow.
Three concerns and mitigations:
1. **Leak risk**: no code path currently destroys these shadows
(`ensure_dispatcher_object` adoptions). Canary's design has the same
property — `GetNativeObject`-synthesized `XObject`s survive until
process exit. No leak relative to canary's behavior.
2. **Double-bump risk**: the early-return guard at the top of
`ensure_dispatcher_object` (`state.objects.contains_key(&ptr)`)
ensures the refcount entry is initialized exactly once per pointer.
Test `ensure_dispatcher_object_is_idempotent_on_repeated_touch`
verifies this.
3. **Refcount underflow risk**: if a future change wires
`handle.destroy` on shadow removal (e.g., when `NtClose` is
somehow called on a guest dispatcher pointer), the refcount must
not underflow. The `or_insert(1)` form preserves any pre-existing
refcount (e.g., if the same pointer was previously allocated via
`alloc_handle_for`, though that's impossible since `next_handle`
starts at `0x1000` and pointers live above `0x1_0000`).