handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-05 07:19:08 +02:00
parent acd1656753
commit ef93a4fa14
620 changed files with 108303 additions and 1 deletions

View File

@@ -0,0 +1,134 @@
# Phase C+17 — Broad-impact catalog (2026-05-14)
The C+17 fix touches a widely-used primitive (`ensure_dispatcher_object`,
called by `Ke{Wait,Set,Reset,Pulse}Event`, `Ke{Wait,Release}Semaphore`, etc.).
This catalog enumerates the surfaced divergences post-fix per chain.
## Resolved (3 of 5 catalogued in C+15-α)
### D-2 / D-3 / D-4 — KeWait*ForSingleObject native-obj handle (all 5 chains)
Class E asymmetry. Canary's `xeKeWaitForSingleObject` /
`KeWaitForMultipleObjects_entry` calls `XObject::GetNativeObject` which
emits `handle.create` for the synthesized wrapper; ours's
`ensure_dispatcher_object` did the same shadow synthesis but never emitted
the schema event. Fix: emit `handle.create` (with the appropriate
`object_type` from `KernelObject::schema_object_type`) on first
adoption, and register the SID so subsequent `wait.begin` events resolve
non-zero `handles_semantic_ids[]`.
Observed: all 5 chains' divergences move past the wait-begin idx that was
previously blocked at SID=0.
## Advanced
### Main tid=6→1 (+382)
102,171 → 102,553. The 382 new matching events between the two indexes are
mostly `kernel.{call,return}`, `import.call`, `RtlEnter/LeaveCriticalSection`,
plus the now-aligned `handle.create`+`wait.begin` pairs from
`KeWaitForSingleObject` and `KeWaitForMultipleObjects` calls. Several
new shadow `handle.create` events fire on first encounter of
specific PKEVENT/PKSEMAPHORE pointers in the game's init path.
### Sister chains (+3 / +2 / +1 / +39)
- tid=4→11 +3: matches all 11 emitted events.
- tid=7→2 +2: matches all 32 events.
- tid=12→7 +1: matches through `handle.create` at idx=2.
- tid=14→9 +39: walks past all the now-aligned `KeWait*` framing into the
audio subsystem.
## Persisted (pre-existing bugs unaffected)
None of the C+15-α catalog's other groups are touched.
## NEW divergences (cataloged for future iterates)
### D-NEW-1 (HIGH) — main idx=102,553: `NtDuplicateObject` no `handle.create`
Canary's `NtDuplicateObject_entry``ObjectTable::DuplicateHandle`
allocates a new slot via `AddHandle(object, &new_handle)`
(util/object_table.cc:148-201), which fires the C+15-α-wired
`phase_a::EmitHandleCreateAuto`. Ours's `nt_duplicate_object`
(exports.rs `nt_duplicate_object`) implements per-AUDIT-062 alias-on-dup
semantics: `dup_id = source_id` so refcount-bumped re-use of the same
slot. No new `handle.create` fires.
This is a genuine engine-architectural difference. Mirror options:
- (a) Make ours allocate a fresh handle on `NtDuplicateObject` and emit
`handle.create` (mirror canary). ~30-40 LOC; downstream impact on
every existing AUDIT-062-dependent code path needs audit.
- (b) Diff-tool suppress this `handle.create` site. Band-aid.
Recommendation: (a). C+18 target. Trade-off: AUDIT-062's "alias on dup"
was implemented to handle a specific worker-cluster handle-aliasing
issue; un-doing it may surface a different regression. The risk
profile is similar to C+15-α: invisible state divergences become
visible. ~30 LOC fix or ~30 LOC tactical revert.
### D-NEW-2 (MEDIUM) — tid=12→7 idx=3: `wait.begin.timeout_ns` mismatch
```
canary: wait.begin handles_semantic_ids=[SID-A] timeout_ns=-30000000
ours: wait.begin handles_semantic_ids=[SID-B] timeout_ns=429466729600
```
The SIDs differ (skipped per diff policy). The `timeout_ns` is the issue:
canary uses 30ms relative timeout; ours has 429.47ms absolute-time
encoding. Likely cause: ours's `decode_timeout_ns` returns the raw
`mem.read_u64(timeout_ptr) as i64 * 100` without applying the
"negative=relative / positive=absolute" semantics consistently with
canary. Inspect `decode_timeout_ns` (exports.rs:4890) — canary's
threading.cc emit code passes `(*timeout_ptr) * 100` directly without
sign conversion either, so the divergence is upstream in how each engine
**writes** the TIMEOUT* struct. Probably ε-class (game-side state
encoding).
C+19 target estimate. ~10-30 LOC investigation.
### D-NEW-3 (LOW) — tid=15→10 idx=2: `handle.create` ordering on shared dispatcher
Canary's `GetNativeObject` is **process-global**: once any thread adopts
a dispatcher pointer (stashing `kXObjSignature` in the wait_list), all
subsequent threads find the existing handle and do NOT re-emit. Canary's
`handle.create` for the semaphore at guest pointer `0x828a3230` (XAudio
voice volume changemask?) emitted earlier on a different thread; on tid=15
the first wait happens to skip straight to `wait.begin`.
Ours's `ensure_dispatcher_object` is also process-global (the `state.objects`
map is shared in `KernelState`). However, the **timing of first adoption**
differs because thread interleaving / boot ordering between the two engines
isn't bit-identical. Ours's tid=10 happens to be the first to touch
`0x828a3230`, so it emits `handle.create` at idx=2; canary's tid=15
arrived after another thread (probably tid=6 or tid=10) had already
adopted it.
This is a **timing-induced ordering** divergence, not a state-model
asymmetry. It's the inverse of the typical D-1/D-2 class — both engines
emit the SAME total number of `handle.create` events; the issue is which
thread happens to be the "first toucher". The diff tool currently treats
this as a divergence because it compares per-tid sequences strictly.
Two possible mitigations:
- (a) Diff-tool: relax ordering for `handle.create` emits when the
"next thread" event is `wait.begin` on the same dispatcher. Complex.
- (b) Suppress `handle.create` from the per-thread sequence entirely;
treat it as a global emit and only diff `wait.begin` SIDs against a
process-global SID-registry. Could work via `SKIP_PAYLOAD_FIELDS_BY_KIND`
extension to drop the event from per-tid alignment.
- (c) Live with the +0/-14 trade-off on tid=15→10 — the main chain
improvement dwarfs it.
Recommendation: (c) for now; C+20+ if the chain becomes load-bearing.
## Reading-error register
- **Reading-error #28 (verify framing first)**: FOLLOWED. Canary's
`GetNativeObject` was read end-to-end before any code change.
- **Reading-error #23 (widely-used primitive flip)**: MITIGATED. Cold-vs-cold
gate caught no main-chain regression; minor sister-chain regression on
tid=15→10 is documented as NEW-3.
- **Reading-error #19 (host-side emits)**: FOLLOWED. `event_log::is_enabled()`
guards on every new emit; default-off cost is one relaxed atomic-bool
check (zero cost when disabled).