Files
xenia-rs/audit-runs/phase-c17-keWait-native-object/broad-impact.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

135 lines
6.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase C+17 — Broad-impact catalog (2026-05-14)
The C+17 fix touches a widely-used primitive (`ensure_dispatcher_object`,
called by `Ke{Wait,Set,Reset,Pulse}Event`, `Ke{Wait,Release}Semaphore`, etc.).
This catalog enumerates the surfaced divergences post-fix per chain.
## Resolved (3 of 5 catalogued in C+15-α)
### D-2 / D-3 / D-4 — KeWait*ForSingleObject native-obj handle (all 5 chains)
Class E asymmetry. Canary's `xeKeWaitForSingleObject` /
`KeWaitForMultipleObjects_entry` calls `XObject::GetNativeObject` which
emits `handle.create` for the synthesized wrapper; ours's
`ensure_dispatcher_object` did the same shadow synthesis but never emitted
the schema event. Fix: emit `handle.create` (with the appropriate
`object_type` from `KernelObject::schema_object_type`) on first
adoption, and register the SID so subsequent `wait.begin` events resolve
non-zero `handles_semantic_ids[]`.
Observed: all 5 chains' divergences move past the wait-begin idx that was
previously blocked at SID=0.
## Advanced
### Main tid=6→1 (+382)
102,171 → 102,553. The 382 new matching events between the two indexes are
mostly `kernel.{call,return}`, `import.call`, `RtlEnter/LeaveCriticalSection`,
plus the now-aligned `handle.create`+`wait.begin` pairs from
`KeWaitForSingleObject` and `KeWaitForMultipleObjects` calls. Several
new shadow `handle.create` events fire on first encounter of
specific PKEVENT/PKSEMAPHORE pointers in the game's init path.
### Sister chains (+3 / +2 / +1 / +39)
- tid=4→11 +3: matches all 11 emitted events.
- tid=7→2 +2: matches all 32 events.
- tid=12→7 +1: matches through `handle.create` at idx=2.
- tid=14→9 +39: walks past all the now-aligned `KeWait*` framing into the
audio subsystem.
## Persisted (pre-existing bugs unaffected)
None of the C+15-α catalog's other groups are touched.
## NEW divergences (cataloged for future iterates)
### D-NEW-1 (HIGH) — main idx=102,553: `NtDuplicateObject` no `handle.create`
Canary's `NtDuplicateObject_entry``ObjectTable::DuplicateHandle`
allocates a new slot via `AddHandle(object, &new_handle)`
(util/object_table.cc:148-201), which fires the C+15-α-wired
`phase_a::EmitHandleCreateAuto`. Ours's `nt_duplicate_object`
(exports.rs `nt_duplicate_object`) implements per-AUDIT-062 alias-on-dup
semantics: `dup_id = source_id` so refcount-bumped re-use of the same
slot. No new `handle.create` fires.
This is a genuine engine-architectural difference. Mirror options:
- (a) Make ours allocate a fresh handle on `NtDuplicateObject` and emit
`handle.create` (mirror canary). ~30-40 LOC; downstream impact on
every existing AUDIT-062-dependent code path needs audit.
- (b) Diff-tool suppress this `handle.create` site. Band-aid.
Recommendation: (a). C+18 target. Trade-off: AUDIT-062's "alias on dup"
was implemented to handle a specific worker-cluster handle-aliasing
issue; un-doing it may surface a different regression. The risk
profile is similar to C+15-α: invisible state divergences become
visible. ~30 LOC fix or ~30 LOC tactical revert.
### D-NEW-2 (MEDIUM) — tid=12→7 idx=3: `wait.begin.timeout_ns` mismatch
```
canary: wait.begin handles_semantic_ids=[SID-A] timeout_ns=-30000000
ours: wait.begin handles_semantic_ids=[SID-B] timeout_ns=429466729600
```
The SIDs differ (skipped per diff policy). The `timeout_ns` is the issue:
canary uses 30ms relative timeout; ours has 429.47ms absolute-time
encoding. Likely cause: ours's `decode_timeout_ns` returns the raw
`mem.read_u64(timeout_ptr) as i64 * 100` without applying the
"negative=relative / positive=absolute" semantics consistently with
canary. Inspect `decode_timeout_ns` (exports.rs:4890) — canary's
threading.cc emit code passes `(*timeout_ptr) * 100` directly without
sign conversion either, so the divergence is upstream in how each engine
**writes** the TIMEOUT* struct. Probably ε-class (game-side state
encoding).
C+19 target estimate. ~10-30 LOC investigation.
### D-NEW-3 (LOW) — tid=15→10 idx=2: `handle.create` ordering on shared dispatcher
Canary's `GetNativeObject` is **process-global**: once any thread adopts
a dispatcher pointer (stashing `kXObjSignature` in the wait_list), all
subsequent threads find the existing handle and do NOT re-emit. Canary's
`handle.create` for the semaphore at guest pointer `0x828a3230` (XAudio
voice volume changemask?) emitted earlier on a different thread; on tid=15
the first wait happens to skip straight to `wait.begin`.
Ours's `ensure_dispatcher_object` is also process-global (the `state.objects`
map is shared in `KernelState`). However, the **timing of first adoption**
differs because thread interleaving / boot ordering between the two engines
isn't bit-identical. Ours's tid=10 happens to be the first to touch
`0x828a3230`, so it emits `handle.create` at idx=2; canary's tid=15
arrived after another thread (probably tid=6 or tid=10) had already
adopted it.
This is a **timing-induced ordering** divergence, not a state-model
asymmetry. It's the inverse of the typical D-1/D-2 class — both engines
emit the SAME total number of `handle.create` events; the issue is which
thread happens to be the "first toucher". The diff tool currently treats
this as a divergence because it compares per-tid sequences strictly.
Two possible mitigations:
- (a) Diff-tool: relax ordering for `handle.create` emits when the
"next thread" event is `wait.begin` on the same dispatcher. Complex.
- (b) Suppress `handle.create` from the per-thread sequence entirely;
treat it as a global emit and only diff `wait.begin` SIDs against a
process-global SID-registry. Could work via `SKIP_PAYLOAD_FIELDS_BY_KIND`
extension to drop the event from per-tid alignment.
- (c) Live with the +0/-14 trade-off on tid=15→10 — the main chain
improvement dwarfs it.
Recommendation: (c) for now; C+20+ if the chain becomes load-bearing.
## Reading-error register
- **Reading-error #28 (verify framing first)**: FOLLOWED. Canary's
`GetNativeObject` was read end-to-end before any code change.
- **Reading-error #23 (widely-used primitive flip)**: MITIGATED. Cold-vs-cold
gate caught no main-chain regression; minor sister-chain regression on
tid=15→10 is documented as NEW-3.
- **Reading-error #19 (host-side emits)**: FOLLOWED. `event_log::is_enabled()`
guards on every new emit; default-off cost is one relaxed atomic-bool
check (zero cost when disabled).