Files
xenia-rs/audit-runs/phase-c17-keWait-native-object/broad-impact.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

6.3 KiB
Raw Blame History

Phase C+17 — Broad-impact catalog (2026-05-14)

The C+17 fix touches a widely-used primitive (ensure_dispatcher_object, called by Ke{Wait,Set,Reset,Pulse}Event, Ke{Wait,Release}Semaphore, etc.). This catalog enumerates the surfaced divergences post-fix per chain.

Resolved (3 of 5 catalogued in C+15-α)

D-2 / D-3 / D-4 — KeWait*ForSingleObject native-obj handle (all 5 chains)

Class E asymmetry. Canary's xeKeWaitForSingleObject / KeWaitForMultipleObjects_entry calls XObject::GetNativeObject which emits handle.create for the synthesized wrapper; ours's ensure_dispatcher_object did the same shadow synthesis but never emitted the schema event. Fix: emit handle.create (with the appropriate object_type from KernelObject::schema_object_type) on first adoption, and register the SID so subsequent wait.begin events resolve non-zero handles_semantic_ids[].

Observed: all 5 chains' divergences move past the wait-begin idx that was previously blocked at SID=0.

Advanced

Main tid=6→1 (+382)

102,171 → 102,553. The 382 new matching events between the two indexes are mostly kernel.{call,return}, import.call, RtlEnter/LeaveCriticalSection, plus the now-aligned handle.create+wait.begin pairs from KeWaitForSingleObject and KeWaitForMultipleObjects calls. Several new shadow handle.create events fire on first encounter of specific PKEVENT/PKSEMAPHORE pointers in the game's init path.

Sister chains (+3 / +2 / +1 / +39)

  • tid=4→11 +3: matches all 11 emitted events.
  • tid=7→2 +2: matches all 32 events.
  • tid=12→7 +1: matches through handle.create at idx=2.
  • tid=14→9 +39: walks past all the now-aligned KeWait* framing into the audio subsystem.

Persisted (pre-existing bugs unaffected)

None of the C+15-α catalog's other groups are touched.

NEW divergences (cataloged for future iterates)

D-NEW-1 (HIGH) — main idx=102,553: NtDuplicateObject no handle.create

Canary's NtDuplicateObject_entryObjectTable::DuplicateHandle allocates a new slot via AddHandle(object, &new_handle) (util/object_table.cc:148-201), which fires the C+15-α-wired phase_a::EmitHandleCreateAuto. Ours's nt_duplicate_object (exports.rs nt_duplicate_object) implements per-AUDIT-062 alias-on-dup semantics: dup_id = source_id so refcount-bumped re-use of the same slot. No new handle.create fires.

This is a genuine engine-architectural difference. Mirror options:

  • (a) Make ours allocate a fresh handle on NtDuplicateObject and emit handle.create (mirror canary). ~30-40 LOC; downstream impact on every existing AUDIT-062-dependent code path needs audit.
  • (b) Diff-tool suppress this handle.create site. Band-aid.

Recommendation: (a). C+18 target. Trade-off: AUDIT-062's "alias on dup" was implemented to handle a specific worker-cluster handle-aliasing issue; un-doing it may surface a different regression. The risk profile is similar to C+15-α: invisible state divergences become visible. ~30 LOC fix or ~30 LOC tactical revert.

D-NEW-2 (MEDIUM) — tid=12→7 idx=3: wait.begin.timeout_ns mismatch

canary: wait.begin handles_semantic_ids=[SID-A] timeout_ns=-30000000
ours:   wait.begin handles_semantic_ids=[SID-B] timeout_ns=429466729600

The SIDs differ (skipped per diff policy). The timeout_ns is the issue: canary uses 30ms relative timeout; ours has 429.47ms absolute-time encoding. Likely cause: ours's decode_timeout_ns returns the raw mem.read_u64(timeout_ptr) as i64 * 100 without applying the "negative=relative / positive=absolute" semantics consistently with canary. Inspect decode_timeout_ns (exports.rs:4890) — canary's threading.cc emit code passes (*timeout_ptr) * 100 directly without sign conversion either, so the divergence is upstream in how each engine writes the TIMEOUT* struct. Probably ε-class (game-side state encoding).

C+19 target estimate. ~10-30 LOC investigation.

D-NEW-3 (LOW) — tid=15→10 idx=2: handle.create ordering on shared dispatcher

Canary's GetNativeObject is process-global: once any thread adopts a dispatcher pointer (stashing kXObjSignature in the wait_list), all subsequent threads find the existing handle and do NOT re-emit. Canary's handle.create for the semaphore at guest pointer 0x828a3230 (XAudio voice volume changemask?) emitted earlier on a different thread; on tid=15 the first wait happens to skip straight to wait.begin.

Ours's ensure_dispatcher_object is also process-global (the state.objects map is shared in KernelState). However, the timing of first adoption differs because thread interleaving / boot ordering between the two engines isn't bit-identical. Ours's tid=10 happens to be the first to touch 0x828a3230, so it emits handle.create at idx=2; canary's tid=15 arrived after another thread (probably tid=6 or tid=10) had already adopted it.

This is a timing-induced ordering divergence, not a state-model asymmetry. It's the inverse of the typical D-1/D-2 class — both engines emit the SAME total number of handle.create events; the issue is which thread happens to be the "first toucher". The diff tool currently treats this as a divergence because it compares per-tid sequences strictly.

Two possible mitigations:

  • (a) Diff-tool: relax ordering for handle.create emits when the "next thread" event is wait.begin on the same dispatcher. Complex.
  • (b) Suppress handle.create from the per-thread sequence entirely; treat it as a global emit and only diff wait.begin SIDs against a process-global SID-registry. Could work via SKIP_PAYLOAD_FIELDS_BY_KIND extension to drop the event from per-tid alignment.
  • (c) Live with the +0/-14 trade-off on tid=15→10 — the main chain improvement dwarfs it.

Recommendation: (c) for now; C+20+ if the chain becomes load-bearing.

Reading-error register

  • Reading-error #28 (verify framing first): FOLLOWED. Canary's GetNativeObject was read end-to-end before any code change.
  • Reading-error #23 (widely-used primitive flip): MITIGATED. Cold-vs-cold gate caught no main-chain regression; minor sister-chain regression on tid=15→10 is documented as NEW-3.
  • Reading-error #19 (host-side emits): FOLLOWED. event_log::is_enabled() guards on every new emit; default-off cost is one relaxed atomic-bool check (zero cost when disabled).