handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
132
audit-runs/phase-w-wedge-reattack/escalation.md
Normal file
132
audit-runs/phase-w-wedge-reattack/escalation.md
Normal file
@@ -0,0 +1,132 @@
|
||||
# Phase W escalation — wedge unbroken by accumulated tooling
|
||||
|
||||
**Outcome category: (C) — escalating cleanly.**
|
||||
|
||||
The Phase W mini-fix landed (`VdInitializeEngines` returns 1 vs old
|
||||
0; matches canary `xboxkrnl_video.cc:271-279`). This is a real
|
||||
correctness fix that advances Phase A matched-prefix
|
||||
**105,046 → 105,112 (+66 events)**. But on the brief's actual gate —
|
||||
`swaps > 1` / `draws > 0` / `texture_cache_entries > 0` — the
|
||||
`check --stable-digest -n 500000000` run is **byte-identical to
|
||||
baseline**: `draws=0, swaps=1, render_targets=0`. The fix does not
|
||||
unblock progression.
|
||||
|
||||
## What we verified afresh
|
||||
|
||||
1. The wedge is structurally identical to AUDIT-049/058/059/062/065:
|
||||
* tid=1 join-waits tid=13 at `sub_82173990+0x2D0` (handle `0x12c8`).
|
||||
* tid=13 wedges at `sub_821CB030+0x1B0` on Event `0x12d0`
|
||||
(`<NO_SIGNALS_DESPITE_WAITS>`).
|
||||
* `sub_825070F0` (vtable[1] worker-spawner) fires 0×.
|
||||
* 4 of 5 canary worker tids (canary's tid=14/15/4/+ several more)
|
||||
emit hundreds of thousands of events; ours's equivalents emit
|
||||
≤80. AUDIT-057 thread-gap PERSISTS.
|
||||
|
||||
2. New tooling (handle.create/destroy, thread.create/exit,
|
||||
wait.begin, shared-global SID absorbers) was applied. It surfaces
|
||||
normal cold-vs-cold divergences past 105K but does NOT illuminate
|
||||
a new signal-flow gap on the wedge handle itself.
|
||||
|
||||
3. The wedge handle's SID `d5e23609d3948568` has zero matches in any
|
||||
canary cold trace. The per-tid-PC SID recipe yields different SIDs
|
||||
for what is *logically* the same Event across engines, because
|
||||
create-site PC + tid + tid_event_idx all participate in the hash.
|
||||
This is by design (it's NOT a process-global dispatcher), but it
|
||||
means the new wait.begin events cannot directly identify "which
|
||||
canary NtSetEvent call should signal this".
|
||||
|
||||
## Why this is hard — the structural impasse
|
||||
|
||||
The matched-prefix metric and the progression metric measure
|
||||
different things. Matched-prefix tracks the **tid=1-only** event
|
||||
sequence in lockstep up to the first kind-mismatch. The wedge is on
|
||||
**tid=13** waiting for a signal that would come from a
|
||||
**worker-cluster thread that never spawns**. The two threads barely
|
||||
overlap in the matched-prefix view (tid=1 is fine for 105K events
|
||||
*because* it hasn't reached the join-wait yet from Phase A's
|
||||
perspective — `sub_82173990+0x2D0` is past idx 105,112 in canary's
|
||||
tid=6 stream).
|
||||
|
||||
Every Phase C fix has correctly advanced matched-prefix while
|
||||
leaving the wedge untouched, because the wedge needs the worker
|
||||
cluster to bootstrap, and the worker cluster's activation chain
|
||||
(`sub_822F1AA8 → sub_82173990 → sub_821746B0 → sub_821748F0 →
|
||||
sub_821C4EB0 → sub_821CC3F8 → sub_821CBA08 → sub_821CB030` and
|
||||
in parallel `→ sub_82172BA0 → sub_821B55D8 → sub_824F8398 →
|
||||
sub_824F7CD0 → sub_824F7800 → sub_825070F0 → 4 worker spawns`) is
|
||||
gated on the tid=13 wait completing, which is gated on a worker
|
||||
signal, which is gated on the worker cluster bootstrapping. This
|
||||
is the **same self-referential lock** AUDIT-063 documented.
|
||||
|
||||
## What new information Phase W produced
|
||||
|
||||
1. **VdInitializeEngines stub fix** (the landing). Trivially
|
||||
correct, advances matched-prefix +66, does not move progression.
|
||||
Worth keeping in canon for cold-vs-cold parity. New stable digest
|
||||
`73e99d60029128b4d5c3dd98e540457d82a52b8a962e7495132be2be31411aca`
|
||||
× 3 byte-identical.
|
||||
2. **Confirmed via the new wait.begin events**: canary's tid=9
|
||||
(= ours's tid=13 logical role) calls `wait.begin` on shared-global
|
||||
dispatcher Event `0xf800004c` (SID `c9f426cc34f55865`) at idx 321
|
||||
*immediately* after `RtlEnterCriticalSection` issues — proving
|
||||
that CS contention on canary's side awakens via the shared-global
|
||||
path while ours's per-tid Event takes the explicit
|
||||
`NtCreateEvent+NtWaitForSingleObjectEx` path. **These are two
|
||||
different objects, not one waiting for the same signal.** The
|
||||
tooling correctly says so.
|
||||
3. **The brief's hypothesis is correct**: matched-prefix is no
|
||||
longer the right metric. Progression has not moved across 25
|
||||
phases.
|
||||
|
||||
## Recommended next steps (ranked)
|
||||
|
||||
### Path 1 (recommended) — accept C+25 fallback and continue normal iteration
|
||||
|
||||
Dispatch C+25 = `MmAllocatePhysicalMemoryEx` / `MmGetPhysicalAddress`
|
||||
deterministic allocator (the new first divergence at idx 105,112 is
|
||||
in this family). Normal Phase C cadence; advances matched-prefix
|
||||
without claiming wedge unblocking. **Be honest in memory notes that
|
||||
matched-prefix is the only metric moving.**
|
||||
|
||||
### Path 2 — re-examine the absorbers
|
||||
|
||||
The C+18/C+21/D-extension absorbers all explicitly fold "scheduling
|
||||
jitter" classes. Per the brief's Path B suggestion: is any absorber
|
||||
HIDING a signal that would resolve the wedge? Specifically:
|
||||
* C+18 shared-global SID absorber folds canary's
|
||||
`aafae4c71fd42890` work-queue semaphore creation into ours's
|
||||
emission window even when ours never creates the equivalent. If
|
||||
ours's worker fails to *enqueue* something canary's worker awaits,
|
||||
we'd never see the gap because the matched-prefix isn't on the
|
||||
worker tid in the first place.
|
||||
* The D-extension absorber folds nested-CS cleanup blocks. If
|
||||
canary's `Enter/Leave` block contains the NtSetEvent that signals
|
||||
the wedge handle (via descendant `xeKeSetEvent`), the absorber
|
||||
hides that.
|
||||
|
||||
Concrete: un-absorb, re-diff, look for the first FOLDED canary block
|
||||
that contains an `NtSetEvent` whose SID resolves to the wedge handle.
|
||||
~3-5 hours of analysis, no LOC change.
|
||||
|
||||
### Path 3 — install host-side mem-watch + diff on wedge handle's guest memory
|
||||
|
||||
AUDIT-067 established that vtable installs go through host-side
|
||||
writes invisible to guest-PC traces. By the same logic, the wedge
|
||||
handle's kernel object header may be mutated by host code (the
|
||||
canary scheduler / dispatcher) in ways ours doesn't replicate. Hook
|
||||
`Memory::write*` in canary on the wedge handle's address; compare
|
||||
against ours.
|
||||
|
||||
### Path 4 — scheduler determinism investment
|
||||
|
||||
The unfunded `scheduler_determinism_plan` artifact (per memory). Stage
|
||||
0 was null result; the contention manifest stages landed but didn't
|
||||
move the cap. The PLAN doc explicitly notes the wedge is upstream of
|
||||
contention, so this is unlikely to help WITHOUT additional work.
|
||||
|
||||
## Honesty note
|
||||
|
||||
19 prior audits attacked this same wedge and failed. Phase W is the
|
||||
20th. We landed a correct mini-fix, but the wedge itself is
|
||||
unchanged. The user's instinct to call this honest fallback is the
|
||||
correct posture.
|
||||
Reference in New Issue
Block a user