handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-05 07:19:08 +02:00
parent acd1656753
commit ef93a4fa14
620 changed files with 108303 additions and 1 deletions

View File

@@ -0,0 +1,384 @@
# Step 2 — Natural install-trigger sequence and ours divergence point
**Date:** 2026-05-21
**Mode:** PLAN-only (investigation; no engine LOC changes).
**Sources:** `canary-jitter-1.jsonl` (4.4 GB, 18.7M events) and
`phase-w-wedge-reattack/ours-postfix.jsonl` (28 MB, 121,569 events).
## TL;DR
The Step 2 plan's framing —
"identify the canary tid=6 kernel-call sequence in the install window
[9.4s, 9.6s]" — **cannot be applied because ours never reaches
host_ns ≥ 1.73s.** Ours's tid=1 wedges 8 seconds before the install
epoch. The reframed question — "what canary-tid=6 sequence between
the matched-prefix wedge point and the install epoch fails in ours?"
— resolves to a **single root cause one level upstream of the wedge**:
> Canary's spawned cache-loader worker (canary tid=17, entry
> `0x821748F0`) executes ~4140 events and calls `ExTerminateThread`
> at host_ns = 2.092s, taking 154ms. Ours's analog (ours tid=13)
> executes 435 events, **never reaches its second wait iteration**,
> and wedges at its FIRST `NtWaitForSingleObjectEx` (no signaler ever
> fires). **Ours's tid=13 takes a different guest-code branch from
> the first wait onward — it calls `NtReleaseSemaphore` instead of
> `NtSetEvent` between `NtCreateEvent` and `NtWaitForSingleObjectEx`,
> so the event it then waits on is unsignaled.**
This is a **branch divergence inside guest code `sub_821CB030`'s
body**, NOT a missing kernel call in ours and NOT a wrong return
value from ours's kernel.
## Step 0 outcome — install epoch reachable on canary, not on ours
| Source | First event | Last event |
|---|---|---|
| canary tid=6 events in [9.0s..11.0s] | 16,175 kernel.calls captured | install epoch + worker-spawn covered ✓ |
| ours tid=1 events | 1.728s (last event before wedge) | install epoch is at ~9.5s — **8s in the future** |
Ours physically cannot reach 9.4s; tid=1 blocks on tid=13's thread-handle
at host_ns=1.728s, all other tids subsequently block too (see
`phase-w-wedge-reattack/halt-on-deadlock-dump.txt`). Therefore the
canary "kernel-call sequence ours doesn't make in the install window"
question is degenerate: ours makes **none** of canary's 16,175 calls
in that window because ours stops emitting at host_ns=1.73s.
The substantive Step 2 question reframes to: **"What does canary
do between matched-prefix idx ~108,476 (= ours's last events) and
the install epoch?"** Answer: it RUNS the worker tid=17 to
completion, which causes the join-wait on tid=1/6 to return, after
which tid=6 iterates `sub_822F1AA8`'s main loop further and
eventually triggers `sub_824FD240` and `sub_825070F0`. Everything
hinges on tid=17 completing.
## Step 1 outcome — canary tid=6 spawns sub_821748F0 at host_ns=1.935s
Exact anchor:
```
canary tid=6 host_ns=1935433700 idx=108476
ExCreateThread(entry=0x821748f0, ctx=0xbc365620, stack=524288, susp=T)
→ handle.create raw=0xf80000a0 hsid=3bd922fbb385c2c9
canary tid=6 host_ns=1937223600 idx=108498
NtResumeThread
NtWaitForSingleObjectEx handles=[3bd922fbb385c2c9] timeout=-1
→ wait.begin
canary tid=6 host_ns=2092000000 idx=108499 (155 ms later)
kernel.return NtWaitForSingleObjectEx rv=0 status=0x00000000
```
The wait IS infinite (timeout_ns=-1) — yet it returns in 155ms because
the worker terminates (canary tid=17's last call is `ExTerminateThread`
at host_ns=2.0918s).
Ours's mirror:
```
ours tid=1 host_ns=1727479660 idx=108481
ExCreateThread(entry=0x821748f0, ctx=0x4024d640, stack=0, susp=T)
→ handle.create raw=0x000012c8 hsid=8a25e09a8a739c1b
ours tid=1 host_ns=1727611893 idx=108505
wait.begin handles=[8a25e09a8a739c1b] timeout=-1
ours tid=1 host_ns=1727614433 idx=108506
kernel.return NtWaitForSingleObjectEx rv=0 ← but this is just the
return record from the entry probe, NOT actual unblock
```
(Note: `ours-postfix.jsonl` schema emits the entry-probe `kernel.return`
even on an infinite wait, because the probe wraps the wait wrapper.
Per `halt-on-deadlock-dump.txt`, tid=1 is in fact still `Blocked` on
handle `0x000012c8` = Thread(id=13) at deadlock-detection time.)
The spawn parameters look identical in shape (same entry PC; ctx and
stack are run-specific). **Spawn semantics match.**
## Step 2 outcome — canary tid=17 vs ours tid=13 kernel-call differential
Lifetimes:
| | canary tid=17 | ours tid=13 |
|---|---|---|
| first event | host_ns=1.9378s | host_ns=1.7276s |
| last event | host_ns=2.0918s | host_ns=1.7307s |
| duration | **154 ms** | **3 ms** |
| total events | 4140 | 435 |
| kernel.call count | 1351 | 142 |
| terminates? | yes via `ExTerminateThread` | no — wedged on wait |
Per-call differential (top entries by |canary ours|):
| kernel.call | canary tid=17 | ours tid=13 | Δ |
|---|---:|---:|---:|
| RtlEnterCriticalSection | 607 | 58 | +549 |
| RtlLeaveCriticalSection | 607 | 58 | +549 |
| NtClose | 19 | 2 | +17 |
| NtCreateEvent | 18 | 3 | +15 |
| NtDuplicateObject | 16 | 2 | +14 |
| RtlInitAnsiString | 11 | 1 | +10 |
| NtWaitForSingleObjectEx | 11 | 2 | +9 |
| RtlInitializeCriticalSectionAndSpinCount | 15 | 6 | +9 |
| NtQueryFullAttributesFile | 9 | 1 | +8 |
| NtReleaseSemaphore | 9 | 1 | +8 |
| RtlNtStatusToDosError | 9 | 1 | +8 |
| NtSetEvent | 8 | 1 | +7 |
| KeTlsSetValue | 2 | 0 | +2 |
| NtCreateFile | 2 | 0 | +2 |
| ExCreateThread | 1 | 0 | +1 |
| ExTerminateThread | 1 | 0 | +1 |
| KeTlsGetValue | 1 | 0 | +1 |
| KeQueryPerformanceFrequency | 0 | 1 | -1 |
**Set-difference of unique kernel-call names**: ours's set of called
APIs is a strict subset of canary's, plus `KeQueryPerformanceFrequency`
which canary called outside this window. **No kernel API is missing
from ours's implementation that canary uses.** All of these APIs
already work in ours (they are called successfully on tid=5, tid=1,
or tid=10 elsewhere in the same run).
The differential isn't "ours fails to implement a kernel call" —
it's "ours executes 10× fewer iterations of the same loop body."
## The control-flow divergence (the root cause)
Canary tid=17, idx 339-356 — the FIRST wait pattern:
```
idx=339 NtCreateEvent
idx=340 handle.create raw=0xf80000b8 hsid=1070523eb111c6ea object_type=1 (Event)
idx=343 NtDuplicateObject → handle.create at idx=344
idx=347 NtSetEvent ← THE EVENT IS SIGNALED BEFORE THE WAIT
idx=350 NtClose → handle.destroy at idx=351
idx=354 NtWaitForSingleObjectEx
idx=355 wait.begin handles=[1070523eb111c6ea] timeout=-1
idx=356 kernel.return rv=0 ← wait completes in 23µs because event was signaled
```
Ours tid=13, idx 175-434 — the analog wait pattern:
```
idx=175 NtCreateEvent
idx=177 handle.create raw=0x000012d0 hsid=d5e23609d3948568 object_type=1 (Event)
… 240 RtlEnterCriticalSection / RtlLeaveCriticalSection ops in between …
idx=419 NtDuplicateObject → handle.create at idx=420
idx=429 NtReleaseSemaphore ← DIFFERENT API — semaphore, not event-set
idx=432 NtWaitForSingleObjectEx
idx=433 wait.begin handles=[d5e23609d3948568] timeout=-1
idx=434 kernel.return rv=0 (entry probe only; actual wait blocks forever)
⏸ WEDGE — event d5e23609d3948568 is never signaled.
```
The key observation: **between `NtCreateEvent` and the corresponding
`NtWaitForSingleObjectEx`**, canary calls `NtSetEvent` to signal
the very event it is about to wait on (idiomatic self-signaled
wait-pump barrier). Ours **skips the NtSetEvent**, calls
`NtReleaseSemaphore` instead, and then blocks on the unsignaled event.
This is a **guest-code branch divergence** inside the helper
hierarchy `sub_821CB030 → sub_821CBA08 → sub_821CC3F8 → sub_821C4EB0`
(per `sub_82173990.md` chain). The branch predicate is some state
read between `NtCreateEvent` and the call site of `NtSetEvent` /
`NtReleaseSemaphore`.
## Step 3/4 — Why does the predicate differ between engines?
The deep root: this exact divergence pattern is what AUDIT-069 S5
already found at a different lens:
> **AUDIT-069 S5**: "Other producers: canary 25 vs ours 1." Canary
> has 24 additional thread sources releasing the work semaphore that
> ours doesn't have.
Combining S5 with this Step 2 finding:
1. Ours's tid=13 emits ONLY 1 NtReleaseSemaphore before wedging
(consistent with the 1 "other producer" S5 measured).
2. Canary's tid=17 emits 9 NtReleaseSemaphore + 8 NtSetEvent before
reaching ExTerminateThread. Each release/set comes from a
different cache-load iteration.
3. The iteration count is gated by the loop body completing each
iteration. Each iteration begins by waiting on an event that
must be PRE-SIGNALED to advance.
In canary, the event gets pre-signaled (NtSetEvent before NtWait).
In ours, the same code path takes the "release semaphore + wait
on event signaled by external" branch instead of the "set event +
wait on event" branch. **The state read by the predicate at the
branch differs.**
What state? Without disassembling `sub_821CB030`/`sub_821CBA08`
and binding the branch PC to the guest memory location the predicate
reads, we cannot say definitively. Candidate state sources:
- A bit/flag in the ctx (`0x4024d640` in ours vs `0xbc365620` in
canary — different addresses but same shape). Could be uninitialized
in ours due to ANON_Class vtable install at `sub_824FD240+0x24`
not having fired (AUDIT-068 S4). But that vtable install fires
much later (host_ns=9.4s in canary), so this is unlikely.
- The result of a prior `NtQueryFullAttributesFile` call. Canary
tid=17 calls this 9× before reaching ExTerminateThread; ours
tid=13 calls it 1× before wedging. The file being queried is in
the `cache:\` filesystem (per `sub_82173990.md` chain).
- A guest-memory shared CS-protected pointer set by another tid
(canary tids 4/10/14 do 38+90+38 signal events in the
[1.9..2.1s] window; in ours, tids 4/5/14 are STILL working in
[0..1.73s] but their output is shifted to ours's tid=5, which
per AUDIT-069 S5 matches canary's tid=10 producer count almost
exactly — 90 NtReleaseSemaphore each).
## Cause attribution
Per the Step 5 framework:
1. **Missing ours implementation?** NO. Every kernel API canary
tid=17 calls is also implemented in ours and works (verified by
other tids using them successfully).
2. **Incorrect return value in ours?** UNLIKELY but unverified. Phase
A schema doesn't capture args/return values for most calls;
`args_resolved={}` is empty for nearly every call in this window.
3. **Missing side effect in ours?** POSSIBLY. If `NtQueryFullAttributesFile`
or `NtCreateFile` on `cache:\<hash>\...` has a slightly different
behavior in ours (e.g., succeeds when canary fails, or vice-versa),
the resulting branch could diverge.
4. **Upstream state divergence (most likely)**: a guest-memory value
read by a predicate inside `sub_821CB030`/`sub_821CBA08` differs
between engines. The earlier-in-this-tid CS-blob (240+ enter/leave
pairs between idx 177 and idx 423) processes some data structure,
the result of which selects the branch.
**Best single guess (MEDIUM confidence)**: a `NtQueryFullAttributesFile`
on a `cache:\<hash>\<filename>` path returns a different value in
ours than in canary (file present vs not, size mismatch, or attrib
mismatch). The branch chooses "we need to recompute the cache item"
(NtReleaseSemaphore path) instead of "cache item is ready, signal
event and proceed" (NtSetEvent path).
## Disjoint-gap count
**ONE gap** — the predicate divergence inside `sub_821CB030`'s
body. However, the predicate divergence likely has a **complex
upstream cause** that involves either filesystem state or
guest-memory state initialized by another tid that ALSO has the
same kind of subtle drift. So:
- **disjoint divergence sites in this trajectory**: 1 (control-flow
branch in sub_821CB030 chain).
- **disjoint hypothesized causes**: 2-3 (file attribute return value,
shared-memory state from tid=10/5 dispatch worker, or vtable install
bypass at upstream).
This is **NOT** the "50+ disjoint missing kernel patterns" failure
mode predicted in tripstone 7. It's a single branch divergence with
multiple candidate first-causes. Methodology pivot to Option C
(critical-path sweep) is **NOT** indicated; targeted iterate per
candidate first-cause IS indicated.
## Recommended next concrete action
**Iterate plan, ordered by minimum LOC + maximum signal**:
### Iterate Step 2.A — branch-probe inside sub_821CB030 body (~50-80 LOC ours + ~50 LOC canary)
Use existing `audit_61_branch_probe_pcs` to pin the divergent
branch inside `sub_821CB030` / `sub_821CBA08` / `sub_821CC3F8`.
Specifically probe every `bne`/`beq` PC inside these guest fns
that has reachable `bl NtSetEvent` on one branch and `bl
NtReleaseSemaphore` on the other. Use sylpheed.db cross-references
to enumerate `bl 0x824AA2F0` (NtSetEvent wrapper) and `bl 0x824AB158`
(NtReleaseSemaphore wrapper) call sites in these fns.
Capture both engines, diff branch-counts. The first divergent
branch is the answer.
### Iterate Step 2.B — args/return-value capture for the 9 NtQueryFullAttributesFile calls on canary tid=17 (~30 LOC canary)
Extend `audit_61` or write a dedicated probe to log `r3` (filename
buffer) and `r0` (NTSTATUS return) for every
`NtQueryFullAttributesFile` call inside this 154-ms window. Compare
against ours's 1 call. If file-attribute return values differ on a
shared file, that's the trigger.
### Iterate Step 2.C — guest-memory read-watch on the ctx struct (~20 LOC, reuses AUDIT-068 S3 read-probe)
Use `audit_68_host_mem_read_probe` to sample the worker ctx
(`0xbc365620` in canary / `0x4024d640` in ours) at ~1ms cadence in
the window [1.7..2.1s]. Identify whether a flag/byte in the ctx
differs at the predicate-read time. This pinpoints the actual
read location if Step 2.A's branch-probe doesn't immediately reveal
the predicate source.
## Tripstones honored
- **#28**: verified canary's actual behavior by reading the jsonl
directly; the AUDIT-069 S5 framing is corroborated, not assumed.
- **#32**: contention regions may jitter; the 240+ CS enter/leave
pairs in ours tid=13 are NOT identical to canary tid=17's count
(607 vs 58). Differential here may include scheduling-determinism
noise. Mitigation: cross-validate with 2nd cold canary run if
Step 2.A doesn't immediately converge.
- **#39**: matched-prefix did NOT drive this; first-draw progression
is the goal.
- **#5 of plan tripstones**: AUDIT-069 S5 "25 producers" finding IS
downstream of Step 2's identified branch divergence. The 25
producers correspond to canary tid=17's loop iterations that ours
tid=13 doesn't reach.
## Cascade
- A (acquire canary install-epoch event log): ✓ HIGH (16,175 kernel
calls captured cleanly in [9..11s] window).
- B (identify install-trigger sequence in canary): ✓ HIGH
(canary tid=6 spawns sub_821748F0 at host_ns=1.935s, join-wait
returns at 2.092s). The "install trigger" is not a single
kernel call but the **completion of worker tid=17**, which
causes the join wait to release tid=6 into the rest of the
main-loop dispatch.
- C (identify where ours diverges from canary): ✓ HIGH (ours
tid=13 wedges 3ms into its lifetime, vs canary tid=17 running
154ms; first kernel-call sequence divergence at the
NtSetEvent vs NtReleaseSemaphore branch).
- D (attribute the divergence to a specific cause): MEDIUM (3
candidate root causes; need iterate 2.A/2.B/2.C to disambiguate).
- E (produce Δ-gap count + roadmap): ✓ HIGH (1 divergence site;
3 candidate first-causes; ~50-200 LOC iterate plan).
## Honest assessment
- The wedge framing established by AUDIT-049 .. AUDIT-069 holds.
- Step 2 narrows the trigger from "the install epoch at 9.4s" down
to "the worker tid=13's first wait at 1.73s" — a 7-order-of-magnitude
refinement in time.
- The 25-producer finding from AUDIT-069 S5 IS a consequence of
the Step 2 branch divergence: each missing iteration of canary
tid=17's load loop is a missing "other producer" signal.
- The fix is NOT to mirror canary's kernel calls; ours implements
them correctly. The fix is to find why ours's `sub_821CB030`
predicate evaluates differently.
- Confidence that the fix is a single guest-state correction
(file-attribute mismatch, ctx-field uninitialized, or shared-memory
flag race): MEDIUM.
## Artifacts produced this session
All under `xenia-rs/audit-runs/review-a-step2-natural-trigger/`:
- `extract_canary_install_window.py` — scanner for canary in [9..11s].
- `extract_canary_tid6_pre_install.py` — scanner for tid=6 [1.5..11s].
- `extract_canary_worker_tid.py` — locates spawn worker by hsid.
- `extract_canary_tid17_full.py` — tid=17 timeline + diff vs ours tid=13.
- `extract_ours_tid1_full.py` — ours tid=1 timeline.
- `extract_ours_tid13_final.py` — ours tid=13 timeline.
- `find_signaler.py` — finds canary tid=17 wait signalers.
- `ours_signal_counts.py` — ours per-tid signal counts.
- `canary-tid6-install-window.csv` — 32,383 events.
- `canary-tid6-install-window.summary` — kernel.call frequencies.
- `canary-tid6-from-anchor.csv` — 139,202 events.
- `canary-tid17-worker-timeline.csv` — 4140 events.
- `ours-tid13-full-timeline.csv` — 435 events.
- `ours-tid1-final-150.csv` — last 150 events on ours tid=1.
- `ours-tid1-summary` — kernel.call frequencies.
- `canary-tid17-waits.csv` — 29 wait.begin events with handle binding.
- `differential-canary-tid17-vs-ours-tid13.txt` — full call-name diff.
- `step2-report.md` — this report.
**LOC delta in this session**: 0 to xenia-rs/canary engines; 0 to
sylpheed.db; ~600 LOC analysis scripts under audit-runs/.