handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
384
audit-runs/review-a-step2-natural-trigger/step2-report.md
Normal file
384
audit-runs/review-a-step2-natural-trigger/step2-report.md
Normal file
@@ -0,0 +1,384 @@
|
||||
# Step 2 — Natural install-trigger sequence and ours divergence point
|
||||
|
||||
**Date:** 2026-05-21
|
||||
**Mode:** PLAN-only (investigation; no engine LOC changes).
|
||||
**Sources:** `canary-jitter-1.jsonl` (4.4 GB, 18.7M events) and
|
||||
`phase-w-wedge-reattack/ours-postfix.jsonl` (28 MB, 121,569 events).
|
||||
|
||||
## TL;DR
|
||||
|
||||
The Step 2 plan's framing —
|
||||
"identify the canary tid=6 kernel-call sequence in the install window
|
||||
[9.4s, 9.6s]" — **cannot be applied because ours never reaches
|
||||
host_ns ≥ 1.73s.** Ours's tid=1 wedges 8 seconds before the install
|
||||
epoch. The reframed question — "what canary-tid=6 sequence between
|
||||
the matched-prefix wedge point and the install epoch fails in ours?"
|
||||
— resolves to a **single root cause one level upstream of the wedge**:
|
||||
|
||||
> Canary's spawned cache-loader worker (canary tid=17, entry
|
||||
> `0x821748F0`) executes ~4140 events and calls `ExTerminateThread`
|
||||
> at host_ns = 2.092s, taking 154ms. Ours's analog (ours tid=13)
|
||||
> executes 435 events, **never reaches its second wait iteration**,
|
||||
> and wedges at its FIRST `NtWaitForSingleObjectEx` (no signaler ever
|
||||
> fires). **Ours's tid=13 takes a different guest-code branch from
|
||||
> the first wait onward — it calls `NtReleaseSemaphore` instead of
|
||||
> `NtSetEvent` between `NtCreateEvent` and `NtWaitForSingleObjectEx`,
|
||||
> so the event it then waits on is unsignaled.**
|
||||
|
||||
This is a **branch divergence inside guest code `sub_821CB030`'s
|
||||
body**, NOT a missing kernel call in ours and NOT a wrong return
|
||||
value from ours's kernel.
|
||||
|
||||
## Step 0 outcome — install epoch reachable on canary, not on ours
|
||||
|
||||
| Source | First event | Last event |
|
||||
|---|---|---|
|
||||
| canary tid=6 events in [9.0s..11.0s] | 16,175 kernel.calls captured | install epoch + worker-spawn covered ✓ |
|
||||
| ours tid=1 events | 1.728s (last event before wedge) | install epoch is at ~9.5s — **8s in the future** |
|
||||
|
||||
Ours physically cannot reach 9.4s; tid=1 blocks on tid=13's thread-handle
|
||||
at host_ns=1.728s, all other tids subsequently block too (see
|
||||
`phase-w-wedge-reattack/halt-on-deadlock-dump.txt`). Therefore the
|
||||
canary "kernel-call sequence ours doesn't make in the install window"
|
||||
question is degenerate: ours makes **none** of canary's 16,175 calls
|
||||
in that window because ours stops emitting at host_ns=1.73s.
|
||||
|
||||
The substantive Step 2 question reframes to: **"What does canary
|
||||
do between matched-prefix idx ~108,476 (= ours's last events) and
|
||||
the install epoch?"** Answer: it RUNS the worker tid=17 to
|
||||
completion, which causes the join-wait on tid=1/6 to return, after
|
||||
which tid=6 iterates `sub_822F1AA8`'s main loop further and
|
||||
eventually triggers `sub_824FD240` and `sub_825070F0`. Everything
|
||||
hinges on tid=17 completing.
|
||||
|
||||
## Step 1 outcome — canary tid=6 spawns sub_821748F0 at host_ns=1.935s
|
||||
|
||||
Exact anchor:
|
||||
|
||||
```
|
||||
canary tid=6 host_ns=1935433700 idx=108476
|
||||
ExCreateThread(entry=0x821748f0, ctx=0xbc365620, stack=524288, susp=T)
|
||||
→ handle.create raw=0xf80000a0 hsid=3bd922fbb385c2c9
|
||||
canary tid=6 host_ns=1937223600 idx=108498
|
||||
NtResumeThread
|
||||
NtWaitForSingleObjectEx handles=[3bd922fbb385c2c9] timeout=-1
|
||||
→ wait.begin
|
||||
canary tid=6 host_ns=2092000000 idx=108499 (155 ms later)
|
||||
kernel.return NtWaitForSingleObjectEx rv=0 status=0x00000000
|
||||
```
|
||||
|
||||
The wait IS infinite (timeout_ns=-1) — yet it returns in 155ms because
|
||||
the worker terminates (canary tid=17's last call is `ExTerminateThread`
|
||||
at host_ns=2.0918s).
|
||||
|
||||
Ours's mirror:
|
||||
|
||||
```
|
||||
ours tid=1 host_ns=1727479660 idx=108481
|
||||
ExCreateThread(entry=0x821748f0, ctx=0x4024d640, stack=0, susp=T)
|
||||
→ handle.create raw=0x000012c8 hsid=8a25e09a8a739c1b
|
||||
ours tid=1 host_ns=1727611893 idx=108505
|
||||
wait.begin handles=[8a25e09a8a739c1b] timeout=-1
|
||||
ours tid=1 host_ns=1727614433 idx=108506
|
||||
kernel.return NtWaitForSingleObjectEx rv=0 ← but this is just the
|
||||
return record from the entry probe, NOT actual unblock
|
||||
```
|
||||
|
||||
(Note: `ours-postfix.jsonl` schema emits the entry-probe `kernel.return`
|
||||
even on an infinite wait, because the probe wraps the wait wrapper.
|
||||
Per `halt-on-deadlock-dump.txt`, tid=1 is in fact still `Blocked` on
|
||||
handle `0x000012c8` = Thread(id=13) at deadlock-detection time.)
|
||||
|
||||
The spawn parameters look identical in shape (same entry PC; ctx and
|
||||
stack are run-specific). **Spawn semantics match.**
|
||||
|
||||
## Step 2 outcome — canary tid=17 vs ours tid=13 kernel-call differential
|
||||
|
||||
Lifetimes:
|
||||
|
||||
| | canary tid=17 | ours tid=13 |
|
||||
|---|---|---|
|
||||
| first event | host_ns=1.9378s | host_ns=1.7276s |
|
||||
| last event | host_ns=2.0918s | host_ns=1.7307s |
|
||||
| duration | **154 ms** | **3 ms** |
|
||||
| total events | 4140 | 435 |
|
||||
| kernel.call count | 1351 | 142 |
|
||||
| terminates? | yes via `ExTerminateThread` | no — wedged on wait |
|
||||
|
||||
Per-call differential (top entries by |canary − ours|):
|
||||
|
||||
| kernel.call | canary tid=17 | ours tid=13 | Δ |
|
||||
|---|---:|---:|---:|
|
||||
| RtlEnterCriticalSection | 607 | 58 | +549 |
|
||||
| RtlLeaveCriticalSection | 607 | 58 | +549 |
|
||||
| NtClose | 19 | 2 | +17 |
|
||||
| NtCreateEvent | 18 | 3 | +15 |
|
||||
| NtDuplicateObject | 16 | 2 | +14 |
|
||||
| RtlInitAnsiString | 11 | 1 | +10 |
|
||||
| NtWaitForSingleObjectEx | 11 | 2 | +9 |
|
||||
| RtlInitializeCriticalSectionAndSpinCount | 15 | 6 | +9 |
|
||||
| NtQueryFullAttributesFile | 9 | 1 | +8 |
|
||||
| NtReleaseSemaphore | 9 | 1 | +8 |
|
||||
| RtlNtStatusToDosError | 9 | 1 | +8 |
|
||||
| NtSetEvent | 8 | 1 | +7 |
|
||||
| KeTlsSetValue | 2 | 0 | +2 |
|
||||
| NtCreateFile | 2 | 0 | +2 |
|
||||
| ExCreateThread | 1 | 0 | +1 |
|
||||
| ExTerminateThread | 1 | 0 | +1 |
|
||||
| KeTlsGetValue | 1 | 0 | +1 |
|
||||
| KeQueryPerformanceFrequency | 0 | 1 | -1 |
|
||||
|
||||
**Set-difference of unique kernel-call names**: ours's set of called
|
||||
APIs is a strict subset of canary's, plus `KeQueryPerformanceFrequency`
|
||||
which canary called outside this window. **No kernel API is missing
|
||||
from ours's implementation that canary uses.** All of these APIs
|
||||
already work in ours (they are called successfully on tid=5, tid=1,
|
||||
or tid=10 elsewhere in the same run).
|
||||
|
||||
The differential isn't "ours fails to implement a kernel call" —
|
||||
it's "ours executes 10× fewer iterations of the same loop body."
|
||||
|
||||
## The control-flow divergence (the root cause)
|
||||
|
||||
Canary tid=17, idx 339-356 — the FIRST wait pattern:
|
||||
|
||||
```
|
||||
idx=339 NtCreateEvent
|
||||
idx=340 handle.create raw=0xf80000b8 hsid=1070523eb111c6ea object_type=1 (Event)
|
||||
idx=343 NtDuplicateObject → handle.create at idx=344
|
||||
idx=347 NtSetEvent ← THE EVENT IS SIGNALED BEFORE THE WAIT
|
||||
idx=350 NtClose → handle.destroy at idx=351
|
||||
idx=354 NtWaitForSingleObjectEx
|
||||
idx=355 wait.begin handles=[1070523eb111c6ea] timeout=-1
|
||||
idx=356 kernel.return rv=0 ← wait completes in 23µs because event was signaled
|
||||
```
|
||||
|
||||
Ours tid=13, idx 175-434 — the analog wait pattern:
|
||||
|
||||
```
|
||||
idx=175 NtCreateEvent
|
||||
idx=177 handle.create raw=0x000012d0 hsid=d5e23609d3948568 object_type=1 (Event)
|
||||
… 240 RtlEnterCriticalSection / RtlLeaveCriticalSection ops in between …
|
||||
idx=419 NtDuplicateObject → handle.create at idx=420
|
||||
idx=429 NtReleaseSemaphore ← DIFFERENT API — semaphore, not event-set
|
||||
idx=432 NtWaitForSingleObjectEx
|
||||
idx=433 wait.begin handles=[d5e23609d3948568] timeout=-1
|
||||
idx=434 kernel.return rv=0 (entry probe only; actual wait blocks forever)
|
||||
⏸ WEDGE — event d5e23609d3948568 is never signaled.
|
||||
```
|
||||
|
||||
The key observation: **between `NtCreateEvent` and the corresponding
|
||||
`NtWaitForSingleObjectEx`**, canary calls `NtSetEvent` to signal
|
||||
the very event it is about to wait on (idiomatic self-signaled
|
||||
wait-pump barrier). Ours **skips the NtSetEvent**, calls
|
||||
`NtReleaseSemaphore` instead, and then blocks on the unsignaled event.
|
||||
|
||||
This is a **guest-code branch divergence** inside the helper
|
||||
hierarchy `sub_821CB030 → sub_821CBA08 → sub_821CC3F8 → sub_821C4EB0`
|
||||
(per `sub_82173990.md` chain). The branch predicate is some state
|
||||
read between `NtCreateEvent` and the call site of `NtSetEvent` /
|
||||
`NtReleaseSemaphore`.
|
||||
|
||||
## Step 3/4 — Why does the predicate differ between engines?
|
||||
|
||||
The deep root: this exact divergence pattern is what AUDIT-069 S5
|
||||
already found at a different lens:
|
||||
|
||||
> **AUDIT-069 S5**: "Other producers: canary 25 vs ours 1." Canary
|
||||
> has 24 additional thread sources releasing the work semaphore that
|
||||
> ours doesn't have.
|
||||
|
||||
Combining S5 with this Step 2 finding:
|
||||
|
||||
1. Ours's tid=13 emits ONLY 1 NtReleaseSemaphore before wedging
|
||||
(consistent with the 1 "other producer" S5 measured).
|
||||
2. Canary's tid=17 emits 9 NtReleaseSemaphore + 8 NtSetEvent before
|
||||
reaching ExTerminateThread. Each release/set comes from a
|
||||
different cache-load iteration.
|
||||
3. The iteration count is gated by the loop body completing each
|
||||
iteration. Each iteration begins by waiting on an event that
|
||||
must be PRE-SIGNALED to advance.
|
||||
|
||||
In canary, the event gets pre-signaled (NtSetEvent before NtWait).
|
||||
In ours, the same code path takes the "release semaphore + wait
|
||||
on event signaled by external" branch instead of the "set event +
|
||||
wait on event" branch. **The state read by the predicate at the
|
||||
branch differs.**
|
||||
|
||||
What state? Without disassembling `sub_821CB030`/`sub_821CBA08`
|
||||
and binding the branch PC to the guest memory location the predicate
|
||||
reads, we cannot say definitively. Candidate state sources:
|
||||
|
||||
- A bit/flag in the ctx (`0x4024d640` in ours vs `0xbc365620` in
|
||||
canary — different addresses but same shape). Could be uninitialized
|
||||
in ours due to ANON_Class vtable install at `sub_824FD240+0x24`
|
||||
not having fired (AUDIT-068 S4). But that vtable install fires
|
||||
much later (host_ns=9.4s in canary), so this is unlikely.
|
||||
- The result of a prior `NtQueryFullAttributesFile` call. Canary
|
||||
tid=17 calls this 9× before reaching ExTerminateThread; ours
|
||||
tid=13 calls it 1× before wedging. The file being queried is in
|
||||
the `cache:\` filesystem (per `sub_82173990.md` chain).
|
||||
- A guest-memory shared CS-protected pointer set by another tid
|
||||
(canary tids 4/10/14 do 38+90+38 signal events in the
|
||||
[1.9..2.1s] window; in ours, tids 4/5/14 are STILL working in
|
||||
[0..1.73s] but their output is shifted to ours's tid=5, which
|
||||
per AUDIT-069 S5 matches canary's tid=10 producer count almost
|
||||
exactly — 90 NtReleaseSemaphore each).
|
||||
|
||||
## Cause attribution
|
||||
|
||||
Per the Step 5 framework:
|
||||
|
||||
1. **Missing ours implementation?** NO. Every kernel API canary
|
||||
tid=17 calls is also implemented in ours and works (verified by
|
||||
other tids using them successfully).
|
||||
2. **Incorrect return value in ours?** UNLIKELY but unverified. Phase
|
||||
A schema doesn't capture args/return values for most calls;
|
||||
`args_resolved={}` is empty for nearly every call in this window.
|
||||
3. **Missing side effect in ours?** POSSIBLY. If `NtQueryFullAttributesFile`
|
||||
or `NtCreateFile` on `cache:\<hash>\...` has a slightly different
|
||||
behavior in ours (e.g., succeeds when canary fails, or vice-versa),
|
||||
the resulting branch could diverge.
|
||||
4. **Upstream state divergence (most likely)**: a guest-memory value
|
||||
read by a predicate inside `sub_821CB030`/`sub_821CBA08` differs
|
||||
between engines. The earlier-in-this-tid CS-blob (240+ enter/leave
|
||||
pairs between idx 177 and idx 423) processes some data structure,
|
||||
the result of which selects the branch.
|
||||
|
||||
**Best single guess (MEDIUM confidence)**: a `NtQueryFullAttributesFile`
|
||||
on a `cache:\<hash>\<filename>` path returns a different value in
|
||||
ours than in canary (file present vs not, size mismatch, or attrib
|
||||
mismatch). The branch chooses "we need to recompute the cache item"
|
||||
(NtReleaseSemaphore path) instead of "cache item is ready, signal
|
||||
event and proceed" (NtSetEvent path).
|
||||
|
||||
## Disjoint-gap count
|
||||
|
||||
**ONE gap** — the predicate divergence inside `sub_821CB030`'s
|
||||
body. However, the predicate divergence likely has a **complex
|
||||
upstream cause** that involves either filesystem state or
|
||||
guest-memory state initialized by another tid that ALSO has the
|
||||
same kind of subtle drift. So:
|
||||
|
||||
- **disjoint divergence sites in this trajectory**: 1 (control-flow
|
||||
branch in sub_821CB030 chain).
|
||||
- **disjoint hypothesized causes**: 2-3 (file attribute return value,
|
||||
shared-memory state from tid=10/5 dispatch worker, or vtable install
|
||||
bypass at upstream).
|
||||
|
||||
This is **NOT** the "50+ disjoint missing kernel patterns" failure
|
||||
mode predicted in tripstone 7. It's a single branch divergence with
|
||||
multiple candidate first-causes. Methodology pivot to Option C
|
||||
(critical-path sweep) is **NOT** indicated; targeted iterate per
|
||||
candidate first-cause IS indicated.
|
||||
|
||||
## Recommended next concrete action
|
||||
|
||||
**Iterate plan, ordered by minimum LOC + maximum signal**:
|
||||
|
||||
### Iterate Step 2.A — branch-probe inside sub_821CB030 body (~50-80 LOC ours + ~50 LOC canary)
|
||||
|
||||
Use existing `audit_61_branch_probe_pcs` to pin the divergent
|
||||
branch inside `sub_821CB030` / `sub_821CBA08` / `sub_821CC3F8`.
|
||||
Specifically probe every `bne`/`beq` PC inside these guest fns
|
||||
that has reachable `bl NtSetEvent` on one branch and `bl
|
||||
NtReleaseSemaphore` on the other. Use sylpheed.db cross-references
|
||||
to enumerate `bl 0x824AA2F0` (NtSetEvent wrapper) and `bl 0x824AB158`
|
||||
(NtReleaseSemaphore wrapper) call sites in these fns.
|
||||
|
||||
Capture both engines, diff branch-counts. The first divergent
|
||||
branch is the answer.
|
||||
|
||||
### Iterate Step 2.B — args/return-value capture for the 9 NtQueryFullAttributesFile calls on canary tid=17 (~30 LOC canary)
|
||||
|
||||
Extend `audit_61` or write a dedicated probe to log `r3` (filename
|
||||
buffer) and `r0` (NTSTATUS return) for every
|
||||
`NtQueryFullAttributesFile` call inside this 154-ms window. Compare
|
||||
against ours's 1 call. If file-attribute return values differ on a
|
||||
shared file, that's the trigger.
|
||||
|
||||
### Iterate Step 2.C — guest-memory read-watch on the ctx struct (~20 LOC, reuses AUDIT-068 S3 read-probe)
|
||||
|
||||
Use `audit_68_host_mem_read_probe` to sample the worker ctx
|
||||
(`0xbc365620` in canary / `0x4024d640` in ours) at ~1ms cadence in
|
||||
the window [1.7..2.1s]. Identify whether a flag/byte in the ctx
|
||||
differs at the predicate-read time. This pinpoints the actual
|
||||
read location if Step 2.A's branch-probe doesn't immediately reveal
|
||||
the predicate source.
|
||||
|
||||
## Tripstones honored
|
||||
|
||||
- **#28**: verified canary's actual behavior by reading the jsonl
|
||||
directly; the AUDIT-069 S5 framing is corroborated, not assumed.
|
||||
- **#32**: contention regions may jitter; the 240+ CS enter/leave
|
||||
pairs in ours tid=13 are NOT identical to canary tid=17's count
|
||||
(607 vs 58). Differential here may include scheduling-determinism
|
||||
noise. Mitigation: cross-validate with 2nd cold canary run if
|
||||
Step 2.A doesn't immediately converge.
|
||||
- **#39**: matched-prefix did NOT drive this; first-draw progression
|
||||
is the goal.
|
||||
- **#5 of plan tripstones**: AUDIT-069 S5 "25 producers" finding IS
|
||||
downstream of Step 2's identified branch divergence. The 25
|
||||
producers correspond to canary tid=17's loop iterations that ours
|
||||
tid=13 doesn't reach.
|
||||
|
||||
## Cascade
|
||||
|
||||
- A (acquire canary install-epoch event log): ✓ HIGH (16,175 kernel
|
||||
calls captured cleanly in [9..11s] window).
|
||||
- B (identify install-trigger sequence in canary): ✓ HIGH
|
||||
(canary tid=6 spawns sub_821748F0 at host_ns=1.935s, join-wait
|
||||
returns at 2.092s). The "install trigger" is not a single
|
||||
kernel call but the **completion of worker tid=17**, which
|
||||
causes the join wait to release tid=6 into the rest of the
|
||||
main-loop dispatch.
|
||||
- C (identify where ours diverges from canary): ✓ HIGH (ours
|
||||
tid=13 wedges 3ms into its lifetime, vs canary tid=17 running
|
||||
154ms; first kernel-call sequence divergence at the
|
||||
NtSetEvent vs NtReleaseSemaphore branch).
|
||||
- D (attribute the divergence to a specific cause): MEDIUM (3
|
||||
candidate root causes; need iterate 2.A/2.B/2.C to disambiguate).
|
||||
- E (produce Δ-gap count + roadmap): ✓ HIGH (1 divergence site;
|
||||
3 candidate first-causes; ~50-200 LOC iterate plan).
|
||||
|
||||
## Honest assessment
|
||||
|
||||
- The wedge framing established by AUDIT-049 .. AUDIT-069 holds.
|
||||
- Step 2 narrows the trigger from "the install epoch at 9.4s" down
|
||||
to "the worker tid=13's first wait at 1.73s" — a 7-order-of-magnitude
|
||||
refinement in time.
|
||||
- The 25-producer finding from AUDIT-069 S5 IS a consequence of
|
||||
the Step 2 branch divergence: each missing iteration of canary
|
||||
tid=17's load loop is a missing "other producer" signal.
|
||||
- The fix is NOT to mirror canary's kernel calls; ours implements
|
||||
them correctly. The fix is to find why ours's `sub_821CB030`
|
||||
predicate evaluates differently.
|
||||
- Confidence that the fix is a single guest-state correction
|
||||
(file-attribute mismatch, ctx-field uninitialized, or shared-memory
|
||||
flag race): MEDIUM.
|
||||
|
||||
## Artifacts produced this session
|
||||
|
||||
All under `xenia-rs/audit-runs/review-a-step2-natural-trigger/`:
|
||||
|
||||
- `extract_canary_install_window.py` — scanner for canary in [9..11s].
|
||||
- `extract_canary_tid6_pre_install.py` — scanner for tid=6 [1.5..11s].
|
||||
- `extract_canary_worker_tid.py` — locates spawn worker by hsid.
|
||||
- `extract_canary_tid17_full.py` — tid=17 timeline + diff vs ours tid=13.
|
||||
- `extract_ours_tid1_full.py` — ours tid=1 timeline.
|
||||
- `extract_ours_tid13_final.py` — ours tid=13 timeline.
|
||||
- `find_signaler.py` — finds canary tid=17 wait signalers.
|
||||
- `ours_signal_counts.py` — ours per-tid signal counts.
|
||||
- `canary-tid6-install-window.csv` — 32,383 events.
|
||||
- `canary-tid6-install-window.summary` — kernel.call frequencies.
|
||||
- `canary-tid6-from-anchor.csv` — 139,202 events.
|
||||
- `canary-tid17-worker-timeline.csv` — 4140 events.
|
||||
- `ours-tid13-full-timeline.csv` — 435 events.
|
||||
- `ours-tid1-final-150.csv` — last 150 events on ours tid=1.
|
||||
- `ours-tid1-summary` — kernel.call frequencies.
|
||||
- `canary-tid17-waits.csv` — 29 wait.begin events with handle binding.
|
||||
- `differential-canary-tid17-vs-ours-tid13.txt` — full call-name diff.
|
||||
- `step2-report.md` — this report.
|
||||
|
||||
**LOC delta in this session**: 0 to xenia-rs/canary engines; 0 to
|
||||
sylpheed.db; ~600 LOC analysis scripts under audit-runs/.
|
||||
Reference in New Issue
Block a user