handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
357
audit-runs/audit-069-wait-signal-producer/writer-report-v4.md
Normal file
357
audit-runs/audit-069-wait-signal-producer/writer-report-v4.md
Normal file
@@ -0,0 +1,357 @@
|
||||
# AUDIT-069 Session 4 — writer report v4
|
||||
|
||||
Date: 2026-05-20
|
||||
xenia-rs HEAD: `e6d43a23ac393004d2e5adf2f0395fd0b5e6448b` (UNCHANGED from S1/S2/S3)
|
||||
`git diff HEAD | sha256sum`: `ed30fd526643918f67311caff0a10d1346d73fd0c0323e02477883cf5ff20357`
|
||||
(UNCHANGED at start AND end of S4)
|
||||
No ours source modifications. No canary instrumentation added.
|
||||
Canary `audit_61_branch_probe_pcs` cvar used (pre-existing from S1).
|
||||
Canary cache restored from `/tmp/canary-cache-bak-audit-068`.
|
||||
|
||||
## Headline (HIGH confidence — direct per-iteration measurement)
|
||||
|
||||
S3's "producer-loop underrun" framing pointed in the right direction
|
||||
but mis-located the divergence. **Neither engine ever takes the
|
||||
exit-branch in `sub_82450A68` (PC=0x82450B50, the LR=epilog path)**.
|
||||
Both engines's dispatch threads stay in the loop indefinitely (no
|
||||
deadlock; just waiting).
|
||||
|
||||
The actual divergence is in the **return value of the
|
||||
`NtWaitForMultipleObjectsEx` call at PC=0x82450B44**:
|
||||
|
||||
- **Ours: r3 = 0x00000001 in 91/91 captures (100%)** — semaphore acquired.
|
||||
- **Canary: r3 = 0x00000102 in 3/4 captures (75%)** — WAIT_TIMEOUT.
|
||||
|
||||
The two handles being waited on are:
|
||||
- **handle[0] = NtCreateEvent** at `[r31+88]` — the STOP event (signal → exit).
|
||||
- **handle[1] = NtCreateSemaphore(InitialCount=0, MaximumCount=0x7FFFFFFF)**
|
||||
at `[r31+92]` — the WORK semaphore (signal → process work).
|
||||
|
||||
Both created by `sub_8244FF50` (spawn helper) BEFORE `ExCreateThread`.
|
||||
mem-watch confirms handle slots in ours: `0x104C` (event) / `0x1050`
|
||||
(semaphore) at run-1; absolute IDs drift across runs but the slot
|
||||
layout is invariant.
|
||||
|
||||
This is **NOT an exit-branch divergence, NOT loop-underrun in the
|
||||
literal sense — it is a SEMAPHORE-STATE divergence**. In ours the
|
||||
work-semaphore count is non-zero at every wait entry (so the wait
|
||||
always returns immediately with success); in canary the count is zero
|
||||
at most wait entries (so the wait times out per the 16ms relative
|
||||
timeout).
|
||||
|
||||
## Method (READ-ONLY, no source mod)
|
||||
|
||||
1. Disassembled `sub_82450A68` body (80 instructions) via
|
||||
`xenia-rs disasm --at 0x82450A68 -n 200`. Saved to
|
||||
`s4/sub_82450A68-disasm.txt`.
|
||||
2. Identified loop topology: prolog → wait-#1 → body (with inner search
|
||||
over 5-slot table at [r31+112..212]) → dispatch (bl 0x82450B68 →
|
||||
γ-signaler family) → re-wait → back-edge OR exit.
|
||||
3. Ran ours-cold with `--branch-probe=` on 14 BB-entry PCs covering all
|
||||
loop-body paths. Captured 696 records over ~80s wallclock /
|
||||
91 loop iterations.
|
||||
4. Ran canary-cold (cache wiped → restored from
|
||||
`/tmp/canary-cache-bak-audit-068`) with same `audit_61_branch_probe_pcs`
|
||||
cvar set. Canary process faulted in vkd3d-proton at ~10s wallclock;
|
||||
captured 35 records / 4 loop iterations. Sufficient to surface the
|
||||
r3 distribution.
|
||||
5. Used `--mem-watch=0x828F3BC0,0x828F3BC4` to identify which ours
|
||||
handle IDs live in slots `[r31+88]` and `[r31+92]`. Then
|
||||
disassembled `sub_8244FF50` to confirm event-vs-semaphore types via
|
||||
the import jumps (`NtCreateEvent` at 0x824A9F18, `NtCreateSemaphore`
|
||||
at 0x824AB0C0).
|
||||
6. Cross-checked ours's kernel handlers (`nt_wait_for_multiple_objects_ex`,
|
||||
`do_wait_multiple`, `handle_consume`, `nt_release_semaphore`,
|
||||
`try_release_semaphore`, `wake_eligible_waiters`) — code looks
|
||||
correct in isolation; the divergence is NOT in these handlers
|
||||
directly.
|
||||
|
||||
## Per-PC iteration counts
|
||||
|
||||
| PC | path | ours fires | canary fires | note |
|
||||
|---|---|---:|---:|---|
|
||||
| 0x82450AA4 | first-iter entry | 1 | 1 | both entered once |
|
||||
| 0x82450AAC | back-edge target | 91 | 4 | canary crashed early |
|
||||
| 0x82450AC0 | flag@212==0 → r4=5 | 2 | 0 | rare path |
|
||||
| 0x82450AC8 | flag@212!=0 → search | 90 | 4 | dominant |
|
||||
| 0x82450AE4 | inner-search continue | 72 | 17 | |
|
||||
| 0x82450AF4 | search-exhausted | 8 | 3 | no candidate found |
|
||||
| 0x82450AF8 | candidate-found | 82 | 1 | |
|
||||
| 0x82450B04 | budget skip | 81 | 0 | |
|
||||
| 0x82450B10 | budget refresh | 8 | 0 | |
|
||||
| 0x82450B28 | dispatch entry | 74 | 1 | bl 0x82450B68 |
|
||||
| 0x82450B34 | re-wait entry | 92 | 4 | |
|
||||
| **0x82450B50** | **EXIT (epilog)** | **0** | **0** | **never reached** |
|
||||
|
||||
## r3 at back-edge (the divergence signal)
|
||||
|
||||
| | ours | canary |
|
||||
|---|---|---|
|
||||
| r3=0x1 | 91/91 (100%) | 1/4 (25%) |
|
||||
| r3=0x102 (TIMEOUT) | 0/91 (0%) | 3/4 (75%) |
|
||||
| r3=0x0 (handle[0] signaled) | 0/91 | 0/4 |
|
||||
| r3=other | 0/91 | 0/4 |
|
||||
|
||||
This is the **per-iteration measurement** the user's framing predicted.
|
||||
The matching iterations show different r3 values at the SAME PC. The
|
||||
"load feeding the predicate" is, however, NOT a guest-memory load — it
|
||||
is the kernel-side return of `NtWaitForMultipleObjectsEx`. The
|
||||
divergent KERNEL STATE is the work-semaphore count.
|
||||
|
||||
## Wait wrapper chain (disasm-derived)
|
||||
|
||||
```
|
||||
sub_824AB240:
|
||||
li r7, 0 ; alertable = 0
|
||||
b 0x824AB190 ; tail-jump
|
||||
|
||||
sub_824AB190(r3=numObj, r4=&handles, r5=WaitMode, r6=Timeout(=16 ms), r7=Alertable):
|
||||
...
|
||||
bl 0x824ACA88 ; converts r4=16 ms → LARGE_INTEGER -160000 (relative 100-ns ticks)
|
||||
...
|
||||
bl 0x8284E08C ; NtWaitForMultipleObjectsEx (ord 254, import @ VA 0x8284E08C)
|
||||
; returns NTSTATUS in r3:
|
||||
; 0 = WAIT_OBJECT_0 = handle[0] (stop event) signaled
|
||||
; 1 = WAIT_OBJECT_0+1 = handle[1] (work semaphore) acquired (atomically decrements count by 1)
|
||||
; 0x102 = WAIT_TIMEOUT = 16 ms elapsed with no signal
|
||||
```
|
||||
|
||||
`sub_82450A68` branches on this:
|
||||
- `cmplwi cr6, r3, 0; beq cr6, 0xB50` → r3 == 0 → EXIT (stop event signaled)
|
||||
- `cmplwi cr6, r3, 0; bne cr6, 0xAAC` → r3 != 0 (including 0x102) → CONTINUE
|
||||
- r3 == 1 → at least one work-item is available → run the inner table search
|
||||
- r3 == 0x102 → just a 16ms timer wake; inner search will likely find no candidate
|
||||
and the loop just re-waits
|
||||
|
||||
In canary's brief 4-iteration captured window, only iteration-0 had real
|
||||
work (`r3=1`); iterations 1-3 were timer-wakes (`r3=0x102`). In ours's
|
||||
91-iteration window, all back-edges saw `r3=1`: someone has released
|
||||
the semaphore at least once between each consume.
|
||||
|
||||
## Handle slot identification (HIGH confidence)
|
||||
|
||||
Via `--mem-watch=0x828F3BC0,0x828F3BC4`:
|
||||
|
||||
```
|
||||
MEM-WATCH addr=0x828f3bc0 old=0x00000000 new=0x0000104c
|
||||
store_addr=0x828f3bc0 store_len=4 tid=1 pc=0x8244ffb0 lr=0x8244ffb0
|
||||
MEM-WATCH addr=0x828f3bc4 old=0x00000000 new=0x00001050
|
||||
store_addr=0x828f3bc4 store_len=4 tid=1 pc=0x8244ffcc lr=0x8244ffcc
|
||||
```
|
||||
|
||||
Static disasm of writer PCs:
|
||||
```
|
||||
0x8244FFAC: bl 0x824A9F18 ; NtCreateEvent wrapper
|
||||
0x8244FFB0: stw r3, 88(r30) ; handle[0] = event = ours 0x104C
|
||||
0x8244FFC8: bl 0x824AB0C0 ; NtCreateSemaphore wrapper (r4=0=Initial, r5=0x7FFFFFFF=Max)
|
||||
0x8244FFCC: stw r3, 92(r30) ; handle[1] = semaphore = ours 0x1050
|
||||
```
|
||||
|
||||
The semaphore is created with **InitialCount=0**. So if no one ever
|
||||
calls `NtReleaseSemaphore` on it, the wait will only ever return
|
||||
`STATUS_TIMEOUT`. Canary's behavior (mostly 0x102, occasionally 0x1)
|
||||
matches this: producers release the semaphore ~1× per ~16ms.
|
||||
|
||||
Ours's behavior (always 0x1) means **producers release the semaphore
|
||||
FASTER THAN the consumer drains it**.
|
||||
|
||||
## NtReleaseSemaphore call graph (xrefs to wrapper sub_824AB158)
|
||||
|
||||
Wrapper sub_824AB158 calls NtReleaseSemaphore (ord 243, import @
|
||||
VA 0x8284E07C). Called from 22 sites across 18 functions:
|
||||
|
||||
```
|
||||
0x822c6770 fn=0x822c6748
|
||||
0x822c6848 fn=0x822c6808
|
||||
0x822c95c4 .. 0x822c9718 fn=0x822c8b50 (×6 inline call sites)
|
||||
0x822f23e8 fn=0x822f2328
|
||||
0x823dd7f8 fn=0x823dd770
|
||||
0x823dda3c fn=0x823dd838
|
||||
0x823df008..1b4 fn=0x823de4b8 (×3)
|
||||
0x823df604 fn=0x823df320
|
||||
0x82450310 fn=0x82450218 ← dispatcher-module enqueuer (callers: sub_82452DC0 ×2)
|
||||
0x824504c4 fn=0x824503A0 ← dispatcher-module enqueuer (callers: sub_82452690, sub_8245E1D8)
|
||||
0x82450cdc fn=0x82450b68 ← THE DISPATCH FUNCTION itself (self-release)
|
||||
0x82450d28 fn=0x82450b68 ← THE DISPATCH FUNCTION itself (self-release)
|
||||
0x82456b48 fn=0x824569c0 (jump form)
|
||||
0x82458020 fn=0x82457fe0
|
||||
0x824584c8 fn=0x82458468
|
||||
0x82459424 fn=0x824591c0
|
||||
0x8245ab6c fn=0x8245aaf0
|
||||
0x8245ac6c fn=0x8245abd8
|
||||
0x8245ade0 fn=0x8245ad00
|
||||
```
|
||||
|
||||
**Critical observation**: the dispatch function `sub_82450B68`
|
||||
contains TWO release sites (at offsets 0xCDC, 0xD28). Each successful
|
||||
dispatch run can release the semaphore again. If both branches release
|
||||
+1 token, and the wait consumes only -1 per iteration, the count would
|
||||
drift up. This is consistent with the "ours over-released" hypothesis.
|
||||
|
||||
Some sub_82450B68 branches release the semaphore via `lwz r3, 92(r27)`
|
||||
which is `handle[1]` of the dispatcher itself. So the dispatch function
|
||||
re-fills its own pipe.
|
||||
|
||||
## Hypothesis (MEDIUM-HIGH confidence)
|
||||
|
||||
The semaphore is being over-released in ours due to a divergent
|
||||
**dispatch-loop control flow inside `sub_82450B68`** that
|
||||
differentially decides whether to fire the self-release. Either:
|
||||
(a) ours takes a sub_82450B68 branch that releases when canary's doesn't
|
||||
(this is the dual of S3's question: which sub-branches differ?), OR
|
||||
(b) ours's parse_timeout scales the 16 ms relative timeout by /100
|
||||
(exports.rs:4495 — `magnitude.max(1) / 100`), turning a 16 ms wall-clock
|
||||
timeout into 1,600 emulator-ticks. This may differentially interact
|
||||
with how often the semaphore gets a release between wait entries.
|
||||
|
||||
The exit-branch-at-matching-iteration framing from the user's task spec
|
||||
does NOT apply here: there IS no exit-branch divergence (both never
|
||||
exit). The divergence is in the wait return value, which has no
|
||||
proximate guest-memory load. The "load feeding the predicate" is a
|
||||
kernel-state read (the semaphore count) performed inside the kernel
|
||||
import handler itself.
|
||||
|
||||
## Most-recent kernel calls (tid=5 in ours, from S3 lr-trace
|
||||
data + S4 cross-check)
|
||||
|
||||
Most-recent kernel calls before each wait at PC=0x82450B44 (re-wait
|
||||
site), on ours tid=5:
|
||||
|
||||
- `NtReleaseSemaphore(handle=0x1050, count=1)` via wrapper
|
||||
sub_824AB158, lr=0x82450CDC OR lr=0x82450D28 (both inside sub_82450B68
|
||||
dispatch body) — self-release in the dispatch tail.
|
||||
- `KeSetEvent(handle=0x10xx)` via wrapper sub_824AA2F0 OR sub_824AAF50 —
|
||||
γ-signaler family fires (the audit's original signaler PCs from S1/S3).
|
||||
- `KeQueryPerformanceCounter`-like via sub_824AA830 — used in budget
|
||||
refresh path.
|
||||
|
||||
In **canary**, the equivalent sequence per S1's signal-probe-correlated.log
|
||||
(180s window) is similar (γ-signalers fire 492× on tid=10), but the
|
||||
SELF-RELEASE rate matters more — that determines whether the consumer
|
||||
keeps seeing a non-zero semaphore.
|
||||
|
||||
## S5 recommendation (refined)
|
||||
|
||||
The right next step is **NOT** to walk further upstream in the
|
||||
γ-signaler chain (S3's lead). It is to **measure the per-branch flow
|
||||
inside `sub_82450B68` itself** — find which of its many branches
|
||||
release the semaphore and how that branch is selected.
|
||||
|
||||
### Path A (RECOMMENDED, ~0 LOC, read-only)
|
||||
|
||||
`--branch-probe` covering `sub_82450B68` body (PCs 0x82450B68 ..
|
||||
0x82451238, the dispatch body). Want to capture:
|
||||
|
||||
1. Frequency at the two release sites `0x82450CDC` and `0x82450D28`
|
||||
(per-call cumulative count on tid=5).
|
||||
2. Frequency at the OTHER exit sites in sub_82450B68 (e.g. the early
|
||||
return at `0x82450EE8` which does NOT release).
|
||||
|
||||
If ours's release-rate at CDC/D28 is significantly higher than canary's,
|
||||
that confirms (a). If similar, then (b) becomes the next theory.
|
||||
|
||||
### Path B (~80 LOC ours-side probe, no source mod)
|
||||
|
||||
Use `--branch-probe` on PCs inside `xenia_kernel::exports::parse_timeout`
|
||||
to confirm the magnitude/100 scaling actually causes the divergence.
|
||||
Actually this requires source instrumentation since parse_timeout is
|
||||
Rust, not guest code. Mid-priority.
|
||||
|
||||
### Path C (~30 LOC canary diagnostic)
|
||||
|
||||
Add canary cvar `audit_69_semaphore_count_probe = VA` that emits the
|
||||
post-Set count for the semaphore at native VA matching ours's
|
||||
[r31+92]'s underlying X_KSEMAPHORE. Compare per-iteration count
|
||||
progression canary-vs-ours.
|
||||
|
||||
LOC budget for S5: Path A = 0, Path B = ~80, Path C = ~30.
|
||||
|
||||
**Path A first** — narrows the divergence to specific sub_82450B68
|
||||
sub-branch behavior at zero LOC cost.
|
||||
|
||||
## Cascade
|
||||
|
||||
- **A** (disasm sub_82450A68): PASS (HIGH) — 80-instruction body,
|
||||
3 BB-paths, 12 BB-entries identified.
|
||||
- **B** (ours per-iteration loop-branch trace): PASS (HIGH) —
|
||||
91 back-edge captures, all r3=0x1.
|
||||
- **C** (canary same trace): PARTIAL (MEDIUM) — canary crashed at
|
||||
4 iterations in vkd3d-proton on exit; 4 captures sufficient to surface
|
||||
r3=0x102 dominance, but not a long-window comparison.
|
||||
- **D** (identify divergent load): PARTIAL (MEDIUM) — no guest-memory
|
||||
load is the proximate cause; the divergence is in the kernel-side
|
||||
semaphore-count state. The "load" is conceptually inside
|
||||
`do_wait_multiple`'s read of `KernelObject::Semaphore.count`.
|
||||
|
||||
Net 2/4 PASS-HIGH, 2/4 PARTIAL-MEDIUM. Methodology learned: when both
|
||||
engines stay in a loop, "which branch did ours take differently" is the
|
||||
WRONG question — ask "what's different at the SAME branch."
|
||||
|
||||
## Confidence flags (summary)
|
||||
|
||||
| finding | confidence |
|
||||
|---|---|
|
||||
| Both engines never take exit-branch (B50) | HIGH |
|
||||
| ours back-edge r3=1 always (91/91) | HIGH |
|
||||
| canary back-edge r3=0x102 mostly (3/4) | HIGH |
|
||||
| handle[1] is NtCreateSemaphore w/ InitialCount=0 | HIGH |
|
||||
| handle[0] is NtCreateEvent | HIGH |
|
||||
| Divergence is kernel-side semaphore-count state | MEDIUM-HIGH |
|
||||
| sub_82450B68 self-release over-fires in ours | MEDIUM |
|
||||
| parse_timeout /100 scaling is contributing | LOW-MEDIUM |
|
||||
|
||||
## Discipline
|
||||
|
||||
- xenia-rs HEAD `e6d43a23ac393004d2e5adf2f0395fd0b5e6448b` UNCHANGED
|
||||
(sha256 of `git diff HEAD` matches S1/S2/S3 end at session start AND end).
|
||||
- READ-ONLY ours. No source mod. `--branch-probe` / `--lr-trace` /
|
||||
`--mem-watch` / `--trace-handles-focus` are runtime read-only flags
|
||||
documented as "lockstep digest unaffected" (state.rs comments).
|
||||
- Canary `audit_61_branch_probe_pcs` cvar enabled with our PC set; set
|
||||
back to "" at session end. Verified.
|
||||
- Canary `mute = true` set during run, restored to `false` at session end.
|
||||
- Canary cache wiped before cold canary run, restored from
|
||||
`/tmp/canary-cache-bak-audit-068` at session end.
|
||||
|
||||
## Artifacts
|
||||
|
||||
```
|
||||
audit-runs/audit-069-wait-signal-producer/s4/
|
||||
sub_82450A68-disasm.txt (80 ins disasm: sub_82450A28 entry + body)
|
||||
ours-loop-branch-trace.stdout (696 BRANCH-PROBE records, ours-cold)
|
||||
ours-loop-branch-trace.stderr (empty under --quiet)
|
||||
canary-loop-branch-trace.stdout (1074 lines, 35 AUDIT-061-BR records)
|
||||
canary-loop-branch-trace.stderr (89 lines, wine/vkd3d setup + final fault)
|
||||
ours-mem-watch.stderr (2 MEM-WATCH records identifying handle slots)
|
||||
ours-mem-watch.stdout (empty)
|
||||
ours-signaler.jsonl (95 lr-trace records on wrapper PCs)
|
||||
ours-handles.{stdout,stderr} (probe for handle dump; --halt-on-deadlock didn't trigger)
|
||||
ours-trace-handles-summary.log (21 lines: focus startup + 8 ExCreateThread spawns)
|
||||
divergence-analysis.md (per-iter table, hypothesis, S5 leads)
|
||||
writer-report-v4.md (this file)
|
||||
```
|
||||
|
||||
No canary instrumentation diff this session. No `fix-canary-s4.diff`.
|
||||
|
||||
## Summary of S1 → S2 → S3 → S4 arc
|
||||
|
||||
- **S1** (2026-05-20 AM): identified canary tid=10 as the signaler;
|
||||
claimed ours lacks this thread (FALSIFIED by S2).
|
||||
- **S2** (2026-05-20 noon): spawn-chain runs identically on ours tid=5;
|
||||
refined to "wrong-handle selection" downstream (REFINED by S3).
|
||||
- **S3** (2026-05-20 PM): ours runs identical PC/LR chain but with
|
||||
~5× fewer iterations. Producer-loop underrun classification.
|
||||
Wedge handle never even created in ours's truncated boot.
|
||||
- **S4** (2026-05-20 evening): per-iteration branch-probe shows
|
||||
**NEITHER engine ever exits the loop**. Divergence is in
|
||||
`NtWaitForMultipleObjectsEx` return: ours r3=1 always (semaphore
|
||||
acquired), canary r3=0x102 mostly (timeout). Root cause is
|
||||
**semaphore-count state divergence** — ours's work-semaphore is
|
||||
over-released relative to consume rate, OR ours's timeout never
|
||||
fires before signal. Hypothesis: divergence inside `sub_82450B68`
|
||||
dispatch body's self-release logic.
|
||||
|
||||
The S5 question is no longer "which earlier kernel call differs" —
|
||||
it is "which sub-branch of `sub_82450B68` releases the semaphore in
|
||||
ours that canary's doesn't release in." Read-only branch-probe on
|
||||
sub_82450B68 body PCs.
|
||||
Reference in New Issue
Block a user