Files
xenia-rs/audit-runs/audit-069-wait-signal-producer/writer-report-v4.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

358 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# AUDIT-069 Session 4 — writer report v4
Date: 2026-05-20
xenia-rs HEAD: `e6d43a23ac393004d2e5adf2f0395fd0b5e6448b` (UNCHANGED from S1/S2/S3)
`git diff HEAD | sha256sum`: `ed30fd526643918f67311caff0a10d1346d73fd0c0323e02477883cf5ff20357`
(UNCHANGED at start AND end of S4)
No ours source modifications. No canary instrumentation added.
Canary `audit_61_branch_probe_pcs` cvar used (pre-existing from S1).
Canary cache restored from `/tmp/canary-cache-bak-audit-068`.
## Headline (HIGH confidence — direct per-iteration measurement)
S3's "producer-loop underrun" framing pointed in the right direction
but mis-located the divergence. **Neither engine ever takes the
exit-branch in `sub_82450A68` (PC=0x82450B50, the LR=epilog path)**.
Both engines's dispatch threads stay in the loop indefinitely (no
deadlock; just waiting).
The actual divergence is in the **return value of the
`NtWaitForMultipleObjectsEx` call at PC=0x82450B44**:
- **Ours: r3 = 0x00000001 in 91/91 captures (100%)** — semaphore acquired.
- **Canary: r3 = 0x00000102 in 3/4 captures (75%)** — WAIT_TIMEOUT.
The two handles being waited on are:
- **handle[0] = NtCreateEvent** at `[r31+88]` — the STOP event (signal → exit).
- **handle[1] = NtCreateSemaphore(InitialCount=0, MaximumCount=0x7FFFFFFF)**
at `[r31+92]` — the WORK semaphore (signal → process work).
Both created by `sub_8244FF50` (spawn helper) BEFORE `ExCreateThread`.
mem-watch confirms handle slots in ours: `0x104C` (event) / `0x1050`
(semaphore) at run-1; absolute IDs drift across runs but the slot
layout is invariant.
This is **NOT an exit-branch divergence, NOT loop-underrun in the
literal sense — it is a SEMAPHORE-STATE divergence**. In ours the
work-semaphore count is non-zero at every wait entry (so the wait
always returns immediately with success); in canary the count is zero
at most wait entries (so the wait times out per the 16ms relative
timeout).
## Method (READ-ONLY, no source mod)
1. Disassembled `sub_82450A68` body (80 instructions) via
`xenia-rs disasm --at 0x82450A68 -n 200`. Saved to
`s4/sub_82450A68-disasm.txt`.
2. Identified loop topology: prolog → wait-#1 → body (with inner search
over 5-slot table at [r31+112..212]) → dispatch (bl 0x82450B68 →
γ-signaler family) → re-wait → back-edge OR exit.
3. Ran ours-cold with `--branch-probe=` on 14 BB-entry PCs covering all
loop-body paths. Captured 696 records over ~80s wallclock /
91 loop iterations.
4. Ran canary-cold (cache wiped → restored from
`/tmp/canary-cache-bak-audit-068`) with same `audit_61_branch_probe_pcs`
cvar set. Canary process faulted in vkd3d-proton at ~10s wallclock;
captured 35 records / 4 loop iterations. Sufficient to surface the
r3 distribution.
5. Used `--mem-watch=0x828F3BC0,0x828F3BC4` to identify which ours
handle IDs live in slots `[r31+88]` and `[r31+92]`. Then
disassembled `sub_8244FF50` to confirm event-vs-semaphore types via
the import jumps (`NtCreateEvent` at 0x824A9F18, `NtCreateSemaphore`
at 0x824AB0C0).
6. Cross-checked ours's kernel handlers (`nt_wait_for_multiple_objects_ex`,
`do_wait_multiple`, `handle_consume`, `nt_release_semaphore`,
`try_release_semaphore`, `wake_eligible_waiters`) — code looks
correct in isolation; the divergence is NOT in these handlers
directly.
## Per-PC iteration counts
| PC | path | ours fires | canary fires | note |
|---|---|---:|---:|---|
| 0x82450AA4 | first-iter entry | 1 | 1 | both entered once |
| 0x82450AAC | back-edge target | 91 | 4 | canary crashed early |
| 0x82450AC0 | flag@212==0 → r4=5 | 2 | 0 | rare path |
| 0x82450AC8 | flag@212!=0 → search | 90 | 4 | dominant |
| 0x82450AE4 | inner-search continue | 72 | 17 | |
| 0x82450AF4 | search-exhausted | 8 | 3 | no candidate found |
| 0x82450AF8 | candidate-found | 82 | 1 | |
| 0x82450B04 | budget skip | 81 | 0 | |
| 0x82450B10 | budget refresh | 8 | 0 | |
| 0x82450B28 | dispatch entry | 74 | 1 | bl 0x82450B68 |
| 0x82450B34 | re-wait entry | 92 | 4 | |
| **0x82450B50** | **EXIT (epilog)** | **0** | **0** | **never reached** |
## r3 at back-edge (the divergence signal)
| | ours | canary |
|---|---|---|
| r3=0x1 | 91/91 (100%) | 1/4 (25%) |
| r3=0x102 (TIMEOUT) | 0/91 (0%) | 3/4 (75%) |
| r3=0x0 (handle[0] signaled) | 0/91 | 0/4 |
| r3=other | 0/91 | 0/4 |
This is the **per-iteration measurement** the user's framing predicted.
The matching iterations show different r3 values at the SAME PC. The
"load feeding the predicate" is, however, NOT a guest-memory load — it
is the kernel-side return of `NtWaitForMultipleObjectsEx`. The
divergent KERNEL STATE is the work-semaphore count.
## Wait wrapper chain (disasm-derived)
```
sub_824AB240:
li r7, 0 ; alertable = 0
b 0x824AB190 ; tail-jump
sub_824AB190(r3=numObj, r4=&handles, r5=WaitMode, r6=Timeout(=16 ms), r7=Alertable):
...
bl 0x824ACA88 ; converts r4=16 ms → LARGE_INTEGER -160000 (relative 100-ns ticks)
...
bl 0x8284E08C ; NtWaitForMultipleObjectsEx (ord 254, import @ VA 0x8284E08C)
; returns NTSTATUS in r3:
; 0 = WAIT_OBJECT_0 = handle[0] (stop event) signaled
; 1 = WAIT_OBJECT_0+1 = handle[1] (work semaphore) acquired (atomically decrements count by 1)
; 0x102 = WAIT_TIMEOUT = 16 ms elapsed with no signal
```
`sub_82450A68` branches on this:
- `cmplwi cr6, r3, 0; beq cr6, 0xB50` → r3 == 0 → EXIT (stop event signaled)
- `cmplwi cr6, r3, 0; bne cr6, 0xAAC` → r3 != 0 (including 0x102) → CONTINUE
- r3 == 1 → at least one work-item is available → run the inner table search
- r3 == 0x102 → just a 16ms timer wake; inner search will likely find no candidate
and the loop just re-waits
In canary's brief 4-iteration captured window, only iteration-0 had real
work (`r3=1`); iterations 1-3 were timer-wakes (`r3=0x102`). In ours's
91-iteration window, all back-edges saw `r3=1`: someone has released
the semaphore at least once between each consume.
## Handle slot identification (HIGH confidence)
Via `--mem-watch=0x828F3BC0,0x828F3BC4`:
```
MEM-WATCH addr=0x828f3bc0 old=0x00000000 new=0x0000104c
store_addr=0x828f3bc0 store_len=4 tid=1 pc=0x8244ffb0 lr=0x8244ffb0
MEM-WATCH addr=0x828f3bc4 old=0x00000000 new=0x00001050
store_addr=0x828f3bc4 store_len=4 tid=1 pc=0x8244ffcc lr=0x8244ffcc
```
Static disasm of writer PCs:
```
0x8244FFAC: bl 0x824A9F18 ; NtCreateEvent wrapper
0x8244FFB0: stw r3, 88(r30) ; handle[0] = event = ours 0x104C
0x8244FFC8: bl 0x824AB0C0 ; NtCreateSemaphore wrapper (r4=0=Initial, r5=0x7FFFFFFF=Max)
0x8244FFCC: stw r3, 92(r30) ; handle[1] = semaphore = ours 0x1050
```
The semaphore is created with **InitialCount=0**. So if no one ever
calls `NtReleaseSemaphore` on it, the wait will only ever return
`STATUS_TIMEOUT`. Canary's behavior (mostly 0x102, occasionally 0x1)
matches this: producers release the semaphore ~1× per ~16ms.
Ours's behavior (always 0x1) means **producers release the semaphore
FASTER THAN the consumer drains it**.
## NtReleaseSemaphore call graph (xrefs to wrapper sub_824AB158)
Wrapper sub_824AB158 calls NtReleaseSemaphore (ord 243, import @
VA 0x8284E07C). Called from 22 sites across 18 functions:
```
0x822c6770 fn=0x822c6748
0x822c6848 fn=0x822c6808
0x822c95c4 .. 0x822c9718 fn=0x822c8b50 (×6 inline call sites)
0x822f23e8 fn=0x822f2328
0x823dd7f8 fn=0x823dd770
0x823dda3c fn=0x823dd838
0x823df008..1b4 fn=0x823de4b8 (×3)
0x823df604 fn=0x823df320
0x82450310 fn=0x82450218 ← dispatcher-module enqueuer (callers: sub_82452DC0 ×2)
0x824504c4 fn=0x824503A0 ← dispatcher-module enqueuer (callers: sub_82452690, sub_8245E1D8)
0x82450cdc fn=0x82450b68 ← THE DISPATCH FUNCTION itself (self-release)
0x82450d28 fn=0x82450b68 ← THE DISPATCH FUNCTION itself (self-release)
0x82456b48 fn=0x824569c0 (jump form)
0x82458020 fn=0x82457fe0
0x824584c8 fn=0x82458468
0x82459424 fn=0x824591c0
0x8245ab6c fn=0x8245aaf0
0x8245ac6c fn=0x8245abd8
0x8245ade0 fn=0x8245ad00
```
**Critical observation**: the dispatch function `sub_82450B68`
contains TWO release sites (at offsets 0xCDC, 0xD28). Each successful
dispatch run can release the semaphore again. If both branches release
+1 token, and the wait consumes only -1 per iteration, the count would
drift up. This is consistent with the "ours over-released" hypothesis.
Some sub_82450B68 branches release the semaphore via `lwz r3, 92(r27)`
which is `handle[1]` of the dispatcher itself. So the dispatch function
re-fills its own pipe.
## Hypothesis (MEDIUM-HIGH confidence)
The semaphore is being over-released in ours due to a divergent
**dispatch-loop control flow inside `sub_82450B68`** that
differentially decides whether to fire the self-release. Either:
(a) ours takes a sub_82450B68 branch that releases when canary's doesn't
(this is the dual of S3's question: which sub-branches differ?), OR
(b) ours's parse_timeout scales the 16 ms relative timeout by /100
(exports.rs:4495 — `magnitude.max(1) / 100`), turning a 16 ms wall-clock
timeout into 1,600 emulator-ticks. This may differentially interact
with how often the semaphore gets a release between wait entries.
The exit-branch-at-matching-iteration framing from the user's task spec
does NOT apply here: there IS no exit-branch divergence (both never
exit). The divergence is in the wait return value, which has no
proximate guest-memory load. The "load feeding the predicate" is a
kernel-state read (the semaphore count) performed inside the kernel
import handler itself.
## Most-recent kernel calls (tid=5 in ours, from S3 lr-trace
data + S4 cross-check)
Most-recent kernel calls before each wait at PC=0x82450B44 (re-wait
site), on ours tid=5:
- `NtReleaseSemaphore(handle=0x1050, count=1)` via wrapper
sub_824AB158, lr=0x82450CDC OR lr=0x82450D28 (both inside sub_82450B68
dispatch body) — self-release in the dispatch tail.
- `KeSetEvent(handle=0x10xx)` via wrapper sub_824AA2F0 OR sub_824AAF50 —
γ-signaler family fires (the audit's original signaler PCs from S1/S3).
- `KeQueryPerformanceCounter`-like via sub_824AA830 — used in budget
refresh path.
In **canary**, the equivalent sequence per S1's signal-probe-correlated.log
(180s window) is similar (γ-signalers fire 492× on tid=10), but the
SELF-RELEASE rate matters more — that determines whether the consumer
keeps seeing a non-zero semaphore.
## S5 recommendation (refined)
The right next step is **NOT** to walk further upstream in the
γ-signaler chain (S3's lead). It is to **measure the per-branch flow
inside `sub_82450B68` itself** — find which of its many branches
release the semaphore and how that branch is selected.
### Path A (RECOMMENDED, ~0 LOC, read-only)
`--branch-probe` covering `sub_82450B68` body (PCs 0x82450B68 ..
0x82451238, the dispatch body). Want to capture:
1. Frequency at the two release sites `0x82450CDC` and `0x82450D28`
(per-call cumulative count on tid=5).
2. Frequency at the OTHER exit sites in sub_82450B68 (e.g. the early
return at `0x82450EE8` which does NOT release).
If ours's release-rate at CDC/D28 is significantly higher than canary's,
that confirms (a). If similar, then (b) becomes the next theory.
### Path B (~80 LOC ours-side probe, no source mod)
Use `--branch-probe` on PCs inside `xenia_kernel::exports::parse_timeout`
to confirm the magnitude/100 scaling actually causes the divergence.
Actually this requires source instrumentation since parse_timeout is
Rust, not guest code. Mid-priority.
### Path C (~30 LOC canary diagnostic)
Add canary cvar `audit_69_semaphore_count_probe = VA` that emits the
post-Set count for the semaphore at native VA matching ours's
[r31+92]'s underlying X_KSEMAPHORE. Compare per-iteration count
progression canary-vs-ours.
LOC budget for S5: Path A = 0, Path B = ~80, Path C = ~30.
**Path A first** — narrows the divergence to specific sub_82450B68
sub-branch behavior at zero LOC cost.
## Cascade
- **A** (disasm sub_82450A68): PASS (HIGH) — 80-instruction body,
3 BB-paths, 12 BB-entries identified.
- **B** (ours per-iteration loop-branch trace): PASS (HIGH) —
91 back-edge captures, all r3=0x1.
- **C** (canary same trace): PARTIAL (MEDIUM) — canary crashed at
4 iterations in vkd3d-proton on exit; 4 captures sufficient to surface
r3=0x102 dominance, but not a long-window comparison.
- **D** (identify divergent load): PARTIAL (MEDIUM) — no guest-memory
load is the proximate cause; the divergence is in the kernel-side
semaphore-count state. The "load" is conceptually inside
`do_wait_multiple`'s read of `KernelObject::Semaphore.count`.
Net 2/4 PASS-HIGH, 2/4 PARTIAL-MEDIUM. Methodology learned: when both
engines stay in a loop, "which branch did ours take differently" is the
WRONG question — ask "what's different at the SAME branch."
## Confidence flags (summary)
| finding | confidence |
|---|---|
| Both engines never take exit-branch (B50) | HIGH |
| ours back-edge r3=1 always (91/91) | HIGH |
| canary back-edge r3=0x102 mostly (3/4) | HIGH |
| handle[1] is NtCreateSemaphore w/ InitialCount=0 | HIGH |
| handle[0] is NtCreateEvent | HIGH |
| Divergence is kernel-side semaphore-count state | MEDIUM-HIGH |
| sub_82450B68 self-release over-fires in ours | MEDIUM |
| parse_timeout /100 scaling is contributing | LOW-MEDIUM |
## Discipline
- xenia-rs HEAD `e6d43a23ac393004d2e5adf2f0395fd0b5e6448b` UNCHANGED
(sha256 of `git diff HEAD` matches S1/S2/S3 end at session start AND end).
- READ-ONLY ours. No source mod. `--branch-probe` / `--lr-trace` /
`--mem-watch` / `--trace-handles-focus` are runtime read-only flags
documented as "lockstep digest unaffected" (state.rs comments).
- Canary `audit_61_branch_probe_pcs` cvar enabled with our PC set; set
back to "" at session end. Verified.
- Canary `mute = true` set during run, restored to `false` at session end.
- Canary cache wiped before cold canary run, restored from
`/tmp/canary-cache-bak-audit-068` at session end.
## Artifacts
```
audit-runs/audit-069-wait-signal-producer/s4/
sub_82450A68-disasm.txt (80 ins disasm: sub_82450A28 entry + body)
ours-loop-branch-trace.stdout (696 BRANCH-PROBE records, ours-cold)
ours-loop-branch-trace.stderr (empty under --quiet)
canary-loop-branch-trace.stdout (1074 lines, 35 AUDIT-061-BR records)
canary-loop-branch-trace.stderr (89 lines, wine/vkd3d setup + final fault)
ours-mem-watch.stderr (2 MEM-WATCH records identifying handle slots)
ours-mem-watch.stdout (empty)
ours-signaler.jsonl (95 lr-trace records on wrapper PCs)
ours-handles.{stdout,stderr} (probe for handle dump; --halt-on-deadlock didn't trigger)
ours-trace-handles-summary.log (21 lines: focus startup + 8 ExCreateThread spawns)
divergence-analysis.md (per-iter table, hypothesis, S5 leads)
writer-report-v4.md (this file)
```
No canary instrumentation diff this session. No `fix-canary-s4.diff`.
## Summary of S1 → S2 → S3 → S4 arc
- **S1** (2026-05-20 AM): identified canary tid=10 as the signaler;
claimed ours lacks this thread (FALSIFIED by S2).
- **S2** (2026-05-20 noon): spawn-chain runs identically on ours tid=5;
refined to "wrong-handle selection" downstream (REFINED by S3).
- **S3** (2026-05-20 PM): ours runs identical PC/LR chain but with
~5× fewer iterations. Producer-loop underrun classification.
Wedge handle never even created in ours's truncated boot.
- **S4** (2026-05-20 evening): per-iteration branch-probe shows
**NEITHER engine ever exits the loop**. Divergence is in
`NtWaitForMultipleObjectsEx` return: ours r3=1 always (semaphore
acquired), canary r3=0x102 mostly (timeout). Root cause is
**semaphore-count state divergence** — ours's work-semaphore is
over-released relative to consume rate, OR ours's timeout never
fires before signal. Hypothesis: divergence inside `sub_82450B68`
dispatch body's self-release logic.
The S5 question is no longer "which earlier kernel call differs" —
it is "which sub-branch of `sub_82450B68` releases the semaphore in
ours that canary's doesn't release in." Read-only branch-probe on
sub_82450B68 body PCs.