handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
271
audit-runs/audit-069-wait-signal-producer/writer-report.md
Normal file
271
audit-runs/audit-069-wait-signal-producer/writer-report.md
Normal file
@@ -0,0 +1,271 @@
|
||||
# AUDIT-069 Session 1 — wait-signal producer identification
|
||||
|
||||
Date: 2026-05-20
|
||||
Status: **LANDED — signaler tid + caller fns identified; AUDIT-066 circular framing FALSIFIED**
|
||||
|
||||
## Headline
|
||||
|
||||
The wait at `sub_821CB030+0x1AC` (PC `0x821CB1DC`) — the canonical
|
||||
AUDIT-049/065 wedge wait — fires in canary on two tids (worker tid=17 and
|
||||
cache-loader tid=26). Both wedges are signaled by **tid=10**, a worker
|
||||
thread spawned EARLY (via `sub_8244FF50` → `ExCreateThread(entry=sub_82450A28)`),
|
||||
NOT by any of the four workers spawned by `sub_825070F0`. This refutes
|
||||
AUDIT-066's circular framing ("γ-signaler running inside the 4 workers
|
||||
spawned by sub_825070F0"): the actual signaler reaches the production
|
||||
phase WITHOUT depending on sub_825070F0 firing.
|
||||
|
||||
## Step 1 — wait site capture (canary)
|
||||
|
||||
Probe: `--audit_61_branch_probe_pcs=0x821CB1DC --mute=true`, 180s cold.
|
||||
|
||||
| tid | r3 (handle) | r4 (timeout) | r5 (wait_mode) | r6 (ctx) | r31 (stack) | lr |
|
||||
|----:|------------:|-------------:|---------------:|---------:|------------:|---:|
|
||||
| 17 | `F80000A4` | `FFFFFFFF` | `0` (auto) | `BC65CEC0` | `7064FA70` | `0x821CB1D0` |
|
||||
| 26 | `F8000110` | `FFFFFFFF` | `0` (auto) | `BC667F80` | `708FF990` | `0x821CB1D0` |
|
||||
|
||||
Two distinct fires (one per logical caller). Both have r4=INFINITE timeout
|
||||
matching dossier. The lr=`0x821CB1D0` is `sub_821CB030+0x1A0` = the
|
||||
instruction AFTER the bl-wait — consistent with branch-probe firing at the
|
||||
basic-block-entry following the wait-call's return.
|
||||
|
||||
Handle drift across cold runs is real: Step 1 vs Step 3 vs Step 4 trajectories
|
||||
produced wait handles `{F80000A0,F8000108}` / `{F80000A0,F8000108}` /
|
||||
`{F80000A4,F8000110}`. Per-run handles are still deterministic; the absolute
|
||||
ID is not.
|
||||
|
||||
**Important framing correction**: The brief expected "~16 fires" per
|
||||
AUDIT-065. This was already partly retracted by AUDIT-066 (which observed
|
||||
that thid=17 "terminates via `ExTerminateThread(0)` WITHOUT ever calling
|
||||
Wait inside its cache loop"). Step 1 confirms AUDIT-066's correction:
|
||||
the wait at `+0x1AC` fires ~2× per boot (one for the work-queue load
|
||||
that ANON_Class_713383D7 work goes through; one for the cache-loader
|
||||
sister-flow). Not 16. The wait is the WORK-QUEUE wait, not a per-cache-file
|
||||
IO wait.
|
||||
|
||||
Confidence: HIGH (probe fired, r3/r4/r5 match expected wait-call ABI,
|
||||
two distinct logical fires reproducible across cold runs).
|
||||
|
||||
## Step 2 — instrumentation (canary, ~280 LOC additive)
|
||||
|
||||
New `audit_69_*` cvars + slowpath module:
|
||||
- **cpu_flags.{h,cc}** (+23/+48 LOC, of which ~30 LOC are mine vs cumulative):
|
||||
- `--audit_69_event_signal_watch` (CSV of guest handle IDs, max 4)
|
||||
- `--audit_69_event_signal_native_ptr` (CSV of guest VAs, max 4)
|
||||
- `--audit_69_log_all_sets` (bool — log EVERY XEvent::Set/Pulse fire)
|
||||
- **xenia-kernel/audit_69_event_signal_watch.h** (51 LOC) — fwd decls,
|
||||
hot-path inline wrapper (single relaxed atomic load + branch).
|
||||
- **xenia-kernel/audit_69_event_signal_watch.cc** (193 LOC) — lazy parse +
|
||||
UINT32_MAX sentinel + `XThread::TryGetCurrentThread()` for lr/tid capture.
|
||||
Mirrors AUDIT-068's static-init gate pattern.
|
||||
- **xenia-kernel/xevent.cc** (+9 LOC) — hook at `XEvent::Set` and
|
||||
`XEvent::Pulse` (the deepest convergence of Ke/Nt set + pulse paths).
|
||||
|
||||
Reading-error registration: `XThread::GetCurrentThread()` asserts on host
|
||||
threads; first iteration used it and crashed. Fixed by switching to
|
||||
`TryGetCurrentThread()`. (Same lesson as AUDIT-067's bool-vs-pointer
|
||||
asymmetry but in a different fn.)
|
||||
|
||||
Cumulative cross-run canary additions retained in tree (AUDIT-061/067/068/069).
|
||||
|
||||
## Step 3 — correlated capture
|
||||
|
||||
Run: cold, 180s, `--mute=true --audit_61_branch_probe_pcs=0x821CB1DC,0x824AA2F0,0x824AAF50 --audit_69_log_all_sets=true`.
|
||||
|
||||
Volume: 122,165 log lines (Step 3) / 155,627 lines (Step 4 with wrapper probes).
|
||||
|
||||
Wait fires (Step 4): 2 (tid=17, tid=26, as in Step 1 but with handle drift to F80000A4/F8000110).
|
||||
|
||||
Signals on wedge handles (Step 4):
|
||||
|
||||
| wedge handle (waited on) | wait tid | signal fires | signal lr | signaling fn | signal tid |
|
||||
|---|---|---|---|---|---|
|
||||
| `0xF80000A4` | 17 | **1** | `0x824AA304` | `sub_824AA2F0` (NtSetEvent wrapper) | **10** |
|
||||
| `0xF8000110` | 26 | **100** | `0x824AAFC8` | `sub_824AAF50` (a generic event-set-with-arg wrapper) | **10** |
|
||||
|
||||
The 100 fires on F8000110 are repeats — auto-reset events fire on first
|
||||
signal; the rest are no-ops. Volume reflects how often the work-queue
|
||||
processes items targeting this synchronizer.
|
||||
|
||||
## Step 4 — signaler-fn resolution (sylpheed.db cross-check)
|
||||
|
||||
Wrapper-entry probe data for these two NtSet wrappers, filtered to tid=10:
|
||||
|
||||
| wrapper | lr-of-caller | caller fn | tid=10 fire count |
|
||||
|---|---|---|---|
|
||||
| `sub_824AA2F0` (NtSetEvent wrapper) | `0x8245DA44` | **`sub_8245D9D8`** (γ-signaler D-A per AUDIT-062) | 23 |
|
||||
| `sub_824AA2F0` (NtSetEvent wrapper) | `0x8245DB08` | **`sub_8245DA78`** (γ-signaler D-B per AUDIT-062) | 8 |
|
||||
| `sub_824AAF50` (Ke-style wrapper) | `0x8245DC5C` | **`sub_8245DB40`** (NEW — not previously named) | 461 |
|
||||
|
||||
`sub_824AAF50` disasm needs follow-up but lr=0x824AAFC8 = `sub_824AAF50+0x78`
|
||||
position is consistent with a `bl xeKeSetEvent` followed by status check
|
||||
in an N-arg helper. The wrapper takes `(handle, ptr, size)` and the
|
||||
internally-signaled event has a different handle from the input.
|
||||
|
||||
Containing-fn cross-check (`sylpheed.db`):
|
||||
- `sub_8245D9D8` and `sub_8245DA78` are in the worker cluster
|
||||
(0x82450000-0x8245C000). Per AUDIT-062: both are γ-signaler-D family,
|
||||
hot from worker-side, missed by AUDIT-059/060 enumeration.
|
||||
- `sub_8245DB40` is in the same cluster; callers are `sub_824528A8+0x54`
|
||||
and `sub_8245EE50+0x20` (both worker-cluster internal).
|
||||
- All three are reached from tid=10's body fn `sub_82450A68`, the
|
||||
trampoline body for the entry `sub_82450A28` (which `ExCreateThread`
|
||||
registers via `sub_8244FF50`).
|
||||
|
||||
**tid=10 caller chain (canary)**:
|
||||
```
|
||||
sub_8244FEA8 (caller of sub_8244FF50; itself called from 11 sites)
|
||||
→ sub_8244FF50 (spawner — calls ExCreateThread w/ entry=sub_82450A28)
|
||||
→ sub_82450A28 (thread-entry trampoline:
|
||||
KeSetThreadPriority(-2, 3); bl sub_82450A68)
|
||||
→ sub_82450A68 (worker dispatch loop)
|
||||
→ ... γ-signalers D / DA78 / DB40
|
||||
```
|
||||
|
||||
`sub_82450A28` is referenced as a data pointer at `0x8244FFF8` (inside
|
||||
`sub_8244FF50`). No call edges to it — it's purely a thread-entry data
|
||||
constant passed to ExCreateThread.
|
||||
|
||||
## Step 5 — ours cross-reference
|
||||
|
||||
All identified signaler fns (`sub_8245D9D8`, `sub_8245DA78`, `sub_8245DB40`,
|
||||
`sub_824AA2F0`, `sub_824AAF50`, `sub_82450A28`, `sub_8244FF50`) are GAME
|
||||
(XEX) code — not kernel-imports. In ours these execute under the JIT, with
|
||||
no host-side analog to compare. The relevant question is whether the
|
||||
trajectory in ours REACHES these PCs.
|
||||
|
||||
Direct evidence from prior runs:
|
||||
|
||||
**AUDIT-062 ours `--lr-trace=0x824aa2f0`** trace (`ours-ntset.jsonl`, 136
|
||||
fires across cold boot up to deadlock):
|
||||
- tid=6: 82 NtSet fires
|
||||
- tid=1: 28 fires
|
||||
- tid=5: 22 fires
|
||||
- tid=8: 2 fires
|
||||
- tid=13: 2 fires
|
||||
- **tid=10: 0 fires**
|
||||
|
||||
ours NEVER spawns the canary-equivalent of tid=10 (the
|
||||
`sub_8244FF50/sub_82450A28/sub_82450A68` worker). This is consistent with
|
||||
AUDIT-057's "thread-gap" finding: ours has fewer threads than canary.
|
||||
|
||||
Within ours, the γ-signalers DO fire — but on tid=5 (calling sub_824AA2F0
|
||||
from lr=`0x8245DA44` = `sub_8245D9D8+0x6C`) per AUDIT-062's
|
||||
`ours-ntset.jsonl:line 1`. AUDIT-062 already established these signal
|
||||
WRONG handles in ours (neighbors of `0x12AC` are signaled; the wedge
|
||||
handle itself is not).
|
||||
|
||||
**Conclusion**: ours's signaler PCs exist and run, but on the wrong tids
|
||||
(no tid=10 equivalent), and target the wrong handles. The PRODUCER →
|
||||
SIGNALER chain in ours is structurally broken at the **thread-spawn**
|
||||
layer, not the kernel-import layer.
|
||||
|
||||
Confidence (Step 5): MEDIUM-HIGH for the chain identification (data is
|
||||
internally consistent and matches AUDIT-062's prior independent capture).
|
||||
LOW on the ours-side resolution mechanism (this audit did not re-run
|
||||
ours; cross-ref is read-only against prior dumps which may be stale
|
||||
relative to current ours HEAD `e6d43a23…`).
|
||||
|
||||
## AUDIT-066 framing refutation
|
||||
|
||||
AUDIT-066 stated:
|
||||
|
||||
> the producer-side signal for THAT event comes from a γ-signaler running
|
||||
> inside the 4 workers spawned by sub_825070F0 — per AUDIT-063's
|
||||
> static-reachability survey of NtSet wrapper callers.
|
||||
|
||||
This is **falsified** by AUDIT-069 Step 3+4 evidence:
|
||||
|
||||
1. The signaler runs on **tid=10**, spawned by `sub_8244FF50` via
|
||||
`ExCreateThread(entry=sub_82450A28)`. This is NOT one of sub_825070F0's
|
||||
4 workers.
|
||||
2. sub_8244FF50's caller chain does NOT require ANON_Class_713383D7's
|
||||
vtable to be installed; it does NOT require sub_825070F0 to fire.
|
||||
3. The circular-bootstrap concern AUDIT-066 raised ("workers can't signal
|
||||
until they spawn; they can't spawn until the wedge clears") was
|
||||
structurally correct framing IF the signaler were inside the
|
||||
sub_825070F0 4-worker family. Since the actual signaler is tid=10
|
||||
(independently spawned), the circle is **broken** — the signaler IS
|
||||
reachable without the wedge clearing.
|
||||
|
||||
Reading-error class **#37**: static-reachability surveys (AUDIT-063 walked
|
||||
12 hops from sub_82452DC0 to NtSet wrapper callers) are scoped to a
|
||||
particular caller chain; they miss alternative producer paths reached via
|
||||
unrelated thread-spawn sites. Always probe at the runtime SIGNAL site to
|
||||
confirm which exact caller fired, not just which static path could fire.
|
||||
|
||||
## Cascade outcome
|
||||
|
||||
- **A** (capture wait site PC + r3=handle in canary): **PASS**. PC
|
||||
`0x821CB1DC`, r3 captures the handle on first fire reproducibly.
|
||||
- **B** (capture signal fires on the wait targets): **PASS**. 1 fire on
|
||||
F80000A4 (wedge handle 1), 100 fires on F8000110 (wedge handle 2).
|
||||
- **C** (resolve signaling fn + immediate caller fn): **PASS**.
|
||||
`sub_824AA2F0` ← `sub_8245D9D8` / `sub_8245DA78` (γ-signaler D family);
|
||||
`sub_824AAF50` ← `sub_8245DB40` (new). All on tid=10.
|
||||
- **D** (ours-side cross-ref): **PARTIAL**. tid=10 IS missing in ours
|
||||
per existing AUDIT-062 data; γ-signalers DO fire but on wrong tids.
|
||||
Did not re-run ours in this session (per task discipline; cross-ref
|
||||
read-only against prior dumps).
|
||||
|
||||
Net 3/4 PASS, 1/4 PARTIAL.
|
||||
|
||||
## Discipline
|
||||
|
||||
- xenia-rs HEAD `e6d43a23ac393004d2e5adf2f0395fd0b5e6448b` UNCHANGED.
|
||||
`git diff HEAD | sha256sum` at session start =
|
||||
`ed30fd526643918f67311caff0a10d1346d73fd0c0323e02477883cf5ff20357`
|
||||
and at session end IDENTICAL.
|
||||
- Canary patch is purely additive, cvar-gated default-off, UINT32_MAX
|
||||
sentinel + std::once parse pattern (per AUDIT-068 discipline).
|
||||
- Every canary run used `--mute=true`.
|
||||
- Cache wiped before each cold run (4 cold runs total: Step 1 90s,
|
||||
Step 1 180s rerun, Step 3 with handle watch, Step 3 with log_all_sets,
|
||||
Step 4 with wrapper probes). Each cache moved to `/tmp/_audit_069_step*`
|
||||
before next cold run.
|
||||
- Cache restoration from `/tmp/canary-cache-bak-audit-068` deferred to
|
||||
session end (done after this report).
|
||||
|
||||
## Artifacts
|
||||
|
||||
```
|
||||
xenia-rs/audit-runs/audit-069-wait-signal-producer/
|
||||
step1-wait-probe.log (90s baseline; 2 wait fires)
|
||||
step1-wait-probe.stdout
|
||||
step1-wait-probe-180s.log (180s rerun; 2 wait fires)
|
||||
step1-wait-probe-180s.stdout
|
||||
step3-signal-probe.log (180s; first signal-watch test;
|
||||
handles drifted, partial correlation)
|
||||
step3-signal-probe.stdout
|
||||
step3-correlated.log (180s; log_all_sets; 120k signal fires)
|
||||
step3-correlated.stdout
|
||||
step4-wrapper-callers.log (180s; log_all_sets + wrapper entries;
|
||||
155k events; correlated lr-to-caller)
|
||||
step4-wrapper-callers.stdout
|
||||
fix-canary.diff (cumulative canary diff vs 6de80dffe)
|
||||
writer-report.md (this file)
|
||||
```
|
||||
|
||||
## Session 2 recommendation
|
||||
|
||||
Two paths, both <100 LOC ours-side:
|
||||
|
||||
**Path 1 (ours read-only probe + targeted root-cause)**: re-run ours with
|
||||
`--ctor-probe=0x82450A28` (the canary-tid=10 entry) — confirm it never
|
||||
fires. Then `--ctor-probe=0x8244FF50` (the spawner). If sub_8244FF50 also
|
||||
never fires, walk up its 11 callers in sylpheed.db — likely one of them
|
||||
gates on a flag/event that's not set in ours's early-boot trajectory.
|
||||
|
||||
**Path 2 (canary additional capture)**: probe canary's tid=10 spawn
|
||||
sequence in detail. Add `audit_69_thread_spawn_watch` cvar that logs
|
||||
every ExCreateThread call with (entry_pc, ctx, suspend_flag, caller_lr).
|
||||
~40 LOC. Compare to ours's spawn list — find which call goes
|
||||
unfired in ours.
|
||||
|
||||
Both paths are cheaper than continuing on the wedge directly. Path 1 is
|
||||
preferred: it stays on the ours side which is the failing engine.
|
||||
|
||||
Predicted Session 2 cascade:
|
||||
- A (find sub_82450A28's first-non-fire ancestor in ours): 75-85%
|
||||
- B (identify the missing precondition for that ancestor): 50-60%
|
||||
- C (fix LOC in ours ≤ 50): 30-40%
|
||||
- D (draws>0): 15-25% (single wedge unlock)
|
||||
Reference in New Issue
Block a user