Merge audit-2026-05-fix/p2-session-closeout
This commit is contained in:
@@ -3890,3 +3890,87 @@ plateau persists because:
|
||||
the addic/subfic class (canary semantics are a 64-bit add against
|
||||
guest memory the Mm layer doesn't fully model yet).
|
||||
|
||||
|
||||
|
||||
---
|
||||
|
||||
## Follow-up session 2026-05-03 — outcome
|
||||
|
||||
Three audit IDs closed across 3 commits, merged to master with `--no-ff`.
|
||||
HEAD: `8668550`. Tests: 556 → 561 (+5 from new wall-clock + ghost-trail tests).
|
||||
|
||||
### Audit IDs landed
|
||||
|
||||
| ID | Commit | Description |
|
||||
|---|---|---|
|
||||
| **GPUBUG-DRAIN-001** | `7a1b6b3` | VdSwap PM4 fallback warning silenced under `--parallel`. New `drain_until_wptr(target, time_budget)` mirrors canary's `WorkerThreadMain` predicate; vd_swap skips PM4 ring injection (unreliable when ring backs up under --parallel) and uses direct `notify_xe_swap`. The slot-0 fetch-constant patch is deferred (GPUBUG-FETCH-PATCH-001). DrainFence handler publishes the digest mirror before reply (was racing the CPU's post-drain digest_snapshot read). |
|
||||
| **KRNBUG-AUDIT-001** | `d1105aa` | Diagnostic instrumentation: `--trace-handles-focus=<LIST>` flag + per-handle DIAGNOSIS report. `record_signal` falls through to ghost-trail capture for focused handles even when no `record_create` exists. Producer-class classification (GuestExport / KernelInternal). Distinguishes "guest never tried" from "signal landed but missed waiter" in one run. |
|
||||
| **KRNBUG-D08** | `27d3608` | V-sync wall-clock under `--parallel`. Lockstep stays on the deterministic instruction-count proxy (sylpheed goldens unchanged). `--parallel` switches to wall-clock via `tick_vsync_wallclock`, raising delivered v-syncs from ~2 → 17 at -n 30M. INTERRUPT_QUEUE_CAP=4 still bottlenecks burst delivery. |
|
||||
|
||||
### Parked-waiter producer-trace finding
|
||||
|
||||
Empirical run at -n 500M lockstep with the new
|
||||
`--trace-handles-focus=0x1004,0x100c,0x15e4,0x42450b5c`:
|
||||
|
||||
```
|
||||
handle=0x00001004 kind=Event/Manual waiters=1 signaled=false
|
||||
signal_attempts=0 (primary=0, ghost=0) waits=1 wakes=0
|
||||
created cycle=0 tid=1 lr=0x824a9f6c src=NtCreateEvent
|
||||
timeline: cycle=0 tid=10 lr=0x824ac578 src=do_wait_single[wait]
|
||||
GuestExport=0 KernelInternal=0 waits=1
|
||||
=> producer is a missing kernel signal source (or BST-paradox upstream)
|
||||
```
|
||||
|
||||
Same shape for 0x100c and 0x15e4. 0x42450b5c shows `<UNCREATED>` +
|
||||
`<AUDIT_BLIND>` (waiter parked via a non-`do_wait_single` path).
|
||||
|
||||
**Conclusion**: hypothesis (A) confirmed for 3 of the 4 handles. The
|
||||
producer code path is genuinely missing — NO Nt/KeSetEvent /
|
||||
KePulseEvent / KeReleaseSemaphore call EVER targets these handles
|
||||
during 500M instructions of execution. The PPC-vs-Rust traversal
|
||||
paradox (BST-bug from `project_xenia_rs_sylpheed_event_chain_2026_04_29`)
|
||||
is **NOT** the cause for these specific handles. The 3 handles share
|
||||
the same creator (lr=0x824a9f6c, tid=1, all at cycle=0) and the same
|
||||
wait-call wrapper (lr=0x824ac578) — likely 3 sibling worker threads
|
||||
all waiting for "work to do" notifications that never come. Most
|
||||
likely producer-class candidates for next session:
|
||||
|
||||
- File I/O completion (`signal_io_completion_event`) — currently a
|
||||
real implementation but possibly never reached; trace `NtReadFile`
|
||||
paths to see if completion events would target these handles.
|
||||
- XAM async task completion — F2/F3 deferred from prior sprint.
|
||||
- Audio buffer-complete — `XAudioRegisterRenderDriverClient` is a
|
||||
one-shot stub.
|
||||
- Timer DPCs — `KeSetTimer` real impl but APC delivery may be
|
||||
routing wrong.
|
||||
|
||||
### Acceptance criteria
|
||||
|
||||
| # | Criterion | Met? |
|
||||
|---|-----------|------|
|
||||
| 1 | Phase 1: zero "PM4_XE_SWAP not consumed" warnings under canonical invocation | ✅ |
|
||||
| 2 | Phase 2: per-handle DIAGNOSIS for all four parked handles | ✅ |
|
||||
| 3 | Phase 3: vsync rate restored under --parallel; n2m golden untouched | ✅ partial — rate up but FIFO cap=4 still bottlenecks |
|
||||
| 4 | cargo test ≥556 | ✅ 561 |
|
||||
| 5 | All work merged to master | ✅ |
|
||||
| 6 | **STRETCH** ≥1 of 4 handles signals | ❌ — but data-driven hypothesis fail-fast tells us why (producer missing, not wake-eligibility bug) |
|
||||
| 7 | **STRETCH** draws > 0 at -n 100M lockstep | ❌ — gating remains parked-waiter handles |
|
||||
|
||||
### Recommended next session
|
||||
|
||||
1. **Producer hunt** for the 3 Event/Manual handles. With the
|
||||
diagnostic baked in, a focused hunt: identify the guest function
|
||||
at `lr=0x824ac578` (the shared wait-call wrapper), walk its
|
||||
callers, find what kernel signal source SHOULD be wired for each
|
||||
handle. Likely starting points: file I/O completion
|
||||
(`signal_io_completion_event`), XamTaskSchedule callback (F2),
|
||||
XAudio buffer-complete.
|
||||
2. **Raise INTERRUPT_QUEUE_CAP** for `--parallel` workloads — the
|
||||
3044 dropped vsyncs at -n 30M --parallel suggest the FIFO is the
|
||||
next bottleneck.
|
||||
3. **F2/F3** (XAM async completion) per the still-deferred list,
|
||||
especially if Phase 2 of next session pinpoints a missing XAM
|
||||
producer.
|
||||
4. **GPUBUG-FETCH-PATCH-001**: re-enable the PM4_TYPE0
|
||||
fetch-constant patch via a side-channel (GpuCommand variant)
|
||||
when draws actually start firing — relevant for bloom/blur N+1.
|
||||
|
||||
Reference in New Issue
Block a user