Files
xenia-rs/audit-runs/audit-069-wait-signal-producer/writer-report-v5.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

123 lines
7.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# AUDIT-069 Session 5 — writer report (RECOVERED from captured data; agent timed out before authoring)
Date: 2026-05-20.
Status: The dispatched agent (`a9380b477f5cb4b3f`) ran ~50 min and timed out via API stream-idle error. The instrumentation, builds, and capture runs completed. The agent did NOT author the final analysis. This report is composed by the parent agent from the captured artifact files (canary-release-trace.log, ours-release-trace.jsonl, fix-canary-s5.diff).
xenia-rs HEAD `e6d43a23ac393004d2e5adf2f0395fd0b5e6448b` UNCHANGED. `sha256(git diff HEAD)` = `ed30fd526643918f67311caff0a10d1346d73fd0c0323e02477883cf5ff20357` UNCHANGED (matches S1-S4 end).
## Canary handle identification
Canary's work-semaphore: handle `0xF800003C` (single semaphore released across 414 events). Wrapper inside canary captures every release through `lr=0x824AB168` (the post-call PC inside `sub_824AB158`). To get the GUEST-side caller LR, S5 would need to probe at the wrapper-entry PC and capture the caller's LR; this was not done in this session.
## Per-tid release counts
### Canary (`canary-release-trace.log`, 414 events)
| tid | count | role |
|---:|---:|---|
| 10 | 382 | worker (self-release inside dispatch fn) |
| 18 | 14 | producer |
| 17 | 9 | producer |
| 6 | 7 | main thread |
| 16 | 1 | producer |
| 26 | 1 | producer |
### Ours (`ours-release-trace.jsonl`, 99 events)
| tid | count | role |
|---:|---:|---|
| 5 | 90 | worker (= canary tid=10 by entry/ctx identity) |
| 1 | 8 | main thread (= canary tid=6) |
| 13 | 1 | producer (the wedged thread) |
## Per-LR release counts (ours only — canary lr field captured wrapper-internal addr, not useful)
| ours lr | count | likely site |
|---|---:|---|
| 0x82450ce0 | 68 | inside sub_82450B68 dispatch fn (the dominant self-release) |
| 0x82450d2c | 7 | second self-release in same fn |
| 0x82450314 | 7 | sub_824502E0+0x34 (producer A) |
| 0x8245ab70 | 7 | sub_8245ab40+0x30 (producer B) |
| 0x824584cc | 4 | sub_82458480 area (producer C) |
| 0x82458024 | 4 | sub_82458000 area (producer D) |
| 0x824504c8 | 1 | sub_82450450+0x78 (producer E) |
| 0x822f23ec | 1 | sub_822F23B0 area (main-thread producer F) |
## Hypothesis verdict
- **H1 (ours over-releases the work-semaphore)**: **FALSIFIED.** Ours releases 99 total vs canary 414 (24% of canary's rate). The worker self-release shows 90 in ours vs 382 in canary (24%). Ours does NOT over-release.
- **H2 (canary processes a batch per iteration)**: **PARTIALLY SUPPORTED but insufficient.** Per-iteration rates (combining S4's iteration data):
- Canary: 4 iterations in 10s with 382 worker releases ≈ ~95 releases per iteration (HIGH variance, n=4 is too small)
- Ours: 91 iterations in ~60s with 90 worker releases ≈ 1 release per iteration
The per-iteration ratio is suggestive but the canary sample size remains too thin for a HIGH-confidence claim.
- **H3 (new): SYSTEMIC under-production of work in ours.** Producer-tid releases:
- Canary: 32 events across 5 producer tids (16, 17, 18, 26 + main 6)
- Ours: 9 events across 2 producer tids (1, 13)
Ours has fewer producer threads contributing AND fewer events per producer. The bug isn't localized to a single fn or handle — it's distributed across the production-side of the work-queue. Ratio ~28%, consistent with the worker self-release ratio.
## Reconciliation with S3
S3 measured γ-signals: ours 81 / canary 492 (16%). S5 measures semaphore releases: ours 99 / canary 414 (24%). Same shape of disparity, slightly different ratio because the two events are at different points in the dispatch path. Both consistent with H3.
## Confidence labels
- Per-tid release counts (ours): HIGH (n=99 measured directly).
- Per-tid release counts (canary): HIGH for the count itself (n=414 measured), MEDIUM for "which canary tid is the worker" (relies on S2's entry/ctx-identity mapping).
- H1 falsification: HIGH.
- H2 partial support: LOW (canary iteration data still n=4).
- H3 (systemic under-production): MEDIUM-HIGH (consistent across two independent measurements — γ-signals from S3, releases from S5).
## Methodology pattern note
S1→S5 has been a sequence of progressively refined framings, each falsifying the prior:
- S1: "spawn-layer bug" — falsified by S2.
- S2: "wrong-handle queue" (per archive) — falsified by S3.
- S3: "producer-loop underrun" — refined by S4 (it's not underrun, it's overrun per S4's branch-probe).
- S4: "ours self-releases too much" → H1 — FALSIFIED by S5.
- S5: H3 — "systemic under-production" — at least testable across multiple measurements, NOT yet a fix point.
S5's H3 is not a localized bug. It says "ours's entire work-queue-producer ecosystem under-fires by ~24-28%". That's a symptom-description, not a root cause. The next session needs to identify WHICH producer fn fails to fire as often, and WHY.
## S6 recommendation
Given S5's H3, the next session should **identify the specific producer-tid divergence**, not continue investigating the dispatch fn. Compare:
- Canary tid=18 (14 releases) vs ours's analog tid — does ours have an analog? Per-tid count divergence at the producer level.
- Canary tid=17 (9 releases) — note: per S1, canary tid=17 is the thread that completes 16+ `sub_821CB030` calls (the wedge wait site). It contributes 9 work-semaphore releases as a producer. Ours's analog is tid=13 (the wedged thread, releases 1).
**The wedge IS the producer divergence**: ours's tid=13 is wedged in `sub_821CB030+0x1AC` and can only release the semaphore 1× before blocking. Canary's tid=17 completes its loop and releases ~9×. So the system has been circular all along:
- Worker (tid=5/10) needs work-items enqueued by producers.
- One major producer is tid=13/17 (the cache thread).
- tid=13 wedges in ours at sub_821CB030 because the worker doesn't process enough items to wake it.
- Worker doesn't process enough items because tid=13 doesn't produce enough.
This is **self-consistent with the AUDIT-049 framing**: the wedge is a producer-consumer ladder where one side can't progress without the other, and they share the work-semaphore at handle 0x1050.
The TRUE first divergence point is upstream of all this: **whatever bootstraps the system so that tid=17 (canary's cache thread) completes its initial work cycle.** Canary's first releases at host_ns=6600 and 9503200 (tid=6 main) happen before tid=10 starts. Ours's tid=1 main also fires releases. The QUESTION: does ours's tid=1 release the right semaphore at the right host_ns?
## S6 path
Capture the **first N=20 release events on each engine, time-ordered**. Compare wallclock + tid + LR. Find the first event canary fires that ours does NOT fire (or vice versa). That's the bootstrap divergence.
LOC: 0 ours, 0 canary (data already captured). Just analysis of the existing logs.
## Cascade outcome
- A (canary cvar implemented + captured): PASS HIGH
- B (ours captured): PASS HIGH (existing --lr-trace)
- C (cadence comparison): PASS MEDIUM (H1 falsified high-confidence; H2 partial-low; H3 medium-high)
- D (root cause identified): N/A — narrowed but not pinpointed.
3 PASS / 1 N/A.
## Discipline
- xenia-rs HEAD UNCHANGED.
- Canary instrumentation 2 new files cvar-gated default-off (audit_70_semaphore_release_watch.h + .cc).
- Canary cache will need restore from `/tmp/canary-cache-bak-audit-068` (agent timed out before doing so — manual cleanup needed).
- `--mute=true` honored on canary runs.