Files
xenia-rs/audit-runs/review-a-step1-crowbar/investigation.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

4.7 KiB
Raw Blame History

Step 0 — framing verification

Read-only checks of the crowbar's expected parameters against xenia-rs/audit-runs/phase-nonmatch-investigation/create-thread-events.json, the AUDIT-068 S3/S4 memory dossier (write epoch 9.4-9.6 s, vtable base 0x8200A1E8), and ours's ExCreateThread (crates/xenia-kernel/src/exports.rs:294).

The 4 thread.create events (from canary-jitter-1.jsonl)

Index host_ns tid (creator) entry_pc ctx_ptr stack susp aff prio
20 10,382,912,900 6 0x82506528 0xBCE251C0 65536 true 0 0
21 10,383,282,200 6 0x82506558 0xBCE251C0 65536 true 0 0
22 10,383,647,200 6 0x82506588 0xBCE251C0 65536 true 0 0
23 10,384,161,700 6 0x825065B8 0xBCE251C0 65536 true 0 0

All 4 share ctx_ptr=0xBCE251C0, all spaced ~370500 ns apart on canary tid=6 (main). affinity=0 means scheduler chooses; priority=0 default.

Canary's natural resume happens "later" via NtResumeThread from worker code (not captured in this jsonl excerpt; deferred — for the crowbar we resume directly after the 4-spawn burst since the natural resume gate is downstream of the wedge).

The ctx layout @ ctx_ptr (per AUDIT-068 S3/S4)

At install epoch host_ns ≈ 9.416 s on canary tid=6, three u32 slots written simultaneously by guest PC sub_824FD240+0x24 POD-copy:

[ctx_ptr + 0x00] = 0x8200A1E8   (vtable BASE — class ANON_Class_713383D7)
[ctx_ptr + 0x04] = ctx_ptr      (self pointer — doubly-linked list head)
[ctx_ptr + 0x08] = ctx_ptr      (self pointer — doubly-linked list head)
[ctx_ptr + 0x0C] = (refcount, observed = 1 at later epoch per S4)

Reading-error #37 discipline: the value 0x8200A1E8 is the vtable BASE, NOT slot-N address. 0x8200A208 cited in older AUDIT-058/060/067 is base + 0x20 = slot-8 address within the vtable, mistaken for the base in those audits. The install value is 0x8200A1E8 per AUDIT-068 S3 measurement.

Worker entry stubs

Per sub_825070F0.md, each of the 4 entries (0x82506528, +0x30, +0x60, +0x90) is a thin stub that does:

lwz r11, 0(r3)       ; load vtable base from ctx
lwz r11, 140(r11)    ; load fn ptr from vtable[35]
                     ; (each entry uses a different slot: 35/36/37/38)
mtctr r11
bctr

So the workers dispatch through ctx's vtable. If the vtable's slots 35-38 are not populated (or 0x8200A1E8 is in .rdata and slot reads are valid), the workers will jump to whatever guest code is at those addresses. The dossier says vtable is "7 entries" but the worker stubs read at offsets 140/144/148/152 → so the actual class has at least 39 vtable entries (consistent with AUDIT-058's "this is a wider parent class" framing).

The risk that the workers fault on a bad vtable load is REAL but HONEST — the crowbar's job is to test this exact thing.

What ours's ex_create_thread does today

crates/xenia-kernel/src/exports.rs:294-405. Takes 6 PPC regs, allocates thread image (stack + PCR + TLS), allocates a thread handle, calls scheduler.spawn(SpawnParams { ... }), installs the self-ref via state.retain_handle(handle), writes the handle to r3 and tid to r5. Phase A thread.create event is emitted when event_log::is_enabled().

The host-side analog therefore only needs:

  1. Allocate ctx page via state.heap_alloc(0x1000, mem) → write the 4 u32s described above into it.
  2. For each of 4 entries: call a host-side ex_create_thread-like helper that takes (entry, ctx_ptr, stack_size, suspended, affinity, priority) directly, skipping the PPC-reg-marshalling.
  3. Resume each of the 4 spawned threads via the scheduler's resume_ref.

Trigger choice

coord_pre_round in xenia-app/src/main.rs:2038 is per-outer-round and has access to both KernelState and ExecStats. Adding a one-shot check on stats.instruction_count >= threshold is trivially additive.

Threshold default = 20_000_000. At ~6.7M instr/sec lockstep that's ~3 s wallclock; well past the 10-thread initial spawn burst (which peaks around the boot-init swap) but still early enough for the workers to have time before the 200M cap.

Configurable via env XENIA_CROWBAR_TRIGGER_INSTR=N.

LOC estimate

  • xenia-kernel/src/exports.rs host_spawn_worker_thread helper: ~50 LOC
  • xenia-kernel/src/state.rs crowbar CrowbarConfig field + tick_crowbar: ~40 LOC
  • xenia-app/src/main.rs cvar + trigger wire-up: ~30 LOC
  • Tests: ~50 LOC

Total ~170 LOC; trim by inlining the helper or sharing SpawnParams boilerplate. Target ≤150 LOC.