Files
xenia-rs/audit-runs/phase-c23-keWait-timeout-encoding/investigation.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

9.3 KiB

Phase C+23 investigation — KeWaitForSingleObject timeout encoding (2026-05-18)

Divergence (input from C+22)

D-NEW-2 at canary tid=12 → ours tid=7 idx=3 sister chain:

canary: [3] wait.begin {handles_semantic_ids: ['c49d8f0ab90401ea'],
                        timeout_ns: -30000000, alertable: False, wait_type: 'any'}
ours:   [3] wait.begin {handles_semantic_ids: ['6e3d96c5a52bf429'],
                        timeout_ns: 429466729600, alertable: False, wait_type: 'any'}

Canary: -30,000,000 ns = -300,000 100ns-ticks = 30 ms relative wait. Ours: +429,466,729,600 ns = +4,294,667,296 100ns-ticks = +7 minutes absolute deadline. Wrong by sign-extension class.

Step 1 — Verify framing (reading-error #28)

Canary's xeKeWaitForSingleObject

xenia-canary/src/xenia/kernel/xboxkrnl/xboxkrnl_threading.cc:969-1013:

uint32_t xeKeWaitForSingleObject(void* object_ptr, uint32_t wait_reason,
                                 uint32_t processor_mode, uint32_t alertable,
                                 uint64_t* timeout_ptr) {
  ...
  if (phase_a::IsEnabled()) {
    uint64_t sid = 0;
    if (!object->handles().empty()) {
      sid = phase_a::LookupHandleSemanticId(object->handles()[0]);
    }
    int64_t timeout_ns = timeout_ptr
        ? (static_cast<int64_t>(*timeout_ptr) * 100) : -1;
    phase_a::EmitWaitBegin(&sid, 1, timeout_ns, alertable != 0, false);
  }
  ...
}

dword_result_t KeWaitForSingleObject_entry(lpvoid_t object_ptr,
                                           dword_t wait_reason,
                                           dword_t processor_mode,
                                           dword_t alertable,
                                           lpqword_t timeout_ptr) {
  uint64_t timeout = timeout_ptr ? static_cast<uint64_t>(*timeout_ptr) : 0u;
  return xeKeWaitForSingleObject(...);
}

lpqword_t is Xenia's BE-swapped 64-bit-aligned pointer accessor. Formula: read 8 BE bytes as int64, multiply by 100.

Ours's ke_wait_for_single_object

xenia-rs/crates/xenia-kernel/src/exports.rs:5051-5083 (and decode_timeout_ns at 4987-4995):

fn decode_timeout_ns(mem: &GuestMemory, timeout_ptr: u32) -> i64 {
    if timeout_ptr == 0 { return -1; }
    let raw = mem.read_u64(timeout_ptr) as i64;
    raw.saturating_mul(100)
}

mem.read_u64 reads 8 BE bytes (xenia-memory/heap.rs:521-533). Formula: read 8 BE bytes as int64, multiply by 100. Identical to canary.

Conclusion of Step 1

Both engines read 8 BE bytes from the same conceptual timeout_ptr and multiply by 100. If both read the same bytes from the same address, they produce the same timeout_ns. The divergence implies one of:

  1. The timeout_ptr address differs (upstream).
  2. The bytes at the same address differ (upstream).
  3. Wrong-register read in one of the engines (reading-error #25).

Step 2 — Sample the actual guest call (reading-error #25 discipline)

Added a TEMPORARY diagnostic dump to ke_wait_for_single_object (removed before landing the fix). Ran cold ours; first hit for tid=7:

XRS_C23 KeWait tid=7 lr=0x824cd4f4 r3=0x42453b5c r4=0x3 r5=0x1 r6=0x0
        r7=0x71187eb0 r8=0x0 r9=0x0 r10=0x2
        bytes_at_r7=hi=0x0 lo=0xfffb6c20
  • r3 = 0x42453b5c — object pointer (PKEVENT at ctx+0x20).
  • r7 = 0x71187eb0 — timeout pointer (stack-allocated).
  • bytes at r7 = 0x00000000 0xFFFB6C20 (BE) → full 8 BE bytes = 0x00000000_FFFB6C20 = +4,294,667,296. Matches ours's output.

For canary's -300,000 (= -30,000,000 / 100), the 8 BE bytes would be 0xFFFFFFFF_FFFB6C20. So the high 4 bytes are zero in ours but all-Fs in canary. The low 32 bits match exactly.

The guest is writing the LARGE_INTEGER to its stack and our engine sees 0x00000000_FFFB6C20 while canary sees 0xFFFFFFFF_FFFB6C20. Different bytes at the same conceptual location ⇒ upstream divergence in how the guest computes the value.

Step 3 — Identify the encoding bug (root cause)

LR at the KeWait call = 0x824cd4f4. The thread entry (from thread.create.entry_pc) is 0x824cd458. Disassembling 0x824cd458 … 0x824cd4f0 (the prolog through the call):

824cd470: 0x3d60fffb   lis  r11, 0xFFFB        ; high half of -300,000
824cd478: 0x3ba10050   addi r29, r1, 80        ; r29 = stack timeout slot
824cd47c: 0x616b6c20   ori  r11, r11, 0x6C20   ; r11 |= 0x6C20
824cd480: 0xf9610050   std  r11, 80(r1)        ; store r11 as 64-bit DW
...
824cd4dc: 0x7fa7eb78   mr   r7, r29            ; r7 = timeout pointer
...
824cd4f0: 0x483808dd   bl   KeWaitForSingleObject

In canonical PowerPC, lis r11, 0xFFFB is addis r11, 0, 0xFFFB and sign-extends the shifted immediate to 64 bits:

r11 = EXTS(0xFFFB) << 16 = 0xFFFFFFFF_FFFB0000

Canary's HIR emitter at xenia-canary/src/xenia/cpu/ppc/ppc_emit_alu.cc: 138-150 (InstrEmit_addis) does exactly that:

Value* si = f.LoadConstantInt64(XEEXTS16(i.D.DS) << 16);

Subsequent ori r11, r11, 0x6C20 produces 0xFFFFFFFF_FFFB6C20, and std r11, 80(r1) writes all 64 bits → canary's wire bytes 0xFFFFFFFF_FFFB6C20 = -300,000 as int64.

Ours's addis at xenia-rs/crates/xenia-cpu/src/interpreter.rs:119-132 (before fix):

PpcOpcode::addis => {
    // (per the comment) truncate to 32 bits to simulate 32-bit ABI.
    let ra_val = if instr.ra() == 0 { 0u64 } else { ctx.gpr[instr.ra()] };
    let result = ra_val.wrapping_add((instr.simm16() as i64 as u64) << 16);
    ctx.gpr[instr.rd()] = result as u32 as u64;   // ⬅ ZERO-extends to 64
    ctx.pc += 4;
}

The result as u32 as u64 cast drops the high 32 bits before storage, producing 0x00000000_FFFB0000 instead of 0xFFFFFFFF_FFFB0000. After ori0x00000000_FFFB6C20. After std (which stores all 64 bits of the GPR) → wire bytes 0x00000000_FFFB6C20 = +4,294,667,296 as int64. This is the C+22 divergence value exactly.

Encoding bug class: (d) Sign-extension. Specifically:

addis performed a defensive 32-bit zero-extension truncation that defeats the architectural sign-extension semantics required when the result later flows into a 64-bit memory store (std).

Why the defensive truncation existed

The C+22-era comment cites correctness of the subfc/lwz carry chain in 32-bit ABI mode. Inspection of every consumer of GPRs that might receive an addis result confirms: every 32-bit-meaningful arithmetic op (subfcx, addic, addicx, subficx, etc.) already defensively truncates BOTH operands to u32 BEFORE computing. So the upstream sign-extended high bits never enter their result; they only become visible via std/mr/orx (operations that legitimately propagate the full 64-bit value).

Reverting the addis truncation does NOT regress any PPCBUG-002/-007/ -etc. fix; those operate at their consumer site, not at the producer.

The fix (5 LOC effective)

xenia-rs/crates/xenia-cpu/src/interpreter.rs:119-138:

PpcOpcode::addis => {
    // Phase C+23: sign-extend the shifted immediate to 64 bits before
    // adding to rA, matching canary's HIR emitter. Defensive 32-bit
    // truncation at each consumer site already handles the 32-bit-ABI
    // arithmetic chain correctness (see PPCBUG-002/-007/etc.).
    let ra_val = if instr.ra() == 0 { 0i64 } else { ctx.gpr[instr.ra()] as i64 };
    let shifted = (instr.simm16() as i64) << 16;
    let result = ra_val.wrapping_add(shifted);
    ctx.gpr[instr.rd()] = result as u64;
    ctx.pc += 4;
}

Tests added (3 new in xenia-cpu)

  • addis_with_negative_simm_sign_extends_to_64_bits — direct unit test for lis r11, 0xFFFB producing 0xFFFFFFFFFFFB0000.
  • lis_ori_std_negative_timeout_writes_sign_extended_doubleword — end-to-end regression: runs the actual 3-instruction sequence used by Sylpheed's KeWait setup, asserts wire bytes 0xFFFFFFFFFFFB6C20 and int64 round-trip to -300,000.
  • addis_with_nonzero_ra_adds_in_64_bit — ensures the rA-non-zero case still uses canonical 64-bit Add semantics.

Cross-engine encoding bug class summary

Per the prompt's hint catalog:

  • (a) Wrong register: ruled out. r3-r10 dump confirms r7 holds the timeout pointer in ours, matching canary's 5-arg ABI signature.
  • (b) Wrong-direction LARGE_INTEGER dereference: ruled out. Both engines read 8 BE bytes via the same idiom.
  • (c) Endianness: ruled out. Both BE.
  • (d) Sign-extension: CONFIRMED. Bug is in the CPU interpreter's addis opcode, not the wait subsystem.

Validation evidence

  • ours-cold (post-fix) tid=7 idx=3 wait.begin.timeout_ns = -30000000, matching canary exactly.
  • Sister chain canary tid=12 → ours tid=7 advances from matched=3 to matched=4.
  • New divergence at idx=4 is return_value: canary=258 (TIMEOUT) ours=0 (SUCCESS) — the C+22-class scheduler-determinism issue (ours's monolithic-thread runner sees no contention, so the 30 ms timeout doesn't fire). Out of scope for this phase.
  • Main chain matched-prefix 104,607 preserved (no regression).
  • All other sister chains at C+22 baseline.

Files

  • investigation.md (this file)
  • cold-vs-cold-result.md
  • diff-cold-vs-cold.md — full Phase A diff report
  • ours-cold.jsonl / ours-cold-stdout.log / ours-cold-stderr.log
  • canary-cold-trunc.jsonl / canary-cold-stdout.log
  • canary-binary-cache-pre-wipe.tar.gz / canary-xdg-cache-pre-wipe.tar.gz
  • re-validation.md
  • digest-cold-stable-1.json / -2.json / -3.json
  • fix.diff