Files
xenia-rs/audit-runs/phase-c19-NtDuplicateObject-handle-create/investigation.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

10 KiB

Phase C+19 investigation — NtDuplicateObject handle.create (2026-05-14)

Verified canary semantics (reading-error #28 discipline)

NtDuplicateObject_entry — xboxkrnl_ob.cc:389-412

X_HANDLE new_handle = X_INVALID_HANDLE_VALUE;
X_STATUS result = kernel_state()->object_table()->DuplicateHandle(handle, &new_handle);
if (new_handle_ptr) { *new_handle_ptr = new_handle; }
if (options == 1 /* DUPLICATE_CLOSE_SOURCE */) {
    kernel_state()->object_table()->RemoveHandle(handle);
}
return result;

ObjectTable::DuplicateHandle — object_table.cc:210-223

X_STATUS ObjectTable::DuplicateHandle(X_HANDLE handle, X_HANDLE* out_handle) {
    handle = TranslateHandle(handle);
    XObject* object = LookupObject(handle, false);  // refcount +1
    if (object) {
        result = AddHandle(object, out_handle);     // alloc fresh slot, refcount +1, EMIT handle.create
        object->Release();                          // refcount -1 (offset LookupObject)
    }
    return result;
}

ObjectTable::AddHandle — object_table.cc:148-208

  • Finds a fresh slot via FindFreeSlot.
  • Stores entry.object = object; entry.handle_ref_count = 1;
  • Bumps handle = (slot << 2) + kHandleBase (or + kHandleHostBase).
  • object->handles().push_back(handle).
  • object->Retain().
  • Emits handle.create via phase_a::EmitHandleCreateAuto (cvar-gated, default-off) using the new handle's tid + tid_event_idx for SID, NOT short-circuited because we're not inside GetNativeObject.

Net effect (source: S, dup: D, underlying XObject: O)

Before dup: O.refcount = 1, slots = {S → O}, handle.create(S) emitted earlier.

After dup:

  • O.refcount = 2 (one for each slot).
  • entry[S].handle_ref_count = 1.
  • entry[D].handle_ref_count = 1.
  • handle.create(D) emitted at this dup.

Subsequent NtClose on either S or D:

  • ReleaseHandle → entry.handle_ref_count--. If 0 → RemoveHandleentry.object = nullptr + object->Release() → if O.refcount == 0 → object dtor (emit handle.destroy).

So:

  • NtClose(S) after dup: entry[S].handle_ref_count: 1→0RemoveHandle(S)O.refcount: 2→1. Object STILL ALIVE through D. NO handle.destroy.

Wait — re-reading object_table.cc:294-295: phase_a::EmitHandleDestroyAuto(handle, ...) is emitted from inside RemoveHandle, which fires whenever a slot's ref_count hits 0. So canary emits handle.destroy on EVERY NtClose of EVERY slot, regardless of whether the underlying object still has other slots.

That means: canary emits handle.create(D) AND on close emits handle.destroy(D), then later handle.destroy(S). Two handle.create events / two handle.destroy events across the dup pair. Symmetric.

Ours's current behavior — exports.rs:5210-5240

fn nt_duplicate_object(...) {
    let source = resolve_pseudo_handle(state, ctx.gpr[3] as u32);
    if !state.objects.contains_key(&source) { return STATUS_INVALID_HANDLE; }
    if out_ptr != 0 { mem.write_u32(out_ptr, source); }  // dup_id = source_id
    if options & DUPLICATE_CLOSE_SOURCE == 0 {
        if let Some(c) = state.handle_refcount.get_mut(&source) { *c += 1; }
    }
    ctx.gpr[3] = STATUS_SUCCESS;
}

dup_id is aliased to source_id. Bumps state.handle_refcount[source] so the later NtClose pair (one per logical reference) doesn't destroy mid-flight. No handle.create event because no new id was allocated.

Subsequent nt_close(handle) decrements handle_refcount[handle], emits handle.destroy only when it reaches 0.

Phase A divergence

At main idx=102553, canary's tid=6 sequence after NtDuplicateObject:

[102551] import.call NtDuplicateObject
[102552] kernel.call  NtDuplicateObject
[102553] handle.create sid=df686b147b291902
[102554] kernel.return NtDuplicateObject

Ours's tid=1:

[102551] import.call NtDuplicateObject
[102552] kernel.call  NtDuplicateObject
[102553] kernel.return NtDuplicateObject   ← canary's [102554]

The visible delta is the missing handle.create between kernel.call and kernel.return.

AUDIT-062 risk assessment (CRITICAL)

What AUDIT-062 verified

ours DOES dup the wedge (kernel-aliasing hypothesis falsified): tid=13 cycle=26711 r3=0x000012ac r4=0x40541E80 (out_ptr). Per ours's exports.rs:4263, NtDup aliases — dup_id = source_id = 0x12AC, refcount++. NOT a kernel bug.

The original AUDIT-062 framing said "NtDup aliasing is correct because the dup_id resolves to the same KernelObject in state.objects". The wedge bug was downstream (producer-side NtSetEvent(worker_idle_event) never firing).

What is load-bearing about the aliasing

The wedge case in AUDIT-062 was:

  1. tid=13 creates event 0x12AC.
  2. Some descendant calls NtDuplicateObject(0x12AC, &dup) → dup 0x12AC (aliased).
  3. tid=13 calls KeWaitForSingleObject(0x12AC) (the source).
  4. Worker thread (eventually) calls NtSetEvent(dup) on 0x12AC.

The load-bearing invariant is: signal on dup wakes wait on source. Why this works today: both ids ARE the same id, so state.objects.get(&0x12AC) finds the same KernelObject::Event with the same waiters Vec.

The trap

If we change nt_duplicate_object to allocate a fresh dup_id and store it as a NEW state.objects entry (e.g. cloning the Event), then signal-on-dup sets the CLONED event's signaled flag, NOT the source's. tid=13 waiting on source will sleep forever. WEDGE REGRESSION.

The fix

Allocate a fresh dup_id, do NOT clone the object. Instead store a handle alias dup_id → source_id in state.handle_aliases. Whenever the guest passes dup_id to any Nt*/Ke* call, resolve through the alias to get source_id. Lookup state.objects[source_id]. The single underlying KernelObject::Event retains the unified waiters list and signaled flag. Signal-on-dup still wakes wait-on-source because both ids canonicalize to the same source.

This mirrors canary's LookupObject which always indexes by slot, but the underlying XObject* is shared. We achieve the same with the alias map.

Refcount lifecycle

  • Source close after dup: alias entry dup_id → source_id stays; underlying object stays alive because handle_refcount[source_id] was bumped in nt_duplicate_object. No handle.destroy emit (refcount > 0 after decrement).

Actually — to match canary's per-slot handle.destroy emission, we need each NtClose on EITHER source or dup to emit handle.destroy (with the closed slot's SID), and we only drop the underlying object when ALL slots are gone.

Cleanest design: track per-handle-id refcount separately:

  • handle_refcount[source_id]: counts the source slot's references.
  • handle_refcount[dup_id]: counts the dup slot's references.

Both start at 1 (fresh allocation, fresh dup).

nt_close(source_id): decrement handle_refcount[source_id]. If 0, emit handle.destroy(source_id), remove the alias entries pointing AT source_id if applicable, and decrement underlying-object refcount.

Actually that's complex. Let me simplify: mirror canary's two-level refcount exactly via a new struct.

Simplest model that preserves AUDIT-062 + emits handle.create

  1. state.handle_aliases: HashMap<u32, u32> (alias_id → canonical_id).
  2. state.handle_refcount[id] continues to mean: how many NtClose calls are needed on THIS id before its slot goes away.
  3. nt_duplicate_object:
    • Compute canonical = resolve_alias(source) (in case source itself is an alias).
    • Alloc dup_id via state.alloc_handle().
    • Insert alias dup_id → canonical.
    • handle_refcount.insert(dup_id, 1).
    • Emit handle.create(dup_id, object_type) using state.objects[canonical].schema_object_type().
    • If options & DUPLICATE_CLOSE_SOURCE, treat as a NtClose(source) after.
  4. nt_close(handle):
    • Decrement handle_refcount[handle] as today.
    • If reaches 0: emit handle.destroy(handle). Remove the alias entry for handle (if it's an alias). If there are NO MORE alias slots pointing to canonical, AND handle == canonical, remove state.objects[canonical].
    • To know "any more slots pointing to canonical", maintain canonical_refcount: HashMap<u32, u32> = number of live handle slots bound to canonical. Bumped at alloc/dup, decremented at close-with-rc-0.
  5. state.resolve_handle(h): returns handle_aliases.get(&h).copied().unwrap_or(h).
  6. Every Nt*/Ke* handler that looks up state.objects via a guest-provided handle id must call state.resolve_handle(h) first.

Coverage of state.objects lookups

resolve_pseudo_handle (18 call sites in exports.rs) will be extended to chain through state.resolve_handle. Direct ctx.gpr[3] as u32 → state.objects.get sites need explicit resolution. Survey identified the following direct sites that need state.resolve_handle insertion:

  • nt_read_file (1630): let handle = ctx.gpr[3] as u32;
  • nt_write_file (similar)
  • nt_set_event (4628)
  • nt_clear_event (4651)
  • nt_query_information_file, nt_set_information_file, nt_query_directory_file, nt_query_volume_information_file, nt_flush_buffers_file (file operations)
  • nt_create_io_completion (and friends)

Will sweep these in the patch.

Tests to add

  1. nt_duplicate_object_allocates_fresh_handle_id: dup != source.
  2. nt_duplicate_object_emits_handle_create_event: cvar-on, both handle.create events present.
  3. nt_duplicate_object_alias_resolves_to_canonical: state.resolve_handle(dup) == source.
  4. nt_duplicate_object_signal_on_dup_wakes_wait_on_source (AUDIT-062 regression test): create event, dup, simulate NtSetEvent(dup), confirm state.objects[source].signaled == true.
  5. nt_duplicate_object_signal_on_source_wakes_wait_on_dup (reverse symmetry).
  6. nt_duplicate_object_then_close_dup_keeps_source_live: refcount and object presence after dup-close.
  7. nt_duplicate_object_then_close_source_keeps_dup_live: reverse.
  8. nt_duplicate_object_close_source_then_close_dup_destroys_object: final close destroys underlying.
  9. nt_duplicate_object_with_close_source_flag: dup + close source in one call.
  10. nt_duplicate_object_invalid_handle_returns_status_invalid_handle.
  11. nt_duplicate_object_writes_handle_id_to_out_ptr.

Plan

  1. Implement state.handle_aliases + state.canonical_refcount + resolve_handle.
  2. Rewrite nt_duplicate_object per Section "Simplest model".
  3. Adjust nt_close and resolve_pseudo_handle.
  4. Sweep direct state.objects.get sites: insert state.resolve_handle().
  5. Add 11 unit tests.
  6. Build + test.
  7. Cold-vs-cold rebaseline.