handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-05 07:19:08 +02:00
parent acd1656753
commit ef93a4fa14
620 changed files with 108303 additions and 1 deletions

View File

@@ -0,0 +1,237 @@
# Phase C+19 investigation — `NtDuplicateObject` handle.create (2026-05-14)
## Verified canary semantics (reading-error #28 discipline)
### `NtDuplicateObject_entry` — xboxkrnl_ob.cc:389-412
```cpp
X_HANDLE new_handle = X_INVALID_HANDLE_VALUE;
X_STATUS result = kernel_state()->object_table()->DuplicateHandle(handle, &new_handle);
if (new_handle_ptr) { *new_handle_ptr = new_handle; }
if (options == 1 /* DUPLICATE_CLOSE_SOURCE */) {
kernel_state()->object_table()->RemoveHandle(handle);
}
return result;
```
### `ObjectTable::DuplicateHandle` — object_table.cc:210-223
```cpp
X_STATUS ObjectTable::DuplicateHandle(X_HANDLE handle, X_HANDLE* out_handle) {
handle = TranslateHandle(handle);
XObject* object = LookupObject(handle, false); // refcount +1
if (object) {
result = AddHandle(object, out_handle); // alloc fresh slot, refcount +1, EMIT handle.create
object->Release(); // refcount -1 (offset LookupObject)
}
return result;
}
```
### `ObjectTable::AddHandle` — object_table.cc:148-208
- Finds a fresh slot via `FindFreeSlot`.
- Stores `entry.object = object; entry.handle_ref_count = 1;`
- Bumps `handle = (slot << 2) + kHandleBase` (or `+ kHandleHostBase`).
- `object->handles().push_back(handle)`.
- `object->Retain()`.
- **Emits `handle.create`** via `phase_a::EmitHandleCreateAuto` (cvar-gated, default-off) using the new handle's tid + tid_event_idx for SID, NOT short-circuited because we're not inside `GetNativeObject`.
### Net effect (source: S, dup: D, underlying XObject: O)
Before dup: `O.refcount = 1`, slots = {S → O}, handle.create(S) emitted earlier.
After dup:
- `O.refcount = 2` (one for each slot).
- `entry[S].handle_ref_count = 1`.
- `entry[D].handle_ref_count = 1`.
- handle.create(D) emitted at this dup.
Subsequent NtClose on either S or D:
- ReleaseHandle → `entry.handle_ref_count--`. If 0 → `RemoveHandle``entry.object = nullptr` + `object->Release()` → if `O.refcount == 0` → object dtor (emit `handle.destroy`).
So:
- `NtClose(S)` after dup: `entry[S].handle_ref_count: 1→0``RemoveHandle(S)``O.refcount: 2→1`. Object STILL ALIVE through D. NO handle.destroy.
Wait — re-reading object_table.cc:294-295: `phase_a::EmitHandleDestroyAuto(handle, ...)` is emitted from inside `RemoveHandle`, which fires whenever a slot's ref_count hits 0. So canary emits handle.destroy on EVERY NtClose of EVERY slot, regardless of whether the underlying object still has other slots.
That means: canary emits handle.create(D) AND on close emits handle.destroy(D), then later handle.destroy(S). Two handle.create events / two handle.destroy events across the dup pair. Symmetric.
## Ours's current behavior — exports.rs:5210-5240
```rust
fn nt_duplicate_object(...) {
let source = resolve_pseudo_handle(state, ctx.gpr[3] as u32);
if !state.objects.contains_key(&source) { return STATUS_INVALID_HANDLE; }
if out_ptr != 0 { mem.write_u32(out_ptr, source); } // dup_id = source_id
if options & DUPLICATE_CLOSE_SOURCE == 0 {
if let Some(c) = state.handle_refcount.get_mut(&source) { *c += 1; }
}
ctx.gpr[3] = STATUS_SUCCESS;
}
```
`dup_id` is aliased to `source_id`. Bumps `state.handle_refcount[source]` so the later `NtClose` pair (one per logical reference) doesn't destroy mid-flight. **No `handle.create` event** because no new id was allocated.
Subsequent `nt_close(handle)` decrements `handle_refcount[handle]`, emits `handle.destroy` only when it reaches 0.
## Phase A divergence
At main idx=102553, canary's tid=6 sequence after `NtDuplicateObject`:
```
[102551] import.call NtDuplicateObject
[102552] kernel.call NtDuplicateObject
[102553] handle.create sid=df686b147b291902
[102554] kernel.return NtDuplicateObject
```
Ours's tid=1:
```
[102551] import.call NtDuplicateObject
[102552] kernel.call NtDuplicateObject
[102553] kernel.return NtDuplicateObject ← canary's [102554]
```
The visible delta is the missing `handle.create` between `kernel.call` and `kernel.return`.
## AUDIT-062 risk assessment (CRITICAL)
### What AUDIT-062 verified
> ours DOES dup the wedge (kernel-aliasing hypothesis falsified):
> tid=13 cycle=26711 r3=0x000012ac r4=0x40541E80 (out_ptr).
> Per ours's exports.rs:4263, NtDup aliases — dup_id = source_id = 0x12AC,
> refcount++. NOT a kernel bug.
The original AUDIT-062 framing said "NtDup aliasing is correct because the
dup_id resolves to the same KernelObject in `state.objects`". The wedge bug
was downstream (producer-side `NtSetEvent(worker_idle_event)` never firing).
### What is load-bearing about the aliasing
The wedge case in AUDIT-062 was:
1. tid=13 creates event `0x12AC`.
2. Some descendant calls `NtDuplicateObject(0x12AC, &dup)` → dup `0x12AC` (aliased).
3. tid=13 calls `KeWaitForSingleObject(0x12AC)` (the source).
4. Worker thread (eventually) calls `NtSetEvent(dup)` on `0x12AC`.
The load-bearing invariant is: **signal on dup wakes wait on source**. Why
this works today: both ids ARE the same id, so `state.objects.get(&0x12AC)`
finds the same `KernelObject::Event` with the same `waiters` Vec.
### The trap
If we change `nt_duplicate_object` to allocate a fresh `dup_id` and store it
as a NEW `state.objects` entry (e.g. cloning the Event), then signal-on-dup
sets the CLONED event's `signaled` flag, NOT the source's. tid=13 waiting on
source will sleep forever. **WEDGE REGRESSION.**
### The fix
Allocate a fresh `dup_id`, do NOT clone the object. Instead store a
**handle alias** `dup_id → source_id` in `state.handle_aliases`. Whenever
the guest passes `dup_id` to any Nt*/Ke* call, resolve through the alias to
get `source_id`. Lookup `state.objects[source_id]`. The single underlying
`KernelObject::Event` retains the unified `waiters` list and `signaled`
flag. **Signal-on-dup still wakes wait-on-source** because both ids
canonicalize to the same source.
This mirrors canary's `LookupObject` which always indexes by slot, but the
underlying `XObject*` is shared. We achieve the same with the alias map.
### Refcount lifecycle
- Source close after dup: alias entry `dup_id → source_id` stays; underlying
object stays alive because `handle_refcount[source_id]` was bumped in
`nt_duplicate_object`. No `handle.destroy` emit (refcount > 0 after
decrement).
Actually — to match canary's per-slot handle.destroy emission, we need each
NtClose on EITHER source or dup to emit handle.destroy (with the closed slot's
SID), and we only drop the underlying object when ALL slots are gone.
Cleanest design: track per-handle-id refcount separately:
- `handle_refcount[source_id]`: counts the source slot's references.
- `handle_refcount[dup_id]`: counts the dup slot's references.
Both start at 1 (fresh allocation, fresh dup).
`nt_close(source_id)`: decrement `handle_refcount[source_id]`. If 0, emit
`handle.destroy(source_id)`, remove the alias entries pointing AT source_id
if applicable, and decrement underlying-object refcount.
Actually that's complex. Let me simplify: mirror canary's two-level refcount
exactly via a new struct.
### Simplest model that preserves AUDIT-062 + emits handle.create
1. `state.handle_aliases: HashMap<u32, u32>` (alias_id → canonical_id).
2. `state.handle_refcount[id]` continues to mean: how many `NtClose` calls
are needed on THIS id before its slot goes away.
3. `nt_duplicate_object`:
- Compute `canonical = resolve_alias(source)` (in case source itself is an alias).
- Alloc `dup_id` via `state.alloc_handle()`.
- Insert alias `dup_id → canonical`.
- `handle_refcount.insert(dup_id, 1)`.
- Emit `handle.create(dup_id, object_type)` using `state.objects[canonical].schema_object_type()`.
- If `options & DUPLICATE_CLOSE_SOURCE`, treat as a `NtClose(source)` after.
4. `nt_close(handle)`:
- Decrement `handle_refcount[handle]` as today.
- If reaches 0: emit `handle.destroy(handle)`. Remove the alias entry for
`handle` (if it's an alias). If there are NO MORE alias slots pointing
to canonical, AND `handle == canonical`, remove `state.objects[canonical]`.
- To know "any more slots pointing to canonical", maintain
`canonical_refcount: HashMap<u32, u32>` = number of live handle slots
bound to canonical. Bumped at alloc/dup, decremented at close-with-rc-0.
5. `state.resolve_handle(h)`: returns `handle_aliases.get(&h).copied().unwrap_or(h)`.
6. Every Nt*/Ke* handler that looks up `state.objects` via a guest-provided
handle id must call `state.resolve_handle(h)` first.
### Coverage of state.objects lookups
`resolve_pseudo_handle` (18 call sites in exports.rs) will be extended to
chain through `state.resolve_handle`. Direct `ctx.gpr[3] as u32 → state.objects.get`
sites need explicit resolution. Survey identified the following direct sites
that need `state.resolve_handle` insertion:
- nt_read_file (1630): `let handle = ctx.gpr[3] as u32;`
- nt_write_file (similar)
- nt_set_event (4628)
- nt_clear_event (4651)
- nt_query_information_file, nt_set_information_file, nt_query_directory_file,
nt_query_volume_information_file, nt_flush_buffers_file (file operations)
- nt_create_io_completion (and friends)
Will sweep these in the patch.
### Tests to add
1. `nt_duplicate_object_allocates_fresh_handle_id`: dup != source.
2. `nt_duplicate_object_emits_handle_create_event`: cvar-on, both
handle.create events present.
3. `nt_duplicate_object_alias_resolves_to_canonical`:
`state.resolve_handle(dup) == source`.
4. `nt_duplicate_object_signal_on_dup_wakes_wait_on_source` (AUDIT-062
regression test): create event, dup, simulate NtSetEvent(dup), confirm
`state.objects[source].signaled == true`.
5. `nt_duplicate_object_signal_on_source_wakes_wait_on_dup` (reverse
symmetry).
6. `nt_duplicate_object_then_close_dup_keeps_source_live`: refcount and
object presence after dup-close.
7. `nt_duplicate_object_then_close_source_keeps_dup_live`: reverse.
8. `nt_duplicate_object_close_source_then_close_dup_destroys_object`:
final close destroys underlying.
9. `nt_duplicate_object_with_close_source_flag`: dup + close source in one
call.
10. `nt_duplicate_object_invalid_handle_returns_status_invalid_handle`.
11. `nt_duplicate_object_writes_handle_id_to_out_ptr`.
## Plan
1. Implement `state.handle_aliases` + `state.canonical_refcount` + `resolve_handle`.
2. Rewrite `nt_duplicate_object` per Section "Simplest model".
3. Adjust `nt_close` and `resolve_pseudo_handle`.
4. Sweep direct `state.objects.get` sites: insert `state.resolve_handle()`.
5. Add 11 unit tests.
6. Build + test.
7. Cold-vs-cold rebaseline.