handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,237 @@
|
||||
# Phase C+19 investigation — `NtDuplicateObject` handle.create (2026-05-14)
|
||||
|
||||
## Verified canary semantics (reading-error #28 discipline)
|
||||
|
||||
### `NtDuplicateObject_entry` — xboxkrnl_ob.cc:389-412
|
||||
|
||||
```cpp
|
||||
X_HANDLE new_handle = X_INVALID_HANDLE_VALUE;
|
||||
X_STATUS result = kernel_state()->object_table()->DuplicateHandle(handle, &new_handle);
|
||||
if (new_handle_ptr) { *new_handle_ptr = new_handle; }
|
||||
if (options == 1 /* DUPLICATE_CLOSE_SOURCE */) {
|
||||
kernel_state()->object_table()->RemoveHandle(handle);
|
||||
}
|
||||
return result;
|
||||
```
|
||||
|
||||
### `ObjectTable::DuplicateHandle` — object_table.cc:210-223
|
||||
|
||||
```cpp
|
||||
X_STATUS ObjectTable::DuplicateHandle(X_HANDLE handle, X_HANDLE* out_handle) {
|
||||
handle = TranslateHandle(handle);
|
||||
XObject* object = LookupObject(handle, false); // refcount +1
|
||||
if (object) {
|
||||
result = AddHandle(object, out_handle); // alloc fresh slot, refcount +1, EMIT handle.create
|
||||
object->Release(); // refcount -1 (offset LookupObject)
|
||||
}
|
||||
return result;
|
||||
}
|
||||
```
|
||||
|
||||
### `ObjectTable::AddHandle` — object_table.cc:148-208
|
||||
|
||||
- Finds a fresh slot via `FindFreeSlot`.
|
||||
- Stores `entry.object = object; entry.handle_ref_count = 1;`
|
||||
- Bumps `handle = (slot << 2) + kHandleBase` (or `+ kHandleHostBase`).
|
||||
- `object->handles().push_back(handle)`.
|
||||
- `object->Retain()`.
|
||||
- **Emits `handle.create`** via `phase_a::EmitHandleCreateAuto` (cvar-gated, default-off) using the new handle's tid + tid_event_idx for SID, NOT short-circuited because we're not inside `GetNativeObject`.
|
||||
|
||||
### Net effect (source: S, dup: D, underlying XObject: O)
|
||||
|
||||
Before dup: `O.refcount = 1`, slots = {S → O}, handle.create(S) emitted earlier.
|
||||
|
||||
After dup:
|
||||
- `O.refcount = 2` (one for each slot).
|
||||
- `entry[S].handle_ref_count = 1`.
|
||||
- `entry[D].handle_ref_count = 1`.
|
||||
- handle.create(D) emitted at this dup.
|
||||
|
||||
Subsequent NtClose on either S or D:
|
||||
- ReleaseHandle → `entry.handle_ref_count--`. If 0 → `RemoveHandle` → `entry.object = nullptr` + `object->Release()` → if `O.refcount == 0` → object dtor (emit `handle.destroy`).
|
||||
|
||||
So:
|
||||
- `NtClose(S)` after dup: `entry[S].handle_ref_count: 1→0` → `RemoveHandle(S)` → `O.refcount: 2→1`. Object STILL ALIVE through D. NO handle.destroy.
|
||||
|
||||
Wait — re-reading object_table.cc:294-295: `phase_a::EmitHandleDestroyAuto(handle, ...)` is emitted from inside `RemoveHandle`, which fires whenever a slot's ref_count hits 0. So canary emits handle.destroy on EVERY NtClose of EVERY slot, regardless of whether the underlying object still has other slots.
|
||||
|
||||
That means: canary emits handle.create(D) AND on close emits handle.destroy(D), then later handle.destroy(S). Two handle.create events / two handle.destroy events across the dup pair. Symmetric.
|
||||
|
||||
## Ours's current behavior — exports.rs:5210-5240
|
||||
|
||||
```rust
|
||||
fn nt_duplicate_object(...) {
|
||||
let source = resolve_pseudo_handle(state, ctx.gpr[3] as u32);
|
||||
if !state.objects.contains_key(&source) { return STATUS_INVALID_HANDLE; }
|
||||
if out_ptr != 0 { mem.write_u32(out_ptr, source); } // dup_id = source_id
|
||||
if options & DUPLICATE_CLOSE_SOURCE == 0 {
|
||||
if let Some(c) = state.handle_refcount.get_mut(&source) { *c += 1; }
|
||||
}
|
||||
ctx.gpr[3] = STATUS_SUCCESS;
|
||||
}
|
||||
```
|
||||
|
||||
`dup_id` is aliased to `source_id`. Bumps `state.handle_refcount[source]` so the later `NtClose` pair (one per logical reference) doesn't destroy mid-flight. **No `handle.create` event** because no new id was allocated.
|
||||
|
||||
Subsequent `nt_close(handle)` decrements `handle_refcount[handle]`, emits `handle.destroy` only when it reaches 0.
|
||||
|
||||
## Phase A divergence
|
||||
|
||||
At main idx=102553, canary's tid=6 sequence after `NtDuplicateObject`:
|
||||
```
|
||||
[102551] import.call NtDuplicateObject
|
||||
[102552] kernel.call NtDuplicateObject
|
||||
[102553] handle.create sid=df686b147b291902
|
||||
[102554] kernel.return NtDuplicateObject
|
||||
```
|
||||
|
||||
Ours's tid=1:
|
||||
```
|
||||
[102551] import.call NtDuplicateObject
|
||||
[102552] kernel.call NtDuplicateObject
|
||||
[102553] kernel.return NtDuplicateObject ← canary's [102554]
|
||||
```
|
||||
|
||||
The visible delta is the missing `handle.create` between `kernel.call` and `kernel.return`.
|
||||
|
||||
## AUDIT-062 risk assessment (CRITICAL)
|
||||
|
||||
### What AUDIT-062 verified
|
||||
|
||||
> ours DOES dup the wedge (kernel-aliasing hypothesis falsified):
|
||||
> tid=13 cycle=26711 r3=0x000012ac r4=0x40541E80 (out_ptr).
|
||||
> Per ours's exports.rs:4263, NtDup aliases — dup_id = source_id = 0x12AC,
|
||||
> refcount++. NOT a kernel bug.
|
||||
|
||||
The original AUDIT-062 framing said "NtDup aliasing is correct because the
|
||||
dup_id resolves to the same KernelObject in `state.objects`". The wedge bug
|
||||
was downstream (producer-side `NtSetEvent(worker_idle_event)` never firing).
|
||||
|
||||
### What is load-bearing about the aliasing
|
||||
|
||||
The wedge case in AUDIT-062 was:
|
||||
1. tid=13 creates event `0x12AC`.
|
||||
2. Some descendant calls `NtDuplicateObject(0x12AC, &dup)` → dup `0x12AC` (aliased).
|
||||
3. tid=13 calls `KeWaitForSingleObject(0x12AC)` (the source).
|
||||
4. Worker thread (eventually) calls `NtSetEvent(dup)` on `0x12AC`.
|
||||
|
||||
The load-bearing invariant is: **signal on dup wakes wait on source**. Why
|
||||
this works today: both ids ARE the same id, so `state.objects.get(&0x12AC)`
|
||||
finds the same `KernelObject::Event` with the same `waiters` Vec.
|
||||
|
||||
### The trap
|
||||
|
||||
If we change `nt_duplicate_object` to allocate a fresh `dup_id` and store it
|
||||
as a NEW `state.objects` entry (e.g. cloning the Event), then signal-on-dup
|
||||
sets the CLONED event's `signaled` flag, NOT the source's. tid=13 waiting on
|
||||
source will sleep forever. **WEDGE REGRESSION.**
|
||||
|
||||
### The fix
|
||||
|
||||
Allocate a fresh `dup_id`, do NOT clone the object. Instead store a
|
||||
**handle alias** `dup_id → source_id` in `state.handle_aliases`. Whenever
|
||||
the guest passes `dup_id` to any Nt*/Ke* call, resolve through the alias to
|
||||
get `source_id`. Lookup `state.objects[source_id]`. The single underlying
|
||||
`KernelObject::Event` retains the unified `waiters` list and `signaled`
|
||||
flag. **Signal-on-dup still wakes wait-on-source** because both ids
|
||||
canonicalize to the same source.
|
||||
|
||||
This mirrors canary's `LookupObject` which always indexes by slot, but the
|
||||
underlying `XObject*` is shared. We achieve the same with the alias map.
|
||||
|
||||
### Refcount lifecycle
|
||||
|
||||
- Source close after dup: alias entry `dup_id → source_id` stays; underlying
|
||||
object stays alive because `handle_refcount[source_id]` was bumped in
|
||||
`nt_duplicate_object`. No `handle.destroy` emit (refcount > 0 after
|
||||
decrement).
|
||||
|
||||
Actually — to match canary's per-slot handle.destroy emission, we need each
|
||||
NtClose on EITHER source or dup to emit handle.destroy (with the closed slot's
|
||||
SID), and we only drop the underlying object when ALL slots are gone.
|
||||
|
||||
Cleanest design: track per-handle-id refcount separately:
|
||||
- `handle_refcount[source_id]`: counts the source slot's references.
|
||||
- `handle_refcount[dup_id]`: counts the dup slot's references.
|
||||
|
||||
Both start at 1 (fresh allocation, fresh dup).
|
||||
|
||||
`nt_close(source_id)`: decrement `handle_refcount[source_id]`. If 0, emit
|
||||
`handle.destroy(source_id)`, remove the alias entries pointing AT source_id
|
||||
if applicable, and decrement underlying-object refcount.
|
||||
|
||||
Actually that's complex. Let me simplify: mirror canary's two-level refcount
|
||||
exactly via a new struct.
|
||||
|
||||
### Simplest model that preserves AUDIT-062 + emits handle.create
|
||||
|
||||
1. `state.handle_aliases: HashMap<u32, u32>` (alias_id → canonical_id).
|
||||
2. `state.handle_refcount[id]` continues to mean: how many `NtClose` calls
|
||||
are needed on THIS id before its slot goes away.
|
||||
3. `nt_duplicate_object`:
|
||||
- Compute `canonical = resolve_alias(source)` (in case source itself is an alias).
|
||||
- Alloc `dup_id` via `state.alloc_handle()`.
|
||||
- Insert alias `dup_id → canonical`.
|
||||
- `handle_refcount.insert(dup_id, 1)`.
|
||||
- Emit `handle.create(dup_id, object_type)` using `state.objects[canonical].schema_object_type()`.
|
||||
- If `options & DUPLICATE_CLOSE_SOURCE`, treat as a `NtClose(source)` after.
|
||||
4. `nt_close(handle)`:
|
||||
- Decrement `handle_refcount[handle]` as today.
|
||||
- If reaches 0: emit `handle.destroy(handle)`. Remove the alias entry for
|
||||
`handle` (if it's an alias). If there are NO MORE alias slots pointing
|
||||
to canonical, AND `handle == canonical`, remove `state.objects[canonical]`.
|
||||
- To know "any more slots pointing to canonical", maintain
|
||||
`canonical_refcount: HashMap<u32, u32>` = number of live handle slots
|
||||
bound to canonical. Bumped at alloc/dup, decremented at close-with-rc-0.
|
||||
5. `state.resolve_handle(h)`: returns `handle_aliases.get(&h).copied().unwrap_or(h)`.
|
||||
6. Every Nt*/Ke* handler that looks up `state.objects` via a guest-provided
|
||||
handle id must call `state.resolve_handle(h)` first.
|
||||
|
||||
### Coverage of state.objects lookups
|
||||
|
||||
`resolve_pseudo_handle` (18 call sites in exports.rs) will be extended to
|
||||
chain through `state.resolve_handle`. Direct `ctx.gpr[3] as u32 → state.objects.get`
|
||||
sites need explicit resolution. Survey identified the following direct sites
|
||||
that need `state.resolve_handle` insertion:
|
||||
|
||||
- nt_read_file (1630): `let handle = ctx.gpr[3] as u32;`
|
||||
- nt_write_file (similar)
|
||||
- nt_set_event (4628)
|
||||
- nt_clear_event (4651)
|
||||
- nt_query_information_file, nt_set_information_file, nt_query_directory_file,
|
||||
nt_query_volume_information_file, nt_flush_buffers_file (file operations)
|
||||
- nt_create_io_completion (and friends)
|
||||
|
||||
Will sweep these in the patch.
|
||||
|
||||
### Tests to add
|
||||
|
||||
1. `nt_duplicate_object_allocates_fresh_handle_id`: dup != source.
|
||||
2. `nt_duplicate_object_emits_handle_create_event`: cvar-on, both
|
||||
handle.create events present.
|
||||
3. `nt_duplicate_object_alias_resolves_to_canonical`:
|
||||
`state.resolve_handle(dup) == source`.
|
||||
4. `nt_duplicate_object_signal_on_dup_wakes_wait_on_source` (AUDIT-062
|
||||
regression test): create event, dup, simulate NtSetEvent(dup), confirm
|
||||
`state.objects[source].signaled == true`.
|
||||
5. `nt_duplicate_object_signal_on_source_wakes_wait_on_dup` (reverse
|
||||
symmetry).
|
||||
6. `nt_duplicate_object_then_close_dup_keeps_source_live`: refcount and
|
||||
object presence after dup-close.
|
||||
7. `nt_duplicate_object_then_close_source_keeps_dup_live`: reverse.
|
||||
8. `nt_duplicate_object_close_source_then_close_dup_destroys_object`:
|
||||
final close destroys underlying.
|
||||
9. `nt_duplicate_object_with_close_source_flag`: dup + close source in one
|
||||
call.
|
||||
10. `nt_duplicate_object_invalid_handle_returns_status_invalid_handle`.
|
||||
11. `nt_duplicate_object_writes_handle_id_to_out_ptr`.
|
||||
|
||||
## Plan
|
||||
|
||||
1. Implement `state.handle_aliases` + `state.canonical_refcount` + `resolve_handle`.
|
||||
2. Rewrite `nt_duplicate_object` per Section "Simplest model".
|
||||
3. Adjust `nt_close` and `resolve_pseudo_handle`.
|
||||
4. Sweep direct `state.objects.get` sites: insert `state.resolve_handle()`.
|
||||
5. Add 11 unit tests.
|
||||
6. Build + test.
|
||||
7. Cold-vs-cold rebaseline.
|
||||
Reference in New Issue
Block a user