handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes

Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
MechaCat02
2026-06-05 07:19:08 +02:00
parent acd1656753
commit ef93a4fa14
620 changed files with 108303 additions and 1 deletions

View File

@@ -0,0 +1,102 @@
# AUDIT-062 regression check (Phase C+19)
## What AUDIT-062 verified
AUDIT-062 (2026-05-12, dossier:
`xenia-rs/docs/functions/sub_821CB030.md`, memory:
`project_xenia_rs_audit_062_worker_wake_gap_2026_05_12.md`) located
the worker-cluster wedge to "the producer never signals the worker-
idle event". It explicitly RULED OUT the NtDuplicate aliasing as the
bug, citing the live `ours-ntdup.jsonl` trace:
> ours DOES dup the wedge (kernel-aliasing hypothesis falsified):
> `--lr-trace=0x8284DF7C` captured `tid=13 cycle=26711 r3=0x000012ac
> r4=0x40541E80` (out_ptr). Per ours's `crates/xenia-kernel/src/
> exports.rs:4263`, NtDup aliases — dup_id = source_id = 0x12AC,
> refcount++. NOT a kernel bug.
The load-bearing invariant from AUDIT-062 is:
**signal-on-dup wakes wait-on-source.**
Pre-C+19 mechanism: dup_id collided with source_id, so the same
`state.objects` entry was hit by both paths.
Post-C+19 mechanism: dup_id is a fresh slot mapped to source_id via
`state.handle_aliases`; every lookup through `resolve_handle`
canonicalizes to source_id, hitting the same `state.objects` entry.
## Risk assessment
| Risk | Pre-C+19 | Post-C+19 |
|------|----------|-----------|
| Signal-on-dup wakes wait-on-source | YES (id collision) | YES (alias canonicalize) |
| File ops on dup work | YES (id collision) | YES (alias canonicalize) |
| Thread suspend/resume on dup | YES (id collision) | YES (alias canonicalize) |
| Close-dup keeps source alive | partial (refcount sharing) | YES (per-slot refcount + canonical_slot_count) |
| Close-source keeps dup alive | partial | YES |
| handle.destroy emitted per slot | NO (one per object) | YES (one per slot — canary parity) |
## Tests proving AUDIT-062 invariant survives
11 new unit tests in `xenia-kernel/src/exports.rs::tests`:
1. `nt_duplicate_object_allocates_fresh_handle_id` — dup != source.
2. **`nt_duplicate_object_signal_on_dup_wakes_wait_on_source`** —
**THE AUDIT-062 REGRESSION GUARD**. Creates an Event, dups,
signals the dup, asserts source Event's `signaled == true`. If
this test ever fails, the C+19 fix has broken AUDIT-062's
worker-cluster wedge resolution.
3. `nt_duplicate_object_signal_on_source_visible_via_dup` — symmetric.
4. `nt_duplicate_object_refcount_lifecycle` — per-slot refcount =
1 for both source and dup; canonical_slot_count = 2; alias map
has `dup → source`.
5. `nt_duplicate_object_then_close_dup_keeps_source_live`
close dup, source still live and signalable.
6. `nt_duplicate_object_then_close_source_keeps_dup_live`
close source, dup still live and signalable (incl. signal
propagation test).
7. `nt_duplicate_object_close_both_destroys_underlying`
close both → object gone; canonical_slot_count entry pruned.
8. `nt_duplicate_object_with_close_source_flag`
DUPLICATE_CLOSE_SOURCE atomically dups and closes source.
9. `nt_duplicate_object_invalid_handle_returns_invalid_handle`.
10. `nt_duplicate_object_dup_of_dup_canonicalizes`
transitive aliasing flattens to original source.
11. `nt_duplicate_object_works_for_semaphore` — non-Event type works
identically.
All 11 pass. Kernel tests: 193 → 204 (+11). Full workspace test
suite passes.
## End-to-end runtime verification
Direct inspection of `ours-cold.jsonl` at tid=1 idx=102553:
```
idx=102551 kind=import.call name=NtDuplicateObject
idx=102552 kind=kernel.call name=NtDuplicateObject
idx=102553 kind=handle.create name= (FRESH slot) ← C+19 NEW
idx=102554 kind=kernel.return name=NtDuplicateObject ret=0
```
The `handle.create` at idx=102553 is the canary-symmetric event that
was missing pre-C+19. Verifies the fix lands at the observable
boundary.
## Conclusion
AUDIT-062's load-bearing invariant — signal-on-dup wakes
wait-on-source — is PRESERVED by the C+19 fix. The invariant
relies on canonical kernel-object sharing, which is now achieved
via the alias map rather than id collision. The mechanism shift
is observation-equivalent to upstream callers: they pass dup_id
to Nt*/Ke* functions; ours resolves dup_id → source_id at lookup
time; the same `KernelObject::Event` (or whatever type) is
mutated regardless of which slot id the caller named.
The pre-C+19 mechanism (id collision) is a special case of the
post-C+19 mechanism (alias map): if no dup_id is ever allocated,
`handle_aliases.get(h)` returns `None`, `resolve_handle(h)` returns
`h` unchanged, and every lookup behaves exactly as it did before.
No AUDIT-062 regression detected.

View File

@@ -0,0 +1,108 @@
# Phase C+19 cold-vs-cold result (2026-05-14)
## Verified resolution of D-NEW-1
Direct inspection of `ours-cold.jsonl` (50M instructions, cold cache)
on tid=1 around idx 102553 (the C+18 baseline divergence point):
```
idx=102551 kind=import.call name=NtDuplicateObject
idx=102552 kind=kernel.call name=NtDuplicateObject
idx=102553 kind=handle.create [NEW in C+19 — fresh dup slot]
idx=102554 kind=kernel.return name=NtDuplicateObject ret=0
```
**D-NEW-1 RESOLVED at the source.** Ours now emits `handle.create`
between `kernel.call NtDuplicateObject` and `kernel.return
NtDuplicateObject`, exactly mirroring canary's
`ObjectTable::DuplicateHandle``AddHandle` (object_table.cc:210-223
→ 148-208). The new `handle.create` payload carries:
- `raw_handle_id` = freshly allocated dup id (NOT source id).
- `object_type` = same as source's `KernelObject` variant.
- `handle_semantic_id` = per-tid SID at the allocation point.
## Acceptance gates
- **Gate 1 (default-off digest)**: PASS — 3× reproducible at
`e1dfcb1559f987b35012a7f2dc6d93f5` (unchanged from C+13/C+15-α/
C+16/C+17/C+18 baseline). C+19 is observation-only at the digest
level; instruction count, swaps, draws all bit-identical to C+18.
- **Gate 2 (cvar-on emit)**: PASS — ours-cold produces 121,569
events (matches C+18's 121,544 ± shared-global tid jitter; the
+25 events are the new dup-side `handle.create` and balancing
per-slot `handle.destroy` events).
- **Gate 3 (diff tool runs)**: PASS — produces 6-chain report.
- **Gate 4 (cold-vs-cold matched prefix)**: PARTIALLY PASS — see
"Canary cache jitter" below.
- **Gate 5 (build)**: PASS — both engines build clean (only the
pre-existing `dead_code` warning on `walk_committed_regions`).
- **Gate 6 (tests)**: PASS — ours kernel tests 193 → 204 (+11 new
AUDIT-062 regression + dup lifecycle tests). Workspace tests all
pass.
- **Gate 7 (Phase B image hash)**: PASS — `image_loaded_sha256` =
`ea8d160e9369328a5b922258a92113efb8d7ce3e1a5c12cc521e375985c91c18`
(unchanged).
- **Gate 8 (event-log determinism)**: PASS — emit count bit-stable
across cold runs. The new `handle.create` and per-slot
`handle.destroy` events are deterministically emitted at the
canary-symmetric boundary.
- **Gate 9 (AUDIT-062 regression)**: PASS — see
`audit062-regression-check.md`. All 11 new tests guard the
signal-on-dup-wakes-wait-on-source invariant.
## Canary cache jitter
The diff tool reports main matched-prefix at 102,424 — below the
C+18 baseline of 102,553. Investigation shows this is **canary-side
cache jitter, not a regression of the C+19 fix**:
```
C+18 baseline canary tid=6 idx=102424: status=0xc000000f (NO_SUCH_FILE)
C+19 canary v2 tid=6 idx=102424: status=0x00000000 (SUCCESS)
C+19 canary v3 tid=6 idx=102424: status=0x00000000 (SUCCESS)
C+19 canary v5 tid=6 idx=102424: status=0x00000000 (SUCCESS)
```
`NtQueryFullAttributesFile` on canary's side returned a different
status across cold runs (cache-state-dependent). Ours's status at
this idx is unchanged (`0xc000000f` in both C+18 and C+19 baselines).
The canary log used to establish the C+18 baseline reflected a
specific cache state that successive cold-canary runs have not
reproduced; this is independent of any change in xenia-rs.
The C+19 fix's true effect is verified by direct inspection of
ours-cold.jsonl at idx 102553 (above), NOT by the canary-comparison
matched-prefix at 102,424.
## Sister chain summary
Unchanged from C+18 baseline (canary jitter doesn't affect sisters):
| chain | C+18 | C+19 | delta |
|--------------------------------|---------|---------|-------|
| canary tid=4 → ours tid=11 | 11 | 11 | 0 |
| canary tid=7 → ours tid=2 | 32 | 32 | 0 |
| canary tid=12 → ours tid=7 | 3 | 3 | 0 |
| canary tid=14 → ours tid=9 | 41 | 41 | 0 |
| canary tid=15 → ours tid=10 | 16 | 16 | 0 |
No sister-chain regressions.
## Conclusion
- Direct verification: D-NEW-1 RESOLVED.
- AUDIT-062 invariant: PRESERVED (11 new regression tests + framing
analysis in `audit062-regression-check.md`).
- Cold-stable digest: UNCHANGED.
- Build + tests: PASS.
- Sister chains: UNCHANGED.
- Canary-side cold-run jitter is an independent observability
concern; the C+19 fix itself is correct and minimal.
## Next target
**C+20 = D-NEW-2 (`KeWaitForSingleObject` `timeout_ns` mismatch on
canary tid=12 → ours tid=7 at idx=3)**. ε-class encoding divergence:
canary=`-30000000` ns, ours=`429466729600` ns. Likely a sign/scale
asymmetry in the timeout payload emitter.

View File

@@ -0,0 +1,135 @@
# Phase A diff report
**This report is the output of Phase A's diff harness. Divergences
shown here are INPUT for Phase B (first-divergence localization),
not findings of Phase A.** Phase A's job is to make the harness
itself correct, not to analyze what it surfaces.
## Summary
| canary_tid | ours_tid | matched | canary_total | ours_total | first_divergence_at | floating_skipped (c/o) |
|---|---|---|---|---|---|---|
| 4 | 11 | 11 | 20000 | 11 | — | 0/0 |
| 6 | 1 | 102424 | 250000 | 108507 | 102424 | 0/0 |
| 7 | 2 | 32 | 32 | 33 | — | 0/0 |
| 12 | 7 | 3 | 20000 | 5 | 3 | 0/0 |
| 14 | 9 | 41 | 20000 | 77 | 41 | 0/0 |
| 15 | 10 | 16 | 20000 | 17 | — | 0/1 |
*`floating_skipped (c/o)` counts shared-global `handle.create` events absorbed by Phase C+18 cross-tid SID matching (per-side, observation-side ordering of process-global dispatchers). See schema-v1.md §"Shared-global SIDs".*
## canary_tid=4 → ours_tid=11
No divergence within the 11 compared events (canary has 20000, ours has 11).
## canary_tid=6 → ours_tid=1
First divergence at `tid_event_idx=102424`: payload.return_value: canary=0 ours=18446744072635809807
**Pre-context (last 5 matching events):**
```
canary: [102419] import.call RtlInitAnsiString
ours: [102419] import.call RtlInitAnsiString
canary: [102420] kernel.call RtlInitAnsiString
ours: [102420] kernel.call RtlInitAnsiString
canary: [102421] kernel.return RtlInitAnsiString
ours: [102421] kernel.return RtlInitAnsiString
canary: [102422] import.call NtQueryFullAttributesFile
ours: [102422] import.call NtQueryFullAttributesFile
canary: [102423] kernel.call NtQueryFullAttributesFile
ours: [102423] kernel.call NtQueryFullAttributesFile
```
**Divergent event:**
```
canary: [102424] kernel.return NtQueryFullAttributesFile
ours: [102424] kernel.return NtQueryFullAttributesFile
```
**Next event after the divergence (if any):**
```
canary: [102425] import.call RtlEnterCriticalSection
ours: [102425] import.call RtlNtStatusToDosError
```
**Raw events (JSON):**
```json
{"deterministic": true, "engine": "canary", "guest_cycle": 0, "host_ns": 1463590800, "kind": "kernel.return", "payload": {"name": "NtQueryFullAttributesFile", "return_value": 0, "side_effects": [], "status": "0x00000000"}, "schema_version": 1, "tid": 6, "tid_event_idx": 102424}
{"deterministic": true, "engine": "ours", "guest_cycle": 5391947, "host_ns": 477692480, "kind": "kernel.return", "payload": {"name": "NtQueryFullAttributesFile", "return_value": 18446744072635809807, "side_effects": [], "status": "0xc000000f"}, "schema_version": 1, "tid": 1, "tid_event_idx": 102424}
```
## canary_tid=7 → ours_tid=2
No divergence within the 32 compared events (canary has 32, ours has 33).
## canary_tid=12 → ours_tid=7
First divergence at `tid_event_idx=3`: payload.timeout_ns: canary=-30000000 ours=429466729600
**Pre-context (last 5 matching events):**
```
canary: [0] import.call KeWaitForSingleObject
ours: [0] import.call KeWaitForSingleObject
canary: [1] kernel.call KeWaitForSingleObject
ours: [1] kernel.call KeWaitForSingleObject
canary: [2] handle.create sid=c49d8f0ab90401ea
ours: [2] handle.create sid=6e3d96c5a52bf429
```
**Divergent event:**
```
canary: [3] wait.begin {'handles_semantic_ids': ['c49d8f0ab90401ea'], 'timeout_ns': -30000000, 'alertable': False, 'wait_type': 'any'}
ours: [3] wait.begin {'handles_semantic_ids': ['6e3d96c5a52bf429'], 'timeout_ns': 429466729600, 'alertable': False, 'wait_type': 'any'}
```
**Next event after the divergence (if any):**
```
canary: [4] kernel.return KeWaitForSingleObject
ours: [4] kernel.return KeWaitForSingleObject
```
**Raw events (JSON):**
```json
{"deterministic": true, "engine": "canary", "guest_cycle": 0, "host_ns": 1582189500, "kind": "wait.begin", "payload": {"alertable": false, "handles_semantic_ids": ["c49d8f0ab90401ea"], "timeout_ns": -30000000, "wait_type": "any"}, "schema_version": 1, "tid": 12, "tid_event_idx": 3}
{"deterministic": true, "engine": "ours", "guest_cycle": 0, "host_ns": 502700532, "kind": "wait.begin", "payload": {"alertable": false, "handles_semantic_ids": ["6e3d96c5a52bf429"], "timeout_ns": 429466729600, "wait_type": "any"}, "schema_version": 1, "tid": 7, "tid_event_idx": 3}
```
## canary_tid=14 → ours_tid=9
First divergence at `tid_event_idx=41`: payload.ord: canary=503 ours=293
**Pre-context (last 5 matching events):**
```
canary: [36] kernel.call KeReleaseSpinLockFromRaisedIrql
ours: [36] kernel.call KeReleaseSpinLockFromRaisedIrql
canary: [37] kernel.return KeReleaseSpinLockFromRaisedIrql
ours: [37] kernel.return KeReleaseSpinLockFromRaisedIrql
canary: [38] import.call KfLowerIrql
ours: [38] import.call KfLowerIrql
canary: [39] kernel.call KfLowerIrql
ours: [39] kernel.call KfLowerIrql
canary: [40] kernel.return KfLowerIrql
ours: [40] kernel.return KfLowerIrql
```
**Divergent event:**
```
canary: [41] import.call XAudioGetVoiceCategoryVolumeChangeMask
ours: [41] import.call RtlEnterCriticalSection
```
**Next event after the divergence (if any):**
```
canary: [42] kernel.call XAudioGetVoiceCategoryVolumeChangeMask
ours: [42] kernel.call RtlEnterCriticalSection
```
**Raw events (JSON):**
```json
{"deterministic": true, "engine": "canary", "guest_cycle": 0, "host_ns": 1818928300, "kind": "import.call", "payload": {"module": "xboxkrnl.exe", "name": "XAudioGetVoiceCategoryVolumeChangeMask", "ord": 503}, "schema_version": 1, "tid": 14, "tid_event_idx": 41}
{"deterministic": true, "engine": "ours", "guest_cycle": 417, "host_ns": 1711325930, "kind": "import.call", "payload": {"module": "xboxkrnl.exe", "name": "RtlEnterCriticalSection", "ord": 293}, "schema_version": 1, "tid": 9, "tid_event_idx": 41}
```
## canary_tid=15 → ours_tid=10
No divergence within the 16 compared events (canary has 20000, ours has 17).

View File

@@ -0,0 +1,10 @@
{
"instructions": 50000007,
"imports": 40390,
"unimpl": 0,
"draws": 0,
"swaps": 1,
"unique_render_targets": 0,
"shader_blobs_live": 0,
"texture_cache_entries": 0
}

View File

@@ -0,0 +1,10 @@
{
"instructions": 50000007,
"imports": 40390,
"unimpl": 0,
"draws": 0,
"swaps": 1,
"unique_render_targets": 0,
"shader_blobs_live": 0,
"texture_cache_entries": 0
}

View File

@@ -0,0 +1,10 @@
{
"instructions": 50000007,
"imports": 40390,
"unimpl": 0,
"draws": 0,
"swaps": 1,
"unique_render_targets": 0,
"shader_blobs_live": 0,
"texture_cache_entries": 0
}

View File

@@ -0,0 +1,237 @@
# Phase C+19 investigation — `NtDuplicateObject` handle.create (2026-05-14)
## Verified canary semantics (reading-error #28 discipline)
### `NtDuplicateObject_entry` — xboxkrnl_ob.cc:389-412
```cpp
X_HANDLE new_handle = X_INVALID_HANDLE_VALUE;
X_STATUS result = kernel_state()->object_table()->DuplicateHandle(handle, &new_handle);
if (new_handle_ptr) { *new_handle_ptr = new_handle; }
if (options == 1 /* DUPLICATE_CLOSE_SOURCE */) {
kernel_state()->object_table()->RemoveHandle(handle);
}
return result;
```
### `ObjectTable::DuplicateHandle` — object_table.cc:210-223
```cpp
X_STATUS ObjectTable::DuplicateHandle(X_HANDLE handle, X_HANDLE* out_handle) {
handle = TranslateHandle(handle);
XObject* object = LookupObject(handle, false); // refcount +1
if (object) {
result = AddHandle(object, out_handle); // alloc fresh slot, refcount +1, EMIT handle.create
object->Release(); // refcount -1 (offset LookupObject)
}
return result;
}
```
### `ObjectTable::AddHandle` — object_table.cc:148-208
- Finds a fresh slot via `FindFreeSlot`.
- Stores `entry.object = object; entry.handle_ref_count = 1;`
- Bumps `handle = (slot << 2) + kHandleBase` (or `+ kHandleHostBase`).
- `object->handles().push_back(handle)`.
- `object->Retain()`.
- **Emits `handle.create`** via `phase_a::EmitHandleCreateAuto` (cvar-gated, default-off) using the new handle's tid + tid_event_idx for SID, NOT short-circuited because we're not inside `GetNativeObject`.
### Net effect (source: S, dup: D, underlying XObject: O)
Before dup: `O.refcount = 1`, slots = {S → O}, handle.create(S) emitted earlier.
After dup:
- `O.refcount = 2` (one for each slot).
- `entry[S].handle_ref_count = 1`.
- `entry[D].handle_ref_count = 1`.
- handle.create(D) emitted at this dup.
Subsequent NtClose on either S or D:
- ReleaseHandle → `entry.handle_ref_count--`. If 0 → `RemoveHandle``entry.object = nullptr` + `object->Release()` → if `O.refcount == 0` → object dtor (emit `handle.destroy`).
So:
- `NtClose(S)` after dup: `entry[S].handle_ref_count: 1→0``RemoveHandle(S)``O.refcount: 2→1`. Object STILL ALIVE through D. NO handle.destroy.
Wait — re-reading object_table.cc:294-295: `phase_a::EmitHandleDestroyAuto(handle, ...)` is emitted from inside `RemoveHandle`, which fires whenever a slot's ref_count hits 0. So canary emits handle.destroy on EVERY NtClose of EVERY slot, regardless of whether the underlying object still has other slots.
That means: canary emits handle.create(D) AND on close emits handle.destroy(D), then later handle.destroy(S). Two handle.create events / two handle.destroy events across the dup pair. Symmetric.
## Ours's current behavior — exports.rs:5210-5240
```rust
fn nt_duplicate_object(...) {
let source = resolve_pseudo_handle(state, ctx.gpr[3] as u32);
if !state.objects.contains_key(&source) { return STATUS_INVALID_HANDLE; }
if out_ptr != 0 { mem.write_u32(out_ptr, source); } // dup_id = source_id
if options & DUPLICATE_CLOSE_SOURCE == 0 {
if let Some(c) = state.handle_refcount.get_mut(&source) { *c += 1; }
}
ctx.gpr[3] = STATUS_SUCCESS;
}
```
`dup_id` is aliased to `source_id`. Bumps `state.handle_refcount[source]` so the later `NtClose` pair (one per logical reference) doesn't destroy mid-flight. **No `handle.create` event** because no new id was allocated.
Subsequent `nt_close(handle)` decrements `handle_refcount[handle]`, emits `handle.destroy` only when it reaches 0.
## Phase A divergence
At main idx=102553, canary's tid=6 sequence after `NtDuplicateObject`:
```
[102551] import.call NtDuplicateObject
[102552] kernel.call NtDuplicateObject
[102553] handle.create sid=df686b147b291902
[102554] kernel.return NtDuplicateObject
```
Ours's tid=1:
```
[102551] import.call NtDuplicateObject
[102552] kernel.call NtDuplicateObject
[102553] kernel.return NtDuplicateObject ← canary's [102554]
```
The visible delta is the missing `handle.create` between `kernel.call` and `kernel.return`.
## AUDIT-062 risk assessment (CRITICAL)
### What AUDIT-062 verified
> ours DOES dup the wedge (kernel-aliasing hypothesis falsified):
> tid=13 cycle=26711 r3=0x000012ac r4=0x40541E80 (out_ptr).
> Per ours's exports.rs:4263, NtDup aliases — dup_id = source_id = 0x12AC,
> refcount++. NOT a kernel bug.
The original AUDIT-062 framing said "NtDup aliasing is correct because the
dup_id resolves to the same KernelObject in `state.objects`". The wedge bug
was downstream (producer-side `NtSetEvent(worker_idle_event)` never firing).
### What is load-bearing about the aliasing
The wedge case in AUDIT-062 was:
1. tid=13 creates event `0x12AC`.
2. Some descendant calls `NtDuplicateObject(0x12AC, &dup)` → dup `0x12AC` (aliased).
3. tid=13 calls `KeWaitForSingleObject(0x12AC)` (the source).
4. Worker thread (eventually) calls `NtSetEvent(dup)` on `0x12AC`.
The load-bearing invariant is: **signal on dup wakes wait on source**. Why
this works today: both ids ARE the same id, so `state.objects.get(&0x12AC)`
finds the same `KernelObject::Event` with the same `waiters` Vec.
### The trap
If we change `nt_duplicate_object` to allocate a fresh `dup_id` and store it
as a NEW `state.objects` entry (e.g. cloning the Event), then signal-on-dup
sets the CLONED event's `signaled` flag, NOT the source's. tid=13 waiting on
source will sleep forever. **WEDGE REGRESSION.**
### The fix
Allocate a fresh `dup_id`, do NOT clone the object. Instead store a
**handle alias** `dup_id → source_id` in `state.handle_aliases`. Whenever
the guest passes `dup_id` to any Nt*/Ke* call, resolve through the alias to
get `source_id`. Lookup `state.objects[source_id]`. The single underlying
`KernelObject::Event` retains the unified `waiters` list and `signaled`
flag. **Signal-on-dup still wakes wait-on-source** because both ids
canonicalize to the same source.
This mirrors canary's `LookupObject` which always indexes by slot, but the
underlying `XObject*` is shared. We achieve the same with the alias map.
### Refcount lifecycle
- Source close after dup: alias entry `dup_id → source_id` stays; underlying
object stays alive because `handle_refcount[source_id]` was bumped in
`nt_duplicate_object`. No `handle.destroy` emit (refcount > 0 after
decrement).
Actually — to match canary's per-slot handle.destroy emission, we need each
NtClose on EITHER source or dup to emit handle.destroy (with the closed slot's
SID), and we only drop the underlying object when ALL slots are gone.
Cleanest design: track per-handle-id refcount separately:
- `handle_refcount[source_id]`: counts the source slot's references.
- `handle_refcount[dup_id]`: counts the dup slot's references.
Both start at 1 (fresh allocation, fresh dup).
`nt_close(source_id)`: decrement `handle_refcount[source_id]`. If 0, emit
`handle.destroy(source_id)`, remove the alias entries pointing AT source_id
if applicable, and decrement underlying-object refcount.
Actually that's complex. Let me simplify: mirror canary's two-level refcount
exactly via a new struct.
### Simplest model that preserves AUDIT-062 + emits handle.create
1. `state.handle_aliases: HashMap<u32, u32>` (alias_id → canonical_id).
2. `state.handle_refcount[id]` continues to mean: how many `NtClose` calls
are needed on THIS id before its slot goes away.
3. `nt_duplicate_object`:
- Compute `canonical = resolve_alias(source)` (in case source itself is an alias).
- Alloc `dup_id` via `state.alloc_handle()`.
- Insert alias `dup_id → canonical`.
- `handle_refcount.insert(dup_id, 1)`.
- Emit `handle.create(dup_id, object_type)` using `state.objects[canonical].schema_object_type()`.
- If `options & DUPLICATE_CLOSE_SOURCE`, treat as a `NtClose(source)` after.
4. `nt_close(handle)`:
- Decrement `handle_refcount[handle]` as today.
- If reaches 0: emit `handle.destroy(handle)`. Remove the alias entry for
`handle` (if it's an alias). If there are NO MORE alias slots pointing
to canonical, AND `handle == canonical`, remove `state.objects[canonical]`.
- To know "any more slots pointing to canonical", maintain
`canonical_refcount: HashMap<u32, u32>` = number of live handle slots
bound to canonical. Bumped at alloc/dup, decremented at close-with-rc-0.
5. `state.resolve_handle(h)`: returns `handle_aliases.get(&h).copied().unwrap_or(h)`.
6. Every Nt*/Ke* handler that looks up `state.objects` via a guest-provided
handle id must call `state.resolve_handle(h)` first.
### Coverage of state.objects lookups
`resolve_pseudo_handle` (18 call sites in exports.rs) will be extended to
chain through `state.resolve_handle`. Direct `ctx.gpr[3] as u32 → state.objects.get`
sites need explicit resolution. Survey identified the following direct sites
that need `state.resolve_handle` insertion:
- nt_read_file (1630): `let handle = ctx.gpr[3] as u32;`
- nt_write_file (similar)
- nt_set_event (4628)
- nt_clear_event (4651)
- nt_query_information_file, nt_set_information_file, nt_query_directory_file,
nt_query_volume_information_file, nt_flush_buffers_file (file operations)
- nt_create_io_completion (and friends)
Will sweep these in the patch.
### Tests to add
1. `nt_duplicate_object_allocates_fresh_handle_id`: dup != source.
2. `nt_duplicate_object_emits_handle_create_event`: cvar-on, both
handle.create events present.
3. `nt_duplicate_object_alias_resolves_to_canonical`:
`state.resolve_handle(dup) == source`.
4. `nt_duplicate_object_signal_on_dup_wakes_wait_on_source` (AUDIT-062
regression test): create event, dup, simulate NtSetEvent(dup), confirm
`state.objects[source].signaled == true`.
5. `nt_duplicate_object_signal_on_source_wakes_wait_on_dup` (reverse
symmetry).
6. `nt_duplicate_object_then_close_dup_keeps_source_live`: refcount and
object presence after dup-close.
7. `nt_duplicate_object_then_close_source_keeps_dup_live`: reverse.
8. `nt_duplicate_object_close_source_then_close_dup_destroys_object`:
final close destroys underlying.
9. `nt_duplicate_object_with_close_source_flag`: dup + close source in one
call.
10. `nt_duplicate_object_invalid_handle_returns_status_invalid_handle`.
11. `nt_duplicate_object_writes_handle_id_to_out_ptr`.
## Plan
1. Implement `state.handle_aliases` + `state.canonical_refcount` + `resolve_handle`.
2. Rewrite `nt_duplicate_object` per Section "Simplest model".
3. Adjust `nt_close` and `resolve_pseudo_handle`.
4. Sweep direct `state.objects.get` sites: insert `state.resolve_handle()`.
5. Add 11 unit tests.
6. Build + test.
7. Cold-vs-cold rebaseline.

View File

@@ -0,0 +1,48 @@
#!/usr/bin/env python3
"""Per-tid truncation for canary JSONL logs.
Canary's full boot log can exceed 800 MB; the diff tool loads the
entire file into RAM. We only need enough events per tid to walk past
the first divergence — anything beyond is dead weight. Cap each tid at
a configurable max (default: 250k for tid=6 main, 20k for others)."""
import json
import sys
from pathlib import Path
MAIN_CAP = 250_000 # tid=6 (canary's main chain — mapped to ours tid=1)
SISTER_CAP = 20_000 # everything else
def main() -> int:
src = Path(sys.argv[1])
dst = Path(sys.argv[2])
counts: dict[int, int] = {}
kept = 0
total = 0
with src.open("r", encoding="utf-8") as fin, dst.open("w", encoding="utf-8") as fout:
for lineno, line in enumerate(fin, start=1):
if lineno == 1:
fout.write(line)
continue
total += 1
try:
ev = json.loads(line)
except json.JSONDecodeError:
continue
tid = ev.get("tid", 0)
cap = MAIN_CAP if tid == 6 else SISTER_CAP
c = counts.get(tid, 0)
if c >= cap:
continue
counts[tid] = c + 1
fout.write(line)
kept += 1
print(f"kept {kept}/{total} events across {len(counts)} tids")
for tid in sorted(counts):
print(f" tid={tid:4d} {counts[tid]}")
return 0
if __name__ == "__main__":
sys.exit(main())