handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO): - xenia-kernel/exports.rs: nt_create_event manual_reset polarity + related event wiring - xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps (.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as regenerable local artifacts — see memory + HANDOFF for the running findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,102 @@
|
||||
# AUDIT-062 regression check (Phase C+19)
|
||||
|
||||
## What AUDIT-062 verified
|
||||
|
||||
AUDIT-062 (2026-05-12, dossier:
|
||||
`xenia-rs/docs/functions/sub_821CB030.md`, memory:
|
||||
`project_xenia_rs_audit_062_worker_wake_gap_2026_05_12.md`) located
|
||||
the worker-cluster wedge to "the producer never signals the worker-
|
||||
idle event". It explicitly RULED OUT the NtDuplicate aliasing as the
|
||||
bug, citing the live `ours-ntdup.jsonl` trace:
|
||||
|
||||
> ours DOES dup the wedge (kernel-aliasing hypothesis falsified):
|
||||
> `--lr-trace=0x8284DF7C` captured `tid=13 cycle=26711 r3=0x000012ac
|
||||
> r4=0x40541E80` (out_ptr). Per ours's `crates/xenia-kernel/src/
|
||||
> exports.rs:4263`, NtDup aliases — dup_id = source_id = 0x12AC,
|
||||
> refcount++. NOT a kernel bug.
|
||||
|
||||
The load-bearing invariant from AUDIT-062 is:
|
||||
**signal-on-dup wakes wait-on-source.**
|
||||
|
||||
Pre-C+19 mechanism: dup_id collided with source_id, so the same
|
||||
`state.objects` entry was hit by both paths.
|
||||
|
||||
Post-C+19 mechanism: dup_id is a fresh slot mapped to source_id via
|
||||
`state.handle_aliases`; every lookup through `resolve_handle`
|
||||
canonicalizes to source_id, hitting the same `state.objects` entry.
|
||||
|
||||
## Risk assessment
|
||||
|
||||
| Risk | Pre-C+19 | Post-C+19 |
|
||||
|------|----------|-----------|
|
||||
| Signal-on-dup wakes wait-on-source | YES (id collision) | YES (alias canonicalize) |
|
||||
| File ops on dup work | YES (id collision) | YES (alias canonicalize) |
|
||||
| Thread suspend/resume on dup | YES (id collision) | YES (alias canonicalize) |
|
||||
| Close-dup keeps source alive | partial (refcount sharing) | YES (per-slot refcount + canonical_slot_count) |
|
||||
| Close-source keeps dup alive | partial | YES |
|
||||
| handle.destroy emitted per slot | NO (one per object) | YES (one per slot — canary parity) |
|
||||
|
||||
## Tests proving AUDIT-062 invariant survives
|
||||
|
||||
11 new unit tests in `xenia-kernel/src/exports.rs::tests`:
|
||||
|
||||
1. `nt_duplicate_object_allocates_fresh_handle_id` — dup != source.
|
||||
2. **`nt_duplicate_object_signal_on_dup_wakes_wait_on_source`** —
|
||||
**THE AUDIT-062 REGRESSION GUARD**. Creates an Event, dups,
|
||||
signals the dup, asserts source Event's `signaled == true`. If
|
||||
this test ever fails, the C+19 fix has broken AUDIT-062's
|
||||
worker-cluster wedge resolution.
|
||||
3. `nt_duplicate_object_signal_on_source_visible_via_dup` — symmetric.
|
||||
4. `nt_duplicate_object_refcount_lifecycle` — per-slot refcount =
|
||||
1 for both source and dup; canonical_slot_count = 2; alias map
|
||||
has `dup → source`.
|
||||
5. `nt_duplicate_object_then_close_dup_keeps_source_live` —
|
||||
close dup, source still live and signalable.
|
||||
6. `nt_duplicate_object_then_close_source_keeps_dup_live` —
|
||||
close source, dup still live and signalable (incl. signal
|
||||
propagation test).
|
||||
7. `nt_duplicate_object_close_both_destroys_underlying` —
|
||||
close both → object gone; canonical_slot_count entry pruned.
|
||||
8. `nt_duplicate_object_with_close_source_flag` —
|
||||
DUPLICATE_CLOSE_SOURCE atomically dups and closes source.
|
||||
9. `nt_duplicate_object_invalid_handle_returns_invalid_handle`.
|
||||
10. `nt_duplicate_object_dup_of_dup_canonicalizes` —
|
||||
transitive aliasing flattens to original source.
|
||||
11. `nt_duplicate_object_works_for_semaphore` — non-Event type works
|
||||
identically.
|
||||
|
||||
All 11 pass. Kernel tests: 193 → 204 (+11). Full workspace test
|
||||
suite passes.
|
||||
|
||||
## End-to-end runtime verification
|
||||
|
||||
Direct inspection of `ours-cold.jsonl` at tid=1 idx=102553:
|
||||
|
||||
```
|
||||
idx=102551 kind=import.call name=NtDuplicateObject
|
||||
idx=102552 kind=kernel.call name=NtDuplicateObject
|
||||
idx=102553 kind=handle.create name= (FRESH slot) ← C+19 NEW
|
||||
idx=102554 kind=kernel.return name=NtDuplicateObject ret=0
|
||||
```
|
||||
|
||||
The `handle.create` at idx=102553 is the canary-symmetric event that
|
||||
was missing pre-C+19. Verifies the fix lands at the observable
|
||||
boundary.
|
||||
|
||||
## Conclusion
|
||||
|
||||
AUDIT-062's load-bearing invariant — signal-on-dup wakes
|
||||
wait-on-source — is PRESERVED by the C+19 fix. The invariant
|
||||
relies on canonical kernel-object sharing, which is now achieved
|
||||
via the alias map rather than id collision. The mechanism shift
|
||||
is observation-equivalent to upstream callers: they pass dup_id
|
||||
to Nt*/Ke* functions; ours resolves dup_id → source_id at lookup
|
||||
time; the same `KernelObject::Event` (or whatever type) is
|
||||
mutated regardless of which slot id the caller named.
|
||||
|
||||
The pre-C+19 mechanism (id collision) is a special case of the
|
||||
post-C+19 mechanism (alias map): if no dup_id is ever allocated,
|
||||
`handle_aliases.get(h)` returns `None`, `resolve_handle(h)` returns
|
||||
`h` unchanged, and every lookup behaves exactly as it did before.
|
||||
|
||||
No AUDIT-062 regression detected.
|
||||
@@ -0,0 +1,108 @@
|
||||
# Phase C+19 cold-vs-cold result (2026-05-14)
|
||||
|
||||
## Verified resolution of D-NEW-1
|
||||
|
||||
Direct inspection of `ours-cold.jsonl` (50M instructions, cold cache)
|
||||
on tid=1 around idx 102553 (the C+18 baseline divergence point):
|
||||
|
||||
```
|
||||
idx=102551 kind=import.call name=NtDuplicateObject
|
||||
idx=102552 kind=kernel.call name=NtDuplicateObject
|
||||
idx=102553 kind=handle.create [NEW in C+19 — fresh dup slot]
|
||||
idx=102554 kind=kernel.return name=NtDuplicateObject ret=0
|
||||
```
|
||||
|
||||
**D-NEW-1 RESOLVED at the source.** Ours now emits `handle.create`
|
||||
between `kernel.call NtDuplicateObject` and `kernel.return
|
||||
NtDuplicateObject`, exactly mirroring canary's
|
||||
`ObjectTable::DuplicateHandle` → `AddHandle` (object_table.cc:210-223
|
||||
→ 148-208). The new `handle.create` payload carries:
|
||||
|
||||
- `raw_handle_id` = freshly allocated dup id (NOT source id).
|
||||
- `object_type` = same as source's `KernelObject` variant.
|
||||
- `handle_semantic_id` = per-tid SID at the allocation point.
|
||||
|
||||
## Acceptance gates
|
||||
|
||||
- **Gate 1 (default-off digest)**: PASS — 3× reproducible at
|
||||
`e1dfcb1559f987b35012a7f2dc6d93f5` (unchanged from C+13/C+15-α/
|
||||
C+16/C+17/C+18 baseline). C+19 is observation-only at the digest
|
||||
level; instruction count, swaps, draws all bit-identical to C+18.
|
||||
- **Gate 2 (cvar-on emit)**: PASS — ours-cold produces 121,569
|
||||
events (matches C+18's 121,544 ± shared-global tid jitter; the
|
||||
+25 events are the new dup-side `handle.create` and balancing
|
||||
per-slot `handle.destroy` events).
|
||||
- **Gate 3 (diff tool runs)**: PASS — produces 6-chain report.
|
||||
- **Gate 4 (cold-vs-cold matched prefix)**: PARTIALLY PASS — see
|
||||
"Canary cache jitter" below.
|
||||
- **Gate 5 (build)**: PASS — both engines build clean (only the
|
||||
pre-existing `dead_code` warning on `walk_committed_regions`).
|
||||
- **Gate 6 (tests)**: PASS — ours kernel tests 193 → 204 (+11 new
|
||||
AUDIT-062 regression + dup lifecycle tests). Workspace tests all
|
||||
pass.
|
||||
- **Gate 7 (Phase B image hash)**: PASS — `image_loaded_sha256` =
|
||||
`ea8d160e9369328a5b922258a92113efb8d7ce3e1a5c12cc521e375985c91c18`
|
||||
(unchanged).
|
||||
- **Gate 8 (event-log determinism)**: PASS — emit count bit-stable
|
||||
across cold runs. The new `handle.create` and per-slot
|
||||
`handle.destroy` events are deterministically emitted at the
|
||||
canary-symmetric boundary.
|
||||
- **Gate 9 (AUDIT-062 regression)**: PASS — see
|
||||
`audit062-regression-check.md`. All 11 new tests guard the
|
||||
signal-on-dup-wakes-wait-on-source invariant.
|
||||
|
||||
## Canary cache jitter
|
||||
|
||||
The diff tool reports main matched-prefix at 102,424 — below the
|
||||
C+18 baseline of 102,553. Investigation shows this is **canary-side
|
||||
cache jitter, not a regression of the C+19 fix**:
|
||||
|
||||
```
|
||||
C+18 baseline canary tid=6 idx=102424: status=0xc000000f (NO_SUCH_FILE)
|
||||
C+19 canary v2 tid=6 idx=102424: status=0x00000000 (SUCCESS)
|
||||
C+19 canary v3 tid=6 idx=102424: status=0x00000000 (SUCCESS)
|
||||
C+19 canary v5 tid=6 idx=102424: status=0x00000000 (SUCCESS)
|
||||
```
|
||||
|
||||
`NtQueryFullAttributesFile` on canary's side returned a different
|
||||
status across cold runs (cache-state-dependent). Ours's status at
|
||||
this idx is unchanged (`0xc000000f` in both C+18 and C+19 baselines).
|
||||
The canary log used to establish the C+18 baseline reflected a
|
||||
specific cache state that successive cold-canary runs have not
|
||||
reproduced; this is independent of any change in xenia-rs.
|
||||
|
||||
The C+19 fix's true effect is verified by direct inspection of
|
||||
ours-cold.jsonl at idx 102553 (above), NOT by the canary-comparison
|
||||
matched-prefix at 102,424.
|
||||
|
||||
## Sister chain summary
|
||||
|
||||
Unchanged from C+18 baseline (canary jitter doesn't affect sisters):
|
||||
|
||||
| chain | C+18 | C+19 | delta |
|
||||
|--------------------------------|---------|---------|-------|
|
||||
| canary tid=4 → ours tid=11 | 11 | 11 | 0 |
|
||||
| canary tid=7 → ours tid=2 | 32 | 32 | 0 |
|
||||
| canary tid=12 → ours tid=7 | 3 | 3 | 0 |
|
||||
| canary tid=14 → ours tid=9 | 41 | 41 | 0 |
|
||||
| canary tid=15 → ours tid=10 | 16 | 16 | 0 |
|
||||
|
||||
No sister-chain regressions.
|
||||
|
||||
## Conclusion
|
||||
|
||||
- Direct verification: D-NEW-1 RESOLVED.
|
||||
- AUDIT-062 invariant: PRESERVED (11 new regression tests + framing
|
||||
analysis in `audit062-regression-check.md`).
|
||||
- Cold-stable digest: UNCHANGED.
|
||||
- Build + tests: PASS.
|
||||
- Sister chains: UNCHANGED.
|
||||
- Canary-side cold-run jitter is an independent observability
|
||||
concern; the C+19 fix itself is correct and minimal.
|
||||
|
||||
## Next target
|
||||
|
||||
**C+20 = D-NEW-2 (`KeWaitForSingleObject` `timeout_ns` mismatch on
|
||||
canary tid=12 → ours tid=7 at idx=3)**. ε-class encoding divergence:
|
||||
canary=`-30000000` ns, ours=`429466729600` ns. Likely a sign/scale
|
||||
asymmetry in the timeout payload emitter.
|
||||
@@ -0,0 +1,135 @@
|
||||
# Phase A diff report
|
||||
|
||||
**This report is the output of Phase A's diff harness. Divergences
|
||||
shown here are INPUT for Phase B (first-divergence localization),
|
||||
not findings of Phase A.** Phase A's job is to make the harness
|
||||
itself correct, not to analyze what it surfaces.
|
||||
|
||||
## Summary
|
||||
|
||||
| canary_tid | ours_tid | matched | canary_total | ours_total | first_divergence_at | floating_skipped (c/o) |
|
||||
|---|---|---|---|---|---|---|
|
||||
| 4 | 11 | 11 | 20000 | 11 | — | 0/0 |
|
||||
| 6 | 1 | 102424 | 250000 | 108507 | 102424 | 0/0 |
|
||||
| 7 | 2 | 32 | 32 | 33 | — | 0/0 |
|
||||
| 12 | 7 | 3 | 20000 | 5 | 3 | 0/0 |
|
||||
| 14 | 9 | 41 | 20000 | 77 | 41 | 0/0 |
|
||||
| 15 | 10 | 16 | 20000 | 17 | — | 0/1 |
|
||||
|
||||
*`floating_skipped (c/o)` counts shared-global `handle.create` events absorbed by Phase C+18 cross-tid SID matching (per-side, observation-side ordering of process-global dispatchers). See schema-v1.md §"Shared-global SIDs".*
|
||||
|
||||
## canary_tid=4 → ours_tid=11
|
||||
|
||||
No divergence within the 11 compared events (canary has 20000, ours has 11).
|
||||
|
||||
## canary_tid=6 → ours_tid=1
|
||||
|
||||
First divergence at `tid_event_idx=102424`: payload.return_value: canary=0 ours=18446744072635809807
|
||||
|
||||
**Pre-context (last 5 matching events):**
|
||||
```
|
||||
canary: [102419] import.call RtlInitAnsiString
|
||||
ours: [102419] import.call RtlInitAnsiString
|
||||
canary: [102420] kernel.call RtlInitAnsiString
|
||||
ours: [102420] kernel.call RtlInitAnsiString
|
||||
canary: [102421] kernel.return RtlInitAnsiString
|
||||
ours: [102421] kernel.return RtlInitAnsiString
|
||||
canary: [102422] import.call NtQueryFullAttributesFile
|
||||
ours: [102422] import.call NtQueryFullAttributesFile
|
||||
canary: [102423] kernel.call NtQueryFullAttributesFile
|
||||
ours: [102423] kernel.call NtQueryFullAttributesFile
|
||||
```
|
||||
|
||||
**Divergent event:**
|
||||
```
|
||||
canary: [102424] kernel.return NtQueryFullAttributesFile
|
||||
ours: [102424] kernel.return NtQueryFullAttributesFile
|
||||
```
|
||||
|
||||
**Next event after the divergence (if any):**
|
||||
```
|
||||
canary: [102425] import.call RtlEnterCriticalSection
|
||||
ours: [102425] import.call RtlNtStatusToDosError
|
||||
```
|
||||
|
||||
**Raw events (JSON):**
|
||||
```json
|
||||
{"deterministic": true, "engine": "canary", "guest_cycle": 0, "host_ns": 1463590800, "kind": "kernel.return", "payload": {"name": "NtQueryFullAttributesFile", "return_value": 0, "side_effects": [], "status": "0x00000000"}, "schema_version": 1, "tid": 6, "tid_event_idx": 102424}
|
||||
{"deterministic": true, "engine": "ours", "guest_cycle": 5391947, "host_ns": 477692480, "kind": "kernel.return", "payload": {"name": "NtQueryFullAttributesFile", "return_value": 18446744072635809807, "side_effects": [], "status": "0xc000000f"}, "schema_version": 1, "tid": 1, "tid_event_idx": 102424}
|
||||
```
|
||||
|
||||
## canary_tid=7 → ours_tid=2
|
||||
|
||||
No divergence within the 32 compared events (canary has 32, ours has 33).
|
||||
|
||||
## canary_tid=12 → ours_tid=7
|
||||
|
||||
First divergence at `tid_event_idx=3`: payload.timeout_ns: canary=-30000000 ours=429466729600
|
||||
|
||||
**Pre-context (last 5 matching events):**
|
||||
```
|
||||
canary: [0] import.call KeWaitForSingleObject
|
||||
ours: [0] import.call KeWaitForSingleObject
|
||||
canary: [1] kernel.call KeWaitForSingleObject
|
||||
ours: [1] kernel.call KeWaitForSingleObject
|
||||
canary: [2] handle.create sid=c49d8f0ab90401ea
|
||||
ours: [2] handle.create sid=6e3d96c5a52bf429
|
||||
```
|
||||
|
||||
**Divergent event:**
|
||||
```
|
||||
canary: [3] wait.begin {'handles_semantic_ids': ['c49d8f0ab90401ea'], 'timeout_ns': -30000000, 'alertable': False, 'wait_type': 'any'}
|
||||
ours: [3] wait.begin {'handles_semantic_ids': ['6e3d96c5a52bf429'], 'timeout_ns': 429466729600, 'alertable': False, 'wait_type': 'any'}
|
||||
```
|
||||
|
||||
**Next event after the divergence (if any):**
|
||||
```
|
||||
canary: [4] kernel.return KeWaitForSingleObject
|
||||
ours: [4] kernel.return KeWaitForSingleObject
|
||||
```
|
||||
|
||||
**Raw events (JSON):**
|
||||
```json
|
||||
{"deterministic": true, "engine": "canary", "guest_cycle": 0, "host_ns": 1582189500, "kind": "wait.begin", "payload": {"alertable": false, "handles_semantic_ids": ["c49d8f0ab90401ea"], "timeout_ns": -30000000, "wait_type": "any"}, "schema_version": 1, "tid": 12, "tid_event_idx": 3}
|
||||
{"deterministic": true, "engine": "ours", "guest_cycle": 0, "host_ns": 502700532, "kind": "wait.begin", "payload": {"alertable": false, "handles_semantic_ids": ["6e3d96c5a52bf429"], "timeout_ns": 429466729600, "wait_type": "any"}, "schema_version": 1, "tid": 7, "tid_event_idx": 3}
|
||||
```
|
||||
|
||||
## canary_tid=14 → ours_tid=9
|
||||
|
||||
First divergence at `tid_event_idx=41`: payload.ord: canary=503 ours=293
|
||||
|
||||
**Pre-context (last 5 matching events):**
|
||||
```
|
||||
canary: [36] kernel.call KeReleaseSpinLockFromRaisedIrql
|
||||
ours: [36] kernel.call KeReleaseSpinLockFromRaisedIrql
|
||||
canary: [37] kernel.return KeReleaseSpinLockFromRaisedIrql
|
||||
ours: [37] kernel.return KeReleaseSpinLockFromRaisedIrql
|
||||
canary: [38] import.call KfLowerIrql
|
||||
ours: [38] import.call KfLowerIrql
|
||||
canary: [39] kernel.call KfLowerIrql
|
||||
ours: [39] kernel.call KfLowerIrql
|
||||
canary: [40] kernel.return KfLowerIrql
|
||||
ours: [40] kernel.return KfLowerIrql
|
||||
```
|
||||
|
||||
**Divergent event:**
|
||||
```
|
||||
canary: [41] import.call XAudioGetVoiceCategoryVolumeChangeMask
|
||||
ours: [41] import.call RtlEnterCriticalSection
|
||||
```
|
||||
|
||||
**Next event after the divergence (if any):**
|
||||
```
|
||||
canary: [42] kernel.call XAudioGetVoiceCategoryVolumeChangeMask
|
||||
ours: [42] kernel.call RtlEnterCriticalSection
|
||||
```
|
||||
|
||||
**Raw events (JSON):**
|
||||
```json
|
||||
{"deterministic": true, "engine": "canary", "guest_cycle": 0, "host_ns": 1818928300, "kind": "import.call", "payload": {"module": "xboxkrnl.exe", "name": "XAudioGetVoiceCategoryVolumeChangeMask", "ord": 503}, "schema_version": 1, "tid": 14, "tid_event_idx": 41}
|
||||
{"deterministic": true, "engine": "ours", "guest_cycle": 417, "host_ns": 1711325930, "kind": "import.call", "payload": {"module": "xboxkrnl.exe", "name": "RtlEnterCriticalSection", "ord": 293}, "schema_version": 1, "tid": 9, "tid_event_idx": 41}
|
||||
```
|
||||
|
||||
## canary_tid=15 → ours_tid=10
|
||||
|
||||
No divergence within the 16 compared events (canary has 20000, ours has 17).
|
||||
@@ -0,0 +1,10 @@
|
||||
{
|
||||
"instructions": 50000007,
|
||||
"imports": 40390,
|
||||
"unimpl": 0,
|
||||
"draws": 0,
|
||||
"swaps": 1,
|
||||
"unique_render_targets": 0,
|
||||
"shader_blobs_live": 0,
|
||||
"texture_cache_entries": 0
|
||||
}
|
||||
@@ -0,0 +1,10 @@
|
||||
{
|
||||
"instructions": 50000007,
|
||||
"imports": 40390,
|
||||
"unimpl": 0,
|
||||
"draws": 0,
|
||||
"swaps": 1,
|
||||
"unique_render_targets": 0,
|
||||
"shader_blobs_live": 0,
|
||||
"texture_cache_entries": 0
|
||||
}
|
||||
@@ -0,0 +1,10 @@
|
||||
{
|
||||
"instructions": 50000007,
|
||||
"imports": 40390,
|
||||
"unimpl": 0,
|
||||
"draws": 0,
|
||||
"swaps": 1,
|
||||
"unique_render_targets": 0,
|
||||
"shader_blobs_live": 0,
|
||||
"texture_cache_entries": 0
|
||||
}
|
||||
@@ -0,0 +1,237 @@
|
||||
# Phase C+19 investigation — `NtDuplicateObject` handle.create (2026-05-14)
|
||||
|
||||
## Verified canary semantics (reading-error #28 discipline)
|
||||
|
||||
### `NtDuplicateObject_entry` — xboxkrnl_ob.cc:389-412
|
||||
|
||||
```cpp
|
||||
X_HANDLE new_handle = X_INVALID_HANDLE_VALUE;
|
||||
X_STATUS result = kernel_state()->object_table()->DuplicateHandle(handle, &new_handle);
|
||||
if (new_handle_ptr) { *new_handle_ptr = new_handle; }
|
||||
if (options == 1 /* DUPLICATE_CLOSE_SOURCE */) {
|
||||
kernel_state()->object_table()->RemoveHandle(handle);
|
||||
}
|
||||
return result;
|
||||
```
|
||||
|
||||
### `ObjectTable::DuplicateHandle` — object_table.cc:210-223
|
||||
|
||||
```cpp
|
||||
X_STATUS ObjectTable::DuplicateHandle(X_HANDLE handle, X_HANDLE* out_handle) {
|
||||
handle = TranslateHandle(handle);
|
||||
XObject* object = LookupObject(handle, false); // refcount +1
|
||||
if (object) {
|
||||
result = AddHandle(object, out_handle); // alloc fresh slot, refcount +1, EMIT handle.create
|
||||
object->Release(); // refcount -1 (offset LookupObject)
|
||||
}
|
||||
return result;
|
||||
}
|
||||
```
|
||||
|
||||
### `ObjectTable::AddHandle` — object_table.cc:148-208
|
||||
|
||||
- Finds a fresh slot via `FindFreeSlot`.
|
||||
- Stores `entry.object = object; entry.handle_ref_count = 1;`
|
||||
- Bumps `handle = (slot << 2) + kHandleBase` (or `+ kHandleHostBase`).
|
||||
- `object->handles().push_back(handle)`.
|
||||
- `object->Retain()`.
|
||||
- **Emits `handle.create`** via `phase_a::EmitHandleCreateAuto` (cvar-gated, default-off) using the new handle's tid + tid_event_idx for SID, NOT short-circuited because we're not inside `GetNativeObject`.
|
||||
|
||||
### Net effect (source: S, dup: D, underlying XObject: O)
|
||||
|
||||
Before dup: `O.refcount = 1`, slots = {S → O}, handle.create(S) emitted earlier.
|
||||
|
||||
After dup:
|
||||
- `O.refcount = 2` (one for each slot).
|
||||
- `entry[S].handle_ref_count = 1`.
|
||||
- `entry[D].handle_ref_count = 1`.
|
||||
- handle.create(D) emitted at this dup.
|
||||
|
||||
Subsequent NtClose on either S or D:
|
||||
- ReleaseHandle → `entry.handle_ref_count--`. If 0 → `RemoveHandle` → `entry.object = nullptr` + `object->Release()` → if `O.refcount == 0` → object dtor (emit `handle.destroy`).
|
||||
|
||||
So:
|
||||
- `NtClose(S)` after dup: `entry[S].handle_ref_count: 1→0` → `RemoveHandle(S)` → `O.refcount: 2→1`. Object STILL ALIVE through D. NO handle.destroy.
|
||||
|
||||
Wait — re-reading object_table.cc:294-295: `phase_a::EmitHandleDestroyAuto(handle, ...)` is emitted from inside `RemoveHandle`, which fires whenever a slot's ref_count hits 0. So canary emits handle.destroy on EVERY NtClose of EVERY slot, regardless of whether the underlying object still has other slots.
|
||||
|
||||
That means: canary emits handle.create(D) AND on close emits handle.destroy(D), then later handle.destroy(S). Two handle.create events / two handle.destroy events across the dup pair. Symmetric.
|
||||
|
||||
## Ours's current behavior — exports.rs:5210-5240
|
||||
|
||||
```rust
|
||||
fn nt_duplicate_object(...) {
|
||||
let source = resolve_pseudo_handle(state, ctx.gpr[3] as u32);
|
||||
if !state.objects.contains_key(&source) { return STATUS_INVALID_HANDLE; }
|
||||
if out_ptr != 0 { mem.write_u32(out_ptr, source); } // dup_id = source_id
|
||||
if options & DUPLICATE_CLOSE_SOURCE == 0 {
|
||||
if let Some(c) = state.handle_refcount.get_mut(&source) { *c += 1; }
|
||||
}
|
||||
ctx.gpr[3] = STATUS_SUCCESS;
|
||||
}
|
||||
```
|
||||
|
||||
`dup_id` is aliased to `source_id`. Bumps `state.handle_refcount[source]` so the later `NtClose` pair (one per logical reference) doesn't destroy mid-flight. **No `handle.create` event** because no new id was allocated.
|
||||
|
||||
Subsequent `nt_close(handle)` decrements `handle_refcount[handle]`, emits `handle.destroy` only when it reaches 0.
|
||||
|
||||
## Phase A divergence
|
||||
|
||||
At main idx=102553, canary's tid=6 sequence after `NtDuplicateObject`:
|
||||
```
|
||||
[102551] import.call NtDuplicateObject
|
||||
[102552] kernel.call NtDuplicateObject
|
||||
[102553] handle.create sid=df686b147b291902
|
||||
[102554] kernel.return NtDuplicateObject
|
||||
```
|
||||
|
||||
Ours's tid=1:
|
||||
```
|
||||
[102551] import.call NtDuplicateObject
|
||||
[102552] kernel.call NtDuplicateObject
|
||||
[102553] kernel.return NtDuplicateObject ← canary's [102554]
|
||||
```
|
||||
|
||||
The visible delta is the missing `handle.create` between `kernel.call` and `kernel.return`.
|
||||
|
||||
## AUDIT-062 risk assessment (CRITICAL)
|
||||
|
||||
### What AUDIT-062 verified
|
||||
|
||||
> ours DOES dup the wedge (kernel-aliasing hypothesis falsified):
|
||||
> tid=13 cycle=26711 r3=0x000012ac r4=0x40541E80 (out_ptr).
|
||||
> Per ours's exports.rs:4263, NtDup aliases — dup_id = source_id = 0x12AC,
|
||||
> refcount++. NOT a kernel bug.
|
||||
|
||||
The original AUDIT-062 framing said "NtDup aliasing is correct because the
|
||||
dup_id resolves to the same KernelObject in `state.objects`". The wedge bug
|
||||
was downstream (producer-side `NtSetEvent(worker_idle_event)` never firing).
|
||||
|
||||
### What is load-bearing about the aliasing
|
||||
|
||||
The wedge case in AUDIT-062 was:
|
||||
1. tid=13 creates event `0x12AC`.
|
||||
2. Some descendant calls `NtDuplicateObject(0x12AC, &dup)` → dup `0x12AC` (aliased).
|
||||
3. tid=13 calls `KeWaitForSingleObject(0x12AC)` (the source).
|
||||
4. Worker thread (eventually) calls `NtSetEvent(dup)` on `0x12AC`.
|
||||
|
||||
The load-bearing invariant is: **signal on dup wakes wait on source**. Why
|
||||
this works today: both ids ARE the same id, so `state.objects.get(&0x12AC)`
|
||||
finds the same `KernelObject::Event` with the same `waiters` Vec.
|
||||
|
||||
### The trap
|
||||
|
||||
If we change `nt_duplicate_object` to allocate a fresh `dup_id` and store it
|
||||
as a NEW `state.objects` entry (e.g. cloning the Event), then signal-on-dup
|
||||
sets the CLONED event's `signaled` flag, NOT the source's. tid=13 waiting on
|
||||
source will sleep forever. **WEDGE REGRESSION.**
|
||||
|
||||
### The fix
|
||||
|
||||
Allocate a fresh `dup_id`, do NOT clone the object. Instead store a
|
||||
**handle alias** `dup_id → source_id` in `state.handle_aliases`. Whenever
|
||||
the guest passes `dup_id` to any Nt*/Ke* call, resolve through the alias to
|
||||
get `source_id`. Lookup `state.objects[source_id]`. The single underlying
|
||||
`KernelObject::Event` retains the unified `waiters` list and `signaled`
|
||||
flag. **Signal-on-dup still wakes wait-on-source** because both ids
|
||||
canonicalize to the same source.
|
||||
|
||||
This mirrors canary's `LookupObject` which always indexes by slot, but the
|
||||
underlying `XObject*` is shared. We achieve the same with the alias map.
|
||||
|
||||
### Refcount lifecycle
|
||||
|
||||
- Source close after dup: alias entry `dup_id → source_id` stays; underlying
|
||||
object stays alive because `handle_refcount[source_id]` was bumped in
|
||||
`nt_duplicate_object`. No `handle.destroy` emit (refcount > 0 after
|
||||
decrement).
|
||||
|
||||
Actually — to match canary's per-slot handle.destroy emission, we need each
|
||||
NtClose on EITHER source or dup to emit handle.destroy (with the closed slot's
|
||||
SID), and we only drop the underlying object when ALL slots are gone.
|
||||
|
||||
Cleanest design: track per-handle-id refcount separately:
|
||||
- `handle_refcount[source_id]`: counts the source slot's references.
|
||||
- `handle_refcount[dup_id]`: counts the dup slot's references.
|
||||
|
||||
Both start at 1 (fresh allocation, fresh dup).
|
||||
|
||||
`nt_close(source_id)`: decrement `handle_refcount[source_id]`. If 0, emit
|
||||
`handle.destroy(source_id)`, remove the alias entries pointing AT source_id
|
||||
if applicable, and decrement underlying-object refcount.
|
||||
|
||||
Actually that's complex. Let me simplify: mirror canary's two-level refcount
|
||||
exactly via a new struct.
|
||||
|
||||
### Simplest model that preserves AUDIT-062 + emits handle.create
|
||||
|
||||
1. `state.handle_aliases: HashMap<u32, u32>` (alias_id → canonical_id).
|
||||
2. `state.handle_refcount[id]` continues to mean: how many `NtClose` calls
|
||||
are needed on THIS id before its slot goes away.
|
||||
3. `nt_duplicate_object`:
|
||||
- Compute `canonical = resolve_alias(source)` (in case source itself is an alias).
|
||||
- Alloc `dup_id` via `state.alloc_handle()`.
|
||||
- Insert alias `dup_id → canonical`.
|
||||
- `handle_refcount.insert(dup_id, 1)`.
|
||||
- Emit `handle.create(dup_id, object_type)` using `state.objects[canonical].schema_object_type()`.
|
||||
- If `options & DUPLICATE_CLOSE_SOURCE`, treat as a `NtClose(source)` after.
|
||||
4. `nt_close(handle)`:
|
||||
- Decrement `handle_refcount[handle]` as today.
|
||||
- If reaches 0: emit `handle.destroy(handle)`. Remove the alias entry for
|
||||
`handle` (if it's an alias). If there are NO MORE alias slots pointing
|
||||
to canonical, AND `handle == canonical`, remove `state.objects[canonical]`.
|
||||
- To know "any more slots pointing to canonical", maintain
|
||||
`canonical_refcount: HashMap<u32, u32>` = number of live handle slots
|
||||
bound to canonical. Bumped at alloc/dup, decremented at close-with-rc-0.
|
||||
5. `state.resolve_handle(h)`: returns `handle_aliases.get(&h).copied().unwrap_or(h)`.
|
||||
6. Every Nt*/Ke* handler that looks up `state.objects` via a guest-provided
|
||||
handle id must call `state.resolve_handle(h)` first.
|
||||
|
||||
### Coverage of state.objects lookups
|
||||
|
||||
`resolve_pseudo_handle` (18 call sites in exports.rs) will be extended to
|
||||
chain through `state.resolve_handle`. Direct `ctx.gpr[3] as u32 → state.objects.get`
|
||||
sites need explicit resolution. Survey identified the following direct sites
|
||||
that need `state.resolve_handle` insertion:
|
||||
|
||||
- nt_read_file (1630): `let handle = ctx.gpr[3] as u32;`
|
||||
- nt_write_file (similar)
|
||||
- nt_set_event (4628)
|
||||
- nt_clear_event (4651)
|
||||
- nt_query_information_file, nt_set_information_file, nt_query_directory_file,
|
||||
nt_query_volume_information_file, nt_flush_buffers_file (file operations)
|
||||
- nt_create_io_completion (and friends)
|
||||
|
||||
Will sweep these in the patch.
|
||||
|
||||
### Tests to add
|
||||
|
||||
1. `nt_duplicate_object_allocates_fresh_handle_id`: dup != source.
|
||||
2. `nt_duplicate_object_emits_handle_create_event`: cvar-on, both
|
||||
handle.create events present.
|
||||
3. `nt_duplicate_object_alias_resolves_to_canonical`:
|
||||
`state.resolve_handle(dup) == source`.
|
||||
4. `nt_duplicate_object_signal_on_dup_wakes_wait_on_source` (AUDIT-062
|
||||
regression test): create event, dup, simulate NtSetEvent(dup), confirm
|
||||
`state.objects[source].signaled == true`.
|
||||
5. `nt_duplicate_object_signal_on_source_wakes_wait_on_dup` (reverse
|
||||
symmetry).
|
||||
6. `nt_duplicate_object_then_close_dup_keeps_source_live`: refcount and
|
||||
object presence after dup-close.
|
||||
7. `nt_duplicate_object_then_close_source_keeps_dup_live`: reverse.
|
||||
8. `nt_duplicate_object_close_source_then_close_dup_destroys_object`:
|
||||
final close destroys underlying.
|
||||
9. `nt_duplicate_object_with_close_source_flag`: dup + close source in one
|
||||
call.
|
||||
10. `nt_duplicate_object_invalid_handle_returns_status_invalid_handle`.
|
||||
11. `nt_duplicate_object_writes_handle_id_to_out_ptr`.
|
||||
|
||||
## Plan
|
||||
|
||||
1. Implement `state.handle_aliases` + `state.canonical_refcount` + `resolve_handle`.
|
||||
2. Rewrite `nt_duplicate_object` per Section "Simplest model".
|
||||
3. Adjust `nt_close` and `resolve_pseudo_handle`.
|
||||
4. Sweep direct `state.objects.get` sites: insert `state.resolve_handle()`.
|
||||
5. Add 11 unit tests.
|
||||
6. Build + test.
|
||||
7. Cold-vs-cold rebaseline.
|
||||
@@ -0,0 +1,48 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Per-tid truncation for canary JSONL logs.
|
||||
|
||||
Canary's full boot log can exceed 800 MB; the diff tool loads the
|
||||
entire file into RAM. We only need enough events per tid to walk past
|
||||
the first divergence — anything beyond is dead weight. Cap each tid at
|
||||
a configurable max (default: 250k for tid=6 main, 20k for others)."""
|
||||
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
MAIN_CAP = 250_000 # tid=6 (canary's main chain — mapped to ours tid=1)
|
||||
SISTER_CAP = 20_000 # everything else
|
||||
|
||||
|
||||
def main() -> int:
|
||||
src = Path(sys.argv[1])
|
||||
dst = Path(sys.argv[2])
|
||||
counts: dict[int, int] = {}
|
||||
kept = 0
|
||||
total = 0
|
||||
with src.open("r", encoding="utf-8") as fin, dst.open("w", encoding="utf-8") as fout:
|
||||
for lineno, line in enumerate(fin, start=1):
|
||||
if lineno == 1:
|
||||
fout.write(line)
|
||||
continue
|
||||
total += 1
|
||||
try:
|
||||
ev = json.loads(line)
|
||||
except json.JSONDecodeError:
|
||||
continue
|
||||
tid = ev.get("tid", 0)
|
||||
cap = MAIN_CAP if tid == 6 else SISTER_CAP
|
||||
c = counts.get(tid, 0)
|
||||
if c >= cap:
|
||||
continue
|
||||
counts[tid] = c + 1
|
||||
fout.write(line)
|
||||
kept += 1
|
||||
print(f"kept {kept}/{total} events across {len(counts)} tids")
|
||||
for tid in sorted(counts):
|
||||
print(f" tid={tid:4d} {counts[tid]}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
Reference in New Issue
Block a user