Files
xenia-rs/audit-runs/phase-c10-NtQueryFullAttributesFile/escalation.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

213 lines
9.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase C+10 — NtQueryFullAttributesFile — ESCALATION
## Outcome
**Phase 1 (emitter extension) — LANDED**.
**Phase 4 fix (cache-state seeding) — ESCALATED**, deferred to a
dedicated cache-subsystem session.
The Phase A emitter now resolves OBJECT_ATTRIBUTES path arguments on
both engines (cvar-gated, default-off, behaviorally inert when off).
That permanent infrastructure win surfaces the divergence string for
this and every future file-IO divergence.
The actual cache-seeding fix needed to advance main matched-prefix
past 102,404 is out of scope per the user's escalation criteria.
## Captured framing (post-extension)
Both engines now log the resolved path at `kernel.call.args_resolved`:
```
canary[6][102403]: NtQueryFullAttributesFile args_resolved.path = "cache:\\d4ea4615\\e\\46ee8ca"
ours [1][102403]: NtQueryFullAttributesFile args_resolved.path = "cache:\\d4ea4615\\e\\46ee8ca"
canary[6][102404]: kernel.return return_value = 0 (STATUS_SUCCESS)
ours [1][102404]: kernel.return return_value = 0xC0000034 (STATUS_OBJECT_NAME_NOT_FOUND)
```
Both engines query the **same path**. Canary returns SUCCESS because
its cache directory (`/home/fabi/.local/share/Xenia/cache/`) is
**pre-populated** with 23 files (~5 MB) accumulated over prior
Sylpheed boots. Ours's cache directory is fresh-wiped per AUDIT-038.
After this query, canary follows up with `NtCreateFile` for the same
path (idx 102481) — it actually reads the cached data. So just lying
SUCCESS without backing bytes would only push the divergence ~78
events forward.
## Classification (per plan Phase 4)
**(A) Missing file — narrowly true (this single cache entry), but**
**(D) Subsystem-required — actual scope**.
Choices considered:
1. **Plant a single file**: would only push the divergence to the
next cache-existence query (16+ distinct hashes in
`cache:\<HASH1>\<X>\<HASH2>` form). 23 files in canary's cache,
most of them follow this pattern. After each plant the next
query still misses.
2. **Seed ours's cache from canary's**: 23 files, ~5 MB. Mechanically
easy (~30 LOC `copy_dir_all`) but violates AUDIT-038's no-oracle-
state line AND AUDIT-053's documented warm-start regression
(Sylpheed's `cache:\*.tmp` journal-style writes append per boot,
making a naive persistent seed self-inconsistent after the second
boot — `runtime_error` throws from version-check on reload).
3. **Lie SUCCESS on cache: existence + lie SUCCESS on subsequent
NtCreateFile + return zero-byte file**: changes Nt semantics
game-wide, likely breaks any read that expects valid content.
4. **Implement the game's cache-generation logic**: that's the
shader/PSO/material cache build subsystem — multi-hundred-LOC
generative subsystem, not in scope.
The user's escalation criteria explicitly call out
"cache-population infrastructure" as ESCALATION. Choices 2-4 fit
that. Choice 1 doesn't solve the problem.
## What was landed (Phase 1 only)
Permanent emitter extension on both engines, schema-v1-compatible
(`args_resolved` was already part of v1, this just populates it for
OBJECT_ATTRIBUTES*-taking exports).
### Ours side (~50 LOC additive)
- `xenia-rs/crates/xenia-kernel/src/event_log.rs`:
- New `emit_kernel_call_with_path(tid, cycle, name, Option<&str>)`
that mirrors `emit_kernel_call` but adds
`args_resolved:{"path":"..."}` when the path is non-empty.
Degrades to the existing empty-object form otherwise so output
is byte-identical to pre-extension when the path is null.
- `xenia-rs/crates/xenia-kernel/src/path.rs`:
- New `object_attributes_raw_name(mem, ptr) -> Option<String>`
that returns the **raw** trimmed path (no prefix-strip, no
case-fold). The emitter uses raw form so the diff surfaces
upstream differences (e.g. if one engine called with one prefix
and the other with a different prefix), not just post-normalize
differences.
- `xenia-rs/crates/xenia-kernel/src/state.rs`:
- In `call_export`, when `phase_a_on` and `name` matches one of
`{NtCreateFile, NtOpenFile, NtQueryFullAttributesFile,
NtOpenSymbolicLinkObject}`, resolve OBJECT_ATTRIBUTES* from the
appropriate gpr position (verified against canary's
xboxkrnl_io.cc signatures) and call
`emit_kernel_call_with_path`. Otherwise call the legacy
`emit_kernel_call`.
### Canary side (~80 LOC additive)
- `xenia-canary/src/xenia/kernel/event_log.h`:
- New `EmitKernelCallWithPath(name, path)` mirroring ours.
- `xenia-canary/src/xenia/kernel/event_log.cc`:
- Implementation of `EmitKernelCallWithPath`.
- New `phase_a_bridge::EmitImportAndCallWithCtx(module, ord, name,
ppc_context)` that dispatches by `name` to read OBJECT_ATTRIBUTES
from the PPCContext gpr and call the path-bearing form. Falls
back to the legacy form when name doesn't match.
- Helper `ReadObjectAttributesRawName(obj_attrs_ptr)` that mirrors
ours's `object_attributes_raw_name` semantically (raw trimmed,
no normalization).
- `xenia-canary/src/xenia/kernel/util/shim_utils.h`:
- Both trampolines (X::Trampoline / Y::Trampoline) switched from
`EmitImportAndCall(...)` to `EmitImportAndCallWithCtx(...,
ppc_context)`. PPCContext is already in scope at that call site
(it's the first argument the trampoline receives).
Total: ~80 LOC each side. Both behaviorally inert when cvar OFF.
## Gates (Phase 1 extension only — all pass)
| # | gate | result |
|---|---|---|
| 1 | cvar-OFF determinism 50M (3 runs) | PASS — all 3 = `b8fa0e0460359a4f660adb7605e053de` (matches C+9 baseline, unchanged) |
| 2 | Phase B `image_loaded_sha256` | PASS — `ea8d160e9369328a5b922258a92113efb8d7ce3e1a5c12cc521e375985c91c18` (matches baseline) |
| 3 | Phase A main matched-prefix | UNCHANGED — 102404 (extension was framing-only; no fix landed; no advance expected) |
| 4 | Both engines build clean | PASS |
| 5 | Phase A emitter det fields (2 runs) | PASS — both = `7489e90ef4c9be629af8c9fabb1cbdd7` (new; replaces C+9's `0b299c37…` because the new args_resolved.path field is part of the det signature) |
| 6 | Unit tests | PASS — 165 → 165 (no new, no regressions) |
## Schema status
The args_resolved field is part of schema-v1 already; this Phase only
**populates** it for a subset of exports. No schema version bump.
The schema-v1 example (`schema-v1.md:112`) shows exactly the form we
emit. We are now compliant with the documented schema for path-bearing
exports rather than emitting an empty stub.
## Cascade prediction (resolution / next steps)
| stage | predicted | outcome |
|---|---|---|
| A=extend emitter cleanly | ~80% | LANDED |
| B=capture path string both engines | ~85% | LANDED — `cache:\d4ea4615\e\46ee8ca` matched both engines |
| C=classify root cause | ~75% | DONE — Class D (subsystem-required) |
| D=land fix in scope | ~55% | **ESCALATED** — fix is choice 2-4 above |
| E=main chain advances past 102404 | ~50% | NOT THIS SESSION |
## Reading-error class
NO new class. Existing classes #15 / ζ (VFS layout aliasing,
AUDIT-053) and AUDIT-038 (no oracle state) are re-affirmed:
* Class #15 ζ (AUDIT-053): persistent cache + journal `.tmp` writes
create a warm-start regression.
* AUDIT-038 line: oracle state is forbidden in default boot.
Both rules together make the cache-seeding fix subsystem-tier, not
single-fix-tier.
## Handoff to dedicated cache-subsystem session
The next session targeting this divergence should:
1. **Decide cache-state strategy**:
- (a) Implement Sylpheed's cache-generation logic so ours builds
its own cache from scratch (matches canary's own bootstrap
experience — but multi-hundred-LOC).
- (b) Seed-once-then-persist: copy canary's cache into ours's
cache_root behind a new cvar `--cache-seed-from=<path>`, then
enable persistence. AUDIT-053's warm-start regression must be
re-tested with AUDIT-054's FILE_DIRECTORY_FILE fix in tree
(it landed AFTER 053's regression was observed).
- (c) Hybrid: synthesize a stub success at NtQueryFullAttributesFile
for known-good cache hashes, then synthesize NtCreateFile/Read
responses with bytes captured from canary's cache files. Closest
to a "single missing file plant" but for 23 files.
2. **Re-validate after the fix** that the warm-start regression
identified in AUDIT-053 doesn't recur (AUDIT-054 may have fixed
it; needs explicit re-test).
3. **Expect cascading Phase A divergences**: each cache hash the
game looks up in turn — the divergence at 102,404 is only the
FIRST. After cache:\d4ea4615 is resolved, the game queries
cache:\69d8e45c (idx 103810 already visible in ours.jsonl) and
so on through 16+ distinct hashes per AUDIT-052.
## Files in this audit run
| file | content |
|---|---|
| `escalation.md` | this file |
| `investigation.md` | Phase 1-4 walkthrough |
| `re-validation.md` | gate results (Phase 1 extension only) |
| `ours.jsonl`, `ours-determ.jsonl`, `canary.jsonl` | Phase A logs with new args_resolved field |
| `diff-report.md` | re-run with path field populated |
| `snap/ours/` | Phase B snapshot (unchanged from C+9) |
| `digest-cvaroff-{1,2,3}.json` | 3× determinism (all = C+9 baseline) |
## Next target
**Same idx 102,404 NtQueryFullAttributesFile**, but in a dedicated
cache-subsystem session. Path framing is now captured for the next
investigator's first read.