Files
xenia-rs/audit-runs/phase-c10-NtQueryFullAttributesFile/investigation.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

237 lines
9.5 KiB
Markdown

# Phase C+10 — NtQueryFullAttributesFile — Investigation
## Phase 1: Emitter extension (LANDED)
### Problem
C+9 left the divergence with no resolved path string:
```
canary[6][102404] kernel.return NtQueryFullAttributesFile return_value=0
ours [1][102404] kernel.return NtQueryFullAttributesFile return_value=0xC0000034
```
`payload.args` and `payload.args_resolved` were both empty objects.
We had no way to identify WHICH file the engine was querying.
### Shape of the fix
Schema v1 already declares `args_resolved` as a free-form object
attached to `kernel.call` (schema-v1.md:108-117), and the existing
example explicitly shows `{"path":"..."}`. The emitter just wasn't
populating it. Extension is pure schema-v1 compliance, no version
bump.
#### Ours-side (event_log.rs / path.rs / state.rs)
- Added `event_log::emit_kernel_call_with_path(tid, cycle, name,
Option<&str>)` — same byte format as `emit_kernel_call`, but when
`path` is `Some(non_empty)` emits `args_resolved:{"path":"..."}`.
When `None` or empty, degrades to the existing
`args_resolved:{}` form so unrelated exports' output is
byte-identical to pre-extension.
- Added `path::object_attributes_raw_name(mem, ptr) -> Option<String>`
— returns the RAW path string (trimmed of whitespace, NO
prefix-strip / no case-fold) so the diff surfaces upstream
prefix-form differences instead of masking them via normalization.
Pre-existing `object_attributes_to_vfs_path` (which DOES normalize)
is kept as-is for VFS lookup callers; emitter uses the new raw
helper.
- `state.rs::call_export`, inside the `phase_a_on` guarded block:
new `match name` resolves OBJECT_ATTRIBUTES* from the right gpr
position. Argument positions verified against canary's
`xboxkrnl/xboxkrnl_io.cc` signatures:
- `NtQueryFullAttributesFile` → r3 = obj_attrs
- `NtOpenSymbolicLinkObject` → r4 = obj_attrs
- `NtCreateFile`, `NtOpenFile` → r5 = obj_attrs
Then calls `emit_kernel_call_with_path(..., resolved.as_deref())`
instead of `emit_kernel_call(...)`. All other exports fall through
to `None` and the legacy form.
#### Canary-side (event_log.h / event_log.cc / util/shim_utils.h)
- `event_log.h`: declared `EmitKernelCallWithPath(name, path)`.
- `event_log.cc`: implemented same as ours (degrades to legacy form
for empty path).
- `event_log.cc::phase_a_bridge::EmitImportAndCallWithCtx(module,
ord, name, ppc_context)` — new bridge function. PPCContext is
passed as `void*` to keep the header transitive include footprint
small (the bridge cc reinterprets to PPCContext* internally).
Inside the bridge, helper `ReadObjectAttributesRawName(ptr)` reads
the X_OBJECT_ATTRIBUTES.name_ptr, then the X_ANSI_STRING bytes
directly out of guest memory (no util::TranslateAnsiPath
normalization). Trims whitespace + trailing NULs to match ours's
semantics byte-for-byte.
- `util/shim_utils.h`: both export trampolines (X::Trampoline /
Y::Trampoline) switched the `phase_a_bridge::EmitImportAndCall`
call to `phase_a_bridge::EmitImportAndCallWithCtx`, passing the
existing `ppc_context` argument that's already in scope. The
legacy `EmitImportAndCall` stays declared and defined for any
future callers that don't have a PPCContext.
### Verification
- Build both engines clean.
- Determinism 3x: digest md5 = `b8fa0e0460359a4f660adb7605e053de`
(identical to C+9 baseline — extension is cvar-OFF zero-cost).
- Phase A emitter determinism 2x: det-fields md5 = `7489e90e…` byte
identical. (Different from C+9's `0b299c37…` because the path
field IS in the deterministic signature — but stable across runs.)
## Phase 2: Re-run + capture path string
After the extension, both engines emit the path at
`kernel.call.args_resolved.path`:
```
canary[6][102403] NtQueryFullAttributesFile path = "cache:\d4ea4615\e\46ee8ca"
ours [1][102403] NtQueryFullAttributesFile path = "cache:\d4ea4615\e\46ee8ca"
```
Both engines query the **same path**. No upstream divergence — the
ANSI_STRING content matches byte-for-byte.
## Phase 3: Why does ours say NOT_FOUND?
### Trace through ours's `nt_query_full_attributes_file`
`exports.rs:1913-1990`:
1. Read OBJECT_ATTRIBUTES → path =
`"cache:/d4ea4615/e/46ee8ca"` (after `normalize_path`).
2. `state.resolve_cache_path(&path)` returns
`Some(<temp_dir>/xenia-rs-cache-<pid>-0/d4ea4615/e/46ee8ca)`.
3. `std::fs::metadata(host_path)` returns `Err(NotFound)`.
4. Return `STATUS_OBJECT_NAME_NOT_FOUND` (`0xC0000034`).
The host path doesn't exist because ours's `init_cache_root`
(`state.rs:499-510`) **clears** the cache directory on every boot
(AUDIT-038 line: per-process tmpdir + full wipe so two consecutive
runs see byte-identical initial state).
### Why does canary's NOT fail?
`xenia-canary/src/xenia/kernel/xboxkrnl/xboxkrnl_io.cc:474-513`:
1. Read OBJECT_ATTRIBUTES → target_path via TranslateAnsiPath.
2. `kernel_state()->file_system()->ResolvePath(target_path)`.
3. If `entry` found, populate file_info, return `X_STATUS_SUCCESS`.
4. Else return `X_STATUS_NO_SUCH_FILE` (`0xC0000035`).
Canary returns 0 → entry was found. Canary's cache mount is at
`/home/fabi/.local/share/Xenia/cache/` (a persistent host directory
populated over prior boots).
### Verification of canary's cache state
```
$ ls /home/fabi/.local/share/Xenia/cache/d4ea4615/e/
-rw-rw-r-- 1 fabi fabi 400 May 11 21:01 46ee8ca
```
Single 400-byte file. Total cache: 23 files, ~5 MB across 16
distinct top-level hash directories.
### Sibling-cache observations
ours.jsonl shows the SAME `NtQueryFullAttributesFile` fires for
multiple cache paths within the 50M window — all returning
`0xC0000034`. Example: idx 103810 queries
`cache:\69d8e45c\8\3421153`. So the divergence is not a single
missing file but a class of 16+ missing hashes.
## Phase 4: Classification + scope decision
Per the plan, the classes are:
* **(A) Missing file** — a single plant fixes it (small).
* **(B) Path-normalization bug** — string operation (small).
* **(C) VFS mount missing** — add the mount (small-medium).
* **(D) Subsystem-required** — STFS or similar — **ESCALATE**.
* **(E) Upstream divergence** — walk back.
This is **NOT (B)** — both engines normalize identically (verified
by matching args_resolved.path).
This is **NOT (E)** — upstream is bit-identical for 102,403 events.
This is **NOT (A)** for any single file — the game queries 16+
distinct cache hashes; planting one only postpones the divergence.
This is **closest to a hybrid (C+D)**:
* **(C)-ish**: canary's cache MOUNT resolves to a populated host dir;
ours's mount resolves to a wiped tmp dir.
* **(D)-ish**: canary's cache is populated because it ran the game
before and the game **built** the cache. To match canary's state
on a fresh boot, we either:
- implement the game's cache-build logic (subsystem),
- copy canary's pre-built cache (oracle state — AUDIT-038
violation),
- or accept that ours runs cold and the divergence is a
fundamental cold-vs-warm asymmetry.
### AUDIT-053 cross-check (warm-start regression risk)
Per AUDIT-053 memo:
> Phase 2 permanent fix REVERTED — warm-start regression from VFS
> layout aliasing: `open_cache_file` treats all `NtCreateFile` as
> files, but `cache:\d4ea4615 disp=CREATE` is meant as a DIRECTORY.
AUDIT-054 fixed that specific aliasing (FILE_DIRECTORY_FILE bit
threading). But there's still the AUDIT-053 secondary concern:
Sylpheed's `cache:\<hash>.tmp` journal-style writes append on each
boot — making naive persistence self-inconsistent across boots.
Whether AUDIT-054's fix fully unblocks persistence is **NOT
RE-VERIFIED** in this session. Re-testing the AUDIT-053 regression
under AUDIT-054's fix-in-tree is itself a follow-up.
### Scope per user direction
User said:
> If the fix requires major VFS work, STFS subsystem
> implementation, or cache-population infrastructure: ESCALATE.
Choices 2-4 from `escalation.md` all qualify as "cache-population
infrastructure":
* Choice 1 (single file plant) won't solve the problem (16+ hashes).
* Choice 2 (seed from canary) is oracle state + warm-start regression
risk per AUDIT-053.
* Choice 3 (synthesize cache reads) is multi-export semantic-change.
* Choice 4 (build cache from scratch) is a full subsystem.
**ESCALATION declared.** Phase 1 emitter extension landed as the
session's permanent infrastructure contribution.
## Discipline check
* **Reading-error #28** (canary source-of-truth): verified canary's
actual `NtQueryFullAttributesFile_entry` body
(`xboxkrnl_io.cc:474-513`), did not assume.
* **Reading-error #23** (downstream regression): no fix landed, so
no regression risk. Emitter extension is cvar-OFF zero-cost.
* **Escalation discipline**: triggered cleanly; explicit memo;
contributing infrastructure (emitter path resolution) kept.
* **Path encoding**: ANSI_STRING raw bytes captured; both engines
agree byte-for-byte; no Unicode issues for the queried path.
* **AUDIT-054 deferred-item**: not re-touched. Cache persistence
remains opt-in via `XENIA_CACHE_PERSIST=1`. Default keeps the
AUDIT-038 wipe behavior.
* **`--mute=true`**: every canary run.
* **Renamed binaries**: `xrs-c10` / `xc-c10.exe`.
## Confidence
* **Phase 1 emitter extension**: HIGH — schema-compliant, additive,
cvar-OFF zero-cost verified via determinism.
* **Phase 4 classification**: HIGH — three independent observations
agree (canary cache populated, ours cache wiped, multiple hashes).
* **Cascade prediction at 102,404**: cache fix lands only the
FIRST in a series — next cache hash will be the next divergence.
Likely net delta of several hundred to a few thousand matched
events per cache slot resolved, until a non-cache divergence
appears.