Files
xenia-rs/audit-runs/phase-c10-NtQueryFullAttributesFile/investigation.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

9.5 KiB

Phase C+10 — NtQueryFullAttributesFile — Investigation

Phase 1: Emitter extension (LANDED)

Problem

C+9 left the divergence with no resolved path string:

canary[6][102404] kernel.return NtQueryFullAttributesFile return_value=0
ours  [1][102404] kernel.return NtQueryFullAttributesFile return_value=0xC0000034

payload.args and payload.args_resolved were both empty objects. We had no way to identify WHICH file the engine was querying.

Shape of the fix

Schema v1 already declares args_resolved as a free-form object attached to kernel.call (schema-v1.md:108-117), and the existing example explicitly shows {"path":"..."}. The emitter just wasn't populating it. Extension is pure schema-v1 compliance, no version bump.

Ours-side (event_log.rs / path.rs / state.rs)

  • Added event_log::emit_kernel_call_with_path(tid, cycle, name, Option<&str>) — same byte format as emit_kernel_call, but when path is Some(non_empty) emits args_resolved:{"path":"..."}. When None or empty, degrades to the existing args_resolved:{} form so unrelated exports' output is byte-identical to pre-extension.

  • Added path::object_attributes_raw_name(mem, ptr) -> Option<String> — returns the RAW path string (trimmed of whitespace, NO prefix-strip / no case-fold) so the diff surfaces upstream prefix-form differences instead of masking them via normalization. Pre-existing object_attributes_to_vfs_path (which DOES normalize) is kept as-is for VFS lookup callers; emitter uses the new raw helper.

  • state.rs::call_export, inside the phase_a_on guarded block: new match name resolves OBJECT_ATTRIBUTES* from the right gpr position. Argument positions verified against canary's xboxkrnl/xboxkrnl_io.cc signatures:

    • NtQueryFullAttributesFile → r3 = obj_attrs
    • NtOpenSymbolicLinkObject → r4 = obj_attrs
    • NtCreateFile, NtOpenFile → r5 = obj_attrs Then calls emit_kernel_call_with_path(..., resolved.as_deref()) instead of emit_kernel_call(...). All other exports fall through to None and the legacy form.

Canary-side (event_log.h / event_log.cc / util/shim_utils.h)

  • event_log.h: declared EmitKernelCallWithPath(name, path).
  • event_log.cc: implemented same as ours (degrades to legacy form for empty path).
  • event_log.cc::phase_a_bridge::EmitImportAndCallWithCtx(module, ord, name, ppc_context) — new bridge function. PPCContext is passed as void* to keep the header transitive include footprint small (the bridge cc reinterprets to PPCContext* internally). Inside the bridge, helper ReadObjectAttributesRawName(ptr) reads the X_OBJECT_ATTRIBUTES.name_ptr, then the X_ANSI_STRING bytes directly out of guest memory (no util::TranslateAnsiPath normalization). Trims whitespace + trailing NULs to match ours's semantics byte-for-byte.
  • util/shim_utils.h: both export trampolines (X::Trampoline / Y::Trampoline) switched the phase_a_bridge::EmitImportAndCall call to phase_a_bridge::EmitImportAndCallWithCtx, passing the existing ppc_context argument that's already in scope. The legacy EmitImportAndCall stays declared and defined for any future callers that don't have a PPCContext.

Verification

  • Build both engines clean.
  • Determinism 3x: digest md5 = b8fa0e0460359a4f660adb7605e053de (identical to C+9 baseline — extension is cvar-OFF zero-cost).
  • Phase A emitter determinism 2x: det-fields md5 = 7489e90e… byte identical. (Different from C+9's 0b299c37… because the path field IS in the deterministic signature — but stable across runs.)

Phase 2: Re-run + capture path string

After the extension, both engines emit the path at kernel.call.args_resolved.path:

canary[6][102403] NtQueryFullAttributesFile  path = "cache:\d4ea4615\e\46ee8ca"
ours  [1][102403] NtQueryFullAttributesFile  path = "cache:\d4ea4615\e\46ee8ca"

Both engines query the same path. No upstream divergence — the ANSI_STRING content matches byte-for-byte.

Phase 3: Why does ours say NOT_FOUND?

Trace through ours's nt_query_full_attributes_file

exports.rs:1913-1990:

  1. Read OBJECT_ATTRIBUTES → path = "cache:/d4ea4615/e/46ee8ca" (after normalize_path).
  2. state.resolve_cache_path(&path) returns Some(<temp_dir>/xenia-rs-cache-<pid>-0/d4ea4615/e/46ee8ca).
  3. std::fs::metadata(host_path) returns Err(NotFound).
  4. Return STATUS_OBJECT_NAME_NOT_FOUND (0xC0000034).

The host path doesn't exist because ours's init_cache_root (state.rs:499-510) clears the cache directory on every boot (AUDIT-038 line: per-process tmpdir + full wipe so two consecutive runs see byte-identical initial state).

Why does canary's NOT fail?

xenia-canary/src/xenia/kernel/xboxkrnl/xboxkrnl_io.cc:474-513:

  1. Read OBJECT_ATTRIBUTES → target_path via TranslateAnsiPath.
  2. kernel_state()->file_system()->ResolvePath(target_path).
  3. If entry found, populate file_info, return X_STATUS_SUCCESS.
  4. Else return X_STATUS_NO_SUCH_FILE (0xC0000035).

Canary returns 0 → entry was found. Canary's cache mount is at /home/fabi/.local/share/Xenia/cache/ (a persistent host directory populated over prior boots).

Verification of canary's cache state

$ ls /home/fabi/.local/share/Xenia/cache/d4ea4615/e/
-rw-rw-r-- 1 fabi fabi 400 May 11 21:01 46ee8ca

Single 400-byte file. Total cache: 23 files, ~5 MB across 16 distinct top-level hash directories.

Sibling-cache observations

ours.jsonl shows the SAME NtQueryFullAttributesFile fires for multiple cache paths within the 50M window — all returning 0xC0000034. Example: idx 103810 queries cache:\69d8e45c\8\3421153. So the divergence is not a single missing file but a class of 16+ missing hashes.

Phase 4: Classification + scope decision

Per the plan, the classes are:

  • (A) Missing file — a single plant fixes it (small).
  • (B) Path-normalization bug — string operation (small).
  • (C) VFS mount missing — add the mount (small-medium).
  • (D) Subsystem-required — STFS or similar — ESCALATE.
  • (E) Upstream divergence — walk back.

This is NOT (B) — both engines normalize identically (verified by matching args_resolved.path).

This is NOT (E) — upstream is bit-identical for 102,403 events.

This is NOT (A) for any single file — the game queries 16+ distinct cache hashes; planting one only postpones the divergence.

This is closest to a hybrid (C+D):

  • (C)-ish: canary's cache MOUNT resolves to a populated host dir; ours's mount resolves to a wiped tmp dir.
  • (D)-ish: canary's cache is populated because it ran the game before and the game built the cache. To match canary's state on a fresh boot, we either:
    • implement the game's cache-build logic (subsystem),
    • copy canary's pre-built cache (oracle state — AUDIT-038 violation),
    • or accept that ours runs cold and the divergence is a fundamental cold-vs-warm asymmetry.

AUDIT-053 cross-check (warm-start regression risk)

Per AUDIT-053 memo:

Phase 2 permanent fix REVERTED — warm-start regression from VFS layout aliasing: open_cache_file treats all NtCreateFile as files, but cache:\d4ea4615 disp=CREATE is meant as a DIRECTORY.

AUDIT-054 fixed that specific aliasing (FILE_DIRECTORY_FILE bit threading). But there's still the AUDIT-053 secondary concern: Sylpheed's cache:\<hash>.tmp journal-style writes append on each boot — making naive persistence self-inconsistent across boots.

Whether AUDIT-054's fix fully unblocks persistence is NOT RE-VERIFIED in this session. Re-testing the AUDIT-053 regression under AUDIT-054's fix-in-tree is itself a follow-up.

Scope per user direction

User said:

If the fix requires major VFS work, STFS subsystem implementation, or cache-population infrastructure: ESCALATE.

Choices 2-4 from escalation.md all qualify as "cache-population infrastructure":

  • Choice 1 (single file plant) won't solve the problem (16+ hashes).
  • Choice 2 (seed from canary) is oracle state + warm-start regression risk per AUDIT-053.
  • Choice 3 (synthesize cache reads) is multi-export semantic-change.
  • Choice 4 (build cache from scratch) is a full subsystem.

ESCALATION declared. Phase 1 emitter extension landed as the session's permanent infrastructure contribution.

Discipline check

  • Reading-error #28 (canary source-of-truth): verified canary's actual NtQueryFullAttributesFile_entry body (xboxkrnl_io.cc:474-513), did not assume.
  • Reading-error #23 (downstream regression): no fix landed, so no regression risk. Emitter extension is cvar-OFF zero-cost.
  • Escalation discipline: triggered cleanly; explicit memo; contributing infrastructure (emitter path resolution) kept.
  • Path encoding: ANSI_STRING raw bytes captured; both engines agree byte-for-byte; no Unicode issues for the queried path.
  • AUDIT-054 deferred-item: not re-touched. Cache persistence remains opt-in via XENIA_CACHE_PERSIST=1. Default keeps the AUDIT-038 wipe behavior.
  • --mute=true: every canary run.
  • Renamed binaries: xrs-c10 / xc-c10.exe.

Confidence

  • Phase 1 emitter extension: HIGH — schema-compliant, additive, cvar-OFF zero-cost verified via determinism.
  • Phase 4 classification: HIGH — three independent observations agree (canary cache populated, ours cache wiped, multiple hashes).
  • Cascade prediction at 102,404: cache fix lands only the FIRST in a series — next cache hash will be the next divergence. Likely net delta of several hundred to a few thousand matched events per cache slot resolved, until a non-cache divergence appears.