Files
xenia-rs/audit-runs/cache-subsystem-plan/investigation.md
MechaCat02 ef93a4fa14 handoff: VSync/event-wedge fixes + iterate 2.A–2.BC research notes
Source changes (dormant parity infra, retained from iterate 2.AI/2.AO):
- xenia-kernel/exports.rs: nt_create_event manual_reset polarity +
  related event wiring
- xenia-gpu/mmio_region.rs: D1MODE_VBLANK_VLINE_STATUS hardcode parity

Also lands the audit-runs/ analysis notes (.md/.txt/.json digests) for the
iterate 2.x VSync/0x10e8/0x1004 wedge investigation. Raw trace dumps
(.jsonl/.gz/.csv/.stdout) and agent worktrees (.claude/) are gitignored as
regenerable local artifacts — see memory + HANDOFF for the running findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:19:08 +02:00

290 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Cache subsystem investigation — Phase C+11 planning (2026-05-14)
## Scope
This investigation informs the plan at [plan.md](plan.md). It was run as a
dedicated planning session after Phase C+10 escalated the cache divergence at
idx 102404. Findings are READ-ONLY observations; no source modified.
## 1. Canary's cache enumeration
Canary's mount: `~/.local/share/Xenia/cache/` (the POSIX `storage_root / "cache"`
convention; canary's `xenia-canary/src/xenia/app/xenia_main.cc:612-652` registers
three `HostPathDevice` mounts at `\\CACHE0`, `\\CACHE1`, `\\CACHE` aliased to
`cache0:`, `cache1:`, `cache:` symbolic links).
State at session start: 23 files / 4.8 MB across 16 hash buckets. Pre-populated
across many prior canary boots. Full enumeration in
[canary-cache-listing.csv](canary-cache-listing.csv).
Notable properties:
* **Zero `.tmp` files** — canary's cache holds only resolved hierarchical leaves
(`<H1>/<X>/<H2>` form) plus two top-level manifests (`access`, `recent`). The
`.tmp` flat-journal files Sylpheed uses for staging are renamed/removed before
they persist.
* Top-level `access` and `recent` are **files**, not directories. Layouts:
* `access`: 20×12-byte records `(hash1 u32 BE, hash2 u32 BE, refcount u32)`.
The 240 B file enumerates the 20 known cache entries (note: 23 files total
on disk but only 20 manifest entries — three of the on-disk files are not
indexed; possibly `recent`-only or orphans).
* `recent`: 20×8-byte records `(hash1 u32 BE, hash2 u32 BE)`. Recently-used
ordering of the same hash pairs.
* Cache content is **game-asset cache**: Shift-JIS Japanese localization text
(`d4ea4615/e/46ee8ca``[SYSTEM]/[LANGUAGE]/XC_LANGUAGE_*` table); `IPFB`-magic
binary blobs (game-asset format, likely font/sprite/level data); large blobs
up to 2.7 MB. This is NOT shader cache or PSO cache.
## 2. Canary's cache code (xenia-canary)
Mount/init:
* `xenia-canary/src/xenia/app/xenia_main.cc:612-652` — registers three
`HostPathDevice` mounts.
* `xenia-canary/src/xenia/base/filesystem_posix.cc:76-97` — POSIX path
resolution for `storage_root` via `$XDG_DATA_HOME` then `$HOME/.local/share`.
* `xenia-canary/src/xenia/vfs/devices/host_path_device.cc:31-48` — creates the
host directory if missing (`std::filesystem::create_directories`). **No wipe
logic anywhere in canary source.** Cache survives across boots.
* `xenia-canary/src/xenia/vfs/devices/host_path_entry.cc:78-98`
`CreateEntryInternal` calls `create_directories(parent)` + `OpenFile("wb")`.
NT IO handlers:
* `xenia-canary/src/xenia/kernel/xboxkrnl/xboxkrnl_io.cc:39-111``NtCreateFile`
routes through `FileSystem::OpenFile` with `is_directory =
(create_options & FILE_DIRECTORY_FILE) != 0` and
`is_non_directory = (create_options & FILE_NON_DIRECTORY_FILE) != 0`.
* `xenia-canary/src/xenia/kernel/xboxkrnl/xboxkrnl_io.cc:474-513`
`NtQueryFullAttributesFile`: returns `X_STATUS_SUCCESS` (0) on
`ResolvePath` hit; `X_STATUS_NO_SUCH_FILE` (0xC000000F) on miss.
* `xenia-canary/src/xenia/kernel/xboxkrnl/xboxkrnl_io_info.cc:226-243`
**`NtSetInformationFile` class 10 (XFileRenameInformation)** correctly
implemented:
```cpp
case XFileRenameInformation: {
auto info = info_ptr.as<X_FILE_RENAME_INFORMATION*>();
std::filesystem::path target_path =
util::TranslateAnsiPath(kernel_memory(), &info->ansi_string);
if (!IsValidPath(target_path.string(), false)) {
return X_STATUS_OBJECT_NAME_INVALID;
}
if (!target_path.has_filename()) {
return X_STATUS_INVALID_PARAMETER;
}
file->Rename(target_path);
out_length = sizeof(*info);
break;
}
```
All file IO is synchronous on the host (`XFile::Write` → `WriteSync` →
`std::fwrite`).
## 3. Ours's cache code (xenia-rs current HEAD)
Mount/init:
* `xenia-rs/crates/xenia-kernel/src/state.rs:1235-1273` — `resolve_default_cache_root`:
* Default: per-process tmpdir `std::env::temp_dir()/xenia-rs-cache-{pid}-{counter}`
with `wipe=true` (AUDIT-038).
* `XENIA_CACHE_ROOT=<path>` env: explicit path, no wipe.
* `XENIA_CACHE_PERSIST=1` (or "true" case-insensitive): `$XDG_DATA_HOME/xenia-rs/cache`
or `$HOME/.local/share/xenia-rs/cache`, no wipe.
* `xenia-rs/crates/xenia-kernel/src/state.rs:499-510` — `init_cache_root`:
conditionally wipes and recreates.
* `xenia-rs/crates/xenia-kernel/src/state.rs:519-554` — `resolve_cache_path`:
case-insensitive prefix-match on `cache:\`, `cache:/`, `cache0:\`, `cache0:/`,
`cache1:\`, `cache1:/`; backslash → forward slash normalization; `..`/`.` /
empty filtered for traversal safety. **Funnels all three (cache, cache0,
cache1) to a single backing root** — different from canary which has three
separate `HostPathDevice` mounts.
NT IO handlers:
* `xenia-rs/crates/xenia-kernel/src/exports.rs:1023-1196``open_cache_file`.
AUDIT-054 `FILE_DIRECTORY_FILE`-bit handling at lines 1041-1051. The
`is_dir_open` decision uses `(create_options & FILE_DIRECTORY_FILE) != 0 ||
host_path.is_dir() || host_path == state.cache_root.unwrap_or(host_path)`. The
last term is a tautology when `cache_root` is `None` (returns `host_path ==
host_path` = true), but harmless when `cache_root` is `Some(_)`.
* `xenia-rs/crates/xenia-kernel/src/exports.rs:1354-1373``nt_create_file`:
reads `create_options` from `sp + 0x54` (per AUDIT-054's `shim_utils.h:49-50`
citation). r5=obj_attrs, r10=create_disposition.
* `xenia-rs/crates/xenia-kernel/src/exports.rs:1375-1405``nt_open_file`:
reads `open_options` from r7 (AUDIT-053's r8→r7 fix, Phase C+5).
* `xenia-rs/crates/xenia-kernel/src/exports.rs:1809-1909``nt_set_information_file`:
validates `min_length` for class 10 at line 1822 (`10 => 16`), but **the match
body at 1847-1905 has no case-arm for class 10**. The `_ =>
(STATUS_SUCCESS, min_length)` catch-all at line 1904 fires for class 10,
returning success without performing the rename. **This is bug #1 in the
plan's headline finding.**
* `xenia-rs/crates/xenia-kernel/src/exports.rs:1913-1990`
`nt_query_full_attributes_file`. Cache short-circuit at lines 1930-1957
uses `std::fs::metadata(&hp)` directly; returns
`STATUS_OBJECT_NAME_NOT_FOUND` (0xC0000034) on miss. Different value than
canary's 0xC000000F but treated equivalently by Sylpheed.
C+10 emitter extension:
* `xenia-rs/crates/xenia-kernel/src/state.rs:657-687``call_export`
dispatches by name to `object_attributes_raw_name` (path.rs:109-115) for the 4
OBJECT_ATTRIBUTES*-taking exports: NtQueryFullAttributesFile (r3),
NtOpenSymbolicLinkObject (r4), NtCreateFile (r5), NtOpenFile (r5). Calls
`emit_kernel_call_with_path` (event_log.rs:202-229). Not wired for
NtSetInformationFile (info buffer has the path, not OBJECT_ATTRIBUTES).
**Stage 1 of the plan extends this dispatch to class-10 rename targets.**
Tests:
* `xenia-rs/crates/xenia-kernel/src/exports.rs:6830-6980` — 5 cache-specific
tests: `cache_create_write_read_roundtrip`, `cache_file_create_collision`,
`cache_file_open_missing`, `cache_root_cleared_on_init`,
`cache_resolve_strips_path_traversal`. Plus 3 async/sync file tests.
* No tests cover `NtSetInformationFile` class 10. **Stage 1 of the plan adds
this test.**
## 4. Sylpheed's cache code (guest PPC binary)
Disassembly of the cache-fallback dispatcher chain (via xenia-rs disasm +
sylpheed.db):
* **`sub_82452DC0`** (PC 0x82452DC00x82453024): high-level dispatcher.
* 0x82452DEC: tries primary data via `sub_82452068` + `sub_82452200`.
* 0x82452E08: checks `r3 == 0`. On not-found, branches to cache fallback at
0x82452E1C.
* 0x82452E1C: calls cache gate `sub_8245B000`.
* 0x82452E28: if cache returns 0 (miss), branches to 0x82452E88 (skip cache).
* 0x82452E30: cache hit → call callback `sub_8245B078`.
* **`sub_8245B000`** (cache gate): validates hash key, calls `sub_8245AD00`.
* **`sub_8245AD00`** (cache query): formats path via `sub_82459130`
(sprintf `cache:\<H1>\<X>\<H2>`); queries via `sub_82612A78` (NtQueryFullAttributesFile
wrapper). On miss (`r3 == -1` at 0x8245AD90), branches to failure 0x8245ADFC.
On hit, enters critical section + calls `sub_8245B1F8` (cache reader).
* **`sub_82459130`** (path formatter): pure sprintf, no cache write.
* **`sub_82612A78`** (NtQueryFullAttributesFile wrapper): wraps the kernel
import; converts STATUS to -1 on error.
**Cache-write path was NOT located in sub_82452DC0's disassembly.** The dispatcher
agent reported no NtCreateFile in the miss branch. Likely the cache build fires
from a different code path (probably inside `sub_82452068`/`sub_82452200`, the
"primary data" handlers, which on first-time access compute the data AND write
it to cache).
Sylpheed binary string references (all confirmed via .pe text-search):
* `cache:\access` at 0x820B5794
* `cache:\recent` at 0x820B5774
* `cache:\ignore` at 0x820B5784
* `cache:\*.tmp` at 0x820B5764
* `cache:\` at 0x820B57A4
* `%s%08x%08x.tmp` at 0x820B57AC (format string for `cache:\<H1><H2>.tmp` flat
journal)
**Conclusion**: Sylpheed manages its own cache content. The game has both the
read path (sub_82452DC0 dispatcher) and the write path (currently unlocated,
likely in primary-data handlers). The write path is what creates `.tmp` files
and (we infer) calls `NtSetInformationFile` class 10 to rename them to
hierarchical leaves.
## 5. Event-log evidence (Phase A jsonl)
From `xenia-rs/audit-runs/phase-c10-NtQueryFullAttributesFile/ours.jsonl`,
tid=4's cache-build sequence on COLD cache:
| idx | event | path |
|---|---|---|
| 13 | NtOpenFile | `cache:\` (probe mount root) |
| 19 | NtClose | (close root probe) |
| 28 | NtCreateFile | `cache:\access` → returns 0xC0000034 NOT_FOUND on cold |
| 37 | NtCreateFile | `cache:\ignore` → returns 0xC0000034 |
| 46 | NtCreateFile | `cache:\recent` → returns 0xC0000034 |
| 64 | NtCreateFile | `cache:\d4ea4615e46ee8ca.tmp` (flat journal, FILE_CREATE) |
| 69 | NtSetInformationFile | (class TBD; ours emitter doesn't capture info_class) |
| 196 | NtCreateFile | `cache:\d4ea4615` (DIR, post-AUDIT-054) |
| 205 | NtCreateFile | `cache:\d4ea4615\e` (subdir) |
| 214 | NtOpenFile | `cache:\d4ea4615e46ee8ca.tmp` (reopen flat journal) |
| 286 | NtCreateFile | `cache:\69d8e45ce534ffea.tmp` (next flat journal) |
| 325 | NtOpenFile | `cache:\` |
| 409 | NtCreateFile | `cache:\access` (retry) |
| 466 | NtCreateFile | `cache:\69d8e45c` (DIR) |
| 475 | NtCreateFile | `cache:\69d8e45c\e` (subdir) |
Statistics across the 50M window:
* Ours emits 69 `cache:` events on tid=4, plus the main-chain divergent
events on tid=1.
* Ours emits **111** `NtSetInformationFile` calls; canary emits **0**.
Canary's cache is warm, so it skips cache-build entirely.
## 6. Persistence experiment
See [persistent-experiment.md](persistent-experiment.md) for the full table
and per-boot cache-content delta. Headline result:
* `XENIA_CACHE_PERSIST=1` + 50M boot 1 (cold): digest
`instructions=50000003 imports=40485 swaps=1 draws=0`. Differs from C+10
default-tmpdir baseline (`50000002`, `40465`) by +1 instruction / +20
imports. Persistent path is slightly different from tmpdir.
* `XENIA_CACHE_PERSIST=1` + 50M boot 2 (warm): same digest. No cxx_throw
regression at 50M.
* On-disk cache after boot 2: 7 `.tmp` flat journals (grew on each boot from
+400 B to +114 KB per file); `access`, `ignore`, `recent` as DIRECTORIES (bug
#2); zero hierarchical leaf files (bug #1 prevents promotion).
* Phase A diff vs canary baseline: matched-prefix on `canary_tid=6 → ours_tid=1`
main chain = **102404** (unchanged from C+10's default-tmpdir result). Divergence
at the same `NtQueryFullAttributesFile` return-value (canary=0 SUCCESS,
ours=0xC0000034 NOT_FOUND).
**Persistence alone does not advance the matched-prefix.** The `.tmp` files
exist but the hierarchical leaf doesn't, so the leaf NtQuery still misses.
## 7. Discipline / methodology checks
* **`--mute=true`**: not used in this session because no canary runs were
required (the C+10 canary.jsonl was reused as-is for the matched-prefix
comparison). Future re-baselines under the plan must use `--mute=true`.
* **Binary rename for stop hook**: ours run via `xrs-c10` (pre-existing from
C+10). No background long-run; the experiments completed in <3 s wall-clock
on the test host.
* **Reading-error #28** (oracle source supersedes spec): verified canary's
`NtSetInformationFile` class-10 implementation by reading
`xboxkrnl_io_info.cc:226-243`; did not assume from docs.
* **No source touched**: this session was read-only-by-design. Plan-mode kept
the tree clean; the only file-system side effects were Phase A event log
output to `audit-runs/cache-subsystem-plan/persist-warm-events.jsonl` and
this directory's deliverables.
## 8. Confidence ratings
| claim | source | confidence |
|---|---|---|
| Bug #1: `nt_set_information_file` class 10 is a no-op stub | direct source read of [exports.rs:1809-1909](xenia-rs/crates/xenia-kernel/src/exports.rs#L1809-L1909) | HIGH |
| Bug #1 prevents .tmp-to-leaf promotion | indirect: ours's cache has .tmp + no leaf; canary's has leaf + no .tmp; canary properly implements class 10 | HIGH (3 independent confirmations) |
| Bug #2: top-level cache files mis-created as directories | direct on-disk observation post-experiment | HIGH |
| Bug #2 root cause: `is_dir_open` discriminator misclassification | source-read inference; not yet instrumented | MEDIUM (Stage 2 instrumentation required) |
| Persistence alone doesn't advance matched-prefix | experimentally verified via diff_events.py | HIGH |
| AUDIT-053 cxx_throw regression not reproduced at 50M | experimentally verified (2 sequential boots, same digest) | MEDIUM (AUDIT-053's regression was at 500M; this window is too short to fully rule it out) |
| Sylpheed has its own cache-build path that already fires in ours | event-log evidence (69 cache: events on tid=4) | HIGH |
| The two engine bugs are the ONLY blockers | inferred from the above; could be additional bugs uncovered post-Stage 1 | MEDIUM (Stages are independently rollback-able; if a Stage doesn't advance matched-prefix, investigate further) |
## 9. Open questions
See plan §"Open questions". Critical ones to resolve during implementation:
1. Confirm via instrumentation that Sylpheed actually calls
`NtSetInformationFile` class 10 for the .tmp→leaf rename. If it uses a
different path (NtDeleteFile + NtCreateFile, or some custom flow),
Stage 1's fix won't fully solve the problem.
2. Confirm via instrumentation whether `cache:\access`/`ignore`/`recent`
creates have `FILE_DIRECTORY_FILE` set in `create_options`, or whether
ours's arg-position read is wrong.
3. Validate whether `access` and `recent` manifest contents are deterministic
byte-for-byte across engines, or whether they include host-allocator
addresses / timestamps that need diff-tool canonicalization.
## 10. Recommended next session
See plan §"Recommended approach" and §"Implementation stages". Three landable
stages, ~150-200 LOC total, expected matched-prefix advance of hundreds-to-
thousands of events post Stage 3.