Preserve repo continuity before the next session handoff

Constraint: Future sessions need startup memory for user preferences, real-data status, and the current FMA bottleneck without re-discovery Rejected: Leave continuity only in transient chat context | Would force every new session to reconstruct state from scratch Confidence: high Scope-risk: narrow Directive: Keep AGENTS continuity memory concise, code-true, and refreshed when project direction or bottlenecks materially change Tested: AGENTS.md anchor search for continuity keys; verified host CUDA snapshot; verified build-index progress logs on small smoke artifacts Not-tested: Full completion of the long-running real FMA CPU build-index stage

Preserve repo continuity before the next session handoff
Constraint: Future sessions need startup memory for user preferences, real-data status, and the current FMA bottleneck without re-discovery Rejected: Leave continuity only in transient chat context | Would force every new session to reconstruct state from scratch Confidence: high Scope-risk: narrow Directive: Keep AGENTS continuity memory concise, code-true, and refreshed when project direction or bottlenecks materially change Tested: AGENTS.md anchor search for continuity keys; verified host CUDA snapshot; verified build-index progress logs on small smoke artifacts Not-tested: Full completion of the long-running real FMA CPU build-index stage
cnb.bofCdSsphPA
Commit 05a2ccca ... 05a2ccca90a395304395d692c1641f9fed3e8488 authored 2026-06-02 15:11:13 +0800 by cnb.bofCdSsphPA
Showing 4 changed files with 183 additions and 5 deletions
AGENTS.md
acr-engine/run_demo.py
acr-engine/src/engines/ecapa_embedder.py
docs/CHANGELOG.md
--- a/AGENTS.md
View file @05a2ccc
+++ b/AGENTS.md
View file @05a2ccc
@@ -373,6 +373,108 @@ Do not manually duplicate hook-owned activation state unless recovering from mis

 ---

+<project_continuity_memory>
+## Project Continuity Memory / 持续开发记忆
+
+This section is repo-local working memory for future sessions. Treat it as a high-signal startup brief and keep it updated when the project state materially changes.
+
+### User preferences and standing constraints
+- User prefers autonomous continuation: do the next safe step instead of asking for permission.
+- After each meaningful completed stage, update `docs/CHANGELOG.md`, then `git commit`, then `git push`.
+- Python interpreter to prefer for this repo:
+  - `/usr/local/miniconda3/bin/python`
+- Documentation preferences:
+  - prioritize diagrams, then tables, then concise explanation
+  - prefer concentrated/condensed docs over many small overlapping files
+  - use relative-path markdown links for repo-local navigation; do not wrap local doc paths as inert code strings
+- Dataset strategy preferences:
+  - maximize reuse of open datasets for personal-use training/evaluation
+  - use some open data for training and keep some fixed for evaluation
+  - document raw dataset format, processed format, manifests, scripts, and labeling rules clearly for future custom-dataset expansion
+- Large data safety:
+  - do not accidentally commit large dataset blobs unless intentionally using Git LFS and the stage explicitly calls for it
+  - avoid committing transient `__pycache__` or smoke-generated bulk audio copies unless explicitly intended
+
+### Current product / technical direction
+- Domain: music ACR / retrieval pipeline
+- Input direction:
+  - music-task input has moved from 40-dim MFCC assumptions toward 128-dim Mel features
+  - band-split path is enabled in current model direction
+- Dataset semantics:
+  - separate `reference` catalog from `query` training/evaluation samples
+  - preserve `song_id`, `type`, `offset`, `source_dataset`, and split semantics in manifests
+- Hard-case emphasis:
+  - keep explicit support for `clean`, `augmented`, `confused`, and `humming_like`
+  - confusion-oriented techniques remain a preferred optimization lane
+
+### Verified repo facts as of 2026-06-02
+- Main app root:
+  - `/workspace/acr-engine`
+- Main docs root:
+  - `/workspace/docs`
+- Real FMA local dataset:
+  - archive source used: `https://modelscope.cn/datasets/pengzhendong/fma/resolve/master/fma_small.zip`
+  - extracted audio root: `acr-engine/data/raw/fma_small_audio`
+  - verified local file count for smoke readiness: `8000`
+- Real training / indexing behavior:
+  - training dataset path uses random 5s crops rather than pre-expanded overlapping windows
+  - retrieval / reference embedding path uses 5.0s windows with 2.5s stride (50% overlap)
+  - external manifest generation currently creates one random query per eligible track by default, commonly at 8.0s
+- `smoke-local` orchestration:
+  - now supports `--device cpu|cuda|auto`
+  - `auto` is resolved inside the adapter before invoking downstream train/index/eval CLIs
+- Current host capability snapshot on 2026-06-02:
+  - `torch.cuda.is_available() == False`
+  - real long-running FMA smoke therefore currently runs on CPU on this host
+
+### Recently completed engineering stages
+- Documentation was strengthened around:
+  - dataset spec
+  - training data / pgvector guidance
+  - open dataset workflow
+  - FMA / external dataset handling
+  - overlap-window vs random-crop behavior
+  - GPU / CPU execution semantics
+- `smoke-local` device selection support was added.
+- build-index observability was improved:
+  - `run_demo.py build-index` now announces chromaprint vs embedding phases
+  - `ECAPAEmbedder.build_reference_index` now logs start/progress/finish with refs/windows/elapsed/eta
+
+### Important current status to resume from
+- A real FMA smoke run was launched on 2026-06-02 and has progressed through training into CPU `build-index`.
+- On this host, the real FMA post-training bottleneck is CPU embedding-index construction, not confirmed deadlock.
+- Small-data verification already proved:
+  - `smoke-local --device auto` resolves to `cpu` on this host
+  - manual `build-index` + `evaluate` succeed on smoke artifacts with `top1=1.0`, `topk=1.0`
+
+### Highest-value next steps
+1. Continue monitoring or resuming the real FMA smoke artifacts until fresh index/report timestamps confirm completion.
+2. Unify the current 5s vs 8s configuration story across:
+   - manifest query duration
+   - train clip duration
+   - eval/report metadata
+3. Add overlapping-query manifest generation for external datasets when broader coverage is needed.
+4. Continue industrialization work:
+   - improve index-stage performance / observability
+   - strengthen dataset governance and reusable ingestion docs
+   - keep handoff docs current for new sessions
+
+### Files future sessions should inspect first
+- `docs/README.md`
+- `docs/CHANGELOG.md`
+- `docs/session-handoff.md`
+- `docs/dataset-spec.md`
+- `docs/training-data-and-pgvector-guide.md`
+- `docs/open-dataset-workflow.md`
+- `acr-engine/src/data/external_adapters.py`
+- `acr-engine/src/data/manifest_tools.py`
+- `acr-engine/src/data/dataset.py`
+- `acr-engine/src/engines/ecapa_embedder.py`
+
+</project_continuity_memory>
+
+---
+
 ## Setup

 Execute `omx setup` to install all components. Execute `omx doctor` to verify installation.
--- a/acr-engine/run_demo.py
View file @05a2ccc
+++ b/acr-engine/run_demo.py
View file @05a2ccc
@@ -56,7 +56,9 @@ def cmd_build_index(args):
    out_dir = Path(args.output)
    out_dir.mkdir(parents=True, exist_ok=True)

+    print(f"[build-index] starting chromaprint index: data={data_dir} output={out_dir}")
    build_chroma_index(data_dir, out_dir)
+    print(f"[build-index] starting embedding index: model={args.model} device={args.device}")
    build_embedding_index(data_dir, Path(args.model), out_dir / 'reference', args.device)


--- a/acr-engine/src/engines/ecapa_embedder.py
View file @05a2ccc
+++ b/acr-engine/src/engines/ecapa_embedder.py
View file @05a2ccc
 import json
 from pathlib import Path
 from typing import List, Optional, Tuple
+import time

 import librosa
 import numpy as np
@@ -101,22 +102,35 @@ class ECAPAEmbedder:
        all_embs = []
        all_ids = []
        songs_dir = Path(songs_dir)
+        refs = [item for item in meta if item.get("type") == "reference"]
+        total_refs = len(refs)
+        start_time = time.time()
+        print(
+            f"[build-reference-index] start: refs={total_refs} device={self.device.type} "
+            f"window_sec={window_sec} stride_sec={stride_sec}"
+        )

-        for item in meta:
-            if item.get("type") != "reference":
-                continue
+        for ref_idx, item in enumerate(refs, start=1):
            audio_path = songs_dir.parent / item["audio_path"]
            if not audio_path.exists():
                continue
            song_id = item["song_id"]
            y, _ = librosa.load(str(audio_path), sr=self.sr, mono=True)
-
-            for seg in self._windows(y, window_sec=window_sec, stride_sec=stride_sec):
+            windows = self._windows(y, window_sec=window_sec, stride_sec=stride_sec)
+            for seg in windows:
                mel = self._to_mel(seg).to(self.device)
                with torch.no_grad():
                    emb, _ = self.model(mel)
                all_embs.append(emb.cpu().numpy().flatten())
                all_ids.append(song_id)
+            if ref_idx == 1 or ref_idx % 250 == 0 or ref_idx == total_refs:
+                elapsed = max(time.time() - start_time, 1e-6)
+                refs_per_sec = ref_idx / elapsed
+                eta_sec = (total_refs - ref_idx) / refs_per_sec if refs_per_sec > 0 else 0.0
+                print(
+                    f"[build-reference-index] progress: refs={ref_idx}/{total_refs} "
+                    f"windows={len(all_ids)} elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f}"
+                )

        all_embs = np.vstack(all_embs)
        np.save(f"{output_path}_embs.npy", all_embs)
--- a/docs/CHANGELOG.md
View file @05a2ccc
+++ b/docs/CHANGELOG.md
View file @05a2ccc
@@ -2,6 +2,66 @@

 ## 2026-06-02

+### Stage: 将连续开发偏好与当前进度固化到 AGENTS.md
+
+完成项：
+- 在根级 `AGENTS.md` 新增 `Project Continuity Memory / 持续开发记忆`
+- 记录用户长期偏好：
+  - 自动继续执行
+  - 每阶段更新 changelog 并 commit/push
+  - 使用 `/usr/local/miniconda3/bin/python`
+  - 文档优先图表/表格/相对路径链接/浓缩结构
+- 记录当前项目方向、真实 FMA 数据状态、`smoke-local` 设备能力、当前宿主机无 CUDA 的事实
+- 记录最近完成的工程阶段与建议下一步，方便新 session 直接续做
+
+验证结果：
+- `AGENTS.md` 已可检索到以下关键记忆锚点：
+  - `Project Continuity Memory`
+  - `smoke-local`
+  - `fma_small_audio`
+  - `torch.cuda.is_available() == False`
+  - `Highest-value next steps`
+- 新增内容与当前代码/文档状态一致：
+  - `smoke-local` 已支持 `--device cpu|cuda|auto`
+  - build-index 进度日志增强已完成
+  - 当前真实 FMA 长跑仍位于 CPU `build-index` 阶段
+
+结论：
+- 现在新 session 启动时，不需要重新摸索用户偏好、数据目录、当前瓶颈与下一步计划
+- `AGENTS.md` 已承担仓库级连续开发记忆入口
+
+### Stage: 增强 build-index 进度可见性，降低真实 FMA 长跑误判成本
+
+完成项：
+- 修改 `acr-engine/src/engines/ecapa_embedder.py`
+  - 在 `build_reference_index` 中输出 start/progress/finish 日志
+  - 日志包含 `refs`、`windows`、`elapsed_sec`、`eta_sec`
+- 修改 `acr-engine/run_demo.py`
+  - 在 `build-index` 阶段显式打印 chromaprint 与 embedding 两个阶段的开始提示
+- 复核当前宿主机设备条件，确认本机当前无 CUDA，只能走 CPU
+
+验证结果：
+- 宿主机设备：
+  - `torch.cuda.is_available() = False`
+  - `device_count = 0`
+- 小数据验证：
+  - 使用 `/tmp/acr_smoke_device_test/fma/manifests` 运行 `PYTHONUNBUFFERED=1 /usr/local/miniconda3/bin/python run_demo.py build-index ... --device cpu`
+  - 看到新的阶段日志：
+    - `[build-index] starting chromaprint index ...`
+    - `[build-index] starting embedding index ...`
+    - `[build-reference-index] start: refs=24 ...`
+    - `[build-reference-index] progress: refs=1/24 ...`
+    - `[build-reference-index] progress: refs=24/24 ...`
+  - 结束时成功输出：
+    - `Built reference index: 120 windows, embeddings shape (120, 192)`
+- 真实 FMA 状态复检：
+  - 真实长跑仍停留在 `run_demo.py build-index ... --device cpu`
+  - 但当前可以明确判断它是在 CPU 上长时间构建 embedding index，而不是“无输出的假卡死”
+
+结论：
+- 现在真实 FMA 长跑的主要瓶颈已被明确定位为 CPU embedding index 构建
+- 即使当前宿主机无 GPU，也已经具备了更可观测的长跑诊断能力，方便后续迁移到 CUDA 机器或继续做索引阶段优化
+
 ### Stage: 让 smoke-local 支持显式设备选择并验证 auto 设备解析

 完成项：