Commit 05a2ccca 05a2ccca90a395304395d692c1641f9fed3e8488 by cnb.bofCdSsphPA

Preserve repo continuity before the next session handoff

Constraint: Future sessions need startup memory for user preferences, real-data status, and the current FMA bottleneck without re-discovery
Rejected: Leave continuity only in transient chat context | Would force every new session to reconstruct state from scratch
Confidence: high
Scope-risk: narrow
Directive: Keep AGENTS continuity memory concise, code-true, and refreshed when project direction or bottlenecks materially change
Tested: AGENTS.md anchor search for continuity keys; verified host CUDA snapshot; verified build-index progress logs on small smoke artifacts
Not-tested: Full completion of the long-running real FMA CPU build-index stage
1 parent cc263571
......@@ -373,6 +373,108 @@ Do not manually duplicate hook-owned activation state unless recovering from mis
---
<project_continuity_memory>
## Project Continuity Memory / 持续开发记忆
This section is repo-local working memory for future sessions. Treat it as a high-signal startup brief and keep it updated when the project state materially changes.
### User preferences and standing constraints
- User prefers autonomous continuation: do the next safe step instead of asking for permission.
- After each meaningful completed stage, update `docs/CHANGELOG.md`, then `git commit`, then `git push`.
- Python interpreter to prefer for this repo:
- `/usr/local/miniconda3/bin/python`
- Documentation preferences:
- prioritize diagrams, then tables, then concise explanation
- prefer concentrated/condensed docs over many small overlapping files
- use relative-path markdown links for repo-local navigation; do not wrap local doc paths as inert code strings
- Dataset strategy preferences:
- maximize reuse of open datasets for personal-use training/evaluation
- use some open data for training and keep some fixed for evaluation
- document raw dataset format, processed format, manifests, scripts, and labeling rules clearly for future custom-dataset expansion
- Large data safety:
- do not accidentally commit large dataset blobs unless intentionally using Git LFS and the stage explicitly calls for it
- avoid committing transient `__pycache__` or smoke-generated bulk audio copies unless explicitly intended
### Current product / technical direction
- Domain: music ACR / retrieval pipeline
- Input direction:
- music-task input has moved from 40-dim MFCC assumptions toward 128-dim Mel features
- band-split path is enabled in current model direction
- Dataset semantics:
- separate `reference` catalog from `query` training/evaluation samples
- preserve `song_id`, `type`, `offset`, `source_dataset`, and split semantics in manifests
- Hard-case emphasis:
- keep explicit support for `clean`, `augmented`, `confused`, and `humming_like`
- confusion-oriented techniques remain a preferred optimization lane
### Verified repo facts as of 2026-06-02
- Main app root:
- `/workspace/acr-engine`
- Main docs root:
- `/workspace/docs`
- Real FMA local dataset:
- archive source used: `https://modelscope.cn/datasets/pengzhendong/fma/resolve/master/fma_small.zip`
- extracted audio root: `acr-engine/data/raw/fma_small_audio`
- verified local file count for smoke readiness: `8000`
- Real training / indexing behavior:
- training dataset path uses random 5s crops rather than pre-expanded overlapping windows
- retrieval / reference embedding path uses 5.0s windows with 2.5s stride (50% overlap)
- external manifest generation currently creates one random query per eligible track by default, commonly at 8.0s
- `smoke-local` orchestration:
- now supports `--device cpu|cuda|auto`
- `auto` is resolved inside the adapter before invoking downstream train/index/eval CLIs
- Current host capability snapshot on 2026-06-02:
- `torch.cuda.is_available() == False`
- real long-running FMA smoke therefore currently runs on CPU on this host
### Recently completed engineering stages
- Documentation was strengthened around:
- dataset spec
- training data / pgvector guidance
- open dataset workflow
- FMA / external dataset handling
- overlap-window vs random-crop behavior
- GPU / CPU execution semantics
- `smoke-local` device selection support was added.
- build-index observability was improved:
- `run_demo.py build-index` now announces chromaprint vs embedding phases
- `ECAPAEmbedder.build_reference_index` now logs start/progress/finish with refs/windows/elapsed/eta
### Important current status to resume from
- A real FMA smoke run was launched on 2026-06-02 and has progressed through training into CPU `build-index`.
- On this host, the real FMA post-training bottleneck is CPU embedding-index construction, not confirmed deadlock.
- Small-data verification already proved:
- `smoke-local --device auto` resolves to `cpu` on this host
- manual `build-index` + `evaluate` succeed on smoke artifacts with `top1=1.0`, `topk=1.0`
### Highest-value next steps
1. Continue monitoring or resuming the real FMA smoke artifacts until fresh index/report timestamps confirm completion.
2. Unify the current 5s vs 8s configuration story across:
- manifest query duration
- train clip duration
- eval/report metadata
3. Add overlapping-query manifest generation for external datasets when broader coverage is needed.
4. Continue industrialization work:
- improve index-stage performance / observability
- strengthen dataset governance and reusable ingestion docs
- keep handoff docs current for new sessions
### Files future sessions should inspect first
- `docs/README.md`
- `docs/CHANGELOG.md`
- `docs/session-handoff.md`
- `docs/dataset-spec.md`
- `docs/training-data-and-pgvector-guide.md`
- `docs/open-dataset-workflow.md`
- `acr-engine/src/data/external_adapters.py`
- `acr-engine/src/data/manifest_tools.py`
- `acr-engine/src/data/dataset.py`
- `acr-engine/src/engines/ecapa_embedder.py`
</project_continuity_memory>
---
## Setup
Execute `omx setup` to install all components. Execute `omx doctor` to verify installation.
......
......@@ -56,7 +56,9 @@ def cmd_build_index(args):
out_dir = Path(args.output)
out_dir.mkdir(parents=True, exist_ok=True)
print(f"[build-index] starting chromaprint index: data={data_dir} output={out_dir}")
build_chroma_index(data_dir, out_dir)
print(f"[build-index] starting embedding index: model={args.model} device={args.device}")
build_embedding_index(data_dir, Path(args.model), out_dir / 'reference', args.device)
......
import json
from pathlib import Path
from typing import List, Optional, Tuple
import time
import librosa
import numpy as np
......@@ -101,22 +102,35 @@ class ECAPAEmbedder:
all_embs = []
all_ids = []
songs_dir = Path(songs_dir)
refs = [item for item in meta if item.get("type") == "reference"]
total_refs = len(refs)
start_time = time.time()
print(
f"[build-reference-index] start: refs={total_refs} device={self.device.type} "
f"window_sec={window_sec} stride_sec={stride_sec}"
)
for item in meta:
if item.get("type") != "reference":
continue
for ref_idx, item in enumerate(refs, start=1):
audio_path = songs_dir.parent / item["audio_path"]
if not audio_path.exists():
continue
song_id = item["song_id"]
y, _ = librosa.load(str(audio_path), sr=self.sr, mono=True)
for seg in self._windows(y, window_sec=window_sec, stride_sec=stride_sec):
windows = self._windows(y, window_sec=window_sec, stride_sec=stride_sec)
for seg in windows:
mel = self._to_mel(seg).to(self.device)
with torch.no_grad():
emb, _ = self.model(mel)
all_embs.append(emb.cpu().numpy().flatten())
all_ids.append(song_id)
if ref_idx == 1 or ref_idx % 250 == 0 or ref_idx == total_refs:
elapsed = max(time.time() - start_time, 1e-6)
refs_per_sec = ref_idx / elapsed
eta_sec = (total_refs - ref_idx) / refs_per_sec if refs_per_sec > 0 else 0.0
print(
f"[build-reference-index] progress: refs={ref_idx}/{total_refs} "
f"windows={len(all_ids)} elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f}"
)
all_embs = np.vstack(all_embs)
np.save(f"{output_path}_embs.npy", all_embs)
......
......@@ -2,6 +2,66 @@
## 2026-06-02
### Stage: 将连续开发偏好与当前进度固化到 AGENTS.md
完成项:
- 在根级 `AGENTS.md` 新增 `Project Continuity Memory / 持续开发记忆`
- 记录用户长期偏好:
- 自动继续执行
- 每阶段更新 changelog 并 commit/push
- 使用 `/usr/local/miniconda3/bin/python`
- 文档优先图表/表格/相对路径链接/浓缩结构
- 记录当前项目方向、真实 FMA 数据状态、`smoke-local` 设备能力、当前宿主机无 CUDA 的事实
- 记录最近完成的工程阶段与建议下一步,方便新 session 直接续做
验证结果:
- `AGENTS.md` 已可检索到以下关键记忆锚点:
- `Project Continuity Memory`
- `smoke-local`
- `fma_small_audio`
- `torch.cuda.is_available() == False`
- `Highest-value next steps`
- 新增内容与当前代码/文档状态一致:
- `smoke-local` 已支持 `--device cpu|cuda|auto`
- build-index 进度日志增强已完成
- 当前真实 FMA 长跑仍位于 CPU `build-index` 阶段
结论:
- 现在新 session 启动时,不需要重新摸索用户偏好、数据目录、当前瓶颈与下一步计划
- `AGENTS.md` 已承担仓库级连续开发记忆入口
### Stage: 增强 build-index 进度可见性,降低真实 FMA 长跑误判成本
完成项:
- 修改 `acr-engine/src/engines/ecapa_embedder.py`
-`build_reference_index` 中输出 start/progress/finish 日志
- 日志包含 `refs``windows``elapsed_sec``eta_sec`
- 修改 `acr-engine/run_demo.py`
-`build-index` 阶段显式打印 chromaprint 与 embedding 两个阶段的开始提示
- 复核当前宿主机设备条件,确认本机当前无 CUDA,只能走 CPU
验证结果:
- 宿主机设备:
- `torch.cuda.is_available() = False`
- `device_count = 0`
- 小数据验证:
- 使用 `/tmp/acr_smoke_device_test/fma/manifests` 运行 `PYTHONUNBUFFERED=1 /usr/local/miniconda3/bin/python run_demo.py build-index ... --device cpu`
- 看到新的阶段日志:
- `[build-index] starting chromaprint index ...`
- `[build-index] starting embedding index ...`
- `[build-reference-index] start: refs=24 ...`
- `[build-reference-index] progress: refs=1/24 ...`
- `[build-reference-index] progress: refs=24/24 ...`
- 结束时成功输出:
- `Built reference index: 120 windows, embeddings shape (120, 192)`
- 真实 FMA 状态复检:
- 真实长跑仍停留在 `run_demo.py build-index ... --device cpu`
- 但当前可以明确判断它是在 CPU 上长时间构建 embedding index,而不是“无输出的假卡死”
结论:
- 现在真实 FMA 长跑的主要瓶颈已被明确定位为 CPU embedding index 构建
- 即使当前宿主机无 GPU,也已经具备了更可观测的长跑诊断能力,方便后续迁移到 CUDA 机器或继续做索引阶段优化
### Stage: 让 smoke-local 支持显式设备选择并验证 auto 设备解析
完成项:
......