Commit 05a2ccca 05a2ccca90a395304395d692c1641f9fed3e8488 by cnb.bofCdSsphPA

Preserve repo continuity before the next session handoff

Constraint: Future sessions need startup memory for user preferences, real-data status, and the current FMA bottleneck without re-discovery
Rejected: Leave continuity only in transient chat context | Would force every new session to reconstruct state from scratch
Confidence: high
Scope-risk: narrow
Directive: Keep AGENTS continuity memory concise, code-true, and refreshed when project direction or bottlenecks materially change
Tested: AGENTS.md anchor search for continuity keys; verified host CUDA snapshot; verified build-index progress logs on small smoke artifacts
Not-tested: Full completion of the long-running real FMA CPU build-index stage
1 parent cc263571
...@@ -373,6 +373,108 @@ Do not manually duplicate hook-owned activation state unless recovering from mis ...@@ -373,6 +373,108 @@ Do not manually duplicate hook-owned activation state unless recovering from mis
373 373
374 --- 374 ---
375 375
376 <project_continuity_memory>
377 ## Project Continuity Memory / 持续开发记忆
378
379 This section is repo-local working memory for future sessions. Treat it as a high-signal startup brief and keep it updated when the project state materially changes.
380
381 ### User preferences and standing constraints
382 - User prefers autonomous continuation: do the next safe step instead of asking for permission.
383 - After each meaningful completed stage, update `docs/CHANGELOG.md`, then `git commit`, then `git push`.
384 - Python interpreter to prefer for this repo:
385 - `/usr/local/miniconda3/bin/python`
386 - Documentation preferences:
387 - prioritize diagrams, then tables, then concise explanation
388 - prefer concentrated/condensed docs over many small overlapping files
389 - use relative-path markdown links for repo-local navigation; do not wrap local doc paths as inert code strings
390 - Dataset strategy preferences:
391 - maximize reuse of open datasets for personal-use training/evaluation
392 - use some open data for training and keep some fixed for evaluation
393 - document raw dataset format, processed format, manifests, scripts, and labeling rules clearly for future custom-dataset expansion
394 - Large data safety:
395 - do not accidentally commit large dataset blobs unless intentionally using Git LFS and the stage explicitly calls for it
396 - avoid committing transient `__pycache__` or smoke-generated bulk audio copies unless explicitly intended
397
398 ### Current product / technical direction
399 - Domain: music ACR / retrieval pipeline
400 - Input direction:
401 - music-task input has moved from 40-dim MFCC assumptions toward 128-dim Mel features
402 - band-split path is enabled in current model direction
403 - Dataset semantics:
404 - separate `reference` catalog from `query` training/evaluation samples
405 - preserve `song_id`, `type`, `offset`, `source_dataset`, and split semantics in manifests
406 - Hard-case emphasis:
407 - keep explicit support for `clean`, `augmented`, `confused`, and `humming_like`
408 - confusion-oriented techniques remain a preferred optimization lane
409
410 ### Verified repo facts as of 2026-06-02
411 - Main app root:
412 - `/workspace/acr-engine`
413 - Main docs root:
414 - `/workspace/docs`
415 - Real FMA local dataset:
416 - archive source used: `https://modelscope.cn/datasets/pengzhendong/fma/resolve/master/fma_small.zip`
417 - extracted audio root: `acr-engine/data/raw/fma_small_audio`
418 - verified local file count for smoke readiness: `8000`
419 - Real training / indexing behavior:
420 - training dataset path uses random 5s crops rather than pre-expanded overlapping windows
421 - retrieval / reference embedding path uses 5.0s windows with 2.5s stride (50% overlap)
422 - external manifest generation currently creates one random query per eligible track by default, commonly at 8.0s
423 - `smoke-local` orchestration:
424 - now supports `--device cpu|cuda|auto`
425 - `auto` is resolved inside the adapter before invoking downstream train/index/eval CLIs
426 - Current host capability snapshot on 2026-06-02:
427 - `torch.cuda.is_available() == False`
428 - real long-running FMA smoke therefore currently runs on CPU on this host
429
430 ### Recently completed engineering stages
431 - Documentation was strengthened around:
432 - dataset spec
433 - training data / pgvector guidance
434 - open dataset workflow
435 - FMA / external dataset handling
436 - overlap-window vs random-crop behavior
437 - GPU / CPU execution semantics
438 - `smoke-local` device selection support was added.
439 - build-index observability was improved:
440 - `run_demo.py build-index` now announces chromaprint vs embedding phases
441 - `ECAPAEmbedder.build_reference_index` now logs start/progress/finish with refs/windows/elapsed/eta
442
443 ### Important current status to resume from
444 - A real FMA smoke run was launched on 2026-06-02 and has progressed through training into CPU `build-index`.
445 - On this host, the real FMA post-training bottleneck is CPU embedding-index construction, not confirmed deadlock.
446 - Small-data verification already proved:
447 - `smoke-local --device auto` resolves to `cpu` on this host
448 - manual `build-index` + `evaluate` succeed on smoke artifacts with `top1=1.0`, `topk=1.0`
449
450 ### Highest-value next steps
451 1. Continue monitoring or resuming the real FMA smoke artifacts until fresh index/report timestamps confirm completion.
452 2. Unify the current 5s vs 8s configuration story across:
453 - manifest query duration
454 - train clip duration
455 - eval/report metadata
456 3. Add overlapping-query manifest generation for external datasets when broader coverage is needed.
457 4. Continue industrialization work:
458 - improve index-stage performance / observability
459 - strengthen dataset governance and reusable ingestion docs
460 - keep handoff docs current for new sessions
461
462 ### Files future sessions should inspect first
463 - `docs/README.md`
464 - `docs/CHANGELOG.md`
465 - `docs/session-handoff.md`
466 - `docs/dataset-spec.md`
467 - `docs/training-data-and-pgvector-guide.md`
468 - `docs/open-dataset-workflow.md`
469 - `acr-engine/src/data/external_adapters.py`
470 - `acr-engine/src/data/manifest_tools.py`
471 - `acr-engine/src/data/dataset.py`
472 - `acr-engine/src/engines/ecapa_embedder.py`
473
474 </project_continuity_memory>
475
476 ---
477
376 ## Setup 478 ## Setup
377 479
378 Execute `omx setup` to install all components. Execute `omx doctor` to verify installation. 480 Execute `omx setup` to install all components. Execute `omx doctor` to verify installation.
......
...@@ -56,7 +56,9 @@ def cmd_build_index(args): ...@@ -56,7 +56,9 @@ def cmd_build_index(args):
56 out_dir = Path(args.output) 56 out_dir = Path(args.output)
57 out_dir.mkdir(parents=True, exist_ok=True) 57 out_dir.mkdir(parents=True, exist_ok=True)
58 58
59 print(f"[build-index] starting chromaprint index: data={data_dir} output={out_dir}")
59 build_chroma_index(data_dir, out_dir) 60 build_chroma_index(data_dir, out_dir)
61 print(f"[build-index] starting embedding index: model={args.model} device={args.device}")
60 build_embedding_index(data_dir, Path(args.model), out_dir / 'reference', args.device) 62 build_embedding_index(data_dir, Path(args.model), out_dir / 'reference', args.device)
61 63
62 64
......
1 import json 1 import json
2 from pathlib import Path 2 from pathlib import Path
3 from typing import List, Optional, Tuple 3 from typing import List, Optional, Tuple
4 import time
4 5
5 import librosa 6 import librosa
6 import numpy as np 7 import numpy as np
...@@ -101,22 +102,35 @@ class ECAPAEmbedder: ...@@ -101,22 +102,35 @@ class ECAPAEmbedder:
101 all_embs = [] 102 all_embs = []
102 all_ids = [] 103 all_ids = []
103 songs_dir = Path(songs_dir) 104 songs_dir = Path(songs_dir)
105 refs = [item for item in meta if item.get("type") == "reference"]
106 total_refs = len(refs)
107 start_time = time.time()
108 print(
109 f"[build-reference-index] start: refs={total_refs} device={self.device.type} "
110 f"window_sec={window_sec} stride_sec={stride_sec}"
111 )
104 112
105 for item in meta: 113 for ref_idx, item in enumerate(refs, start=1):
106 if item.get("type") != "reference":
107 continue
108 audio_path = songs_dir.parent / item["audio_path"] 114 audio_path = songs_dir.parent / item["audio_path"]
109 if not audio_path.exists(): 115 if not audio_path.exists():
110 continue 116 continue
111 song_id = item["song_id"] 117 song_id = item["song_id"]
112 y, _ = librosa.load(str(audio_path), sr=self.sr, mono=True) 118 y, _ = librosa.load(str(audio_path), sr=self.sr, mono=True)
113 119 windows = self._windows(y, window_sec=window_sec, stride_sec=stride_sec)
114 for seg in self._windows(y, window_sec=window_sec, stride_sec=stride_sec): 120 for seg in windows:
115 mel = self._to_mel(seg).to(self.device) 121 mel = self._to_mel(seg).to(self.device)
116 with torch.no_grad(): 122 with torch.no_grad():
117 emb, _ = self.model(mel) 123 emb, _ = self.model(mel)
118 all_embs.append(emb.cpu().numpy().flatten()) 124 all_embs.append(emb.cpu().numpy().flatten())
119 all_ids.append(song_id) 125 all_ids.append(song_id)
126 if ref_idx == 1 or ref_idx % 250 == 0 or ref_idx == total_refs:
127 elapsed = max(time.time() - start_time, 1e-6)
128 refs_per_sec = ref_idx / elapsed
129 eta_sec = (total_refs - ref_idx) / refs_per_sec if refs_per_sec > 0 else 0.0
130 print(
131 f"[build-reference-index] progress: refs={ref_idx}/{total_refs} "
132 f"windows={len(all_ids)} elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f}"
133 )
120 134
121 all_embs = np.vstack(all_embs) 135 all_embs = np.vstack(all_embs)
122 np.save(f"{output_path}_embs.npy", all_embs) 136 np.save(f"{output_path}_embs.npy", all_embs)
......
...@@ -2,6 +2,66 @@ ...@@ -2,6 +2,66 @@
2 2
3 ## 2026-06-02 3 ## 2026-06-02
4 4
5 ### Stage: 将连续开发偏好与当前进度固化到 AGENTS.md
6
7 完成项:
8 - 在根级 `AGENTS.md` 新增 `Project Continuity Memory / 持续开发记忆`
9 - 记录用户长期偏好:
10 - 自动继续执行
11 - 每阶段更新 changelog 并 commit/push
12 - 使用 `/usr/local/miniconda3/bin/python`
13 - 文档优先图表/表格/相对路径链接/浓缩结构
14 - 记录当前项目方向、真实 FMA 数据状态、`smoke-local` 设备能力、当前宿主机无 CUDA 的事实
15 - 记录最近完成的工程阶段与建议下一步,方便新 session 直接续做
16
17 验证结果:
18 - `AGENTS.md` 已可检索到以下关键记忆锚点:
19 - `Project Continuity Memory`
20 - `smoke-local`
21 - `fma_small_audio`
22 - `torch.cuda.is_available() == False`
23 - `Highest-value next steps`
24 - 新增内容与当前代码/文档状态一致:
25 - `smoke-local` 已支持 `--device cpu|cuda|auto`
26 - build-index 进度日志增强已完成
27 - 当前真实 FMA 长跑仍位于 CPU `build-index` 阶段
28
29 结论:
30 - 现在新 session 启动时,不需要重新摸索用户偏好、数据目录、当前瓶颈与下一步计划
31 - `AGENTS.md` 已承担仓库级连续开发记忆入口
32
33 ### Stage: 增强 build-index 进度可见性,降低真实 FMA 长跑误判成本
34
35 完成项:
36 - 修改 `acr-engine/src/engines/ecapa_embedder.py`
37 -`build_reference_index` 中输出 start/progress/finish 日志
38 - 日志包含 `refs``windows``elapsed_sec``eta_sec`
39 - 修改 `acr-engine/run_demo.py`
40 -`build-index` 阶段显式打印 chromaprint 与 embedding 两个阶段的开始提示
41 - 复核当前宿主机设备条件,确认本机当前无 CUDA,只能走 CPU
42
43 验证结果:
44 - 宿主机设备:
45 - `torch.cuda.is_available() = False`
46 - `device_count = 0`
47 - 小数据验证:
48 - 使用 `/tmp/acr_smoke_device_test/fma/manifests` 运行 `PYTHONUNBUFFERED=1 /usr/local/miniconda3/bin/python run_demo.py build-index ... --device cpu`
49 - 看到新的阶段日志:
50 - `[build-index] starting chromaprint index ...`
51 - `[build-index] starting embedding index ...`
52 - `[build-reference-index] start: refs=24 ...`
53 - `[build-reference-index] progress: refs=1/24 ...`
54 - `[build-reference-index] progress: refs=24/24 ...`
55 - 结束时成功输出:
56 - `Built reference index: 120 windows, embeddings shape (120, 192)`
57 - 真实 FMA 状态复检:
58 - 真实长跑仍停留在 `run_demo.py build-index ... --device cpu`
59 - 但当前可以明确判断它是在 CPU 上长时间构建 embedding index,而不是“无输出的假卡死”
60
61 结论:
62 - 现在真实 FMA 长跑的主要瓶颈已被明确定位为 CPU embedding index 构建
63 - 即使当前宿主机无 GPU,也已经具备了更可观测的长跑诊断能力,方便后续迁移到 CUDA 机器或继续做索引阶段优化
64
5 ### Stage: 让 smoke-local 支持显式设备选择并验证 auto 设备解析 65 ### Stage: 让 smoke-local 支持显式设备选择并验证 auto 设备解析
6 66
7 完成项: 67 完成项:
......