Preserve repo continuity before the next session handoff
Constraint: Future sessions need startup memory for user preferences, real-data status, and the current FMA bottleneck without re-discovery Rejected: Leave continuity only in transient chat context | Would force every new session to reconstruct state from scratch Confidence: high Scope-risk: narrow Directive: Keep AGENTS continuity memory concise, code-true, and refreshed when project direction or bottlenecks materially change Tested: AGENTS.md anchor search for continuity keys; verified host CUDA snapshot; verified build-index progress logs on small smoke artifacts Not-tested: Full completion of the long-running real FMA CPU build-index stage
Showing
4 changed files
with
183 additions
and
5 deletions
| ... | @@ -373,6 +373,108 @@ Do not manually duplicate hook-owned activation state unless recovering from mis | ... | @@ -373,6 +373,108 @@ Do not manually duplicate hook-owned activation state unless recovering from mis |
| 373 | 373 | ||
| 374 | --- | 374 | --- |
| 375 | 375 | ||
| 376 | <project_continuity_memory> | ||
| 377 | ## Project Continuity Memory / 持续开发记忆 | ||
| 378 | |||
| 379 | This section is repo-local working memory for future sessions. Treat it as a high-signal startup brief and keep it updated when the project state materially changes. | ||
| 380 | |||
| 381 | ### User preferences and standing constraints | ||
| 382 | - User prefers autonomous continuation: do the next safe step instead of asking for permission. | ||
| 383 | - After each meaningful completed stage, update `docs/CHANGELOG.md`, then `git commit`, then `git push`. | ||
| 384 | - Python interpreter to prefer for this repo: | ||
| 385 | - `/usr/local/miniconda3/bin/python` | ||
| 386 | - Documentation preferences: | ||
| 387 | - prioritize diagrams, then tables, then concise explanation | ||
| 388 | - prefer concentrated/condensed docs over many small overlapping files | ||
| 389 | - use relative-path markdown links for repo-local navigation; do not wrap local doc paths as inert code strings | ||
| 390 | - Dataset strategy preferences: | ||
| 391 | - maximize reuse of open datasets for personal-use training/evaluation | ||
| 392 | - use some open data for training and keep some fixed for evaluation | ||
| 393 | - document raw dataset format, processed format, manifests, scripts, and labeling rules clearly for future custom-dataset expansion | ||
| 394 | - Large data safety: | ||
| 395 | - do not accidentally commit large dataset blobs unless intentionally using Git LFS and the stage explicitly calls for it | ||
| 396 | - avoid committing transient `__pycache__` or smoke-generated bulk audio copies unless explicitly intended | ||
| 397 | |||
| 398 | ### Current product / technical direction | ||
| 399 | - Domain: music ACR / retrieval pipeline | ||
| 400 | - Input direction: | ||
| 401 | - music-task input has moved from 40-dim MFCC assumptions toward 128-dim Mel features | ||
| 402 | - band-split path is enabled in current model direction | ||
| 403 | - Dataset semantics: | ||
| 404 | - separate `reference` catalog from `query` training/evaluation samples | ||
| 405 | - preserve `song_id`, `type`, `offset`, `source_dataset`, and split semantics in manifests | ||
| 406 | - Hard-case emphasis: | ||
| 407 | - keep explicit support for `clean`, `augmented`, `confused`, and `humming_like` | ||
| 408 | - confusion-oriented techniques remain a preferred optimization lane | ||
| 409 | |||
| 410 | ### Verified repo facts as of 2026-06-02 | ||
| 411 | - Main app root: | ||
| 412 | - `/workspace/acr-engine` | ||
| 413 | - Main docs root: | ||
| 414 | - `/workspace/docs` | ||
| 415 | - Real FMA local dataset: | ||
| 416 | - archive source used: `https://modelscope.cn/datasets/pengzhendong/fma/resolve/master/fma_small.zip` | ||
| 417 | - extracted audio root: `acr-engine/data/raw/fma_small_audio` | ||
| 418 | - verified local file count for smoke readiness: `8000` | ||
| 419 | - Real training / indexing behavior: | ||
| 420 | - training dataset path uses random 5s crops rather than pre-expanded overlapping windows | ||
| 421 | - retrieval / reference embedding path uses 5.0s windows with 2.5s stride (50% overlap) | ||
| 422 | - external manifest generation currently creates one random query per eligible track by default, commonly at 8.0s | ||
| 423 | - `smoke-local` orchestration: | ||
| 424 | - now supports `--device cpu|cuda|auto` | ||
| 425 | - `auto` is resolved inside the adapter before invoking downstream train/index/eval CLIs | ||
| 426 | - Current host capability snapshot on 2026-06-02: | ||
| 427 | - `torch.cuda.is_available() == False` | ||
| 428 | - real long-running FMA smoke therefore currently runs on CPU on this host | ||
| 429 | |||
| 430 | ### Recently completed engineering stages | ||
| 431 | - Documentation was strengthened around: | ||
| 432 | - dataset spec | ||
| 433 | - training data / pgvector guidance | ||
| 434 | - open dataset workflow | ||
| 435 | - FMA / external dataset handling | ||
| 436 | - overlap-window vs random-crop behavior | ||
| 437 | - GPU / CPU execution semantics | ||
| 438 | - `smoke-local` device selection support was added. | ||
| 439 | - build-index observability was improved: | ||
| 440 | - `run_demo.py build-index` now announces chromaprint vs embedding phases | ||
| 441 | - `ECAPAEmbedder.build_reference_index` now logs start/progress/finish with refs/windows/elapsed/eta | ||
| 442 | |||
| 443 | ### Important current status to resume from | ||
| 444 | - A real FMA smoke run was launched on 2026-06-02 and has progressed through training into CPU `build-index`. | ||
| 445 | - On this host, the real FMA post-training bottleneck is CPU embedding-index construction, not confirmed deadlock. | ||
| 446 | - Small-data verification already proved: | ||
| 447 | - `smoke-local --device auto` resolves to `cpu` on this host | ||
| 448 | - manual `build-index` + `evaluate` succeed on smoke artifacts with `top1=1.0`, `topk=1.0` | ||
| 449 | |||
| 450 | ### Highest-value next steps | ||
| 451 | 1. Continue monitoring or resuming the real FMA smoke artifacts until fresh index/report timestamps confirm completion. | ||
| 452 | 2. Unify the current 5s vs 8s configuration story across: | ||
| 453 | - manifest query duration | ||
| 454 | - train clip duration | ||
| 455 | - eval/report metadata | ||
| 456 | 3. Add overlapping-query manifest generation for external datasets when broader coverage is needed. | ||
| 457 | 4. Continue industrialization work: | ||
| 458 | - improve index-stage performance / observability | ||
| 459 | - strengthen dataset governance and reusable ingestion docs | ||
| 460 | - keep handoff docs current for new sessions | ||
| 461 | |||
| 462 | ### Files future sessions should inspect first | ||
| 463 | - `docs/README.md` | ||
| 464 | - `docs/CHANGELOG.md` | ||
| 465 | - `docs/session-handoff.md` | ||
| 466 | - `docs/dataset-spec.md` | ||
| 467 | - `docs/training-data-and-pgvector-guide.md` | ||
| 468 | - `docs/open-dataset-workflow.md` | ||
| 469 | - `acr-engine/src/data/external_adapters.py` | ||
| 470 | - `acr-engine/src/data/manifest_tools.py` | ||
| 471 | - `acr-engine/src/data/dataset.py` | ||
| 472 | - `acr-engine/src/engines/ecapa_embedder.py` | ||
| 473 | |||
| 474 | </project_continuity_memory> | ||
| 475 | |||
| 476 | --- | ||
| 477 | |||
| 376 | ## Setup | 478 | ## Setup |
| 377 | 479 | ||
| 378 | Execute `omx setup` to install all components. Execute `omx doctor` to verify installation. | 480 | Execute `omx setup` to install all components. Execute `omx doctor` to verify installation. | ... | ... |
| ... | @@ -56,7 +56,9 @@ def cmd_build_index(args): | ... | @@ -56,7 +56,9 @@ def cmd_build_index(args): |
| 56 | out_dir = Path(args.output) | 56 | out_dir = Path(args.output) |
| 57 | out_dir.mkdir(parents=True, exist_ok=True) | 57 | out_dir.mkdir(parents=True, exist_ok=True) |
| 58 | 58 | ||
| 59 | print(f"[build-index] starting chromaprint index: data={data_dir} output={out_dir}") | ||
| 59 | build_chroma_index(data_dir, out_dir) | 60 | build_chroma_index(data_dir, out_dir) |
| 61 | print(f"[build-index] starting embedding index: model={args.model} device={args.device}") | ||
| 60 | build_embedding_index(data_dir, Path(args.model), out_dir / 'reference', args.device) | 62 | build_embedding_index(data_dir, Path(args.model), out_dir / 'reference', args.device) |
| 61 | 63 | ||
| 62 | 64 | ... | ... |
| 1 | import json | 1 | import json |
| 2 | from pathlib import Path | 2 | from pathlib import Path |
| 3 | from typing import List, Optional, Tuple | 3 | from typing import List, Optional, Tuple |
| 4 | import time | ||
| 4 | 5 | ||
| 5 | import librosa | 6 | import librosa |
| 6 | import numpy as np | 7 | import numpy as np |
| ... | @@ -101,22 +102,35 @@ class ECAPAEmbedder: | ... | @@ -101,22 +102,35 @@ class ECAPAEmbedder: |
| 101 | all_embs = [] | 102 | all_embs = [] |
| 102 | all_ids = [] | 103 | all_ids = [] |
| 103 | songs_dir = Path(songs_dir) | 104 | songs_dir = Path(songs_dir) |
| 105 | refs = [item for item in meta if item.get("type") == "reference"] | ||
| 106 | total_refs = len(refs) | ||
| 107 | start_time = time.time() | ||
| 108 | print( | ||
| 109 | f"[build-reference-index] start: refs={total_refs} device={self.device.type} " | ||
| 110 | f"window_sec={window_sec} stride_sec={stride_sec}" | ||
| 111 | ) | ||
| 104 | 112 | ||
| 105 | for item in meta: | 113 | for ref_idx, item in enumerate(refs, start=1): |
| 106 | if item.get("type") != "reference": | ||
| 107 | continue | ||
| 108 | audio_path = songs_dir.parent / item["audio_path"] | 114 | audio_path = songs_dir.parent / item["audio_path"] |
| 109 | if not audio_path.exists(): | 115 | if not audio_path.exists(): |
| 110 | continue | 116 | continue |
| 111 | song_id = item["song_id"] | 117 | song_id = item["song_id"] |
| 112 | y, _ = librosa.load(str(audio_path), sr=self.sr, mono=True) | 118 | y, _ = librosa.load(str(audio_path), sr=self.sr, mono=True) |
| 113 | 119 | windows = self._windows(y, window_sec=window_sec, stride_sec=stride_sec) | |
| 114 | for seg in self._windows(y, window_sec=window_sec, stride_sec=stride_sec): | 120 | for seg in windows: |
| 115 | mel = self._to_mel(seg).to(self.device) | 121 | mel = self._to_mel(seg).to(self.device) |
| 116 | with torch.no_grad(): | 122 | with torch.no_grad(): |
| 117 | emb, _ = self.model(mel) | 123 | emb, _ = self.model(mel) |
| 118 | all_embs.append(emb.cpu().numpy().flatten()) | 124 | all_embs.append(emb.cpu().numpy().flatten()) |
| 119 | all_ids.append(song_id) | 125 | all_ids.append(song_id) |
| 126 | if ref_idx == 1 or ref_idx % 250 == 0 or ref_idx == total_refs: | ||
| 127 | elapsed = max(time.time() - start_time, 1e-6) | ||
| 128 | refs_per_sec = ref_idx / elapsed | ||
| 129 | eta_sec = (total_refs - ref_idx) / refs_per_sec if refs_per_sec > 0 else 0.0 | ||
| 130 | print( | ||
| 131 | f"[build-reference-index] progress: refs={ref_idx}/{total_refs} " | ||
| 132 | f"windows={len(all_ids)} elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f}" | ||
| 133 | ) | ||
| 120 | 134 | ||
| 121 | all_embs = np.vstack(all_embs) | 135 | all_embs = np.vstack(all_embs) |
| 122 | np.save(f"{output_path}_embs.npy", all_embs) | 136 | np.save(f"{output_path}_embs.npy", all_embs) | ... | ... |
| ... | @@ -2,6 +2,66 @@ | ... | @@ -2,6 +2,66 @@ |
| 2 | 2 | ||
| 3 | ## 2026-06-02 | 3 | ## 2026-06-02 |
| 4 | 4 | ||
| 5 | ### Stage: 将连续开发偏好与当前进度固化到 AGENTS.md | ||
| 6 | |||
| 7 | 完成项: | ||
| 8 | - 在根级 `AGENTS.md` 新增 `Project Continuity Memory / 持续开发记忆` | ||
| 9 | - 记录用户长期偏好: | ||
| 10 | - 自动继续执行 | ||
| 11 | - 每阶段更新 changelog 并 commit/push | ||
| 12 | - 使用 `/usr/local/miniconda3/bin/python` | ||
| 13 | - 文档优先图表/表格/相对路径链接/浓缩结构 | ||
| 14 | - 记录当前项目方向、真实 FMA 数据状态、`smoke-local` 设备能力、当前宿主机无 CUDA 的事实 | ||
| 15 | - 记录最近完成的工程阶段与建议下一步,方便新 session 直接续做 | ||
| 16 | |||
| 17 | 验证结果: | ||
| 18 | - `AGENTS.md` 已可检索到以下关键记忆锚点: | ||
| 19 | - `Project Continuity Memory` | ||
| 20 | - `smoke-local` | ||
| 21 | - `fma_small_audio` | ||
| 22 | - `torch.cuda.is_available() == False` | ||
| 23 | - `Highest-value next steps` | ||
| 24 | - 新增内容与当前代码/文档状态一致: | ||
| 25 | - `smoke-local` 已支持 `--device cpu|cuda|auto` | ||
| 26 | - build-index 进度日志增强已完成 | ||
| 27 | - 当前真实 FMA 长跑仍位于 CPU `build-index` 阶段 | ||
| 28 | |||
| 29 | 结论: | ||
| 30 | - 现在新 session 启动时,不需要重新摸索用户偏好、数据目录、当前瓶颈与下一步计划 | ||
| 31 | - `AGENTS.md` 已承担仓库级连续开发记忆入口 | ||
| 32 | |||
| 33 | ### Stage: 增强 build-index 进度可见性,降低真实 FMA 长跑误判成本 | ||
| 34 | |||
| 35 | 完成项: | ||
| 36 | - 修改 `acr-engine/src/engines/ecapa_embedder.py` | ||
| 37 | - 在 `build_reference_index` 中输出 start/progress/finish 日志 | ||
| 38 | - 日志包含 `refs`、`windows`、`elapsed_sec`、`eta_sec` | ||
| 39 | - 修改 `acr-engine/run_demo.py` | ||
| 40 | - 在 `build-index` 阶段显式打印 chromaprint 与 embedding 两个阶段的开始提示 | ||
| 41 | - 复核当前宿主机设备条件,确认本机当前无 CUDA,只能走 CPU | ||
| 42 | |||
| 43 | 验证结果: | ||
| 44 | - 宿主机设备: | ||
| 45 | - `torch.cuda.is_available() = False` | ||
| 46 | - `device_count = 0` | ||
| 47 | - 小数据验证: | ||
| 48 | - 使用 `/tmp/acr_smoke_device_test/fma/manifests` 运行 `PYTHONUNBUFFERED=1 /usr/local/miniconda3/bin/python run_demo.py build-index ... --device cpu` | ||
| 49 | - 看到新的阶段日志: | ||
| 50 | - `[build-index] starting chromaprint index ...` | ||
| 51 | - `[build-index] starting embedding index ...` | ||
| 52 | - `[build-reference-index] start: refs=24 ...` | ||
| 53 | - `[build-reference-index] progress: refs=1/24 ...` | ||
| 54 | - `[build-reference-index] progress: refs=24/24 ...` | ||
| 55 | - 结束时成功输出: | ||
| 56 | - `Built reference index: 120 windows, embeddings shape (120, 192)` | ||
| 57 | - 真实 FMA 状态复检: | ||
| 58 | - 真实长跑仍停留在 `run_demo.py build-index ... --device cpu` | ||
| 59 | - 但当前可以明确判断它是在 CPU 上长时间构建 embedding index,而不是“无输出的假卡死” | ||
| 60 | |||
| 61 | 结论: | ||
| 62 | - 现在真实 FMA 长跑的主要瓶颈已被明确定位为 CPU embedding index 构建 | ||
| 63 | - 即使当前宿主机无 GPU,也已经具备了更可观测的长跑诊断能力,方便后续迁移到 CUDA 机器或继续做索引阶段优化 | ||
| 64 | |||
| 5 | ### Stage: 让 smoke-local 支持显式设备选择并验证 auto 设备解析 | 65 | ### Stage: 让 smoke-local 支持显式设备选择并验证 auto 设备解析 |
| 6 | 66 | ||
| 7 | 完成项: | 67 | 完成项: | ... | ... |
-
Please register or sign in to post a comment