Commit 5679b5d6 5679b5d6ee364bfb5a547b722eb03fb0dffd7026 by cnb.bofCdSsphPA

Add a detailed handoff doc for future development sessions

Constraint: New sessions need a fast, durable understanding of the project state, open-dataset workflow, verified evidence, and next steps
Rejected: Rely on scattered docs and git history alone | Too slow for session handoff and easy to miss critical workflow context
Confidence: high
Scope-risk: narrow
Directive: Keep this handoff doc updated whenever a major workflow milestone or verified capability changes
Tested: existence checks for docs/session-handoff.md and docs/README.md, plus docs index link presence
Not-tested: Manual human review across multiple markdown renderers
1 parent d2218523
......@@ -69,6 +69,7 @@ flowchart TD
### C. 服务与工程
- [服务接口](./service-api.md)
- [持续开发交接文档](./session-handoff.md)
- [更新记录](./CHANGELOG.md)
### D. 研究与路线
......
# Session Handoff / 持续开发交接文档
> 更新:2026-06-02
> 目的:让新 session / 新代理进入仓库后,能在最短时间内理解项目现状并继续开发。
## 一页结论
这是一个正在从原型向工业化推进的 **音乐 ACR / music retrieval** 项目。
当前已经完成:
1. **原型可运行**
- synthetic 数据生成
- 训练
- 建索引
- 识别
- 评测
2. **开放数据接入链路完整闭环**
- inspect-local / inspect-batch
- prepare-local
- validate-local
- train
- build-index
- evaluate
- generate_artifacts
3. **文档已浓缩**
- docs 入口已分成 4 组
- 相对路径支持跳转
- 开放数据工作流有单页文档
当前最重要的下一步:
- 用真实本地 FMA / MTG-Jamendo 音频目录替换 synthetic stand-in
- 跑真实开放数据 smoke
- 继续优化准确率,尤其是 `confused` / `humming_like`
---
## 1. 项目是什么
这是一个面向**音乐片段识别 / 音乐检索**的 ACR 引擎,核心路线是:
- 指纹检索(Chromaprint-like)
- embedding 检索(ECAPA-derived)
- 可选 melody-aware 融合
- retrieval-first 评测与优化
它已经不是单纯的“分类模型训练脚本”,而是一个较完整的工程原型:
- 数据层
- 训练层
- 索引层
- 识别层
- 评测层
- 文档层
- 开放数据接入层
- 发布产物层
---
## 2. 你应该先看哪些文档
### 核心 4 组入口
- [docs/README.md](./README.md)
- [docs/open-dataset-workflow.md](./open-dataset-workflow.md)
- [docs/dataset-spec.md](./dataset-spec.md)
- [docs/industrialization-roadmap.md](./industrialization-roadmap.md)
### 如果你是算法/模型方向
- [docs/dataset-spec.md](./dataset-spec.md)
- [docs/sota-research-2026.md](./sota-research-2026.md)
- [docs/industrial-benchmark-spec.md](./industrial-benchmark-spec.md)
### 如果你是数据接入方向
- [docs/open-dataset-workflow.md](./open-dataset-workflow.md)
- [docs/dataset-sources-and-licensing.md](./dataset-sources-and-licensing.md)
- [acr-engine/data/raw/README.md](../acr-engine/data/raw/README.md)
### 如果你是工程/服务方向
- [docs/service-api.md](./service-api.md)
- [docs/CHANGELOG.md](./CHANGELOG.md)
---
## 3. 当前代码结构重点
### 训练与评测主入口
- [acr-engine/train.py](../acr-engine/train.py)
- [acr-engine/evaluate.py](../acr-engine/evaluate.py)
- [acr-engine/run_demo.py](../acr-engine/run_demo.py)
### 数据层
- [acr-engine/src/data/dataset.py](../acr-engine/src/data/dataset.py)
- [acr-engine/src/data/synthetic.py](../acr-engine/src/data/synthetic.py)
- [acr-engine/src/data/manifest_tools.py](../acr-engine/src/data/manifest_tools.py)
- [acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py)
### 检索与模型层
- [acr-engine/src/engines/hybrid_engine.py](../acr-engine/src/engines/hybrid_engine.py)
- [acr-engine/src/engines/ecapa_embedder.py](../acr-engine/src/engines/ecapa_embedder.py)
- [acr-engine/src/engines/chromaprint_matcher.py](../acr-engine/src/engines/chromaprint_matcher.py)
- [acr-engine/src/models/ecapa_tdnn.py](../acr-engine/src/models/ecapa_tdnn.py)
- [acr-engine/src/models/losses.py](../acr-engine/src/models/losses.py)
### 服务层
- [acr-engine/src/service/app.py](../acr-engine/src/service/app.py)
---
## 4. 已经完成的关键能力
### 4.1 原型与 synthetic 数据
- synthetic dataset 可生成
- `train.py --dry-run` 可通过
- 可训练出 checkpoint
- 可 build-index
- 可 recognize
- 可 evaluate
### 4.2 开放数据接入
已经具备以下命令:
- `inspect-local`
- `inspect-batch`
- `prepare-local`
- `validate-local`
- `smoke-local`
这些都在:
- [acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py)
### 4.3 文档与发布产物
开放数据 smoke 也能生成:
- benchmark report
- model card
- release checklist
- artifact manifest
---
## 5. 开放数据当前的实际工作方式
### 真实音频应该放到哪里
- [acr-engine/data/raw/fma_small_audio/](../acr-engine/data/raw/fma_small_audio/)
- [acr-engine/data/raw/mtg_jamendo_audio/](../acr-engine/data/raw/mtg_jamendo_audio/)
说明文件:
- [acr-engine/data/raw/README.md](../acr-engine/data/raw/README.md)
### 当前最推荐的命令
#### FMA
```bash
/usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local fma data/raw/fma_small_audio --output-root data/external_smoke --eval-ratio 0.2 --query-duration 8.0 --train-epochs 1 --batch-size 2
```
#### MTG-Jamendo
```bash
/usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local mtg_jamendo data/raw/mtg_jamendo_audio --output-root data/external_smoke --eval-ratio 0.2 --query-duration 8.0 --train-epochs 1 --batch-size 2
```
### 当前 smoke-local 已验证能力
`smoke-local` 会自动跑:
1. inspect-local
2. prepare-local
3. validate-local
4. train
5. build-index
6. evaluate
7. generate_artifacts
---
## 6. 目前最重要的验证证据
### 6.1 synthetic-as-open-fixed(开放数据 stand-in)
已成功验证:
- `prepare-local`
- `validate-local`
- `train.py`
- `build-index`
- `evaluate.py`
- `generate_artifacts.py`
关键结果:
- `num_queries=8`
- `top1=1.0`
- `topk=1.0`
相关目录:
- [acr-engine/data/external_ingested/synthetic_as_open_fixed/](../acr-engine/data/external_ingested/synthetic_as_open_fixed/)
- [acr-engine/reports/open-smoke-fixed/fma/](../acr-engine/reports/open-smoke-fixed/fma/)
### 6.2 一键 smoke-local
已验证:
```bash
/usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local fma data/synthetic_v2/songs --output-root data/external_smoke --eval-ratio 0.2 --query-duration 5.0 --train-epochs 1 --batch-size 2
```
关键结果:
- `num_audio_files=24`
- `catalog=24`
- `train_queries=16`
- `test_queries=8`
- `top1=1.0`
- `topk=1.0`
相关目录:
- [acr-engine/data/external_smoke/](../acr-engine/data/external_smoke/)
---
## 7. 当前最重要的待办
### 优先级 A:真实开放数据替换
目标:
- 用真实本地 FMA / MTG-Jamendo 音频替换 synthetic stand-in
操作:
1. 把真实音频放进:
- `acr-engine/data/raw/fma_small_audio/`
-`acr-engine/data/raw/mtg_jamendo_audio/`
2. 直接运行 `smoke-local`
3. 记录:
- inspect 规模
- train/test query 数
- top1/topk
- artifact bundle
### 优先级 B:hard-case 精度继续优化
当前历史结论:
- naive oversampling:失败
- type-aware weighting:部分有效
- sample-level weighting:提升 `confused`
- retrieval fusion tuning:更稳定有效
下阶段重点:
- `confused`
- `humming_like`
- 真实开放数据上的 hard-case bucket
### 优先级 C:foundation model / SOTA baseline
已经在文档中记录:
- MERT
- MuQ
- 更强 retrieval-first 路线
后续可以做:
- frozen backbone baseline
- adapter fine-tune
---
## 8. 最新关键提交(便于新 session 快速定位)
近几次关键提交建议优先看:
- `d221852` Add explicit drop zones for real open-music corpora
- `eee15ac` Automate the full open-dataset smoke workflow behind one command
- `8795907` Generate release artifacts for the open-dataset smoke path
- `dc9ef1b` Close the open-dataset smoke loop through evaluation
- `b766c74` Make open-dataset manifests trainable end to end
- `fa23144` Add a single-page open dataset workflow for training prep
- `af33be3` Condense docs and add manifest validation before training
这些 commit 基本覆盖了当前开放数据与文档演进主线。
---
## 9. 新 session 接手时的推荐动作
如果你是新的 session,建议顺序:
1. 读:
- [docs/README.md](./README.md)
- [docs/open-dataset-workflow.md](./open-dataset-workflow.md)
- [docs/session-handoff.md](./session-handoff.md)
2. 检查真实数据是否已落位:
- `acr-engine/data/raw/fma_small_audio/`
- `acr-engine/data/raw/mtg_jamendo_audio/`
3. 如果已有真实音频:
- 直接跑 `smoke-local`
4. 如果还没有真实音频:
- 继续优化 synthetic-as-open-fixed
- 或继续补开放数据下载/清洗自动化
5. 每完成一个阶段:
- 更新 [docs/CHANGELOG.md](./CHANGELOG.md)
- `git commit`
- `git push`
---
## 10. 注意事项
- 这个仓库里存在已跟踪的 `__pycache__` 文件;提交时要小心不要让它们污染变更。
- 当前最稳定的改进方向不是盲目调训练权重,而是:
- retrieval-time fusion
- 更真实开放数据
- 更真实评测
- 开放数据布局现在依赖“自包含输出根”:
- `audio/`
- `manifests/`
这一点后续不要破坏。
---
## Sources
- [README.md](./README.md)
- [open-dataset-workflow.md](./open-dataset-workflow.md)
- [CHANGELOG.md](./CHANGELOG.md)