Commit 5679b5d6 5679b5d6ee364bfb5a547b722eb03fb0dffd7026 by cnb.bofCdSsphPA

Add a detailed handoff doc for future development sessions

Constraint: New sessions need a fast, durable understanding of the project state, open-dataset workflow, verified evidence, and next steps
Rejected: Rely on scattered docs and git history alone | Too slow for session handoff and easy to miss critical workflow context
Confidence: high
Scope-risk: narrow
Directive: Keep this handoff doc updated whenever a major workflow milestone or verified capability changes
Tested: existence checks for docs/session-handoff.md and docs/README.md, plus docs index link presence
Not-tested: Manual human review across multiple markdown renderers
1 parent d2218523
...@@ -69,6 +69,7 @@ flowchart TD ...@@ -69,6 +69,7 @@ flowchart TD
69 69
70 ### C. 服务与工程 70 ### C. 服务与工程
71 - [服务接口](./service-api.md) 71 - [服务接口](./service-api.md)
72 - [持续开发交接文档](./session-handoff.md)
72 - [更新记录](./CHANGELOG.md) 73 - [更新记录](./CHANGELOG.md)
73 74
74 ### D. 研究与路线 75 ### D. 研究与路线
......
1 # Session Handoff / 持续开发交接文档
2
3 > 更新:2026-06-02
4 > 目的:让新 session / 新代理进入仓库后,能在最短时间内理解项目现状并继续开发。
5
6 ## 一页结论
7
8 这是一个正在从原型向工业化推进的 **音乐 ACR / music retrieval** 项目。
9 当前已经完成:
10
11 1. **原型可运行**
12 - synthetic 数据生成
13 - 训练
14 - 建索引
15 - 识别
16 - 评测
17
18 2. **开放数据接入链路完整闭环**
19 - inspect-local / inspect-batch
20 - prepare-local
21 - validate-local
22 - train
23 - build-index
24 - evaluate
25 - generate_artifacts
26
27 3. **文档已浓缩**
28 - docs 入口已分成 4 组
29 - 相对路径支持跳转
30 - 开放数据工作流有单页文档
31
32 当前最重要的下一步:
33 - 用真实本地 FMA / MTG-Jamendo 音频目录替换 synthetic stand-in
34 - 跑真实开放数据 smoke
35 - 继续优化准确率,尤其是 `confused` / `humming_like`
36
37 ---
38
39 ## 1. 项目是什么
40
41 这是一个面向**音乐片段识别 / 音乐检索**的 ACR 引擎,核心路线是:
42
43 - 指纹检索(Chromaprint-like)
44 - embedding 检索(ECAPA-derived)
45 - 可选 melody-aware 融合
46 - retrieval-first 评测与优化
47
48 它已经不是单纯的“分类模型训练脚本”,而是一个较完整的工程原型:
49 - 数据层
50 - 训练层
51 - 索引层
52 - 识别层
53 - 评测层
54 - 文档层
55 - 开放数据接入层
56 - 发布产物层
57
58 ---
59
60 ## 2. 你应该先看哪些文档
61
62 ### 核心 4 组入口
63 - [docs/README.md](./README.md)
64 - [docs/open-dataset-workflow.md](./open-dataset-workflow.md)
65 - [docs/dataset-spec.md](./dataset-spec.md)
66 - [docs/industrialization-roadmap.md](./industrialization-roadmap.md)
67
68 ### 如果你是算法/模型方向
69 - [docs/dataset-spec.md](./dataset-spec.md)
70 - [docs/sota-research-2026.md](./sota-research-2026.md)
71 - [docs/industrial-benchmark-spec.md](./industrial-benchmark-spec.md)
72
73 ### 如果你是数据接入方向
74 - [docs/open-dataset-workflow.md](./open-dataset-workflow.md)
75 - [docs/dataset-sources-and-licensing.md](./dataset-sources-and-licensing.md)
76 - [acr-engine/data/raw/README.md](../acr-engine/data/raw/README.md)
77
78 ### 如果你是工程/服务方向
79 - [docs/service-api.md](./service-api.md)
80 - [docs/CHANGELOG.md](./CHANGELOG.md)
81
82 ---
83
84 ## 3. 当前代码结构重点
85
86 ### 训练与评测主入口
87 - [acr-engine/train.py](../acr-engine/train.py)
88 - [acr-engine/evaluate.py](../acr-engine/evaluate.py)
89 - [acr-engine/run_demo.py](../acr-engine/run_demo.py)
90
91 ### 数据层
92 - [acr-engine/src/data/dataset.py](../acr-engine/src/data/dataset.py)
93 - [acr-engine/src/data/synthetic.py](../acr-engine/src/data/synthetic.py)
94 - [acr-engine/src/data/manifest_tools.py](../acr-engine/src/data/manifest_tools.py)
95 - [acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py)
96
97 ### 检索与模型层
98 - [acr-engine/src/engines/hybrid_engine.py](../acr-engine/src/engines/hybrid_engine.py)
99 - [acr-engine/src/engines/ecapa_embedder.py](../acr-engine/src/engines/ecapa_embedder.py)
100 - [acr-engine/src/engines/chromaprint_matcher.py](../acr-engine/src/engines/chromaprint_matcher.py)
101 - [acr-engine/src/models/ecapa_tdnn.py](../acr-engine/src/models/ecapa_tdnn.py)
102 - [acr-engine/src/models/losses.py](../acr-engine/src/models/losses.py)
103
104 ### 服务层
105 - [acr-engine/src/service/app.py](../acr-engine/src/service/app.py)
106
107 ---
108
109 ## 4. 已经完成的关键能力
110
111 ### 4.1 原型与 synthetic 数据
112 - synthetic dataset 可生成
113 - `train.py --dry-run` 可通过
114 - 可训练出 checkpoint
115 - 可 build-index
116 - 可 recognize
117 - 可 evaluate
118
119 ### 4.2 开放数据接入
120 已经具备以下命令:
121
122 - `inspect-local`
123 - `inspect-batch`
124 - `prepare-local`
125 - `validate-local`
126 - `smoke-local`
127
128 这些都在:
129 - [acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py)
130
131 ### 4.3 文档与发布产物
132 开放数据 smoke 也能生成:
133 - benchmark report
134 - model card
135 - release checklist
136 - artifact manifest
137
138 ---
139
140 ## 5. 开放数据当前的实际工作方式
141
142 ### 真实音频应该放到哪里
143 - [acr-engine/data/raw/fma_small_audio/](../acr-engine/data/raw/fma_small_audio/)
144 - [acr-engine/data/raw/mtg_jamendo_audio/](../acr-engine/data/raw/mtg_jamendo_audio/)
145
146 说明文件:
147 - [acr-engine/data/raw/README.md](../acr-engine/data/raw/README.md)
148
149 ### 当前最推荐的命令
150
151 #### FMA
152 ```bash
153 /usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local fma data/raw/fma_small_audio --output-root data/external_smoke --eval-ratio 0.2 --query-duration 8.0 --train-epochs 1 --batch-size 2
154 ```
155
156 #### MTG-Jamendo
157 ```bash
158 /usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local mtg_jamendo data/raw/mtg_jamendo_audio --output-root data/external_smoke --eval-ratio 0.2 --query-duration 8.0 --train-epochs 1 --batch-size 2
159 ```
160
161 ### 当前 smoke-local 已验证能力
162 `smoke-local` 会自动跑:
163 1. inspect-local
164 2. prepare-local
165 3. validate-local
166 4. train
167 5. build-index
168 6. evaluate
169 7. generate_artifacts
170
171 ---
172
173 ## 6. 目前最重要的验证证据
174
175 ### 6.1 synthetic-as-open-fixed(开放数据 stand-in)
176 已成功验证:
177 - `prepare-local`
178 - `validate-local`
179 - `train.py`
180 - `build-index`
181 - `evaluate.py`
182 - `generate_artifacts.py`
183
184 关键结果:
185 - `num_queries=8`
186 - `top1=1.0`
187 - `topk=1.0`
188
189 相关目录:
190 - [acr-engine/data/external_ingested/synthetic_as_open_fixed/](../acr-engine/data/external_ingested/synthetic_as_open_fixed/)
191 - [acr-engine/reports/open-smoke-fixed/fma/](../acr-engine/reports/open-smoke-fixed/fma/)
192
193 ### 6.2 一键 smoke-local
194 已验证:
195 ```bash
196 /usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local fma data/synthetic_v2/songs --output-root data/external_smoke --eval-ratio 0.2 --query-duration 5.0 --train-epochs 1 --batch-size 2
197 ```
198
199 关键结果:
200 - `num_audio_files=24`
201 - `catalog=24`
202 - `train_queries=16`
203 - `test_queries=8`
204 - `top1=1.0`
205 - `topk=1.0`
206
207 相关目录:
208 - [acr-engine/data/external_smoke/](../acr-engine/data/external_smoke/)
209
210 ---
211
212 ## 7. 当前最重要的待办
213
214 ### 优先级 A:真实开放数据替换
215 目标:
216 - 用真实本地 FMA / MTG-Jamendo 音频替换 synthetic stand-in
217
218 操作:
219 1. 把真实音频放进:
220 - `acr-engine/data/raw/fma_small_audio/`
221 -`acr-engine/data/raw/mtg_jamendo_audio/`
222 2. 直接运行 `smoke-local`
223 3. 记录:
224 - inspect 规模
225 - train/test query 数
226 - top1/topk
227 - artifact bundle
228
229 ### 优先级 B:hard-case 精度继续优化
230 当前历史结论:
231 - naive oversampling:失败
232 - type-aware weighting:部分有效
233 - sample-level weighting:提升 `confused`
234 - retrieval fusion tuning:更稳定有效
235
236 下阶段重点:
237 - `confused`
238 - `humming_like`
239 - 真实开放数据上的 hard-case bucket
240
241 ### 优先级 C:foundation model / SOTA baseline
242 已经在文档中记录:
243 - MERT
244 - MuQ
245 - 更强 retrieval-first 路线
246
247 后续可以做:
248 - frozen backbone baseline
249 - adapter fine-tune
250
251 ---
252
253 ## 8. 最新关键提交(便于新 session 快速定位)
254
255 近几次关键提交建议优先看:
256
257 - `d221852` Add explicit drop zones for real open-music corpora
258 - `eee15ac` Automate the full open-dataset smoke workflow behind one command
259 - `8795907` Generate release artifacts for the open-dataset smoke path
260 - `dc9ef1b` Close the open-dataset smoke loop through evaluation
261 - `b766c74` Make open-dataset manifests trainable end to end
262 - `fa23144` Add a single-page open dataset workflow for training prep
263 - `af33be3` Condense docs and add manifest validation before training
264
265 这些 commit 基本覆盖了当前开放数据与文档演进主线。
266
267 ---
268
269 ## 9. 新 session 接手时的推荐动作
270
271 如果你是新的 session,建议顺序:
272
273 1. 读:
274 - [docs/README.md](./README.md)
275 - [docs/open-dataset-workflow.md](./open-dataset-workflow.md)
276 - [docs/session-handoff.md](./session-handoff.md)
277
278 2. 检查真实数据是否已落位:
279 - `acr-engine/data/raw/fma_small_audio/`
280 - `acr-engine/data/raw/mtg_jamendo_audio/`
281
282 3. 如果已有真实音频:
283 - 直接跑 `smoke-local`
284
285 4. 如果还没有真实音频:
286 - 继续优化 synthetic-as-open-fixed
287 - 或继续补开放数据下载/清洗自动化
288
289 5. 每完成一个阶段:
290 - 更新 [docs/CHANGELOG.md](./CHANGELOG.md)
291 - `git commit`
292 - `git push`
293
294 ---
295
296 ## 10. 注意事项
297
298 - 这个仓库里存在已跟踪的 `__pycache__` 文件;提交时要小心不要让它们污染变更。
299 - 当前最稳定的改进方向不是盲目调训练权重,而是:
300 - retrieval-time fusion
301 - 更真实开放数据
302 - 更真实评测
303 - 开放数据布局现在依赖“自包含输出根”:
304 - `audio/`
305 - `manifests/`
306 这一点后续不要破坏。
307
308 ---
309
310 ## Sources
311 - [README.md](./README.md)
312 - [open-dataset-workflow.md](./open-dataset-workflow.md)
313 - [CHANGELOG.md](./CHANGELOG.md)