open-dataset-workflow.md
5.93 KB
Open Dataset Workflow / 开放数据工作流
0. 本地真实数据就绪检查
在跑 smoke-local 前,先确认目录里真的有足够的音频:
/usr/local/miniconda3/bin/python src/data/external_adapters.py check-local-ready fma data/raw/fma_small_audio --eval-ratio 0.2 --query-duration 8.0
/usr/local/miniconda3/bin/python src/data/external_adapters.py check-local-ready mtg_jamendo data/raw/mtg_jamendo_audio --eval-ratio 0.2 --query-duration 8.0
判定标准:
- 至少
2个音频文件 - 至少
2个时长>= 8s的可切 query 文件 -
ready_for_smoke=true才进入完整 smoke
如果目录为空,状态快照脚本也会明确提示未就绪。
更新:2026-06-02
一页结论
如果你要把 FMA / MTG-Jamendo 这类开源音乐目录真正接进项目,推荐只记住这一条链路:
- inspect-local / inspect-batch
- prepare-local
- validate-local
- 再进入训练与评估
- 生成 benchmark / model card / release artifacts
- 或直接使用一键
smoke-local
1. 工作流图
flowchart LR
A[Local Open Audio Dir] --> B[inspect-local / inspect-batch]
B --> C[prepare-local]
C --> D[validate-local]
D --> E[train.py]
D --> F[evaluate.py]
F --> G[artifact bundle]
2. 最短命令表
| 步骤 | 命令 | 作用 |
|---|---|---|
| 预检查 |
src/data/external_adapters.py inspect-local ...
|
看规模是否足够 |
| 批量比较 |
src/data/external_adapters.py inspect-batch ...
|
比较多个候选目录 |
| 生成清单 |
src/data/external_adapters.py prepare-local ...
|
产出 train/test/catalog |
| 训练前校验 |
src/data/external_adapters.py validate-local ...
|
确认结构正确 |
| 训练 smoke |
train.py --data ... --dry-run
|
验证 manifests 可直接进入训练 |
| 发布制品 | scripts/generate_artifacts.py |
生成 benchmark/model-card/release-checklist |
| 一键 smoke |
src/data/external_adapters.py smoke-local ...
|
自动跑完整链路 |
3. 推荐顺序
3.1 单目录
/usr/local/miniconda3/bin/python src/data/external_adapters.py inspect-local fma data/raw/fma_small_audio --eval-ratio 0.2 --query-duration 8.0
/usr/local/miniconda3/bin/python src/data/external_adapters.py prepare-local fma data/raw/fma_small_audio --output-root data/external_ingested --eval-ratio 0.2 --query-duration 8.0
/usr/local/miniconda3/bin/python src/data/external_adapters.py validate-local fma data/external_ingested/fma/manifests
/usr/local/miniconda3/bin/python train.py --data data/external_ingested/fma/manifests --output data/models_fma_smoke --device cpu --epochs 1 --batch-size 2 --dry-run
/usr/local/miniconda3/bin/python run_demo.py build-index --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --output data/index_fma_smoke --device cpu
/usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --index-prefix data/index_fma_smoke/reference --split test --device cpu --fast-eval --output-json reports/fma-smoke/eval.json
/usr/local/miniconda3/bin/python scripts/generate_artifacts.py --eval-json reports/fma-smoke/eval.json --config-json reports/fma-smoke/config.json --output-dir reports/fma-smoke --model-version fma-smoke --data-version fma_local
3.2 多目录比较
/usr/local/miniconda3/bin/python src/data/external_adapters.py inspect-batch fma=data/raw/fma_small_audio mtg_jamendo=data/raw/mtg_jamendo_audio --eval-ratio 0.2 --query-duration 8.0
3.3 一键 smoke
/usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local fma data/raw/fma_small_audio --output-root data/external_smoke --eval-ratio 0.2 --query-duration 8.0 --train-epochs 1 --batch-size 2
真实目录放置位置可参考:
- acr-engine/data/raw/README.md
- acr-engine/data/raw/fma_small_audio/
- acr-engine/data/raw/mtg_jamendo_audio/
4. 输出物说明
- catalog.json:建索引用 reference 清单
- train.json:训练 queries + references
- test.json:固定评估 queries + references
- val.json:可选验证集
5. 当前验证证据
已在本地 data/synthetic_v2/songs 上按开放数据流程跑通:
-
inspect-local:num_audio_files=24recommended_train_queries=19recommended_test_queries=5
-
prepare-local:catalog=24train_queries=16test_queries=8
-
validate-local:ok=true
-
train.py --dry-run:Dry run passed! Pipeline is working.
-
build-index + evaluate:top1=1.0topk=1.0
-
generate_artifacts:benchmark-report.mdmodel-card.mdrelease-checklist.md
-
smoke-local:- 会一次性返回 inspect / prepare / validate / report 路径摘要
FMA 下载完成后的单条准备命令
cd acr-engine
/usr/local/miniconda3/bin/python scripts/fma_postdownload_ready.py
这个脚本会在归档完整时自动执行:
extractcheck-local-readyinspect-local
如果归档还没下完,会返回结构化 archive_not_complete。