open-dataset-workflow.md 4.21 KB

Open Dataset Workflow / 开放数据工作流

更新:2026-06-02

一页结论

如果你要把 FMA / MTG-Jamendo 这类开源音乐目录真正接进项目,推荐只记住这一条链路:

  1. inspect-local / inspect-batch
  2. prepare-local
  3. validate-local
  4. 再进入训练与评估
  5. 生成 benchmark / model card / release artifacts

1. 工作流图

flowchart LR
    A[Local Open Audio Dir] --> B[inspect-local / inspect-batch]
    B --> C[prepare-local]
    C --> D[validate-local]
    D --> E[train.py]
    D --> F[evaluate.py]
    F --> G[artifact bundle]

2. 最短命令表

步骤 命令 作用
预检查 src/data/external_adapters.py inspect-local ... 看规模是否足够
批量比较 src/data/external_adapters.py inspect-batch ... 比较多个候选目录
生成清单 src/data/external_adapters.py prepare-local ... 产出 train/test/catalog
训练前校验 src/data/external_adapters.py validate-local ... 确认结构正确
训练 smoke train.py --data ... --dry-run 验证 manifests 可直接进入训练
发布制品 scripts/generate_artifacts.py 生成 benchmark/model-card/release-checklist

3. 推荐顺序

3.1 单目录

/usr/local/miniconda3/bin/python src/data/external_adapters.py inspect-local fma data/raw/fma_small_audio --eval-ratio 0.2 --query-duration 8.0
/usr/local/miniconda3/bin/python src/data/external_adapters.py prepare-local fma data/raw/fma_small_audio --output-root data/external_ingested --eval-ratio 0.2 --query-duration 8.0
/usr/local/miniconda3/bin/python src/data/external_adapters.py validate-local fma data/external_ingested/fma/manifests
/usr/local/miniconda3/bin/python train.py --data data/external_ingested/fma/manifests --output data/models_fma_smoke --device cpu --epochs 1 --batch-size 2 --dry-run
/usr/local/miniconda3/bin/python run_demo.py build-index --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --output data/index_fma_smoke --device cpu
/usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --index-prefix data/index_fma_smoke/reference --split test --device cpu --fast-eval --output-json reports/fma-smoke/eval.json
/usr/local/miniconda3/bin/python scripts/generate_artifacts.py --eval-json reports/fma-smoke/eval.json --config-json reports/fma-smoke/config.json --output-dir reports/fma-smoke --model-version fma-smoke --data-version fma_local

3.2 多目录比较

/usr/local/miniconda3/bin/python src/data/external_adapters.py inspect-batch fma=data/raw/fma_small_audio mtg_jamendo=data/raw/mtg_jamendo_audio --eval-ratio 0.2 --query-duration 8.0

4. 输出物说明


5. 当前验证证据

已在本地 data/synthetic_v2/songs 上按开放数据流程跑通:

  • inspect-local
    • num_audio_files=24
    • recommended_train_queries=19
    • recommended_test_queries=5
  • prepare-local
    • catalog=24
    • train_queries=16
    • test_queries=8
  • validate-local
    • ok=true
  • train.py --dry-run
    • Dry run passed! Pipeline is working.
  • build-index + evaluate
    • top1=1.0
    • topk=1.0
  • generate_artifacts
    • benchmark-report.md
    • model-card.md
    • release-checklist.md

Sources