open-dataset-workflow.md 2.83 KB

Open Dataset Workflow / 开放数据工作流

更新:2026-06-02

一页结论

如果你要把 FMA / MTG-Jamendo 这类开源音乐目录真正接进项目,推荐只记住这一条链路:

  1. inspect-local / inspect-batch
  2. prepare-local
  3. validate-local
  4. 再进入训练与评估

1. 工作流图

flowchart LR
    A[Local Open Audio Dir] --> B[inspect-local / inspect-batch]
    B --> C[prepare-local]
    C --> D[validate-local]
    D --> E[train.json]
    D --> F[test.json]

2. 最短命令表

步骤 命令 作用
预检查 src/data/external_adapters.py inspect-local ... 看规模是否足够
批量比较 src/data/external_adapters.py inspect-batch ... 比较多个候选目录
生成清单 src/data/external_adapters.py prepare-local ... 产出 train/test/catalog
训练前校验 src/data/external_adapters.py validate-local ... 确认结构正确

3. 推荐顺序

3.1 单目录

/usr/local/miniconda3/bin/python src/data/external_adapters.py inspect-local fma data/raw/fma_small_audio --eval-ratio 0.2 --query-duration 8.0
/usr/local/miniconda3/bin/python src/data/external_adapters.py prepare-local fma data/raw/fma_small_audio --output-root data/external_ingested --eval-ratio 0.2 --query-duration 8.0
/usr/local/miniconda3/bin/python src/data/external_adapters.py validate-local fma data/external_ingested/fma/manifests

3.2 多目录比较

/usr/local/miniconda3/bin/python src/data/external_adapters.py inspect-batch fma=data/raw/fma_small_audio mtg_jamendo=data/raw/mtg_jamendo_audio --eval-ratio 0.2 --query-duration 8.0

4. 输出物说明


5. 当前验证证据

已在本地 data/synthetic_v2/songs 上按开放数据流程跑通:

  • inspect-local
    • num_audio_files=24
    • recommended_train_queries=19
    • recommended_test_queries=5
  • prepare-local
    • catalog=24
    • train_queries=16
    • test_queries=8
  • validate-local
    • ok=true

Sources