open-dataset-workflow.md 6.7 KB

Open Dataset Workflow / 开放数据工作流

0. 本地真实数据就绪检查

在跑 smoke-local 前,先确认目录里真的有足够的音频:

/usr/local/miniconda3/bin/python src/data/external_adapters.py check-local-ready fma data/raw/fma_small_audio --eval-ratio 0.2 --query-duration 8.0
/usr/local/miniconda3/bin/python src/data/external_adapters.py check-local-ready mtg_jamendo data/raw/mtg_jamendo_audio --eval-ratio 0.2 --query-duration 8.0

判定标准:

  • 至少 2 个音频文件
  • 至少 2 个时长 >= 8s 的可切 query 文件
  • ready_for_smoke=true 才进入完整 smoke

如果目录为空,状态快照脚本也会明确提示未就绪。

更新:2026-06-02

一页结论

如果你要把 FMA / MTG-Jamendo 这类开源音乐目录真正接进项目,推荐只记住这一条链路:

  1. inspect-local / inspect-batch
  2. prepare-local
  3. validate-local
  4. 再进入训练与评估
  5. 生成 benchmark / model card / release artifacts
  6. 或直接使用一键 smoke-local

1. 工作流图

flowchart LR
    A[Local Open Audio Dir] --> B[inspect-local / inspect-batch]
    B --> C[prepare-local]
    C --> D[validate-local]
    D --> E[train.py]
    D --> F[evaluate.py]
    F --> G[artifact bundle]

2. 最短命令表

步骤 命令 作用
预检查 src/data/external_adapters.py inspect-local ... 看规模是否足够
批量比较 src/data/external_adapters.py inspect-batch ... 比较多个候选目录
生成清单 src/data/external_adapters.py prepare-local ... 产出 train/test/catalog
训练前校验 src/data/external_adapters.py validate-local ... 确认结构正确
训练 smoke train.py --data ... --dry-run 验证 manifests 可直接进入训练
发布制品 scripts/generate_artifacts.py 生成 benchmark/model-card/release-checklist
一键 smoke src/data/external_adapters.py smoke-local ... 自动跑完整链路

3. 推荐顺序

3.1 单目录

/usr/local/miniconda3/bin/python src/data/external_adapters.py inspect-local fma data/raw/fma_small_audio --eval-ratio 0.2 --query-duration 8.0
/usr/local/miniconda3/bin/python src/data/external_adapters.py prepare-local fma data/raw/fma_small_audio --output-root data/external_ingested --eval-ratio 0.2 --query-duration 8.0
/usr/local/miniconda3/bin/python src/data/external_adapters.py validate-local fma data/external_ingested/fma/manifests
/usr/local/miniconda3/bin/python train.py --data data/external_ingested/fma/manifests --output data/models_fma_smoke --device cpu --epochs 1 --batch-size 2 --dry-run
/usr/local/miniconda3/bin/python run_demo.py build-index --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --output data/index_fma_smoke --device cpu
/usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --index-prefix data/index_fma_smoke/reference --split test --device cpu --fast-eval --output-json reports/fma-smoke/eval.json
/usr/local/miniconda3/bin/python scripts/generate_artifacts.py --eval-json reports/fma-smoke/eval.json --config-json reports/fma-smoke/config.json --output-dir reports/fma-smoke --model-version fma-smoke --data-version fma_local

3.2 多目录比较

/usr/local/miniconda3/bin/python src/data/external_adapters.py inspect-batch fma=data/raw/fma_small_audio mtg_jamendo=data/raw/mtg_jamendo_audio --eval-ratio 0.2 --query-duration 8.0

3.3 一键 smoke

/usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local fma data/raw/fma_small_audio --output-root data/external_smoke --eval-ratio 0.2 --query-duration 8.0 --train-epochs 1 --batch-size 2
/usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local fma data/raw/fma_small_audio --output-root data/external_smoke --eval-ratio 0.2 --query-duration 8.0 --train-epochs 1 --batch-size 2 --device auto

真实目录放置位置可参考:


4. 输出物说明


5. 当前验证证据

已在本地 data/synthetic_v2/songs 上按开放数据流程跑通:

  • inspect-local
    • num_audio_files=24
    • recommended_train_queries=19
    • recommended_test_queries=5
  • prepare-local
    • catalog=24
    • train_queries=16
    • test_queries=8
  • validate-local
    • ok=true
  • train.py --dry-run
    • Dry run passed! Pipeline is working.
  • build-index + evaluate
    • top1=1.0
    • topk=1.0
  • generate_artifacts
    • benchmark-report.md
    • model-card.md
    • release-checklist.md
  • smoke-local
    • 会一次性返回 inspect / prepare / validate / report 路径摘要
    • 现在支持 --device cpu|cuda|auto
    • auto 会在 smoke 内部解析成实际设备,避免把字符串 auto 直接传给 embedding/eval 侧

FMA 下载完成后的单条准备命令

cd acr-engine
/usr/local/miniconda3/bin/python scripts/fma_postdownload_ready.py

这个脚本会在归档完整时自动执行:

  1. extract
  2. check-local-ready
  3. inspect-local

如果归档还没下完,会返回结构化 archive_not_complete

FMA 完成前等待并自动切换

cd acr-engine
/usr/local/miniconda3/bin/python scripts/wait_for_fma_and_prepare.py --interval 30 --max-cycles 120

作用:

  • 周期性检查 fma_small.zip 是否完成
  • 一旦完成,自动进入 scripts/fma_postdownload_ready.py
  • 如果还没完成,则返回 waiting 和最近的进度快照

Sources