open-dataset-plan.md
646 Bytes
Open Dataset Integration Plan
Recommended order
-
FMA small
- URL: https://github.com/mdeff/fma
- Why: easiest small realistic music subset for retrieval experiments
-
MTG-Jamendo
- URL: https://github.com/MTG/mtg-jamendo-dataset
- Why: larger CC-licensed corpus with scriptable upstream tooling
-
QBSH / humming corpora
- Why: add after retrieval baseline is stable
Repo strategy
- Keep external dataset ingestion optional
- Convert external tracks into:
-
catalog.jsonfor searchable references - query segment manifests for evaluation
-
- Start with small local subsets before full-corpus scaling