Commit fa231444 fa2314445d2af3ac0518ae6139dcdfa2a31b29e9 by cnb.bofCdSsphPA

Add a single-page open dataset workflow for training prep

Constraint: Open-dataset onboarding needed one short executable path instead of scattered instructions across many docs
Rejected: Leave ingestion knowledge split across multiple pages only | Raises setup friction before real FMA or MTG-Jamendo training
Confidence: high
Scope-risk: narrow
Directive: Use the single-page workflow as the default operator path before adding more open-dataset sources
Tested: /usr/local/miniconda3/bin/python src/data/external_adapters.py inspect-local fma data/synthetic_v2/songs --eval-ratio 0.2 --query-duration 5.0; /usr/local/miniconda3/bin/python src/data/external_adapters.py prepare-local fma data/synthetic_v2/songs --output-root data/external_ingested/synthetic_as_open --eval-ratio 0.2 --query-duration 5.0; /usr/local/miniconda3/bin/python src/data/external_adapters.py validate-local fma data/external_ingested/synthetic_as_open/fma/manifests
Not-tested: Real FMA or MTG-Jamendo local download directories
1 parent af33be35
[
{
"song_id": "fma_00000",
"audio_path": "songs/song_0000.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00001",
"audio_path": "songs/song_0001.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00002",
"audio_path": "songs/song_0002.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00003",
"audio_path": "songs/song_0003.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00004",
"audio_path": "songs/song_0004.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00005",
"audio_path": "songs/song_0005.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00006",
"audio_path": "songs/song_0006.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00007",
"audio_path": "songs/song_0007.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00008",
"audio_path": "songs/song_0008.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00009",
"audio_path": "songs/song_0009.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00010",
"audio_path": "songs/song_0010.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00011",
"audio_path": "songs/song_0011.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00012",
"audio_path": "songs/song_0012.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00013",
"audio_path": "songs/song_0013.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00014",
"audio_path": "songs/song_0014.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00015",
"audio_path": "songs/song_0015.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00016",
"audio_path": "songs/song_0016.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00017",
"audio_path": "songs/song_0017.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00018",
"audio_path": "songs/song_0018.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00019",
"audio_path": "songs/song_0019.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00020",
"audio_path": "songs/song_0020.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00021",
"audio_path": "songs/song_0021.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00022",
"audio_path": "songs/song_0022.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00023",
"audio_path": "songs/song_0023.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
}
]
\ No newline at end of file
[
{
"song_id": "fma_00000",
"audio_path": "songs/song_0000.wav",
"duration": 5.0,
"type": "clean",
"offset": 6.394,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00003",
"audio_path": "songs/song_0003.wav",
"duration": 5.0,
"type": "clean",
"offset": 8.922,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00004",
"audio_path": "songs/song_0004.wav",
"duration": 5.0,
"type": "clean",
"offset": 4.219,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00006",
"audio_path": "songs/song_0006.wav",
"duration": 5.0,
"type": "clean",
"offset": 0.265,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00009",
"audio_path": "songs/song_0009.wav",
"duration": 5.0,
"type": "clean",
"offset": 8.094,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00011",
"audio_path": "songs/song_0011.wav",
"duration": 5.0,
"type": "clean",
"offset": 3.403,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00013",
"audio_path": "songs/song_0013.wav",
"duration": 5.0,
"type": "clean",
"offset": 0.927,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00020",
"audio_path": "songs/song_0020.wav",
"duration": 5.0,
"type": "clean",
"offset": 7.046,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00000",
"audio_path": "songs/song_0000.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00001",
"audio_path": "songs/song_0001.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00002",
"audio_path": "songs/song_0002.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00003",
"audio_path": "songs/song_0003.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00004",
"audio_path": "songs/song_0004.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00005",
"audio_path": "songs/song_0005.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00006",
"audio_path": "songs/song_0006.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00007",
"audio_path": "songs/song_0007.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00008",
"audio_path": "songs/song_0008.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00009",
"audio_path": "songs/song_0009.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00010",
"audio_path": "songs/song_0010.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00011",
"audio_path": "songs/song_0011.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00012",
"audio_path": "songs/song_0012.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00013",
"audio_path": "songs/song_0013.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00014",
"audio_path": "songs/song_0014.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00015",
"audio_path": "songs/song_0015.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00016",
"audio_path": "songs/song_0016.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00017",
"audio_path": "songs/song_0017.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00018",
"audio_path": "songs/song_0018.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00019",
"audio_path": "songs/song_0019.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00020",
"audio_path": "songs/song_0020.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00021",
"audio_path": "songs/song_0021.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00022",
"audio_path": "songs/song_0022.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00023",
"audio_path": "songs/song_0023.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
}
]
\ No newline at end of file
[
{
"song_id": "fma_00001",
"audio_path": "songs/song_0001.wav",
"duration": 5.0,
"type": "clean",
"offset": 2.75,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00002",
"audio_path": "songs/song_0002.wav",
"duration": 5.0,
"type": "clean",
"offset": 7.365,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00005",
"audio_path": "songs/song_0005.wav",
"duration": 5.0,
"type": "clean",
"offset": 2.186,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00007",
"audio_path": "songs/song_0007.wav",
"duration": 5.0,
"type": "clean",
"offset": 6.499,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00008",
"audio_path": "songs/song_0008.wav",
"duration": 5.0,
"type": "clean",
"offset": 2.204,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00010",
"audio_path": "songs/song_0010.wav",
"duration": 5.0,
"type": "clean",
"offset": 8.058,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00012",
"audio_path": "songs/song_0012.wav",
"duration": 5.0,
"type": "clean",
"offset": 9.572,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00014",
"audio_path": "songs/song_0014.wav",
"duration": 5.0,
"type": "clean",
"offset": 8.475,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00015",
"audio_path": "songs/song_0015.wav",
"duration": 5.0,
"type": "clean",
"offset": 8.071,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00016",
"audio_path": "songs/song_0016.wav",
"duration": 5.0,
"type": "clean",
"offset": 5.362,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00017",
"audio_path": "songs/song_0017.wav",
"duration": 5.0,
"type": "clean",
"offset": 3.785,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00018",
"audio_path": "songs/song_0018.wav",
"duration": 5.0,
"type": "clean",
"offset": 8.294,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00019",
"audio_path": "songs/song_0019.wav",
"duration": 5.0,
"type": "clean",
"offset": 8.617,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00021",
"audio_path": "songs/song_0021.wav",
"duration": 5.0,
"type": "clean",
"offset": 2.279,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00022",
"audio_path": "songs/song_0022.wav",
"duration": 5.0,
"type": "clean",
"offset": 0.798,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00023",
"audio_path": "songs/song_0023.wav",
"duration": 5.0,
"type": "clean",
"offset": 1.01,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00000",
"audio_path": "songs/song_0000.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00001",
"audio_path": "songs/song_0001.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00002",
"audio_path": "songs/song_0002.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00003",
"audio_path": "songs/song_0003.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00004",
"audio_path": "songs/song_0004.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00005",
"audio_path": "songs/song_0005.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00006",
"audio_path": "songs/song_0006.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00007",
"audio_path": "songs/song_0007.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00008",
"audio_path": "songs/song_0008.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00009",
"audio_path": "songs/song_0009.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00010",
"audio_path": "songs/song_0010.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00011",
"audio_path": "songs/song_0011.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00012",
"audio_path": "songs/song_0012.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00013",
"audio_path": "songs/song_0013.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00014",
"audio_path": "songs/song_0014.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00015",
"audio_path": "songs/song_0015.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00016",
"audio_path": "songs/song_0016.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00017",
"audio_path": "songs/song_0017.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00018",
"audio_path": "songs/song_0018.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00019",
"audio_path": "songs/song_0019.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00020",
"audio_path": "songs/song_0020.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00021",
"audio_path": "songs/song_0021.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00022",
"audio_path": "songs/song_0022.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00023",
"audio_path": "songs/song_0023.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
}
]
\ No newline at end of file
......@@ -25,6 +25,31 @@
- 读者不再需要先面对大量平铺文件名
- 相对路径现在更适合直接跳转
### Stage: 开放数据单页工作流
完成项:
- 新增 [docs/open-dataset-workflow.md](./open-dataset-workflow.md)
- 把开放数据接入流程浓缩为:
- `inspect-local / inspect-batch`
- `prepare-local`
- `validate-local`
- 将该工作流挂到 [docs/README.md](./README.md) 的“数据与评测”组下
验证结果:
- `/usr/local/miniconda3/bin/python src/data/external_adapters.py inspect-local fma data/synthetic_v2/songs --eval-ratio 0.2 --query-duration 5.0` 成功
- `/usr/local/miniconda3/bin/python src/data/external_adapters.py prepare-local fma data/synthetic_v2/songs --output-root data/external_ingested/synthetic_as_open --eval-ratio 0.2 --query-duration 5.0` 成功
- `/usr/local/miniconda3/bin/python src/data/external_adapters.py validate-local fma data/external_ingested/synthetic_as_open/fma/manifests` 成功
- 当前结果:
- `num_audio_files=24`
- `catalog=24`
- `train_queries=16`
- `test_queries=8`
- `ok=true`
结论:
- 现在开放数据接入路径已经浓缩成单页可执行工作流
- 后续接真实 FMA / MTG-Jamendo 本地目录时,上手成本更低
### Stage: confused 定向优化 v6(sample-level weighting)
完成项:
......
......@@ -59,6 +59,7 @@ flowchart TD
### B. 数据与评测
- [数据规范](./dataset-spec.md)
- [开放数据工作流](./open-dataset-workflow.md)
- [数据来源与接入](./dataset-sources-and-licensing.md)
- [工业评测规范](./industrial-benchmark-spec.md)
......
......@@ -18,6 +18,9 @@
- CCMusic / ModelScope:优先当补充评估或探索来源
- 保留 license 注记,但不再把“商用阻塞”作为个人实验主阻塞
推荐先读:
- [开放数据工作流](./open-dataset-workflow.md)
建议接入顺序:
1. 下载/准备 FMA 或 MTG-Jamendo 的本地音频目录
2. 运行 [acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py) `inspect-local``inspect-batch`
......
# Open Dataset Workflow / 开放数据工作流
> 更新:2026-06-02
## 一页结论
如果你要把 FMA / MTG-Jamendo 这类开源音乐目录真正接进项目,推荐只记住这一条链路:
1. **inspect-local / inspect-batch**
2. **prepare-local**
3. **validate-local**
4. 再进入训练与评估
---
## 1. 工作流图
```mermaid
flowchart LR
A[Local Open Audio Dir] --> B[inspect-local / inspect-batch]
B --> C[prepare-local]
C --> D[validate-local]
D --> E[train.json]
D --> F[test.json]
```
---
## 2. 最短命令表
| 步骤 | 命令 | 作用 |
|---|---|---|
| 预检查 | [`src/data/external_adapters.py`](../acr-engine/src/data/external_adapters.py) `inspect-local ...` | 看规模是否足够 |
| 批量比较 | [`src/data/external_adapters.py`](../acr-engine/src/data/external_adapters.py) `inspect-batch ...` | 比较多个候选目录 |
| 生成清单 | [`src/data/external_adapters.py`](../acr-engine/src/data/external_adapters.py) `prepare-local ...` | 产出 train/test/catalog |
| 训练前校验 | [`src/data/external_adapters.py`](../acr-engine/src/data/external_adapters.py) `validate-local ...` | 确认结构正确 |
---
## 3. 推荐顺序
### 3.1 单目录
```bash
/usr/local/miniconda3/bin/python src/data/external_adapters.py inspect-local fma data/raw/fma_small_audio --eval-ratio 0.2 --query-duration 8.0
/usr/local/miniconda3/bin/python src/data/external_adapters.py prepare-local fma data/raw/fma_small_audio --output-root data/external_ingested --eval-ratio 0.2 --query-duration 8.0
/usr/local/miniconda3/bin/python src/data/external_adapters.py validate-local fma data/external_ingested/fma/manifests
```
### 3.2 多目录比较
```bash
/usr/local/miniconda3/bin/python src/data/external_adapters.py inspect-batch fma=data/raw/fma_small_audio mtg_jamendo=data/raw/mtg_jamendo_audio --eval-ratio 0.2 --query-duration 8.0
```
---
## 4. 输出物说明
- [catalog.json](../acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/catalog.json):建索引用 reference 清单
- [train.json](../acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/train.json):训练 queries + references
- [test.json](../acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/test.json):固定评估 queries + references
- [val.json](../acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/val.json):可选验证集
---
## 5. 当前验证证据
已在本地 `data/synthetic_v2/songs` 上按开放数据流程跑通:
- `inspect-local`
- `num_audio_files=24`
- `recommended_train_queries=19`
- `recommended_test_queries=5`
- `prepare-local`
- `catalog=24`
- `train_queries=16`
- `test_queries=8`
- `validate-local`
- `ok=true`
---
## Sources
- See [dataset-spec.md](./dataset-spec.md)
- See [dataset-sources-and-licensing.md](./dataset-sources-and-licensing.md)