Commit fb1d00b6 fb1d00b69c8f8285193b0160a582d922bfa342f3 by cnb.bofCdSsphPA

Unify open dataset preparation behind adapter commands

Constraint: Personal-use experimentation needs a single entrypoint from local open-audio directories to train/eval manifests
Rejected: Separate manual manifest generation per dataset | Too error-prone and slows iterative training/evaluation
Confidence: high
Scope-risk: narrow
Directive: Point real FMA or MTG-Jamendo local download folders at prepare-local before expanding training runs
Tested: /usr/local/miniconda3/bin/python -m py_compile src/data/external_adapters.py src/data/manifest_tools.py; /usr/local/miniconda3/bin/python src/data/external_adapters.py prepare-local fma tmp/open_music_demo --output-root data/external_ingested/demo_via_adapter --eval-ratio 0.5 --query-duration 5.0
Not-tested: Full upstream corpus import and large-scale training
1 parent 167aa6e5
[
{
"song_id": "fma_00000",
"audio_path": "open_music_demo/song_0000.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00001",
"audio_path": "open_music_demo/song_0001.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
}
]
\ No newline at end of file
[
{
"song_id": "fma_00000",
"audio_path": "open_music_demo/song_0000.wav",
"duration": 5.0,
"type": "clean",
"offset": 6.394,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00000",
"audio_path": "open_music_demo/song_0000.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00001",
"audio_path": "open_music_demo/song_0001.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
}
]
\ No newline at end of file
[
{
"song_id": "fma_00001",
"audio_path": "open_music_demo/song_0001.wav",
"duration": 5.0,
"type": "clean",
"offset": 2.75,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00000",
"audio_path": "open_music_demo/song_0000.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00001",
"audio_path": "open_music_demo/song_0001.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
}
]
\ No newline at end of file
......@@ -7,6 +7,7 @@ from pathlib import Path
from typing import Dict, List
import argparse
import json
import subprocess
@dataclass
......@@ -42,6 +43,36 @@ class BaseAdapter:
json.dump(manifest, f, indent=2, ensure_ascii=False)
return manifest
def prepare_local_audio(
self,
input_dir: Path,
output_root: Path,
eval_ratio: float = 0.2,
query_duration: float = 8.0,
seed: int = 42,
) -> Dict:
output_root.mkdir(parents=True, exist_ok=True)
cmd = [
"/usr/local/miniconda3/bin/python",
"src/data/manifest_tools.py",
"audio-dir-to-splits",
str(input_dir),
str(output_root),
"--source-dataset",
self.name,
"--eval-ratio",
str(eval_ratio),
"--query-duration",
str(query_duration),
"--seed",
str(seed),
]
result = subprocess.check_output(cmd, text=True)
summary = json.loads(result)
summary["input_dir"] = str(input_dir)
summary["dataset"] = self.name
return summary
class FMAAdapter(BaseAdapter):
name = "fma"
......@@ -156,6 +187,14 @@ def main():
p = sub.add_parser("describe")
p.add_argument("dataset", choices=sorted(ADAPTERS))
p = sub.add_parser("prepare-local")
p.add_argument("dataset", choices=sorted(ADAPTERS))
p.add_argument("input_dir")
p.add_argument("--output-root", default="data/external_ingested")
p.add_argument("--eval-ratio", type=float, default=0.2)
p.add_argument("--query-duration", type=float, default=8.0)
p.add_argument("--seed", type=int, default=42)
args = parser.parse_args()
if args.cmd == "registry":
path = write_registry(args.output)
......@@ -165,6 +204,16 @@ def main():
print(json.dumps(ADAPTERS[args.dataset].init_layout(root), indent=2, ensure_ascii=False))
elif args.cmd == "describe":
print(json.dumps(ADAPTERS[args.dataset].describe(), indent=2, ensure_ascii=False))
elif args.cmd == "prepare-local":
root = Path(args.output_root) / args.dataset
summary = ADAPTERS[args.dataset].prepare_local_audio(
Path(args.input_dir),
root,
eval_ratio=args.eval_ratio,
query_duration=args.query_duration,
seed=args.seed,
)
print(json.dumps(summary, indent=2, ensure_ascii=False))
if __name__ == "__main__":
......
......@@ -97,6 +97,29 @@
- 已经具备把开源音乐数据目录直接切成训练/评估输入的能力
- 下一阶段可以继续对接真实 FMA / MTG-Jamendo 下载目录
### Stage: adapter-level 本地开源目录接入
完成项:
- 扩展 `src/data/external_adapters.py`
- 新增 `prepare-local` 命令
- 支持通过 adapter 入口直接把本地开源音频目录转成:
- `catalog.json`
- `train.json`
- `test.json`
- `val.json`
验证结果:
- `python -m py_compile src/data/external_adapters.py src/data/manifest_tools.py` 成功
- `python src/data/external_adapters.py prepare-local fma tmp/open_music_demo --output-root data/external_ingested/demo_via_adapter --eval-ratio 0.5 --query-duration 5.0` 成功
- 输出结果:
- `catalog=2`
- `train_queries=1`
- `test_queries=1`
结论:
- 现在接入真实 FMA / MTG-Jamendo 目录时,不需要再手动拼 manifests
- adapter 已经能作为统一入口管理开放数据集的训练/评估切分
## 2026-06-02
### Stage: 文档补全 + ACR 最小可运行链路
......
......@@ -18,6 +18,12 @@
- CCMusic / ModelScope:优先当补充评估或探索来源
- 保留 license 注记,但不再把“商用阻塞”作为个人实验主阻塞
建议接入顺序:
1. 下载/准备 FMA 或 MTG-Jamendo 的本地音频目录
2. 运行 `external_adapters.py prepare-local ...`
3. 生成 `catalog/train/test/val` manifests
4.`train.json` 用于训练,将 `test.json` 用于固定评估
---
## 1. 来源分层图
......
......@@ -142,6 +142,10 @@ flowchart LR
- 至少固定一部分曲目只进 `test.json`,不要同时参与训练
- 小数据集也要保证至少 1 个 train query + 1 个 test query
CLI 入口:
- 低层工具:`src/data/manifest_tools.py audio-dir-to-splits`
- 高层统一入口:`src/data/external_adapters.py prepare-local <dataset> <input_dir>`
## 5. 文字说明
### 5.1 为什么必须分离 catalog 和 query
......