Unify open dataset preparation behind adapter commands

Constraint: Personal-use experimentation needs a single entrypoint from local open-audio directories to train/eval manifests Rejected: Separate manual manifest generation per dataset | Too error-prone and slows iterative training/evaluation Confidence: high Scope-risk: narrow Directive: Point real FMA or MTG-Jamendo local download folders at prepare-local before expanding training runs Tested: /usr/local/miniconda3/bin/python -m py_compile src/data/external_adapters.py src/data/manifest_tools.py; /usr/local/miniconda3/bin/python src/data/external_adapters.py prepare-local fma tmp/open_music_demo --output-root data/external_ingested/demo_via_adapter --eval-ratio 0.5 --query-duration 5.0 Not-tested: Full upstream corpus import and large-scale training

Unify open dataset preparation behind adapter commands
Constraint: Personal-use experimentation needs a single entrypoint from local open-audio directories to train/eval manifests Rejected: Separate manual manifest generation per dataset | Too error-prone and slows iterative training/evaluation Confidence: high Scope-risk: narrow Directive: Point real FMA or MTG-Jamendo local download folders at prepare-local before expanding training runs Tested: /usr/local/miniconda3/bin/python -m py_compile src/data/external_adapters.py src/data/manifest_tools.py; /usr/local/miniconda3/bin/python src/data/external_adapters.py prepare-local fma tmp/open_music_demo --output-root data/external_ingested/demo_via_adapter --eval-ratio 0.5 --query-duration 5.0 Not-tested: Full upstream corpus import and large-scale training
cnb.bofCdSsphPA
Commit fb1d00b6 ... fb1d00b69c8f8285193b0160a582d922bfa342f3 authored 2026-06-02 12:40:45 +0800 by cnb.bofCdSsphPA
Showing 8 changed files with 149 additions and 0 deletions
acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/catalog.json
acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/test.json
acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/train.json
acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/val.json
acr-engine/src/data/external_adapters.py
docs/CHANGELOG.md
docs/dataset-sources-and-licensing.md
docs/dataset-spec.md
--- a/acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/catalog.json 0 → 100644
View file @fb1d00b
+++ b/acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/catalog.json 0 → 100644
View file @fb1d00b
+[
+  {
+    "song_id": "fma_00000",
+    "audio_path": "open_music_demo/song_0000.wav",
+    "duration": 15.0,
+    "type": "reference",
+    "source_dataset": "fma"
+  },
+  {
+    "song_id": "fma_00001",
+    "audio_path": "open_music_demo/song_0001.wav",
+    "duration": 15.0,
+    "type": "reference",
+    "source_dataset": "fma"
+  }
+]
\ No newline at end of file
--- a/acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/test.json 0 → 100644
View file @fb1d00b
+++ b/acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/test.json 0 → 100644
View file @fb1d00b
+[
+  {
+    "song_id": "fma_00000",
+    "audio_path": "open_music_demo/song_0000.wav",
+    "duration": 5.0,
+    "type": "clean",
+    "offset": 6.394,
+    "segment_type": "external_query",
+    "source_dataset": "fma"
+  },
+  {
+    "song_id": "fma_00000",
+    "audio_path": "open_music_demo/song_0000.wav",
+    "duration": 15.0,
+    "type": "reference",
+    "source_dataset": "fma"
+  },
+  {
+    "song_id": "fma_00001",
+    "audio_path": "open_music_demo/song_0001.wav",
+    "duration": 15.0,
+    "type": "reference",
+    "source_dataset": "fma"
+  }
+]
\ No newline at end of file
--- a/acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/train.json 0 → 100644
View file @fb1d00b
+++ b/acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/train.json 0 → 100644
View file @fb1d00b
+[
+  {
+    "song_id": "fma_00001",
+    "audio_path": "open_music_demo/song_0001.wav",
+    "duration": 5.0,
+    "type": "clean",
+    "offset": 2.75,
+    "segment_type": "external_query",
+    "source_dataset": "fma"
+  },
+  {
+    "song_id": "fma_00000",
+    "audio_path": "open_music_demo/song_0000.wav",
+    "duration": 15.0,
+    "type": "reference",
+    "source_dataset": "fma"
+  },
+  {
+    "song_id": "fma_00001",
+    "audio_path": "open_music_demo/song_0001.wav",
+    "duration": 15.0,
+    "type": "reference",
+    "source_dataset": "fma"
+  }
+]
\ No newline at end of file
--- a/acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/val.json 0 → 100644
View file @fb1d00b
+++ b/acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/val.json 0 → 100644
View file @fb1d00b
+[]
\ No newline at end of file
--- a/acr-engine/src/data/external_adapters.py
View file @fb1d00b
+++ b/acr-engine/src/data/external_adapters.py
View file @fb1d00b
@@ -7,6 +7,7 @@ from pathlib import Path
 from typing import Dict, List
 import argparse
 import json
+import subprocess


 @dataclass
@@ -42,6 +43,36 @@ class BaseAdapter:
            json.dump(manifest, f, indent=2, ensure_ascii=False)
        return manifest

+    def prepare_local_audio(
+        self,
+        input_dir: Path,
+        output_root: Path,
+        eval_ratio: float = 0.2,
+        query_duration: float = 8.0,
+        seed: int = 42,
+    ) -> Dict:
+        output_root.mkdir(parents=True, exist_ok=True)
+        cmd = [
+            "/usr/local/miniconda3/bin/python",
+            "src/data/manifest_tools.py",
+            "audio-dir-to-splits",
+            str(input_dir),
+            str(output_root),
+            "--source-dataset",
+            self.name,
+            "--eval-ratio",
+            str(eval_ratio),
+            "--query-duration",
+            str(query_duration),
+            "--seed",
+            str(seed),
+        ]
+        result = subprocess.check_output(cmd, text=True)
+        summary = json.loads(result)
+        summary["input_dir"] = str(input_dir)
+        summary["dataset"] = self.name
+        return summary
+

 class FMAAdapter(BaseAdapter):
    name = "fma"
@@ -156,6 +187,14 @@ def main():
    p = sub.add_parser("describe")
    p.add_argument("dataset", choices=sorted(ADAPTERS))

+    p = sub.add_parser("prepare-local")
+    p.add_argument("dataset", choices=sorted(ADAPTERS))
+    p.add_argument("input_dir")
+    p.add_argument("--output-root", default="data/external_ingested")
+    p.add_argument("--eval-ratio", type=float, default=0.2)
+    p.add_argument("--query-duration", type=float, default=8.0)
+    p.add_argument("--seed", type=int, default=42)
+
    args = parser.parse_args()
    if args.cmd == "registry":
        path = write_registry(args.output)
@@ -165,6 +204,16 @@ def main():
        print(json.dumps(ADAPTERS[args.dataset].init_layout(root), indent=2, ensure_ascii=False))
    elif args.cmd == "describe":
        print(json.dumps(ADAPTERS[args.dataset].describe(), indent=2, ensure_ascii=False))
+    elif args.cmd == "prepare-local":
+        root = Path(args.output_root) / args.dataset
+        summary = ADAPTERS[args.dataset].prepare_local_audio(
+            Path(args.input_dir),
+            root,
+            eval_ratio=args.eval_ratio,
+            query_duration=args.query_duration,
+            seed=args.seed,
+        )
+        print(json.dumps(summary, indent=2, ensure_ascii=False))


 if __name__ == "__main__":
--- a/docs/CHANGELOG.md
View file @fb1d00b
+++ b/docs/CHANGELOG.md
View file @fb1d00b
@@ -97,6 +97,29 @@
 - 已经具备把开源音乐数据目录直接切成训练/评估输入的能力
 - 下一阶段可以继续对接真实 FMA / MTG-Jamendo 下载目录

+### Stage: adapter-level 本地开源目录接入
+
+完成项：
+- 扩展 `src/data/external_adapters.py`
+- 新增 `prepare-local` 命令
+- 支持通过 adapter 入口直接把本地开源音频目录转成：
+  - `catalog.json`
+  - `train.json`
+  - `test.json`
+  - `val.json`
+
+验证结果：
+- `python -m py_compile src/data/external_adapters.py src/data/manifest_tools.py` 成功
+- `python src/data/external_adapters.py prepare-local fma tmp/open_music_demo --output-root data/external_ingested/demo_via_adapter --eval-ratio 0.5 --query-duration 5.0` 成功
+- 输出结果：
+  - `catalog=2`
+  - `train_queries=1`
+  - `test_queries=1`
+
+结论：
+- 现在接入真实 FMA / MTG-Jamendo 目录时，不需要再手动拼 manifests
+- adapter 已经能作为统一入口管理开放数据集的训练/评估切分
+
 ## 2026-06-02

 ### Stage: 文档补全 + ACR 最小可运行链路
--- a/docs/dataset-sources-and-licensing.md
View file @fb1d00b
+++ b/docs/dataset-sources-and-licensing.md
View file @fb1d00b
@@ -18,6 +18,12 @@
 - CCMusic / ModelScope：优先当补充评估或探索来源
 - 保留 license 注记，但不再把“商用阻塞”作为个人实验主阻塞

+建议接入顺序：
+1. 下载/准备 FMA 或 MTG-Jamendo 的本地音频目录
+2. 运行 `external_adapters.py prepare-local ...`
+3. 生成 `catalog/train/test/val` manifests
+4. 将 `train.json` 用于训练，将 `test.json` 用于固定评估
+
 ---

 ## 1. 来源分层图
--- a/docs/dataset-spec.md
View file @fb1d00b
+++ b/docs/dataset-spec.md
View file @fb1d00b
@@ -142,6 +142,10 @@ flowchart LR
 - 至少固定一部分曲目只进 `test.json`，不要同时参与训练
 - 小数据集也要保证至少 1 个 train query + 1 个 test query

+CLI 入口：
+- 低层工具：`src/data/manifest_tools.py audio-dir-to-splits`
+- 高层统一入口：`src/data/external_adapters.py prepare-local <dataset> <input_dir>`
+
 ## 5. 文字说明

 ### 5.1 为什么必须分离 catalog 和 query