Commit d75fbf81 d75fbf81a733df56cb4f099774ab52f4309a6915 by cnb.bofCdSsphPA

Add batch inventory for multiple open music directories

Constraint: Personal-use dataset preparation needs fast comparison across several local open-music corpora before ingestion
Rejected: Inspect each dataset directory manually one by one | Slows repeated train/eval setup and comparison
Confidence: high
Scope-risk: narrow
Directive: Use inspect-batch on real FMA and MTG-Jamendo folders before selecting training and held-out evaluation corpora
Tested: /usr/local/miniconda3/bin/python -m py_compile src/data/external_adapters.py src/data/manifest_tools.py; /usr/local/miniconda3/bin/python src/data/external_adapters.py inspect-batch fma=tmp/open_music_demo_fma mtg_jamendo=tmp/open_music_demo_jamendo --eval-ratio 0.5 --query-duration 5.0
Not-tested: Real upstream corpus inventory on downloaded full-size open datasets
1 parent c734a31e
......@@ -19,6 +19,11 @@ Optional pre-check:
/usr/local/miniconda3/bin/python src/data/external_adapters.py inspect-local fma data/raw/fma_small_audio --eval-ratio 0.2 --query-duration 8.0
```
Batch pre-check across multiple candidate corpora:
```bash
/usr/local/miniconda3/bin/python src/data/external_adapters.py inspect-batch fma=data/raw/fma_small_audio mtg_jamendo=data/raw/mtg_jamendo_audio --eval-ratio 0.2 --query-duration 8.0
```
Then generate manifests:
```bash
/usr/local/miniconda3/bin/python src/data/external_adapters.py prepare-local fma data/raw/fma_small_audio --output-root data/external_ingested --eval-ratio 0.2 --query-duration 8.0
......
......@@ -194,6 +194,21 @@ def write_registry(output_path: str):
return out
def inspect_batch(pairs: List[str], eval_ratio: float, query_duration: float) -> Dict:
results = []
for pair in pairs:
dataset, input_dir = pair.split("=", 1)
if dataset not in ADAPTERS:
raise SystemExit(f"Unknown dataset adapter: {dataset}")
summary = ADAPTERS[dataset].inspect_local_audio(
Path(input_dir),
eval_ratio=eval_ratio,
query_duration=query_duration,
)
results.append(summary)
return {"datasets": results, "count": len(results)}
def main():
parser = argparse.ArgumentParser()
sub = parser.add_subparsers(dest="cmd", required=True)
......@@ -222,6 +237,11 @@ def main():
p.add_argument("--eval-ratio", type=float, default=0.2)
p.add_argument("--query-duration", type=float, default=8.0)
p = sub.add_parser("inspect-batch")
p.add_argument("pairs", nargs="+", help="dataset=input_dir")
p.add_argument("--eval-ratio", type=float, default=0.2)
p.add_argument("--query-duration", type=float, default=8.0)
args = parser.parse_args()
if args.cmd == "registry":
path = write_registry(args.output)
......@@ -248,6 +268,9 @@ def main():
query_duration=args.query_duration,
)
print(json.dumps(summary, indent=2, ensure_ascii=False))
elif args.cmd == "inspect-batch":
summary = inspect_batch(args.pairs, args.eval_ratio, args.query_duration)
print(json.dumps(summary, indent=2, ensure_ascii=False))
if __name__ == "__main__":
......
......@@ -147,6 +147,26 @@
- 现在真实 FMA / MTG-Jamendo 目录在导入前就能先做规模预估
- 这对个人使用下的快速数据准备非常有帮助
### Stage: 多目录批量 inventory(inspect-batch)
完成项:
- 扩展 `src/data/external_adapters.py`
- 新增 `inspect-batch`
- 支持一次性检查多个开放数据目录,例如:
- `fma=<dir>`
- `mtg_jamendo=<dir>`
验证结果:
- `python -m py_compile src/data/external_adapters.py src/data/manifest_tools.py` 成功
- `python src/data/external_adapters.py inspect-batch fma=tmp/open_music_demo_fma mtg_jamendo=tmp/open_music_demo_jamendo --eval-ratio 0.5 --query-duration 5.0` 成功
- 返回:
- `count=2`
- 每个数据源均给出 `num_audio_files / eligible_query_files / recommended_train_queries / recommended_test_queries`
结论:
- 现在可以批量对比多个候选开放数据目录的可用规模
- 这让后续接入真实 FMA / MTG-Jamendo / 其他音乐集更高效
## 2026-06-02
### Stage: 文档补全 + ACR 最小可运行链路
......
......@@ -146,6 +146,7 @@ CLI 入口:
- 低层工具:`src/data/manifest_tools.py audio-dir-to-splits`
- 高层统一入口:`src/data/external_adapters.py prepare-local <dataset> <input_dir>`
- 导入前预检查:`src/data/external_adapters.py inspect-local <dataset> <input_dir>`
- 多目录批量预检查:`src/data/external_adapters.py inspect-batch fma=<dir> mtg_jamendo=<dir> ...`
## 5. 文字说明
......