Commit c734a31e c734a31eeaa91c64e491d2601b3161043045fefb by cnb.bofCdSsphPA

Add open-dataset inventory checks before ingestion

Constraint: Personal-use dataset setup needs quick scale visibility before generating train/eval manifests
Rejected: Generate splits blindly | Hides whether a local corpus is large enough for meaningful train/test separation
Confidence: high
Scope-risk: narrow
Directive: Run inspect-local on real FMA or MTG-Jamendo folders before prepare-local and training
Tested: /usr/local/miniconda3/bin/python -m py_compile src/data/manifest_tools.py src/data/external_adapters.py; /usr/local/miniconda3/bin/python src/data/manifest_tools.py inspect-audio-dir tmp/open_music_demo --query-duration 5.0 --eval-ratio 0.5; /usr/local/miniconda3/bin/python src/data/external_adapters.py inspect-local fma tmp/open_music_demo --eval-ratio 0.5 --query-duration 5.0
Not-tested: Real large external corpus inventory on downloaded FMA or MTG-Jamendo directories
1 parent fb1d00b6
# External Open-Music Ingestion
## Goal
Convert local open-music audio folders into ACR-ready manifests for:
- training queries
- evaluation queries
- reference catalog indexing
## Recommended personal-use flow
### 1. Prepare a local audio directory
Examples:
- `data/raw/fma_small_audio/`
- `data/raw/mtg_jamendo_audio/`
### 2. Generate manifests through the adapter entrypoint
Optional pre-check:
```bash
/usr/local/miniconda3/bin/python src/data/external_adapters.py inspect-local fma data/raw/fma_small_audio --eval-ratio 0.2 --query-duration 8.0
```
Then generate manifests:
```bash
/usr/local/miniconda3/bin/python src/data/external_adapters.py prepare-local fma data/raw/fma_small_audio --output-root data/external_ingested --eval-ratio 0.2 --query-duration 8.0
```
or
```bash
/usr/local/miniconda3/bin/python src/data/external_adapters.py prepare-local mtg_jamendo data/raw/mtg_jamendo_audio --output-root data/external_ingested --eval-ratio 0.2 --query-duration 8.0
```
### 3. Use outputs
Generated files:
- `catalog.json`: reference tracks for indexing
- `train.json`: train queries + references
- `test.json`: held-out eval queries + references
- `val.json`: optional validation split
## Notes
- Small datasets are automatically protected so both train/test query sets exist.
- For personal use, FMA and MTG-Jamendo should be the first real baselines.
- Keep `test.json` fixed across experiments to compare models fairly.
......@@ -73,6 +73,27 @@ class BaseAdapter:
summary["dataset"] = self.name
return summary
def inspect_local_audio(
self,
input_dir: Path,
query_duration: float = 8.0,
eval_ratio: float = 0.2,
) -> Dict:
cmd = [
"/usr/local/miniconda3/bin/python",
"src/data/manifest_tools.py",
"inspect-audio-dir",
str(input_dir),
"--query-duration",
str(query_duration),
"--eval-ratio",
str(eval_ratio),
]
result = subprocess.check_output(cmd, text=True)
summary = json.loads(result)
summary["dataset"] = self.name
return summary
class FMAAdapter(BaseAdapter):
name = "fma"
......@@ -195,6 +216,12 @@ def main():
p.add_argument("--query-duration", type=float, default=8.0)
p.add_argument("--seed", type=int, default=42)
p = sub.add_parser("inspect-local")
p.add_argument("dataset", choices=sorted(ADAPTERS))
p.add_argument("input_dir")
p.add_argument("--eval-ratio", type=float, default=0.2)
p.add_argument("--query-duration", type=float, default=8.0)
args = parser.parse_args()
if args.cmd == "registry":
path = write_registry(args.output)
......@@ -214,6 +241,13 @@ def main():
seed=args.seed,
)
print(json.dumps(summary, indent=2, ensure_ascii=False))
elif args.cmd == "inspect-local":
summary = ADAPTERS[args.dataset].inspect_local_audio(
Path(args.input_dir),
eval_ratio=args.eval_ratio,
query_duration=args.query_duration,
)
print(json.dumps(summary, indent=2, ensure_ascii=False))
if __name__ == "__main__":
......
......@@ -106,6 +106,44 @@ def build_train_eval_from_audio_dir(
}
def inspect_audio_dir(
audio_dir: Path,
exts: tuple[str, ...] = (".wav", ".mp3", ".flac", ".ogg"),
query_duration: float = 8.0,
eval_ratio: float = 0.2,
):
files = [p for p in sorted(audio_dir.rglob("*")) if p.suffix.lower() in exts]
durations = []
eligible = 0
for path in files:
try:
duration = float(sf.info(str(path)).duration)
except Exception:
duration = 0.0
durations.append(duration)
if duration >= query_duration:
eligible += 1
durations_sorted = sorted(durations)
total = len(files)
train_queries = max(0, eligible - max(1 if eligible >= 2 else 0, round(eligible * eval_ratio)))
test_queries = 0 if eligible == 0 else max(1 if eligible >= 2 else eligible, round(eligible * eval_ratio))
return {
"audio_dir": str(audio_dir),
"num_audio_files": total,
"eligible_query_files": eligible,
"query_duration": query_duration,
"recommended_train_queries": train_queries,
"recommended_test_queries": test_queries,
"duration_stats": {
"min": round(durations_sorted[0], 3) if durations_sorted else 0.0,
"median": round(durations_sorted[len(durations_sorted) // 2], 3) if durations_sorted else 0.0,
"max": round(durations_sorted[-1], 3) if durations_sorted else 0.0,
},
}
def main():
parser = argparse.ArgumentParser()
sub = parser.add_subparsers(dest="cmd", required=True)
......@@ -124,6 +162,11 @@ def main():
p.add_argument("--query-duration", type=float, default=8.0)
p.add_argument("--seed", type=int, default=42)
p = sub.add_parser("inspect-audio-dir")
p.add_argument("audio_dir")
p.add_argument("--query-duration", type=float, default=8.0)
p.add_argument("--eval-ratio", type=float, default=0.2)
args = parser.parse_args()
if args.cmd == "csv-to-catalog":
count = csv_to_catalog(Path(args.csv_path), Path(args.output_path), args.path_field, args.id_field)
......@@ -138,6 +181,13 @@ def main():
seed=args.seed,
)
print(json.dumps({"status": "ok", **summary}, ensure_ascii=False))
elif args.cmd == "inspect-audio-dir":
summary = inspect_audio_dir(
Path(args.audio_dir),
query_duration=args.query_duration,
eval_ratio=args.eval_ratio,
)
print(json.dumps({"status": "ok", **summary}, ensure_ascii=False))
if __name__ == "__main__":
......
......@@ -120,6 +120,33 @@
- 现在接入真实 FMA / MTG-Jamendo 目录时,不需要再手动拼 manifests
- adapter 已经能作为统一入口管理开放数据集的训练/评估切分
### Stage: 开源目录规模扫描(inspect-local)
完成项:
- 扩展 `src/data/manifest_tools.py`
- 新增 `inspect-audio-dir`
- 扩展 `src/data/external_adapters.py`
- 新增 `inspect-local`
- 在真正生成 manifests 之前,可以先报告:
- 音频文件数量
- 可切 query 的文件数
- 推荐 train/test query 数
- 基础时长统计
验证结果:
- `python -m py_compile src/data/manifest_tools.py src/data/external_adapters.py` 成功
- `python src/data/manifest_tools.py inspect-audio-dir tmp/open_music_demo --query-duration 5.0 --eval-ratio 0.5` 成功
- `python src/data/external_adapters.py inspect-local fma tmp/open_music_demo --eval-ratio 0.5 --query-duration 5.0` 成功
- 返回结果:
- `num_audio_files=2`
- `eligible_query_files=2`
- `recommended_train_queries=1`
- `recommended_test_queries=1`
结论:
- 现在真实 FMA / MTG-Jamendo 目录在导入前就能先做规模预估
- 这对个人使用下的快速数据准备非常有帮助
## 2026-06-02
### Stage: 文档补全 + ACR 最小可运行链路
......
......@@ -145,6 +145,7 @@ flowchart LR
CLI 入口:
- 低层工具:`src/data/manifest_tools.py audio-dir-to-splits`
- 高层统一入口:`src/data/external_adapters.py prepare-local <dataset> <input_dir>`
- 导入前预检查:`src/data/external_adapters.py inspect-local <dataset> <input_dir>`
## 5. 文字说明
......