Commit dc9ef1b8 dc9ef1b810185d2eda7258f5095c9dd5cb297f3f by cnb.bofCdSsphPA

Close the open-dataset smoke loop through evaluation

Constraint: Open-dataset support was not complete until imported corpora could train, build indexes, and produce eval outputs without manual path surgery
Rejected: Stop at train.py dry-run | Does not prove the retrieval/evaluation half of the workflow actually works
Confidence: high
Scope-risk: moderate
Directive: Keep future external dataset layouts self-contained and manifests-root aware across training, indexing, and evaluation paths
Tested: /usr/local/miniconda3/bin/python train.py --data data/external_ingested/synthetic_as_open_fixed/fma/manifests --output data/models_open_smoke_fixed --device cpu --epochs 1 --batch-size 2; /usr/local/miniconda3/bin/python run_demo.py build-index --data data/external_ingested/synthetic_as_open_fixed/fma/manifests --model data/models_open_smoke_fixed/best_model.pt --output data/index_open_smoke_fixed --device cpu; /usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/synthetic_as_open_fixed/fma/manifests --model data/models_open_smoke_fixed/best_model.pt --index-prefix data/index_open_smoke_fixed/reference --split test --device cpu --fast-eval --output-json reports/open-smoke-fixed/fma/eval.json; /usr/local/miniconda3/bin/python -m py_compile evaluate.py run_demo.py src/engines/ecapa_embedder.py src/engines/chromaprint_matcher.py src/data/dataset.py src/data/manifest_tools.py src/data/external_adapters.py train.py
Not-tested: Real downloaded FMA or MTG-Jamendo corpora at larger scale
1 parent b766c74e
This file is too large to display.
{
"fma_00001": 0,
"fma_00002": 1,
"fma_00005": 2,
"fma_00007": 3,
"fma_00008": 4,
"fma_00010": 5,
"fma_00012": 6,
"fma_00014": 7,
"fma_00015": 8,
"fma_00016": 9,
"fma_00017": 10,
"fma_00018": 11,
"fma_00019": 12,
"fma_00021": 13,
"fma_00022": 14,
"fma_00023": 15
}
\ No newline at end of file
......@@ -31,6 +31,7 @@ def main():
args = parser.parse_args()
data_dir = Path(args.data)
asset_root = data_dir.parent if data_dir.name == "manifests" else data_dir
matcher = ChromaprintMatcher()
matcher.load(str(Path(args.index_prefix).parent / "chromaprint.pkl"))
embedder = ECAPAEmbedder(model_path=args.model, device=args.device)
......@@ -53,7 +54,7 @@ def main():
engine.load_metadata(str(p))
items = load_items(data_dir / f"{args.split}.json")
queries = [x for x in items if str(x.get("audio_path", "")).startswith("segments/")]
queries = [x for x in items if x.get("type") != "reference"]
if not queries:
raise SystemExit("No segment queries found for evaluation")
......@@ -63,7 +64,7 @@ def main():
failures = []
for item in queries:
result = engine.recognize(str(data_dir / item["audio_path"]), top_n=args.top_k)
result = engine.recognize(str(asset_root / item["audio_path"]), top_n=args.top_k)
preds = [c["song_id"] for c in result["candidates"]]
truth = item["song_id"]
qtype = item.get("type", "unknown")
......
{
"split": "test",
"num_queries": 8,
"top1": 1.0,
"topk": 1.0,
"by_type": {
"clean": {
"n": 8,
"top1": 1.0,
"topk": 1.0
}
},
"hard_case_summary": {},
"sample_failures": []
}
\ No newline at end of file
......@@ -29,9 +29,10 @@ def cmd_generate_data(args):
def build_chroma_index(data_dir: Path, output_dir: Path):
matcher = ChromaprintMatcher()
metadata_path = data_dir / 'catalog.json' if (data_dir / 'catalog.json').exists() else data_dir / 'train.json'
matcher.index_songs_from_dir(
songs_dir=str(data_dir / 'songs'),
metadata_path=str(data_dir / 'catalog.json' if (data_dir / 'catalog.json').exists() else data_dir / 'train.json'),
songs_dir=str(data_dir),
metadata_path=str(metadata_path),
cache_path=str(output_dir / 'chromaprint.pkl'),
)
print(f"[done] chromaprint index built: hashes={matcher.num_hashes}, postings={matcher.index_size}")
......@@ -40,9 +41,10 @@ def build_chroma_index(data_dir: Path, output_dir: Path):
def build_embedding_index(data_dir: Path, model_path: Path, output_prefix: Path, device: str):
embedder = ECAPAEmbedder(model_path=str(model_path), device=device)
metadata_path = data_dir / 'catalog.json' if (data_dir / 'catalog.json').exists() else data_dir / 'train.json'
ref_embs, ref_ids = embedder.build_reference_index(
songs_dir=str(data_dir / 'songs'),
metadata_path=str(data_dir / 'catalog.json' if (data_dir / 'catalog.json').exists() else data_dir / 'train.json'),
songs_dir=str(data_dir),
metadata_path=str(metadata_path),
output_path=str(output_prefix),
)
print(f"[done] embedding index built: {len(ref_ids)} refs")
......
......@@ -82,7 +82,7 @@ class ChromaprintMatcher:
songs_dir = Path(songs_dir)
for item in meta:
if "songs" not in item.get("audio_path", ""):
if item.get("type") != "reference":
continue
audio_path = songs_dir.parent / item["audio_path"]
if not audio_path.exists():
......
......@@ -103,7 +103,7 @@ class ECAPAEmbedder:
songs_dir = Path(songs_dir)
for item in meta:
if item.get("type") != "reference" and "songs/" not in item.get("audio_path", ""):
if item.get("type") != "reference":
continue
audio_path = songs_dir.parent / item["audio_path"]
if not audio_path.exists():
......
......@@ -72,6 +72,27 @@
- 开放数据路径现在不仅能生成 manifests,还能真正进入训练
- 后续接入真实 FMA / MTG-Jamendo 时,可以直接走同一链路
### Stage: 开放数据完整 smoke 闭环(train/index/eval)
完成项:
- 修复 `run_demo.py` 对开放数据自包含布局的索引入口假设
- 修复 `src/engines/ecapa_embedder.py` / `src/engines/chromaprint_matcher.py` 对 reference 路径的硬编码筛选
- 修复 `evaluate.py` 对开放数据 query 与 `manifests` 根路径的解析
- 打通开放数据 `prepare-local -> validate-local -> train -> build-index -> evaluate`
验证结果:
- `/usr/local/miniconda3/bin/python train.py --data data/external_ingested/synthetic_as_open_fixed/fma/manifests --output data/models_open_smoke_fixed --device cpu --epochs 1 --batch-size 2` 成功
- `/usr/local/miniconda3/bin/python run_demo.py build-index --data data/external_ingested/synthetic_as_open_fixed/fma/manifests --model data/models_open_smoke_fixed/best_model.pt --output data/index_open_smoke_fixed --device cpu` 成功
- `/usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/synthetic_as_open_fixed/fma/manifests --model data/models_open_smoke_fixed/best_model.pt --index-prefix data/index_open_smoke_fixed/reference --split test --device cpu --fast-eval --output-json reports/open-smoke-fixed/fma/eval.json` 成功
- 当前结果:
- `num_queries=8`
- `top1=1.0`
- `topk=1.0`
结论:
- 开放数据接入链路现在已经完整闭环
- 真实 FMA / MTG-Jamendo 本地目录接入时,可直接复用同一流程
### Stage: confused 定向优化 v6(sample-level weighting)
完成项:
......
......@@ -47,6 +47,8 @@ flowchart LR
/usr/local/miniconda3/bin/python src/data/external_adapters.py prepare-local fma data/raw/fma_small_audio --output-root data/external_ingested --eval-ratio 0.2 --query-duration 8.0
/usr/local/miniconda3/bin/python src/data/external_adapters.py validate-local fma data/external_ingested/fma/manifests
/usr/local/miniconda3/bin/python train.py --data data/external_ingested/fma/manifests --output data/models_fma_smoke --device cpu --epochs 1 --batch-size 2 --dry-run
/usr/local/miniconda3/bin/python run_demo.py build-index --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --output data/index_fma_smoke --device cpu
/usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --index-prefix data/index_fma_smoke/reference --split test --device cpu --fast-eval --output-json reports/fma-smoke/eval.json
```
### 3.2 多目录比较
......@@ -82,6 +84,9 @@ flowchart LR
- `ok=true`
- `train.py --dry-run`
- `Dry run passed! Pipeline is working.`
- `build-index + evaluate`
- `top1=1.0`
- `topk=1.0`
---
......