Close the open-dataset smoke loop through evaluation

Constraint: Open-dataset support was not complete until imported corpora could train, build indexes, and produce eval outputs without manual path surgery Rejected: Stop at train.py dry-run | Does not prove the retrieval/evaluation half of the workflow actually works Confidence: high Scope-risk: moderate Directive: Keep future external dataset layouts self-contained and manifests-root aware across training, indexing, and evaluation paths Tested: /usr/local/miniconda3/bin/python train.py --data data/external_ingested/synthetic_as_open_fixed/fma/manifests --output data/models_open_smoke_fixed --device cpu --epochs 1 --batch-size 2; /usr/local/miniconda3/bin/python run_demo.py build-index --data data/external_ingested/synthetic_as_open_fixed/fma/manifests --model data/models_open_smoke_fixed/best_model.pt --output data/index_open_smoke_fixed --device cpu; /usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/synthetic_as_open_fixed/fma/manifests --model data/models_open_smoke_fixed/best_model.pt --index-prefix data/index_open_smoke_fixed/reference --split test --device cpu --fast-eval --output-json reports/open-smoke-fixed/fma/eval.json; /usr/local/miniconda3/bin/python -m py_compile evaluate.py run_demo.py src/engines/ecapa_embedder.py src/engines/chromaprint_matcher.py src/data/dataset.py src/data/manifest_tools.py src/data/external_adapters.py train.py Not-tested: Real downloaded FMA or MTG-Jamendo corpora at larger scale

Close the open-dataset smoke loop through evaluation
Constraint: Open-dataset support was not complete until imported corpora could train, build indexes, and produce eval outputs without manual path surgery Rejected: Stop at train.py dry-run | Does not prove the retrieval/evaluation half of the workflow actually works Confidence: high Scope-risk: moderate Directive: Keep future external dataset layouts self-contained and manifests-root aware across training, indexing, and evaluation paths Tested: /usr/local/miniconda3/bin/python train.py --data data/external_ingested/synthetic_as_open_fixed/fma/manifests --output data/models_open_smoke_fixed --device cpu --epochs 1 --batch-size 2; /usr/local/miniconda3/bin/python run_demo.py build-index --data data/external_ingested/synthetic_as_open_fixed/fma/manifests --model data/models_open_smoke_fixed/best_model.pt --output data/index_open_smoke_fixed --device cpu; /usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/synthetic_as_open_fixed/fma/manifests --model data/models_open_smoke_fixed/best_model.pt --index-prefix data/index_open_smoke_fixed/reference --split test --device cpu --fast-eval --output-json reports/open-smoke-fixed/fma/eval.json; /usr/local/miniconda3/bin/python -m py_compile evaluate.py run_demo.py src/engines/ecapa_embedder.py src/engines/chromaprint_matcher.py src/data/dataset.py src/data/manifest_tools.py src/data/external_adapters.py train.py Not-tested: Real downloaded FMA or MTG-Jamendo corpora at larger scale
cnb.bofCdSsphPA
Commit dc9ef1b8 ... dc9ef1b810185d2eda7258f5095c9dd5cb297f3f authored 2026-06-02 12:59:41 +0800 by cnb.bofCdSsphPA
Showing 12 changed files with 70 additions and 8 deletions
acr-engine/data/index_open_smoke_fixed/chromaprint.pkl
acr-engine/data/index_open_smoke_fixed/reference_embs.npy
acr-engine/data/index_open_smoke_fixed/reference_ids.npy
acr-engine/data/models_open_smoke_fixed/best_model.pt
acr-engine/data/models_open_smoke_fixed/song_to_idx.json
acr-engine/evaluate.py
acr-engine/reports/open-smoke-fixed/fma/eval.json
acr-engine/run_demo.py
acr-engine/src/engines/chromaprint_matcher.py
acr-engine/src/engines/ecapa_embedder.py
docs/CHANGELOG.md
docs/open-dataset-workflow.md
--- a/acr-engine/data/index_open_smoke_fixed/chromaprint.pkl 0 → 100644
View file @dc9ef1b
+++ b/acr-engine/data/index_open_smoke_fixed/chromaprint.pkl 0 → 100644
View file @dc9ef1b
--- a/acr-engine/data/index_open_smoke_fixed/reference_embs.npy 0 → 100644
View file @dc9ef1b
+++ b/acr-engine/data/index_open_smoke_fixed/reference_embs.npy 0 → 100644
View file @dc9ef1b
--- a/acr-engine/data/index_open_smoke_fixed/reference_ids.npy 0 → 100644
View file @dc9ef1b
+++ b/acr-engine/data/index_open_smoke_fixed/reference_ids.npy 0 → 100644
View file @dc9ef1b
--- a/acr-engine/data/models_open_smoke_fixed/best_model.pt 0 → 100644
View file @dc9ef1b
+++ b/acr-engine/data/models_open_smoke_fixed/best_model.pt 0 → 100644
View file @dc9ef1b
--- a/acr-engine/data/models_open_smoke_fixed/song_to_idx.json 0 → 100644
View file @dc9ef1b
+++ b/acr-engine/data/models_open_smoke_fixed/song_to_idx.json 0 → 100644
View file @dc9ef1b
+{
+  "fma_00001": 0,
+  "fma_00002": 1,
+  "fma_00005": 2,
+  "fma_00007": 3,
+  "fma_00008": 4,
+  "fma_00010": 5,
+  "fma_00012": 6,
+  "fma_00014": 7,
+  "fma_00015": 8,
+  "fma_00016": 9,
+  "fma_00017": 10,
+  "fma_00018": 11,
+  "fma_00019": 12,
+  "fma_00021": 13,
+  "fma_00022": 14,
+  "fma_00023": 15
+}
\ No newline at end of file
--- a/acr-engine/evaluate.py
View file @dc9ef1b
+++ b/acr-engine/evaluate.py
View file @dc9ef1b
@@ -31,6 +31,7 @@ def main():
    args = parser.parse_args()

    data_dir = Path(args.data)
+    asset_root = data_dir.parent if data_dir.name == "manifests" else data_dir
    matcher = ChromaprintMatcher()
    matcher.load(str(Path(args.index_prefix).parent / "chromaprint.pkl"))
    embedder = ECAPAEmbedder(model_path=args.model, device=args.device)
@@ -53,7 +54,7 @@ def main():
            engine.load_metadata(str(p))

    items = load_items(data_dir / f"{args.split}.json")
-    queries = [x for x in items if str(x.get("audio_path", "")).startswith("segments/")]
+    queries = [x for x in items if x.get("type") != "reference"]
    if not queries:
        raise SystemExit("No segment queries found for evaluation")

@@ -63,7 +64,7 @@ def main():
    failures = []

    for item in queries:
-        result = engine.recognize(str(data_dir / item["audio_path"]), top_n=args.top_k)
+        result = engine.recognize(str(asset_root / item["audio_path"]), top_n=args.top_k)
        preds = [c["song_id"] for c in result["candidates"]]
        truth = item["song_id"]
        qtype = item.get("type", "unknown")
--- a/acr-engine/reports/open-smoke-fixed/fma/eval.json 0 → 100644
View file @dc9ef1b
+++ b/acr-engine/reports/open-smoke-fixed/fma/eval.json 0 → 100644
View file @dc9ef1b
+{
+  "split": "test",
+  "num_queries": 8,
+  "top1": 1.0,
+  "topk": 1.0,
+  "by_type": {
+    "clean": {
+      "n": 8,
+      "top1": 1.0,
+      "topk": 1.0
+    }
+  },
+  "hard_case_summary": {},
+  "sample_failures": []
+}
\ No newline at end of file
--- a/acr-engine/run_demo.py
View file @dc9ef1b
+++ b/acr-engine/run_demo.py
View file @dc9ef1b
@@ -29,9 +29,10 @@ def cmd_generate_data(args):

 def build_chroma_index(data_dir: Path, output_dir: Path):
    matcher = ChromaprintMatcher()
+    metadata_path = data_dir / 'catalog.json' if (data_dir / 'catalog.json').exists() else data_dir / 'train.json'
    matcher.index_songs_from_dir(
-        songs_dir=str(data_dir / 'songs'),
-        metadata_path=str(data_dir / 'catalog.json' if (data_dir / 'catalog.json').exists() else data_dir / 'train.json'),
+        songs_dir=str(data_dir),
+        metadata_path=str(metadata_path),
        cache_path=str(output_dir / 'chromaprint.pkl'),
    )
    print(f"[done] chromaprint index built: hashes={matcher.num_hashes}, postings={matcher.index_size}")
@@ -40,9 +41,10 @@ def build_chroma_index(data_dir: Path, output_dir: Path):

 def build_embedding_index(data_dir: Path, model_path: Path, output_prefix: Path, device: str):
    embedder = ECAPAEmbedder(model_path=str(model_path), device=device)
+    metadata_path = data_dir / 'catalog.json' if (data_dir / 'catalog.json').exists() else data_dir / 'train.json'
    ref_embs, ref_ids = embedder.build_reference_index(
-        songs_dir=str(data_dir / 'songs'),
-        metadata_path=str(data_dir / 'catalog.json' if (data_dir / 'catalog.json').exists() else data_dir / 'train.json'),
+        songs_dir=str(data_dir),
+        metadata_path=str(metadata_path),
        output_path=str(output_prefix),
    )
    print(f"[done] embedding index built: {len(ref_ids)} refs")
--- a/acr-engine/src/engines/chromaprint_matcher.py
View file @dc9ef1b
+++ b/acr-engine/src/engines/chromaprint_matcher.py
View file @dc9ef1b
@@ -82,7 +82,7 @@ class ChromaprintMatcher:

        songs_dir = Path(songs_dir)
        for item in meta:
-            if "songs" not in item.get("audio_path", ""):
+            if item.get("type") != "reference":
                continue
            audio_path = songs_dir.parent / item["audio_path"]
            if not audio_path.exists():
--- a/acr-engine/src/engines/ecapa_embedder.py
View file @dc9ef1b
+++ b/acr-engine/src/engines/ecapa_embedder.py
View file @dc9ef1b
@@ -103,7 +103,7 @@ class ECAPAEmbedder:
        songs_dir = Path(songs_dir)

        for item in meta:
-            if item.get("type") != "reference" and "songs/" not in item.get("audio_path", ""):
+            if item.get("type") != "reference":
                continue
            audio_path = songs_dir.parent / item["audio_path"]
            if not audio_path.exists():
--- a/docs/CHANGELOG.md
View file @dc9ef1b
+++ b/docs/CHANGELOG.md
View file @dc9ef1b
@@ -72,6 +72,27 @@
 - 开放数据路径现在不仅能生成 manifests，还能真正进入训练
 - 后续接入真实 FMA / MTG-Jamendo 时，可以直接走同一链路

+### Stage: 开放数据完整 smoke 闭环（train/index/eval）
+
+完成项：
+- 修复 `run_demo.py` 对开放数据自包含布局的索引入口假设
+- 修复 `src/engines/ecapa_embedder.py` / `src/engines/chromaprint_matcher.py` 对 reference 路径的硬编码筛选
+- 修复 `evaluate.py` 对开放数据 query 与 `manifests` 根路径的解析
+- 打通开放数据 `prepare-local -> validate-local -> train -> build-index -> evaluate`
+
+验证结果：
+- `/usr/local/miniconda3/bin/python train.py --data data/external_ingested/synthetic_as_open_fixed/fma/manifests --output data/models_open_smoke_fixed --device cpu --epochs 1 --batch-size 2` 成功
+- `/usr/local/miniconda3/bin/python run_demo.py build-index --data data/external_ingested/synthetic_as_open_fixed/fma/manifests --model data/models_open_smoke_fixed/best_model.pt --output data/index_open_smoke_fixed --device cpu` 成功
+- `/usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/synthetic_as_open_fixed/fma/manifests --model data/models_open_smoke_fixed/best_model.pt --index-prefix data/index_open_smoke_fixed/reference --split test --device cpu --fast-eval --output-json reports/open-smoke-fixed/fma/eval.json` 成功
+- 当前结果：
+  - `num_queries=8`
+  - `top1=1.0`
+  - `topk=1.0`
+
+结论：
+- 开放数据接入链路现在已经完整闭环
+- 真实 FMA / MTG-Jamendo 本地目录接入时，可直接复用同一流程
+
 ### Stage: confused 定向优化 v6（sample-level weighting）

 完成项：
--- a/docs/open-dataset-workflow.md
View file @dc9ef1b
+++ b/docs/open-dataset-workflow.md
View file @dc9ef1b
@@ -47,6 +47,8 @@ flowchart LR
 /usr/local/miniconda3/bin/python src/data/external_adapters.py prepare-local fma data/raw/fma_small_audio --output-root data/external_ingested --eval-ratio 0.2 --query-duration 8.0
 /usr/local/miniconda3/bin/python src/data/external_adapters.py validate-local fma data/external_ingested/fma/manifests
 /usr/local/miniconda3/bin/python train.py --data data/external_ingested/fma/manifests --output data/models_fma_smoke --device cpu --epochs 1 --batch-size 2 --dry-run
+/usr/local/miniconda3/bin/python run_demo.py build-index --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --output data/index_fma_smoke --device cpu
+/usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --index-prefix data/index_fma_smoke/reference --split test --device cpu --fast-eval --output-json reports/fma-smoke/eval.json
 ```

 ### 3.2 多目录比较
@@ -82,6 +84,9 @@ flowchart LR
  - `ok=true`
 - `train.py --dry-run`：
  - `Dry run passed! Pipeline is working.`
+- `build-index + evaluate`：
+  - `top1=1.0`
+  - `topk=1.0`

 ---