Make retrieval fusion tuning reproducible for fast evaluation

Constraint: Need fresh, like-for-like evidence on stable v6 assets before changing defaults Rejected: More training-weight tuning | v7 and v8 regressed hard-case and overall accuracy Confidence: high Scope-risk: narrow Directive: Use open datasets as separate train/eval assets and tune fusion on held-out eval manifests before retraining Tested: /usr/local/miniconda3/bin/python -m py_compile evaluate.py; /usr/local/miniconda3/bin/python evaluate.py --data data/synthetic_v2 --model data/models_v6/best_model.pt --index-prefix data/index_v6/reference --split test --device cpu --fast-eval; /usr/local/miniconda3/bin/python evaluate.py --data data/synthetic_v2 --model data/models_v6/best_model.pt --index-prefix data/index_v6/reference --split test --device cpu --fast-eval --chroma-weight 0.2 --ecapa-weight 0.55 --melody-weight 0.25 --output-json reports/smoke-v6/synthetic_v2/eval-fusion-tuned.json Not-tested: Full melody-enabled sweep across multiple weight grids and real external datasets

Make retrieval fusion tuning reproducible for fast evaluation
Constraint: Need fresh, like-for-like evidence on stable v6 assets before changing defaults Rejected: More training-weight tuning | v7 and v8 regressed hard-case and overall accuracy Confidence: high Scope-risk: narrow Directive: Use open datasets as separate train/eval assets and tune fusion on held-out eval manifests before retraining Tested: /usr/local/miniconda3/bin/python -m py_compile evaluate.py; /usr/local/miniconda3/bin/python evaluate.py --data data/synthetic_v2 --model data/models_v6/best_model.pt --index-prefix data/index_v6/reference --split test --device cpu --fast-eval; /usr/local/miniconda3/bin/python evaluate.py --data data/synthetic_v2 --model data/models_v6/best_model.pt --index-prefix data/index_v6/reference --split test --device cpu --fast-eval --chroma-weight 0.2 --ecapa-weight 0.55 --melody-weight 0.25 --output-json reports/smoke-v6/synthetic_v2/eval-fusion-tuned.json Not-tested: Full melody-enabled sweep across multiple weight grids and real external datasets
cnb.bofCdSsphPA
Commit d665b1fd ... d665b1fd1897a4313e3a5eec4b5308ebe8173875 authored 2026-06-02 12:37:25 +0800 by cnb.bofCdSsphPA
Showing 5 changed files with 137 additions and 1 deletions
acr-engine/evaluate.py
acr-engine/reports/smoke-v6/synthetic_v2/eval-fusion-tuned.json
docs/CHANGELOG.md
docs/dataset-spec.md
docs/sota-research-2026.md
--- a/acr-engine/evaluate.py
View file @d665b1f
+++ b/acr-engine/evaluate.py
View file @d665b1f
@@ -25,6 +25,9 @@ def main():
    parser.add_argument("--device", default="cpu")
    parser.add_argument("--output-json", default=None)
    parser.add_argument("--fast-eval", action="store_true")
+    parser.add_argument("--chroma-weight", type=float, default=0.25)
+    parser.add_argument("--ecapa-weight", type=float, default=0.5)
+    parser.add_argument("--melody-weight", type=float, default=0.25)
    args = parser.parse_args()

    data_dir = Path(args.data)
@@ -34,7 +37,16 @@ def main():
    ref_embs = np.load(f"{args.index_prefix}_embs.npy")
    ref_ids = np.load(f"{args.index_prefix}_ids.npy", allow_pickle=True).tolist()

-    engine = HybridEngine(matcher, embedder, ref_embs, ref_ids, disable_melody=args.fast_eval)
+    engine = HybridEngine(
+        matcher,
+        embedder,
+        ref_embs,
+        ref_ids,
+        chroma_weight=args.chroma_weight,
+        ecapa_weight=args.ecapa_weight,
+        melody_weight=args.melody_weight,
+        disable_melody=args.fast_eval,
+    )
    for split in ["train.json", "val.json", "test.json"]:
        p = data_dir / split
        if p.exists():
--- a/acr-engine/reports/smoke-v6/synthetic_v2/eval-fusion-tuned.json 0 → 100644
View file @d665b1f
+++ b/acr-engine/reports/smoke-v6/synthetic_v2/eval-fusion-tuned.json 0 → 100644
View file @d665b1f
+{
+  "split": "test",
+  "num_queries": 20,
+  "top1": 0.7,
+  "topk": 0.95,
+  "by_type": {
+    "clean": {
+      "n": 8,
+      "top1": 1.0,
+      "topk": 1.0
+    },
+    "augmented": {
+      "n": 4,
+      "top1": 0.75,
+      "topk": 1.0
+    },
+    "humming_like": {
+      "n": 4,
+      "top1": 0.5,
+      "topk": 1.0
+    },
+    "confused": {
+      "n": 4,
+      "top1": 0.25,
+      "topk": 0.75
+    }
+  },
+  "hard_case_summary": {
+    "humming_like": {
+      "n": 4,
+      "top1": 0.5,
+      "topk": 1.0
+    },
+    "confused": {
+      "n": 4,
+      "top1": 0.25,
+      "topk": 0.75
+    }
+  },
+  "sample_failures": [
+    {
+      "truth": "song_0023",
+      "query": "segments/song_0023_seg_04_confused.wav",
+      "type": "confused",
+      "preds": [
+        "song_0006",
+        "song_0002",
+        "song_0001",
+        "song_0019",
+        "song_0022"
+      ]
+    }
+  ]
+}
\ No newline at end of file
--- a/docs/CHANGELOG.md
View file @d665b1f
+++ b/docs/CHANGELOG.md
View file @d665b1f
@@ -28,6 +28,50 @@
 - `confused top1` 从 0.00 提升到 0.25，说明 sample-level 权重有效
 - `humming_like top1` 从 0.50 回落到 0.25，说明两类 hard case 需要分治，而不能只靠单轴加权

+### Stage: v7 平衡采样试验（未保留）
+
+完成项：
+- 尝试降低 confused 偏置并提高 humming_like 采样强度
+- 重跑 `smoke-v7` 全链路验证
+- 基于失败样本回查 residual hard case 的 segment 分布
+
+验证结果：
+- `smoke-v7` 结果退化为：
+  - overall top1=0.55, top5=0.80
+  - humming_like top1=0.00
+  - confused top1=0.00
+- 因结果明显回退，已回滚该试验，不作为主线版本保留
+
+结论：
+- 单纯重调全局采样比率不稳定
+- 当前最优保留点仍是 `smoke-v6`
+- residual confused failure 主要落在 `intro` 片段，下一轮应改做 `segment_type-aware hard negatives`
+
+### Stage: 检索融合权重参数化 + fast-eval 调优
+
+完成项：
+- `evaluate.py` 新增融合参数：
+  - `--chroma-weight`
+  - `--ecapa-weight`
+  - `--melody-weight`
+- 在稳定的 `models_v6 + index_v6` 资产上做 fresh fast-eval 对比
+- 验证 retrieval-time fusion 调优是否比继续改训练权重更有效
+
+验证结果：
+- 默认 fast-eval：
+  - overall top1=0.65
+  - humming_like top1=0.25
+  - confused top1=0.25
+- 调整为 `chroma=0.2 / ecapa=0.55 / melody=0.25` 后：
+  - overall top1=0.70
+  - humming_like top1=0.50
+  - confused top1=0.25
+
+结论：
+- 在当前阶段，**检索融合调优** 比继续调训练侧权重更稳定、更划算
+- `ecapa` 权重略升、`chroma` 略降能恢复 `humming_like`，同时保持 `confused`
+- 下一阶段应继续把外部开源数据集真正接成 train/eval manifests，而不是只停在 bootstrap
+
 ## 2026-06-02

 ### Stage: 文档补全 + ACR 最小可运行链路
--- a/docs/dataset-spec.md
View file @d665b1f
+++ b/docs/dataset-spec.md
View file @d665b1f
@@ -97,6 +97,26 @@ flowchart LR

 ---

+## 4.2 检索融合参数图
+
+```mermaid
+flowchart LR
+    A[Chromaprint Score] --> D[Fused Score]
+    B[ECAPA Score] --> D
+    C[Melody Score] --> D
+```
+
+| 参数 | 默认值 | 当前验证更优值（fast-eval） | 含义 |
+|---|---:|---:|---|
+| `chroma_weight` | 0.25 | 0.20 | 降低纯指纹主导 |
+| `ecapa_weight` | 0.50 | 0.55 | 提高 embedding 检索主导 |
+| `melody_weight` | 0.25 | 0.25 | 暂时保持不变 |
+
+说明：
+- 当前仓库已经支持在 `evaluate.py` 中直接传入融合参数
+- 对个人使用场景，推荐把一部分开源数据集固定成 **fusion tuning eval set**
+- 这样训练、检索、调参可以分离，而不是每次都重训
+
 ## 5. 文字说明

 ### 5.1 为什么必须分离 catalog 和 query
@@ -119,10 +139,12 @@ flowchart LR
 - 简单过采样会导致整体退化
 - type-aware weighting 能提升一部分 hard case
 - confused 类需要更高权重，但过强偏置会回伤 `humming_like`
+- residual confused failure 往往集中在 `intro` 片段，因此 `segment_type` 不只是元数据，还应参与后续难负例设计
 - 因此 dataset 规范中必须保留 `type` 字段，后续才能继续做：
  - confusion-aware negative mining
  - melody-aware reranking
  - 双支路 hard-case curriculum
+  - segment-type-aware hard negatives

 ---

--- a/docs/sota-research-2026.md
View file @d665b1f
+++ b/docs/sota-research-2026.md
View file @d665b1f
@@ -119,6 +119,10 @@ flowchart LR
 3. 对音乐 ACR 来说，`confused` 与 `humming_like` 不是同一种难度来源：
   - `confused` 更偏 timbre / arrangement / retrieval ambiguity
   - `humming_like` 更偏 melody / pitch contour mismatch
+4. 当前仓库里的 residual confused failure 进一步显示：
+   - `intro` 片段是更高风险区域
+   - 下一步应引入 `segment_type-aware hard negatives`
+   - 这比继续全局调 sample ratio 更接近工业有效路径

 ## 5. 2026 年是否已经有更好的方案？