Commit d665b1fd d665b1fd1897a4313e3a5eec4b5308ebe8173875 by cnb.bofCdSsphPA

Make retrieval fusion tuning reproducible for fast evaluation

Constraint: Need fresh, like-for-like evidence on stable v6 assets before changing defaults
Rejected: More training-weight tuning | v7 and v8 regressed hard-case and overall accuracy
Confidence: high
Scope-risk: narrow
Directive: Use open datasets as separate train/eval assets and tune fusion on held-out eval manifests before retraining
Tested: /usr/local/miniconda3/bin/python -m py_compile evaluate.py; /usr/local/miniconda3/bin/python evaluate.py --data data/synthetic_v2 --model data/models_v6/best_model.pt --index-prefix data/index_v6/reference --split test --device cpu --fast-eval; /usr/local/miniconda3/bin/python evaluate.py --data data/synthetic_v2 --model data/models_v6/best_model.pt --index-prefix data/index_v6/reference --split test --device cpu --fast-eval --chroma-weight 0.2 --ecapa-weight 0.55 --melody-weight 0.25 --output-json reports/smoke-v6/synthetic_v2/eval-fusion-tuned.json
Not-tested: Full melody-enabled sweep across multiple weight grids and real external datasets
1 parent c89ef4f9
......@@ -25,6 +25,9 @@ def main():
parser.add_argument("--device", default="cpu")
parser.add_argument("--output-json", default=None)
parser.add_argument("--fast-eval", action="store_true")
parser.add_argument("--chroma-weight", type=float, default=0.25)
parser.add_argument("--ecapa-weight", type=float, default=0.5)
parser.add_argument("--melody-weight", type=float, default=0.25)
args = parser.parse_args()
data_dir = Path(args.data)
......@@ -34,7 +37,16 @@ def main():
ref_embs = np.load(f"{args.index_prefix}_embs.npy")
ref_ids = np.load(f"{args.index_prefix}_ids.npy", allow_pickle=True).tolist()
engine = HybridEngine(matcher, embedder, ref_embs, ref_ids, disable_melody=args.fast_eval)
engine = HybridEngine(
matcher,
embedder,
ref_embs,
ref_ids,
chroma_weight=args.chroma_weight,
ecapa_weight=args.ecapa_weight,
melody_weight=args.melody_weight,
disable_melody=args.fast_eval,
)
for split in ["train.json", "val.json", "test.json"]:
p = data_dir / split
if p.exists():
......
{
"split": "test",
"num_queries": 20,
"top1": 0.7,
"topk": 0.95,
"by_type": {
"clean": {
"n": 8,
"top1": 1.0,
"topk": 1.0
},
"augmented": {
"n": 4,
"top1": 0.75,
"topk": 1.0
},
"humming_like": {
"n": 4,
"top1": 0.5,
"topk": 1.0
},
"confused": {
"n": 4,
"top1": 0.25,
"topk": 0.75
}
},
"hard_case_summary": {
"humming_like": {
"n": 4,
"top1": 0.5,
"topk": 1.0
},
"confused": {
"n": 4,
"top1": 0.25,
"topk": 0.75
}
},
"sample_failures": [
{
"truth": "song_0023",
"query": "segments/song_0023_seg_04_confused.wav",
"type": "confused",
"preds": [
"song_0006",
"song_0002",
"song_0001",
"song_0019",
"song_0022"
]
}
]
}
\ No newline at end of file
......@@ -28,6 +28,50 @@
- `confused top1` 从 0.00 提升到 0.25,说明 sample-level 权重有效
- `humming_like top1` 从 0.50 回落到 0.25,说明两类 hard case 需要分治,而不能只靠单轴加权
### Stage: v7 平衡采样试验(未保留)
完成项:
- 尝试降低 confused 偏置并提高 humming_like 采样强度
- 重跑 `smoke-v7` 全链路验证
- 基于失败样本回查 residual hard case 的 segment 分布
验证结果:
- `smoke-v7` 结果退化为:
- overall top1=0.55, top5=0.80
- humming_like top1=0.00
- confused top1=0.00
- 因结果明显回退,已回滚该试验,不作为主线版本保留
结论:
- 单纯重调全局采样比率不稳定
- 当前最优保留点仍是 `smoke-v6`
- residual confused failure 主要落在 `intro` 片段,下一轮应改做 `segment_type-aware hard negatives`
### Stage: 检索融合权重参数化 + fast-eval 调优
完成项:
- `evaluate.py` 新增融合参数:
- `--chroma-weight`
- `--ecapa-weight`
- `--melody-weight`
- 在稳定的 `models_v6 + index_v6` 资产上做 fresh fast-eval 对比
- 验证 retrieval-time fusion 调优是否比继续改训练权重更有效
验证结果:
- 默认 fast-eval:
- overall top1=0.65
- humming_like top1=0.25
- confused top1=0.25
- 调整为 `chroma=0.2 / ecapa=0.55 / melody=0.25` 后:
- overall top1=0.70
- humming_like top1=0.50
- confused top1=0.25
结论:
- 在当前阶段,**检索融合调优** 比继续调训练侧权重更稳定、更划算
- `ecapa` 权重略升、`chroma` 略降能恢复 `humming_like`,同时保持 `confused`
- 下一阶段应继续把外部开源数据集真正接成 train/eval manifests,而不是只停在 bootstrap
## 2026-06-02
### Stage: 文档补全 + ACR 最小可运行链路
......
......@@ -97,6 +97,26 @@ flowchart LR
---
## 4.2 检索融合参数图
```mermaid
flowchart LR
A[Chromaprint Score] --> D[Fused Score]
B[ECAPA Score] --> D
C[Melody Score] --> D
```
| 参数 | 默认值 | 当前验证更优值(fast-eval) | 含义 |
|---|---:|---:|---|
| `chroma_weight` | 0.25 | 0.20 | 降低纯指纹主导 |
| `ecapa_weight` | 0.50 | 0.55 | 提高 embedding 检索主导 |
| `melody_weight` | 0.25 | 0.25 | 暂时保持不变 |
说明:
- 当前仓库已经支持在 `evaluate.py` 中直接传入融合参数
- 对个人使用场景,推荐把一部分开源数据集固定成 **fusion tuning eval set**
- 这样训练、检索、调参可以分离,而不是每次都重训
## 5. 文字说明
### 5.1 为什么必须分离 catalog 和 query
......@@ -119,10 +139,12 @@ flowchart LR
- 简单过采样会导致整体退化
- type-aware weighting 能提升一部分 hard case
- confused 类需要更高权重,但过强偏置会回伤 `humming_like`
- residual confused failure 往往集中在 `intro` 片段,因此 `segment_type` 不只是元数据,还应参与后续难负例设计
- 因此 dataset 规范中必须保留 `type` 字段,后续才能继续做:
- confusion-aware negative mining
- melody-aware reranking
- 双支路 hard-case curriculum
- segment-type-aware hard negatives
---
......
......@@ -119,6 +119,10 @@ flowchart LR
3. 对音乐 ACR 来说,`confused``humming_like` 不是同一种难度来源:
- `confused` 更偏 timbre / arrangement / retrieval ambiguity
- `humming_like` 更偏 melody / pitch contour mismatch
4. 当前仓库里的 residual confused failure 进一步显示:
- `intro` 片段是更高风险区域
- 下一步应引入 `segment_type-aware hard negatives`
- 这比继续全局调 sample ratio 更接近工业有效路径
## 5. 2026 年是否已经有更好的方案?
......