Make retrieval fusion tuning reproducible for fast evaluation
Constraint: Need fresh, like-for-like evidence on stable v6 assets before changing defaults Rejected: More training-weight tuning | v7 and v8 regressed hard-case and overall accuracy Confidence: high Scope-risk: narrow Directive: Use open datasets as separate train/eval assets and tune fusion on held-out eval manifests before retraining Tested: /usr/local/miniconda3/bin/python -m py_compile evaluate.py; /usr/local/miniconda3/bin/python evaluate.py --data data/synthetic_v2 --model data/models_v6/best_model.pt --index-prefix data/index_v6/reference --split test --device cpu --fast-eval; /usr/local/miniconda3/bin/python evaluate.py --data data/synthetic_v2 --model data/models_v6/best_model.pt --index-prefix data/index_v6/reference --split test --device cpu --fast-eval --chroma-weight 0.2 --ecapa-weight 0.55 --melody-weight 0.25 --output-json reports/smoke-v6/synthetic_v2/eval-fusion-tuned.json Not-tested: Full melody-enabled sweep across multiple weight grids and real external datasets
Showing
5 changed files
with
137 additions
and
1 deletions
| ... | @@ -25,6 +25,9 @@ def main(): | ... | @@ -25,6 +25,9 @@ def main(): |
| 25 | parser.add_argument("--device", default="cpu") | 25 | parser.add_argument("--device", default="cpu") |
| 26 | parser.add_argument("--output-json", default=None) | 26 | parser.add_argument("--output-json", default=None) |
| 27 | parser.add_argument("--fast-eval", action="store_true") | 27 | parser.add_argument("--fast-eval", action="store_true") |
| 28 | parser.add_argument("--chroma-weight", type=float, default=0.25) | ||
| 29 | parser.add_argument("--ecapa-weight", type=float, default=0.5) | ||
| 30 | parser.add_argument("--melody-weight", type=float, default=0.25) | ||
| 28 | args = parser.parse_args() | 31 | args = parser.parse_args() |
| 29 | 32 | ||
| 30 | data_dir = Path(args.data) | 33 | data_dir = Path(args.data) |
| ... | @@ -34,7 +37,16 @@ def main(): | ... | @@ -34,7 +37,16 @@ def main(): |
| 34 | ref_embs = np.load(f"{args.index_prefix}_embs.npy") | 37 | ref_embs = np.load(f"{args.index_prefix}_embs.npy") |
| 35 | ref_ids = np.load(f"{args.index_prefix}_ids.npy", allow_pickle=True).tolist() | 38 | ref_ids = np.load(f"{args.index_prefix}_ids.npy", allow_pickle=True).tolist() |
| 36 | 39 | ||
| 37 | engine = HybridEngine(matcher, embedder, ref_embs, ref_ids, disable_melody=args.fast_eval) | 40 | engine = HybridEngine( |
| 41 | matcher, | ||
| 42 | embedder, | ||
| 43 | ref_embs, | ||
| 44 | ref_ids, | ||
| 45 | chroma_weight=args.chroma_weight, | ||
| 46 | ecapa_weight=args.ecapa_weight, | ||
| 47 | melody_weight=args.melody_weight, | ||
| 48 | disable_melody=args.fast_eval, | ||
| 49 | ) | ||
| 38 | for split in ["train.json", "val.json", "test.json"]: | 50 | for split in ["train.json", "val.json", "test.json"]: |
| 39 | p = data_dir / split | 51 | p = data_dir / split |
| 40 | if p.exists(): | 52 | if p.exists(): | ... | ... |
| 1 | { | ||
| 2 | "split": "test", | ||
| 3 | "num_queries": 20, | ||
| 4 | "top1": 0.7, | ||
| 5 | "topk": 0.95, | ||
| 6 | "by_type": { | ||
| 7 | "clean": { | ||
| 8 | "n": 8, | ||
| 9 | "top1": 1.0, | ||
| 10 | "topk": 1.0 | ||
| 11 | }, | ||
| 12 | "augmented": { | ||
| 13 | "n": 4, | ||
| 14 | "top1": 0.75, | ||
| 15 | "topk": 1.0 | ||
| 16 | }, | ||
| 17 | "humming_like": { | ||
| 18 | "n": 4, | ||
| 19 | "top1": 0.5, | ||
| 20 | "topk": 1.0 | ||
| 21 | }, | ||
| 22 | "confused": { | ||
| 23 | "n": 4, | ||
| 24 | "top1": 0.25, | ||
| 25 | "topk": 0.75 | ||
| 26 | } | ||
| 27 | }, | ||
| 28 | "hard_case_summary": { | ||
| 29 | "humming_like": { | ||
| 30 | "n": 4, | ||
| 31 | "top1": 0.5, | ||
| 32 | "topk": 1.0 | ||
| 33 | }, | ||
| 34 | "confused": { | ||
| 35 | "n": 4, | ||
| 36 | "top1": 0.25, | ||
| 37 | "topk": 0.75 | ||
| 38 | } | ||
| 39 | }, | ||
| 40 | "sample_failures": [ | ||
| 41 | { | ||
| 42 | "truth": "song_0023", | ||
| 43 | "query": "segments/song_0023_seg_04_confused.wav", | ||
| 44 | "type": "confused", | ||
| 45 | "preds": [ | ||
| 46 | "song_0006", | ||
| 47 | "song_0002", | ||
| 48 | "song_0001", | ||
| 49 | "song_0019", | ||
| 50 | "song_0022" | ||
| 51 | ] | ||
| 52 | } | ||
| 53 | ] | ||
| 54 | } | ||
| ... | \ No newline at end of file | ... | \ No newline at end of file |
| ... | @@ -28,6 +28,50 @@ | ... | @@ -28,6 +28,50 @@ |
| 28 | - `confused top1` 从 0.00 提升到 0.25,说明 sample-level 权重有效 | 28 | - `confused top1` 从 0.00 提升到 0.25,说明 sample-level 权重有效 |
| 29 | - `humming_like top1` 从 0.50 回落到 0.25,说明两类 hard case 需要分治,而不能只靠单轴加权 | 29 | - `humming_like top1` 从 0.50 回落到 0.25,说明两类 hard case 需要分治,而不能只靠单轴加权 |
| 30 | 30 | ||
| 31 | ### Stage: v7 平衡采样试验(未保留) | ||
| 32 | |||
| 33 | 完成项: | ||
| 34 | - 尝试降低 confused 偏置并提高 humming_like 采样强度 | ||
| 35 | - 重跑 `smoke-v7` 全链路验证 | ||
| 36 | - 基于失败样本回查 residual hard case 的 segment 分布 | ||
| 37 | |||
| 38 | 验证结果: | ||
| 39 | - `smoke-v7` 结果退化为: | ||
| 40 | - overall top1=0.55, top5=0.80 | ||
| 41 | - humming_like top1=0.00 | ||
| 42 | - confused top1=0.00 | ||
| 43 | - 因结果明显回退,已回滚该试验,不作为主线版本保留 | ||
| 44 | |||
| 45 | 结论: | ||
| 46 | - 单纯重调全局采样比率不稳定 | ||
| 47 | - 当前最优保留点仍是 `smoke-v6` | ||
| 48 | - residual confused failure 主要落在 `intro` 片段,下一轮应改做 `segment_type-aware hard negatives` | ||
| 49 | |||
| 50 | ### Stage: 检索融合权重参数化 + fast-eval 调优 | ||
| 51 | |||
| 52 | 完成项: | ||
| 53 | - `evaluate.py` 新增融合参数: | ||
| 54 | - `--chroma-weight` | ||
| 55 | - `--ecapa-weight` | ||
| 56 | - `--melody-weight` | ||
| 57 | - 在稳定的 `models_v6 + index_v6` 资产上做 fresh fast-eval 对比 | ||
| 58 | - 验证 retrieval-time fusion 调优是否比继续改训练权重更有效 | ||
| 59 | |||
| 60 | 验证结果: | ||
| 61 | - 默认 fast-eval: | ||
| 62 | - overall top1=0.65 | ||
| 63 | - humming_like top1=0.25 | ||
| 64 | - confused top1=0.25 | ||
| 65 | - 调整为 `chroma=0.2 / ecapa=0.55 / melody=0.25` 后: | ||
| 66 | - overall top1=0.70 | ||
| 67 | - humming_like top1=0.50 | ||
| 68 | - confused top1=0.25 | ||
| 69 | |||
| 70 | 结论: | ||
| 71 | - 在当前阶段,**检索融合调优** 比继续调训练侧权重更稳定、更划算 | ||
| 72 | - `ecapa` 权重略升、`chroma` 略降能恢复 `humming_like`,同时保持 `confused` | ||
| 73 | - 下一阶段应继续把外部开源数据集真正接成 train/eval manifests,而不是只停在 bootstrap | ||
| 74 | |||
| 31 | ## 2026-06-02 | 75 | ## 2026-06-02 |
| 32 | 76 | ||
| 33 | ### Stage: 文档补全 + ACR 最小可运行链路 | 77 | ### Stage: 文档补全 + ACR 最小可运行链路 | ... | ... |
| ... | @@ -97,6 +97,26 @@ flowchart LR | ... | @@ -97,6 +97,26 @@ flowchart LR |
| 97 | 97 | ||
| 98 | --- | 98 | --- |
| 99 | 99 | ||
| 100 | ## 4.2 检索融合参数图 | ||
| 101 | |||
| 102 | ```mermaid | ||
| 103 | flowchart LR | ||
| 104 | A[Chromaprint Score] --> D[Fused Score] | ||
| 105 | B[ECAPA Score] --> D | ||
| 106 | C[Melody Score] --> D | ||
| 107 | ``` | ||
| 108 | |||
| 109 | | 参数 | 默认值 | 当前验证更优值(fast-eval) | 含义 | | ||
| 110 | |---|---:|---:|---| | ||
| 111 | | `chroma_weight` | 0.25 | 0.20 | 降低纯指纹主导 | | ||
| 112 | | `ecapa_weight` | 0.50 | 0.55 | 提高 embedding 检索主导 | | ||
| 113 | | `melody_weight` | 0.25 | 0.25 | 暂时保持不变 | | ||
| 114 | |||
| 115 | 说明: | ||
| 116 | - 当前仓库已经支持在 `evaluate.py` 中直接传入融合参数 | ||
| 117 | - 对个人使用场景,推荐把一部分开源数据集固定成 **fusion tuning eval set** | ||
| 118 | - 这样训练、检索、调参可以分离,而不是每次都重训 | ||
| 119 | |||
| 100 | ## 5. 文字说明 | 120 | ## 5. 文字说明 |
| 101 | 121 | ||
| 102 | ### 5.1 为什么必须分离 catalog 和 query | 122 | ### 5.1 为什么必须分离 catalog 和 query |
| ... | @@ -119,10 +139,12 @@ flowchart LR | ... | @@ -119,10 +139,12 @@ flowchart LR |
| 119 | - 简单过采样会导致整体退化 | 139 | - 简单过采样会导致整体退化 |
| 120 | - type-aware weighting 能提升一部分 hard case | 140 | - type-aware weighting 能提升一部分 hard case |
| 121 | - confused 类需要更高权重,但过强偏置会回伤 `humming_like` | 141 | - confused 类需要更高权重,但过强偏置会回伤 `humming_like` |
| 142 | - residual confused failure 往往集中在 `intro` 片段,因此 `segment_type` 不只是元数据,还应参与后续难负例设计 | ||
| 122 | - 因此 dataset 规范中必须保留 `type` 字段,后续才能继续做: | 143 | - 因此 dataset 规范中必须保留 `type` 字段,后续才能继续做: |
| 123 | - confusion-aware negative mining | 144 | - confusion-aware negative mining |
| 124 | - melody-aware reranking | 145 | - melody-aware reranking |
| 125 | - 双支路 hard-case curriculum | 146 | - 双支路 hard-case curriculum |
| 147 | - segment-type-aware hard negatives | ||
| 126 | 148 | ||
| 127 | --- | 149 | --- |
| 128 | 150 | ... | ... |
| ... | @@ -119,6 +119,10 @@ flowchart LR | ... | @@ -119,6 +119,10 @@ flowchart LR |
| 119 | 3. 对音乐 ACR 来说,`confused` 与 `humming_like` 不是同一种难度来源: | 119 | 3. 对音乐 ACR 来说,`confused` 与 `humming_like` 不是同一种难度来源: |
| 120 | - `confused` 更偏 timbre / arrangement / retrieval ambiguity | 120 | - `confused` 更偏 timbre / arrangement / retrieval ambiguity |
| 121 | - `humming_like` 更偏 melody / pitch contour mismatch | 121 | - `humming_like` 更偏 melody / pitch contour mismatch |
| 122 | 4. 当前仓库里的 residual confused failure 进一步显示: | ||
| 123 | - `intro` 片段是更高风险区域 | ||
| 124 | - 下一步应引入 `segment_type-aware hard negatives` | ||
| 125 | - 这比继续全局调 sample ratio 更接近工业有效路径 | ||
| 122 | 126 | ||
| 123 | ## 5. 2026 年是否已经有更好的方案? | 127 | ## 5. 2026 年是否已经有更好的方案? |
| 124 | 128 | ... | ... |
-
Please register or sign in to post a comment