explain the v5-v6 hard-case split with source-backed evidence\n\nConstraint: The…
… handoff must convert baseline metrics into an actionable causal explanation without staging report artifacts\nRejected: Start a new weighting experiment immediately | Source-backed explanation of the existing split is cheaper and reduces blind iteration risk\nConfidence: high\nScope-risk: narrow\nDirective: Treat dual-axis hard-case weighting as the next design lane, using v6 as the base and v5 as the humming_like reference\nTested: Verified source-backed v5/v6 definitions from changelog and smoke-v6 config artifacts, then updated handoff/changelog docs\nNot-tested: A new merged weighting strategy or its downstream metric impact
Showing
5 changed files
with
102 additions
and
32 deletions
| ... | @@ -74,23 +74,21 @@ | ... | @@ -74,23 +74,21 @@ |
| 74 | 74 | ||
| 75 | ## 5.5 最新真实 FMA / chromaprint 运行态(2026-06-02) | 75 | ## 5.5 最新真实 FMA / chromaprint 运行态(2026-06-02) |
| 76 | 76 | ||
| 77 | ### 当前最新快照(15:45 UTC) | 77 | ### 当前最新快照(15:46 UTC) |
| 78 | 78 | ||
| 79 | - 远程同步基线:`d4961b1`(更新前) | 79 | - 远程同步基线:`93dfa15`(更新前) |
| 80 | - 当前最重要的新证据:**hard-case 下一轮最合理的优化基线已确定**。 | 80 | - 当前最重要的新证据:**`v5` 与 `v6` 的 hard-case 差异来源已经被解释清楚**。 |
| 81 | - baseline sweep 结论: | 81 | - `v5` = `type-aware hard-case weighting`: |
| 82 | - `v6` 总体最佳:`top1=0.65`, `topk=0.95` | 82 | - `humming_like top1=0.50` |
| 83 | - `v5` 的 `humming_like` 更强:`top1=0.5` | 83 | - `confused top1=0.00` |
| 84 | - 细分: | 84 | - `v6` = `sample-level confused-priority weighting`: |
| 85 | - `v3`: `hum=0.0`, `conf=0.25` | 85 | - `humming_like top1=0.25` |
| 86 | - `v4`: `hum=0.0`, `conf=0.0` | 86 | - `confused top1=0.25` |
| 87 | - `v5`: `hum=0.5`, `conf=0.0` | 87 | - 这说明:下一轮最值得做的不是继续盲 sweep,而是设计 `humming_like` 与 `confused` 分开控制的双轴策略。 |
| 88 | - `v6`: `hum=0.25`, `conf=0.25` | ||
| 89 | - 这说明:下一轮优化不该盲改,而应以 `v6` 为总体主基线,定向吸收 `v5` 在 `humming_like` 上的优势。 | ||
| 90 | - 下一次值得提交的事件: | 88 | - 下一次值得提交的事件: |
| 91 | 1. `v5` vs `v6` 差异审计结果 | 89 | 1. 双轴 hard-case weighting / sampling 方案落地 |
| 92 | 2. 合并实验后的 hard-case 指标改善 | 90 | 2. 其对 `v6` 的 hard-case 指标改善 |
| 93 | 3. dual-track(real-path clean + synthetic hard-case)复测结果 | 91 | 3. dual-track 回归结果稳定 |
| 94 | 92 | ||
| 95 | 93 | ||
| 96 | ## 6. 高风险注意事项 | 94 | ## 6. 高风险注意事项 | ... | ... |
| 1 | ## 2026-06-02 15:46 UTC / v5-v6 hard-case difference is now causally explained | ||
| 2 | |||
| 3 | - 基于仓库内历史实验记录,补齐了 `v5` 与 `v6` hard-case 表现差异的来源解释 | ||
| 4 | - fresh evidence: | ||
| 5 | - `docs/CHANGELOG.md:2954+` 明确记录:`v6` 是 **sample-level confused-priority weighting** | ||
| 6 | - `docs/CHANGELOG.md:6805+` 明确记录:`v5` 是 **type-aware hard-case weighting** | ||
| 7 | - `docs/sota-research-2026.md:113-114` 已总结两者差异: | ||
| 8 | - `v5` 改善 `humming_like`,但 `confused` 无突破 | ||
| 9 | - `v6` 提升 `confused`,但 `humming_like` 回落 | ||
| 10 | - 当前可执行结论: | ||
| 11 | - `v5` 的优势主要来自 **type-aware weighting** 对 `humming_like` 的更温和提升 | ||
| 12 | - `v6` 的优势主要来自 **sample-level confused-priority weighting** 对 `confused` 的定向拉升 | ||
| 13 | - 下一轮不应重做盲目 sweep,而应设计 **双轴 hard-case weighting / 分治策略** | ||
| 14 | |||
| 1 | ## 2026-06-02 15:45 UTC / hard-case baseline sweep pinned the next optimization baseline | 15 | ## 2026-06-02 15:45 UTC / hard-case baseline sweep pinned the next optimization baseline |
| 2 | 16 | ||
| 3 | - 对现有 `v3~v6` 基线在 `data/synthetic_v2` 上做了一轮统一 hard-case 评测 sweep | 17 | - 对现有 `v3~v6` 基线在 `data/synthetic_v2` 上做了一轮统一 hard-case 评测 sweep | ... | ... |
| ... | @@ -248,3 +248,27 @@ | ... | @@ -248,3 +248,27 @@ |
| 248 | 248 | ||
| 249 | - 当前最合理的下一轮实验基线是 `v6`,因为总体最稳。 | 249 | - 当前最合理的下一轮实验基线是 `v6`,因为总体最稳。 |
| 250 | - 但 `v5` 在 `humming_like` 上明显更强,值得做 targeted diff / 吸收。 | 250 | - 但 `v5` 在 `humming_like` 上明显更强,值得做 targeted diff / 吸收。 |
| 251 | |||
| 252 | ## 本次追加交付(2026-06-02 15:46 UTC) | ||
| 253 | |||
| 254 | ### 新增差异审计证据 | ||
| 255 | |||
| 256 | | 类别 | 内容 | | ||
| 257 | |---|---| | ||
| 258 | | v5 来源 | `type-aware hard-case weighting` | | ||
| 259 | | v6 来源 | `sample-level confused-priority weighting` | | ||
| 260 | | 解释 | `v5` 更利于 `humming_like`,`v6` 更利于 `confused` | | ||
| 261 | | 决策 | 下一轮应做双轴 hard-case weighting / 分治,而不是单轴加权 | | ||
| 262 | |||
| 263 | ### 当前最重要的 fresh evidence | ||
| 264 | |||
| 265 | - `docs/CHANGELOG.md:2954+`:`v6` = sample-level confused-priority weighting | ||
| 266 | - `docs/CHANGELOG.md:6805+`:`v5` = type-aware hard-case weighting | ||
| 267 | - `docs/sota-research-2026.md:113-114`: | ||
| 268 | - `v5`: `overall=0.60`, `humming_like=0.50`, `confused=0.00` | ||
| 269 | - `v6`: `overall=0.65`, `humming_like=0.25`, `confused=0.25` | ||
| 270 | |||
| 271 | ### 结论 | ||
| 272 | |||
| 273 | - 现在已经不仅知道 `v5/v6` 哪个更强,还知道“为什么”。 | ||
| 274 | - 下一轮应把 `humming_like` 与 `confused` 分开建模或分开加权。 | ... | ... |
| 1 | ## 本次交付包追加更新(2026-06-02 15:46 UTC) | ||
| 2 | |||
| 3 | ### 交付结论 | ||
| 4 | |||
| 5 | 当前最新里程碑已经从“确定 v6/v5 谁更适合作为基线”推进到 **解释清楚它们为什么会这样表现**: | ||
| 6 | - 远程基线当前为:`93dfa15`(更新前) | ||
| 7 | - `v5` 的关键机制是 `type-aware hard-case weighting` | ||
| 8 | - `v6` 的关键机制是 `sample-level confused-priority weighting` | ||
| 9 | - 因此下一轮最合理的不是继续盲 sweep,而是做 `humming_like` 与 `confused` 的双轴分治策略 | ||
| 10 | |||
| 11 | ### 当前最新事实 | ||
| 12 | |||
| 13 | #### v5 / v6 差异来源 | ||
| 14 | - `v5`: | ||
| 15 | - 历史记录位置:`docs/CHANGELOG.md:6805+` | ||
| 16 | - 定义:`type-aware hard-case weighting` | ||
| 17 | - 结果:`humming_like top1=0.50`, `confused top1=0.00` | ||
| 18 | - `v6`: | ||
| 19 | - 历史记录位置:`docs/CHANGELOG.md:2954+` | ||
| 20 | - 定义:`sample-level confused-priority weighting` | ||
| 21 | - 结果:`humming_like top1=0.25`, `confused top1=0.25` | ||
| 22 | - 汇总解释:`docs/sota-research-2026.md:113-114` | ||
| 23 | |||
| 24 | ### 当前判断 | ||
| 25 | |||
| 26 | - `v5` 与 `v6` 的差异已经可解释,不再是黑箱经验差异。 | ||
| 27 | - 下一阶段最值得做的是: | ||
| 28 | 1. 设计双轴 hard-case weighting; | ||
| 29 | 2. 让 `humming_like` 与 `confused` 分开控制; | ||
| 30 | 3. 再用现有双轨验证链回归测试。 | ||
| 31 | |||
| 32 | --- | ||
| 33 | |||
| 1 | ## 本次交付包追加更新(2026-06-02 15:45 UTC) | 34 | ## 本次交付包追加更新(2026-06-02 15:45 UTC) |
| 2 | 35 | ||
| 3 | ### 交付结论 | 36 | ### 交付结论 | ... | ... |
| ... | @@ -5,23 +5,24 @@ | ... | @@ -5,23 +5,24 @@ |
| 5 | 5 | ||
| 6 | ## 一页结论 | 6 | ## 一页结论 |
| 7 | 7 | ||
| 8 | ### 最新交付快照(2026-06-02 15:45 UTC) | 8 | ### 最新交付快照(2026-06-02 15:46 UTC) |
| 9 | 9 | ||
| 10 | - 当前远程同步基线:`d4961b1`(更新前) | 10 | - 当前远程同步基线:`93dfa15`(更新前) |
| 11 | - 当前最重要的新事实:**hard-case 下一轮最合理的优化基线已确定** | 11 | - 当前最重要的新事实:**`v5` 与 `v6` 的 hard-case 差异来源已经被解释清楚** |
| 12 | - 基线 sweep 结论: | 12 | - `v5`:`type-aware hard-case weighting` |
| 13 | - `v6`:总体最佳,`top1=0.65`, `topk=0.95` | 13 | - `humming_like top1=0.50` |
| 14 | - `v5`:`humming_like` 最佳,`top1=0.5` | 14 | - `confused top1=0.00` |
| 15 | - 细分结果: | 15 | - `v6`:`sample-level confused-priority weighting` |
| 16 | - `v3`: `hum=0.0`, `conf=0.25` | 16 | - `humming_like top1=0.25` |
| 17 | - `v4`: `hum=0.0`, `conf=0.0` | 17 | - `confused top1=0.25` |
| 18 | - `v5`: `hum=0.5`, `conf=0.0` | 18 | - 结论: |
| 19 | - `v6`: `hum=0.25`, `conf=0.25` | 19 | - `v5` 更偏向提升 `humming_like` |
| 20 | - 结论:下一阶段不应盲目重做大规模 clean smoke,而应以 `v6` 为主基线,针对性吸收 `v5` 的 `humming_like` 优势。 | 20 | - `v6` 更偏向提升 `confused` |
| 21 | - 下一轮应设计双轴 hard-case weighting / 分治策略,而不是继续单轴加权 | ||
| 21 | - 新 session 第一优先级: | 22 | - 新 session 第一优先级: |
| 22 | 1. 审计 `v5` vs `v6` 的差异 | 23 | 1. 设计 `humming_like` / `confused` 分开控制的 weighting 或 sampling |
| 23 | 2. 设计“v6 总体 + v5 humming_like 优势”的合并实验 | 24 | 2. 复用现有 `v6` 主基线做最小改动实验 |
| 24 | 3. 用 real-path clean + synthetic hard-case 双轨复测 | 25 | 3. 用 real-path clean + synthetic hard-case 双轨回归 |
| 25 | 26 | ||
| 26 | ### 最新可观测性修复(2026-06-02 15:18 UTC) | 27 | ### 最新可观测性修复(2026-06-02 15:18 UTC) |
| 27 | 28 | ... | ... |
-
Please register or sign in to post a comment