explain the v5-v6 hard-case split with source-backed evidence\n\nConstraint: The…

… handoff must convert baseline metrics into an actionable causal explanation without staging report artifacts\nRejected: Start a new weighting experiment immediately | Source-backed explanation of the existing split is cheaper and reduces blind iteration risk\nConfidence: high\nScope-risk: narrow\nDirective: Treat dual-axis hard-case weighting as the next design lane, using v6 as the base and v5 as the humming_like reference\nTested: Verified source-backed v5/v6 definitions from changelog and smoke-v6 config artifacts, then updated handoff/changelog docs\nNot-tested: A new merged weighting strategy or its downstream metric impact

explain the v5-v6 hard-case split with source-backed evidence\n\nConstraint: The…
… handoff must convert baseline metrics into an actionable causal explanation without staging report artifacts\nRejected: Start a new weighting experiment immediately | Source-backed explanation of the existing split is cheaper and reduces blind iteration risk\nConfidence: high\nScope-risk: narrow\nDirective: Treat dual-axis hard-case weighting as the next design lane, using v6 as the base and v5 as the humming_like reference\nTested: Verified source-backed v5/v6 definitions from changelog and smoke-v6 config artifacts, then updated handoff/changelog docs\nNot-tested: A new merged weighting strategy or its downstream metric impact
cnb.bofCdSsphPA
Commit 7812b589 ... 7812b5892c283c1887f976ce51af75a6671c388c authored 2026-06-02 23:49:10 +0800 by cnb.bofCdSsphPA
Showing 5 changed files with 102 additions and 32 deletions
AGENT.md
docs/CHANGELOG.md
docs/changelist-2026-06-02.md
docs/delivery-handoff-2026-06-02.md
docs/session-handoff.md
--- a/AGENT.md
View file @7812b58
+++ b/AGENT.md
View file @7812b58
@@ -74,23 +74,21 @@
 ## 5.5 最新真实 FMA / chromaprint 运行态（2026-06-02）
-### 当前最新快照（15:45 UTC）
+### 当前最新快照（15:46 UTC）
- 远程同步基线：`d4961b1`（更新前）
+- 远程同步基线：`93dfa15`（更新前）
- 当前最重要的新证据：**hard-case 下一轮最合理的优化基线已确定**。
+- 当前最重要的新证据：**`v5` 与 `v6` 的 hard-case 差异来源已经被解释清楚**。
- baseline sweep 结论：
+- `v5` = `type-aware hard-case weighting`：
-  - `v6` 总体最佳：`top1=0.65`, `topk=0.95`
+  - `humming_like top1=0.50`
-  - `v5` 的 `humming_like` 更强：`top1=0.5`
+  - `confused top1=0.00`
- 细分：
+- `v6` = `sample-level confused-priority weighting`：
-  - `v3`: `hum=0.0`, `conf=0.25`
+  - `humming_like top1=0.25`
-  - `v4`: `hum=0.0`, `conf=0.0`
+  - `confused top1=0.25`
-  - `v5`: `hum=0.5`, `conf=0.0`
+- 这说明：下一轮最值得做的不是继续盲 sweep，而是设计 `humming_like` 与 `confused` 分开控制的双轴策略。
-  - `v6`: `hum=0.25`, `conf=0.25`
- 这说明：下一轮优化不该盲改，而应以 `v6` 为总体主基线，定向吸收 `v5` 在 `humming_like` 上的优势。
 - 下一次值得提交的事件：
-  1. `v5` vs `v6` 差异审计结果
+  1. 双轴 hard-case weighting / sampling 方案落地
-  2. 合并实验后的 hard-case 指标改善
+  2. 其对 `v6` 的 hard-case 指标改善
-  3. dual-track（real-path clean + synthetic hard-case）复测结果
+  3. dual-track 回归结果稳定
 ## 6. 高风险注意事项
--- a/docs/CHANGELOG.md
View file @7812b58
+++ b/docs/CHANGELOG.md
View file @7812b58
+## 2026-06-02 15:46 UTC / v5-v6 hard-case difference is now causally explained
+- 基于仓库内历史实验记录，补齐了 `v5` 与 `v6` hard-case 表现差异的来源解释
+- fresh evidence：
+  - `docs/CHANGELOG.md:2954+` 明确记录：`v6` 是 **sample-level confused-priority weighting**
+  - `docs/CHANGELOG.md:6805+` 明确记录：`v5` 是 **type-aware hard-case weighting**
+  - `docs/sota-research-2026.md:113-114` 已总结两者差异：
+    - `v5` 改善 `humming_like`，但 `confused` 无突破
+    - `v6` 提升 `confused`，但 `humming_like` 回落
+- 当前可执行结论：
+  - `v5` 的优势主要来自 **type-aware weighting** 对 `humming_like` 的更温和提升
+  - `v6` 的优势主要来自 **sample-level confused-priority weighting** 对 `confused` 的定向拉升
+  - 下一轮不应重做盲目 sweep，而应设计 **双轴 hard-case weighting / 分治策略**
 ## 2026-06-02 15:45 UTC / hard-case baseline sweep pinned the next optimization baseline
 - 对现有 `v3~v6` 基线在 `data/synthetic_v2` 上做了一轮统一 hard-case 评测 sweep
--- a/docs/changelist-2026-06-02.md
View file @7812b58
+++ b/docs/changelist-2026-06-02.md
View file @7812b58
@@ -248,3 +248,27 @@
 - 当前最合理的下一轮实验基线是 `v6`，因为总体最稳。
 - 但 `v5` 在 `humming_like` 上明显更强，值得做 targeted diff / 吸收。
+## 本次追加交付（2026-06-02 15:46 UTC）
+### 新增差异审计证据
+| 类别 | 内容 |
+|---|---|
+| v5 来源 | `type-aware hard-case weighting` |
+| v6 来源 | `sample-level confused-priority weighting` |
+| 解释 | `v5` 更利于 `humming_like`，`v6` 更利于 `confused` |
+| 决策 | 下一轮应做双轴 hard-case weighting / 分治，而不是单轴加权 |
+### 当前最重要的 fresh evidence
+- `docs/CHANGELOG.md:2954+`：`v6` = sample-level confused-priority weighting
+- `docs/CHANGELOG.md:6805+`：`v5` = type-aware hard-case weighting
+- `docs/sota-research-2026.md:113-114`：
+  - `v5`: `overall=0.60`, `humming_like=0.50`, `confused=0.00`
+  - `v6`: `overall=0.65`, `humming_like=0.25`, `confused=0.25`
+### 结论
+- 现在已经不仅知道 `v5/v6` 哪个更强，还知道“为什么”。
+- 下一轮应把 `humming_like` 与 `confused` 分开建模或分开加权。
--- a/docs/delivery-handoff-2026-06-02.md
View file @7812b58
+++ b/docs/delivery-handoff-2026-06-02.md
View file @7812b58
+## 本次交付包追加更新（2026-06-02 15:46 UTC）
+### 交付结论
+当前最新里程碑已经从“确定 v6/v5 谁更适合作为基线”推进到 **解释清楚它们为什么会这样表现**：
+- 远程基线当前为：`93dfa15`（更新前）
+- `v5` 的关键机制是 `type-aware hard-case weighting`
+- `v6` 的关键机制是 `sample-level confused-priority weighting`
+- 因此下一轮最合理的不是继续盲 sweep，而是做 `humming_like` 与 `confused` 的双轴分治策略
+### 当前最新事实
+#### v5 / v6 差异来源
+- `v5`：
+  - 历史记录位置：`docs/CHANGELOG.md:6805+`
+  - 定义：`type-aware hard-case weighting`
+  - 结果：`humming_like top1=0.50`, `confused top1=0.00`
+- `v6`：
+  - 历史记录位置：`docs/CHANGELOG.md:2954+`
+  - 定义：`sample-level confused-priority weighting`
+  - 结果：`humming_like top1=0.25`, `confused top1=0.25`
+- 汇总解释：`docs/sota-research-2026.md:113-114`
+### 当前判断
+- `v5` 与 `v6` 的差异已经可解释，不再是黑箱经验差异。
+- 下一阶段最值得做的是：
+  1. 设计双轴 hard-case weighting；
+  2. 让 `humming_like` 与 `confused` 分开控制；
+  3. 再用现有双轨验证链回归测试。
+---
 ## 本次交付包追加更新（2026-06-02 15:45 UTC）
 ### 交付结论
--- a/docs/session-handoff.md
View file @7812b58
+++ b/docs/session-handoff.md
View file @7812b58
@@ -5,23 +5,24 @@
 ## 一页结论
-### 最新交付快照（2026-06-02 15:45 UTC）
+### 最新交付快照（2026-06-02 15:46 UTC）
- 当前远程同步基线：`d4961b1`（更新前）
+- 当前远程同步基线：`93dfa15`（更新前）
- 当前最重要的新事实：**hard-case 下一轮最合理的优化基线已确定**
+- 当前最重要的新事实：**`v5` 与 `v6` 的 hard-case 差异来源已经被解释清楚**
- 基线 sweep 结论：
+- `v5`：`type-aware hard-case weighting`
-  - `v6`：总体最佳，`top1=0.65`, `topk=0.95`
+  - `humming_like top1=0.50`
-  - `v5`：`humming_like` 最佳，`top1=0.5`
+  - `confused top1=0.00`
- 细分结果：
+- `v6`：`sample-level confused-priority weighting`
-  - `v3`: `hum=0.0`, `conf=0.25`
+  - `humming_like top1=0.25`
-  - `v4`: `hum=0.0`, `conf=0.0`
+  - `confused top1=0.25`
-  - `v5`: `hum=0.5`, `conf=0.0`
+- 结论：
-  - `v6`: `hum=0.25`, `conf=0.25`
+  - `v5` 更偏向提升 `humming_like`
- 结论：下一阶段不应盲目重做大规模 clean smoke，而应以 `v6` 为主基线，针对性吸收 `v5` 的 `humming_like` 优势。
+  - `v6` 更偏向提升 `confused`
+  - 下一轮应设计双轴 hard-case weighting / 分治策略，而不是继续单轴加权
 - 新 session 第一优先级：
-  1. 审计 `v5` vs `v6` 的差异
+  1. 设计 `humming_like` / `confused` 分开控制的 weighting 或 sampling
-  2. 设计“v6 总体 + v5 humming_like 优势”的合并实验
+  2. 复用现有 `v6` 主基线做最小改动实验
-  3. 用 real-path clean + synthetic hard-case 双轨复测
+  3. 用 real-path clean + synthetic hard-case 双轨回归
 ### 最新可观测性修复（2026-06-02 15:18 UTC）