select the next hard-case optimization baseline from fresh sweeps\n\nConstraint:…

… Handoff must encode the new baseline decision without staging temporary sweep artifacts\nRejected: Jump straight into retraining without baseline comparison | Fresh sweep evidence now makes a targeted v6-vs-v5 optimization path cheaper and safer\nConfidence: high\nScope-risk: narrow\nDirective: Use v6 as the overall baseline and treat v5 as the humming_like comparison target before changing training or segmentation logic\nTested: Ran a synthetic_v2 hard-case sweep across v3-v6, verified summary metrics, and updated handoff/changelog docs with the baseline decision\nNot-tested: Whether a merged v6-plus-v5 strategy improves real open-data derived hard cases

select the next hard-case optimization baseline from fresh sweeps\n\nConstraint:…
… Handoff must encode the new baseline decision without staging temporary sweep artifacts\nRejected: Jump straight into retraining without baseline comparison | Fresh sweep evidence now makes a targeted v6-vs-v5 optimization path cheaper and safer\nConfidence: high\nScope-risk: narrow\nDirective: Use v6 as the overall baseline and treat v5 as the humming_like comparison target before changing training or segmentation logic\nTested: Ran a synthetic_v2 hard-case sweep across v3-v6, verified summary metrics, and updated handoff/changelog docs with the baseline decision\nNot-tested: Whether a merged v6-plus-v5 strategy improves real open-data derived hard cases
cnb.bofCdSsphPA
Commit 93dfa158 ... 93dfa158d6f4df1e5ced10bae639d7cef9c285f1 authored 2026-06-02 23:46:12 +0800 by cnb.bofCdSsphPA
Showing 5 changed files with 107 additions and 34 deletions
AGENT.md
docs/CHANGELOG.md
docs/changelist-2026-06-02.md
docs/delivery-handoff-2026-06-02.md
docs/session-handoff.md
--- a/AGENT.md
View file @93dfa15
+++ b/AGENT.md
View file @93dfa15
@@ -74,20 +74,23 @@
 ## 5.5 最新真实 FMA / chromaprint 运行态（2026-06-02）
-### 当前最新快照（15:43 UTC）
+### 当前最新快照（15:45 UTC）
- 远程同步基线：`81704ac`（更新前）
+- 远程同步基线：`d4961b1`（更新前）
- 当前最重要的新证据：**hard-case 短板已经被明确量化**。
+- 当前最重要的新证据：**hard-case 下一轮最合理的优化基线已确定**。
- real-path clean 闭环：`num_queries=35`, `top1=0.8571`, `topk=1.0`
+- baseline sweep 结论：
- synthetic hard-case smoke：`num_queries=16`, `top1=0.6875`, `topk=1.0`
+  - `v6` 总体最佳：`top1=0.65`, `topk=0.95`
-  - `humming_like: n=4, top1=0.25, topk=1.0`
+  - `v5` 的 `humming_like` 更强：`top1=0.5`
-  - `confused: n=1, top1=0.0, topk=1.0`
+- 细分：
- 关键解释：real-path FMA external smoke manifest 目前只有 `clean` query；`humming_like` / `confused` 需要通过 `data/synthetic_v2` 这类 hard-case 集补验证。
+  - `v3`: `hum=0.0`, `conf=0.25`
- 这说明：当前工程链已经足够稳定，下一步的最大收益不在 clean smoke，而在 hard-case top1 提升。
+  - `v4`: `hum=0.0`, `conf=0.0`
+  - `v5`: `hum=0.5`, `conf=0.0`
+  - `v6`: `hum=0.25`, `conf=0.25`
+- 这说明：下一轮优化不该盲改，而应以 `v6` 为总体主基线，定向吸收 `v5` 在 `humming_like` 上的优势。
 - 下一次值得提交的事件：
-  1. hard-case 优化后指标改善
+  1. `v5` vs `v6` 差异审计结果
-  2. 真实开放数据上的 hard-case 生成/标注方案落地
+  2. 合并实验后的 hard-case 指标改善
-  3. 更大规模 query/reference 的复测结果
+  3. dual-track（real-path clean + synthetic hard-case）复测结果
 ## 6. 高风险注意事项
--- a/docs/CHANGELOG.md
View file @93dfa15
+++ b/docs/CHANGELOG.md
View file @93dfa15
+## 2026-06-02 15:45 UTC / hard-case baseline sweep pinned the next optimization baseline
+- 对现有 `v3~v6` 基线在 `data/synthetic_v2` 上做了一轮统一 hard-case 评测 sweep
+- fresh evidence（`2026-06-02 15:45:18 UTC`）：
+  - 汇总文件：`/tmp/synth_v2_baseline_sweep/summary.json`
+  - 统一配置：`evaluate.py --data data/synthetic_v2 --fast-eval --split test --top-k 5 --seed 42`
+- 关键结果：
+  - `v3`: `top1=0.6`, `topk=0.75`, `humming_like top1=0.0`, `confused top1=0.25`
+  - `v4`: `top1=0.4`, `topk=0.8`, `humming_like top1=0.0`, `confused top1=0.0`
+  - `v5`: `top1=0.6`, `topk=0.9`, `humming_like top1=0.5`, `confused top1=0.0`
+  - `v6`: `top1=0.65`, `topk=0.95`, `humming_like top1=0.25`, `confused top1=0.25`
+- 结论：
+  - 若看总体与 clean/augmented 平衡：`v6` 当前最强
+  - 若专看 `humming_like top1`：`v5` 当前更强（`0.5` vs `0.25`）
+  - 因此下一轮优化建议以 `v6` 为总体基线，同时对比吸收 `v5` 在 `humming_like` 上更优的因素
 ## 2026-06-02 15:43 UTC / hard-case gap verified after the first real-path closure
 - 在首个 real-path `build-index -> evaluate` 闭环之后，补跑了一轮现成 `synthetic_v2` hard-case smoke，验证下一步优化重点
--- a/docs/changelist-2026-06-02.md
View file @93dfa15
+++ b/docs/changelist-2026-06-02.md
View file @93dfa15
@@ -221,3 +221,30 @@
 - 当前已经不仅知道“系统能跑通”，还知道“最该优化哪里”：hard-case 的 top1。
 - 下一轮更有价值的是围绕 `humming_like` / `confused` 做输入层、切片、混淆增强与 hard negative 优化。
+## 本次追加交付（2026-06-02 15:45 UTC）
+### 新增运行证据
+| 类别 | 内容 |
+|---|---|
+| baseline sweep | `v3~v6` 已完成统一 hard-case sweep |
+| 总体最佳 | `v6`: `top1=0.65`, `topk=0.95` |
+| humming_like 最佳 | `v5`: `top1=0.5` |
+| confused 最佳 | `v3` / `v6`: `top1=0.25` |
+### 当前最重要的 fresh evidence
+- 观测时间：`2026-06-02 15:45:18 UTC`
+- 汇总文件：`/tmp/synth_v2_baseline_sweep/summary.json`
+- 统一评测集：`data/synthetic_v2`
+- 结果摘录：
+  - `v3`: overall `0.6/0.75`, hard-case `hum=0.0`, `conf=0.25`
+  - `v4`: overall `0.4/0.8`, hard-case `hum=0.0`, `conf=0.0`
+  - `v5`: overall `0.6/0.9`, hard-case `hum=0.5`, `conf=0.0`
+  - `v6`: overall `0.65/0.95`, hard-case `hum=0.25`, `conf=0.25`
+### 结论
+- 当前最合理的下一轮实验基线是 `v6`，因为总体最稳。
+- 但 `v5` 在 `humming_like` 上明显更强，值得做 targeted diff / 吸收。
--- a/docs/delivery-handoff-2026-06-02.md
View file @93dfa15
+++ b/docs/delivery-handoff-2026-06-02.md
View file @93dfa15
+## 本次交付包追加更新（2026-06-02 15:45 UTC）
+### 交付结论
+当前最新里程碑已经从“知道 hard-case 有缺口”推进到 **知道哪套历史基线最值得作为下一轮优化起点**：
+- 远程基线当前为：`d4961b1`（更新前）
+- `v6` 是当前总体最优基线：`top1=0.65`, `topk=0.95`
+- `v5` 在 `humming_like` 上更强：`top1=0.5`
+- 因此下一轮不该盲改，而应以 `v6` 为主基线，对比吸收 `v5` 的 hard-case 优势
+### 当前最新事实
+#### hard-case baseline sweep
+- 观测时间：`2026-06-02 15:45:18 UTC`
+- 汇总：`/tmp/synth_v2_baseline_sweep/summary.json`
+- 结果：
+  - `v3`: overall `top1=0.6`, `topk=0.75`; `humming_like=0.0`, `confused=0.25`
+  - `v4`: overall `top1=0.4`, `topk=0.8`; `humming_like=0.0`, `confused=0.0`
+  - `v5`: overall `top1=0.6`, `topk=0.9`; `humming_like=0.5`, `confused=0.0`
+  - `v6`: overall `top1=0.65`, `topk=0.95`; `humming_like=0.25`, `confused=0.25`
+### 当前判断
+- `v6` 适合作为下一轮总体优化主基线。
+- `v5` 适合作为 `humming_like` 对照基线。
+- 下一阶段最值得做的是：
+  1. 审计 `v5` 与 `v6` 的配置/数据/切片差异；
+  2. 把 `v5` 的 `humming_like` 优势迁移到 `v6`；
+  3. 再用真实路径 clean + synthetic hard-case 双轨复测。
+---
 ## 本次交付包追加更新（2026-06-02 15:43 UTC）
 ### 交付结论
--- a/docs/session-handoff.md
View file @93dfa15
+++ b/docs/session-handoff.md
View file @93dfa15
@@ -5,28 +5,23 @@
 ## 一页结论
-### 最新交付快照（2026-06-02 15:43 UTC）
+### 最新交付快照（2026-06-02 15:45 UTC）
- 当前远程同步基线：`81704ac`（更新前）
+- 当前远程同步基线：`d4961b1`（更新前）
- 当前最重要的新事实：**hard-case 短板已经被明确量化**
+- 当前最重要的新事实：**hard-case 下一轮最合理的优化基线已确定**
- real-path clean 闭环结果：
+- 基线 sweep 结论：
-  - `num_queries=35`
+  - `v6`：总体最佳，`top1=0.65`, `topk=0.95`
-  - `top1=0.8571`
+  - `v5`：`humming_like` 最佳，`top1=0.5`
-  - `topk=1.0`
+- 细分结果：
- synthetic hard-case smoke 结果：
+  - `v3`: `hum=0.0`, `conf=0.25`
-  - `num_queries=16`
+  - `v4`: `hum=0.0`, `conf=0.0`
-  - `top1=0.6875`
+  - `v5`: `hum=0.5`, `conf=0.0`
-  - `topk=1.0`
+  - `v6`: `hum=0.25`, `conf=0.25`
-  - `humming_like: n=4, top1=0.25, topk=1.0`
+- 结论：下一阶段不应盲目重做大规模 clean smoke，而应以 `v6` 为主基线，针对性吸收 `v5` 的 `humming_like` 优势。
-  - `confused: n=1, top1=0.0, topk=1.0`
- 关键解释：
-  - real-path FMA external smoke 目前只有 `clean` query
-  - `humming_like` / `confused` 目前只能通过 `data/synthetic_v2` 这类 hard-case 集验证
- 结论：下一阶段不应继续重复 clean smoke，而应优先针对 hard-case 提升 top1。
 - 新 session 第一优先级：
-  1. 围绕 `humming_like` / `confused` 做输入层与切片优化
+  1. 审计 `v5` vs `v6` 的差异
-  2. 设计真实开放数据上的 hard-case 生成/标注链
+  2. 设计“v6 总体 + v5 humming_like 优势”的合并实验
-  3. 再扩大 query / reference 规模复测
+  3. 用 real-path clean + synthetic hard-case 双轨复测
 ### 最新可观测性修复（2026-06-02 15:18 UTC）