pin down the hard-case gap after the first real-path closure\n\nConstraint: Hand…

…off must distinguish clean real-path evidence from hard-case evidence without staging temporary evaluation artifacts\nRejected: Keep scaling clean-only FMA smoke first | Fresh evidence shows the next highest-yield work is hard-case top1 improvement\nConfidence: high\nScope-risk: narrow\nDirective: Treat humming_like and confused as the primary optimization targets before investing more cycles in larger clean-only smoke runs\nTested: Audited manifest type coverage, verified synthetic_v2 hard-case evaluate results, and updated handoff/changelog docs with the gap analysis\nNot-tested: Post-optimization hard-case improvements on real open-data derived hard cases

pin down the hard-case gap after the first real-path closure\n\nConstraint: Hand…
…off must distinguish clean real-path evidence from hard-case evidence without staging temporary evaluation artifacts\nRejected: Keep scaling clean-only FMA smoke first | Fresh evidence shows the next highest-yield work is hard-case top1 improvement\nConfidence: high\nScope-risk: narrow\nDirective: Treat humming_like and confused as the primary optimization targets before investing more cycles in larger clean-only smoke runs\nTested: Audited manifest type coverage, verified synthetic_v2 hard-case evaluate results, and updated handoff/changelog docs with the gap analysis\nNot-tested: Post-optimization hard-case improvements on real open-data derived hard cases
cnb.bofCdSsphPA
Commit d4961b14 ... d4961b1467dda9b819eeacba30ec166c9c034af1 authored 2026-06-02 23:44:12 +0800 by cnb.bofCdSsphPA
Showing 5 changed files with 117 additions and 33 deletions
AGENT.md
docs/CHANGELOG.md
docs/changelist-2026-06-02.md
docs/delivery-handoff-2026-06-02.md
docs/session-handoff.md
--- a/AGENT.md
View file @d4961b1
+++ b/AGENT.md
View file @d4961b1
@@ -74,24 +74,20 @@
 ## 5.5 最新真实 FMA / chromaprint 运行态（2026-06-02）
-### 当前最新快照（15:40 UTC）
+### 当前最新快照（15:43 UTC）
- 远程同步基线：`9371e94`（更新前）
+- 远程同步基线：`81704ac`（更新前）
- 当前最重要的新证据：**fixed real-path 200-ref rerun 已拿到首份显式 evaluate 指标**。
+- 当前最重要的新证据：**hard-case 短板已经被明确量化**。
- index 路径：`/tmp/fma_realpath_small_rerun_index2`
+- real-path clean 闭环：`num_queries=35`, `top1=0.8571`, `topk=1.0`
- eval 路径：`/tmp/fma_realpath_small_rerun_eval/eval_top50.json`
+- synthetic hard-case smoke：`num_queries=16`, `top1=0.6875`, `topk=1.0`
- 当前结果：
+  - `humming_like: n=4, top1=0.25, topk=1.0`
-  - `num_queries=35`
+  - `confused: n=1, top1=0.0, topk=1.0`
-  - `top1=0.8571`
+- 关键解释：real-path FMA external smoke manifest 目前只有 `clean` query；`humming_like` / `confused` 需要通过 `data/synthetic_v2` 这类 hard-case 集补验证。
-  - `topk=1.0`
+- 这说明：当前工程链已经足够稳定，下一步的最大收益不在 clean smoke，而在 hard-case top1 提升。
-  - `by_type.clean: n=35, top1=0.8571, topk=1.0`
- query 数说明：overlap test items `235` 中，非 `reference` query 只有 `35`，所以 `--max-queries 50` 最终评到 `35`。
- 这说明：当前已具备一条完整可复用的真实路径 smoke 证据链：
-  `chromaprint complete -> reference complete -> evaluate complete`
 - 下一次值得提交的事件：
-  1. 更大 query 数 / 更大 reference 集评测
+  1. hard-case 优化后指标改善
-  2. `confused` / `humming_like` / hard negative 指标
+  2. 真实开放数据上的 hard-case 生成/标注方案落地
-  3. 更接近商用场景的数据组合结果
+  3. 更大规模 query/reference 的复测结果
 ## 6. 高风险注意事项
--- a/docs/CHANGELOG.md
View file @d4961b1
+++ b/docs/CHANGELOG.md
View file @d4961b1
+## 2026-06-02 15:43 UTC / hard-case gap verified after the first real-path closure
+- 在首个 real-path `build-index -> evaluate` 闭环之后，补跑了一轮现成 `synthetic_v2` hard-case smoke，验证下一步优化重点
+- fresh evidence（`2026-06-02 15:43:17 UTC`）：
+  - 评测文件：`/tmp/synthetic_v2_eval_v6_top16.json`
+  - 组合：`data/synthetic_v2` + `data/models_v6/best_model.pt` + `data/index_v6/reference`
+  - 结果：`num_queries=16`, `top1=0.6875`, `topk=1.0`
+  - by_type：
+    - `clean`: `n=7`, `top1=1.0`, `topk=1.0`
+    - `augmented`: `n=4`, `top1=0.75`, `topk=1.0`
+    - `humming_like`: `n=4`, `top1=0.25`, `topk=1.0`
+    - `confused`: `n=1`, `top1=0.0`, `topk=1.0`
+- 关键解释：
+  - 当前 real-path FMA external smoke manifest 只包含 `clean` query，没有 `confused` / `humming_like`
+  - 因此 real-path 评测只能证明 clean 闭环，不足以证明 hard-case 鲁棒性
+- 结论：当前最明确的优化方向已收敛到 `humming_like` / `confused` 的 top1 提升，而不是继续重复 clean smoke
 ## 2026-06-02 15:40 UTC / real-path 200-ref rerun closed the first explicit evaluate loop
 - 基于已完成的 `200-ref` real-path index，补了一轮显式 `evaluate.py` smoke
--- a/docs/changelist-2026-06-02.md
View file @d4961b1
+++ b/docs/changelist-2026-06-02.md
View file @d4961b1
@@ -192,3 +192,32 @@
 - 当前已不只是建索引成功，而是已经拿到首份真实路径 `build-index -> evaluate` 闭环证据。
 - 下一轮应把重点切到：更大评测规模与 hard case / confusion 评测。
+## 本次追加交付（2026-06-02 15:43 UTC）
+### 新增运行证据
+| 类别 | 内容 |
+|---|---|
+| hard-case smoke | `synthetic_v2 + models_v6 + index_v6` 显式评测完成 |
+| 总体 | `num_queries=16`, `top1=0.6875`, `topk=1.0` |
+| hard case | `humming_like top1=0.25`, `confused top1=0.0` |
+| 结论 | 当前短板已明确落在 hard-case top1，而不是 clean/topk |
+### 当前最重要的 fresh evidence
+- 观测时间：`2026-06-02 15:43:17 UTC`
+- 结果文件：`/tmp/synthetic_v2_eval_v6_top16.json`
+- 评测结果：
+  - `top1=0.6875`
+  - `topk=1.0`
+  - `humming_like: n=4, top1=0.25, topk=1.0`
+  - `confused: n=1, top1=0.0, topk=1.0`
+- manifest 审计结果：
+  - real-path FMA external smoke 只有 `clean` query
+  - synthetic_v2 才包含 `augmented` / `humming_like` / `confused`
+### 结论
+- 当前已经不仅知道“系统能跑通”，还知道“最该优化哪里”：hard-case 的 top1。
+- 下一轮更有价值的是围绕 `humming_like` / `confused` 做输入层、切片、混淆增强与 hard negative 优化。
--- a/docs/delivery-handoff-2026-06-02.md
View file @d4961b1
+++ b/docs/delivery-handoff-2026-06-02.md
View file @d4961b1
+## 本次交付包追加更新（2026-06-02 15:43 UTC）
+### 交付结论
+当前最新里程碑已经从“real-path clean 闭环跑通”推进到 **hard-case 短板已被明确量化**：
+- 远程基线当前为：`81704ac`（更新前）
+- real-path FMA smoke 已证明 `clean` 闭环可跑通
+- synthetic hard-case smoke 已证明当前主要短板在 `humming_like` / `confused` 的 top1
+- 因此下一阶段不应重复 clean smoke，而应聚焦 hard-case 鲁棒性优化
+### 当前最新事实
+#### hard-case smoke 结果
+- 观测时间：`2026-06-02 15:43:17 UTC`
+- 组合：`data/synthetic_v2` + `data/models_v6/best_model.pt` + `data/index_v6/reference`
+- 结果文件：`/tmp/synthetic_v2_eval_v6_top16.json`
+- 评测结果：
+  - `num_queries=16`
+  - `top1=0.6875`
+  - `topk=1.0`
+  - `clean: n=7, top1=1.0, topk=1.0`
+  - `augmented: n=4, top1=0.75, topk=1.0`
+  - `humming_like: n=4, top1=0.25, topk=1.0`
+  - `confused: n=1, top1=0.0, topk=1.0`
+#### 关键解释
+- real-path FMA external smoke manifest 目前只有 `clean` query：
+  - external test = `1613 clean`
+  - rerun overlap test = `35 clean`
+- 当前仓库里能提供 `humming_like` / `confused` 的现成评测集是 `data/synthetic_v2`。
+### 当前判断
+- 真实路径闭环已经足够证明工程链可运行。
+- 下一阶段的收益最高点已经收敛到：
+  1. `humming_like` top1 提升；
+  2. `confused` top1 提升；
+  3. 将 hard-case 生成/标注引入真实开放数据评测链。
+---
 ## 本次交付包追加更新（2026-06-02 15:40 UTC）
 ### 交付结论
--- a/docs/session-handoff.md
View file @d4961b1
+++ b/docs/session-handoff.md
View file @d4961b1
@@ -5,27 +5,28 @@
 ## 一页结论
-### 最新交付快照（2026-06-02 15:40 UTC）
+### 最新交付快照（2026-06-02 15:43 UTC）
- 当前远程同步基线：`9371e94`（更新前）
+- 当前远程同步基线：`81704ac`（更新前）
- 当前最重要的新事实：**fixed real-path 200-ref rerun 已拿到首份显式 evaluate 指标**
+- 当前最重要的新事实：**hard-case 短板已经被明确量化**
- index 路径：`/tmp/fma_realpath_small_rerun_index2`
+- real-path clean 闭环结果：
- eval 路径：`/tmp/fma_realpath_small_rerun_eval/eval_top50.json`
- 当前核心结果：
  - `num_queries=35`
  - `top1=0.8571`
  - `topk=1.0`
-  - `by_type.clean: n=35, top1=0.8571, topk=1.0`
+- synthetic hard-case smoke 结果：
- query 数说明：
+  - `num_queries=16`
-  - overlap test items = `235`
+  - `top1=0.6875`
-  - 非 `reference` query = `35`
+  - `topk=1.0`
-  - 所以 `--max-queries 50` 实际评到 `35` 条
+  - `humming_like: n=4, top1=0.25, topk=1.0`
- 结论：当前已经形成真实路径的第一条完整闭环：
+  - `confused: n=1, top1=0.0, topk=1.0`
-  `chromaprint -> reference -> evaluate`
+- 关键解释：
+  - real-path FMA external smoke 目前只有 `clean` query
+  - `humming_like` / `confused` 目前只能通过 `data/synthetic_v2` 这类 hard-case 集验证
+- 结论：下一阶段不应继续重复 clean smoke，而应优先针对 hard-case 提升 top1。
 - 新 session 第一优先级：
-  1. 扩大 query / reference 规模
+  1. 围绕 `humming_like` / `confused` 做输入层与切片优化
-  2. 补 hard case（`confused` / `humming_like`）评测
+  2. 设计真实开放数据上的 hard-case 生成/标注链
-  3. 再决定是否推到更大 FMA 子集或全量
+  3. 再扩大 query / reference 规模复测
 ### 最新可观测性修复（2026-06-02 15:18 UTC）