select the next hard-case optimization baseline from fresh sweeps\n\nConstraint:…
… Handoff must encode the new baseline decision without staging temporary sweep artifacts\nRejected: Jump straight into retraining without baseline comparison | Fresh sweep evidence now makes a targeted v6-vs-v5 optimization path cheaper and safer\nConfidence: high\nScope-risk: narrow\nDirective: Use v6 as the overall baseline and treat v5 as the humming_like comparison target before changing training or segmentation logic\nTested: Ran a synthetic_v2 hard-case sweep across v3-v6, verified summary metrics, and updated handoff/changelog docs with the baseline decision\nNot-tested: Whether a merged v6-plus-v5 strategy improves real open-data derived hard cases
Showing
5 changed files
with
107 additions
and
34 deletions
| ... | @@ -74,20 +74,23 @@ | ... | @@ -74,20 +74,23 @@ |
| 74 | 74 | ||
| 75 | ## 5.5 最新真实 FMA / chromaprint 运行态(2026-06-02) | 75 | ## 5.5 最新真实 FMA / chromaprint 运行态(2026-06-02) |
| 76 | 76 | ||
| 77 | ### 当前最新快照(15:43 UTC) | 77 | ### 当前最新快照(15:45 UTC) |
| 78 | 78 | ||
| 79 | - 远程同步基线:`81704ac`(更新前) | 79 | - 远程同步基线:`d4961b1`(更新前) |
| 80 | - 当前最重要的新证据:**hard-case 短板已经被明确量化**。 | 80 | - 当前最重要的新证据:**hard-case 下一轮最合理的优化基线已确定**。 |
| 81 | - real-path clean 闭环:`num_queries=35`, `top1=0.8571`, `topk=1.0` | 81 | - baseline sweep 结论: |
| 82 | - synthetic hard-case smoke:`num_queries=16`, `top1=0.6875`, `topk=1.0` | 82 | - `v6` 总体最佳:`top1=0.65`, `topk=0.95` |
| 83 | - `humming_like: n=4, top1=0.25, topk=1.0` | 83 | - `v5` 的 `humming_like` 更强:`top1=0.5` |
| 84 | - `confused: n=1, top1=0.0, topk=1.0` | 84 | - 细分: |
| 85 | - 关键解释:real-path FMA external smoke manifest 目前只有 `clean` query;`humming_like` / `confused` 需要通过 `data/synthetic_v2` 这类 hard-case 集补验证。 | 85 | - `v3`: `hum=0.0`, `conf=0.25` |
| 86 | - 这说明:当前工程链已经足够稳定,下一步的最大收益不在 clean smoke,而在 hard-case top1 提升。 | 86 | - `v4`: `hum=0.0`, `conf=0.0` |
| 87 | - `v5`: `hum=0.5`, `conf=0.0` | ||
| 88 | - `v6`: `hum=0.25`, `conf=0.25` | ||
| 89 | - 这说明:下一轮优化不该盲改,而应以 `v6` 为总体主基线,定向吸收 `v5` 在 `humming_like` 上的优势。 | ||
| 87 | - 下一次值得提交的事件: | 90 | - 下一次值得提交的事件: |
| 88 | 1. hard-case 优化后指标改善 | 91 | 1. `v5` vs `v6` 差异审计结果 |
| 89 | 2. 真实开放数据上的 hard-case 生成/标注方案落地 | 92 | 2. 合并实验后的 hard-case 指标改善 |
| 90 | 3. 更大规模 query/reference 的复测结果 | 93 | 3. dual-track(real-path clean + synthetic hard-case)复测结果 |
| 91 | 94 | ||
| 92 | 95 | ||
| 93 | ## 6. 高风险注意事项 | 96 | ## 6. 高风险注意事项 | ... | ... |
| 1 | ## 2026-06-02 15:45 UTC / hard-case baseline sweep pinned the next optimization baseline | ||
| 2 | |||
| 3 | - 对现有 `v3~v6` 基线在 `data/synthetic_v2` 上做了一轮统一 hard-case 评测 sweep | ||
| 4 | - fresh evidence(`2026-06-02 15:45:18 UTC`): | ||
| 5 | - 汇总文件:`/tmp/synth_v2_baseline_sweep/summary.json` | ||
| 6 | - 统一配置:`evaluate.py --data data/synthetic_v2 --fast-eval --split test --top-k 5 --seed 42` | ||
| 7 | - 关键结果: | ||
| 8 | - `v3`: `top1=0.6`, `topk=0.75`, `humming_like top1=0.0`, `confused top1=0.25` | ||
| 9 | - `v4`: `top1=0.4`, `topk=0.8`, `humming_like top1=0.0`, `confused top1=0.0` | ||
| 10 | - `v5`: `top1=0.6`, `topk=0.9`, `humming_like top1=0.5`, `confused top1=0.0` | ||
| 11 | - `v6`: `top1=0.65`, `topk=0.95`, `humming_like top1=0.25`, `confused top1=0.25` | ||
| 12 | - 结论: | ||
| 13 | - 若看总体与 clean/augmented 平衡:`v6` 当前最强 | ||
| 14 | - 若专看 `humming_like top1`:`v5` 当前更强(`0.5` vs `0.25`) | ||
| 15 | - 因此下一轮优化建议以 `v6` 为总体基线,同时对比吸收 `v5` 在 `humming_like` 上更优的因素 | ||
| 16 | |||
| 1 | ## 2026-06-02 15:43 UTC / hard-case gap verified after the first real-path closure | 17 | ## 2026-06-02 15:43 UTC / hard-case gap verified after the first real-path closure |
| 2 | 18 | ||
| 3 | - 在首个 real-path `build-index -> evaluate` 闭环之后,补跑了一轮现成 `synthetic_v2` hard-case smoke,验证下一步优化重点 | 19 | - 在首个 real-path `build-index -> evaluate` 闭环之后,补跑了一轮现成 `synthetic_v2` hard-case smoke,验证下一步优化重点 | ... | ... |
| ... | @@ -221,3 +221,30 @@ | ... | @@ -221,3 +221,30 @@ |
| 221 | 221 | ||
| 222 | - 当前已经不仅知道“系统能跑通”,还知道“最该优化哪里”:hard-case 的 top1。 | 222 | - 当前已经不仅知道“系统能跑通”,还知道“最该优化哪里”:hard-case 的 top1。 |
| 223 | - 下一轮更有价值的是围绕 `humming_like` / `confused` 做输入层、切片、混淆增强与 hard negative 优化。 | 223 | - 下一轮更有价值的是围绕 `humming_like` / `confused` 做输入层、切片、混淆增强与 hard negative 优化。 |
| 224 | |||
| 225 | ## 本次追加交付(2026-06-02 15:45 UTC) | ||
| 226 | |||
| 227 | ### 新增运行证据 | ||
| 228 | |||
| 229 | | 类别 | 内容 | | ||
| 230 | |---|---| | ||
| 231 | | baseline sweep | `v3~v6` 已完成统一 hard-case sweep | | ||
| 232 | | 总体最佳 | `v6`: `top1=0.65`, `topk=0.95` | | ||
| 233 | | humming_like 最佳 | `v5`: `top1=0.5` | | ||
| 234 | | confused 最佳 | `v3` / `v6`: `top1=0.25` | | ||
| 235 | |||
| 236 | ### 当前最重要的 fresh evidence | ||
| 237 | |||
| 238 | - 观测时间:`2026-06-02 15:45:18 UTC` | ||
| 239 | - 汇总文件:`/tmp/synth_v2_baseline_sweep/summary.json` | ||
| 240 | - 统一评测集:`data/synthetic_v2` | ||
| 241 | - 结果摘录: | ||
| 242 | - `v3`: overall `0.6/0.75`, hard-case `hum=0.0`, `conf=0.25` | ||
| 243 | - `v4`: overall `0.4/0.8`, hard-case `hum=0.0`, `conf=0.0` | ||
| 244 | - `v5`: overall `0.6/0.9`, hard-case `hum=0.5`, `conf=0.0` | ||
| 245 | - `v6`: overall `0.65/0.95`, hard-case `hum=0.25`, `conf=0.25` | ||
| 246 | |||
| 247 | ### 结论 | ||
| 248 | |||
| 249 | - 当前最合理的下一轮实验基线是 `v6`,因为总体最稳。 | ||
| 250 | - 但 `v5` 在 `humming_like` 上明显更强,值得做 targeted diff / 吸收。 | ... | ... |
| 1 | ## 本次交付包追加更新(2026-06-02 15:45 UTC) | ||
| 2 | |||
| 3 | ### 交付结论 | ||
| 4 | |||
| 5 | 当前最新里程碑已经从“知道 hard-case 有缺口”推进到 **知道哪套历史基线最值得作为下一轮优化起点**: | ||
| 6 | - 远程基线当前为:`d4961b1`(更新前) | ||
| 7 | - `v6` 是当前总体最优基线:`top1=0.65`, `topk=0.95` | ||
| 8 | - `v5` 在 `humming_like` 上更强:`top1=0.5` | ||
| 9 | - 因此下一轮不该盲改,而应以 `v6` 为主基线,对比吸收 `v5` 的 hard-case 优势 | ||
| 10 | |||
| 11 | ### 当前最新事实 | ||
| 12 | |||
| 13 | #### hard-case baseline sweep | ||
| 14 | - 观测时间:`2026-06-02 15:45:18 UTC` | ||
| 15 | - 汇总:`/tmp/synth_v2_baseline_sweep/summary.json` | ||
| 16 | - 结果: | ||
| 17 | - `v3`: overall `top1=0.6`, `topk=0.75`; `humming_like=0.0`, `confused=0.25` | ||
| 18 | - `v4`: overall `top1=0.4`, `topk=0.8`; `humming_like=0.0`, `confused=0.0` | ||
| 19 | - `v5`: overall `top1=0.6`, `topk=0.9`; `humming_like=0.5`, `confused=0.0` | ||
| 20 | - `v6`: overall `top1=0.65`, `topk=0.95`; `humming_like=0.25`, `confused=0.25` | ||
| 21 | |||
| 22 | ### 当前判断 | ||
| 23 | |||
| 24 | - `v6` 适合作为下一轮总体优化主基线。 | ||
| 25 | - `v5` 适合作为 `humming_like` 对照基线。 | ||
| 26 | - 下一阶段最值得做的是: | ||
| 27 | 1. 审计 `v5` 与 `v6` 的配置/数据/切片差异; | ||
| 28 | 2. 把 `v5` 的 `humming_like` 优势迁移到 `v6`; | ||
| 29 | 3. 再用真实路径 clean + synthetic hard-case 双轨复测。 | ||
| 30 | |||
| 31 | --- | ||
| 32 | |||
| 1 | ## 本次交付包追加更新(2026-06-02 15:43 UTC) | 33 | ## 本次交付包追加更新(2026-06-02 15:43 UTC) |
| 2 | 34 | ||
| 3 | ### 交付结论 | 35 | ### 交付结论 | ... | ... |
| ... | @@ -5,28 +5,23 @@ | ... | @@ -5,28 +5,23 @@ |
| 5 | 5 | ||
| 6 | ## 一页结论 | 6 | ## 一页结论 |
| 7 | 7 | ||
| 8 | ### 最新交付快照(2026-06-02 15:43 UTC) | 8 | ### 最新交付快照(2026-06-02 15:45 UTC) |
| 9 | 9 | ||
| 10 | - 当前远程同步基线:`81704ac`(更新前) | 10 | - 当前远程同步基线:`d4961b1`(更新前) |
| 11 | - 当前最重要的新事实:**hard-case 短板已经被明确量化** | 11 | - 当前最重要的新事实:**hard-case 下一轮最合理的优化基线已确定** |
| 12 | - real-path clean 闭环结果: | 12 | - 基线 sweep 结论: |
| 13 | - `num_queries=35` | 13 | - `v6`:总体最佳,`top1=0.65`, `topk=0.95` |
| 14 | - `top1=0.8571` | 14 | - `v5`:`humming_like` 最佳,`top1=0.5` |
| 15 | - `topk=1.0` | 15 | - 细分结果: |
| 16 | - synthetic hard-case smoke 结果: | 16 | - `v3`: `hum=0.0`, `conf=0.25` |
| 17 | - `num_queries=16` | 17 | - `v4`: `hum=0.0`, `conf=0.0` |
| 18 | - `top1=0.6875` | 18 | - `v5`: `hum=0.5`, `conf=0.0` |
| 19 | - `topk=1.0` | 19 | - `v6`: `hum=0.25`, `conf=0.25` |
| 20 | - `humming_like: n=4, top1=0.25, topk=1.0` | 20 | - 结论:下一阶段不应盲目重做大规模 clean smoke,而应以 `v6` 为主基线,针对性吸收 `v5` 的 `humming_like` 优势。 |
| 21 | - `confused: n=1, top1=0.0, topk=1.0` | ||
| 22 | - 关键解释: | ||
| 23 | - real-path FMA external smoke 目前只有 `clean` query | ||
| 24 | - `humming_like` / `confused` 目前只能通过 `data/synthetic_v2` 这类 hard-case 集验证 | ||
| 25 | - 结论:下一阶段不应继续重复 clean smoke,而应优先针对 hard-case 提升 top1。 | ||
| 26 | - 新 session 第一优先级: | 21 | - 新 session 第一优先级: |
| 27 | 1. 围绕 `humming_like` / `confused` 做输入层与切片优化 | 22 | 1. 审计 `v5` vs `v6` 的差异 |
| 28 | 2. 设计真实开放数据上的 hard-case 生成/标注链 | 23 | 2. 设计“v6 总体 + v5 humming_like 优势”的合并实验 |
| 29 | 3. 再扩大 query / reference 规模复测 | 24 | 3. 用 real-path clean + synthetic hard-case 双轨复测 |
| 30 | 25 | ||
| 31 | ### 最新可观测性修复(2026-06-02 15:18 UTC) | 26 | ### 最新可观测性修复(2026-06-02 15:18 UTC) |
| 32 | 27 | ... | ... |
-
Please register or sign in to post a comment