Commit d1f13203 d1f132034b28317088dcb5b94eb8460f5a3c66f3 by cnb.bofCdSsphPA

Promote cap48 guidance once the third seed confirmed the stable winner

Constraint: Strategy guidance had to wait until the full seed=999 report landed and all three cap48 runs could be aggregated consistently
Rejected: Keep treating cap48 as unresolved | The third seed now confirms high_energy repeats the same score while hybrid remains volatile
Confidence: high
Scope-risk: narrow
Directive: Treat high_energy as the cap48 default only within the documented FMA smoke condition until larger cap64 and bucketed benchmarks either confirm or overturn it
Tested: Verified seed=999 report.json, high_energy eval.json, hybrid eval.json, and computed three-seed aggregate showing high_energy mean_top1=0.9167 with zero variance versus hybrid mean_top1=0.8750
Not-tested: cap64-or-larger benchmarks, bucket/style-aware evaluations, and any future hybrid redesign
1 parent d13a3b8b
1 ## 2026-06-02 cap48 seed999 完结与三 seed 聚合 checkpoint
2
3 完成项:
4 - `cap48 top2 seed=999` 最终完成。
5 - 已拿到 `high_energy``hybrid` 的最终评测结果。
6 - 已完成 cap48 三个 seed 的 aggregate 汇总,并更新默认策略表述。
7
8 最终结果(seed=999):
9 - `high_energy``num_queries=24, top1=0.9167, topk=1.0`
10 - `hybrid``num_queries=24, top1=0.8750, topk=1.0`
11 - winner:`high_energy`
12
13 cap48 三 seed aggregate:
14 - `high_energy`
15 - `mean_top1=0.9167`
16 - `min_top1=0.9167`
17 - `max_top1=0.9167`
18 - `stdev_top1=0.0`
19 - `hybrid`
20 - `mean_top1=0.8750`
21 - `min_top1=0.7917`
22 - `max_top1=0.9583`
23 - `stdev_top1=0.0680`
24
25 结论:
26 - 在当前 cap48 真实 FMA smoke 条件下,`high_energy` 已展现出比 `hybrid` 更高且更稳定的 top1。
27 - 默认优先策略表述从“等待更多 seed”推进为:
28 - cap48 条件下优先 `high_energy`
29 - `hybrid` 继续作为优化与对照对象
30
1 ## 2026-06-02 seed999 中间结果 checkpoint(hybrid 已落盘) 31 ## 2026-06-02 seed999 中间结果 checkpoint(hybrid 已落盘)
2 32
3 完成项: 33 完成项:
......
...@@ -62,3 +62,5 @@ cd /workspace/acr-engine ...@@ -62,3 +62,5 @@ cd /workspace/acr-engine
62 - 本次提交用于沉淀这份 fresh verification evidence,方便下个 session 不必重复排查。 62 - 本次提交用于沉淀这份 fresh verification evidence,方便下个 session 不必重复排查。
63 63
64 - 已补记 `hybrid` seed=999 的中间结果:`top1=0.875 / topk=1.0 / num_queries=24` 64 - 已补记 `hybrid` seed=999 的中间结果:`top1=0.875 / topk=1.0 / num_queries=24`
65
66 - 已补齐 `seed=999` 最终结果,并完成 cap48 三 seed aggregate 归纳。
......
...@@ -22,9 +22,10 @@ ...@@ -22,9 +22,10 @@
22 22
23 当前最新状态: 23 当前最新状态:
24 - `hybrid` reference index 已完成 24 - `hybrid` reference index 已完成
25 - `hybrid` 已完成评测,当前结果为 `top1=0.875 / topk=1.0 / num_queries=24` 25 - `hybrid` 已完成评测:`top1=0.875 / topk=1.0 / num_queries=24`
26 - `high_energy` 仍在运行中 26 - `high_energy` 已完成评测:`top1=0.9167 / topk=1.0 / num_queries=24`
27 -`report.json` 仍未落盘 27 -`report.json` 已落盘,winner=`high_energy`
28 - cap48 三 seed aggregate 已可使用
28 29
29 待检查: 30 待检查:
30 - `/tmp/ab_smoke_seg_cap48_top2_seed999/report.json` 31 - `/tmp/ab_smoke_seg_cap48_top2_seed999/report.json`
......
...@@ -216,7 +216,7 @@ ...@@ -216,7 +216,7 @@
216 - 新 session 已可依据本文件和 `AGENT.md` 继续推进。 216 - 新 session 已可依据本文件和 `AGENT.md` 继续推进。
217 217
218 ### 当前卡点 218 ### 当前卡点
219 - `cap48 top2 seed=999` 仍在运行,当前已确认从 `hybrid build-index` 进入 `evaluate.py`,但尚未写回最终 `report.json` 与 3-seed aggregate 结论 219 - `cap48 top2 seed=999` 已完成,三 seed aggregate 已可计算
220 - 工作区存在大量数据与模型产物,当前只建议精确提交文档文件。 220 - 工作区存在大量数据与模型产物,当前只建议精确提交文档文件。
221 221
222 ### 最新验证证据(2026-06-02 18:21 UTC 左右) 222 ### 最新验证证据(2026-06-02 18:21 UTC 左右)
...@@ -231,17 +231,19 @@ ...@@ -231,17 +231,19 @@
231 - `/tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma_index_smoke/reference_progress.json` 231 - `/tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma_index_smoke/reference_progress.json`
232 - 进程树已确认进入: 232 - 进程树已确认进入:
233 - `evaluate.py --data /tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma/manifests ... --output-json /tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma_reports_smoke/eval.json --seed 999 --max-queries 24` 233 - `evaluate.py --data /tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma/manifests ... --output-json /tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma_reports_smoke/eval.json --seed 999 --max-queries 24`
234 - 截至本 checkpoint: 234 - 最终结果(seed=999):
235 - `hybrid` 的 seed=999 评测结果已写出到 `hybrid/fma_reports_smoke/eval.json` 235 - `hybrid``num_queries=24, top1=0.875, topk=1.0`
236 - `hybrid` 当前结果:`num_queries=24, top1=0.875, topk=1.0` 236 - `high_energy``num_queries=24, top1=0.9167, topk=1.0`
237 - 总报告 `/tmp/ab_smoke_seg_cap48_top2_seed999/report.json` 仍未生成 237 - winner:`high_energy`
238 - `high_energy` 当前仍在运行中,尚未写出最终 `eval.json` 238 - 三 seed aggregate(cap48):
239 - `high_energy``mean_top1=0.9167, min=0.9167, max=0.9167, stdev=0.0`
240 - `hybrid``mean_top1=0.8750, min=0.7917, max=0.9583, stdev=0.0680`
239 241
240 ### 最优先待办 242 ### 最优先待办
241 1. 检查 `/tmp/ab_smoke_seg_cap48_top2_seed999/report.json` 是否生成 243 1. 基于 3-seed 结果继续设计 cap64 benchmark
242 2. 如已生成,计算 `default + 123 + 999` 三个 seed 的 aggregate 244 2. 增加 bucket/style-aware benchmark
243 3. 更新 `open-dataset-workflow.md / session-handoff.md / CHANGELOG.md` 245 3. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性
244 4. 提交并推送。 246 4. 提交并推送后继续下一轮验证
245 247
246 ### 续跑时不要做的事 248 ### 续跑时不要做的事
247 - 不要 `git add .` 249 - 不要 `git add .`
......