Promote cap48 guidance once the third seed confirmed the stable winner

Constraint: Strategy guidance had to wait until the full seed=999 report landed and all three cap48 runs could be aggregated consistently Rejected: Keep treating cap48 as unresolved | The third seed now confirms high_energy repeats the same score while hybrid remains volatile Confidence: high Scope-risk: narrow Directive: Treat high_energy as the cap48 default only within the documented FMA smoke condition until larger cap64 and bucketed benchmarks either confirm or overturn it Tested: Verified seed=999 report.json, high_energy eval.json, hybrid eval.json, and computed three-seed aggregate showing high_energy mean_top1=0.9167 with zero variance versus hybrid mean_top1=0.8750 Not-tested: cap64-or-larger benchmarks, bucket/style-aware evaluations, and any future hybrid redesign

Promote cap48 guidance once the third seed confirmed the stable winner
Constraint: Strategy guidance had to wait until the full seed=999 report landed and all three cap48 runs could be aggregated consistently Rejected: Keep treating cap48 as unresolved | The third seed now confirms high_energy repeats the same score while hybrid remains volatile Confidence: high Scope-risk: narrow Directive: Treat high_energy as the cap48 default only within the documented FMA smoke condition until larger cap64 and bucketed benchmarks either confirm or overturn it Tested: Verified seed=999 report.json, high_energy eval.json, hybrid eval.json, and computed three-seed aggregate showing high_energy mean_top1=0.9167 with zero variance versus hybrid mean_top1=0.8750 Not-tested: cap64-or-larger benchmarks, bucket/style-aware evaluations, and any future hybrid redesign
cnb.bofCdSsphPA
Commit d1f13203 ... d1f132034b28317088dcb5b94eb8460f5a3c66f3 authored 2026-06-02 18:29:00 +0800 by cnb.bofCdSsphPA
Showing 4 changed files with 48 additions and 13 deletions
docs/CHANGELOG.md
docs/changelist-2026-06-02.md
docs/delivery-handoff-2026-06-02.md
docs/session-handoff.md
--- a/docs/CHANGELOG.md
View file @d1f1320
+++ b/docs/CHANGELOG.md
View file @d1f1320
+## 2026-06-02 cap48 seed999 完结与三 seed 聚合 checkpoint
+
+完成项：
+- `cap48 top2 seed=999` 最终完成。
+- 已拿到 `high_energy` 与 `hybrid` 的最终评测结果。
+- 已完成 cap48 三个 seed 的 aggregate 汇总，并更新默认策略表述。
+
+最终结果（seed=999）：
+- `high_energy`：`num_queries=24, top1=0.9167, topk=1.0`
+- `hybrid`：`num_queries=24, top1=0.8750, topk=1.0`
+- winner：`high_energy`
+
+cap48 三 seed aggregate：
+- `high_energy`：
+  - `mean_top1=0.9167`
+  - `min_top1=0.9167`
+  - `max_top1=0.9167`
+  - `stdev_top1=0.0`
+- `hybrid`：
+  - `mean_top1=0.8750`
+  - `min_top1=0.7917`
+  - `max_top1=0.9583`
+  - `stdev_top1=0.0680`
+
+结论：
+- 在当前 cap48 真实 FMA smoke 条件下，`high_energy` 已展现出比 `hybrid` 更高且更稳定的 top1。
+- 默认优先策略表述从“等待更多 seed”推进为：
+  - cap48 条件下优先 `high_energy`
+  - `hybrid` 继续作为优化与对照对象
+
 ## 2026-06-02 seed999 中间结果 checkpoint（hybrid 已落盘）

 完成项：
--- a/docs/changelist-2026-06-02.md
View file @d1f1320
+++ b/docs/changelist-2026-06-02.md
View file @d1f1320
@@ -62,3 +62,5 @@ cd /workspace/acr-engine
 - 本次提交用于沉淀这份 fresh verification evidence，方便下个 session 不必重复排查。

 - 已补记 `hybrid` seed=999 的中间结果：`top1=0.875 / topk=1.0 / num_queries=24`。
+
+- 已补齐 `seed=999` 最终结果，并完成 cap48 三 seed aggregate 归纳。
--- a/docs/delivery-handoff-2026-06-02.md
View file @d1f1320
+++ b/docs/delivery-handoff-2026-06-02.md
View file @d1f1320
@@ -22,9 +22,10 @@

 当前最新状态：
 - `hybrid` reference index 已完成
- `hybrid` 已完成评测，当前结果为 `top1=0.875 / topk=1.0 / num_queries=24`
- `high_energy` 仍在运行中
- 总 `report.json` 仍未落盘
+- `hybrid` 已完成评测：`top1=0.875 / topk=1.0 / num_queries=24`
+- `high_energy` 已完成评测：`top1=0.9167 / topk=1.0 / num_queries=24`
+- 总 `report.json` 已落盘，winner=`high_energy`
+- cap48 三 seed aggregate 已可使用

 待检查：
 - `/tmp/ab_smoke_seg_cap48_top2_seed999/report.json`
--- a/docs/session-handoff.md
View file @d1f1320
+++ b/docs/session-handoff.md
View file @d1f1320
@@ -216,7 +216,7 @@
 - 新 session 已可依据本文件和 `AGENT.md` 继续推进。

 ### 当前卡点
- `cap48 top2 seed=999` 仍在运行，当前已确认从 `hybrid build-index` 进入 `evaluate.py`，但尚未写回最终 `report.json` 与 3-seed aggregate 结论。
+- `cap48 top2 seed=999` 已完成，三 seed aggregate 已可计算。
 - 工作区存在大量数据与模型产物，当前只建议精确提交文档文件。

 ### 最新验证证据（2026-06-02 18:21 UTC 左右）
@@ -231,17 +231,19 @@
  - `/tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma_index_smoke/reference_progress.json`
 - 进程树已确认进入：
  - `evaluate.py --data /tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma/manifests ... --output-json /tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma_reports_smoke/eval.json --seed 999 --max-queries 24`
- 截至本 checkpoint：
-  - `hybrid` 的 seed=999 评测结果已写出到 `hybrid/fma_reports_smoke/eval.json`
-  - `hybrid` 当前结果：`num_queries=24, top1=0.875, topk=1.0`
-  - 总报告 `/tmp/ab_smoke_seg_cap48_top2_seed999/report.json` 仍未生成
-  - `high_energy` 当前仍在运行中，尚未写出最终 `eval.json`
+- 最终结果（seed=999）：
+  - `hybrid`：`num_queries=24, top1=0.875, topk=1.0`
+  - `high_energy`：`num_queries=24, top1=0.9167, topk=1.0`
+  - winner：`high_energy`
+- 三 seed aggregate（cap48）：
+  - `high_energy`：`mean_top1=0.9167, min=0.9167, max=0.9167, stdev=0.0`
+  - `hybrid`：`mean_top1=0.8750, min=0.7917, max=0.9583, stdev=0.0680`

 ### 最优先待办
-1. 检查 `/tmp/ab_smoke_seg_cap48_top2_seed999/report.json` 是否生成。
-2. 如已生成，计算 `default + 123 + 999` 三个 seed 的 aggregate。
-3. 更新 `open-dataset-workflow.md / session-handoff.md / CHANGELOG.md`。
-4. 提交并推送。
+1. 基于 3-seed 结果继续设计 cap64 benchmark。
+2. 增加 bucket/style-aware benchmark。
+3. 继续优化 `hybrid`，重点降低波动并提升 hard case 稳定性。
+4. 提交并推送后继续下一轮验证。

 ### 续跑时不要做的事
 - 不要 `git add .`