Commit d1f13203 d1f132034b28317088dcb5b94eb8460f5a3c66f3 by cnb.bofCdSsphPA

Promote cap48 guidance once the third seed confirmed the stable winner

Constraint: Strategy guidance had to wait until the full seed=999 report landed and all three cap48 runs could be aggregated consistently
Rejected: Keep treating cap48 as unresolved | The third seed now confirms high_energy repeats the same score while hybrid remains volatile
Confidence: high
Scope-risk: narrow
Directive: Treat high_energy as the cap48 default only within the documented FMA smoke condition until larger cap64 and bucketed benchmarks either confirm or overturn it
Tested: Verified seed=999 report.json, high_energy eval.json, hybrid eval.json, and computed three-seed aggregate showing high_energy mean_top1=0.9167 with zero variance versus hybrid mean_top1=0.8750
Not-tested: cap64-or-larger benchmarks, bucket/style-aware evaluations, and any future hybrid redesign
1 parent d13a3b8b
## 2026-06-02 cap48 seed999 完结与三 seed 聚合 checkpoint
完成项:
- `cap48 top2 seed=999` 最终完成。
- 已拿到 `high_energy``hybrid` 的最终评测结果。
- 已完成 cap48 三个 seed 的 aggregate 汇总,并更新默认策略表述。
最终结果(seed=999):
- `high_energy``num_queries=24, top1=0.9167, topk=1.0`
- `hybrid``num_queries=24, top1=0.8750, topk=1.0`
- winner:`high_energy`
cap48 三 seed aggregate:
- `high_energy`
- `mean_top1=0.9167`
- `min_top1=0.9167`
- `max_top1=0.9167`
- `stdev_top1=0.0`
- `hybrid`
- `mean_top1=0.8750`
- `min_top1=0.7917`
- `max_top1=0.9583`
- `stdev_top1=0.0680`
结论:
- 在当前 cap48 真实 FMA smoke 条件下,`high_energy` 已展现出比 `hybrid` 更高且更稳定的 top1。
- 默认优先策略表述从“等待更多 seed”推进为:
- cap48 条件下优先 `high_energy`
- `hybrid` 继续作为优化与对照对象
## 2026-06-02 seed999 中间结果 checkpoint(hybrid 已落盘)
完成项:
......
......@@ -62,3 +62,5 @@ cd /workspace/acr-engine
- 本次提交用于沉淀这份 fresh verification evidence,方便下个 session 不必重复排查。
- 已补记 `hybrid` seed=999 的中间结果:`top1=0.875 / topk=1.0 / num_queries=24`
- 已补齐 `seed=999` 最终结果,并完成 cap48 三 seed aggregate 归纳。
......
......@@ -22,9 +22,10 @@
当前最新状态:
- `hybrid` reference index 已完成
- `hybrid` 已完成评测,当前结果为 `top1=0.875 / topk=1.0 / num_queries=24`
- `high_energy` 仍在运行中
-`report.json` 仍未落盘
- `hybrid` 已完成评测:`top1=0.875 / topk=1.0 / num_queries=24`
- `high_energy` 已完成评测:`top1=0.9167 / topk=1.0 / num_queries=24`
-`report.json` 已落盘,winner=`high_energy`
- cap48 三 seed aggregate 已可使用
待检查:
- `/tmp/ab_smoke_seg_cap48_top2_seed999/report.json`
......
......@@ -216,7 +216,7 @@
- 新 session 已可依据本文件和 `AGENT.md` 继续推进。
### 当前卡点
- `cap48 top2 seed=999` 仍在运行,当前已确认从 `hybrid build-index` 进入 `evaluate.py`,但尚未写回最终 `report.json` 与 3-seed aggregate 结论
- `cap48 top2 seed=999` 已完成,三 seed aggregate 已可计算
- 工作区存在大量数据与模型产物,当前只建议精确提交文档文件。
### 最新验证证据(2026-06-02 18:21 UTC 左右)
......@@ -231,17 +231,19 @@
- `/tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma_index_smoke/reference_progress.json`
- 进程树已确认进入:
- `evaluate.py --data /tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma/manifests ... --output-json /tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma_reports_smoke/eval.json --seed 999 --max-queries 24`
- 截至本 checkpoint:
- `hybrid` 的 seed=999 评测结果已写出到 `hybrid/fma_reports_smoke/eval.json`
- `hybrid` 当前结果:`num_queries=24, top1=0.875, topk=1.0`
- 总报告 `/tmp/ab_smoke_seg_cap48_top2_seed999/report.json` 仍未生成
- `high_energy` 当前仍在运行中,尚未写出最终 `eval.json`
- 最终结果(seed=999):
- `hybrid``num_queries=24, top1=0.875, topk=1.0`
- `high_energy``num_queries=24, top1=0.9167, topk=1.0`
- winner:`high_energy`
- 三 seed aggregate(cap48):
- `high_energy``mean_top1=0.9167, min=0.9167, max=0.9167, stdev=0.0`
- `hybrid``mean_top1=0.8750, min=0.7917, max=0.9583, stdev=0.0680`
### 最优先待办
1. 检查 `/tmp/ab_smoke_seg_cap48_top2_seed999/report.json` 是否生成
2. 如已生成,计算 `default + 123 + 999` 三个 seed 的 aggregate
3. 更新 `open-dataset-workflow.md / session-handoff.md / CHANGELOG.md`
4. 提交并推送。
1. 基于 3-seed 结果继续设计 cap64 benchmark
2. 增加 bucket/style-aware benchmark
3. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性
4. 提交并推送后继续下一轮验证
### 续跑时不要做的事
- 不要 `git add .`
......