Promote cap48 guidance once the third seed confirmed the stable winner
Constraint: Strategy guidance had to wait until the full seed=999 report landed and all three cap48 runs could be aggregated consistently Rejected: Keep treating cap48 as unresolved | The third seed now confirms high_energy repeats the same score while hybrid remains volatile Confidence: high Scope-risk: narrow Directive: Treat high_energy as the cap48 default only within the documented FMA smoke condition until larger cap64 and bucketed benchmarks either confirm or overturn it Tested: Verified seed=999 report.json, high_energy eval.json, hybrid eval.json, and computed three-seed aggregate showing high_energy mean_top1=0.9167 with zero variance versus hybrid mean_top1=0.8750 Not-tested: cap64-or-larger benchmarks, bucket/style-aware evaluations, and any future hybrid redesign
Showing
4 changed files
with
48 additions
and
13 deletions
| 1 | ## 2026-06-02 cap48 seed999 完结与三 seed 聚合 checkpoint | ||
| 2 | |||
| 3 | 完成项: | ||
| 4 | - `cap48 top2 seed=999` 最终完成。 | ||
| 5 | - 已拿到 `high_energy` 与 `hybrid` 的最终评测结果。 | ||
| 6 | - 已完成 cap48 三个 seed 的 aggregate 汇总,并更新默认策略表述。 | ||
| 7 | |||
| 8 | 最终结果(seed=999): | ||
| 9 | - `high_energy`:`num_queries=24, top1=0.9167, topk=1.0` | ||
| 10 | - `hybrid`:`num_queries=24, top1=0.8750, topk=1.0` | ||
| 11 | - winner:`high_energy` | ||
| 12 | |||
| 13 | cap48 三 seed aggregate: | ||
| 14 | - `high_energy`: | ||
| 15 | - `mean_top1=0.9167` | ||
| 16 | - `min_top1=0.9167` | ||
| 17 | - `max_top1=0.9167` | ||
| 18 | - `stdev_top1=0.0` | ||
| 19 | - `hybrid`: | ||
| 20 | - `mean_top1=0.8750` | ||
| 21 | - `min_top1=0.7917` | ||
| 22 | - `max_top1=0.9583` | ||
| 23 | - `stdev_top1=0.0680` | ||
| 24 | |||
| 25 | 结论: | ||
| 26 | - 在当前 cap48 真实 FMA smoke 条件下,`high_energy` 已展现出比 `hybrid` 更高且更稳定的 top1。 | ||
| 27 | - 默认优先策略表述从“等待更多 seed”推进为: | ||
| 28 | - cap48 条件下优先 `high_energy` | ||
| 29 | - `hybrid` 继续作为优化与对照对象 | ||
| 30 | |||
| 1 | ## 2026-06-02 seed999 中间结果 checkpoint(hybrid 已落盘) | 31 | ## 2026-06-02 seed999 中间结果 checkpoint(hybrid 已落盘) |
| 2 | 32 | ||
| 3 | 完成项: | 33 | 完成项: | ... | ... |
| ... | @@ -62,3 +62,5 @@ cd /workspace/acr-engine | ... | @@ -62,3 +62,5 @@ cd /workspace/acr-engine |
| 62 | - 本次提交用于沉淀这份 fresh verification evidence,方便下个 session 不必重复排查。 | 62 | - 本次提交用于沉淀这份 fresh verification evidence,方便下个 session 不必重复排查。 |
| 63 | 63 | ||
| 64 | - 已补记 `hybrid` seed=999 的中间结果:`top1=0.875 / topk=1.0 / num_queries=24`。 | 64 | - 已补记 `hybrid` seed=999 的中间结果:`top1=0.875 / topk=1.0 / num_queries=24`。 |
| 65 | |||
| 66 | - 已补齐 `seed=999` 最终结果,并完成 cap48 三 seed aggregate 归纳。 | ... | ... |
| ... | @@ -22,9 +22,10 @@ | ... | @@ -22,9 +22,10 @@ |
| 22 | 22 | ||
| 23 | 当前最新状态: | 23 | 当前最新状态: |
| 24 | - `hybrid` reference index 已完成 | 24 | - `hybrid` reference index 已完成 |
| 25 | - `hybrid` 已完成评测,当前结果为 `top1=0.875 / topk=1.0 / num_queries=24` | 25 | - `hybrid` 已完成评测:`top1=0.875 / topk=1.0 / num_queries=24` |
| 26 | - `high_energy` 仍在运行中 | 26 | - `high_energy` 已完成评测:`top1=0.9167 / topk=1.0 / num_queries=24` |
| 27 | - 总 `report.json` 仍未落盘 | 27 | - 总 `report.json` 已落盘,winner=`high_energy` |
| 28 | - cap48 三 seed aggregate 已可使用 | ||
| 28 | 29 | ||
| 29 | 待检查: | 30 | 待检查: |
| 30 | - `/tmp/ab_smoke_seg_cap48_top2_seed999/report.json` | 31 | - `/tmp/ab_smoke_seg_cap48_top2_seed999/report.json` | ... | ... |
| ... | @@ -216,7 +216,7 @@ | ... | @@ -216,7 +216,7 @@ |
| 216 | - 新 session 已可依据本文件和 `AGENT.md` 继续推进。 | 216 | - 新 session 已可依据本文件和 `AGENT.md` 继续推进。 |
| 217 | 217 | ||
| 218 | ### 当前卡点 | 218 | ### 当前卡点 |
| 219 | - `cap48 top2 seed=999` 仍在运行,当前已确认从 `hybrid build-index` 进入 `evaluate.py`,但尚未写回最终 `report.json` 与 3-seed aggregate 结论。 | 219 | - `cap48 top2 seed=999` 已完成,三 seed aggregate 已可计算。 |
| 220 | - 工作区存在大量数据与模型产物,当前只建议精确提交文档文件。 | 220 | - 工作区存在大量数据与模型产物,当前只建议精确提交文档文件。 |
| 221 | 221 | ||
| 222 | ### 最新验证证据(2026-06-02 18:21 UTC 左右) | 222 | ### 最新验证证据(2026-06-02 18:21 UTC 左右) |
| ... | @@ -231,17 +231,19 @@ | ... | @@ -231,17 +231,19 @@ |
| 231 | - `/tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma_index_smoke/reference_progress.json` | 231 | - `/tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma_index_smoke/reference_progress.json` |
| 232 | - 进程树已确认进入: | 232 | - 进程树已确认进入: |
| 233 | - `evaluate.py --data /tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma/manifests ... --output-json /tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma_reports_smoke/eval.json --seed 999 --max-queries 24` | 233 | - `evaluate.py --data /tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma/manifests ... --output-json /tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma_reports_smoke/eval.json --seed 999 --max-queries 24` |
| 234 | - 截至本 checkpoint: | 234 | - 最终结果(seed=999): |
| 235 | - `hybrid` 的 seed=999 评测结果已写出到 `hybrid/fma_reports_smoke/eval.json` | 235 | - `hybrid`:`num_queries=24, top1=0.875, topk=1.0` |
| 236 | - `hybrid` 当前结果:`num_queries=24, top1=0.875, topk=1.0` | 236 | - `high_energy`:`num_queries=24, top1=0.9167, topk=1.0` |
| 237 | - 总报告 `/tmp/ab_smoke_seg_cap48_top2_seed999/report.json` 仍未生成 | 237 | - winner:`high_energy` |
| 238 | - `high_energy` 当前仍在运行中,尚未写出最终 `eval.json` | 238 | - 三 seed aggregate(cap48): |
| 239 | - `high_energy`:`mean_top1=0.9167, min=0.9167, max=0.9167, stdev=0.0` | ||
| 240 | - `hybrid`:`mean_top1=0.8750, min=0.7917, max=0.9583, stdev=0.0680` | ||
| 239 | 241 | ||
| 240 | ### 最优先待办 | 242 | ### 最优先待办 |
| 241 | 1. 检查 `/tmp/ab_smoke_seg_cap48_top2_seed999/report.json` 是否生成。 | 243 | 1. 基于 3-seed 结果继续设计 cap64 benchmark。 |
| 242 | 2. 如已生成,计算 `default + 123 + 999` 三个 seed 的 aggregate。 | 244 | 2. 增加 bucket/style-aware benchmark。 |
| 243 | 3. 更新 `open-dataset-workflow.md / session-handoff.md / CHANGELOG.md`。 | 245 | 3. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性。 |
| 244 | 4. 提交并推送。 | 246 | 4. 提交并推送后继续下一轮验证。 |
| 245 | 247 | ||
| 246 | ### 续跑时不要做的事 | 248 | ### 续跑时不要做的事 |
| 247 | - 不要 `git add .` | 249 | - 不要 `git add .` | ... | ... |
-
Please register or sign in to post a comment