Commit e49dc0b9 e49dc0b9de5fc6a4f0de81c2f4f0c386eee1469d by cnb.bofCdSsphPA

Record the cap64 reversal once the larger benchmark finished

Constraint: Strategy guidance must now reflect that cap48 and cap64 produce different winners under verified runs
Rejected: Keep high_energy as the generic default | The completed cap64 run shows hybrid winning clearly at a larger subset size, so the docs must acknowledge scale sensitivity
Confidence: high
Scope-risk: moderate
Directive: Do not present a single global default strategy again until bucketed and style-aware benchmarks explain the cap48/cap64 divergence
Tested: Verified cap64 report.json, progress.json, high_energy eval.json, and hybrid eval.json; confirmed cap64 winner=hybrid with top1 0.875 vs high_energy 0.625
Not-tested: Multi-seed cap64 aggregates, bucket/style-aware benchmarks, and any revised hybrid training design
1 parent 8f2e6016
1 ## 2026-06-02 cap64 完结 checkpoint
2
3 完成项:
4 - cap64 真实 FMA 对照已完成。
5 - 已拿到 `high_energy``hybrid` 的最终评测结果与 winner。
6
7 最终结果(cap64, seed=42):
8 - `hybrid``num_queries=32, top1=0.8750, topk=1.0`
9 - `high_energy``num_queries=32, top1=0.6250, topk=1.0`
10 - winner:`hybrid`
11
12 结论:
13 - cap64 与 cap48 给出了不同结论:
14 - cap48 三 seed 下 `high_energy` 更稳且领先
15 - cap64 当前单 seed 下 `hybrid` 明显领先
16 - 这说明默认策略判断已经进入“依赖子集规模 / 数据构成”的阶段。
17 - 下一步必须进入:
18 - bucket/style-aware benchmark
19 - 更系统的 hard-case / genre bucket 评测
20
1 ## 2026-06-02 cap64 hybrid 索引完成并进入评测 checkpoint 21 ## 2026-06-02 cap64 hybrid 索引完成并进入评测 checkpoint
2 22
3 完成项: 23 完成项:
......
...@@ -78,3 +78,5 @@ cd /workspace/acr-engine ...@@ -78,3 +78,5 @@ cd /workspace/acr-engine
78 - 已补充 cap64 新鲜证据:从运行会话确认 `hybrid``Epoch 1/1` 已完整跑完。 78 - 已补充 cap64 新鲜证据:从运行会话确认 `hybrid``Epoch 1/1` 已完整跑完。
79 79
80 - 已补充 cap64 新鲜证据:`hybrid` reference index 完成(`64 refs / 657 windows / 192-d`)并进入 `evaluate.py` 80 - 已补充 cap64 新鲜证据:`hybrid` reference index 完成(`64 refs / 657 windows / 192-d`)并进入 `evaluate.py`
81
82 - 已补齐 cap64 最终结果:`hybrid=0.875``high_energy=0.625`,winner=`hybrid`
......
...@@ -61,5 +61,6 @@ test -f /tmp/ab_smoke_seg_cap48_top2_seed999/report.json && cat /tmp/ab_smoke_se ...@@ -61,5 +61,6 @@ test -f /tmp/ab_smoke_seg_cap48_top2_seed999/report.json && cat /tmp/ab_smoke_se
61 61
62 - 新 benchmark:`/tmp/ab_smoke_seg_cap64_top2` 62 - 新 benchmark:`/tmp/ab_smoke_seg_cap64_top2`
63 - 当前阶段:`high_energy` 已完成评测,结果为 `top1=0.625 / topk=1.0 / num_queries=32` 63 - 当前阶段:`high_energy` 已完成评测,结果为 `top1=0.625 / topk=1.0 / num_queries=32`
64 - 当前 `hybrid` 索引已完成,现处于 evaluate 阶段 64 - cap64 已完成,结果:`hybrid=0.875`, `high_energy=0.625`
65 - 下一 session 应优先检查 `hybrid` 结果与 `report.json` 是否生成 65 - cap64 winner=`hybrid`
66 - 下一 session 应优先进入 bucket/style-aware benchmark
......
...@@ -240,10 +240,10 @@ ...@@ -240,10 +240,10 @@
240 - `hybrid``mean_top1=0.8750, min=0.7917, max=0.9583, stdev=0.0680` 240 - `hybrid``mean_top1=0.8750, min=0.7917, max=0.9583, stdev=0.0680`
241 241
242 ### 最优先待办 242 ### 最优先待办
243 1. 跟进 cap64 的 `hybrid` 结果与最终 `/tmp/ab_smoke_seg_cap64_top2/report.json` 243 1. 设计并启动 bucket/style-aware benchmark
244 2. 在 cap64 完成后更新 `open-dataset-workflow.md / session-handoff.md / CHANGELOG.md` 244 2. 对比 cap48 与 cap64 的不一致现象,补充分规模结论
245 3. 接着增加 bucket/style-aware benchmark 245 3. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性
246 4. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性 246 4. 在新 benchmark 基线下继续提交与推送
247 247
248 ### 续跑时不要做的事 248 ### 续跑时不要做的事
249 - 不要 `git add .` 249 - 不要 `git add .`
...@@ -675,8 +675,9 @@ seed123 最终结论: ...@@ -675,8 +675,9 @@ seed123 最终结论:
675 - 已启动:`/tmp/ab_smoke_seg_cap64_top2` 675 - 已启动:`/tmp/ab_smoke_seg_cap64_top2`
676 - 配置:`subset_size=64`, `max_test_queries=32`, `seed=42` 676 - 配置:`subset_size=64`, `max_test_queries=32`, `seed=42`
677 - 当前最新证据: 677 - 当前最新证据:
678 - `high_energy` 已完成评测:`num_queries=32, top1=0.625, topk=1.0` 678 - cap64 已完成:
679 - `hybrid` reference index 已完成:`64 refs / 657 windows / 192-d` 679 - `hybrid``num_queries=32, top1=0.875, topk=1.0`
680 - `hybrid` 当前已进入 `evaluate.py` 680 - `high_energy``num_queries=32, top1=0.625, topk=1.0`
681 -`report.json` 尚未生成 681 - cap64 winner:`hybrid`
682 - 当前结论已进入“分子集规模不一致”阶段,必须继续做 bucket/style-aware benchmark
682 683
......