Record the cap64 reversal once the larger benchmark finished
Constraint: Strategy guidance must now reflect that cap48 and cap64 produce different winners under verified runs Rejected: Keep high_energy as the generic default | The completed cap64 run shows hybrid winning clearly at a larger subset size, so the docs must acknowledge scale sensitivity Confidence: high Scope-risk: moderate Directive: Do not present a single global default strategy again until bucketed and style-aware benchmarks explain the cap48/cap64 divergence Tested: Verified cap64 report.json, progress.json, high_energy eval.json, and hybrid eval.json; confirmed cap64 winner=hybrid with top1 0.875 vs high_energy 0.625 Not-tested: Multi-seed cap64 aggregates, bucket/style-aware benchmarks, and any revised hybrid training design
Showing
4 changed files
with
34 additions
and
10 deletions
| 1 | ## 2026-06-02 cap64 完结 checkpoint | ||
| 2 | |||
| 3 | 完成项: | ||
| 4 | - cap64 真实 FMA 对照已完成。 | ||
| 5 | - 已拿到 `high_energy` 与 `hybrid` 的最终评测结果与 winner。 | ||
| 6 | |||
| 7 | 最终结果(cap64, seed=42): | ||
| 8 | - `hybrid`:`num_queries=32, top1=0.8750, topk=1.0` | ||
| 9 | - `high_energy`:`num_queries=32, top1=0.6250, topk=1.0` | ||
| 10 | - winner:`hybrid` | ||
| 11 | |||
| 12 | 结论: | ||
| 13 | - cap64 与 cap48 给出了不同结论: | ||
| 14 | - cap48 三 seed 下 `high_energy` 更稳且领先 | ||
| 15 | - cap64 当前单 seed 下 `hybrid` 明显领先 | ||
| 16 | - 这说明默认策略判断已经进入“依赖子集规模 / 数据构成”的阶段。 | ||
| 17 | - 下一步必须进入: | ||
| 18 | - bucket/style-aware benchmark | ||
| 19 | - 更系统的 hard-case / genre bucket 评测 | ||
| 20 | |||
| 1 | ## 2026-06-02 cap64 hybrid 索引完成并进入评测 checkpoint | 21 | ## 2026-06-02 cap64 hybrid 索引完成并进入评测 checkpoint |
| 2 | 22 | ||
| 3 | 完成项: | 23 | 完成项: | ... | ... |
| ... | @@ -78,3 +78,5 @@ cd /workspace/acr-engine | ... | @@ -78,3 +78,5 @@ cd /workspace/acr-engine |
| 78 | - 已补充 cap64 新鲜证据:从运行会话确认 `hybrid` 的 `Epoch 1/1` 已完整跑完。 | 78 | - 已补充 cap64 新鲜证据:从运行会话确认 `hybrid` 的 `Epoch 1/1` 已完整跑完。 |
| 79 | 79 | ||
| 80 | - 已补充 cap64 新鲜证据:`hybrid` reference index 完成(`64 refs / 657 windows / 192-d`)并进入 `evaluate.py`。 | 80 | - 已补充 cap64 新鲜证据:`hybrid` reference index 完成(`64 refs / 657 windows / 192-d`)并进入 `evaluate.py`。 |
| 81 | |||
| 82 | - 已补齐 cap64 最终结果:`hybrid=0.875`、`high_energy=0.625`,winner=`hybrid`。 | ... | ... |
| ... | @@ -61,5 +61,6 @@ test -f /tmp/ab_smoke_seg_cap48_top2_seed999/report.json && cat /tmp/ab_smoke_se | ... | @@ -61,5 +61,6 @@ test -f /tmp/ab_smoke_seg_cap48_top2_seed999/report.json && cat /tmp/ab_smoke_se |
| 61 | 61 | ||
| 62 | - 新 benchmark:`/tmp/ab_smoke_seg_cap64_top2` | 62 | - 新 benchmark:`/tmp/ab_smoke_seg_cap64_top2` |
| 63 | - 当前阶段:`high_energy` 已完成评测,结果为 `top1=0.625 / topk=1.0 / num_queries=32` | 63 | - 当前阶段:`high_energy` 已完成评测,结果为 `top1=0.625 / topk=1.0 / num_queries=32` |
| 64 | - 当前 `hybrid` 索引已完成,现处于 evaluate 阶段 | 64 | - cap64 已完成,结果:`hybrid=0.875`, `high_energy=0.625` |
| 65 | - 下一 session 应优先检查 `hybrid` 结果与 `report.json` 是否生成 | 65 | - cap64 winner=`hybrid` |
| 66 | - 下一 session 应优先进入 bucket/style-aware benchmark | ... | ... |
| ... | @@ -240,10 +240,10 @@ | ... | @@ -240,10 +240,10 @@ |
| 240 | - `hybrid`:`mean_top1=0.8750, min=0.7917, max=0.9583, stdev=0.0680` | 240 | - `hybrid`:`mean_top1=0.8750, min=0.7917, max=0.9583, stdev=0.0680` |
| 241 | 241 | ||
| 242 | ### 最优先待办 | 242 | ### 最优先待办 |
| 243 | 1. 跟进 cap64 的 `hybrid` 结果与最终 `/tmp/ab_smoke_seg_cap64_top2/report.json`。 | 243 | 1. 设计并启动 bucket/style-aware benchmark。 |
| 244 | 2. 在 cap64 完成后更新 `open-dataset-workflow.md / session-handoff.md / CHANGELOG.md`。 | 244 | 2. 对比 cap48 与 cap64 的不一致现象,补充分规模结论。 |
| 245 | 3. 接着增加 bucket/style-aware benchmark。 | 245 | 3. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性。 |
| 246 | 4. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性。 | 246 | 4. 在新 benchmark 基线下继续提交与推送。 |
| 247 | 247 | ||
| 248 | ### 续跑时不要做的事 | 248 | ### 续跑时不要做的事 |
| 249 | - 不要 `git add .` | 249 | - 不要 `git add .` |
| ... | @@ -675,8 +675,9 @@ seed123 最终结论: | ... | @@ -675,8 +675,9 @@ seed123 最终结论: |
| 675 | - 已启动:`/tmp/ab_smoke_seg_cap64_top2` | 675 | - 已启动:`/tmp/ab_smoke_seg_cap64_top2` |
| 676 | - 配置:`subset_size=64`, `max_test_queries=32`, `seed=42` | 676 | - 配置:`subset_size=64`, `max_test_queries=32`, `seed=42` |
| 677 | - 当前最新证据: | 677 | - 当前最新证据: |
| 678 | - `high_energy` 已完成评测:`num_queries=32, top1=0.625, topk=1.0` | 678 | - cap64 已完成: |
| 679 | - `hybrid` reference index 已完成:`64 refs / 657 windows / 192-d` | 679 | - `hybrid`:`num_queries=32, top1=0.875, topk=1.0` |
| 680 | - `hybrid` 当前已进入 `evaluate.py` | 680 | - `high_energy`:`num_queries=32, top1=0.625, topk=1.0` |
| 681 | - 总 `report.json` 尚未生成 | 681 | - cap64 winner:`hybrid` |
| 682 | - 当前结论已进入“分子集规模不一致”阶段,必须继续做 bucket/style-aware benchmark | ||
| 682 | 683 | ... | ... |
-
Please register or sign in to post a comment