Record the cap64 reversal once the larger benchmark finished

Constraint: Strategy guidance must now reflect that cap48 and cap64 produce different winners under verified runs Rejected: Keep high_energy as the generic default | The completed cap64 run shows hybrid winning clearly at a larger subset size, so the docs must acknowledge scale sensitivity Confidence: high Scope-risk: moderate Directive: Do not present a single global default strategy again until bucketed and style-aware benchmarks explain the cap48/cap64 divergence Tested: Verified cap64 report.json, progress.json, high_energy eval.json, and hybrid eval.json; confirmed cap64 winner=hybrid with top1 0.875 vs high_energy 0.625 Not-tested: Multi-seed cap64 aggregates, bucket/style-aware benchmarks, and any revised hybrid training design

Record the cap64 reversal once the larger benchmark finished
Constraint: Strategy guidance must now reflect that cap48 and cap64 produce different winners under verified runs Rejected: Keep high_energy as the generic default | The completed cap64 run shows hybrid winning clearly at a larger subset size, so the docs must acknowledge scale sensitivity Confidence: high Scope-risk: moderate Directive: Do not present a single global default strategy again until bucketed and style-aware benchmarks explain the cap48/cap64 divergence Tested: Verified cap64 report.json, progress.json, high_energy eval.json, and hybrid eval.json; confirmed cap64 winner=hybrid with top1 0.875 vs high_energy 0.625 Not-tested: Multi-seed cap64 aggregates, bucket/style-aware benchmarks, and any revised hybrid training design
cnb.bofCdSsphPA
Commit e49dc0b9 ... e49dc0b9de5fc6a4f0de81c2f4f0c386eee1469d authored 2026-06-02 18:44:58 +0800 by cnb.bofCdSsphPA
Showing 4 changed files with 34 additions and 10 deletions
docs/CHANGELOG.md
docs/changelist-2026-06-02.md
docs/delivery-handoff-2026-06-02.md
docs/session-handoff.md
--- a/docs/CHANGELOG.md
View file @e49dc0b
+++ b/docs/CHANGELOG.md
View file @e49dc0b
+## 2026-06-02 cap64 完结 checkpoint
+
+完成项：
+- cap64 真实 FMA 对照已完成。
+- 已拿到 `high_energy` 与 `hybrid` 的最终评测结果与 winner。
+
+最终结果（cap64, seed=42）：
+- `hybrid`：`num_queries=32, top1=0.8750, topk=1.0`
+- `high_energy`：`num_queries=32, top1=0.6250, topk=1.0`
+- winner：`hybrid`
+
+结论：
+- cap64 与 cap48 给出了不同结论：
+  - cap48 三 seed 下 `high_energy` 更稳且领先
+  - cap64 当前单 seed 下 `hybrid` 明显领先
+- 这说明默认策略判断已经进入“依赖子集规模 / 数据构成”的阶段。
+- 下一步必须进入：
+  - bucket/style-aware benchmark
+  - 更系统的 hard-case / genre bucket 评测
+
 ## 2026-06-02 cap64 hybrid 索引完成并进入评测 checkpoint

 完成项：
--- a/docs/changelist-2026-06-02.md
View file @e49dc0b
+++ b/docs/changelist-2026-06-02.md
View file @e49dc0b
@@ -78,3 +78,5 @@ cd /workspace/acr-engine
 - 已补充 cap64 新鲜证据：从运行会话确认 `hybrid` 的 `Epoch 1/1` 已完整跑完。

 - 已补充 cap64 新鲜证据：`hybrid` reference index 完成（`64 refs / 657 windows / 192-d`）并进入 `evaluate.py`。
+
+- 已补齐 cap64 最终结果：`hybrid=0.875`、`high_energy=0.625`，winner=`hybrid`。
--- a/docs/delivery-handoff-2026-06-02.md
View file @e49dc0b
+++ b/docs/delivery-handoff-2026-06-02.md
View file @e49dc0b
@@ -61,5 +61,6 @@ test -f /tmp/ab_smoke_seg_cap48_top2_seed999/report.json && cat /tmp/ab_smoke_se

 - 新 benchmark：`/tmp/ab_smoke_seg_cap64_top2`
 - 当前阶段：`high_energy` 已完成评测，结果为 `top1=0.625 / topk=1.0 / num_queries=32`
- 当前 `hybrid` 索引已完成，现处于 evaluate 阶段
- 下一 session 应优先检查 `hybrid` 结果与 `report.json` 是否生成
+- cap64 已完成，结果：`hybrid=0.875`, `high_energy=0.625`
+- cap64 winner=`hybrid`
+- 下一 session 应优先进入 bucket/style-aware benchmark
--- a/docs/session-handoff.md
View file @e49dc0b
+++ b/docs/session-handoff.md
View file @e49dc0b
@@ -240,10 +240,10 @@
  - `hybrid`：`mean_top1=0.8750, min=0.7917, max=0.9583, stdev=0.0680`

 ### 最优先待办
-1. 跟进 cap64 的 `hybrid` 结果与最终 `/tmp/ab_smoke_seg_cap64_top2/report.json`。
-2. 在 cap64 完成后更新 `open-dataset-workflow.md / session-handoff.md / CHANGELOG.md`。
-3. 接着增加 bucket/style-aware benchmark。
-4. 继续优化 `hybrid`，重点降低波动并提升 hard case 稳定性。
+1. 设计并启动 bucket/style-aware benchmark。
+2. 对比 cap48 与 cap64 的不一致现象，补充分规模结论。
+3. 继续优化 `hybrid`，重点降低波动并提升 hard case 稳定性。
+4. 在新 benchmark 基线下继续提交与推送。

 ### 续跑时不要做的事
 - 不要 `git add .`
@@ -675,8 +675,9 @@ seed123 最终结论：
 - 已启动：`/tmp/ab_smoke_seg_cap64_top2`
 - 配置：`subset_size=64`, `max_test_queries=32`, `seed=42`
 - 当前最新证据：
-  - `high_energy` 已完成评测：`num_queries=32, top1=0.625, topk=1.0`
-  - `hybrid` reference index 已完成：`64 refs / 657 windows / 192-d`
-  - `hybrid` 当前已进入 `evaluate.py`
-  - 总 `report.json` 尚未生成
+  - cap64 已完成：
+    - `hybrid`：`num_queries=32, top1=0.875, topk=1.0`
+    - `high_energy`：`num_queries=32, top1=0.625, topk=1.0`
+  - cap64 winner：`hybrid`
+  - 当前结论已进入“分子集规模不一致”阶段，必须继续做 bucket/style-aware benchmark