Commit d82d217a d82d217a2fa89e8935d83e8f95c5ee45f421f437 by cnb.bofCdSsphPA

Revise the default-strategy story after the cap48 reversal

Persist the larger 48-track benchmark where high_energy overtook hybrid, and downgrade the previously overconfident default-strategy claim to a conditional recommendation pending broader validation.

Constraint: Only documentation changes are allowed because benchmark outputs remain outside version control
Rejected: Keep asserting hybrid as fully settled default after cap48 | The 48-track capped benchmark materially contradicts that stronger claim
Confidence: high
Scope-risk: narrow
Directive: Resolve the hybrid vs high_energy default question with larger, multi-seed, style-aware benchmarks before making a final hard default claim
Tested: Verified /tmp/ab_smoke_seg_cap48_top2/report.json; verified high_energy eval.json; verified docs now record high_energy=24/0.9167/1.0 and hybrid=24/0.7917/1.0
Not-tested: Multi-seed or style-balanced follow-up benchmark beyond the single cap48 run
1 parent 7769be8c
...@@ -2,6 +2,30 @@ ...@@ -2,6 +2,30 @@
2 2
3 ## 2026-06-02 3 ## 2026-06-02
4 4
5 ### Stage: 收尾 cap48 top2 真实 FMA 对照并发现 high_energy 反超
6
7 完成项:
8 - 读取 `/tmp/ab_smoke_seg_cap48_top2/report.json`
9 - 读取:
10 - `/tmp/ab_smoke_seg_cap48_top2/hybrid/fma_reports_smoke/eval.json`
11 - `/tmp/ab_smoke_seg_cap48_top2/high_energy/fma_reports_smoke/eval.json`
12 - 更新:
13 - [open-dataset-workflow.md](./open-dataset-workflow.md)
14 - [session-handoff.md](./session-handoff.md)
15 - [CHANGELOG.md](./CHANGELOG.md)
16
17 最终结果(subset=48, `max_test_queries=24`):
18 - `high_energy`: `num_queries=24`, `top1=0.9167`, `topk=1.0`
19 - `hybrid`: `num_queries=24`, `top1=0.7917`, `topk=1.0`
20
21 结论:
22 - cap48 与 cap24 / cap32 给出了不同方向的结果
23 - 这意味着“默认策略已经完全固定为 hybrid”的说法需要降级为**暂时性结论**
24 - 当前更稳妥的表述应是:
25 - `hybrid` 仍可保留为保守默认
26 - `high_energy` 已成为必须严肃对待的强竞争方案
27 - 下一步需要更大样本 / 多 seed / style-aware benchmark 再定最终默认
28
5 ### Stage: 启动 cap48 top2 真实 FMA 对照并记录运行阶段 29 ### Stage: 启动 cap48 top2 真实 FMA 对照并记录运行阶段
6 30
7 完成项: 31 完成项:
......
...@@ -189,6 +189,23 @@ flowchart LR ...@@ -189,6 +189,23 @@ flowchart LR
189 - `hybrid` 在更大真实子集上仍明显领先 189 - `hybrid` 在更大真实子集上仍明显领先
190 - `high_energy` 虽然可作为高能区偏置策略,但稳定性不足以成为默认 190 - `high_energy` 虽然可作为高能区偏置策略,但稳定性不足以成为默认
191 - 当前默认策略已经可以稳定写死为 **`hybrid`** 191 - 当前默认策略已经可以稳定写死为 **`hybrid`**
192
193 ### 更新:cap48 top2 对照(subset=48, `max_test_queries=24`)
194
195 继续扩大到 48 首真实 FMA 子集后,出现了**结果反转**:
196
197 | 排名 | 策略 | num_queries | top1 | topk |
198 |---:|---|---:|---:|---:|
199 | 1 | `high_energy` | 24 | 0.9167 | 1.0 |
200 | 2 | `hybrid` | 24 | 0.7917 | 1.0 |
201
202 这轮结果说明:
203 - 前面 cap24 / cap32 支持 `hybrid`
204 - 但 cap48 上 `high_energy` 反超
205 - 因此当前结论应从“默认策略已完全固定”调整为:
206 - **`hybrid` 仍是当前保守默认**
207 - **`high_energy` 已成为强竞争方案**
208 - 下一步必须做更大样本或多随机种子复核,不能只靠单轮 cap48 就完全改默认
192 /usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --index-prefix data/index_fma_smoke/reference --split test --device cpu --fast-eval --output-json reports/fma-smoke/eval.json 209 /usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --index-prefix data/index_fma_smoke/reference --split test --device cpu --fast-eval --output-json reports/fma-smoke/eval.json
193 /usr/local/miniconda3/bin/python scripts/generate_artifacts.py --eval-json reports/fma-smoke/eval.json --config-json reports/fma-smoke/config.json --output-dir reports/fma-smoke --model-version fma-smoke --data-version fma_local 210 /usr/local/miniconda3/bin/python scripts/generate_artifacts.py --eval-json reports/fma-smoke/eval.json --config-json reports/fma-smoke/config.json --output-dir reports/fma-smoke --model-version fma-smoke --data-version fma_local
194 ``` 211 ```
......
...@@ -459,7 +459,7 @@ cap32 top2 最终结论: ...@@ -459,7 +459,7 @@ cap32 top2 最终结论:
459 459
460 --- 460 ---
461 461
462 ## 12. cap48 top2 对照实验(进行中 462 ## 12. cap48 top2 对照实验(已完成
463 463
464 为继续扩展真实数据证据链,已启动更大的 FMA top2 对照: 464 为继续扩展真实数据证据链,已启动更大的 FMA top2 对照:
465 465
...@@ -479,15 +479,15 @@ cd /workspace/acr-engine ...@@ -479,15 +479,15 @@ cd /workspace/acr-engine
479 --output-json /tmp/ab_smoke_seg_cap48_top2/report.json 479 --output-json /tmp/ab_smoke_seg_cap48_top2/report.json
480 ``` 480 ```
481 481
482 当前 fresh evidence 482 最终结果
483 483
484 | 项目 | 状态 | 484 | 项目 | 状态 |
485 |---|---| 485 |---|---|
486 | `subset_size` | `48` | 486 | `subset_size` | `48` |
487 | `max_test_queries` | `24` | 487 | `max_test_queries` | `24` |
488 | `high_energy` | `num_queries=24`, `top1=0.9167`, `topk=1.0` |
488 | `hybrid` | `num_queries=24`, `top1=0.7917`, `topk=1.0` | 489 | `hybrid` | `num_queries=24`, `top1=0.7917`, `topk=1.0` |
489 | `high_energy` | `evaluate.py --max-queries 24` | 490 | `report.json` | 已生成 |
490 | `report.json` | 尚未生成 |
491 491
492 恢复检查命令: 492 恢复检查命令:
493 493
...@@ -495,9 +495,14 @@ cd /workspace/acr-engine ...@@ -495,9 +495,14 @@ cd /workspace/acr-engine
495 pgrep -af 'ab_smoke_seg_cap48_top2|external_adapters.py smoke-local fma /tmp/ab_smoke_seg_cap48_top2|evaluate.py --data /tmp/ab_smoke_seg_cap48_top2|run_demo.py build-index --data /tmp/ab_smoke_seg_cap48_top2|train.py --data /tmp/ab_smoke_seg_cap48_top2' 495 pgrep -af 'ab_smoke_seg_cap48_top2|external_adapters.py smoke-local fma /tmp/ab_smoke_seg_cap48_top2|evaluate.py --data /tmp/ab_smoke_seg_cap48_top2|run_demo.py build-index --data /tmp/ab_smoke_seg_cap48_top2|train.py --data /tmp/ab_smoke_seg_cap48_top2'
496 ``` 496 ```
497 497
498 优先等待文件: 498 cap48 top2 最终结论:
499 - `/tmp/ab_smoke_seg_cap48_top2/high_energy/fma_reports_smoke/eval.json` 499 - `high_energy``24 / 0.9167 / 1.0`
500 - `/tmp/ab_smoke_seg_cap48_top2/report.json` 500 - `hybrid``24 / 0.7917 / 1.0`
501 - 这轮结果与 cap24 / cap32 不一致,说明当前默认策略结论**还不能视为彻底封板**
502 - 下一步应优先做:
503 1. 更大 subset(如 64+)
504 2. 多 seed 复跑
505 3. style-aware bucket benchmark
501 - `b766c74` Make open-dataset manifests trainable end to end 506 - `b766c74` Make open-dataset manifests trainable end to end
502 - `fa23144` Add a single-page open dataset workflow for training prep 507 - `fa23144` Add a single-page open dataset workflow for training prep
503 - `af33be3` Condense docs and add manifest validation before training 508 - `af33be3` Condense docs and add manifest validation before training
......