Commit e519dab7 e519dab7471810641056e3d9fc75354069b9e5c3 by cnb.bofCdSsphPA

Reframe the cap48 finding as seed-sensitive after the second rerun

Persist the completed seed123 benchmark showing hybrid ahead again, and update the strategy guidance from single-run winner claims to a multi-seed interpretation.

Constraint: Only documentation changes are allowed because benchmark outputs remain outside version control
Rejected: Keep framing cap48 as a stable high_energy win | The second seed materially weakens that interpretation
Confidence: high
Scope-risk: narrow
Directive: Base the hybrid vs high_energy default decision on aggregated multi-seed evidence, not any single cap48 run
Tested: Verified /tmp/ab_smoke_seg_cap48_top2_seed123/report.json; verified high_energy eval.json; verified docs now record hybrid=24/0.9583/1.0 and high_energy=24/0.9167/1.0 for seed123
Not-tested: Formal aggregation across multiple seeds beyond these two cap48 runs
1 parent a3a5303f
...@@ -2,6 +2,31 @@ ...@@ -2,6 +2,31 @@
2 2
3 ## 2026-06-02 3 ## 2026-06-02
4 4
5 ### Stage: 收尾 cap48 seed123 并确认 cap48 对 seed 敏感
6
7 完成项:
8 - 读取 `/tmp/ab_smoke_seg_cap48_top2_seed123/report.json`
9 - 读取:
10 - `/tmp/ab_smoke_seg_cap48_top2_seed123/hybrid/fma_reports_smoke/eval.json`
11 - `/tmp/ab_smoke_seg_cap48_top2_seed123/high_energy/fma_reports_smoke/eval.json`
12 - 更新:
13 - [open-dataset-workflow.md](./open-dataset-workflow.md)
14 - [session-handoff.md](./session-handoff.md)
15 - [CHANGELOG.md](./CHANGELOG.md)
16
17 最终结果(subset=48, `max_test_queries=24`, `seed=123`):
18 - `hybrid`: `num_queries=24`, `top1=0.9583`, `topk=1.0`
19 - `high_energy`: `num_queries=24`, `top1=0.9167`, `topk=1.0`
20
21 结论:
22 - cap48 在不同 seed 下已经出现明显不同排序:
23 - 默认 seed:`high_energy > hybrid`
24 - `seed=123``hybrid > high_energy`
25 - 这意味着 cap48 当前最可靠的结论不是“谁绝对赢”,而是:
26 - **该对比对 seed/抽样敏感**
27 - 当前默认策略判断必须依赖多 seed 聚合结果
28 - `hybrid` 仍可保留为保守默认,`high_energy` 仍是强竞争方案
29
5 ### Stage: 启动 cap48 第二个 seed 复核反转结果 30 ### Stage: 启动 cap48 第二个 seed 复核反转结果
6 31
7 完成项: 32 完成项:
......
...@@ -206,6 +206,23 @@ flowchart LR ...@@ -206,6 +206,23 @@ flowchart LR
206 - **`hybrid` 仍是当前保守默认** 206 - **`hybrid` 仍是当前保守默认**
207 - **`high_energy` 已成为强竞争方案** 207 - **`high_energy` 已成为强竞争方案**
208 - 下一步必须做更大样本或多随机种子复核,不能只靠单轮 cap48 就完全改默认 208 - 下一步必须做更大样本或多随机种子复核,不能只靠单轮 cap48 就完全改默认
209
210 ### 更新:cap48 第二个 seed 复核(subset=48, `max_test_queries=24`, `seed=123`)
211
212 对同一规模再跑第二个 seed 后,结果又回到 `hybrid` 领先:
213
214 | 排名 | 策略 | num_queries | top1 | topk |
215 |---:|---|---:|---:|---:|
216 | 1 | `hybrid` | 24 | 0.9583 | 1.0 |
217 | 2 | `high_energy` | 24 | 0.9167 | 1.0 |
218
219 这说明:
220 - cap48 的策略排名对 seed / 抽样子集 **敏感**
221 - 单次 cap48 不能作为“high_energy 已全面反超”的充分证据
222 - 当前最稳妥结论仍是:
223 - `hybrid` 保留为保守默认
224 - `high_energy` 保留为强竞争方案
225 - 后续需要 **多 seed 聚合结论**,而不是看单次跑分
209 /usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --index-prefix data/index_fma_smoke/reference --split test --device cpu --fast-eval --output-json reports/fma-smoke/eval.json 226 /usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --index-prefix data/index_fma_smoke/reference --split test --device cpu --fast-eval --output-json reports/fma-smoke/eval.json
210 /usr/local/miniconda3/bin/python scripts/generate_artifacts.py --eval-json reports/fma-smoke/eval.json --config-json reports/fma-smoke/config.json --output-dir reports/fma-smoke --model-version fma-smoke --data-version fma_local 227 /usr/local/miniconda3/bin/python scripts/generate_artifacts.py --eval-json reports/fma-smoke/eval.json --config-json reports/fma-smoke/config.json --output-dir reports/fma-smoke --model-version fma-smoke --data-version fma_local
211 ``` 228 ```
......
...@@ -506,7 +506,7 @@ cap48 top2 最终结论: ...@@ -506,7 +506,7 @@ cap48 top2 最终结论:
506 506
507 --- 507 ---
508 508
509 ## 13. cap48 top2 第二个 seed(进行中 509 ## 13. cap48 top2 第二个 seed(已完成
510 510
511 为验证 cap48 的“high_energy 反超”是否稳定,已启动第二个 seed: 511 为验证 cap48 的“high_energy 反超”是否稳定,已启动第二个 seed:
512 512
...@@ -527,7 +527,7 @@ cd /workspace/acr-engine ...@@ -527,7 +527,7 @@ cd /workspace/acr-engine
527 --output-json /tmp/ab_smoke_seg_cap48_top2_seed123/report.json 527 --output-json /tmp/ab_smoke_seg_cap48_top2_seed123/report.json
528 ``` 528 ```
529 529
530 当前 fresh evidence 530 最终结果
531 531
532 | 项目 | 状态 | 532 | 项目 | 状态 |
533 |---|---| 533 |---|---|
...@@ -535,14 +535,14 @@ cd /workspace/acr-engine ...@@ -535,14 +535,14 @@ cd /workspace/acr-engine
535 | `max_test_queries` | `24` | 535 | `max_test_queries` | `24` |
536 | `seed` | `123` | 536 | `seed` | `123` |
537 | `hybrid` | `num_queries=24`, `top1=0.9583`, `topk=1.0` | 537 | `hybrid` | `num_queries=24`, `top1=0.9583`, `topk=1.0` |
538 | `high_energy` | `run_demo.py build-index --resume --checkpoint-every-refs 100` | 538 | `high_energy` | `num_queries=24`, `top1=0.9167`, `topk=1.0` |
539 | `report.json` | 尚未生成 | 539 | `report.json` | 已生成 |
540
541 恢复检查命令:
542 540
543 ```bash 541 seed123 最终结论:
544 pgrep -af 'ab_smoke_seg_cap48_top2_seed123|external_adapters.py smoke-local fma /tmp/ab_smoke_seg_cap48_top2_seed123|evaluate.py --data /tmp/ab_smoke_seg_cap48_top2_seed123|run_demo.py build-index --data /tmp/ab_smoke_seg_cap48_top2_seed123|train.py --data /tmp/ab_smoke_seg_cap48_top2_seed123' 542 - `hybrid``24 / 0.9583 / 1.0`
545 ``` 543 - `high_energy``24 / 0.9167 / 1.0`
544 - cap48 至少已经表现出明显的 **seed 敏感性**
545 - 因此当前默认策略的判断应基于 **多 seed 聚合**,而不是单次 cap48 反转
546 - `b766c74` Make open-dataset manifests trainable end to end 546 - `b766c74` Make open-dataset manifests trainable end to end
547 - `fa23144` Add a single-page open dataset workflow for training prep 547 - `fa23144` Add a single-page open dataset workflow for training prep
548 - `af33be3` Condense docs and add manifest validation before training 548 - `af33be3` Condense docs and add manifest validation before training
......