Reframe the cap48 finding as seed-sensitive after the second rerun
Persist the completed seed123 benchmark showing hybrid ahead again, and update the strategy guidance from single-run winner claims to a multi-seed interpretation. Constraint: Only documentation changes are allowed because benchmark outputs remain outside version control Rejected: Keep framing cap48 as a stable high_energy win | The second seed materially weakens that interpretation Confidence: high Scope-risk: narrow Directive: Base the hybrid vs high_energy default decision on aggregated multi-seed evidence, not any single cap48 run Tested: Verified /tmp/ab_smoke_seg_cap48_top2_seed123/report.json; verified high_energy eval.json; verified docs now record hybrid=24/0.9583/1.0 and high_energy=24/0.9167/1.0 for seed123 Not-tested: Formal aggregation across multiple seeds beyond these two cap48 runs
Showing
3 changed files
with
51 additions
and
9 deletions
| ... | @@ -2,6 +2,31 @@ | ... | @@ -2,6 +2,31 @@ |
| 2 | 2 | ||
| 3 | ## 2026-06-02 | 3 | ## 2026-06-02 |
| 4 | 4 | ||
| 5 | ### Stage: 收尾 cap48 seed123 并确认 cap48 对 seed 敏感 | ||
| 6 | |||
| 7 | 完成项: | ||
| 8 | - 读取 `/tmp/ab_smoke_seg_cap48_top2_seed123/report.json` | ||
| 9 | - 读取: | ||
| 10 | - `/tmp/ab_smoke_seg_cap48_top2_seed123/hybrid/fma_reports_smoke/eval.json` | ||
| 11 | - `/tmp/ab_smoke_seg_cap48_top2_seed123/high_energy/fma_reports_smoke/eval.json` | ||
| 12 | - 更新: | ||
| 13 | - [open-dataset-workflow.md](./open-dataset-workflow.md) | ||
| 14 | - [session-handoff.md](./session-handoff.md) | ||
| 15 | - [CHANGELOG.md](./CHANGELOG.md) | ||
| 16 | |||
| 17 | 最终结果(subset=48, `max_test_queries=24`, `seed=123`): | ||
| 18 | - `hybrid`: `num_queries=24`, `top1=0.9583`, `topk=1.0` | ||
| 19 | - `high_energy`: `num_queries=24`, `top1=0.9167`, `topk=1.0` | ||
| 20 | |||
| 21 | 结论: | ||
| 22 | - cap48 在不同 seed 下已经出现明显不同排序: | ||
| 23 | - 默认 seed:`high_energy > hybrid` | ||
| 24 | - `seed=123`:`hybrid > high_energy` | ||
| 25 | - 这意味着 cap48 当前最可靠的结论不是“谁绝对赢”,而是: | ||
| 26 | - **该对比对 seed/抽样敏感** | ||
| 27 | - 当前默认策略判断必须依赖多 seed 聚合结果 | ||
| 28 | - `hybrid` 仍可保留为保守默认,`high_energy` 仍是强竞争方案 | ||
| 29 | |||
| 5 | ### Stage: 启动 cap48 第二个 seed 复核反转结果 | 30 | ### Stage: 启动 cap48 第二个 seed 复核反转结果 |
| 6 | 31 | ||
| 7 | 完成项: | 32 | 完成项: | ... | ... |
| ... | @@ -206,6 +206,23 @@ flowchart LR | ... | @@ -206,6 +206,23 @@ flowchart LR |
| 206 | - **`hybrid` 仍是当前保守默认** | 206 | - **`hybrid` 仍是当前保守默认** |
| 207 | - **`high_energy` 已成为强竞争方案** | 207 | - **`high_energy` 已成为强竞争方案** |
| 208 | - 下一步必须做更大样本或多随机种子复核,不能只靠单轮 cap48 就完全改默认 | 208 | - 下一步必须做更大样本或多随机种子复核,不能只靠单轮 cap48 就完全改默认 |
| 209 | |||
| 210 | ### 更新:cap48 第二个 seed 复核(subset=48, `max_test_queries=24`, `seed=123`) | ||
| 211 | |||
| 212 | 对同一规模再跑第二个 seed 后,结果又回到 `hybrid` 领先: | ||
| 213 | |||
| 214 | | 排名 | 策略 | num_queries | top1 | topk | | ||
| 215 | |---:|---|---:|---:|---:| | ||
| 216 | | 1 | `hybrid` | 24 | 0.9583 | 1.0 | | ||
| 217 | | 2 | `high_energy` | 24 | 0.9167 | 1.0 | | ||
| 218 | |||
| 219 | 这说明: | ||
| 220 | - cap48 的策略排名对 seed / 抽样子集 **敏感** | ||
| 221 | - 单次 cap48 不能作为“high_energy 已全面反超”的充分证据 | ||
| 222 | - 当前最稳妥结论仍是: | ||
| 223 | - `hybrid` 保留为保守默认 | ||
| 224 | - `high_energy` 保留为强竞争方案 | ||
| 225 | - 后续需要 **多 seed 聚合结论**,而不是看单次跑分 | ||
| 209 | /usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --index-prefix data/index_fma_smoke/reference --split test --device cpu --fast-eval --output-json reports/fma-smoke/eval.json | 226 | /usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --index-prefix data/index_fma_smoke/reference --split test --device cpu --fast-eval --output-json reports/fma-smoke/eval.json |
| 210 | /usr/local/miniconda3/bin/python scripts/generate_artifacts.py --eval-json reports/fma-smoke/eval.json --config-json reports/fma-smoke/config.json --output-dir reports/fma-smoke --model-version fma-smoke --data-version fma_local | 227 | /usr/local/miniconda3/bin/python scripts/generate_artifacts.py --eval-json reports/fma-smoke/eval.json --config-json reports/fma-smoke/config.json --output-dir reports/fma-smoke --model-version fma-smoke --data-version fma_local |
| 211 | ``` | 228 | ``` | ... | ... |
| ... | @@ -506,7 +506,7 @@ cap48 top2 最终结论: | ... | @@ -506,7 +506,7 @@ cap48 top2 最终结论: |
| 506 | 506 | ||
| 507 | --- | 507 | --- |
| 508 | 508 | ||
| 509 | ## 13. cap48 top2 第二个 seed(进行中) | 509 | ## 13. cap48 top2 第二个 seed(已完成) |
| 510 | 510 | ||
| 511 | 为验证 cap48 的“high_energy 反超”是否稳定,已启动第二个 seed: | 511 | 为验证 cap48 的“high_energy 反超”是否稳定,已启动第二个 seed: |
| 512 | 512 | ||
| ... | @@ -527,7 +527,7 @@ cd /workspace/acr-engine | ... | @@ -527,7 +527,7 @@ cd /workspace/acr-engine |
| 527 | --output-json /tmp/ab_smoke_seg_cap48_top2_seed123/report.json | 527 | --output-json /tmp/ab_smoke_seg_cap48_top2_seed123/report.json |
| 528 | ``` | 528 | ``` |
| 529 | 529 | ||
| 530 | 当前 fresh evidence: | 530 | 最终结果: |
| 531 | 531 | ||
| 532 | | 项目 | 状态 | | 532 | | 项目 | 状态 | |
| 533 | |---|---| | 533 | |---|---| |
| ... | @@ -535,14 +535,14 @@ cd /workspace/acr-engine | ... | @@ -535,14 +535,14 @@ cd /workspace/acr-engine |
| 535 | | `max_test_queries` | `24` | | 535 | | `max_test_queries` | `24` | |
| 536 | | `seed` | `123` | | 536 | | `seed` | `123` | |
| 537 | | `hybrid` | `num_queries=24`, `top1=0.9583`, `topk=1.0` | | 537 | | `hybrid` | `num_queries=24`, `top1=0.9583`, `topk=1.0` | |
| 538 | | `high_energy` | `run_demo.py build-index --resume --checkpoint-every-refs 100` | | 538 | | `high_energy` | `num_queries=24`, `top1=0.9167`, `topk=1.0` | |
| 539 | | `report.json` | 尚未生成 | | 539 | | `report.json` | 已生成 | |
| 540 | |||
| 541 | 恢复检查命令: | ||
| 542 | 540 | ||
| 543 | ```bash | 541 | seed123 最终结论: |
| 544 | pgrep -af 'ab_smoke_seg_cap48_top2_seed123|external_adapters.py smoke-local fma /tmp/ab_smoke_seg_cap48_top2_seed123|evaluate.py --data /tmp/ab_smoke_seg_cap48_top2_seed123|run_demo.py build-index --data /tmp/ab_smoke_seg_cap48_top2_seed123|train.py --data /tmp/ab_smoke_seg_cap48_top2_seed123' | 542 | - `hybrid`:`24 / 0.9583 / 1.0` |
| 545 | ``` | 543 | - `high_energy`:`24 / 0.9167 / 1.0` |
| 544 | - cap48 至少已经表现出明显的 **seed 敏感性** | ||
| 545 | - 因此当前默认策略的判断应基于 **多 seed 聚合**,而不是单次 cap48 反转 | ||
| 546 | - `b766c74` Make open-dataset manifests trainable end to end | 546 | - `b766c74` Make open-dataset manifests trainable end to end |
| 547 | - `fa23144` Add a single-page open dataset workflow for training prep | 547 | - `fa23144` Add a single-page open dataset workflow for training prep |
| 548 | - `af33be3` Condense docs and add manifest validation before training | 548 | - `af33be3` Condense docs and add manifest validation before training | ... | ... |
-
Please register or sign in to post a comment