Promote the cap48 discussion from single runs to two-seed aggregates
Persist the current two-seed cap48 summary so the strategy recommendation is grounded in aggregated evidence rather than whichever single run happened most recently. Constraint: Only documentation changes are allowed because benchmark artifacts remain outside version control Rejected: Keep narrating cap48 one run at a time | The aggregate is now more informative than any individual cap48 run Confidence: high Scope-risk: narrow Directive: Prefer reporting aggregate seed statistics once two or more runs exist; avoid re-elevating single-seed claims above the aggregate Tested: Verified both cap48 report.json files; computed aggregate mean/min/max/stdev; verified docs now record high_energy mean_top1=0.9167 and hybrid mean_top1=0.8750 Not-tested: Aggregates beyond two seeds or style-bucketed aggregates
Showing
3 changed files
with
61 additions
and
0 deletions
| ... | @@ -2,6 +2,38 @@ | ... | @@ -2,6 +2,38 @@ |
| 2 | 2 | ||
| 3 | ## 2026-06-02 | 3 | ## 2026-06-02 |
| 4 | 4 | ||
| 5 | ### Stage: 汇总 cap48 两次 seed 的聚合指标 | ||
| 6 | |||
| 7 | 完成项: | ||
| 8 | - 汇总: | ||
| 9 | - `/tmp/ab_smoke_seg_cap48_top2/report.json` | ||
| 10 | - `/tmp/ab_smoke_seg_cap48_top2_seed123/report.json` | ||
| 11 | - 计算 cap48 当前 2 次 seed 的聚合指标 | ||
| 12 | - 更新: | ||
| 13 | - [open-dataset-workflow.md](./open-dataset-workflow.md) | ||
| 14 | - [session-handoff.md](./session-handoff.md) | ||
| 15 | - [CHANGELOG.md](./CHANGELOG.md) | ||
| 16 | |||
| 17 | cap48 聚合结果(2 次 seed): | ||
| 18 | - `high_energy`: | ||
| 19 | - `mean_top1 = 0.9167` | ||
| 20 | - `min_top1 = 0.9167` | ||
| 21 | - `max_top1 = 0.9167` | ||
| 22 | - `stdev_top1 = 0.0` | ||
| 23 | - `hybrid`: | ||
| 24 | - `mean_top1 = 0.8750` | ||
| 25 | - `min_top1 = 0.7917` | ||
| 26 | - `max_top1 = 0.9583` | ||
| 27 | - `stdev_top1 = 0.0833` | ||
| 28 | |||
| 29 | 结论: | ||
| 30 | - 仅看 cap48 当前两次 seed,`high_energy` 的均值与稳定性更占优 | ||
| 31 | - `hybrid` 的表现波动更大,但峰值更高 | ||
| 32 | - 当前最稳妥的策略判断应升级为: | ||
| 33 | - 单轮结果不可信 | ||
| 34 | - 默认策略应参考**多 seed 聚合** | ||
| 35 | - 下一步继续扩展 seed 数或 style-aware bucket 比单纯再加单轮更有价值 | ||
| 36 | |||
| 5 | ### Stage: 收尾 cap48 seed123 并确认 cap48 对 seed 敏感 | 37 | ### Stage: 收尾 cap48 seed123 并确认 cap48 对 seed 敏感 |
| 6 | 38 | ||
| 7 | 完成项: | 39 | 完成项: | ... | ... |
| ... | @@ -223,6 +223,23 @@ flowchart LR | ... | @@ -223,6 +223,23 @@ flowchart LR |
| 223 | - `hybrid` 保留为保守默认 | 223 | - `hybrid` 保留为保守默认 |
| 224 | - `high_energy` 保留为强竞争方案 | 224 | - `high_energy` 保留为强竞争方案 |
| 225 | - 后续需要 **多 seed 聚合结论**,而不是看单次跑分 | 225 | - 后续需要 **多 seed 聚合结论**,而不是看单次跑分 |
| 226 | |||
| 227 | ### cap48 多 seed 聚合摘要(当前 2 次) | ||
| 228 | |||
| 229 | 把 cap48 的两次 seed 放到一起看: | ||
| 230 | |||
| 231 | | 策略 | runs | mean_top1 | min_top1 | max_top1 | stdev_top1 | mean_topk | | ||
| 232 | |---|---:|---:|---:|---:|---:|---:| | ||
| 233 | | `high_energy` | 2 | 0.9167 | 0.9167 | 0.9167 | 0.0 | 1.0 | | ||
| 234 | | `hybrid` | 2 | 0.8750 | 0.7917 | 0.9583 | 0.0833 | 1.0 | | ||
| 235 | |||
| 236 | 当前可解释为: | ||
| 237 | - `high_energy` 在这两次 cap48 上**均值更高且更稳定** | ||
| 238 | - `hybrid` 在第二个 seed 上更强,但波动也更大 | ||
| 239 | - 因此目前最准确的表述不是“谁绝对赢”,而是: | ||
| 240 | - **cap48 上 `high_energy` 的聚合均值暂时领先** | ||
| 241 | - **`hybrid` 仍是更保守的默认候选** | ||
| 242 | - 最终默认仍应等待更多 seed 或更大样本确认 | ||
| 226 | /usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --index-prefix data/index_fma_smoke/reference --split test --device cpu --fast-eval --output-json reports/fma-smoke/eval.json | 243 | /usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --index-prefix data/index_fma_smoke/reference --split test --device cpu --fast-eval --output-json reports/fma-smoke/eval.json |
| 227 | /usr/local/miniconda3/bin/python scripts/generate_artifacts.py --eval-json reports/fma-smoke/eval.json --config-json reports/fma-smoke/config.json --output-dir reports/fma-smoke --model-version fma-smoke --data-version fma_local | 244 | /usr/local/miniconda3/bin/python scripts/generate_artifacts.py --eval-json reports/fma-smoke/eval.json --config-json reports/fma-smoke/config.json --output-dir reports/fma-smoke --model-version fma-smoke --data-version fma_local |
| 228 | ``` | 245 | ``` | ... | ... |
| ... | @@ -543,6 +543,18 @@ seed123 最终结论: | ... | @@ -543,6 +543,18 @@ seed123 最终结论: |
| 543 | - `high_energy`:`24 / 0.9167 / 1.0` | 543 | - `high_energy`:`24 / 0.9167 / 1.0` |
| 544 | - cap48 至少已经表现出明显的 **seed 敏感性** | 544 | - cap48 至少已经表现出明显的 **seed 敏感性** |
| 545 | - 因此当前默认策略的判断应基于 **多 seed 聚合**,而不是单次 cap48 反转 | 545 | - 因此当前默认策略的判断应基于 **多 seed 聚合**,而不是单次 cap48 反转 |
| 546 | |||
| 547 | ### cap48 两次 seed 的当前聚合结论 | ||
| 548 | |||
| 549 | | 策略 | runs | mean_top1 | min_top1 | max_top1 | stdev_top1 | | ||
| 550 | |---|---:|---:|---:|---:|---:| | ||
| 551 | | `high_energy` | 2 | 0.9167 | 0.9167 | 0.9167 | 0.0 | | ||
| 552 | | `hybrid` | 2 | 0.8750 | 0.7917 | 0.9583 | 0.0833 | | ||
| 553 | |||
| 554 | 当前最稳妥的解释: | ||
| 555 | - `high_energy` 在 cap48 两次 seed 上的**均值暂时领先** | ||
| 556 | - `hybrid` 结果波动更大,但单轮峰值更高 | ||
| 557 | - 后续默认策略不应只看某一次单跑,而应继续累计 seed / style bucket 的聚合结果 | ||
| 546 | - `b766c74` Make open-dataset manifests trainable end to end | 558 | - `b766c74` Make open-dataset manifests trainable end to end |
| 547 | - `fa23144` Add a single-page open dataset workflow for training prep | 559 | - `fa23144` Add a single-page open dataset workflow for training prep |
| 548 | - `af33be3` Condense docs and add manifest validation before training | 560 | - `af33be3` Condense docs and add manifest validation before training | ... | ... |
-
Please register or sign in to post a comment