Commit ae0d14a5 ae0d14a57929fbcf15e326f4dac8bc5769564c12 by cnb.bofCdSsphPA

Promote the cap48 discussion from single runs to two-seed aggregates

Persist the current two-seed cap48 summary so the strategy recommendation is grounded in aggregated evidence rather than whichever single run happened most recently.

Constraint: Only documentation changes are allowed because benchmark artifacts remain outside version control
Rejected: Keep narrating cap48 one run at a time | The aggregate is now more informative than any individual cap48 run
Confidence: high
Scope-risk: narrow
Directive: Prefer reporting aggregate seed statistics once two or more runs exist; avoid re-elevating single-seed claims above the aggregate
Tested: Verified both cap48 report.json files; computed aggregate mean/min/max/stdev; verified docs now record high_energy mean_top1=0.9167 and hybrid mean_top1=0.8750
Not-tested: Aggregates beyond two seeds or style-bucketed aggregates
1 parent e519dab7
...@@ -2,6 +2,38 @@ ...@@ -2,6 +2,38 @@
2 2
3 ## 2026-06-02 3 ## 2026-06-02
4 4
5 ### Stage: 汇总 cap48 两次 seed 的聚合指标
6
7 完成项:
8 - 汇总:
9 - `/tmp/ab_smoke_seg_cap48_top2/report.json`
10 - `/tmp/ab_smoke_seg_cap48_top2_seed123/report.json`
11 - 计算 cap48 当前 2 次 seed 的聚合指标
12 - 更新:
13 - [open-dataset-workflow.md](./open-dataset-workflow.md)
14 - [session-handoff.md](./session-handoff.md)
15 - [CHANGELOG.md](./CHANGELOG.md)
16
17 cap48 聚合结果(2 次 seed):
18 - `high_energy`:
19 - `mean_top1 = 0.9167`
20 - `min_top1 = 0.9167`
21 - `max_top1 = 0.9167`
22 - `stdev_top1 = 0.0`
23 - `hybrid`:
24 - `mean_top1 = 0.8750`
25 - `min_top1 = 0.7917`
26 - `max_top1 = 0.9583`
27 - `stdev_top1 = 0.0833`
28
29 结论:
30 - 仅看 cap48 当前两次 seed,`high_energy` 的均值与稳定性更占优
31 - `hybrid` 的表现波动更大,但峰值更高
32 - 当前最稳妥的策略判断应升级为:
33 - 单轮结果不可信
34 - 默认策略应参考**多 seed 聚合**
35 - 下一步继续扩展 seed 数或 style-aware bucket 比单纯再加单轮更有价值
36
5 ### Stage: 收尾 cap48 seed123 并确认 cap48 对 seed 敏感 37 ### Stage: 收尾 cap48 seed123 并确认 cap48 对 seed 敏感
6 38
7 完成项: 39 完成项:
......
...@@ -223,6 +223,23 @@ flowchart LR ...@@ -223,6 +223,23 @@ flowchart LR
223 - `hybrid` 保留为保守默认 223 - `hybrid` 保留为保守默认
224 - `high_energy` 保留为强竞争方案 224 - `high_energy` 保留为强竞争方案
225 - 后续需要 **多 seed 聚合结论**,而不是看单次跑分 225 - 后续需要 **多 seed 聚合结论**,而不是看单次跑分
226
227 ### cap48 多 seed 聚合摘要(当前 2 次)
228
229 把 cap48 的两次 seed 放到一起看:
230
231 | 策略 | runs | mean_top1 | min_top1 | max_top1 | stdev_top1 | mean_topk |
232 |---|---:|---:|---:|---:|---:|---:|
233 | `high_energy` | 2 | 0.9167 | 0.9167 | 0.9167 | 0.0 | 1.0 |
234 | `hybrid` | 2 | 0.8750 | 0.7917 | 0.9583 | 0.0833 | 1.0 |
235
236 当前可解释为:
237 - `high_energy` 在这两次 cap48 上**均值更高且更稳定**
238 - `hybrid` 在第二个 seed 上更强,但波动也更大
239 - 因此目前最准确的表述不是“谁绝对赢”,而是:
240 - **cap48 上 `high_energy` 的聚合均值暂时领先**
241 - **`hybrid` 仍是更保守的默认候选**
242 - 最终默认仍应等待更多 seed 或更大样本确认
226 /usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --index-prefix data/index_fma_smoke/reference --split test --device cpu --fast-eval --output-json reports/fma-smoke/eval.json 243 /usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --index-prefix data/index_fma_smoke/reference --split test --device cpu --fast-eval --output-json reports/fma-smoke/eval.json
227 /usr/local/miniconda3/bin/python scripts/generate_artifacts.py --eval-json reports/fma-smoke/eval.json --config-json reports/fma-smoke/config.json --output-dir reports/fma-smoke --model-version fma-smoke --data-version fma_local 244 /usr/local/miniconda3/bin/python scripts/generate_artifacts.py --eval-json reports/fma-smoke/eval.json --config-json reports/fma-smoke/config.json --output-dir reports/fma-smoke --model-version fma-smoke --data-version fma_local
228 ``` 245 ```
......
...@@ -543,6 +543,18 @@ seed123 最终结论: ...@@ -543,6 +543,18 @@ seed123 最终结论:
543 - `high_energy``24 / 0.9167 / 1.0` 543 - `high_energy``24 / 0.9167 / 1.0`
544 - cap48 至少已经表现出明显的 **seed 敏感性** 544 - cap48 至少已经表现出明显的 **seed 敏感性**
545 - 因此当前默认策略的判断应基于 **多 seed 聚合**,而不是单次 cap48 反转 545 - 因此当前默认策略的判断应基于 **多 seed 聚合**,而不是单次 cap48 反转
546
547 ### cap48 两次 seed 的当前聚合结论
548
549 | 策略 | runs | mean_top1 | min_top1 | max_top1 | stdev_top1 |
550 |---|---:|---:|---:|---:|---:|
551 | `high_energy` | 2 | 0.9167 | 0.9167 | 0.9167 | 0.0 |
552 | `hybrid` | 2 | 0.8750 | 0.7917 | 0.9583 | 0.0833 |
553
554 当前最稳妥的解释:
555 - `high_energy` 在 cap48 两次 seed 上的**均值暂时领先**
556 - `hybrid` 结果波动更大,但单轮峰值更高
557 - 后续默认策略不应只看某一次单跑,而应继续累计 seed / style bucket 的聚合结果
546 - `b766c74` Make open-dataset manifests trainable end to end 558 - `b766c74` Make open-dataset manifests trainable end to end
547 - `fa23144` Add a single-page open dataset workflow for training prep 559 - `fa23144` Add a single-page open dataset workflow for training prep
548 - `af33be3` Condense docs and add manifest validation before training 560 - `af33be3` Condense docs and add manifest validation before training
......