Promote the cap48 discussion from single runs to two-seed aggregates

Persist the current two-seed cap48 summary so the strategy recommendation is grounded in aggregated evidence rather than whichever single run happened most recently. Constraint: Only documentation changes are allowed because benchmark artifacts remain outside version control Rejected: Keep narrating cap48 one run at a time | The aggregate is now more informative than any individual cap48 run Confidence: high Scope-risk: narrow Directive: Prefer reporting aggregate seed statistics once two or more runs exist; avoid re-elevating single-seed claims above the aggregate Tested: Verified both cap48 report.json files; computed aggregate mean/min/max/stdev; verified docs now record high_energy mean_top1=0.9167 and hybrid mean_top1=0.8750 Not-tested: Aggregates beyond two seeds or style-bucketed aggregates

Promote the cap48 discussion from single runs to two-seed aggregates
Persist the current two-seed cap48 summary so the strategy recommendation is grounded in aggregated evidence rather than whichever single run happened most recently. Constraint: Only documentation changes are allowed because benchmark artifacts remain outside version control Rejected: Keep narrating cap48 one run at a time | The aggregate is now more informative than any individual cap48 run Confidence: high Scope-risk: narrow Directive: Prefer reporting aggregate seed statistics once two or more runs exist; avoid re-elevating single-seed claims above the aggregate Tested: Verified both cap48 report.json files; computed aggregate mean/min/max/stdev; verified docs now record high_energy mean_top1=0.9167 and hybrid mean_top1=0.8750 Not-tested: Aggregates beyond two seeds or style-bucketed aggregates
cnb.bofCdSsphPA
Commit ae0d14a5 ... ae0d14a57929fbcf15e326f4dac8bc5769564c12 authored 2026-06-02 18:15:34 +0800 by cnb.bofCdSsphPA
Showing 3 changed files with 61 additions and 0 deletions
docs/CHANGELOG.md
docs/open-dataset-workflow.md
docs/session-handoff.md
--- a/docs/CHANGELOG.md
View file @ae0d14a
+++ b/docs/CHANGELOG.md
View file @ae0d14a
@@ -2,6 +2,38 @@
 ## 2026-06-02
+### Stage: 汇总 cap48 两次 seed 的聚合指标
+完成项：
+- 汇总：
+  - `/tmp/ab_smoke_seg_cap48_top2/report.json`
+  - `/tmp/ab_smoke_seg_cap48_top2_seed123/report.json`
+- 计算 cap48 当前 2 次 seed 的聚合指标
+- 更新：
+  - [open-dataset-workflow.md](./open-dataset-workflow.md)
+  - [session-handoff.md](./session-handoff.md)
+  - [CHANGELOG.md](./CHANGELOG.md)
+cap48 聚合结果（2 次 seed）：
+- `high_energy`:
+  - `mean_top1 = 0.9167`
+  - `min_top1 = 0.9167`
+  - `max_top1 = 0.9167`
+  - `stdev_top1 = 0.0`
+- `hybrid`:
+  - `mean_top1 = 0.8750`
+  - `min_top1 = 0.7917`
+  - `max_top1 = 0.9583`
+  - `stdev_top1 = 0.0833`
+结论：
+- 仅看 cap48 当前两次 seed，`high_energy` 的均值与稳定性更占优
+- `hybrid` 的表现波动更大，但峰值更高
+- 当前最稳妥的策略判断应升级为：
+  - 单轮结果不可信
+  - 默认策略应参考**多 seed 聚合**
+  - 下一步继续扩展 seed 数或 style-aware bucket 比单纯再加单轮更有价值
 ### Stage: 收尾 cap48 seed123 并确认 cap48 对 seed 敏感
 完成项：
--- a/docs/open-dataset-workflow.md
View file @ae0d14a
+++ b/docs/open-dataset-workflow.md
View file @ae0d14a
@@ -223,6 +223,23 @@ flowchart LR
  - `hybrid` 保留为保守默认
  - `high_energy` 保留为强竞争方案
  - 后续需要 **多 seed 聚合结论**，而不是看单次跑分
+### cap48 多 seed 聚合摘要（当前 2 次）
+把 cap48 的两次 seed 放到一起看：
+| 策略 | runs | mean_top1 | min_top1 | max_top1 | stdev_top1 | mean_topk |
+|---|---:|---:|---:|---:|---:|---:|
+| `high_energy` | 2 | 0.9167 | 0.9167 | 0.9167 | 0.0 | 1.0 |
+| `hybrid` | 2 | 0.8750 | 0.7917 | 0.9583 | 0.0833 | 1.0 |
+当前可解释为：
+- `high_energy` 在这两次 cap48 上**均值更高且更稳定**
+- `hybrid` 在第二个 seed 上更强，但波动也更大
+- 因此目前最准确的表述不是“谁绝对赢”，而是：
+  - **cap48 上 `high_energy` 的聚合均值暂时领先**
+  - **`hybrid` 仍是更保守的默认候选**
+  - 最终默认仍应等待更多 seed 或更大样本确认
 /usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --index-prefix data/index_fma_smoke/reference --split test --device cpu --fast-eval --output-json reports/fma-smoke/eval.json
 /usr/local/miniconda3/bin/python scripts/generate_artifacts.py --eval-json reports/fma-smoke/eval.json --config-json reports/fma-smoke/config.json --output-dir reports/fma-smoke --model-version fma-smoke --data-version fma_local
 ```
--- a/docs/session-handoff.md
View file @ae0d14a
+++ b/docs/session-handoff.md
View file @ae0d14a
@@ -543,6 +543,18 @@ seed123 最终结论：
 - `high_energy`：`24 / 0.9167 / 1.0`
 - cap48 至少已经表现出明显的 **seed 敏感性**
 - 因此当前默认策略的判断应基于 **多 seed 聚合**，而不是单次 cap48 反转
+### cap48 两次 seed 的当前聚合结论
+| 策略 | runs | mean_top1 | min_top1 | max_top1 | stdev_top1 |
+|---|---:|---:|---:|---:|---:|
+| `high_energy` | 2 | 0.9167 | 0.9167 | 0.9167 | 0.0 |
+| `hybrid` | 2 | 0.8750 | 0.7917 | 0.9583 | 0.0833 |
+当前最稳妥的解释：
+- `high_energy` 在 cap48 两次 seed 上的**均值暂时领先**
+- `hybrid` 结果波动更大，但单轮峰值更高
+- 后续默认策略不应只看某一次单跑，而应继续累计 seed / style bucket 的聚合结果
 - `b766c74` Make open-dataset manifests trainable end to end
 - `fa23144` Add a single-page open dataset workflow for training prep
 - `af33be3` Condense docs and add manifest validation before training