Commit ae0d14a5 ae0d14a57929fbcf15e326f4dac8bc5769564c12 by cnb.bofCdSsphPA

Promote the cap48 discussion from single runs to two-seed aggregates

Persist the current two-seed cap48 summary so the strategy recommendation is grounded in aggregated evidence rather than whichever single run happened most recently.

Constraint: Only documentation changes are allowed because benchmark artifacts remain outside version control
Rejected: Keep narrating cap48 one run at a time | The aggregate is now more informative than any individual cap48 run
Confidence: high
Scope-risk: narrow
Directive: Prefer reporting aggregate seed statistics once two or more runs exist; avoid re-elevating single-seed claims above the aggregate
Tested: Verified both cap48 report.json files; computed aggregate mean/min/max/stdev; verified docs now record high_energy mean_top1=0.9167 and hybrid mean_top1=0.8750
Not-tested: Aggregates beyond two seeds or style-bucketed aggregates
1 parent e519dab7
......@@ -2,6 +2,38 @@
## 2026-06-02
### Stage: 汇总 cap48 两次 seed 的聚合指标
完成项:
- 汇总:
- `/tmp/ab_smoke_seg_cap48_top2/report.json`
- `/tmp/ab_smoke_seg_cap48_top2_seed123/report.json`
- 计算 cap48 当前 2 次 seed 的聚合指标
- 更新:
- [open-dataset-workflow.md](./open-dataset-workflow.md)
- [session-handoff.md](./session-handoff.md)
- [CHANGELOG.md](./CHANGELOG.md)
cap48 聚合结果(2 次 seed):
- `high_energy`:
- `mean_top1 = 0.9167`
- `min_top1 = 0.9167`
- `max_top1 = 0.9167`
- `stdev_top1 = 0.0`
- `hybrid`:
- `mean_top1 = 0.8750`
- `min_top1 = 0.7917`
- `max_top1 = 0.9583`
- `stdev_top1 = 0.0833`
结论:
- 仅看 cap48 当前两次 seed,`high_energy` 的均值与稳定性更占优
- `hybrid` 的表现波动更大,但峰值更高
- 当前最稳妥的策略判断应升级为:
- 单轮结果不可信
- 默认策略应参考**多 seed 聚合**
- 下一步继续扩展 seed 数或 style-aware bucket 比单纯再加单轮更有价值
### Stage: 收尾 cap48 seed123 并确认 cap48 对 seed 敏感
完成项:
......
......@@ -223,6 +223,23 @@ flowchart LR
- `hybrid` 保留为保守默认
- `high_energy` 保留为强竞争方案
- 后续需要 **多 seed 聚合结论**,而不是看单次跑分
### cap48 多 seed 聚合摘要(当前 2 次)
把 cap48 的两次 seed 放到一起看:
| 策略 | runs | mean_top1 | min_top1 | max_top1 | stdev_top1 | mean_topk |
|---|---:|---:|---:|---:|---:|---:|
| `high_energy` | 2 | 0.9167 | 0.9167 | 0.9167 | 0.0 | 1.0 |
| `hybrid` | 2 | 0.8750 | 0.7917 | 0.9583 | 0.0833 | 1.0 |
当前可解释为:
- `high_energy` 在这两次 cap48 上**均值更高且更稳定**
- `hybrid` 在第二个 seed 上更强,但波动也更大
- 因此目前最准确的表述不是“谁绝对赢”,而是:
- **cap48 上 `high_energy` 的聚合均值暂时领先**
- **`hybrid` 仍是更保守的默认候选**
- 最终默认仍应等待更多 seed 或更大样本确认
/usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --index-prefix data/index_fma_smoke/reference --split test --device cpu --fast-eval --output-json reports/fma-smoke/eval.json
/usr/local/miniconda3/bin/python scripts/generate_artifacts.py --eval-json reports/fma-smoke/eval.json --config-json reports/fma-smoke/config.json --output-dir reports/fma-smoke --model-version fma-smoke --data-version fma_local
```
......
......@@ -543,6 +543,18 @@ seed123 最终结论:
- `high_energy``24 / 0.9167 / 1.0`
- cap48 至少已经表现出明显的 **seed 敏感性**
- 因此当前默认策略的判断应基于 **多 seed 聚合**,而不是单次 cap48 反转
### cap48 两次 seed 的当前聚合结论
| 策略 | runs | mean_top1 | min_top1 | max_top1 | stdev_top1 |
|---|---:|---:|---:|---:|---:|
| `high_energy` | 2 | 0.9167 | 0.9167 | 0.9167 | 0.0 |
| `hybrid` | 2 | 0.8750 | 0.7917 | 0.9583 | 0.0833 |
当前最稳妥的解释:
- `high_energy` 在 cap48 两次 seed 上的**均值暂时领先**
- `hybrid` 结果波动更大,但单轮峰值更高
- 后续默认策略不应只看某一次单跑,而应继续累计 seed / style bucket 的聚合结果
- `b766c74` Make open-dataset manifests trainable end to end
- `fa23144` Add a single-page open dataset workflow for training prep
- `af33be3` Condense docs and add manifest validation before training
......