Lock the final cap16 FMA benchmark ranking into the workflow docs

Persist the completed capped real-data benchmark results so future sessions can use the final strategy ordering and recommendation without replaying the run. Constraint: Only documentation should change because benchmark artifacts live outside version control Rejected: Leave the result only in /tmp report files | Would make the evidence fragile across sessions Confidence: high Scope-risk: narrow Directive: Use cap16 as the current default evidence point until a larger capped benchmark supersedes it Tested: Verified /tmp/ab_smoke_seg_cap16/report.json; verified repeated_section_aware eval.json; verified docs reflect final ranking hybrid/high_energy/beat_aware/repeated_section_aware Not-tested: Larger real-dataset benchmark beyond the 16-track capped subset

Lock the final cap16 FMA benchmark ranking into the workflow docs
Persist the completed capped real-data benchmark results so future sessions can use the final strategy ordering and recommendation without replaying the run. Constraint: Only documentation should change because benchmark artifacts live outside version control Rejected: Leave the result only in /tmp report files | Would make the evidence fragile across sessions Confidence: high Scope-risk: narrow Directive: Use cap16 as the current default evidence point until a larger capped benchmark supersedes it Tested: Verified /tmp/ab_smoke_seg_cap16/report.json; verified repeated_section_aware eval.json; verified docs reflect final ranking hybrid/high_energy/beat_aware/repeated_section_aware Not-tested: Larger real-dataset benchmark beyond the 16-track capped subset
cnb.bofCdSsphPA
Commit c659380d ... c659380d9cceff27e1fdde325b89191486ea1ad5 authored 2026-06-02 17:27:36 +0800 by cnb.bofCdSsphPA
Showing 3 changed files with 48 additions and 10 deletions
docs/CHANGELOG.md
docs/open-dataset-workflow.md
docs/session-handoff.md
--- a/docs/CHANGELOG.md
View file @c659380
+++ b/docs/CHANGELOG.md
View file @c659380
@@ -2,6 +2,27 @@

 ## 2026-06-02

+### Stage: 收尾 cap16 真实 FMA capped segmentation benchmark
+
+完成项：
+- 读取 `/tmp/ab_smoke_seg_cap16/report.json`
+- 确认 `repeated_section_aware` 最终评测结果
+- 更新：
+  - [open-dataset-workflow.md](./open-dataset-workflow.md)
+  - [session-handoff.md](./session-handoff.md)
+  - [CHANGELOG.md](./CHANGELOG.md)
+
+最终结果（subset=16, `max_test_queries=12`）：
+- `hybrid`: `num_queries=12`, `top1=1.0`, `topk=1.0`
+- `high_energy`: `num_queries=12`, `top1=1.0`, `topk=1.0`
+- `beat_aware`: `num_queries=12`, `top1=0.9167`, `topk=1.0`
+- `repeated_section_aware`: `num_queries=12`, `top1=0.8333`, `topk=1.0`
+
+结论：
+- 在固定 query 预算下，`hybrid` 仍是当前默认首选
+- `high_energy` 是最强次选，并且与 `hybrid` 在这轮 cap16 上打平
+- `beat_aware` 与 `repeated_section_aware` 单独使用时不如混合策略稳定
+
 ### Stage: 交付当前切片 benchmark 续跑 handoff

 完成项：
--- a/docs/open-dataset-workflow.md
View file @c659380
+++ b/docs/open-dataset-workflow.md
View file @c659380
@@ -144,6 +144,23 @@ flowchart LR
 这一步的意义是：
 - 之前的 A/B 排名更偏“覆盖能力”
 - 加上 cap 后，可以更公平地比较“同等 query 成本下的识别质量”
+
+### 最新真实 FMA capped 结果（subset=16, `max_test_queries=12`）
+
+已完成一轮更公平的真实 FMA A/B：
+
+| 排名 | 策略 | num_queries | top1 | topk |
+|---:|---|---:|---:|---:|
+| 1 | `hybrid` | 12 | 1.0 | 1.0 |
+| 2 | `high_energy` | 12 | 1.0 | 1.0 |
+| 3 | `beat_aware` | 12 | 0.9167 | 1.0 |
+| 4 | `repeated_section_aware` | 12 | 0.8333 | 1.0 |
+
+当前建议：
+- **默认训练 / query 策略仍优先 `hybrid`**
+- `high_energy` 是当前最强的并列次选，适合更偏主段/高能区的数据
+- `beat_aware` 更适合规则节拍较强的风格，但在这轮 FMA 子集上略弱
+- `repeated_section_aware` 单独使用不如混合策略稳
 /usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --index-prefix data/index_fma_smoke/reference --split test --device cpu --fast-eval --output-json reports/fma-smoke/eval.json
 /usr/local/miniconda3/bin/python scripts/generate_artifacts.py --eval-json reports/fma-smoke/eval.json --config-json reports/fma-smoke/config.json --output-dir reports/fma-smoke --model-version fma-smoke --data-version fma_local
 ```
--- a/docs/session-handoff.md
View file @c659380
+++ b/docs/session-handoff.md
View file @c659380
@@ -331,14 +331,14 @@ cd /workspace/acr-engine
  --output-json /tmp/ab_smoke_seg_cap16/report.json
 ```

-在本次交接时，已拿到的 partial result：
+在本次交接时，cap16 已完成，最终结果如下：

 | 策略 | num_queries | top1 | topk | 状态 |
 |---|---:|---:|---:|---|
 | `hybrid` | 12 | 1.0 | 1.0 | 已完成 |
-| `beat_aware` | 12 | 0.9167 | 1.0 | 已完成 |
 | `high_energy` | 12 | 1.0 | 1.0 | 已完成 |
-| `repeated_section_aware` | - | - | - | 未开始/未完成 |
+| `beat_aware` | 12 | 0.9167 | 1.0 | 已完成 |
+| `repeated_section_aware` | 12 | 0.8333 | 1.0 | 已完成 |

 ### 重启后第一优先动作

@@ -348,10 +348,10 @@ cd /workspace/acr-engine
 pgrep -af 'ab_smoke_seg_cap16|external_adapters.py smoke-local fma /tmp/ab_smoke_seg_cap16|evaluate.py --data /tmp/ab_smoke_seg_cap16|run_demo.py build-index --data /tmp/ab_smoke_seg_cap16'
 ```

-2. 如果还在跑，等待 `/tmp/ab_smoke_seg_cap16/report.json`
+2. 如果 `report.json` 已存在，优先读取并同步文档
 3. 如果中断：
-   - 保留已有 `/tmp/ab_smoke_seg_cap16/hybrid`、`/tmp/ab_smoke_seg_cap16/beat_aware` 结果作人工记录
-   - 重新跑剩余策略，或单独跑：
+   - 保留已有 `/tmp/ab_smoke_seg_cap16/*` 结果作人工记录
+   - 重新跑缺失策略，或单独跑：

 ```bash
 cd /workspace/acr-engine
@@ -369,10 +369,10 @@ cd /workspace/acr-engine
  --seed 42
 ```

-4. 完整结果出来后：
-   - 更新 [open-dataset-workflow.md](./open-dataset-workflow.md)
-   - 更新 [CHANGELOG.md](./CHANGELOG.md)
-   - commit + push
+4. 当前这轮 cap16 的最终建议已经形成：
+   - 默认优先：`hybrid`
+   - 强次选：`high_energy`
+   - `beat_aware` / `repeated_section_aware` 更适合作为补充对照，而不是默认策略
 - `b766c74` Make open-dataset manifests trainable end to end
 - `fa23144` Add a single-page open dataset workflow for training prep
 - `af33be3` Condense docs and add manifest validation before training