Preserve fresh benchmark evidence before the evaluation finishes

Constraint: The running cap48 seed=999 benchmark has not emitted its final report yet, so only in-flight evidence can be recorded safely Rejected: Claim a new three-seed conclusion now | The aggregate would be speculative without report.json and eval outputs Confidence: high Scope-risk: narrow Directive: When a long benchmark is still active, checkpoint stage evidence explicitly and wait for report.json before changing strategy guidance Tested: Verified process tree shows hybrid moved from build-index to evaluate.py; verified reference_progress.json reports 48 refs, 491 windows, 192-d embeddings, and complete status; verified report.json is still absent Not-tested: Final hybrid eval metrics, subsequent high_energy run, and final three-seed aggregate

Preserve fresh benchmark evidence before the evaluation finishes
Constraint: The running cap48 seed=999 benchmark has not emitted its final report yet, so only in-flight evidence can be recorded safely Rejected: Claim a new three-seed conclusion now | The aggregate would be speculative without report.json and eval outputs Confidence: high Scope-risk: narrow Directive: When a long benchmark is still active, checkpoint stage evidence explicitly and wait for report.json before changing strategy guidance Tested: Verified process tree shows hybrid moved from build-index to evaluate.py; verified reference_progress.json reports 48 refs, 491 windows, 192-d embeddings, and complete status; verified report.json is still absent Not-tested: Final hybrid eval metrics, subsequent high_energy run, and final three-seed aggregate
cnb.bofCdSsphPA
Commit bdc04f72 ... bdc04f720ff1250976ade4d9d8fcde28b87fefa5 authored 2026-06-02 18:22:40 +0800 by cnb.bofCdSsphPA
Showing 4 changed files with 49 additions and 1 deletions
docs/CHANGELOG.md
docs/changelist-2026-06-02.md
docs/delivery-handoff-2026-06-02.md
docs/session-handoff.md
--- a/docs/CHANGELOG.md
View file @bdc04f7
+++ b/docs/CHANGELOG.md
View file @bdc04f7
+## 2026-06-02 运行中 benchmark 新证据 checkpoint
+
+完成项：
+- 追加记录 `cap48 top2 seed=999` 的新鲜运行证据。
+- 已确认流程并未卡死在 `hybrid build-index`，而是继续推进到 `evaluate.py`。
+- 已记录 reference index 完成指标：`48 refs / 491 windows / 192-d embedding`。
+
+验证证据：
+- `reference_progress.json` 显示：
+  - `status=complete`
+  - `refs_done=48`
+  - `windows_done=491`
+  - `embedding_shape=[491, 192]`
+  - `elapsed_sec=80.26`
+- 进程树显示：
+  - `evaluate.py ... --seed 999 --max-queries 24` 正在运行
+
+说明：
+- 截至本 checkpoint，`/tmp/ab_smoke_seg_cap48_top2_seed999/report.json` 仍未生成。
+- 因此本次仍不更新 3-seed aggregate 结论。
+
 ## 2026-06-02 交付检查点：handoff / changelist / agent memory

 完成项：
--- a/docs/changelist-2026-06-02.md
View file @bdc04f7
+++ b/docs/changelist-2026-06-02.md
View file @bdc04f7
@@ -54,3 +54,9 @@ cd /workspace/acr-engine
 3. 回写 workflow / handoff / changelog。
 4. 提交推送。
 5. 再开启 cap64 或 bucket benchmark。
+
+## 本次追加证据
+
+- 已确认 `cap48 top2 seed=999` 未卡在 build-index。
+- `hybrid` 已完成 reference index，随后进入 `evaluate.py`。
+- 本次提交用于沉淀这份 fresh verification evidence，方便下个 session 不必重复排查。
--- a/docs/delivery-handoff-2026-06-02.md
View file @bdc04f7
+++ b/docs/delivery-handoff-2026-06-02.md
View file @bdc04f7
@@ -20,6 +20,11 @@

 ### 卡点 1：seed=999 benchmark 未完成

+当前最新状态：
+- `hybrid` reference index 已完成
+- 当前正在执行 `evaluate.py`
+- 总 `report.json` 仍未落盘
+
 待检查：
 - `/tmp/ab_smoke_seg_cap48_top2_seed999/report.json`
 - `/tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma_reports_smoke/eval.json`
--- a/docs/session-handoff.md
View file @bdc04f7
+++ b/docs/session-handoff.md
View file @bdc04f7
@@ -216,9 +216,25 @@
 - 新 session 已可依据本文件和 `AGENT.md` 继续推进。

 ### 当前卡点
- `cap48 top2 seed=999` 仍在运行或待收尾，尚未写回最终 3-seed aggregate 结论。
+- `cap48 top2 seed=999` 仍在运行，当前已确认从 `hybrid build-index` 进入 `evaluate.py`，但尚未写回最终 `report.json` 与 3-seed aggregate 结论。
 - 工作区存在大量数据与模型产物，当前只建议精确提交文档文件。

+### 最新验证证据（2026-06-02 18:21 UTC 左右）
+- `hybrid` 的 reference index 已完成：
+  - `refs_done=48 / refs_total=48`
+  - `windows_done=491`
+  - `embedding_shape=[491, 192]`
+  - `elapsed_sec=80.26`
+- 对应文件已写出：
+  - `/tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma_index_smoke/reference_embs.npy`
+  - `/tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma_index_smoke/reference_ids.npy`
+  - `/tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma_index_smoke/reference_progress.json`
+- 进程树已确认进入：
+  - `evaluate.py --data /tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma/manifests ... --output-json /tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma_reports_smoke/eval.json --seed 999 --max-queries 24`
+- 截至本 checkpoint：
+  - 总报告 `/tmp/ab_smoke_seg_cap48_top2_seed999/report.json` 仍未生成
+  - `high_energy` 阶段尚未开始产出可见评测结果
+
 ### 最优先待办
 1. 检查 `/tmp/ab_smoke_seg_cap48_top2_seed999/report.json` 是否生成。
 2. 如已生成，计算 `default + 123 + 999` 三个 seed 的 aggregate。