Commit bdc04f72 bdc04f720ff1250976ade4d9d8fcde28b87fefa5 by cnb.bofCdSsphPA

Preserve fresh benchmark evidence before the evaluation finishes

Constraint: The running cap48 seed=999 benchmark has not emitted its final report yet, so only in-flight evidence can be recorded safely
Rejected: Claim a new three-seed conclusion now | The aggregate would be speculative without report.json and eval outputs
Confidence: high
Scope-risk: narrow
Directive: When a long benchmark is still active, checkpoint stage evidence explicitly and wait for report.json before changing strategy guidance
Tested: Verified process tree shows hybrid moved from build-index to evaluate.py; verified reference_progress.json reports 48 refs, 491 windows, 192-d embeddings, and complete status; verified report.json is still absent
Not-tested: Final hybrid eval metrics, subsequent high_energy run, and final three-seed aggregate
1 parent 0d40b05c
## 2026-06-02 运行中 benchmark 新证据 checkpoint
完成项:
- 追加记录 `cap48 top2 seed=999` 的新鲜运行证据。
- 已确认流程并未卡死在 `hybrid build-index`,而是继续推进到 `evaluate.py`
- 已记录 reference index 完成指标:`48 refs / 491 windows / 192-d embedding`
验证证据:
- `reference_progress.json` 显示:
- `status=complete`
- `refs_done=48`
- `windows_done=491`
- `embedding_shape=[491, 192]`
- `elapsed_sec=80.26`
- 进程树显示:
- `evaluate.py ... --seed 999 --max-queries 24` 正在运行
说明:
- 截至本 checkpoint,`/tmp/ab_smoke_seg_cap48_top2_seed999/report.json` 仍未生成。
- 因此本次仍不更新 3-seed aggregate 结论。
## 2026-06-02 交付检查点:handoff / changelist / agent memory
完成项:
......
......@@ -54,3 +54,9 @@ cd /workspace/acr-engine
3. 回写 workflow / handoff / changelog。
4. 提交推送。
5. 再开启 cap64 或 bucket benchmark。
## 本次追加证据
- 已确认 `cap48 top2 seed=999` 未卡在 build-index。
- `hybrid` 已完成 reference index,随后进入 `evaluate.py`
- 本次提交用于沉淀这份 fresh verification evidence,方便下个 session 不必重复排查。
......
......@@ -20,6 +20,11 @@
### 卡点 1:seed=999 benchmark 未完成
当前最新状态:
- `hybrid` reference index 已完成
- 当前正在执行 `evaluate.py`
-`report.json` 仍未落盘
待检查:
- `/tmp/ab_smoke_seg_cap48_top2_seed999/report.json`
- `/tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma_reports_smoke/eval.json`
......
......@@ -216,9 +216,25 @@
- 新 session 已可依据本文件和 `AGENT.md` 继续推进。
### 当前卡点
- `cap48 top2 seed=999` 仍在运行或待收尾,尚未写回最终 3-seed aggregate 结论。
- `cap48 top2 seed=999` 仍在运行,当前已确认从 `hybrid build-index` 进入 `evaluate.py`,但尚未写回最终 `report.json` 3-seed aggregate 结论。
- 工作区存在大量数据与模型产物,当前只建议精确提交文档文件。
### 最新验证证据(2026-06-02 18:21 UTC 左右)
- `hybrid` 的 reference index 已完成:
- `refs_done=48 / refs_total=48`
- `windows_done=491`
- `embedding_shape=[491, 192]`
- `elapsed_sec=80.26`
- 对应文件已写出:
- `/tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma_index_smoke/reference_embs.npy`
- `/tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma_index_smoke/reference_ids.npy`
- `/tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma_index_smoke/reference_progress.json`
- 进程树已确认进入:
- `evaluate.py --data /tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma/manifests ... --output-json /tmp/ab_smoke_seg_cap48_top2_seed999/hybrid/fma_reports_smoke/eval.json --seed 999 --max-queries 24`
- 截至本 checkpoint:
- 总报告 `/tmp/ab_smoke_seg_cap48_top2_seed999/report.json` 仍未生成
- `high_energy` 阶段尚未开始产出可见评测结果
### 最优先待办
1. 检查 `/tmp/ab_smoke_seg_cap48_top2_seed999/report.json` 是否生成。
2. 如已生成,计算 `default + 123 + 999` 三个 seed 的 aggregate。
......