capture the first real-path index-to-evaluate closure\n\nConstraint: Delivery st…

…ate must reflect fresh evaluate evidence without staging temporary eval assets\nRejected: Wait for larger-scale or hard-case metrics | The first explicit evaluate closure is already a meaningful milestone and restart-safe handoff point\nConfidence: high\nScope-risk: narrow\nDirective: Reuse /tmp/fma_realpath_small_rerun_index2 and /tmp/fma_realpath_small_rerun_eval as the next validation baseline before scaling up\nTested: Verified eval_top50.json at num_queries 35 with top1 0.8571 and topk 1.0, confirmed query-count explanation, and updated handoff/changelog docs\nNot-tested: Larger query caps, hard-case buckets, and full-scale FMA evaluate runs

capture the first real-path index-to-evaluate closure\n\nConstraint: Delivery st…
…ate must reflect fresh evaluate evidence without staging temporary eval assets\nRejected: Wait for larger-scale or hard-case metrics | The first explicit evaluate closure is already a meaningful milestone and restart-safe handoff point\nConfidence: high\nScope-risk: narrow\nDirective: Reuse /tmp/fma_realpath_small_rerun_index2 and /tmp/fma_realpath_small_rerun_eval as the next validation baseline before scaling up\nTested: Verified eval_top50.json at num_queries 35 with top1 0.8571 and topk 1.0, confirmed query-count explanation, and updated handoff/changelog docs\nNot-tested: Larger query caps, hard-case buckets, and full-scale FMA evaluate runs
cnb.bofCdSsphPA
Commit 81704ace ... 81704aceac547a76573fe45c658df3ff529cfcf6 authored 2026-06-02 23:41:33 +0800 by cnb.bofCdSsphPA
Showing 5 changed files with 123 additions and 46 deletions
AGENT.md
docs/CHANGELOG.md
docs/changelist-2026-06-02.md
docs/delivery-handoff-2026-06-02.md
docs/session-handoff.md
--- a/AGENT.md
View file @81704ac
+++ b/AGENT.md
View file @81704ac
@@ -74,31 +74,24 @@
 ## 5.5 最新真实 FMA / chromaprint 运行态（2026-06-02）
-### 当前最新快照（15:35 UTC）
+### 当前最新快照（15:40 UTC）
- 远程同步基线：`41c4d7c`（更新前）
+- 远程同步基线：`9371e94`（更新前）
- 当前最重要的新证据：**fixed real-path 200-ref rerun 已完整产出最终 reference index**。
+- 当前最重要的新证据：**fixed real-path 200-ref rerun 已拿到首份显式 evaluate 指标**。
- 输出目录：`/tmp/fma_realpath_small_rerun_index2`
+- index 路径：`/tmp/fma_realpath_small_rerun_index2`
- chromaprint 已完成：
+- eval 路径：`/tmp/fma_realpath_small_rerun_eval/eval_top50.json`
-  - `status=complete`
+- 当前结果：
-  - `refs_done=200/200`
+  - `num_queries=35`
-  - `skipped_refs=0`
+  - `top1=0.8571`
-  - `hashes=57577`
+  - `topk=1.0`
-  - `postings=187446`
+  - `by_type.clean: n=35, top1=0.8571, topk=1.0`
- reference 已完成：
+- query 数说明：overlap test items `235` 中，非 `reference` query 只有 `35`，所以 `--max-queries 50` 最终评到 `35`。
-  - `status=complete`
+- 这说明：当前已具备一条完整可复用的真实路径 smoke 证据链：
-  - `refs_done=200/200`
+  `chromaprint complete -> reference complete -> evaluate complete`
-  - `windows_done=2068`
-  - `embedding_shape=[2068, 192]`
-  - `skipped_refs=0`
- 当前已出现最终产物：
-  - `reference_embs.npy`
-  - `reference_ids.npy`
- 这说明：`flush=True` + 坏音频 skip tolerance 修复后，真实路径 rerun 已完整穿过两段核心建索引流程。
 - 下一次值得提交的事件：
-  1. `evaluate.py` 启动或显式 evaluate smoke 完成
+  1. 更大 query 数 / 更大 reference 集评测
-  2. identify / 检索指标产出
+  2. `confused` / `humming_like` / hard negative 指标
-  3. 或新的更大样本/全量 rerun 结果
+  3. 更接近商用场景的数据组合结果
 ## 6. 高风险注意事项
--- a/docs/CHANGELOG.md
View file @81704ac
+++ b/docs/CHANGELOG.md
View file @81704ac
+## 2026-06-02 15:40 UTC / real-path 200-ref rerun closed the first explicit evaluate loop
+- 基于已完成的 `200-ref` real-path index，补了一轮显式 `evaluate.py` smoke
+- 先定位并修复评测环境问题：
+  - 初次失败原因为 `/tmp/fma_realpath_small_rerun_eval/audio/...` 缺失
+  - 通过软链补齐：`/tmp/fma_realpath_small_rerun_eval/audio -> /workspace/acr-engine/data/external_smoke/fma/audio`
+- fresh evidence（`2026-06-02 15:40:30 UTC`）：
+  - `eval_top50.json` 已落盘：`/tmp/fma_realpath_small_rerun_eval/eval_top50.json`
+  - 结果：`num_queries=35`, `top1=0.8571`, `topk=1.0`
+  - `by_type.clean`: `n=35`, `top1=0.8571`, `topk=1.0`
+- 验证补充：
+  - overlap test items = `235`
+  - 其中非 `reference` query = `35`
+  - 因此即使 `--max-queries 50`，最终也只会评到 `35` 条 query
+- 结论：当前已拿到第一份真实路径 `build-index -> evaluate` 闭环证据
+- 下一关键里程碑：
+  1. 扩到更大 query 数或更大 reference 集
+  2. 引入 `confused` / `humming_like` 等 hard case 评测
 ## 2026-06-02 15:35 UTC / real-path 200-ref rerun finished reference index
 - fixed real-path 200 reference rerun：`/tmp/fma_realpath_small_rerun_index2` 已完成 reference/embedding 阶段
--- a/docs/changelist-2026-06-02.md
View file @81704ac
+++ b/docs/changelist-2026-06-02.md
View file @81704ac
@@ -162,3 +162,33 @@
 - 当前已确认：修复后的真实路径 rerun 不仅能进入 reference 阶段，而且能完整产出最终 embedding index。
 - 下一轮最高价值工作应切到：评测链是否自动衔接，以及必要时补显式 evaluate smoke。
+## 本次追加交付（2026-06-02 15:40 UTC）
+### 新增运行证据
+| 类别 | 内容 |
+|---|---|
+| evaluate | 显式 `evaluate.py` smoke 已完成 |
+| query 规模 | `num_queries=35`（overlap 中全部非 reference query） |
+| 指标 | `top1=0.8571`, `topk=1.0` |
+| by_type | `clean: n=35, top1=0.8571, topk=1.0` |
+### 当前最重要的 fresh evidence
+- 观测时间：`2026-06-02 15:40:30 UTC`
+- 结果文件：`/tmp/fma_realpath_small_rerun_eval/eval_top50.json`
+- 评测结果：
+  - `split=test`
+  - `num_queries=35`
+  - `top1=0.8571`
+  - `topk=1.0`
+- query 数说明：
+  - overlap test items = `235`
+  - 非 `reference` query = `35`
+  - 所以 `--max-queries 50` 实际评到 `35` 条
+### 结论
+- 当前已不只是建索引成功，而是已经拿到首份真实路径 `build-index -> evaluate` 闭环证据。
+- 下一轮应把重点切到：更大评测规模与 hard case / confusion 评测。
--- a/docs/delivery-handoff-2026-06-02.md
View file @81704ac
+++ b/docs/delivery-handoff-2026-06-02.md
View file @81704ac
+## 本次交付包追加更新（2026-06-02 15:40 UTC）
+### 交付结论
+当前最新里程碑已经从“reference index 完成”推进到 **fixed real-path 200-ref rerun 已拿到首份显式 evaluate 指标**：
+- 远程基线当前为：`9371e94`（更新前）
+- real-path `200-ref` index 已完整完成
+- 显式 `evaluate.py` smoke 已完成
+- 当前首份结果：`top1=0.8571`, `topk=1.0`, `num_queries=35`
+- 因此主线已从“索引能否跑通”进入“评测质量与 hard case 扩展”阶段
+### 当前最新事实
+#### evaluate smoke 路径
+- 观测时间：`2026-06-02 15:40:30 UTC`
+- 结果文件：`/tmp/fma_realpath_small_rerun_eval/eval_top50.json`
+- 评测结果：
+  - `split=test`
+  - `num_queries=35`
+  - `top1=0.8571`
+  - `topk=1.0`
+  - `by_type.clean`: `n=35`, `top1=0.8571`, `topk=1.0`
+- query 数来源说明：
+  - 200-ref catalog 与现有 external smoke test overlap = `235` items
+  - 其中非 `reference` query = `35`
+  - 所以 `--max-queries 50` 实际只评到 `35` 条
+### 当前判断
+- 当前已经拥有一条完整可复用的真实路径 smoke 证据链：
+  `chromaprint complete -> reference complete -> evaluate complete`
+- 下一阶段更值得做的是：
+  1. 扩大评测 query 数与 reference 规模；
+  2. 引入 `confused` / `humming_like` / hard negative 评测。
+---
 ## 本次交付包追加更新（2026-06-02 15:35 UTC）
 ### 交付结论
--- a/docs/session-handoff.md
View file @81704ac
+++ b/docs/session-handoff.md
View file @81704ac
@@ -5,29 +5,27 @@
 ## 一页结论
-### 最新交付快照（2026-06-02 15:35 UTC）
+### 最新交付快照（2026-06-02 15:40 UTC）
- 当前远程同步基线：`41c4d7c`（更新前）
+- 当前远程同步基线：`9371e94`（更新前）
- 当前最重要的新事实：**fixed real-path 200-ref rerun 已完整产出最终 reference index**
+- 当前最重要的新事实：**fixed real-path 200-ref rerun 已拿到首份显式 evaluate 指标**
- 输出目录：`/tmp/fma_realpath_small_rerun_index2`
+- index 路径：`/tmp/fma_realpath_small_rerun_index2`
- chromaprint 阶段：
+- eval 路径：`/tmp/fma_realpath_small_rerun_eval/eval_top50.json`
-  - `status=complete`
+- 当前核心结果：
-  - `refs_done=200/200`
+  - `num_queries=35`
-  - `skipped_refs=0`
+  - `top1=0.8571`
- reference 阶段：
+  - `topk=1.0`
-  - `status=complete`
+  - `by_type.clean: n=35, top1=0.8571, topk=1.0`
-  - `refs_done=200/200`
+- query 数说明：
-  - `windows_done=2068`
+  - overlap test items = `235`
-  - `embedding_shape=[2068, 192]`
+  - 非 `reference` query = `35`
-  - `skipped_refs=0`
+  - 所以 `--max-queries 50` 实际评到 `35` 条
- 当前已出现最终产物：
+- 结论：当前已经形成真实路径的第一条完整闭环：
-  - `reference_embs.npy`
+  `chromaprint -> reference -> evaluate`
-  - `reference_ids.npy`
- 结论：修复后的真实路径 rerun 已完整跨过 `chromaprint -> reference` 两个核心阶段；当前下一优先级是评测链衔接验证。
 - 新 session 第一优先级：
-  1. 检查是否已有 evaluate / identify 后续证据
+  1. 扩大 query / reference 规模
-  2. 若无，基于这套已完成 index 补一轮显式 evaluate smoke
+  2. 补 hard case（`confused` / `humming_like`）评测
-  3. 再决定是否继续扩到更大样本 / 全量 FMA
+  3. 再决定是否推到更大 FMA 子集或全量
 ### 最新可观测性修复（2026-06-02 15:18 UTC）