Record the first business-corpus voice correctness check

Constraint: the repo needs to distinguish runtime success from business-level song_id correctness before any production claim Rejected: treating the workspace_music20 smoke as good enough | the current type_7 batch result is top1=0.0 and top3=0.05, which is far below a usable threshold Confidence: high Scope-risk: narrow Directive: keep all future business-corpus voice evaluations written to local_eval artifacts and mirrored into changelog/checklist/handoff before push Tested: /usr/local/miniconda3/bin/python -m unittest discover -s acr-engine/tests -v; generated acr-engine/data/local_eval/voice_workspace20_type7_eval.json with num_queries=20, top1=0.0, top3=0.05 Not-tested: improved business-corpus correctness after further retrieval tuning

Record the first business-corpus voice correctness check
Constraint: the repo needs to distinguish runtime success from business-level song_id correctness before any production claim Rejected: treating the workspace_music20 smoke as good enough | the current type_7 batch result is top1=0.0 and top3=0.05, which is far below a usable threshold Confidence: high Scope-risk: narrow Directive: keep all future business-corpus voice evaluations written to local_eval artifacts and mirrored into changelog/checklist/handoff before push Tested: /usr/local/miniconda3/bin/python -m unittest discover -s acr-engine/tests -v; generated acr-engine/data/local_eval/voice_workspace20_type7_eval.json with num_queries=20, top1=0.0, top3=0.05 Not-tested: improved business-corpus correctness after further retrieval tuning
cnb.bofCdSsphPA
Commit 5a01ab7f ... 5a01ab7faa1dad545daa57c00dd9f624249fb76f authored 2026-06-03 18:09:35 +0800 by cnb.bofCdSsphPA
Showing 4 changed files with 10 additions and 1 deletions
acr-engine/data/local_eval/voice_workspace20_type7_eval.json
docs/CHANGELOG.md
docs/release-checklist.md
docs/session-handoff.md
--- a/acr-engine/data/local_eval/voice_workspace20_type7_eval.json 0 → 100644
View file @5a01ab7
+++ b/acr-engine/data/local_eval/voice_workspace20_type7_eval.json 0 → 100644
View file @5a01ab7
--- a/docs/CHANGELOG.md
View file @5a01ab7
+++ b/docs/CHANGELOG.md
View file @5a01ab7
+- 新增 `acr-engine/data/local_eval/voice_workspace20_type7_eval.json`，对当前 `workspace_music20` 语义做了 20 条 `type_7` 批量验证：`top1=0.0`、`top3=0.05`，说明业务 song_id 正确性仍明显不足。
+- architect review 当前结论：`APPROVED (WATCH)`，允许继续沿当前架构推进，但不能把当前 business-corpus 结果视作完成。
 - `docs/session-handoff.md` 已刷新为最新 voice service runtime 状态，明确 `/health` 可用、`/recognize/voice` 仍超时，以及下一步最短排查路径

 ## 2026-06-03 voice-to-chunk and context export foundation
--- a/docs/release-checklist.md
View file @5a01ab7
+++ b/docs/release-checklist.md
View file @5a01ab7
@@ -24,7 +24,7 @@ flowchart TD
 | benchmark report 已生成 |  |
 | model card 已生成 |  |
 | license registry 已更新 |  |
-| service smoke test 通过 | partial: `/health` OK, `/recognize/voice` payload returns against `workspace_music20`, but business top1 correctness still needs manual/metric validation |
+| service smoke test 通过 | partial: `/health` OK, `/recognize/voice` payload returns against `workspace_music20`, but batch validation is currently poor (`type_7 top1=0.0`, `top3=0.05`) |
 | dataset whitelist 已确认 |  |
 | changelog 已更新 | yes |
 | architect review completed | yes (approved with watch) |
--- a/docs/session-handoff.md
View file @5a01ab7
+++ b/docs/session-handoff.md
View file @5a01ab7
@@ -46,6 +46,13 @@
 3. 把哼唱评测集接入 `evaluate.py` 或独立评测脚本
 4. 继续做 docs 第二轮收敛，只保留当前有效主文档

+- 当前 `workspace_music20` 业务正确性初测（`acr-engine/data/local_eval/voice_workspace20_type7_eval.json`）：
+  - `num_queries=20`
+  - `top1=0.0`
+  - `top3=0.05`
+  - 说明当前 business sample 语义虽然已通路，但 song_id 正确性还很差，必须继续优化，不可直接当成可用识别能力。
+- architect review 当前结论：`APPROVED (WATCH)`，允许继续沿当前架构推进，但需要明确区分“链路通”与“业务正确”。
+
 ### 最新补充（2026-06-03 voice service runtime）

 - 已确认当前解释器 `/usr/local/miniconda3/bin/python` 下：