Extend the business-corpus voice correctness baseline to type8 and type16
Constraint: we need a complete hard-query picture before claiming the workspace_music20 voice lane is usable or deciding where pgvector work should start Rejected: extrapolating from type_7 alone | the type_8 and type_16 lanes can fail differently and need their own measured baselines Confidence: high Scope-risk: narrow Directive: keep all future business-corpus voice evaluations split by query type so we can see exactly which hard lanes fail and why Tested: /usr/local/miniconda3/bin/python -m unittest discover -s acr-engine/tests -v; generated voice_workspace20_type8_eval.json (top1=0.0, top3=0.0) and voice_workspace20_type16_eval.json (top1=0.0, top3=0.0) Not-tested: improved business-corpus voice correctness after moving to embedding/pgvector retrieval
Showing
5 changed files
with
6 additions
and
1 deletions
This diff is collapsed.
Click to expand it.
This diff is collapsed.
Click to expand it.
| 1 | - 新增 `acr-engine/data/local_eval/voice_workspace20_type7_eval.json`,对当前 `workspace_music20` 语义做了 20 条 `type_7` 批量验证:`top1=0.0`、`top3=0.05`,说明业务 song_id 正确性仍明显不足。 | 1 | - 新增 `acr-engine/data/local_eval/voice_workspace20_type7_eval.json`,对当前 `workspace_music20` 语义做了 20 条 `type_7` 批量验证:`top1=0.0`、`top3=0.05`,说明业务 song_id 正确性仍明显不足。 |
| 2 | - 新增 `acr-engine/data/local_eval/voice_workspace20_type8_eval.json` 与 `voice_workspace20_type16_eval.json`,补充 business-corpus voice correctness 基线:`type_8 top1=0.0/top3=0.0`,`type_16 top1=0.0/top3=0.0`。 | ||
| 2 | - architect review 当前结论:`APPROVED (WATCH)`,允许继续沿当前架构推进,但不能把当前 business-corpus 结果视作完成。 | 3 | - architect review 当前结论:`APPROVED (WATCH)`,允许继续沿当前架构推进,但不能把当前 business-corpus 结果视作完成。 |
| 3 | - `docs/session-handoff.md` 已刷新为最新 voice service runtime 状态,明确 `/health` 可用、`/recognize/voice` 仍超时,以及下一步最短排查路径 | 4 | - `docs/session-handoff.md` 已刷新为最新 voice service runtime 状态,明确 `/health` 可用、`/recognize/voice` 仍超时,以及下一步最短排查路径 |
| 4 | 5 | ... | ... |
| ... | @@ -24,7 +24,7 @@ flowchart TD | ... | @@ -24,7 +24,7 @@ flowchart TD |
| 24 | | benchmark report 已生成 | | | 24 | | benchmark report 已生成 | | |
| 25 | | model card 已生成 | | | 25 | | model card 已生成 | | |
| 26 | | license registry 已更新 | | | 26 | | license registry 已更新 | | |
| 27 | | service smoke test 通过 | partial: `/health` OK, `/recognize/voice` payload returns against `workspace_music20`, but batch validation is currently poor (`type_7 top1=0.0`, `top3=0.05`) | | 27 | | service smoke test 通过 | partial: `/health` OK, `/recognize/voice` payload returns against `workspace_music20`, but batch validation is currently poor (`type_7 top1=0.0/top3=0.05`, `type_8 top1=0.0/top3=0.0`, `type_16 top1=0.0/top3=0.0`) | |
| 28 | | dataset whitelist 已确认 | | | 28 | | dataset whitelist 已确认 | | |
| 29 | | changelog 已更新 | yes | | 29 | | changelog 已更新 | yes | |
| 30 | | architect review completed | yes (approved with watch) | | 30 | | architect review completed | yes (approved with watch) | | ... | ... |
| ... | @@ -51,6 +51,10 @@ | ... | @@ -51,6 +51,10 @@ |
| 51 | - `top1=0.0` | 51 | - `top1=0.0` |
| 52 | - `top3=0.05` | 52 | - `top3=0.05` |
| 53 | - 说明当前 business sample 语义虽然已通路,但 song_id 正确性还很差,必须继续优化,不可直接当成可用识别能力。 | 53 | - 说明当前 business sample 语义虽然已通路,但 song_id 正确性还很差,必须继续优化,不可直接当成可用识别能力。 |
| 54 | - 当前已继续补齐 `type_8 / type_16` 的 business-corpus voice correctness 基线: | ||
| 55 | - `voice_workspace20_type8_eval.json`: `num_queries=15`, `top1=0.0`, `top3=0.0` | ||
| 56 | - `voice_workspace20_type16_eval.json`: `num_queries=12`, `top1=0.0`, `top3=0.0` | ||
| 57 | - 说明当前基于 `/workspace` 的本地 chroma+FAISS voice lane 在 hard query 上几乎不可用,后续应优先切向更接近生产的 embedding/pgvector 评测路径。 | ||
| 54 | - architect review 当前结论:`APPROVED (WATCH)`,允许继续沿当前架构推进,但需要明确区分“链路通”与“业务正确”。 | 58 | - architect review 当前结论:`APPROVED (WATCH)`,允许继续沿当前架构推进,但需要明确区分“链路通”与“业务正确”。 |
| 55 | 59 | ||
| 56 | ### 最新补充(2026-06-03 voice service runtime) | 60 | ### 最新补充(2026-06-03 voice service runtime) | ... | ... |
-
Please register or sign in to post a comment