industrial-benchmark-spec.md 1.83 KB

Raw Blame History Permalink



Industrial Benchmark Spec


更新：2026-06-02


一页结论


工业级 ACR 不能只看总 top1
必须同时看：


hard-case
rejection / false accept
latency / scale
license provenance completeness


1. Benchmark 分层图

flowchart TD
    A[Industrial Benchmark] --> B[Accuracy]
    A --> C[Robustness]
    A --> D[Operational]
    A --> E[Compliance]

    B --> B1[top1/top5/MRR]
    C --> C1[humming/confused/noisy]
    D --> D1[latency/indexing/throughput]
    E --> E1[data provenance/license coverage]


2. 指标表


维度
指标
目标


Accuracy
top1 / top5 / MRR
主识别质量


Robustness
humming/confused/noisy top1
hard-case 质量


Operational
p50/p95 latency
服务能力


Operational
index throughput
建库能力


Safety
false accept rate
误识别风险


Compliance
license coverage
商业可用前提


3. 场景图

flowchart LR
    Q[Queries] --> Q1[clean]
    Q --> Q2[augmented]
    Q --> Q3[humming_like]
    Q --> Q4[confused]
    Q --> Q5[noisy/compressed]


4. 文字说明


4.1 为什么 hard-case 要单独出报表

因为总体 top1 很容易掩盖哼唱和混淆场景的失败，而这些正是用户最敏感的场景。


4.2 为什么要加入 operational metrics

工业级系统不是离线竞赛模型，需要考虑服务响应与增量索引成本。


4.3 为什么要把 compliance 放进 benchmark

对于商用系统，如果训练/评测数据来源不可追溯，再高精度也不能安全上线。


5. 细节附录

推荐 release gate：


clean top1 >= 0.95
noisy top1 >= 0.85
confused top1 >= 0.70
humming_like top1 >= 0.60
top5 >= 0.95 on production-relevant buckets


Sources


See docs/references-and-sources.md for the current source map.