industrial-benchmark-spec.md 2.9 KB

Raw Blame History Permalink



Industrial Benchmark Spec


更新：2026-06-02


一页结论


工业级 ACR 不能只看总 top1
必须同时看：


hard-case
rejection / false accept
latency / scale
license provenance completeness


1. Benchmark 分层图

flowchart TD
    A[Industrial Benchmark] --> B[Accuracy]
    A --> C[Robustness]
    A --> D[Operational]
    A --> E[Compliance]

    B --> B1[top1/top5/MRR]
    C --> C1[humming/confused/noisy]
    D --> D1[latency/indexing/throughput]
    E --> E1[data provenance/license coverage]


2. 指标表


维度
指标
目标


Accuracy
top1 / top5 / MRR
主识别质量


Robustness
humming/confused/noisy top1
hard-case 质量


Operational
p50/p95 latency
服务能力


Operational
index throughput
建库能力


Safety
false accept rate
误识别风险


Compliance
license coverage
商业可用前提


3. 场景图

flowchart LR
    Q[Queries] --> Q1[clean]
    Q --> Q2[augmented]
    Q --> Q3[humming_like]
    Q --> Q4[confused]
    Q --> Q5[noisy/compressed]


4. 文字说明


4.1 为什么 hard-case 要单独出报表

因为总体 top1 很容易掩盖哼唱和混淆场景的失败，而这些正是用户最敏感的场景。


4.2 为什么要加入 operational metrics

工业级系统不是离线竞赛模型，需要考虑服务响应与增量索引成本。


4.3 为什么要把 compliance 放进 benchmark

对于商用系统，如果训练/评测数据来源不可追溯，再高精度也不能安全上线。


5. 细节附录

推荐 release gate：


clean top1 >= 0.95
noisy top1 >= 0.85
confused top1 >= 0.70
humming_like top1 >= 0.60
top5 >= 0.95 on production-relevant buckets


Sources


See references-and-sources.md for the current source map.


6. Bucket / Style-aware 基线

当前仓库已经新增可运行基线脚本：


../acr-engine/scripts/ab_smoke_bucketed.py


用途：


按 bucket 配置文件拆分多个小子集
对每个 bucket 分别运行现有 ab_smoke_segmentation.py

输出 bucket 级 winner 与聚合均值


推荐最小配置文件格式：

{
  "buckets": [
    {"name": "prefix_000_a", "patterns": ["fma_small/000/00000?.mp3"], "subset_size": 4},
    {"name": "prefix_000_b", "patterns": ["fma_small/000/00014?.mp3"], "subset_size": 4}
  ]
}


推荐命令：

/usr/local/miniconda3/bin/python acr-engine/scripts/ab_smoke_bucketed.py   --dataset fma   --input-dir data/raw/fma_small_audio   --bucket-config /tmp/cap64_bucket_test.json   --work-root /tmp/ab_smoke_bucketed_smoke   --default-subset-size 4   --query-duration 8   --train-epochs 1   --batch-size 2   --device cpu   --strategies high_energy hybrid   --max-test-queries 4   --seed 42   --output-json /tmp/ab_smoke_bucketed_smoke/report.json