Commit f04a314e f04a314e7e9cb2f11def143390a13ea273036667 by cnb.bofCdSsphPA

Benchmark segmentation strategies on a real FMA mini-smoke set

Constraint: Strategy comparisons need real-audio evidence, but the benchmark must stay cheap enough to run repeatedly on CPU during active development
Rejected: Judge winners only by top1/topk on a tiny subset | ties hide the practical value of strategies that generate far more usable queries
Confidence: medium
Scope-risk: narrow
Directive: Keep num_queries as a tie-breaker for tiny-smoke comparisons; increase subset size before promoting benchmark winners to default training policy
Tested: /usr/local/miniconda3/bin/python acr-engine/scripts/ab_smoke_segmentation.py --dataset fma --input-dir acr-engine/data/raw/fma_small_audio --work-root /tmp/ab_smoke_seg --subset-size 8 --query-duration 8 --train-epochs 1 --batch-size 2 --device cpu --output-json /tmp/ab_smoke_seg/report.json; post-run ranking verification from /tmp/ab_smoke_seg/report.json
Not-tested: Larger FMA subsets or difficult internal query mixes in the same benchmark script
1 parent 8ed3e34e
#!/usr/bin/env python3
from __future__ import annotations
import argparse
import json
import shutil
import subprocess
from pathlib import Path
PYTHON = "/usr/local/miniconda3/bin/python"
DEFAULT_STRATEGIES = [
"random",
"silence_aware",
"high_energy",
"beat_aware",
"repeated_section_aware",
"hybrid",
]
def run(cmd: list[str], cwd: Path) -> str:
return subprocess.check_output(cmd, cwd=str(cwd), text=True)
def parse_last_json(text: str) -> dict:
for start in range(len(text) - 1, -1, -1):
if text[start] != "{":
continue
try:
return json.loads(text[start:])
except json.JSONDecodeError:
continue
raise ValueError("No JSON object found in command output")
def prepare_subset(src_dir: Path, subset_dir: Path, limit: int) -> dict:
files = sorted(src_dir.rglob("*.mp3"))[:limit]
subset_dir.mkdir(parents=True, exist_ok=True)
copied = []
for src in files:
rel = src.relative_to(src_dir)
dst = subset_dir / rel
dst.parent.mkdir(parents=True, exist_ok=True)
if not dst.exists():
shutil.copy2(src, dst)
copied.append(str(dst))
return {
"source_dir": str(src_dir),
"subset_dir": str(subset_dir),
"num_files": len(copied),
"sample_files": copied[:5],
}
def train_strategy_for_query(strategy: str) -> str:
if strategy == "sliding":
return "random"
return strategy
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--dataset", default="fma")
parser.add_argument("--input-dir", default="data/raw/fma_small_audio")
parser.add_argument("--work-root", default="data/ab_smoke_segmentation")
parser.add_argument("--subset-size", type=int, default=12)
parser.add_argument("--query-duration", type=float, default=8.0)
parser.add_argument("--query-stride", type=float, default=None)
parser.add_argument("--train-epochs", type=int, default=1)
parser.add_argument("--batch-size", type=int, default=2)
parser.add_argument("--device", default="cpu")
parser.add_argument("--seed", type=int, default=42)
parser.add_argument("--strategies", nargs="*", default=DEFAULT_STRATEGIES)
parser.add_argument("--output-json", default=None)
args = parser.parse_args()
repo = Path(__file__).resolve().parents[1]
input_dir = (repo / args.input_dir).resolve()
work_root = (repo / args.work_root).resolve()
subset_dir = work_root / "subset_audio"
subset_info = prepare_subset(input_dir, subset_dir, args.subset_size)
results = []
for strategy in args.strategies:
smoke_root = work_root / strategy
if smoke_root.exists():
shutil.rmtree(smoke_root)
smoke_root.mkdir(parents=True, exist_ok=True)
cmd = [
PYTHON,
"src/data/external_adapters.py",
"smoke-local",
args.dataset,
str(subset_dir),
"--output-root",
str(smoke_root),
"--eval-ratio",
"0.2",
"--query-duration",
str(args.query_duration),
"--query-strategy",
strategy,
"--segment-strategy",
train_strategy_for_query(strategy),
"--train-epochs",
str(args.train_epochs),
"--batch-size",
str(args.batch_size),
"--device",
args.device,
"--seed",
str(args.seed),
]
if args.query_stride is not None:
cmd.extend(["--query-stride", str(args.query_stride)])
output = run(cmd, cwd=repo)
summary = parse_last_json(output)
eval_json = Path(summary["eval_json"])
eval_report = json.loads(eval_json.read_text())
results.append({
"strategy": strategy,
"train_segment_strategy": train_strategy_for_query(strategy),
"num_queries": eval_report["num_queries"],
"top1": eval_report["top1"],
"topk": eval_report["topk"],
"eval_json": str(eval_json),
"report_dir": summary["report_dir"],
"sample_failures": eval_report.get("sample_failures", [])[:3],
})
results.sort(key=lambda x: (x["top1"], x["topk"], x["num_queries"]), reverse=True)
report = {
"dataset": args.dataset,
"subset": subset_info,
"query_duration": args.query_duration,
"query_stride": args.query_stride,
"train_epochs": args.train_epochs,
"batch_size": args.batch_size,
"device": args.device,
"strategies": results,
"winner": results[0] if results else None,
}
text = json.dumps(report, ensure_ascii=False, indent=2)
if args.output_json:
out = Path(args.output_json)
out.parent.mkdir(parents=True, exist_ok=True)
out.write_text(text)
print(text)
if __name__ == "__main__":
main()
......@@ -5675,3 +5675,50 @@
- 下一步可继续做更强的:
- chorus-like multi-feature ranking
- 小规模真实数据策略 A/B 对比
### Stage: real FMA mini-subset segmentation A/B smoke benchmark
完成项:
- 新增脚本:
- `acr-engine/scripts/ab_smoke_segmentation.py`
- 能力:
- 从本地真实数据目录抽取固定数量子集
- 依次运行 `smoke-local`
- 自动比较多种切片策略的 smoke 结果
- 汇总 `top1 / topk / num_queries`
- 修正排序规则:
- 不再只按 `top1/topk`
- 改为 `top1 -> topk -> num_queries`
- 避免在分数持平时把 query 更少的策略误判为 winner
验证结果:
- 真实数据来源:
- `data/raw/fma_small_audio`
- smoke 子集:
- `8` 首 FMA 音频
- `query_duration=8`
- `train_epochs=1`
- `batch_size=2`
- 比较策略:
- `random`
- `silence_aware`
- `high_energy`
- `beat_aware`
- `repeated_section_aware`
- `hybrid`
- 报告路径:
- `/tmp/ab_smoke_seg/report.json`
- 排序修正后的结果:
1. `hybrid``num_queries=37`, `top1=1.0`, `topk=1.0`
2. `beat_aware``num_queries=13`, `top1=1.0`, `topk=1.0`
3. `high_energy``num_queries=12`, `top1=1.0`, `topk=1.0`
4. `repeated_section_aware``num_queries=12`, `top1=1.0`, `topk=1.0`
5. `random``num_queries=4`, `top1=1.0`, `topk=1.0`
6. `silence_aware``num_queries=2`, `top1=1.0`, `topk=1.0`
结论:
- 在这个极小真实子集 smoke 上,所有策略都能达到 `top1/top5 = 1.0`
- 但从 **query 覆盖率** 看:
- `hybrid` 当前最优
- `beat_aware / high_energy / repeated_section_aware` 是更强的次优候选
- 下一步应扩大真实子集规模,并引入更难的 query 类型,进一步拉开策略差异
......
......@@ -86,6 +86,38 @@ flowchart LR
- `smoke-local` 现在内部默认也会为 `build-index` 打开 `--resume`
- checkpoint 会记录 `model_signature`
- 如果这次训练出的 `best_model.pt` 与旧 partial checkpoint 不是同一个模型,恢复会被自动拒绝并从 0 重建,避免混入不同模型的 embedding
## 小规模策略 A/B smoke
如果你想快速比较不同 query / training 切片策略,可直接运行:
```bash
/usr/local/miniconda3/bin/python acr-engine/scripts/ab_smoke_segmentation.py \
--dataset fma \
--input-dir acr-engine/data/raw/fma_small_audio \
--work-root /tmp/ab_smoke_seg \
--subset-size 8 \
--query-duration 8 \
--train-epochs 1 \
--batch-size 2 \
--device cpu \
--output-json /tmp/ab_smoke_seg/report.json
```
当前脚本会比较:
- `random`
- `silence_aware`
- `high_energy`
- `beat_aware`
- `repeated_section_aware`
- `hybrid`
排序规则:
- 先按 `top1`
- 再按 `topk`
- 最后按 `num_queries`
这样在 top1/top5 持平时,会优先保留**覆盖 query 更多**的策略,而不是误把 query 更少的策略排到第一。
/usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --index-prefix data/index_fma_smoke/reference --split test --device cpu --fast-eval --output-json reports/fma-smoke/eval.json
/usr/local/miniconda3/bin/python scripts/generate_artifacts.py --eval-json reports/fma-smoke/eval.json --config-json reports/fma-smoke/config.json --output-dir reports/fma-smoke --model-version fma-smoke --data-version fma_local
```
......