Commit bd66c06b bd66c06bd7512295f9d9510ddb3ae45a150685c0 by cnb.bofCdSsphPA

Add voice chunking and match-context foundations for ACR service

Constraint: keep humming/recording query support lightweight and compatible with the existing FAISS-first local workflow while production retrieval remains pgvector-oriented
Rejected: delaying service-path scaffolding until full production retrieval is ready | would block validation of voice-to-chunk and context export behavior
Confidence: high
Scope-risk: moderate
Directive: keep  semantics song_id-first and treat resource paths only as supporting evidence/context artifacts
Tested: /usr/local/miniconda3/bin/python -m unittest discover -s acr-engine/tests -v
Not-tested: live FastAPI smoke until uvicorn is available in the current interpreter environment
1 parent 69843933
......@@ -123,3 +123,29 @@ cd acr-engine
- Hybrid 分数归一化后再融合
- full-demo 自动训练
- 后续可接入开源数据集
## 哼唱 / 录音识别接口(voice -> chunk -> song_id)
当前已经补齐了两段基础能力:
- `src/data/voice_chunker.py`:把原始 voice / humming 音频切成可检索 chunk
- `src/utils/context_exporter.py`:把命中的 reference window 导出为上下文 clip(默认 10s)
FastAPI 目标接口:
- `POST /recognize/voice`
输入:
- 外部上传语音/录音文件
输出:
- `song_id`
- `reference_audio_path`
- `best_chunk`
- `context_clip`
- `chunk_results`
说明:
- 该接口代码已接入 `src/service/app.py`
- 当前环境尚缺 `uvicorn`,因此服务 smoke 需要先补运行依赖后再执行。
......
#!/usr/bin/env /usr/local/miniconda3/bin/python
from __future__ import annotations
import argparse
import json
from pathlib import Path
def main() -> None:
ap = argparse.ArgumentParser()
ap.add_argument('--chunks-json', required=True)
ap.add_argument('--song-id', required=True)
ap.add_argument('--split', default='test')
ap.add_argument('--output', required=True)
ap.add_argument('--source-dataset', default='humming_real')
args = ap.parse_args()
payload = json.loads(Path(args.chunks_json).read_text(encoding='utf-8'))
rows = []
for chunk in payload.get('chunks', []):
rows.append({
'song_id': args.song_id,
'audio_path': chunk['audio_path'],
'duration': chunk['duration_sec'],
'type': 'humming_real',
'segment_type': 'humming_query',
'offset': chunk['start_sec'],
'source_dataset': args.source_dataset,
'split': args.split,
})
out = Path(args.output)
out.parent.mkdir(parents=True, exist_ok=True)
out.write_text(json.dumps(rows, ensure_ascii=False, indent=2), encoding='utf-8')
print(json.dumps({'rows': len(rows), 'output': str(out)}, ensure_ascii=False, indent=2))
if __name__ == '__main__':
main()
#!/usr/bin/env /usr/local/miniconda3/bin/python
from __future__ import annotations
import json
import subprocess
import time
from pathlib import Path
from urllib.request import Request, urlopen
BASE = 'http://127.0.0.1:8000'
def post_multipart(url: str, file_path: Path):
boundary = '----acrboundary'
data = file_path.read_bytes()
body = (
f'--{boundary}\r\n'
f'Content-Disposition: form-data; name="file"; filename="{file_path.name}"\r\n'
f'Content-Type: audio/wav\r\n\r\n'
).encode('utf-8') + data + f'\r\n--{boundary}--\r\n'.encode('utf-8')
req = Request(url, data=body, method='POST')
req.add_header('Content-Type', f'multipart/form-data; boundary={boundary}')
with urlopen(req, timeout=20) as resp:
return json.loads(resp.read().decode('utf-8'))
def main():
cmd = [
'/usr/local/miniconda3/bin/python', '-m', 'uvicorn', 'src.service.app:app', '--host', '127.0.0.1', '--port', '8000'
]
proc = subprocess.Popen(cmd, cwd='/root/vprecog/acr-engine', stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
query = Path('/workspace/downloads/111/type_7/75cd601b-7604-4b37-8132-cfab39e7c644.mp3')
try:
for _ in range(20):
time.sleep(0.5)
try:
result = post_multipart(BASE + '/recognize/voice', query)
print(json.dumps({
'status': 'ok',
'chunk_count': result.get('chunk_count'),
'top_song_id': result.get('candidates', [{}])[0].get('song_id') if result.get('candidates') else None,
'has_context': bool(result.get('candidates', [{}])[0].get('context_clip')) if result.get('candidates') else False,
}, ensure_ascii=False, indent=2))
return
except Exception:
continue
raise SystemExit('service voice smoke failed: service not ready or endpoint failed')
finally:
proc.terminate()
try:
proc.wait(timeout=5)
except subprocess.TimeoutExpired:
proc.kill()
proc.wait(timeout=5)
if __name__ == '__main__':
main()
#!/usr/bin/env /usr/local/miniconda3/bin/python
from __future__ import annotations
import argparse
import json
from pathlib import Path
from typing import List, Dict
import librosa
import numpy as np
import soundfile as sf
def normalize_audio(audio_path: str, sr: int = 16000) -> np.ndarray:
y, _ = librosa.load(audio_path, sr=sr, mono=True)
return y.astype(np.float32)
def detect_voiced_intervals(y: np.ndarray, sr: int, top_db: int = 30, min_voiced_sec: float = 2.0) -> List[tuple[int, int]]:
intervals = librosa.effects.split(y, top_db=top_db)
min_len = int(sr * min_voiced_sec)
kept = []
for start, end in intervals:
if end - start >= min_len:
kept.append((int(start), int(end)))
return kept
def chunk_intervals(intervals: List[tuple[int, int]], sr: int, target_chunk_sec: float = 8.0, stride_sec: float = 4.0) -> List[tuple[int, int, bool]]:
chunk_len = int(sr * target_chunk_sec)
stride = int(sr * stride_sec)
chunks: List[tuple[int, int, bool]] = []
for start, end in intervals:
seg_len = end - start
if seg_len < chunk_len:
chunks.append((start, end, True))
continue
pos = start
while pos + chunk_len <= end:
chunks.append((pos, pos + chunk_len, False))
pos += stride
if pos < end and end - pos >= int(sr * 2.0):
tail_start = max(start, end - chunk_len)
chunks.append((tail_start, end, end - tail_start < chunk_len))
deduped = []
seen = set()
for item in chunks:
key = (item[0], item[1])
if key not in seen:
deduped.append(item)
seen.add(key)
return deduped
def write_chunks(y: np.ndarray, sr: int, chunks: List[tuple[int, int, bool]], output_dir: str, source_audio_path: str) -> List[Dict]:
out_dir = Path(output_dir)
out_dir.mkdir(parents=True, exist_ok=True)
chunk_len = None
results = []
for idx, (start, end, padded) in enumerate(chunks):
clip = y[start:end]
if chunk_len is None:
chunk_len = max(len(clip), 1)
target_len = max(chunk_len, len(clip))
if padded and len(clip) < target_len:
clip = np.pad(clip, (0, target_len - len(clip)))
chunk_path = out_dir / f'chunk_{idx:03d}.wav'
sf.write(str(chunk_path), clip, sr)
results.append({
'chunk_id': f'chunk_{idx:03d}',
'audio_path': str(chunk_path),
'start_sec': round(start / sr, 4),
'end_sec': round(end / sr, 4),
'duration_sec': round(len(clip) / sr, 4),
'padded': padded,
'source_audio_path': source_audio_path,
})
return results
def voice_to_chunks(audio_path: str, output_dir: str, target_chunk_sec: float = 8.0, stride_sec: float = 4.0, min_voiced_sec: float = 2.0, top_db: int = 30, sr: int = 16000) -> List[Dict]:
y = normalize_audio(audio_path, sr=sr)
intervals = detect_voiced_intervals(y, sr=sr, top_db=top_db, min_voiced_sec=min_voiced_sec)
chunks = chunk_intervals(intervals, sr=sr, target_chunk_sec=target_chunk_sec, stride_sec=stride_sec)
return write_chunks(y, sr, chunks, output_dir, source_audio_path=audio_path)
def main() -> None:
ap = argparse.ArgumentParser()
ap.add_argument('--input', required=True)
ap.add_argument('--output-dir', required=True)
ap.add_argument('--target-chunk-sec', type=float, default=8.0)
ap.add_argument('--stride-sec', type=float, default=4.0)
ap.add_argument('--min-voiced-sec', type=float, default=2.0)
ap.add_argument('--top-db', type=int, default=30)
ap.add_argument('--sr', type=int, default=16000)
ap.add_argument('--output-json', default='chunks.json')
args = ap.parse_args()
chunks = voice_to_chunks(
audio_path=args.input,
output_dir=args.output_dir,
target_chunk_sec=args.target_chunk_sec,
stride_sec=args.stride_sec,
min_voiced_sec=args.min_voiced_sec,
top_db=args.top_db,
sr=args.sr,
)
out_json = Path(args.output_dir) / args.output_json
out_json.write_text(json.dumps({'chunks': chunks}, ensure_ascii=False, indent=2), encoding='utf-8')
print(json.dumps({'chunks': chunks}, ensure_ascii=False, indent=2))
if __name__ == '__main__':
main()
from __future__ import annotations
import shutil
import subprocess
import tempfile
from pathlib import Path
from typing import Dict, Tuple
import librosa
import numpy as np
import soundfile as sf
def load_audio(audio_path: str, sr: int = 16000) -> np.ndarray:
y, _ = librosa.load(audio_path, sr=sr, mono=True)
return y.astype(np.float32)
def chroma_embedding(y: np.ndarray, sr: int) -> np.ndarray:
chroma = librosa.feature.chroma_stft(y=y, sr=sr, n_chroma=12)
feat = np.concatenate([chroma.mean(axis=1), chroma.std(axis=1)], axis=0).astype(np.float32)
norm = np.linalg.norm(feat)
return feat / norm if norm > 0 else feat
def find_best_matching_window(
query_audio_path: str,
reference_audio_path: str,
sr: int = 16000,
stride_sec: float = 1.0,
) -> Dict:
query_y = load_audio(query_audio_path, sr=sr)
ref_y = load_audio(reference_audio_path, sr=sr)
query_len = len(query_y)
if query_len == 0:
raise ValueError('Empty query audio')
if len(ref_y) < query_len:
ref_y = np.pad(ref_y, (0, query_len - len(ref_y)))
query_feat = chroma_embedding(query_y, sr)
stride = max(1, int(sr * stride_sec))
best_score = -1.0
best_start = 0
for start in range(0, max(len(ref_y) - query_len + 1, 1), stride):
window = ref_y[start:start + query_len]
if len(window) < query_len:
window = np.pad(window, (0, query_len - len(window)))
score = float(np.dot(query_feat, chroma_embedding(window, sr)))
if score > best_score:
best_score = score
best_start = start
return {
'window_start_sec': round(best_start / sr, 4),
'window_end_sec': round((best_start + query_len) / sr, 4),
'window_score': round(best_score, 6),
'query_duration_sec': round(query_len / sr, 4),
}
def export_match_context(
audio_path: str,
window_start_sec: float,
window_end_sec: float,
output_path: str,
context_sec: float = 10.0,
output_format: str = 'mp3',
sr: int = 16000,
) -> Dict:
y = load_audio(audio_path, sr=sr)
center = (window_start_sec + window_end_sec) / 2.0
half = context_sec / 2.0
clip_start_sec = max(0.0, center - half)
clip_end_sec = min(len(y) / sr, center + half)
start = int(clip_start_sec * sr)
end = max(start + 1, int(clip_end_sec * sr))
clip = y[start:end]
output = Path(output_path)
output.parent.mkdir(parents=True, exist_ok=True)
actual_format = output_format
if output_format == 'mp3' and shutil.which('ffmpeg'):
with tempfile.TemporaryDirectory() as tmp:
wav_path = Path(tmp) / 'context.wav'
sf.write(wav_path, clip, sr)
cmd = [shutil.which('ffmpeg') or 'ffmpeg', '-y', '-i', str(wav_path), str(output)]
subprocess.run(cmd, check=True, capture_output=True)
else:
if output_format == 'mp3':
actual_format = 'wav'
output = output.with_suffix('.wav')
sf.write(output, clip, sr)
return {
'source_audio_path': audio_path,
'clip_start_sec': round(clip_start_sec, 4),
'clip_end_sec': round(clip_end_sec, 4),
'duration_sec': round((end - start) / sr, 4),
'output_path': str(output),
'output_format': actual_format,
}
from pathlib import Path
import sys
ROOT = Path(__file__).resolve().parents[1]
if str(ROOT) not in sys.path:
sys.path.insert(0, str(ROOT))
import tempfile
import unittest
from pathlib import Path
import test_bootstrap
import numpy as np
import soundfile as sf
from src.utils.context_exporter import export_match_context, find_best_matching_window
class ContextExporterTests(unittest.TestCase):
def test_find_best_matching_window_returns_valid_range(self):
sr = 16000
with tempfile.TemporaryDirectory() as tmp:
query = Path(tmp) / 'query.wav'
ref = Path(tmp) / 'ref.wav'
tone = 0.2 * np.sin(2 * np.pi * 440 * np.linspace(0, 3, sr * 3, endpoint=False)).astype(np.float32)
ref_y = np.concatenate([np.zeros(sr), tone, np.zeros(sr)]).astype(np.float32)
sf.write(query, tone, sr)
sf.write(ref, ref_y, sr)
match = find_best_matching_window(str(query), str(ref), sr=sr, stride_sec=0.5)
self.assertGreaterEqual(match['window_start_sec'], 0.0)
self.assertGreater(match['window_end_sec'], match['window_start_sec'])
def test_export_match_context_writes_audio(self):
sr = 16000
with tempfile.TemporaryDirectory() as tmp:
ref = Path(tmp) / 'ref.wav'
out = Path(tmp) / 'context.wav'
y = 0.2 * np.sin(2 * np.pi * 440 * np.linspace(0, 12, sr * 12, endpoint=False)).astype(np.float32)
sf.write(ref, y, sr)
info = export_match_context(str(ref), 4.0, 7.0, str(out), context_sec=10.0, output_format='wav', sr=sr)
self.assertTrue(Path(info['output_path']).exists())
self.assertEqual(info['output_format'], 'wav')
if __name__ == '__main__':
unittest.main()
......@@ -2,6 +2,8 @@ import tempfile
import unittest
from pathlib import Path
import test_bootstrap
from scripts.local_music20_acr import collect_pairs, first_file
......
import tempfile
import unittest
from pathlib import Path
import test_bootstrap
import numpy as np
import soundfile as sf
from src.data.voice_chunker import detect_voiced_intervals, chunk_intervals, voice_to_chunks
class VoiceChunkerTests(unittest.TestCase):
def test_detect_voiced_intervals_filters_short_segments(self):
sr = 16000
y = np.concatenate([
np.zeros(sr),
0.2 * np.sin(2 * np.pi * 440 * np.linspace(0, 3, sr * 3, endpoint=False)),
np.zeros(sr // 2),
]).astype(np.float32)
intervals = detect_voiced_intervals(y, sr=sr, top_db=30, min_voiced_sec=2.0)
self.assertEqual(len(intervals), 1)
def test_chunk_intervals_handles_short_and_long_regions(self):
sr = 16000
chunks = chunk_intervals([(0, sr * 3), (sr * 5, sr * 15)], sr=sr, target_chunk_sec=8.0, stride_sec=4.0)
self.assertTrue(any(padded for _, _, padded in chunks))
self.assertGreaterEqual(len(chunks), 2)
def test_voice_to_chunks_writes_chunk_files(self):
sr = 16000
with tempfile.TemporaryDirectory() as tmp:
src = Path(tmp) / 'hum.wav'
out = Path(tmp) / 'chunks'
y = np.concatenate([
np.zeros(sr),
0.2 * np.sin(2 * np.pi * 330 * np.linspace(0, 4, sr * 4, endpoint=False)),
np.zeros(sr),
]).astype(np.float32)
sf.write(src, y, sr)
chunks = voice_to_chunks(str(src), str(out), target_chunk_sec=3.0, stride_sec=2.0, min_voiced_sec=2.0, sr=sr)
self.assertGreaterEqual(len(chunks), 1)
self.assertTrue(Path(chunks[0]['audio_path']).exists())
if __name__ == '__main__':
unittest.main()
## 2026-06-03 voice-to-chunk and context export foundation
- 新增 `acr-engine/src/data/voice_chunker.py`,支持 voice / humming 音频切 chunk。
- 新增 `acr-engine/scripts/build_humming_eval_manifest.py`,支持从 chunk 结果生成 `humming_real` 评测 manifest。
- 新增 `acr-engine/src/utils/context_exporter.py`,支持把命中的 reference window 导出成上下文 clip。
- 扩展 `acr-engine/src/service/app.py`,加入 `POST /recognize/voice` 接口雏形。
- 文档入口 `docs/README.md` 已简化为最新架构与最短阅读顺序。
Fresh evidence:
- `/usr/local/miniconda3/bin/python -m unittest discover -s acr-engine/tests -v` => `Ran 7 tests, OK`
- 当前环境缺 `uvicorn`,服务 smoke 尚不能直接启动,需要先补运行依赖。
## 2026-06-03 20-song local ACR workflow in acr-engine
- 新增 `acr-engine/scripts/local_music20_acr.py`,在 `acr-engine` 内提供基于 `/workspace/downloads` 的本地 20 首歌 ACR 小样本流程。
......
# ACR Docs Overview
> 更新:2026-06-02
> 保留最新架构与最短落地入口。历史细节仍在仓库中,但默认阅读只保留下面 6 份主文档。
## 一页结论
## 最短阅读顺序
当前文档入口过多,现统一浓缩为 **5 组主文档**
1. [session-handoff.md](./session-handoff.md)
2. [CHANGELOG.md](./CHANGELOG.md)
3. [acr-architecture.md](./acr-architecture.md)
4. [dataset-spec.md](./dataset-spec.md)
5. [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md)
6. [runbook.md](./runbook.md)
1. **项目与架构**
2. **数据与评测**
3. **业务数据接入**
4. **服务与工程**
5. **研究与路线**
## 当前推荐只看这几类
建议先只读这 5 组,不必一次看完全部细节文档。
### 1. 项目架构
- [acr-architecture.md](./acr-architecture.md)
- [session-handoff.md](./session-handoff.md)
---
### 2. 数据与评测
- [dataset-spec.md](./dataset-spec.md)
- [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md)
- [open-dataset-workflow.md](./open-dataset-workflow.md)
## 1. 文档导航图
### 3. 运行与服务
- [runbook.md](./runbook.md)
- [service-api.md](./service-api.md)
```mermaid
flowchart TD
A[Docs Entry] --> B[Project Responsibility]
A --> C[Architecture]
A --> D[Dataset Spec]
A --> E[Business Export Chain]
A --> F[Service API]
A --> G[Industrial Benchmark]
A --> H[Industrialization Roadmap]
A --> I[Licensing & Sources]
A --> J[SOTA Research]
### 4. 最新 hard-case 结论
- [acr-hard-case-analysis.md](../acr-engine/../docs/acr-hard-case-analysis.md)
B --> C
C --> D
D --> E
E --> F
G --> H
I --> H
J --> H
```
## 当前架构一句话
---
## 2. 浓缩阅读入口
| 读者角色 | 建议先读 |
|---|---|
| 新成员 | [项目与架构](./project-responsibility-map.md), [系统架构](./acr-architecture.md) |
| 算法/模型 | [数据规范](./dataset-spec.md), [SOTA 调研](./sota-research-2026.md) |
| 平台/后端 | [服务接口](./service-api.md), [评测规范](./industrial-benchmark-spec.md) |
| 数据接入 | [开放数据工作流](./open-dataset-workflow.md), [业务导出 Cookbook](./business-export-cookbook.md) |
| 负责人/规划 | [工业化路线](./industrialization-roadmap.md), [交接文档](./session-handoff.md) |
---
## 2.5 新 session 最短阅读顺序
如果是新 session 接手,建议直接按这个顺序:
1. [持续开发交接文档](./session-handoff.md)
2. [更新记录](./CHANGELOG.md)
3. [业务导出 Cookbook](./business-export-cookbook.md)[开放数据工作流](./open-dataset-workflow.md)
选择规则:
- 做你们自己的业务素材接入:先读 `business-export-cookbook.md`
- 做 FMA / MTG-Jamendo 这类开放数据:先读 `open-dataset-workflow.md`
## 2.6 新 session 最短可跑命令
如果你只是想先确认“业务导出链还能不能跑”,直接执行:
```bash
cd /workspace/acr-engine
/usr/local/miniconda3/bin/python scripts/business_export_offline_smoke.py \
--output-root /tmp/business_export_offline_smoke
```
预期结果:
- 生成业务导出样例
- 生成 manifest-ready JSONL
- 生成项目 `catalog/train/test/val`
- `train.py --dry-run` 通过
## 3. 主文档分组
### A. 项目与架构
- [项目职责图](./project-responsibility-map.md)
- [系统架构](./acr-architecture.md)
### B. 数据与评测
- [数据规范](./dataset-spec.md)
- [开放数据工作流](./open-dataset-workflow.md)
- [训练数据与 pgvector 指南](./training-data-and-pgvector-guide.md)
- [生产 Encoder 冻结与 Embedding 策略答疑](./production-encoder-freeze-and-embedding-strategy.md)
- [数据来源与接入](./dataset-sources-and-licensing.md)
- [工业评测规范](./industrial-benchmark-spec.md)
快速落地入口:
- [开放数据工作流](./open-dataset-workflow.md)
- [本地开放数据落点目录](../acr-engine/data/raw/README.md)
- 离线 smoke 已验证:`acr-engine/scripts/business_export_offline_smoke.py`
### C. 业务数据接入
- [业务素材类型与 Bucket 指南](./business-music-bucket-and-type-guide.md)
- [业务 Manifest 与 Type-Role 规范](./business-manifest-and-type-role-spec.md)
- [业务导出 Cookbook](./business-export-cookbook.md)
- [业务数据到项目 Manifest 适配](./business-project-manifest-adapter.md)
业务数据最短链:
1. [业务导出 Cookbook](./business-export-cookbook.md)
2. `acr-engine/scripts/normalize_business_export.py`
3. `acr-engine/scripts/split_business_manifest_ready.py`
4. `acr-engine/scripts/build_business_project_manifests.py`
5. `acr-engine/scripts/business_export_offline_smoke.py`
### D. 服务与工程
- [服务接口](./service-api.md)
- [持续开发交接文档](./session-handoff.md)
- [当前能力地图](./current-capability-map.md)
- [首次启动检查清单](../acr-engine/FIRST_RUN_CHECKLIST.md)
- [更新记录](./CHANGELOG.md)
### E. 研究与路线
- [工业化路线](./industrialization-roadmap.md)
- [SOTA 调研](./sota-research-2026.md)
- [引用来源总表](./references-and-sources.md)
---
## 4. 文字说明
现在开始减少“同层重复文档”的阅读成本:
- 先从入口页做分组
- 再在每组里保留 1~3 份主文档
- 次级细节尽量放到组内,而不是继续横向扩张文件数量
---
## 5. 细节附录
建议使用方式:
- 想了解项目先读 [项目职责图](./project-responsibility-map.md) + [系统架构](./acr-architecture.md)
- 想训练/评测先读 [数据规范](./dataset-spec.md)
- 想接开放数据先读 [数据来源与接入](./dataset-sources-and-licensing.md)
- 想看历史演进再读 [更新记录](./CHANGELOG.md)
## Sources
- This file is an internal documentation navigation artifact for the current repo state.
- `/workspace`:样本与素材来源
- `acr-engine/`:训练、索引、识别、服务主工程
- 本地小样本验证:优先 **FAISS**
- 生产向量检索:统一 **pgvector**
......