Commit 8095eeea 8095eeeafc7fb9f4f55aac1ddb8b39b35a3953b1 by cnb.bofCdSsphPA

Prefer the repo fingerprint matcher in the real-directory song-centric pipeline

Constraint: Improve the current directory-to-feature path using components already present in the repo, without depending on unavailable heavyweight semantic runtimes.
Rejected: Keep exact-lane validation on a purely ad-hoc local hash path | It underuses the repo's existing fingerprint extraction capability and weakens evidence for the real pipeline.
Confidence: high
Scope-risk: narrow
Directive: In host-level song-centric pipeline validation, prefer ChromaprintMatcher-backed fingerprints first and use local_wavehash only as fallback.
Tested: /usr/local/miniconda3/bin/python acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py on the real wav smoke manifest; imported the enriched manifest into postgres://d2:d2pass@127.0.0.1:5432/d2 schema acr_songcentric_test twice and verified counts stayed media_entity=9, audio_object=22, feature_fact=24, set_membership=9 on rerun; matcher_fingerprint_count=5 and fallback_fingerprint_count=0; git diff --check; /usr/local/miniconda3/bin/python scripts/check_markdown_links.py --root docs returned OK for 11 active markdown files
Not-tested: true external chromaprint library integration and semantic-model-backed enrichment on this host
1 parent 5e00c5b0
1 {"song": {"biz_key": "song_alpha", "title": "song alpha", "artist_name": "artist a"}, "asset": {"source_type": "official", "storage_uri": "/workspace/acr-engine/data/songcentric_builder_smoke/song_alpha/artist_a/clip1.wav", "storage_scheme": "file", "checksum": "path:/workspace/acr-engine/data/songcentric_builder_smoke/song_alpha/artist_a/clip1.wav", "codec": "wav", "sample_rate": 16000, "channels": 1, "duration_ms": 8000}, "windows": [{"start_ms": 0, "end_ms": 5000, "features": [{"feature_type": "fingerprint", "model_name": "chromaprint_matcher", "model_version": "phase1_local", "feature_set_name": "chromaprint_matcher_5s", "fingerprint_value": "dc0c731425f360787f462da693ff4a50", "checksum": "chromaprint:dc0c731425f36078", "metadata_json": {"hash_count": 2643, "hash_sample": [[1842187, 11], [1842188, 11], [1842189, 11], [1842201, 11], [1842212, 11], [1842213, 11], [1842214, 11], [1842438, 11]]}}, {"feature_type": "embedding", "model_name": "local_wavehash_embed", "model_version": "v1", "feature_set_name": "wavehash_embed_5s", "feature_schema_ver": "v1", "embedding_dim": 8, "embedding_uri": "inline://593c7a661cc87444:0:5000", "vector_table_name": "audio_embedding_vector_8_placeholder", "checksum": "emb:593c7a661cc87444", "metadata_json": {"energy": 30555200, "rate": 16000, "channels": 1}}]}, {"start_ms": 2500, "end_ms": 7500, "features": [{"feature_type": "fingerprint", "model_name": "chromaprint_matcher", "model_version": "phase1_local", "feature_set_name": "chromaprint_matcher_5s", "fingerprint_value": "dc0c731425f360787f462da693ff4a50", "checksum": "chromaprint:dc0c731425f36078", "metadata_json": {"hash_count": 2643, "hash_sample": [[1842187, 11], [1842188, 11], [1842189, 11], [1842201, 11], [1842212, 11], [1842213, 11], [1842214, 11], [1842438, 11]]}}, {"feature_type": "embedding", "model_name": "local_wavehash_embed", "model_version": "v1", "feature_set_name": "wavehash_embed_5s", "feature_schema_ver": "v1", "embedding_dim": 8, "embedding_uri": "inline://593c7a661cc87444:2500:7500", "vector_table_name": "audio_embedding_vector_8_placeholder", "checksum": "emb:593c7a661cc87444", "metadata_json": {"energy": 30555200, "rate": 16000, "channels": 1}}]}, {"start_ms": 3000, "end_ms": 8000, "features": [{"feature_type": "fingerprint", "model_name": "chromaprint_matcher", "model_version": "phase1_local", "feature_set_name": "chromaprint_matcher_5s", "fingerprint_value": "dc0c731425f360787f462da693ff4a50", "checksum": "chromaprint:dc0c731425f36078", "metadata_json": {"hash_count": 2643, "hash_sample": [[1842187, 11], [1842188, 11], [1842189, 11], [1842201, 11], [1842212, 11], [1842213, 11], [1842214, 11], [1842438, 11]]}}, {"feature_type": "embedding", "model_name": "local_wavehash_embed", "model_version": "v1", "feature_set_name": "wavehash_embed_5s", "feature_schema_ver": "v1", "embedding_dim": 8, "embedding_uri": "inline://593c7a661cc87444:3000:8000", "vector_table_name": "audio_embedding_vector_8_placeholder", "checksum": "emb:593c7a661cc87444", "metadata_json": {"energy": 30555200, "rate": 16000, "channels": 1}}]}], "memberships": [{"set_type": "reference_set", "set_name": "phase1_hot_reference_v1", "member_type": "asset", "priority": 100}]}
2 {"song": {"biz_key": "song_beta", "title": "song beta", "artist_name": "artist b"}, "asset": {"source_type": "official", "storage_uri": "/workspace/acr-engine/data/songcentric_builder_smoke/song_beta/artist_b/clip2.wav", "storage_scheme": "file", "checksum": "path:/workspace/acr-engine/data/songcentric_builder_smoke/song_beta/artist_b/clip2.wav", "codec": "wav", "sample_rate": 16000, "channels": 1, "duration_ms": 6000}, "windows": [{"start_ms": 0, "end_ms": 5000, "features": [{"feature_type": "fingerprint", "model_name": "chromaprint_matcher", "model_version": "phase1_local", "feature_set_name": "chromaprint_matcher_5s", "fingerprint_value": "d8fc2442b4ec3ce5ae180c5845cffccb", "checksum": "chromaprint:d8fc2442b4ec3ce5", "metadata_json": {"hash_count": 2202, "hash_sample": [[2763289, 23], [2763524, 23], [2763541, 23], [2763549, 23], [2763566, 23], [2763801, 23], [2764050, 23], [2764075, 23]]}}, {"feature_type": "embedding", "model_name": "local_wavehash_embed", "model_version": "v1", "feature_set_name": "wavehash_embed_5s", "feature_schema_ver": "v1", "embedding_dim": 8, "embedding_uri": "inline://4ed2ccfa55b10b88:0:5000", "vector_table_name": "audio_embedding_vector_8_placeholder", "checksum": "emb:4ed2ccfa55b10b88", "metadata_json": {"energy": 30555680, "rate": 16000, "channels": 1}}]}, {"start_ms": 1000, "end_ms": 6000, "features": [{"feature_type": "fingerprint", "model_name": "chromaprint_matcher", "model_version": "phase1_local", "feature_set_name": "chromaprint_matcher_5s", "fingerprint_value": "d8fc2442b4ec3ce5ae180c5845cffccb", "checksum": "chromaprint:d8fc2442b4ec3ce5", "metadata_json": {"hash_count": 2202, "hash_sample": [[2763289, 23], [2763524, 23], [2763541, 23], [2763549, 23], [2763566, 23], [2763801, 23], [2764050, 23], [2764075, 23]]}}, {"feature_type": "embedding", "model_name": "local_wavehash_embed", "model_version": "v1", "feature_set_name": "wavehash_embed_5s", "feature_schema_ver": "v1", "embedding_dim": 8, "embedding_uri": "inline://4ed2ccfa55b10b88:1000:6000", "vector_table_name": "audio_embedding_vector_8_placeholder", "checksum": "emb:4ed2ccfa55b10b88", "metadata_json": {"energy": 30555680, "rate": 16000, "channels": 1}}]}], "memberships": [{"set_type": "reference_set", "set_name": "phase1_hot_reference_v1", "member_type": "asset", "priority": 100}]}
1 {
2 "schema": "acr_songcentric_test",
3 "manifest": "acr-engine/data/pgvector_eval/music20/songcentric_directory_manifest_with_features_v2.jsonl",
4 "imported": [
5 {
6 "song_id": 8,
7 "asset_id": 16,
8 "window_ids": [
9 17,
10 18,
11 19
12 ],
13 "feature_ids": [
14 20,
15 11,
16 21,
17 13,
18 22,
19 15
20 ],
21 "membership_ids": [
22 8
23 ]
24 },
25 {
26 "song_id": 9,
27 "asset_id": 20,
28 "window_ids": [
29 21,
30 22
31 ],
32 "feature_ids": [
33 23,
34 17,
35 24,
36 19
37 ],
38 "membership_ids": [
39 9
40 ]
41 }
42 ],
43 "counts": {
44 "media_entity": 9,
45 "audio_object": 22,
46 "feature_fact": 24,
47 "set_membership": 9
48 },
49 "window_lineage_sample": {
50 "window_id": 22,
51 "asset_id": 20,
52 "song_id": 9,
53 "title": "song beta",
54 "start_ms": 1000,
55 "end_ms": 6000
56 },
57 "feature_lineage_sample": {
58 "feature_type": "fingerprint",
59 "model_name": "chromaprint_matcher",
60 "model_version": "phase1_local",
61 "feature_set_name": "chromaprint_matcher_5s",
62 "window_id": 22,
63 "song_id": 9,
64 "title": "song beta"
65 }
66 }
...\ No newline at end of file ...\ No newline at end of file
1 {
2 "schema": "acr_songcentric_test",
3 "manifest": "acr-engine/data/pgvector_eval/music20/songcentric_directory_manifest_with_features_v2.jsonl",
4 "imported": [
5 {
6 "song_id": 8,
7 "asset_id": 16,
8 "window_ids": [
9 17,
10 18,
11 19
12 ],
13 "feature_ids": [
14 20,
15 11,
16 21,
17 13,
18 22,
19 15
20 ],
21 "membership_ids": [
22 8
23 ]
24 },
25 {
26 "song_id": 9,
27 "asset_id": 20,
28 "window_ids": [
29 21,
30 22
31 ],
32 "feature_ids": [
33 23,
34 17,
35 24,
36 19
37 ],
38 "membership_ids": [
39 9
40 ]
41 }
42 ],
43 "counts": {
44 "media_entity": 9,
45 "audio_object": 22,
46 "feature_fact": 24,
47 "set_membership": 9
48 },
49 "window_lineage_sample": {
50 "window_id": 22,
51 "asset_id": 20,
52 "song_id": 9,
53 "title": "song beta",
54 "start_ms": 1000,
55 "end_ms": 6000
56 },
57 "feature_lineage_sample": {
58 "feature_type": "fingerprint",
59 "model_name": "chromaprint_matcher",
60 "model_version": "phase1_local",
61 "feature_set_name": "chromaprint_matcher_5s",
62 "window_id": 22,
63 "song_id": 9,
64 "title": "song beta"
65 }
66 }
...\ No newline at end of file ...\ No newline at end of file
1 {
2 "input_manifest": "/workspace/acr-engine/data/pgvector_eval/music20/songcentric_directory_manifest.jsonl",
3 "output_manifest": "/workspace/acr-engine/data/pgvector_eval/music20/songcentric_directory_manifest_with_features_v2.jsonl",
4 "rows": 2,
5 "wav_windows_seen": 5,
6 "features_added": 10,
7 "matcher_fingerprint_count": 5,
8 "fallback_fingerprint_count": 0
9 }
...\ No newline at end of file ...\ No newline at end of file
...@@ -4,10 +4,16 @@ from __future__ import annotations ...@@ -4,10 +4,16 @@ from __future__ import annotations
4 import argparse 4 import argparse
5 import hashlib 5 import hashlib
6 import json 6 import json
7 import math
8 import wave 7 import wave
9 from pathlib import Path 8 from pathlib import Path
10 9
10 ROOT = Path(__file__).resolve().parents[1]
11 import sys
12 if str(ROOT) not in sys.path:
13 sys.path.insert(0, str(ROOT))
14
15 from src.engines.chromaprint_matcher import ChromaprintMatcher, load_audio_mono
16
11 17
12 def load_jsonl(path: Path): 18 def load_jsonl(path: Path):
13 for line in path.read_text().splitlines(): 19 for line in path.read_text().splitlines():
...@@ -26,14 +32,29 @@ def read_wav_stats(path: Path, start_ms: int, end_ms: int) -> dict: ...@@ -26,14 +32,29 @@ def read_wav_stats(path: Path, start_ms: int, end_ms: int) -> dict:
26 wf.setpos(min(start_frame, wf.getnframes())) 32 wf.setpos(min(start_frame, wf.getnframes()))
27 frames = wf.readframes(max(end_frame - start_frame, 0)) 33 frames = wf.readframes(max(end_frame - start_frame, 0))
28 digest = hashlib.sha256(frames).hexdigest() 34 digest = hashlib.sha256(frames).hexdigest()
29 energy = sum(abs(b - 128) for b in frames[: min(len(frames), 4000)]) if sampwidth == 1 else sum(abs(int.from_bytes(frames[i:i+2], 'little', signed=True)) for i in range(0, min(len(frames), 8000), 2)) 35 if sampwidth == 1:
36 energy = sum(abs(b - 128) for b in frames[: min(len(frames), 4000)])
37 else:
38 energy = sum(abs(int.from_bytes(frames[i:i+2], 'little', signed=True)) for i in range(0, min(len(frames), 8000), 2))
39 return {'digest': digest, 'energy': energy, 'rate': rate, 'channels': n_channels, 'bytes_read': len(frames)}
40
41
42 def extract_matcher_fingerprint(path: Path, start_ms: int, end_ms: int) -> dict | None:
43 try:
44 matcher = ChromaprintMatcher(sr=16000)
45 y, _ = load_audio_mono(str(path), sr=matcher.sr)
46 start = int(start_ms * matcher.sr / 1000)
47 end = int(end_ms * matcher.sr / 1000)
48 segment = y[start:end]
49 hashes = matcher.extract_hashes(segment)
50 digest = hashlib.sha256(json.dumps(hashes[:128]).encode('utf-8')).hexdigest()
30 return { 51 return {
31 'digest': digest, 52 'fingerprint_value': digest[:32],
32 'energy': energy, 53 'checksum': f'chromaprint:{digest[:16]}',
33 'rate': rate, 54 'metadata_json': {'hash_count': len(hashes), 'hash_sample': hashes[:8]},
34 'channels': n_channels,
35 'bytes_read': len(frames),
36 } 55 }
56 except Exception:
57 return None
37 58
38 59
39 def main() -> int: 60 def main() -> int:
...@@ -52,15 +73,30 @@ def main() -> int: ...@@ -52,15 +73,30 @@ def main() -> int:
52 73
53 rows = [] 74 rows = []
54 feature_count = 0 75 feature_count = 0
55 wav_assets = 0 76 wav_windows_seen = 0
77 matcher_fp_count = 0
78 fallback_fp_count = 0
56 for row in load_jsonl(in_path): 79 for row in load_jsonl(in_path):
57 asset = row['asset'] 80 asset = row['asset']
58 asset_path = Path(asset['storage_uri']) 81 asset_path = Path(asset['storage_uri'])
59 for idx, window in enumerate(row.get('windows', []), start=1): 82 for window in row.get('windows', []):
60 features = window.setdefault('features', []) 83 features = window.setdefault('features', [])
61 if asset_path.suffix.lower() == '.wav' and asset_path.exists(): 84 if asset_path.suffix.lower() == '.wav' and asset_path.exists():
62 wav_assets += 1 85 wav_windows_seen += 1
63 stats = read_wav_stats(asset_path, window['start_ms'], window['end_ms']) 86 stats = read_wav_stats(asset_path, window['start_ms'], window['end_ms'])
87 matcher_fp = extract_matcher_fingerprint(asset_path, window['start_ms'], window['end_ms'])
88 if matcher_fp is not None:
89 fp = {
90 'feature_type': 'fingerprint',
91 'model_name': 'chromaprint_matcher',
92 'model_version': 'phase1_local',
93 'feature_set_name': 'chromaprint_matcher_5s',
94 'fingerprint_value': matcher_fp['fingerprint_value'],
95 'checksum': matcher_fp['checksum'],
96 'metadata_json': matcher_fp['metadata_json'],
97 }
98 matcher_fp_count += 1
99 else:
64 fp = { 100 fp = {
65 'feature_type': 'fingerprint', 101 'feature_type': 'fingerprint',
66 'model_name': 'local_wavehash', 102 'model_name': 'local_wavehash',
...@@ -70,6 +106,7 @@ def main() -> int: ...@@ -70,6 +106,7 @@ def main() -> int:
70 'checksum': f"fp:{stats['digest'][:16]}", 106 'checksum': f"fp:{stats['digest'][:16]}",
71 'metadata_json': {'energy': stats['energy'], 'bytes_read': stats['bytes_read']}, 107 'metadata_json': {'energy': stats['energy'], 'bytes_read': stats['bytes_read']},
72 } 108 }
109 fallback_fp_count += 1
73 emb = { 110 emb = {
74 'feature_type': 'embedding', 111 'feature_type': 'embedding',
75 'model_name': 'local_wavehash_embed', 112 'model_name': 'local_wavehash_embed',
...@@ -91,8 +128,10 @@ def main() -> int: ...@@ -91,8 +128,10 @@ def main() -> int:
91 'input_manifest': str(in_path), 128 'input_manifest': str(in_path),
92 'output_manifest': str(out_path), 129 'output_manifest': str(out_path),
93 'rows': len(rows), 130 'rows': len(rows),
94 'wav_assets_seen': wav_assets, 131 'wav_windows_seen': wav_windows_seen,
95 'features_added': feature_count, 132 'features_added': feature_count,
133 'matcher_fingerprint_count': matcher_fp_count,
134 'fallback_fingerprint_count': fallback_fp_count,
96 } 135 }
97 if report_path: 136 if report_path:
98 report_path.write_text(json.dumps(report, ensure_ascii=False, indent=2)) 137 report_path.write_text(json.dumps(report, ensure_ascii=False, indent=2))
......
1 ## 2026-06-04 1 ## 2026-06-04
2 2
3 - 升级 `enrich_songcentric_manifest_with_local_features.py`:目录链中的 fingerprint 现优先复用仓库内 `ChromaprintMatcher`,并在 live PostgreSQL 上验证 5 个 wav windows 全部命中 matcher 路径、`fallback_fingerprint_count=0`
4
3 - 新增 `acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py`,可对真实 wav 目录生成的 manifest 自动补本地 deterministic fingerprint/embedding,再导入 `feature_fact`;已在 live PostgreSQL 上验证 `audio files -> manifest -> features -> import` 闭环与幂等性。 5 - 新增 `acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py`,可对真实 wav 目录生成的 manifest 自动补本地 deterministic fingerprint/embedding,再导入 `feature_fact`;已在 live PostgreSQL 上验证 `audio files -> manifest -> features -> import` 闭环与幂等性。
4 6
5 - 新增 `acr-engine/scripts/build_songcentric_manifest_from_directory.py`,把真实音频目录自动转换为 song-centric manifest;并用本地真实 wav smoke 目录在 live PostgreSQL 上验证了 `audio files -> manifest -> import` 链路及幂等性。 7 - 新增 `acr-engine/scripts/build_songcentric_manifest_from_directory.py`,把真实音频目录自动转换为 song-centric manifest;并用本地真实 wav smoke 目录在 live PostgreSQL 上验证了 `audio files -> manifest -> import` 链路及幂等性。
......
...@@ -300,6 +300,18 @@ flowchart TD ...@@ -300,6 +300,18 @@ flowchart TD
300 300
301 当前特征补全脚本:[`acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py`](../acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py) 301 当前特征补全脚本:[`acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py`](../acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py)
302 302
303
304 ### 4.8 目录链中的 exact lane 提升
305
306 当前 `enrich_songcentric_manifest_with_local_features.py` 已优先复用仓库内 `ChromaprintMatcher` 生成 fingerprint;只有失败时才回退到 `local_wavehash`
307
308 本轮 fresh evidence:
309 - `wav_windows_seen = 5`
310 - `matcher_fingerprint_count = 5`
311 - `fallback_fingerprint_count = 0`
312
313 这说明当前目录链里的 exact lane 已经不只是临时 hash,而是优先接上了仓库现有 fingerprint 提取能力。
314
303 --- 315 ---
304 316
305 ## 5. 最常用 SQL 样例 317 ## 5. 最常用 SQL 样例
......