Commit 35d883a8 35d883a8d0279174504cbf1f04fe30d31b5f0600 by cnb.bofCdSsphPA

Make semantic feature enrichment runtime-aware on the song-centric path

Constraint: Keep the current real-directory import path executable on this host while making semantic-lane readiness explicit instead of pretending the heavyweight runtime exists.
Rejected: Hardwire semantic enrichment to the local fallback without reporting missing runtime state | It hides the true blocker and weakens the upgrade path to real semantic models.
Confidence: high
Scope-risk: narrow
Directive: On this host, treat local_wavehash_embed as a fallback semantic backend and persist missing runtime evidence until torch/torchaudio/transformers are installed.
Tested: /usr/local/miniconda3/bin/python acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py on the real wav smoke manifest; imported the v3 enriched manifest twice into postgres://d2:d2pass@127.0.0.1:5432/d2 schema acr_songcentric_test and verified counts stayed media_entity=9, audio_object=22, feature_fact=24, set_membership=9; report shows semantic_runtime_available=false and missing=[torch, torchaudio, transformers]; git diff --check; /usr/local/miniconda3/bin/python scripts/check_markdown_links.py --root docs returned OK for 11 active markdown files
Not-tested: real MERT/MuQ extraction on this host
1 parent 7e3b0136
1 {"song": {"biz_key": "song_alpha", "title": "song alpha", "artist_name": "artist a"}, "asset": {"source_type": "official", "storage_uri": "/workspace/acr-engine/data/songcentric_builder_smoke/song_alpha/artist_a/clip1.wav", "storage_scheme": "file", "checksum": "path:/workspace/acr-engine/data/songcentric_builder_smoke/song_alpha/artist_a/clip1.wav", "codec": "wav", "sample_rate": 16000, "channels": 1, "duration_ms": 8000}, "windows": [{"start_ms": 0, "end_ms": 5000, "features": [{"feature_type": "fingerprint", "model_name": "chromaprint_matcher", "model_version": "phase1_local", "feature_set_name": "chromaprint_matcher_5s", "fingerprint_value": "dc0c731425f360787f462da693ff4a50", "checksum": "chromaprint:dc0c731425f36078", "metadata_json": {"hash_count": 2643, "hash_sample": [[1842187, 11], [1842188, 11], [1842189, 11], [1842201, 11], [1842212, 11], [1842213, 11], [1842214, 11], [1842438, 11]]}}, {"feature_type": "embedding", "model_name": "local_wavehash_embed", "model_version": "v1", "feature_set_name": "wavehash_embed_5s", "feature_schema_ver": "v1", "embedding_dim": 8, "embedding_uri": "inline://593c7a661cc87444:0:5000", "vector_table_name": "audio_embedding_vector_8_placeholder", "checksum": "emb:593c7a661cc87444", "metadata_json": {"energy": 30555200, "rate": 16000, "channels": 1, "semantic_backend": "local_fallback", "runtime_missing": ["torch", "torchaudio", "transformers"]}}]}, {"start_ms": 2500, "end_ms": 7500, "features": [{"feature_type": "fingerprint", "model_name": "chromaprint_matcher", "model_version": "phase1_local", "feature_set_name": "chromaprint_matcher_5s", "fingerprint_value": "dc0c731425f360787f462da693ff4a50", "checksum": "chromaprint:dc0c731425f36078", "metadata_json": {"hash_count": 2643, "hash_sample": [[1842187, 11], [1842188, 11], [1842189, 11], [1842201, 11], [1842212, 11], [1842213, 11], [1842214, 11], [1842438, 11]]}}, {"feature_type": "embedding", "model_name": "local_wavehash_embed", "model_version": "v1", "feature_set_name": "wavehash_embed_5s", "feature_schema_ver": "v1", "embedding_dim": 8, "embedding_uri": "inline://593c7a661cc87444:2500:7500", "vector_table_name": "audio_embedding_vector_8_placeholder", "checksum": "emb:593c7a661cc87444", "metadata_json": {"energy": 30555200, "rate": 16000, "channels": 1, "semantic_backend": "local_fallback", "runtime_missing": ["torch", "torchaudio", "transformers"]}}]}, {"start_ms": 3000, "end_ms": 8000, "features": [{"feature_type": "fingerprint", "model_name": "chromaprint_matcher", "model_version": "phase1_local", "feature_set_name": "chromaprint_matcher_5s", "fingerprint_value": "dc0c731425f360787f462da693ff4a50", "checksum": "chromaprint:dc0c731425f36078", "metadata_json": {"hash_count": 2643, "hash_sample": [[1842187, 11], [1842188, 11], [1842189, 11], [1842201, 11], [1842212, 11], [1842213, 11], [1842214, 11], [1842438, 11]]}}, {"feature_type": "embedding", "model_name": "local_wavehash_embed", "model_version": "v1", "feature_set_name": "wavehash_embed_5s", "feature_schema_ver": "v1", "embedding_dim": 8, "embedding_uri": "inline://593c7a661cc87444:3000:8000", "vector_table_name": "audio_embedding_vector_8_placeholder", "checksum": "emb:593c7a661cc87444", "metadata_json": {"energy": 30555200, "rate": 16000, "channels": 1, "semantic_backend": "local_fallback", "runtime_missing": ["torch", "torchaudio", "transformers"]}}]}], "memberships": [{"set_type": "reference_set", "set_name": "phase1_hot_reference_v1", "member_type": "asset", "priority": 100}]}
2 {"song": {"biz_key": "song_beta", "title": "song beta", "artist_name": "artist b"}, "asset": {"source_type": "official", "storage_uri": "/workspace/acr-engine/data/songcentric_builder_smoke/song_beta/artist_b/clip2.wav", "storage_scheme": "file", "checksum": "path:/workspace/acr-engine/data/songcentric_builder_smoke/song_beta/artist_b/clip2.wav", "codec": "wav", "sample_rate": 16000, "channels": 1, "duration_ms": 6000}, "windows": [{"start_ms": 0, "end_ms": 5000, "features": [{"feature_type": "fingerprint", "model_name": "chromaprint_matcher", "model_version": "phase1_local", "feature_set_name": "chromaprint_matcher_5s", "fingerprint_value": "d8fc2442b4ec3ce5ae180c5845cffccb", "checksum": "chromaprint:d8fc2442b4ec3ce5", "metadata_json": {"hash_count": 2202, "hash_sample": [[2763289, 23], [2763524, 23], [2763541, 23], [2763549, 23], [2763566, 23], [2763801, 23], [2764050, 23], [2764075, 23]]}}, {"feature_type": "embedding", "model_name": "local_wavehash_embed", "model_version": "v1", "feature_set_name": "wavehash_embed_5s", "feature_schema_ver": "v1", "embedding_dim": 8, "embedding_uri": "inline://4ed2ccfa55b10b88:0:5000", "vector_table_name": "audio_embedding_vector_8_placeholder", "checksum": "emb:4ed2ccfa55b10b88", "metadata_json": {"energy": 30555680, "rate": 16000, "channels": 1, "semantic_backend": "local_fallback", "runtime_missing": ["torch", "torchaudio", "transformers"]}}]}, {"start_ms": 1000, "end_ms": 6000, "features": [{"feature_type": "fingerprint", "model_name": "chromaprint_matcher", "model_version": "phase1_local", "feature_set_name": "chromaprint_matcher_5s", "fingerprint_value": "d8fc2442b4ec3ce5ae180c5845cffccb", "checksum": "chromaprint:d8fc2442b4ec3ce5", "metadata_json": {"hash_count": 2202, "hash_sample": [[2763289, 23], [2763524, 23], [2763541, 23], [2763549, 23], [2763566, 23], [2763801, 23], [2764050, 23], [2764075, 23]]}}, {"feature_type": "embedding", "model_name": "local_wavehash_embed", "model_version": "v1", "feature_set_name": "wavehash_embed_5s", "feature_schema_ver": "v1", "embedding_dim": 8, "embedding_uri": "inline://4ed2ccfa55b10b88:1000:6000", "vector_table_name": "audio_embedding_vector_8_placeholder", "checksum": "emb:4ed2ccfa55b10b88", "metadata_json": {"energy": 30555680, "rate": 16000, "channels": 1, "semantic_backend": "local_fallback", "runtime_missing": ["torch", "torchaudio", "transformers"]}}]}], "memberships": [{"set_type": "reference_set", "set_name": "phase1_hot_reference_v1", "member_type": "asset", "priority": 100}]}
1 {
2 "schema": "acr_songcentric_test",
3 "manifest": "acr-engine/data/pgvector_eval/music20/songcentric_directory_manifest_with_features_v3.jsonl",
4 "imported": [
5 {
6 "song_id": 8,
7 "asset_id": 16,
8 "window_ids": [
9 17,
10 18,
11 19
12 ],
13 "feature_ids": [
14 20,
15 11,
16 21,
17 13,
18 22,
19 15
20 ],
21 "membership_ids": [
22 8
23 ]
24 },
25 {
26 "song_id": 9,
27 "asset_id": 20,
28 "window_ids": [
29 21,
30 22
31 ],
32 "feature_ids": [
33 23,
34 17,
35 24,
36 19
37 ],
38 "membership_ids": [
39 9
40 ]
41 }
42 ],
43 "counts": {
44 "media_entity": 9,
45 "audio_object": 22,
46 "feature_fact": 24,
47 "set_membership": 9
48 },
49 "window_lineage_sample": {
50 "window_id": 22,
51 "asset_id": 20,
52 "song_id": 9,
53 "title": "song beta",
54 "start_ms": 1000,
55 "end_ms": 6000
56 },
57 "feature_lineage_sample": {
58 "feature_type": "fingerprint",
59 "model_name": "chromaprint_matcher",
60 "model_version": "phase1_local",
61 "feature_set_name": "chromaprint_matcher_5s",
62 "window_id": 22,
63 "song_id": 9,
64 "title": "song beta"
65 }
66 }
...\ No newline at end of file ...\ No newline at end of file
1 {
2 "schema": "acr_songcentric_test",
3 "manifest": "acr-engine/data/pgvector_eval/music20/songcentric_directory_manifest_with_features_v3.jsonl",
4 "imported": [
5 {
6 "song_id": 8,
7 "asset_id": 16,
8 "window_ids": [
9 17,
10 18,
11 19
12 ],
13 "feature_ids": [
14 20,
15 11,
16 21,
17 13,
18 22,
19 15
20 ],
21 "membership_ids": [
22 8
23 ]
24 },
25 {
26 "song_id": 9,
27 "asset_id": 20,
28 "window_ids": [
29 21,
30 22
31 ],
32 "feature_ids": [
33 23,
34 17,
35 24,
36 19
37 ],
38 "membership_ids": [
39 9
40 ]
41 }
42 ],
43 "counts": {
44 "media_entity": 9,
45 "audio_object": 22,
46 "feature_fact": 24,
47 "set_membership": 9
48 },
49 "window_lineage_sample": {
50 "window_id": 22,
51 "asset_id": 20,
52 "song_id": 9,
53 "title": "song beta",
54 "start_ms": 1000,
55 "end_ms": 6000
56 },
57 "feature_lineage_sample": {
58 "feature_type": "fingerprint",
59 "model_name": "chromaprint_matcher",
60 "model_version": "phase1_local",
61 "feature_set_name": "chromaprint_matcher_5s",
62 "window_id": 22,
63 "song_id": 9,
64 "title": "song beta"
65 }
66 }
...\ No newline at end of file ...\ No newline at end of file
1 {
2 "input_manifest": "/workspace/acr-engine/data/pgvector_eval/music20/songcentric_directory_manifest.jsonl",
3 "output_manifest": "/workspace/acr-engine/data/pgvector_eval/music20/songcentric_directory_manifest_with_features_v3.jsonl",
4 "rows": 2,
5 "wav_windows_seen": 5,
6 "features_added": 10,
7 "matcher_fingerprint_count": 5,
8 "fallback_fingerprint_count": 0,
9 "semantic_runtime_available": false,
10 "semantic_runtime_missing": [
11 "torch",
12 "torchaudio",
13 "transformers"
14 ],
15 "semantic_runtime_ready_count": 0,
16 "semantic_fallback_count": 5
17 }
...\ No newline at end of file ...\ No newline at end of file
...@@ -3,6 +3,7 @@ from __future__ import annotations ...@@ -3,6 +3,7 @@ from __future__ import annotations
3 3
4 import argparse 4 import argparse
5 import hashlib 5 import hashlib
6 import importlib
6 import json 7 import json
7 import wave 8 import wave
8 from pathlib import Path 9 from pathlib import Path
...@@ -22,6 +23,20 @@ def load_jsonl(path: Path): ...@@ -22,6 +23,20 @@ def load_jsonl(path: Path):
22 yield json.loads(line) 23 yield json.loads(line)
23 24
24 25
26 def module_available(name: str) -> bool:
27 try:
28 importlib.import_module(name)
29 return True
30 except Exception:
31 return False
32
33
34 def semantic_runtime_available() -> tuple[bool, list[str]]:
35 required = ['torch', 'torchaudio', 'transformers']
36 missing = [m for m in required if not module_available(m)]
37 return (len(missing) == 0, missing)
38
39
25 def read_wav_stats(path: Path, start_ms: int, end_ms: int) -> dict: 40 def read_wav_stats(path: Path, start_ms: int, end_ms: int) -> dict:
26 with wave.open(str(path), 'rb') as wf: 41 with wave.open(str(path), 'rb') as wf:
27 rate = wf.getframerate() 42 rate = wf.getframerate()
...@@ -57,6 +72,40 @@ def extract_matcher_fingerprint(path: Path, start_ms: int, end_ms: int) -> dict ...@@ -57,6 +72,40 @@ def extract_matcher_fingerprint(path: Path, start_ms: int, end_ms: int) -> dict
57 return None 72 return None
58 73
59 74
75 def build_semantic_feature(stats: dict, start_ms: int, end_ms: int, runtime_ok: bool, missing: list[str]) -> dict:
76 if runtime_ok:
77 return {
78 'feature_type': 'embedding',
79 'model_name': 'semantic_runtime_ready_placeholder',
80 'model_version': 'awaiting_real_adapter',
81 'feature_set_name': 'semantic_runtime_ready_5s',
82 'feature_schema_ver': 'v1',
83 'embedding_dim': 8,
84 'embedding_uri': f"runtime-ready://{stats['digest'][:16]}:{start_ms}:{end_ms}",
85 'vector_table_name': 'audio_embedding_vector_8_placeholder',
86 'checksum': f"emb:{stats['digest'][:16]}",
87 'metadata_json': {'semantic_backend': 'runtime_ready_placeholder'},
88 }
89 return {
90 'feature_type': 'embedding',
91 'model_name': 'local_wavehash_embed',
92 'model_version': 'v1',
93 'feature_set_name': 'wavehash_embed_5s',
94 'feature_schema_ver': 'v1',
95 'embedding_dim': 8,
96 'embedding_uri': f"inline://{stats['digest'][:16]}:{start_ms}:{end_ms}",
97 'vector_table_name': 'audio_embedding_vector_8_placeholder',
98 'checksum': f"emb:{stats['digest'][:16]}",
99 'metadata_json': {
100 'energy': stats['energy'],
101 'rate': stats['rate'],
102 'channels': stats['channels'],
103 'semantic_backend': 'local_fallback',
104 'runtime_missing': missing,
105 },
106 }
107
108
60 def main() -> int: 109 def main() -> int:
61 parser = argparse.ArgumentParser() 110 parser = argparse.ArgumentParser()
62 parser.add_argument('--input-manifest', required=True) 111 parser.add_argument('--input-manifest', required=True)
...@@ -71,11 +120,16 @@ def main() -> int: ...@@ -71,11 +120,16 @@ def main() -> int:
71 if report_path: 120 if report_path:
72 report_path.parent.mkdir(parents=True, exist_ok=True) 121 report_path.parent.mkdir(parents=True, exist_ok=True)
73 122
123 runtime_ok, missing_runtime = semantic_runtime_available()
124
74 rows = [] 125 rows = []
75 feature_count = 0 126 feature_count = 0
76 wav_windows_seen = 0 127 wav_windows_seen = 0
77 matcher_fp_count = 0 128 matcher_fp_count = 0
78 fallback_fp_count = 0 129 fallback_fp_count = 0
130 semantic_runtime_ready_count = 0
131 semantic_fallback_count = 0
132
79 for row in load_jsonl(in_path): 133 for row in load_jsonl(in_path):
80 asset = row['asset'] 134 asset = row['asset']
81 asset_path = Path(asset['storage_uri']) 135 asset_path = Path(asset['storage_uri'])
...@@ -107,18 +161,13 @@ def main() -> int: ...@@ -107,18 +161,13 @@ def main() -> int:
107 'metadata_json': {'energy': stats['energy'], 'bytes_read': stats['bytes_read']}, 161 'metadata_json': {'energy': stats['energy'], 'bytes_read': stats['bytes_read']},
108 } 162 }
109 fallback_fp_count += 1 163 fallback_fp_count += 1
110 emb = { 164
111 'feature_type': 'embedding', 165 emb = build_semantic_feature(stats, window['start_ms'], window['end_ms'], runtime_ok, missing_runtime)
112 'model_name': 'local_wavehash_embed', 166 if runtime_ok:
113 'model_version': 'v1', 167 semantic_runtime_ready_count += 1
114 'feature_set_name': 'wavehash_embed_5s', 168 else:
115 'feature_schema_ver': 'v1', 169 semantic_fallback_count += 1
116 'embedding_dim': 8, 170
117 'embedding_uri': f"inline://{stats['digest'][:16]}:{window['start_ms']}:{window['end_ms']}",
118 'vector_table_name': 'audio_embedding_vector_8_placeholder',
119 'checksum': f"emb:{stats['digest'][:16]}",
120 'metadata_json': {'energy': stats['energy'], 'rate': stats['rate'], 'channels': stats['channels']},
121 }
122 features.extend([fp, emb]) 171 features.extend([fp, emb])
123 feature_count += 2 172 feature_count += 2
124 rows.append(row) 173 rows.append(row)
...@@ -132,6 +181,10 @@ def main() -> int: ...@@ -132,6 +181,10 @@ def main() -> int:
132 'features_added': feature_count, 181 'features_added': feature_count,
133 'matcher_fingerprint_count': matcher_fp_count, 182 'matcher_fingerprint_count': matcher_fp_count,
134 'fallback_fingerprint_count': fallback_fp_count, 183 'fallback_fingerprint_count': fallback_fp_count,
184 'semantic_runtime_available': runtime_ok,
185 'semantic_runtime_missing': missing_runtime,
186 'semantic_runtime_ready_count': semantic_runtime_ready_count,
187 'semantic_fallback_count': semantic_fallback_count,
135 } 188 }
136 if report_path: 189 if report_path:
137 report_path.write_text(json.dumps(report, ensure_ascii=False, indent=2)) 190 report_path.write_text(json.dumps(report, ensure_ascii=False, indent=2))
......
1 ## 2026-06-04 1 ## 2026-06-04
2 2
3 - 升级 `enrich_songcentric_manifest_with_local_features.py` 为 runtime-aware 语义适配器选择:当前 host 上因缺少 `torch/torchaudio/transformers`,semantic lane 明确写入 `local_wavehash_embed` fallback,并把缺失依赖固化到 report/metadata 中。
4
3 - 升级 `enrich_songcentric_manifest_with_local_features.py`:目录链中的 fingerprint 现优先复用仓库内 `ChromaprintMatcher`,并在 live PostgreSQL 上验证 5 个 wav windows 全部命中 matcher 路径、`fallback_fingerprint_count=0` 5 - 升级 `enrich_songcentric_manifest_with_local_features.py`:目录链中的 fingerprint 现优先复用仓库内 `ChromaprintMatcher`,并在 live PostgreSQL 上验证 5 个 wav windows 全部命中 matcher 路径、`fallback_fingerprint_count=0`
4 6
5 - 新增 `acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py`,可对真实 wav 目录生成的 manifest 自动补本地 deterministic fingerprint/embedding,再导入 `feature_fact`;已在 live PostgreSQL 上验证 `audio files -> manifest -> features -> import` 闭环与幂等性。 7 - 新增 `acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py`,可对真实 wav 目录生成的 manifest 自动补本地 deterministic fingerprint/embedding,再导入 `feature_fact`;已在 live PostgreSQL 上验证 `audio files -> manifest -> features -> import` 闭环与幂等性。
......
...@@ -312,6 +312,20 @@ flowchart TD ...@@ -312,6 +312,20 @@ flowchart TD
312 312
313 这说明当前目录链里的 exact lane 已经不只是临时 hash,而是优先接上了仓库现有 fingerprint 提取能力。 313 这说明当前目录链里的 exact lane 已经不只是临时 hash,而是优先接上了仓库现有 fingerprint 提取能力。
314 314
315
316 ### 4.9 目录链中的 semantic lane 运行时选择
317
318 当前 `enrich_songcentric_manifest_with_local_features.py` 对 semantic lane 采用 **runtime-aware** 选择:
319 - 如果 `torch / torchaudio / transformers` 可用,则预留真实 semantic adapter 入口
320 - 如果不可用,则明确落到 `local_wavehash_embed` fallback,并把缺失依赖写进 metadata/report
321
322 本轮 fresh evidence:
323 - `semantic_runtime_available = false`
324 - `semantic_runtime_missing = ["torch", "torchaudio", "transformers"]`
325 - `semantic_fallback_count = 5`
326
327 这说明当前 host 上 semantic lane 还未接真实模型,但链路已经具备明确的运行时分流与可审计证据。
328
315 --- 329 ---
316 330
317 ## 5. 最常用 SQL 样例 331 ## 5. 最常用 SQL 样例
......