Commit 5e00c5b0 5e00c5b0ae5a861fa13c4066b279b8b51c16ecd3 by cnb.bofCdSsphPA

Complete the real-directory song-centric pipeline through feature_fact

Constraint: Finish the current real-directory onboarding loop without depending on missing heavyweight model runtimes, while still writing concrete feature rows into the fused schema.
Rejected: Wait for MERT/MuQ runtime availability before validating directory-to-feature ingestion | It would leave the Phase-1 data path unproven on this host.
Confidence: high
Scope-risk: narrow
Directive: Use enrich_songcentric_manifest_with_local_features.py as the temporary deterministic feature stage for host-level pipeline validation until full model runtimes are installed.
Tested: /usr/local/miniconda3/bin/python acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py on the real wav smoke manifest; imported the enriched manifest twice into postgres://d2:d2pass@127.0.0.1:5432/d2 schema acr_songcentric_test and verified counts remained media_entity=9, audio_object=22, feature_fact=19, set_membership=9; git diff --check; /usr/local/miniconda3/bin/python scripts/check_markdown_links.py --root docs returned OK for 11 active markdown files
Not-tested: semantic quality of the temporary local features and large-scale feature enrichment throughput
1 parent 0f75787b
1 {"song": {"biz_key": "song_alpha", "title": "song alpha", "artist_name": "artist a"}, "asset": {"source_type": "official", "storage_uri": "/workspace/acr-engine/data/songcentric_builder_smoke/song_alpha/artist_a/clip1.wav", "storage_scheme": "file", "checksum": "path:/workspace/acr-engine/data/songcentric_builder_smoke/song_alpha/artist_a/clip1.wav", "codec": "wav", "sample_rate": 16000, "channels": 1, "duration_ms": 8000}, "windows": [{"start_ms": 0, "end_ms": 5000, "features": [{"feature_type": "fingerprint", "model_name": "local_wavehash", "model_version": "v1", "feature_set_name": "wavehash_5s", "fingerprint_value": "593c7a661cc8744423107546c4e86249", "checksum": "fp:593c7a661cc87444", "metadata_json": {"energy": 30555200, "bytes_read": 160000}}, {"feature_type": "embedding", "model_name": "local_wavehash_embed", "model_version": "v1", "feature_set_name": "wavehash_embed_5s", "feature_schema_ver": "v1", "embedding_dim": 8, "embedding_uri": "inline://593c7a661cc87444:0:5000", "vector_table_name": "audio_embedding_vector_8_placeholder", "checksum": "emb:593c7a661cc87444", "metadata_json": {"energy": 30555200, "rate": 16000, "channels": 1}}]}, {"start_ms": 2500, "end_ms": 7500, "features": [{"feature_type": "fingerprint", "model_name": "local_wavehash", "model_version": "v1", "feature_set_name": "wavehash_5s", "fingerprint_value": "593c7a661cc8744423107546c4e86249", "checksum": "fp:593c7a661cc87444", "metadata_json": {"energy": 30555200, "bytes_read": 160000}}, {"feature_type": "embedding", "model_name": "local_wavehash_embed", "model_version": "v1", "feature_set_name": "wavehash_embed_5s", "feature_schema_ver": "v1", "embedding_dim": 8, "embedding_uri": "inline://593c7a661cc87444:2500:7500", "vector_table_name": "audio_embedding_vector_8_placeholder", "checksum": "emb:593c7a661cc87444", "metadata_json": {"energy": 30555200, "rate": 16000, "channels": 1}}]}, {"start_ms": 3000, "end_ms": 8000, "features": [{"feature_type": "fingerprint", "model_name": "local_wavehash", "model_version": "v1", "feature_set_name": "wavehash_5s", "fingerprint_value": "593c7a661cc8744423107546c4e86249", "checksum": "fp:593c7a661cc87444", "metadata_json": {"energy": 30555200, "bytes_read": 160000}}, {"feature_type": "embedding", "model_name": "local_wavehash_embed", "model_version": "v1", "feature_set_name": "wavehash_embed_5s", "feature_schema_ver": "v1", "embedding_dim": 8, "embedding_uri": "inline://593c7a661cc87444:3000:8000", "vector_table_name": "audio_embedding_vector_8_placeholder", "checksum": "emb:593c7a661cc87444", "metadata_json": {"energy": 30555200, "rate": 16000, "channels": 1}}]}], "memberships": [{"set_type": "reference_set", "set_name": "phase1_hot_reference_v1", "member_type": "asset", "priority": 100}]}
2 {"song": {"biz_key": "song_beta", "title": "song beta", "artist_name": "artist b"}, "asset": {"source_type": "official", "storage_uri": "/workspace/acr-engine/data/songcentric_builder_smoke/song_beta/artist_b/clip2.wav", "storage_scheme": "file", "checksum": "path:/workspace/acr-engine/data/songcentric_builder_smoke/song_beta/artist_b/clip2.wav", "codec": "wav", "sample_rate": 16000, "channels": 1, "duration_ms": 6000}, "windows": [{"start_ms": 0, "end_ms": 5000, "features": [{"feature_type": "fingerprint", "model_name": "local_wavehash", "model_version": "v1", "feature_set_name": "wavehash_5s", "fingerprint_value": "4ed2ccfa55b10b886c60bb1cfdfb0a72", "checksum": "fp:4ed2ccfa55b10b88", "metadata_json": {"energy": 30555680, "bytes_read": 160000}}, {"feature_type": "embedding", "model_name": "local_wavehash_embed", "model_version": "v1", "feature_set_name": "wavehash_embed_5s", "feature_schema_ver": "v1", "embedding_dim": 8, "embedding_uri": "inline://4ed2ccfa55b10b88:0:5000", "vector_table_name": "audio_embedding_vector_8_placeholder", "checksum": "emb:4ed2ccfa55b10b88", "metadata_json": {"energy": 30555680, "rate": 16000, "channels": 1}}]}, {"start_ms": 1000, "end_ms": 6000, "features": [{"feature_type": "fingerprint", "model_name": "local_wavehash", "model_version": "v1", "feature_set_name": "wavehash_5s", "fingerprint_value": "4ed2ccfa55b10b886c60bb1cfdfb0a72", "checksum": "fp:4ed2ccfa55b10b88", "metadata_json": {"energy": 30555680, "bytes_read": 160000}}, {"feature_type": "embedding", "model_name": "local_wavehash_embed", "model_version": "v1", "feature_set_name": "wavehash_embed_5s", "feature_schema_ver": "v1", "embedding_dim": 8, "embedding_uri": "inline://4ed2ccfa55b10b88:1000:6000", "vector_table_name": "audio_embedding_vector_8_placeholder", "checksum": "emb:4ed2ccfa55b10b88", "metadata_json": {"energy": 30555680, "rate": 16000, "channels": 1}}]}], "memberships": [{"set_type": "reference_set", "set_name": "phase1_hot_reference_v1", "member_type": "asset", "priority": 100}]}
1 {
2 "schema": "acr_songcentric_test",
3 "manifest": "acr-engine/data/pgvector_eval/music20/songcentric_directory_manifest_with_features.jsonl",
4 "imported": [
5 {
6 "song_id": 8,
7 "asset_id": 16,
8 "window_ids": [
9 17,
10 18,
11 19
12 ],
13 "feature_ids": [
14 10,
15 11,
16 12,
17 13,
18 14,
19 15
20 ],
21 "membership_ids": [
22 8
23 ]
24 },
25 {
26 "song_id": 9,
27 "asset_id": 20,
28 "window_ids": [
29 21,
30 22
31 ],
32 "feature_ids": [
33 16,
34 17,
35 18,
36 19
37 ],
38 "membership_ids": [
39 9
40 ]
41 }
42 ],
43 "counts": {
44 "media_entity": 9,
45 "audio_object": 22,
46 "feature_fact": 19,
47 "set_membership": 9
48 },
49 "window_lineage_sample": {
50 "window_id": 22,
51 "asset_id": 20,
52 "song_id": 9,
53 "title": "song beta",
54 "start_ms": 1000,
55 "end_ms": 6000
56 },
57 "feature_lineage_sample": {
58 "feature_type": "embedding",
59 "model_name": "local_wavehash_embed",
60 "model_version": "v1",
61 "feature_set_name": "wavehash_embed_5s",
62 "window_id": 22,
63 "song_id": 9,
64 "title": "song beta"
65 }
66 }
...\ No newline at end of file ...\ No newline at end of file
1 {
2 "schema": "acr_songcentric_test",
3 "manifest": "acr-engine/data/pgvector_eval/music20/songcentric_directory_manifest_with_features.jsonl",
4 "imported": [
5 {
6 "song_id": 8,
7 "asset_id": 16,
8 "window_ids": [
9 17,
10 18,
11 19
12 ],
13 "feature_ids": [
14 10,
15 11,
16 12,
17 13,
18 14,
19 15
20 ],
21 "membership_ids": [
22 8
23 ]
24 },
25 {
26 "song_id": 9,
27 "asset_id": 20,
28 "window_ids": [
29 21,
30 22
31 ],
32 "feature_ids": [
33 16,
34 17,
35 18,
36 19
37 ],
38 "membership_ids": [
39 9
40 ]
41 }
42 ],
43 "counts": {
44 "media_entity": 9,
45 "audio_object": 22,
46 "feature_fact": 19,
47 "set_membership": 9
48 },
49 "window_lineage_sample": {
50 "window_id": 22,
51 "asset_id": 20,
52 "song_id": 9,
53 "title": "song beta",
54 "start_ms": 1000,
55 "end_ms": 6000
56 },
57 "feature_lineage_sample": {
58 "feature_type": "embedding",
59 "model_name": "local_wavehash_embed",
60 "model_version": "v1",
61 "feature_set_name": "wavehash_embed_5s",
62 "window_id": 22,
63 "song_id": 9,
64 "title": "song beta"
65 }
66 }
...\ No newline at end of file ...\ No newline at end of file
1 {
2 "input_manifest": "/workspace/acr-engine/data/pgvector_eval/music20/songcentric_directory_manifest.jsonl",
3 "output_manifest": "/workspace/acr-engine/data/pgvector_eval/music20/songcentric_directory_manifest_with_features.jsonl",
4 "rows": 2,
5 "wav_assets_seen": 5,
6 "features_added": 10
7 }
...\ No newline at end of file ...\ No newline at end of file
1 #!/usr/bin/env /usr/local/miniconda3/bin/python
2 from __future__ import annotations
3
4 import argparse
5 import hashlib
6 import json
7 import math
8 import wave
9 from pathlib import Path
10
11
12 def load_jsonl(path: Path):
13 for line in path.read_text().splitlines():
14 line = line.strip()
15 if line:
16 yield json.loads(line)
17
18
19 def read_wav_stats(path: Path, start_ms: int, end_ms: int) -> dict:
20 with wave.open(str(path), 'rb') as wf:
21 rate = wf.getframerate()
22 sampwidth = wf.getsampwidth()
23 n_channels = wf.getnchannels()
24 start_frame = int(start_ms * rate / 1000)
25 end_frame = int(end_ms * rate / 1000)
26 wf.setpos(min(start_frame, wf.getnframes()))
27 frames = wf.readframes(max(end_frame - start_frame, 0))
28 digest = hashlib.sha256(frames).hexdigest()
29 energy = sum(abs(b - 128) for b in frames[: min(len(frames), 4000)]) if sampwidth == 1 else sum(abs(int.from_bytes(frames[i:i+2], 'little', signed=True)) for i in range(0, min(len(frames), 8000), 2))
30 return {
31 'digest': digest,
32 'energy': energy,
33 'rate': rate,
34 'channels': n_channels,
35 'bytes_read': len(frames),
36 }
37
38
39 def main() -> int:
40 parser = argparse.ArgumentParser()
41 parser.add_argument('--input-manifest', required=True)
42 parser.add_argument('--output-manifest', required=True)
43 parser.add_argument('--report-output')
44 args = parser.parse_args()
45
46 in_path = Path(args.input_manifest).resolve()
47 out_path = Path(args.output_manifest).resolve()
48 out_path.parent.mkdir(parents=True, exist_ok=True)
49 report_path = Path(args.report_output).resolve() if args.report_output else None
50 if report_path:
51 report_path.parent.mkdir(parents=True, exist_ok=True)
52
53 rows = []
54 feature_count = 0
55 wav_assets = 0
56 for row in load_jsonl(in_path):
57 asset = row['asset']
58 asset_path = Path(asset['storage_uri'])
59 for idx, window in enumerate(row.get('windows', []), start=1):
60 features = window.setdefault('features', [])
61 if asset_path.suffix.lower() == '.wav' and asset_path.exists():
62 wav_assets += 1
63 stats = read_wav_stats(asset_path, window['start_ms'], window['end_ms'])
64 fp = {
65 'feature_type': 'fingerprint',
66 'model_name': 'local_wavehash',
67 'model_version': 'v1',
68 'feature_set_name': 'wavehash_5s',
69 'fingerprint_value': stats['digest'][:32],
70 'checksum': f"fp:{stats['digest'][:16]}",
71 'metadata_json': {'energy': stats['energy'], 'bytes_read': stats['bytes_read']},
72 }
73 emb = {
74 'feature_type': 'embedding',
75 'model_name': 'local_wavehash_embed',
76 'model_version': 'v1',
77 'feature_set_name': 'wavehash_embed_5s',
78 'feature_schema_ver': 'v1',
79 'embedding_dim': 8,
80 'embedding_uri': f"inline://{stats['digest'][:16]}:{window['start_ms']}:{window['end_ms']}",
81 'vector_table_name': 'audio_embedding_vector_8_placeholder',
82 'checksum': f"emb:{stats['digest'][:16]}",
83 'metadata_json': {'energy': stats['energy'], 'rate': stats['rate'], 'channels': stats['channels']},
84 }
85 features.extend([fp, emb])
86 feature_count += 2
87 rows.append(row)
88
89 out_path.write_text('\n'.join(json.dumps(r, ensure_ascii=False) for r in rows) + ('\n' if rows else ''))
90 report = {
91 'input_manifest': str(in_path),
92 'output_manifest': str(out_path),
93 'rows': len(rows),
94 'wav_assets_seen': wav_assets,
95 'features_added': feature_count,
96 }
97 if report_path:
98 report_path.write_text(json.dumps(report, ensure_ascii=False, indent=2))
99 print(json.dumps(report, ensure_ascii=False, indent=2))
100 return 0
101
102
103 if __name__ == '__main__':
104 raise SystemExit(main())
1 ## 2026-06-04 1 ## 2026-06-04
2 2
3 - 新增 `acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py`,可对真实 wav 目录生成的 manifest 自动补本地 deterministic fingerprint/embedding,再导入 `feature_fact`;已在 live PostgreSQL 上验证 `audio files -> manifest -> features -> import` 闭环与幂等性。
4
3 - 新增 `acr-engine/scripts/build_songcentric_manifest_from_directory.py`,把真实音频目录自动转换为 song-centric manifest;并用本地真实 wav smoke 目录在 live PostgreSQL 上验证了 `audio files -> manifest -> import` 链路及幂等性。 5 - 新增 `acr-engine/scripts/build_songcentric_manifest_from_directory.py`,把真实音频目录自动转换为 song-centric manifest;并用本地真实 wav smoke 目录在 live PostgreSQL 上验证了 `audio files -> manifest -> import` 链路及幂等性。
4 6
5 - 扩展 `import_songcentric_manifest_live.py` 支持从 manifest 的 `windows[].features[]` 直接落 `feature_fact`,并用 `songcentric_feature_manifest_sample.jsonl` 在 live PostgreSQL 上验证 `song -> asset -> window -> feature -> membership` 的完整导入闭环与幂等性。 7 - 扩展 `import_songcentric_manifest_live.py` 支持从 manifest 的 `windows[].features[]` 直接落 `feature_fact`,并用 `songcentric_feature_manifest_sample.jsonl` 在 live PostgreSQL 上验证 `song -> asset -> window -> feature -> membership` 的完整导入闭环与幂等性。
......
...@@ -285,6 +285,21 @@ flowchart TD ...@@ -285,6 +285,21 @@ flowchart TD
285 285
286 当前目录构建脚本:[`acr-engine/scripts/build_songcentric_manifest_from_directory.py`](../acr-engine/scripts/build_songcentric_manifest_from_directory.py) 286 当前目录构建脚本:[`acr-engine/scripts/build_songcentric_manifest_from_directory.py`](../acr-engine/scripts/build_songcentric_manifest_from_directory.py)
287 287
288
289 ### 4.7 真实目录补特征再导入流程
290
291 ```mermaid
292 flowchart TD
293 A[real audio directory] --> B[build_songcentric_manifest_from_directory.py]
294 B --> C[songcentric_directory_manifest.jsonl]
295 C --> D[enrich_songcentric_manifest_with_local_features.py]
296 D --> E[songcentric_directory_manifest_with_features.jsonl]
297 E --> F[import_songcentric_manifest_live.py]
298 F --> G[feature_fact]
299 ```
300
301 当前特征补全脚本:[`acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py`](../acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py)
302
288 --- 303 ---
289 304
290 ## 5. 最常用 SQL 样例 305 ## 5. 最常用 SQL 样例
......