Complete the real-directory song-centric pipeline through feature_fact
Constraint: Finish the current real-directory onboarding loop without depending on missing heavyweight model runtimes, while still writing concrete feature rows into the fused schema. Rejected: Wait for MERT/MuQ runtime availability before validating directory-to-feature ingestion | It would leave the Phase-1 data path unproven on this host. Confidence: high Scope-risk: narrow Directive: Use enrich_songcentric_manifest_with_local_features.py as the temporary deterministic feature stage for host-level pipeline validation until full model runtimes are installed. Tested: /usr/local/miniconda3/bin/python acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py on the real wav smoke manifest; imported the enriched manifest twice into postgres://d2:d2pass@127.0.0.1:5432/d2 schema acr_songcentric_test and verified counts remained media_entity=9, audio_object=22, feature_fact=19, set_membership=9; git diff --check; /usr/local/miniconda3/bin/python scripts/check_markdown_links.py --root docs returned OK for 11 active markdown files Not-tested: semantic quality of the temporary local features and large-scale feature enrichment throughput
Showing
7 changed files
with
262 additions
and
0 deletions
| 1 | {"song": {"biz_key": "song_alpha", "title": "song alpha", "artist_name": "artist a"}, "asset": {"source_type": "official", "storage_uri": "/workspace/acr-engine/data/songcentric_builder_smoke/song_alpha/artist_a/clip1.wav", "storage_scheme": "file", "checksum": "path:/workspace/acr-engine/data/songcentric_builder_smoke/song_alpha/artist_a/clip1.wav", "codec": "wav", "sample_rate": 16000, "channels": 1, "duration_ms": 8000}, "windows": [{"start_ms": 0, "end_ms": 5000, "features": [{"feature_type": "fingerprint", "model_name": "local_wavehash", "model_version": "v1", "feature_set_name": "wavehash_5s", "fingerprint_value": "593c7a661cc8744423107546c4e86249", "checksum": "fp:593c7a661cc87444", "metadata_json": {"energy": 30555200, "bytes_read": 160000}}, {"feature_type": "embedding", "model_name": "local_wavehash_embed", "model_version": "v1", "feature_set_name": "wavehash_embed_5s", "feature_schema_ver": "v1", "embedding_dim": 8, "embedding_uri": "inline://593c7a661cc87444:0:5000", "vector_table_name": "audio_embedding_vector_8_placeholder", "checksum": "emb:593c7a661cc87444", "metadata_json": {"energy": 30555200, "rate": 16000, "channels": 1}}]}, {"start_ms": 2500, "end_ms": 7500, "features": [{"feature_type": "fingerprint", "model_name": "local_wavehash", "model_version": "v1", "feature_set_name": "wavehash_5s", "fingerprint_value": "593c7a661cc8744423107546c4e86249", "checksum": "fp:593c7a661cc87444", "metadata_json": {"energy": 30555200, "bytes_read": 160000}}, {"feature_type": "embedding", "model_name": "local_wavehash_embed", "model_version": "v1", "feature_set_name": "wavehash_embed_5s", "feature_schema_ver": "v1", "embedding_dim": 8, "embedding_uri": "inline://593c7a661cc87444:2500:7500", "vector_table_name": "audio_embedding_vector_8_placeholder", "checksum": "emb:593c7a661cc87444", "metadata_json": {"energy": 30555200, "rate": 16000, "channels": 1}}]}, {"start_ms": 3000, "end_ms": 8000, "features": [{"feature_type": "fingerprint", "model_name": "local_wavehash", "model_version": "v1", "feature_set_name": "wavehash_5s", "fingerprint_value": "593c7a661cc8744423107546c4e86249", "checksum": "fp:593c7a661cc87444", "metadata_json": {"energy": 30555200, "bytes_read": 160000}}, {"feature_type": "embedding", "model_name": "local_wavehash_embed", "model_version": "v1", "feature_set_name": "wavehash_embed_5s", "feature_schema_ver": "v1", "embedding_dim": 8, "embedding_uri": "inline://593c7a661cc87444:3000:8000", "vector_table_name": "audio_embedding_vector_8_placeholder", "checksum": "emb:593c7a661cc87444", "metadata_json": {"energy": 30555200, "rate": 16000, "channels": 1}}]}], "memberships": [{"set_type": "reference_set", "set_name": "phase1_hot_reference_v1", "member_type": "asset", "priority": 100}]} | ||
| 2 | {"song": {"biz_key": "song_beta", "title": "song beta", "artist_name": "artist b"}, "asset": {"source_type": "official", "storage_uri": "/workspace/acr-engine/data/songcentric_builder_smoke/song_beta/artist_b/clip2.wav", "storage_scheme": "file", "checksum": "path:/workspace/acr-engine/data/songcentric_builder_smoke/song_beta/artist_b/clip2.wav", "codec": "wav", "sample_rate": 16000, "channels": 1, "duration_ms": 6000}, "windows": [{"start_ms": 0, "end_ms": 5000, "features": [{"feature_type": "fingerprint", "model_name": "local_wavehash", "model_version": "v1", "feature_set_name": "wavehash_5s", "fingerprint_value": "4ed2ccfa55b10b886c60bb1cfdfb0a72", "checksum": "fp:4ed2ccfa55b10b88", "metadata_json": {"energy": 30555680, "bytes_read": 160000}}, {"feature_type": "embedding", "model_name": "local_wavehash_embed", "model_version": "v1", "feature_set_name": "wavehash_embed_5s", "feature_schema_ver": "v1", "embedding_dim": 8, "embedding_uri": "inline://4ed2ccfa55b10b88:0:5000", "vector_table_name": "audio_embedding_vector_8_placeholder", "checksum": "emb:4ed2ccfa55b10b88", "metadata_json": {"energy": 30555680, "rate": 16000, "channels": 1}}]}, {"start_ms": 1000, "end_ms": 6000, "features": [{"feature_type": "fingerprint", "model_name": "local_wavehash", "model_version": "v1", "feature_set_name": "wavehash_5s", "fingerprint_value": "4ed2ccfa55b10b886c60bb1cfdfb0a72", "checksum": "fp:4ed2ccfa55b10b88", "metadata_json": {"energy": 30555680, "bytes_read": 160000}}, {"feature_type": "embedding", "model_name": "local_wavehash_embed", "model_version": "v1", "feature_set_name": "wavehash_embed_5s", "feature_schema_ver": "v1", "embedding_dim": 8, "embedding_uri": "inline://4ed2ccfa55b10b88:1000:6000", "vector_table_name": "audio_embedding_vector_8_placeholder", "checksum": "emb:4ed2ccfa55b10b88", "metadata_json": {"energy": 30555680, "rate": 16000, "channels": 1}}]}], "memberships": [{"set_type": "reference_set", "set_name": "phase1_hot_reference_v1", "member_type": "asset", "priority": 100}]} |
| 1 | { | ||
| 2 | "schema": "acr_songcentric_test", | ||
| 3 | "manifest": "acr-engine/data/pgvector_eval/music20/songcentric_directory_manifest_with_features.jsonl", | ||
| 4 | "imported": [ | ||
| 5 | { | ||
| 6 | "song_id": 8, | ||
| 7 | "asset_id": 16, | ||
| 8 | "window_ids": [ | ||
| 9 | 17, | ||
| 10 | 18, | ||
| 11 | 19 | ||
| 12 | ], | ||
| 13 | "feature_ids": [ | ||
| 14 | 10, | ||
| 15 | 11, | ||
| 16 | 12, | ||
| 17 | 13, | ||
| 18 | 14, | ||
| 19 | 15 | ||
| 20 | ], | ||
| 21 | "membership_ids": [ | ||
| 22 | 8 | ||
| 23 | ] | ||
| 24 | }, | ||
| 25 | { | ||
| 26 | "song_id": 9, | ||
| 27 | "asset_id": 20, | ||
| 28 | "window_ids": [ | ||
| 29 | 21, | ||
| 30 | 22 | ||
| 31 | ], | ||
| 32 | "feature_ids": [ | ||
| 33 | 16, | ||
| 34 | 17, | ||
| 35 | 18, | ||
| 36 | 19 | ||
| 37 | ], | ||
| 38 | "membership_ids": [ | ||
| 39 | 9 | ||
| 40 | ] | ||
| 41 | } | ||
| 42 | ], | ||
| 43 | "counts": { | ||
| 44 | "media_entity": 9, | ||
| 45 | "audio_object": 22, | ||
| 46 | "feature_fact": 19, | ||
| 47 | "set_membership": 9 | ||
| 48 | }, | ||
| 49 | "window_lineage_sample": { | ||
| 50 | "window_id": 22, | ||
| 51 | "asset_id": 20, | ||
| 52 | "song_id": 9, | ||
| 53 | "title": "song beta", | ||
| 54 | "start_ms": 1000, | ||
| 55 | "end_ms": 6000 | ||
| 56 | }, | ||
| 57 | "feature_lineage_sample": { | ||
| 58 | "feature_type": "embedding", | ||
| 59 | "model_name": "local_wavehash_embed", | ||
| 60 | "model_version": "v1", | ||
| 61 | "feature_set_name": "wavehash_embed_5s", | ||
| 62 | "window_id": 22, | ||
| 63 | "song_id": 9, | ||
| 64 | "title": "song beta" | ||
| 65 | } | ||
| 66 | } | ||
| ... | \ No newline at end of file | ... | \ No newline at end of file |
| 1 | { | ||
| 2 | "schema": "acr_songcentric_test", | ||
| 3 | "manifest": "acr-engine/data/pgvector_eval/music20/songcentric_directory_manifest_with_features.jsonl", | ||
| 4 | "imported": [ | ||
| 5 | { | ||
| 6 | "song_id": 8, | ||
| 7 | "asset_id": 16, | ||
| 8 | "window_ids": [ | ||
| 9 | 17, | ||
| 10 | 18, | ||
| 11 | 19 | ||
| 12 | ], | ||
| 13 | "feature_ids": [ | ||
| 14 | 10, | ||
| 15 | 11, | ||
| 16 | 12, | ||
| 17 | 13, | ||
| 18 | 14, | ||
| 19 | 15 | ||
| 20 | ], | ||
| 21 | "membership_ids": [ | ||
| 22 | 8 | ||
| 23 | ] | ||
| 24 | }, | ||
| 25 | { | ||
| 26 | "song_id": 9, | ||
| 27 | "asset_id": 20, | ||
| 28 | "window_ids": [ | ||
| 29 | 21, | ||
| 30 | 22 | ||
| 31 | ], | ||
| 32 | "feature_ids": [ | ||
| 33 | 16, | ||
| 34 | 17, | ||
| 35 | 18, | ||
| 36 | 19 | ||
| 37 | ], | ||
| 38 | "membership_ids": [ | ||
| 39 | 9 | ||
| 40 | ] | ||
| 41 | } | ||
| 42 | ], | ||
| 43 | "counts": { | ||
| 44 | "media_entity": 9, | ||
| 45 | "audio_object": 22, | ||
| 46 | "feature_fact": 19, | ||
| 47 | "set_membership": 9 | ||
| 48 | }, | ||
| 49 | "window_lineage_sample": { | ||
| 50 | "window_id": 22, | ||
| 51 | "asset_id": 20, | ||
| 52 | "song_id": 9, | ||
| 53 | "title": "song beta", | ||
| 54 | "start_ms": 1000, | ||
| 55 | "end_ms": 6000 | ||
| 56 | }, | ||
| 57 | "feature_lineage_sample": { | ||
| 58 | "feature_type": "embedding", | ||
| 59 | "model_name": "local_wavehash_embed", | ||
| 60 | "model_version": "v1", | ||
| 61 | "feature_set_name": "wavehash_embed_5s", | ||
| 62 | "window_id": 22, | ||
| 63 | "song_id": 9, | ||
| 64 | "title": "song beta" | ||
| 65 | } | ||
| 66 | } | ||
| ... | \ No newline at end of file | ... | \ No newline at end of file |
acr-engine/data/pgvector_eval/music20/songcentric_directory_manifest_with_features_report.json
0 → 100644
| 1 | { | ||
| 2 | "input_manifest": "/workspace/acr-engine/data/pgvector_eval/music20/songcentric_directory_manifest.jsonl", | ||
| 3 | "output_manifest": "/workspace/acr-engine/data/pgvector_eval/music20/songcentric_directory_manifest_with_features.jsonl", | ||
| 4 | "rows": 2, | ||
| 5 | "wav_assets_seen": 5, | ||
| 6 | "features_added": 10 | ||
| 7 | } | ||
| ... | \ No newline at end of file | ... | \ No newline at end of file |
| 1 | #!/usr/bin/env /usr/local/miniconda3/bin/python | ||
| 2 | from __future__ import annotations | ||
| 3 | |||
| 4 | import argparse | ||
| 5 | import hashlib | ||
| 6 | import json | ||
| 7 | import math | ||
| 8 | import wave | ||
| 9 | from pathlib import Path | ||
| 10 | |||
| 11 | |||
| 12 | def load_jsonl(path: Path): | ||
| 13 | for line in path.read_text().splitlines(): | ||
| 14 | line = line.strip() | ||
| 15 | if line: | ||
| 16 | yield json.loads(line) | ||
| 17 | |||
| 18 | |||
| 19 | def read_wav_stats(path: Path, start_ms: int, end_ms: int) -> dict: | ||
| 20 | with wave.open(str(path), 'rb') as wf: | ||
| 21 | rate = wf.getframerate() | ||
| 22 | sampwidth = wf.getsampwidth() | ||
| 23 | n_channels = wf.getnchannels() | ||
| 24 | start_frame = int(start_ms * rate / 1000) | ||
| 25 | end_frame = int(end_ms * rate / 1000) | ||
| 26 | wf.setpos(min(start_frame, wf.getnframes())) | ||
| 27 | frames = wf.readframes(max(end_frame - start_frame, 0)) | ||
| 28 | digest = hashlib.sha256(frames).hexdigest() | ||
| 29 | energy = sum(abs(b - 128) for b in frames[: min(len(frames), 4000)]) if sampwidth == 1 else sum(abs(int.from_bytes(frames[i:i+2], 'little', signed=True)) for i in range(0, min(len(frames), 8000), 2)) | ||
| 30 | return { | ||
| 31 | 'digest': digest, | ||
| 32 | 'energy': energy, | ||
| 33 | 'rate': rate, | ||
| 34 | 'channels': n_channels, | ||
| 35 | 'bytes_read': len(frames), | ||
| 36 | } | ||
| 37 | |||
| 38 | |||
| 39 | def main() -> int: | ||
| 40 | parser = argparse.ArgumentParser() | ||
| 41 | parser.add_argument('--input-manifest', required=True) | ||
| 42 | parser.add_argument('--output-manifest', required=True) | ||
| 43 | parser.add_argument('--report-output') | ||
| 44 | args = parser.parse_args() | ||
| 45 | |||
| 46 | in_path = Path(args.input_manifest).resolve() | ||
| 47 | out_path = Path(args.output_manifest).resolve() | ||
| 48 | out_path.parent.mkdir(parents=True, exist_ok=True) | ||
| 49 | report_path = Path(args.report_output).resolve() if args.report_output else None | ||
| 50 | if report_path: | ||
| 51 | report_path.parent.mkdir(parents=True, exist_ok=True) | ||
| 52 | |||
| 53 | rows = [] | ||
| 54 | feature_count = 0 | ||
| 55 | wav_assets = 0 | ||
| 56 | for row in load_jsonl(in_path): | ||
| 57 | asset = row['asset'] | ||
| 58 | asset_path = Path(asset['storage_uri']) | ||
| 59 | for idx, window in enumerate(row.get('windows', []), start=1): | ||
| 60 | features = window.setdefault('features', []) | ||
| 61 | if asset_path.suffix.lower() == '.wav' and asset_path.exists(): | ||
| 62 | wav_assets += 1 | ||
| 63 | stats = read_wav_stats(asset_path, window['start_ms'], window['end_ms']) | ||
| 64 | fp = { | ||
| 65 | 'feature_type': 'fingerprint', | ||
| 66 | 'model_name': 'local_wavehash', | ||
| 67 | 'model_version': 'v1', | ||
| 68 | 'feature_set_name': 'wavehash_5s', | ||
| 69 | 'fingerprint_value': stats['digest'][:32], | ||
| 70 | 'checksum': f"fp:{stats['digest'][:16]}", | ||
| 71 | 'metadata_json': {'energy': stats['energy'], 'bytes_read': stats['bytes_read']}, | ||
| 72 | } | ||
| 73 | emb = { | ||
| 74 | 'feature_type': 'embedding', | ||
| 75 | 'model_name': 'local_wavehash_embed', | ||
| 76 | 'model_version': 'v1', | ||
| 77 | 'feature_set_name': 'wavehash_embed_5s', | ||
| 78 | 'feature_schema_ver': 'v1', | ||
| 79 | 'embedding_dim': 8, | ||
| 80 | 'embedding_uri': f"inline://{stats['digest'][:16]}:{window['start_ms']}:{window['end_ms']}", | ||
| 81 | 'vector_table_name': 'audio_embedding_vector_8_placeholder', | ||
| 82 | 'checksum': f"emb:{stats['digest'][:16]}", | ||
| 83 | 'metadata_json': {'energy': stats['energy'], 'rate': stats['rate'], 'channels': stats['channels']}, | ||
| 84 | } | ||
| 85 | features.extend([fp, emb]) | ||
| 86 | feature_count += 2 | ||
| 87 | rows.append(row) | ||
| 88 | |||
| 89 | out_path.write_text('\n'.join(json.dumps(r, ensure_ascii=False) for r in rows) + ('\n' if rows else '')) | ||
| 90 | report = { | ||
| 91 | 'input_manifest': str(in_path), | ||
| 92 | 'output_manifest': str(out_path), | ||
| 93 | 'rows': len(rows), | ||
| 94 | 'wav_assets_seen': wav_assets, | ||
| 95 | 'features_added': feature_count, | ||
| 96 | } | ||
| 97 | if report_path: | ||
| 98 | report_path.write_text(json.dumps(report, ensure_ascii=False, indent=2)) | ||
| 99 | print(json.dumps(report, ensure_ascii=False, indent=2)) | ||
| 100 | return 0 | ||
| 101 | |||
| 102 | |||
| 103 | if __name__ == '__main__': | ||
| 104 | raise SystemExit(main()) |
| 1 | ## 2026-06-04 | 1 | ## 2026-06-04 |
| 2 | 2 | ||
| 3 | - 新增 `acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py`,可对真实 wav 目录生成的 manifest 自动补本地 deterministic fingerprint/embedding,再导入 `feature_fact`;已在 live PostgreSQL 上验证 `audio files -> manifest -> features -> import` 闭环与幂等性。 | ||
| 4 | |||
| 3 | - 新增 `acr-engine/scripts/build_songcentric_manifest_from_directory.py`,把真实音频目录自动转换为 song-centric manifest;并用本地真实 wav smoke 目录在 live PostgreSQL 上验证了 `audio files -> manifest -> import` 链路及幂等性。 | 5 | - 新增 `acr-engine/scripts/build_songcentric_manifest_from_directory.py`,把真实音频目录自动转换为 song-centric manifest;并用本地真实 wav smoke 目录在 live PostgreSQL 上验证了 `audio files -> manifest -> import` 链路及幂等性。 |
| 4 | 6 | ||
| 5 | - 扩展 `import_songcentric_manifest_live.py` 支持从 manifest 的 `windows[].features[]` 直接落 `feature_fact`,并用 `songcentric_feature_manifest_sample.jsonl` 在 live PostgreSQL 上验证 `song -> asset -> window -> feature -> membership` 的完整导入闭环与幂等性。 | 7 | - 扩展 `import_songcentric_manifest_live.py` 支持从 manifest 的 `windows[].features[]` 直接落 `feature_fact`,并用 `songcentric_feature_manifest_sample.jsonl` 在 live PostgreSQL 上验证 `song -> asset -> window -> feature -> membership` 的完整导入闭环与幂等性。 | ... | ... |
| ... | @@ -285,6 +285,21 @@ flowchart TD | ... | @@ -285,6 +285,21 @@ flowchart TD |
| 285 | 285 | ||
| 286 | 当前目录构建脚本:[`acr-engine/scripts/build_songcentric_manifest_from_directory.py`](../acr-engine/scripts/build_songcentric_manifest_from_directory.py) | 286 | 当前目录构建脚本:[`acr-engine/scripts/build_songcentric_manifest_from_directory.py`](../acr-engine/scripts/build_songcentric_manifest_from_directory.py) |
| 287 | 287 | ||
| 288 | |||
| 289 | ### 4.7 真实目录补特征再导入流程 | ||
| 290 | |||
| 291 | ```mermaid | ||
| 292 | flowchart TD | ||
| 293 | A[real audio directory] --> B[build_songcentric_manifest_from_directory.py] | ||
| 294 | B --> C[songcentric_directory_manifest.jsonl] | ||
| 295 | C --> D[enrich_songcentric_manifest_with_local_features.py] | ||
| 296 | D --> E[songcentric_directory_manifest_with_features.jsonl] | ||
| 297 | E --> F[import_songcentric_manifest_live.py] | ||
| 298 | F --> G[feature_fact] | ||
| 299 | ``` | ||
| 300 | |||
| 301 | 当前特征补全脚本:[`acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py`](../acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py) | ||
| 302 | |||
| 288 | --- | 303 | --- |
| 289 | 304 | ||
| 290 | ## 5. 最常用 SQL 样例 | 305 | ## 5. 最常用 SQL 样例 | ... | ... |
-
Please register or sign in to post a comment