Commit 3b4b3684 3b4b3684db879c726713a3719bad04ad251cd0a7 by cnb.bofCdSsphPA

Collapse the song-centric directory workflow into one live runner

Constraint: Keep the current real-directory onboarding path executable end-to-end on this host while exposing exact/semantic backend selection in one reproducible report.
Rejected: Leave the song-centric pipeline as multiple manual commands only | It raises handoff cost and makes repeated host validation slower and noisier.
Confidence: high
Scope-risk: narrow
Directive: Use run_songcentric_directory_pipeline_live.py as the default smoke/verification entrypoint for the current song-centric ingestion path.
Tested: /usr/local/miniconda3/bin/python acr-engine/scripts/run_songcentric_directory_pipeline_live.py --dsn postgres://d2:d2pass@127.0.0.1:5432/d2 --schema acr_songcentric_test --input-root acr-engine/data/songcentric_builder_smoke --output-dir acr-engine/data/pgvector_eval/music20; git diff --check; /usr/local/miniconda3/bin/python scripts/check_markdown_links.py --root docs returned OK for 11 active markdown files
Not-tested: very large directory trees and true semantic runtime readiness on this host
1 parent 35d883a8
1 {
2 "input_root": "/workspace/acr-engine/data/songcentric_builder_smoke",
3 "output": "/workspace/acr-engine/data/pgvector_eval/music20/songcentric_pipeline_manifest.jsonl",
4 "song_count": 2,
5 "asset_count": 2,
6 "window_count": 5,
7 "window_ms": 5000,
8 "stride_ms": 2500,
9 "set_name": "phase1_hot_reference_v1"
10 }
...\ No newline at end of file ...\ No newline at end of file
1 {
2 "input_manifest": "/workspace/acr-engine/data/pgvector_eval/music20/songcentric_pipeline_manifest.jsonl",
3 "output_manifest": "/workspace/acr-engine/data/pgvector_eval/music20/songcentric_pipeline_manifest_with_features.jsonl",
4 "rows": 2,
5 "wav_windows_seen": 5,
6 "features_added": 10,
7 "matcher_fingerprint_count": 5,
8 "fallback_fingerprint_count": 0,
9 "semantic_runtime_available": false,
10 "semantic_runtime_missing": [
11 "torch",
12 "torchaudio",
13 "transformers"
14 ],
15 "semantic_runtime_ready_count": 0,
16 "semantic_fallback_count": 5
17 }
...\ No newline at end of file ...\ No newline at end of file
1 {
2 "schema": "acr_songcentric_test",
3 "manifest": "acr-engine/data/pgvector_eval/music20/songcentric_pipeline_manifest_with_features.jsonl",
4 "imported": [
5 {
6 "song_id": 8,
7 "asset_id": 16,
8 "window_ids": [
9 17,
10 18,
11 19
12 ],
13 "feature_ids": [
14 20,
15 11,
16 21,
17 13,
18 22,
19 15
20 ],
21 "membership_ids": [
22 8
23 ]
24 },
25 {
26 "song_id": 9,
27 "asset_id": 20,
28 "window_ids": [
29 21,
30 22
31 ],
32 "feature_ids": [
33 23,
34 17,
35 24,
36 19
37 ],
38 "membership_ids": [
39 9
40 ]
41 }
42 ],
43 "counts": {
44 "media_entity": 9,
45 "audio_object": 22,
46 "feature_fact": 24,
47 "set_membership": 9
48 },
49 "window_lineage_sample": {
50 "window_id": 22,
51 "asset_id": 20,
52 "song_id": 9,
53 "title": "song beta",
54 "start_ms": 1000,
55 "end_ms": 6000
56 },
57 "feature_lineage_sample": {
58 "feature_type": "fingerprint",
59 "model_name": "chromaprint_matcher",
60 "model_version": "phase1_local",
61 "feature_set_name": "chromaprint_matcher_5s",
62 "window_id": 22,
63 "song_id": 9,
64 "title": "song beta"
65 }
66 }
...\ No newline at end of file ...\ No newline at end of file
1 {"song": {"biz_key": "song_alpha", "title": "song alpha", "artist_name": "artist a"}, "asset": {"source_type": "official", "storage_uri": "/workspace/acr-engine/data/songcentric_builder_smoke/song_alpha/artist_a/clip1.wav", "storage_scheme": "file", "checksum": "path:/workspace/acr-engine/data/songcentric_builder_smoke/song_alpha/artist_a/clip1.wav", "codec": "wav", "sample_rate": 16000, "channels": 1, "duration_ms": 8000}, "windows": [{"start_ms": 0, "end_ms": 5000}, {"start_ms": 2500, "end_ms": 7500}, {"start_ms": 3000, "end_ms": 8000}], "memberships": [{"set_type": "reference_set", "set_name": "phase1_hot_reference_v1", "member_type": "asset", "priority": 100}]}
2 {"song": {"biz_key": "song_beta", "title": "song beta", "artist_name": "artist b"}, "asset": {"source_type": "official", "storage_uri": "/workspace/acr-engine/data/songcentric_builder_smoke/song_beta/artist_b/clip2.wav", "storage_scheme": "file", "checksum": "path:/workspace/acr-engine/data/songcentric_builder_smoke/song_beta/artist_b/clip2.wav", "codec": "wav", "sample_rate": 16000, "channels": 1, "duration_ms": 6000}, "windows": [{"start_ms": 0, "end_ms": 5000}, {"start_ms": 1000, "end_ms": 6000}], "memberships": [{"set_type": "reference_set", "set_name": "phase1_hot_reference_v1", "member_type": "asset", "priority": 100}]}
1 {"song": {"biz_key": "song_alpha", "title": "song alpha", "artist_name": "artist a"}, "asset": {"source_type": "official", "storage_uri": "/workspace/acr-engine/data/songcentric_builder_smoke/song_alpha/artist_a/clip1.wav", "storage_scheme": "file", "checksum": "path:/workspace/acr-engine/data/songcentric_builder_smoke/song_alpha/artist_a/clip1.wav", "codec": "wav", "sample_rate": 16000, "channels": 1, "duration_ms": 8000}, "windows": [{"start_ms": 0, "end_ms": 5000, "features": [{"feature_type": "fingerprint", "model_name": "chromaprint_matcher", "model_version": "phase1_local", "feature_set_name": "chromaprint_matcher_5s", "fingerprint_value": "dc0c731425f360787f462da693ff4a50", "checksum": "chromaprint:dc0c731425f36078", "metadata_json": {"hash_count": 2643, "hash_sample": [[1842187, 11], [1842188, 11], [1842189, 11], [1842201, 11], [1842212, 11], [1842213, 11], [1842214, 11], [1842438, 11]]}}, {"feature_type": "embedding", "model_name": "local_wavehash_embed", "model_version": "v1", "feature_set_name": "wavehash_embed_5s", "feature_schema_ver": "v1", "embedding_dim": 8, "embedding_uri": "inline://593c7a661cc87444:0:5000", "vector_table_name": "audio_embedding_vector_8_placeholder", "checksum": "emb:593c7a661cc87444", "metadata_json": {"energy": 30555200, "rate": 16000, "channels": 1, "semantic_backend": "local_fallback", "runtime_missing": ["torch", "torchaudio", "transformers"]}}]}, {"start_ms": 2500, "end_ms": 7500, "features": [{"feature_type": "fingerprint", "model_name": "chromaprint_matcher", "model_version": "phase1_local", "feature_set_name": "chromaprint_matcher_5s", "fingerprint_value": "dc0c731425f360787f462da693ff4a50", "checksum": "chromaprint:dc0c731425f36078", "metadata_json": {"hash_count": 2643, "hash_sample": [[1842187, 11], [1842188, 11], [1842189, 11], [1842201, 11], [1842212, 11], [1842213, 11], [1842214, 11], [1842438, 11]]}}, {"feature_type": "embedding", "model_name": "local_wavehash_embed", "model_version": "v1", "feature_set_name": "wavehash_embed_5s", "feature_schema_ver": "v1", "embedding_dim": 8, "embedding_uri": "inline://593c7a661cc87444:2500:7500", "vector_table_name": "audio_embedding_vector_8_placeholder", "checksum": "emb:593c7a661cc87444", "metadata_json": {"energy": 30555200, "rate": 16000, "channels": 1, "semantic_backend": "local_fallback", "runtime_missing": ["torch", "torchaudio", "transformers"]}}]}, {"start_ms": 3000, "end_ms": 8000, "features": [{"feature_type": "fingerprint", "model_name": "chromaprint_matcher", "model_version": "phase1_local", "feature_set_name": "chromaprint_matcher_5s", "fingerprint_value": "dc0c731425f360787f462da693ff4a50", "checksum": "chromaprint:dc0c731425f36078", "metadata_json": {"hash_count": 2643, "hash_sample": [[1842187, 11], [1842188, 11], [1842189, 11], [1842201, 11], [1842212, 11], [1842213, 11], [1842214, 11], [1842438, 11]]}}, {"feature_type": "embedding", "model_name": "local_wavehash_embed", "model_version": "v1", "feature_set_name": "wavehash_embed_5s", "feature_schema_ver": "v1", "embedding_dim": 8, "embedding_uri": "inline://593c7a661cc87444:3000:8000", "vector_table_name": "audio_embedding_vector_8_placeholder", "checksum": "emb:593c7a661cc87444", "metadata_json": {"energy": 30555200, "rate": 16000, "channels": 1, "semantic_backend": "local_fallback", "runtime_missing": ["torch", "torchaudio", "transformers"]}}]}], "memberships": [{"set_type": "reference_set", "set_name": "phase1_hot_reference_v1", "member_type": "asset", "priority": 100}]}
2 {"song": {"biz_key": "song_beta", "title": "song beta", "artist_name": "artist b"}, "asset": {"source_type": "official", "storage_uri": "/workspace/acr-engine/data/songcentric_builder_smoke/song_beta/artist_b/clip2.wav", "storage_scheme": "file", "checksum": "path:/workspace/acr-engine/data/songcentric_builder_smoke/song_beta/artist_b/clip2.wav", "codec": "wav", "sample_rate": 16000, "channels": 1, "duration_ms": 6000}, "windows": [{"start_ms": 0, "end_ms": 5000, "features": [{"feature_type": "fingerprint", "model_name": "chromaprint_matcher", "model_version": "phase1_local", "feature_set_name": "chromaprint_matcher_5s", "fingerprint_value": "d8fc2442b4ec3ce5ae180c5845cffccb", "checksum": "chromaprint:d8fc2442b4ec3ce5", "metadata_json": {"hash_count": 2202, "hash_sample": [[2763289, 23], [2763524, 23], [2763541, 23], [2763549, 23], [2763566, 23], [2763801, 23], [2764050, 23], [2764075, 23]]}}, {"feature_type": "embedding", "model_name": "local_wavehash_embed", "model_version": "v1", "feature_set_name": "wavehash_embed_5s", "feature_schema_ver": "v1", "embedding_dim": 8, "embedding_uri": "inline://4ed2ccfa55b10b88:0:5000", "vector_table_name": "audio_embedding_vector_8_placeholder", "checksum": "emb:4ed2ccfa55b10b88", "metadata_json": {"energy": 30555680, "rate": 16000, "channels": 1, "semantic_backend": "local_fallback", "runtime_missing": ["torch", "torchaudio", "transformers"]}}]}, {"start_ms": 1000, "end_ms": 6000, "features": [{"feature_type": "fingerprint", "model_name": "chromaprint_matcher", "model_version": "phase1_local", "feature_set_name": "chromaprint_matcher_5s", "fingerprint_value": "d8fc2442b4ec3ce5ae180c5845cffccb", "checksum": "chromaprint:d8fc2442b4ec3ce5", "metadata_json": {"hash_count": 2202, "hash_sample": [[2763289, 23], [2763524, 23], [2763541, 23], [2763549, 23], [2763566, 23], [2763801, 23], [2764050, 23], [2764075, 23]]}}, {"feature_type": "embedding", "model_name": "local_wavehash_embed", "model_version": "v1", "feature_set_name": "wavehash_embed_5s", "feature_schema_ver": "v1", "embedding_dim": 8, "embedding_uri": "inline://4ed2ccfa55b10b88:1000:6000", "vector_table_name": "audio_embedding_vector_8_placeholder", "checksum": "emb:4ed2ccfa55b10b88", "metadata_json": {"energy": 30555680, "rate": 16000, "channels": 1, "semantic_backend": "local_fallback", "runtime_missing": ["torch", "torchaudio", "transformers"]}}]}], "memberships": [{"set_type": "reference_set", "set_name": "phase1_hot_reference_v1", "member_type": "asset", "priority": 100}]}
1 {
2 "schema": "acr_songcentric_test",
3 "input_root": "acr-engine/data/songcentric_builder_smoke",
4 "steps": [
5 {
6 "name": "build_manifest",
7 "command": "/usr/local/miniconda3/bin/python acr-engine/scripts/build_songcentric_manifest_from_directory.py --input-root acr-engine/data/songcentric_builder_smoke --output acr-engine/data/pgvector_eval/music20/songcentric_pipeline_manifest.jsonl --report-output acr-engine/data/pgvector_eval/music20/songcentric_pipeline_build_report.json",
8 "returncode": 0
9 },
10 {
11 "name": "enrich_features",
12 "command": "/usr/local/miniconda3/bin/python acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py --input-manifest acr-engine/data/pgvector_eval/music20/songcentric_pipeline_manifest.jsonl --output-manifest acr-engine/data/pgvector_eval/music20/songcentric_pipeline_manifest_with_features.jsonl --report-output acr-engine/data/pgvector_eval/music20/songcentric_pipeline_enrich_report.json",
13 "returncode": 0
14 },
15 {
16 "name": "import_manifest",
17 "command": "/usr/local/miniconda3/bin/python acr-engine/scripts/import_songcentric_manifest_live.py --dsn postgres://d2:d2pass@127.0.0.1:5432/d2 --schema acr_songcentric_test --manifest acr-engine/data/pgvector_eval/music20/songcentric_pipeline_manifest_with_features.jsonl --output acr-engine/data/pgvector_eval/music20/songcentric_pipeline_import_report.json",
18 "returncode": 0
19 }
20 ],
21 "build_summary": {
22 "input_root": "/workspace/acr-engine/data/songcentric_builder_smoke",
23 "output": "/workspace/acr-engine/data/pgvector_eval/music20/songcentric_pipeline_manifest.jsonl",
24 "song_count": 2,
25 "asset_count": 2,
26 "window_count": 5,
27 "window_ms": 5000,
28 "stride_ms": 2500,
29 "set_name": "phase1_hot_reference_v1"
30 },
31 "enrich_summary": {
32 "wav_windows_seen": 5,
33 "features_added": 10,
34 "matcher_fingerprint_count": 5,
35 "fallback_fingerprint_count": 0,
36 "semantic_runtime_available": false,
37 "semantic_runtime_missing": [
38 "torch",
39 "torchaudio",
40 "transformers"
41 ],
42 "semantic_runtime_ready_count": 0,
43 "semantic_fallback_count": 5
44 },
45 "import_counts": {
46 "media_entity": 9,
47 "audio_object": 22,
48 "feature_fact": 24,
49 "set_membership": 9
50 },
51 "feature_lineage_sample": {
52 "feature_type": "fingerprint",
53 "model_name": "chromaprint_matcher",
54 "model_version": "phase1_local",
55 "feature_set_name": "chromaprint_matcher_5s",
56 "window_id": 22,
57 "song_id": 9,
58 "title": "song beta"
59 }
60 }
...\ No newline at end of file ...\ No newline at end of file
1 #!/usr/bin/env /usr/local/miniconda3/bin/python
2 from __future__ import annotations
3
4 import argparse
5 import json
6 import subprocess
7 from pathlib import Path
8
9 ROOT = Path(__file__).resolve().parents[1]
10 PYTHON = '/usr/local/miniconda3/bin/python'
11
12
13 def run_step(name: str, cmd: list[str]) -> dict:
14 proc = subprocess.run(cmd, cwd=str(ROOT.parent), capture_output=True, text=True)
15 return {
16 'name': name,
17 'command': ' '.join(cmd),
18 'returncode': proc.returncode,
19 'stdout': proc.stdout,
20 'stderr': proc.stderr,
21 }
22
23
24 def load_json(path: Path) -> dict:
25 return json.loads(path.read_text())
26
27
28 def main() -> int:
29 parser = argparse.ArgumentParser()
30 parser.add_argument('--dsn', required=True)
31 parser.add_argument('--schema', default='acr_songcentric_test')
32 parser.add_argument('--input-root', default='acr-engine/data/songcentric_builder_smoke')
33 parser.add_argument('--output-dir', default='acr-engine/data/pgvector_eval/music20')
34 args = parser.parse_args()
35
36 out_dir = (ROOT.parent / args.output_dir).resolve()
37 out_dir.mkdir(parents=True, exist_ok=True)
38
39 manifest = out_dir / 'songcentric_pipeline_manifest.jsonl'
40 build_report = out_dir / 'songcentric_pipeline_build_report.json'
41 enriched_manifest = out_dir / 'songcentric_pipeline_manifest_with_features.jsonl'
42 enrich_report = out_dir / 'songcentric_pipeline_enrich_report.json'
43 import_report = out_dir / 'songcentric_pipeline_import_report.json'
44
45 steps = []
46 steps.append(run_step('build_manifest', [
47 PYTHON, 'acr-engine/scripts/build_songcentric_manifest_from_directory.py',
48 '--input-root', args.input_root,
49 '--output', str(manifest.relative_to(ROOT.parent)),
50 '--report-output', str(build_report.relative_to(ROOT.parent)),
51 ]))
52 if steps[-1]['returncode'] != 0:
53 raise SystemExit(json.dumps({'failed_step': steps[-1]}, ensure_ascii=False, indent=2))
54
55 steps.append(run_step('enrich_features', [
56 PYTHON, 'acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py',
57 '--input-manifest', str(manifest.relative_to(ROOT.parent)),
58 '--output-manifest', str(enriched_manifest.relative_to(ROOT.parent)),
59 '--report-output', str(enrich_report.relative_to(ROOT.parent)),
60 ]))
61 if steps[-1]['returncode'] != 0:
62 raise SystemExit(json.dumps({'failed_step': steps[-1]}, ensure_ascii=False, indent=2))
63
64 steps.append(run_step('import_manifest', [
65 PYTHON, 'acr-engine/scripts/import_songcentric_manifest_live.py',
66 '--dsn', args.dsn,
67 '--schema', args.schema,
68 '--manifest', str(enriched_manifest.relative_to(ROOT.parent)),
69 '--output', str(import_report.relative_to(ROOT.parent)),
70 ]))
71 if steps[-1]['returncode'] != 0:
72 raise SystemExit(json.dumps({'failed_step': steps[-1]}, ensure_ascii=False, indent=2))
73
74 build = load_json(build_report)
75 enrich = load_json(enrich_report)
76 imp = load_json(import_report)
77
78 summary = {
79 'schema': args.schema,
80 'input_root': args.input_root,
81 'steps': [{k: v for k, v in s.items() if k in ('name', 'command', 'returncode')} for s in steps],
82 'build_summary': build,
83 'enrich_summary': {
84 'wav_windows_seen': enrich['wav_windows_seen'],
85 'features_added': enrich['features_added'],
86 'matcher_fingerprint_count': enrich['matcher_fingerprint_count'],
87 'fallback_fingerprint_count': enrich['fallback_fingerprint_count'],
88 'semantic_runtime_available': enrich['semantic_runtime_available'],
89 'semantic_runtime_missing': enrich['semantic_runtime_missing'],
90 'semantic_runtime_ready_count': enrich['semantic_runtime_ready_count'],
91 'semantic_fallback_count': enrich['semantic_fallback_count'],
92 },
93 'import_counts': imp['counts'],
94 'feature_lineage_sample': imp.get('feature_lineage_sample'),
95 }
96 report_path = out_dir / 'songcentric_pipeline_runner_report.json'
97 report_path.write_text(json.dumps(summary, ensure_ascii=False, indent=2))
98 print(json.dumps(summary, ensure_ascii=False, indent=2))
99 return 0
100
101
102 if __name__ == '__main__':
103 raise SystemExit(main())
1 ## 2026-06-04 1 ## 2026-06-04
2 2
3 - 新增 `acr-engine/scripts/run_songcentric_directory_pipeline_live.py`,把“真实目录 -> manifest -> 特征补全 -> live PostgreSQL 导入”收敛为一条可重复执行的 runner,并输出 exact/semantic backend 选择与导入计数摘要。
4
3 - 升级 `enrich_songcentric_manifest_with_local_features.py` 为 runtime-aware 语义适配器选择:当前 host 上因缺少 `torch/torchaudio/transformers`,semantic lane 明确写入 `local_wavehash_embed` fallback,并把缺失依赖固化到 report/metadata 中。 5 - 升级 `enrich_songcentric_manifest_with_local_features.py` 为 runtime-aware 语义适配器选择:当前 host 上因缺少 `torch/torchaudio/transformers`,semantic lane 明确写入 `local_wavehash_embed` fallback,并把缺失依赖固化到 report/metadata 中。
4 6
5 - 升级 `enrich_songcentric_manifest_with_local_features.py`:目录链中的 fingerprint 现优先复用仓库内 `ChromaprintMatcher`,并在 live PostgreSQL 上验证 5 个 wav windows 全部命中 matcher 路径、`fallback_fingerprint_count=0` 7 - 升级 `enrich_songcentric_manifest_with_local_features.py`:目录链中的 fingerprint 现优先复用仓库内 `ChromaprintMatcher`,并在 live PostgreSQL 上验证 5 个 wav windows 全部命中 matcher 路径、`fallback_fingerprint_count=0`
......
...@@ -326,6 +326,25 @@ flowchart TD ...@@ -326,6 +326,25 @@ flowchart TD
326 326
327 这说明当前 host 上 semantic lane 还未接真实模型,但链路已经具备明确的运行时分流与可审计证据。 327 这说明当前 host 上 semantic lane 还未接真实模型,但链路已经具备明确的运行时分流与可审计证据。
328 328
329
330 ### 4.10 一键 song-centric 目录链 runner
331
332 ```mermaid
333 flowchart TD
334 A[run_songcentric_directory_pipeline_live.py] --> B[build manifest]
335 B --> C[enrich features]
336 C --> D[import manifest]
337 D --> E[runner report]
338 ```
339
340 当前 runner:[`acr-engine/scripts/run_songcentric_directory_pipeline_live.py`](../acr-engine/scripts/run_songcentric_directory_pipeline_live.py)
341
342 它会在一条命令里输出:
343 - 目录扫描结果
344 - exact lane 是否走 `ChromaprintMatcher`
345 - semantic lane 是否 runtime-ready
346 - live PostgreSQL 导入后的计数与 lineage 样例
347
329 --- 348 ---
330 349
331 ## 5. 最常用 SQL 样例 350 ## 5. 最常用 SQL 样例
......
...@@ -153,6 +153,7 @@ model_registry -> feature_set_registry -> audio_embedding / audio_fingerprint -> ...@@ -153,6 +153,7 @@ model_registry -> feature_set_registry -> audio_embedding / audio_fingerprint ->
153 - `acr-engine/scripts/run_phase1_worker_contract_smoke_live.py` 153 - `acr-engine/scripts/run_phase1_worker_contract_smoke_live.py`
154 - `acr-engine/scripts/run_embedding_vector_table_negative_matrix_live.py` 154 - `acr-engine/scripts/run_embedding_vector_table_negative_matrix_live.py`
155 - `acr-engine/scripts/validate_audio_embedding_asset_upsert_live.py` 155 - `acr-engine/scripts/validate_audio_embedding_asset_upsert_live.py`
156 - `acr-engine/scripts/run_songcentric_directory_pipeline_live.py`
156 157
157 --- 158 ---
158 159
......