Import song-centric manifests into live PostgreSQL with idempotent upserts

Constraint: Extend the current 4-table song-centric schema with a practical manifest ingestion path without introducing the older split-table model or hidden side metadata tables. Rejected: Leave ingestion as handwritten SQL or one-off bootstrap logic | It slows real asset onboarding and makes repeatability hard to verify. Confidence: high Scope-risk: narrow Directive: Use import_songcentric_manifest_live.py plus a manifest JSONL as the default path for batch asset/window onboarding into the fused schema. Tested: /usr/local/miniconda3/bin/python acr-engine/scripts/import_songcentric_manifest_live.py --dsn postgres://d2:d2pass@127.0.0.1:5432/d2 --schema acr_songcentric_test --manifest acr-engine/data/pgvector_eval/music20/songcentric_manifest_sample.jsonl; repeated the import and verified counts remained media_entity=5, audio_object=11, feature_fact=6, set_membership=5; git diff --check; /usr/local/miniconda3/bin/python scripts/check_markdown_links.py --root docs returned OK for 11 active markdown files Not-tested: feature_fact generation during manifest import and large-scale manifest throughput

Import song-centric manifests into live PostgreSQL with idempotent upserts
Constraint: Extend the current 4-table song-centric schema with a practical manifest ingestion path without introducing the older split-table model or hidden side metadata tables. Rejected: Leave ingestion as handwritten SQL or one-off bootstrap logic | It slows real asset onboarding and makes repeatability hard to verify. Confidence: high Scope-risk: narrow Directive: Use import_songcentric_manifest_live.py plus a manifest JSONL as the default path for batch asset/window onboarding into the fused schema. Tested: /usr/local/miniconda3/bin/python acr-engine/scripts/import_songcentric_manifest_live.py --dsn postgres://d2:d2pass@127.0.0.1:5432/d2 --schema acr_songcentric_test --manifest acr-engine/data/pgvector_eval/music20/songcentric_manifest_sample.jsonl; repeated the import and verified counts remained media_entity=5, audio_object=11, feature_fact=6, set_membership=5; git diff --check; /usr/local/miniconda3/bin/python scripts/check_markdown_links.py --root docs returned OK for 11 active markdown files Not-tested: feature_fact generation during manifest import and large-scale manifest throughput
cnb.bofCdSsphPA
Commit 8002dfb0 ... 8002dfb03ecaffefd2ee691b9fd275131a159cf4 authored 2026-06-04 14:45:31 +0800 by cnb.bofCdSsphPA
Showing 6 changed files with 248 additions and 0 deletions
acr-engine/data/pgvector_eval/music20/songcentric_manifest_import_report.json
acr-engine/data/pgvector_eval/music20/songcentric_manifest_import_report_rerun.json
acr-engine/data/pgvector_eval/music20/songcentric_manifest_sample.jsonl
acr-engine/scripts/import_songcentric_manifest_live.py
docs/CHANGELOG.md
docs/postgres_db_schema_samples.md
--- a/acr-engine/data/pgvector_eval/music20/songcentric_manifest_import_report.json 0 → 100644
View file @8002dfb
+++ b/acr-engine/data/pgvector_eval/music20/songcentric_manifest_import_report.json 0 → 100644
View file @8002dfb
+{
+  "schema": "acr_songcentric_test",
+  "manifest": "acr-engine/data/pgvector_eval/music20/songcentric_manifest_sample.jsonl",
+  "imported": [
+    {
+      "song_id": 4,
+      "asset_id": 7,
+      "window_ids": [
+        8,
+        9
+      ],
+      "membership_ids": [
+        4
+      ]
+    },
+    {
+      "song_id": 5,
+      "asset_id": 10,
+      "window_ids": [
+        11
+      ],
+      "membership_ids": [
+        5
+      ]
+    }
+  ],
+  "counts": {
+    "media_entity": 5,
+    "audio_object": 11,
+    "feature_fact": 6,
+    "set_membership": 5
+  },
+  "window_lineage_sample": {
+    "window_id": 11,
+    "asset_id": 10,
+    "song_id": 5,
+    "title": "Manifest Song 2",
+    "start_ms": 5000,
+    "end_ms": 10000
+  }
+}
\ No newline at end of file
--- a/acr-engine/data/pgvector_eval/music20/songcentric_manifest_import_report_rerun.json 0 → 100644
View file @8002dfb
+++ b/acr-engine/data/pgvector_eval/music20/songcentric_manifest_import_report_rerun.json 0 → 100644
View file @8002dfb
+{
+  "schema": "acr_songcentric_test",
+  "manifest": "acr-engine/data/pgvector_eval/music20/songcentric_manifest_sample.jsonl",
+  "imported": [
+    {
+      "song_id": 4,
+      "asset_id": 7,
+      "window_ids": [
+        8,
+        9
+      ],
+      "membership_ids": [
+        4
+      ]
+    },
+    {
+      "song_id": 5,
+      "asset_id": 10,
+      "window_ids": [
+        11
+      ],
+      "membership_ids": [
+        5
+      ]
+    }
+  ],
+  "counts": {
+    "media_entity": 5,
+    "audio_object": 11,
+    "feature_fact": 6,
+    "set_membership": 5
+  },
+  "window_lineage_sample": {
+    "window_id": 11,
+    "asset_id": 10,
+    "song_id": 5,
+    "title": "Manifest Song 2",
+    "start_ms": 5000,
+    "end_ms": 10000
+  }
+}
\ No newline at end of file
--- a/acr-engine/data/pgvector_eval/music20/songcentric_manifest_sample.jsonl 0 → 100644
View file @8002dfb
+++ b/acr-engine/data/pgvector_eval/music20/songcentric_manifest_sample.jsonl 0 → 100644
View file @8002dfb
+{"song":{"biz_key":"song-20001","title":"Manifest Song 1","artist_name":"Manifest Artist 1"},"asset":{"source_type":"official","storage_uri":"/workspace/downloads/song-20001/master.wav","storage_scheme":"file","checksum":"sha256:manifest-song-20001","codec":"wav","sample_rate":16000,"channels":1,"duration_ms":210000},"windows":[{"start_ms":10000,"end_ms":15000},{"start_ms":60000,"end_ms":65000}],"memberships":[{"set_type":"reference_set","set_name":"phase1_hot_reference_v1","member_type":"asset","priority":100}]}
+{"song":{"biz_key":"song-20002","title":"Manifest Song 2","artist_name":"Manifest Artist 2"},"asset":{"source_type":"ugc","storage_uri":"/workspace/downloads/song-20002/clip.wav","storage_scheme":"file","checksum":"sha256:manifest-song-20002","codec":"wav","sample_rate":16000,"channels":1,"duration_ms":95000},"windows":[{"start_ms":5000,"end_ms":10000}],"memberships":[{"set_type":"eval_set","set_name":"phase1_eval_v1","member_type":"asset","priority":50}]}
--- a/acr-engine/scripts/import_songcentric_manifest_live.py 0 → 100755
View file @8002dfb
+++ b/acr-engine/scripts/import_songcentric_manifest_live.py 0 → 100755
View file @8002dfb
+#!/usr/bin/env /usr/local/miniconda3/bin/python
+from __future__ import annotations
+
+import argparse
+import json
+from pathlib import Path
+
+import psycopg
+from psycopg.rows import dict_row
+
+
+def quote_ident(name: str) -> str:
+    return '"' + name.replace('"', '""') + '"'
+
+
+def load_jsonl(path: Path):
+    for line in path.read_text().splitlines():
+        line = line.strip()
+        if line:
+            yield json.loads(line)
+
+
+def ensure_song(cur, song: dict) -> int:
+    row = cur.execute(
+        "select entity_id from media_entity where entity_type='song' and biz_key=%s",
+        (song['biz_key'],),
+    ).fetchone()
+    if row:
+        return row['entity_id']
+    return cur.execute(
+        "insert into media_entity (entity_type,biz_key,title,artist_name) values ('song',%s,%s,%s) returning entity_id",
+        (song['biz_key'], song['title'], song.get('artist_name')),
+    ).fetchone()['entity_id']
+
+
+def ensure_asset(cur, song_id: int, asset: dict) -> int:
+    row = cur.execute(
+        "select object_id from audio_object where object_type='asset' and song_id=%s and checksum=%s",
+        (song_id, asset['checksum']),
+    ).fetchone()
+    if row:
+        return row['object_id']
+    return cur.execute(
+        """
+        insert into audio_object (
+            object_type,song_id,source_type,storage_uri,storage_scheme,checksum,codec,sample_rate,channels,duration_ms
+        ) values ('asset',%s,%s,%s,%s,%s,%s,%s,%s,%s) returning object_id
+        """,
+        (
+            song_id, asset.get('source_type'), asset.get('storage_uri'), asset.get('storage_scheme'),
+            asset.get('checksum'), asset.get('codec'), asset.get('sample_rate'), asset.get('channels'), asset.get('duration_ms'),
+        ),
+    ).fetchone()['object_id']
+
+
+def ensure_window(cur, song_id: int, asset_id: int, win: dict) -> int:
+    row = cur.execute(
+        "select object_id from audio_object where object_type='window' and parent_object_id=%s and start_ms=%s and end_ms=%s",
+        (asset_id, win['start_ms'], win['end_ms']),
+    ).fetchone()
+    if row:
+        return row['object_id']
+    return cur.execute(
+        "insert into audio_object (object_type,song_id,parent_object_id,start_ms,end_ms,duration_ms) values ('window',%s,%s,%s,%s,%s) returning object_id",
+        (song_id, asset_id, win['start_ms'], win['end_ms'], win['end_ms'] - win['start_ms']),
+    ).fetchone()['object_id']
+
+
+def ensure_membership(cur, m: dict, member_id: int, song_id: int) -> int:
+    row = cur.execute(
+        "select membership_id from set_membership where set_type=%s and set_name=%s and member_type=%s and member_id=%s",
+        (m['set_type'], m['set_name'], m['member_type'], member_id),
+    ).fetchone()
+    if row:
+        return row['membership_id']
+    return cur.execute(
+        "insert into set_membership (set_type,set_name,member_type,member_id,song_id,priority) values (%s,%s,%s,%s,%s,%s) returning membership_id",
+        (m['set_type'], m['set_name'], m['member_type'], member_id, song_id, m.get('priority', 100)),
+    ).fetchone()['membership_id']
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--dsn', required=True)
+    parser.add_argument('--schema', default='acr_songcentric_test')
+    parser.add_argument('--manifest', required=True)
+    parser.add_argument('--output', required=True)
+    args = parser.parse_args()
+
+    manifest_path = Path(args.manifest)
+    output_path = Path(args.output)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    qschema = quote_ident(args.schema)
+
+    report = {
+        'schema': args.schema,
+        'manifest': str(manifest_path),
+        'imported': [],
+    }
+
+    with psycopg.connect(args.dsn, row_factory=dict_row) as conn:
+        with conn.cursor() as cur:
+            cur.execute(f'set search_path to {qschema}, public')
+            for row in load_jsonl(manifest_path):
+                song_id = ensure_song(cur, row['song'])
+                asset_id = ensure_asset(cur, song_id, row['asset'])
+                window_ids = [ensure_window(cur, song_id, asset_id, w) for w in row.get('windows', [])]
+                membership_ids = []
+                for m in row.get('memberships', []):
+                    member_id = asset_id if m['member_type'] == 'asset' else song_id
+                    membership_ids.append(ensure_membership(cur, m, member_id, song_id))
+                report['imported'].append({
+                    'song_id': song_id,
+                    'asset_id': asset_id,
+                    'window_ids': window_ids,
+                    'membership_ids': membership_ids,
+                })
+
+            counts = {}
+            for table in ['media_entity', 'audio_object', 'feature_fact', 'set_membership']:
+                counts[table] = cur.execute(f'select count(*) as c from {table}').fetchone()['c']
+            report['counts'] = counts
+            report['window_lineage_sample'] = cur.execute(
+                """
+                select win.object_id as window_id,
+                       ast.object_id as asset_id,
+                       song.entity_id as song_id,
+                       song.title,
+                       win.start_ms,
+                       win.end_ms
+                from audio_object win
+                join audio_object ast on ast.object_id = win.parent_object_id and ast.object_type='asset'
+                join media_entity song on song.entity_id = win.song_id and song.entity_type='song'
+                where win.object_type='window'
+                order by win.object_id desc
+                limit 1
+                """
+            ).fetchone()
+        conn.commit()
+
+    output_path.write_text(json.dumps(report, ensure_ascii=False, indent=2))
+    print(json.dumps(report, ensure_ascii=False, indent=2))
+    return 0
+
+
+if __name__ == '__main__':
+    raise SystemExit(main())
--- a/docs/CHANGELOG.md
View file @8002dfb
+++ b/docs/CHANGELOG.md
View file @8002dfb
 ## 2026-06-04

+- 新增 `acr-engine/scripts/import_songcentric_manifest_live.py` 与样例 manifest `acr-engine/data/pgvector_eval/music20/songcentric_manifest_sample.jsonl`，把当前 4 表 schema 推进到“可从 JSONL manifest 批量导入 song/asset/window/set_membership”的阶段。
+
 - 新增 `acr-engine/scripts/bootstrap_songcentric_phase1_live.py`，把当前 4 表 schema 从“单条 smoke 写入”推进到“可重复 Phase-1 bootstrap”；并准备对 `acr_songcentric_test` 做 fresh live 初始化验证。

 - 新增正式 SQL 文件 `acr-engine/sql/acr_pg_schema_songcentric_v1.sql` 与 live smoke 脚本 `acr-engine/scripts/smoke_songcentric_schema_live.py`，把 4 张融合优先表从文档草案推进到可执行 schema，并准备在用户 PostgreSQL 上做 fresh 验证。
--- a/docs/postgres_db_schema_samples.md
View file @8002dfb
+++ b/docs/postgres_db_schema_samples.md
View file @8002dfb
@@ -252,6 +252,21 @@ flowchart TD

 当前 live bootstrap 脚本：[`acr-engine/scripts/bootstrap_songcentric_phase1_live.py`](../acr-engine/scripts/bootstrap_songcentric_phase1_live.py)

+
+### 4.5 Manifest 导入流程
+
+```mermaid
+flowchart TD
+    A[songcentric_manifest_sample.jsonl] --> B[import_songcentric_manifest_live.py]
+    B --> C[media_entity song]
+    B --> D[audio_object asset]
+    B --> E[audio_object window x N]
+    B --> F[set_membership]
+```
+
+当前样例 manifest：[`acr-engine/data/pgvector_eval/music20/songcentric_manifest_sample.jsonl`](../acr-engine/data/pgvector_eval/music20/songcentric_manifest_sample.jsonl)  
+当前导入脚本：[`acr-engine/scripts/import_songcentric_manifest_live.py`](../acr-engine/scripts/import_songcentric_manifest_live.py)
+
 ---

 ## 5. 最常用 SQL 样例