Bridge pgvector exports toward actual PostgreSQL bulk ingestion

Constraint: Schema and manifest-export templates are useful, but practical adoption still needs an explicit handoff into database load order and SQL shapes Rejected: Stop at export JSON only | Leaves later sessions to redesign the bulk-ingest bridge from scratch Confidence: high Scope-risk: narrow Directive: Keep bulk-load templates declarative until a real database target is available, then add a live loader without changing manifest semantics Tested: /usr/local/miniconda3/bin/python -m py_compile acr-engine/scripts/pgvector_bulk_load_template.py; /usr/local/miniconda3/bin/python acr-engine/scripts/pgvector_bulk_load_template.py --input acr-engine/reports/pgvector_manifest_export_test.json --output acr-engine/reports/pgvector_bulk_load_plan_test.json Not-tested: Live PostgreSQL execution remains pending a database environment

Bridge pgvector exports toward actual PostgreSQL bulk ingestion
Constraint: Schema and manifest-export templates are useful, but practical adoption still needs an explicit handoff into database load order and SQL shapes Rejected: Stop at export JSON only | Leaves later sessions to redesign the bulk-ingest bridge from scratch Confidence: high Scope-risk: narrow Directive: Keep bulk-load templates declarative until a real database target is available, then add a live loader without changing manifest semantics Tested: /usr/local/miniconda3/bin/python -m py_compile acr-engine/scripts/pgvector_bulk_load_template.py; /usr/local/miniconda3/bin/python acr-engine/scripts/pgvector_bulk_load_template.py --input acr-engine/reports/pgvector_manifest_export_test.json --output acr-engine/reports/pgvector_bulk_load_plan_test.json Not-tested: Live PostgreSQL execution remains pending a database environment
cnb.bofCdSsphPA
Commit 44bbfcb5 ... 44bbfcb50895dad287867165c5a5c15943dc6ec6 authored 2026-06-02 13:51:37 +0800 by cnb.bofCdSsphPA
Showing 4 changed files with 132 additions and 0 deletions
acr-engine/reports/pgvector_bulk_load_plan_test.json
acr-engine/scripts/pgvector_bulk_load_template.py
docs/CHANGELOG.md
docs/training-data-and-pgvector-guide.md
--- a/acr-engine/reports/pgvector_bulk_load_plan_test.json 0 → 100644
View file @44bbfcb
+++ b/acr-engine/reports/pgvector_bulk_load_plan_test.json 0 → 100644
View file @44bbfcb
--- a/acr-engine/scripts/pgvector_bulk_load_template.py 0 → 100755
View file @44bbfcb
+++ b/acr-engine/scripts/pgvector_bulk_load_template.py 0 → 100755
View file @44bbfcb
+#!/usr/bin/env python3
+"""Template bulk loader for pgvector-related metadata tables.
+This script intentionally avoids requiring psycopg at runtime for now.
+It produces the SQL statements and row payloads that a future live loader can
+execute via COPY or execute_batch.
+"""
+from __future__ import annotations
+import argparse
+import json
+from pathlib import Path
+SQL_STATEMENTS = {
+    "songs": """
+INSERT INTO songs (song_id, title, artist, version_id, source_dataset, license)
+VALUES (%(song_id)s, %(title)s, %(artist)s, %(version_id)s, %(source_dataset)s, %(license)s)
+ON CONFLICT (song_id) DO UPDATE SET
+    title = EXCLUDED.title,
+    artist = EXCLUDED.artist,
+    version_id = EXCLUDED.version_id,
+    source_dataset = EXCLUDED.source_dataset,
+    license = EXCLUDED.license;
+""".strip(),
+    "references": """
+INSERT INTO references (song_id, audio_uri, duration_sec, sample_rate)
+VALUES (%(song_id)s, %(audio_uri)s, %(duration_sec)s, %(sample_rate)s);
+""".strip(),
+    "segments": """
+INSERT INTO segments (song_id, audio_uri, offset_sec, duration_sec, split, type, segment_type, source_dataset)
+VALUES (%(song_id)s, %(audio_uri)s, %(offset_sec)s, %(duration_sec)s, %(split)s, %(type)s, %(segment_type)s, %(source_dataset)s);
+""".strip(),
+}
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--input", required=True, help="JSON exported by export_manifest_to_pgvector_json.py")
+    parser.add_argument("--output", required=True, help="Output JSON plan for later DB execution")
+    args = parser.parse_args()
+    payload = json.loads(Path(args.input).read_text())
+    plan = {
+        "counts": {
+            "songs": len(payload.get("songs", [])),
+            "references": len(payload.get("references", [])),
+            "segments": len(payload.get("segments", [])),
+        },
+        "sql": SQL_STATEMENTS,
+        "rows": {
+            "songs": payload.get("songs", []),
+            "references": payload.get("references", []),
+            "segments": payload.get("segments", []),
+        },
+        "notes": [
+            "Execute songs before references and segments.",
+            "Embedding rows should be loaded only after reference_id/segment_id resolution.",
+            "A live loader can replace row-wise inserts with COPY/execute_batch.",
+        ],
+    }
+    out = Path(args.output)
+    out.parent.mkdir(parents=True, exist_ok=True)
+    out.write_text(json.dumps(plan, indent=2, ensure_ascii=False))
+    print(json.dumps({
+        "status": "ok",
+        "output": str(out.resolve()),
+        **plan["counts"],
+    }, indent=2, ensure_ascii=False))
+if __name__ == "__main__":
+    main()
--- a/docs/CHANGELOG.md
View file @44bbfcb
+++ b/docs/CHANGELOG.md
View file @44bbfcb
@@ -235,6 +235,29 @@
+### Stage: pgvector bulk load plan 模板
+完成项：
+- 新增 [acr-engine/scripts/pgvector_bulk_load_template.py](../acr-engine/scripts/pgvector_bulk_load_template.py)
+- 为 pgvector 导出结果补充 PostgreSQL bulk-load plan 模板
+- 在 [docs/training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) 中补充对应说明
+验证结果：
+- `/usr/local/miniconda3/bin/python -m py_compile scripts/pgvector_bulk_load_template.py` 成功
+- `/usr/local/miniconda3/bin/python scripts/pgvector_bulk_load_template.py --input reports/pgvector_manifest_export_test.json --output reports/pgvector_bulk_load_plan_test.json` 成功
+- 当前结果：
+  - `songs=24`
+  - `references=24`
+  - `segments=20`
+结论：
+- pgvector 方向现在已经具备：
+  - schema 模板
+  - manifest 导出模板
+  - bulk-load plan 模板
+- 后续接真实 PostgreSQL 时，只差 live loader，而不是从零设计数据入口
 ### Stage: pgvector 落库模板
 完成项：
--- a/docs/training-data-and-pgvector-guide.md
View file @44bbfcb
+++ b/docs/training-data-and-pgvector-guide.md
View file @44bbfcb
@@ -539,6 +539,40 @@ cd acr-engine
 2. 后续你们可以再用 bulk insert / COPY / ETL 把这些行落到 PostgreSQL
 3. embedding 生成后再写入 `vector(192)` 列
+### Bulk load plan 模板
+仓库里现在还新增了：
+- [acr-engine/scripts/pgvector_bulk_load_template.py](../acr-engine/scripts/pgvector_bulk_load_template.py)
+它会把前一步导出的 manifest-friendly JSON，进一步整理成：
+- SQL 语句模板
+- songs / references / segments 行数据
+- 导入顺序说明
+示例：
+```bash
+cd acr-engine
+/usr/local/miniconda3/bin/python scripts/pgvector_bulk_load_template.py \
+  --input reports/pgvector_manifest_export_test.json \
+  --output reports/pgvector_bulk_load_plan_test.json
+```
+当前已验证结果：
+- `songs=24`
+- `references=24`
+- `segments=20`
+这样后续如果你们接真实 PostgreSQL，可以分三步走：
+1. manifest -> pgvector-friendly JSON
+2. JSON -> bulk load plan
+3. bulk load plan -> PostgreSQL / pgvector 实际写入
 ## Sources
 - Current code behavior from: