Commit 44bbfcb5 44bbfcb50895dad287867165c5a5c15943dc6ec6 by cnb.bofCdSsphPA

Bridge pgvector exports toward actual PostgreSQL bulk ingestion

Constraint: Schema and manifest-export templates are useful, but practical adoption still needs an explicit handoff into database load order and SQL shapes
Rejected: Stop at export JSON only | Leaves later sessions to redesign the bulk-ingest bridge from scratch
Confidence: high
Scope-risk: narrow
Directive: Keep bulk-load templates declarative until a real database target is available, then add a live loader without changing manifest semantics
Tested: /usr/local/miniconda3/bin/python -m py_compile acr-engine/scripts/pgvector_bulk_load_template.py; /usr/local/miniconda3/bin/python acr-engine/scripts/pgvector_bulk_load_template.py --input acr-engine/reports/pgvector_manifest_export_test.json --output acr-engine/reports/pgvector_bulk_load_plan_test.json
Not-tested: Live PostgreSQL execution remains pending a database environment
1 parent 528cc473
#!/usr/bin/env python3
"""Template bulk loader for pgvector-related metadata tables.
This script intentionally avoids requiring psycopg at runtime for now.
It produces the SQL statements and row payloads that a future live loader can
execute via COPY or execute_batch.
"""
from __future__ import annotations
import argparse
import json
from pathlib import Path
SQL_STATEMENTS = {
"songs": """
INSERT INTO songs (song_id, title, artist, version_id, source_dataset, license)
VALUES (%(song_id)s, %(title)s, %(artist)s, %(version_id)s, %(source_dataset)s, %(license)s)
ON CONFLICT (song_id) DO UPDATE SET
title = EXCLUDED.title,
artist = EXCLUDED.artist,
version_id = EXCLUDED.version_id,
source_dataset = EXCLUDED.source_dataset,
license = EXCLUDED.license;
""".strip(),
"references": """
INSERT INTO references (song_id, audio_uri, duration_sec, sample_rate)
VALUES (%(song_id)s, %(audio_uri)s, %(duration_sec)s, %(sample_rate)s);
""".strip(),
"segments": """
INSERT INTO segments (song_id, audio_uri, offset_sec, duration_sec, split, type, segment_type, source_dataset)
VALUES (%(song_id)s, %(audio_uri)s, %(offset_sec)s, %(duration_sec)s, %(split)s, %(type)s, %(segment_type)s, %(source_dataset)s);
""".strip(),
}
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--input", required=True, help="JSON exported by export_manifest_to_pgvector_json.py")
parser.add_argument("--output", required=True, help="Output JSON plan for later DB execution")
args = parser.parse_args()
payload = json.loads(Path(args.input).read_text())
plan = {
"counts": {
"songs": len(payload.get("songs", [])),
"references": len(payload.get("references", [])),
"segments": len(payload.get("segments", [])),
},
"sql": SQL_STATEMENTS,
"rows": {
"songs": payload.get("songs", []),
"references": payload.get("references", []),
"segments": payload.get("segments", []),
},
"notes": [
"Execute songs before references and segments.",
"Embedding rows should be loaded only after reference_id/segment_id resolution.",
"A live loader can replace row-wise inserts with COPY/execute_batch.",
],
}
out = Path(args.output)
out.parent.mkdir(parents=True, exist_ok=True)
out.write_text(json.dumps(plan, indent=2, ensure_ascii=False))
print(json.dumps({
"status": "ok",
"output": str(out.resolve()),
**plan["counts"],
}, indent=2, ensure_ascii=False))
if __name__ == "__main__":
main()
......@@ -235,6 +235,29 @@
### Stage: pgvector bulk load plan 模板
完成项:
- 新增 [acr-engine/scripts/pgvector_bulk_load_template.py](../acr-engine/scripts/pgvector_bulk_load_template.py)
- 为 pgvector 导出结果补充 PostgreSQL bulk-load plan 模板
-[docs/training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) 中补充对应说明
验证结果:
- `/usr/local/miniconda3/bin/python -m py_compile scripts/pgvector_bulk_load_template.py` 成功
- `/usr/local/miniconda3/bin/python scripts/pgvector_bulk_load_template.py --input reports/pgvector_manifest_export_test.json --output reports/pgvector_bulk_load_plan_test.json` 成功
- 当前结果:
- `songs=24`
- `references=24`
- `segments=20`
结论:
- pgvector 方向现在已经具备:
- schema 模板
- manifest 导出模板
- bulk-load plan 模板
- 后续接真实 PostgreSQL 时,只差 live loader,而不是从零设计数据入口
### Stage: pgvector 落库模板
完成项:
......
......@@ -539,6 +539,40 @@ cd acr-engine
2. 后续你们可以再用 bulk insert / COPY / ETL 把这些行落到 PostgreSQL
3. embedding 生成后再写入 `vector(192)`
### Bulk load plan 模板
仓库里现在还新增了:
- [acr-engine/scripts/pgvector_bulk_load_template.py](../acr-engine/scripts/pgvector_bulk_load_template.py)
它会把前一步导出的 manifest-friendly JSON,进一步整理成:
- SQL 语句模板
- songs / references / segments 行数据
- 导入顺序说明
示例:
```bash
cd acr-engine
/usr/local/miniconda3/bin/python scripts/pgvector_bulk_load_template.py \
--input reports/pgvector_manifest_export_test.json \
--output reports/pgvector_bulk_load_plan_test.json
```
当前已验证结果:
- `songs=24`
- `references=24`
- `segments=20`
这样后续如果你们接真实 PostgreSQL,可以分三步走:
1. manifest -> pgvector-friendly JSON
2. JSON -> bulk load plan
3. bulk load plan -> PostgreSQL / pgvector 实际写入
## Sources
- Current code behavior from:
......