Bridge pgvector exports toward actual PostgreSQL bulk ingestion
Constraint: Schema and manifest-export templates are useful, but practical adoption still needs an explicit handoff into database load order and SQL shapes Rejected: Stop at export JSON only | Leaves later sessions to redesign the bulk-ingest bridge from scratch Confidence: high Scope-risk: narrow Directive: Keep bulk-load templates declarative until a real database target is available, then add a live loader without changing manifest semantics Tested: /usr/local/miniconda3/bin/python -m py_compile acr-engine/scripts/pgvector_bulk_load_template.py; /usr/local/miniconda3/bin/python acr-engine/scripts/pgvector_bulk_load_template.py --input acr-engine/reports/pgvector_manifest_export_test.json --output acr-engine/reports/pgvector_bulk_load_plan_test.json Not-tested: Live PostgreSQL execution remains pending a database environment
Showing
4 changed files
with
132 additions
and
0 deletions
This diff is collapsed.
Click to expand it.
| 1 | #!/usr/bin/env python3 | ||
| 2 | """Template bulk loader for pgvector-related metadata tables. | ||
| 3 | |||
| 4 | This script intentionally avoids requiring psycopg at runtime for now. | ||
| 5 | It produces the SQL statements and row payloads that a future live loader can | ||
| 6 | execute via COPY or execute_batch. | ||
| 7 | """ | ||
| 8 | |||
| 9 | from __future__ import annotations | ||
| 10 | |||
| 11 | import argparse | ||
| 12 | import json | ||
| 13 | from pathlib import Path | ||
| 14 | |||
| 15 | |||
| 16 | SQL_STATEMENTS = { | ||
| 17 | "songs": """ | ||
| 18 | INSERT INTO songs (song_id, title, artist, version_id, source_dataset, license) | ||
| 19 | VALUES (%(song_id)s, %(title)s, %(artist)s, %(version_id)s, %(source_dataset)s, %(license)s) | ||
| 20 | ON CONFLICT (song_id) DO UPDATE SET | ||
| 21 | title = EXCLUDED.title, | ||
| 22 | artist = EXCLUDED.artist, | ||
| 23 | version_id = EXCLUDED.version_id, | ||
| 24 | source_dataset = EXCLUDED.source_dataset, | ||
| 25 | license = EXCLUDED.license; | ||
| 26 | """.strip(), | ||
| 27 | "references": """ | ||
| 28 | INSERT INTO references (song_id, audio_uri, duration_sec, sample_rate) | ||
| 29 | VALUES (%(song_id)s, %(audio_uri)s, %(duration_sec)s, %(sample_rate)s); | ||
| 30 | """.strip(), | ||
| 31 | "segments": """ | ||
| 32 | INSERT INTO segments (song_id, audio_uri, offset_sec, duration_sec, split, type, segment_type, source_dataset) | ||
| 33 | VALUES (%(song_id)s, %(audio_uri)s, %(offset_sec)s, %(duration_sec)s, %(split)s, %(type)s, %(segment_type)s, %(source_dataset)s); | ||
| 34 | """.strip(), | ||
| 35 | } | ||
| 36 | |||
| 37 | |||
| 38 | def main(): | ||
| 39 | parser = argparse.ArgumentParser() | ||
| 40 | parser.add_argument("--input", required=True, help="JSON exported by export_manifest_to_pgvector_json.py") | ||
| 41 | parser.add_argument("--output", required=True, help="Output JSON plan for later DB execution") | ||
| 42 | args = parser.parse_args() | ||
| 43 | |||
| 44 | payload = json.loads(Path(args.input).read_text()) | ||
| 45 | plan = { | ||
| 46 | "counts": { | ||
| 47 | "songs": len(payload.get("songs", [])), | ||
| 48 | "references": len(payload.get("references", [])), | ||
| 49 | "segments": len(payload.get("segments", [])), | ||
| 50 | }, | ||
| 51 | "sql": SQL_STATEMENTS, | ||
| 52 | "rows": { | ||
| 53 | "songs": payload.get("songs", []), | ||
| 54 | "references": payload.get("references", []), | ||
| 55 | "segments": payload.get("segments", []), | ||
| 56 | }, | ||
| 57 | "notes": [ | ||
| 58 | "Execute songs before references and segments.", | ||
| 59 | "Embedding rows should be loaded only after reference_id/segment_id resolution.", | ||
| 60 | "A live loader can replace row-wise inserts with COPY/execute_batch.", | ||
| 61 | ], | ||
| 62 | } | ||
| 63 | |||
| 64 | out = Path(args.output) | ||
| 65 | out.parent.mkdir(parents=True, exist_ok=True) | ||
| 66 | out.write_text(json.dumps(plan, indent=2, ensure_ascii=False)) | ||
| 67 | print(json.dumps({ | ||
| 68 | "status": "ok", | ||
| 69 | "output": str(out.resolve()), | ||
| 70 | **plan["counts"], | ||
| 71 | }, indent=2, ensure_ascii=False)) | ||
| 72 | |||
| 73 | |||
| 74 | if __name__ == "__main__": | ||
| 75 | main() |
| ... | @@ -235,6 +235,29 @@ | ... | @@ -235,6 +235,29 @@ |
| 235 | 235 | ||
| 236 | 236 | ||
| 237 | 237 | ||
| 238 | |||
| 239 | ### Stage: pgvector bulk load plan 模板 | ||
| 240 | |||
| 241 | 完成项: | ||
| 242 | - 新增 [acr-engine/scripts/pgvector_bulk_load_template.py](../acr-engine/scripts/pgvector_bulk_load_template.py) | ||
| 243 | - 为 pgvector 导出结果补充 PostgreSQL bulk-load plan 模板 | ||
| 244 | - 在 [docs/training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) 中补充对应说明 | ||
| 245 | |||
| 246 | 验证结果: | ||
| 247 | - `/usr/local/miniconda3/bin/python -m py_compile scripts/pgvector_bulk_load_template.py` 成功 | ||
| 248 | - `/usr/local/miniconda3/bin/python scripts/pgvector_bulk_load_template.py --input reports/pgvector_manifest_export_test.json --output reports/pgvector_bulk_load_plan_test.json` 成功 | ||
| 249 | - 当前结果: | ||
| 250 | - `songs=24` | ||
| 251 | - `references=24` | ||
| 252 | - `segments=20` | ||
| 253 | |||
| 254 | 结论: | ||
| 255 | - pgvector 方向现在已经具备: | ||
| 256 | - schema 模板 | ||
| 257 | - manifest 导出模板 | ||
| 258 | - bulk-load plan 模板 | ||
| 259 | - 后续接真实 PostgreSQL 时,只差 live loader,而不是从零设计数据入口 | ||
| 260 | |||
| 238 | ### Stage: pgvector 落库模板 | 261 | ### Stage: pgvector 落库模板 |
| 239 | 262 | ||
| 240 | 完成项: | 263 | 完成项: | ... | ... |
| ... | @@ -539,6 +539,40 @@ cd acr-engine | ... | @@ -539,6 +539,40 @@ cd acr-engine |
| 539 | 2. 后续你们可以再用 bulk insert / COPY / ETL 把这些行落到 PostgreSQL | 539 | 2. 后续你们可以再用 bulk insert / COPY / ETL 把这些行落到 PostgreSQL |
| 540 | 3. embedding 生成后再写入 `vector(192)` 列 | 540 | 3. embedding 生成后再写入 `vector(192)` 列 |
| 541 | 541 | ||
| 542 | |||
| 543 | ### Bulk load plan 模板 | ||
| 544 | |||
| 545 | 仓库里现在还新增了: | ||
| 546 | |||
| 547 | - [acr-engine/scripts/pgvector_bulk_load_template.py](../acr-engine/scripts/pgvector_bulk_load_template.py) | ||
| 548 | |||
| 549 | 它会把前一步导出的 manifest-friendly JSON,进一步整理成: | ||
| 550 | |||
| 551 | - SQL 语句模板 | ||
| 552 | - songs / references / segments 行数据 | ||
| 553 | - 导入顺序说明 | ||
| 554 | |||
| 555 | 示例: | ||
| 556 | |||
| 557 | ```bash | ||
| 558 | cd acr-engine | ||
| 559 | /usr/local/miniconda3/bin/python scripts/pgvector_bulk_load_template.py \ | ||
| 560 | --input reports/pgvector_manifest_export_test.json \ | ||
| 561 | --output reports/pgvector_bulk_load_plan_test.json | ||
| 562 | ``` | ||
| 563 | |||
| 564 | 当前已验证结果: | ||
| 565 | |||
| 566 | - `songs=24` | ||
| 567 | - `references=24` | ||
| 568 | - `segments=20` | ||
| 569 | |||
| 570 | 这样后续如果你们接真实 PostgreSQL,可以分三步走: | ||
| 571 | |||
| 572 | 1. manifest -> pgvector-friendly JSON | ||
| 573 | 2. JSON -> bulk load plan | ||
| 574 | 3. bulk load plan -> PostgreSQL / pgvector 实际写入 | ||
| 575 | |||
| 542 | ## Sources | 576 | ## Sources |
| 543 | 577 | ||
| 544 | - Current code behavior from: | 578 | - Current code behavior from: | ... | ... |
-
Please register or sign in to post a comment