Commit 44bbfcb5 44bbfcb50895dad287867165c5a5c15943dc6ec6 by cnb.bofCdSsphPA

Bridge pgvector exports toward actual PostgreSQL bulk ingestion

Constraint: Schema and manifest-export templates are useful, but practical adoption still needs an explicit handoff into database load order and SQL shapes
Rejected: Stop at export JSON only | Leaves later sessions to redesign the bulk-ingest bridge from scratch
Confidence: high
Scope-risk: narrow
Directive: Keep bulk-load templates declarative until a real database target is available, then add a live loader without changing manifest semantics
Tested: /usr/local/miniconda3/bin/python -m py_compile acr-engine/scripts/pgvector_bulk_load_template.py; /usr/local/miniconda3/bin/python acr-engine/scripts/pgvector_bulk_load_template.py --input acr-engine/reports/pgvector_manifest_export_test.json --output acr-engine/reports/pgvector_bulk_load_plan_test.json
Not-tested: Live PostgreSQL execution remains pending a database environment
1 parent 528cc473
1 #!/usr/bin/env python3
2 """Template bulk loader for pgvector-related metadata tables.
3
4 This script intentionally avoids requiring psycopg at runtime for now.
5 It produces the SQL statements and row payloads that a future live loader can
6 execute via COPY or execute_batch.
7 """
8
9 from __future__ import annotations
10
11 import argparse
12 import json
13 from pathlib import Path
14
15
16 SQL_STATEMENTS = {
17 "songs": """
18 INSERT INTO songs (song_id, title, artist, version_id, source_dataset, license)
19 VALUES (%(song_id)s, %(title)s, %(artist)s, %(version_id)s, %(source_dataset)s, %(license)s)
20 ON CONFLICT (song_id) DO UPDATE SET
21 title = EXCLUDED.title,
22 artist = EXCLUDED.artist,
23 version_id = EXCLUDED.version_id,
24 source_dataset = EXCLUDED.source_dataset,
25 license = EXCLUDED.license;
26 """.strip(),
27 "references": """
28 INSERT INTO references (song_id, audio_uri, duration_sec, sample_rate)
29 VALUES (%(song_id)s, %(audio_uri)s, %(duration_sec)s, %(sample_rate)s);
30 """.strip(),
31 "segments": """
32 INSERT INTO segments (song_id, audio_uri, offset_sec, duration_sec, split, type, segment_type, source_dataset)
33 VALUES (%(song_id)s, %(audio_uri)s, %(offset_sec)s, %(duration_sec)s, %(split)s, %(type)s, %(segment_type)s, %(source_dataset)s);
34 """.strip(),
35 }
36
37
38 def main():
39 parser = argparse.ArgumentParser()
40 parser.add_argument("--input", required=True, help="JSON exported by export_manifest_to_pgvector_json.py")
41 parser.add_argument("--output", required=True, help="Output JSON plan for later DB execution")
42 args = parser.parse_args()
43
44 payload = json.loads(Path(args.input).read_text())
45 plan = {
46 "counts": {
47 "songs": len(payload.get("songs", [])),
48 "references": len(payload.get("references", [])),
49 "segments": len(payload.get("segments", [])),
50 },
51 "sql": SQL_STATEMENTS,
52 "rows": {
53 "songs": payload.get("songs", []),
54 "references": payload.get("references", []),
55 "segments": payload.get("segments", []),
56 },
57 "notes": [
58 "Execute songs before references and segments.",
59 "Embedding rows should be loaded only after reference_id/segment_id resolution.",
60 "A live loader can replace row-wise inserts with COPY/execute_batch.",
61 ],
62 }
63
64 out = Path(args.output)
65 out.parent.mkdir(parents=True, exist_ok=True)
66 out.write_text(json.dumps(plan, indent=2, ensure_ascii=False))
67 print(json.dumps({
68 "status": "ok",
69 "output": str(out.resolve()),
70 **plan["counts"],
71 }, indent=2, ensure_ascii=False))
72
73
74 if __name__ == "__main__":
75 main()
...@@ -235,6 +235,29 @@ ...@@ -235,6 +235,29 @@
235 235
236 236
237 237
238
239 ### Stage: pgvector bulk load plan 模板
240
241 完成项:
242 - 新增 [acr-engine/scripts/pgvector_bulk_load_template.py](../acr-engine/scripts/pgvector_bulk_load_template.py)
243 - 为 pgvector 导出结果补充 PostgreSQL bulk-load plan 模板
244 -[docs/training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) 中补充对应说明
245
246 验证结果:
247 - `/usr/local/miniconda3/bin/python -m py_compile scripts/pgvector_bulk_load_template.py` 成功
248 - `/usr/local/miniconda3/bin/python scripts/pgvector_bulk_load_template.py --input reports/pgvector_manifest_export_test.json --output reports/pgvector_bulk_load_plan_test.json` 成功
249 - 当前结果:
250 - `songs=24`
251 - `references=24`
252 - `segments=20`
253
254 结论:
255 - pgvector 方向现在已经具备:
256 - schema 模板
257 - manifest 导出模板
258 - bulk-load plan 模板
259 - 后续接真实 PostgreSQL 时,只差 live loader,而不是从零设计数据入口
260
238 ### Stage: pgvector 落库模板 261 ### Stage: pgvector 落库模板
239 262
240 完成项: 263 完成项:
......
...@@ -539,6 +539,40 @@ cd acr-engine ...@@ -539,6 +539,40 @@ cd acr-engine
539 2. 后续你们可以再用 bulk insert / COPY / ETL 把这些行落到 PostgreSQL 539 2. 后续你们可以再用 bulk insert / COPY / ETL 把这些行落到 PostgreSQL
540 3. embedding 生成后再写入 `vector(192)` 540 3. embedding 生成后再写入 `vector(192)`
541 541
542
543 ### Bulk load plan 模板
544
545 仓库里现在还新增了:
546
547 - [acr-engine/scripts/pgvector_bulk_load_template.py](../acr-engine/scripts/pgvector_bulk_load_template.py)
548
549 它会把前一步导出的 manifest-friendly JSON,进一步整理成:
550
551 - SQL 语句模板
552 - songs / references / segments 行数据
553 - 导入顺序说明
554
555 示例:
556
557 ```bash
558 cd acr-engine
559 /usr/local/miniconda3/bin/python scripts/pgvector_bulk_load_template.py \
560 --input reports/pgvector_manifest_export_test.json \
561 --output reports/pgvector_bulk_load_plan_test.json
562 ```
563
564 当前已验证结果:
565
566 - `songs=24`
567 - `references=24`
568 - `segments=20`
569
570 这样后续如果你们接真实 PostgreSQL,可以分三步走:
571
572 1. manifest -> pgvector-friendly JSON
573 2. JSON -> bulk load plan
574 3. bulk load plan -> PostgreSQL / pgvector 实际写入
575
542 ## Sources 576 ## Sources
543 577
544 - Current code behavior from: 578 - Current code behavior from:
......