Commit 58041e10 58041e10b6c2c0709ad738224363a0d0d36334c7 by cnb.bofCdSsphPA

Connect internal asset exports to pgvector preparation early

Constraint: Internal CSV ingestion should reach a pgvector-ready payload without requiring a second custom export path
Rejected: Limit the mapper to manifest outputs only | Forces another transformation layer before database loading
Confidence: high
Scope-risk: narrow
Directive: Keep pgvector payloads aligned with the shared songs/references/segments contract while preserving internal asset metadata fields
Tested: internal_asset_type_mapper.py with --emit-pgvector-json produced songs=2 references=2 segments=2 and included audio_role/asset_type_code/validation_status in sample rows
Not-tested: Direct bulk load into PostgreSQL using a live pgvector database
1 parent 5334df1f
......@@ -176,6 +176,68 @@ def build_manifest_bundle(
}
def build_pgvector_payload(
references: List[Dict],
queries: List[Dict],
split: str,
) -> Dict[str, List[Dict]]:
songs: Dict[str, Dict] = {}
reference_rows: List[Dict] = []
segment_rows: List[Dict] = []
for row in references:
song_id = row["song_id"]
songs.setdefault(song_id, {
"song_id": song_id,
"title": song_id,
"artist": None,
"version_id": row.get("version_id"),
"source_dataset": row.get("source_dataset", "internal_assets"),
"license": None,
})
reference_rows.append({
"song_id": song_id,
"audio_uri": row["audio_path"],
"duration_sec": row.get("duration", 0.0),
"sample_rate": 16000,
"audio_role": row.get("audio_role"),
"asset_type_code": row.get("asset_type_code"),
"audio_exists": row.get("audio_exists"),
"validation_status": row.get("validation_status"),
})
for row in queries:
song_id = row["song_id"]
songs.setdefault(song_id, {
"song_id": song_id,
"title": song_id,
"artist": None,
"version_id": row.get("version_id"),
"source_dataset": row.get("source_dataset", "internal_assets"),
"license": None,
})
segment_rows.append({
"song_id": song_id,
"audio_uri": row["audio_path"],
"offset_sec": row.get("offset", 0.0) if row.get("offset") is not None else 0.0,
"duration_sec": row.get("duration", 0.0),
"split": split,
"type": row.get("type", "unknown"),
"segment_type": row.get("segment_type"),
"source_dataset": row.get("source_dataset", "internal_assets"),
"audio_role": row.get("audio_role"),
"asset_type_code": row.get("asset_type_code"),
"audio_exists": row.get("audio_exists"),
"validation_status": row.get("validation_status"),
})
return {
"songs": list(songs.values()),
"references": reference_rows,
"segments": segment_rows,
}
def main():
parser = argparse.ArgumentParser()
parser.add_argument("csv_path")
......@@ -192,6 +254,8 @@ def main():
parser.add_argument("--audio-root", default=None)
parser.add_argument("--include-conditionals-as", choices=["skip", "query", "reference"], default="skip")
parser.add_argument("--emit-manifests", action="store_true")
parser.add_argument("--emit-pgvector-json", action="store_true")
parser.add_argument("--pgvector-split", default="train")
parser.add_argument("--eval-ratio", type=float, default=0.2)
parser.add_argument("--seed", type=int, default=42)
args = parser.parse_args()
......@@ -242,6 +306,19 @@ def main():
summary["manifest_test_rows"] = len(bundle["test"])
summary["manifest_val_rows"] = len(bundle["val"])
if args.emit_pgvector_json:
pgvector_payload = build_pgvector_payload(
references=references,
queries=queries,
split=args.pgvector_split,
)
pgvector_path = out_dir / "pgvector_payload.json"
pgvector_path.write_text(json.dumps(pgvector_payload, indent=2, ensure_ascii=False))
summary["pgvector_payload"] = str(pgvector_path)
summary["pgvector_songs"] = len(pgvector_payload["songs"])
summary["pgvector_references"] = len(pgvector_payload["references"])
summary["pgvector_segments"] = len(pgvector_payload["segments"])
for name, payload in outputs.items():
(out_dir / name).write_text(json.dumps(payload, indent=2, ensure_ascii=False))
......
......@@ -2,6 +2,39 @@
## 2026-06-02
### Stage: 为内部素材映射脚本增加 pgvector-ready JSON 导出
完成项:
- 扩展 `acr-engine/scripts/internal_asset_type_mapper.py`
- 新增 `--emit-pgvector-json`
- 新增 `--pgvector-split`
- 可直接导出:
- `pgvector_payload.json`
- 导出结构与现有 pgvector 导出工具兼容,包含:
- `songs`
- `references`
- `segments`
- 同时额外保留:
- `audio_role`
- `asset_type_code`
- `audio_exists`
- `validation_status`
验证结果:
- 运行:
- `internal_asset_type_mapper.py ... --emit-pgvector-json --pgvector-split train`
- 输出摘要:
- `pgvector_songs = 2`
- `pgvector_references = 2`
- `pgvector_segments = 2`
- 抽样检查:
- reference 行含 `duration_sec/sample_rate/audio_role/asset_type_code`
- segment 行含 `offset_sec/split/type/segment_type/audio_role`
结论:
- 现在内部素材 CSV 已经可以直接桥接到 pgvector 入库准备阶段
- 后续再补 loader 或数据库直写时,不需要重新设计内部素材导出结构
### Stage: 为内部素材映射脚本增加音频存在性与时长校验
完成项:
......
......@@ -495,6 +495,8 @@ query:
- `manifest_bundle/train.json`
- `manifest_bundle/test.json`
- `manifest_bundle/val.json`
- 可选直接生成:
- `pgvector_payload.json`
- 可选做音频校验:
- `audio_exists`
- `duration_sec`
......@@ -523,6 +525,23 @@ query:
- `duration`
- `missing_audio` 汇总
如果你们下一步就是要进 PostgreSQL / pgvector,可直接导出:
```bash
/usr/local/miniconda3/bin/python acr-engine/scripts/internal_asset_type_mapper.py assets.csv --audio-root data/internal_audio --output-dir out/internal_asset_map --emit-pgvector-json --pgvector-split train
```
输出会包含:
- `songs`
- `references`
- `segments`
并额外带上:
- `audio_role`
- `asset_type_code`
- `audio_exists`
- `validation_status`
如果你想临时把伴奏类也纳入导出,可用:
```bash
......