Commit f048e400 f048e4001b0dbce0e395517e4cbf564cb94bc2f3 by cnb.bofCdSsphPA

Bridge internal CSV exports into manifest bundles before ingestion at scale

Constraint: Internal asset exports should reach train/test-ready manifests without repeated manual reshaping
Rejected: Stop at references/queries JSON only | Still leaves each import needing custom bundle assembly and split logic
Confidence: high
Scope-risk: narrow
Directive: Keep internal manifest emission conservative and deterministic; preserve train/test query presence even on tiny exports
Tested: internal_asset_type_mapper.py sample run with --emit-manifests produced catalog/train/test/val and balanced 1 query in both train and test
Not-tested: Duration/offset enrichment from live source metadata and audio-path existence checks on production exports
1 parent 728ef117
......@@ -10,6 +10,7 @@ from __future__ import annotations
import argparse
import csv
import json
import random
from pathlib import Path
from typing import Dict, List, Tuple
......@@ -107,6 +108,50 @@ def route_records(rows: List[Dict], include_conditionals_as: str) -> Tuple[List[
return references, queries, metadata_only, excluded
def build_manifest_bundle(
references: List[Dict],
queries: List[Dict],
eval_ratio: float,
seed: int,
) -> Dict[str, List[Dict]]:
rng = random.Random(seed)
grouped_queries: Dict[str, List[Dict]] = {}
for row in queries:
grouped_queries.setdefault(row["song_id"], []).append(row)
train_queries: List[Dict] = []
test_queries: List[Dict] = []
val_queries: List[Dict] = []
for song_id, items in grouped_queries.items():
items = list(items)
rng.shuffle(items)
if len(items) == 1:
train_queries.extend(items)
continue
num_test = max(1, round(len(items) * eval_ratio))
num_test = min(num_test, len(items) - 1)
test_part = items[:num_test]
train_part = items[num_test:]
if not train_part and test_part:
train_part.append(test_part.pop())
train_queries.extend(train_part)
test_queries.extend(test_part)
if len(queries) >= 2 and not test_queries and train_queries:
test_queries.append(train_queries.pop())
if len(queries) >= 2 and not train_queries and test_queries:
train_queries.append(test_queries.pop())
return {
"catalog": references,
"train": train_queries + references,
"test": test_queries + references,
"val": val_queries,
}
def main():
parser = argparse.ArgumentParser()
parser.add_argument("csv_path")
......@@ -121,6 +166,9 @@ def main():
parser.add_argument("--artist-field", default="artist")
parser.add_argument("--platform-field", default="source_platform")
parser.add_argument("--include-conditionals-as", choices=["skip", "query", "reference"], default="skip")
parser.add_argument("--emit-manifests", action="store_true")
parser.add_argument("--eval-ratio", type=float, default=0.2)
parser.add_argument("--seed", type=int, default=42)
args = parser.parse_args()
rows = []
......@@ -133,20 +181,38 @@ def main():
out_dir = Path(args.output_dir)
out_dir.mkdir(parents=True, exist_ok=True)
outputs = {
"references.json": references,
"queries.json": queries,
"metadata_only.json": metadata_only,
"excluded.json": excluded,
"summary.json": {
summary = {
"input_rows": len(rows),
"references": len(references),
"queries": len(queries),
"metadata_only": len(metadata_only),
"excluded": len(excluded),
"include_conditionals_as": args.include_conditionals_as,
},
}
outputs = {
"references.json": references,
"queries.json": queries,
"metadata_only.json": metadata_only,
"excluded.json": excluded,
"summary.json": summary,
}
if args.emit_manifests:
manifest_dir = out_dir / "manifest_bundle"
manifest_dir.mkdir(parents=True, exist_ok=True)
bundle = build_manifest_bundle(
references=references,
queries=queries,
eval_ratio=args.eval_ratio,
seed=args.seed,
)
for split, payload in bundle.items():
(manifest_dir / f"{split}.json").write_text(json.dumps(payload, indent=2, ensure_ascii=False))
summary["manifest_bundle"] = str(manifest_dir)
summary["manifest_train_rows"] = len(bundle["train"])
summary["manifest_test_rows"] = len(bundle["test"])
summary["manifest_val_rows"] = len(bundle["val"])
for name, payload in outputs.items():
(out_dir / name).write_text(json.dumps(payload, indent=2, ensure_ascii=False))
......
......@@ -2,6 +2,39 @@
## 2026-06-02
### Stage: 让内部素材映射脚本直接输出 train/test manifests
完成项:
- 扩展 `acr-engine/scripts/internal_asset_type_mapper.py`
- 新增 `--emit-manifests`
- 新增 `--eval-ratio`
- 新增 `--seed`
- 在原有 `references/queries/metadata_only/excluded` 基础上,新增:
- `manifest_bundle/catalog.json`
- `manifest_bundle/train.json`
- `manifest_bundle/test.json`
- `manifest_bundle/val.json`
- 增加小样本保护:
- 即使 query 很少,也尽量保证 `train/test` 都有 query
- 更新 [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md)
验证结果:
- 使用 6 行样例 CSV 执行:
- `internal_asset_type_mapper.py ... --emit-manifests --eval-ratio 0.5 --seed 42`
- 输出摘要:
- `manifest_bundle` 已生成
- `manifest_train_rows = 3`
- `manifest_test_rows = 3`
- `manifest_val_rows = 0`
- manifest 检查:
- `catalog`:2 references
- `train`:1 query + 2 references
- `test`:1 query + 2 references
结论:
- 现在内部素材 CSV 已经可以一步变成接近可训练的 manifest bundle
- 后续如果再补充 duration/offset/audio existence 校验,就能更平滑接入正式训练链路
### Stage: 将内部素材 type 策略落成可执行映射脚本
完成项:
......
......@@ -490,6 +490,11 @@ query:
- `queries.json`
- `metadata_only.json`
- `excluded.json`
- 可选直接生成:
- `manifest_bundle/catalog.json`
- `manifest_bundle/train.json`
- `manifest_bundle/test.json`
- `manifest_bundle/val.json`
最短示例:
......@@ -497,6 +502,12 @@ query:
/usr/local/miniconda3/bin/python acr-engine/scripts/internal_asset_type_mapper.py assets.csv --output-dir out/internal_asset_map
```
如果你希望直接产出可训练 manifest:
```bash
/usr/local/miniconda3/bin/python acr-engine/scripts/internal_asset_type_mapper.py assets.csv --output-dir out/internal_asset_map --emit-manifests --eval-ratio 0.2
```
如果你想临时把伴奏类也纳入导出,可用:
```bash
......