Commit 5334df1f 5334df1fb3faf970851523337c0603f43726305b by cnb.bofCdSsphPA

Validate internal audio assets before manifest-scale training

Constraint: Internal CSV exports should expose missing audio and usable durations before they are treated as train-ready manifests
Rejected: Defer path and duration checks to later training failures | Would make ingestion debugging slow and noisy
Confidence: high
Scope-risk: narrow
Directive: Keep internal asset validation lightweight at mapping time; surface existence and duration early, then layer richer QC rules incrementally
Tested: internal_asset_type_mapper.py with --audio-root on a 6-row sample detected missing_audio=2 and emitted durations for existing reference/query assets
Not-tested: Production-scale scans over the full internal asset repository
1 parent f048e400
......@@ -13,6 +13,7 @@ import json
import random
from pathlib import Path
from typing import Dict, List, Tuple
import soundfile as sf
REFERENCE = "reference"
QUERY = "query"
......@@ -43,11 +44,29 @@ TYPE_POLICY: Dict[int, Dict[str, str]] = {
}
def inspect_audio(asset_path: str | None, audio_root: Path | None) -> Tuple[bool, float | None]:
if not asset_path:
return False, None
path = Path(asset_path)
if audio_root and not path.is_absolute():
path = audio_root / path
if not path.exists():
return False, None
try:
info = sf.info(str(path))
return True, float(info.duration)
except Exception:
return True, None
def normalize_row(row: Dict[str, str], args) -> Dict:
type_code = int(row[args.type_field])
policy = TYPE_POLICY.get(type_code, {"bucket": EXCLUDED, "audio_role": "unknown", "train_type": "none", "priority": "unknown"})
canonical_song_id = row.get(args.song_field) or row.get(args.canonical_song_field) or row.get(args.asset_id_field) or "unknown_song"
version_id = row.get(args.version_field) or f"{canonical_song_id}_type_{type_code}"
audio_path = row.get(args.path_field)
audio_exists, duration_sec = inspect_audio(audio_path, Path(args.audio_root) if args.audio_root else None)
validation_status = "ok" if audio_exists else "missing_audio"
record = {
"asset_id": row.get(args.asset_id_field),
"canonical_song_id": canonical_song_id,
......@@ -57,7 +76,10 @@ def normalize_row(row: Dict[str, str], args) -> Dict:
"recommended_train_type": policy["train_type"],
"priority": policy["priority"],
"bucket": policy["bucket"],
"audio_path": row.get(args.path_field),
"audio_path": audio_path,
"audio_exists": audio_exists,
"duration_sec": duration_sec,
"validation_status": validation_status,
"title": row.get(args.title_field),
"artist": row.get(args.artist_field),
"source_platform": row.get(args.platform_field) or "internal",
......@@ -72,6 +94,8 @@ def to_manifest_record(record: Dict, bucket: str) -> Dict:
"asset_type_code": record["asset_type_code"],
"audio_role": record["audio_role"],
"audio_path": record["audio_path"],
"audio_exists": record["audio_exists"],
"validation_status": record["validation_status"],
"source_dataset": "internal_assets",
"source_platform": record["source_platform"],
}
......@@ -79,12 +103,12 @@ def to_manifest_record(record: Dict, bucket: str) -> Dict:
return {
**base,
"type": "reference",
"duration": 0.0,
"duration": record["duration_sec"] or 0.0,
}
return {
**base,
"type": record["recommended_train_type"],
"duration": 0.0,
"duration": record["duration_sec"] or 0.0,
"offset": None,
"segment_type": "external_query",
}
......@@ -165,6 +189,7 @@ def main():
parser.add_argument("--title-field", default="title")
parser.add_argument("--artist-field", default="artist")
parser.add_argument("--platform-field", default="source_platform")
parser.add_argument("--audio-root", default=None)
parser.add_argument("--include-conditionals-as", choices=["skip", "query", "reference"], default="skip")
parser.add_argument("--emit-manifests", action="store_true")
parser.add_argument("--eval-ratio", type=float, default=0.2)
......@@ -178,6 +203,8 @@ def main():
rows.append(normalize_row(row, args))
references, queries, metadata_only, excluded = route_records(rows, args.include_conditionals_as)
missing_audio = sum(1 for row in rows if not row["audio_exists"])
trainable_audio_rows = sum(1 for row in rows if row["audio_exists"] and row["bucket"] in {REFERENCE, QUERY, CONDITIONAL})
out_dir = Path(args.output_dir)
out_dir.mkdir(parents=True, exist_ok=True)
......@@ -187,6 +214,8 @@ def main():
"queries": len(queries),
"metadata_only": len(metadata_only),
"excluded": len(excluded),
"missing_audio": missing_audio,
"trainable_audio_rows": trainable_audio_rows,
"include_conditionals_as": args.include_conditionals_as,
}
outputs = {
......
......@@ -2,6 +2,35 @@
## 2026-06-02
### Stage: 为内部素材映射脚本增加音频存在性与时长校验
完成项:
- 扩展 `acr-engine/scripts/internal_asset_type_mapper.py`
- 新增 `--audio-root`
- 自动探测 `audio_exists`
- 自动探测 `duration_sec`
- 自动写入 `validation_status`
- 在 summary 中新增:
- `missing_audio`
- `trainable_audio_rows`
- 更新 [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md)
验证结果:
- 构造了 6 行样例 CSV,其中 4 个真实音频、2 个缺失路径
- 运行:
- `internal_asset_type_mapper.py ... --audio-root /tmp/internal_assets_audio --emit-manifests`
- 输出摘要:
- `missing_audio = 2`
- `trainable_audio_rows = 4`
- 生成的 reference/query 记录已带:
- `audio_exists = true`
- `validation_status = ok`
- 正确的 `duration`
结论:
- 现在内部素材 CSV 到 manifest 的链路已经具备最基础的训练前质量校验
- 后续再补 offset / 更细粒度质量规则时,不需要推翻现有脚本结构
### Stage: 让内部素材映射脚本直接输出 train/test manifests
完成项:
......
......@@ -495,6 +495,10 @@ query:
- `manifest_bundle/train.json`
- `manifest_bundle/test.json`
- `manifest_bundle/val.json`
- 可选做音频校验:
- `audio_exists`
- `duration_sec`
- `validation_status`
最短示例:
......@@ -508,6 +512,17 @@ query:
/usr/local/miniconda3/bin/python acr-engine/scripts/internal_asset_type_mapper.py assets.csv --output-dir out/internal_asset_map --emit-manifests --eval-ratio 0.2
```
如果你们的 CSV 里是相对路径,推荐加上音频根目录:
```bash
/usr/local/miniconda3/bin/python acr-engine/scripts/internal_asset_type_mapper.py assets.csv --audio-root data/internal_audio --output-dir out/internal_asset_map --emit-manifests
```
这样脚本会自动补:
- `audio_exists`
- `duration`
- `missing_audio` 汇总
如果你想临时把伴奏类也纳入导出,可用:
```bash
......