Validate internal audio assets before manifest-scale training

Constraint: Internal CSV exports should expose missing audio and usable durations before they are treated as train-ready manifests Rejected: Defer path and duration checks to later training failures | Would make ingestion debugging slow and noisy Confidence: high Scope-risk: narrow Directive: Keep internal asset validation lightweight at mapping time; surface existence and duration early, then layer richer QC rules incrementally Tested: internal_asset_type_mapper.py with --audio-root on a 6-row sample detected missing_audio=2 and emitted durations for existing reference/query assets Not-tested: Production-scale scans over the full internal asset repository

Validate internal audio assets before manifest-scale training
Constraint: Internal CSV exports should expose missing audio and usable durations before they are treated as train-ready manifests Rejected: Defer path and duration checks to later training failures | Would make ingestion debugging slow and noisy Confidence: high Scope-risk: narrow Directive: Keep internal asset validation lightweight at mapping time; surface existence and duration early, then layer richer QC rules incrementally Tested: internal_asset_type_mapper.py with --audio-root on a 6-row sample detected missing_audio=2 and emitted durations for existing reference/query assets Not-tested: Production-scale scans over the full internal asset repository
cnb.bofCdSsphPA
Commit 5334df1f ... 5334df1fb3faf970851523337c0603f43726305b authored 2026-06-02 15:38:16 +0800 by cnb.bofCdSsphPA
Showing 3 changed files with 76 additions and 3 deletions
acr-engine/scripts/internal_asset_type_mapper.py
docs/CHANGELOG.md
docs/training-data-and-pgvector-guide.md
--- a/acr-engine/scripts/internal_asset_type_mapper.py
View file @5334df1
+++ b/acr-engine/scripts/internal_asset_type_mapper.py
View file @5334df1
@@ -13,6 +13,7 @@ import json
 import random
 from pathlib import Path
 from typing import Dict, List, Tuple
+import soundfile as sf

 REFERENCE = "reference"
 QUERY = "query"
@@ -43,11 +44,29 @@ TYPE_POLICY: Dict[int, Dict[str, str]] = {
 }


+def inspect_audio(asset_path: str | None, audio_root: Path | None) -> Tuple[bool, float | None]:
+    if not asset_path:
+        return False, None
+    path = Path(asset_path)
+    if audio_root and not path.is_absolute():
+        path = audio_root / path
+    if not path.exists():
+        return False, None
+    try:
+        info = sf.info(str(path))
+        return True, float(info.duration)
+    except Exception:
+        return True, None
+
+
 def normalize_row(row: Dict[str, str], args) -> Dict:
    type_code = int(row[args.type_field])
    policy = TYPE_POLICY.get(type_code, {"bucket": EXCLUDED, "audio_role": "unknown", "train_type": "none", "priority": "unknown"})
    canonical_song_id = row.get(args.song_field) or row.get(args.canonical_song_field) or row.get(args.asset_id_field) or "unknown_song"
    version_id = row.get(args.version_field) or f"{canonical_song_id}_type_{type_code}"
+    audio_path = row.get(args.path_field)
+    audio_exists, duration_sec = inspect_audio(audio_path, Path(args.audio_root) if args.audio_root else None)
+    validation_status = "ok" if audio_exists else "missing_audio"
    record = {
        "asset_id": row.get(args.asset_id_field),
        "canonical_song_id": canonical_song_id,
@@ -57,7 +76,10 @@ def normalize_row(row: Dict[str, str], args) -> Dict:
        "recommended_train_type": policy["train_type"],
        "priority": policy["priority"],
        "bucket": policy["bucket"],
-        "audio_path": row.get(args.path_field),
+        "audio_path": audio_path,
+        "audio_exists": audio_exists,
+        "duration_sec": duration_sec,
+        "validation_status": validation_status,
        "title": row.get(args.title_field),
        "artist": row.get(args.artist_field),
        "source_platform": row.get(args.platform_field) or "internal",
@@ -72,6 +94,8 @@ def to_manifest_record(record: Dict, bucket: str) -> Dict:
        "asset_type_code": record["asset_type_code"],
        "audio_role": record["audio_role"],
        "audio_path": record["audio_path"],
+        "audio_exists": record["audio_exists"],
+        "validation_status": record["validation_status"],
        "source_dataset": "internal_assets",
        "source_platform": record["source_platform"],
    }
@@ -79,12 +103,12 @@ def to_manifest_record(record: Dict, bucket: str) -> Dict:
        return {
            **base,
            "type": "reference",
-            "duration": 0.0,
+            "duration": record["duration_sec"] or 0.0,
        }
    return {
        **base,
        "type": record["recommended_train_type"],
-        "duration": 0.0,
+        "duration": record["duration_sec"] or 0.0,
        "offset": None,
        "segment_type": "external_query",
    }
@@ -165,6 +189,7 @@ def main():
    parser.add_argument("--title-field", default="title")
    parser.add_argument("--artist-field", default="artist")
    parser.add_argument("--platform-field", default="source_platform")
+    parser.add_argument("--audio-root", default=None)
    parser.add_argument("--include-conditionals-as", choices=["skip", "query", "reference"], default="skip")
    parser.add_argument("--emit-manifests", action="store_true")
    parser.add_argument("--eval-ratio", type=float, default=0.2)
@@ -178,6 +203,8 @@ def main():
            rows.append(normalize_row(row, args))

    references, queries, metadata_only, excluded = route_records(rows, args.include_conditionals_as)
+    missing_audio = sum(1 for row in rows if not row["audio_exists"])
+    trainable_audio_rows = sum(1 for row in rows if row["audio_exists"] and row["bucket"] in {REFERENCE, QUERY, CONDITIONAL})

    out_dir = Path(args.output_dir)
    out_dir.mkdir(parents=True, exist_ok=True)
@@ -187,6 +214,8 @@ def main():
        "queries": len(queries),
        "metadata_only": len(metadata_only),
        "excluded": len(excluded),
+        "missing_audio": missing_audio,
+        "trainable_audio_rows": trainable_audio_rows,
        "include_conditionals_as": args.include_conditionals_as,
    }
    outputs = {
--- a/docs/CHANGELOG.md
View file @5334df1
+++ b/docs/CHANGELOG.md
View file @5334df1
@@ -2,6 +2,35 @@

 ## 2026-06-02

+### Stage: 为内部素材映射脚本增加音频存在性与时长校验
+
+完成项：
+- 扩展 `acr-engine/scripts/internal_asset_type_mapper.py`
+  - 新增 `--audio-root`
+  - 自动探测 `audio_exists`
+  - 自动探测 `duration_sec`
+  - 自动写入 `validation_status`
+- 在 summary 中新增：
+  - `missing_audio`
+  - `trainable_audio_rows`
+- 更新 [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md)
+
+验证结果：
+- 构造了 6 行样例 CSV，其中 4 个真实音频、2 个缺失路径
+- 运行：
+  - `internal_asset_type_mapper.py ... --audio-root /tmp/internal_assets_audio --emit-manifests`
+- 输出摘要：
+  - `missing_audio = 2`
+  - `trainable_audio_rows = 4`
+- 生成的 reference/query 记录已带：
+  - `audio_exists = true`
+  - `validation_status = ok`
+  - 正确的 `duration`
+
+结论：
+- 现在内部素材 CSV 到 manifest 的链路已经具备最基础的训练前质量校验
+- 后续再补 offset / 更细粒度质量规则时，不需要推翻现有脚本结构
+
 ### Stage: 让内部素材映射脚本直接输出 train/test manifests

 完成项：
--- a/docs/training-data-and-pgvector-guide.md
View file @5334df1
+++ b/docs/training-data-and-pgvector-guide.md
View file @5334df1
@@ -495,6 +495,10 @@ query:
  - `manifest_bundle/train.json`
  - `manifest_bundle/test.json`
  - `manifest_bundle/val.json`
+- 可选做音频校验：
+  - `audio_exists`
+  - `duration_sec`
+  - `validation_status`

 最短示例：

@@ -508,6 +512,17 @@ query:
 /usr/local/miniconda3/bin/python acr-engine/scripts/internal_asset_type_mapper.py assets.csv --output-dir out/internal_asset_map --emit-manifests --eval-ratio 0.2
 ```

+如果你们的 CSV 里是相对路径，推荐加上音频根目录：
+
+```bash
+/usr/local/miniconda3/bin/python acr-engine/scripts/internal_asset_type_mapper.py assets.csv --audio-root data/internal_audio --output-dir out/internal_asset_map --emit-manifests
+```
+
+这样脚本会自动补：
+- `audio_exists`
+- `duration`
+- `missing_audio` 汇总
+
 如果你想临时把伴奏类也纳入导出，可用：

 ```bash