Preserve internal query window semantics for trainable asset exports

Constraint: Internal assets must support both manually labeled clips and whole-track auto-window generation without breaking pgvector export Rejected: Treat missing query duration as full audio duration | prevents multi-window query expansion for long source audio Confidence: high Scope-risk: narrow Directive: Keep explicit CSV offset authoritative; only auto-expand when offset is absent and query_stride is set Tested: /usr/local/miniconda3/bin/python -m py_compile acr-engine/scripts/internal_asset_type_mapper.py; local 30s/40s WAV fixture export with manifest + pgvector verification Not-tested: End-to-end retraining with newly expanded internal manifests

Preserve internal query window semantics for trainable asset exports
Constraint: Internal assets must support both manually labeled clips and whole-track auto-window generation without breaking pgvector export Rejected: Treat missing query duration as full audio duration | prevents multi-window query expansion for long source audio Confidence: high Scope-risk: narrow Directive: Keep explicit CSV offset authoritative; only auto-expand when offset is absent and query_stride is set Tested: /usr/local/miniconda3/bin/python -m py_compile acr-engine/scripts/internal_asset_type_mapper.py; local 30s/40s WAV fixture export with manifest + pgvector verification Not-tested: End-to-end retraining with newly expanded internal manifests
cnb.bofCdSsphPA
Commit d61ee980 ... d61ee9806973e69d3510cd953ef130deeb51bb06 authored 2026-06-02 15:53:57 +0800 by cnb.bofCdSsphPA
Showing 3 changed files with 112 additions and 3 deletions
acr-engine/scripts/internal_asset_type_mapper.py
docs/CHANGELOG.md
docs/training-data-and-pgvector-guide.md
--- a/acr-engine/scripts/internal_asset_type_mapper.py
View file @d61ee98
+++ b/acr-engine/scripts/internal_asset_type_mapper.py
View file @d61ee98
@@ -10,6 +10,7 @@ from __future__ import annotations
 import argparse
 import csv
 import json
+import math
 import random
 from pathlib import Path
 from typing import Dict, List, Tuple
@@ -106,7 +107,7 @@ def normalize_row(row: Dict[str, str], args) -> Dict:
 def to_manifest_record(record: Dict, bucket: str, args) -> Dict:
    inferred_query_duration = record["csv_duration_sec"]
    if inferred_query_duration is None:
-        inferred_query_duration = record["duration_sec"] if record["duration_sec"] is not None else args.default_query_duration
+        inferred_query_duration = args.default_query_duration
    inferred_query_offset = record["csv_offset_sec"]
    if inferred_query_offset is None:
        inferred_query_offset = args.default_query_offset
@@ -133,7 +134,9 @@ def to_manifest_record(record: Dict, bucket: str, args) -> Dict:
        "type": record["recommended_train_type"],
        "duration": inferred_query_duration or 0.0,
        "offset": inferred_query_offset,
+        "offset_is_explicit": record["csv_offset_sec"] is not None,
        "segment_type": "external_query",
+        "source_audio_duration": record["duration_sec"],
    }


@@ -155,6 +158,52 @@ def route_records(rows: List[Dict], include_conditionals_as: str, args) -> Tuple
    return references, queries, metadata_only, excluded


+def expand_query_records(queries: List[Dict], query_stride: float | None) -> List[Dict]:
+    if not query_stride or query_stride <= 0:
+        return queries
+
+    expanded: List[Dict] = []
+    for row in queries:
+        duration = float(row.get("duration", 0.0) or 0.0)
+        audio_duration = float(duration or 0.0)
+        source_duration = row.get("source_audio_duration")
+        if source_duration is not None:
+            try:
+                audio_duration = float(source_duration)
+            except (TypeError, ValueError):
+                pass
+
+        explicit_offset = row.get("offset")
+        offset_is_explicit = bool(row.get("offset_is_explicit"))
+        if offset_is_explicit and explicit_offset not in (None, ""):
+            clone = dict(row)
+            clone["query_index"] = 0
+            expanded.append(clone)
+            continue
+
+        if audio_duration <= 0 or duration <= 0 or audio_duration <= duration:
+            clone = dict(row)
+            clone["offset"] = 0.0
+            clone["query_index"] = 0
+            expanded.append(clone)
+            continue
+
+        max_offset = max(0.0, audio_duration - duration)
+        n_steps = int(math.floor(max_offset / query_stride))
+        offsets = [round(i * query_stride, 3) for i in range(n_steps + 1)]
+        if not offsets:
+            offsets = [0.0]
+        if round(max_offset, 3) > offsets[-1]:
+            offsets.append(round(max_offset, 3))
+
+        for idx, offset in enumerate(offsets):
+            clone = dict(row)
+            clone["offset"] = offset
+            clone["query_index"] = idx
+            expanded.append(clone)
+    return expanded
+
+
 def build_manifest_bundle(
    references: List[Dict],
    queries: List[Dict],
@@ -227,6 +276,7 @@ def build_pgvector_payload(
            "asset_type_code": row.get("asset_type_code"),
            "audio_exists": row.get("audio_exists"),
            "validation_status": row.get("validation_status"),
+            "query_index": row.get("query_index"),
        })

    for row in queries:
@@ -252,6 +302,7 @@ def build_pgvector_payload(
            "asset_type_code": row.get("asset_type_code"),
            "audio_exists": row.get("audio_exists"),
            "validation_status": row.get("validation_status"),
+            "query_index": row.get("query_index"),
        })

    return {
@@ -279,6 +330,7 @@ def main():
    parser.add_argument("--audio-root", default=None)
    parser.add_argument("--default-query-duration", type=float, default=8.0)
    parser.add_argument("--default-query-offset", type=float, default=0.0)
+    parser.add_argument("--query-stride", type=float, default=None)
    parser.add_argument("--include-conditionals-as", choices=["skip", "query", "reference"], default="skip")
    parser.add_argument("--emit-manifests", action="store_true")
    parser.add_argument("--emit-pgvector-json", action="store_true")
@@ -294,6 +346,7 @@ def main():
            rows.append(normalize_row(row, args))

    references, queries, metadata_only, excluded = route_records(rows, args.include_conditionals_as, args)
+    queries = expand_query_records(queries, args.query_stride)
    missing_audio = sum(1 for row in rows if not row["audio_exists"])
    trainable_audio_rows = sum(1 for row in rows if row["audio_exists"] and row["bucket"] in {REFERENCE, QUERY, CONDITIONAL})

@@ -308,6 +361,7 @@ def main():
        "missing_audio": missing_audio,
        "trainable_audio_rows": trainable_audio_rows,
        "include_conditionals_as": args.include_conditionals_as,
+        "query_stride": args.query_stride,
    }
    outputs = {
        "references.json": references,
--- a/docs/CHANGELOG.md
View file @d61ee98
+++ b/docs/CHANGELOG.md
View file @d61ee98
@@ -5372,3 +5372,29 @@
 结论：
 - type-aware weighting 比 naive oversampling 更有效
 - 下一轮应专门针对 confused 类设计更强的 negative mining / confusion-aware 信号
+
+### Stage: internal asset query stride fix + type policy hardening
+
+完成项：
+- 修复 `acr-engine/scripts/internal_asset_type_mapper.py` 中内部素材 query 的自动扩窗逻辑
+- 新增 `source_audio_duration` 透传，使长音频可基于真实总时长按 `--query-stride` 展开
+- 修复 “默认 offset=0 被误判为显式 offset” 的问题，确保只有 CSV 明确给了 offset 才禁用扩窗
+- 为 `pgvector_payload.json` 的 `segments` 补充 `query_index`
+- 在 [docs/training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) 补充内部素材滑窗规则、推荐参数表与自动扩窗示例
+
+验证结果：
+- 使用本地 30s `songA.wav` 验证：
+  - `--default-query-duration 8 --query-stride 4`
+  - `queries.json` 成功导出 `7` 条 query
+  - offset 为 `0, 4, 8, 12, 16, 20, 22`
+  - `query_index` 为 `0..6`
+- 使用本地 40s `songB.wav` + CSV 显式 `offset=12` 验证：
+  - 仍只导出 `1` 条 query
+  - 未被自动扩窗覆盖
+- `manifest_bundle/*.json` 与 `pgvector_payload.json` 均已同步反映扩窗结果
+
+结论：
+- 现在内部素材可以稳定支持两种模式：
+  - **人工标 offset 的短视频片段**：保持单条 query
+  - **只有整首音频、没有 query 起点的素材**：自动生成多窗口 query
+- 这让 `7/8/16/18` 这类 query 型素材可以更直接进入训练与评测流水线，同时保留对 `pgvector` 入库的可追踪性
--- a/docs/training-data-and-pgvector-guide.md
View file @d61ee98
+++ b/docs/training-data-and-pgvector-guide.md
View file @d61ee98
@@ -530,11 +530,23 @@ query:
 - `--offset-field`
 - `--default-query-duration`
 - `--default-query-offset`
+- `--query-stride`

 规则是：
 - query 优先使用 CSV 自带的 `duration/offset`
- 没有时，优先使用音频探测时长
- offset 没有时，回落到默认值（通常 `0.0`）
+- duration 没有时，回落到默认 query duration（例如 `8.0s`），而不是整首音频时长
+- 音频总时长会单独保留为 `source_audio_duration`，供 query 滑窗展开使用
+- offset 有 CSV 显式值时，保持单条 query，不做自动扩窗
+- offset 没有显式值且设置了 `--query-stride` 时，会按滑窗方式自动展开成多条 query
+- 若未设置 `--query-stride`，offset 没有显式值时回落到默认值（通常 `0.0`）
+
+推荐参数：
+
+| 场景 | 推荐参数 | 说明 |
+|---|---|---|
+| 内部短视频片段已人工标好起点 | `--offset-field offset_sec` | 保留人工时间戳，避免自动扩窗覆盖人工标注 |
+| 只有整首原始音频，没有 query 起点 | `--default-query-duration 8 --query-stride 4` | 自动产出 50% overlap 的多窗口 query |
+| 只想先做最小可用集 | `--default-query-duration 8` | 每条 query 只导出 1 个片段，默认 offset=0 |

 如果你们下一步就是要进 PostgreSQL / pgvector，可直接导出：

@@ -542,6 +554,23 @@ query:
 /usr/local/miniconda3/bin/python acr-engine/scripts/internal_asset_type_mapper.py assets.csv --audio-root data/internal_audio --output-dir out/internal_asset_map --emit-pgvector-json --pgvector-split train
 ```

+自动扩窗示例：
+
+```bash
+/usr/local/miniconda3/bin/python acr-engine/scripts/internal_asset_type_mapper.py assets.csv \
+  --audio-root data/internal_audio \
+  --output-dir out/internal_asset_map \
+  --default-query-duration 8 \
+  --query-stride 4 \
+  --emit-manifests \
+  --emit-pgvector-json
+```
+
+例如 30s 音频在 `8s` query、`4s` stride 下会导出 offset：
+- `0, 4, 8, 12, 16, 20, 22`
+
+导出的 `queries.json` 与 `pgvector_payload.json` 中都会保留 `query_index`，方便后续追踪窗口来源。
+
 输出会包含：
 - `songs`
 - `references`