Commit d61ee980 d61ee9806973e69d3510cd953ef130deeb51bb06 by cnb.bofCdSsphPA

Preserve internal query window semantics for trainable asset exports

Constraint: Internal assets must support both manually labeled clips and whole-track auto-window generation without breaking pgvector export
Rejected: Treat missing query duration as full audio duration | prevents multi-window query expansion for long source audio
Confidence: high
Scope-risk: narrow
Directive: Keep explicit CSV offset authoritative; only auto-expand when offset is absent and query_stride is set
Tested: /usr/local/miniconda3/bin/python -m py_compile acr-engine/scripts/internal_asset_type_mapper.py; local 30s/40s WAV fixture export with manifest + pgvector verification
Not-tested: End-to-end retraining with newly expanded internal manifests
1 parent 3e13c578
......@@ -10,6 +10,7 @@ from __future__ import annotations
import argparse
import csv
import json
import math
import random
from pathlib import Path
from typing import Dict, List, Tuple
......@@ -106,7 +107,7 @@ def normalize_row(row: Dict[str, str], args) -> Dict:
def to_manifest_record(record: Dict, bucket: str, args) -> Dict:
inferred_query_duration = record["csv_duration_sec"]
if inferred_query_duration is None:
inferred_query_duration = record["duration_sec"] if record["duration_sec"] is not None else args.default_query_duration
inferred_query_duration = args.default_query_duration
inferred_query_offset = record["csv_offset_sec"]
if inferred_query_offset is None:
inferred_query_offset = args.default_query_offset
......@@ -133,7 +134,9 @@ def to_manifest_record(record: Dict, bucket: str, args) -> Dict:
"type": record["recommended_train_type"],
"duration": inferred_query_duration or 0.0,
"offset": inferred_query_offset,
"offset_is_explicit": record["csv_offset_sec"] is not None,
"segment_type": "external_query",
"source_audio_duration": record["duration_sec"],
}
......@@ -155,6 +158,52 @@ def route_records(rows: List[Dict], include_conditionals_as: str, args) -> Tuple
return references, queries, metadata_only, excluded
def expand_query_records(queries: List[Dict], query_stride: float | None) -> List[Dict]:
if not query_stride or query_stride <= 0:
return queries
expanded: List[Dict] = []
for row in queries:
duration = float(row.get("duration", 0.0) or 0.0)
audio_duration = float(duration or 0.0)
source_duration = row.get("source_audio_duration")
if source_duration is not None:
try:
audio_duration = float(source_duration)
except (TypeError, ValueError):
pass
explicit_offset = row.get("offset")
offset_is_explicit = bool(row.get("offset_is_explicit"))
if offset_is_explicit and explicit_offset not in (None, ""):
clone = dict(row)
clone["query_index"] = 0
expanded.append(clone)
continue
if audio_duration <= 0 or duration <= 0 or audio_duration <= duration:
clone = dict(row)
clone["offset"] = 0.0
clone["query_index"] = 0
expanded.append(clone)
continue
max_offset = max(0.0, audio_duration - duration)
n_steps = int(math.floor(max_offset / query_stride))
offsets = [round(i * query_stride, 3) for i in range(n_steps + 1)]
if not offsets:
offsets = [0.0]
if round(max_offset, 3) > offsets[-1]:
offsets.append(round(max_offset, 3))
for idx, offset in enumerate(offsets):
clone = dict(row)
clone["offset"] = offset
clone["query_index"] = idx
expanded.append(clone)
return expanded
def build_manifest_bundle(
references: List[Dict],
queries: List[Dict],
......@@ -227,6 +276,7 @@ def build_pgvector_payload(
"asset_type_code": row.get("asset_type_code"),
"audio_exists": row.get("audio_exists"),
"validation_status": row.get("validation_status"),
"query_index": row.get("query_index"),
})
for row in queries:
......@@ -252,6 +302,7 @@ def build_pgvector_payload(
"asset_type_code": row.get("asset_type_code"),
"audio_exists": row.get("audio_exists"),
"validation_status": row.get("validation_status"),
"query_index": row.get("query_index"),
})
return {
......@@ -279,6 +330,7 @@ def main():
parser.add_argument("--audio-root", default=None)
parser.add_argument("--default-query-duration", type=float, default=8.0)
parser.add_argument("--default-query-offset", type=float, default=0.0)
parser.add_argument("--query-stride", type=float, default=None)
parser.add_argument("--include-conditionals-as", choices=["skip", "query", "reference"], default="skip")
parser.add_argument("--emit-manifests", action="store_true")
parser.add_argument("--emit-pgvector-json", action="store_true")
......@@ -294,6 +346,7 @@ def main():
rows.append(normalize_row(row, args))
references, queries, metadata_only, excluded = route_records(rows, args.include_conditionals_as, args)
queries = expand_query_records(queries, args.query_stride)
missing_audio = sum(1 for row in rows if not row["audio_exists"])
trainable_audio_rows = sum(1 for row in rows if row["audio_exists"] and row["bucket"] in {REFERENCE, QUERY, CONDITIONAL})
......@@ -308,6 +361,7 @@ def main():
"missing_audio": missing_audio,
"trainable_audio_rows": trainable_audio_rows,
"include_conditionals_as": args.include_conditionals_as,
"query_stride": args.query_stride,
}
outputs = {
"references.json": references,
......
......@@ -5372,3 +5372,29 @@
结论:
- type-aware weighting 比 naive oversampling 更有效
- 下一轮应专门针对 confused 类设计更强的 negative mining / confusion-aware 信号
### Stage: internal asset query stride fix + type policy hardening
完成项:
- 修复 `acr-engine/scripts/internal_asset_type_mapper.py` 中内部素材 query 的自动扩窗逻辑
- 新增 `source_audio_duration` 透传,使长音频可基于真实总时长按 `--query-stride` 展开
- 修复 “默认 offset=0 被误判为显式 offset” 的问题,确保只有 CSV 明确给了 offset 才禁用扩窗
-`pgvector_payload.json``segments` 补充 `query_index`
-[docs/training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) 补充内部素材滑窗规则、推荐参数表与自动扩窗示例
验证结果:
- 使用本地 30s `songA.wav` 验证:
- `--default-query-duration 8 --query-stride 4`
- `queries.json` 成功导出 `7` 条 query
- offset 为 `0, 4, 8, 12, 16, 20, 22`
- `query_index``0..6`
- 使用本地 40s `songB.wav` + CSV 显式 `offset=12` 验证:
- 仍只导出 `1` 条 query
- 未被自动扩窗覆盖
- `manifest_bundle/*.json``pgvector_payload.json` 均已同步反映扩窗结果
结论:
- 现在内部素材可以稳定支持两种模式:
- **人工标 offset 的短视频片段**:保持单条 query
- **只有整首音频、没有 query 起点的素材**:自动生成多窗口 query
- 这让 `7/8/16/18` 这类 query 型素材可以更直接进入训练与评测流水线,同时保留对 `pgvector` 入库的可追踪性
......
......@@ -530,11 +530,23 @@ query:
- `--offset-field`
- `--default-query-duration`
- `--default-query-offset`
- `--query-stride`
规则是:
- query 优先使用 CSV 自带的 `duration/offset`
- 没有时,优先使用音频探测时长
- offset 没有时,回落到默认值(通常 `0.0`
- duration 没有时,回落到默认 query duration(例如 `8.0s`),而不是整首音频时长
- 音频总时长会单独保留为 `source_audio_duration`,供 query 滑窗展开使用
- offset 有 CSV 显式值时,保持单条 query,不做自动扩窗
- offset 没有显式值且设置了 `--query-stride` 时,会按滑窗方式自动展开成多条 query
- 若未设置 `--query-stride`,offset 没有显式值时回落到默认值(通常 `0.0`
推荐参数:
| 场景 | 推荐参数 | 说明 |
|---|---|---|
| 内部短视频片段已人工标好起点 | `--offset-field offset_sec` | 保留人工时间戳,避免自动扩窗覆盖人工标注 |
| 只有整首原始音频,没有 query 起点 | `--default-query-duration 8 --query-stride 4` | 自动产出 50% overlap 的多窗口 query |
| 只想先做最小可用集 | `--default-query-duration 8` | 每条 query 只导出 1 个片段,默认 offset=0 |
如果你们下一步就是要进 PostgreSQL / pgvector,可直接导出:
......@@ -542,6 +554,23 @@ query:
/usr/local/miniconda3/bin/python acr-engine/scripts/internal_asset_type_mapper.py assets.csv --audio-root data/internal_audio --output-dir out/internal_asset_map --emit-pgvector-json --pgvector-split train
```
自动扩窗示例:
```bash
/usr/local/miniconda3/bin/python acr-engine/scripts/internal_asset_type_mapper.py assets.csv \
--audio-root data/internal_audio \
--output-dir out/internal_asset_map \
--default-query-duration 8 \
--query-stride 4 \
--emit-manifests \
--emit-pgvector-json
```
例如 30s 音频在 `8s` query、`4s` stride 下会导出 offset:
- `0, 4, 8, 12, 16, 20, 22`
导出的 `queries.json``pgvector_payload.json` 中都会保留 `query_index`,方便后续追踪窗口来源。
输出会包含:
- `songs`
- `references`
......