Make smoke metadata explicit before more real-data comparisons

Constraint: Real-data smoke reports must distinguish manifest query duration from training segment duration to avoid 5s-vs-8s confusion across runs Rejected: Keep a single ambiguous query_duration field | Makes cross-run analysis and handoff error-prone Confidence: high Scope-risk: narrow Directive: Preserve explicit duration semantics in future smoke/report artifacts and keep legacy aliases only for compatibility Tested: build_smoke_config_summary() emits manifest_query_duration=8.0 and train_segment_duration=5.0 using configs/default.yaml Not-tested: End-to-end regeneration of the still-running real FMA smoke report bundle with the new config schema

Make smoke metadata explicit before more real-data comparisons
Constraint: Real-data smoke reports must distinguish manifest query duration from training segment duration to avoid 5s-vs-8s confusion across runs Rejected: Keep a single ambiguous query_duration field | Makes cross-run analysis and handoff error-prone Confidence: high Scope-risk: narrow Directive: Preserve explicit duration semantics in future smoke/report artifacts and keep legacy aliases only for compatibility Tested: build_smoke_config_summary() emits manifest_query_duration=8.0 and train_segment_duration=5.0 using configs/default.yaml Not-tested: End-to-end regeneration of the still-running real FMA smoke report bundle with the new config schema
cnb.bofCdSsphPA
Commit d7df0087 ... d7df0087171d5c9e89596a4ea01a326bdf28c88f authored 2026-06-02 15:14:22 +0800 by cnb.bofCdSsphPA
Showing 3 changed files with 87 additions and 11 deletions
acr-engine/src/data/external_adapters.py
docs/CHANGELOG.md
docs/training-data-and-pgvector-guide.md
--- a/acr-engine/src/data/external_adapters.py
View file @d7df008
+++ b/acr-engine/src/data/external_adapters.py
View file @d7df008
@@ -9,6 +9,7 @@ import argparse
 import json
 import subprocess
 import torch
+import yaml


 AUDIO_EXTS = (".wav", ".mp3", ".flac", ".ogg")
@@ -22,6 +23,47 @@ def resolve_device(device: str) -> str:
    return device


+def load_default_training_config(config_path: str = "configs/default.yaml") -> Dict:
+    with open(config_path) as f:
+        return yaml.safe_load(f)
+
+
+def build_smoke_config_summary(
+    dataset: str,
+    manifests_dir: Path,
+    manifest_query_duration: float,
+    train_epochs: int,
+    batch_size: int,
+    requested_device: str,
+    resolved_device: str,
+    base_cfg: Dict,
+) -> Dict:
+    return {
+        "model": {
+            "embed_dim": base_cfg["model"]["embed_dim"],
+            "channels": base_cfg["model"]["channels"],
+            "n_mels": base_cfg["model"]["n_mels"],
+            "use_band_split": base_cfg["model"].get("use_band_split", True),
+        },
+        "data": {
+            "source_dataset": dataset,
+            "manifests_dir": str(manifests_dir),
+            "manifest_query_duration": manifest_query_duration,
+            "train_segment_duration": base_cfg["data"]["segment_dur"],
+            "sample_rate": base_cfg["data"]["sample_rate"],
+            "n_fft": base_cfg["data"]["n_fft"],
+            "hop_length": base_cfg["data"]["hop_length"],
+            "query_duration_legacy": manifest_query_duration,
+        },
+        "run": {
+            "train_epochs": train_epochs,
+            "batch_size": batch_size,
+            "requested_device": requested_device,
+            "resolved_device": resolved_device,
+        },
+    }
+
+
 @dataclass
 class DatasetRecord:
    name: str
@@ -340,6 +382,7 @@ def smoke_local_dataset(
    )
    manifests_dir = Path(prepare_summary["output_dir"])
    validate_summary = adapter.validate_local_manifests(manifests_dir)
+    base_cfg = load_default_training_config()

    model_dir = output_root / f"{dataset}_models_smoke"
    index_dir = output_root / f"{dataset}_index_smoke"
@@ -380,16 +423,16 @@ def smoke_local_dataset(
        "--output-json", str(eval_json),
    ], check=True)

-    config = {
-        "model": {"embed_dim": 192, "channels": 512, "n_mels": 128, "use_band_split": True},
-        "data": {"source_dataset": dataset, "manifests_dir": str(manifests_dir), "query_duration": query_duration},
-        "run": {
-            "train_epochs": train_epochs,
-            "batch_size": batch_size,
-            "requested_device": device,
-            "resolved_device": resolved_device,
-        },
-    }
+    config = build_smoke_config_summary(
+        dataset=dataset,
+        manifests_dir=manifests_dir,
+        manifest_query_duration=query_duration,
+        train_epochs=train_epochs,
+        batch_size=batch_size,
+        requested_device=device,
+        resolved_device=resolved_device,
+        base_cfg=base_cfg,
+    )
    report_dir.mkdir(parents=True, exist_ok=True)
    config_path.write_text(json.dumps(config, indent=2))

--- a/docs/CHANGELOG.md
View file @d7df008
+++ b/docs/CHANGELOG.md
View file @d7df008
@@ -2,6 +2,35 @@

 ## 2026-06-02

+### Stage: 显式拆分 smoke 配置里的 8s query 与 5s training segment 语义
+
+完成项：
+- 修改 `acr-engine/src/data/external_adapters.py`
+  - 新增 `load_default_training_config()`
+  - 新增 `build_smoke_config_summary()`
+  - 让 `smoke-local` 产出的 `config.json` 显式记录：
+    - `manifest_query_duration`
+    - `train_segment_duration`
+    - `sample_rate`
+    - `n_fft`
+    - `hop_length`
+    - `query_duration_legacy`
+- 更新 [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md)，说明新旧配置口径
+
+验证结果：
+- 通过直接调用 `build_smoke_config_summary()` 验证输出：
+  - `manifest_query_duration = 8.0`
+  - `train_segment_duration = 5.0`
+  - `requested_device = auto`
+  - `resolved_device = cpu`
+- 默认训练配置读取自：
+  - `configs/default.yaml`
+  - 其中 `data.segment_dur = 5.0`
+
+结论：
+- 现在 smoke 配置摘要已经能明确区分“manifest 的 query 时长”和“训练 clip 时长”
+- 后续即使 report 产物跨实验对比，也更容易避免 5s/8s 语义混淆
+
 ### Stage: 将连续开发偏好与当前进度固化到 AGENTS.md

 完成项：
--- a/docs/training-data-and-pgvector-guide.md
View file @d7df008
+++ b/docs/training-data-and-pgvector-guide.md
View file @d7df008
@@ -326,7 +326,11 @@ flowchart TD

 解释：
 - **manifest query 时长**、**训练 crop 时长**、**报告里记录的 query_duration** 当前不是完全同一个配置源；
- 现有 `fma_reports_smoke/config.json` 时间戳早于最新 manifests，属于需要继续治理的实验产物一致性问题；
+- 旧的 `fma_reports_smoke/config.json` 时间戳早于最新 manifests，属于历史实验产物一致性问题；
+- 当前代码侧已经开始把 smoke 配置摘要显式拆成：
+  - `manifest_query_duration`
+  - `train_segment_duration`
+  - `query_duration_legacy`
 - 因此后续继续做工业级化时，应该把 “manifest query 时长 / train clip 时长 / eval query 时长 / report metadata” 统一纳入一个显式配置结构。

 ---