Commit d7df0087 d7df0087171d5c9e89596a4ea01a326bdf28c88f by cnb.bofCdSsphPA

Make smoke metadata explicit before more real-data comparisons

Constraint: Real-data smoke reports must distinguish manifest query duration from training segment duration to avoid 5s-vs-8s confusion across runs
Rejected: Keep a single ambiguous query_duration field | Makes cross-run analysis and handoff error-prone
Confidence: high
Scope-risk: narrow
Directive: Preserve explicit duration semantics in future smoke/report artifacts and keep legacy aliases only for compatibility
Tested: build_smoke_config_summary() emits manifest_query_duration=8.0 and train_segment_duration=5.0 using configs/default.yaml
Not-tested: End-to-end regeneration of the still-running real FMA smoke report bundle with the new config schema
1 parent 05a2ccca
......@@ -9,6 +9,7 @@ import argparse
import json
import subprocess
import torch
import yaml
AUDIO_EXTS = (".wav", ".mp3", ".flac", ".ogg")
......@@ -22,6 +23,47 @@ def resolve_device(device: str) -> str:
return device
def load_default_training_config(config_path: str = "configs/default.yaml") -> Dict:
with open(config_path) as f:
return yaml.safe_load(f)
def build_smoke_config_summary(
dataset: str,
manifests_dir: Path,
manifest_query_duration: float,
train_epochs: int,
batch_size: int,
requested_device: str,
resolved_device: str,
base_cfg: Dict,
) -> Dict:
return {
"model": {
"embed_dim": base_cfg["model"]["embed_dim"],
"channels": base_cfg["model"]["channels"],
"n_mels": base_cfg["model"]["n_mels"],
"use_band_split": base_cfg["model"].get("use_band_split", True),
},
"data": {
"source_dataset": dataset,
"manifests_dir": str(manifests_dir),
"manifest_query_duration": manifest_query_duration,
"train_segment_duration": base_cfg["data"]["segment_dur"],
"sample_rate": base_cfg["data"]["sample_rate"],
"n_fft": base_cfg["data"]["n_fft"],
"hop_length": base_cfg["data"]["hop_length"],
"query_duration_legacy": manifest_query_duration,
},
"run": {
"train_epochs": train_epochs,
"batch_size": batch_size,
"requested_device": requested_device,
"resolved_device": resolved_device,
},
}
@dataclass
class DatasetRecord:
name: str
......@@ -340,6 +382,7 @@ def smoke_local_dataset(
)
manifests_dir = Path(prepare_summary["output_dir"])
validate_summary = adapter.validate_local_manifests(manifests_dir)
base_cfg = load_default_training_config()
model_dir = output_root / f"{dataset}_models_smoke"
index_dir = output_root / f"{dataset}_index_smoke"
......@@ -380,16 +423,16 @@ def smoke_local_dataset(
"--output-json", str(eval_json),
], check=True)
config = {
"model": {"embed_dim": 192, "channels": 512, "n_mels": 128, "use_band_split": True},
"data": {"source_dataset": dataset, "manifests_dir": str(manifests_dir), "query_duration": query_duration},
"run": {
"train_epochs": train_epochs,
"batch_size": batch_size,
"requested_device": device,
"resolved_device": resolved_device,
},
}
config = build_smoke_config_summary(
dataset=dataset,
manifests_dir=manifests_dir,
manifest_query_duration=query_duration,
train_epochs=train_epochs,
batch_size=batch_size,
requested_device=device,
resolved_device=resolved_device,
base_cfg=base_cfg,
)
report_dir.mkdir(parents=True, exist_ok=True)
config_path.write_text(json.dumps(config, indent=2))
......
......@@ -2,6 +2,35 @@
## 2026-06-02
### Stage: 显式拆分 smoke 配置里的 8s query 与 5s training segment 语义
完成项:
- 修改 `acr-engine/src/data/external_adapters.py`
- 新增 `load_default_training_config()`
- 新增 `build_smoke_config_summary()`
-`smoke-local` 产出的 `config.json` 显式记录:
- `manifest_query_duration`
- `train_segment_duration`
- `sample_rate`
- `n_fft`
- `hop_length`
- `query_duration_legacy`
- 更新 [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md),说明新旧配置口径
验证结果:
- 通过直接调用 `build_smoke_config_summary()` 验证输出:
- `manifest_query_duration = 8.0`
- `train_segment_duration = 5.0`
- `requested_device = auto`
- `resolved_device = cpu`
- 默认训练配置读取自:
- `configs/default.yaml`
- 其中 `data.segment_dur = 5.0`
结论:
- 现在 smoke 配置摘要已经能明确区分“manifest 的 query 时长”和“训练 clip 时长”
- 后续即使 report 产物跨实验对比,也更容易避免 5s/8s 语义混淆
### Stage: 将连续开发偏好与当前进度固化到 AGENTS.md
完成项:
......
......@@ -326,7 +326,11 @@ flowchart TD
解释:
- **manifest query 时长****训练 crop 时长****报告里记录的 query_duration** 当前不是完全同一个配置源;
- 现有 `fma_reports_smoke/config.json` 时间戳早于最新 manifests,属于需要继续治理的实验产物一致性问题;
- 旧的 `fma_reports_smoke/config.json` 时间戳早于最新 manifests,属于历史实验产物一致性问题;
- 当前代码侧已经开始把 smoke 配置摘要显式拆成:
- `manifest_query_duration`
- `train_segment_duration`
- `query_duration_legacy`
- 因此后续继续做工业级化时,应该把 “manifest query 时长 / train clip 时长 / eval query 时长 / report metadata” 统一纳入一个显式配置结构。
---
......