Commit 4ceaa995 4ceaa995820cda86de67bbdb1881c26b0d142f46 by cnb.bofCdSsphPA

Resume smoke indexing safely without mixing model generations

Constraint: smoke-local must recover long CPU index builds automatically, but partial embeddings from an older model must never contaminate a newly trained index
Rejected: Always reuse any existing partial checkpoint | can silently blend embeddings from different model generations into one index
Confidence: high
Scope-risk: moderate
Directive: Keep model-signature checks on all future index resume paths; auto-resume should fall back to clean rebuild on any signature mismatch
Tested: /usr/local/miniconda3/bin/python -m py_compile acr-engine/src/engines/ecapa_embedder.py acr-engine/src/data/external_adapters.py acr-engine/run_demo.py; same-model partial checkpoint resume vs fresh rebuild equality; mismatched-model checkpoint rejection and clean rebuild equality
Not-tested: Reattaching the currently running real FMA smoke process after an external interruption
1 parent e45896b7
......@@ -372,6 +372,7 @@ def smoke_local_dataset(
query_strategy: str,
segment_strategy: str,
silence_top_db: int,
index_checkpoint_every_refs: int,
seed: int,
train_epochs: int,
batch_size: int,
......@@ -432,6 +433,8 @@ def smoke_local_dataset(
"--model", str(model_dir / "best_model.pt"),
"--output", str(index_dir),
"--device", resolved_device,
"--resume",
"--checkpoint-every-refs", str(index_checkpoint_every_refs),
], check=True)
report_dir.mkdir(parents=True, exist_ok=True)
......@@ -461,6 +464,8 @@ def smoke_local_dataset(
config["data"]["manifest_query_stride"] = query_stride
config["data"]["manifest_query_strategy"] = query_strategy
config["data"]["silence_top_db"] = silence_top_db
config["run"]["index_checkpoint_every_refs"] = index_checkpoint_every_refs
config["run"]["index_resume_enabled"] = True
config["run"]["train_segment_strategy"] = segment_strategy
report_dir.mkdir(parents=True, exist_ok=True)
config_path.write_text(json.dumps(config, indent=2))
......@@ -546,6 +551,7 @@ def main():
p.add_argument("--query-strategy", choices=["random", "sliding", "silence_aware", "hybrid"], default="random")
p.add_argument("--segment-strategy", choices=["random", "silence_aware", "hybrid"], default="random")
p.add_argument("--silence-top-db", type=int, default=30)
p.add_argument("--index-checkpoint-every-refs", type=int, default=100)
p.add_argument("--seed", type=int, default=42)
p.add_argument("--train-epochs", type=int, default=1)
p.add_argument("--batch-size", type=int, default=2)
......@@ -605,6 +611,7 @@ def main():
query_strategy=args.query_strategy,
segment_strategy=args.segment_strategy,
silence_top_db=args.silence_top_db,
index_checkpoint_every_refs=args.index_checkpoint_every_refs,
seed=args.seed,
train_epochs=args.train_epochs,
batch_size=args.batch_size,
......
......@@ -19,10 +19,12 @@ class ECAPAEmbedder:
hop_length: int = 160,
):
self.device = torch.device(device)
self.model_path = Path(model_path)
self.sr = sr
self.n_mels = n_mels
self.n_fft = n_fft
self.hop_length = hop_length
self.model_signature = self._build_model_signature(self.model_path)
from src.models.ecapa_tdnn import ECAPA_ACR
......@@ -54,6 +56,14 @@ class ECAPAEmbedder:
y, _ = librosa.load(path, sr=self.sr, mono=True)
return y
def _build_model_signature(self, model_path: Path) -> dict:
stat = model_path.stat()
return {
"path": str(model_path),
"size_bytes": int(stat.st_size),
"mtime_ns": int(stat.st_mtime_ns),
}
def _resolve_audio_path(self, songs_dir: Path, rel_path: str) -> Path:
candidate = songs_dir / rel_path
if candidate.exists():
......@@ -131,6 +141,11 @@ class ECAPAEmbedder:
if resume and progress_path.exists() and partial_embs_path.exists() and partial_ids_path.exists():
try:
progress = json.loads(progress_path.read_text())
progress_sig = progress.get("model_signature")
if progress_sig and progress_sig != self.model_signature:
raise ValueError(
f"model signature mismatch: checkpoint={progress_sig} current={self.model_signature}"
)
refs_done = int(progress.get("refs_done", 0) or 0)
partial_embs = np.load(partial_embs_path)
partial_ids = np.load(partial_ids_path, allow_pickle=True).tolist()
......@@ -145,6 +160,12 @@ class ECAPAEmbedder:
refs_done = 0
all_embs = []
all_ids = []
for stale_path in (partial_embs_path, partial_ids_path):
try:
if stale_path.exists():
stale_path.unlink()
except OSError:
pass
print(
f"[build-reference-index] start: refs={total_refs} device={self.device.type} "
......@@ -170,6 +191,7 @@ class ECAPAEmbedder:
"device": self.device.type,
"window_sec": window_sec,
"stride_sec": stride_sec,
"model_signature": self.model_signature,
"partial_embs_path": str(partial_embs_path),
"partial_ids_path": str(partial_ids_path),
}, indent=2))
......@@ -185,6 +207,7 @@ class ECAPAEmbedder:
"device": self.device.type,
"window_sec": window_sec,
"stride_sec": stride_sec,
"model_signature": self.model_signature,
"final_embs_path": str(final_embs_path),
"final_ids_path": str(final_ids_path),
"embedding_shape": list(emb_shape),
......
......@@ -5482,3 +5482,43 @@
- 现在 CPU 长时间 `build-index` 任务即使中断,也可以从 partial checkpoint 续跑
- 该恢复逻辑已经拿到“恢复结果与 fresh rebuild 完全一致”的新鲜证据
- 下一步可以把这套 resume 能力进一步接到 `smoke-local` 的自动恢复策略里
### Stage: smoke-local auto resume + model-signature safety gate
完成项:
-`acr-engine/src/engines/ecapa_embedder.py` 为 index checkpoint 增加 `model_signature`
- `path`
- `size_bytes`
- `mtime_ns`
- 恢复时如果 checkpoint 的 `model_signature` 与当前 `best_model.pt` 不一致:
- 自动拒绝旧 checkpoint
- 清理旧 partial 文件
- 从 0 重建 embedding index
-`acr-engine/src/data/external_adapters.py``smoke-local` 中默认启用:
- `run_demo.py build-index --resume`
- `--checkpoint-every-refs`
-[docs/open-dataset-workflow.md](./open-dataset-workflow.md) 补充模型签名护栏说明
验证结果:
- 编译验证:
- `/usr/local/miniconda3/bin/python -m py_compile src/engines/ecapa_embedder.py src/data/external_adapters.py run_demo.py`
- 同模型恢复验证(`models_v6 -> models_v6`):
- 人工构造前 `2` 首 reference 的 partial checkpoint
- 日志出现:
- `[build-reference-index] resuming from checkpoint: refs_done=2/24 windows_done=10`
- 与 fresh rebuild 对比:
- `same_final_ids_equal == True`
- `same_final_embs_equal == True`
- `same_progress_status == complete`
- 异模型拒绝恢复验证(`models_v6 partial -> models_v5 current`):
- 日志出现:
- `resume checkpoint ignored due to load failure: model signature mismatch`
- 随后从 0 重建:
- `start: refs=24 ... resume=True refs_done=0`
-`models_v5` fresh rebuild 对比:
- `mismatch_final_ids_equal == True`
- `mismatch_final_embs_equal == True`
结论:
- `smoke-local` 现在已经具备“可恢复,但不会错误复用旧模型 embedding”的安全自动恢复能力
- 这对真实 FMA 这类 CPU 长时任务尤其重要:重启可续跑,换模型不会串污染 index
......
......@@ -81,6 +81,11 @@ flowchart LR
--device cpu \
--resume \
--checkpoint-every-refs 100
说明:
- `smoke-local` 现在内部默认也会为 `build-index` 打开 `--resume`
- checkpoint 会记录 `model_signature`
- 如果这次训练出的 `best_model.pt` 与旧 partial checkpoint 不是同一个模型,恢复会被自动拒绝并从 0 重建,避免混入不同模型的 embedding
/usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --index-prefix data/index_fma_smoke/reference --split test --device cpu --fast-eval --output-json reports/fma-smoke/eval.json
/usr/local/miniconda3/bin/python scripts/generate_artifacts.py --eval-json reports/fma-smoke/eval.json --config-json reports/fma-smoke/config.json --output-dir reports/fma-smoke --model-version fma-smoke --data-version fma_local
```
......