Commit 707449b8 707449b80170541b2d62065e7d0037323d80799b by cnb.bofCdSsphPA

Prevent a single bad MP3 from collapsing the whole build-index pipeline

Constraint: Real-path investigation exposed decode failures from mpg123/librosa on some MP3s during long index runs
Rejected: Abort the entire job on first decode error | it turns one bad asset into total index failure
Confidence: high
Scope-risk: narrow
Directive: Keep per-file skip logging and skipped_refs accounting while continuing the real-path root-cause run
Tested: Verified /tmp/chroma_skip_repro with 1 good MP3 + 1 bad MP3 completes RC=0, logs skip decode failure, writes reference outputs, and records skipped_refs=1
Not-tested: Full real-path FMA rerun after tolerance change is still pending
1 parent 6ece1fa7
......@@ -89,6 +89,7 @@
- 当前仍未出现 `reference_*``evaluate.py`
- 因此下一轮工作重心必须切到:**排查 build-index 异常退出**,而不是继续把它当作纯线性慢任务。
- 已完成一个低风险修复:关键 `print()` 已加 `flush=True`,并已用极小样本 `RC=1` 失败复现验证日志/traceback 可实时落盘,不再出现 `0 bytes` 日志黑箱。
- 已完成一个高价值容错修复:坏 MP3 / 缺失音频会在 chromaprint/reference 阶段被跳过,并已用 `1 good + 1 bad` 最小复现验证 `RC=0``reference_*` 成功产出。
- 下一次值得提交的事件:
1. 找到明确失败证据/退出原因
2. 成功小样本复现并补日志
......
......@@ -103,6 +103,7 @@ class ChromaprintMatcher:
refs = [item for item in meta if item.get("type") == "reference"]
total_refs = len(refs)
start_time = time.time()
skipped_refs = 0
progress_file = Path(progress_path) if progress_path else None
cache_file = Path(cache_path) if cache_path else None
......@@ -121,15 +122,29 @@ class ChromaprintMatcher:
"eta_sec": round(eta_sec, 3),
"hashes": self.num_hashes,
"postings": self.index_size,
"skipped_refs": skipped_refs,
"cache_path": str(cache_file) if cache_file else None,
}, indent=2))
for ref_idx, item in enumerate(refs, start=1):
audio_path = self._resolve_audio_path(songs_dir, item["audio_path"])
if not audio_path.exists():
skipped_refs += 1
print(
f"[chromaprint-index] skip missing audio: song_id={item.get('song_id')} path={audio_path}",
flush=True,
)
continue
song_id = item["song_id"]
y, _ = librosa.load(str(audio_path), sr=self.sr, mono=True)
try:
y, _ = librosa.load(str(audio_path), sr=self.sr, mono=True)
except Exception as exc:
skipped_refs += 1
print(
f"[chromaprint-index] skip decode failure: song_id={song_id} path={audio_path} error={exc}",
flush=True,
)
continue
self.index_song(song_id, y)
if ref_idx == 1 or ref_idx == total_refs or (checkpoint_every_refs > 0 and ref_idx % checkpoint_every_refs == 0):
elapsed = max(time.time() - start_time, 1e-6)
......@@ -138,7 +153,7 @@ class ChromaprintMatcher:
print(
f"[chromaprint-index] progress: refs={ref_idx}/{total_refs} "
f"hashes={self.num_hashes} postings={self.index_size} "
f"elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f}"
f"elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f} skipped_refs={skipped_refs}"
, flush=True)
if checkpoint_every_refs > 0 and ref_idx % checkpoint_every_refs == 0:
if cache_file is not None:
......
......@@ -131,6 +131,7 @@ class ECAPAEmbedder:
final_embs_path = Path(f"{output_path}_embs.npy")
final_ids_path = Path(f"{output_path}_ids.npy")
refs_done = 0
skipped_refs = 0
if resume and final_embs_path.exists() and final_ids_path.exists():
print(f"[build-reference-index] resume hit complete index: {final_embs_path} / {final_ids_path}", flush=True)
......@@ -191,6 +192,7 @@ class ECAPAEmbedder:
"device": self.device.type,
"window_sec": window_sec,
"stride_sec": stride_sec,
"skipped_refs": skipped_refs,
"model_signature": self.model_signature,
"partial_embs_path": str(partial_embs_path),
"partial_ids_path": str(partial_ids_path),
......@@ -207,6 +209,7 @@ class ECAPAEmbedder:
"device": self.device.type,
"window_sec": window_sec,
"stride_sec": stride_sec,
"skipped_refs": skipped_refs,
"model_signature": self.model_signature,
"final_embs_path": str(final_embs_path),
"final_ids_path": str(final_ids_path),
......@@ -222,9 +225,22 @@ class ECAPAEmbedder:
for ref_idx, item in enumerate(refs[refs_done:], start=refs_done + 1):
audio_path = self._resolve_audio_path(songs_dir, item["audio_path"])
if not audio_path.exists():
skipped_refs += 1
print(
f"[build-reference-index] skip missing audio: song_id={item.get('song_id')} path={audio_path}",
flush=True,
)
continue
song_id = item["song_id"]
y, _ = librosa.load(str(audio_path), sr=self.sr, mono=True)
try:
y, _ = librosa.load(str(audio_path), sr=self.sr, mono=True)
except Exception as exc:
skipped_refs += 1
print(
f"[build-reference-index] skip decode failure: song_id={song_id} path={audio_path} error={exc}",
flush=True,
)
continue
windows = self._windows(y, window_sec=window_sec, stride_sec=stride_sec)
for seg in windows:
mel = self._to_mel(seg).to(self.device)
......@@ -238,7 +254,7 @@ class ECAPAEmbedder:
eta_sec = (total_refs - ref_idx) / refs_per_sec if refs_per_sec > 0 else 0.0
print(
f"[build-reference-index] progress: refs={ref_idx}/{total_refs} "
f"windows={len(all_ids)} elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f}"
f"windows={len(all_ids)} elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f} skipped_refs={skipped_refs}"
, flush=True)
if checkpoint_every_refs > 0 and (ref_idx % checkpoint_every_refs == 0 or ref_idx == total_refs):
write_checkpoint(ref_idx)
......
## 2026-06-02 15:22 UTC / bad-mp3 skip tolerance verified
-`chromaprint_matcher.py``ecapa_embedder.py` 的 reference 建索引循环增加单文件容错:
- missing audio -> warning + skip
- decode failure -> warning + skip
- progress JSON 新增字段:`skipped_refs`
- 最小复现验证:`/tmp/chroma_skip_repro`
- 1 个正常 mp3 + 1 个损坏 mp3
- `RC=0`
- chromaprint 阶段:`skipped_refs=1`
- reference 阶段:`skipped_refs=1`
- 成功产出:
- `reference_embs.npy`
- `reference_ids.npy`
- `reference_progress.json`
- 结论:当前已修复“单个坏 MP3 拖垮整轮 build-index”的高概率故障模式
## 2026-06-02 15:18 UTC / build-index log flush hardening
-`run_demo.py``chromaprint_matcher.py``ecapa_embedder.py` 的关键 `print()` 增加 `flush=True`
......
......@@ -68,3 +68,29 @@
- 当前已修复“失败时日志完全不可见”的可观测性问题。
- 下一轮 root cause 排查可以直接依赖实时日志,而不再需要盲等。
## 本次追加交付(2026-06-02 15:22 UTC)
### 新增代码修复
| 文件 | 变更 |
|---|---|
| [../acr-engine/src/engines/chromaprint_matcher.py](../acr-engine/src/engines/chromaprint_matcher.py) | 坏音频/缺失音频跳过;progress 增加 `skipped_refs` |
| [../acr-engine/src/engines/ecapa_embedder.py](../acr-engine/src/engines/ecapa_embedder.py) | 坏音频/缺失音频跳过;progress 增加 `skipped_refs` |
### 新增验证证据
- 最小容错复现:`/tmp/chroma_skip_repro`
- 输入:`1 good mp3 + 1 bad mp3`
- 结果:`RC=0`
- 验证点:
- 日志可见 `skip decode failure`
- `chromaprint_progress.json``status=complete`
- `reference_progress.json``status=complete`
- 两个 progress 文件都记录 `skipped_refs=1`
- 最终成功产出 `reference_embs.npy` / `reference_ids.npy`
### 结论
- 当前已验证:单个坏 MP3 不再拖垮整轮 `build-index`
- 下一轮应回到真实路径复现,确认主问题是否就是由坏 MP3 触发。
......
......@@ -49,6 +49,15 @@
- 极小样本 `/tmp/chroma_repro_tiny12` 已验证:失败时日志与 traceback 可实时落盘,不再保持 `0 bytes`
- 这意味着下一 session 继续排查时,日志可作为一手证据,而不是黑箱。
#### 已完成的坏音频容错修复
- 已为 chromaprint/reference 两个建索引阶段增加单文件容错:坏 MP3 / 缺失音频会被记录并跳过。
- 最小复现 `/tmp/chroma_skip_repro` 已验证:
- `RC=0`
- `skip decode failure` 日志可见
- `reference_embs.npy` / `reference_ids.npy` 成功产出
- progress 中记录 `skipped_refs=1`
- 这说明:单个坏 MP3 不再拖垮整轮 `build-index`
## 新 session 接管顺序
1. 先看 [./session-handoff.md](./session-handoff.md)
......