Prevent a single bad MP3 from collapsing the whole build-index pipeline
Constraint: Real-path investigation exposed decode failures from mpg123/librosa on some MP3s during long index runs Rejected: Abort the entire job on first decode error | it turns one bad asset into total index failure Confidence: high Scope-risk: narrow Directive: Keep per-file skip logging and skipped_refs accounting while continuing the real-path root-cause run Tested: Verified /tmp/chroma_skip_repro with 1 good MP3 + 1 bad MP3 completes RC=0, logs skip decode failure, writes reference outputs, and records skipped_refs=1 Not-tested: Full real-path FMA rerun after tolerance change is still pending
Showing
6 changed files
with
88 additions
and
4 deletions
| ... | @@ -89,6 +89,7 @@ | ... | @@ -89,6 +89,7 @@ |
| 89 | - 当前仍未出现 `reference_*` 或 `evaluate.py`。 | 89 | - 当前仍未出现 `reference_*` 或 `evaluate.py`。 |
| 90 | - 因此下一轮工作重心必须切到:**排查 build-index 异常退出**,而不是继续把它当作纯线性慢任务。 | 90 | - 因此下一轮工作重心必须切到:**排查 build-index 异常退出**,而不是继续把它当作纯线性慢任务。 |
| 91 | - 已完成一个低风险修复:关键 `print()` 已加 `flush=True`,并已用极小样本 `RC=1` 失败复现验证日志/traceback 可实时落盘,不再出现 `0 bytes` 日志黑箱。 | 91 | - 已完成一个低风险修复:关键 `print()` 已加 `flush=True`,并已用极小样本 `RC=1` 失败复现验证日志/traceback 可实时落盘,不再出现 `0 bytes` 日志黑箱。 |
| 92 | - 已完成一个高价值容错修复:坏 MP3 / 缺失音频会在 chromaprint/reference 阶段被跳过,并已用 `1 good + 1 bad` 最小复现验证 `RC=0`、`reference_*` 成功产出。 | ||
| 92 | - 下一次值得提交的事件: | 93 | - 下一次值得提交的事件: |
| 93 | 1. 找到明确失败证据/退出原因 | 94 | 1. 找到明确失败证据/退出原因 |
| 94 | 2. 成功小样本复现并补日志 | 95 | 2. 成功小样本复现并补日志 | ... | ... |
| ... | @@ -103,6 +103,7 @@ class ChromaprintMatcher: | ... | @@ -103,6 +103,7 @@ class ChromaprintMatcher: |
| 103 | refs = [item for item in meta if item.get("type") == "reference"] | 103 | refs = [item for item in meta if item.get("type") == "reference"] |
| 104 | total_refs = len(refs) | 104 | total_refs = len(refs) |
| 105 | start_time = time.time() | 105 | start_time = time.time() |
| 106 | skipped_refs = 0 | ||
| 106 | 107 | ||
| 107 | progress_file = Path(progress_path) if progress_path else None | 108 | progress_file = Path(progress_path) if progress_path else None |
| 108 | cache_file = Path(cache_path) if cache_path else None | 109 | cache_file = Path(cache_path) if cache_path else None |
| ... | @@ -121,15 +122,29 @@ class ChromaprintMatcher: | ... | @@ -121,15 +122,29 @@ class ChromaprintMatcher: |
| 121 | "eta_sec": round(eta_sec, 3), | 122 | "eta_sec": round(eta_sec, 3), |
| 122 | "hashes": self.num_hashes, | 123 | "hashes": self.num_hashes, |
| 123 | "postings": self.index_size, | 124 | "postings": self.index_size, |
| 125 | "skipped_refs": skipped_refs, | ||
| 124 | "cache_path": str(cache_file) if cache_file else None, | 126 | "cache_path": str(cache_file) if cache_file else None, |
| 125 | }, indent=2)) | 127 | }, indent=2)) |
| 126 | 128 | ||
| 127 | for ref_idx, item in enumerate(refs, start=1): | 129 | for ref_idx, item in enumerate(refs, start=1): |
| 128 | audio_path = self._resolve_audio_path(songs_dir, item["audio_path"]) | 130 | audio_path = self._resolve_audio_path(songs_dir, item["audio_path"]) |
| 129 | if not audio_path.exists(): | 131 | if not audio_path.exists(): |
| 132 | skipped_refs += 1 | ||
| 133 | print( | ||
| 134 | f"[chromaprint-index] skip missing audio: song_id={item.get('song_id')} path={audio_path}", | ||
| 135 | flush=True, | ||
| 136 | ) | ||
| 130 | continue | 137 | continue |
| 131 | song_id = item["song_id"] | 138 | song_id = item["song_id"] |
| 132 | y, _ = librosa.load(str(audio_path), sr=self.sr, mono=True) | 139 | try: |
| 140 | y, _ = librosa.load(str(audio_path), sr=self.sr, mono=True) | ||
| 141 | except Exception as exc: | ||
| 142 | skipped_refs += 1 | ||
| 143 | print( | ||
| 144 | f"[chromaprint-index] skip decode failure: song_id={song_id} path={audio_path} error={exc}", | ||
| 145 | flush=True, | ||
| 146 | ) | ||
| 147 | continue | ||
| 133 | self.index_song(song_id, y) | 148 | self.index_song(song_id, y) |
| 134 | if ref_idx == 1 or ref_idx == total_refs or (checkpoint_every_refs > 0 and ref_idx % checkpoint_every_refs == 0): | 149 | if ref_idx == 1 or ref_idx == total_refs or (checkpoint_every_refs > 0 and ref_idx % checkpoint_every_refs == 0): |
| 135 | elapsed = max(time.time() - start_time, 1e-6) | 150 | elapsed = max(time.time() - start_time, 1e-6) |
| ... | @@ -138,7 +153,7 @@ class ChromaprintMatcher: | ... | @@ -138,7 +153,7 @@ class ChromaprintMatcher: |
| 138 | print( | 153 | print( |
| 139 | f"[chromaprint-index] progress: refs={ref_idx}/{total_refs} " | 154 | f"[chromaprint-index] progress: refs={ref_idx}/{total_refs} " |
| 140 | f"hashes={self.num_hashes} postings={self.index_size} " | 155 | f"hashes={self.num_hashes} postings={self.index_size} " |
| 141 | f"elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f}" | 156 | f"elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f} skipped_refs={skipped_refs}" |
| 142 | , flush=True) | 157 | , flush=True) |
| 143 | if checkpoint_every_refs > 0 and ref_idx % checkpoint_every_refs == 0: | 158 | if checkpoint_every_refs > 0 and ref_idx % checkpoint_every_refs == 0: |
| 144 | if cache_file is not None: | 159 | if cache_file is not None: | ... | ... |
| ... | @@ -131,6 +131,7 @@ class ECAPAEmbedder: | ... | @@ -131,6 +131,7 @@ class ECAPAEmbedder: |
| 131 | final_embs_path = Path(f"{output_path}_embs.npy") | 131 | final_embs_path = Path(f"{output_path}_embs.npy") |
| 132 | final_ids_path = Path(f"{output_path}_ids.npy") | 132 | final_ids_path = Path(f"{output_path}_ids.npy") |
| 133 | refs_done = 0 | 133 | refs_done = 0 |
| 134 | skipped_refs = 0 | ||
| 134 | 135 | ||
| 135 | if resume and final_embs_path.exists() and final_ids_path.exists(): | 136 | if resume and final_embs_path.exists() and final_ids_path.exists(): |
| 136 | print(f"[build-reference-index] resume hit complete index: {final_embs_path} / {final_ids_path}", flush=True) | 137 | print(f"[build-reference-index] resume hit complete index: {final_embs_path} / {final_ids_path}", flush=True) |
| ... | @@ -191,6 +192,7 @@ class ECAPAEmbedder: | ... | @@ -191,6 +192,7 @@ class ECAPAEmbedder: |
| 191 | "device": self.device.type, | 192 | "device": self.device.type, |
| 192 | "window_sec": window_sec, | 193 | "window_sec": window_sec, |
| 193 | "stride_sec": stride_sec, | 194 | "stride_sec": stride_sec, |
| 195 | "skipped_refs": skipped_refs, | ||
| 194 | "model_signature": self.model_signature, | 196 | "model_signature": self.model_signature, |
| 195 | "partial_embs_path": str(partial_embs_path), | 197 | "partial_embs_path": str(partial_embs_path), |
| 196 | "partial_ids_path": str(partial_ids_path), | 198 | "partial_ids_path": str(partial_ids_path), |
| ... | @@ -207,6 +209,7 @@ class ECAPAEmbedder: | ... | @@ -207,6 +209,7 @@ class ECAPAEmbedder: |
| 207 | "device": self.device.type, | 209 | "device": self.device.type, |
| 208 | "window_sec": window_sec, | 210 | "window_sec": window_sec, |
| 209 | "stride_sec": stride_sec, | 211 | "stride_sec": stride_sec, |
| 212 | "skipped_refs": skipped_refs, | ||
| 210 | "model_signature": self.model_signature, | 213 | "model_signature": self.model_signature, |
| 211 | "final_embs_path": str(final_embs_path), | 214 | "final_embs_path": str(final_embs_path), |
| 212 | "final_ids_path": str(final_ids_path), | 215 | "final_ids_path": str(final_ids_path), |
| ... | @@ -222,9 +225,22 @@ class ECAPAEmbedder: | ... | @@ -222,9 +225,22 @@ class ECAPAEmbedder: |
| 222 | for ref_idx, item in enumerate(refs[refs_done:], start=refs_done + 1): | 225 | for ref_idx, item in enumerate(refs[refs_done:], start=refs_done + 1): |
| 223 | audio_path = self._resolve_audio_path(songs_dir, item["audio_path"]) | 226 | audio_path = self._resolve_audio_path(songs_dir, item["audio_path"]) |
| 224 | if not audio_path.exists(): | 227 | if not audio_path.exists(): |
| 228 | skipped_refs += 1 | ||
| 229 | print( | ||
| 230 | f"[build-reference-index] skip missing audio: song_id={item.get('song_id')} path={audio_path}", | ||
| 231 | flush=True, | ||
| 232 | ) | ||
| 225 | continue | 233 | continue |
| 226 | song_id = item["song_id"] | 234 | song_id = item["song_id"] |
| 227 | y, _ = librosa.load(str(audio_path), sr=self.sr, mono=True) | 235 | try: |
| 236 | y, _ = librosa.load(str(audio_path), sr=self.sr, mono=True) | ||
| 237 | except Exception as exc: | ||
| 238 | skipped_refs += 1 | ||
| 239 | print( | ||
| 240 | f"[build-reference-index] skip decode failure: song_id={song_id} path={audio_path} error={exc}", | ||
| 241 | flush=True, | ||
| 242 | ) | ||
| 243 | continue | ||
| 228 | windows = self._windows(y, window_sec=window_sec, stride_sec=stride_sec) | 244 | windows = self._windows(y, window_sec=window_sec, stride_sec=stride_sec) |
| 229 | for seg in windows: | 245 | for seg in windows: |
| 230 | mel = self._to_mel(seg).to(self.device) | 246 | mel = self._to_mel(seg).to(self.device) |
| ... | @@ -238,7 +254,7 @@ class ECAPAEmbedder: | ... | @@ -238,7 +254,7 @@ class ECAPAEmbedder: |
| 238 | eta_sec = (total_refs - ref_idx) / refs_per_sec if refs_per_sec > 0 else 0.0 | 254 | eta_sec = (total_refs - ref_idx) / refs_per_sec if refs_per_sec > 0 else 0.0 |
| 239 | print( | 255 | print( |
| 240 | f"[build-reference-index] progress: refs={ref_idx}/{total_refs} " | 256 | f"[build-reference-index] progress: refs={ref_idx}/{total_refs} " |
| 241 | f"windows={len(all_ids)} elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f}" | 257 | f"windows={len(all_ids)} elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f} skipped_refs={skipped_refs}" |
| 242 | , flush=True) | 258 | , flush=True) |
| 243 | if checkpoint_every_refs > 0 and (ref_idx % checkpoint_every_refs == 0 or ref_idx == total_refs): | 259 | if checkpoint_every_refs > 0 and (ref_idx % checkpoint_every_refs == 0 or ref_idx == total_refs): |
| 244 | write_checkpoint(ref_idx) | 260 | write_checkpoint(ref_idx) | ... | ... |
| 1 | ## 2026-06-02 15:22 UTC / bad-mp3 skip tolerance verified | ||
| 2 | |||
| 3 | - 为 `chromaprint_matcher.py` 与 `ecapa_embedder.py` 的 reference 建索引循环增加单文件容错: | ||
| 4 | - missing audio -> warning + skip | ||
| 5 | - decode failure -> warning + skip | ||
| 6 | - progress JSON 新增字段:`skipped_refs` | ||
| 7 | - 最小复现验证:`/tmp/chroma_skip_repro` | ||
| 8 | - 1 个正常 mp3 + 1 个损坏 mp3 | ||
| 9 | - `RC=0` | ||
| 10 | - chromaprint 阶段:`skipped_refs=1` | ||
| 11 | - reference 阶段:`skipped_refs=1` | ||
| 12 | - 成功产出: | ||
| 13 | - `reference_embs.npy` | ||
| 14 | - `reference_ids.npy` | ||
| 15 | - `reference_progress.json` | ||
| 16 | - 结论:当前已修复“单个坏 MP3 拖垮整轮 build-index”的高概率故障模式 | ||
| 17 | |||
| 1 | ## 2026-06-02 15:18 UTC / build-index log flush hardening | 18 | ## 2026-06-02 15:18 UTC / build-index log flush hardening |
| 2 | 19 | ||
| 3 | - 为 `run_demo.py`、`chromaprint_matcher.py`、`ecapa_embedder.py` 的关键 `print()` 增加 `flush=True` | 20 | - 为 `run_demo.py`、`chromaprint_matcher.py`、`ecapa_embedder.py` 的关键 `print()` 增加 `flush=True` | ... | ... |
| ... | @@ -68,3 +68,29 @@ | ... | @@ -68,3 +68,29 @@ |
| 68 | 68 | ||
| 69 | - 当前已修复“失败时日志完全不可见”的可观测性问题。 | 69 | - 当前已修复“失败时日志完全不可见”的可观测性问题。 |
| 70 | - 下一轮 root cause 排查可以直接依赖实时日志,而不再需要盲等。 | 70 | - 下一轮 root cause 排查可以直接依赖实时日志,而不再需要盲等。 |
| 71 | |||
| 72 | ## 本次追加交付(2026-06-02 15:22 UTC) | ||
| 73 | |||
| 74 | ### 新增代码修复 | ||
| 75 | |||
| 76 | | 文件 | 变更 | | ||
| 77 | |---|---| | ||
| 78 | | [../acr-engine/src/engines/chromaprint_matcher.py](../acr-engine/src/engines/chromaprint_matcher.py) | 坏音频/缺失音频跳过;progress 增加 `skipped_refs` | | ||
| 79 | | [../acr-engine/src/engines/ecapa_embedder.py](../acr-engine/src/engines/ecapa_embedder.py) | 坏音频/缺失音频跳过;progress 增加 `skipped_refs` | | ||
| 80 | |||
| 81 | ### 新增验证证据 | ||
| 82 | |||
| 83 | - 最小容错复现:`/tmp/chroma_skip_repro` | ||
| 84 | - 输入:`1 good mp3 + 1 bad mp3` | ||
| 85 | - 结果:`RC=0` | ||
| 86 | - 验证点: | ||
| 87 | - 日志可见 `skip decode failure` | ||
| 88 | - `chromaprint_progress.json` 为 `status=complete` | ||
| 89 | - `reference_progress.json` 为 `status=complete` | ||
| 90 | - 两个 progress 文件都记录 `skipped_refs=1` | ||
| 91 | - 最终成功产出 `reference_embs.npy` / `reference_ids.npy` | ||
| 92 | |||
| 93 | ### 结论 | ||
| 94 | |||
| 95 | - 当前已验证:单个坏 MP3 不再拖垮整轮 `build-index`。 | ||
| 96 | - 下一轮应回到真实路径复现,确认主问题是否就是由坏 MP3 触发。 | ... | ... |
| ... | @@ -49,6 +49,15 @@ | ... | @@ -49,6 +49,15 @@ |
| 49 | - 极小样本 `/tmp/chroma_repro_tiny12` 已验证:失败时日志与 traceback 可实时落盘,不再保持 `0 bytes`。 | 49 | - 极小样本 `/tmp/chroma_repro_tiny12` 已验证:失败时日志与 traceback 可实时落盘,不再保持 `0 bytes`。 |
| 50 | - 这意味着下一 session 继续排查时,日志可作为一手证据,而不是黑箱。 | 50 | - 这意味着下一 session 继续排查时,日志可作为一手证据,而不是黑箱。 |
| 51 | 51 | ||
| 52 | #### 已完成的坏音频容错修复 | ||
| 53 | - 已为 chromaprint/reference 两个建索引阶段增加单文件容错:坏 MP3 / 缺失音频会被记录并跳过。 | ||
| 54 | - 最小复现 `/tmp/chroma_skip_repro` 已验证: | ||
| 55 | - `RC=0` | ||
| 56 | - `skip decode failure` 日志可见 | ||
| 57 | - `reference_embs.npy` / `reference_ids.npy` 成功产出 | ||
| 58 | - progress 中记录 `skipped_refs=1` | ||
| 59 | - 这说明:单个坏 MP3 不再拖垮整轮 `build-index`。 | ||
| 60 | |||
| 52 | ## 新 session 接管顺序 | 61 | ## 新 session 接管顺序 |
| 53 | 62 | ||
| 54 | 1. 先看 [./session-handoff.md](./session-handoff.md) | 63 | 1. 先看 [./session-handoff.md](./session-handoff.md) | ... | ... |
-
Please register or sign in to post a comment