Commit 707449b8 707449b80170541b2d62065e7d0037323d80799b by cnb.bofCdSsphPA

Prevent a single bad MP3 from collapsing the whole build-index pipeline

Constraint: Real-path investigation exposed decode failures from mpg123/librosa on some MP3s during long index runs
Rejected: Abort the entire job on first decode error | it turns one bad asset into total index failure
Confidence: high
Scope-risk: narrow
Directive: Keep per-file skip logging and skipped_refs accounting while continuing the real-path root-cause run
Tested: Verified /tmp/chroma_skip_repro with 1 good MP3 + 1 bad MP3 completes RC=0, logs skip decode failure, writes reference outputs, and records skipped_refs=1
Not-tested: Full real-path FMA rerun after tolerance change is still pending
1 parent 6ece1fa7
...@@ -89,6 +89,7 @@ ...@@ -89,6 +89,7 @@
89 - 当前仍未出现 `reference_*``evaluate.py` 89 - 当前仍未出现 `reference_*``evaluate.py`
90 - 因此下一轮工作重心必须切到:**排查 build-index 异常退出**,而不是继续把它当作纯线性慢任务。 90 - 因此下一轮工作重心必须切到:**排查 build-index 异常退出**,而不是继续把它当作纯线性慢任务。
91 - 已完成一个低风险修复:关键 `print()` 已加 `flush=True`,并已用极小样本 `RC=1` 失败复现验证日志/traceback 可实时落盘,不再出现 `0 bytes` 日志黑箱。 91 - 已完成一个低风险修复:关键 `print()` 已加 `flush=True`,并已用极小样本 `RC=1` 失败复现验证日志/traceback 可实时落盘,不再出现 `0 bytes` 日志黑箱。
92 - 已完成一个高价值容错修复:坏 MP3 / 缺失音频会在 chromaprint/reference 阶段被跳过,并已用 `1 good + 1 bad` 最小复现验证 `RC=0``reference_*` 成功产出。
92 - 下一次值得提交的事件: 93 - 下一次值得提交的事件:
93 1. 找到明确失败证据/退出原因 94 1. 找到明确失败证据/退出原因
94 2. 成功小样本复现并补日志 95 2. 成功小样本复现并补日志
......
...@@ -103,6 +103,7 @@ class ChromaprintMatcher: ...@@ -103,6 +103,7 @@ class ChromaprintMatcher:
103 refs = [item for item in meta if item.get("type") == "reference"] 103 refs = [item for item in meta if item.get("type") == "reference"]
104 total_refs = len(refs) 104 total_refs = len(refs)
105 start_time = time.time() 105 start_time = time.time()
106 skipped_refs = 0
106 107
107 progress_file = Path(progress_path) if progress_path else None 108 progress_file = Path(progress_path) if progress_path else None
108 cache_file = Path(cache_path) if cache_path else None 109 cache_file = Path(cache_path) if cache_path else None
...@@ -121,15 +122,29 @@ class ChromaprintMatcher: ...@@ -121,15 +122,29 @@ class ChromaprintMatcher:
121 "eta_sec": round(eta_sec, 3), 122 "eta_sec": round(eta_sec, 3),
122 "hashes": self.num_hashes, 123 "hashes": self.num_hashes,
123 "postings": self.index_size, 124 "postings": self.index_size,
125 "skipped_refs": skipped_refs,
124 "cache_path": str(cache_file) if cache_file else None, 126 "cache_path": str(cache_file) if cache_file else None,
125 }, indent=2)) 127 }, indent=2))
126 128
127 for ref_idx, item in enumerate(refs, start=1): 129 for ref_idx, item in enumerate(refs, start=1):
128 audio_path = self._resolve_audio_path(songs_dir, item["audio_path"]) 130 audio_path = self._resolve_audio_path(songs_dir, item["audio_path"])
129 if not audio_path.exists(): 131 if not audio_path.exists():
132 skipped_refs += 1
133 print(
134 f"[chromaprint-index] skip missing audio: song_id={item.get('song_id')} path={audio_path}",
135 flush=True,
136 )
130 continue 137 continue
131 song_id = item["song_id"] 138 song_id = item["song_id"]
139 try:
132 y, _ = librosa.load(str(audio_path), sr=self.sr, mono=True) 140 y, _ = librosa.load(str(audio_path), sr=self.sr, mono=True)
141 except Exception as exc:
142 skipped_refs += 1
143 print(
144 f"[chromaprint-index] skip decode failure: song_id={song_id} path={audio_path} error={exc}",
145 flush=True,
146 )
147 continue
133 self.index_song(song_id, y) 148 self.index_song(song_id, y)
134 if ref_idx == 1 or ref_idx == total_refs or (checkpoint_every_refs > 0 and ref_idx % checkpoint_every_refs == 0): 149 if ref_idx == 1 or ref_idx == total_refs or (checkpoint_every_refs > 0 and ref_idx % checkpoint_every_refs == 0):
135 elapsed = max(time.time() - start_time, 1e-6) 150 elapsed = max(time.time() - start_time, 1e-6)
...@@ -138,7 +153,7 @@ class ChromaprintMatcher: ...@@ -138,7 +153,7 @@ class ChromaprintMatcher:
138 print( 153 print(
139 f"[chromaprint-index] progress: refs={ref_idx}/{total_refs} " 154 f"[chromaprint-index] progress: refs={ref_idx}/{total_refs} "
140 f"hashes={self.num_hashes} postings={self.index_size} " 155 f"hashes={self.num_hashes} postings={self.index_size} "
141 f"elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f}" 156 f"elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f} skipped_refs={skipped_refs}"
142 , flush=True) 157 , flush=True)
143 if checkpoint_every_refs > 0 and ref_idx % checkpoint_every_refs == 0: 158 if checkpoint_every_refs > 0 and ref_idx % checkpoint_every_refs == 0:
144 if cache_file is not None: 159 if cache_file is not None:
......
...@@ -131,6 +131,7 @@ class ECAPAEmbedder: ...@@ -131,6 +131,7 @@ class ECAPAEmbedder:
131 final_embs_path = Path(f"{output_path}_embs.npy") 131 final_embs_path = Path(f"{output_path}_embs.npy")
132 final_ids_path = Path(f"{output_path}_ids.npy") 132 final_ids_path = Path(f"{output_path}_ids.npy")
133 refs_done = 0 133 refs_done = 0
134 skipped_refs = 0
134 135
135 if resume and final_embs_path.exists() and final_ids_path.exists(): 136 if resume and final_embs_path.exists() and final_ids_path.exists():
136 print(f"[build-reference-index] resume hit complete index: {final_embs_path} / {final_ids_path}", flush=True) 137 print(f"[build-reference-index] resume hit complete index: {final_embs_path} / {final_ids_path}", flush=True)
...@@ -191,6 +192,7 @@ class ECAPAEmbedder: ...@@ -191,6 +192,7 @@ class ECAPAEmbedder:
191 "device": self.device.type, 192 "device": self.device.type,
192 "window_sec": window_sec, 193 "window_sec": window_sec,
193 "stride_sec": stride_sec, 194 "stride_sec": stride_sec,
195 "skipped_refs": skipped_refs,
194 "model_signature": self.model_signature, 196 "model_signature": self.model_signature,
195 "partial_embs_path": str(partial_embs_path), 197 "partial_embs_path": str(partial_embs_path),
196 "partial_ids_path": str(partial_ids_path), 198 "partial_ids_path": str(partial_ids_path),
...@@ -207,6 +209,7 @@ class ECAPAEmbedder: ...@@ -207,6 +209,7 @@ class ECAPAEmbedder:
207 "device": self.device.type, 209 "device": self.device.type,
208 "window_sec": window_sec, 210 "window_sec": window_sec,
209 "stride_sec": stride_sec, 211 "stride_sec": stride_sec,
212 "skipped_refs": skipped_refs,
210 "model_signature": self.model_signature, 213 "model_signature": self.model_signature,
211 "final_embs_path": str(final_embs_path), 214 "final_embs_path": str(final_embs_path),
212 "final_ids_path": str(final_ids_path), 215 "final_ids_path": str(final_ids_path),
...@@ -222,9 +225,22 @@ class ECAPAEmbedder: ...@@ -222,9 +225,22 @@ class ECAPAEmbedder:
222 for ref_idx, item in enumerate(refs[refs_done:], start=refs_done + 1): 225 for ref_idx, item in enumerate(refs[refs_done:], start=refs_done + 1):
223 audio_path = self._resolve_audio_path(songs_dir, item["audio_path"]) 226 audio_path = self._resolve_audio_path(songs_dir, item["audio_path"])
224 if not audio_path.exists(): 227 if not audio_path.exists():
228 skipped_refs += 1
229 print(
230 f"[build-reference-index] skip missing audio: song_id={item.get('song_id')} path={audio_path}",
231 flush=True,
232 )
225 continue 233 continue
226 song_id = item["song_id"] 234 song_id = item["song_id"]
235 try:
227 y, _ = librosa.load(str(audio_path), sr=self.sr, mono=True) 236 y, _ = librosa.load(str(audio_path), sr=self.sr, mono=True)
237 except Exception as exc:
238 skipped_refs += 1
239 print(
240 f"[build-reference-index] skip decode failure: song_id={song_id} path={audio_path} error={exc}",
241 flush=True,
242 )
243 continue
228 windows = self._windows(y, window_sec=window_sec, stride_sec=stride_sec) 244 windows = self._windows(y, window_sec=window_sec, stride_sec=stride_sec)
229 for seg in windows: 245 for seg in windows:
230 mel = self._to_mel(seg).to(self.device) 246 mel = self._to_mel(seg).to(self.device)
...@@ -238,7 +254,7 @@ class ECAPAEmbedder: ...@@ -238,7 +254,7 @@ class ECAPAEmbedder:
238 eta_sec = (total_refs - ref_idx) / refs_per_sec if refs_per_sec > 0 else 0.0 254 eta_sec = (total_refs - ref_idx) / refs_per_sec if refs_per_sec > 0 else 0.0
239 print( 255 print(
240 f"[build-reference-index] progress: refs={ref_idx}/{total_refs} " 256 f"[build-reference-index] progress: refs={ref_idx}/{total_refs} "
241 f"windows={len(all_ids)} elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f}" 257 f"windows={len(all_ids)} elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f} skipped_refs={skipped_refs}"
242 , flush=True) 258 , flush=True)
243 if checkpoint_every_refs > 0 and (ref_idx % checkpoint_every_refs == 0 or ref_idx == total_refs): 259 if checkpoint_every_refs > 0 and (ref_idx % checkpoint_every_refs == 0 or ref_idx == total_refs):
244 write_checkpoint(ref_idx) 260 write_checkpoint(ref_idx)
......
1 ## 2026-06-02 15:22 UTC / bad-mp3 skip tolerance verified
2
3 -`chromaprint_matcher.py``ecapa_embedder.py` 的 reference 建索引循环增加单文件容错:
4 - missing audio -> warning + skip
5 - decode failure -> warning + skip
6 - progress JSON 新增字段:`skipped_refs`
7 - 最小复现验证:`/tmp/chroma_skip_repro`
8 - 1 个正常 mp3 + 1 个损坏 mp3
9 - `RC=0`
10 - chromaprint 阶段:`skipped_refs=1`
11 - reference 阶段:`skipped_refs=1`
12 - 成功产出:
13 - `reference_embs.npy`
14 - `reference_ids.npy`
15 - `reference_progress.json`
16 - 结论:当前已修复“单个坏 MP3 拖垮整轮 build-index”的高概率故障模式
17
1 ## 2026-06-02 15:18 UTC / build-index log flush hardening 18 ## 2026-06-02 15:18 UTC / build-index log flush hardening
2 19
3 -`run_demo.py``chromaprint_matcher.py``ecapa_embedder.py` 的关键 `print()` 增加 `flush=True` 20 -`run_demo.py``chromaprint_matcher.py``ecapa_embedder.py` 的关键 `print()` 增加 `flush=True`
......
...@@ -68,3 +68,29 @@ ...@@ -68,3 +68,29 @@
68 68
69 - 当前已修复“失败时日志完全不可见”的可观测性问题。 69 - 当前已修复“失败时日志完全不可见”的可观测性问题。
70 - 下一轮 root cause 排查可以直接依赖实时日志,而不再需要盲等。 70 - 下一轮 root cause 排查可以直接依赖实时日志,而不再需要盲等。
71
72 ## 本次追加交付(2026-06-02 15:22 UTC)
73
74 ### 新增代码修复
75
76 | 文件 | 变更 |
77 |---|---|
78 | [../acr-engine/src/engines/chromaprint_matcher.py](../acr-engine/src/engines/chromaprint_matcher.py) | 坏音频/缺失音频跳过;progress 增加 `skipped_refs` |
79 | [../acr-engine/src/engines/ecapa_embedder.py](../acr-engine/src/engines/ecapa_embedder.py) | 坏音频/缺失音频跳过;progress 增加 `skipped_refs` |
80
81 ### 新增验证证据
82
83 - 最小容错复现:`/tmp/chroma_skip_repro`
84 - 输入:`1 good mp3 + 1 bad mp3`
85 - 结果:`RC=0`
86 - 验证点:
87 - 日志可见 `skip decode failure`
88 - `chromaprint_progress.json``status=complete`
89 - `reference_progress.json``status=complete`
90 - 两个 progress 文件都记录 `skipped_refs=1`
91 - 最终成功产出 `reference_embs.npy` / `reference_ids.npy`
92
93 ### 结论
94
95 - 当前已验证:单个坏 MP3 不再拖垮整轮 `build-index`
96 - 下一轮应回到真实路径复现,确认主问题是否就是由坏 MP3 触发。
......
...@@ -49,6 +49,15 @@ ...@@ -49,6 +49,15 @@
49 - 极小样本 `/tmp/chroma_repro_tiny12` 已验证:失败时日志与 traceback 可实时落盘,不再保持 `0 bytes` 49 - 极小样本 `/tmp/chroma_repro_tiny12` 已验证:失败时日志与 traceback 可实时落盘,不再保持 `0 bytes`
50 - 这意味着下一 session 继续排查时,日志可作为一手证据,而不是黑箱。 50 - 这意味着下一 session 继续排查时,日志可作为一手证据,而不是黑箱。
51 51
52 #### 已完成的坏音频容错修复
53 - 已为 chromaprint/reference 两个建索引阶段增加单文件容错:坏 MP3 / 缺失音频会被记录并跳过。
54 - 最小复现 `/tmp/chroma_skip_repro` 已验证:
55 - `RC=0`
56 - `skip decode failure` 日志可见
57 - `reference_embs.npy` / `reference_ids.npy` 成功产出
58 - progress 中记录 `skipped_refs=1`
59 - 这说明:单个坏 MP3 不再拖垮整轮 `build-index`
60
52 ## 新 session 接管顺序 61 ## 新 session 接管顺序
53 62
54 1. 先看 [./session-handoff.md](./session-handoff.md) 63 1. 先看 [./session-handoff.md](./session-handoff.md)
......