Make build-index failures observable so root-cause analysis can proceed from real logs
Constraint: The live build-index investigation was blocked by stdout/stderr buffering that left log files at 0 bytes during long runs Rejected: Keep diagnosing from progress files alone | they do not preserve traceback or stage-transition context Confidence: high Scope-risk: narrow Directive: Preserve flush-on-progress behavior while chasing the remaining real-path build-index root cause Tested: Verified tiny repro /tmp/chroma_repro_tiny12 writes live logs and traceback with RC=1 after flush=True change Not-tested: No final fix for the real-path build-index exit yet
Showing
8 changed files
with
70 additions
and
19 deletions
| ... | @@ -88,6 +88,7 @@ | ... | @@ -88,6 +88,7 @@ |
| 88 | - `refs_done=4420/8000` | 88 | - `refs_done=4420/8000` |
| 89 | - 当前仍未出现 `reference_*` 或 `evaluate.py`。 | 89 | - 当前仍未出现 `reference_*` 或 `evaluate.py`。 |
| 90 | - 因此下一轮工作重心必须切到:**排查 build-index 异常退出**,而不是继续把它当作纯线性慢任务。 | 90 | - 因此下一轮工作重心必须切到:**排查 build-index 异常退出**,而不是继续把它当作纯线性慢任务。 |
| 91 | - 已完成一个低风险修复:关键 `print()` 已加 `flush=True`,并已用极小样本 `RC=1` 失败复现验证日志/traceback 可实时落盘,不再出现 `0 bytes` 日志黑箱。 | ||
| 91 | - 下一次值得提交的事件: | 92 | - 下一次值得提交的事件: |
| 92 | 1. 找到明确失败证据/退出原因 | 93 | 1. 找到明确失败证据/退出原因 |
| 93 | 2. 成功小样本复现并补日志 | 94 | 2. 成功小样本复现并补日志 | ... | ... |
| ... | @@ -24,7 +24,7 @@ def cmd_generate_data(args): | ... | @@ -24,7 +24,7 @@ def cmd_generate_data(args): |
| 24 | segment_duration=args.segment_duration, | 24 | segment_duration=args.segment_duration, |
| 25 | seed=args.seed, | 25 | seed=args.seed, |
| 26 | ) | 26 | ) |
| 27 | print(f"[done] dataset generated at {args.output}") | 27 | print(f"[done] dataset generated at {args.output}", flush=True) |
| 28 | 28 | ||
| 29 | 29 | ||
| 30 | def build_chroma_index(data_dir: Path, output_dir: Path, checkpoint_every_refs: int = 0): | 30 | def build_chroma_index(data_dir: Path, output_dir: Path, checkpoint_every_refs: int = 0): |
| ... | @@ -37,7 +37,7 @@ def build_chroma_index(data_dir: Path, output_dir: Path, checkpoint_every_refs: | ... | @@ -37,7 +37,7 @@ def build_chroma_index(data_dir: Path, output_dir: Path, checkpoint_every_refs: |
| 37 | checkpoint_every_refs=checkpoint_every_refs, | 37 | checkpoint_every_refs=checkpoint_every_refs, |
| 38 | progress_path=str(output_dir / 'chromaprint_progress.json'), | 38 | progress_path=str(output_dir / 'chromaprint_progress.json'), |
| 39 | ) | 39 | ) |
| 40 | print(f"[done] chromaprint index built: hashes={matcher.num_hashes}, postings={matcher.index_size}") | 40 | print(f"[done] chromaprint index built: hashes={matcher.num_hashes}, postings={matcher.index_size}", flush=True) |
| 41 | return matcher | 41 | return matcher |
| 42 | 42 | ||
| 43 | 43 | ||
| ... | @@ -58,7 +58,7 @@ def build_embedding_index( | ... | @@ -58,7 +58,7 @@ def build_embedding_index( |
| 58 | checkpoint_every_refs=checkpoint_every_refs, | 58 | checkpoint_every_refs=checkpoint_every_refs, |
| 59 | resume=resume, | 59 | resume=resume, |
| 60 | ) | 60 | ) |
| 61 | print(f"[done] embedding index built: {len(ref_ids)} refs") | 61 | print(f"[done] embedding index built: {len(ref_ids)} refs", flush=True) |
| 62 | return embedder, ref_embs, ref_ids | 62 | return embedder, ref_embs, ref_ids |
| 63 | 63 | ||
| 64 | 64 | ||
| ... | @@ -67,12 +67,12 @@ def cmd_build_index(args): | ... | @@ -67,12 +67,12 @@ def cmd_build_index(args): |
| 67 | out_dir = Path(args.output) | 67 | out_dir = Path(args.output) |
| 68 | out_dir.mkdir(parents=True, exist_ok=True) | 68 | out_dir.mkdir(parents=True, exist_ok=True) |
| 69 | 69 | ||
| 70 | print(f"[build-index] starting chromaprint index: data={data_dir} output={out_dir}") | 70 | print(f"[build-index] starting chromaprint index: data={data_dir} output={out_dir}", flush=True) |
| 71 | build_chroma_index(data_dir, out_dir, checkpoint_every_refs=args.chromaprint_checkpoint_every_refs) | 71 | build_chroma_index(data_dir, out_dir, checkpoint_every_refs=args.chromaprint_checkpoint_every_refs) |
| 72 | print( | 72 | print( |
| 73 | f"[build-index] starting embedding index: model={args.model} device={args.device} " | 73 | f"[build-index] starting embedding index: model={args.model} device={args.device} " |
| 74 | f"resume={args.resume} checkpoint_every_refs={args.checkpoint_every_refs}" | 74 | f"resume={args.resume} checkpoint_every_refs={args.checkpoint_every_refs}" |
| 75 | ) | 75 | , flush=True) |
| 76 | build_embedding_index( | 76 | build_embedding_index( |
| 77 | data_dir, | 77 | data_dir, |
| 78 | Path(args.model), | 78 | Path(args.model), |
| ... | @@ -108,7 +108,7 @@ def cmd_recognize(args): | ... | @@ -108,7 +108,7 @@ def cmd_recognize(args): |
| 108 | engine.load_metadata(str(p)) | 108 | engine.load_metadata(str(p)) |
| 109 | 109 | ||
| 110 | result = engine.recognize(args.query, top_n=args.top_n) | 110 | result = engine.recognize(args.query, top_n=args.top_n) |
| 111 | print(json.dumps(result, ensure_ascii=False, indent=2)) | 111 | print(json.dumps(result, ensure_ascii=False, indent=2), flush=True) |
| 112 | 112 | ||
| 113 | 113 | ||
| 114 | def cmd_full_demo(args): | 114 | def cmd_full_demo(args): |
| ... | @@ -125,7 +125,7 @@ def cmd_full_demo(args): | ... | @@ -125,7 +125,7 @@ def cmd_full_demo(args): |
| 125 | segment_duration=args.segment_duration, | 125 | segment_duration=args.segment_duration, |
| 126 | seed=args.seed, | 126 | seed=args.seed, |
| 127 | ) | 127 | ) |
| 128 | print(f"[done] dataset generated at {data_dir}") | 128 | print(f"[done] dataset generated at {data_dir}", flush=True) |
| 129 | 129 | ||
| 130 | model_path = model_dir / 'best_model.pt' | 130 | model_path = model_dir / 'best_model.pt' |
| 131 | if not model_path.exists(): | 131 | if not model_path.exists(): |
| ... | @@ -136,7 +136,7 @@ def cmd_full_demo(args): | ... | @@ -136,7 +136,7 @@ def cmd_full_demo(args): |
| 136 | '--data', str(data_dir), '--output', str(model_dir), | 136 | '--data', str(data_dir), '--output', str(model_dir), |
| 137 | '--device', args.device, '--epochs', '3', '--batch-size', '8' | 137 | '--device', args.device, '--epochs', '3', '--batch-size', '8' |
| 138 | ] | 138 | ] |
| 139 | print('[full-demo] training model:', ' '.join(cmd)) | 139 | print('[full-demo] training model:', ' '.join(cmd), flush=True) |
| 140 | subprocess.run(cmd, check=True) | 140 | subprocess.run(cmd, check=True) |
| 141 | 141 | ||
| 142 | index_dir.mkdir(parents=True, exist_ok=True) | 142 | index_dir.mkdir(parents=True, exist_ok=True) |
| ... | @@ -152,8 +152,8 @@ def cmd_full_demo(args): | ... | @@ -152,8 +152,8 @@ def cmd_full_demo(args): |
| 152 | for split in ['train.json', 'val.json', 'test.json']: | 152 | for split in ['train.json', 'val.json', 'test.json']: |
| 153 | engine.load_metadata(str(data_dir / split)) | 153 | engine.load_metadata(str(data_dir / split)) |
| 154 | result = engine.recognize(str(query_path), top_n=5) | 154 | result = engine.recognize(str(query_path), top_n=5) |
| 155 | print('[demo-query]', query_item['song_id'], query_item['audio_path']) | 155 | print('[demo-query]', query_item['song_id'], query_item['audio_path'], flush=True) |
| 156 | print(json.dumps(result, ensure_ascii=False, indent=2)) | 156 | print(json.dumps(result, ensure_ascii=False, indent=2), flush=True) |
| 157 | 157 | ||
| 158 | 158 | ||
| 159 | if __name__ == '__main__': | 159 | if __name__ == '__main__': | ... | ... |
| ... | @@ -139,7 +139,7 @@ class ChromaprintMatcher: | ... | @@ -139,7 +139,7 @@ class ChromaprintMatcher: |
| 139 | f"[chromaprint-index] progress: refs={ref_idx}/{total_refs} " | 139 | f"[chromaprint-index] progress: refs={ref_idx}/{total_refs} " |
| 140 | f"hashes={self.num_hashes} postings={self.index_size} " | 140 | f"hashes={self.num_hashes} postings={self.index_size} " |
| 141 | f"elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f}" | 141 | f"elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f}" |
| 142 | ) | 142 | , flush=True) |
| 143 | if checkpoint_every_refs > 0 and ref_idx % checkpoint_every_refs == 0: | 143 | if checkpoint_every_refs > 0 and ref_idx % checkpoint_every_refs == 0: |
| 144 | if cache_file is not None: | 144 | if cache_file is not None: |
| 145 | self.save(str(cache_file)) | 145 | self.save(str(cache_file)) | ... | ... |
| ... | @@ -48,7 +48,7 @@ class ECAPAEmbedder: | ... | @@ -48,7 +48,7 @@ class ECAPAEmbedder: |
| 48 | ) | 48 | ) |
| 49 | missing = self.model.load_state_dict(state["model_state_dict"], strict=False) | 49 | missing = self.model.load_state_dict(state["model_state_dict"], strict=False) |
| 50 | if missing.unexpected_keys: | 50 | if missing.unexpected_keys: |
| 51 | print(f"[warn] unexpected keys while loading model: {missing.unexpected_keys}") | 51 | print(f"[warn] unexpected keys while loading model: {missing.unexpected_keys}", flush=True) |
| 52 | self.model.to(self.device) | 52 | self.model.to(self.device) |
| 53 | self.model.eval() | 53 | self.model.eval() |
| 54 | 54 | ||
| ... | @@ -133,7 +133,7 @@ class ECAPAEmbedder: | ... | @@ -133,7 +133,7 @@ class ECAPAEmbedder: |
| 133 | refs_done = 0 | 133 | refs_done = 0 |
| 134 | 134 | ||
| 135 | if resume and final_embs_path.exists() and final_ids_path.exists(): | 135 | if resume and final_embs_path.exists() and final_ids_path.exists(): |
| 136 | print(f"[build-reference-index] resume hit complete index: {final_embs_path} / {final_ids_path}") | 136 | print(f"[build-reference-index] resume hit complete index: {final_embs_path} / {final_ids_path}", flush=True) |
| 137 | final_embs = np.load(final_embs_path) | 137 | final_embs = np.load(final_embs_path) |
| 138 | final_ids = np.load(final_ids_path, allow_pickle=True).tolist() | 138 | final_ids = np.load(final_ids_path, allow_pickle=True).tolist() |
| 139 | return final_embs, final_ids | 139 | return final_embs, final_ids |
| ... | @@ -154,9 +154,9 @@ class ECAPAEmbedder: | ... | @@ -154,9 +154,9 @@ class ECAPAEmbedder: |
| 154 | print( | 154 | print( |
| 155 | f"[build-reference-index] resuming from checkpoint: refs_done={refs_done}/{total_refs} " | 155 | f"[build-reference-index] resuming from checkpoint: refs_done={refs_done}/{total_refs} " |
| 156 | f"windows_done={len(all_ids)}" | 156 | f"windows_done={len(all_ids)}" |
| 157 | ) | 157 | , flush=True) |
| 158 | except Exception as exc: | 158 | except Exception as exc: |
| 159 | print(f"[build-reference-index] resume checkpoint ignored due to load failure: {exc}") | 159 | print(f"[build-reference-index] resume checkpoint ignored due to load failure: {exc}", flush=True) |
| 160 | refs_done = 0 | 160 | refs_done = 0 |
| 161 | all_embs = [] | 161 | all_embs = [] |
| 162 | all_ids = [] | 162 | all_ids = [] |
| ... | @@ -170,7 +170,7 @@ class ECAPAEmbedder: | ... | @@ -170,7 +170,7 @@ class ECAPAEmbedder: |
| 170 | print( | 170 | print( |
| 171 | f"[build-reference-index] start: refs={total_refs} device={self.device.type} " | 171 | f"[build-reference-index] start: refs={total_refs} device={self.device.type} " |
| 172 | f"window_sec={window_sec} stride_sec={stride_sec} resume={resume} refs_done={refs_done}" | 172 | f"window_sec={window_sec} stride_sec={stride_sec} resume={resume} refs_done={refs_done}" |
| 173 | ) | 173 | , flush=True) |
| 174 | 174 | ||
| 175 | def write_checkpoint(ref_idx: int): | 175 | def write_checkpoint(ref_idx: int): |
| 176 | if not all_embs: | 176 | if not all_embs: |
| ... | @@ -214,7 +214,7 @@ class ECAPAEmbedder: | ... | @@ -214,7 +214,7 @@ class ECAPAEmbedder: |
| 214 | }, indent=2)) | 214 | }, indent=2)) |
| 215 | 215 | ||
| 216 | if refs_done > total_refs: | 216 | if refs_done > total_refs: |
| 217 | print(f"[build-reference-index] resume refs_done={refs_done} exceeds refs_total={total_refs}; restarting") | 217 | print(f"[build-reference-index] resume refs_done={refs_done} exceeds refs_total={total_refs}; restarting", flush=True) |
| 218 | refs_done = 0 | 218 | refs_done = 0 |
| 219 | all_embs = [] | 219 | all_embs = [] |
| 220 | all_ids = [] | 220 | all_ids = [] |
| ... | @@ -239,7 +239,7 @@ class ECAPAEmbedder: | ... | @@ -239,7 +239,7 @@ class ECAPAEmbedder: |
| 239 | print( | 239 | print( |
| 240 | f"[build-reference-index] progress: refs={ref_idx}/{total_refs} " | 240 | f"[build-reference-index] progress: refs={ref_idx}/{total_refs} " |
| 241 | f"windows={len(all_ids)} elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f}" | 241 | f"windows={len(all_ids)} elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f}" |
| 242 | ) | 242 | , flush=True) |
| 243 | if checkpoint_every_refs > 0 and (ref_idx % checkpoint_every_refs == 0 or ref_idx == total_refs): | 243 | if checkpoint_every_refs > 0 and (ref_idx % checkpoint_every_refs == 0 or ref_idx == total_refs): |
| 244 | write_checkpoint(ref_idx) | 244 | write_checkpoint(ref_idx) |
| 245 | 245 | ||
| ... | @@ -252,7 +252,7 @@ class ECAPAEmbedder: | ... | @@ -252,7 +252,7 @@ class ECAPAEmbedder: |
| 252 | np.save(final_embs_path, all_embs) | 252 | np.save(final_embs_path, all_embs) |
| 253 | np.save(final_ids_path, np.array(all_ids)) | 253 | np.save(final_ids_path, np.array(all_ids)) |
| 254 | write_complete(len(all_ids), all_embs.shape) | 254 | write_complete(len(all_ids), all_embs.shape) |
| 255 | print(f"Built reference index: {len(all_ids)} windows, embeddings shape {all_embs.shape}") | 255 | print(f"Built reference index: {len(all_ids)} windows, embeddings shape {all_embs.shape}", flush=True) |
| 256 | return all_embs, all_ids | 256 | return all_embs, all_ids |
| 257 | 257 | ||
| 258 | def search(self, query_emb: np.ndarray, ref_embs: np.ndarray, ref_ids: List[str], top_k: int = 10): | 258 | def search(self, query_emb: np.ndarray, ref_embs: np.ndarray, ref_ids: List[str], top_k: int = 10): | ... | ... |
| 1 | ## 2026-06-02 15:18 UTC / build-index log flush hardening | ||
| 2 | |||
| 3 | - 为 `run_demo.py`、`chromaprint_matcher.py`、`ecapa_embedder.py` 的关键 `print()` 增加 `flush=True` | ||
| 4 | - 目的:避免 `build-index` 长时间运行时日志文件保持 `0 bytes`,导致“无声运行/无声退出”难排查 | ||
| 5 | - fresh evidence:极小样本复现 `/tmp/chroma_repro_tiny12` 已验证日志实时落盘 | ||
| 6 | - `RC=1` | ||
| 7 | - 日志中可见:`[build-index] starting chromaprint index` | ||
| 8 | - 日志中可见:`[build-reference-index] start: refs=12 ...` | ||
| 9 | - 日志中可见最终 traceback:`ValueError: No reference embeddings were produced ...` | ||
| 10 | - 结论:当前至少已修复“失败时日志完全不可见”的可观测性问题,下一步可继续针对真实路径复现 root cause | ||
| 11 | |||
| 1 | ## 2026-06-02 15:09 UTC / build-index unexpected exit checkpoint | 12 | ## 2026-06-02 15:09 UTC / build-index unexpected exit checkpoint |
| 2 | 13 | ||
| 3 | - 新鲜证据:observable 与 legacy 两个 `build-index` 进程都已退出 | 14 | - 新鲜证据:observable 与 legacy 两个 `build-index` 进程都已退出 | ... | ... |
| ... | @@ -44,3 +44,27 @@ | ... | @@ -44,3 +44,27 @@ |
| 44 | 1. 当前已不应继续把它描述成“仅仅线性慢”。 | 44 | 1. 当前已不应继续把它描述成“仅仅线性慢”。 |
| 45 | 2. 下一轮工作应转向 **build-index 异常退出排查**。 | 45 | 2. 下一轮工作应转向 **build-index 异常退出排查**。 |
| 46 | 3. 新提交已经有意义,因为状态从“运行中”变成了“已退出且无下游产物”。 | 46 | 3. 新提交已经有意义,因为状态从“运行中”变成了“已退出且无下游产物”。 |
| 47 | |||
| 48 | ## 本次追加交付(2026-06-02 15:18 UTC) | ||
| 49 | |||
| 50 | ### 新增代码修复 | ||
| 51 | |||
| 52 | | 文件 | 变更 | | ||
| 53 | |---|---| | ||
| 54 | | [../acr-engine/run_demo.py](../acr-engine/run_demo.py) | `build-index` / demo 关键日志统一 `flush=True` | | ||
| 55 | | [../acr-engine/src/engines/chromaprint_matcher.py](../acr-engine/src/engines/chromaprint_matcher.py) | chromaprint 阶段 progress 日志 `flush=True` | | ||
| 56 | | [../acr-engine/src/engines/ecapa_embedder.py](../acr-engine/src/engines/ecapa_embedder.py) | embedding/reference 阶段关键日志 `flush=True` | | ||
| 57 | |||
| 58 | ### 新增验证证据 | ||
| 59 | |||
| 60 | - 极小样本复现:`/tmp/chroma_repro_tiny12` | ||
| 61 | - 结果:`RC=1` | ||
| 62 | - 现在日志已实时落盘,不再是 `0 bytes`: | ||
| 63 | - `[build-index] starting chromaprint index ...` | ||
| 64 | - `[build-reference-index] start: refs=12 ...` | ||
| 65 | - `ValueError: No reference embeddings were produced ...` | ||
| 66 | |||
| 67 | ### 结论 | ||
| 68 | |||
| 69 | - 当前已修复“失败时日志完全不可见”的可观测性问题。 | ||
| 70 | - 下一轮 root cause 排查可以直接依赖实时日志,而不再需要盲等。 | ... | ... |
| ... | @@ -44,6 +44,11 @@ | ... | @@ -44,6 +44,11 @@ |
| 44 | - 这不再是“CPU-only 长时间构建但仍在推进”的状态。 | 44 | - 这不再是“CPU-only 长时间构建但仍在推进”的状态。 |
| 45 | - 现在更像是:**`build-index` 在 chromaprint 阶段中途退出,但没有留下显式下游产物**。 | 45 | - 现在更像是:**`build-index` 在 chromaprint 阶段中途退出,但没有留下显式下游产物**。 |
| 46 | 46 | ||
| 47 | #### 已完成的低风险修复 | ||
| 48 | - 已把 `run_demo.py`、`chromaprint_matcher.py`、`ecapa_embedder.py` 的关键日志改为 `flush=True`。 | ||
| 49 | - 极小样本 `/tmp/chroma_repro_tiny12` 已验证:失败时日志与 traceback 可实时落盘,不再保持 `0 bytes`。 | ||
| 50 | - 这意味着下一 session 继续排查时,日志可作为一手证据,而不是黑箱。 | ||
| 51 | |||
| 47 | ## 新 session 接管顺序 | 52 | ## 新 session 接管顺序 |
| 48 | 53 | ||
| 49 | 1. 先看 [./session-handoff.md](./session-handoff.md) | 54 | 1. 先看 [./session-handoff.md](./session-handoff.md) | ... | ... |
| ... | @@ -31,6 +31,16 @@ | ... | @@ -31,6 +31,16 @@ |
| 31 | 2. 查 silent failure / OOM / shell termination 证据 | 31 | 2. 查 silent failure / OOM / shell termination 证据 |
| 32 | 3. 用小样本复现异常并补日志 | 32 | 3. 用小样本复现异常并补日志 |
| 33 | 33 | ||
| 34 | ### 最新可观测性修复(2026-06-02 15:18 UTC) | ||
| 35 | |||
| 36 | - 已为 `run_demo.py`、`src/engines/chromaprint_matcher.py`、`src/engines/ecapa_embedder.py` 的关键 `print()` 增加 `flush=True`。 | ||
| 37 | - 极小样本复现 `/tmp/chroma_repro_tiny12` 已验证: | ||
| 38 | - 日志文件不再保持 `0 bytes` | ||
| 39 | - traceback 可实时落盘 | ||
| 40 | - 当前已确认:至少“失败时无日志”这个问题已被修复;下一步继续追真实路径 root cause。 | ||
| 41 | |||
| 42 | - 验证结果补充:`RC=1`,日志中可见 `ValueError: No reference embeddings were produced ...`。 | ||
| 43 | |||
| 34 | 这是一个正在从原型向工业化推进的 **音乐 ACR / music retrieval** 项目。 | 44 | 这是一个正在从原型向工业化推进的 **音乐 ACR / music retrieval** 项目。 |
| 35 | 当前已经完成: | 45 | 当前已经完成: |
| 36 | 46 | ... | ... |
-
Please register or sign in to post a comment