Commit 6ece1fa7 6ece1fa7f3c14810170c6635193ce3d2cdc4dc3e by cnb.bofCdSsphPA

Make build-index failures observable so root-cause analysis can proceed from real logs

Constraint: The live build-index investigation was blocked by stdout/stderr buffering that left log files at 0 bytes during long runs
Rejected: Keep diagnosing from progress files alone | they do not preserve traceback or stage-transition context
Confidence: high
Scope-risk: narrow
Directive: Preserve flush-on-progress behavior while chasing the remaining real-path build-index root cause
Tested: Verified tiny repro /tmp/chroma_repro_tiny12 writes live logs and traceback with RC=1 after flush=True change
Not-tested: No final fix for the real-path build-index exit yet
1 parent 7bb69662
...@@ -88,6 +88,7 @@ ...@@ -88,6 +88,7 @@
88 - `refs_done=4420/8000` 88 - `refs_done=4420/8000`
89 - 当前仍未出现 `reference_*``evaluate.py` 89 - 当前仍未出现 `reference_*``evaluate.py`
90 - 因此下一轮工作重心必须切到:**排查 build-index 异常退出**,而不是继续把它当作纯线性慢任务。 90 - 因此下一轮工作重心必须切到:**排查 build-index 异常退出**,而不是继续把它当作纯线性慢任务。
91 - 已完成一个低风险修复:关键 `print()` 已加 `flush=True`,并已用极小样本 `RC=1` 失败复现验证日志/traceback 可实时落盘,不再出现 `0 bytes` 日志黑箱。
91 - 下一次值得提交的事件: 92 - 下一次值得提交的事件:
92 1. 找到明确失败证据/退出原因 93 1. 找到明确失败证据/退出原因
93 2. 成功小样本复现并补日志 94 2. 成功小样本复现并补日志
......
...@@ -24,7 +24,7 @@ def cmd_generate_data(args): ...@@ -24,7 +24,7 @@ def cmd_generate_data(args):
24 segment_duration=args.segment_duration, 24 segment_duration=args.segment_duration,
25 seed=args.seed, 25 seed=args.seed,
26 ) 26 )
27 print(f"[done] dataset generated at {args.output}") 27 print(f"[done] dataset generated at {args.output}", flush=True)
28 28
29 29
30 def build_chroma_index(data_dir: Path, output_dir: Path, checkpoint_every_refs: int = 0): 30 def build_chroma_index(data_dir: Path, output_dir: Path, checkpoint_every_refs: int = 0):
...@@ -37,7 +37,7 @@ def build_chroma_index(data_dir: Path, output_dir: Path, checkpoint_every_refs: ...@@ -37,7 +37,7 @@ def build_chroma_index(data_dir: Path, output_dir: Path, checkpoint_every_refs:
37 checkpoint_every_refs=checkpoint_every_refs, 37 checkpoint_every_refs=checkpoint_every_refs,
38 progress_path=str(output_dir / 'chromaprint_progress.json'), 38 progress_path=str(output_dir / 'chromaprint_progress.json'),
39 ) 39 )
40 print(f"[done] chromaprint index built: hashes={matcher.num_hashes}, postings={matcher.index_size}") 40 print(f"[done] chromaprint index built: hashes={matcher.num_hashes}, postings={matcher.index_size}", flush=True)
41 return matcher 41 return matcher
42 42
43 43
...@@ -58,7 +58,7 @@ def build_embedding_index( ...@@ -58,7 +58,7 @@ def build_embedding_index(
58 checkpoint_every_refs=checkpoint_every_refs, 58 checkpoint_every_refs=checkpoint_every_refs,
59 resume=resume, 59 resume=resume,
60 ) 60 )
61 print(f"[done] embedding index built: {len(ref_ids)} refs") 61 print(f"[done] embedding index built: {len(ref_ids)} refs", flush=True)
62 return embedder, ref_embs, ref_ids 62 return embedder, ref_embs, ref_ids
63 63
64 64
...@@ -67,12 +67,12 @@ def cmd_build_index(args): ...@@ -67,12 +67,12 @@ def cmd_build_index(args):
67 out_dir = Path(args.output) 67 out_dir = Path(args.output)
68 out_dir.mkdir(parents=True, exist_ok=True) 68 out_dir.mkdir(parents=True, exist_ok=True)
69 69
70 print(f"[build-index] starting chromaprint index: data={data_dir} output={out_dir}") 70 print(f"[build-index] starting chromaprint index: data={data_dir} output={out_dir}", flush=True)
71 build_chroma_index(data_dir, out_dir, checkpoint_every_refs=args.chromaprint_checkpoint_every_refs) 71 build_chroma_index(data_dir, out_dir, checkpoint_every_refs=args.chromaprint_checkpoint_every_refs)
72 print( 72 print(
73 f"[build-index] starting embedding index: model={args.model} device={args.device} " 73 f"[build-index] starting embedding index: model={args.model} device={args.device} "
74 f"resume={args.resume} checkpoint_every_refs={args.checkpoint_every_refs}" 74 f"resume={args.resume} checkpoint_every_refs={args.checkpoint_every_refs}"
75 ) 75 , flush=True)
76 build_embedding_index( 76 build_embedding_index(
77 data_dir, 77 data_dir,
78 Path(args.model), 78 Path(args.model),
...@@ -108,7 +108,7 @@ def cmd_recognize(args): ...@@ -108,7 +108,7 @@ def cmd_recognize(args):
108 engine.load_metadata(str(p)) 108 engine.load_metadata(str(p))
109 109
110 result = engine.recognize(args.query, top_n=args.top_n) 110 result = engine.recognize(args.query, top_n=args.top_n)
111 print(json.dumps(result, ensure_ascii=False, indent=2)) 111 print(json.dumps(result, ensure_ascii=False, indent=2), flush=True)
112 112
113 113
114 def cmd_full_demo(args): 114 def cmd_full_demo(args):
...@@ -125,7 +125,7 @@ def cmd_full_demo(args): ...@@ -125,7 +125,7 @@ def cmd_full_demo(args):
125 segment_duration=args.segment_duration, 125 segment_duration=args.segment_duration,
126 seed=args.seed, 126 seed=args.seed,
127 ) 127 )
128 print(f"[done] dataset generated at {data_dir}") 128 print(f"[done] dataset generated at {data_dir}", flush=True)
129 129
130 model_path = model_dir / 'best_model.pt' 130 model_path = model_dir / 'best_model.pt'
131 if not model_path.exists(): 131 if not model_path.exists():
...@@ -136,7 +136,7 @@ def cmd_full_demo(args): ...@@ -136,7 +136,7 @@ def cmd_full_demo(args):
136 '--data', str(data_dir), '--output', str(model_dir), 136 '--data', str(data_dir), '--output', str(model_dir),
137 '--device', args.device, '--epochs', '3', '--batch-size', '8' 137 '--device', args.device, '--epochs', '3', '--batch-size', '8'
138 ] 138 ]
139 print('[full-demo] training model:', ' '.join(cmd)) 139 print('[full-demo] training model:', ' '.join(cmd), flush=True)
140 subprocess.run(cmd, check=True) 140 subprocess.run(cmd, check=True)
141 141
142 index_dir.mkdir(parents=True, exist_ok=True) 142 index_dir.mkdir(parents=True, exist_ok=True)
...@@ -152,8 +152,8 @@ def cmd_full_demo(args): ...@@ -152,8 +152,8 @@ def cmd_full_demo(args):
152 for split in ['train.json', 'val.json', 'test.json']: 152 for split in ['train.json', 'val.json', 'test.json']:
153 engine.load_metadata(str(data_dir / split)) 153 engine.load_metadata(str(data_dir / split))
154 result = engine.recognize(str(query_path), top_n=5) 154 result = engine.recognize(str(query_path), top_n=5)
155 print('[demo-query]', query_item['song_id'], query_item['audio_path']) 155 print('[demo-query]', query_item['song_id'], query_item['audio_path'], flush=True)
156 print(json.dumps(result, ensure_ascii=False, indent=2)) 156 print(json.dumps(result, ensure_ascii=False, indent=2), flush=True)
157 157
158 158
159 if __name__ == '__main__': 159 if __name__ == '__main__':
......
...@@ -139,7 +139,7 @@ class ChromaprintMatcher: ...@@ -139,7 +139,7 @@ class ChromaprintMatcher:
139 f"[chromaprint-index] progress: refs={ref_idx}/{total_refs} " 139 f"[chromaprint-index] progress: refs={ref_idx}/{total_refs} "
140 f"hashes={self.num_hashes} postings={self.index_size} " 140 f"hashes={self.num_hashes} postings={self.index_size} "
141 f"elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f}" 141 f"elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f}"
142 ) 142 , flush=True)
143 if checkpoint_every_refs > 0 and ref_idx % checkpoint_every_refs == 0: 143 if checkpoint_every_refs > 0 and ref_idx % checkpoint_every_refs == 0:
144 if cache_file is not None: 144 if cache_file is not None:
145 self.save(str(cache_file)) 145 self.save(str(cache_file))
......
...@@ -48,7 +48,7 @@ class ECAPAEmbedder: ...@@ -48,7 +48,7 @@ class ECAPAEmbedder:
48 ) 48 )
49 missing = self.model.load_state_dict(state["model_state_dict"], strict=False) 49 missing = self.model.load_state_dict(state["model_state_dict"], strict=False)
50 if missing.unexpected_keys: 50 if missing.unexpected_keys:
51 print(f"[warn] unexpected keys while loading model: {missing.unexpected_keys}") 51 print(f"[warn] unexpected keys while loading model: {missing.unexpected_keys}", flush=True)
52 self.model.to(self.device) 52 self.model.to(self.device)
53 self.model.eval() 53 self.model.eval()
54 54
...@@ -133,7 +133,7 @@ class ECAPAEmbedder: ...@@ -133,7 +133,7 @@ class ECAPAEmbedder:
133 refs_done = 0 133 refs_done = 0
134 134
135 if resume and final_embs_path.exists() and final_ids_path.exists(): 135 if resume and final_embs_path.exists() and final_ids_path.exists():
136 print(f"[build-reference-index] resume hit complete index: {final_embs_path} / {final_ids_path}") 136 print(f"[build-reference-index] resume hit complete index: {final_embs_path} / {final_ids_path}", flush=True)
137 final_embs = np.load(final_embs_path) 137 final_embs = np.load(final_embs_path)
138 final_ids = np.load(final_ids_path, allow_pickle=True).tolist() 138 final_ids = np.load(final_ids_path, allow_pickle=True).tolist()
139 return final_embs, final_ids 139 return final_embs, final_ids
...@@ -154,9 +154,9 @@ class ECAPAEmbedder: ...@@ -154,9 +154,9 @@ class ECAPAEmbedder:
154 print( 154 print(
155 f"[build-reference-index] resuming from checkpoint: refs_done={refs_done}/{total_refs} " 155 f"[build-reference-index] resuming from checkpoint: refs_done={refs_done}/{total_refs} "
156 f"windows_done={len(all_ids)}" 156 f"windows_done={len(all_ids)}"
157 ) 157 , flush=True)
158 except Exception as exc: 158 except Exception as exc:
159 print(f"[build-reference-index] resume checkpoint ignored due to load failure: {exc}") 159 print(f"[build-reference-index] resume checkpoint ignored due to load failure: {exc}", flush=True)
160 refs_done = 0 160 refs_done = 0
161 all_embs = [] 161 all_embs = []
162 all_ids = [] 162 all_ids = []
...@@ -170,7 +170,7 @@ class ECAPAEmbedder: ...@@ -170,7 +170,7 @@ class ECAPAEmbedder:
170 print( 170 print(
171 f"[build-reference-index] start: refs={total_refs} device={self.device.type} " 171 f"[build-reference-index] start: refs={total_refs} device={self.device.type} "
172 f"window_sec={window_sec} stride_sec={stride_sec} resume={resume} refs_done={refs_done}" 172 f"window_sec={window_sec} stride_sec={stride_sec} resume={resume} refs_done={refs_done}"
173 ) 173 , flush=True)
174 174
175 def write_checkpoint(ref_idx: int): 175 def write_checkpoint(ref_idx: int):
176 if not all_embs: 176 if not all_embs:
...@@ -214,7 +214,7 @@ class ECAPAEmbedder: ...@@ -214,7 +214,7 @@ class ECAPAEmbedder:
214 }, indent=2)) 214 }, indent=2))
215 215
216 if refs_done > total_refs: 216 if refs_done > total_refs:
217 print(f"[build-reference-index] resume refs_done={refs_done} exceeds refs_total={total_refs}; restarting") 217 print(f"[build-reference-index] resume refs_done={refs_done} exceeds refs_total={total_refs}; restarting", flush=True)
218 refs_done = 0 218 refs_done = 0
219 all_embs = [] 219 all_embs = []
220 all_ids = [] 220 all_ids = []
...@@ -239,7 +239,7 @@ class ECAPAEmbedder: ...@@ -239,7 +239,7 @@ class ECAPAEmbedder:
239 print( 239 print(
240 f"[build-reference-index] progress: refs={ref_idx}/{total_refs} " 240 f"[build-reference-index] progress: refs={ref_idx}/{total_refs} "
241 f"windows={len(all_ids)} elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f}" 241 f"windows={len(all_ids)} elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f}"
242 ) 242 , flush=True)
243 if checkpoint_every_refs > 0 and (ref_idx % checkpoint_every_refs == 0 or ref_idx == total_refs): 243 if checkpoint_every_refs > 0 and (ref_idx % checkpoint_every_refs == 0 or ref_idx == total_refs):
244 write_checkpoint(ref_idx) 244 write_checkpoint(ref_idx)
245 245
...@@ -252,7 +252,7 @@ class ECAPAEmbedder: ...@@ -252,7 +252,7 @@ class ECAPAEmbedder:
252 np.save(final_embs_path, all_embs) 252 np.save(final_embs_path, all_embs)
253 np.save(final_ids_path, np.array(all_ids)) 253 np.save(final_ids_path, np.array(all_ids))
254 write_complete(len(all_ids), all_embs.shape) 254 write_complete(len(all_ids), all_embs.shape)
255 print(f"Built reference index: {len(all_ids)} windows, embeddings shape {all_embs.shape}") 255 print(f"Built reference index: {len(all_ids)} windows, embeddings shape {all_embs.shape}", flush=True)
256 return all_embs, all_ids 256 return all_embs, all_ids
257 257
258 def search(self, query_emb: np.ndarray, ref_embs: np.ndarray, ref_ids: List[str], top_k: int = 10): 258 def search(self, query_emb: np.ndarray, ref_embs: np.ndarray, ref_ids: List[str], top_k: int = 10):
......
1 ## 2026-06-02 15:18 UTC / build-index log flush hardening
2
3 -`run_demo.py``chromaprint_matcher.py``ecapa_embedder.py` 的关键 `print()` 增加 `flush=True`
4 - 目的:避免 `build-index` 长时间运行时日志文件保持 `0 bytes`,导致“无声运行/无声退出”难排查
5 - fresh evidence:极小样本复现 `/tmp/chroma_repro_tiny12` 已验证日志实时落盘
6 - `RC=1`
7 - 日志中可见:`[build-index] starting chromaprint index`
8 - 日志中可见:`[build-reference-index] start: refs=12 ...`
9 - 日志中可见最终 traceback:`ValueError: No reference embeddings were produced ...`
10 - 结论:当前至少已修复“失败时日志完全不可见”的可观测性问题,下一步可继续针对真实路径复现 root cause
11
1 ## 2026-06-02 15:09 UTC / build-index unexpected exit checkpoint 12 ## 2026-06-02 15:09 UTC / build-index unexpected exit checkpoint
2 13
3 - 新鲜证据:observable 与 legacy 两个 `build-index` 进程都已退出 14 - 新鲜证据:observable 与 legacy 两个 `build-index` 进程都已退出
......
...@@ -44,3 +44,27 @@ ...@@ -44,3 +44,27 @@
44 1. 当前已不应继续把它描述成“仅仅线性慢”。 44 1. 当前已不应继续把它描述成“仅仅线性慢”。
45 2. 下一轮工作应转向 **build-index 异常退出排查** 45 2. 下一轮工作应转向 **build-index 异常退出排查**
46 3. 新提交已经有意义,因为状态从“运行中”变成了“已退出且无下游产物”。 46 3. 新提交已经有意义,因为状态从“运行中”变成了“已退出且无下游产物”。
47
48 ## 本次追加交付(2026-06-02 15:18 UTC)
49
50 ### 新增代码修复
51
52 | 文件 | 变更 |
53 |---|---|
54 | [../acr-engine/run_demo.py](../acr-engine/run_demo.py) | `build-index` / demo 关键日志统一 `flush=True` |
55 | [../acr-engine/src/engines/chromaprint_matcher.py](../acr-engine/src/engines/chromaprint_matcher.py) | chromaprint 阶段 progress 日志 `flush=True` |
56 | [../acr-engine/src/engines/ecapa_embedder.py](../acr-engine/src/engines/ecapa_embedder.py) | embedding/reference 阶段关键日志 `flush=True` |
57
58 ### 新增验证证据
59
60 - 极小样本复现:`/tmp/chroma_repro_tiny12`
61 - 结果:`RC=1`
62 - 现在日志已实时落盘,不再是 `0 bytes`
63 - `[build-index] starting chromaprint index ...`
64 - `[build-reference-index] start: refs=12 ...`
65 - `ValueError: No reference embeddings were produced ...`
66
67 ### 结论
68
69 - 当前已修复“失败时日志完全不可见”的可观测性问题。
70 - 下一轮 root cause 排查可以直接依赖实时日志,而不再需要盲等。
......
...@@ -44,6 +44,11 @@ ...@@ -44,6 +44,11 @@
44 - 这不再是“CPU-only 长时间构建但仍在推进”的状态。 44 - 这不再是“CPU-only 长时间构建但仍在推进”的状态。
45 - 现在更像是:**`build-index` 在 chromaprint 阶段中途退出,但没有留下显式下游产物** 45 - 现在更像是:**`build-index` 在 chromaprint 阶段中途退出,但没有留下显式下游产物**
46 46
47 #### 已完成的低风险修复
48 - 已把 `run_demo.py``chromaprint_matcher.py``ecapa_embedder.py` 的关键日志改为 `flush=True`
49 - 极小样本 `/tmp/chroma_repro_tiny12` 已验证:失败时日志与 traceback 可实时落盘,不再保持 `0 bytes`
50 - 这意味着下一 session 继续排查时,日志可作为一手证据,而不是黑箱。
51
47 ## 新 session 接管顺序 52 ## 新 session 接管顺序
48 53
49 1. 先看 [./session-handoff.md](./session-handoff.md) 54 1. 先看 [./session-handoff.md](./session-handoff.md)
......
...@@ -31,6 +31,16 @@ ...@@ -31,6 +31,16 @@
31 2. 查 silent failure / OOM / shell termination 证据 31 2. 查 silent failure / OOM / shell termination 证据
32 3. 用小样本复现异常并补日志 32 3. 用小样本复现异常并补日志
33 33
34 ### 最新可观测性修复(2026-06-02 15:18 UTC)
35
36 - 已为 `run_demo.py``src/engines/chromaprint_matcher.py``src/engines/ecapa_embedder.py` 的关键 `print()` 增加 `flush=True`
37 - 极小样本复现 `/tmp/chroma_repro_tiny12` 已验证:
38 - 日志文件不再保持 `0 bytes`
39 - traceback 可实时落盘
40 - 当前已确认:至少“失败时无日志”这个问题已被修复;下一步继续追真实路径 root cause。
41
42 - 验证结果补充:`RC=1`,日志中可见 `ValueError: No reference embeddings were produced ...`
43
34 这是一个正在从原型向工业化推进的 **音乐 ACR / music retrieval** 项目。 44 这是一个正在从原型向工业化推进的 **音乐 ACR / music retrieval** 项目。
35 当前已经完成: 45 当前已经完成:
36 46
......