Commit 6ece1fa7 6ece1fa7f3c14810170c6635193ce3d2cdc4dc3e by cnb.bofCdSsphPA

Make build-index failures observable so root-cause analysis can proceed from real logs

Constraint: The live build-index investigation was blocked by stdout/stderr buffering that left log files at 0 bytes during long runs
Rejected: Keep diagnosing from progress files alone | they do not preserve traceback or stage-transition context
Confidence: high
Scope-risk: narrow
Directive: Preserve flush-on-progress behavior while chasing the remaining real-path build-index root cause
Tested: Verified tiny repro /tmp/chroma_repro_tiny12 writes live logs and traceback with RC=1 after flush=True change
Not-tested: No final fix for the real-path build-index exit yet
1 parent 7bb69662
......@@ -88,6 +88,7 @@
- `refs_done=4420/8000`
- 当前仍未出现 `reference_*``evaluate.py`
- 因此下一轮工作重心必须切到:**排查 build-index 异常退出**,而不是继续把它当作纯线性慢任务。
- 已完成一个低风险修复:关键 `print()` 已加 `flush=True`,并已用极小样本 `RC=1` 失败复现验证日志/traceback 可实时落盘,不再出现 `0 bytes` 日志黑箱。
- 下一次值得提交的事件:
1. 找到明确失败证据/退出原因
2. 成功小样本复现并补日志
......
......@@ -24,7 +24,7 @@ def cmd_generate_data(args):
segment_duration=args.segment_duration,
seed=args.seed,
)
print(f"[done] dataset generated at {args.output}")
print(f"[done] dataset generated at {args.output}", flush=True)
def build_chroma_index(data_dir: Path, output_dir: Path, checkpoint_every_refs: int = 0):
......@@ -37,7 +37,7 @@ def build_chroma_index(data_dir: Path, output_dir: Path, checkpoint_every_refs:
checkpoint_every_refs=checkpoint_every_refs,
progress_path=str(output_dir / 'chromaprint_progress.json'),
)
print(f"[done] chromaprint index built: hashes={matcher.num_hashes}, postings={matcher.index_size}")
print(f"[done] chromaprint index built: hashes={matcher.num_hashes}, postings={matcher.index_size}", flush=True)
return matcher
......@@ -58,7 +58,7 @@ def build_embedding_index(
checkpoint_every_refs=checkpoint_every_refs,
resume=resume,
)
print(f"[done] embedding index built: {len(ref_ids)} refs")
print(f"[done] embedding index built: {len(ref_ids)} refs", flush=True)
return embedder, ref_embs, ref_ids
......@@ -67,12 +67,12 @@ def cmd_build_index(args):
out_dir = Path(args.output)
out_dir.mkdir(parents=True, exist_ok=True)
print(f"[build-index] starting chromaprint index: data={data_dir} output={out_dir}")
print(f"[build-index] starting chromaprint index: data={data_dir} output={out_dir}", flush=True)
build_chroma_index(data_dir, out_dir, checkpoint_every_refs=args.chromaprint_checkpoint_every_refs)
print(
f"[build-index] starting embedding index: model={args.model} device={args.device} "
f"resume={args.resume} checkpoint_every_refs={args.checkpoint_every_refs}"
)
, flush=True)
build_embedding_index(
data_dir,
Path(args.model),
......@@ -108,7 +108,7 @@ def cmd_recognize(args):
engine.load_metadata(str(p))
result = engine.recognize(args.query, top_n=args.top_n)
print(json.dumps(result, ensure_ascii=False, indent=2))
print(json.dumps(result, ensure_ascii=False, indent=2), flush=True)
def cmd_full_demo(args):
......@@ -125,7 +125,7 @@ def cmd_full_demo(args):
segment_duration=args.segment_duration,
seed=args.seed,
)
print(f"[done] dataset generated at {data_dir}")
print(f"[done] dataset generated at {data_dir}", flush=True)
model_path = model_dir / 'best_model.pt'
if not model_path.exists():
......@@ -136,7 +136,7 @@ def cmd_full_demo(args):
'--data', str(data_dir), '--output', str(model_dir),
'--device', args.device, '--epochs', '3', '--batch-size', '8'
]
print('[full-demo] training model:', ' '.join(cmd))
print('[full-demo] training model:', ' '.join(cmd), flush=True)
subprocess.run(cmd, check=True)
index_dir.mkdir(parents=True, exist_ok=True)
......@@ -152,8 +152,8 @@ def cmd_full_demo(args):
for split in ['train.json', 'val.json', 'test.json']:
engine.load_metadata(str(data_dir / split))
result = engine.recognize(str(query_path), top_n=5)
print('[demo-query]', query_item['song_id'], query_item['audio_path'])
print(json.dumps(result, ensure_ascii=False, indent=2))
print('[demo-query]', query_item['song_id'], query_item['audio_path'], flush=True)
print(json.dumps(result, ensure_ascii=False, indent=2), flush=True)
if __name__ == '__main__':
......
......@@ -139,7 +139,7 @@ class ChromaprintMatcher:
f"[chromaprint-index] progress: refs={ref_idx}/{total_refs} "
f"hashes={self.num_hashes} postings={self.index_size} "
f"elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f}"
)
, flush=True)
if checkpoint_every_refs > 0 and ref_idx % checkpoint_every_refs == 0:
if cache_file is not None:
self.save(str(cache_file))
......
......@@ -48,7 +48,7 @@ class ECAPAEmbedder:
)
missing = self.model.load_state_dict(state["model_state_dict"], strict=False)
if missing.unexpected_keys:
print(f"[warn] unexpected keys while loading model: {missing.unexpected_keys}")
print(f"[warn] unexpected keys while loading model: {missing.unexpected_keys}", flush=True)
self.model.to(self.device)
self.model.eval()
......@@ -133,7 +133,7 @@ class ECAPAEmbedder:
refs_done = 0
if resume and final_embs_path.exists() and final_ids_path.exists():
print(f"[build-reference-index] resume hit complete index: {final_embs_path} / {final_ids_path}")
print(f"[build-reference-index] resume hit complete index: {final_embs_path} / {final_ids_path}", flush=True)
final_embs = np.load(final_embs_path)
final_ids = np.load(final_ids_path, allow_pickle=True).tolist()
return final_embs, final_ids
......@@ -154,9 +154,9 @@ class ECAPAEmbedder:
print(
f"[build-reference-index] resuming from checkpoint: refs_done={refs_done}/{total_refs} "
f"windows_done={len(all_ids)}"
)
, flush=True)
except Exception as exc:
print(f"[build-reference-index] resume checkpoint ignored due to load failure: {exc}")
print(f"[build-reference-index] resume checkpoint ignored due to load failure: {exc}", flush=True)
refs_done = 0
all_embs = []
all_ids = []
......@@ -170,7 +170,7 @@ class ECAPAEmbedder:
print(
f"[build-reference-index] start: refs={total_refs} device={self.device.type} "
f"window_sec={window_sec} stride_sec={stride_sec} resume={resume} refs_done={refs_done}"
)
, flush=True)
def write_checkpoint(ref_idx: int):
if not all_embs:
......@@ -214,7 +214,7 @@ class ECAPAEmbedder:
}, indent=2))
if refs_done > total_refs:
print(f"[build-reference-index] resume refs_done={refs_done} exceeds refs_total={total_refs}; restarting")
print(f"[build-reference-index] resume refs_done={refs_done} exceeds refs_total={total_refs}; restarting", flush=True)
refs_done = 0
all_embs = []
all_ids = []
......@@ -239,7 +239,7 @@ class ECAPAEmbedder:
print(
f"[build-reference-index] progress: refs={ref_idx}/{total_refs} "
f"windows={len(all_ids)} elapsed_sec={elapsed:.1f} eta_sec={eta_sec:.1f}"
)
, flush=True)
if checkpoint_every_refs > 0 and (ref_idx % checkpoint_every_refs == 0 or ref_idx == total_refs):
write_checkpoint(ref_idx)
......@@ -252,7 +252,7 @@ class ECAPAEmbedder:
np.save(final_embs_path, all_embs)
np.save(final_ids_path, np.array(all_ids))
write_complete(len(all_ids), all_embs.shape)
print(f"Built reference index: {len(all_ids)} windows, embeddings shape {all_embs.shape}")
print(f"Built reference index: {len(all_ids)} windows, embeddings shape {all_embs.shape}", flush=True)
return all_embs, all_ids
def search(self, query_emb: np.ndarray, ref_embs: np.ndarray, ref_ids: List[str], top_k: int = 10):
......
## 2026-06-02 15:18 UTC / build-index log flush hardening
-`run_demo.py``chromaprint_matcher.py``ecapa_embedder.py` 的关键 `print()` 增加 `flush=True`
- 目的:避免 `build-index` 长时间运行时日志文件保持 `0 bytes`,导致“无声运行/无声退出”难排查
- fresh evidence:极小样本复现 `/tmp/chroma_repro_tiny12` 已验证日志实时落盘
- `RC=1`
- 日志中可见:`[build-index] starting chromaprint index`
- 日志中可见:`[build-reference-index] start: refs=12 ...`
- 日志中可见最终 traceback:`ValueError: No reference embeddings were produced ...`
- 结论:当前至少已修复“失败时日志完全不可见”的可观测性问题,下一步可继续针对真实路径复现 root cause
## 2026-06-02 15:09 UTC / build-index unexpected exit checkpoint
- 新鲜证据:observable 与 legacy 两个 `build-index` 进程都已退出
......
......@@ -44,3 +44,27 @@
1. 当前已不应继续把它描述成“仅仅线性慢”。
2. 下一轮工作应转向 **build-index 异常退出排查**
3. 新提交已经有意义,因为状态从“运行中”变成了“已退出且无下游产物”。
## 本次追加交付(2026-06-02 15:18 UTC)
### 新增代码修复
| 文件 | 变更 |
|---|---|
| [../acr-engine/run_demo.py](../acr-engine/run_demo.py) | `build-index` / demo 关键日志统一 `flush=True` |
| [../acr-engine/src/engines/chromaprint_matcher.py](../acr-engine/src/engines/chromaprint_matcher.py) | chromaprint 阶段 progress 日志 `flush=True` |
| [../acr-engine/src/engines/ecapa_embedder.py](../acr-engine/src/engines/ecapa_embedder.py) | embedding/reference 阶段关键日志 `flush=True` |
### 新增验证证据
- 极小样本复现:`/tmp/chroma_repro_tiny12`
- 结果:`RC=1`
- 现在日志已实时落盘,不再是 `0 bytes`
- `[build-index] starting chromaprint index ...`
- `[build-reference-index] start: refs=12 ...`
- `ValueError: No reference embeddings were produced ...`
### 结论
- 当前已修复“失败时日志完全不可见”的可观测性问题。
- 下一轮 root cause 排查可以直接依赖实时日志,而不再需要盲等。
......
......@@ -44,6 +44,11 @@
- 这不再是“CPU-only 长时间构建但仍在推进”的状态。
- 现在更像是:**`build-index` 在 chromaprint 阶段中途退出,但没有留下显式下游产物**
#### 已完成的低风险修复
- 已把 `run_demo.py``chromaprint_matcher.py``ecapa_embedder.py` 的关键日志改为 `flush=True`
- 极小样本 `/tmp/chroma_repro_tiny12` 已验证:失败时日志与 traceback 可实时落盘,不再保持 `0 bytes`
- 这意味着下一 session 继续排查时,日志可作为一手证据,而不是黑箱。
## 新 session 接管顺序
1. 先看 [./session-handoff.md](./session-handoff.md)
......
......@@ -31,6 +31,16 @@
2. 查 silent failure / OOM / shell termination 证据
3. 用小样本复现异常并补日志
### 最新可观测性修复(2026-06-02 15:18 UTC)
- 已为 `run_demo.py``src/engines/chromaprint_matcher.py``src/engines/ecapa_embedder.py` 的关键 `print()` 增加 `flush=True`
- 极小样本复现 `/tmp/chroma_repro_tiny12` 已验证:
- 日志文件不再保持 `0 bytes`
- traceback 可实时落盘
- 当前已确认:至少“失败时无日志”这个问题已被修复;下一步继续追真实路径 root cause。
- 验证结果补充:`RC=1`,日志中可见 `ValueError: No reference embeddings were produced ...`
这是一个正在从原型向工业化推进的 **音乐 ACR / music retrieval** 项目。
当前已经完成:
......