Commit d6d67893 d6d67893f179fa1f4d61916a65b34f2851f1bba2 by cnb.bofCdSsphPA

Make data onboarding and long FMA transfer supervision easier to sustain

Constraint: The user needs detailed data-format guidance now, while the real FMA archive transfer still requires durable hands-off supervision across long sessions
Rejected: Treat documentation and download-watch work as separate later tasks | Would leave either user guidance or transfer resilience lagging behind active development
Confidence: high
Scope-risk: narrow
Directive: Keep the new training-data/pgvector guide aligned with actual manifest fields and use watch_fma_download.py as the first-line long-transfer watchdog
Tested: /usr/local/miniconda3/bin/python -m py_compile acr-engine/scripts/watch_fma_download.py; /usr/local/miniconda3/bin/python acr-engine/scripts/watch_fma_download.py --cycles 2 --interval 2; /usr/local/miniconda3/bin/python acr-engine/scripts/prepare_fma_archive.py inspect
Not-tested: Full archive completion, extraction, and real-data smoke remain pending
1 parent 83a3f89f
1 #!/usr/bin/env python3
2 """Ensure the FMA archive download keeps running; restart if stalled or dead."""
3
4 from __future__ import annotations
5
6 import argparse
7 import json
8 import subprocess
9 import time
10 from pathlib import Path
11
12 PYTHON = "/usr/local/miniconda3/bin/python"
13 INSPECT = [PYTHON, "scripts/prepare_fma_archive.py", "inspect"]
14 BG = [PYTHON, "scripts/prepare_fma_archive.py", "bg-download"]
15 DEFAULT_LOG = Path("/tmp/fma_modelscope_watch.log")
16
17
18 def inspect() -> dict:
19 out = subprocess.check_output(INSPECT, text=True)
20 return json.loads(out)
21
22
23 def bg_download() -> dict:
24 out = subprocess.check_output(BG, text=True)
25 return json.loads(out)
26
27
28 def has_live_curl() -> bool:
29 proc = subprocess.run(
30 ["bash", "-lc", "ps -ef | grep 'fma_small.zip' | grep -v grep >/dev/null"],
31 capture_output=True,
32 text=True,
33 )
34 return proc.returncode == 0
35
36
37 def main():
38 parser = argparse.ArgumentParser()
39 parser.add_argument("--interval", type=float, default=5.0)
40 parser.add_argument("--cycles", type=int, default=3)
41 parser.add_argument("--log-path", default=str(DEFAULT_LOG))
42 args = parser.parse_args()
43
44 log_path = Path(args.log_path)
45 log_path.parent.mkdir(parents=True, exist_ok=True)
46 events = []
47
48 previous_size = None
49 for _ in range(args.cycles):
50 snapshot = inspect()
51 size = int(snapshot["archive_size"])
52 live = has_live_curl()
53 restarted = None
54 if (previous_size is not None and size <= previous_size) or not live:
55 restarted = bg_download()
56 time.sleep(2)
57 snapshot = inspect()
58 size = int(snapshot["archive_size"])
59 live = has_live_curl()
60 event = {
61 "snapshot": snapshot,
62 "live_curl": live,
63 "restarted": restarted,
64 }
65 events.append(event)
66 previous_size = size
67 time.sleep(args.interval)
68
69 text = json.dumps({"status": "ok", "events": events}, indent=2, ensure_ascii=False)
70 log_path.write_text(text)
71 print(text)
72
73
74 if __name__ == "__main__":
75 main()
...@@ -232,6 +232,54 @@ ...@@ -232,6 +232,54 @@
232 232
233 233
234 234
235
236
237 ### Stage: FMA 下载自动守护
238
239 完成项:
240 - 新增 [acr-engine/scripts/watch_fma_download.py](../acr-engine/scripts/watch_fma_download.py)
241 - 支持周期性:
242 - inspect 当前下载进度
243 - 检查 curl 进程是否存活
244 - 如果停滞或进程消失则自动重新触发 `bg-download`
245
246 验证结果:
247 - `/usr/local/miniconda3/bin/python -m py_compile scripts/watch_fma_download.py` 成功
248 - `/usr/local/miniconda3/bin/python scripts/watch_fma_download.py --cycles 2 --interval 2` 成功
249 - 当前 watcher 观测到:
250 - 第 1 次 `archive_size=513179648`
251 - 第 2 次 `archive_size=516587520`
252 - 两次 `live_curl=true`
253 - 两次 `restarted=null`(说明下载健康推进,无需重启)
254 - 最新 inspect:
255 - `archive_size=522387456`
256 - `archive_progress_percent=6.8023`
257
258 结论:
259 - 现在真实 FMA 下载不只可手动恢复,也具备基础自动守护能力
260 - 长时间下载的持续性进一步增强
261
262 ### Stage: 训练数据与 pgvector 专项文档
263
264 完成项:
265 - 新增 [docs/training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md)
266 - 单独详细说明:
267 - 当前训练数据/输入数据应该是什么格式
268 - `reference` / `query` 的角色区分
269 - BGM / 手机录音 / 直播录屏如何转成训练数据
270 - `song_id` / `type` / `offset` / `segment_type` 等标签建议
271 - 未来接 `pgvector` 时的推荐表结构与字段设计
272 - 将该文档挂接到 [docs/README.md](./README.md)[docs/session-handoff.md](./session-handoff.md)
273
274 验证结果:
275 - 文档内容已对齐当前代码行为:
276 - `acr-engine/src/data/dataset.py`
277 - `acr-engine/src/data/manifest_tools.py`
278 - 已补充从“原始音频 -> 标准化资产 -> manifest -> pgvector”的完整分层说明
279
280 结论:
281 - 现在项目对“训练数据应该长什么样”与“以后怎么接 pgvector”已经有独立、可交接、可执行的文档说明
282
235 ### Stage: FMA 后台续传恢复 283 ### Stage: FMA 后台续传恢复
236 284
237 完成项: 285 完成项:
......
...@@ -275,6 +275,7 @@ ...@@ -275,6 +275,7 @@
275 - [docs/open-dataset-workflow.md](./open-dataset-workflow.md) 275 - [docs/open-dataset-workflow.md](./open-dataset-workflow.md)
276 - [docs/session-handoff.md](./session-handoff.md) 276 - [docs/session-handoff.md](./session-handoff.md)
277 - [docs/current-capability-map.md](./current-capability-map.md) 277 - [docs/current-capability-map.md](./current-capability-map.md)
278 - [docs/training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md)
278 - [acr-engine/FIRST_RUN_CHECKLIST.md](../acr-engine/FIRST_RUN_CHECKLIST.md) 279 - [acr-engine/FIRST_RUN_CHECKLIST.md](../acr-engine/FIRST_RUN_CHECKLIST.md)
279 - FMA 真实子集下载脚手架已存在:[acr-engine/scripts/fetch_fma_subset.py](../acr-engine/scripts/fetch_fma_subset.py);最近验证结果是旧直链 `403`、页面级历史 URL `404`;但 `https://modelscope.cn/datasets/pengzhendong/fma/resolve/master/fma_small.zip` 已验证 `200 OK` 且支持 range 280 - FMA 真实子集下载脚手架已存在:[acr-engine/scripts/fetch_fma_subset.py](../acr-engine/scripts/fetch_fma_subset.py);最近验证结果是旧直链 `403`、页面级历史 URL `404`;但 `https://modelscope.cn/datasets/pengzhendong/fma/resolve/master/fma_small.zip` 已验证 `200 OK` 且支持 range
280 - 运行 [acr-engine/scripts/status_snapshot.py](../acr-engine/scripts/status_snapshot.py) 281 - 运行 [acr-engine/scripts/status_snapshot.py](../acr-engine/scripts/status_snapshot.py)
......