Make data onboarding and long FMA transfer supervision easier to sustain
Constraint: The user needs detailed data-format guidance now, while the real FMA archive transfer still requires durable hands-off supervision across long sessions Rejected: Treat documentation and download-watch work as separate later tasks | Would leave either user guidance or transfer resilience lagging behind active development Confidence: high Scope-risk: narrow Directive: Keep the new training-data/pgvector guide aligned with actual manifest fields and use watch_fma_download.py as the first-line long-transfer watchdog Tested: /usr/local/miniconda3/bin/python -m py_compile acr-engine/scripts/watch_fma_download.py; /usr/local/miniconda3/bin/python acr-engine/scripts/watch_fma_download.py --cycles 2 --interval 2; /usr/local/miniconda3/bin/python acr-engine/scripts/prepare_fma_archive.py inspect Not-tested: Full archive completion, extraction, and real-data smoke remain pending
Showing
4 changed files
with
124 additions
and
0 deletions
acr-engine/scripts/watch_fma_download.py
0 → 100755
| 1 | #!/usr/bin/env python3 | ||
| 2 | """Ensure the FMA archive download keeps running; restart if stalled or dead.""" | ||
| 3 | |||
| 4 | from __future__ import annotations | ||
| 5 | |||
| 6 | import argparse | ||
| 7 | import json | ||
| 8 | import subprocess | ||
| 9 | import time | ||
| 10 | from pathlib import Path | ||
| 11 | |||
| 12 | PYTHON = "/usr/local/miniconda3/bin/python" | ||
| 13 | INSPECT = [PYTHON, "scripts/prepare_fma_archive.py", "inspect"] | ||
| 14 | BG = [PYTHON, "scripts/prepare_fma_archive.py", "bg-download"] | ||
| 15 | DEFAULT_LOG = Path("/tmp/fma_modelscope_watch.log") | ||
| 16 | |||
| 17 | |||
| 18 | def inspect() -> dict: | ||
| 19 | out = subprocess.check_output(INSPECT, text=True) | ||
| 20 | return json.loads(out) | ||
| 21 | |||
| 22 | |||
| 23 | def bg_download() -> dict: | ||
| 24 | out = subprocess.check_output(BG, text=True) | ||
| 25 | return json.loads(out) | ||
| 26 | |||
| 27 | |||
| 28 | def has_live_curl() -> bool: | ||
| 29 | proc = subprocess.run( | ||
| 30 | ["bash", "-lc", "ps -ef | grep 'fma_small.zip' | grep -v grep >/dev/null"], | ||
| 31 | capture_output=True, | ||
| 32 | text=True, | ||
| 33 | ) | ||
| 34 | return proc.returncode == 0 | ||
| 35 | |||
| 36 | |||
| 37 | def main(): | ||
| 38 | parser = argparse.ArgumentParser() | ||
| 39 | parser.add_argument("--interval", type=float, default=5.0) | ||
| 40 | parser.add_argument("--cycles", type=int, default=3) | ||
| 41 | parser.add_argument("--log-path", default=str(DEFAULT_LOG)) | ||
| 42 | args = parser.parse_args() | ||
| 43 | |||
| 44 | log_path = Path(args.log_path) | ||
| 45 | log_path.parent.mkdir(parents=True, exist_ok=True) | ||
| 46 | events = [] | ||
| 47 | |||
| 48 | previous_size = None | ||
| 49 | for _ in range(args.cycles): | ||
| 50 | snapshot = inspect() | ||
| 51 | size = int(snapshot["archive_size"]) | ||
| 52 | live = has_live_curl() | ||
| 53 | restarted = None | ||
| 54 | if (previous_size is not None and size <= previous_size) or not live: | ||
| 55 | restarted = bg_download() | ||
| 56 | time.sleep(2) | ||
| 57 | snapshot = inspect() | ||
| 58 | size = int(snapshot["archive_size"]) | ||
| 59 | live = has_live_curl() | ||
| 60 | event = { | ||
| 61 | "snapshot": snapshot, | ||
| 62 | "live_curl": live, | ||
| 63 | "restarted": restarted, | ||
| 64 | } | ||
| 65 | events.append(event) | ||
| 66 | previous_size = size | ||
| 67 | time.sleep(args.interval) | ||
| 68 | |||
| 69 | text = json.dumps({"status": "ok", "events": events}, indent=2, ensure_ascii=False) | ||
| 70 | log_path.write_text(text) | ||
| 71 | print(text) | ||
| 72 | |||
| 73 | |||
| 74 | if __name__ == "__main__": | ||
| 75 | main() |
| ... | @@ -232,6 +232,54 @@ | ... | @@ -232,6 +232,54 @@ |
| 232 | 232 | ||
| 233 | 233 | ||
| 234 | 234 | ||
| 235 | |||
| 236 | |||
| 237 | ### Stage: FMA 下载自动守护 | ||
| 238 | |||
| 239 | 完成项: | ||
| 240 | - 新增 [acr-engine/scripts/watch_fma_download.py](../acr-engine/scripts/watch_fma_download.py) | ||
| 241 | - 支持周期性: | ||
| 242 | - inspect 当前下载进度 | ||
| 243 | - 检查 curl 进程是否存活 | ||
| 244 | - 如果停滞或进程消失则自动重新触发 `bg-download` | ||
| 245 | |||
| 246 | 验证结果: | ||
| 247 | - `/usr/local/miniconda3/bin/python -m py_compile scripts/watch_fma_download.py` 成功 | ||
| 248 | - `/usr/local/miniconda3/bin/python scripts/watch_fma_download.py --cycles 2 --interval 2` 成功 | ||
| 249 | - 当前 watcher 观测到: | ||
| 250 | - 第 1 次 `archive_size=513179648` | ||
| 251 | - 第 2 次 `archive_size=516587520` | ||
| 252 | - 两次 `live_curl=true` | ||
| 253 | - 两次 `restarted=null`(说明下载健康推进,无需重启) | ||
| 254 | - 最新 inspect: | ||
| 255 | - `archive_size=522387456` | ||
| 256 | - `archive_progress_percent=6.8023` | ||
| 257 | |||
| 258 | 结论: | ||
| 259 | - 现在真实 FMA 下载不只可手动恢复,也具备基础自动守护能力 | ||
| 260 | - 长时间下载的持续性进一步增强 | ||
| 261 | |||
| 262 | ### Stage: 训练数据与 pgvector 专项文档 | ||
| 263 | |||
| 264 | 完成项: | ||
| 265 | - 新增 [docs/training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) | ||
| 266 | - 单独详细说明: | ||
| 267 | - 当前训练数据/输入数据应该是什么格式 | ||
| 268 | - `reference` / `query` 的角色区分 | ||
| 269 | - BGM / 手机录音 / 直播录屏如何转成训练数据 | ||
| 270 | - `song_id` / `type` / `offset` / `segment_type` 等标签建议 | ||
| 271 | - 未来接 `pgvector` 时的推荐表结构与字段设计 | ||
| 272 | - 将该文档挂接到 [docs/README.md](./README.md) 与 [docs/session-handoff.md](./session-handoff.md) | ||
| 273 | |||
| 274 | 验证结果: | ||
| 275 | - 文档内容已对齐当前代码行为: | ||
| 276 | - `acr-engine/src/data/dataset.py` | ||
| 277 | - `acr-engine/src/data/manifest_tools.py` | ||
| 278 | - 已补充从“原始音频 -> 标准化资产 -> manifest -> pgvector”的完整分层说明 | ||
| 279 | |||
| 280 | 结论: | ||
| 281 | - 现在项目对“训练数据应该长什么样”与“以后怎么接 pgvector”已经有独立、可交接、可执行的文档说明 | ||
| 282 | |||
| 235 | ### Stage: FMA 后台续传恢复 | 283 | ### Stage: FMA 后台续传恢复 |
| 236 | 284 | ||
| 237 | 完成项: | 285 | 完成项: | ... | ... |
| ... | @@ -275,6 +275,7 @@ | ... | @@ -275,6 +275,7 @@ |
| 275 | - [docs/open-dataset-workflow.md](./open-dataset-workflow.md) | 275 | - [docs/open-dataset-workflow.md](./open-dataset-workflow.md) |
| 276 | - [docs/session-handoff.md](./session-handoff.md) | 276 | - [docs/session-handoff.md](./session-handoff.md) |
| 277 | - [docs/current-capability-map.md](./current-capability-map.md) | 277 | - [docs/current-capability-map.md](./current-capability-map.md) |
| 278 | - [docs/training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) | ||
| 278 | - [acr-engine/FIRST_RUN_CHECKLIST.md](../acr-engine/FIRST_RUN_CHECKLIST.md) | 279 | - [acr-engine/FIRST_RUN_CHECKLIST.md](../acr-engine/FIRST_RUN_CHECKLIST.md) |
| 279 | - FMA 真实子集下载脚手架已存在:[acr-engine/scripts/fetch_fma_subset.py](../acr-engine/scripts/fetch_fma_subset.py);最近验证结果是旧直链 `403`、页面级历史 URL `404`;但 `https://modelscope.cn/datasets/pengzhendong/fma/resolve/master/fma_small.zip` 已验证 `200 OK` 且支持 range | 280 | - FMA 真实子集下载脚手架已存在:[acr-engine/scripts/fetch_fma_subset.py](../acr-engine/scripts/fetch_fma_subset.py);最近验证结果是旧直链 `403`、页面级历史 URL `404`;但 `https://modelscope.cn/datasets/pengzhendong/fma/resolve/master/fma_small.zip` 已验证 `200 OK` 且支持 range |
| 280 | - 运行 [acr-engine/scripts/status_snapshot.py](../acr-engine/scripts/status_snapshot.py) | 281 | - 运行 [acr-engine/scripts/status_snapshot.py](../acr-engine/scripts/status_snapshot.py) | ... | ... |
docs/training-data-and-pgvector-guide.md
0 → 100644
This diff is collapsed.
Click to expand it.
-
Please register or sign in to post a comment