Make data onboarding and long FMA transfer supervision easier to sustain

Constraint: The user needs detailed data-format guidance now, while the real FMA archive transfer still requires durable hands-off supervision across long sessions Rejected: Treat documentation and download-watch work as separate later tasks | Would leave either user guidance or transfer resilience lagging behind active development Confidence: high Scope-risk: narrow Directive: Keep the new training-data/pgvector guide aligned with actual manifest fields and use watch_fma_download.py as the first-line long-transfer watchdog Tested: /usr/local/miniconda3/bin/python -m py_compile acr-engine/scripts/watch_fma_download.py; /usr/local/miniconda3/bin/python acr-engine/scripts/watch_fma_download.py --cycles 2 --interval 2; /usr/local/miniconda3/bin/python acr-engine/scripts/prepare_fma_archive.py inspect Not-tested: Full archive completion, extraction, and real-data smoke remain pending

Make data onboarding and long FMA transfer supervision easier to sustain
Constraint: The user needs detailed data-format guidance now, while the real FMA archive transfer still requires durable hands-off supervision across long sessions Rejected: Treat documentation and download-watch work as separate later tasks | Would leave either user guidance or transfer resilience lagging behind active development Confidence: high Scope-risk: narrow Directive: Keep the new training-data/pgvector guide aligned with actual manifest fields and use watch_fma_download.py as the first-line long-transfer watchdog Tested: /usr/local/miniconda3/bin/python -m py_compile acr-engine/scripts/watch_fma_download.py; /usr/local/miniconda3/bin/python acr-engine/scripts/watch_fma_download.py --cycles 2 --interval 2; /usr/local/miniconda3/bin/python acr-engine/scripts/prepare_fma_archive.py inspect Not-tested: Full archive completion, extraction, and real-data smoke remain pending
cnb.bofCdSsphPA
Commit d6d67893 ... d6d67893f179fa1f4d61916a65b34f2851f1bba2 authored 2026-06-02 13:47:54 +0800 by cnb.bofCdSsphPA
Showing 4 changed files with 124 additions and 0 deletions
acr-engine/scripts/watch_fma_download.py
docs/CHANGELOG.md
docs/session-handoff.md
docs/training-data-and-pgvector-guide.md
--- a/acr-engine/scripts/watch_fma_download.py 0 → 100755
View file @d6d6789
+++ b/acr-engine/scripts/watch_fma_download.py 0 → 100755
View file @d6d6789
+#!/usr/bin/env python3
+"""Ensure the FMA archive download keeps running; restart if stalled or dead."""
+from __future__ import annotations
+import argparse
+import json
+import subprocess
+import time
+from pathlib import Path
+PYTHON = "/usr/local/miniconda3/bin/python"
+INSPECT = [PYTHON, "scripts/prepare_fma_archive.py", "inspect"]
+BG = [PYTHON, "scripts/prepare_fma_archive.py", "bg-download"]
+DEFAULT_LOG = Path("/tmp/fma_modelscope_watch.log")
+def inspect() -> dict:
+    out = subprocess.check_output(INSPECT, text=True)
+    return json.loads(out)
+def bg_download() -> dict:
+    out = subprocess.check_output(BG, text=True)
+    return json.loads(out)
+def has_live_curl() -> bool:
+    proc = subprocess.run(
+        ["bash", "-lc", "ps -ef | grep 'fma_small.zip' | grep -v grep >/dev/null"],
+        capture_output=True,
+        text=True,
+    )
+    return proc.returncode == 0
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--interval", type=float, default=5.0)
+    parser.add_argument("--cycles", type=int, default=3)
+    parser.add_argument("--log-path", default=str(DEFAULT_LOG))
+    args = parser.parse_args()
+    log_path = Path(args.log_path)
+    log_path.parent.mkdir(parents=True, exist_ok=True)
+    events = []
+    previous_size = None
+    for _ in range(args.cycles):
+        snapshot = inspect()
+        size = int(snapshot["archive_size"])
+        live = has_live_curl()
+        restarted = None
+        if (previous_size is not None and size <= previous_size) or not live:
+            restarted = bg_download()
+            time.sleep(2)
+            snapshot = inspect()
+            size = int(snapshot["archive_size"])
+            live = has_live_curl()
+        event = {
+            "snapshot": snapshot,
+            "live_curl": live,
+            "restarted": restarted,
+        }
+        events.append(event)
+        previous_size = size
+        time.sleep(args.interval)
+    text = json.dumps({"status": "ok", "events": events}, indent=2, ensure_ascii=False)
+    log_path.write_text(text)
+    print(text)
+if __name__ == "__main__":
+    main()
--- a/docs/CHANGELOG.md
View file @d6d6789
+++ b/docs/CHANGELOG.md
View file @d6d6789
@@ -232,6 +232,54 @@
+### Stage: FMA 下载自动守护
+完成项：
+- 新增 [acr-engine/scripts/watch_fma_download.py](../acr-engine/scripts/watch_fma_download.py)
+- 支持周期性：
+  - inspect 当前下载进度
+  - 检查 curl 进程是否存活
+  - 如果停滞或进程消失则自动重新触发 `bg-download`
+验证结果：
+- `/usr/local/miniconda3/bin/python -m py_compile scripts/watch_fma_download.py` 成功
+- `/usr/local/miniconda3/bin/python scripts/watch_fma_download.py --cycles 2 --interval 2` 成功
+- 当前 watcher 观测到：
+  - 第 1 次 `archive_size=513179648`
+  - 第 2 次 `archive_size=516587520`
+  - 两次 `live_curl=true`
+  - 两次 `restarted=null`（说明下载健康推进，无需重启）
+- 最新 inspect：
+  - `archive_size=522387456`
+  - `archive_progress_percent=6.8023`
+结论：
+- 现在真实 FMA 下载不只可手动恢复，也具备基础自动守护能力
+- 长时间下载的持续性进一步增强
+### Stage: 训练数据与 pgvector 专项文档
+完成项：
+- 新增 [docs/training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md)
+- 单独详细说明：
+  - 当前训练数据/输入数据应该是什么格式
+  - `reference` / `query` 的角色区分
+  - BGM / 手机录音 / 直播录屏如何转成训练数据
+  - `song_id` / `type` / `offset` / `segment_type` 等标签建议
+  - 未来接 `pgvector` 时的推荐表结构与字段设计
+- 将该文档挂接到 [docs/README.md](./README.md) 与 [docs/session-handoff.md](./session-handoff.md)
+验证结果：
+- 文档内容已对齐当前代码行为：
+  - `acr-engine/src/data/dataset.py`
+  - `acr-engine/src/data/manifest_tools.py`
+- 已补充从“原始音频 -> 标准化资产 -> manifest -> pgvector”的完整分层说明
+结论：
+- 现在项目对“训练数据应该长什么样”与“以后怎么接 pgvector”已经有独立、可交接、可执行的文档说明
 ### Stage: FMA 后台续传恢复
 完成项：
--- a/docs/session-handoff.md
View file @d6d6789
+++ b/docs/session-handoff.md
View file @d6d6789
@@ -275,6 +275,7 @@
   - [docs/open-dataset-workflow.md](./open-dataset-workflow.md)
   - [docs/session-handoff.md](./session-handoff.md)
   - [docs/current-capability-map.md](./current-capability-map.md)
+- [docs/training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md)
   - [acr-engine/FIRST_RUN_CHECKLIST.md](../acr-engine/FIRST_RUN_CHECKLIST.md)
 - FMA 真实子集下载脚手架已存在：[acr-engine/scripts/fetch_fma_subset.py](../acr-engine/scripts/fetch_fma_subset.py)；最近验证结果是旧直链 `403`、页面级历史 URL `404`；但 `https://modelscope.cn/datasets/pengzhendong/fma/resolve/master/fma_small.zip` 已验证 `200 OK` 且支持 range
   - 运行 [acr-engine/scripts/status_snapshot.py](../acr-engine/scripts/status_snapshot.py)
--- a/docs/training-data-and-pgvector-guide.md 0 → 100644
View file @d6d6789
+++ b/docs/training-data-and-pgvector-guide.md 0 → 100644
View file @d6d6789