Commit d6d67893 d6d67893f179fa1f4d61916a65b34f2851f1bba2 by cnb.bofCdSsphPA

Make data onboarding and long FMA transfer supervision easier to sustain

Constraint: The user needs detailed data-format guidance now, while the real FMA archive transfer still requires durable hands-off supervision across long sessions
Rejected: Treat documentation and download-watch work as separate later tasks | Would leave either user guidance or transfer resilience lagging behind active development
Confidence: high
Scope-risk: narrow
Directive: Keep the new training-data/pgvector guide aligned with actual manifest fields and use watch_fma_download.py as the first-line long-transfer watchdog
Tested: /usr/local/miniconda3/bin/python -m py_compile acr-engine/scripts/watch_fma_download.py; /usr/local/miniconda3/bin/python acr-engine/scripts/watch_fma_download.py --cycles 2 --interval 2; /usr/local/miniconda3/bin/python acr-engine/scripts/prepare_fma_archive.py inspect
Not-tested: Full archive completion, extraction, and real-data smoke remain pending
1 parent 83a3f89f
#!/usr/bin/env python3
"""Ensure the FMA archive download keeps running; restart if stalled or dead."""
from __future__ import annotations
import argparse
import json
import subprocess
import time
from pathlib import Path
PYTHON = "/usr/local/miniconda3/bin/python"
INSPECT = [PYTHON, "scripts/prepare_fma_archive.py", "inspect"]
BG = [PYTHON, "scripts/prepare_fma_archive.py", "bg-download"]
DEFAULT_LOG = Path("/tmp/fma_modelscope_watch.log")
def inspect() -> dict:
out = subprocess.check_output(INSPECT, text=True)
return json.loads(out)
def bg_download() -> dict:
out = subprocess.check_output(BG, text=True)
return json.loads(out)
def has_live_curl() -> bool:
proc = subprocess.run(
["bash", "-lc", "ps -ef | grep 'fma_small.zip' | grep -v grep >/dev/null"],
capture_output=True,
text=True,
)
return proc.returncode == 0
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--interval", type=float, default=5.0)
parser.add_argument("--cycles", type=int, default=3)
parser.add_argument("--log-path", default=str(DEFAULT_LOG))
args = parser.parse_args()
log_path = Path(args.log_path)
log_path.parent.mkdir(parents=True, exist_ok=True)
events = []
previous_size = None
for _ in range(args.cycles):
snapshot = inspect()
size = int(snapshot["archive_size"])
live = has_live_curl()
restarted = None
if (previous_size is not None and size <= previous_size) or not live:
restarted = bg_download()
time.sleep(2)
snapshot = inspect()
size = int(snapshot["archive_size"])
live = has_live_curl()
event = {
"snapshot": snapshot,
"live_curl": live,
"restarted": restarted,
}
events.append(event)
previous_size = size
time.sleep(args.interval)
text = json.dumps({"status": "ok", "events": events}, indent=2, ensure_ascii=False)
log_path.write_text(text)
print(text)
if __name__ == "__main__":
main()
......@@ -232,6 +232,54 @@
### Stage: FMA 下载自动守护
完成项:
- 新增 [acr-engine/scripts/watch_fma_download.py](../acr-engine/scripts/watch_fma_download.py)
- 支持周期性:
- inspect 当前下载进度
- 检查 curl 进程是否存活
- 如果停滞或进程消失则自动重新触发 `bg-download`
验证结果:
- `/usr/local/miniconda3/bin/python -m py_compile scripts/watch_fma_download.py` 成功
- `/usr/local/miniconda3/bin/python scripts/watch_fma_download.py --cycles 2 --interval 2` 成功
- 当前 watcher 观测到:
- 第 1 次 `archive_size=513179648`
- 第 2 次 `archive_size=516587520`
- 两次 `live_curl=true`
- 两次 `restarted=null`(说明下载健康推进,无需重启)
- 最新 inspect:
- `archive_size=522387456`
- `archive_progress_percent=6.8023`
结论:
- 现在真实 FMA 下载不只可手动恢复,也具备基础自动守护能力
- 长时间下载的持续性进一步增强
### Stage: 训练数据与 pgvector 专项文档
完成项:
- 新增 [docs/training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md)
- 单独详细说明:
- 当前训练数据/输入数据应该是什么格式
- `reference` / `query` 的角色区分
- BGM / 手机录音 / 直播录屏如何转成训练数据
- `song_id` / `type` / `offset` / `segment_type` 等标签建议
- 未来接 `pgvector` 时的推荐表结构与字段设计
- 将该文档挂接到 [docs/README.md](./README.md)[docs/session-handoff.md](./session-handoff.md)
验证结果:
- 文档内容已对齐当前代码行为:
- `acr-engine/src/data/dataset.py`
- `acr-engine/src/data/manifest_tools.py`
- 已补充从“原始音频 -> 标准化资产 -> manifest -> pgvector”的完整分层说明
结论:
- 现在项目对“训练数据应该长什么样”与“以后怎么接 pgvector”已经有独立、可交接、可执行的文档说明
### Stage: FMA 后台续传恢复
完成项:
......
......@@ -275,6 +275,7 @@
- [docs/open-dataset-workflow.md](./open-dataset-workflow.md)
- [docs/session-handoff.md](./session-handoff.md)
- [docs/current-capability-map.md](./current-capability-map.md)
- [docs/training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md)
- [acr-engine/FIRST_RUN_CHECKLIST.md](../acr-engine/FIRST_RUN_CHECKLIST.md)
- FMA 真实子集下载脚手架已存在:[acr-engine/scripts/fetch_fma_subset.py](../acr-engine/scripts/fetch_fma_subset.py);最近验证结果是旧直链 `403`、页面级历史 URL `404`;但 `https://modelscope.cn/datasets/pengzhendong/fma/resolve/master/fma_small.zip` 已验证 `200 OK` 且支持 range
- 运行 [acr-engine/scripts/status_snapshot.py](../acr-engine/scripts/status_snapshot.py)
......