Commit b32e002b b32e002b701880d25cb7769699ff264b6b5dfeaf by cnb.bofCdSsphPA

Preserve the first verified real-FMA download path and blocker evidence

Constraint: Continuous dataset landing work needs concrete failed-path evidence so future sessions do not restart from outdated assumptions
Rejected: Omit the failed download automation because it did not complete | Loses reproducible evidence about the current 403 and missing-tool barriers
Confidence: high
Scope-risk: narrow
Directive: Replace this bounded fetch path only after verifying a stable official archive or mirror-based download route
Tested: /usr/local/miniconda3/bin/python -m py_compile acr-engine/scripts/fetch_fma_subset.py; /usr/local/miniconda3/bin/python acr-engine/scripts/fetch_fma_subset.py --report acr-engine/reports/fma_fetch_subset_report.json
Not-tested: Successful real FMA audio download remains blocked by current upstream/tooling availability
1 parent 2d501547
{
"output_dir": "/workspace/acr-engine/data/raw/fma_small_audio",
"requested": 8,
"downloaded": 0,
"existing": 0,
"failures": [
{
"track_id": 2,
"status": "http_error",
"code": 403,
"url": "https://files.freemusicarchive.org/storage-freemusicarchive-org/music/000/000002.mp3"
},
{
"track_id": 5,
"status": "http_error",
"code": 403,
"url": "https://files.freemusicarchive.org/storage-freemusicarchive-org/music/000/000005.mp3"
},
{
"track_id": 10,
"status": "http_error",
"code": 403,
"url": "https://files.freemusicarchive.org/storage-freemusicarchive-org/music/000/000010.mp3"
},
{
"track_id": 20,
"status": "http_error",
"code": 403,
"url": "https://files.freemusicarchive.org/storage-freemusicarchive-org/music/000/000020.mp3"
},
{
"track_id": 26,
"status": "http_error",
"code": 403,
"url": "https://files.freemusicarchive.org/storage-freemusicarchive-org/music/000/000026.mp3"
},
{
"track_id": 30,
"status": "http_error",
"code": 403,
"url": "https://files.freemusicarchive.org/storage-freemusicarchive-org/music/000/000030.mp3"
},
{
"track_id": 46,
"status": "http_error",
"code": 403,
"url": "https://files.freemusicarchive.org/storage-freemusicarchive-org/music/000/000046.mp3"
},
{
"track_id": 48,
"status": "http_error",
"code": 403,
"url": "https://files.freemusicarchive.org/storage-freemusicarchive-org/music/000/000048.mp3"
}
],
"results": [
{
"track_id": 2,
"status": "http_error",
"code": 403,
"url": "https://files.freemusicarchive.org/storage-freemusicarchive-org/music/000/000002.mp3"
},
{
"track_id": 5,
"status": "http_error",
"code": 403,
"url": "https://files.freemusicarchive.org/storage-freemusicarchive-org/music/000/000005.mp3"
},
{
"track_id": 10,
"status": "http_error",
"code": 403,
"url": "https://files.freemusicarchive.org/storage-freemusicarchive-org/music/000/000010.mp3"
},
{
"track_id": 20,
"status": "http_error",
"code": 403,
"url": "https://files.freemusicarchive.org/storage-freemusicarchive-org/music/000/000020.mp3"
},
{
"track_id": 26,
"status": "http_error",
"code": 403,
"url": "https://files.freemusicarchive.org/storage-freemusicarchive-org/music/000/000026.mp3"
},
{
"track_id": 30,
"status": "http_error",
"code": 403,
"url": "https://files.freemusicarchive.org/storage-freemusicarchive-org/music/000/000030.mp3"
},
{
"track_id": 46,
"status": "http_error",
"code": 403,
"url": "https://files.freemusicarchive.org/storage-freemusicarchive-org/music/000/000046.mp3"
},
{
"track_id": 48,
"status": "http_error",
"code": 403,
"url": "https://files.freemusicarchive.org/storage-freemusicarchive-org/music/000/000048.mp3"
}
]
}
\ No newline at end of file
#!/usr/bin/env python3
"""Download a bounded real FMA subset through yt-dlp when direct archive URLs are unavailable."""
from __future__ import annotations
import argparse
import json
import shutil
import subprocess
from pathlib import Path
DEFAULT_TRACK_IDS = [2, 5, 10, 20, 26, 30, 46, 48]
FMA_TRACK_URL = "https://freemusicarchive.org/music/track/{track_id}"
def ensure_ytdlp() -> str:
path = shutil.which("yt-dlp")
if not path:
raise SystemExit(json.dumps({
"status": "blocked",
"reason": "yt_dlp_missing",
"recommendation": "Install yt-dlp or provide local FMA audio manually into data/raw/fma_small_audio",
}, indent=2, ensure_ascii=False))
return path
def fetch_one(track_id: int, output_dir: Path, ytdlp: str, overwrite: bool = False) -> dict:
outtmpl = str(output_dir / "%(id)s.%(ext)s")
url = FMA_TRACK_URL.format(track_id=track_id)
cmd = [
ytdlp,
"--no-playlist",
"-o", outtmpl,
]
if not overwrite:
cmd.append("--no-overwrites")
cmd.append(url)
proc = subprocess.run(cmd, text=True, capture_output=True)
return {
"track_id": track_id,
"url": url,
"status": "downloaded" if proc.returncode == 0 else "failed",
"returncode": proc.returncode,
"stdout": proc.stdout[-1200:],
"stderr": proc.stderr[-1200:],
}
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--output-dir", default="data/raw/fma_small_audio")
parser.add_argument("--track-ids", nargs="*", type=int, default=DEFAULT_TRACK_IDS)
parser.add_argument("--overwrite", action="store_true")
parser.add_argument("--report", default=None)
args = parser.parse_args()
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
ytdlp = ensure_ytdlp()
results = [fetch_one(track_id, output_dir, ytdlp, overwrite=args.overwrite) for track_id in args.track_ids]
summary = {
"output_dir": str(output_dir.resolve()),
"requested": len(args.track_ids),
"downloaded": sum(1 for x in results if x["status"] == "downloaded"),
"failed": sum(1 for x in results if x["status"] != "downloaded"),
"results": results,
}
text = json.dumps(summary, indent=2, ensure_ascii=False)
if args.report:
report_path = Path(args.report)
report_path.parent.mkdir(parents=True, exist_ok=True)
report_path.write_text(text)
print(text)
if __name__ == "__main__":
main()
......@@ -223,6 +223,27 @@
### Stage: FMA 真实下载脚手架
完成项:
- 新增 [acr-engine/scripts/fetch_fma_subset.py](../acr-engine/scripts/fetch_fma_subset.py)
- 先验证了旧版 FMA 文件直链抓取路径
- 再切换为页面级抓取脚手架,并显式输出阻塞原因
- 将当前真实 FMA 下载状态记录进:
- [docs/open-dataset-workflow.md](./open-dataset-workflow.md)
- [docs/session-handoff.md](./session-handoff.md)
验证结果:
- `/usr/local/miniconda3/bin/python scripts/fetch_fma_subset.py --report reports/fma_fetch_subset_report.json` 已执行两轮验证
- 第一轮结果:8 个 track id 全部 `HTTP 403`
- 第二轮结果:`yt-dlp not found`,脚本返回结构化 `blocked` JSON
结论:
- 真实 FMA 下载自动化入口已具备
- 但当前环境下仍缺稳定可用下载通道,尚不能宣称真实 FMA 已成功落地
- 该阻塞已经被显式固化到交接文档中,避免新 session 重复踩坑
### Stage: 原始开放数据 LFS 治理
完成项:
......
......@@ -276,6 +276,7 @@
- [docs/session-handoff.md](./session-handoff.md)
- [docs/current-capability-map.md](./current-capability-map.md)
- [acr-engine/FIRST_RUN_CHECKLIST.md](../acr-engine/FIRST_RUN_CHECKLIST.md)
- FMA 真实子集下载脚手架已存在:[acr-engine/scripts/fetch_fma_subset.py](../acr-engine/scripts/fetch_fma_subset.py);最近验证结果是旧直链 `403`、当前环境缺 `yt-dlp`
- 运行 [acr-engine/scripts/status_snapshot.py](../acr-engine/scripts/status_snapshot.py)
- 或直接查看最新落盘快照:`acr-engine/.omx/latest_status_snapshot.json`
......