Make data onboarding and long FMA transfer supervision easier to sustain
Constraint: The user needs detailed data-format guidance now, while the real FMA archive transfer still requires durable hands-off supervision across long sessions Rejected: Treat documentation and download-watch work as separate later tasks | Would leave either user guidance or transfer resilience lagging behind active development Confidence: high Scope-risk: narrow Directive: Keep the new training-data/pgvector guide aligned with actual manifest fields and use watch_fma_download.py as the first-line long-transfer watchdog Tested: /usr/local/miniconda3/bin/python -m py_compile acr-engine/scripts/watch_fma_download.py; /usr/local/miniconda3/bin/python acr-engine/scripts/watch_fma_download.py --cycles 2 --interval 2; /usr/local/miniconda3/bin/python acr-engine/scripts/prepare_fma_archive.py inspect Not-tested: Full archive completion, extraction, and real-data smoke remain pending
Showing
4 changed files
with
642 additions
and
0 deletions
acr-engine/scripts/watch_fma_download.py
0 → 100755
| 1 | #!/usr/bin/env python3 | ||
| 2 | """Ensure the FMA archive download keeps running; restart if stalled or dead.""" | ||
| 3 | |||
| 4 | from __future__ import annotations | ||
| 5 | |||
| 6 | import argparse | ||
| 7 | import json | ||
| 8 | import subprocess | ||
| 9 | import time | ||
| 10 | from pathlib import Path | ||
| 11 | |||
| 12 | PYTHON = "/usr/local/miniconda3/bin/python" | ||
| 13 | INSPECT = [PYTHON, "scripts/prepare_fma_archive.py", "inspect"] | ||
| 14 | BG = [PYTHON, "scripts/prepare_fma_archive.py", "bg-download"] | ||
| 15 | DEFAULT_LOG = Path("/tmp/fma_modelscope_watch.log") | ||
| 16 | |||
| 17 | |||
| 18 | def inspect() -> dict: | ||
| 19 | out = subprocess.check_output(INSPECT, text=True) | ||
| 20 | return json.loads(out) | ||
| 21 | |||
| 22 | |||
| 23 | def bg_download() -> dict: | ||
| 24 | out = subprocess.check_output(BG, text=True) | ||
| 25 | return json.loads(out) | ||
| 26 | |||
| 27 | |||
| 28 | def has_live_curl() -> bool: | ||
| 29 | proc = subprocess.run( | ||
| 30 | ["bash", "-lc", "ps -ef | grep 'fma_small.zip' | grep -v grep >/dev/null"], | ||
| 31 | capture_output=True, | ||
| 32 | text=True, | ||
| 33 | ) | ||
| 34 | return proc.returncode == 0 | ||
| 35 | |||
| 36 | |||
| 37 | def main(): | ||
| 38 | parser = argparse.ArgumentParser() | ||
| 39 | parser.add_argument("--interval", type=float, default=5.0) | ||
| 40 | parser.add_argument("--cycles", type=int, default=3) | ||
| 41 | parser.add_argument("--log-path", default=str(DEFAULT_LOG)) | ||
| 42 | args = parser.parse_args() | ||
| 43 | |||
| 44 | log_path = Path(args.log_path) | ||
| 45 | log_path.parent.mkdir(parents=True, exist_ok=True) | ||
| 46 | events = [] | ||
| 47 | |||
| 48 | previous_size = None | ||
| 49 | for _ in range(args.cycles): | ||
| 50 | snapshot = inspect() | ||
| 51 | size = int(snapshot["archive_size"]) | ||
| 52 | live = has_live_curl() | ||
| 53 | restarted = None | ||
| 54 | if (previous_size is not None and size <= previous_size) or not live: | ||
| 55 | restarted = bg_download() | ||
| 56 | time.sleep(2) | ||
| 57 | snapshot = inspect() | ||
| 58 | size = int(snapshot["archive_size"]) | ||
| 59 | live = has_live_curl() | ||
| 60 | event = { | ||
| 61 | "snapshot": snapshot, | ||
| 62 | "live_curl": live, | ||
| 63 | "restarted": restarted, | ||
| 64 | } | ||
| 65 | events.append(event) | ||
| 66 | previous_size = size | ||
| 67 | time.sleep(args.interval) | ||
| 68 | |||
| 69 | text = json.dumps({"status": "ok", "events": events}, indent=2, ensure_ascii=False) | ||
| 70 | log_path.write_text(text) | ||
| 71 | print(text) | ||
| 72 | |||
| 73 | |||
| 74 | if __name__ == "__main__": | ||
| 75 | main() |
| ... | @@ -232,6 +232,54 @@ | ... | @@ -232,6 +232,54 @@ |
| 232 | 232 | ||
| 233 | 233 | ||
| 234 | 234 | ||
| 235 | |||
| 236 | |||
| 237 | ### Stage: FMA 下载自动守护 | ||
| 238 | |||
| 239 | 完成项: | ||
| 240 | - 新增 [acr-engine/scripts/watch_fma_download.py](../acr-engine/scripts/watch_fma_download.py) | ||
| 241 | - 支持周期性: | ||
| 242 | - inspect 当前下载进度 | ||
| 243 | - 检查 curl 进程是否存活 | ||
| 244 | - 如果停滞或进程消失则自动重新触发 `bg-download` | ||
| 245 | |||
| 246 | 验证结果: | ||
| 247 | - `/usr/local/miniconda3/bin/python -m py_compile scripts/watch_fma_download.py` 成功 | ||
| 248 | - `/usr/local/miniconda3/bin/python scripts/watch_fma_download.py --cycles 2 --interval 2` 成功 | ||
| 249 | - 当前 watcher 观测到: | ||
| 250 | - 第 1 次 `archive_size=513179648` | ||
| 251 | - 第 2 次 `archive_size=516587520` | ||
| 252 | - 两次 `live_curl=true` | ||
| 253 | - 两次 `restarted=null`(说明下载健康推进,无需重启) | ||
| 254 | - 最新 inspect: | ||
| 255 | - `archive_size=522387456` | ||
| 256 | - `archive_progress_percent=6.8023` | ||
| 257 | |||
| 258 | 结论: | ||
| 259 | - 现在真实 FMA 下载不只可手动恢复,也具备基础自动守护能力 | ||
| 260 | - 长时间下载的持续性进一步增强 | ||
| 261 | |||
| 262 | ### Stage: 训练数据与 pgvector 专项文档 | ||
| 263 | |||
| 264 | 完成项: | ||
| 265 | - 新增 [docs/training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) | ||
| 266 | - 单独详细说明: | ||
| 267 | - 当前训练数据/输入数据应该是什么格式 | ||
| 268 | - `reference` / `query` 的角色区分 | ||
| 269 | - BGM / 手机录音 / 直播录屏如何转成训练数据 | ||
| 270 | - `song_id` / `type` / `offset` / `segment_type` 等标签建议 | ||
| 271 | - 未来接 `pgvector` 时的推荐表结构与字段设计 | ||
| 272 | - 将该文档挂接到 [docs/README.md](./README.md) 与 [docs/session-handoff.md](./session-handoff.md) | ||
| 273 | |||
| 274 | 验证结果: | ||
| 275 | - 文档内容已对齐当前代码行为: | ||
| 276 | - `acr-engine/src/data/dataset.py` | ||
| 277 | - `acr-engine/src/data/manifest_tools.py` | ||
| 278 | - 已补充从“原始音频 -> 标准化资产 -> manifest -> pgvector”的完整分层说明 | ||
| 279 | |||
| 280 | 结论: | ||
| 281 | - 现在项目对“训练数据应该长什么样”与“以后怎么接 pgvector”已经有独立、可交接、可执行的文档说明 | ||
| 282 | |||
| 235 | ### Stage: FMA 后台续传恢复 | 283 | ### Stage: FMA 后台续传恢复 |
| 236 | 284 | ||
| 237 | 完成项: | 285 | 完成项: | ... | ... |
| ... | @@ -275,6 +275,7 @@ | ... | @@ -275,6 +275,7 @@ |
| 275 | - [docs/open-dataset-workflow.md](./open-dataset-workflow.md) | 275 | - [docs/open-dataset-workflow.md](./open-dataset-workflow.md) |
| 276 | - [docs/session-handoff.md](./session-handoff.md) | 276 | - [docs/session-handoff.md](./session-handoff.md) |
| 277 | - [docs/current-capability-map.md](./current-capability-map.md) | 277 | - [docs/current-capability-map.md](./current-capability-map.md) |
| 278 | - [docs/training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) | ||
| 278 | - [acr-engine/FIRST_RUN_CHECKLIST.md](../acr-engine/FIRST_RUN_CHECKLIST.md) | 279 | - [acr-engine/FIRST_RUN_CHECKLIST.md](../acr-engine/FIRST_RUN_CHECKLIST.md) |
| 279 | - FMA 真实子集下载脚手架已存在:[acr-engine/scripts/fetch_fma_subset.py](../acr-engine/scripts/fetch_fma_subset.py);最近验证结果是旧直链 `403`、页面级历史 URL `404`;但 `https://modelscope.cn/datasets/pengzhendong/fma/resolve/master/fma_small.zip` 已验证 `200 OK` 且支持 range | 280 | - FMA 真实子集下载脚手架已存在:[acr-engine/scripts/fetch_fma_subset.py](../acr-engine/scripts/fetch_fma_subset.py);最近验证结果是旧直链 `403`、页面级历史 URL `404`;但 `https://modelscope.cn/datasets/pengzhendong/fma/resolve/master/fma_small.zip` 已验证 `200 OK` 且支持 range |
| 280 | - 运行 [acr-engine/scripts/status_snapshot.py](../acr-engine/scripts/status_snapshot.py) | 281 | - 运行 [acr-engine/scripts/status_snapshot.py](../acr-engine/scripts/status_snapshot.py) | ... | ... |
docs/training-data-and-pgvector-guide.md
0 → 100644
| 1 | # Training Data, Input Format, and pgvector Guide / 训练数据、输入格式与 pgvector 指南 | ||
| 2 | |||
| 3 | > 更新:2026-06-02 | ||
| 4 | |||
| 5 | ## 一页结论 | ||
| 6 | |||
| 7 | 如果后面要把这个 ACR 项目做成稳定可持续的数据工程,建议把数据分成 4 层: | ||
| 8 | |||
| 9 | 1. **原始音频层**:BGM、歌曲母带、录音、直播切片、手机录音 | ||
| 10 | 2. **标准音频资产层**:统一采样率、声道、文件命名后的可训练音频 | ||
| 11 | 3. **manifest 层**:`catalog.json` / `train.json` / `test.json` / `val.json` | ||
| 12 | 4. **向量索引层**:embedding、metadata、pgvector/ANN 检索库 | ||
| 13 | |||
| 14 | 当前项目真正训练时吃进去的,不是“任意音频文件夹”,而是: | ||
| 15 | |||
| 16 | - 音频文件 | ||
| 17 | - 配套 manifest | ||
| 18 | - `song_id` 级别标签 | ||
| 19 | - query/reference 角色划分 | ||
| 20 | |||
| 21 | 如果未来要接 `pgvector`,推荐做法不是直接把原始音频塞进数据库,而是: | ||
| 22 | |||
| 23 | - 音频留文件系统/对象存储 | ||
| 24 | - metadata 落 PostgreSQL | ||
| 25 | - embedding 向量落 `pgvector` | ||
| 26 | - `song_id` / `segment_id` / `type` / `offset` 做结构化列 | ||
| 27 | |||
| 28 | --- | ||
| 29 | |||
| 30 | ## 1. 分层结构图 | ||
| 31 | |||
| 32 | ```mermaid | ||
| 33 | flowchart LR | ||
| 34 | A[原始 BGM / 录音 / 歌曲] --> B[标准化音频资产] | ||
| 35 | B --> C[Reference Catalog] | ||
| 36 | B --> D[Query Segments] | ||
| 37 | C --> E[Manifest JSON] | ||
| 38 | D --> E | ||
| 39 | E --> F[训练 / 评测] | ||
| 40 | C --> G[Embedding / Fingerprint 索引] | ||
| 41 | G --> H[pgvector / 检索服务] | ||
| 42 | ``` | ||
| 43 | |||
| 44 | --- | ||
| 45 | |||
| 46 | ## 2. 当前代码实际接受什么数据 | ||
| 47 | |||
| 48 | ### 2.1 音频层要求 | ||
| 49 | |||
| 50 | 从当前代码看,核心读取逻辑是: | ||
| 51 | |||
| 52 | - `librosa.load(..., sr=16000, mono=True)` | ||
| 53 | - 默认训练片段长度:`5s` | ||
| 54 | - 外部数据 query 推荐长度:`8s` | ||
| 55 | - 频谱输入:Mel spectrogram | ||
| 56 | - 当前文档目标输入层:**128 Mel** | ||
| 57 | |||
| 58 | 所以推荐原始资产先标准化为: | ||
| 59 | |||
| 60 | | 项目 | 推荐值 | 说明 | | ||
| 61 | |---|---|---| | ||
| 62 | | 文件类型 | `.wav` / `.mp3` / `.flac` / `.ogg` | 当前转换工具支持这些后缀 | | ||
| 63 | | 采样率 | `16k` | 训练/评测读取时会统一到 16k | | ||
| 64 | | 声道 | mono | 当前 pipeline 按 mono 读取 | | ||
| 65 | | reference 时长 | 尽量完整曲目 | 用于建索引 | | ||
| 66 | | query 时长 | `5s` 或 `8s` | 训练常用 5s,开放数据切 query 常用 8s | | ||
| 67 | |||
| 68 | ### 2.2 manifest 层要求 | ||
| 69 | |||
| 70 | 当前项目的实际关键字段: | ||
| 71 | |||
| 72 | | 字段 | 必需 | 用途 | | ||
| 73 | |---|---|---| | ||
| 74 | | `song_id` | 是 | 主标签,同一首歌所有 query/reference 共享 | | ||
| 75 | | `audio_path` | 是 | 音频相对路径 | | ||
| 76 | | `duration` | 是 | 时长,控制切片与合法性 | | ||
| 77 | | `type` | 是 | `reference` / `clean` / `augmented` / `confused` / `humming_like` | | ||
| 78 | | `offset` | 建议 | query 在原曲中的起始偏移 | | ||
| 79 | | `segment_type` | 建议 | `intro` / `mid` / `outro` / `external_query` | | ||
| 80 | | `source_dataset` | 建议 | 数据来源标记 | | ||
| 81 | |||
| 82 | --- | ||
| 83 | |||
| 84 | ## 3. 训练数据到底应该长什么样 | ||
| 85 | |||
| 86 | ## 3.1 Reference 数据 | ||
| 87 | |||
| 88 | Reference 表示“曲库真身”,用于建索引。 | ||
| 89 | |||
| 90 | 示例: | ||
| 91 | |||
| 92 | ```json | ||
| 93 | { | ||
| 94 | "song_id": "song_0001", | ||
| 95 | "audio_path": "audio/song_0001.wav", | ||
| 96 | "duration": 183.4, | ||
| 97 | "type": "reference", | ||
| 98 | "source_dataset": "internal_bgm" | ||
| 99 | } | ||
| 100 | ``` | ||
| 101 | |||
| 102 | 要求: | ||
| 103 | |||
| 104 | - 一首歌至少 1 条 reference | ||
| 105 | - reference 通常是整首歌或较长主版本 | ||
| 106 | - 不建议把噪音重、混响重、环境录音直接当 reference 主资产 | ||
| 107 | |||
| 108 | --- | ||
| 109 | |||
| 110 | ## 3.2 Query 数据 | ||
| 111 | |||
| 112 | Query 表示“我要识别的输入样本”。 | ||
| 113 | |||
| 114 | 示例: | ||
| 115 | |||
| 116 | ```json | ||
| 117 | { | ||
| 118 | "song_id": "song_0001", | ||
| 119 | "audio_path": "audio/song_0001_query_03.wav", | ||
| 120 | "duration": 5.0, | ||
| 121 | "type": "clean", | ||
| 122 | "offset": 63.2, | ||
| 123 | "segment_type": "mid", | ||
| 124 | "source_dataset": "internal_bgm" | ||
| 125 | } | ||
| 126 | ``` | ||
| 127 | |||
| 128 | Query 的内容应该是: | ||
| 129 | |||
| 130 | - 原曲中的 5s/8s 切片 | ||
| 131 | - 或人工合成退化片段 | ||
| 132 | - 或真实世界录音片段 | ||
| 133 | |||
| 134 | ### Query 常见类型 | ||
| 135 | |||
| 136 | | type | 含义 | 适合来源 | | ||
| 137 | |---|---|---| | ||
| 138 | | `clean` | 原曲直接切片 | 标准训练/评测主力 | | ||
| 139 | | `augmented` | 加噪、混响、压缩、EQ 后的切片 | 提升鲁棒性 | | ||
| 140 | | `confused` | 与其他音乐更相似、容易误判的片段 | 难例强化 | | ||
| 141 | | `humming_like` | 旋律近似、弱配器、手机录音风格 | 旋律/哼唱类近似查询 | | ||
| 142 | |||
| 143 | --- | ||
| 144 | |||
| 145 | ## 4. 如果我们有 BGM、音乐录音,怎么转成训练数据 | ||
| 146 | |||
| 147 | ## 4.1 场景 A:你有一批“标准 BGM 母带/成品曲目” | ||
| 148 | |||
| 149 | 这是最容易的情况。 | ||
| 150 | |||
| 151 | ### 推荐做法 | ||
| 152 | |||
| 153 | 1. 每首完整 BGM 作为 `reference` | ||
| 154 | 2. 从同一首 BGM 中随机切出多个 query 片段 | ||
| 155 | 3. 给这些片段打上同一个 `song_id` | ||
| 156 | 4. query 再按用途分到 `train/test/val` | ||
| 157 | |||
| 158 | ### 结果结构 | ||
| 159 | |||
| 160 | ```mermaid | ||
| 161 | flowchart TD | ||
| 162 | A[完整 BGM] --> B[reference] | ||
| 163 | A --> C1[query 1] | ||
| 164 | A --> C2[query 2] | ||
| 165 | A --> C3[query 3] | ||
| 166 | C1 --> D[train/test] | ||
| 167 | C2 --> D | ||
| 168 | C3 --> D | ||
| 169 | ``` | ||
| 170 | |||
| 171 | ### 关键点 | ||
| 172 | |||
| 173 | - `song_id` 必须稳定 | ||
| 174 | - 曲目版本要分清:主版本、纯伴奏版、短版、重编版不要误当同一首 | ||
| 175 | - 如果版本差异大,建议拆成不同 `song_id` | ||
| 176 | |||
| 177 | --- | ||
| 178 | |||
| 179 | ## 4.2 场景 B:你有“手机录音/环境录音/直播录屏” | ||
| 180 | |||
| 181 | 这类数据不应直接全部当 reference 主资产,而更适合作为 query 或 hard-case 样本。 | ||
| 182 | |||
| 183 | ### 推荐角色 | ||
| 184 | |||
| 185 | | 资产 | 角色 | | ||
| 186 | |---|---| | ||
| 187 | | 清晰母带 / 官方音源 | reference | | ||
| 188 | | 手机录音 / 环境录音 | query | | ||
| 189 | | 直播采集片段 | query | | ||
| 190 | | 背景噪声下的短片段 | query | | ||
| 191 | |||
| 192 | ### 标注建议 | ||
| 193 | |||
| 194 | 如果录音可以明确对应某首 reference: | ||
| 195 | |||
| 196 | ```json | ||
| 197 | { | ||
| 198 | "song_id": "song_0001", | ||
| 199 | "audio_path": "queries/phone/song_0001_live_01.wav", | ||
| 200 | "duration": 5.0, | ||
| 201 | "type": "augmented", | ||
| 202 | "segment_type": "external_query", | ||
| 203 | "source_dataset": "phone_recording" | ||
| 204 | } | ||
| 205 | ``` | ||
| 206 | |||
| 207 | 如果录音非常嘈杂、很接近旋律轮廓但音色严重失真: | ||
| 208 | |||
| 209 | - 可标 `type=humming_like` | ||
| 210 | - 或单独增加你们自己的扩展类型,如 `field_recording` | ||
| 211 | |||
| 212 | 但要注意: | ||
| 213 | - 训练前最好先映射回当前系统已有类型,避免代码完全不认识 | ||
| 214 | |||
| 215 | --- | ||
| 216 | |||
| 217 | ## 4.3 场景 C:你只有一堆音频文件夹,还没做精标注 | ||
| 218 | |||
| 219 | 先做“弱监督标准化”比不做强。 | ||
| 220 | |||
| 221 | ### 最低可行办法 | ||
| 222 | |||
| 223 | - 一首文件 = 一个 `song_id` | ||
| 224 | - 整首文件 = `reference` | ||
| 225 | - 从整首里随机切 query = `clean` | ||
| 226 | - 后续再人工修正错标/重名/版本冲突 | ||
| 227 | |||
| 228 | 这也是当前: | ||
| 229 | |||
| 230 | - `manifest_tools.py audio-dir-to-splits` | ||
| 231 | - `external_adapters.py prepare-local` | ||
| 232 | |||
| 233 | 在做的事情。 | ||
| 234 | |||
| 235 | ### 适用场景 | ||
| 236 | |||
| 237 | - FMA | ||
| 238 | - MTG-Jamendo | ||
| 239 | - 你本地一批 BGM 文件夹 | ||
| 240 | - 一批已知 song-level 但未做 segment 级标注的数据 | ||
| 241 | |||
| 242 | --- | ||
| 243 | |||
| 244 | ## 5. label 应该怎么设计 | ||
| 245 | |||
| 246 | ## 5.1 主标签:`song_id` | ||
| 247 | |||
| 248 | 这是最重要的标签。 | ||
| 249 | |||
| 250 | 同一首歌的: | ||
| 251 | |||
| 252 | - reference | ||
| 253 | - clean query | ||
| 254 | - 手机录音 query | ||
| 255 | - 混响/噪声 query | ||
| 256 | |||
| 257 | 都应该共享同一个 `song_id`。 | ||
| 258 | |||
| 259 | ### 推荐命名 | ||
| 260 | |||
| 261 | | 场景 | 推荐形式 | | ||
| 262 | |---|---| | ||
| 263 | | 内部 BGM | `bgm_000001` | | ||
| 264 | | 商业曲库 | `catalog_000001` | | ||
| 265 | | 开源数据 | `fma_012345` | | ||
| 266 | | 多版本项目 | `song_0001_vocal` / `song_0001_inst` | | ||
| 267 | |||
| 268 | --- | ||
| 269 | |||
| 270 | ## 5.2 辅助标签 | ||
| 271 | |||
| 272 | 建议额外保留: | ||
| 273 | |||
| 274 | | 字段 | 作用 | | ||
| 275 | |---|---| | ||
| 276 | | `version_id` | 区分原版 / 伴奏版 / 重编版 | | ||
| 277 | | `segment_type` | 区分 intro / mid / outro | | ||
| 278 | | `recording_type` | 区分 studio / phone / live / screen_capture | | ||
| 279 | | `noise_level` | 区分 clean / mild / heavy | | ||
| 280 | | `source_dataset` | 保留来源审计 | | ||
| 281 | |||
| 282 | 当前代码最少已经建议保留: | ||
| 283 | |||
| 284 | - `type` | ||
| 285 | - `offset` | ||
| 286 | - `segment_type` | ||
| 287 | - `source_dataset` | ||
| 288 | |||
| 289 | --- | ||
| 290 | |||
| 291 | ## 6. 如果以后接 pgvector,当前应该怎么处理 | ||
| 292 | |||
| 293 | ## 6.1 不要直接把原始音频塞进 pgvector | ||
| 294 | |||
| 295 | 推荐分工: | ||
| 296 | |||
| 297 | | 层 | 存放位置 | | ||
| 298 | |---|---| | ||
| 299 | | 原始音频文件 | 对象存储 / 文件系统 | | ||
| 300 | | manifest / metadata | PostgreSQL 普通表 | | ||
| 301 | | embedding 向量 | `pgvector` 列 | | ||
| 302 | | 音频指纹 | PostgreSQL JSON / 独立索引 / 文件 | | ||
| 303 | |||
| 304 | --- | ||
| 305 | |||
| 306 | ## 6.2 推荐表结构 | ||
| 307 | |||
| 308 | ```mermaid | ||
| 309 | flowchart TD | ||
| 310 | A[songs] --> B[segments] | ||
| 311 | A --> C[references] | ||
| 312 | C --> D[reference_embeddings] | ||
| 313 | B --> E[query_embeddings] | ||
| 314 | ``` | ||
| 315 | |||
| 316 | ### songs | ||
| 317 | |||
| 318 | | 列 | 说明 | | ||
| 319 | |---|---| | ||
| 320 | | `song_id` | 主键/唯一业务键 | | ||
| 321 | | `title` | 曲名 | | ||
| 322 | | `artist` | 作者/演出者 | | ||
| 323 | | `version_id` | 版本标识 | | ||
| 324 | | `source_dataset` | 来源 | | ||
| 325 | | `license` | 许可证 | | ||
| 326 | |||
| 327 | ### references | ||
| 328 | |||
| 329 | | 列 | 说明 | | ||
| 330 | |---|---| | ||
| 331 | | `reference_id` | 主键 | | ||
| 332 | | `song_id` | 外键 | | ||
| 333 | | `audio_uri` | 文件路径/对象存储地址 | | ||
| 334 | | `duration_sec` | 时长 | | ||
| 335 | | `sample_rate` | 采样率 | | ||
| 336 | |||
| 337 | ### segments | ||
| 338 | |||
| 339 | | 列 | 说明 | | ||
| 340 | |---|---| | ||
| 341 | | `segment_id` | 主键 | | ||
| 342 | | `song_id` | 外键 | | ||
| 343 | | `audio_uri` | query 文件路径 | | ||
| 344 | | `offset_sec` | 起始偏移 | | ||
| 345 | | `duration_sec` | 片段长度 | | ||
| 346 | | `type` | clean/augmented/confused/humming_like | | ||
| 347 | | `segment_type` | intro/mid/outro/external_query | | ||
| 348 | | `split` | train/test/val | | ||
| 349 | |||
| 350 | ### reference_embeddings / query_embeddings | ||
| 351 | |||
| 352 | | 列 | 说明 | | ||
| 353 | |---|---| | ||
| 354 | | `id` | 主键 | | ||
| 355 | | `song_id` | 外键 | | ||
| 356 | | `segment_id` / `reference_id` | 外键 | | ||
| 357 | | `embedding` | `vector(n)` | | ||
| 358 | | `model_version` | 模型版本 | | ||
| 359 | | `created_at` | 生成时间 | | ||
| 360 | |||
| 361 | --- | ||
| 362 | |||
| 363 | ## 6.3 pgvector 推荐实践 | ||
| 364 | |||
| 365 | 如果当前 embedding 维度是 192,那么可以设计: | ||
| 366 | |||
| 367 | ```sql | ||
| 368 | embedding vector(192) | ||
| 369 | ``` | ||
| 370 | |||
| 371 | ### 推荐额外保留字段 | ||
| 372 | |||
| 373 | - `model_version` | ||
| 374 | - `data_version` | ||
| 375 | - `index_version` | ||
| 376 | - `source_dataset` | ||
| 377 | - `type` | ||
| 378 | - `offset_sec` | ||
| 379 | - `segment_type` | ||
| 380 | |||
| 381 | 这样以后你能做: | ||
| 382 | |||
| 383 | - 按模型版本重建 embedding | ||
| 384 | - 按数据集来源筛选候选 | ||
| 385 | - 按 query 类型分析误判 | ||
| 386 | - 按 `segment_type` 做 intro/outro 针对性策略 | ||
| 387 | |||
| 388 | --- | ||
| 389 | |||
| 390 | ## 6.4 当前项目要为 pgvector 提前准备什么 | ||
| 391 | |||
| 392 | 现在就该做的,不是先建数据库,而是先保证这些字段规范: | ||
| 393 | |||
| 394 | 1. `song_id` 稳定且唯一 | ||
| 395 | 2. `audio_path` 可追踪 | ||
| 396 | 3. `type` 明确 | ||
| 397 | 4. `offset` 可回溯 | ||
| 398 | 5. `source_dataset` 清晰 | ||
| 399 | 6. 模型产出的 embedding 可以和 metadata 一一对应 | ||
| 400 | |||
| 401 | 换句话说: | ||
| 402 | |||
| 403 | > 先把 manifest 设计好,未来接 pgvector 才不会返工。 | ||
| 404 | |||
| 405 | --- | ||
| 406 | |||
| 407 | ## 7. 推荐的落地目录与数据加工流程 | ||
| 408 | |||
| 409 | ## 7.1 原始层 | ||
| 410 | |||
| 411 | ```text | ||
| 412 | acr-engine/data/raw/ | ||
| 413 | bgm_master/ | ||
| 414 | phone_recordings/ | ||
| 415 | live_captures/ | ||
| 416 | fma_small_audio/ | ||
| 417 | ``` | ||
| 418 | |||
| 419 | ## 7.2 标准化层 | ||
| 420 | |||
| 421 | ```text | ||
| 422 | acr-engine/data/curated/ | ||
| 423 | my_bgm_v1/ | ||
| 424 | audio/ | ||
| 425 | manifests/ | ||
| 426 | ``` | ||
| 427 | |||
| 428 | ## 7.3 训练/评测层 | ||
| 429 | |||
| 430 | ```text | ||
| 431 | catalog.json | ||
| 432 | train.json | ||
| 433 | test.json | ||
| 434 | val.json | ||
| 435 | ``` | ||
| 436 | |||
| 437 | --- | ||
| 438 | |||
| 439 | ## 8. 针对你当前项目的推荐操作方式 | ||
| 440 | |||
| 441 | ## 8.1 如果你现在手上有 BGM | ||
| 442 | |||
| 443 | 最推荐: | ||
| 444 | |||
| 445 | 1. 每首完整 BGM 先放入一个目录 | ||
| 446 | 2. 用统一命名整理 | ||
| 447 | 3. 运行: | ||
| 448 | |||
| 449 | ```bash | ||
| 450 | /usr/local/miniconda3/bin/python src/data/external_adapters.py inspect-local fma <你的目录> | ||
| 451 | /usr/local/miniconda3/bin/python src/data/external_adapters.py prepare-local fma <你的目录> --output-root data/external_ingested | ||
| 452 | /usr/local/miniconda3/bin/python src/data/external_adapters.py validate-local fma data/external_ingested/fma/manifests | ||
| 453 | ``` | ||
| 454 | |||
| 455 | 如果只是你自有 BGM,不一定非要叫 `fma`,后面可以再做一个内部 adapter。 | ||
| 456 | |||
| 457 | --- | ||
| 458 | |||
| 459 | ## 8.2 如果你现在手上有录音/采集片段 | ||
| 460 | |||
| 461 | 建议: | ||
| 462 | |||
| 463 | - 不要把这批录音直接当 reference 主库 | ||
| 464 | - 先把它们映射到已有 reference 的 `song_id` | ||
| 465 | - 作为 query / hard-case 数据进入训练或评测 | ||
| 466 | |||
| 467 | 如果当前无法人工全标: | ||
| 468 | |||
| 469 | - 先放到单独目录 | ||
| 470 | - 先做 song-level 或 file-level 对齐 | ||
| 471 | - 后续逐步补 segment-level 标注 | ||
| 472 | |||
| 473 | --- | ||
| 474 | |||
| 475 | ## 9. 一张总表 | ||
| 476 | |||
| 477 | | 你手上的数据 | 推荐转化方式 | 在系统里的角色 | | ||
| 478 | |---|---|---| | ||
| 479 | | 完整 BGM 母带 | 整首保留 + 随机切 query | reference + clean query | | ||
| 480 | | 官方歌曲文件 | 整首保留 + 切片 | reference + clean query | | ||
| 481 | | 手机录音 | 对齐到已有 `song_id` | augmented / humming_like query | | ||
| 482 | | 直播录屏 | 截出音乐段并对齐 `song_id` | external query | | ||
| 483 | | 背景噪声录音 | 作为 hard case | confused / augmented | | ||
| 484 | | 开源整包数据集 | 先 `inspect-local`/`prepare-local` | baseline train/eval corpus | | ||
| 485 | |||
| 486 | --- | ||
| 487 | |||
| 488 | ## 10. 你现在最值得立刻固化的字段 | ||
| 489 | |||
| 490 | 如果后面确定要上 `pgvector`,现在最少要保证每条样本都能追踪: | ||
| 491 | |||
| 492 | - `song_id` | ||
| 493 | - `audio_path` | ||
| 494 | - `duration` | ||
| 495 | - `type` | ||
| 496 | - `offset` | ||
| 497 | - `segment_type` | ||
| 498 | - `source_dataset` | ||
| 499 | - `split` | ||
| 500 | - `model_version`(生成 embedding 时记录) | ||
| 501 | |||
| 502 | --- | ||
| 503 | |||
| 504 | ## 11. 推荐文档跳转 | ||
| 505 | |||
| 506 | - [dataset-spec.md](./dataset-spec.md) | ||
| 507 | - [open-dataset-workflow.md](./open-dataset-workflow.md) | ||
| 508 | - [dataset-sources-and-licensing.md](./dataset-sources-and-licensing.md) | ||
| 509 | - [session-handoff.md](./session-handoff.md) | ||
| 510 | |||
| 511 | ## Sources | ||
| 512 | |||
| 513 | - Current code behavior from: | ||
| 514 | - [acr-engine/src/data/dataset.py](../acr-engine/src/data/dataset.py) | ||
| 515 | - [acr-engine/src/data/manifest_tools.py](../acr-engine/src/data/manifest_tools.py) | ||
| 516 | - Existing project docs: | ||
| 517 | - [dataset-spec.md](./dataset-spec.md) | ||
| 518 | - [open-dataset-workflow.md](./open-dataset-workflow.md) |
-
Please register or sign in to post a comment