handoff for next session in GPU/CPU inter
Showing
2 changed files
with
423 additions
and
0 deletions
| ... | @@ -18,3 +18,4 @@ acr-engine/configs/manifests/examples/business_asset_export_real_smoke.csv | ... | @@ -18,3 +18,4 @@ acr-engine/configs/manifests/examples/business_asset_export_real_smoke.csv |
| 18 | acr-engine/__pycache__/ | 18 | acr-engine/__pycache__/ |
| 19 | acr-engine/**/__pycache__/ | 19 | acr-engine/**/__pycache__/ |
| 20 | acr-engine/**/*.pyc | 20 | acr-engine/**/*.pyc |
| 21 | best_model.pt | ... | ... |
docs/coverhunter_handoff.md
0 → 100644
| 1 | # CoverHunter 训练专题交接文档 | ||
| 2 | |||
| 3 | ## 1. 当前目标与状态 | ||
| 4 | |||
| 5 | 本轮工作目标是把 CoverHunter 微调链路整理为一个可扩展的专题,并完成: | ||
| 6 | |||
| 7 | - 双流结构落地 | ||
| 8 | - 训练配置整理 | ||
| 9 | - 4GB GPU 轻量配置 | ||
| 10 | - 环境安装自动化 | ||
| 11 | - CPU 降级路径验证 | ||
| 12 | - 专题文档沉淀 | ||
| 13 | |||
| 14 | 当前结论: | ||
| 15 | |||
| 16 | - **专题文档已完成** | ||
| 17 | - **双流训练代码已基本接通** | ||
| 18 | - **环境依赖已自动安装** | ||
| 19 | - **GPU 仍不可用**(驱动与当前 torch/cu130 不兼容) | ||
| 20 | - **CPU 路径已验证到 dry-run 深层阶段,但仍有未完成修复** | ||
| 21 | - 需要换环境后继续测试 | ||
| 22 | |||
| 23 | --- | ||
| 24 | |||
| 25 | ## 2. 已完成的核心内容 | ||
| 26 | |||
| 27 | ### 2.1 训练专题与流程文档 | ||
| 28 | |||
| 29 | 已新增/维护: | ||
| 30 | |||
| 31 | - `docs/coverhunter_finetune_topic.md` | ||
| 32 | - `docs/coverhunter_training_process.md` | ||
| 33 | - `docs/coverhunter_env_setup.md` | ||
| 34 | |||
| 35 | 说明: | ||
| 36 | |||
| 37 | - `coverhunter_finetune_topic.md`:专题总方案,详细描述当前音源、训练计划、阶段目标、权重用途 | ||
| 38 | - `coverhunter_training_process.md`:标准训练流程 | ||
| 39 | - `coverhunter_env_setup.md`:环境安装、验证、阻塞点说明 | ||
| 40 | |||
| 41 | ### 2.2 双流训练结构 | ||
| 42 | |||
| 43 | 当前已按“双流”方向改造: | ||
| 44 | |||
| 45 | - **流 A:MERT + melody/chroma 分支** | ||
| 46 | - **流 B:ECAPA 分支** | ||
| 47 | - 双流融合:`DualStreamFusion` | ||
| 48 | - 检索头:`CoverHunterHead` | ||
| 49 | - 损失:`InfoNCE + AAMSoftmax` | ||
| 50 | |||
| 51 | 主要文件: | ||
| 52 | |||
| 53 | - `acr-engine/src/models/ecapa_tdnn.py` | ||
| 54 | - `acr-engine/src/models/losses.py` | ||
| 55 | - `acr-engine/src/data/dataset.py` | ||
| 56 | - `acr-engine/train.py` | ||
| 57 | |||
| 58 | ### 2.3 数据增强 | ||
| 59 | |||
| 60 | 已在 `acr-engine/src/utils/augment.py` 中接入专题所需增强: | ||
| 61 | |||
| 62 | - 伪造录音: | ||
| 63 | - `AddGaussianNoise` | ||
| 64 | - `AddBackgroundNoise` | ||
| 65 | - `BandPassFilter` | ||
| 66 | - `Mp3Compression` | ||
| 67 | - 伪造翻唱: | ||
| 68 | - `PitchShift` | ||
| 69 | - `TimeStretch` | ||
| 70 | - `Frequency Masking`(作用于 mel) | ||
| 71 | |||
| 72 | 并增加了缺少可选依赖时的降级处理,避免直接崩溃。 | ||
| 73 | |||
| 74 | ### 2.4 训练配置 | ||
| 75 | |||
| 76 | 当前主要配置: | ||
| 77 | |||
| 78 | - `acr-engine/configs/default.yaml` | ||
| 79 | - `acr-engine/configs/coverhunter_finetune.yaml` | ||
| 80 | - `acr-engine/configs/coverhunter_finetune_4gb.yaml` | ||
| 81 | |||
| 82 | 其中: | ||
| 83 | |||
| 84 | - `coverhunter_finetune_4gb.yaml` 是面向 **Quadro P1000 4GB** 的轻量配置 | ||
| 85 | - `acr-engine/scripts/run_coverhunter_finetune.py` 默认已切到该配置 | ||
| 86 | |||
| 87 | ### 2.5 环境自动化 | ||
| 88 | |||
| 89 | 已新增环境安装脚本: | ||
| 90 | |||
| 91 | - `acr-engine/scripts/setup_coverhunter_env.py` | ||
| 92 | |||
| 93 | 已实际执行过: | ||
| 94 | |||
| 95 | ```bash | ||
| 96 | /usr/local/miniconda3/bin/python acr-engine/scripts/setup_coverhunter_env.py | ||
| 97 | ``` | ||
| 98 | |||
| 99 | 报告文件: | ||
| 100 | |||
| 101 | - `acr-engine/reports/coverhunter_env_setup_report.json` | ||
| 102 | |||
| 103 | 依赖安装已成功,包括: | ||
| 104 | |||
| 105 | - `torch` | ||
| 106 | - `torchaudio` | ||
| 107 | - `transformers` | ||
| 108 | - `huggingface_hub` | ||
| 109 | - `librosa` | ||
| 110 | - `soundfile` | ||
| 111 | - `audiomentations` | ||
| 112 | |||
| 113 | --- | ||
| 114 | |||
| 115 | ## 3. 当前已有音源与数据资产 | ||
| 116 | |||
| 117 | 当前直接可用于训练和链路验证的数据: | ||
| 118 | |||
| 119 | - `acr-engine/data/synthetic_v2/train.json` | ||
| 120 | - `acr-engine/data/synthetic_v2/test.json` | ||
| 121 | - `acr-engine/data/synthetic_v2/segments/*.wav` | ||
| 122 | |||
| 123 | 已统计结果: | ||
| 124 | |||
| 125 | - 样本数:**96** | ||
| 126 | - `song_id`:**16** | ||
| 127 | - 类型分布: | ||
| 128 | - `reference`: 16 | ||
| 129 | - `clean`: 32 | ||
| 130 | - `augmented`: 16 | ||
| 131 | - `humming_like`: 16 | ||
| 132 | - `confused`: 16 | ||
| 133 | |||
| 134 | 结论: | ||
| 135 | |||
| 136 | - 适合做: | ||
| 137 | - 训练链路验证 | ||
| 138 | - 双流结构验证 | ||
| 139 | - 参数/显存调优 | ||
| 140 | - 产物结构验证 | ||
| 141 | - 不适合直接做最终生产权重定版 | ||
| 142 | |||
| 143 | 后续专题仍需补充: | ||
| 144 | |||
| 145 | - 更多 reference / clean 原曲 | ||
| 146 | - 真实录音与环境噪声样本 | ||
| 147 | - 更多真实 cover | ||
| 148 | - 难负样本 | ||
| 149 | - 更多 humming_like 语料 | ||
| 150 | |||
| 151 | --- | ||
| 152 | |||
| 153 | ## 4. 环境现状与阻塞点 | ||
| 154 | |||
| 155 | ### 4.1 Python 解释器 | ||
| 156 | |||
| 157 | 统一使用: | ||
| 158 | |||
| 159 | ```bash | ||
| 160 | /usr/local/miniconda3/bin/python | ||
| 161 | ``` | ||
| 162 | |||
| 163 | ### 4.2 GPU 状态 | ||
| 164 | |||
| 165 | 系统可见 GPU,但当前 PyTorch 不可用: | ||
| 166 | |||
| 167 | - `nvidia-smi` 可见设备 | ||
| 168 | - `torch.cuda.is_available()` 返回 **False** | ||
| 169 | |||
| 170 | 环境报告中明确告警: | ||
| 171 | |||
| 172 | - 当前驱动版本过旧 | ||
| 173 | - 与当前安装的 `torch 2.12.0+cu130` 不兼容 | ||
| 174 | |||
| 175 | 即: | ||
| 176 | |||
| 177 | - **不是代码问题** | ||
| 178 | - 是 **驱动 / CUDA / torch 版本组合问题** | ||
| 179 | |||
| 180 | ### 4.3 后续环境处理建议 | ||
| 181 | |||
| 182 | 换环境后建议优先处理: | ||
| 183 | |||
| 184 | 1. 升级 NVIDIA 驱动,或 | ||
| 185 | 2. 安装与现有驱动兼容的更低版本 CUDA torch | ||
| 186 | |||
| 187 | 建议优先目标: | ||
| 188 | |||
| 189 | - 先让 `torch.cuda.is_available()` 为 `True` | ||
| 190 | - 再继续训练验证 | ||
| 191 | |||
| 192 | --- | ||
| 193 | |||
| 194 | ## 5. CPU 测试进展 | ||
| 195 | |||
| 196 | 用户要求: | ||
| 197 | |||
| 198 | - 支持降级到 CPU | ||
| 199 | - 先用 CPU 测完整性 | ||
| 200 | |||
| 201 | 已执行命令: | ||
| 202 | |||
| 203 | ```bash | ||
| 204 | /usr/local/miniconda3/bin/python /mnt/e/hikoon-ACR/acr-engine/scripts/run_coverhunter_finetune.py \ | ||
| 205 | --python /usr/local/miniconda3/bin/python \ | ||
| 206 | --config configs/coverhunter_finetune_4gb.yaml \ | ||
| 207 | --data data/synthetic_v2 \ | ||
| 208 | --device cpu \ | ||
| 209 | --segment-strategy hybrid \ | ||
| 210 | --dry-run | ||
| 211 | ``` | ||
| 212 | |||
| 213 | ### 5.1 已修复的问题 | ||
| 214 | |||
| 215 | #### 问题 1:`librosa.hz_to_midi` 参数错误 | ||
| 216 | |||
| 217 | 报错: | ||
| 218 | |||
| 219 | - `hz_to_midi() got an unexpected keyword argument 'bins_per_octave'` | ||
| 220 | |||
| 221 | 已修复: | ||
| 222 | |||
| 223 | - 文件:`acr-engine/src/data/dataset.py` | ||
| 224 | - 去掉了不兼容参数 | ||
| 225 | |||
| 226 | #### 问题 2:audiomentations 可选依赖引发噪声提示/潜在中断 | ||
| 227 | |||
| 228 | 已处理: | ||
| 229 | |||
| 230 | - 文件:`acr-engine/src/utils/augment.py` | ||
| 231 | - 增加导入保护与降级逻辑 | ||
| 232 | - 缺少可选增强时不应直接崩溃 | ||
| 233 | |||
| 234 | ### 5.2 CPU dry-run 当前推进到哪里 | ||
| 235 | |||
| 236 | 最新一次 CPU dry-run 已经推进到: | ||
| 237 | |||
| 238 | - 成功读取数据 | ||
| 239 | - 成功构建 batch | ||
| 240 | - 成功进入模型 forward 前阶段 | ||
| 241 | - 控制台输出: | ||
| 242 | - `Device: cpu` | ||
| 243 | - `Dry batch shape: torch.Size([6, 96, 501]) torch.Size([6])` | ||
| 244 | - `Classes: 16` | ||
| 245 | - `Train songs: 64` | ||
| 246 | - `Dry run: running one batch through forward/backward...` | ||
| 247 | |||
| 248 | 说明: | ||
| 249 | |||
| 250 | - 数据加载链路已经基本通了 | ||
| 251 | - collate 基本通了 | ||
| 252 | - 配置能读 | ||
| 253 | - 训练入口能走到 forward/backward 前后深层位置 | ||
| 254 | |||
| 255 | ### 5.3 CPU dry-run 仍未完成的阻塞 | ||
| 256 | |||
| 257 | 最新可见报错来自: | ||
| 258 | |||
| 259 | - `acr-engine/src/models/ecapa_tdnn.py` | ||
| 260 | - `FrozenMERTFeatureExtractor` | ||
| 261 | |||
| 262 | 报错现象(上一轮日志中): | ||
| 263 | |||
| 264 | - 网络不可达,HuggingFace MERT 拉取失败 | ||
| 265 | - fallback 逻辑仍存在未完全闭合情况 | ||
| 266 | - 具体表现为: | ||
| 267 | - `TypeError: 'NoneType' object is not callable` | ||
| 268 | |||
| 269 | 我已经做过一次修正: | ||
| 270 | |||
| 271 | - 在 `FrozenMERTFeatureExtractor` 中先初始化本地 fallback `proj` | ||
| 272 | - 如果 `AutoModel.from_pretrained(model_name)` 失败,就回退到本地 frozen projection 路径 | ||
| 273 | |||
| 274 | 但是**最后一次工具调用被中断**,没有拿到新的完整 stderr 日志来最终确认这一修复是否完全生效。 | ||
| 275 | |||
| 276 | 因此当前状态应认定为: | ||
| 277 | |||
| 278 | - **CPU fallback 已显著推进** | ||
| 279 | - **但尚未拿到一次完全成功的 dry-run 结果** | ||
| 280 | - 下一环境中需要先继续补跑一次 CPU dry-run | ||
| 281 | |||
| 282 | --- | ||
| 283 | |||
| 284 | ## 6. 已生成的重要运行痕迹 | ||
| 285 | |||
| 286 | ### 6.1 环境安装报告 | ||
| 287 | |||
| 288 | - `acr-engine/reports/coverhunter_env_setup_report.json` | ||
| 289 | |||
| 290 | ### 6.2 CPU dry-run 目录 | ||
| 291 | |||
| 292 | 已知 CPU 相关运行目录: | ||
| 293 | |||
| 294 | - `/mnt/e/hikoon-ACR/data/training_runs/coverhunter_finetune_20260608T130103Z/` | ||
| 295 | - `/mnt/e/hikoon-ACR/data/training_runs/coverhunter_finetune_20260608T130306Z/` | ||
| 296 | - `/mnt/e/hikoon-ACR/data/training_runs/coverhunter_finetune_20260608T130514Z/` | ||
| 297 | |||
| 298 | 里面可参考: | ||
| 299 | |||
| 300 | - `stdout.log` | ||
| 301 | - `stderr.log` | ||
| 302 | - `run_request.json` | ||
| 303 | - `run_summary.json` | ||
| 304 | |||
| 305 | 说明: | ||
| 306 | |||
| 307 | - 注意这些目录是在 `/mnt/e/hikoon-ACR/data/training_runs/`,不是 `acr-engine/data/training_runs/` | ||
| 308 | - 这是因为运行时是从仓库根触发脚本,输出路径使用了相对路径 | ||
| 309 | |||
| 310 | --- | ||
| 311 | |||
| 312 | ## 7. 下次换环境后的建议恢复步骤 | ||
| 313 | |||
| 314 | 建议按这个顺序继续: | ||
| 315 | |||
| 316 | ### Step 1:先验证 Python 环境 | ||
| 317 | |||
| 318 | ```bash | ||
| 319 | /usr/local/miniconda3/bin/python --version | ||
| 320 | /usr/local/miniconda3/bin/python -m pip show torch transformers librosa soundfile audiomentations | ||
| 321 | ``` | ||
| 322 | |||
| 323 | ### Step 2:验证 CUDA 是否正常 | ||
| 324 | |||
| 325 | ```bash | ||
| 326 | /usr/local/miniconda3/bin/python - <<'PY' | ||
| 327 | import torch | ||
| 328 | print(torch.__version__) | ||
| 329 | print(torch.cuda.is_available()) | ||
| 330 | if torch.cuda.is_available(): | ||
| 331 | print(torch.cuda.device_count()) | ||
| 332 | for i in range(torch.cuda.device_count()): | ||
| 333 | print(i, torch.cuda.get_device_name(i)) | ||
| 334 | PY | ||
| 335 | ``` | ||
| 336 | |||
| 337 | ### Step 3:先补跑 CPU dry-run,确认完整性 | ||
| 338 | |||
| 339 | ```bash | ||
| 340 | cd /mnt/e/hikoon-ACR/acr-engine && \ | ||
| 341 | /usr/local/miniconda3/bin/python scripts/run_coverhunter_finetune.py \ | ||
| 342 | --python /usr/local/miniconda3/bin/python \ | ||
| 343 | --config configs/coverhunter_finetune_4gb.yaml \ | ||
| 344 | --data data/synthetic_v2 \ | ||
| 345 | --device cpu \ | ||
| 346 | --segment-strategy hybrid \ | ||
| 347 | --dry-run | ||
| 348 | ``` | ||
| 349 | |||
| 350 | 如果仍失败,优先检查: | ||
| 351 | |||
| 352 | - `acr-engine/src/models/ecapa_tdnn.py` | ||
| 353 | - `FrozenMERTFeatureExtractor` | ||
| 354 | - MERT 下载失败时的 fallback 是否还存在空路径 | ||
| 355 | |||
| 356 | ### Step 4:CPU 通了之后,再跑 GPU dry-run | ||
| 357 | |||
| 358 | ```bash | ||
| 359 | cd /mnt/e/hikoon-ACR/acr-engine && \ | ||
| 360 | /usr/local/miniconda3/bin/python scripts/run_coverhunter_finetune.py \ | ||
| 361 | --python /usr/local/miniconda3/bin/python \ | ||
| 362 | --config configs/coverhunter_finetune_4gb.yaml \ | ||
| 363 | --data data/synthetic_v2 \ | ||
| 364 | --device cuda \ | ||
| 365 | --segment-strategy hybrid \ | ||
| 366 | --dry-run | ||
| 367 | ``` | ||
| 368 | |||
| 369 | ### Step 5:再做小规模试训 | ||
| 370 | |||
| 371 | ```bash | ||
| 372 | cd /mnt/e/hikoon-ACR/acr-engine && \ | ||
| 373 | /usr/local/miniconda3/bin/python train.py \ | ||
| 374 | --config configs/coverhunter_finetune_4gb.yaml \ | ||
| 375 | --data data/synthetic_v2 \ | ||
| 376 | --output data/training_runs/coverhunter_4gb_trial \ | ||
| 377 | --device cuda \ | ||
| 378 | --segment-strategy hybrid \ | ||
| 379 | --batch-size 2 \ | ||
| 380 | --epochs 2 | ||
| 381 | ``` | ||
| 382 | |||
| 383 | --- | ||
| 384 | |||
| 385 | ## 8. 重点文件清单 | ||
| 386 | |||
| 387 | ### 模型与训练 | ||
| 388 | |||
| 389 | - `acr-engine/src/models/ecapa_tdnn.py` | ||
| 390 | - `acr-engine/src/models/losses.py` | ||
| 391 | - `acr-engine/src/data/dataset.py` | ||
| 392 | - `acr-engine/src/utils/augment.py` | ||
| 393 | - `acr-engine/train.py` | ||
| 394 | - `acr-engine/scripts/run_coverhunter_finetune.py` | ||
| 395 | |||
| 396 | ### 环境 | ||
| 397 | |||
| 398 | - `acr-engine/scripts/setup_coverhunter_env.py` | ||
| 399 | - `acr-engine/reports/coverhunter_env_setup_report.json` | ||
| 400 | |||
| 401 | ### 配置 | ||
| 402 | |||
| 403 | - `acr-engine/configs/default.yaml` | ||
| 404 | - `acr-engine/configs/coverhunter_finetune.yaml` | ||
| 405 | - `acr-engine/configs/coverhunter_finetune_4gb.yaml` | ||
| 406 | |||
| 407 | ### 文档 | ||
| 408 | |||
| 409 | - `docs/coverhunter_finetune_topic.md` | ||
| 410 | - `docs/coverhunter_training_process.md` | ||
| 411 | - `docs/coverhunter_env_setup.md` | ||
| 412 | - `docs/coverhunter_handoff.md` | ||
| 413 | |||
| 414 | --- | ||
| 415 | |||
| 416 | ## 9. 一句话交接结论 | ||
| 417 | |||
| 418 | 当前专题已经完成了**结构、配置、环境自动化和文档沉淀**,但由于 **GPU 驱动与 torch CUDA 不兼容**、以及 **CPU fallback 尚未拿到最终成功 dry-run**,所以本轮最佳交接点是: | ||
| 419 | |||
| 420 | - **保留当前代码与文档成果** | ||
| 421 | - **换环境后先继续补跑 CPU dry-run** | ||
| 422 | - **再恢复 GPU 验证与正式试训** |
-
Please register or sign in to post a comment