Capture real FMA smoke execution evidence for restart handoff
Constraint: This checkpoint records running-smoke evidence only and must not stage data, model artifacts, or tmp outputs Rejected: Wait for the full real FMA smoke to finish before updating handoff docs | The running-state evidence is already valuable for the next session and should not be lost Confidence: high Scope-risk: narrow Directive: Keep future restart notes aligned with the live smoke status and continue using explicit file staging Tested: Re-verified real FMA smoke is running on CPU, manifests validate, and the documented no-GPU condition explains the long training phase Not-tested: Did not wait for Epoch 1 completion, model checkpoint emission, or downstream build-index/evaluate completion
Showing
3 changed files
with
52 additions
and
1 deletions
| 1 | ## 2026-06-02 真实 FMA smoke 启动并进入 CPU 长训练 checkpoint | ||
| 2 | |||
| 3 | 完成项: | ||
| 4 | - 已在真实本地 FMA 数据上启动端到端 smoke: | ||
| 5 | - 输入目录:`acr-engine/data/raw/fma_small_audio` | ||
| 6 | - 输出目录:`/tmp/fma_real_smoke_stopcheck` | ||
| 7 | - 已确认真实 FMA manifest 生成并通过校验: | ||
| 8 | - `catalog_references=8000` | ||
| 9 | - `train_queries=6401` | ||
| 10 | - `test_queries=1593` | ||
| 11 | - `val_queries=0` | ||
| 12 | - 已确认当前环境无可用 NVIDIA GPU: | ||
| 13 | - `nvidia-smi` 返回 `NO_NVIDIA_GPU` | ||
| 14 | - `torch.cuda.is_available() = false` | ||
| 15 | - 已确认真实 smoke 当前处于 CPU 训练阶段,且持续推进: | ||
| 16 | - 训练命令输出目录:`/tmp/fma_real_smoke_stopcheck/fma_models_smoke` | ||
| 17 | - 当前 checkpoint 时进度已推进到 `Epoch 1 step 836/3201` | ||
| 18 | |||
| 19 | 验证结果: | ||
| 20 | - `check-local-ready fma ...` => `ready_for_smoke=true` | ||
| 21 | - `validate-splits /tmp/fma_real_smoke_stopcheck/fma/manifests` => `ok=true` | ||
| 22 | - `train.py` 进程仍在运行,`ELAPSED≈08:22` | ||
| 23 | |||
| 24 | 结论: | ||
| 25 | - 真实 FMA 数据已经不只是“可检查”,而是已进入真实端到端 smoke 执行。 | ||
| 26 | - 当前慢的主因是“无 GPU + 真实 FMA 规模较大”,不是流程卡死。 | ||
| 27 | - `fma_models_smoke` 目录暂时无文件是正常现象;按当前 `train.py` 逻辑,`best_model.pt` 会在 `Epoch 1` 结束后首次落盘。 | ||
| 28 | |||
| 1 | ## 2026-06-02 Python 缓存噪音忽略规则补齐 checkpoint | 29 | ## 2026-06-02 Python 缓存噪音忽略规则补齐 checkpoint |
| 2 | 30 | ||
| 3 | 完成项: | 31 | 完成项: | ... | ... |
| ... | @@ -103,3 +103,10 @@ cd /workspace/acr-engine | ... | @@ -103,3 +103,10 @@ cd /workspace/acr-engine |
| 103 | 103 | ||
| 104 | - 已新增 `acr-engine/scripts/business_export_offline_smoke.py`,并拿到端到端离线 smoke fresh evidence。 | 104 | - 已新增 `acr-engine/scripts/business_export_offline_smoke.py`,并拿到端到端离线 smoke fresh evidence。 |
| 105 | - 已确认链路:业务导出样例 -> 规范化 -> 项目 manifest -> `train.py --dry-run`。 | 105 | - 已确认链路:业务导出样例 -> 规范化 -> 项目 manifest -> `train.py --dry-run`。 |
| 106 | |||
| 107 | - 已补记真实 FMA smoke 的进行中 fresh evidence: | ||
| 108 | - `fma_small_audio` 已 `ready_for_smoke=true` | ||
| 109 | - 真实 smoke 输出目录:`/tmp/fma_real_smoke_stopcheck` | ||
| 110 | - manifest 校验通过:`catalog_references=8000`, `train_queries=6401`, `test_queries=1593` | ||
| 111 | - 当前环境无 GPU,真实 smoke 正在 CPU 上进入长训练阶段 | ||
| 112 | - 训练中途 `fma_models_smoke/` 为空是正常现象,因为 `train.py` 在 `Epoch 1` 结束后才首次保存 `best_model.pt` | ... | ... |
| ... | @@ -34,6 +34,23 @@ | ... | @@ -34,6 +34,23 @@ |
| 34 | - 跑真实开放数据 smoke | 34 | - 跑真实开放数据 smoke |
| 35 | - 继续优化准确率,尤其是 `confused` / `humming_like` | 35 | - 继续优化准确率,尤其是 `confused` / `humming_like` |
| 36 | 36 | ||
| 37 | ### 最新真实 FMA 运行事实(2026-06-02 补记) | ||
| 38 | |||
| 39 | - `fma_small.zip` 已完整落地,并已解压到 `acr-engine/data/raw/fma_small_audio` | ||
| 40 | - `check-local-ready fma ...` 已验证: | ||
| 41 | - `ready_for_smoke=true` | ||
| 42 | - `num_audio_files=8000` | ||
| 43 | - `eligible_query_files=7994` | ||
| 44 | - 真实 FMA smoke 已实际启动到: | ||
| 45 | - 输出目录:`/tmp/fma_real_smoke_stopcheck` | ||
| 46 | - manifest 校验:`ok=true` | ||
| 47 | - 当前训练规模:`catalog_references=8000`, `train_queries=6401`, `test_queries=1593` | ||
| 48 | - 当前环境无 GPU: | ||
| 49 | - `nvidia-smi` => `NO_NVIDIA_GPU` | ||
| 50 | - `torch.cuda.is_available() = false` | ||
| 51 | - 因此本轮真实 smoke 当前表现为 **CPU 长训练**,不是异常卡死。 | ||
| 52 | - 重要:`train.py` 采用 **epoch-end save** 策略,`best_model.pt` 会在 `Epoch 1` 结束后首次落盘;所以训练中途看到空的 `fma_models_smoke/` 目录是正常现象。 | ||
| 53 | |||
| 37 | --- | 54 | --- |
| 38 | 55 | ||
| 39 | ## 1. 项目是什么 | 56 | ## 1. 项目是什么 |
| ... | @@ -734,4 +751,3 @@ seed123 最终结论: | ... | @@ -734,4 +751,3 @@ seed123 最终结论: |
| 734 | - `high_energy`: `3 / 1.0 / 1.0` | 751 | - `high_energy`: `3 / 1.0 / 1.0` |
| 735 | - winner: `hybrid` | 752 | - winner: `hybrid` |
| 736 | - 当前第二个 bucket `prefix_000_b` 仍在继续执行 | 753 | - 当前第二个 bucket `prefix_000_b` 仍在继续执行 |
| 737 | ... | ... |
-
Please register or sign in to post a comment