Commit fd574b22 fd574b2248ef075ab2e48bee7cc0d6bbff098360 by cnb.bofCdSsphPA

Capture real FMA smoke execution evidence for restart handoff

Constraint: This checkpoint records running-smoke evidence only and must not stage data, model artifacts, or tmp outputs
Rejected: Wait for the full real FMA smoke to finish before updating handoff docs | The running-state evidence is already valuable for the next session and should not be lost
Confidence: high
Scope-risk: narrow
Directive: Keep future restart notes aligned with the live smoke status and continue using explicit file staging
Tested: Re-verified real FMA smoke is running on CPU, manifests validate, and the documented no-GPU condition explains the long training phase
Not-tested: Did not wait for Epoch 1 completion, model checkpoint emission, or downstream build-index/evaluate completion
1 parent b90754c6
## 2026-06-02 真实 FMA smoke 启动并进入 CPU 长训练 checkpoint
完成项:
- 已在真实本地 FMA 数据上启动端到端 smoke:
- 输入目录:`acr-engine/data/raw/fma_small_audio`
- 输出目录:`/tmp/fma_real_smoke_stopcheck`
- 已确认真实 FMA manifest 生成并通过校验:
- `catalog_references=8000`
- `train_queries=6401`
- `test_queries=1593`
- `val_queries=0`
- 已确认当前环境无可用 NVIDIA GPU:
- `nvidia-smi` 返回 `NO_NVIDIA_GPU`
- `torch.cuda.is_available() = false`
- 已确认真实 smoke 当前处于 CPU 训练阶段,且持续推进:
- 训练命令输出目录:`/tmp/fma_real_smoke_stopcheck/fma_models_smoke`
- 当前 checkpoint 时进度已推进到 `Epoch 1 step 836/3201`
验证结果:
- `check-local-ready fma ...` => `ready_for_smoke=true`
- `validate-splits /tmp/fma_real_smoke_stopcheck/fma/manifests` => `ok=true`
- `train.py` 进程仍在运行,`ELAPSED≈08:22`
结论:
- 真实 FMA 数据已经不只是“可检查”,而是已进入真实端到端 smoke 执行。
- 当前慢的主因是“无 GPU + 真实 FMA 规模较大”,不是流程卡死。
- `fma_models_smoke` 目录暂时无文件是正常现象;按当前 `train.py` 逻辑,`best_model.pt` 会在 `Epoch 1` 结束后首次落盘。
## 2026-06-02 Python 缓存噪音忽略规则补齐 checkpoint
完成项:
......
......@@ -103,3 +103,10 @@ cd /workspace/acr-engine
- 已新增 `acr-engine/scripts/business_export_offline_smoke.py`,并拿到端到端离线 smoke fresh evidence。
- 已确认链路:业务导出样例 -> 规范化 -> 项目 manifest -> `train.py --dry-run`
- 已补记真实 FMA smoke 的进行中 fresh evidence:
- `fma_small_audio``ready_for_smoke=true`
- 真实 smoke 输出目录:`/tmp/fma_real_smoke_stopcheck`
- manifest 校验通过:`catalog_references=8000`, `train_queries=6401`, `test_queries=1593`
- 当前环境无 GPU,真实 smoke 正在 CPU 上进入长训练阶段
- 训练中途 `fma_models_smoke/` 为空是正常现象,因为 `train.py``Epoch 1` 结束后才首次保存 `best_model.pt`
......
......@@ -34,6 +34,23 @@
- 跑真实开放数据 smoke
- 继续优化准确率,尤其是 `confused` / `humming_like`
### 最新真实 FMA 运行事实(2026-06-02 补记)
- `fma_small.zip` 已完整落地,并已解压到 `acr-engine/data/raw/fma_small_audio`
- `check-local-ready fma ...` 已验证:
- `ready_for_smoke=true`
- `num_audio_files=8000`
- `eligible_query_files=7994`
- 真实 FMA smoke 已实际启动到:
- 输出目录:`/tmp/fma_real_smoke_stopcheck`
- manifest 校验:`ok=true`
- 当前训练规模:`catalog_references=8000`, `train_queries=6401`, `test_queries=1593`
- 当前环境无 GPU:
- `nvidia-smi` => `NO_NVIDIA_GPU`
- `torch.cuda.is_available() = false`
- 因此本轮真实 smoke 当前表现为 **CPU 长训练**,不是异常卡死。
- 重要:`train.py` 采用 **epoch-end save** 策略,`best_model.pt` 会在 `Epoch 1` 结束后首次落盘;所以训练中途看到空的 `fma_models_smoke/` 目录是正常现象。
---
## 1. 项目是什么
......@@ -734,4 +751,3 @@ seed123 最终结论:
- `high_energy`: `3 / 1.0 / 1.0`
- winner: `hybrid`
- 当前第二个 bucket `prefix_000_b` 仍在继续执行
......