Commit fd574b22 fd574b2248ef075ab2e48bee7cc0d6bbff098360 by cnb.bofCdSsphPA

Capture real FMA smoke execution evidence for restart handoff

Constraint: This checkpoint records running-smoke evidence only and must not stage data, model artifacts, or tmp outputs
Rejected: Wait for the full real FMA smoke to finish before updating handoff docs | The running-state evidence is already valuable for the next session and should not be lost
Confidence: high
Scope-risk: narrow
Directive: Keep future restart notes aligned with the live smoke status and continue using explicit file staging
Tested: Re-verified real FMA smoke is running on CPU, manifests validate, and the documented no-GPU condition explains the long training phase
Not-tested: Did not wait for Epoch 1 completion, model checkpoint emission, or downstream build-index/evaluate completion
1 parent b90754c6
1 ## 2026-06-02 真实 FMA smoke 启动并进入 CPU 长训练 checkpoint
2
3 完成项:
4 - 已在真实本地 FMA 数据上启动端到端 smoke:
5 - 输入目录:`acr-engine/data/raw/fma_small_audio`
6 - 输出目录:`/tmp/fma_real_smoke_stopcheck`
7 - 已确认真实 FMA manifest 生成并通过校验:
8 - `catalog_references=8000`
9 - `train_queries=6401`
10 - `test_queries=1593`
11 - `val_queries=0`
12 - 已确认当前环境无可用 NVIDIA GPU:
13 - `nvidia-smi` 返回 `NO_NVIDIA_GPU`
14 - `torch.cuda.is_available() = false`
15 - 已确认真实 smoke 当前处于 CPU 训练阶段,且持续推进:
16 - 训练命令输出目录:`/tmp/fma_real_smoke_stopcheck/fma_models_smoke`
17 - 当前 checkpoint 时进度已推进到 `Epoch 1 step 836/3201`
18
19 验证结果:
20 - `check-local-ready fma ...` => `ready_for_smoke=true`
21 - `validate-splits /tmp/fma_real_smoke_stopcheck/fma/manifests` => `ok=true`
22 - `train.py` 进程仍在运行,`ELAPSED≈08:22`
23
24 结论:
25 - 真实 FMA 数据已经不只是“可检查”,而是已进入真实端到端 smoke 执行。
26 - 当前慢的主因是“无 GPU + 真实 FMA 规模较大”,不是流程卡死。
27 - `fma_models_smoke` 目录暂时无文件是正常现象;按当前 `train.py` 逻辑,`best_model.pt` 会在 `Epoch 1` 结束后首次落盘。
28
1 ## 2026-06-02 Python 缓存噪音忽略规则补齐 checkpoint 29 ## 2026-06-02 Python 缓存噪音忽略规则补齐 checkpoint
2 30
3 完成项: 31 完成项:
......
...@@ -103,3 +103,10 @@ cd /workspace/acr-engine ...@@ -103,3 +103,10 @@ cd /workspace/acr-engine
103 103
104 - 已新增 `acr-engine/scripts/business_export_offline_smoke.py`,并拿到端到端离线 smoke fresh evidence。 104 - 已新增 `acr-engine/scripts/business_export_offline_smoke.py`,并拿到端到端离线 smoke fresh evidence。
105 - 已确认链路:业务导出样例 -> 规范化 -> 项目 manifest -> `train.py --dry-run` 105 - 已确认链路:业务导出样例 -> 规范化 -> 项目 manifest -> `train.py --dry-run`
106
107 - 已补记真实 FMA smoke 的进行中 fresh evidence:
108 - `fma_small_audio``ready_for_smoke=true`
109 - 真实 smoke 输出目录:`/tmp/fma_real_smoke_stopcheck`
110 - manifest 校验通过:`catalog_references=8000`, `train_queries=6401`, `test_queries=1593`
111 - 当前环境无 GPU,真实 smoke 正在 CPU 上进入长训练阶段
112 - 训练中途 `fma_models_smoke/` 为空是正常现象,因为 `train.py``Epoch 1` 结束后才首次保存 `best_model.pt`
......
...@@ -34,6 +34,23 @@ ...@@ -34,6 +34,23 @@
34 - 跑真实开放数据 smoke 34 - 跑真实开放数据 smoke
35 - 继续优化准确率,尤其是 `confused` / `humming_like` 35 - 继续优化准确率,尤其是 `confused` / `humming_like`
36 36
37 ### 最新真实 FMA 运行事实(2026-06-02 补记)
38
39 - `fma_small.zip` 已完整落地,并已解压到 `acr-engine/data/raw/fma_small_audio`
40 - `check-local-ready fma ...` 已验证:
41 - `ready_for_smoke=true`
42 - `num_audio_files=8000`
43 - `eligible_query_files=7994`
44 - 真实 FMA smoke 已实际启动到:
45 - 输出目录:`/tmp/fma_real_smoke_stopcheck`
46 - manifest 校验:`ok=true`
47 - 当前训练规模:`catalog_references=8000`, `train_queries=6401`, `test_queries=1593`
48 - 当前环境无 GPU:
49 - `nvidia-smi` => `NO_NVIDIA_GPU`
50 - `torch.cuda.is_available() = false`
51 - 因此本轮真实 smoke 当前表现为 **CPU 长训练**,不是异常卡死。
52 - 重要:`train.py` 采用 **epoch-end save** 策略,`best_model.pt` 会在 `Epoch 1` 结束后首次落盘;所以训练中途看到空的 `fma_models_smoke/` 目录是正常现象。
53
37 --- 54 ---
38 55
39 ## 1. 项目是什么 56 ## 1. 项目是什么
...@@ -734,4 +751,3 @@ seed123 最终结论: ...@@ -734,4 +751,3 @@ seed123 最终结论:
734 - `high_energy`: `3 / 1.0 / 1.0` 751 - `high_energy`: `3 / 1.0 / 1.0`
735 - winner: `hybrid` 752 - winner: `hybrid`
736 - 当前第二个 bucket `prefix_000_b` 仍在继续执行 753 - 当前第二个 bucket `prefix_000_b` 仍在继续执行
737
......