Commit 47390fe9 47390fe9c391192b1d74f806a61417622ca72025 by cnb.bofCdSsphPA

Record fresh FMA smoke verification before epoch completion

Update the handoff package with newer runtime evidence so the next session can distinguish a still-progressing epoch from a hung pipeline while waiting for the first saved model file.

Constraint: Verification had to rely on live process state because Epoch 1 has not completed yet
Rejected: Leave the prior checkpoint as-is | would force the next session to re-check whether progress continued
Confidence: high
Scope-risk: narrow
Directive: Continue checking for the first transition into saved model output, build-index, or evaluate before drawing quality conclusions
Tested: ps on PID 311629; process scan for smoke-local/build-index/evaluate; validate-splits on /tmp/fma_real_smoke_stopcheck/fma/manifests; find on /tmp/fma_real_smoke_stopcheck/fma_models_smoke
Not-tested: Final FMA smoke report and accuracy metrics
1 parent 60e0f9e3
## 2026-06-02 真实 FMA smoke fresh evidence 再校验 checkpoint
完成项:
- 再次检查真实 FMA smoke 运行态,确认训练仍在持续推进。
- 更新 `docs/session-handoff.md`,补充 12:11 UTC 的 fresh evidence。
- 更新 `docs/changelist-2026-06-02.md``docs/delivery-handoff-2026-06-02.md`,强调当前尚未进入 `build-index/evaluate`
验证结果:
- `ps -p 311629 -o pid,etime,%cpu,%mem,cmd` => `ELAPSED=14:25`
- 当前仅存在 `smoke-local``train.py` 相关进程,未见 `build-index/evaluate` 新进程
- `validate-splits /tmp/fma_real_smoke_stopcheck/fma/manifests` => `ok=true`
- `find /tmp/fma_real_smoke_stopcheck/fma_models_smoke ...` 仅返回目录本身
结论:
- 真实 FMA smoke 仍在 epoch 内部推进,没有卡死证据。
- 截至本次 checkpoint,流程尚未进入建索引或评测阶段。
## 2026-06-02 交付包补齐最新真实 FMA 运行态与重启接力说明 checkpoint
完成项:
......
......@@ -138,3 +138,11 @@ cd /workspace/acr-engine
1. 先看 [./session-handoff.md](./session-handoff.md)
2. 再检查真实 FMA smoke 是否已经产出 `best_model.pt` 或进入 `build-index/evaluate`
3. 若完成,则先补文档、changelog、commit、push,再继续下一轮 benchmark。
## 12:11 UTC 再校验补充
- 已拿到比上一提交更新的 fresh evidence:`train.py ELAPSED=14:25`
- 已确认当前仍未切换到 `build-index``evaluate` 进程。
- 已确认模型输出目录仍为空,仅有目录本身。
- 这进一步证明:当前是长时间 CPU 训练,不是进程悬挂。
......
......@@ -28,6 +28,7 @@
- `train_queries=6401`
- `test_queries=1593`
- 当前模型目录 `/tmp/fma_real_smoke_stopcheck/fma_models_smoke/` 仍为空,但这是符合当前 `train.py` 实现的正常现象:`best_model.pt` 会在 `Epoch 1` 结束后首次保存。
- 截至 2026-06-02 12:11 UTC,再次校验仍未进入 `build-index` / `evaluate`,最新 `train.py ELAPSED=14:25`
- 所以这轮交付最重要的不是“最终精度”,而是**把正在跑的真实大规模 smoke 状态、卡点和续跑方式明确记录下来**
## 当前卡点
......
......@@ -79,6 +79,28 @@
3. **工作树噪音依旧很大**
- 提交时必须继续只显式暂存文档 / 脚本,不能误带 `data/external_smoke``data/raw`、checkpoint、`__pycache__`
### 更新中的 fresh evidence(2026-06-02 12:11 UTC)
- 与上一版交付相比,真实 FMA smoke 仍在持续推进,而不是僵死:
- `train.py ELAPSED=14:25`
- `%CPU≈615`
- `%MEM≈10.4`
- 当前仍仅看到运行中的两个关键进程:
- `PID=311494``external_adapters.py smoke-local fma ...`
- `PID=311629``train.py --data /tmp/fma_real_smoke_stopcheck/fma/manifests ...`
- 仍未出现 `build-index` / `evaluate` 相关新进程。
- `/tmp/fma_real_smoke_stopcheck/fma_models_smoke/` 仍只有目录本身,没有模型文件。
- manifest 再次校验结果未变:
- `catalog_references=8000`
- `train_queries=6401`
- `test_queries=1593`
- `val_queries=0`
- `ok=true`
这说明:
- 当前状态是 **真实 FMA 全量训练仍在 epoch 内部推进**
- 还没有到 `Epoch 1` 结束,因此仍不能期待 `best_model.pt` 已经落盘。
### 重启后第一优先级动作
1. 先检查真实 FMA smoke 是否完成:
......