Record fresh FMA smoke verification before epoch completion
Update the handoff package with newer runtime evidence so the next session can distinguish a still-progressing epoch from a hung pipeline while waiting for the first saved model file. Constraint: Verification had to rely on live process state because Epoch 1 has not completed yet Rejected: Leave the prior checkpoint as-is | would force the next session to re-check whether progress continued Confidence: high Scope-risk: narrow Directive: Continue checking for the first transition into saved model output, build-index, or evaluate before drawing quality conclusions Tested: ps on PID 311629; process scan for smoke-local/build-index/evaluate; validate-splits on /tmp/fma_real_smoke_stopcheck/fma/manifests; find on /tmp/fma_real_smoke_stopcheck/fma_models_smoke Not-tested: Final FMA smoke report and accuracy metrics
Showing
4 changed files
with
48 additions
and
0 deletions
| 1 | ## 2026-06-02 真实 FMA smoke fresh evidence 再校验 checkpoint | ||
| 2 | |||
| 3 | 完成项: | ||
| 4 | - 再次检查真实 FMA smoke 运行态,确认训练仍在持续推进。 | ||
| 5 | - 更新 `docs/session-handoff.md`,补充 12:11 UTC 的 fresh evidence。 | ||
| 6 | - 更新 `docs/changelist-2026-06-02.md` 与 `docs/delivery-handoff-2026-06-02.md`,强调当前尚未进入 `build-index/evaluate`。 | ||
| 7 | |||
| 8 | 验证结果: | ||
| 9 | - `ps -p 311629 -o pid,etime,%cpu,%mem,cmd` => `ELAPSED=14:25` | ||
| 10 | - 当前仅存在 `smoke-local` 与 `train.py` 相关进程,未见 `build-index/evaluate` 新进程 | ||
| 11 | - `validate-splits /tmp/fma_real_smoke_stopcheck/fma/manifests` => `ok=true` | ||
| 12 | - `find /tmp/fma_real_smoke_stopcheck/fma_models_smoke ...` 仅返回目录本身 | ||
| 13 | |||
| 14 | 结论: | ||
| 15 | - 真实 FMA smoke 仍在 epoch 内部推进,没有卡死证据。 | ||
| 16 | - 截至本次 checkpoint,流程尚未进入建索引或评测阶段。 | ||
| 17 | |||
| 1 | ## 2026-06-02 交付包补齐最新真实 FMA 运行态与重启接力说明 checkpoint | 18 | ## 2026-06-02 交付包补齐最新真实 FMA 运行态与重启接力说明 checkpoint |
| 2 | 19 | ||
| 3 | 完成项: | 20 | 完成项: | ... | ... |
| ... | @@ -138,3 +138,11 @@ cd /workspace/acr-engine | ... | @@ -138,3 +138,11 @@ cd /workspace/acr-engine |
| 138 | 1. 先看 [./session-handoff.md](./session-handoff.md)。 | 138 | 1. 先看 [./session-handoff.md](./session-handoff.md)。 |
| 139 | 2. 再检查真实 FMA smoke 是否已经产出 `best_model.pt` 或进入 `build-index/evaluate`。 | 139 | 2. 再检查真实 FMA smoke 是否已经产出 `best_model.pt` 或进入 `build-index/evaluate`。 |
| 140 | 3. 若完成,则先补文档、changelog、commit、push,再继续下一轮 benchmark。 | 140 | 3. 若完成,则先补文档、changelog、commit、push,再继续下一轮 benchmark。 |
| 141 | |||
| 142 | |||
| 143 | ## 12:11 UTC 再校验补充 | ||
| 144 | |||
| 145 | - 已拿到比上一提交更新的 fresh evidence:`train.py ELAPSED=14:25`。 | ||
| 146 | - 已确认当前仍未切换到 `build-index` 或 `evaluate` 进程。 | ||
| 147 | - 已确认模型输出目录仍为空,仅有目录本身。 | ||
| 148 | - 这进一步证明:当前是长时间 CPU 训练,不是进程悬挂。 | ... | ... |
| ... | @@ -28,6 +28,7 @@ | ... | @@ -28,6 +28,7 @@ |
| 28 | - `train_queries=6401` | 28 | - `train_queries=6401` |
| 29 | - `test_queries=1593` | 29 | - `test_queries=1593` |
| 30 | - 当前模型目录 `/tmp/fma_real_smoke_stopcheck/fma_models_smoke/` 仍为空,但这是符合当前 `train.py` 实现的正常现象:`best_model.pt` 会在 `Epoch 1` 结束后首次保存。 | 30 | - 当前模型目录 `/tmp/fma_real_smoke_stopcheck/fma_models_smoke/` 仍为空,但这是符合当前 `train.py` 实现的正常现象:`best_model.pt` 会在 `Epoch 1` 结束后首次保存。 |
| 31 | - 截至 2026-06-02 12:11 UTC,再次校验仍未进入 `build-index` / `evaluate`,最新 `train.py ELAPSED=14:25`。 | ||
| 31 | - 所以这轮交付最重要的不是“最终精度”,而是**把正在跑的真实大规模 smoke 状态、卡点和续跑方式明确记录下来**。 | 32 | - 所以这轮交付最重要的不是“最终精度”,而是**把正在跑的真实大规模 smoke 状态、卡点和续跑方式明确记录下来**。 |
| 32 | 33 | ||
| 33 | ## 当前卡点 | 34 | ## 当前卡点 | ... | ... |
| ... | @@ -79,6 +79,28 @@ | ... | @@ -79,6 +79,28 @@ |
| 79 | 3. **工作树噪音依旧很大** | 79 | 3. **工作树噪音依旧很大** |
| 80 | - 提交时必须继续只显式暂存文档 / 脚本,不能误带 `data/external_smoke`、`data/raw`、checkpoint、`__pycache__`。 | 80 | - 提交时必须继续只显式暂存文档 / 脚本,不能误带 `data/external_smoke`、`data/raw`、checkpoint、`__pycache__`。 |
| 81 | 81 | ||
| 82 | ### 更新中的 fresh evidence(2026-06-02 12:11 UTC) | ||
| 83 | |||
| 84 | - 与上一版交付相比,真实 FMA smoke 仍在持续推进,而不是僵死: | ||
| 85 | - `train.py ELAPSED=14:25` | ||
| 86 | - `%CPU≈615` | ||
| 87 | - `%MEM≈10.4` | ||
| 88 | - 当前仍仅看到运行中的两个关键进程: | ||
| 89 | - `PID=311494`:`external_adapters.py smoke-local fma ...` | ||
| 90 | - `PID=311629`:`train.py --data /tmp/fma_real_smoke_stopcheck/fma/manifests ...` | ||
| 91 | - 仍未出现 `build-index` / `evaluate` 相关新进程。 | ||
| 92 | - `/tmp/fma_real_smoke_stopcheck/fma_models_smoke/` 仍只有目录本身,没有模型文件。 | ||
| 93 | - manifest 再次校验结果未变: | ||
| 94 | - `catalog_references=8000` | ||
| 95 | - `train_queries=6401` | ||
| 96 | - `test_queries=1593` | ||
| 97 | - `val_queries=0` | ||
| 98 | - `ok=true` | ||
| 99 | |||
| 100 | 这说明: | ||
| 101 | - 当前状态是 **真实 FMA 全量训练仍在 epoch 内部推进**。 | ||
| 102 | - 还没有到 `Epoch 1` 结束,因此仍不能期待 `best_model.pt` 已经落盘。 | ||
| 103 | |||
| 82 | ### 重启后第一优先级动作 | 104 | ### 重启后第一优先级动作 |
| 83 | 105 | ||
| 84 | 1. 先检查真实 FMA smoke 是否完成: | 106 | 1. 先检查真实 FMA smoke 是否完成: | ... | ... |
-
Please register or sign in to post a comment