Capture another live FMA smoke progress checkpoint
Keep the restart artifacts synchronized with the newest observed elapsed time so the next session can see that the real FMA smoke is still advancing without yet reaching model save or evaluation stages. Constraint: Training remains inside Epoch 1, so verification is limited to live runtime evidence Rejected: Stop at the prior 17:07 checkpoint | would leave handoff docs behind the latest verified state Confidence: high Scope-risk: narrow Directive: Continue monitoring until the first saved model file or stage transition appears Tested: ps on PID 311629; validate-splits on /tmp/fma_real_smoke_stopcheck/fma/manifests; find on /tmp/fma_real_smoke_stopcheck/fma_models_smoke Not-tested: End-of-epoch artifacts, build-index, evaluate, final metrics
Showing
3 changed files
with
45 additions
and
0 deletions
| 1 | ## 2026-06-02 真实 FMA smoke fresh evidence 18:22 checkpoint | ||
| 2 | |||
| 3 | 完成项: | ||
| 4 | - 再次检查真实 FMA smoke 运行态,确认 `train.py` elapsed 已推进到 18:22。 | ||
| 5 | - 更新 `docs/session-handoff.md` 与 `docs/changelist-2026-06-02.md`,同步更晚的 live evidence。 | ||
| 6 | |||
| 7 | 验证结果: | ||
| 8 | - `ps -p 311629 -o pid,etime,%cpu,%mem,cmd` => `ELAPSED=18:22` | ||
| 9 | - 仍未出现 `build-index/evaluate` 相关新进程 | ||
| 10 | - `validate-splits /tmp/fma_real_smoke_stopcheck/fma/manifests` => `ok=true` | ||
| 11 | - `fma_models_smoke/` 仍仅有目录本身 | ||
| 12 | |||
| 13 | 结论: | ||
| 14 | - 真实 FMA 全量 smoke 仍在 epoch 内推进,没有中断迹象。 | ||
| 15 | - 到该时点仍未产生首个模型文件或下游阶段切换证据。 | ||
| 16 | |||
| 1 | ## 2026-06-02 真实 FMA smoke fresh evidence 17:07 checkpoint | 17 | ## 2026-06-02 真实 FMA smoke fresh evidence 17:07 checkpoint |
| 2 | 18 | ||
| 3 | 完成项: | 19 | 完成项: | ... | ... |
| ... | @@ -161,3 +161,10 @@ cd /workspace/acr-engine | ... | @@ -161,3 +161,10 @@ cd /workspace/acr-engine |
| 161 | - 最新 live 证据已推进到:`train.py ELAPSED=17:07`。 | 161 | - 最新 live 证据已推进到:`train.py ELAPSED=17:07`。 |
| 162 | - 仍未出现模型文件,也未切换到 `build-index/evaluate`。 | 162 | - 仍未出现模型文件,也未切换到 `build-index/evaluate`。 |
| 163 | - manifest 校验结果保持不变且继续通过。 | 163 | - manifest 校验结果保持不变且继续通过。 |
| 164 | |||
| 165 | |||
| 166 | ## 12:15 UTC 时间推进补充 | ||
| 167 | |||
| 168 | - 最新 live 证据已推进到:`train.py ELAPSED=18:22`。 | ||
| 169 | - 仍未出现模型文件,也未切换到 `build-index/evaluate`。 | ||
| 170 | - manifest 复核继续通过,统计保持不变。 | ... | ... |
| ... | @@ -145,6 +145,28 @@ | ... | @@ -145,6 +145,28 @@ |
| 145 | - 当前 smoke 仍在第一个 epoch 内持续前进。 | 145 | - 当前 smoke 仍在第一个 epoch 内持续前进。 |
| 146 | - 到 12:14 UTC 为止,仍未进入保存首个模型文件或下游检索/评测阶段。 | 146 | - 到 12:14 UTC 为止,仍未进入保存首个模型文件或下游检索/评测阶段。 |
| 147 | 147 | ||
| 148 | ### 再次推进的 fresh evidence(2026-06-02 12:15 UTC) | ||
| 149 | |||
| 150 | - 真实 FMA smoke 持续推进到: | ||
| 151 | - `train.py ELAPSED=18:22` | ||
| 152 | - `%CPU≈615` | ||
| 153 | - `%MEM≈10.5` | ||
| 154 | - 当前进程结构仍未发生阶段切换: | ||
| 155 | - `PID=311494`:`external_adapters.py smoke-local fma ...` | ||
| 156 | - `PID=311629`:`train.py --data /tmp/fma_real_smoke_stopcheck/fma/manifests ...` | ||
| 157 | - 仍未出现 `build-index` / `evaluate` 相关新进程。 | ||
| 158 | - `fma_models_smoke/` 仍只有目录本身,没有模型文件。 | ||
| 159 | - manifest 再次复核仍通过: | ||
| 160 | - `ok=true` | ||
| 161 | - `catalog_references=8000` | ||
| 162 | - `train_queries=6401` | ||
| 163 | - `test_queries=1593` | ||
| 164 | - `val_queries=0` | ||
| 165 | |||
| 166 | 这说明: | ||
| 167 | - 当前依旧只是第 1 个 epoch 内部持续推进。 | ||
| 168 | - 到 12:15 UTC 为止,仍没有首个模型文件或后续检索/评测阶段证据。 | ||
| 169 | |||
| 148 | ### 重启后第一优先级动作 | 170 | ### 重启后第一优先级动作 |
| 149 | 171 | ||
| 150 | 1. 先检查真实 FMA smoke 是否完成: | 172 | 1. 先检查真实 FMA smoke 是否完成: | ... | ... |
-
Please register or sign in to post a comment