Advance handoff evidence with continued epoch progress
Keep the restart package aligned with the newest observed runtime so the next session inherits proof that the real FMA smoke continues moving forward inside Epoch 1. Constraint: The only new evidence available was live process progress because training has not finished the epoch Rejected: Reuse the 15:12 checkpoint | would leave handoff evidence stale by another monitoring cycle Confidence: high Scope-risk: narrow Directive: Keep watching for first model output or stage transition before changing any roadmap conclusions Tested: ps on PID 311629; validate-splits on /tmp/fma_real_smoke_stopcheck/fma/manifests; find on /tmp/fma_real_smoke_stopcheck/fma_models_smoke Not-tested: Final checkpoint write, build-index, evaluate, and report generation
Showing
3 changed files
with
45 additions
and
0 deletions
| 1 | ## 2026-06-02 真实 FMA smoke fresh evidence 17:07 checkpoint | ||
| 2 | |||
| 3 | 完成项: | ||
| 4 | - 再次刷新真实 FMA smoke 运行态,确认 `train.py` elapsed 已推进到 17:07。 | ||
| 5 | - 更新 `docs/session-handoff.md` 与 `docs/changelist-2026-06-02.md`,记录更晚的 live evidence。 | ||
| 6 | |||
| 7 | 验证结果: | ||
| 8 | - `ps -p 311629 -o pid,etime,%cpu,%mem,cmd` => `ELAPSED=17:07` | ||
| 9 | - 当前仍未出现 `build-index/evaluate` 相关新进程 | ||
| 10 | - `validate-splits /tmp/fma_real_smoke_stopcheck/fma/manifests` => `ok=true` | ||
| 11 | - `fma_models_smoke/` 仍仅有目录本身 | ||
| 12 | |||
| 13 | 结论: | ||
| 14 | - 真实 FMA 全量 smoke 仍在 epoch 内推进,暂无流程中断证据。 | ||
| 15 | - 到该时点仍未产生首个模型文件或后续评测阶段证据。 | ||
| 16 | |||
| 1 | ## 2026-06-02 真实 FMA smoke fresh evidence 15:12 checkpoint | 17 | ## 2026-06-02 真实 FMA smoke fresh evidence 15:12 checkpoint |
| 2 | 18 | ||
| 3 | 完成项: | 19 | 完成项: | ... | ... |
| ... | @@ -154,3 +154,10 @@ cd /workspace/acr-engine | ... | @@ -154,3 +154,10 @@ cd /workspace/acr-engine |
| 154 | - 当前 CPU / 内存观测:`%CPU≈614`, `%MEM≈10.5`。 | 154 | - 当前 CPU / 内存观测:`%CPU≈614`, `%MEM≈10.5`。 |
| 155 | - 仍未出现 `build-index/evaluate` 进程,也未出现首个模型文件。 | 155 | - 仍未出现 `build-index/evaluate` 进程,也未出现首个模型文件。 |
| 156 | - 这说明当前只是继续处在真实 FMA 全量 epoch 内训练阶段。 | 156 | - 这说明当前只是继续处在真实 FMA 全量 epoch 内训练阶段。 |
| 157 | |||
| 158 | |||
| 159 | ## 12:14 UTC 时间推进补充 | ||
| 160 | |||
| 161 | - 最新 live 证据已推进到:`train.py ELAPSED=17:07`。 | ||
| 162 | - 仍未出现模型文件,也未切换到 `build-index/evaluate`。 | ||
| 163 | - manifest 校验结果保持不变且继续通过。 | ... | ... |
| ... | @@ -123,6 +123,28 @@ | ... | @@ -123,6 +123,28 @@ |
| 123 | - 真实 FMA 全量 smoke 依旧在 `Epoch 1` 内部推进。 | 123 | - 真实 FMA 全量 smoke 依旧在 `Epoch 1` 内部推进。 |
| 124 | - 截至 12:12 UTC,仍未出现首个可落盘模型文件或下游阶段切换。 | 124 | - 截至 12:12 UTC,仍未出现首个可落盘模型文件或下游阶段切换。 |
| 125 | 125 | ||
| 126 | ### 再次刷新的 fresh evidence(2026-06-02 12:14 UTC) | ||
| 127 | |||
| 128 | - 真实 FMA smoke 继续推进: | ||
| 129 | - `train.py ELAPSED=17:07` | ||
| 130 | - `%CPU≈615` | ||
| 131 | - `%MEM≈10.4` | ||
| 132 | - 当前仍只有训练阶段相关进程: | ||
| 133 | - `PID=311494`:`external_adapters.py smoke-local fma ...` | ||
| 134 | - `PID=311629`:`train.py --data /tmp/fma_real_smoke_stopcheck/fma/manifests ...` | ||
| 135 | - 仍未观测到 `build-index` / `evaluate` 新进程。 | ||
| 136 | - `fma_models_smoke/` 仍只有目录本身,没有模型文件。 | ||
| 137 | - manifest 复核仍通过: | ||
| 138 | - `ok=true` | ||
| 139 | - `catalog_references=8000` | ||
| 140 | - `train_queries=6401` | ||
| 141 | - `test_queries=1593` | ||
| 142 | - `val_queries=0` | ||
| 143 | |||
| 144 | 这进一步说明: | ||
| 145 | - 当前 smoke 仍在第一个 epoch 内持续前进。 | ||
| 146 | - 到 12:14 UTC 为止,仍未进入保存首个模型文件或下游检索/评测阶段。 | ||
| 147 | |||
| 126 | ### 重启后第一优先级动作 | 148 | ### 重启后第一优先级动作 |
| 127 | 149 | ||
| 128 | 1. 先检查真实 FMA smoke 是否完成: | 150 | 1. 先检查真实 FMA smoke 是否完成: | ... | ... |
-
Please register or sign in to post a comment