Commits · 47390fe9c391192b1d74f806a61417622ca72025 · wanghai-tech / hikoon-ACR

02 Jun, 2026 40 commits

Record fresh FMA smoke verification before epoch completion · 47390fe9 ...

Update the handoff package with newer runtime evidence so the next session can distinguish a still-progressing epoch from a hung pipeline while waiting for the first saved model file.

Constraint: Verification had to rely on live process state because Epoch 1 has not completed yet
Rejected: Leave the prior checkpoint as-is | would force the next session to re-check whether progress continued
Confidence: high
Scope-risk: narrow
Directive: Continue checking for the first transition into saved model output, build-index, or evaluate before drawing quality conclusions
Tested: ps on PID 311629; process scan for smoke-local/build-index/evaluate; validate-splits on /tmp/fma_real_smoke_stopcheck/fma/manifests; find on /tmp/fma_real_smoke_stopcheck/fma_models_smoke
Not-tested: Final FMA smoke report and accuracy metrics

authored 2026-06-02 20:12:18 +0800

Preserve live FMA smoke state for fast session restart · 60e0f9e3 ...

60e0f9e3 Browse Directory

Capture the current real-FMA CPU smoke checkpoint, restart path, and delivery handoff so the next session can resume without re-diagnosing an expected long-running training stage.

Constraint: Real FMA smoke is still running on CPU with no GPU available
Rejected: Wait for final smoke completion before documenting | would delay a usable handoff artifact
Confidence: high
Scope-risk: narrow
Directive: Keep staging explicit; do not include datasets, smoke outputs, checkpoints, or caches
Tested: git diff review; live process check; validate-splits on /tmp/fma_real_smoke_stopcheck/fma/manifests
Not-tested: Final FMA smoke metrics after Epoch 1 completion

authored 2026-06-02 20:10:59 +0800

Capture real FMA smoke execution evidence for restart handoff · fd574b22 ...

fd574b22

Constraint: This checkpoint records running-smoke evidence only and must not stage data, model artifacts, or tmp outputs
Rejected: Wait for the full real FMA smoke to finish before updating handoff docs | The running-state evidence is already valuable for the next session and should not be lost
Confidence: high
Scope-risk: narrow
Directive: Keep future restart notes aligned with the live smoke status and continue using explicit file staging
Tested: Re-verified real FMA smoke is running on CPU, manifests validate, and the documented no-GPU condition explains the long training phase
Not-tested: Did not wait for Epoch 1 completion, model checkpoint emission, or downstream build-index/evaluate completion

authored 2026-06-02 20:07:21 +0800

Reduce restart noise further by ignoring common untracked Python cache artifacts · b90754c6 ...

b90754c6 Browse Directory

Constraint: Limit this checkpoint to ignore rules and handoff notes; do not change tracked artifact history
Rejected: Expand ignore coverage to all noisy data trees immediately | This pass only suppresses well-understood untracked cache noise
Confidence: high
Scope-risk: narrow
Directive: Keep ignore changes incremental and distinguish between untracked cache noise and already-tracked historical artifacts
Tested: Confirmed untracked __pycache__ and .pyc noise disappeared from git status after the ignore update
Not-tested: Did not rewrite tracking state for already-versioned cache or data artifacts

authored 2026-06-02 19:12:39 +0800

Reduce restart noise by ignoring known local smoke artifacts · 0184cb37 ...

0184cb37 Browse Directory

Constraint: Limit this checkpoint to ignore rules and handoff notes; do not alter dataset contents
Rejected: Ignore broad data or cache trees immediately | This pass only suppresses confirmed local-generated noise with low risk
Confidence: high
Scope-risk: narrow
Directive: Keep adding ignore rules incrementally and only for artifacts proven to be local/generated noise
Tested: Confirmed the targeted .omx wait files and real-smoke CSV no longer appear in git status after the ignore update
Not-tested: Did not broaden ignore coverage to larger data/cache trees in this checkpoint

authored 2026-06-02 19:11:52 +0800

Unify the first runnable command across all primary restart entrypoints · db60ba0f ...

db60ba0f Browse Directory

Constraint: Limit this checkpoint to handoff and changelog documentation consistency
Rejected: Leave small formatting mismatches across entrypoints | Restart guidance should be byte-level easy to copy and compare
Confidence: high
Scope-risk: narrow
Directive: Keep the first verification command identical across AGENT, README, session handoff, and delivery handoff
Tested: Verified all four primary entrypoints contain the same runnable command block
Not-tested: No code or training path executed in this consistency-only checkpoint

authored 2026-06-02 19:10:21 +0800

Keep every primary handoff entry aligned on the first runnable verification command · 8659ce9e ...

8659ce9e Browse Directory

Constraint: Restrict this checkpoint to handoff documentation consistency only
Rejected: Leave delivery handoff behind the newer restart guidance | All primary restart entrypoints should expose the same first verification command
Confidence: high
Scope-risk: narrow
Directive: Keep the first runnable command identical across AGENT, README, session handoff, and delivery handoff
Tested: Rechecked relative links in the updated delivery handoff and changelog docs
Not-tested: No code or training path executed in this handoff-consistency checkpoint

authored 2026-06-02 19:08:46 +0800

Synchronize the shortest runnable restart command into agent memory · cfdd1765 ...

cfdd1765 Browse Directory

Constraint: Limit this checkpoint to memory and handoff documentation only
Rejected: Keep the restart command only in README | New sessions should see the first verification command directly in AGENT memory too
Confidence: high
Scope-risk: narrow
Directive: Keep AGENT memory focused on restart-critical commands and avoid duplicating full workflow specs there
Tested: Rechecked 174 relative links across AGENT, changelog, and handoff docs
Not-tested: No code or training path executed in this memory-only checkpoint

authored 2026-06-02 19:08:08 +0800

Add the shortest runnable restart command to the docs overview and reconfirm the offline smoke · 74313c01 ...

74313c01

Constraint: Restrict this checkpoint to navigation and handoff documentation backed by fresh local verification
Rejected: Keep restart guidance read-only without a first command to run | New sessions benefit from an immediate executable sanity check
Confidence: high
Scope-risk: narrow
Directive: Keep README focused on compressed restart guidance and use the offline smoke only as an environment and chain sanity check
Tested: Re-ran business_export_offline_smoke.py successfully and rechecked 215 relative links across the updated docs
Not-tested: Did not connect to a live business export or run full training/evaluation beyond dry-run

authored 2026-06-02 19:07:18 +0800

Make the docs overview self-consistent and add the shortest restart reading path · d3082ce2 ...

d3082ce2 Browse Directory

Constraint: Restrict this checkpoint to navigation documentation only
Rejected: Leave the overview mismatch and rely on users to infer reading order | Restart sessions should get a direct, explicit path
Confidence: high
Scope-risk: narrow
Directive: Keep README focused on compressed navigation and restart order, not on duplicating full specs
Tested: Rechecked 215 relative links across the updated overview, changelog, and handoff docs
Not-tested: No code or training path executed in this navigation-only checkpoint

authored 2026-06-02 19:06:14 +0800

Expose the business-data intake chain directly from the docs overview · ec59c9b1 ...

ec59c9b1 Browse Directory

Constraint: Keep this checkpoint limited to navigation docs and preserve the condensed doc structure
Rejected: Keep the new business-export material discoverable only through deep links | New sessions should find the intake chain from the overview immediately
Confidence: high
Scope-risk: narrow
Directive: Maintain README as the compressed navigation surface and avoid expanding it into another full spec
Tested: Rechecked 211 relative links across the updated overview, changelog, and handoff docs
Not-tested: No code or training path executed in this navigation-only checkpoint

authored 2026-06-02 19:05:18 +0800

Record the proven offline smoke so the handoff reflects executable evidence · 55974514 ...

55974514 Browse Directory

Constraint: Limit this checkpoint to documentation updates backed by already-collected local evidence
Rejected: Leave the smoke result only in transient chat output | The next session needs the proof captured in repo-native handoff files
Confidence: high
Scope-risk: narrow
Directive: Keep treating the offline smoke as an integration proof, not as a substitute for real business-data validation
Tested: Rechecked 183 relative links and documented the successful offline smoke summary already verified locally
Not-tested: No new code path executed in this documentation-only checkpoint

authored 2026-06-02 19:04:01 +0800

Prove the offline business-export chain with a runnable smoke over local audio · 7eff944b ...

7eff944b

Constraint: Keep verification offline-only and avoid touching real databases or production assets
Rejected: Stop at manifest generation without execution evidence | A dry-run smoke gives the next session stronger handoff confidence
Confidence: high
Scope-risk: narrow
Directive: Stage local sample audio inside the smoke workspace so manifest paths remain self-contained and reproducible
Tested: Ran business_export_offline_smoke.py end-to-end; verified normalize/build summaries and train.py --dry-run success; rechecked adapter doc links
Not-tested: Did not run full training/evaluation on live business exports or connect to any database

authored 2026-06-02 19:02:36 +0800

Finish the offline business-export chain by generating project manifests directl… · 3bdc0139 ...

3bdc0139

…y from normalized rows

Constraint: Keep this checkpoint offline-only and avoid touching real business data, datasets, or model artifacts
Rejected: Leave final manifest shaping as a manual next-session task | The handoff is stronger when catalog/train/test/val can already be produced automatically
Confidence: high
Scope-risk: narrow
Directive: Treat these generated manifests as integration-stage scaffolds and validate final field policy again before production data ingestion
Tested: Ran build_business_project_manifests.py on normalized sample data and verified catalog/train/test/val structure; rechecked 70 relative links
Not-tested: Did not run the generated manifests through full training/evaluation against live business audio

authored 2026-06-02 18:59:32 +0800

Complete the business-export chain by splitting manifest-ready rows into role-specific lists · b9feaccc ...

b9feaccc

Constraint: Keep this checkpoint offline-only and avoid touching real business data, datasets, or model artifacts
Rejected: Leave role splitting as a manual next-session step | The export chain is more usable when reference/query/excluded lists are produced automatically
Confidence: high
Scope-risk: narrow
Directive: Treat the split outputs as staging lists and keep final project-manifest adaptation explicit in the downstream integration step
Tested: Normalized the sample CSV, ran split_business_manifest_ready.py, verified 1 reference + 1 query + 1 excluded row, and rechecked 73 relative links
Not-tested: Did not run against a live business export or feed the split outputs into the full training pipeline

authored 2026-06-02 18:58:03 +0800

Turn business export guidance into a runnable normalization step for the next session · b5981c79 ...

b5981c79

Constraint: Keep this checkpoint offline-only and avoid touching real databases, datasets, or model artifacts
Rejected: Stop at static CSV/JSONL examples only | The next session needs an executable normalization path, not just samples
Confidence: high
Scope-risk: narrow
Directive: Treat normalized JSONL as manifest-ready staging output and keep final manifest shaping explicit in the integration step
Tested: Ran normalize_business_export.py on the sample CSV and JSONL inputs; verified 3 output rows each; rechecked 71 relative links
Not-tested: Did not run against a live business export or connect to any database

authored 2026-06-02 18:57:07 +0800

Provide export cookbook samples so business tables can flow into manifests without guesswork · b7d4b1b6 ...

b7d4b1b6 Browse Files

Constraint: Keep this checkpoint static and avoid any real database connectivity or dataset mutation
Rejected: Leave export details implicit until a live exporter exists | The next session needs concrete SQL, CSV, and JSONL examples now
Confidence: high
Scope-risk: narrow
Directive: Treat the SQL as a field-mapping example only and adapt table names to the real schema during integration
Tested: Parsed the CSV and JSONL examples and rechecked 69 relative links across the export docs
Not-tested: Did not connect to a production database or execute a live export

authored 2026-06-02 18:56:00 +0800

Make business asset tables exportable into manifest and role mapping templates · 51d789e1 ...

51d789e1

Constraint: Keep the checkpoint lightweight and avoid touching real datasets or generated artifacts
Rejected: Defer manifest guidance until a DB export tool exists | The next session needs repo-native field and role contracts now
Confidence: high
Scope-risk: narrow
Directive: Default ambiguous assets to excluded until manual review confirms song identity and usable role
Tested: Parsed manifest templates; verified print_business_type_mapping.py emits valid JSON; rechecked 94 relative links
Not-tested: Did not connect to a real database or run a live export in this checkpoint

authored 2026-06-02 18:54:54 +0800

Map business asset types into runnable training and bucket guidance for the next session · 8739bf35 ...

8739bf35 Browse Files

Constraint: Keep this checkpoint documentation-first and avoid staging dataset, cache, or model artifacts
Rejected: Leave the asset-type strategy implicit in chat only | The next session needs repo-native guidance and templates
Confidence: high
Scope-risk: narrow
Directive: Treat type-based buckets as a starting scaffold and keep hard-negative curation manual until evidence supports automation
Tested: Parsed both bucket JSON templates and rechecked 104 relative links across the new docs
Not-tested: Did not run a fresh business-type benchmark in this checkpoint

authored 2026-06-02 18:53:40 +0800

Provide a runnable semantic-bucket template so the next benchmark step can start immediately · 75fa5e93 ...

75fa5e93 Browse Files

Constraint: Keep the checkpoint lightweight and avoid touching dataset or model artifacts
Rejected: Wait to add buckets until automatic semantic labeling exists | Manual curated buckets are enough to unblock the next session now
Confidence: high
Scope-risk: narrow
Directive: Use the template as a curated benchmark scaffold, not as evidence that filenames imply semantics
Tested: Parsed the new JSON template; ran ab_smoke_bucketed.py --help; rechecked targeted relative links
Not-tested: Did not launch a new semantic bucket benchmark run in this checkpoint

authored 2026-06-02 18:51:59 +0800

Capture the finished bucket benchmark and handoff state for the next session · 1bdca61b ...

1bdca61b Browse Files

Constraint: Avoid staging datasets, smoke artifacts, /tmp outputs, and caches
Rejected: Delay handoff until larger semantic buckets exist | User asked for immediate delivery and resumability now
Confidence: high
Scope-risk: narrow
Directive: Treat toy prefix buckets as a methodology baseline, not a product conclusion
Tested: Verified /tmp/ab_smoke_bucketed_smoke/report.json and bucket_report.json outputs; reviewed targeted git diff
Not-tested: No new training or benchmark execution in this documentation-only checkpoint

authored 2026-06-02 18:50:23 +0800

Promote bucket benchmarking from a plan to a runnable baseline · c1a22cbb ...

c1a22cbb Browse Files

Constraint: The cap48/cap64 reversal means strategy guidance can no longer rely on a single overall subset result
Rejected: Keep bucket benchmarking as a doc-only next step | The repo now needs an executable baseline so later sessions can measure scale/style divergence directly
Confidence: high
Scope-risk: moderate
Directive: Treat ab_smoke_bucketed.py as the canonical seed for style-aware evaluation, and expand bucket definitions before revisiting global default-strategy claims
Tested: Verified acr-engine/scripts/ab_smoke_bucketed.py passes py_compile; verified first bucket prefix_000_a produced bucket_report.json with hybrid 4/1.0/1.0 and high_energy 3/1.0/1.0; verified second bucket execution is in progress
Not-tested: Full multi-bucket report.json completion, richer bucket definitions, and bucket-level aggregate conclusions

authored 2026-06-02 18:48:23 +0800

Record the cap64 reversal once the larger benchmark finished · e49dc0b9 ...

e49dc0b9

Constraint: Strategy guidance must now reflect that cap48 and cap64 produce different winners under verified runs
Rejected: Keep high_energy as the generic default | The completed cap64 run shows hybrid winning clearly at a larger subset size, so the docs must acknowledge scale sensitivity
Confidence: high
Scope-risk: moderate
Directive: Do not present a single global default strategy again until bucketed and style-aware benchmarks explain the cap48/cap64 divergence
Tested: Verified cap64 report.json, progress.json, high_energy eval.json, and hybrid eval.json; confirmed cap64 winner=hybrid with top1 0.875 vs high_energy 0.625
Not-tested: Multi-seed cap64 aggregates, bucket/style-aware benchmarks, and any revised hybrid training design

authored 2026-06-02 18:44:58 +0800

Preserve proof that cap64 hybrid advanced into evaluation before results landed · 8f2e6016 ...

8f2e6016 Browse Files

Constraint: The cap64 run is still incomplete, so only verified hybrid index-complete and evaluation-running evidence can be recorded safely now
Rejected: Wait for hybrid eval.json before checkpointing | Would lose the verified handoff that hybrid indexing finished and evaluate.py is already running
Confidence: high
Scope-risk: narrow
Directive: Keep cap64 high_energy and hybrid checkpoints symmetric so the final comparison can be written from docs alone if needed
Tested: Verified hybrid reference_progress.json shows 64 refs, 657 windows, 192-d embeddings, and complete status; verified active process is evaluate.py on /tmp/ab_smoke_seg_cap64_top2/hybrid/fma/manifests; verified hybrid eval.json and report.json are still absent
Not-tested: Final hybrid cap64 metrics, final report.json, and any cap64 winner conclusion

authored 2026-06-02 18:43:15 +0800

Preserve proof that cap64 hybrid training fully finished before scoring lands · fee2a39c ...

fee2a39c Browse Directory

Constraint: The cap64 run is still active, so only verified training-complete evidence can be recorded now without overstating results
Rejected: Wait for hybrid eval before checkpointing | Would lose the stronger handoff evidence that the full hybrid epoch already completed
Confidence: high
Scope-risk: narrow
Directive: Keep distinguishing hybrid training-complete from hybrid index/eval completion until report.json lands
Tested: Verified live session output shows hybrid Epoch 1 progressed from 0/32 to 32/32, and verified the active process remains run_demo.py build-index on /tmp/ab_smoke_seg_cap64_top2/hybrid/fma/manifests while hybrid eval.json and report.json remain absent
Not-tested: Final hybrid cap64 metrics, final report.json, and any cap64 winner conclusion

authored 2026-06-02 18:41:16 +0800

Preserve proof that cap64 hybrid advanced into indexing · 65cc45c2 ...

65cc45c2 Browse Directory

Constraint: The cap64 run is still in progress, so this checkpoint can only record verified hybrid stage transitions, not final comparisons
Rejected: Wait for hybrid eval before checkpointing | Would lose the verified evidence that hybrid training finished and indexing has already started
Confidence: high
Scope-risk: narrow
Directive: Keep cap64 branch checkpoints symmetric so high_energy and hybrid can be compared later without re-reading process history
Tested: Verified active process is run_demo.py build-index on /tmp/ab_smoke_seg_cap64_top2/hybrid/fma/manifests; verified /tmp/ab_smoke_seg_cap64_top2/hybrid/fma_models_smoke/best_model.pt exists; verified hybrid eval.json and report.json are still absent
Not-tested: Final hybrid cap64 metrics, final report.json, and any cap64 winner conclusion

authored 2026-06-02 18:39:26 +0800

Preserve proof that cap64 has entered the hybrid training branch · df7bd04b ...

df7bd04b Browse Directory

Constraint: The cap64 run is still incomplete, so only branch-transition evidence can be recorded safely at this point
Rejected: Wait for the hybrid eval before checkpointing | Would lose the verified handoff that execution has moved beyond high_energy into hybrid training
Confidence: high
Scope-risk: narrow
Directive: Keep cap64 branch progression explicit so the next session can resume from the current strategy leg without re-inspection
Tested: Verified high_energy eval.json reports num_queries=32, top1=0.625, topk=1.0; verified active processes show external_adapters.py on /tmp/ab_smoke_seg_cap64_top2/hybrid and train.py on /tmp/ab_smoke_seg_cap64_top2/hybrid/fma/manifests; verified hybrid eval.json and report.json are still absent
Not-tested: Final hybrid cap64 metrics, final report.json, and any cap64 winner conclusion

authored 2026-06-02 18:37:43 +0800

Preserve the first cap64 score before the second strategy finishes · 398d12c3 ...

398d12c3 Browse Directory

Constraint: The cap64 run has only produced the high_energy leg so far, so any larger conclusion must wait for hybrid and the final report
Rejected: Wait for report.json before checkpointing | Would lose the verified cap64 high_energy score and the proof that execution has already switched into the hybrid branch
Confidence: high
Scope-risk: narrow
Directive: Do not compare cap64 strategy winners until both legs and the final report land; treat the current 0.625 high_energy score as an intermediate checkpoint only
Tested: Verified high_energy eval.json reports num_queries=32, top1=0.625, topk=1.0; verified progress.json records the same result; verified the active process has switched to the hybrid smoke-local branch and report.json is still absent
Not-tested: Final cap64 hybrid metrics, final report.json, and any cap64-based strategy conclusion

authored 2026-06-02 18:36:56 +0800

Preserve proof that cap64 advanced into evaluation before results landed · 3243aebb ...

3243aebb Browse Directory

Constraint: The cap64 run is still active, so this checkpoint can only record stage completion evidence rather than final benchmark conclusions
Rejected: Wait for eval.json or report.json before committing | Would lose the verified handoff that indexing finished and evaluate.py is now running
Confidence: high
Scope-risk: narrow
Directive: Keep stage checkpoints explicit—training complete, index complete, evaluation running, report complete—until cap64 fully settles
Tested: Verified reference_progress.json shows 64 refs, 657 windows, and complete status; verified active process is evaluate.py on /tmp/ab_smoke_seg_cap64_top2/high_energy/fma/manifests; verified high_energy eval.json and report.json are still absent
Not-tested: Final cap64 high_energy metrics, hybrid branch execution, and post-cap64 strategy guidance

authored 2026-06-02 18:35:47 +0800

Preserve proof that cap64 training finished before indexing completes · efd63cd9 ...

efd63cd9 Browse Directory

Constraint: The cap64 run is still active, so only verified training-complete evidence can be recorded without overstating results
Rejected: Keep only the older build-index note | The live session now proves the entire high_energy epoch finished, which is stronger handoff evidence
Confidence: high
Scope-risk: narrow
Directive: Distinguish clearly between training-complete, indexing-complete, and report-complete milestones in future cap64 checkpoints
Tested: Verified live session output shows high_energy Epoch 1 progressed from 0/32 to 32/32, and verified the active process remains run_demo.py build-index on /tmp/ab_smoke_seg_cap64_top2/high_energy/fma/manifests
Not-tested: Final cap64 eval metrics, hybrid branch progress, and report.json generation

authored 2026-06-02 18:34:10 +0800

Preserve the cap64 stage transition before the larger run finishes · 3d12fc0a ...

3d12fc0a Browse Directory

Constraint: The cap64 benchmark is still running, so only verified stage-transition evidence can be documented safely
Rejected: Wait for cap64 completion before checkpointing | Would leave the next session without proof that the run advanced from training into build-index
Confidence: high
Scope-risk: narrow
Directive: Keep recording cap64 milestones as they happen, but avoid updating winner guidance until report.json lands
Tested: Verified cap64 processes are active, confirmed the high_energy branch advanced from train.py to run_demo.py build-index on /tmp/ab_smoke_seg_cap64_top2/high_energy/fma/manifests, and confirmed report.json is still absent
Not-tested: Final cap64 scores, hybrid branch progression, and any post-cap64 strategy conclusion

authored 2026-06-02 18:31:37 +0800

Preserve proof that the cap64 benchmark has started before it finishes · ef9b24f8 ...

ef9b24f8 Browse Directory

Constraint: The new cap64 run is still in-flight, so only startup and stage-transition evidence can be documented safely
Rejected: Wait for cap64 results before checkpointing | Would leave the next session without a verified handoff that the larger benchmark is already running
Confidence: high
Scope-risk: narrow
Directive: Keep cap64 artifacts out of git and update strategy guidance only after report.json lands
Tested: Verified the cap64 ab_smoke process is running, confirmed the high_energy smoke-local branch entered train.py on /tmp/ab_smoke_seg_cap64_top2/high_energy/fma/manifests, and recorded the active work root and parameters in docs
Not-tested: Final cap64 metrics, hybrid branch execution, and any post-cap64 strategy conclusion

authored 2026-06-02 18:30:31 +0800

Promote cap48 guidance once the third seed confirmed the stable winner · d1f13203 ...

d1f13203 Browse Directory

Constraint: Strategy guidance had to wait until the full seed=999 report landed and all three cap48 runs could be aggregated consistently
Rejected: Keep treating cap48 as unresolved | The third seed now confirms high_energy repeats the same score while hybrid remains volatile
Confidence: high
Scope-risk: narrow
Directive: Treat high_energy as the cap48 default only within the documented FMA smoke condition until larger cap64 and bucketed benchmarks either confirm or overturn it
Tested: Verified seed=999 report.json, high_energy eval.json, hybrid eval.json, and computed three-seed aggregate showing high_energy mean_top1=0.9167 with zero variance versus hybrid mean_top1=0.8750
Not-tested: cap64-or-larger benchmarks, bucket/style-aware evaluations, and any future hybrid redesign

authored 2026-06-02 18:29:00 +0800

Preserve the hybrid seed999 score before the second strategy finishes · d13a3b8b ...

d13a3b8b Browse Directory

Constraint: The cap48 seed=999 run has only completed the hybrid leg, so the three-seed aggregate is still incomplete
Rejected: Wait for high_energy to finish before checkpointing | Would risk losing the verified hybrid seed999 score from the active Ralph session
Confidence: high
Scope-risk: narrow
Directive: Keep recording verified partial benchmark milestones, but do not revise default-strategy guidance until both strategies and the final report are available
Tested: Verified hybrid eval.json reports num_queries=24, top1=0.875, topk=1.0; verified progress.json records the same result; verified high_energy is still running and report.json is still absent
Not-tested: Final high_energy seed999 metrics, final report.json, and updated three-seed aggregate

authored 2026-06-02 18:25:51 +0800

Preserve fresh benchmark evidence before the evaluation finishes · bdc04f72 ...

bdc04f72 Browse Directory

Constraint: The running cap48 seed=999 benchmark has not emitted its final report yet, so only in-flight evidence can be recorded safely
Rejected: Claim a new three-seed conclusion now | The aggregate would be speculative without report.json and eval outputs
Confidence: high
Scope-risk: narrow
Directive: When a long benchmark is still active, checkpoint stage evidence explicitly and wait for report.json before changing strategy guidance
Tested: Verified process tree shows hybrid moved from build-index to evaluate.py; verified reference_progress.json reports 48 refs, 491 windows, 192-d embeddings, and complete status; verified report.json is still absent
Not-tested: Final hybrid eval metrics, subsequent high_energy run, and final three-seed aggregate

authored 2026-06-02 18:22:40 +0800

Preserve restartable delivery state before the long benchmark finishes · 0d40b05c ...

0d40b05c Browse Directory

Constraint: The cap48 seed=999 benchmark is still running, so this checkpoint must avoid unverified algorithm conclusions
Rejected: Wait for the CPU benchmark to finish | Would delay handoff and leave the next session without a clean restart package
Confidence: high
Scope-risk: narrow
Directive: Keep future doc-only checkpoints surgically staged and do not add data/raw, external_smoke, /tmp outputs, or model artifacts
Tested: Verified staged diff only includes AGENT memory, handoff, changelog, and changelist docs; confirmed /tmp cap48 seed=999 report is not ready yet
Not-tested: The in-flight cap48 seed=999 benchmark result and any follow-up aggregate metrics

authored 2026-06-02 18:20:30 +0800

Promote the cap48 discussion from single runs to two-seed aggregates · ae0d14a5 ...

ae0d14a5 Browse Directory

Persist the current two-seed cap48 summary so the strategy recommendation is grounded in aggregated evidence rather than whichever single run happened most recently.

Constraint: Only documentation changes are allowed because benchmark artifacts remain outside version control
Rejected: Keep narrating cap48 one run at a time | The aggregate is now more informative than any individual cap48 run
Confidence: high
Scope-risk: narrow
Directive: Prefer reporting aggregate seed statistics once two or more runs exist; avoid re-elevating single-seed claims above the aggregate
Tested: Verified both cap48 report.json files; computed aggregate mean/min/max/stdev; verified docs now record high_energy mean_top1=0.9167 and hybrid mean_top1=0.8750
Not-tested: Aggregates beyond two seeds or style-bucketed aggregates

authored 2026-06-02 18:15:34 +0800

Reframe the cap48 finding as seed-sensitive after the second rerun · e519dab7 ...

e519dab7 Browse Directory

Persist the completed seed123 benchmark showing hybrid ahead again, and update the strategy guidance from single-run winner claims to a multi-seed interpretation.

Constraint: Only documentation changes are allowed because benchmark outputs remain outside version control
Rejected: Keep framing cap48 as a stable high_energy win | The second seed materially weakens that interpretation
Confidence: high
Scope-risk: narrow
Directive: Base the hybrid vs high_energy default decision on aggregated multi-seed evidence, not any single cap48 run
Tested: Verified /tmp/ab_smoke_seg_cap48_top2_seed123/report.json; verified high_energy eval.json; verified docs now record hybrid=24/0.9583/1.0 and high_energy=24/0.9167/1.0 for seed123
Not-tested: Formal aggregation across multiple seeds beyond these two cap48 runs

authored 2026-06-02 18:13:48 +0800

Record the first cap48 seed123 hybrid score for the multi-seed check · a3a5303f ...

a3a5303f Browse Directory

Persist the newly finished cap48 seed123 hybrid result so the second-seed validation run now has measured evidence instead of only a runtime checkpoint.

Constraint: seed123 high_energy and the final report are still pending
Rejected: Wait for the full seed123 report before updating docs | Would leave the multi-seed evidence stale across sessions
Confidence: high
Scope-risk: narrow
Directive: Replace the seed123 partial section with the final two-strategy ranking once high_energy eval and report.json land
Tested: Verified /tmp/ab_smoke_seg_cap48_top2_seed123/hybrid/fma_reports_smoke/eval.json; verified docs record hybrid=24/0.9583/1.0 and high_energy still in build-index
Not-tested: Final seed123 comparison because high_energy has not finished yet

authored 2026-06-02 18:10:08 +0800

Refresh the second cap48 seed checkpoint now that hybrid reached evaluation · ef7e4493 ...

ef7e4493 Browse Directory

Update the handoff and changelog with the newer seed123 runtime milestone so later sessions know the hybrid lane has advanced from build-index into capped evaluation.

Constraint: No measured seed123 score is available yet, only a later execution milestone
Rejected: Leave the older build-index note in place | Would make the restart handoff stale and less actionable
Confidence: high
Scope-risk: narrow
Directive: Replace the seed123 runtime note with measured scores as soon as hybrid eval.json or report.json land
Tested: Verified active seed123 hybrid evaluate.py process; verified docs now record seed123 current phase as evaluate.py --max-queries 24
Not-tested: Seed123 strategy scores because hybrid eval.json has not landed yet

authored 2026-06-02 18:08:52 +0800