Commits · 7eff944bedeed205a3bf5d9f2225d954ceb13a61 · wanghai-tech / hikoon-ACR

02 Jun, 2026 40 commits

Prove the offline business-export chain with a runnable smoke over local audio · 7eff944b ...

Constraint: Keep verification offline-only and avoid touching real databases or production assets
Rejected: Stop at manifest generation without execution evidence | A dry-run smoke gives the next session stronger handoff confidence
Confidence: high
Scope-risk: narrow
Directive: Stage local sample audio inside the smoke workspace so manifest paths remain self-contained and reproducible
Tested: Ran business_export_offline_smoke.py end-to-end; verified normalize/build summaries and train.py --dry-run success; rechecked adapter doc links
Not-tested: Did not run full training/evaluation on live business exports or connect to any database

authored 2026-06-02 19:02:36 +0800

Finish the offline business-export chain by generating project manifests directl… · 3bdc0139 ...

3bdc0139

…y from normalized rows

Constraint: Keep this checkpoint offline-only and avoid touching real business data, datasets, or model artifacts
Rejected: Leave final manifest shaping as a manual next-session task | The handoff is stronger when catalog/train/test/val can already be produced automatically
Confidence: high
Scope-risk: narrow
Directive: Treat these generated manifests as integration-stage scaffolds and validate final field policy again before production data ingestion
Tested: Ran build_business_project_manifests.py on normalized sample data and verified catalog/train/test/val structure; rechecked 70 relative links
Not-tested: Did not run the generated manifests through full training/evaluation against live business audio

authored 2026-06-02 18:59:32 +0800

Complete the business-export chain by splitting manifest-ready rows into role-specific lists · b9feaccc ...

b9feaccc

Constraint: Keep this checkpoint offline-only and avoid touching real business data, datasets, or model artifacts
Rejected: Leave role splitting as a manual next-session step | The export chain is more usable when reference/query/excluded lists are produced automatically
Confidence: high
Scope-risk: narrow
Directive: Treat the split outputs as staging lists and keep final project-manifest adaptation explicit in the downstream integration step
Tested: Normalized the sample CSV, ran split_business_manifest_ready.py, verified 1 reference + 1 query + 1 excluded row, and rechecked 73 relative links
Not-tested: Did not run against a live business export or feed the split outputs into the full training pipeline

authored 2026-06-02 18:58:03 +0800

Turn business export guidance into a runnable normalization step for the next session · b5981c79 ...

b5981c79

Constraint: Keep this checkpoint offline-only and avoid touching real databases, datasets, or model artifacts
Rejected: Stop at static CSV/JSONL examples only | The next session needs an executable normalization path, not just samples
Confidence: high
Scope-risk: narrow
Directive: Treat normalized JSONL as manifest-ready staging output and keep final manifest shaping explicit in the integration step
Tested: Ran normalize_business_export.py on the sample CSV and JSONL inputs; verified 3 output rows each; rechecked 71 relative links
Not-tested: Did not run against a live business export or connect to any database

authored 2026-06-02 18:57:07 +0800

Provide export cookbook samples so business tables can flow into manifests without guesswork · b7d4b1b6 ...

b7d4b1b6 Browse Files

Constraint: Keep this checkpoint static and avoid any real database connectivity or dataset mutation
Rejected: Leave export details implicit until a live exporter exists | The next session needs concrete SQL, CSV, and JSONL examples now
Confidence: high
Scope-risk: narrow
Directive: Treat the SQL as a field-mapping example only and adapt table names to the real schema during integration
Tested: Parsed the CSV and JSONL examples and rechecked 69 relative links across the export docs
Not-tested: Did not connect to a production database or execute a live export

authored 2026-06-02 18:56:00 +0800

Make business asset tables exportable into manifest and role mapping templates · 51d789e1 ...

51d789e1

Constraint: Keep the checkpoint lightweight and avoid touching real datasets or generated artifacts
Rejected: Defer manifest guidance until a DB export tool exists | The next session needs repo-native field and role contracts now
Confidence: high
Scope-risk: narrow
Directive: Default ambiguous assets to excluded until manual review confirms song identity and usable role
Tested: Parsed manifest templates; verified print_business_type_mapping.py emits valid JSON; rechecked 94 relative links
Not-tested: Did not connect to a real database or run a live export in this checkpoint

authored 2026-06-02 18:54:54 +0800

Map business asset types into runnable training and bucket guidance for the next session · 8739bf35 ...

8739bf35 Browse Files

Constraint: Keep this checkpoint documentation-first and avoid staging dataset, cache, or model artifacts
Rejected: Leave the asset-type strategy implicit in chat only | The next session needs repo-native guidance and templates
Confidence: high
Scope-risk: narrow
Directive: Treat type-based buckets as a starting scaffold and keep hard-negative curation manual until evidence supports automation
Tested: Parsed both bucket JSON templates and rechecked 104 relative links across the new docs
Not-tested: Did not run a fresh business-type benchmark in this checkpoint

authored 2026-06-02 18:53:40 +0800

Provide a runnable semantic-bucket template so the next benchmark step can start immediately · 75fa5e93 ...

75fa5e93 Browse Files

Constraint: Keep the checkpoint lightweight and avoid touching dataset or model artifacts
Rejected: Wait to add buckets until automatic semantic labeling exists | Manual curated buckets are enough to unblock the next session now
Confidence: high
Scope-risk: narrow
Directive: Use the template as a curated benchmark scaffold, not as evidence that filenames imply semantics
Tested: Parsed the new JSON template; ran ab_smoke_bucketed.py --help; rechecked targeted relative links
Not-tested: Did not launch a new semantic bucket benchmark run in this checkpoint

authored 2026-06-02 18:51:59 +0800

Capture the finished bucket benchmark and handoff state for the next session · 1bdca61b ...

1bdca61b Browse Files

Constraint: Avoid staging datasets, smoke artifacts, /tmp outputs, and caches
Rejected: Delay handoff until larger semantic buckets exist | User asked for immediate delivery and resumability now
Confidence: high
Scope-risk: narrow
Directive: Treat toy prefix buckets as a methodology baseline, not a product conclusion
Tested: Verified /tmp/ab_smoke_bucketed_smoke/report.json and bucket_report.json outputs; reviewed targeted git diff
Not-tested: No new training or benchmark execution in this documentation-only checkpoint

authored 2026-06-02 18:50:23 +0800

Promote bucket benchmarking from a plan to a runnable baseline · c1a22cbb ...

c1a22cbb Browse Files

Constraint: The cap48/cap64 reversal means strategy guidance can no longer rely on a single overall subset result
Rejected: Keep bucket benchmarking as a doc-only next step | The repo now needs an executable baseline so later sessions can measure scale/style divergence directly
Confidence: high
Scope-risk: moderate
Directive: Treat ab_smoke_bucketed.py as the canonical seed for style-aware evaluation, and expand bucket definitions before revisiting global default-strategy claims
Tested: Verified acr-engine/scripts/ab_smoke_bucketed.py passes py_compile; verified first bucket prefix_000_a produced bucket_report.json with hybrid 4/1.0/1.0 and high_energy 3/1.0/1.0; verified second bucket execution is in progress
Not-tested: Full multi-bucket report.json completion, richer bucket definitions, and bucket-level aggregate conclusions

authored 2026-06-02 18:48:23 +0800

Record the cap64 reversal once the larger benchmark finished · e49dc0b9 ...

e49dc0b9 Browse Files

Constraint: Strategy guidance must now reflect that cap48 and cap64 produce different winners under verified runs
Rejected: Keep high_energy as the generic default | The completed cap64 run shows hybrid winning clearly at a larger subset size, so the docs must acknowledge scale sensitivity
Confidence: high
Scope-risk: moderate
Directive: Do not present a single global default strategy again until bucketed and style-aware benchmarks explain the cap48/cap64 divergence
Tested: Verified cap64 report.json, progress.json, high_energy eval.json, and hybrid eval.json; confirmed cap64 winner=hybrid with top1 0.875 vs high_energy 0.625
Not-tested: Multi-seed cap64 aggregates, bucket/style-aware benchmarks, and any revised hybrid training design

authored 2026-06-02 18:44:58 +0800

Preserve proof that cap64 hybrid advanced into evaluation before results landed · 8f2e6016 ...

8f2e6016 Browse Files

Constraint: The cap64 run is still incomplete, so only verified hybrid index-complete and evaluation-running evidence can be recorded safely now
Rejected: Wait for hybrid eval.json before checkpointing | Would lose the verified handoff that hybrid indexing finished and evaluate.py is already running
Confidence: high
Scope-risk: narrow
Directive: Keep cap64 high_energy and hybrid checkpoints symmetric so the final comparison can be written from docs alone if needed
Tested: Verified hybrid reference_progress.json shows 64 refs, 657 windows, 192-d embeddings, and complete status; verified active process is evaluate.py on /tmp/ab_smoke_seg_cap64_top2/hybrid/fma/manifests; verified hybrid eval.json and report.json are still absent
Not-tested: Final hybrid cap64 metrics, final report.json, and any cap64 winner conclusion

authored 2026-06-02 18:43:15 +0800

Preserve proof that cap64 hybrid training fully finished before scoring lands · fee2a39c ...

fee2a39c Browse Directory

Constraint: The cap64 run is still active, so only verified training-complete evidence can be recorded now without overstating results
Rejected: Wait for hybrid eval before checkpointing | Would lose the stronger handoff evidence that the full hybrid epoch already completed
Confidence: high
Scope-risk: narrow
Directive: Keep distinguishing hybrid training-complete from hybrid index/eval completion until report.json lands
Tested: Verified live session output shows hybrid Epoch 1 progressed from 0/32 to 32/32, and verified the active process remains run_demo.py build-index on /tmp/ab_smoke_seg_cap64_top2/hybrid/fma/manifests while hybrid eval.json and report.json remain absent
Not-tested: Final hybrid cap64 metrics, final report.json, and any cap64 winner conclusion

authored 2026-06-02 18:41:16 +0800

Preserve proof that cap64 hybrid advanced into indexing · 65cc45c2 ...

65cc45c2 Browse Directory

Constraint: The cap64 run is still in progress, so this checkpoint can only record verified hybrid stage transitions, not final comparisons
Rejected: Wait for hybrid eval before checkpointing | Would lose the verified evidence that hybrid training finished and indexing has already started
Confidence: high
Scope-risk: narrow
Directive: Keep cap64 branch checkpoints symmetric so high_energy and hybrid can be compared later without re-reading process history
Tested: Verified active process is run_demo.py build-index on /tmp/ab_smoke_seg_cap64_top2/hybrid/fma/manifests; verified /tmp/ab_smoke_seg_cap64_top2/hybrid/fma_models_smoke/best_model.pt exists; verified hybrid eval.json and report.json are still absent
Not-tested: Final hybrid cap64 metrics, final report.json, and any cap64 winner conclusion

authored 2026-06-02 18:39:26 +0800

Preserve proof that cap64 has entered the hybrid training branch · df7bd04b ...

df7bd04b Browse Directory

Constraint: The cap64 run is still incomplete, so only branch-transition evidence can be recorded safely at this point
Rejected: Wait for the hybrid eval before checkpointing | Would lose the verified handoff that execution has moved beyond high_energy into hybrid training
Confidence: high
Scope-risk: narrow
Directive: Keep cap64 branch progression explicit so the next session can resume from the current strategy leg without re-inspection
Tested: Verified high_energy eval.json reports num_queries=32, top1=0.625, topk=1.0; verified active processes show external_adapters.py on /tmp/ab_smoke_seg_cap64_top2/hybrid and train.py on /tmp/ab_smoke_seg_cap64_top2/hybrid/fma/manifests; verified hybrid eval.json and report.json are still absent
Not-tested: Final hybrid cap64 metrics, final report.json, and any cap64 winner conclusion

authored 2026-06-02 18:37:43 +0800

Preserve the first cap64 score before the second strategy finishes · 398d12c3 ...

398d12c3 Browse Directory

Constraint: The cap64 run has only produced the high_energy leg so far, so any larger conclusion must wait for hybrid and the final report
Rejected: Wait for report.json before checkpointing | Would lose the verified cap64 high_energy score and the proof that execution has already switched into the hybrid branch
Confidence: high
Scope-risk: narrow
Directive: Do not compare cap64 strategy winners until both legs and the final report land; treat the current 0.625 high_energy score as an intermediate checkpoint only
Tested: Verified high_energy eval.json reports num_queries=32, top1=0.625, topk=1.0; verified progress.json records the same result; verified the active process has switched to the hybrid smoke-local branch and report.json is still absent
Not-tested: Final cap64 hybrid metrics, final report.json, and any cap64-based strategy conclusion

authored 2026-06-02 18:36:56 +0800

Preserve proof that cap64 advanced into evaluation before results landed · 3243aebb ...

3243aebb Browse Directory

Constraint: The cap64 run is still active, so this checkpoint can only record stage completion evidence rather than final benchmark conclusions
Rejected: Wait for eval.json or report.json before committing | Would lose the verified handoff that indexing finished and evaluate.py is now running
Confidence: high
Scope-risk: narrow
Directive: Keep stage checkpoints explicit—training complete, index complete, evaluation running, report complete—until cap64 fully settles
Tested: Verified reference_progress.json shows 64 refs, 657 windows, and complete status; verified active process is evaluate.py on /tmp/ab_smoke_seg_cap64_top2/high_energy/fma/manifests; verified high_energy eval.json and report.json are still absent
Not-tested: Final cap64 high_energy metrics, hybrid branch execution, and post-cap64 strategy guidance

authored 2026-06-02 18:35:47 +0800

Preserve proof that cap64 training finished before indexing completes · efd63cd9 ...

efd63cd9 Browse Directory

Constraint: The cap64 run is still active, so only verified training-complete evidence can be recorded without overstating results
Rejected: Keep only the older build-index note | The live session now proves the entire high_energy epoch finished, which is stronger handoff evidence
Confidence: high
Scope-risk: narrow
Directive: Distinguish clearly between training-complete, indexing-complete, and report-complete milestones in future cap64 checkpoints
Tested: Verified live session output shows high_energy Epoch 1 progressed from 0/32 to 32/32, and verified the active process remains run_demo.py build-index on /tmp/ab_smoke_seg_cap64_top2/high_energy/fma/manifests
Not-tested: Final cap64 eval metrics, hybrid branch progress, and report.json generation

authored 2026-06-02 18:34:10 +0800

Preserve the cap64 stage transition before the larger run finishes · 3d12fc0a ...

3d12fc0a Browse Directory

Constraint: The cap64 benchmark is still running, so only verified stage-transition evidence can be documented safely
Rejected: Wait for cap64 completion before checkpointing | Would leave the next session without proof that the run advanced from training into build-index
Confidence: high
Scope-risk: narrow
Directive: Keep recording cap64 milestones as they happen, but avoid updating winner guidance until report.json lands
Tested: Verified cap64 processes are active, confirmed the high_energy branch advanced from train.py to run_demo.py build-index on /tmp/ab_smoke_seg_cap64_top2/high_energy/fma/manifests, and confirmed report.json is still absent
Not-tested: Final cap64 scores, hybrid branch progression, and any post-cap64 strategy conclusion

authored 2026-06-02 18:31:37 +0800

Preserve proof that the cap64 benchmark has started before it finishes · ef9b24f8 ...

ef9b24f8 Browse Directory

Constraint: The new cap64 run is still in-flight, so only startup and stage-transition evidence can be documented safely
Rejected: Wait for cap64 results before checkpointing | Would leave the next session without a verified handoff that the larger benchmark is already running
Confidence: high
Scope-risk: narrow
Directive: Keep cap64 artifacts out of git and update strategy guidance only after report.json lands
Tested: Verified the cap64 ab_smoke process is running, confirmed the high_energy smoke-local branch entered train.py on /tmp/ab_smoke_seg_cap64_top2/high_energy/fma/manifests, and recorded the active work root and parameters in docs
Not-tested: Final cap64 metrics, hybrid branch execution, and any post-cap64 strategy conclusion

authored 2026-06-02 18:30:31 +0800

Promote cap48 guidance once the third seed confirmed the stable winner · d1f13203 ...

d1f13203 Browse Directory

Constraint: Strategy guidance had to wait until the full seed=999 report landed and all three cap48 runs could be aggregated consistently
Rejected: Keep treating cap48 as unresolved | The third seed now confirms high_energy repeats the same score while hybrid remains volatile
Confidence: high
Scope-risk: narrow
Directive: Treat high_energy as the cap48 default only within the documented FMA smoke condition until larger cap64 and bucketed benchmarks either confirm or overturn it
Tested: Verified seed=999 report.json, high_energy eval.json, hybrid eval.json, and computed three-seed aggregate showing high_energy mean_top1=0.9167 with zero variance versus hybrid mean_top1=0.8750
Not-tested: cap64-or-larger benchmarks, bucket/style-aware evaluations, and any future hybrid redesign

authored 2026-06-02 18:29:00 +0800

Preserve the hybrid seed999 score before the second strategy finishes · d13a3b8b ...

d13a3b8b Browse Directory

Constraint: The cap48 seed=999 run has only completed the hybrid leg, so the three-seed aggregate is still incomplete
Rejected: Wait for high_energy to finish before checkpointing | Would risk losing the verified hybrid seed999 score from the active Ralph session
Confidence: high
Scope-risk: narrow
Directive: Keep recording verified partial benchmark milestones, but do not revise default-strategy guidance until both strategies and the final report are available
Tested: Verified hybrid eval.json reports num_queries=24, top1=0.875, topk=1.0; verified progress.json records the same result; verified high_energy is still running and report.json is still absent
Not-tested: Final high_energy seed999 metrics, final report.json, and updated three-seed aggregate

authored 2026-06-02 18:25:51 +0800

Preserve fresh benchmark evidence before the evaluation finishes · bdc04f72 ...

bdc04f72 Browse Directory

Constraint: The running cap48 seed=999 benchmark has not emitted its final report yet, so only in-flight evidence can be recorded safely
Rejected: Claim a new three-seed conclusion now | The aggregate would be speculative without report.json and eval outputs
Confidence: high
Scope-risk: narrow
Directive: When a long benchmark is still active, checkpoint stage evidence explicitly and wait for report.json before changing strategy guidance
Tested: Verified process tree shows hybrid moved from build-index to evaluate.py; verified reference_progress.json reports 48 refs, 491 windows, 192-d embeddings, and complete status; verified report.json is still absent
Not-tested: Final hybrid eval metrics, subsequent high_energy run, and final three-seed aggregate

authored 2026-06-02 18:22:40 +0800

Preserve restartable delivery state before the long benchmark finishes · 0d40b05c ...

0d40b05c Browse Directory

Constraint: The cap48 seed=999 benchmark is still running, so this checkpoint must avoid unverified algorithm conclusions
Rejected: Wait for the CPU benchmark to finish | Would delay handoff and leave the next session without a clean restart package
Confidence: high
Scope-risk: narrow
Directive: Keep future doc-only checkpoints surgically staged and do not add data/raw, external_smoke, /tmp outputs, or model artifacts
Tested: Verified staged diff only includes AGENT memory, handoff, changelog, and changelist docs; confirmed /tmp cap48 seed=999 report is not ready yet
Not-tested: The in-flight cap48 seed=999 benchmark result and any follow-up aggregate metrics

authored 2026-06-02 18:20:30 +0800

Promote the cap48 discussion from single runs to two-seed aggregates · ae0d14a5 ...

ae0d14a5 Browse Directory

Persist the current two-seed cap48 summary so the strategy recommendation is grounded in aggregated evidence rather than whichever single run happened most recently.

Constraint: Only documentation changes are allowed because benchmark artifacts remain outside version control
Rejected: Keep narrating cap48 one run at a time | The aggregate is now more informative than any individual cap48 run
Confidence: high
Scope-risk: narrow
Directive: Prefer reporting aggregate seed statistics once two or more runs exist; avoid re-elevating single-seed claims above the aggregate
Tested: Verified both cap48 report.json files; computed aggregate mean/min/max/stdev; verified docs now record high_energy mean_top1=0.9167 and hybrid mean_top1=0.8750
Not-tested: Aggregates beyond two seeds or style-bucketed aggregates

authored 2026-06-02 18:15:34 +0800

Reframe the cap48 finding as seed-sensitive after the second rerun · e519dab7 ...

e519dab7 Browse Directory

Persist the completed seed123 benchmark showing hybrid ahead again, and update the strategy guidance from single-run winner claims to a multi-seed interpretation.

Constraint: Only documentation changes are allowed because benchmark outputs remain outside version control
Rejected: Keep framing cap48 as a stable high_energy win | The second seed materially weakens that interpretation
Confidence: high
Scope-risk: narrow
Directive: Base the hybrid vs high_energy default decision on aggregated multi-seed evidence, not any single cap48 run
Tested: Verified /tmp/ab_smoke_seg_cap48_top2_seed123/report.json; verified high_energy eval.json; verified docs now record hybrid=24/0.9583/1.0 and high_energy=24/0.9167/1.0 for seed123
Not-tested: Formal aggregation across multiple seeds beyond these two cap48 runs

authored 2026-06-02 18:13:48 +0800

Record the first cap48 seed123 hybrid score for the multi-seed check · a3a5303f ...

a3a5303f Browse Directory

Persist the newly finished cap48 seed123 hybrid result so the second-seed validation run now has measured evidence instead of only a runtime checkpoint.

Constraint: seed123 high_energy and the final report are still pending
Rejected: Wait for the full seed123 report before updating docs | Would leave the multi-seed evidence stale across sessions
Confidence: high
Scope-risk: narrow
Directive: Replace the seed123 partial section with the final two-strategy ranking once high_energy eval and report.json land
Tested: Verified /tmp/ab_smoke_seg_cap48_top2_seed123/hybrid/fma_reports_smoke/eval.json; verified docs record hybrid=24/0.9583/1.0 and high_energy still in build-index
Not-tested: Final seed123 comparison because high_energy has not finished yet

authored 2026-06-02 18:10:08 +0800

Refresh the second cap48 seed checkpoint now that hybrid reached evaluation · ef7e4493 ...

ef7e4493 Browse Directory

Update the handoff and changelog with the newer seed123 runtime milestone so later sessions know the hybrid lane has advanced from build-index into capped evaluation.

Constraint: No measured seed123 score is available yet, only a later execution milestone
Rejected: Leave the older build-index note in place | Would make the restart handoff stale and less actionable
Confidence: high
Scope-risk: narrow
Directive: Replace the seed123 runtime note with measured scores as soon as hybrid eval.json or report.json land
Tested: Verified active seed123 hybrid evaluate.py process; verified docs now record seed123 current phase as evaluate.py --max-queries 24
Not-tested: Seed123 strategy scores because hybrid eval.json has not landed yet

authored 2026-06-02 18:08:52 +0800

Checkpoint the second cap48 seed while the rerun is still building · 124d4612 ...

124d4612 Browse Directory

Preserve the second-seed cap48 entry point and current build-index phase so later sessions can validate whether the cap48 reversal was stable or a seed artifact.

Constraint: The second-seed run has not produced scores yet, so only execution-state evidence is available
Rejected: Wait for the seed123 scores before recording anything | Risks losing the multi-seed validation checkpoint if the session ends first
Confidence: high
Scope-risk: narrow
Directive: Replace the seed123 running-state section with measured scores once hybrid eval.json or report.json land
Tested: Verified active cap48 seed123 processes; verified handoff records work-root, seed, subset size, query cap, and current build-index phase
Not-tested: cap48 seed123 strategy scores because the run is still in progress

authored 2026-06-02 18:04:26 +0800

Revise the default-strategy story after the cap48 reversal · d82d217a ...

d82d217a Browse Directory

Persist the larger 48-track benchmark where high_energy overtook hybrid, and downgrade the previously overconfident default-strategy claim to a conditional recommendation pending broader validation.

Constraint: Only documentation changes are allowed because benchmark outputs remain outside version control
Rejected: Keep asserting hybrid as fully settled default after cap48 | The 48-track capped benchmark materially contradicts that stronger claim
Confidence: high
Scope-risk: narrow
Directive: Resolve the hybrid vs high_energy default question with larger, multi-seed, style-aware benchmarks before making a final hard default claim
Tested: Verified /tmp/ab_smoke_seg_cap48_top2/report.json; verified high_energy eval.json; verified docs now record high_energy=24/0.9167/1.0 and hybrid=24/0.7917/1.0
Not-tested: Multi-seed or style-balanced follow-up benchmark beyond the single cap48 run

authored 2026-06-02 18:00:55 +0800

Refresh the cap48 checkpoint now that high-energy reached evaluation · 7769be8c ...

7769be8c Browse Directory

Update the handoff and changelog with the newer cap48 runtime milestone so later sessions know the high_energy lane has advanced from build-index into capped evaluation.

Constraint: No measured cap48 high_energy score is available yet, only a later execution milestone
Rejected: Leave the older build-index note in place | Would make the restart handoff stale and less actionable
Confidence: high
Scope-risk: narrow
Directive: Replace the cap48 runtime note with final top-two scores as soon as high_energy eval.json or report.json lands
Tested: Verified active cap48 high_energy evaluate.py process; verified docs now record high_energy current phase as evaluate.py --max-queries 24
Not-tested: Final cap48 comparison because high_energy eval.json has not landed yet

authored 2026-06-02 17:59:27 +0800

Record the first cap48 hybrid score while the larger run continues · 0f84d109 ...

0f84d109 Browse Directory

Persist the newly finished cap48 hybrid result so the next session can continue the 48-track validation run from measured evidence instead of only a runtime checkpoint.

Constraint: cap48 high_energy and the final report are still pending
Rejected: Wait for the full cap48 report before updating docs | Would leave the largest current real-data checkpoint stale across sessions
Confidence: high
Scope-risk: narrow
Directive: Replace the cap48 partial section with the final two-strategy ranking once high_energy eval and report.json land
Tested: Verified /tmp/ab_smoke_seg_cap48_top2/hybrid/fma_reports_smoke/eval.json; verified docs record hybrid=24/0.7917/1.0 and high_energy still in build-index
Not-tested: Final cap48 comparison because high_energy has not finished yet

authored 2026-06-02 17:55:53 +0800

Refresh the cap48 checkpoint now that hybrid reached evaluation · 727f06c5 ...

727f06c5 Browse Directory

Update the handoff and changelog with the newer cap48 runtime milestone so later sessions know the run has advanced from build-index into capped evaluation.

Constraint: No measured cap48 score is available yet, only a later execution milestone
Rejected: Leave the older build-index note in place | Would make the restart handoff stale and less actionable
Confidence: high
Scope-risk: narrow
Directive: Replace the cap48 runtime note with hybrid scores as soon as eval.json lands
Tested: Verified active cap48 evaluate.py process; verified docs now record cap48 current phase as evaluate.py --max-queries 24
Not-tested: cap48 strategy scores because hybrid eval.json has not landed yet

authored 2026-06-02 17:54:44 +0800

Checkpoint the cap48 benchmark while the larger run is still building · 026b5539 ...

026b5539 Browse Directory

Preserve the new 48-track top-two benchmark entry point and current build-index phase so later sessions can continue the expanding validation ladder without rediscovering runtime state.

Constraint: cap48 has not produced scores yet, so only execution-state evidence is available
Rejected: Wait for cap48 scores before recording anything | Risks losing the larger-benchmark checkpoint if the session ends first
Confidence: high
Scope-risk: narrow
Directive: Replace the cap48 running-state section with measured scores once hybrid eval.json or report.json land
Tested: Verified active cap48 processes; verified handoff records work-root, subset size, query cap, and current build-index phase
Not-tested: cap48 strategy scores because the run is still in progress

authored 2026-06-02 17:50:57 +0800

Lock the cap32 result and harden the hybrid default recommendation · f05e7023 ...

f05e7023 Browse Directory

Persist the larger 32-track benchmark showing hybrid strongly outperforming high_energy, so the default strategy decision rests on multiple larger real-data checkpoints instead of a single subset.

Constraint: Only documentation changes are allowed because benchmark artifacts stay outside version control
Rejected: Keep the default recommendation tentative after cap32 | The 24-track and 32-track capped benchmarks now agree on hybrid superiority
Confidence: high
Scope-risk: narrow
Directive: Use cap24 and cap32 together as the current strongest strategy evidence until a broader multi-style benchmark supersedes them
Tested: Verified /tmp/ab_smoke_seg_cap32_top2/report.json; verified high_energy eval.json; verified docs now record hybrid=20/0.95/1.0 and high_energy=20/0.5/1.0
Not-tested: Wider style-balanced benchmark beyond the FMA top-two subsets

authored 2026-06-02 17:46:42 +0800

Record the first cap32 hybrid score while the larger run continues · f228197d ...

f228197d Browse Directory

Persist the newly finished cap32 hybrid result so the next session can continue the top-two validation run from measured evidence instead of only a running-state checkpoint.

Constraint: cap32 high_energy and the final report are still pending
Rejected: Wait for the full cap32 report before updating docs | Would leave the larger-subset evidence stale across sessions
Confidence: high
Scope-risk: narrow
Directive: Replace the cap32 partial section with the final two-strategy ranking once high_energy eval and report.json land
Tested: Verified /tmp/ab_smoke_seg_cap32_top2/hybrid/fma_reports_smoke/eval.json; verified docs record hybrid=20/0.95/1.0 and high_energy still training
Not-tested: Final cap32 comparison because high_energy has not finished yet

authored 2026-06-02 17:42:43 +0800

Checkpoint the larger cap32 benchmark before results land · 5dadbae3 ...

5dadbae3 Browse Directory

Preserve the new 32-track top-two benchmark entry point and current build-index phase so a later session can continue the stronger validation run without losing runtime context.

Constraint: The cap32 benchmark is still running, so only execution-state evidence is available
Rejected: Wait for cap32 results before recording anything | Risks losing the larger-benchmark checkpoint if the session ends first
Confidence: high
Scope-risk: narrow
Directive: Replace the cap32 running-state section with measured scores once hybrid eval.json and report.json land
Tested: Verified active cap32 processes; verified handoff records work-root, subset size, query cap, and current build-index phase
Not-tested: cap32 strategy scores because the run is still in progress

authored 2026-06-02 17:41:01 +0800

Promote hybrid to the default strategy using the stronger cap24 evidence · 08379e56 ...

08379e56 Browse Directory

Persist the larger real-FMA benchmark result showing hybrid clearly outperforming high_energy, so the project recommendation can converge on one default instead of an unresolved tie.

Constraint: Only docs change because benchmark outputs remain outside version control
Rejected: Keep treating hybrid and high_energy as co-equal defaults | The larger 24-track capped benchmark now separates them clearly
Confidence: high
Scope-risk: narrow
Directive: Use cap24 top-two as the current strongest public evidence until a larger capped benchmark supersedes it
Tested: Verified /tmp/ab_smoke_seg_cap24_top2/report.json; verified high_energy eval.json; verified docs now state hybrid=16/1.0/1.0 and high_energy=16/0.8125/1.0
Not-tested: Broader strategy comparison beyond hybrid vs high_energy on the 24-track subset

authored 2026-06-02 17:36:12 +0800

Preserve the larger cap24 top-two benchmark checkpoint · 48a5957a ...

48a5957a Browse Directory

Record the new 24-track capped benchmark setup and the first completed hybrid result so the next session can continue the stronger tie-break experiment without rediscovering runtime state.

Constraint: The cap24 benchmark is still in progress, so only partial evidence can be documented now
Rejected: Wait for high_energy to finish before updating handoff | Risks losing the fresh larger-subset evidence if the session ends first
Confidence: high
Scope-risk: narrow
Directive: Replace the partial cap24 section with the final two-strategy ranking once report.json lands
Tested: Verified /tmp/ab_smoke_seg_cap24_top2/hybrid/fma_reports_smoke/eval.json; verified active cap24 processes; verified docs include the exact work-root and resume command
Not-tested: Final cap24 top-two comparison because high_energy is still training

authored 2026-06-02 17:33:42 +0800

Lock the final cap16 FMA benchmark ranking into the workflow docs · c659380d ...

c659380d Browse Directory

Persist the completed capped real-data benchmark results so future sessions can use the final strategy ordering and recommendation without replaying the run.

Constraint: Only documentation should change because benchmark artifacts live outside version control
Rejected: Leave the result only in /tmp report files | Would make the evidence fragile across sessions
Confidence: high
Scope-risk: narrow
Directive: Use cap16 as the current default evidence point until a larger capped benchmark supersedes it
Tested: Verified /tmp/ab_smoke_seg_cap16/report.json; verified repeated_section_aware eval.json; verified docs reflect final ranking hybrid/high_energy/beat_aware/repeated_section_aware
Not-tested: Larger real-dataset benchmark beyond the 16-track capped subset

authored 2026-06-02 17:27:36 +0800