1. 02 Jun, 2026 40 commits
    • Constraint: Keep verification offline-only and avoid touching real databases or production assets
      Rejected: Stop at manifest generation without execution evidence | A dry-run smoke gives the next session stronger handoff confidence
      Confidence: high
      Scope-risk: narrow
      Directive: Stage local sample audio inside the smoke workspace so manifest paths remain self-contained and reproducible
      Tested: Ran business_export_offline_smoke.py end-to-end; verified normalize/build summaries and train.py --dry-run success; rechecked adapter doc links
      Not-tested: Did not run full training/evaluation on live business exports or connect to any database
      cnb.bofCdSsphPA authored
    • …y from normalized rows
      
      Constraint: Keep this checkpoint offline-only and avoid touching real business data, datasets, or model artifacts
      Rejected: Leave final manifest shaping as a manual next-session task | The handoff is stronger when catalog/train/test/val can already be produced automatically
      Confidence: high
      Scope-risk: narrow
      Directive: Treat these generated manifests as integration-stage scaffolds and validate final field policy again before production data ingestion
      Tested: Ran build_business_project_manifests.py on normalized sample data and verified catalog/train/test/val structure; rechecked 70 relative links
      Not-tested: Did not run the generated manifests through full training/evaluation against live business audio
      cnb.bofCdSsphPA authored
    • Constraint: Keep this checkpoint offline-only and avoid touching real business data, datasets, or model artifacts
      Rejected: Leave role splitting as a manual next-session step | The export chain is more usable when reference/query/excluded lists are produced automatically
      Confidence: high
      Scope-risk: narrow
      Directive: Treat the split outputs as staging lists and keep final project-manifest adaptation explicit in the downstream integration step
      Tested: Normalized the sample CSV, ran split_business_manifest_ready.py, verified 1 reference + 1 query + 1 excluded row, and rechecked 73 relative links
      Not-tested: Did not run against a live business export or feed the split outputs into the full training pipeline
      cnb.bofCdSsphPA authored
    • Constraint: Keep this checkpoint offline-only and avoid touching real databases, datasets, or model artifacts
      Rejected: Stop at static CSV/JSONL examples only | The next session needs an executable normalization path, not just samples
      Confidence: high
      Scope-risk: narrow
      Directive: Treat normalized JSONL as manifest-ready staging output and keep final manifest shaping explicit in the integration step
      Tested: Ran normalize_business_export.py on the sample CSV and JSONL inputs; verified 3 output rows each; rechecked 71 relative links
      Not-tested: Did not run against a live business export or connect to any database
      cnb.bofCdSsphPA authored
    • Constraint: Keep this checkpoint static and avoid any real database connectivity or dataset mutation
      Rejected: Leave export details implicit until a live exporter exists | The next session needs concrete SQL, CSV, and JSONL examples now
      Confidence: high
      Scope-risk: narrow
      Directive: Treat the SQL as a field-mapping example only and adapt table names to the real schema during integration
      Tested: Parsed the CSV and JSONL examples and rechecked 69 relative links across the export docs
      Not-tested: Did not connect to a production database or execute a live export
      cnb.bofCdSsphPA authored
    • Constraint: Keep the checkpoint lightweight and avoid touching real datasets or generated artifacts
      Rejected: Defer manifest guidance until a DB export tool exists | The next session needs repo-native field and role contracts now
      Confidence: high
      Scope-risk: narrow
      Directive: Default ambiguous assets to excluded until manual review confirms song identity and usable role
      Tested: Parsed manifest templates; verified print_business_type_mapping.py emits valid JSON; rechecked 94 relative links
      Not-tested: Did not connect to a real database or run a live export in this checkpoint
      cnb.bofCdSsphPA authored
    • Constraint: Keep this checkpoint documentation-first and avoid staging dataset, cache, or model artifacts
      Rejected: Leave the asset-type strategy implicit in chat only | The next session needs repo-native guidance and templates
      Confidence: high
      Scope-risk: narrow
      Directive: Treat type-based buckets as a starting scaffold and keep hard-negative curation manual until evidence supports automation
      Tested: Parsed both bucket JSON templates and rechecked 104 relative links across the new docs
      Not-tested: Did not run a fresh business-type benchmark in this checkpoint
      cnb.bofCdSsphPA authored
    • Constraint: Keep the checkpoint lightweight and avoid touching dataset or model artifacts
      Rejected: Wait to add buckets until automatic semantic labeling exists | Manual curated buckets are enough to unblock the next session now
      Confidence: high
      Scope-risk: narrow
      Directive: Use the template as a curated benchmark scaffold, not as evidence that filenames imply semantics
      Tested: Parsed the new JSON template; ran ab_smoke_bucketed.py --help; rechecked targeted relative links
      Not-tested: Did not launch a new semantic bucket benchmark run in this checkpoint
      cnb.bofCdSsphPA authored
    • Constraint: Avoid staging datasets, smoke artifacts, /tmp outputs, and caches
      Rejected: Delay handoff until larger semantic buckets exist | User asked for immediate delivery and resumability now
      Confidence: high
      Scope-risk: narrow
      Directive: Treat toy prefix buckets as a methodology baseline, not a product conclusion
      Tested: Verified /tmp/ab_smoke_bucketed_smoke/report.json and bucket_report.json outputs; reviewed targeted git diff
      Not-tested: No new training or benchmark execution in this documentation-only checkpoint
      cnb.bofCdSsphPA authored
    • Constraint: The cap48/cap64 reversal means strategy guidance can no longer rely on a single overall subset result
      Rejected: Keep bucket benchmarking as a doc-only next step | The repo now needs an executable baseline so later sessions can measure scale/style divergence directly
      Confidence: high
      Scope-risk: moderate
      Directive: Treat ab_smoke_bucketed.py as the canonical seed for style-aware evaluation, and expand bucket definitions before revisiting global default-strategy claims
      Tested: Verified acr-engine/scripts/ab_smoke_bucketed.py passes py_compile; verified first bucket prefix_000_a produced bucket_report.json with hybrid 4/1.0/1.0 and high_energy 3/1.0/1.0; verified second bucket execution is in progress
      Not-tested: Full multi-bucket report.json completion, richer bucket definitions, and bucket-level aggregate conclusions
      cnb.bofCdSsphPA authored
    • Constraint: Strategy guidance must now reflect that cap48 and cap64 produce different winners under verified runs
      Rejected: Keep high_energy as the generic default | The completed cap64 run shows hybrid winning clearly at a larger subset size, so the docs must acknowledge scale sensitivity
      Confidence: high
      Scope-risk: moderate
      Directive: Do not present a single global default strategy again until bucketed and style-aware benchmarks explain the cap48/cap64 divergence
      Tested: Verified cap64 report.json, progress.json, high_energy eval.json, and hybrid eval.json; confirmed cap64 winner=hybrid with top1 0.875 vs high_energy 0.625
      Not-tested: Multi-seed cap64 aggregates, bucket/style-aware benchmarks, and any revised hybrid training design
      cnb.bofCdSsphPA authored
    • Constraint: The cap64 run is still incomplete, so only verified hybrid index-complete and evaluation-running evidence can be recorded safely now
      Rejected: Wait for hybrid eval.json before checkpointing | Would lose the verified handoff that hybrid indexing finished and evaluate.py is already running
      Confidence: high
      Scope-risk: narrow
      Directive: Keep cap64 high_energy and hybrid checkpoints symmetric so the final comparison can be written from docs alone if needed
      Tested: Verified hybrid reference_progress.json shows 64 refs, 657 windows, 192-d embeddings, and complete status; verified active process is evaluate.py on /tmp/ab_smoke_seg_cap64_top2/hybrid/fma/manifests; verified hybrid eval.json and report.json are still absent
      Not-tested: Final hybrid cap64 metrics, final report.json, and any cap64 winner conclusion
      cnb.bofCdSsphPA authored
    • Constraint: The cap64 run is still active, so only verified training-complete evidence can be recorded now without overstating results
      Rejected: Wait for hybrid eval before checkpointing | Would lose the stronger handoff evidence that the full hybrid epoch already completed
      Confidence: high
      Scope-risk: narrow
      Directive: Keep distinguishing hybrid training-complete from hybrid index/eval completion until report.json lands
      Tested: Verified live session output shows hybrid Epoch 1 progressed from 0/32 to 32/32, and verified the active process remains run_demo.py build-index on /tmp/ab_smoke_seg_cap64_top2/hybrid/fma/manifests while hybrid eval.json and report.json remain absent
      Not-tested: Final hybrid cap64 metrics, final report.json, and any cap64 winner conclusion
      cnb.bofCdSsphPA authored
    • Constraint: The cap64 run is still in progress, so this checkpoint can only record verified hybrid stage transitions, not final comparisons
      Rejected: Wait for hybrid eval before checkpointing | Would lose the verified evidence that hybrid training finished and indexing has already started
      Confidence: high
      Scope-risk: narrow
      Directive: Keep cap64 branch checkpoints symmetric so high_energy and hybrid can be compared later without re-reading process history
      Tested: Verified active process is run_demo.py build-index on /tmp/ab_smoke_seg_cap64_top2/hybrid/fma/manifests; verified /tmp/ab_smoke_seg_cap64_top2/hybrid/fma_models_smoke/best_model.pt exists; verified hybrid eval.json and report.json are still absent
      Not-tested: Final hybrid cap64 metrics, final report.json, and any cap64 winner conclusion
      cnb.bofCdSsphPA authored
    • Constraint: The cap64 run is still incomplete, so only branch-transition evidence can be recorded safely at this point
      Rejected: Wait for the hybrid eval before checkpointing | Would lose the verified handoff that execution has moved beyond high_energy into hybrid training
      Confidence: high
      Scope-risk: narrow
      Directive: Keep cap64 branch progression explicit so the next session can resume from the current strategy leg without re-inspection
      Tested: Verified high_energy eval.json reports num_queries=32, top1=0.625, topk=1.0; verified active processes show external_adapters.py on /tmp/ab_smoke_seg_cap64_top2/hybrid and train.py on /tmp/ab_smoke_seg_cap64_top2/hybrid/fma/manifests; verified hybrid eval.json and report.json are still absent
      Not-tested: Final hybrid cap64 metrics, final report.json, and any cap64 winner conclusion
      cnb.bofCdSsphPA authored
    • Constraint: The cap64 run has only produced the high_energy leg so far, so any larger conclusion must wait for hybrid and the final report
      Rejected: Wait for report.json before checkpointing | Would lose the verified cap64 high_energy score and the proof that execution has already switched into the hybrid branch
      Confidence: high
      Scope-risk: narrow
      Directive: Do not compare cap64 strategy winners until both legs and the final report land; treat the current 0.625 high_energy score as an intermediate checkpoint only
      Tested: Verified high_energy eval.json reports num_queries=32, top1=0.625, topk=1.0; verified progress.json records the same result; verified the active process has switched to the hybrid smoke-local branch and report.json is still absent
      Not-tested: Final cap64 hybrid metrics, final report.json, and any cap64-based strategy conclusion
      cnb.bofCdSsphPA authored
    • Constraint: The cap64 run is still active, so this checkpoint can only record stage completion evidence rather than final benchmark conclusions
      Rejected: Wait for eval.json or report.json before committing | Would lose the verified handoff that indexing finished and evaluate.py is now running
      Confidence: high
      Scope-risk: narrow
      Directive: Keep stage checkpoints explicit—training complete, index complete, evaluation running, report complete—until cap64 fully settles
      Tested: Verified reference_progress.json shows 64 refs, 657 windows, and complete status; verified active process is evaluate.py on /tmp/ab_smoke_seg_cap64_top2/high_energy/fma/manifests; verified high_energy eval.json and report.json are still absent
      Not-tested: Final cap64 high_energy metrics, hybrid branch execution, and post-cap64 strategy guidance
      cnb.bofCdSsphPA authored
    • Constraint: The cap64 run is still active, so only verified training-complete evidence can be recorded without overstating results
      Rejected: Keep only the older build-index note | The live session now proves the entire high_energy epoch finished, which is stronger handoff evidence
      Confidence: high
      Scope-risk: narrow
      Directive: Distinguish clearly between training-complete, indexing-complete, and report-complete milestones in future cap64 checkpoints
      Tested: Verified live session output shows high_energy Epoch 1 progressed from 0/32 to 32/32, and verified the active process remains run_demo.py build-index on /tmp/ab_smoke_seg_cap64_top2/high_energy/fma/manifests
      Not-tested: Final cap64 eval metrics, hybrid branch progress, and report.json generation
      cnb.bofCdSsphPA authored
    • Constraint: The cap64 benchmark is still running, so only verified stage-transition evidence can be documented safely
      Rejected: Wait for cap64 completion before checkpointing | Would leave the next session without proof that the run advanced from training into build-index
      Confidence: high
      Scope-risk: narrow
      Directive: Keep recording cap64 milestones as they happen, but avoid updating winner guidance until report.json lands
      Tested: Verified cap64 processes are active, confirmed the high_energy branch advanced from train.py to run_demo.py build-index on /tmp/ab_smoke_seg_cap64_top2/high_energy/fma/manifests, and confirmed report.json is still absent
      Not-tested: Final cap64 scores, hybrid branch progression, and any post-cap64 strategy conclusion
      cnb.bofCdSsphPA authored
    • Constraint: The new cap64 run is still in-flight, so only startup and stage-transition evidence can be documented safely
      Rejected: Wait for cap64 results before checkpointing | Would leave the next session without a verified handoff that the larger benchmark is already running
      Confidence: high
      Scope-risk: narrow
      Directive: Keep cap64 artifacts out of git and update strategy guidance only after report.json lands
      Tested: Verified the cap64 ab_smoke process is running, confirmed the high_energy smoke-local branch entered train.py on /tmp/ab_smoke_seg_cap64_top2/high_energy/fma/manifests, and recorded the active work root and parameters in docs
      Not-tested: Final cap64 metrics, hybrid branch execution, and any post-cap64 strategy conclusion
      cnb.bofCdSsphPA authored
    • Constraint: Strategy guidance had to wait until the full seed=999 report landed and all three cap48 runs could be aggregated consistently
      Rejected: Keep treating cap48 as unresolved | The third seed now confirms high_energy repeats the same score while hybrid remains volatile
      Confidence: high
      Scope-risk: narrow
      Directive: Treat high_energy as the cap48 default only within the documented FMA smoke condition until larger cap64 and bucketed benchmarks either confirm or overturn it
      Tested: Verified seed=999 report.json, high_energy eval.json, hybrid eval.json, and computed three-seed aggregate showing high_energy mean_top1=0.9167 with zero variance versus hybrid mean_top1=0.8750
      Not-tested: cap64-or-larger benchmarks, bucket/style-aware evaluations, and any future hybrid redesign
      cnb.bofCdSsphPA authored
    • Constraint: The cap48 seed=999 run has only completed the hybrid leg, so the three-seed aggregate is still incomplete
      Rejected: Wait for high_energy to finish before checkpointing | Would risk losing the verified hybrid seed999 score from the active Ralph session
      Confidence: high
      Scope-risk: narrow
      Directive: Keep recording verified partial benchmark milestones, but do not revise default-strategy guidance until both strategies and the final report are available
      Tested: Verified hybrid eval.json reports num_queries=24, top1=0.875, topk=1.0; verified progress.json records the same result; verified high_energy is still running and report.json is still absent
      Not-tested: Final high_energy seed999 metrics, final report.json, and updated three-seed aggregate
      cnb.bofCdSsphPA authored
    • Constraint: The running cap48 seed=999 benchmark has not emitted its final report yet, so only in-flight evidence can be recorded safely
      Rejected: Claim a new three-seed conclusion now | The aggregate would be speculative without report.json and eval outputs
      Confidence: high
      Scope-risk: narrow
      Directive: When a long benchmark is still active, checkpoint stage evidence explicitly and wait for report.json before changing strategy guidance
      Tested: Verified process tree shows hybrid moved from build-index to evaluate.py; verified reference_progress.json reports 48 refs, 491 windows, 192-d embeddings, and complete status; verified report.json is still absent
      Not-tested: Final hybrid eval metrics, subsequent high_energy run, and final three-seed aggregate
      cnb.bofCdSsphPA authored
    • Constraint: The cap48 seed=999 benchmark is still running, so this checkpoint must avoid unverified algorithm conclusions
      Rejected: Wait for the CPU benchmark to finish | Would delay handoff and leave the next session without a clean restart package
      Confidence: high
      Scope-risk: narrow
      Directive: Keep future doc-only checkpoints surgically staged and do not add data/raw, external_smoke, /tmp outputs, or model artifacts
      Tested: Verified staged diff only includes AGENT memory, handoff, changelog, and changelist docs; confirmed /tmp cap48 seed=999 report is not ready yet
      Not-tested: The in-flight cap48 seed=999 benchmark result and any follow-up aggregate metrics
      cnb.bofCdSsphPA authored
    • Persist the current two-seed cap48 summary so the strategy recommendation is grounded in aggregated evidence rather than whichever single run happened most recently.
      
      Constraint: Only documentation changes are allowed because benchmark artifacts remain outside version control
      Rejected: Keep narrating cap48 one run at a time | The aggregate is now more informative than any individual cap48 run
      Confidence: high
      Scope-risk: narrow
      Directive: Prefer reporting aggregate seed statistics once two or more runs exist; avoid re-elevating single-seed claims above the aggregate
      Tested: Verified both cap48 report.json files; computed aggregate mean/min/max/stdev; verified docs now record high_energy mean_top1=0.9167 and hybrid mean_top1=0.8750
      Not-tested: Aggregates beyond two seeds or style-bucketed aggregates
      cnb.bofCdSsphPA authored
    • Persist the completed seed123 benchmark showing hybrid ahead again, and update the strategy guidance from single-run winner claims to a multi-seed interpretation.
      
      Constraint: Only documentation changes are allowed because benchmark outputs remain outside version control
      Rejected: Keep framing cap48 as a stable high_energy win | The second seed materially weakens that interpretation
      Confidence: high
      Scope-risk: narrow
      Directive: Base the hybrid vs high_energy default decision on aggregated multi-seed evidence, not any single cap48 run
      Tested: Verified /tmp/ab_smoke_seg_cap48_top2_seed123/report.json; verified high_energy eval.json; verified docs now record hybrid=24/0.9583/1.0 and high_energy=24/0.9167/1.0 for seed123
      Not-tested: Formal aggregation across multiple seeds beyond these two cap48 runs
      cnb.bofCdSsphPA authored
    • Persist the newly finished cap48 seed123 hybrid result so the second-seed validation run now has measured evidence instead of only a runtime checkpoint.
      
      Constraint: seed123 high_energy and the final report are still pending
      Rejected: Wait for the full seed123 report before updating docs | Would leave the multi-seed evidence stale across sessions
      Confidence: high
      Scope-risk: narrow
      Directive: Replace the seed123 partial section with the final two-strategy ranking once high_energy eval and report.json land
      Tested: Verified /tmp/ab_smoke_seg_cap48_top2_seed123/hybrid/fma_reports_smoke/eval.json; verified docs record hybrid=24/0.9583/1.0 and high_energy still in build-index
      Not-tested: Final seed123 comparison because high_energy has not finished yet
      cnb.bofCdSsphPA authored
    • Update the handoff and changelog with the newer seed123 runtime milestone so later sessions know the hybrid lane has advanced from build-index into capped evaluation.
      
      Constraint: No measured seed123 score is available yet, only a later execution milestone
      Rejected: Leave the older build-index note in place | Would make the restart handoff stale and less actionable
      Confidence: high
      Scope-risk: narrow
      Directive: Replace the seed123 runtime note with measured scores as soon as hybrid eval.json or report.json land
      Tested: Verified active seed123 hybrid evaluate.py process; verified docs now record seed123 current phase as evaluate.py --max-queries 24
      Not-tested: Seed123 strategy scores because hybrid eval.json has not landed yet
      cnb.bofCdSsphPA authored
    • Preserve the second-seed cap48 entry point and current build-index phase so later sessions can validate whether the cap48 reversal was stable or a seed artifact.
      
      Constraint: The second-seed run has not produced scores yet, so only execution-state evidence is available
      Rejected: Wait for the seed123 scores before recording anything | Risks losing the multi-seed validation checkpoint if the session ends first
      Confidence: high
      Scope-risk: narrow
      Directive: Replace the seed123 running-state section with measured scores once hybrid eval.json or report.json land
      Tested: Verified active cap48 seed123 processes; verified handoff records work-root, seed, subset size, query cap, and current build-index phase
      Not-tested: cap48 seed123 strategy scores because the run is still in progress
      cnb.bofCdSsphPA authored
    • Persist the larger 48-track benchmark where high_energy overtook hybrid, and downgrade the previously overconfident default-strategy claim to a conditional recommendation pending broader validation.
      
      Constraint: Only documentation changes are allowed because benchmark outputs remain outside version control
      Rejected: Keep asserting hybrid as fully settled default after cap48 | The 48-track capped benchmark materially contradicts that stronger claim
      Confidence: high
      Scope-risk: narrow
      Directive: Resolve the hybrid vs high_energy default question with larger, multi-seed, style-aware benchmarks before making a final hard default claim
      Tested: Verified /tmp/ab_smoke_seg_cap48_top2/report.json; verified high_energy eval.json; verified docs now record high_energy=24/0.9167/1.0 and hybrid=24/0.7917/1.0
      Not-tested: Multi-seed or style-balanced follow-up benchmark beyond the single cap48 run
      cnb.bofCdSsphPA authored
    • Update the handoff and changelog with the newer cap48 runtime milestone so later sessions know the high_energy lane has advanced from build-index into capped evaluation.
      
      Constraint: No measured cap48 high_energy score is available yet, only a later execution milestone
      Rejected: Leave the older build-index note in place | Would make the restart handoff stale and less actionable
      Confidence: high
      Scope-risk: narrow
      Directive: Replace the cap48 runtime note with final top-two scores as soon as high_energy eval.json or report.json lands
      Tested: Verified active cap48 high_energy evaluate.py process; verified docs now record high_energy current phase as evaluate.py --max-queries 24
      Not-tested: Final cap48 comparison because high_energy eval.json has not landed yet
      cnb.bofCdSsphPA authored
    • Persist the newly finished cap48 hybrid result so the next session can continue the 48-track validation run from measured evidence instead of only a runtime checkpoint.
      
      Constraint: cap48 high_energy and the final report are still pending
      Rejected: Wait for the full cap48 report before updating docs | Would leave the largest current real-data checkpoint stale across sessions
      Confidence: high
      Scope-risk: narrow
      Directive: Replace the cap48 partial section with the final two-strategy ranking once high_energy eval and report.json land
      Tested: Verified /tmp/ab_smoke_seg_cap48_top2/hybrid/fma_reports_smoke/eval.json; verified docs record hybrid=24/0.7917/1.0 and high_energy still in build-index
      Not-tested: Final cap48 comparison because high_energy has not finished yet
      cnb.bofCdSsphPA authored
    • Update the handoff and changelog with the newer cap48 runtime milestone so later sessions know the run has advanced from build-index into capped evaluation.
      
      Constraint: No measured cap48 score is available yet, only a later execution milestone
      Rejected: Leave the older build-index note in place | Would make the restart handoff stale and less actionable
      Confidence: high
      Scope-risk: narrow
      Directive: Replace the cap48 runtime note with hybrid scores as soon as eval.json lands
      Tested: Verified active cap48 evaluate.py process; verified docs now record cap48 current phase as evaluate.py --max-queries 24
      Not-tested: cap48 strategy scores because hybrid eval.json has not landed yet
      cnb.bofCdSsphPA authored
    • Preserve the new 48-track top-two benchmark entry point and current build-index phase so later sessions can continue the expanding validation ladder without rediscovering runtime state.
      
      Constraint: cap48 has not produced scores yet, so only execution-state evidence is available
      Rejected: Wait for cap48 scores before recording anything | Risks losing the larger-benchmark checkpoint if the session ends first
      Confidence: high
      Scope-risk: narrow
      Directive: Replace the cap48 running-state section with measured scores once hybrid eval.json or report.json land
      Tested: Verified active cap48 processes; verified handoff records work-root, subset size, query cap, and current build-index phase
      Not-tested: cap48 strategy scores because the run is still in progress
      cnb.bofCdSsphPA authored
    • Persist the larger 32-track benchmark showing hybrid strongly outperforming high_energy, so the default strategy decision rests on multiple larger real-data checkpoints instead of a single subset.
      
      Constraint: Only documentation changes are allowed because benchmark artifacts stay outside version control
      Rejected: Keep the default recommendation tentative after cap32 | The 24-track and 32-track capped benchmarks now agree on hybrid superiority
      Confidence: high
      Scope-risk: narrow
      Directive: Use cap24 and cap32 together as the current strongest strategy evidence until a broader multi-style benchmark supersedes them
      Tested: Verified /tmp/ab_smoke_seg_cap32_top2/report.json; verified high_energy eval.json; verified docs now record hybrid=20/0.95/1.0 and high_energy=20/0.5/1.0
      Not-tested: Wider style-balanced benchmark beyond the FMA top-two subsets
      cnb.bofCdSsphPA authored
    • Persist the newly finished cap32 hybrid result so the next session can continue the top-two validation run from measured evidence instead of only a running-state checkpoint.
      
      Constraint: cap32 high_energy and the final report are still pending
      Rejected: Wait for the full cap32 report before updating docs | Would leave the larger-subset evidence stale across sessions
      Confidence: high
      Scope-risk: narrow
      Directive: Replace the cap32 partial section with the final two-strategy ranking once high_energy eval and report.json land
      Tested: Verified /tmp/ab_smoke_seg_cap32_top2/hybrid/fma_reports_smoke/eval.json; verified docs record hybrid=20/0.95/1.0 and high_energy still training
      Not-tested: Final cap32 comparison because high_energy has not finished yet
      cnb.bofCdSsphPA authored
    • Preserve the new 32-track top-two benchmark entry point and current build-index phase so a later session can continue the stronger validation run without losing runtime context.
      
      Constraint: The cap32 benchmark is still running, so only execution-state evidence is available
      Rejected: Wait for cap32 results before recording anything | Risks losing the larger-benchmark checkpoint if the session ends first
      Confidence: high
      Scope-risk: narrow
      Directive: Replace the cap32 running-state section with measured scores once hybrid eval.json and report.json land
      Tested: Verified active cap32 processes; verified handoff records work-root, subset size, query cap, and current build-index phase
      Not-tested: cap32 strategy scores because the run is still in progress
      cnb.bofCdSsphPA authored
    • Persist the larger real-FMA benchmark result showing hybrid clearly outperforming high_energy, so the project recommendation can converge on one default instead of an unresolved tie.
      
      Constraint: Only docs change because benchmark outputs remain outside version control
      Rejected: Keep treating hybrid and high_energy as co-equal defaults | The larger 24-track capped benchmark now separates them clearly
      Confidence: high
      Scope-risk: narrow
      Directive: Use cap24 top-two as the current strongest public evidence until a larger capped benchmark supersedes it
      Tested: Verified /tmp/ab_smoke_seg_cap24_top2/report.json; verified high_energy eval.json; verified docs now state hybrid=16/1.0/1.0 and high_energy=16/0.8125/1.0
      Not-tested: Broader strategy comparison beyond hybrid vs high_energy on the 24-track subset
      cnb.bofCdSsphPA authored
    • Record the new 24-track capped benchmark setup and the first completed hybrid result so the next session can continue the stronger tie-break experiment without rediscovering runtime state.
      
      Constraint: The cap24 benchmark is still in progress, so only partial evidence can be documented now
      Rejected: Wait for high_energy to finish before updating handoff | Risks losing the fresh larger-subset evidence if the session ends first
      Confidence: high
      Scope-risk: narrow
      Directive: Replace the partial cap24 section with the final two-strategy ranking once report.json lands
      Tested: Verified /tmp/ab_smoke_seg_cap24_top2/hybrid/fma_reports_smoke/eval.json; verified active cap24 processes; verified docs include the exact work-root and resume command
      Not-tested: Final cap24 top-two comparison because high_energy is still training
      cnb.bofCdSsphPA authored
    • Persist the completed capped real-data benchmark results so future sessions can use the final strategy ordering and recommendation without replaying the run.
      
      Constraint: Only documentation should change because benchmark artifacts live outside version control
      Rejected: Leave the result only in /tmp report files | Would make the evidence fragile across sessions
      Confidence: high
      Scope-risk: narrow
      Directive: Use cap16 as the current default evidence point until a larger capped benchmark supersedes it
      Tested: Verified /tmp/ab_smoke_seg_cap16/report.json; verified repeated_section_aware eval.json; verified docs reflect final ranking hybrid/high_energy/beat_aware/repeated_section_aware
      Not-tested: Larger real-dataset benchmark beyond the 16-track capped subset
      cnb.bofCdSsphPA authored