Expand external dataset coverage before harder real-data training
Constraint: Open-dataset ingestion needs a way to generate multiple overlapping queries per track, otherwise training/eval coverage stays too sparse Rejected: Keep only one random external query per track | Leaves long songs underrepresented and weakens reproducibility Confidence: high Scope-risk: moderate Directive: Preserve single-query behavior as the default, but keep overlap-query generation configurable through query_stride for future corpora Tested: manifest_tools audio-dir-to-splits --help shows --query-stride; prepare-local on data/synthetic_v2/songs with query_duration=8.0 and query_stride=4.0 produced 72 queries with query_index fields Not-tested: Full end-to-end smoke-local completion on the still-running real FMA corpus with overlap-query mode enabled
Showing
5 changed files
with
85 additions
and
17 deletions
-
Please register or sign in to post a comment