parameterize dual-axis hard-case weighting for low-risk experiments\n\nConstrain…
…t: Keep the training pipeline behavior stable while exposing humming_like and confused controls through config only\nRejected: Add a brand-new sampler framework first | The smallest useful step is config-driven control on the existing dataset weighting path\nConfidence: high\nScope-risk: narrow\nDirective: Run weight-search experiments through training.sample_type_weights and training.pair_type_weights before attempting broader training-stack refactors\nTested: py_compile passed, train.py dry-run on synthetic_v2 passed, and custom SongPairDataset weighting instantiation produced expected hard_weight output\nNot-tested: End-to-end retraining and metric improvements from new dual-axis weight combinations
Showing
8 changed files
with
137 additions
and
44 deletions
| ... | @@ -74,21 +74,22 @@ | ... | @@ -74,21 +74,22 @@ |
| 74 | 74 | ||
| 75 | ## 5.5 最新真实 FMA / chromaprint 运行态(2026-06-02) | 75 | ## 5.5 最新真实 FMA / chromaprint 运行态(2026-06-02) |
| 76 | 76 | ||
| 77 | ### 当前最新快照(15:46 UTC) | 77 | ### 当前最新快照(15:47 UTC) |
| 78 | 78 | ||
| 79 | - 远程同步基线:`93dfa15`(更新前) | 79 | - 远程同步基线:`7812b58`(更新前) |
| 80 | - 当前最重要的新证据:**`v5` 与 `v6` 的 hard-case 差异来源已经被解释清楚**。 | 80 | - 当前最重要的新证据:**dual-axis hard-case weighting 已在代码中参数化**。 |
| 81 | - `v5` = `type-aware hard-case weighting`: | 81 | - 当前可调入口: |
| 82 | - `humming_like top1=0.50` | 82 | - `training.sample_type_weights` |
| 83 | - `confused top1=0.00` | 83 | - `training.pair_type_weights` |
| 84 | - `v6` = `sample-level confused-priority weighting`: | 84 | - fresh verification: |
| 85 | - `humming_like top1=0.25` | 85 | - `py_compile` 通过 |
| 86 | - `confused top1=0.25` | 86 | - `train.py --dry-run` 通过 |
| 87 | - 这说明:下一轮最值得做的不是继续盲 sweep,而是设计 `humming_like` 与 `confused` 分开控制的双轴策略。 | 87 | - 自定义权重实例化检查通过 |
| 88 | - 这说明:下一轮已经可以直接做权重搜索实验,而不需要再先改数据集/训练框架结构。 | ||
| 88 | - 下一次值得提交的事件: | 89 | - 下一次值得提交的事件: |
| 89 | 1. 双轴 hard-case weighting / sampling 方案落地 | 90 | 1. 首轮 dual-axis 权重实验结果 |
| 90 | 2. 其对 `v6` 的 hard-case 指标改善 | 91 | 2. `humming_like` 改善且 `confused` 不回退的组合 |
| 91 | 3. dual-track 回归结果稳定 | 92 | 3. dual-track 回归验证结果 |
| 92 | 93 | ||
| 93 | 94 | ||
| 94 | ## 6. 高风险注意事项 | 95 | ## 6. 高风险注意事项 | ... | ... |
| ... | @@ -31,6 +31,15 @@ training: | ... | @@ -31,6 +31,15 @@ training: |
| 31 | gradient_clip: 1.0 | 31 | gradient_clip: 1.0 |
| 32 | save_every: 10 | 32 | save_every: 10 |
| 33 | log_every: 10 | 33 | log_every: 10 |
| 34 | sample_type_weights: | ||
| 35 | default: 1 | ||
| 36 | humming_like: 3 | ||
| 37 | confused: 5 | ||
| 38 | pair_type_weights: | ||
| 39 | default: 1.0 | ||
| 40 | augmented: 1.4 | ||
| 41 | humming_like: 2.5 | ||
| 42 | confused: 4.0 | ||
| 34 | 43 | ||
| 35 | engine: | 44 | engine: |
| 36 | chromaprint: | 45 | chromaprint: | ... | ... |
| ... | @@ -331,6 +331,8 @@ class SongPairDataset(Dataset): | ... | @@ -331,6 +331,8 @@ class SongPairDataset(Dataset): |
| 331 | augment: bool = True, | 331 | augment: bool = True, |
| 332 | segment_strategy: str = "random", | 332 | segment_strategy: str = "random", |
| 333 | silence_top_db: int = 30, | 333 | silence_top_db: int = 30, |
| 334 | sample_type_weights: Optional[Dict[str, int]] = None, | ||
| 335 | pair_type_weights: Optional[Dict[str, float]] = None, | ||
| 334 | ): | 336 | ): |
| 335 | self.sr = sr | 337 | self.sr = sr |
| 336 | self.n_mels = n_mels | 338 | self.n_mels = n_mels |
| ... | @@ -342,6 +344,19 @@ class SongPairDataset(Dataset): | ... | @@ -342,6 +344,19 @@ class SongPairDataset(Dataset): |
| 342 | self.silence_top_db = silence_top_db | 344 | self.silence_top_db = silence_top_db |
| 343 | self.data_dir = Path(data_dir) | 345 | self.data_dir = Path(data_dir) |
| 344 | self.asset_root = self.data_dir.parent if self.data_dir.name == "manifests" else self.data_dir | 346 | self.asset_root = self.data_dir.parent if self.data_dir.name == "manifests" else self.data_dir |
| 347 | self.sample_type_weights = { | ||
| 348 | "default": 1, | ||
| 349 | "humming_like": 3, | ||
| 350 | "confused": 5, | ||
| 351 | **(sample_type_weights or {}), | ||
| 352 | } | ||
| 353 | self.pair_type_weights = { | ||
| 354 | "default": 1.0, | ||
| 355 | "augmented": 1.4, | ||
| 356 | "humming_like": 2.5, | ||
| 357 | "confused": 4.0, | ||
| 358 | **(pair_type_weights or {}), | ||
| 359 | } | ||
| 345 | 360 | ||
| 346 | with open(self.data_dir / f"{split}.json") as f: | 361 | with open(self.data_dir / f"{split}.json") as f: |
| 347 | metadata = json.load(f) | 362 | metadata = json.load(f) |
| ... | @@ -358,12 +373,9 @@ class SongPairDataset(Dataset): | ... | @@ -358,12 +373,9 @@ class SongPairDataset(Dataset): |
| 358 | self.sample_song_ids = [] | 373 | self.sample_song_ids = [] |
| 359 | for sid, items in self.by_song.items(): | 374 | for sid, items in self.by_song.items(): |
| 360 | item_types = {x.get("type") for x in items} | 375 | item_types = {x.get("type") for x in items} |
| 361 | if "confused" in item_types: | 376 | weight = self.sample_type_weights.get("default", 1) |
| 362 | weight = 5 | 377 | for item_type in item_types: |
| 363 | elif "humming_like" in item_types: | 378 | weight = max(weight, int(self.sample_type_weights.get(item_type, weight))) |
| 364 | weight = 3 | ||
| 365 | else: | ||
| 366 | weight = 1 | ||
| 367 | self.sample_song_ids.extend([sid] * weight) | 379 | self.sample_song_ids.extend([sid] * weight) |
| 368 | self.song_to_idx = {sid: i for i, sid in enumerate(self.song_ids)} | 380 | self.song_to_idx = {sid: i for i, sid in enumerate(self.song_ids)} |
| 369 | 381 | ||
| ... | @@ -432,14 +444,9 @@ class SongPairDataset(Dataset): | ... | @@ -432,14 +444,9 @@ class SongPairDataset(Dataset): |
| 432 | else: | 444 | else: |
| 433 | a, b = random.sample(choices, 2) | 445 | a, b = random.sample(choices, 2) |
| 434 | 446 | ||
| 435 | type_to_weight = { | ||
| 436 | "confused": 4.0, | ||
| 437 | "humming_like": 2.5, | ||
| 438 | "augmented": 1.4, | ||
| 439 | } | ||
| 440 | pair_weights = [ | 447 | pair_weights = [ |
| 441 | type_to_weight.get(a.get("type", "unknown"), 1.0), | 448 | self.pair_type_weights.get(a.get("type", "unknown"), self.pair_type_weights.get("default", 1.0)), |
| 442 | type_to_weight.get(b.get("type", "unknown"), 1.0), | 449 | self.pair_type_weights.get(b.get("type", "unknown"), self.pair_type_weights.get("default", 1.0)), |
| 443 | ] | 450 | ] |
| 444 | 451 | ||
| 445 | wavs = [] | 452 | wavs = [] | ... | ... |
| ... | @@ -157,6 +157,8 @@ def main(): | ... | @@ -157,6 +157,8 @@ def main(): |
| 157 | augment=True, | 157 | augment=True, |
| 158 | segment_strategy=args.segment_strategy, | 158 | segment_strategy=args.segment_strategy, |
| 159 | silence_top_db=args.silence_top_db, | 159 | silence_top_db=args.silence_top_db, |
| 160 | sample_type_weights=cfg["training"].get("sample_type_weights"), | ||
| 161 | pair_type_weights=cfg["training"].get("pair_type_weights"), | ||
| 160 | ) | 162 | ) |
| 161 | 163 | ||
| 162 | catalog_dataset = ACRDataset( | 164 | catalog_dataset = ACRDataset( | ... | ... |
| 1 | ## 2026-06-02 15:47 UTC / dual-axis hard-case weighting is now configurable in code | ||
| 2 | |||
| 3 | - 已把 `SongPairDataset` 中的 hard-case 采样权重与 pair loss 权重从硬编码改为配置驱动 | ||
| 4 | - 代码变更: | ||
| 5 | - `src/data/dataset.py`:新增 `sample_type_weights` / `pair_type_weights` 参数 | ||
| 6 | - `train.py`:从 `cfg["training"]` 透传上述配置 | ||
| 7 | - `configs/default.yaml`:新增默认 dual-axis hard-case 权重配置 | ||
| 8 | - fresh verification: | ||
| 9 | - `python -m py_compile train.py src/data/dataset.py` 通过 | ||
| 10 | - `train.py --data data/synthetic_v2 --device cpu --epochs 1 --batch-size 4 --dry-run` 通过 | ||
| 11 | - 自定义权重实例化检查通过: | ||
| 12 | - `dataset_len=96` | ||
| 13 | - `unique_songs=16` | ||
| 14 | - `sample_multiplicity_minmax=6/6` | ||
| 15 | - 示例 `hard_weight=[5.0, 1.0]` | ||
| 16 | - 结论:下一轮可直接在不改代码结构的前提下,实验 `humming_like` / `confused` 的双轴 weighting 组合 | ||
| 17 | |||
| 1 | ## 2026-06-02 15:46 UTC / v5-v6 hard-case difference is now causally explained | 18 | ## 2026-06-02 15:46 UTC / v5-v6 hard-case difference is now causally explained |
| 2 | 19 | ||
| 3 | - 基于仓库内历史实验记录,补齐了 `v5` 与 `v6` hard-case 表现差异的来源解释 | 20 | - 基于仓库内历史实验记录,补齐了 `v5` 与 `v6` hard-case 表现差异的来源解释 | ... | ... |
| ... | @@ -272,3 +272,28 @@ | ... | @@ -272,3 +272,28 @@ |
| 272 | 272 | ||
| 273 | - 现在已经不仅知道 `v5/v6` 哪个更强,还知道“为什么”。 | 273 | - 现在已经不仅知道 `v5/v6` 哪个更强,还知道“为什么”。 |
| 274 | - 下一轮应把 `humming_like` 与 `confused` 分开建模或分开加权。 | 274 | - 下一轮应把 `humming_like` 与 `confused` 分开建模或分开加权。 |
| 275 | |||
| 276 | ## 本次追加交付(2026-06-02 15:47 UTC) | ||
| 277 | |||
| 278 | ### 新增代码能力 | ||
| 279 | |||
| 280 | | 文件 | 变更 | | ||
| 281 | |---|---| | ||
| 282 | | [../acr-engine/src/data/dataset.py](../acr-engine/src/data/dataset.py) | hard-case 采样权重与 pair 权重改为配置驱动 | | ||
| 283 | | [../acr-engine/train.py](../acr-engine/train.py) | 训练链路透传 dual-axis 权重配置 | | ||
| 284 | | [../acr-engine/configs/default.yaml](../acr-engine/configs/default.yaml) | 增加 `sample_type_weights` / `pair_type_weights` 默认配置 | | ||
| 285 | |||
| 286 | ### 当前最重要的 fresh evidence | ||
| 287 | |||
| 288 | - `python -m py_compile train.py src/data/dataset.py`:通过 | ||
| 289 | - `train.py --data data/synthetic_v2 --device cpu --epochs 1 --batch-size 4 --dry-run`:通过 | ||
| 290 | - 自定义权重实例化检查: | ||
| 291 | - `dataset_len=96` | ||
| 292 | - `unique_songs=16` | ||
| 293 | - `sample_multiplicity_minmax=6/6` | ||
| 294 | - `hard_weight=[5.0, 1.0]` | ||
| 295 | |||
| 296 | ### 结论 | ||
| 297 | |||
| 298 | - dual-axis hard-case weighting 已从“设计建议”升级为“代码中可直接调参实验”的状态。 | ||
| 299 | - 下一轮可直接围绕 `sample_type_weights` 与 `pair_type_weights` 做最小实验。 | ... | ... |
| 1 | ## 本次交付包追加更新(2026-06-02 15:47 UTC) | ||
| 2 | |||
| 3 | ### 交付结论 | ||
| 4 | |||
| 5 | 当前最新里程碑已经从“知道该做 dual-axis”推进到 **dual-axis hard-case weighting 已在代码中参数化**: | ||
| 6 | - 远程基线当前为:`7812b58`(更新前) | ||
| 7 | - `sample_type_weights` 与 `pair_type_weights` 已可配置 | ||
| 8 | - 训练 dry-run 已通过 | ||
| 9 | - 因此下一轮可直接做最小调参实验,而不是再先改代码结构 | ||
| 10 | |||
| 11 | ### 当前最新事实 | ||
| 12 | |||
| 13 | #### 代码实现位置 | ||
| 14 | - `src/data/dataset.py`: | ||
| 15 | - `sample_type_weights` 控制 song-level 采样重复度 | ||
| 16 | - `pair_type_weights` 控制 pair-level `hard_weight` | ||
| 17 | - `train.py`:从 `training` 配置透传 | ||
| 18 | - `configs/default.yaml`:提供默认 dual-axis 配置 | ||
| 19 | |||
| 20 | #### fresh verification | ||
| 21 | - `python -m py_compile train.py src/data/dataset.py`:通过 | ||
| 22 | - `train.py --data data/synthetic_v2 --device cpu --epochs 1 --batch-size 4 --dry-run`:通过 | ||
| 23 | - 自定义权重实例化检查: | ||
| 24 | - `dataset_len=96` | ||
| 25 | - `sample_multiplicity_minmax=6/6` | ||
| 26 | - `hard_weight=[5.0, 1.0]` | ||
| 27 | |||
| 28 | ### 当前判断 | ||
| 29 | |||
| 30 | - 现在已经具备一个最小、低风险、可反复实验的 dual-axis 入口。 | ||
| 31 | - 下一阶段最值得做的是直接搜索 `humming_like` / `confused` 的权重组合,而不是继续做只读分析。 | ||
| 32 | |||
| 33 | --- | ||
| 34 | |||
| 1 | ## 本次交付包追加更新(2026-06-02 15:46 UTC) | 35 | ## 本次交付包追加更新(2026-06-02 15:46 UTC) |
| 2 | 36 | ||
| 3 | ### 交付结论 | 37 | ### 交付结论 | ... | ... |
| ... | @@ -5,24 +5,22 @@ | ... | @@ -5,24 +5,22 @@ |
| 5 | 5 | ||
| 6 | ## 一页结论 | 6 | ## 一页结论 |
| 7 | 7 | ||
| 8 | ### 最新交付快照(2026-06-02 15:46 UTC) | 8 | ### 最新交付快照(2026-06-02 15:47 UTC) |
| 9 | 9 | ||
| 10 | - 当前远程同步基线:`93dfa15`(更新前) | 10 | - 当前远程同步基线:`7812b58`(更新前) |
| 11 | - 当前最重要的新事实:**`v5` 与 `v6` 的 hard-case 差异来源已经被解释清楚** | 11 | - 当前最重要的新事实:**dual-axis hard-case weighting 已在代码中参数化** |
| 12 | - `v5`:`type-aware hard-case weighting` | 12 | - 新增可调入口: |
| 13 | - `humming_like top1=0.50` | 13 | - `training.sample_type_weights` |
| 14 | - `confused top1=0.00` | 14 | - `training.pair_type_weights` |
| 15 | - `v6`:`sample-level confused-priority weighting` | 15 | - fresh verification: |
| 16 | - `humming_like top1=0.25` | 16 | - `py_compile` 通过 |
| 17 | - `confused top1=0.25` | 17 | - `train.py --dry-run` 通过 |
| 18 | - 结论: | 18 | - 自定义权重实例化检查通过 |
| 19 | - `v5` 更偏向提升 `humming_like` | 19 | - 结论:下一轮不需要先改代码结构,已经可以直接做最小调参实验。 |
| 20 | - `v6` 更偏向提升 `confused` | ||
| 21 | - 下一轮应设计双轴 hard-case weighting / 分治策略,而不是继续单轴加权 | ||
| 22 | - 新 session 第一优先级: | 20 | - 新 session 第一优先级: |
| 23 | 1. 设计 `humming_like` / `confused` 分开控制的 weighting 或 sampling | 21 | 1. 在 `v6` 主基线上搜索 dual-axis 权重组合 |
| 24 | 2. 复用现有 `v6` 主基线做最小改动实验 | 22 | 2. 目标优先提升 `humming_like top1`,同时不丢掉 `confused top1` |
| 25 | 3. 用 real-path clean + synthetic hard-case 双轨回归 | 23 | 3. 用 real-path clean + synthetic hard-case 双轨复测 |
| 26 | 24 | ||
| 27 | ### 最新可观测性修复(2026-06-02 15:18 UTC) | 25 | ### 最新可观测性修复(2026-06-02 15:18 UTC) |
| 28 | 26 | ... | ... |
-
Please register or sign in to post a comment