Improve confused-case retrieval with sample-level hard weighting

Constraint: Must preserve runnable pipeline and record stage evidence before continuing optimization Rejected: More naive oversampling | Regressed overall and hard-case accuracy in smoke-v4 Confidence: medium Scope-risk: moderate Directive: Treat confused and humming_like as separate optimization lanes in future stages Tested: /usr/local/miniconda3/bin/python train.py --data data/synthetic_v2 --output data/models_v6 --device cpu --epochs 1 --batch-size 6 --dry-run; /usr/local/miniconda3/bin/python -m py_compile train.py src/models/losses.py src/data/dataset.py; /usr/local/miniconda3/bin/python train.py --data data/synthetic_v2 --output data/models_v6 --device cpu --epochs 2 --batch-size 6; /usr/local/miniconda3/bin/python run_demo.py build-index --data data/synthetic_v2 --model data/models_v6/best_model.pt --output data/index_v6 --device cpu; /usr/local/miniconda3/bin/python evaluate.py --data data/synthetic_v2 --model data/models_v6/best_model.pt --index-prefix data/index_v6/reference --split test --device cpu --fast-eval --output-json reports/smoke-v6/synthetic_v2/eval.json; /usr/local/miniconda3/bin/python scripts/generate_artifacts.py --eval-json reports/smoke-v6/synthetic_v2/eval.json --config-json reports/smoke-v6/synthetic_v2/config.json --output-dir reports/smoke-v6/synthetic_v2 --model-version smoke-v6 --data-version synthetic_v2 Not-tested: Real external dataset training run and GPU-scale convergence

Improve confused-case retrieval with sample-level hard weighting
Constraint: Must preserve runnable pipeline and record stage evidence before continuing optimization Rejected: More naive oversampling | Regressed overall and hard-case accuracy in smoke-v4 Confidence: medium Scope-risk: moderate Directive: Treat confused and humming_like as separate optimization lanes in future stages Tested: /usr/local/miniconda3/bin/python train.py --data data/synthetic_v2 --output data/models_v6 --device cpu --epochs 1 --batch-size 6 --dry-run; /usr/local/miniconda3/bin/python -m py_compile train.py src/models/losses.py src/data/dataset.py; /usr/local/miniconda3/bin/python train.py --data data/synthetic_v2 --output data/models_v6 --device cpu --epochs 2 --batch-size 6; /usr/local/miniconda3/bin/python run_demo.py build-index --data data/synthetic_v2 --model data/models_v6/best_model.pt --output data/index_v6 --device cpu; /usr/local/miniconda3/bin/python evaluate.py --data data/synthetic_v2 --model data/models_v6/best_model.pt --index-prefix data/index_v6/reference --split test --device cpu --fast-eval --output-json reports/smoke-v6/synthetic_v2/eval.json; /usr/local/miniconda3/bin/python scripts/generate_artifacts.py --eval-json reports/smoke-v6/synthetic_v2/eval.json --config-json reports/smoke-v6/synthetic_v2/config.json --output-dir reports/smoke-v6/synthetic_v2 --model-version smoke-v6 --data-version synthetic_v2 Not-tested: Real external dataset training run and GPU-scale convergence
cnb.bofCdSsphPA
Commit c89ef4f9 ... c89ef4f94ed49b1cd4df93f0bf7d77ad6bf17184 authored 2026-06-02 12:20:42 +0800 by cnb.bofCdSsphPA
Showing 17 changed files with 321 additions and 7 deletions
acr-engine/data/index_v6/chromaprint.pkl
acr-engine/data/index_v6/reference_embs.npy
acr-engine/data/index_v6/reference_ids.npy
acr-engine/data/models_v6/best_model.pt
acr-engine/data/models_v6/song_to_idx.json
acr-engine/reports/smoke-v6/synthetic_v2/artifact-manifest.json
acr-engine/reports/smoke-v6/synthetic_v2/benchmark-report.md
acr-engine/reports/smoke-v6/synthetic_v2/config.json
acr-engine/reports/smoke-v6/synthetic_v2/eval.json
acr-engine/reports/smoke-v6/synthetic_v2/model-card.md
acr-engine/reports/smoke-v6/synthetic_v2/release-checklist.md
acr-engine/src/data/dataset.py
acr-engine/src/models/losses.py
acr-engine/train.py
docs/CHANGELOG.md
docs/dataset-spec.md
docs/sota-research-2026.md
--- a/acr-engine/data/index_v6/chromaprint.pkl 0 → 100644
View file @c89ef4f
+++ b/acr-engine/data/index_v6/chromaprint.pkl 0 → 100644
View file @c89ef4f
--- a/acr-engine/data/index_v6/reference_embs.npy 0 → 100644
View file @c89ef4f
+++ b/acr-engine/data/index_v6/reference_embs.npy 0 → 100644
View file @c89ef4f
--- a/acr-engine/data/index_v6/reference_ids.npy 0 → 100644
View file @c89ef4f
+++ b/acr-engine/data/index_v6/reference_ids.npy 0 → 100644
View file @c89ef4f
--- a/acr-engine/data/models_v6/best_model.pt 0 → 100644
View file @c89ef4f
+++ b/acr-engine/data/models_v6/best_model.pt 0 → 100644
View file @c89ef4f
--- a/acr-engine/data/models_v6/song_to_idx.json 0 → 100644
View file @c89ef4f
+++ b/acr-engine/data/models_v6/song_to_idx.json 0 → 100644
View file @c89ef4f
+{
+  "song_0000": 0,
+  "song_0001": 1,
+  "song_0002": 2,
+  "song_0003": 3,
+  "song_0004": 4,
+  "song_0005": 5,
+  "song_0006": 6,
+  "song_0007": 7,
+  "song_0008": 8,
+  "song_0009": 9,
+  "song_0010": 10,
+  "song_0011": 11,
+  "song_0012": 12,
+  "song_0013": 13,
+  "song_0014": 14,
+  "song_0015": 15
+}
\ No newline at end of file
--- a/acr-engine/reports/smoke-v6/synthetic_v2/artifact-manifest.json 0 → 100644
View file @c89ef4f
+++ b/acr-engine/reports/smoke-v6/synthetic_v2/artifact-manifest.json 0 → 100644
View file @c89ef4f
+{
+  "generated_at": "2026-06-02T04:19:34Z",
+  "model_version": "smoke-v6",
+  "data_version": "synthetic_v2",
+  "files": {
+    "benchmark_report": "reports/smoke-v6/synthetic_v2/benchmark-report.md",
+    "model_card": "reports/smoke-v6/synthetic_v2/model-card.md",
+    "release_checklist": "reports/smoke-v6/synthetic_v2/release-checklist.md"
+  }
+}
\ No newline at end of file
--- a/acr-engine/reports/smoke-v6/synthetic_v2/benchmark-report.md 0 → 100644
View file @c89ef4f
+++ b/acr-engine/reports/smoke-v6/synthetic_v2/benchmark-report.md 0 → 100644
View file @c89ef4f
+# Benchmark Report
+## 一页结论
+- 模型版本：smoke-v6
+- 数据版本：synthetic_v2
+- 核心结论：top1=0.65 top5=0.95
+- 是否通过上线门禁：TBD
+## 1. 评测范围图
+```mermaid
+flowchart LR
+    A[smoke-v6] --> B[synthetic_v2]
+    A --> C[Scenario Buckets]
+    A --> D[Latency / Ops]
+```
+## 2. 指标表
+| Bucket | top1 | top5 | MRR | FAR | Notes |
+|---|---:|---:|---:|---:|---|
+| clean | 1.0 | 1.0 |  |  |  |
+| augmented | 0.75 | 1.0 |  |  |  |
+| humming_like | 0.25 | 1.0 |  |  |  |
+| confused | 0.25 | 0.75 |  |  |  |
+## 3. 文字分析
+- 最强项：clean/augmented buckets if present
+- 最弱项：see hard-case summary
+- 与上一版本对比：TBD
+## 4. 细节附录
+- 原始 JSON 报告：embedded source
+## Sources
+- docs/industrial-benchmark-spec.md
--- a/acr-engine/reports/smoke-v6/synthetic_v2/config.json 0 → 100644
View file @c89ef4f
+++ b/acr-engine/reports/smoke-v6/synthetic_v2/config.json 0 → 100644
View file @c89ef4f
+{
+  "model_version": "smoke-v6",
+  "data_version": "synthetic_v2",
+  "focus": "sample-level hard weighting with confused-priority sampling",
+  "train_command": "/usr/local/miniconda3/bin/python train.py --data data/synthetic_v2 --output data/models_v6 --device cpu --epochs 2 --batch-size 6",
+  "index_command": "/usr/local/miniconda3/bin/python run_demo.py build-index --data data/synthetic_v2 --model data/models_v6/best_model.pt --output data/index_v6 --device cpu",
+  "eval_command": "/usr/local/miniconda3/bin/python evaluate.py --data data/synthetic_v2 --model data/models_v6/best_model.pt --index-prefix data/index_v6/reference --split test --device cpu --fast-eval --output-json reports/smoke-v6/synthetic_v2/eval.json",
+  "notes": [
+    "confused improved from 0.00 to 0.25 top1 vs smoke-v5",
+    "humming_like regressed from 0.50 to 0.25 top1 vs smoke-v5",
+    "overall top1 improved from 0.60 to 0.65"
+  ]
+}
--- a/acr-engine/reports/smoke-v6/synthetic_v2/eval.json 0 → 100644
View file @c89ef4f
+++ b/acr-engine/reports/smoke-v6/synthetic_v2/eval.json 0 → 100644
View file @c89ef4f
+{
+  "split": "test",
+  "num_queries": 20,
+  "top1": 0.65,
+  "topk": 0.95,
+  "by_type": {
+    "clean": {
+      "n": 8,
+      "top1": 1.0,
+      "topk": 1.0
+    },
+    "augmented": {
+      "n": 4,
+      "top1": 0.75,
+      "topk": 1.0
+    },
+    "humming_like": {
+      "n": 4,
+      "top1": 0.25,
+      "topk": 1.0
+    },
+    "confused": {
+      "n": 4,
+      "top1": 0.25,
+      "topk": 0.75
+    }
+  },
+  "hard_case_summary": {
+    "humming_like": {
+      "n": 4,
+      "top1": 0.25,
+      "topk": 1.0
+    },
+    "confused": {
+      "n": 4,
+      "top1": 0.25,
+      "topk": 0.75
+    }
+  },
+  "sample_failures": [
+    {
+      "truth": "song_0023",
+      "query": "segments/song_0023_seg_04_confused.wav",
+      "type": "confused",
+      "preds": [
+        "song_0006",
+        "song_0002",
+        "song_0022",
+        "song_0001",
+        "song_0000"
+      ]
+    }
+  ]
+}
\ No newline at end of file
--- a/acr-engine/reports/smoke-v6/synthetic_v2/model-card.md 0 → 100644
View file @c89ef4f
+++ b/acr-engine/reports/smoke-v6/synthetic_v2/model-card.md 0 → 100644
View file @c89ef4f
+# Model Card
+## 一页结论
+- 模型名称：ACR Hybrid Encoder
+- 版本：smoke-v6
+- 适用场景：music ACR prototype / retrieval
+- 不适用场景：未经白名单数据验证的生产商用全量上线
+## 1. 模型结构图
+```mermaid
+flowchart LR
+    A[Input Audio] --> B[128 Mel + BandSplit]
+    B --> C[Encoder]
+    C --> D[Embedding]
+    D --> E[Hybrid Retrieval]
+```
+## 2. 关键信息表
+| 项 | 内容 |
+|---|---|
+| embed_dim | None |
+| channels | None |
+| n_mels | None |
+| use_band_split | None |
+| benchmark report | reports/smoke-v6/synthetic_v2/benchmark-report.md |
+## 3. 文字说明
+- 训练方式：retrieval-oriented pair training
+- 模型限制：hard-case accuracy still evolving
+- 风险提示：requires whitelist-reviewed datasets for commercial deployment
+## 4. 细节附录
+- config embedded from source JSON
+## Sources
+- docs/dataset-spec.md
+- docs/benchmark-report-template.md
--- a/acr-engine/reports/smoke-v6/synthetic_v2/release-checklist.md 0 → 100644
View file @c89ef4f
+++ b/acr-engine/reports/smoke-v6/synthetic_v2/release-checklist.md 0 → 100644
View file @c89ef4f
+# Release Checklist
+## 一页结论
+发布前必须同时满足：质量通过、合规通过、服务通过、文档齐全。
+## 1. 发布门禁图
+```mermaid
+flowchart TD
+    A[smoke-v6] --> B[Benchmark Pass]
+    A --> C[License Review Pass]
+    A --> D[Service Smoke Pass]
+    A --> E[Docs Complete]
+```
+## 2. Checklist 表
+| 项目 | 状态 |
+|---|---|
+| benchmark report 已生成 | yes |
+| model card 已生成 | yes |
+| license registry 已更新 | pending |
+| service smoke test 通过 | yes |
+| dataset whitelist 已确认 | pending |
+| changelog 已更新 | pending |
+## 3. 文字说明
+- 当前用于工程治理与预发布检查，不代表已满足商用法律门槛。
+## 4. 细节附录
+- benchmark 报告路径：reports/smoke-v6/synthetic_v2/benchmark-report.md
+- model card 路径：reports/smoke-v6/synthetic_v2/model-card.md
+## Sources
+- docs/dataset-sources-and-licensing.md
+- docs/industrial-benchmark-spec.md
--- a/acr-engine/src/data/dataset.py
View file @c89ef4f
+++ b/acr-engine/src/data/dataset.py
View file @c89ef4f
@@ -193,7 +193,13 @@ class SongPairDataset(Dataset):
        self.song_ids = sorted(self.by_song)
        self.sample_song_ids = []
        for sid, items in self.by_song.items():
-            weight = 3 if any(x.get("type") in {"confused", "humming_like"} for x in items) else 1
+            item_types = {x.get("type") for x in items}
+            if "confused" in item_types:
+                weight = 5
+            elif "humming_like" in item_types:
+                weight = 3
+            else:
+                weight = 1
            self.sample_song_ids.extend([sid] * weight)
        self.song_to_idx = {sid: i for i, sid in enumerate(self.song_ids)}
@@ -228,8 +234,15 @@ class SongPairDataset(Dataset):
        else:
            a, b = random.sample(choices, 2)
-        pair_types = {a.get("type", "unknown"), b.get("type", "unknown")}
+        type_to_weight = {
-        hard_weight = 2.5 if pair_types & {"confused", "humming_like"} else 1.0
+            "confused": 4.0,
+            "humming_like": 2.5,
+            "augmented": 1.4,
+        }
+        pair_weights = [
+            type_to_weight.get(a.get("type", "unknown"), 1.0),
+            type_to_weight.get(b.get("type", "unknown"), 1.0),
+        ]
        wavs = []
        for sample in (a, b):
@@ -247,5 +260,5 @@ class SongPairDataset(Dataset):
            "mel": torch.stack(wavs, dim=0),
            "song_id": torch.tensor([label, label], dtype=torch.long),
            "song_name": song_id,
-            "hard_weight": torch.tensor(hard_weight, dtype=torch.float32),
+            "hard_weight": torch.tensor(pair_weights, dtype=torch.float32),
        }
--- a/acr-engine/src/models/losses.py
View file @c89ef4f
+++ b/acr-engine/src/models/losses.py
View file @c89ef4f
@@ -26,7 +26,7 @@ class SupConLoss(nn.Module):
        pos_count = pos_mask.sum(dim=1)
        loss = -(log_prob * pos_mask).sum(dim=1)
        loss = loss / pos_count.clamp(min=1)
-        return loss.mean()
+        return loss
 class CombinedLoss(nn.Module):
@@ -51,12 +51,17 @@ class CombinedLoss(nn.Module):
        hard_weight: torch.Tensor | None = None,
    ) -> dict:
        loss_supcon = self.supcon(embedding, supcon_labels)
-        loss_ce = self.ce(logits, labels)
+        loss_ce = F.cross_entropy(logits, labels, reduction="none")
        if hard_weight is not None:
-            weight = hard_weight.float().mean()
+            weight = hard_weight.float()
+            if weight.dim() == 0:
+                weight = weight.unsqueeze(0)
            loss_supcon = loss_supcon * weight
            loss_ce = loss_ce * weight
+        loss_supcon = loss_supcon.mean()
+        loss_ce = loss_ce.mean()
        total = self.supcon_weight * loss_supcon + self.aam_weight * loss_ce
        return {
            "loss": total,
--- a/acr-engine/train.py
View file @c89ef4f
+++ b/acr-engine/train.py
View file @c89ef4f
@@ -32,6 +32,9 @@ def collate_fn(batch):
                mels.append(mel[i])
                song_ids.append(b["song_id"][i])
                song_names.append(b["song_name"])
+                if torch.is_tensor(hw) and hw.dim() > 0:
+                    hard_weights.append(hw[i])
+                else:
                    hard_weights.append(hw)
        else:
            mels.append(mel)
--- a/docs/CHANGELOG.md
View file @c89ef4f
+++ b/docs/CHANGELOG.md
View file @c89ef4f
@@ -2,6 +2,34 @@
 ## 2026-06-02
+### Stage: confused 定向优化 v6（sample-level weighting）
+完成项：
+- 将 hard-case loss 从 batch 级平均权重改为 **sample-level weighting**
+- `SongPairDataset` 改为对 `confused` / `humming_like` 区分采样强度
+- `confused` 样本权重提高到更高优先级
+- 重训 `models_v6`、重建 `index_v6`、重跑 `smoke-v6` 评测
+- 生成 `reports/smoke-v6/synthetic_v2/` 发布制品
+- 补充 `docs/dataset-spec.md` 中的 hard-case 输入规范说明
+- 补充 `docs/sota-research-2026.md` 中的 v4/v5/v6 对比结论
+验证结果：
+- `train.py --dry-run` 成功
+- `py_compile` 成功
+- `run_demo.py build-index` 成功
+- `evaluate.py --fast-eval --output-json reports/smoke-v6/synthetic_v2/eval.json` 成功
+- 当前结果：
+  - overall top1=0.65, top5=0.95
+  - humming_like top1=0.25
+  - confused top1=0.25
+结论：
+- 相比 smoke-v5，overall top1 从 0.60 提升到 0.65
+- `confused top1` 从 0.00 提升到 0.25，说明 sample-level 权重有效
+- `humming_like top1` 从 0.50 回落到 0.25，说明两类 hard case 需要分治，而不能只靠单轴加权
+## 2026-06-02
 ### Stage: 文档补全 + ACR 最小可运行链路
 完成项：
--- a/docs/dataset-spec.md
View file @c89ef4f
+++ b/docs/dataset-spec.md
View file @c89ef4f
@@ -73,6 +73,30 @@ flowchart TD
 ---
+## 4.1 Hard-case 训练信号图
+```mermaid
+flowchart LR
+    A[Query Segment] --> B{type}
+    B -->|clean| C[w=1.0]
+    B -->|augmented| D[w=1.4]
+    B -->|humming_like| E[w=2.5]
+    B -->|confused| F[w=4.0]
+    C --> G[Sample-level SupCon + CE]
+    D --> G
+    E --> G
+    F --> G
+```
+| 类型 | 当前训练权重 | 目标 |
+|---|---:|---|
+| clean | 1.0 | 保持基础识别稳定 |
+| augmented | 1.4 | 提高常规退化鲁棒性 |
+| humming_like | 2.5 | 提高旋律近似查询鲁棒性 |
+| confused | 4.0 | 强化最易混淆片段分离能力 |
+---
 ## 5. 文字说明
 ### 5.1 为什么必须分离 catalog 和 query
@@ -88,6 +112,18 @@ flowchart TD
 ### 5.3 query 类型为什么要显式标注
 `clean / augmented / confused / humming_like` 是评测与训练策略的重要条件，不应只放在隐式文件名里。
+### 5.4 为什么 hard-case 权重必须做到 sample-level
+如果只在 batch 级取平均权重，`confused` 这种少量但高风险样本会被正常样本稀释。当前版本已经改成 **sample-level weighted SupCon + sample-level weighted CE**，从而让单个困难片段在损失中真实“被看见”。
+### 5.5 当前经验结论
+- 简单过采样会导致整体退化
+- type-aware weighting 能提升一部分 hard case
+- confused 类需要更高权重，但过强偏置会回伤 `humming_like`
+- 因此 dataset 规范中必须保留 `type` 字段，后续才能继续做：
+  - confusion-aware negative mining
+  - melody-aware reranking
+  - 双支路 hard-case curriculum
 ---
 ## 6. 细节附录
--- a/docs/sota-research-2026.md
View file @c89ef4f
+++ b/docs/sota-research-2026.md
View file @c89ef4f
@@ -10,6 +10,12 @@
 2. **Music Foundation Model 作为 backbone / teacher**
 3. **Band-split / band-aware 结构用于音乐频谱建模**
+对本项目当前阶段的直接结论：
+- **仅靠样本重复或统一加权不是 SOTA 思路**
+- 更接近 2026 工业最佳实践的是：**retrieval-first + hard negative mining + foundation model backbone + 任务专门支路**
+- 我们当前仓库已经走到其中两步：`128 Mel + band-split`、`retrieval-first eval`
+- 下一步最该补的是：`confusion-aware negatives` 与 `humming melody tower`
 ## 1. 方向图
@@ -99,6 +105,21 @@ flowchart LR
 如果你之后指定了准确论文或仓库，我可以按那一版精确对齐实现。
+### 对当前实验结果的解释
+| 策略 | overall top1 | humming_like top1 | confused top1 | 结论 |
+|---|---:|---:|---:|---|
+| naive oversampling (smoke-v4) | 0.40 | 0.00 | 0.00 | 明显退化 |
+| type-aware weighting (smoke-v5) | 0.60 | 0.50 | 0.00 | 改善 humming，但 confused 无突破 |
+| sample-level confused-priority weighting (smoke-v6) | 0.65 | 0.25 | 0.25 | confused 突破，但需要重新平衡 humming |
+这说明：
+1. 2026 年这个方向里，**“难例重要”是对的**
+2. 但 **单维度加权还不够**，需要把不同 hard case 分开建模
+3. 对音乐 ACR 来说，`confused` 与 `humming_like` 不是同一种难度来源：
+   - `confused` 更偏 timbre / arrangement / retrieval ambiguity
+   - `humming_like` 更偏 melody / pitch contour mismatch
 ## 5. 2026 年是否已经有更好的方案？
 有，结论是：**有明显更好的路线**。
@@ -122,6 +143,8 @@ flowchart LR
 - triplet / multi-positive metric learning 对比 SupCon
 - window-level index aggregation
 - FMA / Jamendo 小规模真实数据验证
+- confusion-aware negative mining
+- humming 专门旋律支路 / pitch contour rerank
 ### 更后阶段
 - humming 专门 melody tower