Map business asset types into runnable training and bucket guidance for the next session

Constraint: Keep this checkpoint documentation-first and avoid staging dataset, cache, or model artifacts Rejected: Leave the asset-type strategy implicit in chat only | The next session needs repo-native guidance and templates Confidence: high Scope-risk: narrow Directive: Treat type-based buckets as a starting scaffold and keep hard-negative curation manual until evidence supports automation Tested: Parsed both bucket JSON templates and rechecked 104 relative links across the new docs Not-tested: Did not run a fresh business-type benchmark in this checkpoint

Map business asset types into runnable training and bucket guidance for the next session
Constraint: Keep this checkpoint documentation-first and avoid staging dataset, cache, or model artifacts Rejected: Leave the asset-type strategy implicit in chat only | The next session needs repo-native guidance and templates Confidence: high Scope-risk: narrow Directive: Treat type-based buckets as a starting scaffold and keep hard-negative curation manual until evidence supports automation Tested: Parsed both bucket JSON templates and rechecked 104 relative links across the new docs Not-tested: Did not run a fresh business-type benchmark in this checkpoint
cnb.bofCdSsphPA
Commit 8739bf35 ... 8739bf35b4d3f5d1f40f093794e321b2de19a842 authored 2026-06-02 18:53:40 +0800 by cnb.bofCdSsphPA
Showing 7 changed files with 304 additions and 2 deletions
AGENT.md
acr-engine/configs/buckets/business_type_bucket_template.json
docs/CHANGELOG.md
docs/business-music-bucket-and-type-guide.md
docs/open-dataset-workflow.md
docs/session-handoff.md
docs/training-data-and-pgvector-guide.md
--- a/AGENT.md
View file @8739bf3
+++ b/AGENT.md
View file @8739bf3
@@ -53,7 +53,9 @@
 ## 5. 当前续跑优先级
 1. 将 toy prefix bucket 升级为语义 bucket。
-   - 模板入口：`acr-engine/configs/buckets/fma_semantic_bucket_template.json`
+   - 通用模板：`acr-engine/configs/buckets/fma_semantic_bucket_template.json`
+   - 业务模板：`acr-engine/configs/buckets/business_type_bucket_template.json`
+   - 业务说明：`docs/business-music-bucket-and-type-guide.md`
 2. 补 cap64 multi-seed aggregate。
 3. 更新：
   - `docs/open-dataset-workflow.md`
--- a/acr-engine/configs/buckets/business_type_bucket_template.json 0 → 100644
View file @8739bf3
+++ b/acr-engine/configs/buckets/business_type_bucket_template.json 0 → 100644
View file @8739bf3
+{
+  "notes": {
+    "purpose": "Business-oriented bucket template mapped from asset types such as lossless master, compressed master, short-video clip, chorus clip, and demo.",
+    "how_to_use": "Replace placeholder patterns with exported file lists or curated globs from your own asset store before running ab_smoke_bucketed.py.",
+    "warning": "Keep reference and query roles logically separate even if they share the same song_id."
+  },
+  "buckets": [
+    {
+      "name": "lossless_reference_core",
+      "patterns": [
+        "business_audio/*/type10_or_type11_REPLACE_*.mp3"
+      ],
+      "subset_size": 16,
+      "label_hint": "type 10/11 high-quality master/reference candidates"
+    },
+    {
+      "name": "compressed_reference_realworld",
+      "patterns": [
+        "business_audio/*/type1_or_type9_REPLACE_*.mp3"
+      ],
+      "subset_size": 16,
+      "label_hint": "type 1/9 compressed real-world distribution"
+    },
+    {
+      "name": "short_video_hook",
+      "patterns": [
+        "business_audio/*/type7_type8_type16_REPLACE_*.mp3"
+      ],
+      "subset_size": 16,
+      "label_hint": "short-video or chorus-hook recognition bucket"
+    },
+    {
+      "name": "with_harmony_shift",
+      "patterns": [
+        "business_audio/*/type2_or_type12_REPLACE_*.mp3"
+      ],
+      "subset_size": 16,
+      "label_hint": "with-harmony accompaniment robustness bucket"
+    },
+    {
+      "name": "demo_variation_pool",
+      "patterns": [
+        "business_audio/*/type18_REPLACE_*.mp3"
+      ],
+      "subset_size": 16,
+      "label_hint": "demo-version variation and weak-supervision bucket"
+    },
+    {
+      "name": "hard_negative_confusable",
+      "patterns": [
+        "business_audio/*/manual_confusable_REPLACE_*.mp3"
+      ],
+      "subset_size": 16,
+      "label_hint": "manually curated confusable tracks"
+    }
+  ]
+}
--- a/docs/CHANGELOG.md
View file @8739bf3
+++ b/docs/CHANGELOG.md
View file @8739bf3
+## 2026-06-02 业务素材 type→bucket 指南交付 checkpoint
+完成项：
+- 新增业务素材类型与 bucket 说明文档：`docs/business-music-bucket-and-type-guide.md`
+- 新增业务素材 bucket 模板：`acr-engine/configs/buckets/business_type_bucket_template.json`
+- 已把该入口接回 pgvector 指南、开放数据工作流和 session handoff。
+首批业务 bucket：
+- `lossless_reference_core`
+- `compressed_reference_realworld`
+- `short_video_hook`
+- `with_harmony_shift`
+- `demo_variation_pool`
+- `hard_negative_confusable`
+结论：
+- 现在不仅有通用语义 bucket 模板，也有贴近你们素材 type 的业务 bucket 模板。
+- 下个 session 可以直接按照素材 type 做训练/评测分层，而不必再从表结构重新推导。
 ## 2026-06-02 语义 bucket 模板交付 checkpoint
 完成项：
--- a/docs/business-music-bucket-and-type-guide.md 0 → 100644
View file @8739bf3
+++ b/docs/business-music-bucket-and-type-guide.md 0 → 100644
View file @8739bf3
+# Business Music Bucket and Type Guide / 业务音乐素材类型与 Bucket 指南
+> 更新：2026-06-02  
+> 关联文档：[训练数据与 pgvector 指南](./training-data-and-pgvector-guide.md) · [开放数据工作流](./open-dataset-workflow.md) · [工业级 Benchmark 规范](./industrial-benchmark-spec.md)
+## 一页结论
+针对你们现有的素材 `type` 字段，**不要把所有文件都混进训练**。  
+更推荐按“reference 主资产 + query 派生资产 + hard-case 评测资产”三层来用。
+### 最推荐参与训练/建库的类型
+| 优先级 | type | 含义 | 训练用途 |
+|---|---:|---|---|
+| 高 | `10` | 伴奏无和声-无损 | 最干净的 reference 候选 |
+| 高 | `11` | 原曲-无损 | 主 reference / 主训练资产 |
+| 高 | `9` | 伴奏无和声-压缩 | reference 补充 / 压缩域适配 |
+| 高 | `1` | 原曲-压缩 | 训练域补充 / 真实线上分布 |
+| 中 | `18` | 音频 demo | 可作为弱监督补充，需人工筛 |
+| 中 | `8` | 片段(副歌) | 可用于 repeated-section / 高辨识度 query |
+| 中 | `7` | 抖音片段 | 可用于短视频域 query 评测 |
+| 中 | `16` | 快手片段 | 可用于短视频域 query 评测 |
+### 通常不直接参与主训练的类型
+| type | 含义 | 原因 |
+|---:|---|---|
+| `2` / `12` | 伴奏有和声 | 容易引入“同曲不同演唱层”的额外变异，适合后续单独实验 |
+| `3` / `13` / `20` | 歌词 / LRC / 译文滚动歌词 | 非音频资产 |
+| `4` / `14` / `19` | 封面 / PSD / 曲谱图片 | 非音频资产 |
+| `5` | 授权书 | 合规文件，不入模 |
+| `6` | 专辑信息 | 元数据，不入模 |
+| `17` | 词曲压缩包 | 需先拆解，不应直接入模 |
+---
+## 1. 业务素材职责图
+```mermaid
+flowchart LR
+    A[无损主资产\n10 / 11] --> B[reference 主库]
+    C[压缩主资产\n1 / 9] --> D[训练域增强]
+    E[短视频片段\n7 / 16 / 8] --> F[query 评测集]
+    G[录音/demo\n18] --> H[弱监督补充池]
+    B --> I[训练 / 建索引]
+    D --> I
+    F --> J[短片段评测 / hard-case]
+    H --> K[人工筛选后再进入 I 或 J]
+```
+---
+## 2. 你们的 type 应该怎么用
+## 2.1 主训练 / 主建库推荐
+### A. 第一优先：`10` + `11`
+原因：
+- 音质最好
+- 标签语义最稳定
+- 最适合作为“真值 reference”
+推荐用途：
+- `reference`
+- 主训练资产
+- pgvector 主 embedding 表
+### B. 第二优先：`9` + `1`
+原因：
+- 更接近线上真实压缩分布
+- 可以增强模型对编码损伤的适应性
+推荐用途：
+- 训练补充
+- 评测时做 clean/compressed query
+- reference 域扩展
+### C. 第三优先：`8` / `7` / `16`
+原因：
+- 更接近真实识别入口
+- 有利于短片段 / 副歌 / 短视频域评测
+推荐用途：
+- query 评测集
+- repeated-section-rich bucket
+- short-video bucket
+### D. 谨慎使用：`18`
+原因：
+- `demo` 的混音、编排、完整度差异很大
+- 很容易把“不是同一首最终版本”的样本混入同标签
+推荐用途：
+- 先放人工筛选池
+- 只在确认与正式版本同曲同主旋律时再纳入训练或 hard-case
+---
+## 2.2 不建议一开始就并入主训练的类型
+### `2` / `12` 伴奏有和声
+风险：
+- 同一 `song_id` 下会多出人声/和声干扰
+- 如果当前系统目标是“音乐 ACR / BGM 识别”，这类素材更适合作为后续 domain robustness 对照
+建议：
+- 先单独放一个 `with_harmony_accompaniment` bucket
+- 不要一开始和 `10`/`9` 直接混训
+---
+## 3. 建议的训练/评测分层
+```mermaid
+flowchart TD
+    A[主库 reference] --> A1[10 / 11]
+    B[训练补充] --> B1[1 / 9]
+    C[短片段评测] --> C1[7 / 16 / 8]
+    D[特殊对照] --> D1[2 / 12 / 18]
+    E[非音频元数据] --> E1[3 / 4 / 5 / 6 / 13 / 14 / 17 / 19 / 20]
+```
+### 推荐首版策略
+| 层 | 推荐 type |
+|---|---|
+| reference 主库 | `10`, `11` |
+| 训练补充 | `1`, `9` |
+| query 评测 | `7`, `8`, `16` |
+| 人工筛选后可补充 | `18` |
+| 后续鲁棒性专项实验 | `2`, `12` |
+---
+## 4. 业务语义 bucket 建议
+## 4.1 第一批最值得做的 bucket
+| bucket 名称 | 推荐来源 type | 作用 |
+|---|---|---|
+| `lossless_reference_core` | `10`, `11` | 最干净真值库 |
+| `compressed_reference_realworld` | `1`, `9` | 线上压缩域 |
+| `short_video_hook` | `7`, `16`, `8` | 短视频 / 副歌识别 |
+| `with_harmony_shift` | `2`, `12` | 有和声伴奏干扰 |
+| `demo_variation_pool` | `18` | demo 与正式版差异风险 |
+| `hard_negative_confusable` | 人工精选 | 风格近似、编曲近似、旋律近似 |
+---
+## 4.2 为什么这比通用 semantic bucket 更贴近业务
+因为你们的数据不是纯学术数据集，而是**带素材业务语义**的：
+- 有主资产 / 压缩版 / 无损版
+- 有短视频片段
+- 有副歌片段
+- 有带和声/不带和声伴奏
+因此你们最先应该做的不是抽象的 genre bucket，而是：
+1. **版本形态 bucket**
+2. **入口场景 bucket**
+3. **混淆风险 bucket**
+---
+## 5. 推荐配置模板
+配套模板：
+- [../acr-engine/configs/buckets/fma_semantic_bucket_template.json](../acr-engine/configs/buckets/fma_semantic_bucket_template.json)
+- [../acr-engine/configs/buckets/business_type_bucket_template.json](../acr-engine/configs/buckets/business_type_bucket_template.json)
+其中：
+- `fma_semantic_bucket_template.json` 更偏通用方法学
+- `business_type_bucket_template.json` 更偏你们现有业务素材形态
+---
+## 6. 和 pgvector 怎么配合
+如果后续落到 pgvector，建议至少保留这些字段：
+| 字段 | 说明 |
+|---|---|
+| `song_id` | 主歌曲 ID |
+| `asset_id` | 具体资产 ID |
+| `type` | 你们的素材类型 |
+| `bucket` | 当前评测/训练桶 |
+| `role` | `reference` / `query` |
+| `source_dataset` | 来源 |
+| `offset_sec` | query 起点 |
+| `duration_sec` | query 长度 |
+| `embedding` | pgvector 向量 |
+这样后面就能按：
+- `type` 过滤
+- `bucket` 出报表
+- `role` 区分 reference/query
+- `source_dataset` 做多源分析
+---
+## 7. 下个 session 的直接动作
+1. 先按这个文档筛出首批可用 type：`10`, `11`, `9`, `1`, `8`, `7`, `16`
+2. 再把这些映射进：
+   - [../acr-engine/configs/buckets/business_type_bucket_template.json](../acr-engine/configs/buckets/business_type_bucket_template.json)
+3. 跑 bucket benchmark
+4. 对照 `hybrid` / `high_energy` 在不同业务 bucket 下是否分化
+## Sources
+- See [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md)
+- See [industrial-benchmark-spec.md](./industrial-benchmark-spec.md)
--- a/docs/open-dataset-workflow.md
View file @8739bf3
+++ b/docs/open-dataset-workflow.md
View file @8739bf3
@@ -396,3 +396,10 @@ cd /workspace/acr-engine
  --seed 42 \
  --output-json /tmp/ab_smoke_bucketed_semantic/report.json
 ```
+### 业务素材 bucket 模板
+如果下一步不是继续用 FMA，而是切到你们自己的歌曲/BGM 素材，优先看：
+- [business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md)
+- [../acr-engine/configs/buckets/business_type_bucket_template.json](../acr-engine/configs/buckets/business_type_bucket_template.json)
--- a/docs/session-handoff.md
View file @8739bf3
+++ b/docs/session-handoff.md
View file @8739bf3
@@ -256,6 +256,7 @@
 ### 最优先待办
 1. 把已完成的 toy bucket baseline 升级为语义 bucket（风格 / 结构 / hard-case）。
   - 模板：`acr-engine/configs/buckets/fma_semantic_bucket_template.json`
+   - 业务型素材优先看：[business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md)
 2. 对比 cap48 与 cap64 的不一致现象，补充分规模结论。
 3. 继续补 cap64 multi-seed，而不是只保留单 seed。
 4. 继续优化 `hybrid`，重点降低波动并提升 hard case 稳定性。
--- a/docs/training-data-and-pgvector-guide.md
View file @8739bf3
+++ b/docs/training-data-and-pgvector-guide.md
View file @8739bf3
 # Training Data, Input Format, and pgvector Guide / 训练数据、输入格式与 pgvector 指南
 > 更新：2026-06-02  
-> 关联文档：[数据规范](./dataset-spec.md) · [开放数据工作流](./open-dataset-workflow.md) · [数据来源与接入](./dataset-sources-and-licensing.md) · [服务接口](./service-api.md)
+> 关联文档：[数据规范](./dataset-spec.md) · [开放数据工作流](./open-dataset-workflow.md) · [数据来源与接入](./dataset-sources-and-licensing.md) · [服务接口](./service-api.md) · [业务素材类型与 Bucket 指南](./business-music-bucket-and-type-guide.md)
 ## 一页结论