Map business asset types into runnable training and bucket guidance for the next session
Constraint: Keep this checkpoint documentation-first and avoid staging dataset, cache, or model artifacts Rejected: Leave the asset-type strategy implicit in chat only | The next session needs repo-native guidance and templates Confidence: high Scope-risk: narrow Directive: Treat type-based buckets as a starting scaffold and keep hard-negative curation manual until evidence supports automation Tested: Parsed both bucket JSON templates and rechecked 104 relative links across the new docs Not-tested: Did not run a fresh business-type benchmark in this checkpoint
Showing
7 changed files
with
304 additions
and
2 deletions
| ... | @@ -53,7 +53,9 @@ | ... | @@ -53,7 +53,9 @@ |
| 53 | ## 5. 当前续跑优先级 | 53 | ## 5. 当前续跑优先级 |
| 54 | 54 | ||
| 55 | 1. 将 toy prefix bucket 升级为语义 bucket。 | 55 | 1. 将 toy prefix bucket 升级为语义 bucket。 |
| 56 | - 模板入口:`acr-engine/configs/buckets/fma_semantic_bucket_template.json` | 56 | - 通用模板:`acr-engine/configs/buckets/fma_semantic_bucket_template.json` |
| 57 | - 业务模板:`acr-engine/configs/buckets/business_type_bucket_template.json` | ||
| 58 | - 业务说明:`docs/business-music-bucket-and-type-guide.md` | ||
| 57 | 2. 补 cap64 multi-seed aggregate。 | 59 | 2. 补 cap64 multi-seed aggregate。 |
| 58 | 3. 更新: | 60 | 3. 更新: |
| 59 | - `docs/open-dataset-workflow.md` | 61 | - `docs/open-dataset-workflow.md` | ... | ... |
| 1 | { | ||
| 2 | "notes": { | ||
| 3 | "purpose": "Business-oriented bucket template mapped from asset types such as lossless master, compressed master, short-video clip, chorus clip, and demo.", | ||
| 4 | "how_to_use": "Replace placeholder patterns with exported file lists or curated globs from your own asset store before running ab_smoke_bucketed.py.", | ||
| 5 | "warning": "Keep reference and query roles logically separate even if they share the same song_id." | ||
| 6 | }, | ||
| 7 | "buckets": [ | ||
| 8 | { | ||
| 9 | "name": "lossless_reference_core", | ||
| 10 | "patterns": [ | ||
| 11 | "business_audio/*/type10_or_type11_REPLACE_*.mp3" | ||
| 12 | ], | ||
| 13 | "subset_size": 16, | ||
| 14 | "label_hint": "type 10/11 high-quality master/reference candidates" | ||
| 15 | }, | ||
| 16 | { | ||
| 17 | "name": "compressed_reference_realworld", | ||
| 18 | "patterns": [ | ||
| 19 | "business_audio/*/type1_or_type9_REPLACE_*.mp3" | ||
| 20 | ], | ||
| 21 | "subset_size": 16, | ||
| 22 | "label_hint": "type 1/9 compressed real-world distribution" | ||
| 23 | }, | ||
| 24 | { | ||
| 25 | "name": "short_video_hook", | ||
| 26 | "patterns": [ | ||
| 27 | "business_audio/*/type7_type8_type16_REPLACE_*.mp3" | ||
| 28 | ], | ||
| 29 | "subset_size": 16, | ||
| 30 | "label_hint": "short-video or chorus-hook recognition bucket" | ||
| 31 | }, | ||
| 32 | { | ||
| 33 | "name": "with_harmony_shift", | ||
| 34 | "patterns": [ | ||
| 35 | "business_audio/*/type2_or_type12_REPLACE_*.mp3" | ||
| 36 | ], | ||
| 37 | "subset_size": 16, | ||
| 38 | "label_hint": "with-harmony accompaniment robustness bucket" | ||
| 39 | }, | ||
| 40 | { | ||
| 41 | "name": "demo_variation_pool", | ||
| 42 | "patterns": [ | ||
| 43 | "business_audio/*/type18_REPLACE_*.mp3" | ||
| 44 | ], | ||
| 45 | "subset_size": 16, | ||
| 46 | "label_hint": "demo-version variation and weak-supervision bucket" | ||
| 47 | }, | ||
| 48 | { | ||
| 49 | "name": "hard_negative_confusable", | ||
| 50 | "patterns": [ | ||
| 51 | "business_audio/*/manual_confusable_REPLACE_*.mp3" | ||
| 52 | ], | ||
| 53 | "subset_size": 16, | ||
| 54 | "label_hint": "manually curated confusable tracks" | ||
| 55 | } | ||
| 56 | ] | ||
| 57 | } |
| 1 | ## 2026-06-02 业务素材 type→bucket 指南交付 checkpoint | ||
| 2 | |||
| 3 | 完成项: | ||
| 4 | - 新增业务素材类型与 bucket 说明文档:`docs/business-music-bucket-and-type-guide.md` | ||
| 5 | - 新增业务素材 bucket 模板:`acr-engine/configs/buckets/business_type_bucket_template.json` | ||
| 6 | - 已把该入口接回 pgvector 指南、开放数据工作流和 session handoff。 | ||
| 7 | |||
| 8 | 首批业务 bucket: | ||
| 9 | - `lossless_reference_core` | ||
| 10 | - `compressed_reference_realworld` | ||
| 11 | - `short_video_hook` | ||
| 12 | - `with_harmony_shift` | ||
| 13 | - `demo_variation_pool` | ||
| 14 | - `hard_negative_confusable` | ||
| 15 | |||
| 16 | 结论: | ||
| 17 | - 现在不仅有通用语义 bucket 模板,也有贴近你们素材 type 的业务 bucket 模板。 | ||
| 18 | - 下个 session 可以直接按照素材 type 做训练/评测分层,而不必再从表结构重新推导。 | ||
| 19 | |||
| 1 | ## 2026-06-02 语义 bucket 模板交付 checkpoint | 20 | ## 2026-06-02 语义 bucket 模板交付 checkpoint |
| 2 | 21 | ||
| 3 | 完成项: | 22 | 完成项: | ... | ... |
docs/business-music-bucket-and-type-guide.md
0 → 100644
| 1 | # Business Music Bucket and Type Guide / 业务音乐素材类型与 Bucket 指南 | ||
| 2 | |||
| 3 | > 更新:2026-06-02 | ||
| 4 | > 关联文档:[训练数据与 pgvector 指南](./training-data-and-pgvector-guide.md) · [开放数据工作流](./open-dataset-workflow.md) · [工业级 Benchmark 规范](./industrial-benchmark-spec.md) | ||
| 5 | |||
| 6 | ## 一页结论 | ||
| 7 | |||
| 8 | 针对你们现有的素材 `type` 字段,**不要把所有文件都混进训练**。 | ||
| 9 | 更推荐按“reference 主资产 + query 派生资产 + hard-case 评测资产”三层来用。 | ||
| 10 | |||
| 11 | ### 最推荐参与训练/建库的类型 | ||
| 12 | |||
| 13 | | 优先级 | type | 含义 | 训练用途 | | ||
| 14 | |---|---:|---|---| | ||
| 15 | | 高 | `10` | 伴奏无和声-无损 | 最干净的 reference 候选 | | ||
| 16 | | 高 | `11` | 原曲-无损 | 主 reference / 主训练资产 | | ||
| 17 | | 高 | `9` | 伴奏无和声-压缩 | reference 补充 / 压缩域适配 | | ||
| 18 | | 高 | `1` | 原曲-压缩 | 训练域补充 / 真实线上分布 | | ||
| 19 | | 中 | `18` | 音频 demo | 可作为弱监督补充,需人工筛 | | ||
| 20 | | 中 | `8` | 片段(副歌) | 可用于 repeated-section / 高辨识度 query | | ||
| 21 | | 中 | `7` | 抖音片段 | 可用于短视频域 query 评测 | | ||
| 22 | | 中 | `16` | 快手片段 | 可用于短视频域 query 评测 | | ||
| 23 | |||
| 24 | ### 通常不直接参与主训练的类型 | ||
| 25 | |||
| 26 | | type | 含义 | 原因 | | ||
| 27 | |---:|---|---| | ||
| 28 | | `2` / `12` | 伴奏有和声 | 容易引入“同曲不同演唱层”的额外变异,适合后续单独实验 | | ||
| 29 | | `3` / `13` / `20` | 歌词 / LRC / 译文滚动歌词 | 非音频资产 | | ||
| 30 | | `4` / `14` / `19` | 封面 / PSD / 曲谱图片 | 非音频资产 | | ||
| 31 | | `5` | 授权书 | 合规文件,不入模 | | ||
| 32 | | `6` | 专辑信息 | 元数据,不入模 | | ||
| 33 | | `17` | 词曲压缩包 | 需先拆解,不应直接入模 | | ||
| 34 | |||
| 35 | --- | ||
| 36 | |||
| 37 | ## 1. 业务素材职责图 | ||
| 38 | |||
| 39 | ```mermaid | ||
| 40 | flowchart LR | ||
| 41 | A[无损主资产\n10 / 11] --> B[reference 主库] | ||
| 42 | C[压缩主资产\n1 / 9] --> D[训练域增强] | ||
| 43 | E[短视频片段\n7 / 16 / 8] --> F[query 评测集] | ||
| 44 | G[录音/demo\n18] --> H[弱监督补充池] | ||
| 45 | B --> I[训练 / 建索引] | ||
| 46 | D --> I | ||
| 47 | F --> J[短片段评测 / hard-case] | ||
| 48 | H --> K[人工筛选后再进入 I 或 J] | ||
| 49 | ``` | ||
| 50 | |||
| 51 | --- | ||
| 52 | |||
| 53 | ## 2. 你们的 type 应该怎么用 | ||
| 54 | |||
| 55 | ## 2.1 主训练 / 主建库推荐 | ||
| 56 | |||
| 57 | ### A. 第一优先:`10` + `11` | ||
| 58 | |||
| 59 | 原因: | ||
| 60 | - 音质最好 | ||
| 61 | - 标签语义最稳定 | ||
| 62 | - 最适合作为“真值 reference” | ||
| 63 | |||
| 64 | 推荐用途: | ||
| 65 | - `reference` | ||
| 66 | - 主训练资产 | ||
| 67 | - pgvector 主 embedding 表 | ||
| 68 | |||
| 69 | ### B. 第二优先:`9` + `1` | ||
| 70 | |||
| 71 | 原因: | ||
| 72 | - 更接近线上真实压缩分布 | ||
| 73 | - 可以增强模型对编码损伤的适应性 | ||
| 74 | |||
| 75 | 推荐用途: | ||
| 76 | - 训练补充 | ||
| 77 | - 评测时做 clean/compressed query | ||
| 78 | - reference 域扩展 | ||
| 79 | |||
| 80 | ### C. 第三优先:`8` / `7` / `16` | ||
| 81 | |||
| 82 | 原因: | ||
| 83 | - 更接近真实识别入口 | ||
| 84 | - 有利于短片段 / 副歌 / 短视频域评测 | ||
| 85 | |||
| 86 | 推荐用途: | ||
| 87 | - query 评测集 | ||
| 88 | - repeated-section-rich bucket | ||
| 89 | - short-video bucket | ||
| 90 | |||
| 91 | ### D. 谨慎使用:`18` | ||
| 92 | |||
| 93 | 原因: | ||
| 94 | - `demo` 的混音、编排、完整度差异很大 | ||
| 95 | - 很容易把“不是同一首最终版本”的样本混入同标签 | ||
| 96 | |||
| 97 | 推荐用途: | ||
| 98 | - 先放人工筛选池 | ||
| 99 | - 只在确认与正式版本同曲同主旋律时再纳入训练或 hard-case | ||
| 100 | |||
| 101 | --- | ||
| 102 | |||
| 103 | ## 2.2 不建议一开始就并入主训练的类型 | ||
| 104 | |||
| 105 | ### `2` / `12` 伴奏有和声 | ||
| 106 | |||
| 107 | 风险: | ||
| 108 | - 同一 `song_id` 下会多出人声/和声干扰 | ||
| 109 | - 如果当前系统目标是“音乐 ACR / BGM 识别”,这类素材更适合作为后续 domain robustness 对照 | ||
| 110 | |||
| 111 | 建议: | ||
| 112 | - 先单独放一个 `with_harmony_accompaniment` bucket | ||
| 113 | - 不要一开始和 `10`/`9` 直接混训 | ||
| 114 | |||
| 115 | --- | ||
| 116 | |||
| 117 | ## 3. 建议的训练/评测分层 | ||
| 118 | |||
| 119 | ```mermaid | ||
| 120 | flowchart TD | ||
| 121 | A[主库 reference] --> A1[10 / 11] | ||
| 122 | B[训练补充] --> B1[1 / 9] | ||
| 123 | C[短片段评测] --> C1[7 / 16 / 8] | ||
| 124 | D[特殊对照] --> D1[2 / 12 / 18] | ||
| 125 | E[非音频元数据] --> E1[3 / 4 / 5 / 6 / 13 / 14 / 17 / 19 / 20] | ||
| 126 | ``` | ||
| 127 | |||
| 128 | ### 推荐首版策略 | ||
| 129 | |||
| 130 | | 层 | 推荐 type | | ||
| 131 | |---|---| | ||
| 132 | | reference 主库 | `10`, `11` | | ||
| 133 | | 训练补充 | `1`, `9` | | ||
| 134 | | query 评测 | `7`, `8`, `16` | | ||
| 135 | | 人工筛选后可补充 | `18` | | ||
| 136 | | 后续鲁棒性专项实验 | `2`, `12` | | ||
| 137 | |||
| 138 | --- | ||
| 139 | |||
| 140 | ## 4. 业务语义 bucket 建议 | ||
| 141 | |||
| 142 | ## 4.1 第一批最值得做的 bucket | ||
| 143 | |||
| 144 | | bucket 名称 | 推荐来源 type | 作用 | | ||
| 145 | |---|---|---| | ||
| 146 | | `lossless_reference_core` | `10`, `11` | 最干净真值库 | | ||
| 147 | | `compressed_reference_realworld` | `1`, `9` | 线上压缩域 | | ||
| 148 | | `short_video_hook` | `7`, `16`, `8` | 短视频 / 副歌识别 | | ||
| 149 | | `with_harmony_shift` | `2`, `12` | 有和声伴奏干扰 | | ||
| 150 | | `demo_variation_pool` | `18` | demo 与正式版差异风险 | | ||
| 151 | | `hard_negative_confusable` | 人工精选 | 风格近似、编曲近似、旋律近似 | | ||
| 152 | |||
| 153 | --- | ||
| 154 | |||
| 155 | ## 4.2 为什么这比通用 semantic bucket 更贴近业务 | ||
| 156 | |||
| 157 | 因为你们的数据不是纯学术数据集,而是**带素材业务语义**的: | ||
| 158 | - 有主资产 / 压缩版 / 无损版 | ||
| 159 | - 有短视频片段 | ||
| 160 | - 有副歌片段 | ||
| 161 | - 有带和声/不带和声伴奏 | ||
| 162 | |||
| 163 | 因此你们最先应该做的不是抽象的 genre bucket,而是: | ||
| 164 | 1. **版本形态 bucket** | ||
| 165 | 2. **入口场景 bucket** | ||
| 166 | 3. **混淆风险 bucket** | ||
| 167 | |||
| 168 | --- | ||
| 169 | |||
| 170 | ## 5. 推荐配置模板 | ||
| 171 | |||
| 172 | 配套模板: | ||
| 173 | - [../acr-engine/configs/buckets/fma_semantic_bucket_template.json](../acr-engine/configs/buckets/fma_semantic_bucket_template.json) | ||
| 174 | - [../acr-engine/configs/buckets/business_type_bucket_template.json](../acr-engine/configs/buckets/business_type_bucket_template.json) | ||
| 175 | |||
| 176 | 其中: | ||
| 177 | - `fma_semantic_bucket_template.json` 更偏通用方法学 | ||
| 178 | - `business_type_bucket_template.json` 更偏你们现有业务素材形态 | ||
| 179 | |||
| 180 | --- | ||
| 181 | |||
| 182 | ## 6. 和 pgvector 怎么配合 | ||
| 183 | |||
| 184 | 如果后续落到 pgvector,建议至少保留这些字段: | ||
| 185 | |||
| 186 | | 字段 | 说明 | | ||
| 187 | |---|---| | ||
| 188 | | `song_id` | 主歌曲 ID | | ||
| 189 | | `asset_id` | 具体资产 ID | | ||
| 190 | | `type` | 你们的素材类型 | | ||
| 191 | | `bucket` | 当前评测/训练桶 | | ||
| 192 | | `role` | `reference` / `query` | | ||
| 193 | | `source_dataset` | 来源 | | ||
| 194 | | `offset_sec` | query 起点 | | ||
| 195 | | `duration_sec` | query 长度 | | ||
| 196 | | `embedding` | pgvector 向量 | | ||
| 197 | |||
| 198 | 这样后面就能按: | ||
| 199 | - `type` 过滤 | ||
| 200 | - `bucket` 出报表 | ||
| 201 | - `role` 区分 reference/query | ||
| 202 | - `source_dataset` 做多源分析 | ||
| 203 | |||
| 204 | --- | ||
| 205 | |||
| 206 | ## 7. 下个 session 的直接动作 | ||
| 207 | |||
| 208 | 1. 先按这个文档筛出首批可用 type:`10`, `11`, `9`, `1`, `8`, `7`, `16` | ||
| 209 | 2. 再把这些映射进: | ||
| 210 | - [../acr-engine/configs/buckets/business_type_bucket_template.json](../acr-engine/configs/buckets/business_type_bucket_template.json) | ||
| 211 | 3. 跑 bucket benchmark | ||
| 212 | 4. 对照 `hybrid` / `high_energy` 在不同业务 bucket 下是否分化 | ||
| 213 | |||
| 214 | ## Sources | ||
| 215 | - See [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) | ||
| 216 | - See [industrial-benchmark-spec.md](./industrial-benchmark-spec.md) |
| ... | @@ -396,3 +396,10 @@ cd /workspace/acr-engine | ... | @@ -396,3 +396,10 @@ cd /workspace/acr-engine |
| 396 | --seed 42 \ | 396 | --seed 42 \ |
| 397 | --output-json /tmp/ab_smoke_bucketed_semantic/report.json | 397 | --output-json /tmp/ab_smoke_bucketed_semantic/report.json |
| 398 | ``` | 398 | ``` |
| 399 | |||
| 400 | |||
| 401 | ### 业务素材 bucket 模板 | ||
| 402 | |||
| 403 | 如果下一步不是继续用 FMA,而是切到你们自己的歌曲/BGM 素材,优先看: | ||
| 404 | - [business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md) | ||
| 405 | - [../acr-engine/configs/buckets/business_type_bucket_template.json](../acr-engine/configs/buckets/business_type_bucket_template.json) | ... | ... |
| ... | @@ -256,6 +256,7 @@ | ... | @@ -256,6 +256,7 @@ |
| 256 | ### 最优先待办 | 256 | ### 最优先待办 |
| 257 | 1. 把已完成的 toy bucket baseline 升级为语义 bucket(风格 / 结构 / hard-case)。 | 257 | 1. 把已完成的 toy bucket baseline 升级为语义 bucket(风格 / 结构 / hard-case)。 |
| 258 | - 模板:`acr-engine/configs/buckets/fma_semantic_bucket_template.json` | 258 | - 模板:`acr-engine/configs/buckets/fma_semantic_bucket_template.json` |
| 259 | - 业务型素材优先看:[business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md) | ||
| 259 | 2. 对比 cap48 与 cap64 的不一致现象,补充分规模结论。 | 260 | 2. 对比 cap48 与 cap64 的不一致现象,补充分规模结论。 |
| 260 | 3. 继续补 cap64 multi-seed,而不是只保留单 seed。 | 261 | 3. 继续补 cap64 multi-seed,而不是只保留单 seed。 |
| 261 | 4. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性。 | 262 | 4. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性。 | ... | ... |
| 1 | # Training Data, Input Format, and pgvector Guide / 训练数据、输入格式与 pgvector 指南 | 1 | # Training Data, Input Format, and pgvector Guide / 训练数据、输入格式与 pgvector 指南 |
| 2 | 2 | ||
| 3 | > 更新:2026-06-02 | 3 | > 更新:2026-06-02 |
| 4 | > 关联文档:[数据规范](./dataset-spec.md) · [开放数据工作流](./open-dataset-workflow.md) · [数据来源与接入](./dataset-sources-and-licensing.md) · [服务接口](./service-api.md) | 4 | > 关联文档:[数据规范](./dataset-spec.md) · [开放数据工作流](./open-dataset-workflow.md) · [数据来源与接入](./dataset-sources-and-licensing.md) · [服务接口](./service-api.md) · [业务素材类型与 Bucket 指南](./business-music-bucket-and-type-guide.md) |
| 5 | 5 | ||
| 6 | ## 一页结论 | 6 | ## 一页结论 |
| 7 | 7 | ... | ... |
-
Please register or sign in to post a comment