Commit 51d789e1 51d789e1bb207aa3c70d58f96c132451d18b16ed by cnb.bofCdSsphPA

Make business asset tables exportable into manifest and role mapping templates

Constraint: Keep the checkpoint lightweight and avoid touching real datasets or generated artifacts
Rejected: Defer manifest guidance until a DB export tool exists | The next session needs repo-native field and role contracts now
Confidence: high
Scope-risk: narrow
Directive: Default ambiguous assets to excluded until manual review confirms song identity and usable role
Tested: Parsed manifest templates; verified print_business_type_mapping.py emits valid JSON; rechecked 94 relative links
Not-tested: Did not connect to a real database or run a live export in this checkpoint
1 parent 8739bf35
...@@ -56,6 +56,7 @@ ...@@ -56,6 +56,7 @@
56 - 通用模板:`acr-engine/configs/buckets/fma_semantic_bucket_template.json` 56 - 通用模板:`acr-engine/configs/buckets/fma_semantic_bucket_template.json`
57 - 业务模板:`acr-engine/configs/buckets/business_type_bucket_template.json` 57 - 业务模板:`acr-engine/configs/buckets/business_type_bucket_template.json`
58 - 业务说明:`docs/business-music-bucket-and-type-guide.md` 58 - 业务说明:`docs/business-music-bucket-and-type-guide.md`
59 - Manifest 规范:`docs/business-manifest-and-type-role-spec.md`
59 2. 补 cap64 multi-seed aggregate。 60 2. 补 cap64 multi-seed aggregate。
60 3. 更新: 61 3. 更新:
61 - `docs/open-dataset-workflow.md` 62 - `docs/open-dataset-workflow.md`
......
1 {
2 "dataset": "business_music",
3 "version": "v1_template",
4 "description": "Template manifest row schema for business music assets before training/indexing/evaluation.",
5 "required_fields": [
6 "song_id",
7 "asset_id",
8 "type",
9 "role",
10 "split",
11 "audio_path",
12 "source_dataset"
13 ],
14 "recommended_fields": [
15 "title",
16 "artist",
17 "album_id",
18 "bucket",
19 "offset_sec",
20 "duration_sec",
21 "sample_rate",
22 "bitrate",
23 "license",
24 "is_lossless",
25 "parent_song_id",
26 "notes"
27 ],
28 "role_enum": ["reference", "query", "excluded"],
29 "split_enum": ["train", "val", "test", "holdout"],
30 "example_rows": [
31 {
32 "song_id": "song_0001",
33 "asset_id": "asset_0001_master_lossless",
34 "type": 11,
35 "role": "reference",
36 "split": "train",
37 "audio_path": "business_audio/song_0001/master_lossless.wav",
38 "source_dataset": "internal_catalog",
39 "bucket": "lossless_reference_core",
40 "offset_sec": 0.0,
41 "duration_sec": 180.0,
42 "is_lossless": true
43 },
44 {
45 "song_id": "song_0001",
46 "asset_id": "asset_0001_douyin_clip",
47 "type": 7,
48 "role": "query",
49 "split": "test",
50 "audio_path": "business_audio/song_0001/douyin_clip.mp3",
51 "source_dataset": "internal_catalog",
52 "bucket": "short_video_hook",
53 "offset_sec": 42.5,
54 "duration_sec": 8.0,
55 "is_lossless": false
56 }
57 ]
58 }
1 {
2 "notes": {
3 "purpose": "Starting mapping from business asset type to manifest role and default bucket.",
4 "warning": "This is a starting policy; keep ambiguous assets in excluded until manually reviewed."
5 },
6 "mappings": [
7 {"type": 10, "role": "reference", "default_bucket": "lossless_reference_core", "priority": "high", "trainable": true},
8 {"type": 11, "role": "reference", "default_bucket": "lossless_reference_core", "priority": "high", "trainable": true},
9 {"type": 9, "role": "reference", "default_bucket": "compressed_reference_realworld", "priority": "high", "trainable": true},
10 {"type": 1, "role": "reference", "default_bucket": "compressed_reference_realworld", "priority": "high", "trainable": true},
11 {"type": 8, "role": "query", "default_bucket": "short_video_hook", "priority": "medium", "trainable": true},
12 {"type": 7, "role": "query", "default_bucket": "short_video_hook", "priority": "medium", "trainable": true},
13 {"type": 16, "role": "query", "default_bucket": "short_video_hook", "priority": "medium", "trainable": true},
14 {"type": 18, "role": "excluded", "default_bucket": "demo_variation_pool", "priority": "review", "trainable": false},
15 {"type": 2, "role": "excluded", "default_bucket": "with_harmony_shift", "priority": "review", "trainable": false},
16 {"type": 12, "role": "excluded", "default_bucket": "with_harmony_shift", "priority": "review", "trainable": false},
17 {"type": 3, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false},
18 {"type": 4, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false},
19 {"type": 5, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false},
20 {"type": 6, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false},
21 {"type": 13, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false},
22 {"type": 14, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false},
23 {"type": 17, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false},
24 {"type": 19, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false},
25 {"type": 20, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false}
26 ]
27 }
1 #!/usr/bin/env python3
2 from __future__ import annotations
3 import argparse, json
4 from pathlib import Path
5
6 def main() -> None:
7 parser = argparse.ArgumentParser(description='Print the business type->role/bucket mapping template')
8 parser.add_argument('--config', default='configs/manifests/business_type_role_mapping.json')
9 args = parser.parse_args()
10 path = Path(__file__).resolve().parents[1] / args.config
11 data = json.loads(path.read_text())
12 print(json.dumps(data, ensure_ascii=False, indent=2))
13
14 if __name__ == '__main__':
15 main()
1 ## 2026-06-02 业务 manifest 与 type-role 规范交付 checkpoint
2
3 完成项:
4 - 新增 `docs/business-manifest-and-type-role-spec.md`
5 - 新增 `acr-engine/configs/manifests/business_asset_manifest_template.json`
6 - 新增 `acr-engine/configs/manifests/business_type_role_mapping.json`
7 - 新增 `acr-engine/scripts/print_business_type_mapping.py`
8
9 结论:
10 - 下个 session 已可直接从业务库表导出 manifest 所需字段。
11 - `type -> role(reference/query/excluded) -> bucket` 现在已经有 repo-native 默认规则,不需要再从聊天记录反推。
12
1 ## 2026-06-02 业务素材 type→bucket 指南交付 checkpoint 13 ## 2026-06-02 业务素材 type→bucket 指南交付 checkpoint
2 14
3 完成项: 15 完成项:
......
1 # Business Manifest and Type-Role Spec / 业务 Manifest 与 Type-Role 规范
2
3 > 更新:2026-06-02
4 > 关联文档:[业务素材类型与 Bucket 指南](./business-music-bucket-and-type-guide.md) · [训练数据与 pgvector 指南](./training-data-and-pgvector-guide.md)
5
6 ## 一页结论
7
8 现在仓库里已经有两份可以直接复用的业务接入模板:
9 - [../acr-engine/configs/manifests/business_asset_manifest_template.json](../acr-engine/configs/manifests/business_asset_manifest_template.json)
10 - [../acr-engine/configs/manifests/business_type_role_mapping.json](../acr-engine/configs/manifests/business_type_role_mapping.json)
11
12 它们解决两个问题:
13 1. 业务库表里的字段,最少要映射成什么 manifest 字段。
14 2. 你们的 `type` 应该默认落到 `reference / query / excluded` 哪一类。
15
16 ---
17
18 ## 1. 映射图
19
20 ```mermaid
21 flowchart LR
22 A[业务库表记录] --> B[type-role mapping]
23 B --> C[reference]
24 B --> D[query]
25 B --> E[excluded]
26 C --> F[manifest rows]
27 D --> F
28 F --> G[train / build-index / evaluate]
29 ```
30
31 ---
32
33 ## 2. 最小 manifest 字段
34
35 | 字段 | 必需 | 说明 |
36 |---|---|---|
37 | `song_id` | 是 | 歌曲主 ID |
38 | `asset_id` | 是 | 具体素材 ID |
39 | `type` | 是 | 你们现有的素材类型 |
40 | `role` | 是 | `reference` / `query` / `excluded` |
41 | `split` | 是 | `train` / `val` / `test` / `holdout` |
42 | `audio_path` | 是 | 可访问的音频路径 |
43 | `source_dataset` | 是 | 来源标识 |
44 | `bucket` | 否 | 分桶评测标签 |
45 | `offset_sec` | 否 | query 起点 |
46 | `duration_sec` | 否 | 片段长度 |
47
48 ---
49
50 ## 3. 默认 type-role 规则
51
52 | type | 默认 role | 默认 bucket | 说明 |
53 |---:|---|---|---|
54 | `10` / `11` | `reference` | `lossless_reference_core` | 无损主库 |
55 | `9` / `1` | `reference` | `compressed_reference_realworld` | 压缩真实分布 |
56 | `8` / `7` / `16` | `query` | `short_video_hook` | 短视频/副歌入口 |
57 | `18` | `excluded` | `demo_variation_pool` | 先人工筛 |
58 | `2` / `12` | `excluded` | `with_harmony_shift` | 先做专项桶 |
59 | 其余非音频 type | `excluded` | `non_audio` | 不入模 |
60
61 ---
62
63 ## 4. 导出原则
64
65 1. **reference 与 query 即使同曲,也不要混成同一条资产记录。**
66 2. **如果无法确认是否同曲同版本,默认 `excluded`。**
67 3. **`type=18 demo` 不要自动并入 train,先人工审。**
68 4. **短视频片段优先导出为 `query`,不要直接当 reference。**
69
70 ---
71
72 ## 5. 模板与脚本
73
74 - Manifest 模板:
75 - [../acr-engine/configs/manifests/business_asset_manifest_template.json](../acr-engine/configs/manifests/business_asset_manifest_template.json)
76 - Type-role 模板:
77 - [../acr-engine/configs/manifests/business_type_role_mapping.json](../acr-engine/configs/manifests/business_type_role_mapping.json)
78 - 打印脚本:
79 - [../acr-engine/scripts/print_business_type_mapping.py](../acr-engine/scripts/print_business_type_mapping.py)
80
81 示例命令:
82
83 ```bash
84 cd /workspace/acr-engine
85 /usr/local/miniconda3/bin/python scripts/print_business_type_mapping.py
86 ```
87
88 ---
89
90 ## 6. 下个 session 直接动作
91
92 1. 按这份规范把库表字段映射到 manifest 行。
93 2.`business_type_role_mapping.json` 给每条资产打默认 `role` / `bucket`
94 3. 先导出 `reference``query` 清单,再进入训练与 bucket benchmark。
95
96 ## Sources
97 - See [business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md)
98 - See [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md)
...@@ -257,6 +257,7 @@ ...@@ -257,6 +257,7 @@
257 1. 把已完成的 toy bucket baseline 升级为语义 bucket(风格 / 结构 / hard-case)。 257 1. 把已完成的 toy bucket baseline 升级为语义 bucket(风格 / 结构 / hard-case)。
258 - 模板:`acr-engine/configs/buckets/fma_semantic_bucket_template.json` 258 - 模板:`acr-engine/configs/buckets/fma_semantic_bucket_template.json`
259 - 业务型素材优先看:[business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md) 259 - 业务型素材优先看:[business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md)
260 - Manifest/角色映射看:[business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md)
260 2. 对比 cap48 与 cap64 的不一致现象,补充分规模结论。 261 2. 对比 cap48 与 cap64 的不一致现象,补充分规模结论。
261 3. 继续补 cap64 multi-seed,而不是只保留单 seed。 262 3. 继续补 cap64 multi-seed,而不是只保留单 seed。
262 4. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性。 263 4. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性。
......
1 # Training Data, Input Format, and pgvector Guide / 训练数据、输入格式与 pgvector 指南 1 # Training Data, Input Format, and pgvector Guide / 训练数据、输入格式与 pgvector 指南
2 2
3 > 更新:2026-06-02 3 > 更新:2026-06-02
4 > 关联文档:[数据规范](./dataset-spec.md) · [开放数据工作流](./open-dataset-workflow.md) · [数据来源与接入](./dataset-sources-and-licensing.md) · [服务接口](./service-api.md) · [业务素材类型与 Bucket 指南](./business-music-bucket-and-type-guide.md) 4 > 关联文档:[数据规范](./dataset-spec.md) · [开放数据工作流](./open-dataset-workflow.md) · [数据来源与接入](./dataset-sources-and-licensing.md) · [服务接口](./service-api.md) · [业务素材类型与 Bucket 指南](./business-music-bucket-and-type-guide.md) · [业务 Manifest 与 Type-Role 规范](./business-manifest-and-type-role-spec.md)
5 5
6 ## 一页结论 6 ## 一页结论
7 7
......