Commit 51d789e1 51d789e1bb207aa3c70d58f96c132451d18b16ed by cnb.bofCdSsphPA

Make business asset tables exportable into manifest and role mapping templates

Constraint: Keep the checkpoint lightweight and avoid touching real datasets or generated artifacts
Rejected: Defer manifest guidance until a DB export tool exists | The next session needs repo-native field and role contracts now
Confidence: high
Scope-risk: narrow
Directive: Default ambiguous assets to excluded until manual review confirms song identity and usable role
Tested: Parsed manifest templates; verified print_business_type_mapping.py emits valid JSON; rechecked 94 relative links
Not-tested: Did not connect to a real database or run a live export in this checkpoint
1 parent 8739bf35
......@@ -56,6 +56,7 @@
- 通用模板:`acr-engine/configs/buckets/fma_semantic_bucket_template.json`
- 业务模板:`acr-engine/configs/buckets/business_type_bucket_template.json`
- 业务说明:`docs/business-music-bucket-and-type-guide.md`
- Manifest 规范:`docs/business-manifest-and-type-role-spec.md`
2. 补 cap64 multi-seed aggregate。
3. 更新:
- `docs/open-dataset-workflow.md`
......
{
"dataset": "business_music",
"version": "v1_template",
"description": "Template manifest row schema for business music assets before training/indexing/evaluation.",
"required_fields": [
"song_id",
"asset_id",
"type",
"role",
"split",
"audio_path",
"source_dataset"
],
"recommended_fields": [
"title",
"artist",
"album_id",
"bucket",
"offset_sec",
"duration_sec",
"sample_rate",
"bitrate",
"license",
"is_lossless",
"parent_song_id",
"notes"
],
"role_enum": ["reference", "query", "excluded"],
"split_enum": ["train", "val", "test", "holdout"],
"example_rows": [
{
"song_id": "song_0001",
"asset_id": "asset_0001_master_lossless",
"type": 11,
"role": "reference",
"split": "train",
"audio_path": "business_audio/song_0001/master_lossless.wav",
"source_dataset": "internal_catalog",
"bucket": "lossless_reference_core",
"offset_sec": 0.0,
"duration_sec": 180.0,
"is_lossless": true
},
{
"song_id": "song_0001",
"asset_id": "asset_0001_douyin_clip",
"type": 7,
"role": "query",
"split": "test",
"audio_path": "business_audio/song_0001/douyin_clip.mp3",
"source_dataset": "internal_catalog",
"bucket": "short_video_hook",
"offset_sec": 42.5,
"duration_sec": 8.0,
"is_lossless": false
}
]
}
{
"notes": {
"purpose": "Starting mapping from business asset type to manifest role and default bucket.",
"warning": "This is a starting policy; keep ambiguous assets in excluded until manually reviewed."
},
"mappings": [
{"type": 10, "role": "reference", "default_bucket": "lossless_reference_core", "priority": "high", "trainable": true},
{"type": 11, "role": "reference", "default_bucket": "lossless_reference_core", "priority": "high", "trainable": true},
{"type": 9, "role": "reference", "default_bucket": "compressed_reference_realworld", "priority": "high", "trainable": true},
{"type": 1, "role": "reference", "default_bucket": "compressed_reference_realworld", "priority": "high", "trainable": true},
{"type": 8, "role": "query", "default_bucket": "short_video_hook", "priority": "medium", "trainable": true},
{"type": 7, "role": "query", "default_bucket": "short_video_hook", "priority": "medium", "trainable": true},
{"type": 16, "role": "query", "default_bucket": "short_video_hook", "priority": "medium", "trainable": true},
{"type": 18, "role": "excluded", "default_bucket": "demo_variation_pool", "priority": "review", "trainable": false},
{"type": 2, "role": "excluded", "default_bucket": "with_harmony_shift", "priority": "review", "trainable": false},
{"type": 12, "role": "excluded", "default_bucket": "with_harmony_shift", "priority": "review", "trainable": false},
{"type": 3, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false},
{"type": 4, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false},
{"type": 5, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false},
{"type": 6, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false},
{"type": 13, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false},
{"type": 14, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false},
{"type": 17, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false},
{"type": 19, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false},
{"type": 20, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false}
]
}
#!/usr/bin/env python3
from __future__ import annotations
import argparse, json
from pathlib import Path
def main() -> None:
parser = argparse.ArgumentParser(description='Print the business type->role/bucket mapping template')
parser.add_argument('--config', default='configs/manifests/business_type_role_mapping.json')
args = parser.parse_args()
path = Path(__file__).resolve().parents[1] / args.config
data = json.loads(path.read_text())
print(json.dumps(data, ensure_ascii=False, indent=2))
if __name__ == '__main__':
main()
## 2026-06-02 业务 manifest 与 type-role 规范交付 checkpoint
完成项:
- 新增 `docs/business-manifest-and-type-role-spec.md`
- 新增 `acr-engine/configs/manifests/business_asset_manifest_template.json`
- 新增 `acr-engine/configs/manifests/business_type_role_mapping.json`
- 新增 `acr-engine/scripts/print_business_type_mapping.py`
结论:
- 下个 session 已可直接从业务库表导出 manifest 所需字段。
- `type -> role(reference/query/excluded) -> bucket` 现在已经有 repo-native 默认规则,不需要再从聊天记录反推。
## 2026-06-02 业务素材 type→bucket 指南交付 checkpoint
完成项:
......
# Business Manifest and Type-Role Spec / 业务 Manifest 与 Type-Role 规范
> 更新:2026-06-02
> 关联文档:[业务素材类型与 Bucket 指南](./business-music-bucket-and-type-guide.md) · [训练数据与 pgvector 指南](./training-data-and-pgvector-guide.md)
## 一页结论
现在仓库里已经有两份可以直接复用的业务接入模板:
- [../acr-engine/configs/manifests/business_asset_manifest_template.json](../acr-engine/configs/manifests/business_asset_manifest_template.json)
- [../acr-engine/configs/manifests/business_type_role_mapping.json](../acr-engine/configs/manifests/business_type_role_mapping.json)
它们解决两个问题:
1. 业务库表里的字段,最少要映射成什么 manifest 字段。
2. 你们的 `type` 应该默认落到 `reference / query / excluded` 哪一类。
---
## 1. 映射图
```mermaid
flowchart LR
A[业务库表记录] --> B[type-role mapping]
B --> C[reference]
B --> D[query]
B --> E[excluded]
C --> F[manifest rows]
D --> F
F --> G[train / build-index / evaluate]
```
---
## 2. 最小 manifest 字段
| 字段 | 必需 | 说明 |
|---|---|---|
| `song_id` | 是 | 歌曲主 ID |
| `asset_id` | 是 | 具体素材 ID |
| `type` | 是 | 你们现有的素材类型 |
| `role` | 是 | `reference` / `query` / `excluded` |
| `split` | 是 | `train` / `val` / `test` / `holdout` |
| `audio_path` | 是 | 可访问的音频路径 |
| `source_dataset` | 是 | 来源标识 |
| `bucket` | 否 | 分桶评测标签 |
| `offset_sec` | 否 | query 起点 |
| `duration_sec` | 否 | 片段长度 |
---
## 3. 默认 type-role 规则
| type | 默认 role | 默认 bucket | 说明 |
|---:|---|---|---|
| `10` / `11` | `reference` | `lossless_reference_core` | 无损主库 |
| `9` / `1` | `reference` | `compressed_reference_realworld` | 压缩真实分布 |
| `8` / `7` / `16` | `query` | `short_video_hook` | 短视频/副歌入口 |
| `18` | `excluded` | `demo_variation_pool` | 先人工筛 |
| `2` / `12` | `excluded` | `with_harmony_shift` | 先做专项桶 |
| 其余非音频 type | `excluded` | `non_audio` | 不入模 |
---
## 4. 导出原则
1. **reference 与 query 即使同曲,也不要混成同一条资产记录。**
2. **如果无法确认是否同曲同版本,默认 `excluded`。**
3. **`type=18 demo` 不要自动并入 train,先人工审。**
4. **短视频片段优先导出为 `query`,不要直接当 reference。**
---
## 5. 模板与脚本
- Manifest 模板:
- [../acr-engine/configs/manifests/business_asset_manifest_template.json](../acr-engine/configs/manifests/business_asset_manifest_template.json)
- Type-role 模板:
- [../acr-engine/configs/manifests/business_type_role_mapping.json](../acr-engine/configs/manifests/business_type_role_mapping.json)
- 打印脚本:
- [../acr-engine/scripts/print_business_type_mapping.py](../acr-engine/scripts/print_business_type_mapping.py)
示例命令:
```bash
cd /workspace/acr-engine
/usr/local/miniconda3/bin/python scripts/print_business_type_mapping.py
```
---
## 6. 下个 session 直接动作
1. 按这份规范把库表字段映射到 manifest 行。
2.`business_type_role_mapping.json` 给每条资产打默认 `role` / `bucket`
3. 先导出 `reference``query` 清单,再进入训练与 bucket benchmark。
## Sources
- See [business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md)
- See [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md)
......@@ -257,6 +257,7 @@
1. 把已完成的 toy bucket baseline 升级为语义 bucket(风格 / 结构 / hard-case)。
- 模板:`acr-engine/configs/buckets/fma_semantic_bucket_template.json`
- 业务型素材优先看:[business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md)
- Manifest/角色映射看:[business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md)
2. 对比 cap48 与 cap64 的不一致现象,补充分规模结论。
3. 继续补 cap64 multi-seed,而不是只保留单 seed。
4. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性。
......
# Training Data, Input Format, and pgvector Guide / 训练数据、输入格式与 pgvector 指南
> 更新:2026-06-02
> 关联文档:[数据规范](./dataset-spec.md) · [开放数据工作流](./open-dataset-workflow.md) · [数据来源与接入](./dataset-sources-and-licensing.md) · [服务接口](./service-api.md) · [业务素材类型与 Bucket 指南](./business-music-bucket-and-type-guide.md)
> 关联文档:[数据规范](./dataset-spec.md) · [开放数据工作流](./open-dataset-workflow.md) · [数据来源与接入](./dataset-sources-and-licensing.md) · [服务接口](./service-api.md) · [业务素材类型与 Bucket 指南](./business-music-bucket-and-type-guide.md) · [业务 Manifest 与 Type-Role 规范](./business-manifest-and-type-role-spec.md)
## 一页结论
......