Make business asset tables exportable into manifest and role mapping templates
Constraint: Keep the checkpoint lightweight and avoid touching real datasets or generated artifacts Rejected: Defer manifest guidance until a DB export tool exists | The next session needs repo-native field and role contracts now Confidence: high Scope-risk: narrow Directive: Default ambiguous assets to excluded until manual review confirms song identity and usable role Tested: Parsed manifest templates; verified print_business_type_mapping.py emits valid JSON; rechecked 94 relative links Not-tested: Did not connect to a real database or run a live export in this checkpoint
Showing
8 changed files
with
213 additions
and
1 deletions
| ... | @@ -56,6 +56,7 @@ | ... | @@ -56,6 +56,7 @@ |
| 56 | - 通用模板:`acr-engine/configs/buckets/fma_semantic_bucket_template.json` | 56 | - 通用模板:`acr-engine/configs/buckets/fma_semantic_bucket_template.json` |
| 57 | - 业务模板:`acr-engine/configs/buckets/business_type_bucket_template.json` | 57 | - 业务模板:`acr-engine/configs/buckets/business_type_bucket_template.json` |
| 58 | - 业务说明:`docs/business-music-bucket-and-type-guide.md` | 58 | - 业务说明:`docs/business-music-bucket-and-type-guide.md` |
| 59 | - Manifest 规范:`docs/business-manifest-and-type-role-spec.md` | ||
| 59 | 2. 补 cap64 multi-seed aggregate。 | 60 | 2. 补 cap64 multi-seed aggregate。 |
| 60 | 3. 更新: | 61 | 3. 更新: |
| 61 | - `docs/open-dataset-workflow.md` | 62 | - `docs/open-dataset-workflow.md` | ... | ... |
| 1 | { | ||
| 2 | "dataset": "business_music", | ||
| 3 | "version": "v1_template", | ||
| 4 | "description": "Template manifest row schema for business music assets before training/indexing/evaluation.", | ||
| 5 | "required_fields": [ | ||
| 6 | "song_id", | ||
| 7 | "asset_id", | ||
| 8 | "type", | ||
| 9 | "role", | ||
| 10 | "split", | ||
| 11 | "audio_path", | ||
| 12 | "source_dataset" | ||
| 13 | ], | ||
| 14 | "recommended_fields": [ | ||
| 15 | "title", | ||
| 16 | "artist", | ||
| 17 | "album_id", | ||
| 18 | "bucket", | ||
| 19 | "offset_sec", | ||
| 20 | "duration_sec", | ||
| 21 | "sample_rate", | ||
| 22 | "bitrate", | ||
| 23 | "license", | ||
| 24 | "is_lossless", | ||
| 25 | "parent_song_id", | ||
| 26 | "notes" | ||
| 27 | ], | ||
| 28 | "role_enum": ["reference", "query", "excluded"], | ||
| 29 | "split_enum": ["train", "val", "test", "holdout"], | ||
| 30 | "example_rows": [ | ||
| 31 | { | ||
| 32 | "song_id": "song_0001", | ||
| 33 | "asset_id": "asset_0001_master_lossless", | ||
| 34 | "type": 11, | ||
| 35 | "role": "reference", | ||
| 36 | "split": "train", | ||
| 37 | "audio_path": "business_audio/song_0001/master_lossless.wav", | ||
| 38 | "source_dataset": "internal_catalog", | ||
| 39 | "bucket": "lossless_reference_core", | ||
| 40 | "offset_sec": 0.0, | ||
| 41 | "duration_sec": 180.0, | ||
| 42 | "is_lossless": true | ||
| 43 | }, | ||
| 44 | { | ||
| 45 | "song_id": "song_0001", | ||
| 46 | "asset_id": "asset_0001_douyin_clip", | ||
| 47 | "type": 7, | ||
| 48 | "role": "query", | ||
| 49 | "split": "test", | ||
| 50 | "audio_path": "business_audio/song_0001/douyin_clip.mp3", | ||
| 51 | "source_dataset": "internal_catalog", | ||
| 52 | "bucket": "short_video_hook", | ||
| 53 | "offset_sec": 42.5, | ||
| 54 | "duration_sec": 8.0, | ||
| 55 | "is_lossless": false | ||
| 56 | } | ||
| 57 | ] | ||
| 58 | } |
| 1 | { | ||
| 2 | "notes": { | ||
| 3 | "purpose": "Starting mapping from business asset type to manifest role and default bucket.", | ||
| 4 | "warning": "This is a starting policy; keep ambiguous assets in excluded until manually reviewed." | ||
| 5 | }, | ||
| 6 | "mappings": [ | ||
| 7 | {"type": 10, "role": "reference", "default_bucket": "lossless_reference_core", "priority": "high", "trainable": true}, | ||
| 8 | {"type": 11, "role": "reference", "default_bucket": "lossless_reference_core", "priority": "high", "trainable": true}, | ||
| 9 | {"type": 9, "role": "reference", "default_bucket": "compressed_reference_realworld", "priority": "high", "trainable": true}, | ||
| 10 | {"type": 1, "role": "reference", "default_bucket": "compressed_reference_realworld", "priority": "high", "trainable": true}, | ||
| 11 | {"type": 8, "role": "query", "default_bucket": "short_video_hook", "priority": "medium", "trainable": true}, | ||
| 12 | {"type": 7, "role": "query", "default_bucket": "short_video_hook", "priority": "medium", "trainable": true}, | ||
| 13 | {"type": 16, "role": "query", "default_bucket": "short_video_hook", "priority": "medium", "trainable": true}, | ||
| 14 | {"type": 18, "role": "excluded", "default_bucket": "demo_variation_pool", "priority": "review", "trainable": false}, | ||
| 15 | {"type": 2, "role": "excluded", "default_bucket": "with_harmony_shift", "priority": "review", "trainable": false}, | ||
| 16 | {"type": 12, "role": "excluded", "default_bucket": "with_harmony_shift", "priority": "review", "trainable": false}, | ||
| 17 | {"type": 3, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false}, | ||
| 18 | {"type": 4, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false}, | ||
| 19 | {"type": 5, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false}, | ||
| 20 | {"type": 6, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false}, | ||
| 21 | {"type": 13, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false}, | ||
| 22 | {"type": 14, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false}, | ||
| 23 | {"type": 17, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false}, | ||
| 24 | {"type": 19, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false}, | ||
| 25 | {"type": 20, "role": "excluded", "default_bucket": "non_audio", "priority": "drop", "trainable": false} | ||
| 26 | ] | ||
| 27 | } |
| 1 | #!/usr/bin/env python3 | ||
| 2 | from __future__ import annotations | ||
| 3 | import argparse, json | ||
| 4 | from pathlib import Path | ||
| 5 | |||
| 6 | def main() -> None: | ||
| 7 | parser = argparse.ArgumentParser(description='Print the business type->role/bucket mapping template') | ||
| 8 | parser.add_argument('--config', default='configs/manifests/business_type_role_mapping.json') | ||
| 9 | args = parser.parse_args() | ||
| 10 | path = Path(__file__).resolve().parents[1] / args.config | ||
| 11 | data = json.loads(path.read_text()) | ||
| 12 | print(json.dumps(data, ensure_ascii=False, indent=2)) | ||
| 13 | |||
| 14 | if __name__ == '__main__': | ||
| 15 | main() |
| 1 | ## 2026-06-02 业务 manifest 与 type-role 规范交付 checkpoint | ||
| 2 | |||
| 3 | 完成项: | ||
| 4 | - 新增 `docs/business-manifest-and-type-role-spec.md` | ||
| 5 | - 新增 `acr-engine/configs/manifests/business_asset_manifest_template.json` | ||
| 6 | - 新增 `acr-engine/configs/manifests/business_type_role_mapping.json` | ||
| 7 | - 新增 `acr-engine/scripts/print_business_type_mapping.py` | ||
| 8 | |||
| 9 | 结论: | ||
| 10 | - 下个 session 已可直接从业务库表导出 manifest 所需字段。 | ||
| 11 | - `type -> role(reference/query/excluded) -> bucket` 现在已经有 repo-native 默认规则,不需要再从聊天记录反推。 | ||
| 12 | |||
| 1 | ## 2026-06-02 业务素材 type→bucket 指南交付 checkpoint | 13 | ## 2026-06-02 业务素材 type→bucket 指南交付 checkpoint |
| 2 | 14 | ||
| 3 | 完成项: | 15 | 完成项: | ... | ... |
docs/business-manifest-and-type-role-spec.md
0 → 100644
| 1 | # Business Manifest and Type-Role Spec / 业务 Manifest 与 Type-Role 规范 | ||
| 2 | |||
| 3 | > 更新:2026-06-02 | ||
| 4 | > 关联文档:[业务素材类型与 Bucket 指南](./business-music-bucket-and-type-guide.md) · [训练数据与 pgvector 指南](./training-data-and-pgvector-guide.md) | ||
| 5 | |||
| 6 | ## 一页结论 | ||
| 7 | |||
| 8 | 现在仓库里已经有两份可以直接复用的业务接入模板: | ||
| 9 | - [../acr-engine/configs/manifests/business_asset_manifest_template.json](../acr-engine/configs/manifests/business_asset_manifest_template.json) | ||
| 10 | - [../acr-engine/configs/manifests/business_type_role_mapping.json](../acr-engine/configs/manifests/business_type_role_mapping.json) | ||
| 11 | |||
| 12 | 它们解决两个问题: | ||
| 13 | 1. 业务库表里的字段,最少要映射成什么 manifest 字段。 | ||
| 14 | 2. 你们的 `type` 应该默认落到 `reference / query / excluded` 哪一类。 | ||
| 15 | |||
| 16 | --- | ||
| 17 | |||
| 18 | ## 1. 映射图 | ||
| 19 | |||
| 20 | ```mermaid | ||
| 21 | flowchart LR | ||
| 22 | A[业务库表记录] --> B[type-role mapping] | ||
| 23 | B --> C[reference] | ||
| 24 | B --> D[query] | ||
| 25 | B --> E[excluded] | ||
| 26 | C --> F[manifest rows] | ||
| 27 | D --> F | ||
| 28 | F --> G[train / build-index / evaluate] | ||
| 29 | ``` | ||
| 30 | |||
| 31 | --- | ||
| 32 | |||
| 33 | ## 2. 最小 manifest 字段 | ||
| 34 | |||
| 35 | | 字段 | 必需 | 说明 | | ||
| 36 | |---|---|---| | ||
| 37 | | `song_id` | 是 | 歌曲主 ID | | ||
| 38 | | `asset_id` | 是 | 具体素材 ID | | ||
| 39 | | `type` | 是 | 你们现有的素材类型 | | ||
| 40 | | `role` | 是 | `reference` / `query` / `excluded` | | ||
| 41 | | `split` | 是 | `train` / `val` / `test` / `holdout` | | ||
| 42 | | `audio_path` | 是 | 可访问的音频路径 | | ||
| 43 | | `source_dataset` | 是 | 来源标识 | | ||
| 44 | | `bucket` | 否 | 分桶评测标签 | | ||
| 45 | | `offset_sec` | 否 | query 起点 | | ||
| 46 | | `duration_sec` | 否 | 片段长度 | | ||
| 47 | |||
| 48 | --- | ||
| 49 | |||
| 50 | ## 3. 默认 type-role 规则 | ||
| 51 | |||
| 52 | | type | 默认 role | 默认 bucket | 说明 | | ||
| 53 | |---:|---|---|---| | ||
| 54 | | `10` / `11` | `reference` | `lossless_reference_core` | 无损主库 | | ||
| 55 | | `9` / `1` | `reference` | `compressed_reference_realworld` | 压缩真实分布 | | ||
| 56 | | `8` / `7` / `16` | `query` | `short_video_hook` | 短视频/副歌入口 | | ||
| 57 | | `18` | `excluded` | `demo_variation_pool` | 先人工筛 | | ||
| 58 | | `2` / `12` | `excluded` | `with_harmony_shift` | 先做专项桶 | | ||
| 59 | | 其余非音频 type | `excluded` | `non_audio` | 不入模 | | ||
| 60 | |||
| 61 | --- | ||
| 62 | |||
| 63 | ## 4. 导出原则 | ||
| 64 | |||
| 65 | 1. **reference 与 query 即使同曲,也不要混成同一条资产记录。** | ||
| 66 | 2. **如果无法确认是否同曲同版本,默认 `excluded`。** | ||
| 67 | 3. **`type=18 demo` 不要自动并入 train,先人工审。** | ||
| 68 | 4. **短视频片段优先导出为 `query`,不要直接当 reference。** | ||
| 69 | |||
| 70 | --- | ||
| 71 | |||
| 72 | ## 5. 模板与脚本 | ||
| 73 | |||
| 74 | - Manifest 模板: | ||
| 75 | - [../acr-engine/configs/manifests/business_asset_manifest_template.json](../acr-engine/configs/manifests/business_asset_manifest_template.json) | ||
| 76 | - Type-role 模板: | ||
| 77 | - [../acr-engine/configs/manifests/business_type_role_mapping.json](../acr-engine/configs/manifests/business_type_role_mapping.json) | ||
| 78 | - 打印脚本: | ||
| 79 | - [../acr-engine/scripts/print_business_type_mapping.py](../acr-engine/scripts/print_business_type_mapping.py) | ||
| 80 | |||
| 81 | 示例命令: | ||
| 82 | |||
| 83 | ```bash | ||
| 84 | cd /workspace/acr-engine | ||
| 85 | /usr/local/miniconda3/bin/python scripts/print_business_type_mapping.py | ||
| 86 | ``` | ||
| 87 | |||
| 88 | --- | ||
| 89 | |||
| 90 | ## 6. 下个 session 直接动作 | ||
| 91 | |||
| 92 | 1. 按这份规范把库表字段映射到 manifest 行。 | ||
| 93 | 2. 用 `business_type_role_mapping.json` 给每条资产打默认 `role` / `bucket`。 | ||
| 94 | 3. 先导出 `reference` 与 `query` 清单,再进入训练与 bucket benchmark。 | ||
| 95 | |||
| 96 | ## Sources | ||
| 97 | - See [business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md) | ||
| 98 | - See [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) |
| ... | @@ -257,6 +257,7 @@ | ... | @@ -257,6 +257,7 @@ |
| 257 | 1. 把已完成的 toy bucket baseline 升级为语义 bucket(风格 / 结构 / hard-case)。 | 257 | 1. 把已完成的 toy bucket baseline 升级为语义 bucket(风格 / 结构 / hard-case)。 |
| 258 | - 模板:`acr-engine/configs/buckets/fma_semantic_bucket_template.json` | 258 | - 模板:`acr-engine/configs/buckets/fma_semantic_bucket_template.json` |
| 259 | - 业务型素材优先看:[business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md) | 259 | - 业务型素材优先看:[business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md) |
| 260 | - Manifest/角色映射看:[business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md) | ||
| 260 | 2. 对比 cap48 与 cap64 的不一致现象,补充分规模结论。 | 261 | 2. 对比 cap48 与 cap64 的不一致现象,补充分规模结论。 |
| 261 | 3. 继续补 cap64 multi-seed,而不是只保留单 seed。 | 262 | 3. 继续补 cap64 multi-seed,而不是只保留单 seed。 |
| 262 | 4. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性。 | 263 | 4. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性。 | ... | ... |
| 1 | # Training Data, Input Format, and pgvector Guide / 训练数据、输入格式与 pgvector 指南 | 1 | # Training Data, Input Format, and pgvector Guide / 训练数据、输入格式与 pgvector 指南 |
| 2 | 2 | ||
| 3 | > 更新:2026-06-02 | 3 | > 更新:2026-06-02 |
| 4 | > 关联文档:[数据规范](./dataset-spec.md) · [开放数据工作流](./open-dataset-workflow.md) · [数据来源与接入](./dataset-sources-and-licensing.md) · [服务接口](./service-api.md) · [业务素材类型与 Bucket 指南](./business-music-bucket-and-type-guide.md) | 4 | > 关联文档:[数据规范](./dataset-spec.md) · [开放数据工作流](./open-dataset-workflow.md) · [数据来源与接入](./dataset-sources-and-licensing.md) · [服务接口](./service-api.md) · [业务素材类型与 Bucket 指南](./business-music-bucket-and-type-guide.md) · [业务 Manifest 与 Type-Role 规范](./business-manifest-and-type-role-spec.md) |
| 5 | 5 | ||
| 6 | ## 一页结论 | 6 | ## 一页结论 |
| 7 | 7 | ... | ... |
-
Please register or sign in to post a comment