Commit b7d4b1b6 b7d4b1b6c1ed22c23f47291ba7505d39f21166ad by cnb.bofCdSsphPA

Provide export cookbook samples so business tables can flow into manifests without guesswork

Constraint: Keep this checkpoint static and avoid any real database connectivity or dataset mutation
Rejected: Leave export details implicit until a live exporter exists | The next session needs concrete SQL, CSV, and JSONL examples now
Confidence: high
Scope-risk: narrow
Directive: Treat the SQL as a field-mapping example only and adapt table names to the real schema during integration
Tested: Parsed the CSV and JSONL examples and rechecked 69 relative links across the export docs
Not-tested: Did not connect to a production database or execute a live export
1 parent 51d789e1
...@@ -57,6 +57,7 @@ ...@@ -57,6 +57,7 @@
57 - 业务模板:`acr-engine/configs/buckets/business_type_bucket_template.json` 57 - 业务模板:`acr-engine/configs/buckets/business_type_bucket_template.json`
58 - 业务说明:`docs/business-music-bucket-and-type-guide.md` 58 - 业务说明:`docs/business-music-bucket-and-type-guide.md`
59 - Manifest 规范:`docs/business-manifest-and-type-role-spec.md` 59 - Manifest 规范:`docs/business-manifest-and-type-role-spec.md`
60 - 导出 cookbook:`docs/business-export-cookbook.md`
60 2. 补 cap64 multi-seed aggregate。 61 2. 补 cap64 multi-seed aggregate。
61 3. 更新: 62 3. 更新:
62 - `docs/open-dataset-workflow.md` 63 - `docs/open-dataset-workflow.md`
......
1 song_id,asset_id,type,role,split,audio_path,source_dataset,title,artist,album_id,bucket,offset_sec,duration_sec,sample_rate,bitrate,license,is_lossless
2 song_0001,asset_0001_master_lossless,11,reference,train,business_audio/song_0001/master_lossless.wav,internal_catalog,Song A,Artist A,album_01,lossless_reference_core,0,180,44100,1411,licensed,true
3 song_0001,asset_0001_douyin_clip,7,query,test,business_audio/song_0001/douyin_clip.mp3,internal_catalog,Song A,Artist A,album_01,short_video_hook,42.5,8,44100,192,licensed,false
4 song_0002,asset_0002_demo,18,excluded,holdout,business_audio/song_0002/demo.mp3,internal_catalog,Song B,Artist B,album_02,demo_variation_pool,0,95,44100,192,review_pending,false
1 {"song_id":"song_0001","asset_id":"asset_0001_master_lossless","type":11,"role":"reference","split":"train","audio_path":"business_audio/song_0001/master_lossless.wav","source_dataset":"internal_catalog","title":"Song A","artist":"Artist A","album_id":"album_01","bucket":"lossless_reference_core","offset_sec":0.0,"duration_sec":180.0,"sample_rate":44100,"bitrate":1411,"license":"licensed","is_lossless":true}
2 {"song_id":"song_0001","asset_id":"asset_0001_douyin_clip","type":7,"role":"query","split":"test","audio_path":"business_audio/song_0001/douyin_clip.mp3","source_dataset":"internal_catalog","title":"Song A","artist":"Artist A","album_id":"album_01","bucket":"short_video_hook","offset_sec":42.5,"duration_sec":8.0,"sample_rate":44100,"bitrate":192,"license":"licensed","is_lossless":false}
3 {"song_id":"song_0002","asset_id":"asset_0002_demo","type":18,"role":"excluded","split":"holdout","audio_path":"business_audio/song_0002/demo.mp3","source_dataset":"internal_catalog","title":"Song B","artist":"Artist B","album_id":"album_02","bucket":"demo_variation_pool","offset_sec":0.0,"duration_sec":95.0,"sample_rate":44100,"bitrate":192,"license":"review_pending","is_lossless":false}
1 ## 2026-06-02 业务导出 cookbook 与样例交付 checkpoint
2
3 完成项:
4 - 新增 `docs/business-export-cookbook.md`
5 - 新增 CSV 样例:`acr-engine/configs/manifests/examples/business_asset_export_example.csv`
6 - 新增 JSONL 样例:`acr-engine/configs/manifests/examples/business_asset_export_example.jsonl`
7
8 结论:
9 - 下个 session 已有 SQL 字段映射参考,以及 CSV/JSONL 中间格式样例。
10 - 从业务库表到 manifest 的最后一段人工理解成本继续降低。
11
1 ## 2026-06-02 业务 manifest 与 type-role 规范交付 checkpoint 12 ## 2026-06-02 业务 manifest 与 type-role 规范交付 checkpoint
2 13
3 完成项: 14 完成项:
......
1 # Business Export Cookbook / 业务库表导出 Cookbook
2
3 > 更新:2026-06-02
4 > 关联文档:[业务 Manifest 与 Type-Role 规范](./business-manifest-and-type-role-spec.md) · [业务素材类型与 Bucket 指南](./business-music-bucket-and-type-guide.md)
5
6 ## 一页结论
7
8 下个 session 如果要从你们的业务库表真正导出训练/评测清单,建议直接按这个顺序:
9
10 1. 先从 SQL 导出音频资产基础字段
11 2.`type-role mapping``role` / `bucket`
12 3. 落成 CSV 或 JSONL 中间文件
13 4. 再转成项目 manifest
14
15 仓库里已经补好以下参考物:
16 - [../acr-engine/configs/manifests/business_asset_manifest_template.json](../acr-engine/configs/manifests/business_asset_manifest_template.json)
17 - [../acr-engine/configs/manifests/business_type_role_mapping.json](../acr-engine/configs/manifests/business_type_role_mapping.json)
18 - [../acr-engine/configs/manifests/examples/business_asset_export_example.csv](../acr-engine/configs/manifests/examples/business_asset_export_example.csv)
19 - [../acr-engine/configs/manifests/examples/business_asset_export_example.jsonl](../acr-engine/configs/manifests/examples/business_asset_export_example.jsonl)
20
21 ---
22
23 ## 1. 推荐 SQL 导出字段
24
25 ```sql
26 SELECT
27 s.id AS song_id,
28 a.id AS asset_id,
29 a.type AS type,
30 a.file_path AS audio_path,
31 s.title AS title,
32 s.artist_name AS artist,
33 s.album_id AS album_id,
34 a.duration_sec AS duration_sec,
35 a.sample_rate AS sample_rate,
36 a.bitrate AS bitrate,
37 a.license_code AS license,
38 a.created_at AS created_at
39 FROM music_asset a
40 JOIN song s ON s.id = a.song_id
41 WHERE a.type IN (1,7,8,9,10,11,16,18,2,12);
42 ```
43
44 说明:
45 - 这不是强制 SQL,只是字段映射样例。
46 - 关键不是表名,而是把字段凑齐到 manifest 规范里。
47
48 ---
49
50 ## 2. 导出后要补什么字段
51
52 | 字段 | 来源 | 说明 |
53 |---|---|---|
54 | `role` | `business_type_role_mapping.json` | 由 `type` 映射 |
55 | `bucket` | `business_type_role_mapping.json` | 默认业务 bucket |
56 | `split` | 导出脚本或后处理 | `train/val/test/holdout` |
57 | `source_dataset` | 固定值 | 如 `internal_catalog` |
58 | `offset_sec` | 片段类素材可填 | 非片段可先置 `0` |
59
60 ---
61
62 ## 3. 推荐中间格式
63
64 ### CSV
65 适合:
66 - 业务同学先导数据
67 - Excel / 表格工具核对
68
69 样例:
70 - [../acr-engine/configs/manifests/examples/business_asset_export_example.csv](../acr-engine/configs/manifests/examples/business_asset_export_example.csv)
71
72 ### JSONL
73 适合:
74 - 脚本流式处理
75 - 后续直接转 manifest
76
77 样例:
78 - [../acr-engine/configs/manifests/examples/business_asset_export_example.jsonl](../acr-engine/configs/manifests/examples/business_asset_export_example.jsonl)
79
80 ---
81
82 ## 4. 建议后处理规则
83
84 1. `type=10/11` 默认补成 `reference`
85 2. `type=1/9` 默认补成压缩域 `reference`
86 3. `type=7/8/16` 默认补成 `query`
87 4. `type=18/2/12` 默认先 `excluded`
88 5. 非音频资产直接过滤掉
89
90 ---
91
92 ## 5. 下个 session 最直接动作
93
94 1. 按 SQL 样例从业务库导一次真实数据
95 2. 存成 CSV 或 JSONL
96 3. 用仓库里的 mapping 规则补齐 `role` / `bucket`
97 4. 再转换成项目需要的 manifest
98
99 ## Sources
100 - See [business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md)
101 - See [business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md)
...@@ -93,6 +93,9 @@ cd /workspace/acr-engine ...@@ -93,6 +93,9 @@ cd /workspace/acr-engine
93 2.`business_type_role_mapping.json` 给每条资产打默认 `role` / `bucket` 93 2.`business_type_role_mapping.json` 给每条资产打默认 `role` / `bucket`
94 3. 先导出 `reference``query` 清单,再进入训练与 bucket benchmark。 94 3. 先导出 `reference``query` 清单,再进入训练与 bucket benchmark。
95 95
96 ## 延伸阅读
97 - [business-export-cookbook.md](./business-export-cookbook.md)
98
96 ## Sources 99 ## Sources
97 - See [business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md) 100 - See [business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md)
98 - See [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) 101 - See [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md)
......
...@@ -258,6 +258,7 @@ ...@@ -258,6 +258,7 @@
258 - 模板:`acr-engine/configs/buckets/fma_semantic_bucket_template.json` 258 - 模板:`acr-engine/configs/buckets/fma_semantic_bucket_template.json`
259 - 业务型素材优先看:[business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md) 259 - 业务型素材优先看:[business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md)
260 - Manifest/角色映射看:[business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md) 260 - Manifest/角色映射看:[business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md)
261 - SQL/CSV/JSONL 导出参考:[business-export-cookbook.md](./business-export-cookbook.md)
261 2. 对比 cap48 与 cap64 的不一致现象,补充分规模结论。 262 2. 对比 cap48 与 cap64 的不一致现象,补充分规模结论。
262 3. 继续补 cap64 multi-seed,而不是只保留单 seed。 263 3. 继续补 cap64 multi-seed,而不是只保留单 seed。
263 4. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性。 264 4. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性。
......