Commit b7d4b1b6 b7d4b1b6c1ed22c23f47291ba7505d39f21166ad by cnb.bofCdSsphPA

Provide export cookbook samples so business tables can flow into manifests without guesswork

Constraint: Keep this checkpoint static and avoid any real database connectivity or dataset mutation
Rejected: Leave export details implicit until a live exporter exists | The next session needs concrete SQL, CSV, and JSONL examples now
Confidence: high
Scope-risk: narrow
Directive: Treat the SQL as a field-mapping example only and adapt table names to the real schema during integration
Tested: Parsed the CSV and JSONL examples and rechecked 69 relative links across the export docs
Not-tested: Did not connect to a production database or execute a live export
1 parent 51d789e1
......@@ -57,6 +57,7 @@
- 业务模板:`acr-engine/configs/buckets/business_type_bucket_template.json`
- 业务说明:`docs/business-music-bucket-and-type-guide.md`
- Manifest 规范:`docs/business-manifest-and-type-role-spec.md`
- 导出 cookbook:`docs/business-export-cookbook.md`
2. 补 cap64 multi-seed aggregate。
3. 更新:
- `docs/open-dataset-workflow.md`
......
song_id,asset_id,type,role,split,audio_path,source_dataset,title,artist,album_id,bucket,offset_sec,duration_sec,sample_rate,bitrate,license,is_lossless
song_0001,asset_0001_master_lossless,11,reference,train,business_audio/song_0001/master_lossless.wav,internal_catalog,Song A,Artist A,album_01,lossless_reference_core,0,180,44100,1411,licensed,true
song_0001,asset_0001_douyin_clip,7,query,test,business_audio/song_0001/douyin_clip.mp3,internal_catalog,Song A,Artist A,album_01,short_video_hook,42.5,8,44100,192,licensed,false
song_0002,asset_0002_demo,18,excluded,holdout,business_audio/song_0002/demo.mp3,internal_catalog,Song B,Artist B,album_02,demo_variation_pool,0,95,44100,192,review_pending,false
{"song_id":"song_0001","asset_id":"asset_0001_master_lossless","type":11,"role":"reference","split":"train","audio_path":"business_audio/song_0001/master_lossless.wav","source_dataset":"internal_catalog","title":"Song A","artist":"Artist A","album_id":"album_01","bucket":"lossless_reference_core","offset_sec":0.0,"duration_sec":180.0,"sample_rate":44100,"bitrate":1411,"license":"licensed","is_lossless":true}
{"song_id":"song_0001","asset_id":"asset_0001_douyin_clip","type":7,"role":"query","split":"test","audio_path":"business_audio/song_0001/douyin_clip.mp3","source_dataset":"internal_catalog","title":"Song A","artist":"Artist A","album_id":"album_01","bucket":"short_video_hook","offset_sec":42.5,"duration_sec":8.0,"sample_rate":44100,"bitrate":192,"license":"licensed","is_lossless":false}
{"song_id":"song_0002","asset_id":"asset_0002_demo","type":18,"role":"excluded","split":"holdout","audio_path":"business_audio/song_0002/demo.mp3","source_dataset":"internal_catalog","title":"Song B","artist":"Artist B","album_id":"album_02","bucket":"demo_variation_pool","offset_sec":0.0,"duration_sec":95.0,"sample_rate":44100,"bitrate":192,"license":"review_pending","is_lossless":false}
## 2026-06-02 业务导出 cookbook 与样例交付 checkpoint
完成项:
- 新增 `docs/business-export-cookbook.md`
- 新增 CSV 样例:`acr-engine/configs/manifests/examples/business_asset_export_example.csv`
- 新增 JSONL 样例:`acr-engine/configs/manifests/examples/business_asset_export_example.jsonl`
结论:
- 下个 session 已有 SQL 字段映射参考,以及 CSV/JSONL 中间格式样例。
- 从业务库表到 manifest 的最后一段人工理解成本继续降低。
## 2026-06-02 业务 manifest 与 type-role 规范交付 checkpoint
完成项:
......
# Business Export Cookbook / 业务库表导出 Cookbook
> 更新:2026-06-02
> 关联文档:[业务 Manifest 与 Type-Role 规范](./business-manifest-and-type-role-spec.md) · [业务素材类型与 Bucket 指南](./business-music-bucket-and-type-guide.md)
## 一页结论
下个 session 如果要从你们的业务库表真正导出训练/评测清单,建议直接按这个顺序:
1. 先从 SQL 导出音频资产基础字段
2.`type-role mapping``role` / `bucket`
3. 落成 CSV 或 JSONL 中间文件
4. 再转成项目 manifest
仓库里已经补好以下参考物:
- [../acr-engine/configs/manifests/business_asset_manifest_template.json](../acr-engine/configs/manifests/business_asset_manifest_template.json)
- [../acr-engine/configs/manifests/business_type_role_mapping.json](../acr-engine/configs/manifests/business_type_role_mapping.json)
- [../acr-engine/configs/manifests/examples/business_asset_export_example.csv](../acr-engine/configs/manifests/examples/business_asset_export_example.csv)
- [../acr-engine/configs/manifests/examples/business_asset_export_example.jsonl](../acr-engine/configs/manifests/examples/business_asset_export_example.jsonl)
---
## 1. 推荐 SQL 导出字段
```sql
SELECT
s.id AS song_id,
a.id AS asset_id,
a.type AS type,
a.file_path AS audio_path,
s.title AS title,
s.artist_name AS artist,
s.album_id AS album_id,
a.duration_sec AS duration_sec,
a.sample_rate AS sample_rate,
a.bitrate AS bitrate,
a.license_code AS license,
a.created_at AS created_at
FROM music_asset a
JOIN song s ON s.id = a.song_id
WHERE a.type IN (1,7,8,9,10,11,16,18,2,12);
```
说明:
- 这不是强制 SQL,只是字段映射样例。
- 关键不是表名,而是把字段凑齐到 manifest 规范里。
---
## 2. 导出后要补什么字段
| 字段 | 来源 | 说明 |
|---|---|---|
| `role` | `business_type_role_mapping.json` | 由 `type` 映射 |
| `bucket` | `business_type_role_mapping.json` | 默认业务 bucket |
| `split` | 导出脚本或后处理 | `train/val/test/holdout` |
| `source_dataset` | 固定值 | 如 `internal_catalog` |
| `offset_sec` | 片段类素材可填 | 非片段可先置 `0` |
---
## 3. 推荐中间格式
### CSV
适合:
- 业务同学先导数据
- Excel / 表格工具核对
样例:
- [../acr-engine/configs/manifests/examples/business_asset_export_example.csv](../acr-engine/configs/manifests/examples/business_asset_export_example.csv)
### JSONL
适合:
- 脚本流式处理
- 后续直接转 manifest
样例:
- [../acr-engine/configs/manifests/examples/business_asset_export_example.jsonl](../acr-engine/configs/manifests/examples/business_asset_export_example.jsonl)
---
## 4. 建议后处理规则
1. `type=10/11` 默认补成 `reference`
2. `type=1/9` 默认补成压缩域 `reference`
3. `type=7/8/16` 默认补成 `query`
4. `type=18/2/12` 默认先 `excluded`
5. 非音频资产直接过滤掉
---
## 5. 下个 session 最直接动作
1. 按 SQL 样例从业务库导一次真实数据
2. 存成 CSV 或 JSONL
3. 用仓库里的 mapping 规则补齐 `role` / `bucket`
4. 再转换成项目需要的 manifest
## Sources
- See [business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md)
- See [business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md)
......@@ -93,6 +93,9 @@ cd /workspace/acr-engine
2.`business_type_role_mapping.json` 给每条资产打默认 `role` / `bucket`
3. 先导出 `reference``query` 清单,再进入训练与 bucket benchmark。
## 延伸阅读
- [business-export-cookbook.md](./business-export-cookbook.md)
## Sources
- See [business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md)
- See [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md)
......
......@@ -258,6 +258,7 @@
- 模板:`acr-engine/configs/buckets/fma_semantic_bucket_template.json`
- 业务型素材优先看:[business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md)
- Manifest/角色映射看:[business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md)
- SQL/CSV/JSONL 导出参考:[business-export-cookbook.md](./business-export-cookbook.md)
2. 对比 cap48 与 cap64 的不一致现象,补充分规模结论。
3. 继续补 cap64 multi-seed,而不是只保留单 seed。
4. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性。
......