Provide export cookbook samples so business tables can flow into manifests without guesswork
Constraint: Keep this checkpoint static and avoid any real database connectivity or dataset mutation Rejected: Leave export details implicit until a live exporter exists | The next session needs concrete SQL, CSV, and JSONL examples now Confidence: high Scope-risk: narrow Directive: Treat the SQL as a field-mapping example only and adapt table names to the real schema during integration Tested: Parsed the CSV and JSONL examples and rechecked 69 relative links across the export docs Not-tested: Did not connect to a production database or execute a live export
Showing
7 changed files
with
124 additions
and
0 deletions
| ... | @@ -57,6 +57,7 @@ | ... | @@ -57,6 +57,7 @@ |
| 57 | - 业务模板:`acr-engine/configs/buckets/business_type_bucket_template.json` | 57 | - 业务模板:`acr-engine/configs/buckets/business_type_bucket_template.json` |
| 58 | - 业务说明:`docs/business-music-bucket-and-type-guide.md` | 58 | - 业务说明:`docs/business-music-bucket-and-type-guide.md` |
| 59 | - Manifest 规范:`docs/business-manifest-and-type-role-spec.md` | 59 | - Manifest 规范:`docs/business-manifest-and-type-role-spec.md` |
| 60 | - 导出 cookbook:`docs/business-export-cookbook.md` | ||
| 60 | 2. 补 cap64 multi-seed aggregate。 | 61 | 2. 补 cap64 multi-seed aggregate。 |
| 61 | 3. 更新: | 62 | 3. 更新: |
| 62 | - `docs/open-dataset-workflow.md` | 63 | - `docs/open-dataset-workflow.md` | ... | ... |
| 1 | song_id,asset_id,type,role,split,audio_path,source_dataset,title,artist,album_id,bucket,offset_sec,duration_sec,sample_rate,bitrate,license,is_lossless | ||
| 2 | song_0001,asset_0001_master_lossless,11,reference,train,business_audio/song_0001/master_lossless.wav,internal_catalog,Song A,Artist A,album_01,lossless_reference_core,0,180,44100,1411,licensed,true | ||
| 3 | song_0001,asset_0001_douyin_clip,7,query,test,business_audio/song_0001/douyin_clip.mp3,internal_catalog,Song A,Artist A,album_01,short_video_hook,42.5,8,44100,192,licensed,false | ||
| 4 | song_0002,asset_0002_demo,18,excluded,holdout,business_audio/song_0002/demo.mp3,internal_catalog,Song B,Artist B,album_02,demo_variation_pool,0,95,44100,192,review_pending,false |
| 1 | {"song_id":"song_0001","asset_id":"asset_0001_master_lossless","type":11,"role":"reference","split":"train","audio_path":"business_audio/song_0001/master_lossless.wav","source_dataset":"internal_catalog","title":"Song A","artist":"Artist A","album_id":"album_01","bucket":"lossless_reference_core","offset_sec":0.0,"duration_sec":180.0,"sample_rate":44100,"bitrate":1411,"license":"licensed","is_lossless":true} | ||
| 2 | {"song_id":"song_0001","asset_id":"asset_0001_douyin_clip","type":7,"role":"query","split":"test","audio_path":"business_audio/song_0001/douyin_clip.mp3","source_dataset":"internal_catalog","title":"Song A","artist":"Artist A","album_id":"album_01","bucket":"short_video_hook","offset_sec":42.5,"duration_sec":8.0,"sample_rate":44100,"bitrate":192,"license":"licensed","is_lossless":false} | ||
| 3 | {"song_id":"song_0002","asset_id":"asset_0002_demo","type":18,"role":"excluded","split":"holdout","audio_path":"business_audio/song_0002/demo.mp3","source_dataset":"internal_catalog","title":"Song B","artist":"Artist B","album_id":"album_02","bucket":"demo_variation_pool","offset_sec":0.0,"duration_sec":95.0,"sample_rate":44100,"bitrate":192,"license":"review_pending","is_lossless":false} |
| 1 | ## 2026-06-02 业务导出 cookbook 与样例交付 checkpoint | ||
| 2 | |||
| 3 | 完成项: | ||
| 4 | - 新增 `docs/business-export-cookbook.md` | ||
| 5 | - 新增 CSV 样例:`acr-engine/configs/manifests/examples/business_asset_export_example.csv` | ||
| 6 | - 新增 JSONL 样例:`acr-engine/configs/manifests/examples/business_asset_export_example.jsonl` | ||
| 7 | |||
| 8 | 结论: | ||
| 9 | - 下个 session 已有 SQL 字段映射参考,以及 CSV/JSONL 中间格式样例。 | ||
| 10 | - 从业务库表到 manifest 的最后一段人工理解成本继续降低。 | ||
| 11 | |||
| 1 | ## 2026-06-02 业务 manifest 与 type-role 规范交付 checkpoint | 12 | ## 2026-06-02 业务 manifest 与 type-role 规范交付 checkpoint |
| 2 | 13 | ||
| 3 | 完成项: | 14 | 完成项: | ... | ... |
docs/business-export-cookbook.md
0 → 100644
| 1 | # Business Export Cookbook / 业务库表导出 Cookbook | ||
| 2 | |||
| 3 | > 更新:2026-06-02 | ||
| 4 | > 关联文档:[业务 Manifest 与 Type-Role 规范](./business-manifest-and-type-role-spec.md) · [业务素材类型与 Bucket 指南](./business-music-bucket-and-type-guide.md) | ||
| 5 | |||
| 6 | ## 一页结论 | ||
| 7 | |||
| 8 | 下个 session 如果要从你们的业务库表真正导出训练/评测清单,建议直接按这个顺序: | ||
| 9 | |||
| 10 | 1. 先从 SQL 导出音频资产基础字段 | ||
| 11 | 2. 用 `type-role mapping` 补 `role` / `bucket` | ||
| 12 | 3. 落成 CSV 或 JSONL 中间文件 | ||
| 13 | 4. 再转成项目 manifest | ||
| 14 | |||
| 15 | 仓库里已经补好以下参考物: | ||
| 16 | - [../acr-engine/configs/manifests/business_asset_manifest_template.json](../acr-engine/configs/manifests/business_asset_manifest_template.json) | ||
| 17 | - [../acr-engine/configs/manifests/business_type_role_mapping.json](../acr-engine/configs/manifests/business_type_role_mapping.json) | ||
| 18 | - [../acr-engine/configs/manifests/examples/business_asset_export_example.csv](../acr-engine/configs/manifests/examples/business_asset_export_example.csv) | ||
| 19 | - [../acr-engine/configs/manifests/examples/business_asset_export_example.jsonl](../acr-engine/configs/manifests/examples/business_asset_export_example.jsonl) | ||
| 20 | |||
| 21 | --- | ||
| 22 | |||
| 23 | ## 1. 推荐 SQL 导出字段 | ||
| 24 | |||
| 25 | ```sql | ||
| 26 | SELECT | ||
| 27 | s.id AS song_id, | ||
| 28 | a.id AS asset_id, | ||
| 29 | a.type AS type, | ||
| 30 | a.file_path AS audio_path, | ||
| 31 | s.title AS title, | ||
| 32 | s.artist_name AS artist, | ||
| 33 | s.album_id AS album_id, | ||
| 34 | a.duration_sec AS duration_sec, | ||
| 35 | a.sample_rate AS sample_rate, | ||
| 36 | a.bitrate AS bitrate, | ||
| 37 | a.license_code AS license, | ||
| 38 | a.created_at AS created_at | ||
| 39 | FROM music_asset a | ||
| 40 | JOIN song s ON s.id = a.song_id | ||
| 41 | WHERE a.type IN (1,7,8,9,10,11,16,18,2,12); | ||
| 42 | ``` | ||
| 43 | |||
| 44 | 说明: | ||
| 45 | - 这不是强制 SQL,只是字段映射样例。 | ||
| 46 | - 关键不是表名,而是把字段凑齐到 manifest 规范里。 | ||
| 47 | |||
| 48 | --- | ||
| 49 | |||
| 50 | ## 2. 导出后要补什么字段 | ||
| 51 | |||
| 52 | | 字段 | 来源 | 说明 | | ||
| 53 | |---|---|---| | ||
| 54 | | `role` | `business_type_role_mapping.json` | 由 `type` 映射 | | ||
| 55 | | `bucket` | `business_type_role_mapping.json` | 默认业务 bucket | | ||
| 56 | | `split` | 导出脚本或后处理 | `train/val/test/holdout` | | ||
| 57 | | `source_dataset` | 固定值 | 如 `internal_catalog` | | ||
| 58 | | `offset_sec` | 片段类素材可填 | 非片段可先置 `0` | | ||
| 59 | |||
| 60 | --- | ||
| 61 | |||
| 62 | ## 3. 推荐中间格式 | ||
| 63 | |||
| 64 | ### CSV | ||
| 65 | 适合: | ||
| 66 | - 业务同学先导数据 | ||
| 67 | - Excel / 表格工具核对 | ||
| 68 | |||
| 69 | 样例: | ||
| 70 | - [../acr-engine/configs/manifests/examples/business_asset_export_example.csv](../acr-engine/configs/manifests/examples/business_asset_export_example.csv) | ||
| 71 | |||
| 72 | ### JSONL | ||
| 73 | 适合: | ||
| 74 | - 脚本流式处理 | ||
| 75 | - 后续直接转 manifest | ||
| 76 | |||
| 77 | 样例: | ||
| 78 | - [../acr-engine/configs/manifests/examples/business_asset_export_example.jsonl](../acr-engine/configs/manifests/examples/business_asset_export_example.jsonl) | ||
| 79 | |||
| 80 | --- | ||
| 81 | |||
| 82 | ## 4. 建议后处理规则 | ||
| 83 | |||
| 84 | 1. `type=10/11` 默认补成 `reference` | ||
| 85 | 2. `type=1/9` 默认补成压缩域 `reference` | ||
| 86 | 3. `type=7/8/16` 默认补成 `query` | ||
| 87 | 4. `type=18/2/12` 默认先 `excluded` | ||
| 88 | 5. 非音频资产直接过滤掉 | ||
| 89 | |||
| 90 | --- | ||
| 91 | |||
| 92 | ## 5. 下个 session 最直接动作 | ||
| 93 | |||
| 94 | 1. 按 SQL 样例从业务库导一次真实数据 | ||
| 95 | 2. 存成 CSV 或 JSONL | ||
| 96 | 3. 用仓库里的 mapping 规则补齐 `role` / `bucket` | ||
| 97 | 4. 再转换成项目需要的 manifest | ||
| 98 | |||
| 99 | ## Sources | ||
| 100 | - See [business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md) | ||
| 101 | - See [business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md) |
| ... | @@ -93,6 +93,9 @@ cd /workspace/acr-engine | ... | @@ -93,6 +93,9 @@ cd /workspace/acr-engine |
| 93 | 2. 用 `business_type_role_mapping.json` 给每条资产打默认 `role` / `bucket`。 | 93 | 2. 用 `business_type_role_mapping.json` 给每条资产打默认 `role` / `bucket`。 |
| 94 | 3. 先导出 `reference` 与 `query` 清单,再进入训练与 bucket benchmark。 | 94 | 3. 先导出 `reference` 与 `query` 清单,再进入训练与 bucket benchmark。 |
| 95 | 95 | ||
| 96 | ## 延伸阅读 | ||
| 97 | - [business-export-cookbook.md](./business-export-cookbook.md) | ||
| 98 | |||
| 96 | ## Sources | 99 | ## Sources |
| 97 | - See [business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md) | 100 | - See [business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md) |
| 98 | - See [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) | 101 | - See [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) | ... | ... |
| ... | @@ -258,6 +258,7 @@ | ... | @@ -258,6 +258,7 @@ |
| 258 | - 模板:`acr-engine/configs/buckets/fma_semantic_bucket_template.json` | 258 | - 模板:`acr-engine/configs/buckets/fma_semantic_bucket_template.json` |
| 259 | - 业务型素材优先看:[business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md) | 259 | - 业务型素材优先看:[business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md) |
| 260 | - Manifest/角色映射看:[business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md) | 260 | - Manifest/角色映射看:[business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md) |
| 261 | - SQL/CSV/JSONL 导出参考:[business-export-cookbook.md](./business-export-cookbook.md) | ||
| 261 | 2. 对比 cap48 与 cap64 的不一致现象,补充分规模结论。 | 262 | 2. 对比 cap48 与 cap64 的不一致现象,补充分规模结论。 |
| 262 | 3. 继续补 cap64 multi-seed,而不是只保留单 seed。 | 263 | 3. 继续补 cap64 multi-seed,而不是只保留单 seed。 |
| 263 | 4. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性。 | 264 | 4. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性。 | ... | ... |
-
Please register or sign in to post a comment