Commit b5981c79 b5981c7964836f2d2cb871e57f99acc42a6939e7 by cnb.bofCdSsphPA

Turn business export guidance into a runnable normalization step for the next session

Constraint: Keep this checkpoint offline-only and avoid touching real databases, datasets, or model artifacts
Rejected: Stop at static CSV/JSONL examples only | The next session needs an executable normalization path, not just samples
Confidence: high
Scope-risk: narrow
Directive: Treat normalized JSONL as manifest-ready staging output and keep final manifest shaping explicit in the integration step
Tested: Ran normalize_business_export.py on the sample CSV and JSONL inputs; verified 3 output rows each; rechecked 71 relative links
Not-tested: Did not run against a live business export or connect to any database
1 parent b7d4b1b6
......@@ -58,6 +58,7 @@
- 业务说明:`docs/business-music-bucket-and-type-guide.md`
- Manifest 规范:`docs/business-manifest-and-type-role-spec.md`
- 导出 cookbook:`docs/business-export-cookbook.md`
- 规范化脚本:`acr-engine/scripts/normalize_business_export.py`
2. 补 cap64 multi-seed aggregate。
3. 更新:
- `docs/open-dataset-workflow.md`
......
#!/usr/bin/env python3
from __future__ import annotations
import argparse
import csv
import json
from pathlib import Path
from typing import Iterable
def load_rows(path: Path) -> list[dict]:
suffix = path.suffix.lower()
if suffix == '.csv':
with path.open(newline='') as f:
return list(csv.DictReader(f))
if suffix == '.jsonl':
return [json.loads(line) for line in path.read_text().splitlines() if line.strip()]
raise ValueError(f'unsupported input format: {path}')
def load_mapping(path: Path) -> dict[int, dict]:
data = json.loads(path.read_text())
return {int(item['type']): item for item in data['mappings']}
def parse_bool(value):
if isinstance(value, bool):
return value
if value is None:
return None
s = str(value).strip().lower()
if s in {'true', '1', 'yes'}:
return True
if s in {'false', '0', 'no'}:
return False
return None
def parse_float(value):
if value in (None, ''):
return None
try:
return float(value)
except ValueError:
return None
def normalize_row(row: dict, mapping: dict[int, dict], source_dataset: str, default_split: str) -> dict:
row = dict(row)
asset_type = int(row['type'])
rule = mapping.get(asset_type, {'role': 'excluded', 'default_bucket': 'unknown', 'trainable': False})
normalized = {
'song_id': row['song_id'],
'asset_id': row['asset_id'],
'type': asset_type,
'role': row.get('role') or rule['role'],
'split': row.get('split') or default_split,
'audio_path': row['audio_path'],
'source_dataset': row.get('source_dataset') or source_dataset,
'title': row.get('title'),
'artist': row.get('artist'),
'album_id': row.get('album_id'),
'bucket': row.get('bucket') or rule.get('default_bucket'),
'offset_sec': parse_float(row.get('offset_sec')),
'duration_sec': parse_float(row.get('duration_sec')),
'sample_rate': int(row['sample_rate']) if row.get('sample_rate') not in (None, '') else None,
'bitrate': int(row['bitrate']) if row.get('bitrate') not in (None, '') else None,
'license': row.get('license'),
'is_lossless': parse_bool(row.get('is_lossless')),
'trainable': bool(rule.get('trainable', False)),
}
return normalized
def emit_jsonl(rows: Iterable[dict], output: Path) -> None:
output.parent.mkdir(parents=True, exist_ok=True)
with output.open('w') as f:
for row in rows:
f.write(json.dumps(row, ensure_ascii=False) + '\n')
def main() -> None:
parser = argparse.ArgumentParser(description='Normalize business CSV/JSONL export into manifest-ready JSONL rows')
parser.add_argument('--input', required=True, help='Input CSV or JSONL export')
parser.add_argument('--mapping', default='configs/manifests/business_type_role_mapping.json')
parser.add_argument('--source-dataset', default='internal_catalog')
parser.add_argument('--default-split', default='holdout')
parser.add_argument('--output', required=True, help='Output JSONL path')
args = parser.parse_args()
repo = Path(__file__).resolve().parents[1]
input_path = Path(args.input)
if not input_path.is_absolute():
input_path = (repo / input_path).resolve()
mapping_path = Path(args.mapping)
if not mapping_path.is_absolute():
mapping_path = (repo / mapping_path).resolve()
output_path = Path(args.output)
if not output_path.is_absolute():
output_path = (repo / output_path).resolve()
rows = load_rows(input_path)
mapping = load_mapping(mapping_path)
normalized = [normalize_row(row, mapping, args.source_dataset, args.default_split) for row in rows]
emit_jsonl(normalized, output_path)
summary = {
'input_rows': len(rows),
'output_rows': len(normalized),
'output': str(output_path),
'roles': sorted({row['role'] for row in normalized}),
'buckets': sorted({row['bucket'] for row in normalized if row.get('bucket')}),
}
print(json.dumps(summary, ensure_ascii=False, indent=2))
if __name__ == '__main__':
main()
## 2026-06-02 业务导出规范化脚本交付 checkpoint
完成项:
- 新增 `acr-engine/scripts/normalize_business_export.py`
- 已把业务导出 cookbook 从“样例说明”推进为“可运行转换脚本 + 样例输入”。
结论:
- 下个 session 可以直接把业务 CSV/JSONL 导出转成 manifest-ready JSONL。
- `type -> role -> bucket` 默认规则现在不只是文档约定,也有可执行脚本承接。
## 2026-06-02 业务导出 cookbook 与样例交付 checkpoint
完成项:
......
......@@ -11,6 +11,7 @@
2.`type-role mapping``role` / `bucket`
3. 落成 CSV 或 JSONL 中间文件
4. 再转成项目 manifest
5. 或直接先用仓库脚本转成 manifest-ready JSONL
仓库里已经补好以下参考物:
- [../acr-engine/configs/manifests/business_asset_manifest_template.json](../acr-engine/configs/manifests/business_asset_manifest_template.json)
......@@ -99,3 +100,24 @@ WHERE a.type IN (1,7,8,9,10,11,16,18,2,12);
## Sources
- See [business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md)
- See [business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md)
## 6. 轻量规范化脚本
仓库里已经补了一层可直接运行的转换脚本:
- [../acr-engine/scripts/normalize_business_export.py](../acr-engine/scripts/normalize_business_export.py)
示例:
```bash
cd /workspace/acr-engine
/usr/local/miniconda3/bin/python scripts/normalize_business_export.py \
--input configs/manifests/examples/business_asset_export_example.csv \
--output /tmp/business_asset_manifest_ready.jsonl
```
这个脚本会:
1. 读取 CSV 或 JSONL 导出
2. 应用 `business_type_role_mapping.json`
3. 自动补 `role / bucket / source_dataset / split` 默认值
4. 输出 manifest-ready JSONL
......
......@@ -77,6 +77,8 @@ flowchart LR
- [../acr-engine/configs/manifests/business_type_role_mapping.json](../acr-engine/configs/manifests/business_type_role_mapping.json)
- 打印脚本:
- [../acr-engine/scripts/print_business_type_mapping.py](../acr-engine/scripts/print_business_type_mapping.py)
- 规范化脚本:
- [../acr-engine/scripts/normalize_business_export.py](../acr-engine/scripts/normalize_business_export.py)
示例命令:
......
......@@ -259,6 +259,7 @@
- 业务型素材优先看:[business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md)
- Manifest/角色映射看:[business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md)
- SQL/CSV/JSONL 导出参考:[business-export-cookbook.md](./business-export-cookbook.md)
- 规范化脚本:`acr-engine/scripts/normalize_business_export.py`
2. 对比 cap48 与 cap64 的不一致现象,补充分规模结论。
3. 继续补 cap64 multi-seed,而不是只保留单 seed。
4. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性。
......