Turn business export guidance into a runnable normalization step for the next session

Constraint: Keep this checkpoint offline-only and avoid touching real databases, datasets, or model artifacts Rejected: Stop at static CSV/JSONL examples only | The next session needs an executable normalization path, not just samples Confidence: high Scope-risk: narrow Directive: Treat normalized JSONL as manifest-ready staging output and keep final manifest shaping explicit in the integration step Tested: Ran normalize_business_export.py on the sample CSV and JSONL inputs; verified 3 output rows each; rechecked 71 relative links Not-tested: Did not run against a live business export or connect to any database

Turn business export guidance into a runnable normalization step for the next session
Constraint: Keep this checkpoint offline-only and avoid touching real databases, datasets, or model artifacts Rejected: Stop at static CSV/JSONL examples only | The next session needs an executable normalization path, not just samples Confidence: high Scope-risk: narrow Directive: Treat normalized JSONL as manifest-ready staging output and keep final manifest shaping explicit in the integration step Tested: Ran normalize_business_export.py on the sample CSV and JSONL inputs; verified 3 output rows each; rechecked 71 relative links Not-tested: Did not run against a live business export or connect to any database
cnb.bofCdSsphPA
Commit b5981c79 ... b5981c7964836f2d2cb871e57f99acc42a6939e7 authored 2026-06-02 18:57:07 +0800 by cnb.bofCdSsphPA
Showing 6 changed files with 153 additions and 0 deletions
AGENT.md
acr-engine/scripts/normalize_business_export.py
docs/CHANGELOG.md
docs/business-export-cookbook.md
docs/business-manifest-and-type-role-spec.md
docs/session-handoff.md
--- a/AGENT.md
View file @b5981c7
+++ b/AGENT.md
View file @b5981c7
@@ -58,6 +58,7 @@
   - 业务说明：`docs/business-music-bucket-and-type-guide.md`
   - Manifest 规范：`docs/business-manifest-and-type-role-spec.md`
   - 导出 cookbook：`docs/business-export-cookbook.md`
+   - 规范化脚本：`acr-engine/scripts/normalize_business_export.py`
 2. 补 cap64 multi-seed aggregate。
 3. 更新：
   - `docs/open-dataset-workflow.md`
--- a/acr-engine/scripts/normalize_business_export.py 0 → 100755
View file @b5981c7
+++ b/acr-engine/scripts/normalize_business_export.py 0 → 100755
View file @b5981c7
+#!/usr/bin/env python3
+from __future__ import annotations
+import argparse
+import csv
+import json
+from pathlib import Path
+from typing import Iterable
+def load_rows(path: Path) -> list[dict]:
+    suffix = path.suffix.lower()
+    if suffix == '.csv':
+        with path.open(newline='') as f:
+            return list(csv.DictReader(f))
+    if suffix == '.jsonl':
+        return [json.loads(line) for line in path.read_text().splitlines() if line.strip()]
+    raise ValueError(f'unsupported input format: {path}')
+def load_mapping(path: Path) -> dict[int, dict]:
+    data = json.loads(path.read_text())
+    return {int(item['type']): item for item in data['mappings']}
+def parse_bool(value):
+    if isinstance(value, bool):
+        return value
+    if value is None:
+        return None
+    s = str(value).strip().lower()
+    if s in {'true', '1', 'yes'}:
+        return True
+    if s in {'false', '0', 'no'}:
+        return False
+    return None
+def parse_float(value):
+    if value in (None, ''):
+        return None
+    try:
+        return float(value)
+    except ValueError:
+        return None
+def normalize_row(row: dict, mapping: dict[int, dict], source_dataset: str, default_split: str) -> dict:
+    row = dict(row)
+    asset_type = int(row['type'])
+    rule = mapping.get(asset_type, {'role': 'excluded', 'default_bucket': 'unknown', 'trainable': False})
+    normalized = {
+        'song_id': row['song_id'],
+        'asset_id': row['asset_id'],
+        'type': asset_type,
+        'role': row.get('role') or rule['role'],
+        'split': row.get('split') or default_split,
+        'audio_path': row['audio_path'],
+        'source_dataset': row.get('source_dataset') or source_dataset,
+        'title': row.get('title'),
+        'artist': row.get('artist'),
+        'album_id': row.get('album_id'),
+        'bucket': row.get('bucket') or rule.get('default_bucket'),
+        'offset_sec': parse_float(row.get('offset_sec')),
+        'duration_sec': parse_float(row.get('duration_sec')),
+        'sample_rate': int(row['sample_rate']) if row.get('sample_rate') not in (None, '') else None,
+        'bitrate': int(row['bitrate']) if row.get('bitrate') not in (None, '') else None,
+        'license': row.get('license'),
+        'is_lossless': parse_bool(row.get('is_lossless')),
+        'trainable': bool(rule.get('trainable', False)),
+    }
+    return normalized
+def emit_jsonl(rows: Iterable[dict], output: Path) -> None:
+    output.parent.mkdir(parents=True, exist_ok=True)
+    with output.open('w') as f:
+        for row in rows:
+            f.write(json.dumps(row, ensure_ascii=False) + '\n')
+def main() -> None:
+    parser = argparse.ArgumentParser(description='Normalize business CSV/JSONL export into manifest-ready JSONL rows')
+    parser.add_argument('--input', required=True, help='Input CSV or JSONL export')
+    parser.add_argument('--mapping', default='configs/manifests/business_type_role_mapping.json')
+    parser.add_argument('--source-dataset', default='internal_catalog')
+    parser.add_argument('--default-split', default='holdout')
+    parser.add_argument('--output', required=True, help='Output JSONL path')
+    args = parser.parse_args()
+    repo = Path(__file__).resolve().parents[1]
+    input_path = Path(args.input)
+    if not input_path.is_absolute():
+        input_path = (repo / input_path).resolve()
+    mapping_path = Path(args.mapping)
+    if not mapping_path.is_absolute():
+        mapping_path = (repo / mapping_path).resolve()
+    output_path = Path(args.output)
+    if not output_path.is_absolute():
+        output_path = (repo / output_path).resolve()
+    rows = load_rows(input_path)
+    mapping = load_mapping(mapping_path)
+    normalized = [normalize_row(row, mapping, args.source_dataset, args.default_split) for row in rows]
+    emit_jsonl(normalized, output_path)
+    summary = {
+        'input_rows': len(rows),
+        'output_rows': len(normalized),
+        'output': str(output_path),
+        'roles': sorted({row['role'] for row in normalized}),
+        'buckets': sorted({row['bucket'] for row in normalized if row.get('bucket')}),
+    }
+    print(json.dumps(summary, ensure_ascii=False, indent=2))
+if __name__ == '__main__':
+    main()
--- a/docs/CHANGELOG.md
View file @b5981c7
+++ b/docs/CHANGELOG.md
View file @b5981c7
+## 2026-06-02 业务导出规范化脚本交付 checkpoint
+完成项：
+- 新增 `acr-engine/scripts/normalize_business_export.py`
+- 已把业务导出 cookbook 从“样例说明”推进为“可运行转换脚本 + 样例输入”。
+结论：
+- 下个 session 可以直接把业务 CSV/JSONL 导出转成 manifest-ready JSONL。
+- `type -> role -> bucket` 默认规则现在不只是文档约定，也有可执行脚本承接。
 ## 2026-06-02 业务导出 cookbook 与样例交付 checkpoint
 完成项：
--- a/docs/business-export-cookbook.md
View file @b5981c7
+++ b/docs/business-export-cookbook.md
View file @b5981c7
@@ -11,6 +11,7 @@
 2. 用 `type-role mapping` 补 `role` / `bucket`
 3. 落成 CSV 或 JSONL 中间文件
 4. 再转成项目 manifest
+5. 或直接先用仓库脚本转成 manifest-ready JSONL
 仓库里已经补好以下参考物：
 - [../acr-engine/configs/manifests/business_asset_manifest_template.json](../acr-engine/configs/manifests/business_asset_manifest_template.json)
@@ -99,3 +100,24 @@ WHERE a.type IN (1,7,8,9,10,11,16,18,2,12);
 ## Sources
 - See [business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md)
 - See [business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md)
+## 6. 轻量规范化脚本
+仓库里已经补了一层可直接运行的转换脚本：
+- [../acr-engine/scripts/normalize_business_export.py](../acr-engine/scripts/normalize_business_export.py)
+示例：
+```bash
+cd /workspace/acr-engine
+/usr/local/miniconda3/bin/python scripts/normalize_business_export.py \
+  --input configs/manifests/examples/business_asset_export_example.csv \
+  --output /tmp/business_asset_manifest_ready.jsonl
+```
+这个脚本会：
+1. 读取 CSV 或 JSONL 导出
+2. 应用 `business_type_role_mapping.json`
+3. 自动补 `role / bucket / source_dataset / split` 默认值
+4. 输出 manifest-ready JSONL
--- a/docs/business-manifest-and-type-role-spec.md
View file @b5981c7
+++ b/docs/business-manifest-and-type-role-spec.md
View file @b5981c7
@@ -77,6 +77,8 @@ flowchart LR
  - [../acr-engine/configs/manifests/business_type_role_mapping.json](../acr-engine/configs/manifests/business_type_role_mapping.json)
 - 打印脚本：
  - [../acr-engine/scripts/print_business_type_mapping.py](../acr-engine/scripts/print_business_type_mapping.py)
+- 规范化脚本：
+  - [../acr-engine/scripts/normalize_business_export.py](../acr-engine/scripts/normalize_business_export.py)
 示例命令：
--- a/docs/session-handoff.md
View file @b5981c7
+++ b/docs/session-handoff.md
View file @b5981c7
@@ -259,6 +259,7 @@
   - 业务型素材优先看：[business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md)
   - Manifest/角色映射看：[business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md)
   - SQL/CSV/JSONL 导出参考：[business-export-cookbook.md](./business-export-cookbook.md)
+   - 规范化脚本：`acr-engine/scripts/normalize_business_export.py`
 2. 对比 cap48 与 cap64 的不一致现象，补充分规模结论。
 3. 继续补 cap64 multi-seed，而不是只保留单 seed。
 4. 继续优化 `hybrid`，重点降低波动并提升 hard case 稳定性。