Commit b9feaccc b9feaccce273c63d5e1e694ec781e0d26493b1c4 by cnb.bofCdSsphPA

Complete the business-export chain by splitting manifest-ready rows into role-specific lists

Constraint: Keep this checkpoint offline-only and avoid touching real business data, datasets, or model artifacts
Rejected: Leave role splitting as a manual next-session step | The export chain is more usable when reference/query/excluded lists are produced automatically
Confidence: high
Scope-risk: narrow
Directive: Treat the split outputs as staging lists and keep final project-manifest adaptation explicit in the downstream integration step
Tested: Normalized the sample CSV, ran split_business_manifest_ready.py, verified 1 reference + 1 query + 1 excluded row, and rechecked 73 relative links
Not-tested: Did not run against a live business export or feed the split outputs into the full training pipeline
1 parent b5981c79
......@@ -59,6 +59,7 @@
- Manifest 规范:`docs/business-manifest-and-type-role-spec.md`
- 导出 cookbook:`docs/business-export-cookbook.md`
- 规范化脚本:`acr-engine/scripts/normalize_business_export.py`
- 角色拆分脚本:`acr-engine/scripts/split_business_manifest_ready.py`
2. 补 cap64 multi-seed aggregate。
3. 更新:
- `docs/open-dataset-workflow.md`
......
#!/usr/bin/env python3
from __future__ import annotations
import argparse
import json
from collections import Counter
from pathlib import Path
def load_jsonl(path: Path) -> list[dict]:
return [json.loads(line) for line in path.read_text().splitlines() if line.strip()]
def write_json(path: Path, rows: list[dict]) -> None:
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(json.dumps(rows, ensure_ascii=False, indent=2))
def main() -> None:
parser = argparse.ArgumentParser(description='Split manifest-ready business JSONL into reference/query/excluded JSON files')
parser.add_argument('--input', required=True)
parser.add_argument('--output-dir', required=True)
args = parser.parse_args()
input_path = Path(args.input).resolve()
output_dir = Path(args.output_dir).resolve()
rows = load_jsonl(input_path)
grouped = {'reference': [], 'query': [], 'excluded': []}
for row in rows:
role = row.get('role', 'excluded')
grouped.setdefault(role, []).append(row)
write_json(output_dir / 'reference.json', grouped.get('reference', []))
write_json(output_dir / 'query.json', grouped.get('query', []))
write_json(output_dir / 'excluded.json', grouped.get('excluded', []))
summary = {
'input_rows': len(rows),
'role_counts': dict(Counter(row.get('role', 'excluded') for row in rows)),
'outputs': {
'reference': str((output_dir / 'reference.json').resolve()),
'query': str((output_dir / 'query.json').resolve()),
'excluded': str((output_dir / 'excluded.json').resolve()),
},
}
print(json.dumps(summary, ensure_ascii=False, indent=2))
if __name__ == '__main__':
main()
## 2026-06-02 manifest-ready 角色拆分脚本交付 checkpoint
完成项:
- 新增 `acr-engine/scripts/split_business_manifest_ready.py`
- 已把业务规范化输出继续推进为 `reference/query/excluded` 三类 JSON 清单。
结论:
- 下个 session 从业务导出到角色拆分已经形成连续脚本链路。
- 后续只需要补最终项目 manifest 适配,而不必再手工分角色。
## 2026-06-02 业务导出规范化脚本交付 checkpoint
完成项:
......
......@@ -121,3 +121,25 @@ cd /workspace/acr-engine
2. 应用 `business_type_role_mapping.json`
3. 自动补 `role / bucket / source_dataset / split` 默认值
4. 输出 manifest-ready JSONL
## 7. 拆分为角色清单
如果你已经拿到了 manifest-ready JSONL,还可以继续用:
- [../acr-engine/scripts/split_business_manifest_ready.py](../acr-engine/scripts/split_business_manifest_ready.py)
示例:
```bash
cd /workspace/acr-engine
/usr/local/miniconda3/bin/python scripts/split_business_manifest_ready.py \
--input /tmp/business_asset_manifest_ready.jsonl \
--output-dir /tmp/business_asset_manifest_split
```
它会输出:
- `reference.json`
- `query.json`
- `excluded.json`
这样下个 session 可以更快把业务素材继续整形成训练/评测所需清单。
......
......@@ -79,6 +79,8 @@ flowchart LR
- [../acr-engine/scripts/print_business_type_mapping.py](../acr-engine/scripts/print_business_type_mapping.py)
- 规范化脚本:
- [../acr-engine/scripts/normalize_business_export.py](../acr-engine/scripts/normalize_business_export.py)
- 角色拆分脚本:
- [../acr-engine/scripts/split_business_manifest_ready.py](../acr-engine/scripts/split_business_manifest_ready.py)
示例命令:
......
......@@ -260,6 +260,7 @@
- Manifest/角色映射看:[business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md)
- SQL/CSV/JSONL 导出参考:[business-export-cookbook.md](./business-export-cookbook.md)
- 规范化脚本:`acr-engine/scripts/normalize_business_export.py`
- 角色拆分脚本:`acr-engine/scripts/split_business_manifest_ready.py`
2. 对比 cap48 与 cap64 的不一致现象,补充分规模结论。
3. 继续补 cap64 multi-seed,而不是只保留单 seed。
4. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性。
......