Commit b9feaccc b9feaccce273c63d5e1e694ec781e0d26493b1c4 by cnb.bofCdSsphPA

Complete the business-export chain by splitting manifest-ready rows into role-specific lists

Constraint: Keep this checkpoint offline-only and avoid touching real business data, datasets, or model artifacts
Rejected: Leave role splitting as a manual next-session step | The export chain is more usable when reference/query/excluded lists are produced automatically
Confidence: high
Scope-risk: narrow
Directive: Treat the split outputs as staging lists and keep final project-manifest adaptation explicit in the downstream integration step
Tested: Normalized the sample CSV, ran split_business_manifest_ready.py, verified 1 reference + 1 query + 1 excluded row, and rechecked 73 relative links
Not-tested: Did not run against a live business export or feed the split outputs into the full training pipeline
1 parent b5981c79
...@@ -59,6 +59,7 @@ ...@@ -59,6 +59,7 @@
59 - Manifest 规范:`docs/business-manifest-and-type-role-spec.md` 59 - Manifest 规范:`docs/business-manifest-and-type-role-spec.md`
60 - 导出 cookbook:`docs/business-export-cookbook.md` 60 - 导出 cookbook:`docs/business-export-cookbook.md`
61 - 规范化脚本:`acr-engine/scripts/normalize_business_export.py` 61 - 规范化脚本:`acr-engine/scripts/normalize_business_export.py`
62 - 角色拆分脚本:`acr-engine/scripts/split_business_manifest_ready.py`
62 2. 补 cap64 multi-seed aggregate。 63 2. 补 cap64 multi-seed aggregate。
63 3. 更新: 64 3. 更新:
64 - `docs/open-dataset-workflow.md` 65 - `docs/open-dataset-workflow.md`
......
1 #!/usr/bin/env python3
2 from __future__ import annotations
3
4 import argparse
5 import json
6 from collections import Counter
7 from pathlib import Path
8
9
10 def load_jsonl(path: Path) -> list[dict]:
11 return [json.loads(line) for line in path.read_text().splitlines() if line.strip()]
12
13
14 def write_json(path: Path, rows: list[dict]) -> None:
15 path.parent.mkdir(parents=True, exist_ok=True)
16 path.write_text(json.dumps(rows, ensure_ascii=False, indent=2))
17
18
19 def main() -> None:
20 parser = argparse.ArgumentParser(description='Split manifest-ready business JSONL into reference/query/excluded JSON files')
21 parser.add_argument('--input', required=True)
22 parser.add_argument('--output-dir', required=True)
23 args = parser.parse_args()
24
25 input_path = Path(args.input).resolve()
26 output_dir = Path(args.output_dir).resolve()
27 rows = load_jsonl(input_path)
28
29 grouped = {'reference': [], 'query': [], 'excluded': []}
30 for row in rows:
31 role = row.get('role', 'excluded')
32 grouped.setdefault(role, []).append(row)
33
34 write_json(output_dir / 'reference.json', grouped.get('reference', []))
35 write_json(output_dir / 'query.json', grouped.get('query', []))
36 write_json(output_dir / 'excluded.json', grouped.get('excluded', []))
37
38 summary = {
39 'input_rows': len(rows),
40 'role_counts': dict(Counter(row.get('role', 'excluded') for row in rows)),
41 'outputs': {
42 'reference': str((output_dir / 'reference.json').resolve()),
43 'query': str((output_dir / 'query.json').resolve()),
44 'excluded': str((output_dir / 'excluded.json').resolve()),
45 },
46 }
47 print(json.dumps(summary, ensure_ascii=False, indent=2))
48
49
50 if __name__ == '__main__':
51 main()
1 ## 2026-06-02 manifest-ready 角色拆分脚本交付 checkpoint
2
3 完成项:
4 - 新增 `acr-engine/scripts/split_business_manifest_ready.py`
5 - 已把业务规范化输出继续推进为 `reference/query/excluded` 三类 JSON 清单。
6
7 结论:
8 - 下个 session 从业务导出到角色拆分已经形成连续脚本链路。
9 - 后续只需要补最终项目 manifest 适配,而不必再手工分角色。
10
1 ## 2026-06-02 业务导出规范化脚本交付 checkpoint 11 ## 2026-06-02 业务导出规范化脚本交付 checkpoint
2 12
3 完成项: 13 完成项:
......
...@@ -121,3 +121,25 @@ cd /workspace/acr-engine ...@@ -121,3 +121,25 @@ cd /workspace/acr-engine
121 2. 应用 `business_type_role_mapping.json` 121 2. 应用 `business_type_role_mapping.json`
122 3. 自动补 `role / bucket / source_dataset / split` 默认值 122 3. 自动补 `role / bucket / source_dataset / split` 默认值
123 4. 输出 manifest-ready JSONL 123 4. 输出 manifest-ready JSONL
124
125
126 ## 7. 拆分为角色清单
127
128 如果你已经拿到了 manifest-ready JSONL,还可以继续用:
129 - [../acr-engine/scripts/split_business_manifest_ready.py](../acr-engine/scripts/split_business_manifest_ready.py)
130
131 示例:
132
133 ```bash
134 cd /workspace/acr-engine
135 /usr/local/miniconda3/bin/python scripts/split_business_manifest_ready.py \
136 --input /tmp/business_asset_manifest_ready.jsonl \
137 --output-dir /tmp/business_asset_manifest_split
138 ```
139
140 它会输出:
141 - `reference.json`
142 - `query.json`
143 - `excluded.json`
144
145 这样下个 session 可以更快把业务素材继续整形成训练/评测所需清单。
......
...@@ -79,6 +79,8 @@ flowchart LR ...@@ -79,6 +79,8 @@ flowchart LR
79 - [../acr-engine/scripts/print_business_type_mapping.py](../acr-engine/scripts/print_business_type_mapping.py) 79 - [../acr-engine/scripts/print_business_type_mapping.py](../acr-engine/scripts/print_business_type_mapping.py)
80 - 规范化脚本: 80 - 规范化脚本:
81 - [../acr-engine/scripts/normalize_business_export.py](../acr-engine/scripts/normalize_business_export.py) 81 - [../acr-engine/scripts/normalize_business_export.py](../acr-engine/scripts/normalize_business_export.py)
82 - 角色拆分脚本:
83 - [../acr-engine/scripts/split_business_manifest_ready.py](../acr-engine/scripts/split_business_manifest_ready.py)
82 84
83 示例命令: 85 示例命令:
84 86
......
...@@ -260,6 +260,7 @@ ...@@ -260,6 +260,7 @@
260 - Manifest/角色映射看:[business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md) 260 - Manifest/角色映射看:[business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md)
261 - SQL/CSV/JSONL 导出参考:[business-export-cookbook.md](./business-export-cookbook.md) 261 - SQL/CSV/JSONL 导出参考:[business-export-cookbook.md](./business-export-cookbook.md)
262 - 规范化脚本:`acr-engine/scripts/normalize_business_export.py` 262 - 规范化脚本:`acr-engine/scripts/normalize_business_export.py`
263 - 角色拆分脚本:`acr-engine/scripts/split_business_manifest_ready.py`
263 2. 对比 cap48 与 cap64 的不一致现象,补充分规模结论。 264 2. 对比 cap48 与 cap64 的不一致现象,补充分规模结论。
264 3. 继续补 cap64 multi-seed,而不是只保留单 seed。 265 3. 继续补 cap64 multi-seed,而不是只保留单 seed。
265 4. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性。 266 4. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性。
......