Complete the business-export chain by splitting manifest-ready rows into role-specific lists
Constraint: Keep this checkpoint offline-only and avoid touching real business data, datasets, or model artifacts Rejected: Leave role splitting as a manual next-session step | The export chain is more usable when reference/query/excluded lists are produced automatically Confidence: high Scope-risk: narrow Directive: Treat the split outputs as staging lists and keep final project-manifest adaptation explicit in the downstream integration step Tested: Normalized the sample CSV, ran split_business_manifest_ready.py, verified 1 reference + 1 query + 1 excluded row, and rechecked 73 relative links Not-tested: Did not run against a live business export or feed the split outputs into the full training pipeline
Showing
6 changed files
with
87 additions
and
0 deletions
| ... | @@ -59,6 +59,7 @@ | ... | @@ -59,6 +59,7 @@ |
| 59 | - Manifest 规范:`docs/business-manifest-and-type-role-spec.md` | 59 | - Manifest 规范:`docs/business-manifest-and-type-role-spec.md` |
| 60 | - 导出 cookbook:`docs/business-export-cookbook.md` | 60 | - 导出 cookbook:`docs/business-export-cookbook.md` |
| 61 | - 规范化脚本:`acr-engine/scripts/normalize_business_export.py` | 61 | - 规范化脚本:`acr-engine/scripts/normalize_business_export.py` |
| 62 | - 角色拆分脚本:`acr-engine/scripts/split_business_manifest_ready.py` | ||
| 62 | 2. 补 cap64 multi-seed aggregate。 | 63 | 2. 补 cap64 multi-seed aggregate。 |
| 63 | 3. 更新: | 64 | 3. 更新: |
| 64 | - `docs/open-dataset-workflow.md` | 65 | - `docs/open-dataset-workflow.md` | ... | ... |
| 1 | #!/usr/bin/env python3 | ||
| 2 | from __future__ import annotations | ||
| 3 | |||
| 4 | import argparse | ||
| 5 | import json | ||
| 6 | from collections import Counter | ||
| 7 | from pathlib import Path | ||
| 8 | |||
| 9 | |||
| 10 | def load_jsonl(path: Path) -> list[dict]: | ||
| 11 | return [json.loads(line) for line in path.read_text().splitlines() if line.strip()] | ||
| 12 | |||
| 13 | |||
| 14 | def write_json(path: Path, rows: list[dict]) -> None: | ||
| 15 | path.parent.mkdir(parents=True, exist_ok=True) | ||
| 16 | path.write_text(json.dumps(rows, ensure_ascii=False, indent=2)) | ||
| 17 | |||
| 18 | |||
| 19 | def main() -> None: | ||
| 20 | parser = argparse.ArgumentParser(description='Split manifest-ready business JSONL into reference/query/excluded JSON files') | ||
| 21 | parser.add_argument('--input', required=True) | ||
| 22 | parser.add_argument('--output-dir', required=True) | ||
| 23 | args = parser.parse_args() | ||
| 24 | |||
| 25 | input_path = Path(args.input).resolve() | ||
| 26 | output_dir = Path(args.output_dir).resolve() | ||
| 27 | rows = load_jsonl(input_path) | ||
| 28 | |||
| 29 | grouped = {'reference': [], 'query': [], 'excluded': []} | ||
| 30 | for row in rows: | ||
| 31 | role = row.get('role', 'excluded') | ||
| 32 | grouped.setdefault(role, []).append(row) | ||
| 33 | |||
| 34 | write_json(output_dir / 'reference.json', grouped.get('reference', [])) | ||
| 35 | write_json(output_dir / 'query.json', grouped.get('query', [])) | ||
| 36 | write_json(output_dir / 'excluded.json', grouped.get('excluded', [])) | ||
| 37 | |||
| 38 | summary = { | ||
| 39 | 'input_rows': len(rows), | ||
| 40 | 'role_counts': dict(Counter(row.get('role', 'excluded') for row in rows)), | ||
| 41 | 'outputs': { | ||
| 42 | 'reference': str((output_dir / 'reference.json').resolve()), | ||
| 43 | 'query': str((output_dir / 'query.json').resolve()), | ||
| 44 | 'excluded': str((output_dir / 'excluded.json').resolve()), | ||
| 45 | }, | ||
| 46 | } | ||
| 47 | print(json.dumps(summary, ensure_ascii=False, indent=2)) | ||
| 48 | |||
| 49 | |||
| 50 | if __name__ == '__main__': | ||
| 51 | main() |
| 1 | ## 2026-06-02 manifest-ready 角色拆分脚本交付 checkpoint | ||
| 2 | |||
| 3 | 完成项: | ||
| 4 | - 新增 `acr-engine/scripts/split_business_manifest_ready.py` | ||
| 5 | - 已把业务规范化输出继续推进为 `reference/query/excluded` 三类 JSON 清单。 | ||
| 6 | |||
| 7 | 结论: | ||
| 8 | - 下个 session 从业务导出到角色拆分已经形成连续脚本链路。 | ||
| 9 | - 后续只需要补最终项目 manifest 适配,而不必再手工分角色。 | ||
| 10 | |||
| 1 | ## 2026-06-02 业务导出规范化脚本交付 checkpoint | 11 | ## 2026-06-02 业务导出规范化脚本交付 checkpoint |
| 2 | 12 | ||
| 3 | 完成项: | 13 | 完成项: | ... | ... |
| ... | @@ -121,3 +121,25 @@ cd /workspace/acr-engine | ... | @@ -121,3 +121,25 @@ cd /workspace/acr-engine |
| 121 | 2. 应用 `business_type_role_mapping.json` | 121 | 2. 应用 `business_type_role_mapping.json` |
| 122 | 3. 自动补 `role / bucket / source_dataset / split` 默认值 | 122 | 3. 自动补 `role / bucket / source_dataset / split` 默认值 |
| 123 | 4. 输出 manifest-ready JSONL | 123 | 4. 输出 manifest-ready JSONL |
| 124 | |||
| 125 | |||
| 126 | ## 7. 拆分为角色清单 | ||
| 127 | |||
| 128 | 如果你已经拿到了 manifest-ready JSONL,还可以继续用: | ||
| 129 | - [../acr-engine/scripts/split_business_manifest_ready.py](../acr-engine/scripts/split_business_manifest_ready.py) | ||
| 130 | |||
| 131 | 示例: | ||
| 132 | |||
| 133 | ```bash | ||
| 134 | cd /workspace/acr-engine | ||
| 135 | /usr/local/miniconda3/bin/python scripts/split_business_manifest_ready.py \ | ||
| 136 | --input /tmp/business_asset_manifest_ready.jsonl \ | ||
| 137 | --output-dir /tmp/business_asset_manifest_split | ||
| 138 | ``` | ||
| 139 | |||
| 140 | 它会输出: | ||
| 141 | - `reference.json` | ||
| 142 | - `query.json` | ||
| 143 | - `excluded.json` | ||
| 144 | |||
| 145 | 这样下个 session 可以更快把业务素材继续整形成训练/评测所需清单。 | ... | ... |
| ... | @@ -79,6 +79,8 @@ flowchart LR | ... | @@ -79,6 +79,8 @@ flowchart LR |
| 79 | - [../acr-engine/scripts/print_business_type_mapping.py](../acr-engine/scripts/print_business_type_mapping.py) | 79 | - [../acr-engine/scripts/print_business_type_mapping.py](../acr-engine/scripts/print_business_type_mapping.py) |
| 80 | - 规范化脚本: | 80 | - 规范化脚本: |
| 81 | - [../acr-engine/scripts/normalize_business_export.py](../acr-engine/scripts/normalize_business_export.py) | 81 | - [../acr-engine/scripts/normalize_business_export.py](../acr-engine/scripts/normalize_business_export.py) |
| 82 | - 角色拆分脚本: | ||
| 83 | - [../acr-engine/scripts/split_business_manifest_ready.py](../acr-engine/scripts/split_business_manifest_ready.py) | ||
| 82 | 84 | ||
| 83 | 示例命令: | 85 | 示例命令: |
| 84 | 86 | ... | ... |
| ... | @@ -260,6 +260,7 @@ | ... | @@ -260,6 +260,7 @@ |
| 260 | - Manifest/角色映射看:[business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md) | 260 | - Manifest/角色映射看:[business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md) |
| 261 | - SQL/CSV/JSONL 导出参考:[business-export-cookbook.md](./business-export-cookbook.md) | 261 | - SQL/CSV/JSONL 导出参考:[business-export-cookbook.md](./business-export-cookbook.md) |
| 262 | - 规范化脚本:`acr-engine/scripts/normalize_business_export.py` | 262 | - 规范化脚本:`acr-engine/scripts/normalize_business_export.py` |
| 263 | - 角色拆分脚本:`acr-engine/scripts/split_business_manifest_ready.py` | ||
| 263 | 2. 对比 cap48 与 cap64 的不一致现象,补充分规模结论。 | 264 | 2. 对比 cap48 与 cap64 的不一致现象,补充分规模结论。 |
| 264 | 3. 继续补 cap64 multi-seed,而不是只保留单 seed。 | 265 | 3. 继续补 cap64 multi-seed,而不是只保留单 seed。 |
| 265 | 4. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性。 | 266 | 4. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性。 | ... | ... |
-
Please register or sign in to post a comment