Finish the offline business-export chain by generating project manifests directl…

…y from normalized rows Constraint: Keep this checkpoint offline-only and avoid touching real business data, datasets, or model artifacts Rejected: Leave final manifest shaping as a manual next-session task | The handoff is stronger when catalog/train/test/val can already be produced automatically Confidence: high Scope-risk: narrow Directive: Treat these generated manifests as integration-stage scaffolds and validate final field policy again before production data ingestion Tested: Ran build_business_project_manifests.py on normalized sample data and verified catalog/train/test/val structure; rechecked 70 relative links Not-tested: Did not run the generated manifests through full training/evaluation against live business audio

Finish the offline business-export chain by generating project manifests directl…
…y from normalized rows Constraint: Keep this checkpoint offline-only and avoid touching real business data, datasets, or model artifacts Rejected: Leave final manifest shaping as a manual next-session task | The handoff is stronger when catalog/train/test/val can already be produced automatically Confidence: high Scope-risk: narrow Directive: Treat these generated manifests as integration-stage scaffolds and validate final field policy again before production data ingestion Tested: Ran build_business_project_manifests.py on normalized sample data and verified catalog/train/test/val structure; rechecked 70 relative links Not-tested: Did not run the generated manifests through full training/evaluation against live business audio
cnb.bofCdSsphPA
Commit 3bdc0139 ... 3bdc01393f0d8f7d102847837c04ab67f2916bfe authored 2026-06-02 18:59:32 +0800 by cnb.bofCdSsphPA
Showing 6 changed files with 189 additions and 0 deletions
AGENT.md
acr-engine/scripts/build_business_project_manifests.py
docs/CHANGELOG.md
docs/business-export-cookbook.md
docs/business-project-manifest-adapter.md
docs/session-handoff.md
--- a/AGENT.md
View file @3bdc013
+++ b/AGENT.md
View file @3bdc013
@@ -60,6 +60,7 @@
   - 导出 cookbook：`docs/business-export-cookbook.md`
   - 规范化脚本：`acr-engine/scripts/normalize_business_export.py`
   - 角色拆分脚本：`acr-engine/scripts/split_business_manifest_ready.py`
+   - 项目 manifest 适配：`acr-engine/scripts/build_business_project_manifests.py`
 2. 补 cap64 multi-seed aggregate。
 3. 更新：
   - `docs/open-dataset-workflow.md`
--- a/acr-engine/scripts/build_business_project_manifests.py 0 → 100755
View file @3bdc013
+++ b/acr-engine/scripts/build_business_project_manifests.py 0 → 100755
View file @3bdc013
+#!/usr/bin/env python3
+from __future__ import annotations
+
+import argparse
+import json
+from pathlib import Path
+
+
+def load_jsonl(path: Path) -> list[dict]:
+    return [json.loads(line) for line in path.read_text().splitlines() if line.strip()]
+
+
+def write_json(path: Path, rows: list[dict]) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    path.write_text(json.dumps(rows, ensure_ascii=False, indent=2))
+
+
+def build_reference(row: dict) -> dict:
+    return {
+        'song_id': row['song_id'],
+        'audio_path': row['audio_path'],
+        'duration': row.get('duration_sec') or 0.0,
+        'type': 'reference',
+        'source_dataset': row.get('source_dataset', 'business_music'),
+    }
+
+
+def build_query(row: dict) -> dict:
+    return {
+        'song_id': row['song_id'],
+        'audio_path': row['audio_path'],
+        'duration': row.get('duration_sec') or 8.0,
+        'type': 'clean',
+        'offset': row.get('offset_sec') or 0.0,
+        'segment_type': 'external_query',
+        'source_dataset': row.get('source_dataset', 'business_music'),
+    }
+
+
+def dedupe_refs(rows: list[dict]) -> list[dict]:
+    seen = set()
+    out = []
+    for row in rows:
+        key = (row['song_id'], row['audio_path'])
+        if key in seen:
+            continue
+        seen.add(key)
+        out.append(row)
+    return out
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description='Build project manifests from business manifest-ready JSONL')
+    parser.add_argument('--input', required=True, help='manifest-ready JSONL from normalize_business_export.py')
+    parser.add_argument('--output-dir', required=True, help='output manifests dir')
+    parser.add_argument('--include-holdout-in-val', action='store_true', help='map holdout queries into val.json')
+    args = parser.parse_args()
+
+    rows = load_jsonl(Path(args.input).resolve())
+    refs_src = [row for row in rows if row.get('role') == 'reference']
+    query_src = [row for row in rows if row.get('role') == 'query']
+
+    refs = dedupe_refs([build_reference(row) for row in refs_src])
+    train_queries = [build_query(row) for row in query_src if row.get('split') == 'train']
+    test_queries = [build_query(row) for row in query_src if row.get('split') == 'test']
+    val_queries = [build_query(row) for row in query_src if row.get('split') == 'val']
+    if args.include_holdout_in_val:
+        val_queries.extend(build_query(row) for row in query_src if row.get('split') == 'holdout')
+
+    out_dir = Path(args.output_dir).resolve()
+    write_json(out_dir / 'catalog.json', refs)
+    write_json(out_dir / 'train.json', train_queries + refs)
+    write_json(out_dir / 'test.json', test_queries + refs)
+    write_json(out_dir / 'val.json', val_queries)
+
+    summary = {
+        'catalog_refs': len(refs),
+        'train_queries': len(train_queries),
+        'test_queries': len(test_queries),
+        'val_queries': len(val_queries),
+        'output_dir': str(out_dir),
+    }
+    print(json.dumps(summary, ensure_ascii=False, indent=2))
+
+
+if __name__ == '__main__':
+    main()
--- a/docs/CHANGELOG.md
View file @3bdc013
+++ b/docs/CHANGELOG.md
View file @3bdc013
+## 2026-06-02 项目 manifest 适配脚本交付 checkpoint
+
+完成项：
+- 新增 `acr-engine/scripts/build_business_project_manifests.py`
+- 新增 `docs/business-project-manifest-adapter.md`
+- 已把业务导出链推进到可直接生成项目 `catalog/train/test/val` 的阶段。
+
+结论：
+- 下个 session 已基本不需要再手工拼项目 manifest。
+- 从业务导出到项目 manifest 的离线适配链已经成型。
+
 ## 2026-06-02 manifest-ready 角色拆分脚本交付 checkpoint

 完成项：
--- a/docs/business-export-cookbook.md
View file @3bdc013
+++ b/docs/business-export-cookbook.md
View file @3bdc013
@@ -143,3 +143,10 @@ cd /workspace/acr-engine
 - `excluded.json`

 这样下个 session 可以更快把业务素材继续整形成训练/评测所需清单。
+
+
+## 8. 生成项目 manifest
+
+如果你已经有 manifest-ready JSONL，可以直接继续生成项目当前需要的四个 manifest：
+- [../acr-engine/scripts/build_business_project_manifests.py](../acr-engine/scripts/build_business_project_manifests.py)
+- [business-project-manifest-adapter.md](./business-project-manifest-adapter.md)
--- a/docs/business-project-manifest-adapter.md 0 → 100644
View file @3bdc013
+++ b/docs/business-project-manifest-adapter.md 0 → 100644
View file @3bdc013
+# Business Project Manifest Adapter / 业务数据到项目 Manifest 适配说明
+
+> 更新：2026-06-02  
+> 关联文档：[业务导出 Cookbook](./business-export-cookbook.md) · [业务 Manifest 与 Type-Role 规范](./business-manifest-and-type-role-spec.md)
+
+## 一页结论
+
+现在仓库里已经有一条接近项目训练/评测 manifest 的离线脚本链：
+
+1. 业务库表导出 CSV / JSONL
+2. [../acr-engine/scripts/normalize_business_export.py](../acr-engine/scripts/normalize_business_export.py)
+3. [../acr-engine/scripts/split_business_manifest_ready.py](../acr-engine/scripts/split_business_manifest_ready.py)
+4. [../acr-engine/scripts/build_business_project_manifests.py](../acr-engine/scripts/build_business_project_manifests.py)
+
+最后一步会直接生成：
+- `catalog.json`
+- `train.json`
+- `test.json`
+- `val.json`
+
+格式对齐当前项目已有 manifest 结构。
+
+---
+
+## 1. 对齐后的项目格式
+
+### `catalog.json`
+- 只放 reference
+- 字段：`song_id / audio_path / duration / type=reference / source_dataset`
+
+### `train.json` / `test.json`
+- 前半部分是 query
+- 后半部分拼接 reference
+- query 字段：
+  - `song_id`
+  - `audio_path`
+  - `duration`
+  - `type=clean`
+  - `offset`
+  - `segment_type=external_query`
+  - `source_dataset`
+
+### `val.json`
+- 当前默认只放 `split=val` 的 query
+- 可选把 `holdout` 合并进 `val`
+
+---
+
+## 2. 示例命令
+
+```bash
+cd /workspace/acr-engine
+/usr/local/miniconda3/bin/python scripts/normalize_business_export.py \
+  --input configs/manifests/examples/business_asset_export_example.csv \
+  --output /tmp/business_asset_manifest_ready.jsonl
+
+/usr/local/miniconda3/bin/python scripts/build_business_project_manifests.py \
+  --input /tmp/business_asset_manifest_ready.jsonl \
+  --output-dir /tmp/business_project_manifests
+```
+
+如果你希望把 `holdout` 先并进 `val.json`：
+
+```bash
+/usr/local/miniconda3/bin/python scripts/build_business_project_manifests.py \
+  --input /tmp/business_asset_manifest_ready.jsonl \
+  --output-dir /tmp/business_project_manifests \
+  --include-holdout-in-val
+```
+
+---
+
+## 3. 适配边界
+
+这一步还不是最终“真实业务生产接入”，但已经足够让下个 session：
+- 用真实业务导出样本跑通 manifest 结构
+- 对接 `train.py / evaluate.py / run_demo.py`
+- 再只针对最终字段细节做小修
+
+## Sources
+- See [business-export-cookbook.md](./business-export-cookbook.md)
+- See [business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md)
--- a/docs/session-handoff.md
View file @3bdc013
+++ b/docs/session-handoff.md
View file @3bdc013
@@ -261,6 +261,7 @@
   - SQL/CSV/JSONL 导出参考：[business-export-cookbook.md](./business-export-cookbook.md)
   - 规范化脚本：`acr-engine/scripts/normalize_business_export.py`
   - 角色拆分脚本：`acr-engine/scripts/split_business_manifest_ready.py`
+   - 项目 manifest 适配：`acr-engine/scripts/build_business_project_manifests.py`
 2. 对比 cap48 与 cap64 的不一致现象，补充分规模结论。
 3. 继续补 cap64 multi-seed，而不是只保留单 seed。
 4. 继续优化 `hybrid`，重点降低波动并提升 hard case 稳定性。