Commit 3bdc0139 3bdc01393f0d8f7d102847837c04ab67f2916bfe by cnb.bofCdSsphPA

Finish the offline business-export chain by generating project manifests directl…

…y from normalized rows

Constraint: Keep this checkpoint offline-only and avoid touching real business data, datasets, or model artifacts
Rejected: Leave final manifest shaping as a manual next-session task | The handoff is stronger when catalog/train/test/val can already be produced automatically
Confidence: high
Scope-risk: narrow
Directive: Treat these generated manifests as integration-stage scaffolds and validate final field policy again before production data ingestion
Tested: Ran build_business_project_manifests.py on normalized sample data and verified catalog/train/test/val structure; rechecked 70 relative links
Not-tested: Did not run the generated manifests through full training/evaluation against live business audio
1 parent b9feaccc
......@@ -60,6 +60,7 @@
- 导出 cookbook:`docs/business-export-cookbook.md`
- 规范化脚本:`acr-engine/scripts/normalize_business_export.py`
- 角色拆分脚本:`acr-engine/scripts/split_business_manifest_ready.py`
- 项目 manifest 适配:`acr-engine/scripts/build_business_project_manifests.py`
2. 补 cap64 multi-seed aggregate。
3. 更新:
- `docs/open-dataset-workflow.md`
......
#!/usr/bin/env python3
from __future__ import annotations
import argparse
import json
from pathlib import Path
def load_jsonl(path: Path) -> list[dict]:
return [json.loads(line) for line in path.read_text().splitlines() if line.strip()]
def write_json(path: Path, rows: list[dict]) -> None:
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(json.dumps(rows, ensure_ascii=False, indent=2))
def build_reference(row: dict) -> dict:
return {
'song_id': row['song_id'],
'audio_path': row['audio_path'],
'duration': row.get('duration_sec') or 0.0,
'type': 'reference',
'source_dataset': row.get('source_dataset', 'business_music'),
}
def build_query(row: dict) -> dict:
return {
'song_id': row['song_id'],
'audio_path': row['audio_path'],
'duration': row.get('duration_sec') or 8.0,
'type': 'clean',
'offset': row.get('offset_sec') or 0.0,
'segment_type': 'external_query',
'source_dataset': row.get('source_dataset', 'business_music'),
}
def dedupe_refs(rows: list[dict]) -> list[dict]:
seen = set()
out = []
for row in rows:
key = (row['song_id'], row['audio_path'])
if key in seen:
continue
seen.add(key)
out.append(row)
return out
def main() -> None:
parser = argparse.ArgumentParser(description='Build project manifests from business manifest-ready JSONL')
parser.add_argument('--input', required=True, help='manifest-ready JSONL from normalize_business_export.py')
parser.add_argument('--output-dir', required=True, help='output manifests dir')
parser.add_argument('--include-holdout-in-val', action='store_true', help='map holdout queries into val.json')
args = parser.parse_args()
rows = load_jsonl(Path(args.input).resolve())
refs_src = [row for row in rows if row.get('role') == 'reference']
query_src = [row for row in rows if row.get('role') == 'query']
refs = dedupe_refs([build_reference(row) for row in refs_src])
train_queries = [build_query(row) for row in query_src if row.get('split') == 'train']
test_queries = [build_query(row) for row in query_src if row.get('split') == 'test']
val_queries = [build_query(row) for row in query_src if row.get('split') == 'val']
if args.include_holdout_in_val:
val_queries.extend(build_query(row) for row in query_src if row.get('split') == 'holdout')
out_dir = Path(args.output_dir).resolve()
write_json(out_dir / 'catalog.json', refs)
write_json(out_dir / 'train.json', train_queries + refs)
write_json(out_dir / 'test.json', test_queries + refs)
write_json(out_dir / 'val.json', val_queries)
summary = {
'catalog_refs': len(refs),
'train_queries': len(train_queries),
'test_queries': len(test_queries),
'val_queries': len(val_queries),
'output_dir': str(out_dir),
}
print(json.dumps(summary, ensure_ascii=False, indent=2))
if __name__ == '__main__':
main()
## 2026-06-02 项目 manifest 适配脚本交付 checkpoint
完成项:
- 新增 `acr-engine/scripts/build_business_project_manifests.py`
- 新增 `docs/business-project-manifest-adapter.md`
- 已把业务导出链推进到可直接生成项目 `catalog/train/test/val` 的阶段。
结论:
- 下个 session 已基本不需要再手工拼项目 manifest。
- 从业务导出到项目 manifest 的离线适配链已经成型。
## 2026-06-02 manifest-ready 角色拆分脚本交付 checkpoint
完成项:
......
......@@ -143,3 +143,10 @@ cd /workspace/acr-engine
- `excluded.json`
这样下个 session 可以更快把业务素材继续整形成训练/评测所需清单。
## 8. 生成项目 manifest
如果你已经有 manifest-ready JSONL,可以直接继续生成项目当前需要的四个 manifest:
- [../acr-engine/scripts/build_business_project_manifests.py](../acr-engine/scripts/build_business_project_manifests.py)
- [business-project-manifest-adapter.md](./business-project-manifest-adapter.md)
......
# Business Project Manifest Adapter / 业务数据到项目 Manifest 适配说明
> 更新:2026-06-02
> 关联文档:[业务导出 Cookbook](./business-export-cookbook.md) · [业务 Manifest 与 Type-Role 规范](./business-manifest-and-type-role-spec.md)
## 一页结论
现在仓库里已经有一条接近项目训练/评测 manifest 的离线脚本链:
1. 业务库表导出 CSV / JSONL
2. [../acr-engine/scripts/normalize_business_export.py](../acr-engine/scripts/normalize_business_export.py)
3. [../acr-engine/scripts/split_business_manifest_ready.py](../acr-engine/scripts/split_business_manifest_ready.py)
4. [../acr-engine/scripts/build_business_project_manifests.py](../acr-engine/scripts/build_business_project_manifests.py)
最后一步会直接生成:
- `catalog.json`
- `train.json`
- `test.json`
- `val.json`
格式对齐当前项目已有 manifest 结构。
---
## 1. 对齐后的项目格式
### `catalog.json`
- 只放 reference
- 字段:`song_id / audio_path / duration / type=reference / source_dataset`
### `train.json` / `test.json`
- 前半部分是 query
- 后半部分拼接 reference
- query 字段:
- `song_id`
- `audio_path`
- `duration`
- `type=clean`
- `offset`
- `segment_type=external_query`
- `source_dataset`
### `val.json`
- 当前默认只放 `split=val` 的 query
- 可选把 `holdout` 合并进 `val`
---
## 2. 示例命令
```bash
cd /workspace/acr-engine
/usr/local/miniconda3/bin/python scripts/normalize_business_export.py \
--input configs/manifests/examples/business_asset_export_example.csv \
--output /tmp/business_asset_manifest_ready.jsonl
/usr/local/miniconda3/bin/python scripts/build_business_project_manifests.py \
--input /tmp/business_asset_manifest_ready.jsonl \
--output-dir /tmp/business_project_manifests
```
如果你希望把 `holdout` 先并进 `val.json`
```bash
/usr/local/miniconda3/bin/python scripts/build_business_project_manifests.py \
--input /tmp/business_asset_manifest_ready.jsonl \
--output-dir /tmp/business_project_manifests \
--include-holdout-in-val
```
---
## 3. 适配边界
这一步还不是最终“真实业务生产接入”,但已经足够让下个 session:
- 用真实业务导出样本跑通 manifest 结构
- 对接 `train.py / evaluate.py / run_demo.py`
- 再只针对最终字段细节做小修
## Sources
- See [business-export-cookbook.md](./business-export-cookbook.md)
- See [business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md)
......@@ -261,6 +261,7 @@
- SQL/CSV/JSONL 导出参考:[business-export-cookbook.md](./business-export-cookbook.md)
- 规范化脚本:`acr-engine/scripts/normalize_business_export.py`
- 角色拆分脚本:`acr-engine/scripts/split_business_manifest_ready.py`
- 项目 manifest 适配:`acr-engine/scripts/build_business_project_manifests.py`
2. 对比 cap48 与 cap64 的不一致现象,补充分规模结论。
3. 继续补 cap64 multi-seed,而不是只保留单 seed。
4. 继续优化 `hybrid`,重点降低波动并提升 hard case 稳定性。
......