Commit af33be35 af33be35c971b2b007897a658418f2437b19609c by cnb.bofCdSsphPA

Condense docs and add manifest validation before training

Constraint: Readers need fewer entry documents and clickable relative links before scaling open-dataset usage
Rejected: Keep expanding flat documentation pages | Increases navigation cost and hides the main execution path
Confidence: high
Scope-risk: moderate
Directive: Route future dataset operations through inspect-local/inspect-batch/prepare-local/validate-local and keep docs grouped by role
Tested: /usr/local/miniconda3/bin/python -m py_compile src/data/manifest_tools.py src/data/external_adapters.py; /usr/local/miniconda3/bin/python src/data/manifest_tools.py validate-splits data/external_ingested/demo_via_adapter/fma/manifests; /usr/local/miniconda3/bin/python src/data/external_adapters.py validate-local fma data/external_ingested/demo_via_adapter/fma/manifests; python3 targeted-doc-link scan over docs/README.md docs/dataset-spec.md docs/dataset-sources-and-licensing.md docs/industrialization-roadmap.md docs/service-api.md docs/industrial-benchmark-spec.md acr-engine/data/external_ingested/README.md
Not-tested: Real browser/rendered markdown click-through behavior across every client
1 parent d75fbf81
......@@ -10,8 +10,8 @@ Convert local open-music audio folders into ACR-ready manifests for:
### 1. Prepare a local audio directory
Examples:
- `data/raw/fma_small_audio/`
- `data/raw/mtg_jamendo_audio/`
- [data/raw/fma_small_audio/](../raw/fma_small_audio/)
- [data/raw/mtg_jamendo_audio/](../raw/mtg_jamendo_audio/)
### 2. Generate manifests through the adapter entrypoint
Optional pre-check:
......@@ -37,12 +37,12 @@ or
### 3. Use outputs
Generated files:
- `catalog.json`: reference tracks for indexing
- `train.json`: train queries + references
- `test.json`: held-out eval queries + references
- `val.json`: optional validation split
- [catalog.json](./demo_via_adapter/fma/manifests/catalog.json): reference tracks for indexing
- [train.json](./demo_via_adapter/fma/manifests/train.json): train queries + references
- [test.json](./demo_via_adapter/fma/manifests/test.json): held-out eval queries + references
- [val.json](./demo_via_adapter/fma/manifests/val.json): optional validation split
## Notes
- Small datasets are automatically protected so both train/test query sets exist.
- For personal use, FMA and MTG-Jamendo should be the first real baselines.
- Keep `test.json` fixed across experiments to compare models fairly.
- Keep [test.json](./demo_via_adapter/fma/manifests/test.json) fixed across experiments to compare models fairly.
......
......@@ -94,6 +94,18 @@ class BaseAdapter:
summary["dataset"] = self.name
return summary
def validate_local_manifests(self, manifests_dir: Path) -> Dict:
cmd = [
"/usr/local/miniconda3/bin/python",
"src/data/manifest_tools.py",
"validate-splits",
str(manifests_dir),
]
result = subprocess.check_output(cmd, text=True)
summary = json.loads(result)
summary["dataset"] = self.name
return summary
class FMAAdapter(BaseAdapter):
name = "fma"
......@@ -242,6 +254,10 @@ def main():
p.add_argument("--eval-ratio", type=float, default=0.2)
p.add_argument("--query-duration", type=float, default=8.0)
p = sub.add_parser("validate-local")
p.add_argument("dataset", choices=sorted(ADAPTERS))
p.add_argument("manifests_dir")
args = parser.parse_args()
if args.cmd == "registry":
path = write_registry(args.output)
......@@ -271,6 +287,9 @@ def main():
elif args.cmd == "inspect-batch":
summary = inspect_batch(args.pairs, args.eval_ratio, args.query_duration)
print(json.dumps(summary, indent=2, ensure_ascii=False))
elif args.cmd == "validate-local":
summary = ADAPTERS[args.dataset].validate_local_manifests(Path(args.manifests_dir))
print(json.dumps(summary, indent=2, ensure_ascii=False))
if __name__ == "__main__":
......
......@@ -144,6 +144,48 @@ def inspect_audio_dir(
}
def validate_splits(manifests_dir: Path):
required = ["catalog.json", "train.json", "test.json", "val.json"]
missing = [name for name in required if not (manifests_dir / name).exists()]
if missing:
return {"ok": False, "missing_files": missing}
catalog = json.loads((manifests_dir / "catalog.json").read_text())
train = json.loads((manifests_dir / "train.json").read_text())
test = json.loads((manifests_dir / "test.json").read_text())
val = json.loads((manifests_dir / "val.json").read_text())
catalog_refs = [x for x in catalog if x.get("type") == "reference"]
train_queries = [x for x in train if x.get("type") != "reference"]
test_queries = [x for x in test if x.get("type") != "reference"]
val_queries = [x for x in val if x.get("type") != "reference"]
source_values = {
x.get("source_dataset", "unknown")
for x in catalog_refs + train_queries + test_queries + val_queries
}
errors = []
if not catalog_refs:
errors.append("catalog_has_no_references")
if not train_queries:
errors.append("train_has_no_queries")
if not test_queries:
errors.append("test_has_no_queries")
if len(source_values) > 1:
errors.append("mixed_source_dataset_values")
return {
"ok": len(errors) == 0,
"errors": errors,
"catalog_references": len(catalog_refs),
"train_queries": len(train_queries),
"test_queries": len(test_queries),
"val_queries": len(val_queries),
"source_datasets": sorted(source_values),
}
def main():
parser = argparse.ArgumentParser()
sub = parser.add_subparsers(dest="cmd", required=True)
......@@ -167,6 +209,9 @@ def main():
p.add_argument("--query-duration", type=float, default=8.0)
p.add_argument("--eval-ratio", type=float, default=0.2)
p = sub.add_parser("validate-splits")
p.add_argument("manifests_dir")
args = parser.parse_args()
if args.cmd == "csv-to-catalog":
count = csv_to_catalog(Path(args.csv_path), Path(args.output_path), args.path_field, args.id_field)
......@@ -188,6 +233,9 @@ def main():
eval_ratio=args.eval_ratio,
)
print(json.dumps({"status": "ok", **summary}, ensure_ascii=False))
elif args.cmd == "validate-splits":
summary = validate_splits(Path(args.manifests_dir))
print(json.dumps(summary, ensure_ascii=False))
if __name__ == "__main__":
......
......@@ -2,6 +2,29 @@
## 2026-06-02
### Stage: 文档浓缩与相对链接跳转
完成项:
- 重构 [docs/README.md](./README.md) 为 4 组主文档入口
- 将多处相对路径从反引号文本改为 Markdown 可点击链接
- 收拢“数据接入”阅读入口,降低文档数量感知
- 修正文档内对脚本、manifest、关联文档的跳转方式
验证结果:
- 入口文档现在按:
- 项目与架构
- 数据与评测
- 服务与工程
- 研究与路线
进行分组
- `dataset-spec.md` / `dataset-sources-and-licensing.md` / `industrialization-roadmap.md` / `service-api.md` / `industrial-benchmark-spec.md`
已使用相对链接替代部分反引号路径
结论:
- 文档入口已明显浓缩
- 读者不再需要先面对大量平铺文件名
- 相对路径现在更适合直接跳转
### Stage: confused 定向优化 v6(sample-level weighting)
完成项:
......
......@@ -4,15 +4,14 @@
## 一页结论
这套文档已经按“**重点 → 图 → 表 → 文 → 细节**”重构,建议按下面顺序阅读
当前文档入口过多,现统一浓缩为 **4 组主文档**
1. **项目定位与职责**
2. **系统架构**
3. **数据规范**
4. **服务接口**
5. **benchmark 与工业化路线**
6. **数据来源与许可**
7. **SOTA 调研**
1. **项目与架构**
2. **数据与评测**
3. **服务与工程**
4. **研究与路线**
建议先只读这 4 组,不必一次看完全部细节文档。
---
......@@ -40,56 +39,56 @@ flowchart TD
---
## 2. 阅读顺序表
## 2. 浓缩阅读入口
| 读者角色 | 建议先读 |
|---|---|
| 产品/负责人 | `industrialization-roadmap.md` |
| 算法/模型 | `acr-architecture.md`, `dataset-spec.md`, `sota-research-2026.md` |
| 平台/后端 | `service-api.md`, `industrial-benchmark-spec.md` |
| 数据/合规 | `dataset-sources-and-licensing.md` |
| 新成员 | `project-responsibility-map.md`, `README.md` |
| 新成员 | [项目与架构](./project-responsibility-map.md), [系统架构](./acr-architecture.md) |
| 算法/模型 | [数据规范](./dataset-spec.md), [SOTA 调研](./sota-research-2026.md) |
| 平台/后端 | [服务接口](./service-api.md), [评测规范](./industrial-benchmark-spec.md) |
| 数据接入 | [数据来源与接入](./dataset-sources-and-licensing.md) |
| 负责人/规划 | [工业化路线](./industrialization-roadmap.md) |
---
## 3. 文档清单
## 3. 主文档分组
### A. 项目与架构
- [项目职责图](./project-responsibility-map.md)
- [系统架构](./acr-architecture.md)
### B. 数据与评测
- [数据规范](./dataset-spec.md)
- [数据来源与接入](./dataset-sources-and-licensing.md)
- [工业评测规范](./industrial-benchmark-spec.md)
- `project-responsibility-map.md`
- `acr-architecture.md`
- `dataset-spec.md`
- `service-api.md`
- `industrial-benchmark-spec.md`
- `industrialization-roadmap.md`
- `dataset-sources-and-licensing.md`
- `sota-research-2026.md`
- `CHANGELOG.md`
### C. 服务与工程
- [服务接口](./service-api.md)
- [更新记录](./CHANGELOG.md)
### D. 研究与路线
- [工业化路线](./industrialization-roadmap.md)
- [SOTA 调研](./sota-research-2026.md)
- [引用来源总表](./references-and-sources.md)
---
## 4. 文字说明
这套文档不是“平铺型说明书”,而是尽量面向:
- 决策
- 分工
- 分层
- 工业化演进
因此每份文档都优先呈现:
- 重点结论
- 图示关系
- 表格归纳
- 文字说明
- 细节附录
现在开始减少“同层重复文档”的阅读成本:
- 先从入口页做分组
- 再在每组里保留 1~3 份主文档
- 次级细节尽量放到组内,而不是继续横向扩张文件数量
---
## 5. 细节附录
建议后续继续补充
- Benchmark report 模板
- Model card 模板
- License review checklist
- Release checklist
建议使用方式
- 想了解项目先读 [项目职责图](./project-responsibility-map.md) + [系统架构](./acr-architecture.md)
- 想训练/评测先读 [数据规范](./dataset-spec.md)
- 想接开放数据先读 [数据来源与接入](./dataset-sources-and-licensing.md)
- 想看历史演进再读 [更新记录](./CHANGELOG.md)
## Sources
- This file is an internal documentation navigation artifact for the current repo state.
......
......@@ -20,9 +20,10 @@
建议接入顺序:
1. 下载/准备 FMA 或 MTG-Jamendo 的本地音频目录
2. 运行 `external_adapters.py prepare-local ...`
3. 生成 `catalog/train/test/val` manifests
4.`train.json` 用于训练,将 `test.json` 用于固定评估
2. 运行 [acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py) `inspect-local``inspect-batch`
3. 再运行 [acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py) `prepare-local`
4. 生成 [catalog.json / train.json / test.json / val.json](../acr-engine/data/external_ingested/README.md)
5.[train.json](../acr-engine/data/external_ingested/README.md) 用于训练,将 [test.json](../acr-engine/data/external_ingested/README.md) 用于固定评估
---
......@@ -91,4 +92,4 @@ flowchart LR
## Sources
- See `docs/references-and-sources.md` for the current source map.
- See [references-and-sources.md](./references-and-sources.md) for the current source map.
......
......@@ -132,21 +132,21 @@ flowchart LR
| 产物 | 用途 | 说明 |
|---|---|---|
| `catalog.json` | 建索引 | 所有 reference 曲目 |
| `train.json` | 训练查询 | query + references |
| `test.json` | 评估查询 | query + references |
| `val.json` | 预留验证集 | 当前可为空 |
| [catalog.json](../acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/catalog.json) | 建索引 | 所有 reference 曲目 |
| [train.json](../acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/train.json) | 训练查询 | query + references |
| [test.json](../acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/test.json) | 评估查询 | query + references |
| [val.json](../acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/val.json) | 预留验证集 | 当前可为空 |
推荐法则(个人使用):
- FMA / MTG-Jamendo 可优先用于真实 train/eval baseline
- 至少固定一部分曲目只进 `test.json`,不要同时参与训练
- 至少固定一部分曲目只进 [test.json](../acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/test.json),不要同时参与训练
- 小数据集也要保证至少 1 个 train query + 1 个 test query
CLI 入口:
- 低层工具:`src/data/manifest_tools.py audio-dir-to-splits`
- 高层统一入口:`src/data/external_adapters.py prepare-local <dataset> <input_dir>`
- 导入前预检查:`src/data/external_adapters.py inspect-local <dataset> <input_dir>`
- 多目录批量预检查:`src/data/external_adapters.py inspect-batch fma=<dir> mtg_jamendo=<dir> ...`
- 低层工具:[acr-engine/src/data/manifest_tools.py](../acr-engine/src/data/manifest_tools.py)
- 高层统一入口:[acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py) `prepare-local <dataset> <input_dir>`
- 导入前预检查:[acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py) `inspect-local <dataset> <input_dir>`
- 多目录批量预检查:[acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py) `inspect-batch fma=<dir> mtg_jamendo=<dir> ...`
## 5. 文字说明
......@@ -205,4 +205,4 @@ CLI 入口:
## Sources
- See `docs/references-and-sources.md` for the current source map.
- See [references-and-sources.md](./references-and-sources.md) for the current source map.
......
......@@ -80,4 +80,4 @@ flowchart LR
## Sources
- See `docs/references-and-sources.md` for the current source map.
- See [references-and-sources.md](./references-and-sources.md) for the current source map.
......
......@@ -75,10 +75,10 @@ flowchart LR
## 5. 细节附录
关联文档:
- `docs/dataset-sources-and-licensing.md`
- `docs/industrial-benchmark-spec.md`
- `docs/service-api.md`
- [数据来源与接入](./dataset-sources-and-licensing.md)
- [工业评测规范](./industrial-benchmark-spec.md)
- [服务接口](./service-api.md)
## Sources
- See `docs/references-and-sources.md` for the current source map.
- See [references-and-sources.md](./references-and-sources.md) for the current source map.
......
......@@ -94,4 +94,4 @@ sequenceDiagram
## Sources
- See `docs/references-and-sources.md` for the current source map.
- See [references-and-sources.md](./references-and-sources.md) for the current source map.
......