Commit af33be35 af33be35c971b2b007897a658418f2437b19609c by cnb.bofCdSsphPA

Condense docs and add manifest validation before training

Constraint: Readers need fewer entry documents and clickable relative links before scaling open-dataset usage
Rejected: Keep expanding flat documentation pages | Increases navigation cost and hides the main execution path
Confidence: high
Scope-risk: moderate
Directive: Route future dataset operations through inspect-local/inspect-batch/prepare-local/validate-local and keep docs grouped by role
Tested: /usr/local/miniconda3/bin/python -m py_compile src/data/manifest_tools.py src/data/external_adapters.py; /usr/local/miniconda3/bin/python src/data/manifest_tools.py validate-splits data/external_ingested/demo_via_adapter/fma/manifests; /usr/local/miniconda3/bin/python src/data/external_adapters.py validate-local fma data/external_ingested/demo_via_adapter/fma/manifests; python3 targeted-doc-link scan over docs/README.md docs/dataset-spec.md docs/dataset-sources-and-licensing.md docs/industrialization-roadmap.md docs/service-api.md docs/industrial-benchmark-spec.md acr-engine/data/external_ingested/README.md
Not-tested: Real browser/rendered markdown click-through behavior across every client
1 parent d75fbf81
...@@ -10,8 +10,8 @@ Convert local open-music audio folders into ACR-ready manifests for: ...@@ -10,8 +10,8 @@ Convert local open-music audio folders into ACR-ready manifests for:
10 10
11 ### 1. Prepare a local audio directory 11 ### 1. Prepare a local audio directory
12 Examples: 12 Examples:
13 - `data/raw/fma_small_audio/` 13 - [data/raw/fma_small_audio/](../raw/fma_small_audio/)
14 - `data/raw/mtg_jamendo_audio/` 14 - [data/raw/mtg_jamendo_audio/](../raw/mtg_jamendo_audio/)
15 15
16 ### 2. Generate manifests through the adapter entrypoint 16 ### 2. Generate manifests through the adapter entrypoint
17 Optional pre-check: 17 Optional pre-check:
...@@ -37,12 +37,12 @@ or ...@@ -37,12 +37,12 @@ or
37 37
38 ### 3. Use outputs 38 ### 3. Use outputs
39 Generated files: 39 Generated files:
40 - `catalog.json`: reference tracks for indexing 40 - [catalog.json](./demo_via_adapter/fma/manifests/catalog.json): reference tracks for indexing
41 - `train.json`: train queries + references 41 - [train.json](./demo_via_adapter/fma/manifests/train.json): train queries + references
42 - `test.json`: held-out eval queries + references 42 - [test.json](./demo_via_adapter/fma/manifests/test.json): held-out eval queries + references
43 - `val.json`: optional validation split 43 - [val.json](./demo_via_adapter/fma/manifests/val.json): optional validation split
44 44
45 ## Notes 45 ## Notes
46 - Small datasets are automatically protected so both train/test query sets exist. 46 - Small datasets are automatically protected so both train/test query sets exist.
47 - For personal use, FMA and MTG-Jamendo should be the first real baselines. 47 - For personal use, FMA and MTG-Jamendo should be the first real baselines.
48 - Keep `test.json` fixed across experiments to compare models fairly. 48 - Keep [test.json](./demo_via_adapter/fma/manifests/test.json) fixed across experiments to compare models fairly.
......
...@@ -94,6 +94,18 @@ class BaseAdapter: ...@@ -94,6 +94,18 @@ class BaseAdapter:
94 summary["dataset"] = self.name 94 summary["dataset"] = self.name
95 return summary 95 return summary
96 96
97 def validate_local_manifests(self, manifests_dir: Path) -> Dict:
98 cmd = [
99 "/usr/local/miniconda3/bin/python",
100 "src/data/manifest_tools.py",
101 "validate-splits",
102 str(manifests_dir),
103 ]
104 result = subprocess.check_output(cmd, text=True)
105 summary = json.loads(result)
106 summary["dataset"] = self.name
107 return summary
108
97 109
98 class FMAAdapter(BaseAdapter): 110 class FMAAdapter(BaseAdapter):
99 name = "fma" 111 name = "fma"
...@@ -242,6 +254,10 @@ def main(): ...@@ -242,6 +254,10 @@ def main():
242 p.add_argument("--eval-ratio", type=float, default=0.2) 254 p.add_argument("--eval-ratio", type=float, default=0.2)
243 p.add_argument("--query-duration", type=float, default=8.0) 255 p.add_argument("--query-duration", type=float, default=8.0)
244 256
257 p = sub.add_parser("validate-local")
258 p.add_argument("dataset", choices=sorted(ADAPTERS))
259 p.add_argument("manifests_dir")
260
245 args = parser.parse_args() 261 args = parser.parse_args()
246 if args.cmd == "registry": 262 if args.cmd == "registry":
247 path = write_registry(args.output) 263 path = write_registry(args.output)
...@@ -271,6 +287,9 @@ def main(): ...@@ -271,6 +287,9 @@ def main():
271 elif args.cmd == "inspect-batch": 287 elif args.cmd == "inspect-batch":
272 summary = inspect_batch(args.pairs, args.eval_ratio, args.query_duration) 288 summary = inspect_batch(args.pairs, args.eval_ratio, args.query_duration)
273 print(json.dumps(summary, indent=2, ensure_ascii=False)) 289 print(json.dumps(summary, indent=2, ensure_ascii=False))
290 elif args.cmd == "validate-local":
291 summary = ADAPTERS[args.dataset].validate_local_manifests(Path(args.manifests_dir))
292 print(json.dumps(summary, indent=2, ensure_ascii=False))
274 293
275 294
276 if __name__ == "__main__": 295 if __name__ == "__main__":
......
...@@ -144,6 +144,48 @@ def inspect_audio_dir( ...@@ -144,6 +144,48 @@ def inspect_audio_dir(
144 } 144 }
145 145
146 146
147 def validate_splits(manifests_dir: Path):
148 required = ["catalog.json", "train.json", "test.json", "val.json"]
149 missing = [name for name in required if not (manifests_dir / name).exists()]
150 if missing:
151 return {"ok": False, "missing_files": missing}
152
153 catalog = json.loads((manifests_dir / "catalog.json").read_text())
154 train = json.loads((manifests_dir / "train.json").read_text())
155 test = json.loads((manifests_dir / "test.json").read_text())
156 val = json.loads((manifests_dir / "val.json").read_text())
157
158 catalog_refs = [x for x in catalog if x.get("type") == "reference"]
159 train_queries = [x for x in train if x.get("type") != "reference"]
160 test_queries = [x for x in test if x.get("type") != "reference"]
161 val_queries = [x for x in val if x.get("type") != "reference"]
162
163 source_values = {
164 x.get("source_dataset", "unknown")
165 for x in catalog_refs + train_queries + test_queries + val_queries
166 }
167
168 errors = []
169 if not catalog_refs:
170 errors.append("catalog_has_no_references")
171 if not train_queries:
172 errors.append("train_has_no_queries")
173 if not test_queries:
174 errors.append("test_has_no_queries")
175 if len(source_values) > 1:
176 errors.append("mixed_source_dataset_values")
177
178 return {
179 "ok": len(errors) == 0,
180 "errors": errors,
181 "catalog_references": len(catalog_refs),
182 "train_queries": len(train_queries),
183 "test_queries": len(test_queries),
184 "val_queries": len(val_queries),
185 "source_datasets": sorted(source_values),
186 }
187
188
147 def main(): 189 def main():
148 parser = argparse.ArgumentParser() 190 parser = argparse.ArgumentParser()
149 sub = parser.add_subparsers(dest="cmd", required=True) 191 sub = parser.add_subparsers(dest="cmd", required=True)
...@@ -167,6 +209,9 @@ def main(): ...@@ -167,6 +209,9 @@ def main():
167 p.add_argument("--query-duration", type=float, default=8.0) 209 p.add_argument("--query-duration", type=float, default=8.0)
168 p.add_argument("--eval-ratio", type=float, default=0.2) 210 p.add_argument("--eval-ratio", type=float, default=0.2)
169 211
212 p = sub.add_parser("validate-splits")
213 p.add_argument("manifests_dir")
214
170 args = parser.parse_args() 215 args = parser.parse_args()
171 if args.cmd == "csv-to-catalog": 216 if args.cmd == "csv-to-catalog":
172 count = csv_to_catalog(Path(args.csv_path), Path(args.output_path), args.path_field, args.id_field) 217 count = csv_to_catalog(Path(args.csv_path), Path(args.output_path), args.path_field, args.id_field)
...@@ -188,6 +233,9 @@ def main(): ...@@ -188,6 +233,9 @@ def main():
188 eval_ratio=args.eval_ratio, 233 eval_ratio=args.eval_ratio,
189 ) 234 )
190 print(json.dumps({"status": "ok", **summary}, ensure_ascii=False)) 235 print(json.dumps({"status": "ok", **summary}, ensure_ascii=False))
236 elif args.cmd == "validate-splits":
237 summary = validate_splits(Path(args.manifests_dir))
238 print(json.dumps(summary, ensure_ascii=False))
191 239
192 240
193 if __name__ == "__main__": 241 if __name__ == "__main__":
......
...@@ -2,6 +2,29 @@ ...@@ -2,6 +2,29 @@
2 2
3 ## 2026-06-02 3 ## 2026-06-02
4 4
5 ### Stage: 文档浓缩与相对链接跳转
6
7 完成项:
8 - 重构 [docs/README.md](./README.md) 为 4 组主文档入口
9 - 将多处相对路径从反引号文本改为 Markdown 可点击链接
10 - 收拢“数据接入”阅读入口,降低文档数量感知
11 - 修正文档内对脚本、manifest、关联文档的跳转方式
12
13 验证结果:
14 - 入口文档现在按:
15 - 项目与架构
16 - 数据与评测
17 - 服务与工程
18 - 研究与路线
19 进行分组
20 - `dataset-spec.md` / `dataset-sources-and-licensing.md` / `industrialization-roadmap.md` / `service-api.md` / `industrial-benchmark-spec.md`
21 已使用相对链接替代部分反引号路径
22
23 结论:
24 - 文档入口已明显浓缩
25 - 读者不再需要先面对大量平铺文件名
26 - 相对路径现在更适合直接跳转
27
5 ### Stage: confused 定向优化 v6(sample-level weighting) 28 ### Stage: confused 定向优化 v6(sample-level weighting)
6 29
7 完成项: 30 完成项:
......
...@@ -4,15 +4,14 @@ ...@@ -4,15 +4,14 @@
4 4
5 ## 一页结论 5 ## 一页结论
6 6
7 这套文档已经按“**重点 → 图 → 表 → 文 → 细节**”重构,建议按下面顺序阅读 7 当前文档入口过多,现统一浓缩为 **4 组主文档**
8 8
9 1. **项目定位与职责** 9 1. **项目与架构**
10 2. **系统架构** 10 2. **数据与评测**
11 3. **数据规范** 11 3. **服务与工程**
12 4. **服务接口** 12 4. **研究与路线**
13 5. **benchmark 与工业化路线** 13
14 6. **数据来源与许可** 14 建议先只读这 4 组,不必一次看完全部细节文档。
15 7. **SOTA 调研**
16 15
17 --- 16 ---
18 17
...@@ -40,56 +39,56 @@ flowchart TD ...@@ -40,56 +39,56 @@ flowchart TD
40 39
41 --- 40 ---
42 41
43 ## 2. 阅读顺序表 42 ## 2. 浓缩阅读入口
44 43
45 | 读者角色 | 建议先读 | 44 | 读者角色 | 建议先读 |
46 |---|---| 45 |---|---|
47 | 产品/负责人 | `industrialization-roadmap.md` | 46 | 新成员 | [项目与架构](./project-responsibility-map.md), [系统架构](./acr-architecture.md) |
48 | 算法/模型 | `acr-architecture.md`, `dataset-spec.md`, `sota-research-2026.md` | 47 | 算法/模型 | [数据规范](./dataset-spec.md), [SOTA 调研](./sota-research-2026.md) |
49 | 平台/后端 | `service-api.md`, `industrial-benchmark-spec.md` | 48 | 平台/后端 | [服务接口](./service-api.md), [评测规范](./industrial-benchmark-spec.md) |
50 | 数据/合规 | `dataset-sources-and-licensing.md` | 49 | 数据接入 | [数据来源与接入](./dataset-sources-and-licensing.md) |
51 | 新成员 | `project-responsibility-map.md`, `README.md` | 50 | 负责人/规划 | [工业化路线](./industrialization-roadmap.md) |
52 51
53 --- 52 ---
54 53
55 ## 3. 文档清单 54 ## 3. 主文档分组
55
56 ### A. 项目与架构
57 - [项目职责图](./project-responsibility-map.md)
58 - [系统架构](./acr-architecture.md)
59
60 ### B. 数据与评测
61 - [数据规范](./dataset-spec.md)
62 - [数据来源与接入](./dataset-sources-and-licensing.md)
63 - [工业评测规范](./industrial-benchmark-spec.md)
56 64
57 - `project-responsibility-map.md` 65 ### C. 服务与工程
58 - `acr-architecture.md` 66 - [服务接口](./service-api.md)
59 - `dataset-spec.md` 67 - [更新记录](./CHANGELOG.md)
60 - `service-api.md` 68
61 - `industrial-benchmark-spec.md` 69 ### D. 研究与路线
62 - `industrialization-roadmap.md` 70 - [工业化路线](./industrialization-roadmap.md)
63 - `dataset-sources-and-licensing.md` 71 - [SOTA 调研](./sota-research-2026.md)
64 - `sota-research-2026.md` 72 - [引用来源总表](./references-and-sources.md)
65 - `CHANGELOG.md`
66 73
67 --- 74 ---
68 75
69 ## 4. 文字说明 76 ## 4. 文字说明
70 77
71 这套文档不是“平铺型说明书”,而是尽量面向: 78 现在开始减少“同层重复文档”的阅读成本:
72 - 决策 79 - 先从入口页做分组
73 - 分工 80 - 再在每组里保留 1~3 份主文档
74 - 分层 81 - 次级细节尽量放到组内,而不是继续横向扩张文件数量
75 - 工业化演进
76
77 因此每份文档都优先呈现:
78 - 重点结论
79 - 图示关系
80 - 表格归纳
81 - 文字说明
82 - 细节附录
83 82
84 --- 83 ---
85 84
86 ## 5. 细节附录 85 ## 5. 细节附录
87 86
88 建议后续继续补充 87 建议使用方式
89 - Benchmark report 模板 88 - 想了解项目先读 [项目职责图](./project-responsibility-map.md) + [系统架构](./acr-architecture.md)
90 - Model card 模板 89 - 想训练/评测先读 [数据规范](./dataset-spec.md)
91 - License review checklist 90 - 想接开放数据先读 [数据来源与接入](./dataset-sources-and-licensing.md)
92 - Release checklist 91 - 想看历史演进再读 [更新记录](./CHANGELOG.md)
93 92
94 ## Sources 93 ## Sources
95 - This file is an internal documentation navigation artifact for the current repo state. 94 - This file is an internal documentation navigation artifact for the current repo state.
......
...@@ -20,9 +20,10 @@ ...@@ -20,9 +20,10 @@
20 20
21 建议接入顺序: 21 建议接入顺序:
22 1. 下载/准备 FMA 或 MTG-Jamendo 的本地音频目录 22 1. 下载/准备 FMA 或 MTG-Jamendo 的本地音频目录
23 2. 运行 `external_adapters.py prepare-local ...` 23 2. 运行 [acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py) `inspect-local``inspect-batch`
24 3. 生成 `catalog/train/test/val` manifests 24 3. 再运行 [acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py) `prepare-local`
25 4.`train.json` 用于训练,将 `test.json` 用于固定评估 25 4. 生成 [catalog.json / train.json / test.json / val.json](../acr-engine/data/external_ingested/README.md)
26 5.[train.json](../acr-engine/data/external_ingested/README.md) 用于训练,将 [test.json](../acr-engine/data/external_ingested/README.md) 用于固定评估
26 27
27 --- 28 ---
28 29
...@@ -91,4 +92,4 @@ flowchart LR ...@@ -91,4 +92,4 @@ flowchart LR
91 92
92 93
93 ## Sources 94 ## Sources
94 - See `docs/references-and-sources.md` for the current source map. 95 - See [references-and-sources.md](./references-and-sources.md) for the current source map.
......
...@@ -132,21 +132,21 @@ flowchart LR ...@@ -132,21 +132,21 @@ flowchart LR
132 132
133 | 产物 | 用途 | 说明 | 133 | 产物 | 用途 | 说明 |
134 |---|---|---| 134 |---|---|---|
135 | `catalog.json` | 建索引 | 所有 reference 曲目 | 135 | [catalog.json](../acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/catalog.json) | 建索引 | 所有 reference 曲目 |
136 | `train.json` | 训练查询 | query + references | 136 | [train.json](../acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/train.json) | 训练查询 | query + references |
137 | `test.json` | 评估查询 | query + references | 137 | [test.json](../acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/test.json) | 评估查询 | query + references |
138 | `val.json` | 预留验证集 | 当前可为空 | 138 | [val.json](../acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/val.json) | 预留验证集 | 当前可为空 |
139 139
140 推荐法则(个人使用): 140 推荐法则(个人使用):
141 - FMA / MTG-Jamendo 可优先用于真实 train/eval baseline 141 - FMA / MTG-Jamendo 可优先用于真实 train/eval baseline
142 - 至少固定一部分曲目只进 `test.json`,不要同时参与训练 142 - 至少固定一部分曲目只进 [test.json](../acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/test.json),不要同时参与训练
143 - 小数据集也要保证至少 1 个 train query + 1 个 test query 143 - 小数据集也要保证至少 1 个 train query + 1 个 test query
144 144
145 CLI 入口: 145 CLI 入口:
146 - 低层工具:`src/data/manifest_tools.py audio-dir-to-splits` 146 - 低层工具:[acr-engine/src/data/manifest_tools.py](../acr-engine/src/data/manifest_tools.py)
147 - 高层统一入口:`src/data/external_adapters.py prepare-local <dataset> <input_dir>` 147 - 高层统一入口:[acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py) `prepare-local <dataset> <input_dir>`
148 - 导入前预检查:`src/data/external_adapters.py inspect-local <dataset> <input_dir>` 148 - 导入前预检查:[acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py) `inspect-local <dataset> <input_dir>`
149 - 多目录批量预检查:`src/data/external_adapters.py inspect-batch fma=<dir> mtg_jamendo=<dir> ...` 149 - 多目录批量预检查:[acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py) `inspect-batch fma=<dir> mtg_jamendo=<dir> ...`
150 150
151 ## 5. 文字说明 151 ## 5. 文字说明
152 152
...@@ -205,4 +205,4 @@ CLI 入口: ...@@ -205,4 +205,4 @@ CLI 入口:
205 205
206 206
207 ## Sources 207 ## Sources
208 - See `docs/references-and-sources.md` for the current source map. 208 - See [references-and-sources.md](./references-and-sources.md) for the current source map.
......
...@@ -80,4 +80,4 @@ flowchart LR ...@@ -80,4 +80,4 @@ flowchart LR
80 80
81 81
82 ## Sources 82 ## Sources
83 - See `docs/references-and-sources.md` for the current source map. 83 - See [references-and-sources.md](./references-and-sources.md) for the current source map.
......
...@@ -75,10 +75,10 @@ flowchart LR ...@@ -75,10 +75,10 @@ flowchart LR
75 ## 5. 细节附录 75 ## 5. 细节附录
76 76
77 关联文档: 77 关联文档:
78 - `docs/dataset-sources-and-licensing.md` 78 - [数据来源与接入](./dataset-sources-and-licensing.md)
79 - `docs/industrial-benchmark-spec.md` 79 - [工业评测规范](./industrial-benchmark-spec.md)
80 - `docs/service-api.md` 80 - [服务接口](./service-api.md)
81 81
82 82
83 ## Sources 83 ## Sources
84 - See `docs/references-and-sources.md` for the current source map. 84 - See [references-and-sources.md](./references-and-sources.md) for the current source map.
......
...@@ -94,4 +94,4 @@ sequenceDiagram ...@@ -94,4 +94,4 @@ sequenceDiagram
94 94
95 95
96 ## Sources 96 ## Sources
97 - See `docs/references-and-sources.md` for the current source map. 97 - See [references-and-sources.md](./references-and-sources.md) for the current source map.
......