Commit eee15aca eee15aca7bf6230c2bcb57b19f424c8741c892a8 by cnb.bofCdSsphPA

Automate the full open-dataset smoke workflow behind one command

Constraint: Real FMA or MTG-Jamendo onboarding should require only an input directory change, not a long manual command chain
Rejected: Keep the smoke steps separate only | Slows repeated validation and increases operator error risk
Confidence: high
Scope-risk: moderate
Directive: Use smoke-local as the default first-pass validation path for every new local open-music corpus
Tested: /usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local fma data/synthetic_v2/songs --output-root data/external_smoke --eval-ratio 0.2 --query-duration 5.0 --train-epochs 1 --batch-size 2; /usr/local/miniconda3/bin/python -m py_compile src/data/external_adapters.py src/data/manifest_tools.py train.py run_demo.py evaluate.py scripts/generate_artifacts.py
Not-tested: Real downloaded FMA or MTG-Jamendo directories on larger-scale smoke runs
1 parent 87959076
Showing 42 changed files with 1050 additions and 0 deletions
[
{
"song_id": "fma_00000",
"audio_path": "audio/fma_00000.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00001",
"audio_path": "audio/fma_00001.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00002",
"audio_path": "audio/fma_00002.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00003",
"audio_path": "audio/fma_00003.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00004",
"audio_path": "audio/fma_00004.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00005",
"audio_path": "audio/fma_00005.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00006",
"audio_path": "audio/fma_00006.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00007",
"audio_path": "audio/fma_00007.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00008",
"audio_path": "audio/fma_00008.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00009",
"audio_path": "audio/fma_00009.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00010",
"audio_path": "audio/fma_00010.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00011",
"audio_path": "audio/fma_00011.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00012",
"audio_path": "audio/fma_00012.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00013",
"audio_path": "audio/fma_00013.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00014",
"audio_path": "audio/fma_00014.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00015",
"audio_path": "audio/fma_00015.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00016",
"audio_path": "audio/fma_00016.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00017",
"audio_path": "audio/fma_00017.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00018",
"audio_path": "audio/fma_00018.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00019",
"audio_path": "audio/fma_00019.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00020",
"audio_path": "audio/fma_00020.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00021",
"audio_path": "audio/fma_00021.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00022",
"audio_path": "audio/fma_00022.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00023",
"audio_path": "audio/fma_00023.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
}
]
\ No newline at end of file
[
{
"song_id": "fma_00000",
"audio_path": "audio/fma_00000.wav",
"duration": 5.0,
"type": "clean",
"offset": 6.394,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00003",
"audio_path": "audio/fma_00003.wav",
"duration": 5.0,
"type": "clean",
"offset": 8.922,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00004",
"audio_path": "audio/fma_00004.wav",
"duration": 5.0,
"type": "clean",
"offset": 4.219,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00006",
"audio_path": "audio/fma_00006.wav",
"duration": 5.0,
"type": "clean",
"offset": 0.265,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00009",
"audio_path": "audio/fma_00009.wav",
"duration": 5.0,
"type": "clean",
"offset": 8.094,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00011",
"audio_path": "audio/fma_00011.wav",
"duration": 5.0,
"type": "clean",
"offset": 3.403,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00013",
"audio_path": "audio/fma_00013.wav",
"duration": 5.0,
"type": "clean",
"offset": 0.927,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00020",
"audio_path": "audio/fma_00020.wav",
"duration": 5.0,
"type": "clean",
"offset": 7.046,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00000",
"audio_path": "audio/fma_00000.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00001",
"audio_path": "audio/fma_00001.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00002",
"audio_path": "audio/fma_00002.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00003",
"audio_path": "audio/fma_00003.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00004",
"audio_path": "audio/fma_00004.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00005",
"audio_path": "audio/fma_00005.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00006",
"audio_path": "audio/fma_00006.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00007",
"audio_path": "audio/fma_00007.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00008",
"audio_path": "audio/fma_00008.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00009",
"audio_path": "audio/fma_00009.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00010",
"audio_path": "audio/fma_00010.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00011",
"audio_path": "audio/fma_00011.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00012",
"audio_path": "audio/fma_00012.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00013",
"audio_path": "audio/fma_00013.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00014",
"audio_path": "audio/fma_00014.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00015",
"audio_path": "audio/fma_00015.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00016",
"audio_path": "audio/fma_00016.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00017",
"audio_path": "audio/fma_00017.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00018",
"audio_path": "audio/fma_00018.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00019",
"audio_path": "audio/fma_00019.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00020",
"audio_path": "audio/fma_00020.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00021",
"audio_path": "audio/fma_00021.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00022",
"audio_path": "audio/fma_00022.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00023",
"audio_path": "audio/fma_00023.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
}
]
\ No newline at end of file
[
{
"song_id": "fma_00001",
"audio_path": "audio/fma_00001.wav",
"duration": 5.0,
"type": "clean",
"offset": 2.75,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00002",
"audio_path": "audio/fma_00002.wav",
"duration": 5.0,
"type": "clean",
"offset": 7.365,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00005",
"audio_path": "audio/fma_00005.wav",
"duration": 5.0,
"type": "clean",
"offset": 2.186,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00007",
"audio_path": "audio/fma_00007.wav",
"duration": 5.0,
"type": "clean",
"offset": 6.499,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00008",
"audio_path": "audio/fma_00008.wav",
"duration": 5.0,
"type": "clean",
"offset": 2.204,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00010",
"audio_path": "audio/fma_00010.wav",
"duration": 5.0,
"type": "clean",
"offset": 8.058,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00012",
"audio_path": "audio/fma_00012.wav",
"duration": 5.0,
"type": "clean",
"offset": 9.572,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00014",
"audio_path": "audio/fma_00014.wav",
"duration": 5.0,
"type": "clean",
"offset": 8.475,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00015",
"audio_path": "audio/fma_00015.wav",
"duration": 5.0,
"type": "clean",
"offset": 8.071,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00016",
"audio_path": "audio/fma_00016.wav",
"duration": 5.0,
"type": "clean",
"offset": 5.362,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00017",
"audio_path": "audio/fma_00017.wav",
"duration": 5.0,
"type": "clean",
"offset": 3.785,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00018",
"audio_path": "audio/fma_00018.wav",
"duration": 5.0,
"type": "clean",
"offset": 8.294,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00019",
"audio_path": "audio/fma_00019.wav",
"duration": 5.0,
"type": "clean",
"offset": 8.617,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00021",
"audio_path": "audio/fma_00021.wav",
"duration": 5.0,
"type": "clean",
"offset": 2.279,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00022",
"audio_path": "audio/fma_00022.wav",
"duration": 5.0,
"type": "clean",
"offset": 0.798,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00023",
"audio_path": "audio/fma_00023.wav",
"duration": 5.0,
"type": "clean",
"offset": 1.01,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00000",
"audio_path": "audio/fma_00000.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00001",
"audio_path": "audio/fma_00001.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00002",
"audio_path": "audio/fma_00002.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00003",
"audio_path": "audio/fma_00003.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00004",
"audio_path": "audio/fma_00004.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00005",
"audio_path": "audio/fma_00005.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00006",
"audio_path": "audio/fma_00006.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00007",
"audio_path": "audio/fma_00007.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00008",
"audio_path": "audio/fma_00008.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00009",
"audio_path": "audio/fma_00009.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00010",
"audio_path": "audio/fma_00010.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00011",
"audio_path": "audio/fma_00011.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00012",
"audio_path": "audio/fma_00012.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00013",
"audio_path": "audio/fma_00013.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00014",
"audio_path": "audio/fma_00014.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00015",
"audio_path": "audio/fma_00015.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00016",
"audio_path": "audio/fma_00016.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00017",
"audio_path": "audio/fma_00017.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00018",
"audio_path": "audio/fma_00018.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00019",
"audio_path": "audio/fma_00019.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00020",
"audio_path": "audio/fma_00020.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00021",
"audio_path": "audio/fma_00021.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00022",
"audio_path": "audio/fma_00022.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00023",
"audio_path": "audio/fma_00023.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
}
]
\ No newline at end of file
{
"fma_00001": 0,
"fma_00002": 1,
"fma_00005": 2,
"fma_00007": 3,
"fma_00008": 4,
"fma_00010": 5,
"fma_00012": 6,
"fma_00014": 7,
"fma_00015": 8,
"fma_00016": 9,
"fma_00017": 10,
"fma_00018": 11,
"fma_00019": 12,
"fma_00021": 13,
"fma_00022": 14,
"fma_00023": 15
}
\ No newline at end of file
{
"generated_at": "2026-06-02T05:04:14Z",
"model_version": "fma-smoke",
"data_version": "fma_local",
"files": {
"benchmark_report": "data/external_smoke/fma_reports_smoke/benchmark-report.md",
"model_card": "data/external_smoke/fma_reports_smoke/model-card.md",
"release_checklist": "data/external_smoke/fma_reports_smoke/release-checklist.md"
}
}
\ No newline at end of file
# Benchmark Report
## 一页结论
- 模型版本:fma-smoke
- 数据版本:fma_local
- 核心结论:top1=1.0 top5=1.0
- 是否通过上线门禁:TBD
## 1. 评测范围图
```mermaid
flowchart LR
A[fma-smoke] --> B[fma_local]
A --> C[Scenario Buckets]
A --> D[Latency / Ops]
```
## 2. 指标表
| Bucket | top1 | top5 | MRR | FAR | Notes |
|---|---:|---:|---:|---:|---|
| clean | 1.0 | 1.0 | | | |
## 3. 文字分析
- 最强项:clean/augmented buckets if present
- 最弱项:see hard-case summary
- 与上一版本对比:TBD
## 4. 细节附录
- 原始 JSON 报告:embedded source
## Sources
- docs/industrial-benchmark-spec.md
{
"model": {
"embed_dim": 192,
"channels": 512,
"n_mels": 128,
"use_band_split": true
},
"data": {
"source_dataset": "fma",
"manifests_dir": "data/external_smoke/fma/manifests",
"query_duration": 5.0
},
"run": {
"train_epochs": 1,
"batch_size": 2
}
}
\ No newline at end of file
{
"split": "test",
"num_queries": 8,
"top1": 1.0,
"topk": 1.0,
"by_type": {
"clean": {
"n": 8,
"top1": 1.0,
"topk": 1.0
}
},
"hard_case_summary": {},
"sample_failures": []
}
\ No newline at end of file
# Model Card
## 一页结论
- 模型名称:ACR Hybrid Encoder
- 版本:fma-smoke
- 适用场景:music ACR prototype / retrieval
- 不适用场景:未经白名单数据验证的生产商用全量上线
## 1. 模型结构图
```mermaid
flowchart LR
A[Input Audio] --> B[128 Mel + BandSplit]
B --> C[Encoder]
C --> D[Embedding]
D --> E[Hybrid Retrieval]
```
## 2. 关键信息表
| 项 | 内容 |
|---|---|
| embed_dim | 192 |
| channels | 512 |
| n_mels | 128 |
| use_band_split | True |
| benchmark report | data/external_smoke/fma_reports_smoke/benchmark-report.md |
## 3. 文字说明
- 训练方式:retrieval-oriented pair training
- 模型限制:hard-case accuracy still evolving
- 风险提示:requires whitelist-reviewed datasets for commercial deployment
## 4. 细节附录
- config embedded from source JSON
## Sources
- docs/dataset-spec.md
- docs/benchmark-report-template.md
# Release Checklist
## 一页结论
发布前必须同时满足:质量通过、合规通过、服务通过、文档齐全。
## 1. 发布门禁图
```mermaid
flowchart TD
A[fma-smoke] --> B[Benchmark Pass]
A --> C[License Review Pass]
A --> D[Service Smoke Pass]
A --> E[Docs Complete]
```
## 2. Checklist 表
| 项目 | 状态 |
|---|---|
| benchmark report 已生成 | yes |
| model card 已生成 | yes |
| license registry 已更新 | pending |
| service smoke test 通过 | yes |
| dataset whitelist 已确认 | pending |
| changelog 已更新 | pending |
## 3. 文字说明
- 当前用于工程治理与预发布检查,不代表已满足商用法律门槛。
## 4. 细节附录
- benchmark 报告路径:data/external_smoke/fma_reports_smoke/benchmark-report.md
- model card 路径:data/external_smoke/fma_reports_smoke/model-card.md
## Sources
- docs/dataset-sources-and-licensing.md
- docs/industrial-benchmark-spec.md
......@@ -221,6 +221,100 @@ def inspect_batch(pairs: List[str], eval_ratio: float, query_duration: float) ->
return {"datasets": results, "count": len(results)}
def smoke_local_dataset(
dataset: str,
input_dir: Path,
output_root: Path,
eval_ratio: float,
query_duration: float,
seed: int,
train_epochs: int,
batch_size: int,
) -> Dict:
adapter = ADAPTERS[dataset]
inspect_summary = adapter.inspect_local_audio(input_dir, query_duration=query_duration, eval_ratio=eval_ratio)
prepare_summary = adapter.prepare_local_audio(
input_dir,
output_root / dataset,
eval_ratio=eval_ratio,
query_duration=query_duration,
seed=seed,
)
manifests_dir = Path(prepare_summary["output_dir"])
validate_summary = adapter.validate_local_manifests(manifests_dir)
model_dir = output_root / f"{dataset}_models_smoke"
index_dir = output_root / f"{dataset}_index_smoke"
report_dir = output_root / f"{dataset}_reports_smoke"
config_path = report_dir / "config.json"
subprocess.run([
"/usr/local/miniconda3/bin/python",
"train.py",
"--data", str(manifests_dir),
"--output", str(model_dir),
"--device", "cpu",
"--epochs", str(train_epochs),
"--batch-size", str(batch_size),
], check=True)
subprocess.run([
"/usr/local/miniconda3/bin/python",
"run_demo.py",
"build-index",
"--data", str(manifests_dir),
"--model", str(model_dir / "best_model.pt"),
"--output", str(index_dir),
"--device", "cpu",
], check=True)
report_dir.mkdir(parents=True, exist_ok=True)
eval_json = report_dir / "eval.json"
subprocess.run([
"/usr/local/miniconda3/bin/python",
"evaluate.py",
"--data", str(manifests_dir),
"--model", str(model_dir / "best_model.pt"),
"--index-prefix", str(index_dir / "reference"),
"--split", "test",
"--device", "cpu",
"--fast-eval",
"--output-json", str(eval_json),
], check=True)
config = {
"model": {"embed_dim": 192, "channels": 512, "n_mels": 128, "use_band_split": True},
"data": {"source_dataset": dataset, "manifests_dir": str(manifests_dir), "query_duration": query_duration},
"run": {
"train_epochs": train_epochs,
"batch_size": batch_size,
},
}
report_dir.mkdir(parents=True, exist_ok=True)
config_path.write_text(json.dumps(config, indent=2))
subprocess.run([
"/usr/local/miniconda3/bin/python",
"scripts/generate_artifacts.py",
"--eval-json", str(eval_json),
"--config-json", str(config_path),
"--output-dir", str(report_dir),
"--model-version", f"{dataset}-smoke",
"--data-version", f"{dataset}_local",
], check=True)
return {
"dataset": dataset,
"inspect": inspect_summary,
"prepare": prepare_summary,
"validate": validate_summary,
"model_dir": str(model_dir),
"index_dir": str(index_dir),
"report_dir": str(report_dir),
"eval_json": str(eval_json),
}
def main():
parser = argparse.ArgumentParser()
sub = parser.add_subparsers(dest="cmd", required=True)
......@@ -258,6 +352,16 @@ def main():
p.add_argument("dataset", choices=sorted(ADAPTERS))
p.add_argument("manifests_dir")
p = sub.add_parser("smoke-local")
p.add_argument("dataset", choices=sorted(ADAPTERS))
p.add_argument("input_dir")
p.add_argument("--output-root", default="data/external_smoke")
p.add_argument("--eval-ratio", type=float, default=0.2)
p.add_argument("--query-duration", type=float, default=8.0)
p.add_argument("--seed", type=int, default=42)
p.add_argument("--train-epochs", type=int, default=1)
p.add_argument("--batch-size", type=int, default=2)
args = parser.parse_args()
if args.cmd == "registry":
path = write_registry(args.output)
......@@ -290,6 +394,18 @@ def main():
elif args.cmd == "validate-local":
summary = ADAPTERS[args.dataset].validate_local_manifests(Path(args.manifests_dir))
print(json.dumps(summary, indent=2, ensure_ascii=False))
elif args.cmd == "smoke-local":
summary = smoke_local_dataset(
dataset=args.dataset,
input_dir=Path(args.input_dir),
output_root=Path(args.output_root),
eval_ratio=args.eval_ratio,
query_duration=args.query_duration,
seed=args.seed,
train_epochs=args.train_epochs,
batch_size=args.batch_size,
)
print(json.dumps(summary, indent=2, ensure_ascii=False))
if __name__ == "__main__":
......
......@@ -115,6 +115,35 @@
- 现在开放数据链路已经不只是“能跑”,还具备基础发布/汇报产物
- 下一步替换成真实 FMA / MTG-Jamendo 本地目录后,可直接复用同一 release 流程
### Stage: 一键 open-dataset smoke
完成项:
- 扩展 `src/data/external_adapters.py`
- 新增 `smoke-local`
- 一条命令自动执行:
- inspect-local
- prepare-local
- validate-local
- train
- build-index
- evaluate
- generate_artifacts
验证结果:
- `/usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local fma data/synthetic_v2/songs --output-root data/external_smoke --eval-ratio 0.2 --query-duration 5.0 --train-epochs 1 --batch-size 2` 成功
- 当前结果:
- `num_audio_files=24`
- `catalog=24`
- `train_queries=16`
- `test_queries=8`
- `top1=1.0`
- `topk=1.0`
- 产物目录:`data/external_smoke/fma_reports_smoke`
结论:
- 现在只要替换 `input_dir`,就能对真实 FMA / MTG-Jamendo 本地目录跑完整 smoke
- 这显著降低了真实开放数据集接入和验证成本
### Stage: confused 定向优化 v6(sample-level weighting)
完成项:
......
......@@ -11,6 +11,7 @@
3. **validate-local**
4. 再进入训练与评估
5. 生成 benchmark / model card / release artifacts
6. 或直接使用一键 `smoke-local`
---
......@@ -38,6 +39,7 @@ flowchart LR
| 训练前校验 | [`src/data/external_adapters.py`](../acr-engine/src/data/external_adapters.py) `validate-local ...` | 确认结构正确 |
| 训练 smoke | [`train.py`](../acr-engine/train.py) `--data ... --dry-run` | 验证 manifests 可直接进入训练 |
| 发布制品 | [`scripts/generate_artifacts.py`](../acr-engine/scripts/generate_artifacts.py) | 生成 benchmark/model-card/release-checklist |
| 一键 smoke | [`src/data/external_adapters.py`](../acr-engine/src/data/external_adapters.py) `smoke-local ...` | 自动跑完整链路 |
---
......@@ -61,6 +63,12 @@ flowchart LR
/usr/local/miniconda3/bin/python src/data/external_adapters.py inspect-batch fma=data/raw/fma_small_audio mtg_jamendo=data/raw/mtg_jamendo_audio --eval-ratio 0.2 --query-duration 8.0
```
### 3.3 一键 smoke
```bash
/usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local fma data/raw/fma_small_audio --output-root data/external_smoke --eval-ratio 0.2 --query-duration 8.0 --train-epochs 1 --batch-size 2
```
---
## 4. 输出物说明
......@@ -95,6 +103,8 @@ flowchart LR
- `benchmark-report.md`
- `model-card.md`
- `release-checklist.md`
- `smoke-local`
- 会一次性返回 inspect / prepare / validate / report 路径摘要
---
......