Reduce ACR docs to the current song-centric storage design
Constraint: Keep only documentation that directly serves the current Phase-1 song-centric + fused-table storage and retrieval design. Rejected: Preserve broad historical, dataset, business-export, and template docs in the main docs root | They increase handoff cost and blur the active design surface. Confidence: high Scope-risk: moderate Directive: Treat postgresql-data-model.md as the single source of truth for where slices, models, and features are stored until a concrete fused DDL supersedes it. Tested: git diff --check on touched docs; /usr/local/miniconda3/bin/python scripts/check_markdown_links.py --root docs returned OK for 11 active markdown files; final docs root reduced to 12 files Not-tested: external markdown renderers and downstream readers that may still expect removed auxiliary docs
Showing
25 changed files
with
181 additions
and
3232 deletions
| 1 | ## 2026-06-04 | 1 | ## 2026-06-04 |
| 2 | 2 | ||
| 3 | - 在 `docs/postgresql-data-model.md` 新增“切片数据 / 模型 / feature 具体落哪张表”的表格与流程图,明确当前默认回溯链为 `feature_fact -> audio_object(window) -> audio_object(asset) -> media_entity(song)`。 | ||
| 4 | - 收敛 `docs/README.md` 为当前 song-centric 设计入口,并清理 docs 目录中与当前设计无关的模板、开放数据、业务导出、历史路线类文档。 | ||
| 5 | |||
| 3 | - 收敛文档入口链路,新增 `docs/start-here.md`,统一新同学接手路径为:`README -> start-here -> session-handoff`。 | 6 | - 收敛文档入口链路,新增 `docs/start-here.md`,统一新同学接手路径为:`README -> start-here -> session-handoff`。 |
| 4 | - 重写 `docs/README.md`,按“接手 / 方案 / 实施 / 运行 / 角色”重组导航,降低首次阅读成本。 | 7 | - 重写 `docs/README.md`,按“接手 / 方案 / 实施 / 运行 / 角色”重组导航,降低首次阅读成本。 |
| 5 | - 重构 `docs/session-handoff.md`,把最新 Phase-1 runner、稳定结论、blocker 与下一步动作收口到单页文档。 | 8 | - 重构 `docs/session-handoff.md`,把最新 Phase-1 runner、稳定结论、blocker 与下一步动作收口到单页文档。 |
| ... | @@ -12,6 +15,8 @@ | ... | @@ -12,6 +15,8 @@ |
| 12 | 15 | ||
| 13 | - 根据“尽量融合、用多 type 关联”的新约束,在 `docs/postgresql-data-model.md` 补充“融合优先”建模视图:推荐以 `media_entity`、`audio_object`、`feature_fact`、`set_membership` 这 4 类通用表承载 Phase-1 物理实现,同时保留 `song/recording/asset/window/feature` 的逻辑分层。 | 16 | - 根据“尽量融合、用多 type 关联”的新约束,在 `docs/postgresql-data-model.md` 补充“融合优先”建模视图:推荐以 `media_entity`、`audio_object`、`feature_fact`、`set_membership` 这 4 类通用表承载 Phase-1 物理实现,同时保留 `song/recording/asset/window/feature` 的逻辑分层。 |
| 14 | 17 | ||
| 18 | - 根据“当前不关心版本,只需多个音频稳定归到同一个 song_id”的新约束,将 `docs/postgresql-data-model.md`、`docs/README.md` 与 `docs/start-here.md` 的默认 Phase-1 口径进一步收敛为 `song -> asset -> window -> feature`;`recording` 调整为未来扩展层,而非当前强主层。 | ||
| 19 | |||
| 15 | ## 2026-06-04 | 20 | ## 2026-06-04 |
| 16 | 21 | ||
| 17 | - 更新 `docs/README.md` 顶部为与 `session-handoff` 一致的“最短启动路径”,并再次用该入口命令重跑 `run_planner_validation_commands_live.py`,确认 fresh 结果仍为 `executed_count=4`、`all_passed=true`。 | 22 | - 更新 `docs/README.md` 顶部为与 `session-handoff` 一致的“最短启动路径”,并再次用该入口命令重跑 `run_planner_validation_commands_live.py`,确认 fresh 结果仍为 `executed_count=4`、`all_passed=true`。 | ... | ... |
| 1 | # ACR Docs Overview | 1 | # ACR Docs Overview |
| 2 | 2 | ||
| 3 | > 面向“版权保护 / 听歌识曲 / 版本归属”的音乐 ACR 文档总入口。 | 3 | > 当前仅保留与 **song-centric + 融合优先** ACR 设计直接相关的文档。 |
| 4 | 4 | ||
| 5 | --- | 5 | --- |
| 6 | 6 | ||
| 7 | ## 0. 新同学先做什么 | 7 | ## 0. 新同学先做什么 |
| 8 | 8 | ||
| 9 | ### 先跑,不要先读一堆文档 | ||
| 10 | |||
| 11 | ```bash | 9 | ```bash |
| 12 | cd /workspace/acr-engine | 10 | cd /workspace/acr-engine |
| 13 | /usr/local/miniconda3/bin/python scripts/run_planner_validation_commands_live.py \ | 11 | /usr/local/miniconda3/bin/python scripts/run_planner_validation_commands_live.py \ |
| ... | @@ -21,105 +19,53 @@ cd /workspace/acr-engine | ... | @@ -21,105 +19,53 @@ cd /workspace/acr-engine |
| 21 | - `executed_count = 4` | 19 | - `executed_count = 4` |
| 22 | - `all_passed = true` | 20 | - `all_passed = true` |
| 23 | 21 | ||
| 24 | ### 再按这条阅读链路走 | ||
| 25 | 1. [start-here.md](./start-here.md) | ||
| 26 | 2. [session-handoff.md](./session-handoff.md) | ||
| 27 | 3. [acr-architecture.md](./acr-architecture.md) | ||
| 28 | 4. [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 29 | 5. [phase1-implementation-checklist.md](./phase1-implementation-checklist.md) | ||
| 30 | |||
| 31 | --- | 22 | --- |
| 32 | 23 | ||
| 33 | ## 1. 文档总导航 | 24 | ## 1. 当前默认设计口径 |
| 34 | 25 | ||
| 35 | ### A. 接手项目 / 恢复上下文 | 26 | 当前 Phase-1 默认按下面理解: |
| 36 | - [start-here.md](./start-here.md) — 新同学 10 分钟接手入口 | ||
| 37 | - [session-handoff.md](./session-handoff.md) — 当前状态、阻塞、下一步 | ||
| 38 | - [CHANGELOG.md](./CHANGELOG.md) — 变更记录 | ||
| 39 | 27 | ||
| 40 | ### B. 系统方案 / 设计主线 | 28 | ```text |
| 41 | - [acr-architecture.md](./acr-architecture.md) — 总体架构与分层 | 29 | song -> asset -> window -> fingerprint / embedding |
| 42 | - [sota-evolution-guide.md](./sota-evolution-guide.md) — SOTA 演进路径 | 30 | ``` |
| 43 | - [postgresql-data-model.md](./postgresql-data-model.md) — PostgreSQL 主数据/特征模型 | ||
| 44 | - [production-encoder-freeze-and-embedding-strategy.md](./production-encoder-freeze-and-embedding-strategy.md) — encoder-only 冻结策略 | ||
| 45 | 31 | ||
| 46 | ### C. 第一个阶段怎么落地 | 32 | 对应融合优先物理表: |
| 47 | - [phase1-implementation-checklist.md](./phase1-implementation-checklist.md) — Phase-1 执行清单 | ||
| 48 | - [postgresql-data-model.md](./postgresql-data-model.md) — 含 Phase-1 极简 schema 与融合优先视图 | ||
| 49 | - [model-feature-registry-bootstrap.md](./model-feature-registry-bootstrap.md) — model/feature/reference set 初始化 | ||
| 50 | - [phase1-worker-contract.md](./phase1-worker-contract.md) — worker、job、失败语义合同 | ||
| 51 | - [postgres_db_schema_samples.md](./postgres_db_schema_samples.md) — PostgreSQL 存储样例 | ||
| 52 | 33 | ||
| 53 | ### D. 运行 / 服务 / 数据治理 | 34 | ```text |
| 54 | - [runbook.md](./runbook.md) — 运维/运行手册 | 35 | media_entity -> audio_object -> feature_fact -> set_membership |
| 55 | - [service-api.md](./service-api.md) — 服务 API | 36 | ``` |
| 56 | - [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) — 训练/向量检索说明 | ||
| 57 | - [open-dataset-workflow.md](./open-dataset-workflow.md) — 开源数据接入流程 | ||
| 58 | 37 | ||
| 59 | --- | 38 | --- |
| 60 | 39 | ||
| 61 | ## 2. 按角色阅读 | 40 | ## 2. 必读文档 |
| 62 | 41 | ||
| 63 | ### 产品 / 业务 / 版权策略 | ||
| 64 | 1. [start-here.md](./start-here.md) | 42 | 1. [start-here.md](./start-here.md) |
| 65 | 2. [acr-architecture.md](./acr-architecture.md) | ||
| 66 | 3. [project-responsibility-map.md](./project-responsibility-map.md) | ||
| 67 | 4. [business-export-cookbook.md](./business-export-cookbook.md) | ||
| 68 | |||
| 69 | ### 数据 / 平台 / PostgreSQL | ||
| 70 | 1. [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 71 | 2. [postgres_db_schema_samples.md](./postgres_db_schema_samples.md) | ||
| 72 | 3. [model-feature-registry-bootstrap.md](./model-feature-registry-bootstrap.md) | ||
| 73 | 4. [runbook.md](./runbook.md) | ||
| 74 | |||
| 75 | ### 算法 / 检索 / 模型 | ||
| 76 | 1. [sota-evolution-guide.md](./sota-evolution-guide.md) | ||
| 77 | 2. [production-encoder-freeze-and-embedding-strategy.md](./production-encoder-freeze-and-embedding-strategy.md) | ||
| 78 | 3. [phase1-worker-contract.md](./phase1-worker-contract.md) | ||
| 79 | 4. [sota-research-2026.md](./sota-research-2026.md) | ||
| 80 | |||
| 81 | ### 开发 / 实施 / 交付 | ||
| 82 | 1. [phase1-implementation-checklist.md](./phase1-implementation-checklist.md) | ||
| 83 | 2. [session-handoff.md](./session-handoff.md) | 43 | 2. [session-handoff.md](./session-handoff.md) |
| 84 | 3. [CHANGELOG.md](./CHANGELOG.md) | 44 | 3. [acr-architecture.md](./acr-architecture.md) |
| 85 | 4. [release-checklist.md](./release-checklist.md) | 45 | 4. [postgresql-data-model.md](./postgresql-data-model.md) |
| 46 | 5. [phase1-implementation-checklist.md](./phase1-implementation-checklist.md) | ||
| 86 | 47 | ||
| 87 | --- | 48 | --- |
| 88 | 49 | ||
| 89 | ## 3. 当前最重要的稳定结论 | 50 | ## 3. 实施相关文档 |
| 90 | 51 | ||
| 91 | - 目标场景不是普通歌曲推荐,而是 **版权保护 / 听歌识曲 / 版本归属**。 | 52 | - [postgresql-data-model.md](./postgresql-data-model.md) — 当前唯一默认数据模型;含切片/模型/feature 落表说明与流程图 |
| 92 | - Phase-1 先走 **encoder-only** 路线,不先微调底座。 | 53 | - [postgres_db_schema_samples.md](./postgres_db_schema_samples.md) — PostgreSQL 存储样例 |
| 93 | - exact lane:`Chromaprint`。 | 54 | - [model-feature-registry-bootstrap.md](./model-feature-registry-bootstrap.md) — model/feature/reference set 初始化 |
| 94 | - semantic baseline:`MERT-v1-95M`。 | 55 | - [phase1-worker-contract.md](./phase1-worker-contract.md) — worker、job、失败语义合同 |
| 95 | - semantic challenger:`MuQ`。 | 56 | - [phase1-implementation-checklist.md](./phase1-implementation-checklist.md) — Phase-1 实施清单 |
| 96 | - `ECAPA` 保留为 historical baseline,不再作为长期主底座。 | 57 | - [production-encoder-freeze-and-embedding-strategy.md](./production-encoder-freeze-and-embedding-strategy.md) — encoder-only 冻结策略 |
| 97 | - 可演进完整版主链为: | 58 | - [sota-evolution-guide.md](./sota-evolution-guide.md) — 当前 SOTA 演进主线 |
| 98 | |||
| 99 | ```text | ||
| 100 | canonical_song -> work -> recording -> recording_asset -> audio_window | ||
| 101 | ``` | ||
| 102 | |||
| 103 | - 如果只看 Phase-1 最小骨架,可以先按下面理解: | ||
| 104 | |||
| 105 | ```text | ||
| 106 | song -> recording -> asset -> window -> fingerprint / embedding | ||
| 107 | ``` | ||
| 108 | |||
| 109 | - 模型/特征主链固定为: | ||
| 110 | |||
| 111 | ```text | ||
| 112 | model_registry -> feature_set_registry -> audio_embedding / audio_fingerprint -> retrieval_index_registry | ||
| 113 | ``` | ||
| 114 | 59 | ||
| 115 | --- | 60 | --- |
| 116 | 61 | ||
| 117 | ## 4. 当前不要浪费时间的方向 | 62 | ## 4. 当前稳定结论 |
| 118 | 63 | ||
| 119 | - 不要回退到只用一个 `song_id` 的扁平结构。 | 64 | - 最终归属对象当前只要求稳定返回 `song_id` |
| 120 | - 不要把 embedding 存成固定列(如 `mert_embedding` / `muq_embedding`)。 | 65 | - 同一个 `song` 下允许有多个音频文件 |
| 121 | - 不要在 Phase-1 先讨论重新训练底座。 | 66 | - 当前暂不把 `recording/version` 作为必须返回对象 |
| 122 | - 不要把当前阻塞误判成 PostgreSQL schema 问题;当前主要 blocker 是音频挂载与 runtime 依赖。 | 67 | - `window` 仍然保留,因为它是 evidence / offset / 检索最小单元 |
| 68 | - `feature_fact` 统一承载 `fingerprint` 和 `embedding` | ||
| 123 | 69 | ||
| 124 | --- | 70 | --- |
| 125 | 71 | ||
| ... | @@ -129,19 +75,4 @@ model_registry -> feature_set_registry -> audio_embedding / audio_fingerprint -> | ... | @@ -129,19 +75,4 @@ model_registry -> feature_set_registry -> audio_embedding / audio_fingerprint -> |
| 129 | /usr/local/miniconda3/bin/python scripts/check_markdown_links.py --root docs | 75 | /usr/local/miniconda3/bin/python scripts/check_markdown_links.py --root docs |
| 130 | ``` | 76 | ``` |
| 131 | 77 | ||
| 132 | 用途:在清理或重组文档后,快速发现 `docs/` 下的相对链接断链。默认会跳过 `CHANGELOG.md` 这类历史归档文档。 | 78 | 默认会跳过 `CHANGELOG.md` 这类历史归档文档。 |
| 133 | |||
| 134 | --- | ||
| 135 | |||
| 136 | ## 6. 补充但不建议作为第一入口 | ||
| 137 | |||
| 138 | 以下文档保留用于专题补充,不建议新同学第一轮就读: | ||
| 139 | - [dataset-spec.md](./dataset-spec.md) | ||
| 140 | - [dataset-sources-and-licensing.md](./dataset-sources-and-licensing.md) | ||
| 141 | - [references-and-sources.md](./references-and-sources.md) | ||
| 142 | - [current-capability-map.md](./current-capability-map.md) | ||
| 143 | - [industrialization-roadmap.md](./industrialization-roadmap.md) | ||
| 144 | - [industrial-benchmark-spec.md](./industrial-benchmark-spec.md) | ||
| 145 | - [benchmark-report-template.md](./benchmark-report-template.md) | ||
| 146 | - [model-card-template.md](./model-card-template.md) | ||
| 147 | - [report-layout.md](./report-layout.md) | ... | ... |
| ... | @@ -79,21 +79,22 @@ flowchart TD | ... | @@ -79,21 +79,22 @@ flowchart TD |
| 79 | 那 Phase-1 完全可以按下面这套最小骨架推进: | 79 | 那 Phase-1 完全可以按下面这套最小骨架推进: |
| 80 | 80 | ||
| 81 | ```text | 81 | ```text |
| 82 | song -> recording -> asset -> window -> fingerprint / embedding | 82 | song -> asset -> window -> fingerprint / embedding |
| 83 | ``` | 83 | ``` |
| 84 | 84 | ||
| 85 | 保留原因: | 85 | 保留原因: |
| 86 | - `recording` 不能删:同一首歌会有多个版本 | ||
| 87 | - `window` 不能删:它是 offset/evidence/多段投票的最小单元 | 86 | - `window` 不能删:它是 offset/evidence/多段投票的最小单元 |
| 88 | - `feature_set_registry` 不能删:否则未来换 MERT/MuQ 会把 schema 写死 | 87 | - `feature_set_registry` / `feature_fact` 不能删:否则未来换 MERT/MuQ 会把 schema 写死 |
| 88 | - `asset` 不能删:同一个 `song` 下会有多个真实音频文件 | ||
| 89 | 89 | ||
| 90 | 可以延后: | 90 | 可以延后: |
| 91 | - `recording` | ||
| 91 | - `work` | 92 | - `work` |
| 92 | - 更重的 `retrieval_index_registry` | 93 | - 更重的 `retrieval_index_registry` |
| 93 | - 更细的全链路审计表 | 94 | - 更细的全链路审计表 |
| 94 | 95 | ||
| 95 | 因此推荐口径不是“把所有层都砍掉”,而是: | 96 | 因此推荐口径不是“把所有层都砍掉”,而是: |
| 96 | > **Phase-1 先上最小可用层;未来版本归属/cover/work 治理再继续加层。** | 97 | > **Phase-1 先上 song-centric 最小可用层;未来版本归属/cover/work 治理再继续加层。** |
| 97 | 98 | ||
| 98 | --- | 99 | --- |
| 99 | 100 | ||
| ... | @@ -125,7 +126,6 @@ song -> recording -> asset -> window -> fingerprint / embedding | ... | @@ -125,7 +126,6 @@ song -> recording -> asset -> window -> fingerprint / embedding |
| 125 | 最该读: | 126 | 最该读: |
| 126 | - 本文 | 127 | - 本文 |
| 127 | - [postgresql-data-model.md](./postgresql-data-model.md) | 128 | - [postgresql-data-model.md](./postgresql-data-model.md) |
| 128 | - [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) | ||
| 129 | 129 | ||
| 130 | --- | 130 | --- |
| 131 | 131 | ||
| ... | @@ -139,8 +139,8 @@ song -> recording -> asset -> window -> fingerprint / embedding | ... | @@ -139,8 +139,8 @@ song -> recording -> asset -> window -> fingerprint / embedding |
| 139 | 139 | ||
| 140 | 最该读: | 140 | 最该读: |
| 141 | - 本文 | 141 | - 本文 |
| 142 | - [service-api.md](./service-api.md) | ||
| 143 | - [postgresql-data-model.md](./postgresql-data-model.md) | 142 | - [postgresql-data-model.md](./postgresql-data-model.md) |
| 143 | - [phase1-worker-contract.md](./phase1-worker-contract.md) | ||
| 144 | 144 | ||
| 145 | --- | 145 | --- |
| 146 | 146 | ||
| ... | @@ -154,8 +154,8 @@ song -> recording -> asset -> window -> fingerprint / embedding | ... | @@ -154,8 +154,8 @@ song -> recording -> asset -> window -> fingerprint / embedding |
| 154 | 154 | ||
| 155 | 最该读: | 155 | 最该读: |
| 156 | - [sota-evolution-guide.md](./sota-evolution-guide.md) | 156 | - [sota-evolution-guide.md](./sota-evolution-guide.md) |
| 157 | - [sota-research-2026.md](./sota-research-2026.md) | ||
| 158 | - [production-encoder-freeze-and-embedding-strategy.md](./production-encoder-freeze-and-embedding-strategy.md) | 157 | - [production-encoder-freeze-and-embedding-strategy.md](./production-encoder-freeze-and-embedding-strategy.md) |
| 158 | - [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 159 | 159 | ||
| 160 | --- | 160 | --- |
| 161 | 161 | ||
| ... | @@ -235,4 +235,4 @@ flowchart LR | ... | @@ -235,4 +235,4 @@ flowchart LR |
| 235 | 如果你是: | 235 | 如果你是: |
| 236 | - **架构负责人**:下一篇看 [sota-evolution-guide.md](./sota-evolution-guide.md) | 236 | - **架构负责人**:下一篇看 [sota-evolution-guide.md](./sota-evolution-guide.md) |
| 237 | - **数据/后端负责人**:下一篇看 [postgresql-data-model.md](./postgresql-data-model.md) | 237 | - **数据/后端负责人**:下一篇看 [postgresql-data-model.md](./postgresql-data-model.md) |
| 238 | - **模型负责人**:先看 [sota-evolution-guide.md](./sota-evolution-guide.md) 再回到 [sota-research-2026.md](./sota-research-2026.md) | 238 | - **模型负责人**:先看 [sota-evolution-guide.md](./sota-evolution-guide.md) 再看 [production-encoder-freeze-and-embedding-strategy.md](./production-encoder-freeze-and-embedding-strategy.md) | ... | ... |
docs/benchmark-report-template.md
deleted
100644 → 0
| 1 | # Benchmark Report Template | ||
| 2 | |||
| 3 | > 用于每次模型版本评测输出 | ||
| 4 | |||
| 5 | ## 一页结论 | ||
| 6 | - 模型版本: | ||
| 7 | - 数据版本: | ||
| 8 | - 核心结论: | ||
| 9 | - 是否通过上线门禁: | ||
| 10 | |||
| 11 | ## 1. 评测范围图 | ||
| 12 | |||
| 13 | ```mermaid | ||
| 14 | flowchart LR | ||
| 15 | A[Model Version] --> B[Datasets] | ||
| 16 | A --> C[Scenario Buckets] | ||
| 17 | A --> D[Latency / Ops] | ||
| 18 | ``` | ||
| 19 | |||
| 20 | ## 2. 指标表 | ||
| 21 | |||
| 22 | | Bucket | top1 | top5 | MRR | FAR | Notes | | ||
| 23 | |---|---:|---:|---:|---:|---| | ||
| 24 | | clean | | | | | | | ||
| 25 | | humming_like | | | | | | | ||
| 26 | | confused | | | | | | | ||
| 27 | |||
| 28 | ## 3. 文字分析 | ||
| 29 | - 最强项: | ||
| 30 | - 最弱项: | ||
| 31 | - 与上一版本对比: | ||
| 32 | |||
| 33 | ## 4. 细节附录 | ||
| 34 | - 评测命令 | ||
| 35 | - 数据清单 | ||
| 36 | - 原始 JSON 报告路径 | ||
| 37 | |||
| 38 | ## Sources | ||
| 39 | - `docs/industrial-benchmark-spec.md` |
docs/business-export-cookbook.md
deleted
100644 → 0
| 1 | # Business Export Cookbook / 业务库表导出 Cookbook | ||
| 2 | |||
| 3 | > 更新:2026-06-02 | ||
| 4 | > 关联文档:[业务 Manifest 与 Type-Role 规范](./business-manifest-and-type-role-spec.md) · [业务素材类型与 Bucket 指南](./business-music-bucket-and-type-guide.md) | ||
| 5 | |||
| 6 | ## 一页结论 | ||
| 7 | |||
| 8 | 下个 session 如果要从你们的业务库表真正导出训练/评测清单,建议直接按这个顺序: | ||
| 9 | |||
| 10 | 1. 先从 SQL 导出音频资产基础字段 | ||
| 11 | 2. 用 `type-role mapping` 补 `role` / `bucket` | ||
| 12 | 3. 落成 CSV 或 JSONL 中间文件 | ||
| 13 | 4. 再转成项目 manifest | ||
| 14 | 5. 或直接先用仓库脚本转成 manifest-ready JSONL | ||
| 15 | |||
| 16 | 仓库里已经补好以下参考物: | ||
| 17 | - [../acr-engine/configs/manifests/business_asset_manifest_template.json](../acr-engine/configs/manifests/business_asset_manifest_template.json) | ||
| 18 | - [../acr-engine/configs/manifests/business_type_role_mapping.json](../acr-engine/configs/manifests/business_type_role_mapping.json) | ||
| 19 | - [../acr-engine/configs/manifests/examples/business_asset_export_example.csv](../acr-engine/configs/manifests/examples/business_asset_export_example.csv) | ||
| 20 | - [../acr-engine/configs/manifests/examples/business_asset_export_example.jsonl](../acr-engine/configs/manifests/examples/business_asset_export_example.jsonl) | ||
| 21 | |||
| 22 | --- | ||
| 23 | |||
| 24 | ## 1. 推荐 SQL 导出字段 | ||
| 25 | |||
| 26 | ```sql | ||
| 27 | SELECT | ||
| 28 | s.id AS song_id, | ||
| 29 | a.id AS asset_id, | ||
| 30 | a.type AS type, | ||
| 31 | a.file_path AS audio_path, | ||
| 32 | s.title AS title, | ||
| 33 | s.artist_name AS artist, | ||
| 34 | s.album_id AS album_id, | ||
| 35 | a.duration_sec AS duration_sec, | ||
| 36 | a.sample_rate AS sample_rate, | ||
| 37 | a.bitrate AS bitrate, | ||
| 38 | a.license_code AS license, | ||
| 39 | a.created_at AS created_at | ||
| 40 | FROM music_asset a | ||
| 41 | JOIN song s ON s.id = a.song_id | ||
| 42 | WHERE a.type IN (1,7,8,9,10,11,16,18,2,12); | ||
| 43 | ``` | ||
| 44 | |||
| 45 | 说明: | ||
| 46 | - 这不是强制 SQL,只是字段映射样例。 | ||
| 47 | - 关键不是表名,而是把字段凑齐到 manifest 规范里。 | ||
| 48 | |||
| 49 | --- | ||
| 50 | |||
| 51 | ## 2. 导出后要补什么字段 | ||
| 52 | |||
| 53 | | 字段 | 来源 | 说明 | | ||
| 54 | |---|---|---| | ||
| 55 | | `role` | `business_type_role_mapping.json` | 由 `type` 映射 | | ||
| 56 | | `bucket` | `business_type_role_mapping.json` | 默认业务 bucket | | ||
| 57 | | `split` | 导出脚本或后处理 | `train/val/test/holdout` | | ||
| 58 | | `source_dataset` | 固定值 | 如 `internal_catalog` | | ||
| 59 | | `offset_sec` | 片段类素材可填 | 非片段可先置 `0` | | ||
| 60 | |||
| 61 | --- | ||
| 62 | |||
| 63 | ## 3. 推荐中间格式 | ||
| 64 | |||
| 65 | ### CSV | ||
| 66 | 适合: | ||
| 67 | - 业务同学先导数据 | ||
| 68 | - Excel / 表格工具核对 | ||
| 69 | |||
| 70 | 样例: | ||
| 71 | - [../acr-engine/configs/manifests/examples/business_asset_export_example.csv](../acr-engine/configs/manifests/examples/business_asset_export_example.csv) | ||
| 72 | |||
| 73 | ### JSONL | ||
| 74 | 适合: | ||
| 75 | - 脚本流式处理 | ||
| 76 | - 后续直接转 manifest | ||
| 77 | |||
| 78 | 样例: | ||
| 79 | - [../acr-engine/configs/manifests/examples/business_asset_export_example.jsonl](../acr-engine/configs/manifests/examples/business_asset_export_example.jsonl) | ||
| 80 | |||
| 81 | --- | ||
| 82 | |||
| 83 | ## 4. 建议后处理规则 | ||
| 84 | |||
| 85 | 1. `type=10/11` 默认补成 `reference` | ||
| 86 | 2. `type=1/9` 默认补成压缩域 `reference` | ||
| 87 | 3. `type=7/8/16` 默认补成 `query` | ||
| 88 | 4. `type=18/2/12` 默认先 `excluded` | ||
| 89 | 5. 非音频资产直接过滤掉 | ||
| 90 | |||
| 91 | --- | ||
| 92 | |||
| 93 | ## 5. 下个 session 最直接动作 | ||
| 94 | |||
| 95 | 1. 按 SQL 样例从业务库导一次真实数据 | ||
| 96 | 2. 存成 CSV 或 JSONL | ||
| 97 | 3. 用仓库里的 mapping 规则补齐 `role` / `bucket` | ||
| 98 | 4. 再转换成项目需要的 manifest | ||
| 99 | |||
| 100 | ## Sources | ||
| 101 | - See [business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md) | ||
| 102 | - See [business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md) | ||
| 103 | |||
| 104 | |||
| 105 | ## 6. 轻量规范化脚本 | ||
| 106 | |||
| 107 | 仓库里已经补了一层可直接运行的转换脚本: | ||
| 108 | - [../acr-engine/scripts/normalize_business_export.py](../acr-engine/scripts/normalize_business_export.py) | ||
| 109 | |||
| 110 | 示例: | ||
| 111 | |||
| 112 | ```bash | ||
| 113 | cd /workspace/acr-engine | ||
| 114 | /usr/local/miniconda3/bin/python scripts/normalize_business_export.py \ | ||
| 115 | --input configs/manifests/examples/business_asset_export_example.csv \ | ||
| 116 | --output /tmp/business_asset_manifest_ready.jsonl | ||
| 117 | ``` | ||
| 118 | |||
| 119 | 这个脚本会: | ||
| 120 | 1. 读取 CSV 或 JSONL 导出 | ||
| 121 | 2. 应用 `business_type_role_mapping.json` | ||
| 122 | 3. 自动补 `role / bucket / source_dataset / split` 默认值 | ||
| 123 | 4. 输出 manifest-ready JSONL | ||
| 124 | |||
| 125 | |||
| 126 | ## 7. 拆分为角色清单 | ||
| 127 | |||
| 128 | 如果你已经拿到了 manifest-ready JSONL,还可以继续用: | ||
| 129 | - [../acr-engine/scripts/split_business_manifest_ready.py](../acr-engine/scripts/split_business_manifest_ready.py) | ||
| 130 | |||
| 131 | 示例: | ||
| 132 | |||
| 133 | ```bash | ||
| 134 | cd /workspace/acr-engine | ||
| 135 | /usr/local/miniconda3/bin/python scripts/split_business_manifest_ready.py \ | ||
| 136 | --input /tmp/business_asset_manifest_ready.jsonl \ | ||
| 137 | --output-dir /tmp/business_asset_manifest_split | ||
| 138 | ``` | ||
| 139 | |||
| 140 | 它会输出: | ||
| 141 | - `reference.json` | ||
| 142 | - `query.json` | ||
| 143 | - `excluded.json` | ||
| 144 | |||
| 145 | 这样下个 session 可以更快把业务素材继续整形成训练/评测所需清单。 | ||
| 146 | |||
| 147 | |||
| 148 | ## 8. 生成项目 manifest | ||
| 149 | |||
| 150 | 如果你已经有 manifest-ready JSONL,可以直接继续生成项目当前需要的四个 manifest: | ||
| 151 | - [../acr-engine/scripts/build_business_project_manifests.py](../acr-engine/scripts/build_business_project_manifests.py) | ||
| 152 | - [business-project-manifest-adapter.md](./business-project-manifest-adapter.md) |
| 1 | # Business Manifest and Type-Role Spec / 业务 Manifest 与 Type-Role 规范 | ||
| 2 | |||
| 3 | > 更新:2026-06-02 | ||
| 4 | > 关联文档:[业务素材类型与 Bucket 指南](./business-music-bucket-and-type-guide.md) · [训练数据与 pgvector 指南](./training-data-and-pgvector-guide.md) | ||
| 5 | |||
| 6 | ## 一页结论 | ||
| 7 | |||
| 8 | 现在仓库里已经有两份可以直接复用的业务接入模板: | ||
| 9 | - [../acr-engine/configs/manifests/business_asset_manifest_template.json](../acr-engine/configs/manifests/business_asset_manifest_template.json) | ||
| 10 | - [../acr-engine/configs/manifests/business_type_role_mapping.json](../acr-engine/configs/manifests/business_type_role_mapping.json) | ||
| 11 | |||
| 12 | 它们解决两个问题: | ||
| 13 | 1. 业务库表里的字段,最少要映射成什么 manifest 字段。 | ||
| 14 | 2. 你们的 `type` 应该默认落到 `reference / query / excluded` 哪一类。 | ||
| 15 | |||
| 16 | --- | ||
| 17 | |||
| 18 | ## 1. 映射图 | ||
| 19 | |||
| 20 | ```mermaid | ||
| 21 | flowchart LR | ||
| 22 | A[业务库表记录] --> B[type-role mapping] | ||
| 23 | B --> C[reference] | ||
| 24 | B --> D[query] | ||
| 25 | B --> E[excluded] | ||
| 26 | C --> F[manifest rows] | ||
| 27 | D --> F | ||
| 28 | F --> G[train / build-index / evaluate] | ||
| 29 | ``` | ||
| 30 | |||
| 31 | --- | ||
| 32 | |||
| 33 | ## 2. 最小 manifest 字段 | ||
| 34 | |||
| 35 | | 字段 | 必需 | 说明 | | ||
| 36 | |---|---|---| | ||
| 37 | | `song_id` | 是 | 歌曲主 ID | | ||
| 38 | | `asset_id` | 是 | 具体素材 ID | | ||
| 39 | | `type` | 是 | 你们现有的素材类型 | | ||
| 40 | | `role` | 是 | `reference` / `query` / `excluded` | | ||
| 41 | | `split` | 是 | `train` / `val` / `test` / `holdout` | | ||
| 42 | | `audio_path` | 是 | 可访问的音频路径 | | ||
| 43 | | `source_dataset` | 是 | 来源标识 | | ||
| 44 | | `bucket` | 否 | 分桶评测标签 | | ||
| 45 | | `offset_sec` | 否 | query 起点 | | ||
| 46 | | `duration_sec` | 否 | 片段长度 | | ||
| 47 | |||
| 48 | --- | ||
| 49 | |||
| 50 | ## 3. 默认 type-role 规则 | ||
| 51 | |||
| 52 | | type | 默认 role | 默认 bucket | 说明 | | ||
| 53 | |---:|---|---|---| | ||
| 54 | | `10` / `11` | `reference` | `lossless_reference_core` | 无损主库 | | ||
| 55 | | `9` / `1` | `reference` | `compressed_reference_realworld` | 压缩真实分布 | | ||
| 56 | | `8` / `7` / `16` | `query` | `short_video_hook` | 短视频/副歌入口 | | ||
| 57 | | `18` | `excluded` | `demo_variation_pool` | 先人工筛 | | ||
| 58 | | `2` / `12` | `excluded` | `with_harmony_shift` | 先做专项桶 | | ||
| 59 | | 其余非音频 type | `excluded` | `non_audio` | 不入模 | | ||
| 60 | |||
| 61 | --- | ||
| 62 | |||
| 63 | ## 4. 导出原则 | ||
| 64 | |||
| 65 | 1. **reference 与 query 即使同曲,也不要混成同一条资产记录。** | ||
| 66 | 2. **如果无法确认是否同曲同版本,默认 `excluded`。** | ||
| 67 | 3. **`type=18 demo` 不要自动并入 train,先人工审。** | ||
| 68 | 4. **短视频片段优先导出为 `query`,不要直接当 reference。** | ||
| 69 | |||
| 70 | --- | ||
| 71 | |||
| 72 | ## 5. 模板与脚本 | ||
| 73 | |||
| 74 | - Manifest 模板: | ||
| 75 | - [../acr-engine/configs/manifests/business_asset_manifest_template.json](../acr-engine/configs/manifests/business_asset_manifest_template.json) | ||
| 76 | - Type-role 模板: | ||
| 77 | - [../acr-engine/configs/manifests/business_type_role_mapping.json](../acr-engine/configs/manifests/business_type_role_mapping.json) | ||
| 78 | - 打印脚本: | ||
| 79 | - [../acr-engine/scripts/print_business_type_mapping.py](../acr-engine/scripts/print_business_type_mapping.py) | ||
| 80 | - 规范化脚本: | ||
| 81 | - [../acr-engine/scripts/normalize_business_export.py](../acr-engine/scripts/normalize_business_export.py) | ||
| 82 | - 角色拆分脚本: | ||
| 83 | - [../acr-engine/scripts/split_business_manifest_ready.py](../acr-engine/scripts/split_business_manifest_ready.py) | ||
| 84 | |||
| 85 | 示例命令: | ||
| 86 | |||
| 87 | ```bash | ||
| 88 | cd /workspace/acr-engine | ||
| 89 | /usr/local/miniconda3/bin/python scripts/print_business_type_mapping.py | ||
| 90 | ``` | ||
| 91 | |||
| 92 | --- | ||
| 93 | |||
| 94 | ## 6. 下个 session 直接动作 | ||
| 95 | |||
| 96 | 1. 按这份规范把库表字段映射到 manifest 行。 | ||
| 97 | 2. 用 `business_type_role_mapping.json` 给每条资产打默认 `role` / `bucket`。 | ||
| 98 | 3. 先导出 `reference` 与 `query` 清单,再进入训练与 bucket benchmark。 | ||
| 99 | |||
| 100 | ## 延伸阅读 | ||
| 101 | - [business-export-cookbook.md](./business-export-cookbook.md) | ||
| 102 | |||
| 103 | ## Sources | ||
| 104 | - See [business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md) | ||
| 105 | - See [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) |
| 1 | # Business Music Bucket and Type Guide / 业务音乐素材类型与 Bucket 指南 | ||
| 2 | |||
| 3 | > 更新:2026-06-02 | ||
| 4 | > 关联文档:[训练数据与 pgvector 指南](./training-data-and-pgvector-guide.md) · [开放数据工作流](./open-dataset-workflow.md) · [工业级 Benchmark 规范](./industrial-benchmark-spec.md) | ||
| 5 | |||
| 6 | ## 一页结论 | ||
| 7 | |||
| 8 | 针对你们现有的素材 `type` 字段,**不要把所有文件都混进训练**。 | ||
| 9 | 更推荐按“reference 主资产 + query 派生资产 + hard-case 评测资产”三层来用。 | ||
| 10 | |||
| 11 | ### 最推荐参与训练/建库的类型 | ||
| 12 | |||
| 13 | | 优先级 | type | 含义 | 训练用途 | | ||
| 14 | |---|---:|---|---| | ||
| 15 | | 高 | `10` | 伴奏无和声-无损 | 最干净的 reference 候选 | | ||
| 16 | | 高 | `11` | 原曲-无损 | 主 reference / 主训练资产 | | ||
| 17 | | 高 | `9` | 伴奏无和声-压缩 | reference 补充 / 压缩域适配 | | ||
| 18 | | 高 | `1` | 原曲-压缩 | 训练域补充 / 真实线上分布 | | ||
| 19 | | 中 | `18` | 音频 demo | 可作为弱监督补充,需人工筛 | | ||
| 20 | | 中 | `8` | 片段(副歌) | 可用于 repeated-section / 高辨识度 query | | ||
| 21 | | 中 | `7` | 抖音片段 | 可用于短视频域 query 评测 | | ||
| 22 | | 中 | `16` | 快手片段 | 可用于短视频域 query 评测 | | ||
| 23 | |||
| 24 | ### 通常不直接参与主训练的类型 | ||
| 25 | |||
| 26 | | type | 含义 | 原因 | | ||
| 27 | |---:|---|---| | ||
| 28 | | `2` / `12` | 伴奏有和声 | 容易引入“同曲不同演唱层”的额外变异,适合后续单独实验 | | ||
| 29 | | `3` / `13` / `20` | 歌词 / LRC / 译文滚动歌词 | 非音频资产 | | ||
| 30 | | `4` / `14` / `19` | 封面 / PSD / 曲谱图片 | 非音频资产 | | ||
| 31 | | `5` | 授权书 | 合规文件,不入模 | | ||
| 32 | | `6` | 专辑信息 | 元数据,不入模 | | ||
| 33 | | `17` | 词曲压缩包 | 需先拆解,不应直接入模 | | ||
| 34 | |||
| 35 | --- | ||
| 36 | |||
| 37 | ## 1. 业务素材职责图 | ||
| 38 | |||
| 39 | ```mermaid | ||
| 40 | flowchart LR | ||
| 41 | A[无损主资产\n10 / 11] --> B[reference 主库] | ||
| 42 | C[压缩主资产\n1 / 9] --> D[训练域增强] | ||
| 43 | E[短视频片段\n7 / 16 / 8] --> F[query 评测集] | ||
| 44 | G[录音/demo\n18] --> H[弱监督补充池] | ||
| 45 | B --> I[训练 / 建索引] | ||
| 46 | D --> I | ||
| 47 | F --> J[短片段评测 / hard-case] | ||
| 48 | H --> K[人工筛选后再进入 I 或 J] | ||
| 49 | ``` | ||
| 50 | |||
| 51 | --- | ||
| 52 | |||
| 53 | ## 2. 你们的 type 应该怎么用 | ||
| 54 | |||
| 55 | ## 2.1 主训练 / 主建库推荐 | ||
| 56 | |||
| 57 | ### A. 第一优先:`10` + `11` | ||
| 58 | |||
| 59 | 原因: | ||
| 60 | - 音质最好 | ||
| 61 | - 标签语义最稳定 | ||
| 62 | - 最适合作为“真值 reference” | ||
| 63 | |||
| 64 | 推荐用途: | ||
| 65 | - `reference` | ||
| 66 | - 主训练资产 | ||
| 67 | - pgvector 主 embedding 表 | ||
| 68 | |||
| 69 | ### B. 第二优先:`9` + `1` | ||
| 70 | |||
| 71 | 原因: | ||
| 72 | - 更接近线上真实压缩分布 | ||
| 73 | - 可以增强模型对编码损伤的适应性 | ||
| 74 | |||
| 75 | 推荐用途: | ||
| 76 | - 训练补充 | ||
| 77 | - 评测时做 clean/compressed query | ||
| 78 | - reference 域扩展 | ||
| 79 | |||
| 80 | ### C. 第三优先:`8` / `7` / `16` | ||
| 81 | |||
| 82 | 原因: | ||
| 83 | - 更接近真实识别入口 | ||
| 84 | - 有利于短片段 / 副歌 / 短视频域评测 | ||
| 85 | |||
| 86 | 推荐用途: | ||
| 87 | - query 评测集 | ||
| 88 | - repeated-section-rich bucket | ||
| 89 | - short-video bucket | ||
| 90 | |||
| 91 | ### D. 谨慎使用:`18` | ||
| 92 | |||
| 93 | 原因: | ||
| 94 | - `demo` 的混音、编排、完整度差异很大 | ||
| 95 | - 很容易把“不是同一首最终版本”的样本混入同标签 | ||
| 96 | |||
| 97 | 推荐用途: | ||
| 98 | - 先放人工筛选池 | ||
| 99 | - 只在确认与正式版本同曲同主旋律时再纳入训练或 hard-case | ||
| 100 | |||
| 101 | --- | ||
| 102 | |||
| 103 | ## 2.2 不建议一开始就并入主训练的类型 | ||
| 104 | |||
| 105 | ### `2` / `12` 伴奏有和声 | ||
| 106 | |||
| 107 | 风险: | ||
| 108 | - 同一 `song_id` 下会多出人声/和声干扰 | ||
| 109 | - 如果当前系统目标是“音乐 ACR / BGM 识别”,这类素材更适合作为后续 domain robustness 对照 | ||
| 110 | |||
| 111 | 建议: | ||
| 112 | - 先单独放一个 `with_harmony_accompaniment` bucket | ||
| 113 | - 不要一开始和 `10`/`9` 直接混训 | ||
| 114 | |||
| 115 | --- | ||
| 116 | |||
| 117 | ## 3. 建议的训练/评测分层 | ||
| 118 | |||
| 119 | ```mermaid | ||
| 120 | flowchart TD | ||
| 121 | A[主库 reference] --> A1[10 / 11] | ||
| 122 | B[训练补充] --> B1[1 / 9] | ||
| 123 | C[短片段评测] --> C1[7 / 16 / 8] | ||
| 124 | D[特殊对照] --> D1[2 / 12 / 18] | ||
| 125 | E[非音频元数据] --> E1[3 / 4 / 5 / 6 / 13 / 14 / 17 / 19 / 20] | ||
| 126 | ``` | ||
| 127 | |||
| 128 | ### 推荐首版策略 | ||
| 129 | |||
| 130 | | 层 | 推荐 type | | ||
| 131 | |---|---| | ||
| 132 | | reference 主库 | `10`, `11` | | ||
| 133 | | 训练补充 | `1`, `9` | | ||
| 134 | | query 评测 | `7`, `8`, `16` | | ||
| 135 | | 人工筛选后可补充 | `18` | | ||
| 136 | | 后续鲁棒性专项实验 | `2`, `12` | | ||
| 137 | |||
| 138 | --- | ||
| 139 | |||
| 140 | ## 4. 业务语义 bucket 建议 | ||
| 141 | |||
| 142 | ## 4.1 第一批最值得做的 bucket | ||
| 143 | |||
| 144 | | bucket 名称 | 推荐来源 type | 作用 | | ||
| 145 | |---|---|---| | ||
| 146 | | `lossless_reference_core` | `10`, `11` | 最干净真值库 | | ||
| 147 | | `compressed_reference_realworld` | `1`, `9` | 线上压缩域 | | ||
| 148 | | `short_video_hook` | `7`, `16`, `8` | 短视频 / 副歌识别 | | ||
| 149 | | `with_harmony_shift` | `2`, `12` | 有和声伴奏干扰 | | ||
| 150 | | `demo_variation_pool` | `18` | demo 与正式版差异风险 | | ||
| 151 | | `hard_negative_confusable` | 人工精选 | 风格近似、编曲近似、旋律近似 | | ||
| 152 | |||
| 153 | --- | ||
| 154 | |||
| 155 | ## 4.2 为什么这比通用 semantic bucket 更贴近业务 | ||
| 156 | |||
| 157 | 因为你们的数据不是纯学术数据集,而是**带素材业务语义**的: | ||
| 158 | - 有主资产 / 压缩版 / 无损版 | ||
| 159 | - 有短视频片段 | ||
| 160 | - 有副歌片段 | ||
| 161 | - 有带和声/不带和声伴奏 | ||
| 162 | |||
| 163 | 因此你们最先应该做的不是抽象的 genre bucket,而是: | ||
| 164 | 1. **版本形态 bucket** | ||
| 165 | 2. **入口场景 bucket** | ||
| 166 | 3. **混淆风险 bucket** | ||
| 167 | |||
| 168 | --- | ||
| 169 | |||
| 170 | ## 5. 推荐配置模板 | ||
| 171 | |||
| 172 | 配套模板: | ||
| 173 | - [../acr-engine/configs/buckets/fma_semantic_bucket_template.json](../acr-engine/configs/buckets/fma_semantic_bucket_template.json) | ||
| 174 | - [../acr-engine/configs/buckets/business_type_bucket_template.json](../acr-engine/configs/buckets/business_type_bucket_template.json) | ||
| 175 | |||
| 176 | 其中: | ||
| 177 | - `fma_semantic_bucket_template.json` 更偏通用方法学 | ||
| 178 | - `business_type_bucket_template.json` 更偏你们现有业务素材形态 | ||
| 179 | |||
| 180 | --- | ||
| 181 | |||
| 182 | ## 6. 和 pgvector 怎么配合 | ||
| 183 | |||
| 184 | 如果后续落到 pgvector,建议至少保留这些字段: | ||
| 185 | |||
| 186 | | 字段 | 说明 | | ||
| 187 | |---|---| | ||
| 188 | | `song_id` | 主歌曲 ID | | ||
| 189 | | `asset_id` | 具体资产 ID | | ||
| 190 | | `type` | 你们的素材类型 | | ||
| 191 | | `bucket` | 当前评测/训练桶 | | ||
| 192 | | `role` | `reference` / `query` | | ||
| 193 | | `source_dataset` | 来源 | | ||
| 194 | | `offset_sec` | query 起点 | | ||
| 195 | | `duration_sec` | query 长度 | | ||
| 196 | | `embedding` | pgvector 向量 | | ||
| 197 | |||
| 198 | 这样后面就能按: | ||
| 199 | - `type` 过滤 | ||
| 200 | - `bucket` 出报表 | ||
| 201 | - `role` 区分 reference/query | ||
| 202 | - `source_dataset` 做多源分析 | ||
| 203 | |||
| 204 | --- | ||
| 205 | |||
| 206 | ## 7. 下个 session 的直接动作 | ||
| 207 | |||
| 208 | 1. 先按这个文档筛出首批可用 type:`10`, `11`, `9`, `1`, `8`, `7`, `16` | ||
| 209 | 2. 再把这些映射进: | ||
| 210 | - [../acr-engine/configs/buckets/business_type_bucket_template.json](../acr-engine/configs/buckets/business_type_bucket_template.json) | ||
| 211 | 3. 跑 bucket benchmark | ||
| 212 | 4. 对照 `hybrid` / `high_energy` 在不同业务 bucket 下是否分化 | ||
| 213 | |||
| 214 | ## Sources | ||
| 215 | - See [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) | ||
| 216 | - See [industrial-benchmark-spec.md](./industrial-benchmark-spec.md) |
| 1 | # Business Project Manifest Adapter / 业务数据到项目 Manifest 适配说明 | ||
| 2 | |||
| 3 | > 更新:2026-06-02 | ||
| 4 | > 关联文档:[业务导出 Cookbook](./business-export-cookbook.md) · [业务 Manifest 与 Type-Role 规范](./business-manifest-and-type-role-spec.md) | ||
| 5 | |||
| 6 | ## 一页结论 | ||
| 7 | |||
| 8 | 现在仓库里已经有一条接近项目训练/评测 manifest 的离线脚本链: | ||
| 9 | |||
| 10 | 1. 业务库表导出 CSV / JSONL | ||
| 11 | 2. [../acr-engine/scripts/normalize_business_export.py](../acr-engine/scripts/normalize_business_export.py) | ||
| 12 | 3. [../acr-engine/scripts/split_business_manifest_ready.py](../acr-engine/scripts/split_business_manifest_ready.py) | ||
| 13 | 4. [../acr-engine/scripts/build_business_project_manifests.py](../acr-engine/scripts/build_business_project_manifests.py) | ||
| 14 | |||
| 15 | 最后一步会直接生成: | ||
| 16 | - `catalog.json` | ||
| 17 | - `train.json` | ||
| 18 | - `test.json` | ||
| 19 | - `val.json` | ||
| 20 | |||
| 21 | 格式对齐当前项目已有 manifest 结构。 | ||
| 22 | |||
| 23 | --- | ||
| 24 | |||
| 25 | ## 1. 对齐后的项目格式 | ||
| 26 | |||
| 27 | ### `catalog.json` | ||
| 28 | - 只放 reference | ||
| 29 | - 字段:`song_id / audio_path / duration / type=reference / source_dataset` | ||
| 30 | |||
| 31 | ### `train.json` / `test.json` | ||
| 32 | - 前半部分是 query | ||
| 33 | - 后半部分拼接 reference | ||
| 34 | - query 字段: | ||
| 35 | - `song_id` | ||
| 36 | - `audio_path` | ||
| 37 | - `duration` | ||
| 38 | - `type=clean` | ||
| 39 | - `offset` | ||
| 40 | - `segment_type=external_query` | ||
| 41 | - `source_dataset` | ||
| 42 | |||
| 43 | ### `val.json` | ||
| 44 | - 当前默认只放 `split=val` 的 query | ||
| 45 | - 可选把 `holdout` 合并进 `val` | ||
| 46 | |||
| 47 | --- | ||
| 48 | |||
| 49 | ## 2. 示例命令 | ||
| 50 | |||
| 51 | ```bash | ||
| 52 | cd /workspace/acr-engine | ||
| 53 | /usr/local/miniconda3/bin/python scripts/normalize_business_export.py \ | ||
| 54 | --input configs/manifests/examples/business_asset_export_example.csv \ | ||
| 55 | --output /tmp/business_asset_manifest_ready.jsonl | ||
| 56 | |||
| 57 | /usr/local/miniconda3/bin/python scripts/build_business_project_manifests.py \ | ||
| 58 | --input /tmp/business_asset_manifest_ready.jsonl \ | ||
| 59 | --output-dir /tmp/business_project_manifests | ||
| 60 | ``` | ||
| 61 | |||
| 62 | 如果你希望把 `holdout` 先并进 `val.json`: | ||
| 63 | |||
| 64 | ```bash | ||
| 65 | /usr/local/miniconda3/bin/python scripts/build_business_project_manifests.py \ | ||
| 66 | --input /tmp/business_asset_manifest_ready.jsonl \ | ||
| 67 | --output-dir /tmp/business_project_manifests \ | ||
| 68 | --include-holdout-in-val | ||
| 69 | ``` | ||
| 70 | |||
| 71 | --- | ||
| 72 | |||
| 73 | ## 3. 适配边界 | ||
| 74 | |||
| 75 | 这一步还不是最终“真实业务生产接入”,但已经足够让下个 session: | ||
| 76 | - 用真实业务导出样本跑通 manifest 结构 | ||
| 77 | - 对接 `train.py / evaluate.py / run_demo.py` | ||
| 78 | - 再只针对最终字段细节做小修 | ||
| 79 | |||
| 80 | ## Sources | ||
| 81 | - See [business-export-cookbook.md](./business-export-cookbook.md) | ||
| 82 | - See [business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md) |
docs/current-capability-map.md
deleted
100644 → 0
| 1 | # Current Capability Map / 当前能力地图 | ||
| 2 | |||
| 3 | > 更新:2026-06-02 | ||
| 4 | |||
| 5 | ## 一页结论 | ||
| 6 | |||
| 7 | 当前项目有三类能力: | ||
| 8 | |||
| 9 | 1. **已完整闭环** | ||
| 10 | 2. **已打通但仍是 smoke 级** | ||
| 11 | 3. **仍待真实数据/更大规模验证** | ||
| 12 | |||
| 13 | --- | ||
| 14 | |||
| 15 | ## 1. 能力状态表 | ||
| 16 | |||
| 17 | | 能力 | 当前状态 | 说明 | | ||
| 18 | |---|---|---| | ||
| 19 | | synthetic 数据生成 | 已完成 | 可稳定生成合成训练/评测数据 | | ||
| 20 | | synthetic 训练 | 已完成 | `train.py` 可跑通 | | ||
| 21 | | synthetic 建索引 | 已完成 | `run_demo.py build-index` 可跑通 | | ||
| 22 | | synthetic 评测 | 已完成 | `evaluate.py` 可输出 JSON | | ||
| 23 | | synthetic 发布制品 | 已完成 | 可生成 benchmark/model-card/checklist | | ||
| 24 | | 开放数据 inspect | 已完成 | `inspect-local` / `inspect-batch` | | ||
| 25 | | 开放数据 prepare | 已完成 | `prepare-local` | | ||
| 26 | | 开放数据 validate | 已完成 | `validate-local` | | ||
| 27 | | 开放数据训练 smoke | 已完成 | 已在 stand-in 数据上验证 | | ||
| 28 | | 开放数据索引 smoke | 已完成 | 已在 stand-in 数据上验证 | | ||
| 29 | | 开放数据评测 smoke | 已完成 | 已在 stand-in 数据上验证 | | ||
| 30 | | 开放数据发布制品 smoke | 已完成 | 已在 stand-in 数据上验证 | | ||
| 31 | | 一键 smoke-local | 已完成 | inspect→prepare→validate→train→index→eval→artifacts | | ||
| 32 | | 真实 FMA 本地目录 smoke | 待外部数据 | 代码已就绪,缺真实音频目录 | | ||
| 33 | | 真实 MTG-Jamendo 本地目录 smoke | 待外部数据 | 代码已就绪,缺真实音频目录 | | ||
| 34 | | hard-case 精度优化 | 进行中 | confused / humming_like 仍需持续优化 | | ||
| 35 | | foundation model baseline | 未完成 | 仅完成文档研究与路线规划 | | ||
| 36 | | 工业级生产部署 | 未完成 | 服务骨架已在,生产治理未完成 | | ||
| 37 | |||
| 38 | --- | ||
| 39 | |||
| 40 | ## 2. 最短路径图 | ||
| 41 | |||
| 42 | ```mermaid | ||
| 43 | flowchart LR | ||
| 44 | A[Local Audio Dir] --> B[inspect-local] | ||
| 45 | B --> C[prepare-local] | ||
| 46 | C --> D[validate-local] | ||
| 47 | D --> E[train] | ||
| 48 | E --> F[build-index] | ||
| 49 | F --> G[evaluate] | ||
| 50 | G --> H[generate_artifacts] | ||
| 51 | ``` | ||
| 52 | |||
| 53 | --- | ||
| 54 | |||
| 55 | ## 3. 当前最可靠入口 | ||
| 56 | |||
| 57 | - [docs/open-dataset-workflow.md](./open-dataset-workflow.md) | ||
| 58 | - [docs/session-handoff.md](./session-handoff.md) | ||
| 59 | - [acr-engine/FIRST_RUN_CHECKLIST.md](../acr-engine/FIRST_RUN_CHECKLIST.md) | ||
| 60 | - [acr-engine/scripts/status_snapshot.py](../acr-engine/scripts/status_snapshot.py) | ||
| 61 | |||
| 62 | --- | ||
| 63 | |||
| 64 | ## 4. 当前最重要缺口 | ||
| 65 | |||
| 66 | 1. 真实 FMA 本地音频未落地 | ||
| 67 | 2. 真实 MTG-Jamendo 本地音频未落地 | ||
| 68 | 3. hard-case 在真实数据上的表现未知 | ||
| 69 | 4. foundation model baseline 还未开始实现 | ||
| 70 | 5. 服务与部署仍偏原型级 | ||
| 71 | |||
| 72 | --- | ||
| 73 | |||
| 74 | ## Sources | ||
| 75 | - [session-handoff.md](./session-handoff.md) | ||
| 76 | - [open-dataset-workflow.md](./open-dataset-workflow.md) | ||
| 77 | - [CHANGELOG.md](./CHANGELOG.md) |
| 1 | # Dataset Sources and Licensing | ||
| 2 | |||
| 3 | > 更新:2026-06-02 | ||
| 4 | |||
| 5 | ## 一页结论 | ||
| 6 | |||
| 7 | - 当前优先目标改为:**个人使用下充分利用开源数据集** | ||
| 8 | - 外部数据集接入现在不仅要能 bootstrap,还要能真实切成 train/eval manifests | ||
| 9 | - 当前建议优先级: | ||
| 10 | 1. FMA | ||
| 11 | 2. MTG-Jamendo | ||
| 12 | 3. CCMusic(审批/核验后) | ||
| 13 | 4. ModelScope music datasets(白名单后) | ||
| 14 | - ModelScope 与 CCMusic 当前都不能默认直接进入商用训练 | ||
| 15 | |||
| 16 | 对个人使用的直接建议: | ||
| 17 | - FMA / MTG-Jamendo:优先转成训练与评估资产 | ||
| 18 | - CCMusic / ModelScope:优先当补充评估或探索来源 | ||
| 19 | - 保留 license 注记,但不再把“商用阻塞”作为个人实验主阻塞 | ||
| 20 | |||
| 21 | 推荐先读: | ||
| 22 | - [开放数据工作流](./open-dataset-workflow.md) | ||
| 23 | |||
| 24 | 建议接入顺序: | ||
| 25 | 1. 下载/准备 FMA 或 MTG-Jamendo 的本地音频目录 | ||
| 26 | 2. 运行 [acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py) `inspect-local` 或 `inspect-batch` | ||
| 27 | 3. 再运行 [acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py) `prepare-local` | ||
| 28 | 4. 生成 [catalog.json / train.json / test.json / val.json](../acr-engine/data/external_ingested/README.md) | ||
| 29 | 5. 将 [train.json](../acr-engine/data/external_ingested/README.md) 用于训练,将 [test.json](../acr-engine/data/external_ingested/README.md) 用于固定评估 | ||
| 30 | |||
| 31 | --- | ||
| 32 | |||
| 33 | ## 1. 来源分层图 | ||
| 34 | |||
| 35 | ```mermaid | ||
| 36 | flowchart TD | ||
| 37 | A[Candidate Datasets] --> B[Open / MIR Baselines] | ||
| 38 | A --> C[Chinese / Regional Sources] | ||
| 39 | A --> D[Discovery Surfaces] | ||
| 40 | |||
| 41 | B --> B1[FMA] | ||
| 42 | B --> B2[MTG-Jamendo] | ||
| 43 | C --> C1[CCMusic] | ||
| 44 | D --> D1[ModelScope music datasets] | ||
| 45 | ``` | ||
| 46 | |||
| 47 | --- | ||
| 48 | |||
| 49 | ## 2. 数据源表 | ||
| 50 | |||
| 51 | | 数据源 | 角色 | 风险 | 当前策略 | | ||
| 52 | |---|---|---|---| | ||
| 53 | | FMA | 首批真实 baseline | track license 需核验 | review_required | | ||
| 54 | | MTG-Jamendo | retrieval/tagging corpus | CC 细则需核验 | review_required | | ||
| 55 | | CCMusic | 中文 MIR 资源 | 可能需申请/存在限制 | review_required | | ||
| 56 | | ModelScope music | 数据发现入口 | license 分散 | deny_until_whitelisted | | ||
| 57 | |||
| 58 | --- | ||
| 59 | |||
| 60 | ## 3. 白名单流程图 | ||
| 61 | |||
| 62 | ```mermaid | ||
| 63 | flowchart LR | ||
| 64 | A[发现数据集] --> B[收集 license / terms] | ||
| 65 | B --> C[法律/合规审查] | ||
| 66 | C --> D{可商用?} | ||
| 67 | D -- 是 --> E[加入 whitelist] | ||
| 68 | D -- 否 --> F[禁止进入训练] | ||
| 69 | ``` | ||
| 70 | |||
| 71 | --- | ||
| 72 | |||
| 73 | ## 4. 文字说明 | ||
| 74 | |||
| 75 | ### 4.1 为什么 ModelScope 只能先当 discovery surface | ||
| 76 | 因为不同数据集来源和条款差异很大,不能因为“在 ModelScope 上”就默认可商用。 | ||
| 77 | |||
| 78 | ### 4.2 为什么 CCMusic 要单独看 | ||
| 79 | 它对中文音乐任务很有价值,但部分子集可能涉及申请、协议或非标准商业许可边界。 | ||
| 80 | |||
| 81 | ### 4.3 为什么 license registry 要和模型版本绑定 | ||
| 82 | 这样才能在未来追踪: | ||
| 83 | - 某个模型到底用了哪些数据 | ||
| 84 | - 这些数据是否允许对应商用场景 | ||
| 85 | |||
| 86 | --- | ||
| 87 | |||
| 88 | ## 5. 细节附录 | ||
| 89 | |||
| 90 | 入口链接: | ||
| 91 | - FMA: https://github.com/mdeff/fma | ||
| 92 | - MTG-Jamendo: https://github.com/MTG/mtg-jamendo-dataset | ||
| 93 | - CCMusic: https://ccmusic-database.github.io/en/database/ccm.html | ||
| 94 | - ModelScope search: https://modelscope.cn/search?page=1&search=music&type=dataset | ||
| 95 | |||
| 96 | |||
| 97 | ## Sources | ||
| 98 | - See [references-and-sources.md](./references-and-sources.md) for the current source map. | ||
| 99 | |||
| 100 | |||
| 101 | ## Download / LFS governance | ||
| 102 | |||
| 103 | ### Preferred repository behavior | ||
| 104 | |||
| 105 | ```mermaid | ||
| 106 | flowchart TD | ||
| 107 | A[Upstream dataset source] --> B[Local raw drop zone] | ||
| 108 | B --> C[Git LFS tracked large files] | ||
| 109 | C --> D[check-local-ready] | ||
| 110 | D --> E[prepare-local / smoke-local] | ||
| 111 | ``` | ||
| 112 | |||
| 113 | ### Current repo policy | ||
| 114 | |||
| 115 | | Item | Policy | Reason | | ||
| 116 | |---|---|---| | ||
| 117 | | `acr-engine/data/raw/**/*.zip` | Git LFS | avoid bloating normal git history | | ||
| 118 | | `acr-engine/data/raw/**/*.wav` / `.mp3` / `.flac` / `.ogg` | Git LFS | allow local reproducibility without normal blob explosion | | ||
| 119 | | FMA Small | acceptable as first real-data engineering baseline | easiest realistic open music smoke path | | ||
| 120 | | MTG-Jamendo | default to research/eval lane | do not assume commercial-safe rights without subset-specific proof | | ||
| 121 | |||
| 122 | ### Operational note | ||
| 123 | |||
| 124 | Even when a dataset is technically downloadable, this project should separate: | ||
| 125 | |||
| 126 | - **engineering usability** | ||
| 127 | - **benchmark suitability** | ||
| 128 | - **commercial deployment suitability** | ||
| 129 | |||
| 130 | These are not the same thing. |
docs/dataset-spec.md
deleted
100644 → 0
| 1 | # ACR Dataset / 输入输出规范 | ||
| 2 | |||
| 3 | > 更新:2026-06-02 | ||
| 4 | > 关联文档:[训练数据与 pgvector 指南](./training-data-and-pgvector-guide.md) · [开放数据工作流](./open-dataset-workflow.md) · [数据来源与接入](./dataset-sources-and-licensing.md) | ||
| 5 | |||
| 6 | ## 一页结论 | ||
| 7 | |||
| 8 | 当前项目的数据规范,最重要的是 4 件事: | ||
| 9 | |||
| 10 | 1. **训练输入不是“整首 mp3 文件”本身,而是 manifest 驱动的 reference + query 样本体系**。 | ||
| 11 | 2. **训练和检索的切窗策略不同**:训练端当前是**随机裁剪 5s**,检索/建索引端当前是**5s 窗口 + 2.5s stride 的 50% 重叠滑窗**。 | ||
| 12 | 3. **外部开源数据集进入项目时,必须先转换成统一 manifest**,再做训练、评测、索引和 pgvector 入库。 | ||
| 13 | 4. **当前音乐任务输入层已切换到 128 维 Mel 频谱**,并开启 band-split 方向;FMA 这类真实数据建议优先使用 GPU。 | ||
| 14 | |||
| 15 | --- | ||
| 16 | |||
| 17 | ## 1. 数据流图 | ||
| 18 | |||
| 19 | ```mermaid | ||
| 20 | flowchart LR | ||
| 21 | A[Raw Audio\nFMA / MTG / 自有 BGM / 录音] --> B[Manifest Conversion] | ||
| 22 | B --> C[Catalog Manifest\nreference] | ||
| 23 | B --> D[Train/Test Manifest\nquery] | ||
| 24 | C --> E[Reference Index Build\nsliding windows] | ||
| 25 | D --> F[Training / Evaluation] | ||
| 26 | E --> G[Hybrid Retrieval] | ||
| 27 | F --> G | ||
| 28 | G --> H[pgvector / report / service] | ||
| 29 | ``` | ||
| 30 | |||
| 31 | --- | ||
| 32 | |||
| 33 | ## 2. 数据对象表 | ||
| 34 | |||
| 35 | | 对象 | 作用 | 必要字段 | 说明 | | ||
| 36 | |---|---|---|---| | ||
| 37 | | Reference | 可检索曲库 | `song_id`, `audio_path`, `duration`, `type=reference` | 用于建索引 | | ||
| 38 | | Query Segment | 待识别片段 | `song_id`, `audio_path`, `duration`, `type` | 用于训练/评测 | | ||
| 39 | | Catalog Manifest | reference 总表 | JSON list | 用于离线索引 | | ||
| 40 | | Query Manifest | query 总表 | JSON list | 用于训练与评测 | | ||
| 41 | |||
| 42 | --- | ||
| 43 | |||
| 44 | ## 3. Manifest 结构图 | ||
| 45 | |||
| 46 | ```mermaid | ||
| 47 | flowchart TD | ||
| 48 | M[Manifest] --> R[Reference Records] | ||
| 49 | M --> Q[Query Records] | ||
| 50 | R --> R1[song_id] | ||
| 51 | R --> R2[audio_path] | ||
| 52 | R --> R3[duration] | ||
| 53 | R --> R4[type=reference] | ||
| 54 | Q --> Q1[song_id] | ||
| 55 | Q --> Q2[audio_path] | ||
| 56 | Q --> Q3[duration] | ||
| 57 | Q --> Q4[type=clean/augmented/confused/humming_like] | ||
| 58 | Q --> Q5[offset] | ||
| 59 | ``` | ||
| 60 | |||
| 61 | --- | ||
| 62 | |||
| 63 | ## 4. 输入输出总表 | ||
| 64 | |||
| 65 | | 环节 | 输入 | 输出 | | ||
| 66 | |---|---|---| | ||
| 67 | | 训练 | query segments + references | embeddings + logits | | ||
| 68 | | 索引 | catalog references | chromaprint index + embedding index | | ||
| 69 | | 识别 | query audio | ranked candidates | | ||
| 70 | | 评测 | query manifest + catalog | top1/top5/hard-case report | | ||
| 71 | | 入库 | manifest + embedding | pgvector-ready JSON / SQL rows | | ||
| 72 | |||
| 73 | --- | ||
| 74 | |||
| 75 | ## 5. 3 分钟 mp3 到 5–8 秒片段:当前到底怎么切 | ||
| 76 | |||
| 77 | ## 5.1 当前代码里有 **3 种不同切法** | ||
| 78 | |||
| 79 | ```mermaid | ||
| 80 | flowchart TD | ||
| 81 | A[3min mp3] --> B[训练 Dataset] | ||
| 82 | A --> C[检索 / 建索引] | ||
| 83 | A --> D[外部数据集 manifest 生成] | ||
| 84 | |||
| 85 | B --> B1[随机裁 1 个 5s clip] | ||
| 86 | C --> C1[5s 窗 + 2.5s stride]\n50% overlap | ||
| 87 | D --> D1[每首歌随机采 1 个 query]\n默认 8s | ||
| 88 | ``` | ||
| 89 | |||
| 90 | | 场景 | 当前实现 | 是否重叠 | 代码位置 | | ||
| 91 | |---|---|---:|---| | ||
| 92 | | 训练 `SongPairDataset` | 每次采样按 `segment_strategy` 选 1 个 5s clip;默认可随机,也可走音乐感知候选 | 否,**不是固定滑窗全集展开** | [acr-engine/src/data/dataset.py](../acr-engine/src/data/dataset.py) | | ||
| 93 | | 检索 / embedding / 建索引 | `window_sec=5.0`, `stride_sec=2.5` | 是,**50% overlap** | [acr-engine/src/utils/audio.py](../acr-engine/src/utils/audio.py), [acr-engine/src/engines/ecapa_embedder.py](../acr-engine/src/engines/ecapa_embedder.py) | | ||
| 94 | | `audio-dir-to-splits` 默认 | 每首歌生成 query;可随机,也可按音乐感知策略产出 | 否 | [acr-engine/src/data/manifest_tools.py](../acr-engine/src/data/manifest_tools.py) | | ||
| 95 | | `audio-dir-to-splits --query-stride 4.0` 例 | 对单首歌生成多个滑窗 query | 是,可配置 | [acr-engine/src/data/manifest_tools.py](../acr-engine/src/data/manifest_tools.py) | | ||
| 96 | |||
| 97 | ### 直接回答你的问题 | ||
| 98 | |||
| 99 | - **有重叠窗口,主要在检索/索引链路;训练端不是全量滑窗展开。** | ||
| 100 | - **当前训练主链路不是“只会随机切”**,而是每次 batch 动态选 1 个 5s 片段;候选可以来自 `random / silence_aware / high_energy / onset_aware / beat_aware / repeated_section_aware / hybrid`。 | ||
| 101 | - **当前外部数据集 manifest 生成器也不再只有随机 query**,可通过 `--query-strategy` 走音乐感知切法,也可通过 `--query-stride` 开启多 query / overlap query 生成。** | ||
| 102 | |||
| 103 | --- | ||
| 104 | |||
| 105 | ## 5.2 为什么这样设计 | ||
| 106 | |||
| 107 | | 设计点 | 当前好处 | 当前限制 | | ||
| 108 | |---|---|---| | ||
| 109 | | 训练随机裁剪 | 节省存储,不必预生成几万切片 | 同一 epoch 暴露到的时间区域有限 | | ||
| 110 | | 检索重叠滑窗 | 更接近真实 ACR reference coverage | 索引体积更大 | | ||
| 111 | | 音乐感知候选切片 | 更容易打到主段、起音、拍点、非静音区 | CPU 分析成本更高 | | ||
| 112 | | 外部数据少量 query smoke | smoke 更轻、更快验证 | 训练/评测覆盖不充分 | | ||
| 113 | |||
| 114 | 推荐理解方式: | ||
| 115 | - **训练端**更像“随机数据增强采样器” | ||
| 116 | - **检索端**更像“为了召回覆盖做滑窗索引” | ||
| 117 | |||
| 118 | ### 5.3 我们到底有没有用 librosa 的分段逻辑 | ||
| 119 | |||
| 120 | 有,而且已经进入主链路,但不是“把整套结构分段 API 全盘替代随机采样”。 | ||
| 121 | |||
| 122 | 当前已用到的 `librosa` 音乐感知逻辑: | ||
| 123 | |||
| 124 | | 逻辑 | 当前用途 | 代码位置 | | ||
| 125 | |---|---|---| | ||
| 126 | | `librosa.effects.split` | `silence_aware`,避开静音区 | [acr-engine/src/data/dataset.py](../acr-engine/src/data/dataset.py) | | ||
| 127 | | `librosa.onset.onset_detect` | `onset_aware`,优先起音附近 | [acr-engine/src/data/dataset.py](../acr-engine/src/data/dataset.py) | | ||
| 128 | | `librosa.beat.beat_track` | `beat_aware`,优先规则拍点 | [acr-engine/src/data/dataset.py](../acr-engine/src/data/dataset.py) | | ||
| 129 | | `librosa.feature.chroma_cqt` | `repeated_section_aware`,近似找重复主段 / hook | [acr-engine/src/data/dataset.py](../acr-engine/src/data/dataset.py) | | ||
| 130 | |||
| 131 | 还**没有**直接上整套更重的 `librosa.segment.*` 结构分段主流程,原因主要是: | ||
| 132 | |||
| 133 | 1. **训练 query 的真实来源并不总对齐段落边界**,完全结构分段会把训练分布拉得过“整齐”; | ||
| 134 | 2. **CPU 成本更高**,对 FMA / MTG 这类大目录 smoke 和批量 manifest 生成不够轻; | ||
| 135 | 3. **当前阶段先追求稳健可复现**,优先落地静音、起音、拍点、重复段这几类收益更直接的候选策略。 | ||
| 136 | |||
| 137 | 所以现在的设计不是“没考虑 librosa 分段”,而是: | ||
| 138 | - **已经用了 librosa 的轻量高收益部分** | ||
| 139 | - **保留 random 作为泛化增强** | ||
| 140 | - **把更重的结构分段留作后续增强,而不是一上来替代全部采样逻辑** | ||
| 141 | |||
| 142 | --- | ||
| 143 | |||
| 144 | ## 6. 当前训练信号与 hard-case 规则 | ||
| 145 | |||
| 146 | ## 6.1 Hard-case 训练信号图 | ||
| 147 | |||
| 148 | ```mermaid | ||
| 149 | flowchart LR | ||
| 150 | A[Query Segment] --> B{type} | ||
| 151 | B -->|clean| C[w=1.0] | ||
| 152 | B -->|augmented| D[w=1.4] | ||
| 153 | B -->|humming_like| E[w=2.5] | ||
| 154 | B -->|confused| F[w=4.0] | ||
| 155 | C --> G[Sample-level SupCon + CE] | ||
| 156 | D --> G | ||
| 157 | E --> G | ||
| 158 | F --> G | ||
| 159 | ``` | ||
| 160 | |||
| 161 | | 类型 | 当前训练权重 | 目标 | | ||
| 162 | |---|---:|---| | ||
| 163 | | clean | 1.0 | 保持基础识别稳定 | | ||
| 164 | | augmented | 1.4 | 提高常规退化鲁棒性 | | ||
| 165 | | humming_like | 2.5 | 提高旋律近似查询鲁棒性 | | ||
| 166 | | confused | 4.0 | 强化最易混淆片段分离能力 | | ||
| 167 | |||
| 168 | --- | ||
| 169 | |||
| 170 | ## 6.2 检索融合参数图 | ||
| 171 | |||
| 172 | ```mermaid | ||
| 173 | flowchart LR | ||
| 174 | A[Chromaprint Score] --> D[Fused Score] | ||
| 175 | B[ECAPA Score] --> D | ||
| 176 | C[Melody Score] --> D | ||
| 177 | ``` | ||
| 178 | |||
| 179 | | 参数 | 默认值 | 当前验证更优值(fast-eval) | 含义 | | ||
| 180 | |---|---:|---:|---| | ||
| 181 | | `chroma_weight` | 0.25 | 0.20 | 降低纯指纹主导 | | ||
| 182 | | `ecapa_weight` | 0.50 | 0.55 | 提高 embedding 检索主导 | | ||
| 183 | | `melody_weight` | 0.25 | 0.25 | 暂时保持不变 | | ||
| 184 | |||
| 185 | --- | ||
| 186 | |||
| 187 | ## 7. 开源数据集 train/eval 切分图 | ||
| 188 | |||
| 189 | ```mermaid | ||
| 190 | flowchart LR | ||
| 191 | A[Open Audio Dir] --> B[audio-dir-to-splits] | ||
| 192 | B --> C[catalog.json] | ||
| 193 | B --> D[train.json] | ||
| 194 | B --> E[test.json] | ||
| 195 | B --> F[val.json] | ||
| 196 | ``` | ||
| 197 | |||
| 198 | | 产物 | 用途 | 说明 | | ||
| 199 | |---|---|---| | ||
| 200 | | [catalog.json](../acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/catalog.json) | 建索引 | 所有 reference 曲目 | | ||
| 201 | | [train.json](../acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/train.json) | 训练查询 | query + references | | ||
| 202 | | [test.json](../acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/test.json) | 评估查询 | query + references | | ||
| 203 | | [val.json](../acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/val.json) | 预留验证集 | 当前可为空 | | ||
| 204 | |||
| 205 | 推荐法则: | ||
| 206 | - FMA / MTG-Jamendo 可优先用于真实 train/eval baseline | ||
| 207 | - 至少固定一部分曲目只进 [test.json](../acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/test.json),不要同时参与训练 | ||
| 208 | - 小数据集也要保证至少 1 个 train query + 1 个 test query | ||
| 209 | |||
| 210 | CLI 入口: | ||
| 211 | - 低层工具:[acr-engine/src/data/manifest_tools.py](../acr-engine/src/data/manifest_tools.py) | ||
| 212 | - 高层统一入口:[acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py) `prepare-local <dataset> <input_dir>` | ||
| 213 | - 导入前预检查:[acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py) `inspect-local <dataset> <input_dir>` | ||
| 214 | - 多目录批量预检查:[acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py) `inspect-batch fma=<dir> mtg_jamendo=<dir> ...` | ||
| 215 | |||
| 216 | --- | ||
| 217 | |||
| 218 | ## 8. 当前项目输入层规范 | ||
| 219 | |||
| 220 | | 项目 | 当前值/建议 | 说明 | | ||
| 221 | |---|---|---| | ||
| 222 | | 采样率 | `16kHz` | 统一音频读取口径 | | ||
| 223 | | 声道 | `mono` | 当前链路按单声道处理 | | ||
| 224 | | 频谱 | `128 Mel` | 音乐任务输入层 | | ||
| 225 | | 训练 clip | `5s` | 当前训练代码事实 | | ||
| 226 | | 外部 query manifest 默认 | `8s` | 当前 `prepare-local/smoke-local` 默认 | | ||
| 227 | | Band split | `enabled` | 已纳入当前模型配置 | | ||
| 228 | |||
| 229 | --- | ||
| 230 | |||
| 231 | ## 9. 文字说明 | ||
| 232 | |||
| 233 | ### 9.1 为什么必须分离 catalog 和 query | ||
| 234 | 工业化系统里,“可搜索曲库”和“训练/评测 query”必须分离,否则评测会和真实服务语义混在一起。 | ||
| 235 | |||
| 236 | ### 9.2 为什么输入层是 128 Mel | ||
| 237 | 音乐任务需要更丰富的频带表达;128 Mel 比 40 维 MFCC 更适合 timbre/harmony/band-split 建模。 | ||
| 238 | |||
| 239 | ### 9.3 为什么 query 类型必须显式标注 | ||
| 240 | `clean / augmented / confused / humming_like` 不只是标签名,而是训练权重、评测 bucket、难例治理的入口。 | ||
| 241 | |||
| 242 | ### 9.4 关于 5s vs 8s 的一个当前注意点 | ||
| 243 | 当前仓库有两组时长: | ||
| 244 | - **训练 Dataset 与默认模型训练:5s** | ||
| 245 | - **开放数据 manifest/query 默认:8s** | ||
| 246 | |||
| 247 | 这不是同一层配置,因此当前文档和实验报告里必须明确区分;不要把它们误认为一套统一参数。 | ||
| 248 | |||
| 249 | ### 9.5 当前经验结论 | ||
| 250 | - 简单过采样会导致整体退化 | ||
| 251 | - type-aware weighting 能提升一部分 hard case | ||
| 252 | - confused 类需要更高权重,但过强偏置会回伤 `humming_like` | ||
| 253 | - residual confused failure 往往集中在 `intro` 片段,因此 `segment_type` 不只是元数据,还应参与后续难负例设计 | ||
| 254 | |||
| 255 | --- | ||
| 256 | |||
| 257 | ## 10. 细节附录 | ||
| 258 | |||
| 259 | ### Reference 示例 | ||
| 260 | ```json | ||
| 261 | { | ||
| 262 | "song_id": "song_0001", | ||
| 263 | "audio_path": "songs/song_0001.wav", | ||
| 264 | "duration": 20.0, | ||
| 265 | "type": "reference" | ||
| 266 | } | ||
| 267 | ``` | ||
| 268 | |||
| 269 | ### Query 示例 | ||
| 270 | ```json | ||
| 271 | { | ||
| 272 | "song_id": "song_0001", | ||
| 273 | "audio_path": "segments/song_0001_seg_04_confused.wav", | ||
| 274 | "duration": 5.0, | ||
| 275 | "type": "confused", | ||
| 276 | "offset": 8.3, | ||
| 277 | "segment_type": "mid" | ||
| 278 | } | ||
| 279 | ``` | ||
| 280 | |||
| 281 | ## Sources | ||
| 282 | - 当前代码事实来自 [acr-engine/src/data/dataset.py](../acr-engine/src/data/dataset.py), [acr-engine/src/data/manifest_tools.py](../acr-engine/src/data/manifest_tools.py), [acr-engine/src/utils/audio.py](../acr-engine/src/utils/audio.py), [acr-engine/src/engines/ecapa_embedder.py](../acr-engine/src/engines/ecapa_embedder.py), [acr-engine/train.py](../acr-engine/train.py) |
docs/industrial-benchmark-spec.md
deleted
100644 → 0
| 1 | # Industrial Benchmark Spec | ||
| 2 | |||
| 3 | > 更新:2026-06-02 | ||
| 4 | |||
| 5 | ## 一页结论 | ||
| 6 | |||
| 7 | - 工业级 ACR 不能只看总 top1 | ||
| 8 | - 必须同时看: | ||
| 9 | 1. hard-case | ||
| 10 | 2. rejection / false accept | ||
| 11 | 3. latency / scale | ||
| 12 | 4. license provenance completeness | ||
| 13 | |||
| 14 | --- | ||
| 15 | |||
| 16 | ## 1. Benchmark 分层图 | ||
| 17 | |||
| 18 | ```mermaid | ||
| 19 | flowchart TD | ||
| 20 | A[Industrial Benchmark] --> B[Accuracy] | ||
| 21 | A --> C[Robustness] | ||
| 22 | A --> D[Operational] | ||
| 23 | A --> E[Compliance] | ||
| 24 | |||
| 25 | B --> B1[top1/top5/MRR] | ||
| 26 | C --> C1[humming/confused/noisy] | ||
| 27 | D --> D1[latency/indexing/throughput] | ||
| 28 | E --> E1[data provenance/license coverage] | ||
| 29 | ``` | ||
| 30 | |||
| 31 | --- | ||
| 32 | |||
| 33 | ## 2. 指标表 | ||
| 34 | |||
| 35 | | 维度 | 指标 | 目标 | | ||
| 36 | |---|---|---| | ||
| 37 | | Accuracy | top1 / top5 / MRR | 主识别质量 | | ||
| 38 | | Robustness | humming/confused/noisy top1 | hard-case 质量 | | ||
| 39 | | Operational | p50/p95 latency | 服务能力 | | ||
| 40 | | Operational | index throughput | 建库能力 | | ||
| 41 | | Safety | false accept rate | 误识别风险 | | ||
| 42 | | Compliance | license coverage | 商业可用前提 | | ||
| 43 | |||
| 44 | --- | ||
| 45 | |||
| 46 | ## 3. 场景图 | ||
| 47 | |||
| 48 | ```mermaid | ||
| 49 | flowchart LR | ||
| 50 | Q[Queries] --> Q1[clean] | ||
| 51 | Q --> Q2[augmented] | ||
| 52 | Q --> Q3[humming_like] | ||
| 53 | Q --> Q4[confused] | ||
| 54 | Q --> Q5[noisy/compressed] | ||
| 55 | ``` | ||
| 56 | |||
| 57 | --- | ||
| 58 | |||
| 59 | ## 4. 文字说明 | ||
| 60 | |||
| 61 | ### 4.1 为什么 hard-case 要单独出报表 | ||
| 62 | 因为总体 top1 很容易掩盖哼唱和混淆场景的失败,而这些正是用户最敏感的场景。 | ||
| 63 | |||
| 64 | ### 4.2 为什么要加入 operational metrics | ||
| 65 | 工业级系统不是离线竞赛模型,需要考虑服务响应与增量索引成本。 | ||
| 66 | |||
| 67 | ### 4.3 为什么要把 compliance 放进 benchmark | ||
| 68 | 对于商用系统,如果训练/评测数据来源不可追溯,再高精度也不能安全上线。 | ||
| 69 | |||
| 70 | --- | ||
| 71 | |||
| 72 | ## 5. 细节附录 | ||
| 73 | |||
| 74 | 推荐 release gate: | ||
| 75 | - clean top1 >= 0.95 | ||
| 76 | - noisy top1 >= 0.85 | ||
| 77 | - confused top1 >= 0.70 | ||
| 78 | - humming_like top1 >= 0.60 | ||
| 79 | - top5 >= 0.95 on production-relevant buckets | ||
| 80 | |||
| 81 | |||
| 82 | ## Sources | ||
| 83 | - See [references-and-sources.md](./references-and-sources.md) for the current source map. | ||
| 84 | |||
| 85 | |||
| 86 | ## 6. Bucket / Style-aware 基线 | ||
| 87 | |||
| 88 | 当前仓库已经新增可运行基线脚本: | ||
| 89 | - [../acr-engine/scripts/ab_smoke_bucketed.py](../acr-engine/scripts/ab_smoke_bucketed.py) | ||
| 90 | |||
| 91 | 用途: | ||
| 92 | - 按 bucket 配置文件拆分多个小子集 | ||
| 93 | - 对每个 bucket 分别运行现有 `ab_smoke_segmentation.py` | ||
| 94 | - 输出 bucket 级 winner 与聚合均值 | ||
| 95 | |||
| 96 | 推荐最小配置文件格式: | ||
| 97 | |||
| 98 | ```json | ||
| 99 | { | ||
| 100 | "buckets": [ | ||
| 101 | {"name": "prefix_000_a", "patterns": ["fma_small/000/00000?.mp3"], "subset_size": 4}, | ||
| 102 | {"name": "prefix_000_b", "patterns": ["fma_small/000/00014?.mp3"], "subset_size": 4} | ||
| 103 | ] | ||
| 104 | } | ||
| 105 | ``` | ||
| 106 | |||
| 107 | 推荐命令: | ||
| 108 | |||
| 109 | ```bash | ||
| 110 | /usr/local/miniconda3/bin/python acr-engine/scripts/ab_smoke_bucketed.py --dataset fma --input-dir data/raw/fma_small_audio --bucket-config /tmp/cap64_bucket_test.json --work-root /tmp/ab_smoke_bucketed_smoke --default-subset-size 4 --query-duration 8 --train-epochs 1 --batch-size 2 --device cpu --strategies high_energy hybrid --max-test-queries 4 --seed 42 --output-json /tmp/ab_smoke_bucketed_smoke/report.json | ||
| 111 | ``` | ||
| 112 | |||
| 113 | |||
| 114 | 当前已验证的最小结果: | ||
| 115 | - `prefix_000_a` winner=`hybrid` | ||
| 116 | - `prefix_000_b` winner=`high_energy` | ||
| 117 | - aggregate 层面两者 `mean_top1` 都是 `1.0` | ||
| 118 | |||
| 119 | 因此 bucket benchmark 的当前意义不是“选出唯一赢家”,而是为后续语义 bucket / hard-case bucket 提供一个可复用执行框架。 | ||
| 120 | |||
| 121 | |||
| 122 | 推荐模板: | ||
| 123 | - [../acr-engine/configs/buckets/fma_semantic_bucket_template.json](../acr-engine/configs/buckets/fma_semantic_bucket_template.json) | ||
| 124 | |||
| 125 | 它不是自动标注器,而是一个“人工先分 bucket,再复用统一 benchmark 流程”的执行模板。 |
docs/industrialization-roadmap.md
deleted
100644 → 0
| 1 | # 工业化路线图 | ||
| 2 | |||
| 3 | > 更新:2026-06-02 | ||
| 4 | |||
| 5 | ## 一页结论 | ||
| 6 | |||
| 7 | 当前项目已完成: | ||
| 8 | - 原型可运行 | ||
| 9 | - retrieval-first 初步改造 | ||
| 10 | - 服务骨架 | ||
| 11 | - 外部数据 adapter 雏形 | ||
| 12 | |||
| 13 | 下一阶段必须聚焦三件事: | ||
| 14 | 1. **真实数据接入** | ||
| 15 | 2. **hard-case 精度** | ||
| 16 | 3. **商业化合规与服务稳定性** | ||
| 17 | |||
| 18 | --- | ||
| 19 | |||
| 20 | ## 1. 路线图图示 | ||
| 21 | |||
| 22 | ```mermaid | ||
| 23 | flowchart LR | ||
| 24 | P0[P0 原型跑通] --> P1[P1 真实数据验证] | ||
| 25 | P1 --> P2[P2 工程化与服务化] | ||
| 26 | P2 --> P3[P3 大规模索引] | ||
| 27 | P3 --> P4[P4 商用上线] | ||
| 28 | ``` | ||
| 29 | |||
| 30 | --- | ||
| 31 | |||
| 32 | ## 2. 阶段表 | ||
| 33 | |||
| 34 | | 阶段 | 目标 | 当前状态 | 核心产物 | | ||
| 35 | |---|---|---|---| | ||
| 36 | | P0 | 端到端原型 | 已完成 | demo/train/index/eval | | ||
| 37 | | P1 | 白名单真实数据接入 | 进行中 | adapters/manifests/benchmark | | ||
| 38 | | P2 | API / benchmark / ops | 进行中 | FastAPI + spec | | ||
| 39 | | P3 | ANN / 增量索引 | 未完成 | Faiss/HNSW | | ||
| 40 | | P4 | 可商用平台 | 未完成 | license gate / SLA / release flow | | ||
| 41 | |||
| 42 | --- | ||
| 43 | |||
| 44 | ## 3. 近期优先级 | ||
| 45 | |||
| 46 | ### Priority A | ||
| 47 | - FMA / Jamendo 小规模白名单子集接入 | ||
| 48 | - humming_like / confused 精度提升 | ||
| 49 | - service 配置化与真实部署 smoke | ||
| 50 | |||
| 51 | ### Priority B | ||
| 52 | - ANN 向量索引 | ||
| 53 | - 拒识/误接收指标 | ||
| 54 | - 模型版本化 | ||
| 55 | |||
| 56 | ### Priority C | ||
| 57 | - foundation model baseline | ||
| 58 | - 在线评估与监控 | ||
| 59 | - 商业部署流程 | ||
| 60 | |||
| 61 | --- | ||
| 62 | |||
| 63 | ## 4. 分层职责 | ||
| 64 | |||
| 65 | | 层 | 重点 | | ||
| 66 | |---|---| | ||
| 67 | | 数据层 | 只接入可审计白名单数据 | | ||
| 68 | | 模型层 | 以 retrieval 指标为主,不迷信分类头 | | ||
| 69 | | 检索层 | 强化 hard-case 与 rejection | | ||
| 70 | | 服务层 | 稳定 API、可配置、可观测 | | ||
| 71 | | 合规层 | 任何上线模型必须可追溯数据来源 | | ||
| 72 | |||
| 73 | --- | ||
| 74 | |||
| 75 | ## 5. 细节附录 | ||
| 76 | |||
| 77 | 关联文档: | ||
| 78 | - [数据来源与接入](./dataset-sources-and-licensing.md) | ||
| 79 | - [工业评测规范](./industrial-benchmark-spec.md) | ||
| 80 | - [服务接口](./service-api.md) | ||
| 81 | |||
| 82 | |||
| 83 | ## Sources | ||
| 84 | - See [references-and-sources.md](./references-and-sources.md) for the current source map. |
docs/model-card-template.md
deleted
100644 → 0
| 1 | # Model Card Template | ||
| 2 | |||
| 3 | ## 一页结论 | ||
| 4 | - 模型名称: | ||
| 5 | - 版本: | ||
| 6 | - 适用场景: | ||
| 7 | - 不适用场景: | ||
| 8 | |||
| 9 | ## 1. 模型结构图 | ||
| 10 | |||
| 11 | ```mermaid | ||
| 12 | flowchart LR | ||
| 13 | A[Input Audio] --> B[128 Mel + BandSplit] | ||
| 14 | B --> C[Encoder] | ||
| 15 | C --> D[Embedding] | ||
| 16 | D --> E[Hybrid Retrieval] | ||
| 17 | ``` | ||
| 18 | |||
| 19 | ## 2. 关键信息表 | ||
| 20 | |||
| 21 | | 项 | 内容 | | ||
| 22 | |---|---| | ||
| 23 | | 训练数据 | | | ||
| 24 | | 评测数据 | | | ||
| 25 | | 主要指标 | | | ||
| 26 | | 已知风险 | | | ||
| 27 | | 许可证约束 | | | ||
| 28 | |||
| 29 | ## 3. 文字说明 | ||
| 30 | - 训练方式: | ||
| 31 | - 模型限制: | ||
| 32 | - 风险提示: | ||
| 33 | |||
| 34 | ## 4. 细节附录 | ||
| 35 | - checkpoint 路径 | ||
| 36 | - config 路径 | ||
| 37 | - benchmark 报告路径 | ||
| 38 | |||
| 39 | ## Sources | ||
| 40 | - `docs/dataset-spec.md` | ||
| 41 | - `docs/benchmark-report-template.md` |
docs/open-dataset-workflow.md
deleted
100644 → 0
| 1 | # Open Dataset Workflow / 开放数据工作流 | ||
| 2 | |||
| 3 | ## 0. 本地真实数据就绪检查 | ||
| 4 | |||
| 5 | 在跑 `smoke-local` 前,先确认目录里真的有足够的音频: | ||
| 6 | |||
| 7 | ```bash | ||
| 8 | /usr/local/miniconda3/bin/python src/data/external_adapters.py check-local-ready fma data/raw/fma_small_audio --eval-ratio 0.2 --query-duration 8.0 | ||
| 9 | /usr/local/miniconda3/bin/python src/data/external_adapters.py check-local-ready mtg_jamendo data/raw/mtg_jamendo_audio --eval-ratio 0.2 --query-duration 8.0 | ||
| 10 | ``` | ||
| 11 | |||
| 12 | 判定标准: | ||
| 13 | |||
| 14 | - 至少 `2` 个音频文件 | ||
| 15 | - 至少 `2` 个时长 `>= 8s` 的可切 query 文件 | ||
| 16 | - `ready_for_smoke=true` 才进入完整 smoke | ||
| 17 | |||
| 18 | 如果目录为空,状态快照脚本也会明确提示未就绪。 | ||
| 19 | |||
| 20 | |||
| 21 | > 更新:2026-06-02 | ||
| 22 | |||
| 23 | ## 一页结论 | ||
| 24 | |||
| 25 | 如果你要把 FMA / MTG-Jamendo 这类开源音乐目录真正接进项目,推荐只记住这一条链路: | ||
| 26 | |||
| 27 | 1. **inspect-local / inspect-batch** | ||
| 28 | 2. **prepare-local** | ||
| 29 | 3. **validate-local** | ||
| 30 | 4. 再进入训练与评估 | ||
| 31 | 5. 生成 benchmark / model card / release artifacts | ||
| 32 | 6. 或直接使用一键 `smoke-local` | ||
| 33 | |||
| 34 | --- | ||
| 35 | |||
| 36 | ## 1. 工作流图 | ||
| 37 | |||
| 38 | ```mermaid | ||
| 39 | flowchart LR | ||
| 40 | A[Local Open Audio Dir] --> B[inspect-local / inspect-batch] | ||
| 41 | B --> C[prepare-local] | ||
| 42 | C --> D[validate-local] | ||
| 43 | D --> E[train.py] | ||
| 44 | D --> F[evaluate.py] | ||
| 45 | F --> G[artifact bundle] | ||
| 46 | ``` | ||
| 47 | |||
| 48 | --- | ||
| 49 | |||
| 50 | ## 2. 最短命令表 | ||
| 51 | |||
| 52 | | 步骤 | 命令 | 作用 | | ||
| 53 | |---|---|---| | ||
| 54 | | 预检查 | [`src/data/external_adapters.py`](../acr-engine/src/data/external_adapters.py) `inspect-local ...` | 看规模是否足够 | | ||
| 55 | | 批量比较 | [`src/data/external_adapters.py`](../acr-engine/src/data/external_adapters.py) `inspect-batch ...` | 比较多个候选目录 | | ||
| 56 | | 生成清单 | [`src/data/external_adapters.py`](../acr-engine/src/data/external_adapters.py) `prepare-local ...` | 产出 train/test/catalog | | ||
| 57 | | 训练前校验 | [`src/data/external_adapters.py`](../acr-engine/src/data/external_adapters.py) `validate-local ...` | 确认结构正确 | | ||
| 58 | | 训练 smoke | [`train.py`](../acr-engine/train.py) `--data ... --dry-run` | 验证 manifests 可直接进入训练 | | ||
| 59 | | 发布制品 | [`scripts/generate_artifacts.py`](../acr-engine/scripts/generate_artifacts.py) | 生成 benchmark/model-card/release-checklist | | ||
| 60 | | 一键 smoke | [`src/data/external_adapters.py`](../acr-engine/src/data/external_adapters.py) `smoke-local ...` | 自动跑完整链路 | | ||
| 61 | |||
| 62 | --- | ||
| 63 | |||
| 64 | ## 3. 推荐顺序 | ||
| 65 | |||
| 66 | ### 3.1 单目录 | ||
| 67 | |||
| 68 | ```bash | ||
| 69 | /usr/local/miniconda3/bin/python src/data/external_adapters.py inspect-local fma data/raw/fma_small_audio --eval-ratio 0.2 --query-duration 8.0 | ||
| 70 | /usr/local/miniconda3/bin/python src/data/external_adapters.py prepare-local fma data/raw/fma_small_audio --output-root data/external_ingested --eval-ratio 0.2 --query-duration 8.0 | ||
| 71 | /usr/local/miniconda3/bin/python src/data/external_adapters.py prepare-local fma data/raw/fma_small_audio --output-root data/external_ingested --eval-ratio 0.2 --query-duration 8.0 --query-stride 4.0 | ||
| 72 | /usr/local/miniconda3/bin/python src/data/external_adapters.py validate-local fma data/external_ingested/fma/manifests | ||
| 73 | /usr/local/miniconda3/bin/python train.py --data data/external_ingested/fma/manifests --output data/models_fma_smoke --device cpu --epochs 1 --batch-size 2 --dry-run | ||
| 74 | /usr/local/miniconda3/bin/python run_demo.py build-index --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --output data/index_fma_smoke --device cpu | ||
| 75 | |||
| 76 | # 如果长时间 CPU 建索引被中断,可从 partial checkpoint 续跑 | ||
| 77 | /usr/local/miniconda3/bin/python run_demo.py build-index \ | ||
| 78 | --data data/external_ingested/fma/manifests \ | ||
| 79 | --model data/models_fma_smoke/best_model.pt \ | ||
| 80 | --output data/index_fma_smoke \ | ||
| 81 | --device cpu \ | ||
| 82 | --resume \ | ||
| 83 | --checkpoint-every-refs 100 | ||
| 84 | |||
| 85 | 说明: | ||
| 86 | - `smoke-local` 现在内部默认也会为 `build-index` 打开 `--resume` | ||
| 87 | - checkpoint 会记录 `model_signature` | ||
| 88 | - 如果这次训练出的 `best_model.pt` 与旧 partial checkpoint 不是同一个模型,恢复会被自动拒绝并从 0 重建,避免混入不同模型的 embedding | ||
| 89 | |||
| 90 | ## 小规模策略 A/B smoke | ||
| 91 | |||
| 92 | 如果你想快速比较不同 query / training 切片策略,可直接运行: | ||
| 93 | |||
| 94 | ```bash | ||
| 95 | /usr/local/miniconda3/bin/python acr-engine/scripts/ab_smoke_segmentation.py \ | ||
| 96 | --dataset fma \ | ||
| 97 | --input-dir acr-engine/data/raw/fma_small_audio \ | ||
| 98 | --work-root /tmp/ab_smoke_seg \ | ||
| 99 | --subset-size 8 \ | ||
| 100 | --query-duration 8 \ | ||
| 101 | --train-epochs 1 \ | ||
| 102 | --batch-size 2 \ | ||
| 103 | --device cpu \ | ||
| 104 | --output-json /tmp/ab_smoke_seg/report.json | ||
| 105 | ``` | ||
| 106 | |||
| 107 | 当前脚本会比较: | ||
| 108 | - `random` | ||
| 109 | - `silence_aware` | ||
| 110 | - `high_energy` | ||
| 111 | - `beat_aware` | ||
| 112 | - `repeated_section_aware` | ||
| 113 | - `hybrid` | ||
| 114 | |||
| 115 | 排序规则: | ||
| 116 | - 先按 `top1` | ||
| 117 | - 再按 `topk` | ||
| 118 | - 最后按 `num_queries` | ||
| 119 | |||
| 120 | 这样在 top1/top5 持平时,会优先保留**覆盖 query 更多**的策略,而不是误把 query 更少的策略排到第一。 | ||
| 121 | |||
| 122 | 如果你要做**更公平**的策略比较,建议再加 `--max-test-queries`,让每个策略在同样的 query 预算下评测: | ||
| 123 | |||
| 124 | ```bash | ||
| 125 | /usr/local/miniconda3/bin/python acr-engine/scripts/ab_smoke_segmentation.py \ | ||
| 126 | --dataset fma \ | ||
| 127 | --input-dir acr-engine/data/raw/fma_small_audio \ | ||
| 128 | --work-root /tmp/ab_smoke_seg_cap \ | ||
| 129 | --subset-size 6 \ | ||
| 130 | --query-duration 8 \ | ||
| 131 | --train-epochs 1 \ | ||
| 132 | --batch-size 2 \ | ||
| 133 | --device cpu \ | ||
| 134 | --strategies hybrid \ | ||
| 135 | --max-test-queries 5 \ | ||
| 136 | --output-json /tmp/ab_smoke_seg_cap/report.json | ||
| 137 | ``` | ||
| 138 | |||
| 139 | 已验证: | ||
| 140 | - 最终报告会显式记录 `max_test_queries` | ||
| 141 | - `evaluate.py` 会按 `--seed` 复现抽样 | ||
| 142 | - 端到端 smoke 报告中的 `num_queries` 已成功收敛到 `5` | ||
| 143 | |||
| 144 | 这一步的意义是: | ||
| 145 | - 之前的 A/B 排名更偏“覆盖能力” | ||
| 146 | - 加上 cap 后,可以更公平地比较“同等 query 成本下的识别质量” | ||
| 147 | |||
| 148 | ### 最新真实 FMA capped 结果(subset=16, `max_test_queries=12`) | ||
| 149 | |||
| 150 | 已完成一轮更公平的真实 FMA A/B: | ||
| 151 | |||
| 152 | | 排名 | 策略 | num_queries | top1 | topk | | ||
| 153 | |---:|---|---:|---:|---:| | ||
| 154 | | 1 | `hybrid` | 12 | 1.0 | 1.0 | | ||
| 155 | | 2 | `high_energy` | 12 | 1.0 | 1.0 | | ||
| 156 | | 3 | `beat_aware` | 12 | 0.9167 | 1.0 | | ||
| 157 | | 4 | `repeated_section_aware` | 12 | 0.8333 | 1.0 | | ||
| 158 | |||
| 159 | 当前建议: | ||
| 160 | - **默认训练 / query 策略仍优先 `hybrid`** | ||
| 161 | - `high_energy` 是当前最强的并列次选,适合更偏主段/高能区的数据 | ||
| 162 | - `beat_aware` 更适合规则节拍较强的风格,但在这轮 FMA 子集上略弱 | ||
| 163 | - `repeated_section_aware` 单独使用不如混合策略稳 | ||
| 164 | |||
| 165 | ### 更新:更大 cap24 top2 对照(subset=24, `max_test_queries=16`) | ||
| 166 | |||
| 167 | 在更大的真实 FMA 子集上,只保留前两名策略继续对照: | ||
| 168 | |||
| 169 | | 排名 | 策略 | num_queries | top1 | topk | | ||
| 170 | |---:|---|---:|---:|---:| | ||
| 171 | | 1 | `hybrid` | 16 | 1.0 | 1.0 | | ||
| 172 | | 2 | `high_energy` | 16 | 0.8125 | 1.0 | | ||
| 173 | |||
| 174 | 这轮结果比 cap16 更有区分度,说明: | ||
| 175 | - `hybrid` 不只是“和 `high_energy` 打平” | ||
| 176 | - 在更大的真实子集上,`hybrid` 的稳定性更强 | ||
| 177 | - 当前默认推荐应明确收敛到 **`hybrid`** | ||
| 178 | |||
| 179 | ### 更新:cap32 top2 对照(subset=32, `max_test_queries=20`) | ||
| 180 | |||
| 181 | 进一步扩大到 32 首真实 FMA 子集后,结论继续强化: | ||
| 182 | |||
| 183 | | 排名 | 策略 | num_queries | top1 | topk | | ||
| 184 | |---:|---|---:|---:|---:| | ||
| 185 | | 1 | `hybrid` | 20 | 0.95 | 1.0 | | ||
| 186 | | 2 | `high_energy` | 20 | 0.5 | 1.0 | | ||
| 187 | |||
| 188 | 这说明: | ||
| 189 | - `hybrid` 在更大真实子集上仍明显领先 | ||
| 190 | - `high_energy` 虽然可作为高能区偏置策略,但稳定性不足以成为默认 | ||
| 191 | - 当前默认策略已经可以稳定写死为 **`hybrid`** | ||
| 192 | |||
| 193 | ### 更新:cap48 top2 对照(subset=48, `max_test_queries=24`) | ||
| 194 | |||
| 195 | 继续扩大到 48 首真实 FMA 子集后,出现了**结果反转**: | ||
| 196 | |||
| 197 | | 排名 | 策略 | num_queries | top1 | topk | | ||
| 198 | |---:|---|---:|---:|---:| | ||
| 199 | | 1 | `high_energy` | 24 | 0.9167 | 1.0 | | ||
| 200 | | 2 | `hybrid` | 24 | 0.7917 | 1.0 | | ||
| 201 | |||
| 202 | 这轮结果说明: | ||
| 203 | - 前面 cap24 / cap32 支持 `hybrid` | ||
| 204 | - 但 cap48 上 `high_energy` 反超 | ||
| 205 | - 因此当前结论应从“默认策略已完全固定”调整为: | ||
| 206 | - **`hybrid` 仍是当前保守默认** | ||
| 207 | - **`high_energy` 已成为强竞争方案** | ||
| 208 | - 下一步必须做更大样本或多随机种子复核,不能只靠单轮 cap48 就完全改默认 | ||
| 209 | |||
| 210 | ### 更新:cap48 第二个 seed 复核(subset=48, `max_test_queries=24`, `seed=123`) | ||
| 211 | |||
| 212 | 对同一规模再跑第二个 seed 后,结果又回到 `hybrid` 领先: | ||
| 213 | |||
| 214 | | 排名 | 策略 | num_queries | top1 | topk | | ||
| 215 | |---:|---|---:|---:|---:| | ||
| 216 | | 1 | `hybrid` | 24 | 0.9583 | 1.0 | | ||
| 217 | | 2 | `high_energy` | 24 | 0.9167 | 1.0 | | ||
| 218 | |||
| 219 | 这说明: | ||
| 220 | - cap48 的策略排名对 seed / 抽样子集 **敏感** | ||
| 221 | - 单次 cap48 不能作为“high_energy 已全面反超”的充分证据 | ||
| 222 | - 当前最稳妥结论仍是: | ||
| 223 | - `hybrid` 保留为保守默认 | ||
| 224 | - `high_energy` 保留为强竞争方案 | ||
| 225 | - 后续需要 **多 seed 聚合结论**,而不是看单次跑分 | ||
| 226 | |||
| 227 | ### cap48 多 seed 聚合摘要(当前 2 次) | ||
| 228 | |||
| 229 | 把 cap48 的两次 seed 放到一起看: | ||
| 230 | |||
| 231 | | 策略 | runs | mean_top1 | min_top1 | max_top1 | stdev_top1 | mean_topk | | ||
| 232 | |---|---:|---:|---:|---:|---:|---:| | ||
| 233 | | `high_energy` | 2 | 0.9167 | 0.9167 | 0.9167 | 0.0 | 1.0 | | ||
| 234 | | `hybrid` | 2 | 0.8750 | 0.7917 | 0.9583 | 0.0833 | 1.0 | | ||
| 235 | |||
| 236 | 当前可解释为: | ||
| 237 | - `high_energy` 在这两次 cap48 上**均值更高且更稳定** | ||
| 238 | - `hybrid` 在第二个 seed 上更强,但波动也更大 | ||
| 239 | - 因此目前最准确的表述不是“谁绝对赢”,而是: | ||
| 240 | - **cap48 上 `high_energy` 的聚合均值暂时领先** | ||
| 241 | - **`hybrid` 仍是更保守的默认候选** | ||
| 242 | - 最终默认仍应等待更多 seed 或更大样本确认 | ||
| 243 | /usr/local/miniconda3/bin/python evaluate.py --data data/external_ingested/fma/manifests --model data/models_fma_smoke/best_model.pt --index-prefix data/index_fma_smoke/reference --split test --device cpu --fast-eval --output-json reports/fma-smoke/eval.json | ||
| 244 | /usr/local/miniconda3/bin/python scripts/generate_artifacts.py --eval-json reports/fma-smoke/eval.json --config-json reports/fma-smoke/config.json --output-dir reports/fma-smoke --model-version fma-smoke --data-version fma_local | ||
| 245 | ``` | ||
| 246 | |||
| 247 | ### 3.2 多目录比较 | ||
| 248 | |||
| 249 | ```bash | ||
| 250 | /usr/local/miniconda3/bin/python src/data/external_adapters.py inspect-batch fma=data/raw/fma_small_audio mtg_jamendo=data/raw/mtg_jamendo_audio --eval-ratio 0.2 --query-duration 8.0 | ||
| 251 | ``` | ||
| 252 | |||
| 253 | ### 3.3 一键 smoke | ||
| 254 | |||
| 255 | ```bash | ||
| 256 | /usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local fma data/raw/fma_small_audio --output-root data/external_smoke --eval-ratio 0.2 --query-duration 8.0 --train-epochs 1 --batch-size 2 | ||
| 257 | /usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local fma data/raw/fma_small_audio --output-root data/external_smoke --eval-ratio 0.2 --query-duration 8.0 --train-epochs 1 --batch-size 2 --device auto | ||
| 258 | /usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local fma data/raw/fma_small_audio --output-root data/external_smoke --eval-ratio 0.2 --query-duration 8.0 --query-stride 4.0 --train-epochs 1 --batch-size 2 --device auto | ||
| 259 | ``` | ||
| 260 | |||
| 261 | 真实目录放置位置可参考: | ||
| 262 | - [acr-engine/data/raw/README.md](../acr-engine/data/raw/README.md) | ||
| 263 | - [acr-engine/data/raw/fma_small_audio/](../acr-engine/data/raw/fma_small_audio/) | ||
| 264 | - [acr-engine/data/raw/mtg_jamendo_audio/](../acr-engine/data/raw/mtg_jamendo_audio/) | ||
| 265 | |||
| 266 | --- | ||
| 267 | |||
| 268 | ## 4. 输出物说明 | ||
| 269 | |||
| 270 | - [catalog.json](../acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/catalog.json):建索引用 reference 清单 | ||
| 271 | - [train.json](../acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/train.json):训练 queries + references | ||
| 272 | - [test.json](../acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/test.json):固定评估 queries + references | ||
| 273 | - [val.json](../acr-engine/data/external_ingested/demo_via_adapter/fma/manifests/val.json):可选验证集 | ||
| 274 | |||
| 275 | --- | ||
| 276 | |||
| 277 | ## 5. 当前验证证据 | ||
| 278 | |||
| 279 | 已在本地 `data/synthetic_v2/songs` 上按开放数据流程跑通: | ||
| 280 | |||
| 281 | - `inspect-local`: | ||
| 282 | - `num_audio_files=24` | ||
| 283 | - `recommended_train_queries=19` | ||
| 284 | - `recommended_test_queries=5` | ||
| 285 | - `prepare-local`: | ||
| 286 | - `catalog=24` | ||
| 287 | - `train_queries=16` | ||
| 288 | - `test_queries=8` | ||
| 289 | - `validate-local`: | ||
| 290 | - `ok=true` | ||
| 291 | - `train.py --dry-run`: | ||
| 292 | - `Dry run passed! Pipeline is working.` | ||
| 293 | - `build-index + evaluate`: | ||
| 294 | - `top1=1.0` | ||
| 295 | - `topk=1.0` | ||
| 296 | - `generate_artifacts`: | ||
| 297 | - `benchmark-report.md` | ||
| 298 | - `model-card.md` | ||
| 299 | - `release-checklist.md` | ||
| 300 | - `smoke-local`: | ||
| 301 | - 会一次性返回 inspect / prepare / validate / report 路径摘要 | ||
| 302 | - 现在支持 `--device cpu|cuda|auto` | ||
| 303 | - `auto` 会在 smoke 内部解析成实际设备,避免把字符串 `auto` 直接传给 embedding/eval 侧 | ||
| 304 | - 现在支持 `--query-stride` | ||
| 305 | - 当设置 `--query-stride < query-duration` 时,会为单首歌生成多个重叠 query,而不是只采 1 个随机 query | ||
| 306 | |||
| 307 | --- | ||
| 308 | |||
| 309 | |||
| 310 | ### FMA 下载完成后的单条准备命令 | ||
| 311 | |||
| 312 | ```bash | ||
| 313 | cd acr-engine | ||
| 314 | /usr/local/miniconda3/bin/python scripts/fma_postdownload_ready.py | ||
| 315 | ``` | ||
| 316 | |||
| 317 | 这个脚本会在归档完整时自动执行: | ||
| 318 | |||
| 319 | 1. `extract` | ||
| 320 | 2. `check-local-ready` | ||
| 321 | 3. `inspect-local` | ||
| 322 | |||
| 323 | 如果归档还没下完,会返回结构化 `archive_not_complete`。 | ||
| 324 | |||
| 325 | |||
| 326 | ### FMA 完成前等待并自动切换 | ||
| 327 | |||
| 328 | ```bash | ||
| 329 | cd acr-engine | ||
| 330 | /usr/local/miniconda3/bin/python scripts/wait_for_fma_and_prepare.py --interval 30 --max-cycles 120 | ||
| 331 | ``` | ||
| 332 | |||
| 333 | 作用: | ||
| 334 | |||
| 335 | - 周期性检查 `fma_small.zip` 是否完成 | ||
| 336 | - 一旦完成,自动进入 [scripts/fma_postdownload_ready.py](../acr-engine/scripts/fma_postdownload_ready.py) | ||
| 337 | - 如果还没完成,则返回 `waiting` 和最近的进度快照 | ||
| 338 | |||
| 339 | ## Sources | ||
| 340 | - See [dataset-spec.md](./dataset-spec.md) | ||
| 341 | - See [dataset-sources-and-licensing.md](./dataset-sources-and-licensing.md) | ||
| 342 | |||
| 343 | |||
| 344 | ### Bucket / style-aware benchmark 基线 | ||
| 345 | |||
| 346 | 为了避免只看单一子集规模,现在仓库里已经有可运行的 bucket benchmark 基线: | ||
| 347 | - [../acr-engine/scripts/ab_smoke_bucketed.py](../acr-engine/scripts/ab_smoke_bucketed.py) | ||
| 348 | |||
| 349 | 它的作用是: | ||
| 350 | 1. 从同一大目录中按 pattern 划出多个 bucket | ||
| 351 | 2. 每个 bucket 各自运行 `ab_smoke_segmentation.py` | ||
| 352 | 3. 生成 bucket 级 winner 与 aggregate summary | ||
| 353 | |||
| 354 | 最小 smoke 已验证: | ||
| 355 | - bucket: `prefix_000_a` | ||
| 356 | - `hybrid`: `4 / 1.0 / 1.0` | ||
| 357 | - `high_energy`: `3 / 1.0 / 1.0` | ||
| 358 | - winner: `hybrid` | ||
| 359 | |||
| 360 | 完整 bucket 汇总现已完成: | ||
| 361 | - `prefix_000_a` winner=`hybrid` | ||
| 362 | - `prefix_000_b` winner=`high_energy` | ||
| 363 | - aggregate: | ||
| 364 | - `hybrid`:`mean_top1=1.0, mean_topk=1.0, mean_num_queries=4.0` | ||
| 365 | - `high_energy`:`mean_top1=1.0, mean_topk=1.0, mean_num_queries=3.5` | ||
| 366 | |||
| 367 | 当前结论: | ||
| 368 | - bucket baseline 已经能稳定复现“不同子集会选出不同 winner”。 | ||
| 369 | - 下一步不是继续做 prefix toy bucket,而是升级到更有业务意义的 bucket。 | ||
| 370 | |||
| 371 | 推荐直接从模板开始: | ||
| 372 | - [../acr-engine/configs/buckets/fma_semantic_bucket_template.json](../acr-engine/configs/buckets/fma_semantic_bucket_template.json) | ||
| 373 | |||
| 374 | 建议先人工挑一批歌,再把 glob 替换成你自己的候选集合,优先覆盖: | ||
| 375 | 1. `energy_dominant` | ||
| 376 | 2. `repeated_section_rich` | ||
| 377 | 3. `steady_beat_regular_meter` | ||
| 378 | 4. `hard_negative_confusable` | ||
| 379 | |||
| 380 | 对应命令: | ||
| 381 | |||
| 382 | ```bash | ||
| 383 | cd /workspace/acr-engine | ||
| 384 | /usr/local/miniconda3/bin/python scripts/ab_smoke_bucketed.py \ | ||
| 385 | --dataset fma \ | ||
| 386 | --input-dir data/raw/fma_small_audio \ | ||
| 387 | --bucket-config configs/buckets/fma_semantic_bucket_template.json \ | ||
| 388 | --work-root /tmp/ab_smoke_bucketed_semantic \ | ||
| 389 | --default-subset-size 16 \ | ||
| 390 | --query-duration 8 \ | ||
| 391 | --train-epochs 1 \ | ||
| 392 | --batch-size 2 \ | ||
| 393 | --device cpu \ | ||
| 394 | --strategies high_energy hybrid \ | ||
| 395 | --max-test-queries 8 \ | ||
| 396 | --seed 42 \ | ||
| 397 | --output-json /tmp/ab_smoke_bucketed_semantic/report.json | ||
| 398 | ``` | ||
| 399 | |||
| 400 | |||
| 401 | ### 业务素材 bucket 模板 | ||
| 402 | |||
| 403 | 如果下一步不是继续用 FMA,而是切到你们自己的歌曲/BGM 素材,优先看: | ||
| 404 | - [business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md) | ||
| 405 | - [../acr-engine/configs/buckets/business_type_bucket_template.json](../acr-engine/configs/buckets/business_type_bucket_template.json) |
| ... | @@ -256,6 +256,132 @@ window -> fingerprint / embedding -> candidate -> aggregate | ... | @@ -256,6 +256,132 @@ window -> fingerprint / embedding -> candidate -> aggregate |
| 256 | 256 | ||
| 257 | --- | 257 | --- |
| 258 | 258 | ||
| 259 | ## 1.2 当前业务前提变化:版本暂不重要,先做 song-centric | ||
| 260 | |||
| 261 | 如果当前业务约束是: | ||
| 262 | |||
| 263 | > **同一个歌曲下可以有多个录音或多个音频,但暂时不关心版本语义,只需要最终稳定归到同一个 `song_id`** | ||
| 264 | |||
| 265 | 那么当前 Phase-1 最推荐的默认口径应进一步收敛为: | ||
| 266 | |||
| 267 | ```text | ||
| 268 | song -> asset -> window -> feature | ||
| 269 | ``` | ||
| 270 | |||
| 271 | 也就是说: | ||
| 272 | - `song` 是当前唯一必须稳定返回的归属对象 | ||
| 273 | - 同一个 `song` 下允许存在多个音频文件 | ||
| 274 | - 这些音频文件可以是官方、抓取、BGM、片段、query sample 等不同来源 | ||
| 275 | - 现阶段先不把“录音版本差异”提升成必须单独建模的核心层 | ||
| 276 | |||
| 277 | ### 当前最推荐的物理实现 | ||
| 278 | |||
| 279 | 在这个业务前提下,最推荐直接采用 **3+1 张融合表**: | ||
| 280 | |||
| 281 | | 物理表 | 主要 type | 当前作用 | | ||
| 282 | |---|---|---| | ||
| 283 | | `media_entity` | `song` | 只承载最终业务归属对象 | | ||
| 284 | | `audio_object` | `asset`, `window` | 承载音频文件与切片窗口 | | ||
| 285 | | `feature_fact` | `fingerprint`, `embedding` | 承载检索特征事实 | | ||
| 286 | | `set_membership` | `reference_set`, `hot_set`, `eval_set` | 承载 reference / eval 等集合关系 | | ||
| 287 | |||
| 288 | 对应逻辑主链: | ||
| 289 | |||
| 290 | ```text | ||
| 291 | song -> asset -> window -> feature | ||
| 292 | ``` | ||
| 293 | |||
| 294 | ### 切片数据、模型、feature 具体落在哪些表 | ||
| 295 | |||
| 296 | 在当前 **song-centric + 融合优先** 设计下,可以直接按下面理解: | ||
| 297 | |||
| 298 | | 你关心的对象 | 当前推荐表 | 关键 type / 字段 | 作用 | | ||
| 299 | |---|---|---|---| | ||
| 300 | | 歌曲主实体 | `media_entity` | `entity_type=song` | 最终归属到哪个 `song_id` | | ||
| 301 | | 原始音频文件 | `audio_object` | `object_type=asset`, `song_id`, `storage_uri`, `checksum` | 保存同一 song 下的多个音频文件 | | ||
| 302 | | 切片窗口 | `audio_object` | `object_type=window`, `parent_object_id=<asset_id>`, `start_ms`, `end_ms` | 保存由 asset 切出来的检索窗口 | | ||
| 303 | | 模型信息 | `feature_fact` | `model_name`, `model_version`, `feature_set_name` | 记录这条特征是哪个模型、哪套参数算的 | | ||
| 304 | | fingerprint 特征 | `feature_fact` | `feature_type=fingerprint`, `fingerprint_value` | 保存 exact lane 特征 | | ||
| 305 | | embedding 特征 | `feature_fact` | `feature_type=embedding`, `embedding_dim`, `embedding_uri`, `vector_table_name` | 保存 semantic lane 特征 | | ||
| 306 | | reference / eval 归属 | `set_membership` | `set_type`, `member_type`, `member_id` | 决定哪些 asset/window/song 进入 reference 集 | | ||
| 307 | |||
| 308 | 最关键的一点是: | ||
| 309 | |||
| 310 | > **切片本身也落在 `audio_object`,只是 `object_type=window`;模型与特征统一落在 `feature_fact`。** | ||
| 311 | |||
| 312 | ### 对应流程图 | ||
| 313 | |||
| 314 | ```mermaid | ||
| 315 | flowchart TD | ||
| 316 | A[media_entity | ||
| 317 | entity_type=song] --> B[audio_object | ||
| 318 | object_type=asset] | ||
| 319 | B --> C[audio_object | ||
| 320 | object_type=window] | ||
| 321 | C --> D1[feature_fact | ||
| 322 | feature_type=fingerprint] | ||
| 323 | C --> D2[feature_fact | ||
| 324 | feature_type=embedding] | ||
| 325 | D1 --> E[set_membership | ||
| 326 | reference_set / eval_set] | ||
| 327 | D2 --> E | ||
| 328 | ``` | ||
| 329 | |||
| 330 | ### 对应写入流程 | ||
| 331 | |||
| 332 | ```mermaid | ||
| 333 | sequenceDiagram | ||
| 334 | participant ING as Ingest Job | ||
| 335 | participant DB as PostgreSQL | ||
| 336 | |||
| 337 | ING->>DB: 写 media_entity(song) | ||
| 338 | ING->>DB: 写 audio_object(asset) | ||
| 339 | ING->>DB: 切窗后写 audio_object(window) | ||
| 340 | ING->>DB: 写 feature_fact(fingerprint) | ||
| 341 | ING->>DB: 写 feature_fact(embedding) | ||
| 342 | ING->>DB: 写 set_membership(reference/eval) | ||
| 343 | ``` | ||
| 344 | |||
| 345 | ### 一个最实用的查询回溯口径 | ||
| 346 | |||
| 347 | 如果 query 命中了一条 embedding/fingerprint,回溯路径就是: | ||
| 348 | |||
| 349 | ```text | ||
| 350 | feature_fact -> audio_object(window) -> audio_object(asset) -> media_entity(song) | ||
| 351 | ``` | ||
| 352 | |||
| 353 | 这条链已经足够支撑你当前最关心的问题: | ||
| 354 | - 这个切片来自哪个音频文件 | ||
| 355 | - 这个音频文件归到哪个 `song_id` | ||
| 356 | - 这条特征是哪个模型/feature set 算出来的 | ||
| 357 | |||
| 358 | --- | ||
| 359 | |||
| 360 | ### 为什么现在可以先不把 `recording` 做成强实体 | ||
| 361 | |||
| 362 | 因为你当前不关心: | ||
| 363 | - official / live / remaster 的严格版本区分 | ||
| 364 | - cover/version lane 的独立归档 | ||
| 365 | - 返回结果必须精确到 recording_id | ||
| 366 | |||
| 367 | 你当前真正关心的是: | ||
| 368 | |||
| 369 | > 这一批不同来源、不同形式的音频,最后是否都能被稳定归到同一个 `song_id` | ||
| 370 | |||
| 371 | 在这个目标下,把 `recording` 作为强主层,会增加理解成本,但对当前第一阶段收益有限。 | ||
| 372 | |||
| 373 | ### 但这不代表未来永远不要 `recording` | ||
| 374 | |||
| 375 | 推荐的处理方式是: | ||
| 376 | - **当前 schema 默认不强推 `recording`** | ||
| 377 | - 如果未来开始关心版本归属,再把 `recording` 从 `media_entity(entity_type=recording)` 或 `audio_object.metadata_json` 中提升出来 | ||
| 378 | |||
| 379 | 换句话说: | ||
| 380 | - **当前先做 song-centric 检索归属** | ||
| 381 | - **未来再演进到 recording-centric / work-centric 治理** | ||
| 382 | |||
| 383 | --- | ||
| 384 | |||
| 259 | ## 1.2.1 融合优先:逻辑分层保留,物理表尽量收敛 | 385 | ## 1.2.1 融合优先:逻辑分层保留,物理表尽量收敛 |
| 260 | 386 | ||
| 261 | 如果你的核心诉求是: | 387 | 如果你的核心诉求是: |
| ... | @@ -264,7 +390,7 @@ window -> fingerprint / embedding -> candidate -> aggregate | ... | @@ -264,7 +390,7 @@ window -> fingerprint / embedding -> candidate -> aggregate |
| 264 | 390 | ||
| 265 | 那么推荐采用下面这个口径: | 391 | 那么推荐采用下面这个口径: |
| 266 | 392 | ||
| 267 | - **逻辑层** 仍然保留 `song / recording / asset / window / feature` | 393 | - **逻辑层** 当前默认保留 `song / asset / window / feature`;`recording` 仅保留为未来扩展语义 |
| 268 | - **物理层** 尽量融合成少数几张通用表 | 394 | - **物理层** 尽量融合成少数几张通用表 |
| 269 | 395 | ||
| 270 | 也就是说: | 396 | 也就是说: |
| ... | @@ -275,7 +401,7 @@ window -> fingerprint / embedding -> candidate -> aggregate | ... | @@ -275,7 +401,7 @@ window -> fingerprint / embedding -> candidate -> aggregate |
| 275 | 401 | ||
| 276 | | 物理表 | 主要 type | 作用 | | 402 | | 物理表 | 主要 type | 作用 | |
| 277 | |---|---|---| | 403 | |---|---|---| |
| 278 | | `media_entity` | `song`, `work`, `recording` | 承载业务归属对象 | | 404 | | `media_entity` | `song`(当前默认), `work`/`recording`(未来扩展) | 承载业务归属对象 | |
| 279 | | `audio_object` | `asset`, `window` | 承载真实音频文件与切片对象 | | 405 | | `audio_object` | `asset`, `window` | 承载真实音频文件与切片对象 | |
| 280 | | `feature_fact` | `fingerprint`, `embedding` | 承载检索特征事实 | | 406 | | `feature_fact` | `fingerprint`, `embedding` | 承载检索特征事实 | |
| 281 | | `set_membership` | `reference_set`, `hot_set`, `eval_set` | 承载集合归属关系 | | 407 | | `set_membership` | `reference_set`, `hot_set`, `eval_set` | 承载集合归属关系 | |
| ... | @@ -292,9 +418,9 @@ media_entity -> audio_object -> feature_fact -> set_membership | ... | @@ -292,9 +418,9 @@ media_entity -> audio_object -> feature_fact -> set_membership |
| 292 | 418 | ||
| 293 | #### `media_entity` | 419 | #### `media_entity` |
| 294 | 用 `entity_type` 区分: | 420 | 用 `entity_type` 区分: |
| 295 | - `song` | 421 | - `song`(当前默认必用) |
| 296 | - `work` | 422 | - `work`(可选) |
| 297 | - `recording` | 423 | - `recording`(未来扩展) |
| 298 | 424 | ||
| 299 | 公共字段可统一为: | 425 | 公共字段可统一为: |
| 300 | - `entity_id` | 426 | - `entity_id` |
| ... | @@ -354,7 +480,7 @@ media_entity -> audio_object -> feature_fact -> set_membership | ... | @@ -354,7 +480,7 @@ media_entity -> audio_object -> feature_fact -> set_membership |
| 354 | 480 | ||
| 355 | 优点: | 481 | 优点: |
| 356 | 1. **新同学更容易理解**:看到的是 3~4 张核心表,而不是十几张专用表 | 482 | 1. **新同学更容易理解**:看到的是 3~4 张核心表,而不是十几张专用表 |
| 357 | 2. **多 type 复用更自然**:`song/work/recording`、`asset/window` 都能用 type 统一表达 | 483 | 2. **更符合当前业务前提**:多个音频直接挂到同一个 `song_id`,先不强区分 recording |
| 358 | 3. **模型演进更平滑**:`feature_fact` 可以同时容纳不同模型与不同特征 | 484 | 3. **模型演进更平滑**:`feature_fact` 可以同时容纳不同模型与不同特征 |
| 359 | 4. **更符合当前目标**:先把识别闭环跑通,而不是先把治理模型拆到很细 | 485 | 4. **更符合当前目标**:先把识别闭环跑通,而不是先把治理模型拆到很细 |
| 360 | 486 | ||
| ... | @@ -395,7 +521,7 @@ song_everything | ... | @@ -395,7 +521,7 @@ song_everything |
| 395 | 521 | ||
| 396 | | 层 | 融合优先推荐表 | 当前作用 | | 522 | | 层 | 融合优先推荐表 | 当前作用 | |
| 397 | |---|---|---| | 523 | |---|---|---| |
| 398 | | 实体层 | `media_entity` | 统一承载 `song/work/recording` | | 524 | | 实体层 | `media_entity` | 当前默认只承载 `song` | |
| 399 | | 音频对象层 | `audio_object` | 统一承载 `asset/window` | | 525 | | 音频对象层 | `audio_object` | 统一承载 `asset/window` | |
| 400 | | 特征层 | `feature_fact` | 统一承载 `fingerprint/embedding` | | 526 | | 特征层 | `feature_fact` | 统一承载 `fingerprint/embedding` | |
| 401 | | 集合层 | `set_membership` | 统一承载 `reference/hot/eval` 等集合关系 | | 527 | | 集合层 | `set_membership` | 统一承载 `reference/hot/eval` 等集合关系 | |
| ... | @@ -406,10 +532,10 @@ song_everything | ... | @@ -406,10 +532,10 @@ song_everything |
| 406 | media_entity -> audio_object -> feature_fact -> set_membership | 532 | media_entity -> audio_object -> feature_fact -> set_membership |
| 407 | ``` | 533 | ``` |
| 408 | 534 | ||
| 409 | 如果按逻辑语义理解,则仍然对应: | 535 | 如果按逻辑语义理解,则当前默认对应: |
| 410 | 536 | ||
| 411 | ```text | 537 | ```text |
| 412 | song/work/recording -> asset/window -> fingerprint/embedding -> reference membership | 538 | song -> asset/window -> fingerprint/embedding -> reference membership |
| 413 | ``` | 539 | ``` |
| 414 | 540 | ||
| 415 | ### 这版极简 schema 明确不要求第一天就重投入的内容 | 541 | ### 这版极简 schema 明确不要求第一天就重投入的内容 | ... | ... |
| 1 | # Production Encoder Freeze & Embedding Strategy / 生产 Encoder 冻结与 Embedding 策略答疑 | 1 | # Production Encoder Freeze & Embedding Strategy / 生产 Encoder 冻结与 Embedding 策略答疑 |
| 2 | 2 | ||
| 3 | > 更新:2026-06-03 | 3 | > 更新:2026-06-03 |
| 4 | > 关联文档:[持续开发交接文档](./session-handoff.md) · [训练数据与 pgvector 指南](./training-data-and-pgvector-guide.md) · [开放数据工作流](./open-dataset-workflow.md) · [服务接口](./service-api.md) | 4 | > 关联文档:[持续开发交接文档](./session-handoff.md) · [PostgreSQL 数据模型](./postgresql-data-model.md) · [Phase-1 实施清单](./phase1-implementation-checklist.md) |
| 5 | 5 | ||
| 6 | ## 一页结论 | 6 | ## 一页结论 |
| 7 | 7 | ||
| ... | @@ -623,9 +623,9 @@ prod_artifacts/ | ... | @@ -623,9 +623,9 @@ prod_artifacts/ |
| 623 | 623 | ||
| 624 | ## Sources | 624 | ## Sources |
| 625 | - [持续开发交接文档](./session-handoff.md) | 625 | - [持续开发交接文档](./session-handoff.md) |
| 626 | - [训练数据与 pgvector 指南](./training-data-and-pgvector-guide.md) | 626 | - [postgresql-data-model.md](./postgresql-data-model.md) |
| 627 | - [开放数据工作流](./open-dataset-workflow.md) | 627 | - [phase1-implementation-checklist.md](./phase1-implementation-checklist.md) |
| 628 | - [服务接口](./service-api.md) | 628 | - [phase1-worker-contract.md](./phase1-worker-contract.md) |
| 629 | - [acr-engine/train.py](../acr-engine/train.py) | 629 | - [acr-engine/train.py](../acr-engine/train.py) |
| 630 | - [acr-engine/run_demo.py](../acr-engine/run_demo.py) | 630 | - [acr-engine/run_demo.py](../acr-engine/run_demo.py) |
| 631 | - [acr-engine/src/engines/ecapa_embedder.py](../acr-engine/src/engines/ecapa_embedder.py) | 631 | - [acr-engine/src/engines/ecapa_embedder.py](../acr-engine/src/engines/ecapa_embedder.py) | ... | ... |
docs/project-responsibility-map.md
deleted
100644 → 0
| 1 | # ACR 项目职责图 | ||
| 2 | |||
| 3 | > 更新:2026-06-02 | ||
| 4 | |||
| 5 | ## 一页结论 | ||
| 6 | |||
| 7 | - 本项目已经从“算法原型”升级为“**面向工业化的 ACR 平台雏形**” | ||
| 8 | - 当前系统分为 **数据层、训练层、检索层、服务层、评测层、合规层** | ||
| 9 | - 近期重点不是再堆功能,而是: | ||
| 10 | 1. 提升 `humming_like` / `confused` 准确率 | ||
| 11 | 2. 接入真实白名单数据集 | ||
| 12 | 3. 完善服务、索引、benchmark 与合规闭环 | ||
| 13 | |||
| 14 | --- | ||
| 15 | |||
| 16 | ## 1. 分层图 | ||
| 17 | |||
| 18 | ```mermaid | ||
| 19 | flowchart TD | ||
| 20 | A[L1 业务目标层] --> B[L2 系统能力层] | ||
| 21 | B --> C[L3 核心模块层] | ||
| 22 | C --> D[L4 工程服务层] | ||
| 23 | C --> E[L5 数据与合规层] | ||
| 24 | |||
| 25 | A1[听歌识曲 / 哼唱识别 / 商业可用]:::goal --> A | ||
| 26 | |||
| 27 | B1[高准确率识别] --> B | ||
| 28 | B2[可扩展曲库] --> B | ||
| 29 | B3[可服务化调用] --> B | ||
| 30 | B4[可审计数据来源] --> B | ||
| 31 | |||
| 32 | C1[训练与表征学习] --> C | ||
| 33 | C2[指纹检索] --> C | ||
| 34 | C3[向量检索] --> C | ||
| 35 | C4[混合重排] --> C | ||
| 36 | C5[评测基准] --> C | ||
| 37 | |||
| 38 | D1[FastAPI] --> D | ||
| 39 | D2[Index Build] --> D | ||
| 40 | D3[Manifest Tools] --> D | ||
| 41 | |||
| 42 | E1[External Adapters] --> E | ||
| 43 | E2[Dataset Registry] --> E | ||
| 44 | E3[License Review] --> E | ||
| 45 | |||
| 46 | classDef goal fill:#e8f5e9,stroke:#2e7d32; | ||
| 47 | ``` | ||
| 48 | |||
| 49 | --- | ||
| 50 | |||
| 51 | ## 2. 职责总表 | ||
| 52 | |||
| 53 | | 层级 | 模块 | 负责内容 | 当前状态 | | ||
| 54 | |---|---|---|---| | ||
| 55 | | 数据层 | `src/data/*` | synthetic 数据、external adapters、manifest | 已有基础 | | ||
| 56 | | 训练层 | `train.py` / `src/models/*` | 128 Mel、band-split、embedding 学习 | 可运行 | | ||
| 57 | | 检索层 | `src/engines/*` | chromaprint、embedding、melody-aware hybrid | 可运行 | | ||
| 58 | | 服务层 | `src/service/*` | health / recognize / index build | 骨架已通 | | ||
| 59 | | 评测层 | `evaluate.py` | top1/top5/hard-case benchmark | 已建立 | | ||
| 60 | | 合规层 | registry/docs | dataset source / licensing / whitelist | 雏形已建 | | ||
| 61 | |||
| 62 | --- | ||
| 63 | |||
| 64 | ## 3. 分工图 | ||
| 65 | |||
| 66 | ```mermaid | ||
| 67 | flowchart LR | ||
| 68 | D[数据团队] --> D1[数据接入] | ||
| 69 | D --> D2[manifest 标准化] | ||
| 70 | D --> D3[license 审查] | ||
| 71 | |||
| 72 | M[模型团队] --> M1[特征与模型] | ||
| 73 | M --> M2[鲁棒训练] | ||
| 74 | M --> M3[hard-case 优化] | ||
| 75 | |||
| 76 | R[检索团队] --> R1[指纹索引] | ||
| 77 | R --> R2[向量索引] | ||
| 78 | R --> R3[融合与拒识] | ||
| 79 | |||
| 80 | S[平台团队] --> S1[API 服务] | ||
| 81 | S --> S2[部署] | ||
| 82 | S --> S3[监控] | ||
| 83 | |||
| 84 | Q[质量团队] --> Q1[benchmark] | ||
| 85 | Q --> Q2[回归验证] | ||
| 86 | Q --> Q3[上线门禁] | ||
| 87 | ``` | ||
| 88 | |||
| 89 | --- | ||
| 90 | |||
| 91 | ## 4. 文字说明 | ||
| 92 | |||
| 93 | ### 4.1 数据层 | ||
| 94 | 负责把不同来源的数据集(synthetic、FMA、Jamendo、CCMusic、ModelScope 白名单集)转成统一的 `catalog/query manifest`。 | ||
| 95 | |||
| 96 | ### 4.2 训练层 | ||
| 97 | 负责音乐任务特征建模,目前已经从低维说话人风格输入升级到: | ||
| 98 | - 128 Mel | ||
| 99 | - band-split | ||
| 100 | - retrieval-first 训练方向 | ||
| 101 | |||
| 102 | ### 4.3 检索层 | ||
| 103 | 负责三路信息融合: | ||
| 104 | - 指纹匹配 | ||
| 105 | - embedding 匹配 | ||
| 106 | - melody-aware 重排 | ||
| 107 | |||
| 108 | ### 4.4 服务层 | ||
| 109 | 负责把离线原型包装成可调用系统,目前已有 FastAPI 骨架。 | ||
| 110 | |||
| 111 | ### 4.5 评测层 | ||
| 112 | 负责质量门禁,不能只看总体 top1,要看 hard-case、拒识、误接收。 | ||
| 113 | |||
| 114 | ### 4.6 合规层 | ||
| 115 | 负责商用前提,任何外部数据集都必须进入 registry 和白名单流程。 | ||
| 116 | |||
| 117 | --- | ||
| 118 | |||
| 119 | ## 5. 细节附录 | ||
| 120 | |||
| 121 | 关键文档: | ||
| 122 | - `docs/dataset-spec.md` | ||
| 123 | - `docs/industrial-benchmark-spec.md` | ||
| 124 | - `docs/dataset-sources-and-licensing.md` | ||
| 125 | - `docs/industrialization-roadmap.md` | ||
| 126 | |||
| 127 | |||
| 128 | ## Sources | ||
| 129 | - See `docs/references-and-sources.md` for the current source map. |
docs/references-and-sources.md
deleted
100644 → 0
| 1 | # References and Sources Map | ||
| 2 | |||
| 3 | > 更新:2026-06-02 | ||
| 4 | |||
| 5 | ## 一页结论 | ||
| 6 | |||
| 7 | 当前项目的引用分成四类: | ||
| 8 | 1. **开源数据集来源** | ||
| 9 | 2. **研究/SOTA 来源** | ||
| 10 | 3. **服务与工程规范来源** | ||
| 11 | 4. **项目内部文档来源** | ||
| 12 | |||
| 13 | --- | ||
| 14 | |||
| 15 | ## 1. 引用分层图 | ||
| 16 | |||
| 17 | ```mermaid | ||
| 18 | flowchart TD | ||
| 19 | A[References] --> B[Datasets] | ||
| 20 | A --> C[Research] | ||
| 21 | A --> D[Engineering] | ||
| 22 | A --> E[Internal Docs] | ||
| 23 | |||
| 24 | B --> B1[FMA] | ||
| 25 | B --> B2[MTG-Jamendo] | ||
| 26 | B --> B3[CCMusic] | ||
| 27 | B --> B4[ModelScope] | ||
| 28 | |||
| 29 | C --> C1[Neural AFP] | ||
| 30 | C --> C2[Music Foundation Models] | ||
| 31 | C --> C3[Band-split] | ||
| 32 | C --> C4[Data Balancing] | ||
| 33 | ``` | ||
| 34 | |||
| 35 | --- | ||
| 36 | |||
| 37 | ## 2. 外部来源表 | ||
| 38 | |||
| 39 | | 类别 | 名称 | URL | 当前用途 | | ||
| 40 | |---|---|---|---| | ||
| 41 | | Dataset | FMA | https://github.com/mdeff/fma | 真实 retrieval baseline 候选 | | ||
| 42 | | Dataset | MTG-Jamendo | https://github.com/MTG/mtg-jamendo-dataset | 真实音乐检索候选 | | ||
| 43 | | Dataset | CCMusic | https://ccmusic-database.github.io/en/database/ccm.html | 中文 MIR 数据源候选 | | ||
| 44 | | Dataset | ModelScope music search | https://modelscope.cn/search?page=1&search=music&type=dataset | 数据发现入口 | | ||
| 45 | | Research | MERT | https://arxiv.org/abs/2306.00107 | foundation-model 方向参考 | | ||
| 46 | | Research | MuQ | https://arxiv.org/abs/2501.01108 | music representation 方向参考 | | ||
| 47 | | Research | Band-split RNN | https://arxiv.org/abs/2209.15174 | 频带建模参考 | | ||
| 48 | | Research | BAGAN | https://arxiv.org/abs/1803.09655 | 数据平衡增强参考 | | ||
| 49 | |||
| 50 | --- | ||
| 51 | |||
| 52 | ## 3. 内部文档依赖图 | ||
| 53 | |||
| 54 | ```mermaid | ||
| 55 | flowchart LR | ||
| 56 | A[references-and-sources.md] --> B[dataset-sources-and-licensing.md] | ||
| 57 | A --> C[sota-research-2026.md] | ||
| 58 | A --> D[industrialization-roadmap.md] | ||
| 59 | ``` | ||
| 60 | |||
| 61 | --- | ||
| 62 | |||
| 63 | ## 4. 文字说明 | ||
| 64 | |||
| 65 | ### 4.1 为什么单独做 References Map | ||
| 66 | 因为后续文档会越来越多,如果不把“哪些结论来自哪里”系统整理出来,很快会失去可追溯性。 | ||
| 67 | |||
| 68 | ### 4.2 目前引用质量说明 | ||
| 69 | - dataset 来源:优先官方 repo / 官方主页 | ||
| 70 | - research 来源:优先 arXiv / 论文主页 | ||
| 71 | - service/工程来源:当前主要以内生工程规范为主 | ||
| 72 | |||
| 73 | ### 4.3 未来要加强的地方 | ||
| 74 | - 在每篇核心文档底部补“Sources”小节 | ||
| 75 | - benchmark 报告与 model card 显式引用训练数据与论文版本 | ||
| 76 | |||
| 77 | --- | ||
| 78 | |||
| 79 | ## 5. 细节附录 | ||
| 80 | |||
| 81 | 建议补充: | ||
| 82 | - 每份文档增加 `Sources` 节 | ||
| 83 | - 每次模型 release 输出引用快照 | ||
| 84 | |||
| 85 | ## Sources | ||
| 86 | - FMA: https://github.com/mdeff/fma | ||
| 87 | - MTG-Jamendo: https://github.com/MTG/mtg-jamendo-dataset | ||
| 88 | - CCMusic: https://ccmusic-database.github.io/en/database/ccm.html | ||
| 89 | - ModelScope music search: https://modelscope.cn/search?page=1&search=music&type=dataset |
docs/release-checklist.md
deleted
100644 → 0
| 1 | # Release Checklist | ||
| 2 | |||
| 3 | ## 一页结论 | ||
| 4 | 发布前必须同时满足: | ||
| 5 | - 质量通过 | ||
| 6 | - 合规通过 | ||
| 7 | - 服务通过 | ||
| 8 | - 文档齐全 | ||
| 9 | |||
| 10 | ## 1. 发布门禁图 | ||
| 11 | |||
| 12 | ```mermaid | ||
| 13 | flowchart TD | ||
| 14 | A[Release Candidate] --> B[Benchmark Pass] | ||
| 15 | A --> C[License Review Pass] | ||
| 16 | A --> D[Service Smoke Pass] | ||
| 17 | A --> E[Docs Complete] | ||
| 18 | ``` | ||
| 19 | |||
| 20 | ## 2. Checklist 表 | ||
| 21 | |||
| 22 | | 项目 | 状态 | | ||
| 23 | |---|---| | ||
| 24 | | benchmark report 已生成 | | | ||
| 25 | | model card 已生成 | | | ||
| 26 | | license registry 已更新 | | | ||
| 27 | | service smoke test 通过 | partial: `/health` OK, `/recognize/voice` payload returns against `workspace_music20`, but batch validation is currently poor (`type_7 top1=0.0/top3=0.05`, `type_8 top1=0.0/top3=0.0`, `type_16 top1=0.0/top3=0.0`) | | ||
| 28 | | dataset whitelist 已确认 | | | ||
| 29 | | changelog 已更新 | yes | | ||
| 30 | | architect review completed | yes (approved with watch) | | ||
| 31 | |||
| 32 | ## 3. 文字说明 | ||
| 33 | - 任何缺失项都不能视作商用可发布 | ||
| 34 | |||
| 35 | ## 4. 细节附录 | ||
| 36 | - 发布 commit | ||
| 37 | - benchmark 报告路径 | ||
| 38 | - model card 路径 | ||
| 39 | - license 审查记录路径 | ||
| 40 | |||
| 41 | ## Sources | ||
| 42 | - `docs/dataset-sources-and-licensing.md` | ||
| 43 | - `docs/industrial-benchmark-spec.md` | ||
| 44 | |||
| 45 | |||
| 46 | ## 2026-06-03 voice-query service foundation | ||
| 47 | |||
| 48 | - `/health` 已可用 | ||
| 49 | - `/recognize/voice` 路由已接入,但当前推理仍被 `torch` 缺失阻塞 | ||
| 50 | - 本地 FAISS 20-song 验证已完成 | ||
| 51 | - handoff / changelog / docs README 已同步 | ||
| 52 | |||
| 53 | - handoff 已刷新:yes(已指向 voice service runtime 当前状态与下一步排查路径) | ||
| 54 | |||
| 55 | - business-corpus song_id baseline 已生成:yes(`data/pgvector_eval/music20/songid_eval_report.json`) |
docs/report-layout.md
deleted
100644 → 0
| 1 | # Report Layout Convention | ||
| 2 | |||
| 3 | ## 一页结论 | ||
| 4 | |||
| 5 | 所有评测与发布产物统一放入: | ||
| 6 | |||
| 7 | - `reports/<model-version>/<data-version>/eval.json` | ||
| 8 | - `reports/<model-version>/<data-version>/benchmark-report.md` | ||
| 9 | - `reports/<model-version>/<data-version>/model-card.md` | ||
| 10 | - `reports/<model-version>/<data-version>/release-checklist.md` | ||
| 11 | - `reports/<model-version>/<data-version>/artifact-manifest.json` | ||
| 12 | |||
| 13 | --- | ||
| 14 | |||
| 15 | ## 1. 布局图 | ||
| 16 | |||
| 17 | ```mermaid | ||
| 18 | flowchart TD | ||
| 19 | A[reports/] --> B[model-version] | ||
| 20 | B --> C[data-version] | ||
| 21 | C --> D[eval.json] | ||
| 22 | C --> E[benchmark-report.md] | ||
| 23 | C --> F[model-card.md] | ||
| 24 | C --> G[release-checklist.md] | ||
| 25 | C --> H[artifact-manifest.json] | ||
| 26 | ``` | ||
| 27 | |||
| 28 | --- | ||
| 29 | |||
| 30 | ## 2. 约定表 | ||
| 31 | |||
| 32 | | 文件 | 用途 | | ||
| 33 | |---|---| | ||
| 34 | | eval.json | 机器可读评测输出 | | ||
| 35 | | benchmark-report.md | 人类可读 benchmark 摘要 | | ||
| 36 | | model-card.md | 模型说明 | | ||
| 37 | | release-checklist.md | 发布门禁 | | ||
| 38 | | artifact-manifest.json | 产物索引 | | ||
| 39 | |||
| 40 | --- | ||
| 41 | |||
| 42 | ## 3. 文字说明 | ||
| 43 | - 所有 release 候选都应有独立目录 | ||
| 44 | - 不要把临时 smoke 文件与正式 release 报告混放 | ||
| 45 | |||
| 46 | ## Sources | ||
| 47 | - docs/benchmark-report-template.md | ||
| 48 | - docs/model-card-template.md | ||
| 49 | - docs/release-checklist.md |
docs/runbook.md
deleted
100644 → 0
| 1 | # ACR 项目运行手册 | ||
| 2 | |||
| 3 | ## 1. 环境 | ||
| 4 | |||
| 5 | ```bash | ||
| 6 | cd acr-engine | ||
| 7 | python -m venv .venv | ||
| 8 | source .venv/bin/activate | ||
| 9 | pip install -r requirements.txt | ||
| 10 | ``` | ||
| 11 | |||
| 12 | ## 2. 生成数据 | ||
| 13 | |||
| 14 | ```bash | ||
| 15 | python run_demo.py generate-data --output data/synthetic --num-songs 24 | ||
| 16 | ``` | ||
| 17 | |||
| 18 | ## 3. 校验训练链路 | ||
| 19 | |||
| 20 | ```bash | ||
| 21 | python train.py --data data/synthetic --dry-run --device cpu | ||
| 22 | ``` | ||
| 23 | |||
| 24 | ## 4. 最小训练 | ||
| 25 | |||
| 26 | ```bash | ||
| 27 | python train.py --data data/synthetic --output data/models --device cpu --epochs 1 --batch-size 8 | ||
| 28 | ``` | ||
| 29 | |||
| 30 | ## 5. 建索引 | ||
| 31 | |||
| 32 | ```bash | ||
| 33 | python run_demo.py build-index --data data/synthetic --model data/models/best_model.pt --output data/index --device cpu | ||
| 34 | ``` | ||
| 35 | |||
| 36 | ## 6. 跑识别 | ||
| 37 | |||
| 38 | ```bash | ||
| 39 | python run_demo.py recognize \ | ||
| 40 | --query data/synthetic/segments/song_0020_seg_00.wav \ | ||
| 41 | --data data/synthetic \ | ||
| 42 | --model data/models/best_model.pt \ | ||
| 43 | --index-prefix data/index/reference \ | ||
| 44 | --device cpu | ||
| 45 | ``` | ||
| 46 | |||
| 47 | ## 7. 成功判定 | ||
| 48 | |||
| 49 | 至少满足: | ||
| 50 | |||
| 51 | - 能输出 JSON 结果 | ||
| 52 | - 返回 `candidates` | ||
| 53 | - 结果中包含 `song_id` 和 `confidence` |
docs/service-api.md
deleted
100644 → 0
| 1 | # ACR Service API | ||
| 2 | |||
| 3 | > 更新:2026-06-02 | ||
| 4 | |||
| 5 | ## 一页结论 | ||
| 6 | |||
| 7 | - 当前服务是工业化骨架,不是最终生产网关 | ||
| 8 | - 已提供最小可调用能力: | ||
| 9 | 1. health | ||
| 10 | 2. ready | ||
| 11 | 3. config | ||
| 12 | 4. cache | ||
| 13 | 5. recognize | ||
| 14 | 6. index build | ||
| 15 | - 已补充:服务就绪探针、基础缓存可见性、索引/模型存在性检查 | ||
| 16 | - 下一阶段重点是:鉴权、异步任务、ANN 索引、监控、错误码规范化 | ||
| 17 | |||
| 18 | --- | ||
| 19 | |||
| 20 | ## 1. 服务结构图 | ||
| 21 | |||
| 22 | ```mermaid | ||
| 23 | flowchart LR | ||
| 24 | C[Client] --> H[/health] | ||
| 25 | C --> H2[/ready] | ||
| 26 | C --> G[/config] | ||
| 27 | C --> C2[/cache] | ||
| 28 | C --> R[/recognize] | ||
| 29 | C --> I[/index/build] | ||
| 30 | |||
| 31 | R --> E[Hybrid Engine Cache] | ||
| 32 | I --> B[Index Builders] | ||
| 33 | ``` | ||
| 34 | |||
| 35 | --- | ||
| 36 | |||
| 37 | ## 2. Endpoint 表 | ||
| 38 | |||
| 39 | | Endpoint | 方法 | 作用 | | ||
| 40 | |---|---|---| | ||
| 41 | | `/health` | GET | 健康检查 | | ||
| 42 | | `/config` | GET | 查看默认配置 | | ||
| 43 | | `/ready` | GET | 查看模型/索引/manifest 是否就绪 | | ||
| 44 | | `/cache` | GET | 查看当前 engine cache 状态 | | ||
| 45 | | `/recognize` | POST | 输入 query,输出候选 | | ||
| 46 | | `/index/build` | POST | 触发离线索引构建 | | ||
| 47 | |||
| 48 | --- | ||
| 49 | |||
| 50 | ## 3. 请求流程图 | ||
| 51 | |||
| 52 | ```mermaid | ||
| 53 | sequenceDiagram | ||
| 54 | participant Client | ||
| 55 | participant API | ||
| 56 | participant Engine | ||
| 57 | |||
| 58 | Client->>API: POST /recognize | ||
| 59 | API->>Engine: load matcher/index/model | ||
| 60 | Engine-->>API: top-k candidates | ||
| 61 | API-->>Client: JSON result | ||
| 62 | ``` | ||
| 63 | |||
| 64 | --- | ||
| 65 | |||
| 66 | ## 4. 文字说明 | ||
| 67 | |||
| 68 | ### 4.1 为什么先暴露文件路径 API | ||
| 69 | 当前阶段优先验证系统闭环,不急于引入上传存储层与异步 job orchestration。 | ||
| 70 | |||
| 71 | ### 4.2 `/config` 的作用 | ||
| 72 | 帮助服务侧和调用侧快速确认当前默认数据目录、模型路径与索引前缀。 | ||
| 73 | |||
| 74 | ### 4.3 后续生产化差距 | ||
| 75 | - 缺鉴权 | ||
| 76 | - 缺对象存储上传 | ||
| 77 | - 缺异步索引任务 | ||
| 78 | - 缺可观测性 | ||
| 79 | - 缺错误码与 SLA 规范 | ||
| 80 | |||
| 81 | --- | ||
| 82 | |||
| 83 | ## 5. 细节附录 | ||
| 84 | |||
| 85 | ### `/health` | ||
| 86 | 返回: | ||
| 87 | ```json | ||
| 88 | {"status":"ok","service":"acr","version":"0.2.0"} | ||
| 89 | ``` | ||
| 90 | |||
| 91 | ### `/config` | ||
| 92 | 返回: | ||
| 93 | ```json | ||
| 94 | { | ||
| 95 | "data_dir":"data/synthetic_v2", | ||
| 96 | "model_path":"data/models_v3/best_model.pt", | ||
| 97 | "index_prefix":"data/index_v3/reference", | ||
| 98 | "device":"cpu" | ||
| 99 | } | ||
| 100 | ``` | ||
| 101 | |||
| 102 | |||
| 103 | ### `/ready` | ||
| 104 | 返回: | ||
| 105 | ```json | ||
| 106 | { | ||
| 107 | "service":"acr", | ||
| 108 | "version":"0.3.0", | ||
| 109 | "ready":true, | ||
| 110 | "files":{...}, | ||
| 111 | "manifests":[...], | ||
| 112 | "engine_cache_size":0 | ||
| 113 | } | ||
| 114 | ``` | ||
| 115 | |||
| 116 | ### `/cache` | ||
| 117 | 返回当前进程内 engine cache 统计。 | ||
| 118 | |||
| 119 | |||
| 120 | |||
| 121 | ## 6. 本地运行与 smoke | ||
| 122 | |||
| 123 | ```bash | ||
| 124 | cd acr-engine | ||
| 125 | /usr/local/miniconda3/bin/python -m uvicorn src.service.app:app --host 127.0.0.1 --port 8000 | ||
| 126 | ``` | ||
| 127 | |||
| 128 | 另一个终端可直接执行: | ||
| 129 | |||
| 130 | ```bash | ||
| 131 | cd acr-engine | ||
| 132 | /usr/local/miniconda3/bin/python scripts/service_smoke.py | ||
| 133 | ``` | ||
| 134 | |||
| 135 | 该 smoke 当前会校验: | ||
| 136 | |||
| 137 | - `/health` | ||
| 138 | - `/ready` | ||
| 139 | - `/config` | ||
| 140 | - `/cache` | ||
| 141 | |||
| 142 | ## Sources | ||
| 143 | - See [references-and-sources.md](./references-and-sources.md) for the current source map. |
docs/sota-research-2026.md
deleted
100644 → 0
| 1 | # ACR / Music Retrieval SOTA Research (截至 2026-06-02) | ||
| 2 | |||
| 3 | ## 结论摘要 | ||
| 4 | |||
| 5 | 到 2025-2026,这个方向相比传统“从零训练一个小型 ECAPA embedding”已经明显前进了。 | ||
| 6 | |||
| 7 | 当前更强的方向主要有三类: | ||
| 8 | |||
| 9 | 1. **Neural Audio Fingerprinting 的鲁棒训练增强** | ||
| 10 | 2. **Music Foundation Model 作为 backbone / teacher** | ||
| 11 | 3. **Band-split / band-aware 结构用于音乐频谱建模** | ||
| 12 | |||
| 13 | 对本项目当前阶段的直接结论: | ||
| 14 | - **仅靠样本重复或统一加权不是 SOTA 思路** | ||
| 15 | - 更接近 2026 工业最佳实践的是:**retrieval-first + hard negative mining + foundation model backbone + 任务专门支路** | ||
| 16 | - 我们当前仓库已经走到其中两步:`128 Mel + band-split`、`retrieval-first eval` | ||
| 17 | - 下一步最该补的是:`confusion-aware negatives` 与 `humming melody tower` | ||
| 18 | |||
| 19 | |||
| 20 | ## 1. 方向图 | ||
| 21 | |||
| 22 | ```mermaid | ||
| 23 | flowchart LR | ||
| 24 | A[2026 ACR / MIR SOTA] --> B[Neural AFP Robustness] | ||
| 25 | A --> C[Music Foundation Models] | ||
| 26 | A --> D[Band-aware Architectures] | ||
| 27 | A --> E[Data Balancing / Hard Negatives] | ||
| 28 | ``` | ||
| 29 | |||
| 30 | ## 1. Neural AFP 的更强实践 | ||
| 31 | |||
| 32 | ### Enhancing Neural Audio Fingerprint Robustness to Audio Degradation for Music Identification (2025) | ||
| 33 | - arXiv: https://arxiv.org/abs/2506.22661 | ||
| 34 | |||
| 35 | 关键信息: | ||
| 36 | - 指出很多 neural AFP 工作对真实退化模拟不够真实 | ||
| 37 | - 系统比较 metric learning 方法 | ||
| 38 | - 发现自监督 triplet loss 变体在该任务中更优 | ||
| 39 | - 强调多个 positive samples 对不同 loss 的影响不同 | ||
| 40 | |||
| 41 | 对本项目的启发: | ||
| 42 | - 不应只依赖当前简单 SupCon + CE | ||
| 43 | - 应增加更真实的退化增强 | ||
| 44 | - 应明确做 retrieval 指标选择,而非只看分类头 | ||
| 45 | |||
| 46 | ## 2. Music Foundation Model Backbones | ||
| 47 | |||
| 48 | ### Robust Neural Audio Fingerprinting using Music Foundation Models (2025) | ||
| 49 | - arXiv: https://arxiv.org/abs/2511.05399 | ||
| 50 | |||
| 51 | 关键信息: | ||
| 52 | - 使用预训练 music foundation model(例如 MuQ、MERT)作为 neural fingerprinting backbone | ||
| 53 | - 在 distorted / compressed / manipulated 音频条件下优于从零训练模型 | ||
| 54 | - 还能更好做 segment-level localization | ||
| 55 | |||
| 56 | ### MERT (2023) | ||
| 57 | - arXiv: https://arxiv.org/abs/2306.00107 | ||
| 58 | |||
| 59 | 关键信息: | ||
| 60 | - 大规模自监督 music understanding 模型 | ||
| 61 | - 在多个 music understanding 任务上达到强表现 | ||
| 62 | |||
| 63 | ### MuQ (2025) | ||
| 64 | - arXiv: https://arxiv.org/abs/2501.01108 | ||
| 65 | |||
| 66 | 关键信息: | ||
| 67 | - 面向音乐的自监督表征学习模型 | ||
| 68 | - 使用 Mel-RVQ 目标 | ||
| 69 | - 在多种下游任务上优于更早工作 | ||
| 70 | |||
| 71 | 对本项目的启发: | ||
| 72 | - 2026 继续只用小模型从零训,不太可能是最佳路线 | ||
| 73 | - 更合理路线: | ||
| 74 | - 当前仓库保留轻量自训 baseline | ||
| 75 | - 下一阶段增加 MERT / MuQ frozen encoder 或 adapter fine-tune 版本 | ||
| 76 | |||
| 77 | ## 3. Band-split / band-aware 结构 | ||
| 78 | |||
| 79 | ### Music Source Separation with Band-split RNN (2022) | ||
| 80 | - arXiv: https://arxiv.org/abs/2209.15174 | ||
| 81 | |||
| 82 | 关键信息: | ||
| 83 | - 显式把频谱切成多个频带再建模 | ||
| 84 | - 对音乐任务优于直接照搬通用音频结构 | ||
| 85 | |||
| 86 | 虽然该文主要做 source separation,不是 ACR,但它对“音乐频带先验”很有启发。 | ||
| 87 | |||
| 88 | 对本项目的启发: | ||
| 89 | - 输入层加入 band-split 是合理工程方向 | ||
| 90 | - 未来可继续发展成: | ||
| 91 | - band-aware attention | ||
| 92 | - multi-band retrieval heads | ||
| 93 | - harmonic/rhythm 双塔结构 | ||
| 94 | |||
| 95 | ## 4. 数据平衡与生成增强 | ||
| 96 | |||
| 97 | ### BAGAN: Data Augmentation with Balancing GAN (2018) | ||
| 98 | - arXiv: https://arxiv.org/abs/1803.09655 | ||
| 99 | |||
| 100 | 严格说你提到的 `pro-WGAN` 我这次没有找到一个明确、权威、在该任务里广泛标准化的同名主文献;当前更接近、且有明确权威来源的是 **BAGAN / balancing GAN** 这一类面向不平衡数据增强的方法。 | ||
| 101 | |||
| 102 | 因此本次实现里我采用的是: | ||
| 103 | - **pro-WGAN 风格的工程近似平衡策略** | ||
| 104 | - 不是声称已经复现某篇明确的 `pro-WGAN` SOTA 论文 | ||
| 105 | |||
| 106 | 如果你之后指定了准确论文或仓库,我可以按那一版精确对齐实现。 | ||
| 107 | |||
| 108 | ### 对当前实验结果的解释 | ||
| 109 | |||
| 110 | | 策略 | overall top1 | humming_like top1 | confused top1 | 结论 | | ||
| 111 | |---|---:|---:|---:|---| | ||
| 112 | | naive oversampling (smoke-v4) | 0.40 | 0.00 | 0.00 | 明显退化 | | ||
| 113 | | type-aware weighting (smoke-v5) | 0.60 | 0.50 | 0.00 | 改善 humming,但 confused 无突破 | | ||
| 114 | | sample-level confused-priority weighting (smoke-v6) | 0.65 | 0.25 | 0.25 | confused 突破,但需要重新平衡 humming | | ||
| 115 | |||
| 116 | 这说明: | ||
| 117 | 1. 2026 年这个方向里,**“难例重要”是对的** | ||
| 118 | 2. 但 **单维度加权还不够**,需要把不同 hard case 分开建模 | ||
| 119 | 3. 对音乐 ACR 来说,`confused` 与 `humming_like` 不是同一种难度来源: | ||
| 120 | - `confused` 更偏 timbre / arrangement / retrieval ambiguity | ||
| 121 | - `humming_like` 更偏 melody / pitch contour mismatch | ||
| 122 | 4. 当前仓库里的 residual confused failure 进一步显示: | ||
| 123 | - `intro` 片段是更高风险区域 | ||
| 124 | - 下一步应引入 `segment_type-aware hard negatives` | ||
| 125 | - 这比继续全局调 sample ratio 更接近工业有效路径 | ||
| 126 | |||
| 127 | ## 5. 2026 年是否已经有更好的方案? | ||
| 128 | |||
| 129 | 有,结论是:**有明显更好的路线**。 | ||
| 130 | |||
| 131 | 最值得参考的是: | ||
| 132 | 1. 用 **music foundation model** 做 backbone | ||
| 133 | 2. 用 **更真实退化模拟 + retrieval-first metric learning** | ||
| 134 | 3. 用 **segment-level / window-level indexing**,而不是整曲平均 embedding | ||
| 135 | 4. 对哼唱任务增加 **melody/pitch contour 专门支路** | ||
| 136 | |||
| 137 | ## 6. 对本项目的建议排序 | ||
| 138 | |||
| 139 | ### 当前阶段(已开始) | ||
| 140 | - 128 Mel 替换低维说话人风格输入 | ||
| 141 | - band-split 输入层 | ||
| 142 | - 更强混淆增强 | ||
| 143 | - retrieval-first 评测 | ||
| 144 | |||
| 145 | ### 下一阶段 | ||
| 146 | - MERT / MuQ frozen feature baseline | ||
| 147 | - triplet / multi-positive metric learning 对比 SupCon | ||
| 148 | - window-level index aggregation | ||
| 149 | - FMA / Jamendo 小规模真实数据验证 | ||
| 150 | - confusion-aware negative mining | ||
| 151 | - humming 专门旋律支路 / pitch contour rerank | ||
| 152 | |||
| 153 | ### 更后阶段 | ||
| 154 | - humming 专门 melody tower | ||
| 155 | - foundation model + lightweight fingerprint head | ||
| 156 | - ANN + reranker 两阶段工业化检索 | ||
| 157 | |||
| 158 | ## Sources | ||
| 159 | - Araz et al., 2025, Enhancing Neural Audio Fingerprint Robustness to Audio Degradation for Music Identification: https://arxiv.org/abs/2506.22661 | ||
| 160 | - Singh et al., 2025, Robust Neural Audio Fingerprinting using Music Foundation Models: https://arxiv.org/abs/2511.05399 | ||
| 161 | - Li et al., 2023, MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training: https://arxiv.org/abs/2306.00107 | ||
| 162 | - Zhu et al., 2025, MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization: https://arxiv.org/abs/2501.01108 | ||
| 163 | - Luo & Yu, 2022, Music Source Separation with Band-split RNN: https://arxiv.org/abs/2209.15174 | ||
| 164 | - Mariani et al., 2018, BAGAN: Data Augmentation with Balancing GAN: https://arxiv.org/abs/1803.09655 |
| 1 | # Training Data, Input Format, and pgvector Guide / 训练数据、输入格式与 pgvector 指南 | ||
| 2 | |||
| 3 | > 更新:2026-06-02 | ||
| 4 | > 关联文档:[数据规范](./dataset-spec.md) · [开放数据工作流](./open-dataset-workflow.md) · [数据来源与接入](./dataset-sources-and-licensing.md) · [服务接口](./service-api.md) · [业务素材类型与 Bucket 指南](./business-music-bucket-and-type-guide.md) · [业务 Manifest 与 Type-Role 规范](./business-manifest-and-type-role-spec.md) | ||
| 5 | |||
| 6 | ## 一页结论 | ||
| 7 | |||
| 8 | 围绕你最新问的几个问题,可以压缩成 5 句话: | ||
| 9 | |||
| 10 | 1. **当前训练输入的最小单位是“带 `song_id` 的 query 样本 + reference 资产 + manifest”**,不是直接把 3 分钟 mp3 整批扔进模型。 | ||
| 11 | 2. **3 分钟 mp3 当前在训练端通常不是预切全量重叠窗口,而是运行时随机裁 5s;检索端才是重叠滑窗。** | ||
| 12 | 3. **如果有 GPU,FMA 这类真实数据训练会明显加速;当前 `train.py` 支持 `auto/cuda`,`smoke-local` 也已支持 `--device cpu|cuda|auto`,其中 `auto` 会在 smoke 内部解析成实际设备。** | ||
| 13 | 4. **FMA、MTG-Jamendo、自有 BGM/录音都应先变成统一 manifest,再做训练、评测和 pgvector 入库。** | ||
| 14 | 5. **后续你们要扩自己的数据集时,最重要的不是文件后缀,而是 `song_id / type / offset / source_dataset / split` 这些结构化字段。** | ||
| 15 | |||
| 16 | --- | ||
| 17 | |||
| 18 | ## 1. 总体结构图 | ||
| 19 | |||
| 20 | ```mermaid | ||
| 21 | flowchart LR | ||
| 22 | A[原始素材\nBGM / 歌曲 / 录音 / FMA] --> B[标准化音频资产\n16k / mono] | ||
| 23 | B --> C[Reference 全曲/长片段] | ||
| 24 | B --> D[Query 短片段\n5s / 8s] | ||
| 25 | C --> E[Manifest\ncatalog/train/val/test] | ||
| 26 | D --> E | ||
| 27 | E --> F[训练 / 评测 / 建索引] | ||
| 28 | F --> G[Embedding / Fingerprint] | ||
| 29 | G --> H[PostgreSQL + pgvector] | ||
| 30 | ``` | ||
| 31 | |||
| 32 | --- | ||
| 33 | |||
| 34 | ## 2. 当前训练数据到底是什么格式 | ||
| 35 | |||
| 36 | ## 2.1 不是“一个 mp3 文件”,而是三层对象 | ||
| 37 | |||
| 38 | | 层 | 现在需要什么 | 作用 | | ||
| 39 | |---|---|---| | ||
| 40 | | 音频资产层 | `.wav/.mp3/.flac/.ogg` | 真正被读取的内容 | | ||
| 41 | | 标注层 | `song_id`、`type`、`offset`、`source_dataset` | 告诉系统“是谁、是哪种样本” | | ||
| 42 | | manifest 层 | `catalog.json` / `train.json` / `test.json` / `val.json` | 驱动训练、建库、评测 | | ||
| 43 | |||
| 44 | 也就是说,**最小可训练单元**不是“一个文件”,而是: | ||
| 45 | - 一个 `audio_path` | ||
| 46 | - 一个 `song_id` | ||
| 47 | - 一个 `type` | ||
| 48 | - 一个 `duration` | ||
| 49 | - 如为 query,通常还应有 `offset` | ||
| 50 | |||
| 51 | --- | ||
| 52 | |||
| 53 | ## 2.2 当前推荐音频约束 | ||
| 54 | |||
| 55 | | 项目 | 当前推荐值 | 说明 | | ||
| 56 | |---|---|---| | ||
| 57 | | 采样率 | `16000 Hz` | 读取时统一到 16k | | ||
| 58 | | 声道 | `mono` | 当前管线按单声道 | | ||
| 59 | | 训练片段长度 | `5s` | 训练数据集代码事实 | | ||
| 60 | | 外部 query 默认长度 | `8s` | `prepare-local/smoke-local` 默认 | | ||
| 61 | | 频谱输入 | `128 维 Mel` | 当前音乐任务输入层 | | ||
| 62 | | 模型配置 | `use_band_split=true` | 已集成频带分割模块 | | ||
| 63 | |||
| 64 | --- | ||
| 65 | |||
| 66 | ## 3. 3 分钟 mp3 怎么进入训练 | ||
| 67 | |||
| 68 | ## 3.1 当前切片策略图 | ||
| 69 | |||
| 70 | ```mermaid | ||
| 71 | flowchart TD | ||
| 72 | A[3min 原始 mp3] --> B[训练 Dataset] | ||
| 73 | A --> C[建索引 / 检索] | ||
| 74 | A --> D[外部数据集 query 生成] | ||
| 75 | |||
| 76 | B --> B1[随机 offset] | ||
| 77 | B1 --> B2[取 5s clip] | ||
| 78 | |||
| 79 | C --> C1[5s window] | ||
| 80 | C1 --> C2[2.5s stride] | ||
| 81 | C2 --> C3[50% overlap windows] | ||
| 82 | |||
| 83 | D --> D1[随机取 1 个 8s query] | ||
| 84 | ``` | ||
| 85 | |||
| 86 | ## 3.2 直接回答“有没有重叠窗口” | ||
| 87 | |||
| 88 | | 链路 | 当前答案 | | ||
| 89 | |---|---| | ||
| 90 | | 训练 | **没有固定重叠滑窗集**,而是随机裁剪 | | ||
| 91 | | 检索 / reference index | **有**,默认 50% overlap | | ||
| 92 | | 开放数据 manifest 生成 | **没有**,默认每首歌只生成 1 个 query | | ||
| 93 | |||
| 94 | 所以: | ||
| 95 | - 如果你担心 3 分钟内容只看一小段,担心是合理的; | ||
| 96 | - 当前训练覆盖靠多次 epoch 的随机采样累积,而不是一次性把整首歌切完; | ||
| 97 | - 如果后续要提高 recall/鲁棒性,可以继续加“多 query / overlap query manifest”这条增强线。 | ||
| 98 | |||
| 99 | --- | ||
| 100 | |||
| 101 | ## 4. 如果有 BGM、音乐录音,应该怎么转成训练数据 | ||
| 102 | |||
| 103 | ## 4.1 推荐分工图 | ||
| 104 | |||
| 105 | ```mermaid | ||
| 106 | flowchart LR | ||
| 107 | A[自有 BGM / 歌曲母带] --> B[reference] | ||
| 108 | A --> C[clean query] | ||
| 109 | A --> D[augmented query] | ||
| 110 | A --> E[confused / humming_like] | ||
| 111 | F[手机录音 / 环境录音] --> C | ||
| 112 | F --> E | ||
| 113 | ``` | ||
| 114 | |||
| 115 | ## 4.2 转换规则表 | ||
| 116 | |||
| 117 | | 原始素材 | 转成什么 | 标记建议 | | ||
| 118 | |---|---|---| | ||
| 119 | | 完整 BGM / 完整歌曲 | `reference` | `type=reference` | | ||
| 120 | | 原曲直接截 5s/8s | `clean` query | `type=clean` | | ||
| 121 | | 加噪/压缩/混响/EQ 后片段 | `augmented` query | `type=augmented` | | ||
| 122 | | 容易和别的歌混淆的片段 | 难例 query | `type=confused` | | ||
| 123 | | 哼唱感、旋律弱、手机录音风格 | 难例 query | `type=humming_like` | | ||
| 124 | |||
| 125 | --- | ||
| 126 | |||
| 127 | ## 4.3 你们自己扩数据集时的最小规则 | ||
| 128 | |||
| 129 | 1. **一首歌必须有稳定 `song_id`**。 | ||
| 130 | 2. **完整版本或主版本优先做 `reference`**。 | ||
| 131 | 3. **query 一定要能回溯到 reference 的时间位置**,因此最好保留 `offset`。 | ||
| 132 | 4. **不同来源必须保留 `source_dataset`**。 | ||
| 133 | 5. **训练、验证、测试必须保留 `split` 语义**,不要后面再靠文件夹猜。 | ||
| 134 | |||
| 135 | --- | ||
| 136 | |||
| 137 | ## 5. FMA / MTG / 自有数据的目录规范 | ||
| 138 | |||
| 139 | ## 5.1 推荐目录图 | ||
| 140 | |||
| 141 | ```mermaid | ||
| 142 | flowchart TD | ||
| 143 | A[acr-engine/data/raw] --> B[fma_small_audio] | ||
| 144 | A --> C[mtg_jamendo_audio] | ||
| 145 | A --> D[my_bgm_audio] | ||
| 146 | |||
| 147 | E[acr-engine/data/external_ingested] --> F[fma/manifests] | ||
| 148 | E --> G[mtg_jamendo/manifests] | ||
| 149 | E --> H[my_bgm/manifests] | ||
| 150 | |||
| 151 | I[acr-engine/data/external_smoke] --> J[fma_*] | ||
| 152 | ``` | ||
| 153 | |||
| 154 | ## 5.2 目录职责表 | ||
| 155 | |||
| 156 | | 目录 | 作用 | | ||
| 157 | |---|---| | ||
| 158 | | `acr-engine/data/raw/fma_small_audio/` | FMA 原始音频目录 | | ||
| 159 | | `acr-engine/data/raw/mtg_jamendo_audio/` | MTG-Jamendo 原始音频目录 | | ||
| 160 | | `acr-engine/data/external_ingested/<dataset>/manifests/` | 统一转换后的 manifest | | ||
| 161 | | `acr-engine/data/external_smoke/` | smoke 训练/索引/评测产物 | | ||
| 162 | |||
| 163 | --- | ||
| 164 | |||
| 165 | ## 6. FMA 的具体说明 | ||
| 166 | |||
| 167 | ## 6.1 当前已验证的 FMA 事实 | ||
| 168 | |||
| 169 | | 项 | 当前状态 | | ||
| 170 | |---|---| | ||
| 171 | | 数据源 | 用户指定 ModelScope FMA Small 链接 | | ||
| 172 | | 本地目录 | `acr-engine/data/raw/fma_small_audio` | | ||
| 173 | | 音频文件数 | `8000` | | ||
| 174 | | 可切 query 文件 | `7994` | | ||
| 175 | | 中位时长 | `29.977s` | | ||
| 176 | | 真实 smoke | **正在运行 / 已产生中间产物** | | ||
| 177 | |||
| 178 | 真实检查入口: | ||
| 179 | - [acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py) `check-local-ready` | ||
| 180 | - [acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py) `inspect-local` | ||
| 181 | |||
| 182 | --- | ||
| 183 | |||
| 184 | ## 6.2 FMA cookbook | ||
| 185 | |||
| 186 | ```bash | ||
| 187 | cd /workspace/acr-engine | ||
| 188 | |||
| 189 | /usr/local/miniconda3/bin/python src/data/external_adapters.py inspect-local \ | ||
| 190 | fma data/raw/fma_small_audio --eval-ratio 0.2 --query-duration 8.0 | ||
| 191 | |||
| 192 | /usr/local/miniconda3/bin/python src/data/external_adapters.py prepare-local \ | ||
| 193 | fma data/raw/fma_small_audio --output-root data/external_ingested \ | ||
| 194 | --eval-ratio 0.2 --query-duration 8.0 | ||
| 195 | |||
| 196 | /usr/local/miniconda3/bin/python src/data/external_adapters.py validate-local \ | ||
| 197 | fma data/external_ingested/fma/manifests | ||
| 198 | ``` | ||
| 199 | |||
| 200 | 如果只是验证全链路: | ||
| 201 | |||
| 202 | ```bash | ||
| 203 | /usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local \ | ||
| 204 | fma data/raw/fma_small_audio --output-root data/external_smoke \ | ||
| 205 | --eval-ratio 0.2 --query-duration 8.0 --train-epochs 1 --batch-size 2 | ||
| 206 | ``` | ||
| 207 | |||
| 208 | --- | ||
| 209 | |||
| 210 | ## 7. 当前脚本与职责索引 | ||
| 211 | |||
| 212 | | 脚本/文件 | 作用 | | ||
| 213 | |---|---| | ||
| 214 | | [acr-engine/src/data/manifest_tools.py](../acr-engine/src/data/manifest_tools.py) | 音频目录 -> manifest | | ||
| 215 | | [acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py) | inspect / prepare / validate / smoke 统一入口 | | ||
| 216 | | [acr-engine/src/data/dataset.py](../acr-engine/src/data/dataset.py) | 训练/测试数据集读取与随机裁剪 | | ||
| 217 | | [acr-engine/src/utils/audio.py](../acr-engine/src/utils/audio.py) | 通用音频处理与滑窗 | | ||
| 218 | | [acr-engine/src/engines/ecapa_embedder.py](../acr-engine/src/engines/ecapa_embedder.py) | embedding 抽取与 reference 滑窗索引 | | ||
| 219 | | [acr-engine/train.py](../acr-engine/train.py) | 训练主入口 | | ||
| 220 | | [acr-engine/evaluate.py](../acr-engine/evaluate.py) | 评测主入口 | | ||
| 221 | | [acr-engine/run_demo.py](../acr-engine/run_demo.py) | build-index / query demo | | ||
| 222 | | [acr-engine/scripts/export_manifest_to_pgvector_json.py](../acr-engine/scripts/export_manifest_to_pgvector_json.py) | manifest 导出为 pgvector-ready JSON | | ||
| 223 | | [acr-engine/scripts/pgvector_bulk_load_template.py](../acr-engine/scripts/pgvector_bulk_load_template.py) | PostgreSQL 批量导入模板 | | ||
| 224 | | [acr-engine/sql/pgvector_schema.sql](../acr-engine/sql/pgvector_schema.sql) | pgvector 表结构模板 | | ||
| 225 | |||
| 226 | --- | ||
| 227 | |||
| 228 | ## 8. GPU 是否会快很多 | ||
| 229 | |||
| 230 | ## 8.1 结论先说 | ||
| 231 | |||
| 232 | **会。对于 FMA 这种 8000 首规模的真实数据,GPU 通常会比 CPU 快很多。** | ||
| 233 | |||
| 234 | 原因: | ||
| 235 | - Mel 特征后面的 ECAPA 前向/反向传播主要是张量计算; | ||
| 236 | - 当前 `train.py` 已支持 `--device auto`,且 CUDA 路径已启用 mixed precision; | ||
| 237 | - 真实 FMA smoke 当前跑在 CPU,上千 batch 训练明显慢。 | ||
| 238 | |||
| 239 | ## 8.2 当前代码现状 | ||
| 240 | |||
| 241 | | 链路 | 当前状态 | | ||
| 242 | |---|---| | ||
| 243 | | `train.py` | 支持 `--device auto/cuda/cpu` | | ||
| 244 | | CUDA mixed precision | 已支持 | | ||
| 245 | | `smoke-local` | 现已支持 `--device cpu|cuda|auto` | | ||
| 246 | | `evaluate.py` | 当前 CLI 默认 `cpu` | | ||
| 247 | | `run_demo.py build-index` | 当前 smoke 里也走 `cpu` | | ||
| 248 | |||
| 249 | ### 当前要注意的一点 | ||
| 250 | |||
| 251 | `smoke-local` 现在已经支持显式设备选择,但有一个实现细节必须明确: | ||
| 252 | - `train.py` 可以直接理解 `auto` | ||
| 253 | - `run_demo.py / evaluate.py` 的 embedding 侧不能直接吃字符串 `auto` | ||
| 254 | |||
| 255 | 所以当前 `smoke-local` 的做法是: | ||
| 256 | - 对外允许传 `--device auto` | ||
| 257 | - 对内先解析成真实设备,再分发给训练 / 建索引 / 评测 | ||
| 258 | |||
| 259 | 这让真实数据 smoke 可以直接复用 GPU,而不需要手工拆成多段命令。 | ||
| 260 | |||
| 261 | --- | ||
| 262 | |||
| 263 | ## 9. 如果后面要保存到 pgvector,现在应该怎么处理 | ||
| 264 | |||
| 265 | ## 9.1 正确分层 | ||
| 266 | |||
| 267 | | 内容 | 存储位置 | | ||
| 268 | |---|---| | ||
| 269 | | 原始音频 / 标准音频 | 文件系统 / NAS / 对象存储 | | ||
| 270 | | manifest / 元数据 | PostgreSQL 普通表 | | ||
| 271 | | 向量 embedding | `pgvector` 列 | | ||
| 272 | | 检索参数与版本 | PostgreSQL / 配置中心 | | ||
| 273 | |||
| 274 | 原则: | ||
| 275 | - **不要把原始 mp3 直接塞进 pgvector 表**; | ||
| 276 | - 先标准化音频和 manifest; | ||
| 277 | - 再从 manifest 产出 embedding 与结构化记录。 | ||
| 278 | |||
| 279 | --- | ||
| 280 | |||
| 281 | ## 9.2 pgvector 推荐数据模型 | ||
| 282 | |||
| 283 | ```mermaid | ||
| 284 | flowchart TD | ||
| 285 | A[songs] --> B[references] | ||
| 286 | A --> C[segments] | ||
| 287 | B --> D[reference_embeddings] | ||
| 288 | C --> E[query_embeddings] | ||
| 289 | ``` | ||
| 290 | |||
| 291 | 当前仓库模板: | ||
| 292 | - [acr-engine/sql/pgvector_schema.sql](../acr-engine/sql/pgvector_schema.sql) | ||
| 293 | - [acr-engine/scripts/export_manifest_to_pgvector_json.py](../acr-engine/scripts/export_manifest_to_pgvector_json.py) | ||
| 294 | - [acr-engine/scripts/pgvector_bulk_load_template.py](../acr-engine/scripts/pgvector_bulk_load_template.py) | ||
| 295 | |||
| 296 | --- | ||
| 297 | |||
| 298 | ## 9.3 从今天开始就该保留的字段 | ||
| 299 | |||
| 300 | | 字段 | 作用 | | ||
| 301 | |---|---| | ||
| 302 | | `song_id` | 主连接键 | | ||
| 303 | | `version_id` | 多版本扩展 | | ||
| 304 | | `audio_path` / `audio_uri` | 回溯音频资产 | | ||
| 305 | | `duration` | 切片合法性校验 | | ||
| 306 | | `offset` | query 对应 reference 的时间位置 | | ||
| 307 | | `type` | 训练/检索角色 | | ||
| 308 | | `segment_type` | intro/mid/outro/external | | ||
| 309 | | `source_dataset` | 数据来源治理 | | ||
| 310 | | `license` | 合规治理 | | ||
| 311 | | `split` | train/val/test | | ||
| 312 | | `model_version` | 向量版本控制 | | ||
| 313 | | `data_version` | 数据快照版本 | | ||
| 314 | |||
| 315 | --- | ||
| 316 | |||
| 317 | ## 10. 当前一个真实注意点:5s / 8s 配置差异 | ||
| 318 | |||
| 319 | 当前仓库里有一个必须写进交接文档的现实问题: | ||
| 320 | |||
| 321 | | 项 | 当前事实 | | ||
| 322 | |---|---| | ||
| 323 | | `smoke-local` 命令 | 常用 `--query-duration 8.0` | | ||
| 324 | | 训练 dataset | 仍按 `segment_dur=5.0` 读取 | | ||
| 325 | | 现有 FMA smoke 报告 `config.json` | 出现过 `query_duration=5.0` 的旧产物 | | ||
| 326 | |||
| 327 | 解释: | ||
| 328 | - **manifest query 时长**、**训练 crop 时长**、**报告里记录的 query_duration** 当前不是完全同一个配置源; | ||
| 329 | - 旧的 `fma_reports_smoke/config.json` 时间戳早于最新 manifests,属于历史实验产物一致性问题; | ||
| 330 | - 当前代码侧已经开始把 smoke 配置摘要显式拆成: | ||
| 331 | - `manifest_query_duration` | ||
| 332 | - `train_segment_duration` | ||
| 333 | - `query_duration_legacy` | ||
| 334 | - 因此后续继续做工业级化时,应该把 “manifest query 时长 / train clip 时长 / eval query 时长 / report metadata” 统一纳入一个显式配置结构。 | ||
| 335 | |||
| 336 | --- | ||
| 337 | |||
| 338 | ## 11. 给你们后续自建数据集的落地建议 | ||
| 339 | |||
| 340 | 1. **完整曲库先做 reference 池**。 | ||
| 341 | 2. **从 reference 池切出 clean query 作为第一层训练集**。 | ||
| 342 | 3. **再做 augmented / confused / humming_like 三类增强 query**。 | ||
| 343 | 4. **固定一部分永不训练,只做 test set。** | ||
| 344 | 5. **先把 manifest 字段做全,再谈 pgvector 和工业服务。** | ||
| 345 | |||
| 346 | --- | ||
| 347 | |||
| 348 | ## 11.5 切片策略:不要只用随机切 | ||
| 349 | |||
| 350 | 当前项目现在已经支持多类切片思路,但职责不同: | ||
| 351 | |||
| 352 | | 策略 | 适用位置 | 作用 | 是否已接入 | | ||
| 353 | |---|---|---|---| | ||
| 354 | | `random` | 训练 query | 增强泛化,模拟未知用户截取点 | 是 | | ||
| 355 | | `sliding` | 建库 / query 生成 | 保证覆盖率,减少漏召回 | 是 | | ||
| 356 | | `silence_aware` | 训练 query / 外部 query 生成 | 优先避开静音,落到真正有音乐内容的片段 | 是 | | ||
| 357 | | `high_energy` | 训练 query / 外部 query 生成 | 优先抽取 RMS 高能区,更接近副歌/主唱/强节奏段 | 是 | | ||
| 358 | | `onset_aware` | 训练 query / 外部 query 生成 | 优先靠近起音事件,减少截到拖尾/空拍 | 是 | | ||
| 359 | | `beat_aware` | 训练 query / 外部 query 生成 | 优先靠近节拍点,适合强节奏流行/电子/舞曲等 | 是 | | ||
| 360 | | `repeated_section_aware` | 训练 query / 外部 query 生成 | 优先抽取与其它窗口最相似的重复主段,近似副歌/重复 hook | 是 | | ||
| 361 | | `hybrid` | 训练 query / 外部 query 生成 | 混合 repeated-section / beat / energy / onset / silence / random | 是 | | ||
| 362 | |||
| 363 | 推荐理解: | ||
| 364 | |||
| 365 | 1. **训练不是全部随机切** | ||
| 366 | 当前训练集可用 `random / silence_aware / high_energy / onset_aware / beat_aware / repeated_section_aware / hybrid` | ||
| 367 | 2. **reference 建库不是随机切** | ||
| 368 | 建库仍然是固定滑窗 | ||
| 369 | 3. **外部数据 query 生成也不是只能随机切** | ||
| 370 | 现在可选 `--query-strategy random|silence_aware|high_energy|onset_aware|beat_aware|repeated_section_aware|hybrid` | ||
| 371 | |||
| 372 | ### 11.6 为什么没有直接全量切到 `librosa.segment.*` | ||
| 373 | |||
| 374 | 这不是没考虑,而是当前做了更保守的工程取舍: | ||
| 375 | |||
| 376 | - 已经接入 `librosa.effects.split / onset_detect / beat_track / chroma_cqt` | ||
| 377 | - 先把非静音、起音、拍点、重复段这些高收益候选打通 | ||
| 378 | - 暂时没有把更重的结构分段作为默认主流程 | ||
| 379 | |||
| 380 | 原因: | ||
| 381 | |||
| 382 | 1. **ACR 查询不总是结构化片段** | ||
| 383 | 用户截到的可能是副歌,也可能是过门、录屏残片、短视频二创片段。 | ||
| 384 | 2. **重结构分段更耗 CPU** | ||
| 385 | 对 FMA 这类真实开放集批量 prepare/smoke 不够轻。 | ||
| 386 | 3. **训练仍需要随机性** | ||
| 387 | 纯结构分段会降低截取点分布的多样性。 | ||
| 388 | |||
| 389 | 当前更合理的策略是: | ||
| 390 | - `hybrid` 作为默认训练切片推荐 | ||
| 391 | - `beat_aware / repeated_section_aware` 作为偏音乐主段的强化选项 | ||
| 392 | - `random` 保留为泛化基线 | ||
| 393 | |||
| 394 | 为什么不直接完全依赖音乐结构分段? | ||
| 395 | |||
| 396 | - ACR 真实 query 往往来自短视频、录屏、随手截取,不一定对齐节拍或段落边界 | ||
| 397 | - 先做 **静音感知分段**,收益最大、风险最小 | ||
| 398 | - 更复杂的 beat / chorus / onset 分段可以作为下一阶段增强,而不应替代现有随机增强 | ||
| 399 | |||
| 400 | ### 训练侧推荐 | ||
| 401 | |||
| 402 | ```bash | ||
| 403 | /usr/local/miniconda3/bin/python acr-engine/train.py \ | ||
| 404 | --data data/your_manifests \ | ||
| 405 | --segment-strategy hybrid \ | ||
| 406 | --silence-top-db 30 | ||
| 407 | ``` | ||
| 408 | |||
| 409 | 建议: | ||
| 410 | - baseline:`random` | ||
| 411 | - 更稳的音乐任务:`hybrid` | ||
| 412 | - 已知原始音频静音很多:`silence_aware` | ||
| 413 | - 更想贴近副歌/强节奏:`high_energy` | ||
| 414 | - 更想贴近短音起点/打点:`onset_aware` | ||
| 415 | - 更想贴近稳定节拍网格:`beat_aware` | ||
| 416 | - 更想贴近副歌/重复 hook:`repeated_section_aware` | ||
| 417 | |||
| 418 | ### 外部数据 query 生成推荐 | ||
| 419 | |||
| 420 | ```bash | ||
| 421 | /usr/local/miniconda3/bin/python acr-engine/src/data/external_adapters.py prepare-local fma data/raw/fma_small_audio \ | ||
| 422 | --output-root data/external_ingested \ | ||
| 423 | --query-duration 8 \ | ||
| 424 | --query-stride 4 \ | ||
| 425 | --query-strategy high_energy \ | ||
| 426 | --silence-top-db 30 | ||
| 427 | ``` | ||
| 428 | |||
| 429 | 这会优先从高能区生成 query,而不是从长静音头尾或低能过门里随机采样。 | ||
| 430 | |||
| 431 | 补充建议: | ||
| 432 | |||
| 433 | | 场景 | 推荐策略 | | ||
| 434 | |---|---| | ||
| 435 | | 录音静音头尾很多 | `silence_aware` | | ||
| 436 | | 更想贴近副歌/主段 | `high_energy` | | ||
| 437 | | 更想贴近打点/起唱点 | `onset_aware` | | ||
| 438 | | 更想贴近规则拍点/律动骨架 | `beat_aware` | | ||
| 439 | | 更想贴近重复主段/副歌 hook | `repeated_section_aware` | | ||
| 440 | | 既要音乐感知,又要保留泛化 | `hybrid` | | ||
| 441 | |||
| 442 | --- | ||
| 443 | |||
| 444 | ## 12. 你这批内部素材 type,哪些推荐参与训练 | ||
| 445 | |||
| 446 | ## 12.1 一页结论 | ||
| 447 | |||
| 448 | 如果目标是做 **音乐 ACR / 歌曲识别**,推荐按下面的优先级: | ||
| 449 | |||
| 450 | - **主 reference 首选**:`11 原曲-无损` | ||
| 451 | - **次级 reference / 兼容增强**:`1 原曲-压缩` | ||
| 452 | - **主 query 来源**:`7 抖音片段`、`8 片段(副歌)`、`16 快手片段`、`18 音频demo` | ||
| 453 | - **伴奏类**:`2/9/10/12` 不建议直接无脑混进“原曲主任务”同标签训练,除非你们的业务明确要识别伴奏版本 | ||
| 454 | - **纯文本/图片/授权/压缩包**:不进音频训练,只进元数据治理 | ||
| 455 | |||
| 456 | --- | ||
| 457 | |||
| 458 | ## 12.2 推荐映射表 | ||
| 459 | |||
| 460 | | type | 内容 | 建议角色 | 是否进主训练 | 建议说明 | | ||
| 461 | |---:|---|---|---|---| | ||
| 462 | | 1 | 原曲-压缩(mp3) | secondary reference / clean query 来源 | 是 | 当 11 缺失时可做主 reference;有 11 时更适合做压缩退化增强 | | ||
| 463 | | 2 | 伴奏有和声-压缩(mp3) | 可选单独版本库 / hard negative | 条件式 | 不建议直接和原曲共用同一训练语义 | | ||
| 464 | | 3 | TXT歌词 | metadata | 否 | 可入库做检索增强,不进音频模型 | | ||
| 465 | | 4 | 封面 | metadata | 否 | 不进音频训练 | | ||
| 466 | | 5 | 授权书 | compliance metadata | 否 | 只做合规治理 | | ||
| 467 | | 6 | 专辑信息(txt) | metadata | 否 | 只做元数据 | | ||
| 468 | | 7 | 抖音片段 | query | 是 | 很适合真实 query 训练/评测 | | ||
| 469 | | 8 | 片段(副歌) | query | 是 | 高价值 query,建议重点保留 | | ||
| 470 | | 9 | 伴奏无和声-压缩(mp3) | 可选单独版本库 / hard negative | 条件式 | 不建议默认并入原曲主标签 | | ||
| 471 | | 10 | 伴奏无和声-无损 | 可选 reference / hard negative | 条件式 | 仅在“识别伴奏版本”任务里进入主训练 | | ||
| 472 | | 11 | 原曲-无损(wav/flac) | primary reference | **强烈推荐** | 最适合作为标准 reference 真值 | | ||
| 473 | | 12 | 伴奏有和声-无损 | 可选单独版本库 / hard negative | 条件式 | 与原曲声学差异大,默认不要并到原曲主任务 | | ||
| 474 | | 13 | 滚动歌词(lrc) | metadata | 否 | 可做歌词侧检索,不进音频模型 | | ||
| 475 | | 14 | 封面源文件(psd) | metadata | 否 | 不进训练 | | ||
| 476 | | 16 | 快手片段 | query | 是 | 与 7 类似,适合真实短视频场景评测 | | ||
| 477 | | 17 | 词曲压缩包 | archive metadata | 否 | 先解包治理,不直接训练 | | ||
| 478 | | 18 | 音频demo(mp3/wav) | query / weak reference | 条件式 | 先按质量分层;可做 query,必要时做辅 reference | | ||
| 479 | | 19 | 曲谱(png) | metadata | 否 | 不进音频训练 | | ||
| 480 | | 20 | 译文滚动歌词 | metadata | 否 | 只做文本侧扩展 | | ||
| 481 | |||
| 482 | --- | ||
| 483 | |||
| 484 | ## 12.3 最推荐的主任务训练组合 | ||
| 485 | |||
| 486 | ### A. 如果你的目标是“识别原曲” | ||
| 487 | |||
| 488 | ```text | ||
| 489 | reference: | ||
| 490 | - 11 原曲无损(主) | ||
| 491 | - 1 原曲压缩(辅) | ||
| 492 | |||
| 493 | query: | ||
| 494 | - 从 11 / 1 切 5s / 8s clean query | ||
| 495 | - 7 抖音片段 | ||
| 496 | - 8 副歌片段 | ||
| 497 | - 16 快手片段 | ||
| 498 | - 18 音频demo(筛质后) | ||
| 499 | ``` | ||
| 500 | |||
| 501 | ### B. 如果你的目标是“原曲 + 伴奏版本都要识别” | ||
| 502 | |||
| 503 | 建议不要直接把原曲和伴奏粗暴合并成同一个训练标签,而是至少保留: | ||
| 504 | |||
| 505 | - `canonical_song_id`:作品级 ID | ||
| 506 | - `version_id`:版本级 ID | ||
| 507 | - `audio_role`:`original / inst_with_harmony / inst_no_harmony / short_clip / demo` | ||
| 508 | |||
| 509 | 这样你后面可以做两种策略: | ||
| 510 | |||
| 511 | 1. **作品级识别** | ||
| 512 | - 原曲和伴奏共享 `canonical_song_id` | ||
| 513 | - 但保留不同 `version_id` | ||
| 514 | 2. **版本级识别** | ||
| 515 | - 原曲和伴奏完全分开标签 | ||
| 516 | |||
| 517 | 如果你现在主目标只是 ACR 识别“这首歌是谁”,而不是区分伴奏版本,建议先走策略 1,但**训练时不要让伴奏版本无脑占太高比例**,否则会把主模型拉偏。 | ||
| 518 | |||
| 519 | --- | ||
| 520 | |||
| 521 | ## 12.4 我给你的实际建议 | ||
| 522 | |||
| 523 | ### 第一批一定要进的数据 | ||
| 524 | |||
| 525 | - `11 原曲无损` | ||
| 526 | - `1 原曲压缩` | ||
| 527 | - `7 抖音片段` | ||
| 528 | - `8 副歌片段` | ||
| 529 | - `16 快手片段` | ||
| 530 | |||
| 531 | ### 第二批可控加入的数据 | ||
| 532 | |||
| 533 | - `18 音频demo` | ||
| 534 | - 先按质量筛选 | ||
| 535 | - 干净 demo 可做 query | ||
| 536 | - 明显截断/噪声重的 demo 进 hard-case pool | ||
| 537 | |||
| 538 | ### 不建议第一阶段直接并入主训练标签的数据 | ||
| 539 | |||
| 540 | - `2 伴奏有和声` | ||
| 541 | - `9 伴奏无和声-压缩` | ||
| 542 | - `10 伴奏无和声-无损` | ||
| 543 | - `12 伴奏有和声-无损` | ||
| 544 | |||
| 545 | 更稳妥的做法: | ||
| 546 | - 先单独入库 | ||
| 547 | - 先做评测集 / hard negative | ||
| 548 | - 等主模型稳定后,再决定是否做多版本任务 | ||
| 549 | |||
| 550 | --- | ||
| 551 | |||
| 552 | ## 12.5 对应 manifest / pgvector 字段建议 | ||
| 553 | |||
| 554 | 如果你们要把这些 type 真正落到训练和数据库,建议至少补这几个字段: | ||
| 555 | |||
| 556 | | 字段 | 示例 | 作用 | | ||
| 557 | |---|---|---| | ||
| 558 | | `canonical_song_id` | `song_123` | 作品主键 | | ||
| 559 | | `version_id` | `song_123_orig_lossless` | 版本主键 | | ||
| 560 | | `asset_type_code` | `11` | 原始 type 枚举 | | ||
| 561 | | `audio_role` | `original_lossless` | 训练与检索语义 | | ||
| 562 | | `type` | `reference / clean / augmented / confused` | 模型训练角色 | | ||
| 563 | | `source_platform` | `douyin / kuaishou / internal` | 来源治理 | | ||
| 564 | |||
| 565 | ### 一个实用映射例子 | ||
| 566 | |||
| 567 | | 原始 type | 推荐 `audio_role` | 推荐训练 `type` | | ||
| 568 | |---:|---|---| | ||
| 569 | | 11 | `original_lossless` | `reference` | | ||
| 570 | | 1 | `original_lossy` | `reference` 或 `clean` | | ||
| 571 | | 7 | `short_video_clip` | `clean` / `confused` | | ||
| 572 | | 8 | `chorus_clip` | `clean` | | ||
| 573 | | 16 | `short_video_clip` | `clean` / `confused` | | ||
| 574 | | 18 | `demo_audio` | `clean` / `augmented` | | ||
| 575 | | 2/9/10/12 | `instrumental_variant` | 先不进主训练,或做 hard negative | | ||
| 576 | |||
| 577 | ## 12.6 现在仓库里已经有可执行映射脚本 | ||
| 578 | |||
| 579 | 脚本: | ||
| 580 | - [acr-engine/scripts/internal_asset_type_mapper.py](../acr-engine/scripts/internal_asset_type_mapper.py) | ||
| 581 | |||
| 582 | 作用: | ||
| 583 | - 读取内部素材 CSV | ||
| 584 | - 按 `type` 枚举自动分流成: | ||
| 585 | - `references.json` | ||
| 586 | - `queries.json` | ||
| 587 | - `metadata_only.json` | ||
| 588 | - `excluded.json` | ||
| 589 | - 可选直接生成: | ||
| 590 | - `manifest_bundle/catalog.json` | ||
| 591 | - `manifest_bundle/train.json` | ||
| 592 | - `manifest_bundle/test.json` | ||
| 593 | - `manifest_bundle/val.json` | ||
| 594 | - 可选直接生成: | ||
| 595 | - `pgvector_payload.json` | ||
| 596 | - 可选做音频校验: | ||
| 597 | - `audio_exists` | ||
| 598 | - `duration_sec` | ||
| 599 | - `validation_status` | ||
| 600 | |||
| 601 | 最短示例: | ||
| 602 | |||
| 603 | ```bash | ||
| 604 | /usr/local/miniconda3/bin/python acr-engine/scripts/internal_asset_type_mapper.py assets.csv --output-dir out/internal_asset_map | ||
| 605 | ``` | ||
| 606 | |||
| 607 | 如果你希望直接产出可训练 manifest: | ||
| 608 | |||
| 609 | ```bash | ||
| 610 | /usr/local/miniconda3/bin/python acr-engine/scripts/internal_asset_type_mapper.py assets.csv --output-dir out/internal_asset_map --emit-manifests --eval-ratio 0.2 | ||
| 611 | ``` | ||
| 612 | |||
| 613 | 如果你们的 CSV 里是相对路径,推荐加上音频根目录: | ||
| 614 | |||
| 615 | ```bash | ||
| 616 | /usr/local/miniconda3/bin/python acr-engine/scripts/internal_asset_type_mapper.py assets.csv --audio-root data/internal_audio --output-dir out/internal_asset_map --emit-manifests | ||
| 617 | ``` | ||
| 618 | |||
| 619 | 这样脚本会自动补: | ||
| 620 | - `audio_exists` | ||
| 621 | - `duration` | ||
| 622 | - `missing_audio` 汇总 | ||
| 623 | |||
| 624 | 同时脚本现在还支持: | ||
| 625 | - `--duration-field` | ||
| 626 | - `--offset-field` | ||
| 627 | - `--default-query-duration` | ||
| 628 | - `--default-query-offset` | ||
| 629 | - `--query-stride` | ||
| 630 | |||
| 631 | 规则是: | ||
| 632 | - query 优先使用 CSV 自带的 `duration/offset` | ||
| 633 | - duration 没有时,回落到默认 query duration(例如 `8.0s`),而不是整首音频时长 | ||
| 634 | - 音频总时长会单独保留为 `source_audio_duration`,供 query 滑窗展开使用 | ||
| 635 | - offset 有 CSV 显式值时,保持单条 query,不做自动扩窗 | ||
| 636 | - offset 没有显式值且设置了 `--query-stride` 时,会按滑窗方式自动展开成多条 query | ||
| 637 | - 若未设置 `--query-stride`,offset 没有显式值时回落到默认值(通常 `0.0`) | ||
| 638 | |||
| 639 | 推荐参数: | ||
| 640 | |||
| 641 | | 场景 | 推荐参数 | 说明 | | ||
| 642 | |---|---|---| | ||
| 643 | | 内部短视频片段已人工标好起点 | `--offset-field offset_sec` | 保留人工时间戳,避免自动扩窗覆盖人工标注 | | ||
| 644 | | 只有整首原始音频,没有 query 起点 | `--default-query-duration 8 --query-stride 4` | 自动产出 50% overlap 的多窗口 query | | ||
| 645 | | 只想先做最小可用集 | `--default-query-duration 8` | 每条 query 只导出 1 个片段,默认 offset=0 | | ||
| 646 | |||
| 647 | 如果你们下一步就是要进 PostgreSQL / pgvector,可直接导出: | ||
| 648 | |||
| 649 | ```bash | ||
| 650 | /usr/local/miniconda3/bin/python acr-engine/scripts/internal_asset_type_mapper.py assets.csv --audio-root data/internal_audio --output-dir out/internal_asset_map --emit-pgvector-json --pgvector-split train | ||
| 651 | ``` | ||
| 652 | |||
| 653 | 自动扩窗示例: | ||
| 654 | |||
| 655 | ```bash | ||
| 656 | /usr/local/miniconda3/bin/python acr-engine/scripts/internal_asset_type_mapper.py assets.csv \ | ||
| 657 | --audio-root data/internal_audio \ | ||
| 658 | --output-dir out/internal_asset_map \ | ||
| 659 | --default-query-duration 8 \ | ||
| 660 | --query-stride 4 \ | ||
| 661 | --emit-manifests \ | ||
| 662 | --emit-pgvector-json | ||
| 663 | ``` | ||
| 664 | |||
| 665 | 例如 30s 音频在 `8s` query、`4s` stride 下会导出 offset: | ||
| 666 | - `0, 4, 8, 12, 16, 20, 22` | ||
| 667 | |||
| 668 | 导出的 `queries.json` 与 `pgvector_payload.json` 中都会保留 `query_index`,方便后续追踪窗口来源。 | ||
| 669 | |||
| 670 | 输出会包含: | ||
| 671 | - `songs` | ||
| 672 | - `references` | ||
| 673 | - `segments` | ||
| 674 | |||
| 675 | 并额外带上: | ||
| 676 | - `audio_role` | ||
| 677 | - `asset_type_code` | ||
| 678 | - `audio_exists` | ||
| 679 | - `validation_status` | ||
| 680 | |||
| 681 | 如果你想临时把伴奏类也纳入导出,可用: | ||
| 682 | |||
| 683 | ```bash | ||
| 684 | /usr/local/miniconda3/bin/python acr-engine/scripts/internal_asset_type_mapper.py assets.csv --output-dir out/internal_asset_map --include-conditionals-as query | ||
| 685 | ``` | ||
| 686 | |||
| 687 | 但默认仍建议: | ||
| 688 | - `--include-conditionals-as skip` | ||
| 689 | |||
| 690 | 这样更符合当前主任务“先把原曲识别打稳,再逐步纳入伴奏版本”的策略。 | ||
| 691 | |||
| 692 | ## Sources | ||
| 693 | - 当前代码事实来自 [acr-engine/src/data/dataset.py](../acr-engine/src/data/dataset.py), [acr-engine/src/data/manifest_tools.py](../acr-engine/src/data/manifest_tools.py), [acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py), [acr-engine/src/utils/audio.py](../acr-engine/src/utils/audio.py), [acr-engine/src/engines/ecapa_embedder.py](../acr-engine/src/engines/ecapa_embedder.py), [acr-engine/train.py](../acr-engine/train.py) |
-
Please register or sign in to post a comment