Make the Phase-1 ACR plan executable for each delivery role
Constraint: The architecture and schema docs were already in place, but teams still lacked a concrete implementation checklist and registry bootstrap contract for encoder-only rollout Rejected: leaving execution guidance implicit in architecture prose | would slow Phase-1 delivery and cause inconsistent model/feature initialization Confidence: high Scope-risk: narrow Directive: treat Phase-1 implementation sequencing and model/feature/reference-set bootstrap as first-class docs that evolve with the schema Tested: git diff --check on changed docs; Python document sanity check; README/CHANGELOG link coverage verified with rg Not-tested: no runtime behavior changed; no database apply executed
Showing
4 changed files
with
461 additions
and
8 deletions
| 1 | ## 2026-06-04 | 1 | ## 2026-06-04 |
| 2 | 2 | ||
| 3 | - 新增 [Phase-1 实施清单](./phase1-implementation-checklist.md),把 encoder-only 路线拆成主数据、reference set、feature set、索引、评测的可执行阶段。 | ||
| 4 | - 新增 [模型与 Feature Set 初始化手册](./model-feature-registry-bootstrap.md),补齐 model_registry / feature_set_registry / reference_set_registry 的初始化约定与示例 SQL。 | ||
| 3 | - 重构文档主阅读路径,新增按角色划分的文档入口:架构、开发、运维、模型底座。 | 5 | - 重构文档主阅读路径,新增按角色划分的文档入口:架构、开发、运维、模型底座。 |
| 4 | - 新增 [SOTA 演进方案说明](./sota-evolution-guide.md),明确 Phase-1 encoder-only 路线、MERT/MuQ 角色与后续 version/cover 演进。 | 6 | - 新增 [SOTA 演进方案说明](./sota-evolution-guide.md),明确 Phase-1 encoder-only 路线、MERT/MuQ 角色与后续 version/cover 演进。 |
| 5 | - 重写 [ACR 系统蓝图](./acr-architecture.md),补充角色视图、离线/在线职责分工与当前实现到目标实现的映射。 | 7 | - 重写 [ACR 系统蓝图](./acr-architecture.md),补充角色视图、离线/在线职责分工与当前实现到目标实现的映射。 | ... | ... |
| ... | @@ -20,25 +20,30 @@ | ... | @@ -20,25 +20,30 @@ |
| 20 | 1. [acr-architecture.md](./acr-architecture.md) | 20 | 1. [acr-architecture.md](./acr-architecture.md) |
| 21 | 2. [sota-evolution-guide.md](./sota-evolution-guide.md) | 21 | 2. [sota-evolution-guide.md](./sota-evolution-guide.md) |
| 22 | 3. [postgresql-data-model.md](./postgresql-data-model.md) | 22 | 3. [postgresql-data-model.md](./postgresql-data-model.md) |
| 23 | 4. [session-handoff.md](./session-handoff.md) | 23 | 4. [phase1-implementation-checklist.md](./phase1-implementation-checklist.md) |
| 24 | 5. [session-handoff.md](./session-handoff.md) | ||
| 24 | 25 | ||
| 25 | ### 2. 开发 / 数据 / 检索工程师 | 26 | ### 2. 开发 / 数据 / 检索工程师 |
| 26 | 1. [postgresql-data-model.md](./postgresql-data-model.md) | 27 | 1. [postgresql-data-model.md](./postgresql-data-model.md) |
| 27 | 2. [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) | 28 | 2. [phase1-implementation-checklist.md](./phase1-implementation-checklist.md) |
| 28 | 3. [acr-architecture.md](./acr-architecture.md) | 29 | 3. [model-feature-registry-bootstrap.md](./model-feature-registry-bootstrap.md) |
| 29 | 4. [runbook.md](./runbook.md) | 30 | 4. [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) |
| 31 | 5. [acr-architecture.md](./acr-architecture.md) | ||
| 32 | 6. [runbook.md](./runbook.md) | ||
| 30 | 33 | ||
| 31 | ### 3. 运维 / 平台 / 服务工程师 | 34 | ### 3. 运维 / 平台 / 服务工程师 |
| 32 | 1. [acr-architecture.md](./acr-architecture.md) | 35 | 1. [acr-architecture.md](./acr-architecture.md) |
| 33 | 2. [postgresql-data-model.md](./postgresql-data-model.md) | 36 | 2. [postgresql-data-model.md](./postgresql-data-model.md) |
| 34 | 3. [service-api.md](./service-api.md) | 37 | 3. [phase1-implementation-checklist.md](./phase1-implementation-checklist.md) |
| 35 | 4. [runbook.md](./runbook.md) | 38 | 4. [service-api.md](./service-api.md) |
| 39 | 5. [runbook.md](./runbook.md) | ||
| 36 | 40 | ||
| 37 | ### 4. 模型 / 底座 / 研究工程师 | 41 | ### 4. 模型 / 底座 / 研究工程师 |
| 38 | 1. [sota-research-2026.md](./sota-research-2026.md) | 42 | 1. [sota-research-2026.md](./sota-research-2026.md) |
| 39 | 2. [sota-evolution-guide.md](./sota-evolution-guide.md) | 43 | 2. [sota-evolution-guide.md](./sota-evolution-guide.md) |
| 40 | 3. [production-encoder-freeze-and-embedding-strategy.md](./production-encoder-freeze-and-embedding-strategy.md) | 44 | 3. [model-feature-registry-bootstrap.md](./model-feature-registry-bootstrap.md) |
| 41 | 4. [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) | 45 | 4. [production-encoder-freeze-and-embedding-strategy.md](./production-encoder-freeze-and-embedding-strategy.md) |
| 46 | 5. [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) | ||
| 42 | 47 | ||
| 43 | --- | 48 | --- |
| 44 | 49 | ||
| ... | @@ -49,6 +54,8 @@ | ... | @@ -49,6 +54,8 @@ |
| 49 | | [acr-architecture.md](./acr-architecture.md) | 当前系统蓝图、角色分工、在线/离线链路 | 架构、开发、运维 | | 54 | | [acr-architecture.md](./acr-architecture.md) | 当前系统蓝图、角色分工、在线/离线链路 | 架构、开发、运维 | |
| 50 | | [sota-evolution-guide.md](./sota-evolution-guide.md) | SOTA 演进路径、Phase-1 encoder-only 方案、后续升级路线 | 架构、模型、检索 | | 55 | | [sota-evolution-guide.md](./sota-evolution-guide.md) | SOTA 演进路径、Phase-1 encoder-only 方案、后续升级路线 | 架构、模型、检索 | |
| 51 | | [postgresql-data-model.md](./postgresql-data-model.md) | PostgreSQL 数据字典、DDL 设计意图、流程图、查询路径 | 数据、后端、检索、平台 | | 56 | | [postgresql-data-model.md](./postgresql-data-model.md) | PostgreSQL 数据字典、DDL 设计意图、流程图、查询路径 | 数据、后端、检索、平台 | |
| 57 | | [phase1-implementation-checklist.md](./phase1-implementation-checklist.md) | Phase-1 落地 checklist,按阶段拆执行项 | 架构、开发、平台 | | ||
| 58 | | [model-feature-registry-bootstrap.md](./model-feature-registry-bootstrap.md) | 模型、feature set、reference set 初始化手册 | 模型、检索、数据 | | ||
| 52 | | [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) | 当前训练/manifest/pgvector 原型链说明 | 开发、数据 | | 59 | | [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) | 当前训练/manifest/pgvector 原型链说明 | 开发、数据 | |
| 53 | | [session-handoff.md](./session-handoff.md) | 最新状态与续跑上下文 | 新 session 接手人 | | 60 | | [session-handoff.md](./session-handoff.md) | 最新状态与续跑上下文 | 新 session 接手人 | |
| 54 | 61 | ... | ... |
docs/model-feature-registry-bootstrap.md
0 → 100644
| 1 | # 模型与 Feature Set 初始化手册 / Model & Feature Registry Bootstrap | ||
| 2 | |||
| 3 | > 更新:2026-06-04 | ||
| 4 | > 目标:给出 Phase-1 里 `model_registry`、`feature_set_registry`、`reference_set_registry` 的初始化约定,避免每次接入新 encoder 时重新设计。 | ||
| 5 | |||
| 6 | ## 一页结论 | ||
| 7 | |||
| 8 | Phase-1 不微调底座时,真正需要初始化的不是“训练任务”,而是三类对象: | ||
| 9 | |||
| 10 | 1. **模型定义**:`model_registry` | ||
| 11 | 2. **特征定义**:`feature_set_registry` | ||
| 12 | 3. **reference 集定义**:`reference_set_registry` | ||
| 13 | |||
| 14 | 也就是说,先把“你要怎么用模型”写清楚,再开始抽特征。 | ||
| 15 | |||
| 16 | --- | ||
| 17 | |||
| 18 | ## 1. 推荐命名约定 | ||
| 19 | |||
| 20 | ## 1.1 model_name | ||
| 21 | 推荐固定小写: | ||
| 22 | - `chromaprint` | ||
| 23 | - `mert` | ||
| 24 | - `muq` | ||
| 25 | - `ecapa` | ||
| 26 | - `coverhunter_encoder`(Phase-2+) | ||
| 27 | |||
| 28 | ## 1.2 model_version | ||
| 29 | 推荐表达清楚来源和规模: | ||
| 30 | - `v1-95m` | ||
| 31 | - `v1-330m` | ||
| 32 | - `large-msd-iter` | ||
| 33 | - `acr-baseline-v1` | ||
| 34 | |||
| 35 | ## 1.3 feature set 命名 | ||
| 36 | 推荐格式: | ||
| 37 | |||
| 38 | ```text | ||
| 39 | <model_name>__<feature_name>__<window>x<hop>__<pooling>__<metric> | ||
| 40 | ``` | ||
| 41 | |||
| 42 | 示例: | ||
| 43 | - `mert__semantic_embedding__5s_2.5s__mean__cosine` | ||
| 44 | - `mert__semantic_embedding__10s_5s__mean__cosine` | ||
| 45 | - `muq__semantic_embedding__5s_2.5s__mean__cosine` | ||
| 46 | |||
| 47 | --- | ||
| 48 | |||
| 49 | ## 2. Phase-1 推荐初始化对象 | ||
| 50 | |||
| 51 | ## 2.1 模型清单 | ||
| 52 | |||
| 53 | | model_name | model_version | 角色 | | ||
| 54 | |---|---|---| | ||
| 55 | | `chromaprint` | `v1` | exact lane | | ||
| 56 | | `mert` | `v1-95m` | semantic 主 baseline | | ||
| 57 | | `muq` | `large-msd-iter` | semantic challenger | | ||
| 58 | | `ecapa` | `acr-baseline-v1` | 历史 baseline / 对照 | | ||
| 59 | |||
| 60 | --- | ||
| 61 | |||
| 62 | ## 2.2 Feature set 清单 | ||
| 63 | |||
| 64 | | feature_set | 目的 | | ||
| 65 | |---|---| | ||
| 66 | | `chromaprint asset-level` | exact 匹配 | | ||
| 67 | | `mert 5s/2.5s mean` | 主 semantic baseline | | ||
| 68 | | `mert 10s/5s mean` | 较长上下文验证 | | ||
| 69 | | `muq 5s/2.5s mean` | challenger baseline | | ||
| 70 | | `ecapa 5s/2.5s` | 历史对照 | | ||
| 71 | |||
| 72 | --- | ||
| 73 | |||
| 74 | ## 3. 推荐初始化 SQL | ||
| 75 | |||
| 76 | ## 3.1 注册模型 | ||
| 77 | |||
| 78 | ```sql | ||
| 79 | insert into model_registry ( | ||
| 80 | model_name, model_family, model_version, model_source, model_uri, | ||
| 81 | license_name, input_modality, input_sample_rate, input_channel_mode, | ||
| 82 | default_window_sec, default_hop_sec, output_embedding_dim, | ||
| 83 | pooling_supported, layer_selection_supported, is_trainable | ||
| 84 | ) values | ||
| 85 | ('chromaprint', 'fingerprint', 'v1', 'local', null, | ||
| 86 | null, 'audio', 16000, 'mono', | ||
| 87 | 5.0, 2.5, null, | ||
| 88 | array['none'], false, false), | ||
| 89 | ('mert', 'music_ssl', 'v1-95m', 'github', 'https://github.com/yizhilll/MERT', | ||
| 90 | null, 'audio', 24000, 'mono', | ||
| 91 | 5.0, 2.5, 768, | ||
| 92 | array['mean','cls'], true, false), | ||
| 93 | ('muq', 'music_ssl', 'large-msd-iter', 'github', 'https://github.com/tencent-ailab/MuQ', | ||
| 94 | null, 'audio', 24000, 'mono', | ||
| 95 | 5.0, 2.5, 768, | ||
| 96 | array['mean','cls'], true, false), | ||
| 97 | ('ecapa', 'speech_derived', 'acr-baseline-v1', 'local', null, | ||
| 98 | null, 'audio', 16000, 'mono', | ||
| 99 | 5.0, 2.5, 192, | ||
| 100 | array['mean'], false, true); | ||
| 101 | ``` | ||
| 102 | |||
| 103 | --- | ||
| 104 | |||
| 105 | ## 3.2 注册 feature set | ||
| 106 | |||
| 107 | ```sql | ||
| 108 | insert into feature_set_registry ( | ||
| 109 | model_id, feature_name, feature_level, extraction_granularity, | ||
| 110 | window_sec, hop_sec, embedding_dim, pooling_strategy, layer_selection, | ||
| 111 | normalize_l2, distance_metric, quantization_type, feature_schema_version | ||
| 112 | ) | ||
| 113 | select model_id, 'semantic_embedding', 'window', 'sliding_window', | ||
| 114 | 5.0, 2.5, 768, 'mean', 'final', | ||
| 115 | true, 'cosine', 'none', 'v1' | ||
| 116 | from model_registry | ||
| 117 | where model_name = 'mert' and model_version = 'v1-95m'; | ||
| 118 | |||
| 119 | insert into feature_set_registry ( | ||
| 120 | model_id, feature_name, feature_level, extraction_granularity, | ||
| 121 | window_sec, hop_sec, embedding_dim, pooling_strategy, layer_selection, | ||
| 122 | normalize_l2, distance_metric, quantization_type, feature_schema_version | ||
| 123 | ) | ||
| 124 | select model_id, 'semantic_embedding', 'window', 'sliding_window', | ||
| 125 | 10.0, 5.0, 768, 'mean', 'final', | ||
| 126 | true, 'cosine', 'none', 'v1' | ||
| 127 | from model_registry | ||
| 128 | where model_name = 'mert' and model_version = 'v1-95m'; | ||
| 129 | |||
| 130 | insert into feature_set_registry ( | ||
| 131 | model_id, feature_name, feature_level, extraction_granularity, | ||
| 132 | window_sec, hop_sec, embedding_dim, pooling_strategy, layer_selection, | ||
| 133 | normalize_l2, distance_metric, quantization_type, feature_schema_version | ||
| 134 | ) | ||
| 135 | select model_id, 'semantic_embedding', 'window', 'sliding_window', | ||
| 136 | 5.0, 2.5, 768, 'mean', 'final', | ||
| 137 | true, 'cosine', 'none', 'v1' | ||
| 138 | from model_registry | ||
| 139 | where model_name = 'muq' and model_version = 'large-msd-iter'; | ||
| 140 | ``` | ||
| 141 | |||
| 142 | --- | ||
| 143 | |||
| 144 | ## 3.3 注册 reference set | ||
| 145 | |||
| 146 | ```sql | ||
| 147 | insert into reference_set_registry ( | ||
| 148 | set_name, description, encoder_scope, status | ||
| 149 | ) values ( | ||
| 150 | 'phase1_hot_reference_v1', | ||
| 151 | 'Phase-1 主 reference 集,仅包含当前线上热参考 recording', | ||
| 152 | 'mert-v1-95m / muq-large-msd-iter', | ||
| 153 | 'active' | ||
| 154 | ); | ||
| 155 | ``` | ||
| 156 | |||
| 157 | --- | ||
| 158 | |||
| 159 | ## 4. reference set 的运营原则 | ||
| 160 | |||
| 161 | ### 当前建议 | ||
| 162 | - 一个时间点只允许一个主 `active` hot reference set | ||
| 163 | - 新 encoder / 新聚合策略上线时,新建 set,不覆盖旧 set | ||
| 164 | - A/B 或 shadow 期间,允许多个 set 并存,但只有一个主线上标记 | ||
| 165 | |||
| 166 | ### 为什么 | ||
| 167 | 这样可以支持: | ||
| 168 | - 线上回滚 | ||
| 169 | - encoder 升级 | ||
| 170 | - 索引热切换 | ||
| 171 | - 离线重放 | ||
| 172 | |||
| 173 | --- | ||
| 174 | |||
| 175 | ## 5. 维度扩展规则 | ||
| 176 | |||
| 177 | 当前 DDL 只演示了: | ||
| 178 | - `audio_embedding_vector_192` | ||
| 179 | - `audio_embedding_vector_768` | ||
| 180 | |||
| 181 | 后续如果接入新 encoder 维度,如 `1024`: | ||
| 182 | |||
| 183 | 1. 新增 `audio_embedding_vector_1024` | ||
| 184 | 2. 对应 feature_set 的 `embedding_dim=1024` | ||
| 185 | 3. 独立建索引 | ||
| 186 | 4. 通过 `retrieval_index_registry` 切换 | ||
| 187 | |||
| 188 | ### 原则 | ||
| 189 | - 维度变化是 feature set 升级,不是主数据模型升级 | ||
| 190 | - 主数据层不该因 encoder 升级而改表 | ||
| 191 | |||
| 192 | --- | ||
| 193 | |||
| 194 | ## 6. 当前推荐顺序 | ||
| 195 | |||
| 196 | ```mermaid | ||
| 197 | flowchart TD | ||
| 198 | A[注册 model_registry] --> B[注册 feature_set_registry] | ||
| 199 | B --> C[注册 reference_set_registry] | ||
| 200 | C --> D[抽取 embeddings/fingerprint] | ||
| 201 | D --> E[写 audio_embedding/audio_fingerprint] | ||
| 202 | E --> F[建 retrieval_index_registry] | ||
| 203 | ``` | ||
| 204 | |||
| 205 | --- | ||
| 206 | |||
| 207 | ## 7. 最后建议 | ||
| 208 | |||
| 209 | 如果你今天就开始做 Phase-1 初始化,建议最少先注册: | ||
| 210 | |||
| 211 | 1. `chromaprint v1` | ||
| 212 | 2. `mert v1-95m` | ||
| 213 | 3. `muq large-msd-iter` | ||
| 214 | 4. `mert 5s/2.5s mean` | ||
| 215 | 5. `muq 5s/2.5s mean` | ||
| 216 | 6. `phase1_hot_reference_v1` | ||
| 217 | |||
| 218 | 这样数据、模型、索引三条线就都有了稳定入口。 |
docs/phase1-implementation-checklist.md
0 → 100644
| 1 | # Phase-1 实施清单 / Encoder-only Implementation Checklist | ||
| 2 | |||
| 3 | > 更新:2026-06-04 | ||
| 4 | > 目标:把“先不上微调、先用开源 encoder”的 Phase-1 方案拆成可执行步骤,方便数据、检索、平台、运维团队并行推进。 | ||
| 5 | |||
| 6 | ## 一页结论 | ||
| 7 | |||
| 8 | Phase-1 的交付目标不是“证明某个新模型绝对最优”,而是: | ||
| 9 | |||
| 10 | 1. 把 **PostgreSQL 主数据模型** 落稳 | ||
| 11 | 2. 把 **reference 资产 / window / feature_set** 跑通 | ||
| 12 | 3. 用 **MERT + MuQ** 建立 encoder-only baseline | ||
| 13 | 4. 把 **fingerprint lane + semantic lane** 的聚合链先跑通 | ||
| 14 | 5. 给 Phase-2 的 version/cover lane 留好接口 | ||
| 15 | |||
| 16 | --- | ||
| 17 | |||
| 18 | ## 1. 交付范围 | ||
| 19 | |||
| 20 | ### 本阶段必须完成 | ||
| 21 | - `canonical_song / work / recording / recording_asset / audio_window` 入库 | ||
| 22 | - `model_registry / feature_set_registry` 初始化 | ||
| 23 | - MERT/MuQ encoder-only 特征抽取 | ||
| 24 | - hot reference set 建设 | ||
| 25 | - semantic index 建设 | ||
| 26 | - query -> candidate -> canonical_song 的基础闭环 | ||
| 27 | |||
| 28 | ### 本阶段不强求完成 | ||
| 29 | - 底座微调 | ||
| 30 | - cover 专项训练 | ||
| 31 | - humming 专项 melody tower | ||
| 32 | - 全量冷数据统一进热索引 | ||
| 33 | |||
| 34 | --- | ||
| 35 | |||
| 36 | ## 2. 角色分工 | ||
| 37 | |||
| 38 | | 角色 | 主要交付 | | ||
| 39 | |---|---| | ||
| 40 | | 数据工程 | 资产清洗、去重、实体映射、切窗清单 | | ||
| 41 | | 后端/DBA | PostgreSQL DDL、索引、写入链、校验约束 | | ||
| 42 | | 检索工程 | fingerprint lane、semantic lane、聚合逻辑 | | ||
| 43 | | 模型工程 | MERT/MuQ 接入、feature_set 设计、抽特征脚本 | | ||
| 44 | | 平台/运维 | 离线任务编排、对象存储、热/冷索引治理 | | ||
| 45 | |||
| 46 | --- | ||
| 47 | |||
| 48 | ## 3. 分阶段 checklist | ||
| 49 | |||
| 50 | ## Stage 1:主数据落库 | ||
| 51 | |||
| 52 | ### 目标 | ||
| 53 | 把业务事实层稳定下来,不依赖具体 encoder。 | ||
| 54 | |||
| 55 | ### Checklist | ||
| 56 | - [ ] 建库执行 `acr-engine/sql/acr_pg_schema_v2.sql` | ||
| 57 | - [ ] 初始化 `canonical_song` | ||
| 58 | - [ ] 初始化 `work` | ||
| 59 | - [ ] 初始化 `recording` | ||
| 60 | - [ ] 初始化 `recording_asset` | ||
| 61 | - [ ] 校验 lineage trigger 可用 | ||
| 62 | - [ ] 用一小批 reference 数据做插入烟测 | ||
| 63 | |||
| 64 | ### 输出物 | ||
| 65 | - PostgreSQL schema v2 | ||
| 66 | - 初始实体数据 | ||
| 67 | - 可复用的数据导入脚本 | ||
| 68 | |||
| 69 | --- | ||
| 70 | |||
| 71 | ## Stage 2:reference 资产与切窗 | ||
| 72 | |||
| 73 | ### 目标 | ||
| 74 | 把“可被检索”的 reference 集合建出来。 | ||
| 75 | |||
| 76 | ### Checklist | ||
| 77 | - [ ] 选出 `is_reference=true` 的 recording | ||
| 78 | - [ ] 创建 `reference_set_registry` | ||
| 79 | - [ ] 回填 `reference_set_member` | ||
| 80 | - [ ] 统一标准化音频路径 | ||
| 81 | - [ ] 生成 `audio_window` | ||
| 82 | - [ ] 标记 `active_for_index` | ||
| 83 | |||
| 84 | ### 推荐规则 | ||
| 85 | - 先只放主 reference 版本 | ||
| 86 | - 默认先做 `5s / 2.5s hop` | ||
| 87 | - intro/outro 可先保留,后续再做 quality pruning | ||
| 88 | |||
| 89 | --- | ||
| 90 | |||
| 91 | ## Stage 3:模型与 feature_set 初始化 | ||
| 92 | |||
| 93 | ### 目标 | ||
| 94 | 把模型注册和特征版本定义稳定下来。 | ||
| 95 | |||
| 96 | ### Checklist | ||
| 97 | - [ ] 注册 `chromaprint` | ||
| 98 | - [ ] 注册 `mert v1-95m` | ||
| 99 | - [ ] 注册 `muq` | ||
| 100 | - [ ] 注册 `mert 5s/2.5s mean pool` | ||
| 101 | - [ ] 注册 `mert 10s/5s mean pool` | ||
| 102 | - [ ] 注册 `muq 5s/2.5s mean pool` | ||
| 103 | - [ ] 明确每个 feature_set 的 metric / quantization / dim | ||
| 104 | |||
| 105 | ### 输出物 | ||
| 106 | - `model_registry` 初始化数据 | ||
| 107 | - `feature_set_registry` 初始化数据 | ||
| 108 | - feature set 命名约定 | ||
| 109 | |||
| 110 | --- | ||
| 111 | |||
| 112 | ## Stage 4:encoder-only 抽特征 | ||
| 113 | |||
| 114 | ### 目标 | ||
| 115 | 先不上训练,直接把 reference 集变成可检索 embedding。 | ||
| 116 | |||
| 117 | ### Checklist | ||
| 118 | - [ ] 抽取 MERT window embeddings | ||
| 119 | - [ ] 抽取 MuQ window embeddings | ||
| 120 | - [ ] 写入 `audio_embedding` | ||
| 121 | - [ ] 热数据写入 `audio_embedding_vector_768` 或对应物理表 | ||
| 122 | - [ ] 冷数据落对象存储/parquet | ||
| 123 | - [ ] 回填 `is_indexed` | ||
| 124 | |||
| 125 | ### 验证 | ||
| 126 | - [ ] 随机抽样检查 `window -> embedding -> feature_set` 回链可用 | ||
| 127 | - [ ] 检查向量 norm/缺失率/重复率 | ||
| 128 | |||
| 129 | --- | ||
| 130 | |||
| 131 | ## Stage 5:索引与召回 | ||
| 132 | |||
| 133 | ### 目标 | ||
| 134 | 跑通 semantic lane 与 exact lane 的双路召回。 | ||
| 135 | |||
| 136 | ### Checklist | ||
| 137 | - [ ] 建 fingerprint index | ||
| 138 | - [ ] 建 semantic index | ||
| 139 | - [ ] 回填 `retrieval_index_registry` | ||
| 140 | - [ ] 做 query encode | ||
| 141 | - [ ] 返回 `retrieval_candidate` | ||
| 142 | - [ ] 聚合到 `recording / work / canonical_song` | ||
| 143 | |||
| 144 | ### 第一版聚合建议 | ||
| 145 | - max score | ||
| 146 | - top-k average | ||
| 147 | - hit windows count | ||
| 148 | - exact lane / semantic lane agreement bonus | ||
| 149 | |||
| 150 | --- | ||
| 151 | |||
| 152 | ## Stage 6:基础评测与上线门禁 | ||
| 153 | |||
| 154 | ### 目标 | ||
| 155 | 先证明 Phase-1 结构可用。 | ||
| 156 | |||
| 157 | ### Checklist | ||
| 158 | - [ ] exact query bucket | ||
| 159 | - [ ] noisy/BGM bucket | ||
| 160 | - [ ] version-like bucket(即便暂时不训练 cover lane) | ||
| 161 | - [ ] Top1/Top3/MRR | ||
| 162 | - [ ] canonical_song recall | ||
| 163 | - [ ] work-level recall | ||
| 164 | - [ ] reference set 版本记录 | ||
| 165 | |||
| 166 | --- | ||
| 167 | |||
| 168 | ## 4. 推荐时间顺序 | ||
| 169 | |||
| 170 | ```mermaid | ||
| 171 | flowchart TD | ||
| 172 | A[Schema v2 落库] --> B[实体导入] | ||
| 173 | B --> C[reference set 初始化] | ||
| 174 | C --> D[audio_window 生成] | ||
| 175 | D --> E[model/feature_set 初始化] | ||
| 176 | E --> F[MERT/MuQ 抽特征] | ||
| 177 | F --> G[semantic index] | ||
| 178 | C --> H[fingerprint index] | ||
| 179 | G --> I[candidate aggregation] | ||
| 180 | H --> I | ||
| 181 | I --> J[Phase-1 benchmark] | ||
| 182 | ``` | ||
| 183 | |||
| 184 | --- | ||
| 185 | |||
| 186 | ## 5. 第一版验收标准 | ||
| 187 | |||
| 188 | ### 数据层 | ||
| 189 | - 能稳定插入 `canonical_song -> work -> recording -> recording_asset -> audio_window` | ||
| 190 | - 能支撑至少一套 `reference_set` | ||
| 191 | |||
| 192 | ### 模型/特征层 | ||
| 193 | - 能并行存在多个 `model_registry / feature_set_registry` | ||
| 194 | - 能跑通 MERT/MuQ encoder-only 抽特征 | ||
| 195 | |||
| 196 | ### 检索层 | ||
| 197 | - 能同时返回 fingerprint lane 与 semantic lane 候选 | ||
| 198 | - 能聚合输出 `canonical_song_id` | ||
| 199 | |||
| 200 | ### 运维层 | ||
| 201 | - 能重建 reference set | ||
| 202 | - 能重建 semantic index | ||
| 203 | - 能记录 feature_set 与 index version | ||
| 204 | |||
| 205 | --- | ||
| 206 | |||
| 207 | ## 6. 本阶段容易踩的坑 | ||
| 208 | |||
| 209 | 1. 先把 embedding 存储设计死到某个模型维度 | ||
| 210 | 2. 只保留 song_id,不保留 work/recording | ||
| 211 | 3. reference set 没有版本化 | ||
| 212 | 4. query 结果无法回查具体 evidence window | ||
| 213 | 5. exact lane 被过早删除 | ||
| 214 | |||
| 215 | --- | ||
| 216 | |||
| 217 | ## 7. 当前建议结论 | ||
| 218 | |||
| 219 | 如果你要马上排计划,建议按这个优先级: | ||
| 220 | |||
| 221 | 1. Schema v2 与主数据导入 | ||
| 222 | 2. reference set + audio_window | ||
| 223 | 3. MERT/MuQ feature_set 初始化 | ||
| 224 | 4. encoder-only 抽特征 | ||
| 225 | 5. 双路召回与聚合 | ||
| 226 | 6. benchmark 与门禁 |
-
Please register or sign in to post a comment