Why the schema docs need one song-centric story, not parallel histories
Constraint: New teammates must understand where slice/model/feature data lands without reading deprecated v2/planner-worker material Rejected: Keep old docs with disclaimers | still leaves two competing mental models in the default docs path Confidence: high Scope-risk: narrow Directive: Keep future docs anchored on the 4-table song-centric path unless the physical schema default truly changes Tested: markdown link check on /workspace/docs; staged diff review; verified referenced wrapper script is present Not-tested: No database or pipeline rerun was needed for this docs-only consolidation
Showing
13 changed files
with
608 additions
and
3713 deletions
| 1 | #!/usr/bin/env bash | ||
| 2 | set -euo pipefail | ||
| 3 | |||
| 4 | ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" | ||
| 5 | PYTHON_BIN="${PYTHON_BIN:-/usr/local/miniconda3/bin/python}" | ||
| 6 | DSN="${1:-${PG_DSN:-}}" | ||
| 7 | SCHEMA="${2:-${PG_SCHEMA:-acr_songcentric_test}}" | ||
| 8 | INPUT_ROOT="${3:-$ROOT_DIR/data/songcentric_builder_smoke}" | ||
| 9 | OUTPUT_DIR="${4:-$ROOT_DIR/data/pgvector_eval/music20}" | ||
| 10 | |||
| 11 | if [[ -z "$DSN" ]]; then | ||
| 12 | echo "usage: $0 <postgres-dsn> [schema] [input-root] [output-dir]" >&2 | ||
| 13 | echo "or set PG_DSN before running this script" >&2 | ||
| 14 | exit 1 | ||
| 15 | fi | ||
| 16 | |||
| 17 | cd "$ROOT_DIR/.." | ||
| 18 | "$PYTHON_BIN" acr-engine/scripts/run_songcentric_directory_pipeline_live.py \ | ||
| 19 | --dsn "$DSN" \ | ||
| 20 | --schema "$SCHEMA" \ | ||
| 21 | --input-root "${INPUT_ROOT#$ROOT_DIR/..\/}" \ | ||
| 22 | --output-dir "${OUTPUT_DIR#$ROOT_DIR/..\/}" |
| 1 | # Changelog | ||
| 2 | |||
| 1 | ## 2026-06-04 | 3 | ## 2026-06-04 |
| 4 | - 收敛 `docs/` 到当前 song-centric 主线,只保留 `README / start-here / session-handoff / postgresql-data-model / postgres_db_schema_samples / CHANGELOG` 六份核心文档,删除旧的 v2 / planner-worker / registry 扩展文档,避免新同学误入已退居次线的设计。 | ||
| 5 | - 重写 `docs/postgresql-data-model.md`,明确 `保存切片的数据 + 模型 + feature` 的落表方案:`window` 落 `audio_object`,模型身份落 `feature_fact.model_name/model_version/feature_set_name`,具体 `fingerprint/embedding` 也统一落 `feature_fact`。 | ||
| 6 | - 重写 `docs/postgres_db_schema_samples.md` 与入口文档,补充当前 4 表主链的流程图、典型 SQL 样例、查询回溯路径与写入顺序,统一文档口径到 `media_entity -> audio_object -> feature_fact -> set_membership`。 | ||
| 7 | |||
| 8 | ## 2026-06-04 | ||
| 9 | |||
| 10 | - 新增 `acr-engine/scripts/start_songcentric_shortest_path.sh`,把当前默认主线再收敛成一条可直接复制执行的 shell 入口,并已用 fresh runner 结果再次验证。 | ||
| 2 | 11 | ||
| 3 | - 将 `run_songcentric_directory_pipeline_live.py` 提升为当前默认主线入口,并把 fresh runner 结果同步到 `docs/README.md`、`docs/start-here.md`、`docs/session-handoff.md`,降低下次 session 的恢复成本。 | 12 | - 将 `run_songcentric_directory_pipeline_live.py` 提升为当前默认主线入口,并把 fresh runner 结果同步到 `docs/README.md`、`docs/start-here.md`、`docs/session-handoff.md`,降低下次 session 的恢复成本。 |
| 4 | 13 | ... | ... |
| 1 | # ACR Docs Overview | 1 | # ACR Docs Overview |
| 2 | 2 | ||
| 3 | > 当前仅保留与 **song-centric + 融合优先** ACR 设计直接相关的文档。 | 3 | > 当前 docs 只保留与 **song-centric + 4 表融合 schema** 直接相关的文档。 |
| 4 | 4 | ||
| 5 | --- | 5 | --- |
| 6 | 6 | ||
| 7 | ## 0. 新同学先做什么 | 7 | ## 1. 先看什么 |
| 8 | 8 | ||
| 9 | 如果当前要继续 song-centric 主线,先跑: | 9 | 新同学接手顺序: |
| 10 | 10 | ||
| 11 | ```bash | 11 | 1. [start-here.md](./start-here.md) |
| 12 | cd /workspace | 12 | 2. [session-handoff.md](./session-handoff.md) |
| 13 | /usr/local/miniconda3/bin/python acr-engine/scripts/run_songcentric_directory_pipeline_live.py \ | 13 | 3. [postgresql-data-model.md](./postgresql-data-model.md) |
| 14 | --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' \ | 14 | 4. [postgres_db_schema_samples.md](./postgres_db_schema_samples.md) |
| 15 | --schema acr_songcentric_test \ | 15 | 5. [CHANGELOG.md](./CHANGELOG.md) |
| 16 | --input-root acr-engine/data/songcentric_builder_smoke \ | ||
| 17 | --output-dir acr-engine/data/pgvector_eval/music20 | ||
| 18 | ``` | ||
| 19 | |||
| 20 | 如果要回归旧的 planner/worker 合同,再跑: | ||
| 21 | |||
| 22 | ```bash | ||
| 23 | cd /workspace/acr-engine | ||
| 24 | /usr/local/miniconda3/bin/python scripts/run_planner_validation_commands_live.py \ | ||
| 25 | --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' \ | ||
| 26 | --output data/pgvector_eval/music20/planner_validation_commands_runner_report.json | ||
| 27 | ``` | ||
| 28 | |||
| 29 | 也可以用包装脚本:`acr-engine/scripts/start_phase1_shortest_path.sh 'postgres://d2:d2pass@127.0.0.1:5432/d2'` | ||
| 30 | |||
| 31 | 当前 fresh evidence: | ||
| 32 | - `executed_count = 4` | ||
| 33 | - `all_passed = true` | ||
| 34 | 16 | ||
| 35 | --- | 17 | --- |
| 36 | 18 | ||
| 37 | ## 1. 当前默认设计口径 | 19 | ## 2. 当前默认设计口径 |
| 38 | 20 | ||
| 39 | 当前 Phase-1 默认按下面理解: | 21 | 逻辑语义: |
| 40 | 22 | ||
| 41 | ```text | 23 | ```text |
| 42 | song -> asset -> window -> fingerprint / embedding | 24 | song -> asset -> window -> fingerprint / embedding |
| 43 | ``` | 25 | ``` |
| 44 | 26 | ||
| 45 | 对应融合优先物理表: | 27 | 物理落表: |
| 46 | 28 | ||
| 47 | ```text | 29 | ```text |
| 48 | media_entity -> audio_object -> feature_fact -> set_membership | 30 | media_entity -> audio_object -> feature_fact -> set_membership |
| 49 | ``` | 31 | ``` |
| 50 | 32 | ||
| 33 | 核心目标: | ||
| 34 | - 最终稳定返回 `song_id` | ||
| 35 | - 同一个 `song` 下允许多个音频文件 | ||
| 36 | - `window` 是切片/evidence/召回最小单元 | ||
| 37 | - `feature_fact` 同时承载 exact lane 与 semantic lane | ||
| 38 | - Phase-1 直接复用开源 encoder,不先训练/微调 | ||
| 39 | |||
| 51 | --- | 40 | --- |
| 52 | 41 | ||
| 53 | ## 2. 必读文档 | 42 | ## 3. 一键验证主链 |
| 54 | 43 | ||
| 55 | 1. [start-here.md](./start-here.md) | 44 | ```bash |
| 56 | 2. [session-handoff.md](./session-handoff.md) | 45 | cd /workspace |
| 57 | 3. [acr-architecture.md](./acr-architecture.md) | 46 | /usr/local/miniconda3/bin/python acr-engine/scripts/run_songcentric_directory_pipeline_live.py \ |
| 58 | 4. [postgresql-data-model.md](./postgresql-data-model.md) | 47 | --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' \ |
| 59 | 5. [phase1-implementation-checklist.md](./phase1-implementation-checklist.md) | 48 | --schema acr_songcentric_test \ |
| 49 | --input-root acr-engine/data/songcentric_builder_smoke \ | ||
| 50 | --output-dir acr-engine/data/pgvector_eval/music20 | ||
| 51 | ``` | ||
| 60 | 52 | ||
| 61 | --- | 53 | 包装脚本: |
| 62 | 54 | ||
| 63 | ## 3. 实施相关文档 | 55 | ```bash |
| 56 | acr-engine/scripts/start_songcentric_shortest_path.sh 'postgres://d2:d2pass@127.0.0.1:5432/d2' | ||
| 57 | ``` | ||
| 64 | 58 | ||
| 65 | - [postgresql-data-model.md](./postgresql-data-model.md) — 当前唯一默认数据模型;含切片/模型/feature 落表说明与流程图 | 59 | 当前 fresh evidence: |
| 66 | - [postgres_db_schema_samples.md](./postgres_db_schema_samples.md) — PostgreSQL 存储样例 | 60 | - `song_count = 2` |
| 67 | - [model-feature-registry-bootstrap.md](./model-feature-registry-bootstrap.md) — model/feature/reference set 初始化 | 61 | - `asset_count = 2` |
| 68 | - [phase1-worker-contract.md](./phase1-worker-contract.md) — worker、job、失败语义合同 | 62 | - `window_count = 5` |
| 69 | - [phase1-implementation-checklist.md](./phase1-implementation-checklist.md) — Phase-1 实施清单 | 63 | - `matcher_fingerprint_count = 5` |
| 70 | - [production-encoder-freeze-and-embedding-strategy.md](./production-encoder-freeze-and-embedding-strategy.md) — encoder-only 冻结策略 | 64 | - `fallback_fingerprint_count = 0` |
| 71 | - [sota-evolution-guide.md](./sota-evolution-guide.md) — 当前 SOTA 演进主线 | 65 | - `semantic_runtime_available = false` |
| 66 | - `import_counts.feature_fact = 24` | ||
| 72 | 67 | ||
| 73 | --- | 68 | --- |
| 74 | 69 | ||
| 75 | ## 4. 当前稳定结论 | 70 | ## 4. 当前保留文档分别解决什么 |
| 76 | 71 | ||
| 77 | - 最终归属对象当前只要求稳定返回 `song_id` | 72 | - [start-here.md](./start-here.md):新同学 10 分钟接手入口 |
| 78 | - 同一个 `song` 下允许有多个音频文件 | 73 | - [session-handoff.md](./session-handoff.md):下次启动从哪里继续 |
| 79 | - 当前暂不把 `recording/version` 作为必须返回对象 | 74 | - [postgresql-data-model.md](./postgresql-data-model.md):表设计、字段语义、流程图、设计取舍 |
| 80 | - `window` 仍然保留,因为它是 evidence / offset / 检索最小单元 | 75 | - [postgres_db_schema_samples.md](./postgres_db_schema_samples.md):DDL、样例数据、典型 SQL、导入查询链路 |
| 81 | - `feature_fact` 统一承载 `fingerprint` 和 `embedding` | 76 | - [CHANGELOG.md](./CHANGELOG.md):变更历史 |
| 82 | 77 | ||
| 83 | --- | 78 | --- |
| 84 | 79 | ||
| 85 | ## 5. 文档维护命令 | 80 | ## 5. 文档维护命令 |
| 86 | 81 | ||
| 87 | ```bash | 82 | ```bash |
| 88 | /usr/local/miniconda3/bin/python scripts/check_markdown_links.py --root docs | 83 | /usr/local/miniconda3/bin/python /workspace/scripts/check_markdown_links.py --root /workspace/docs |
| 89 | ``` | 84 | ``` |
| 90 | |||
| 91 | 默认会跳过 `CHANGELOG.md` 这类历史归档文档。 | ... | ... |
docs/acr-architecture.md
deleted
100644 → 0
| 1 | # ACR 系统蓝图 / Architecture Blueprint | ||
| 2 | |||
| 3 | > 更新:2026-06-04 | ||
| 4 | > 目标:把当前 ACR 原型、未来 SOTA 演进路径、以及不同角色的关注点统一到一份可读的系统蓝图里。 | ||
| 5 | |||
| 6 | ## 一页结论 | ||
| 7 | |||
| 8 | 当前仓库已经验证了一个可运行的混合识别原型: | ||
| 9 | |||
| 10 | - `Chromaprint / fingerprint`:负责 exact / near-duplicate 快速召回 | ||
| 11 | - `ECAPA-style embedding`:负责当前语义向量召回 baseline | ||
| 12 | - `melody-aware rerank`:负责弱旋律补强 | ||
| 13 | |||
| 14 | 但未来面向 **版权保护 + 100w 音频 / 30w 歌曲** 的目标,系统应演进为: | ||
| 15 | |||
| 16 | 1. **数据规范稳定**:`canonical_song -> work -> recording -> recording_asset -> audio_window` | ||
| 17 | 2. **底座模型可替换**:`model_registry -> feature_set_registry -> embedding/index` | ||
| 18 | 3. **检索链分层**:exact lane + semantic lane + version/cover lane + aggregation | ||
| 19 | 4. **服务与运维分离**:离线建库、在线召回、审核归一、监控治理分别有清晰职责 | ||
| 20 | |||
| 21 | --- | ||
| 22 | |||
| 23 | ## 1. 总体系统图 | ||
| 24 | |||
| 25 | ```mermaid | ||
| 26 | flowchart TD | ||
| 27 | A[Audio Sources\n官方母带 / 平台音频 / 抓取音频 / UGC / 录音] --> B[Asset Normalization] | ||
| 28 | B --> C[Canonical Data Model\nSong / Work / Recording / Asset / Window] | ||
| 29 | |||
| 30 | C --> D1[Exact Lane\nChromaprint / Neural AFP] | ||
| 31 | C --> D2[Semantic Lane\nFoundation Encoder] | ||
| 32 | C --> D3[Version/Cover Lane\nPhase-2+] | ||
| 33 | |||
| 34 | D1 --> E[Candidate Aggregation] | ||
| 35 | D2 --> E | ||
| 36 | D3 --> E | ||
| 37 | |||
| 38 | E --> F[Canonical Song Decision] | ||
| 39 | F --> G[Service / Review / Audit] | ||
| 40 | ``` | ||
| 41 | |||
| 42 | --- | ||
| 43 | |||
| 44 | ## 2. 当前实现 vs 目标实现 | ||
| 45 | |||
| 46 | | 维度 | 当前实现 | 目标实现 | | ||
| 47 | |---|---|---| | ||
| 48 | | 底座向量模型 | ECAPA-style baseline | MERT / MuQ 等 foundation encoder 为主 | | ||
| 49 | | 检索结构 | chromaprint + embedding + melody | exact + semantic + version/cover + rerank | | ||
| 50 | | 数据主键 | 以 `song_id` 为核心 | `canonical_song / work / recording / asset / window` 分层 | | ||
| 51 | | 存储形态 | 原型式 pgvector schema + 文件产物 | PostgreSQL 主数据 + 可替换向量/索引层 | | ||
| 52 | | 服务目标 | 验证闭环 | 版权保护 / 归属判断 / 工业化运维 | | ||
| 53 | |||
| 54 | --- | ||
| 55 | |||
| 56 | ## 2.1 为什么现在会显得“层很多” | ||
| 57 | |||
| 58 | 因为当前蓝图同时覆盖了 3 个维度: | ||
| 59 | |||
| 60 | 1. **业务归属**:`song/work/recording` | ||
| 61 | 2. **音频实体**:`asset/window` | ||
| 62 | 3. **检索计算**:`feature/index/candidate/decision` | ||
| 63 | |||
| 64 | 把这三类问题放在一张总图中,会看起来像一条很长的链。 | ||
| 65 | 但在工程上,它们其实是不同职责: | ||
| 66 | |||
| 67 | - 业务归属层回答:**最后该归谁** | ||
| 68 | - 音频实体层回答:**命中的是哪段音频** | ||
| 69 | - 检索计算层回答:**这段音频是怎么被召回出来的** | ||
| 70 | |||
| 71 | --- | ||
| 72 | |||
| 73 | ## 2.2 当前最小可用架构可以收敛到什么程度 | ||
| 74 | |||
| 75 | 如果当前阶段只追求: | ||
| 76 | |||
| 77 | > 快速稳定地把 query 命中到正确 `song_id` | ||
| 78 | |||
| 79 | 那 Phase-1 完全可以按下面这套最小骨架推进: | ||
| 80 | |||
| 81 | ```text | ||
| 82 | song -> asset -> window -> fingerprint / embedding | ||
| 83 | ``` | ||
| 84 | |||
| 85 | 保留原因: | ||
| 86 | - `window` 不能删:它是 offset/evidence/多段投票的最小单元 | ||
| 87 | - `feature_set_registry` / `feature_fact` 不能删:否则未来换 MERT/MuQ 会把 schema 写死 | ||
| 88 | - `asset` 不能删:同一个 `song` 下会有多个真实音频文件 | ||
| 89 | |||
| 90 | 可以延后: | ||
| 91 | - `recording` | ||
| 92 | - `work` | ||
| 93 | - 更重的 `retrieval_index_registry` | ||
| 94 | - 更细的全链路审计表 | ||
| 95 | |||
| 96 | 因此推荐口径不是“把所有层都砍掉”,而是: | ||
| 97 | > **Phase-1 先上 song-centric 最小可用层;未来版本归属/cover/work 治理再继续加层。** | ||
| 98 | |||
| 99 | --- | ||
| 100 | |||
| 101 | ## 3. 角色视图 | ||
| 102 | |||
| 103 | ## 3.1 产品 / 架构角色 | ||
| 104 | |||
| 105 | 关注: | ||
| 106 | - 版权保护是否能最终定位到 `canonical_song_id` | ||
| 107 | - `recording` 与 `work` 的区别是否明确 | ||
| 108 | - 当前阶段是否坚持“先冻结规范、后迭代模型” | ||
| 109 | - 各团队之间接口是否清晰 | ||
| 110 | |||
| 111 | 最该读: | ||
| 112 | - 本文 | ||
| 113 | - [sota-evolution-guide.md](./sota-evolution-guide.md) | ||
| 114 | - [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 115 | |||
| 116 | --- | ||
| 117 | |||
| 118 | ## 3.2 开发角色(后端 / 检索 / 数据) | ||
| 119 | |||
| 120 | 关注: | ||
| 121 | - 如何把音频导入统一实体模型 | ||
| 122 | - 如何切窗、建 feature_set、挂索引 | ||
| 123 | - 如何从 query 走到候选,再归一到 `canonical_song_id` | ||
| 124 | - 如何支持未来切换 `model_name / model_version / feature_set` | ||
| 125 | |||
| 126 | 最该读: | ||
| 127 | - 本文 | ||
| 128 | - [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 129 | |||
| 130 | --- | ||
| 131 | |||
| 132 | ## 3.3 运维 / 平台角色 | ||
| 133 | |||
| 134 | 关注: | ||
| 135 | - 离线任务:抽特征、建索引、重建索引 | ||
| 136 | - 在线服务:召回、聚合、缓存、可观测性 | ||
| 137 | - 存储分层:对象存储、PostgreSQL、索引后端 | ||
| 138 | - 版本化:encoder 变更如何灰度、回滚、双写/双索引 | ||
| 139 | |||
| 140 | 最该读: | ||
| 141 | - 本文 | ||
| 142 | - [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 143 | - [phase1-worker-contract.md](./phase1-worker-contract.md) | ||
| 144 | |||
| 145 | --- | ||
| 146 | |||
| 147 | ## 3.4 模型底座 / 研究角色 | ||
| 148 | |||
| 149 | 关注: | ||
| 150 | - Phase-1 先不用微调时,选哪个开源 encoder | ||
| 151 | - 如何定义 feature_set:窗长、hop、pooling、layer selection | ||
| 152 | - 未来如何从 encoder-only 升级到 version/cover lane | ||
| 153 | - 如何让新模型接入而不破坏数据层 | ||
| 154 | |||
| 155 | 最该读: | ||
| 156 | - [sota-evolution-guide.md](./sota-evolution-guide.md) | ||
| 157 | - [production-encoder-freeze-and-embedding-strategy.md](./production-encoder-freeze-and-embedding-strategy.md) | ||
| 158 | - [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 159 | |||
| 160 | --- | ||
| 161 | |||
| 162 | ## 4. 离线 / 在线职责拆分 | ||
| 163 | |||
| 164 | ```mermaid | ||
| 165 | flowchart LR | ||
| 166 | A[Offline\n数据治理/切窗/特征抽取/建索引] --> B[Registered Artifacts\nfeature_set / index / metadata] | ||
| 167 | B --> C[Online\nquery encode / retrieve / aggregate / decide] | ||
| 168 | ``` | ||
| 169 | |||
| 170 | ### 离线职责 | ||
| 171 | - 资产标准化 | ||
| 172 | - 元数据归一 | ||
| 173 | - 切窗 | ||
| 174 | - 模型特征抽取 | ||
| 175 | - fingerprint / embedding 建索引 | ||
| 176 | - 回填 PostgreSQL 元数据 | ||
| 177 | |||
| 178 | ### 在线职责 | ||
| 179 | - 接收 query | ||
| 180 | - query 切块 / 编码 | ||
| 181 | - exact / semantic / version lane 召回 | ||
| 182 | - recording/work/song 聚合 | ||
| 183 | - 输出 `canonical_song_id` + 证据 | ||
| 184 | |||
| 185 | --- | ||
| 186 | |||
| 187 | ## 5. 为什么必须把角色拆开 | ||
| 188 | |||
| 189 | 因为这个项目已经不是单一模型脚本,而是: | ||
| 190 | |||
| 191 | 1. **数据治理系统**:谁的音频、属于哪个 recording/work/song | ||
| 192 | 2. **检索系统**:如何从 query 找到候选 | ||
| 193 | 3. **判定系统**:最终输出哪一个 `canonical_song_id` | ||
| 194 | 4. **服务系统**:如何对外提供 API 与可观测性 | ||
| 195 | 5. **演进系统**:底座模型会变,但数据规范不能跟着乱变 | ||
| 196 | |||
| 197 | --- | ||
| 198 | |||
| 199 | ## 6. 当前阶段建议 | ||
| 200 | |||
| 201 | ### 当前最重要的不是继续改训练,而是: | ||
| 202 | |||
| 203 | 1. 先把 PostgreSQL 数据规范稳定下来 | ||
| 204 | 2. 先把 `model_registry / feature_set_registry` 结构打稳 | ||
| 205 | 3. Phase-1 用开源 encoder 直接做 semantic lane baseline | ||
| 206 | 4. 保留当前 ECAPA 作为历史 baseline / 对照组 | ||
| 207 | |||
| 208 | ### 当前系统中的保留项 | ||
| 209 | - `Chromaprint`:保留 | ||
| 210 | - `ECAPA baseline`:保留为对照组 | ||
| 211 | - `melody rerank`:保留为补充 lane,不再作为主演进方向 | ||
| 212 | |||
| 213 | ### 当前系统中的升级项 | ||
| 214 | - semantic lane 主 encoder -> foundation model | ||
| 215 | - pgvector 原型 schema -> 可扩展 PostgreSQL 数据模型 | ||
| 216 | - 扁平 song_id -> canonical/work/recording/recording_asset/audio_window | ||
| 217 | |||
| 218 | --- | ||
| 219 | |||
| 220 | ## 7. 与代码的映射 | ||
| 221 | |||
| 222 | | 代码/文档 | 当前角色 | | ||
| 223 | |---|---| | ||
| 224 | | `acr-engine/src/engines/chromaprint_matcher.py` | exact lane 原型 | | ||
| 225 | | `acr-engine/src/engines/ecapa_embedder.py` | current embedding lane baseline | | ||
| 226 | | `acr-engine/src/engines/hybrid_engine.py` | current aggregation prototype | | ||
| 227 | | `acr-engine/sql/pgvector_schema.sql` | 早期 pgvector prototype | | ||
| 228 | | `acr-engine/sql/acr_pg_schema_v2.sql` | 推荐的 PostgreSQL V2 schema | | ||
| 229 | | [postgresql-data-model.md](./postgresql-data-model.md) | V2 schema 设计说明 | | ||
| 230 | |||
| 231 | --- | ||
| 232 | |||
| 233 | ## 8. 阅读建议 | ||
| 234 | |||
| 235 | 如果你是: | ||
| 236 | - **架构负责人**:下一篇看 [sota-evolution-guide.md](./sota-evolution-guide.md) | ||
| 237 | - **数据/后端负责人**:下一篇看 [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 238 | - **模型负责人**:先看 [sota-evolution-guide.md](./sota-evolution-guide.md) 再看 [production-encoder-freeze-and-embedding-strategy.md](./production-encoder-freeze-and-embedding-strategy.md) |
| 1 | # 模型与 Feature Set 初始化手册 / Model & Feature Registry Bootstrap | ||
| 2 | |||
| 3 | > 更新:2026-06-04 | ||
| 4 | > 目标:给出 Phase-1 里 `model_registry`、`feature_set_registry`、`reference_set_registry` 的初始化约定,避免每次接入新 encoder 时重新设计。 | ||
| 5 | |||
| 6 | ## 一页结论 | ||
| 7 | |||
| 8 | Phase-1 不微调底座时,真正需要初始化的不是“训练任务”,而是三类对象: | ||
| 9 | |||
| 10 | 1. **模型定义**:`model_registry` | ||
| 11 | 2. **特征定义**:`feature_set_registry` | ||
| 12 | 3. **reference 集定义**:`reference_set_registry` | ||
| 13 | |||
| 14 | 也就是说,先把“你要怎么用模型”写清楚,再开始抽特征。 | ||
| 15 | |||
| 16 | --- | ||
| 17 | |||
| 18 | ## 1. 推荐命名约定 | ||
| 19 | |||
| 20 | ## 1.1 model_name | ||
| 21 | 推荐固定小写: | ||
| 22 | - `chromaprint` | ||
| 23 | - `mert` | ||
| 24 | - `muq` | ||
| 25 | - `ecapa` | ||
| 26 | - `coverhunter_encoder`(Phase-2+) | ||
| 27 | |||
| 28 | ## 1.2 model_version | ||
| 29 | 推荐表达清楚来源和规模: | ||
| 30 | - `v1-95m` | ||
| 31 | - `v1-330m` | ||
| 32 | - `large-msd-iter` | ||
| 33 | - `acr-baseline-v1` | ||
| 34 | |||
| 35 | ## 1.3 feature set 命名 | ||
| 36 | 推荐格式: | ||
| 37 | |||
| 38 | ```text | ||
| 39 | <model_name>__<feature_name>__<window>x<hop>__<pooling>__<metric> | ||
| 40 | ``` | ||
| 41 | |||
| 42 | 示例: | ||
| 43 | - `mert__semantic_embedding__5s_2.5s__mean__cosine` | ||
| 44 | - `mert__semantic_embedding__10s_5s__mean__cosine` | ||
| 45 | - `muq__semantic_embedding__5s_2.5s__mean__cosine` | ||
| 46 | |||
| 47 | --- | ||
| 48 | |||
| 49 | ## 2. Phase-1 推荐初始化对象 | ||
| 50 | |||
| 51 | ## 2.1 模型清单 | ||
| 52 | |||
| 53 | | model_name | model_version | 角色 | | ||
| 54 | |---|---|---| | ||
| 55 | | `chromaprint` | `v1` | exact lane | | ||
| 56 | | `mert` | `v1-95m` | semantic 主 baseline | | ||
| 57 | | `muq` | `large-msd-iter` | semantic challenger | | ||
| 58 | | `ecapa` | `acr-baseline-v1` | 历史 baseline / 对照 | | ||
| 59 | |||
| 60 | --- | ||
| 61 | |||
| 62 | ## 2.2 Feature set 清单 | ||
| 63 | |||
| 64 | | feature_set | 目的 | | ||
| 65 | |---|---| | ||
| 66 | | `chromaprint asset-level` | exact 匹配 | | ||
| 67 | | `mert 5s/2.5s mean` | 主 semantic baseline | | ||
| 68 | | `mert 10s/5s mean` | 较长上下文验证 | | ||
| 69 | | `muq 5s/2.5s mean` | challenger baseline | | ||
| 70 | | `ecapa 5s/2.5s` | 历史对照 | | ||
| 71 | |||
| 72 | --- | ||
| 73 | |||
| 74 | ## 3. 推荐初始化 SQL | ||
| 75 | |||
| 76 | ## 3.1 注册模型 | ||
| 77 | |||
| 78 | ```sql | ||
| 79 | insert into model_registry ( | ||
| 80 | model_name, model_family, model_version, model_source, model_uri, | ||
| 81 | license_name, input_modality, input_sample_rate, input_channel_mode, | ||
| 82 | default_window_sec, default_hop_sec, output_embedding_dim, | ||
| 83 | pooling_supported, layer_selection_supported, is_trainable | ||
| 84 | ) values | ||
| 85 | ('chromaprint', 'fingerprint', 'v1', 'local', null, | ||
| 86 | null, 'audio', 16000, 'mono', | ||
| 87 | 5.0, 2.5, null, | ||
| 88 | array['none'], false, false), | ||
| 89 | ('mert', 'music_ssl', 'v1-95m', 'github', 'https://github.com/yizhilll/MERT', | ||
| 90 | null, 'audio', 24000, 'mono', | ||
| 91 | 5.0, 2.5, 768, | ||
| 92 | array['mean','cls'], true, false), | ||
| 93 | ('muq', 'music_ssl', 'large-msd-iter', 'github', 'https://github.com/tencent-ailab/MuQ', | ||
| 94 | null, 'audio', 24000, 'mono', | ||
| 95 | 5.0, 2.5, 768, | ||
| 96 | array['mean','cls'], true, false), | ||
| 97 | ('ecapa', 'speech_derived', 'acr-baseline-v1', 'local', null, | ||
| 98 | null, 'audio', 16000, 'mono', | ||
| 99 | 5.0, 2.5, 192, | ||
| 100 | array['mean'], false, true); | ||
| 101 | ``` | ||
| 102 | |||
| 103 | --- | ||
| 104 | |||
| 105 | ## 3.2 注册 feature set | ||
| 106 | |||
| 107 | ```sql | ||
| 108 | insert into feature_set_registry ( | ||
| 109 | model_id, feature_name, feature_level, extraction_granularity, | ||
| 110 | window_sec, hop_sec, embedding_dim, pooling_strategy, layer_selection, | ||
| 111 | normalize_l2, distance_metric, quantization_type, feature_schema_version | ||
| 112 | ) | ||
| 113 | select model_id, 'semantic_embedding', 'window', 'sliding_window', | ||
| 114 | 5.0, 2.5, 768, 'mean', 'final', | ||
| 115 | true, 'cosine', 'none', 'v1' | ||
| 116 | from model_registry | ||
| 117 | where model_name = 'mert' and model_version = 'v1-95m'; | ||
| 118 | |||
| 119 | insert into feature_set_registry ( | ||
| 120 | model_id, feature_name, feature_level, extraction_granularity, | ||
| 121 | window_sec, hop_sec, embedding_dim, pooling_strategy, layer_selection, | ||
| 122 | normalize_l2, distance_metric, quantization_type, feature_schema_version | ||
| 123 | ) | ||
| 124 | select model_id, 'semantic_embedding', 'window', 'sliding_window', | ||
| 125 | 10.0, 5.0, 768, 'mean', 'final', | ||
| 126 | true, 'cosine', 'none', 'v1' | ||
| 127 | from model_registry | ||
| 128 | where model_name = 'mert' and model_version = 'v1-95m'; | ||
| 129 | |||
| 130 | insert into feature_set_registry ( | ||
| 131 | model_id, feature_name, feature_level, extraction_granularity, | ||
| 132 | window_sec, hop_sec, embedding_dim, pooling_strategy, layer_selection, | ||
| 133 | normalize_l2, distance_metric, quantization_type, feature_schema_version | ||
| 134 | ) | ||
| 135 | select model_id, 'semantic_embedding', 'window', 'sliding_window', | ||
| 136 | 5.0, 2.5, 768, 'mean', 'final', | ||
| 137 | true, 'cosine', 'none', 'v1' | ||
| 138 | from model_registry | ||
| 139 | where model_name = 'muq' and model_version = 'large-msd-iter'; | ||
| 140 | ``` | ||
| 141 | |||
| 142 | --- | ||
| 143 | |||
| 144 | ## 3.3 注册 reference set | ||
| 145 | |||
| 146 | ```sql | ||
| 147 | insert into reference_set_registry ( | ||
| 148 | set_name, description, encoder_scope, status | ||
| 149 | ) values ( | ||
| 150 | 'phase1_hot_reference_v1', | ||
| 151 | 'Phase-1 主 reference 集,仅包含当前线上热参考 recording', | ||
| 152 | 'mert-v1-95m / muq-large-msd-iter', | ||
| 153 | 'active' | ||
| 154 | ); | ||
| 155 | ``` | ||
| 156 | |||
| 157 | --- | ||
| 158 | |||
| 159 | ## 4. reference set 的运营原则 | ||
| 160 | |||
| 161 | ### 当前建议 | ||
| 162 | - 一个时间点只允许一个主 `active` hot reference set | ||
| 163 | - 新 encoder / 新聚合策略上线时,新建 set,不覆盖旧 set | ||
| 164 | - A/B 或 shadow 期间,允许多个 set 并存,但只有一个主线上标记 | ||
| 165 | |||
| 166 | ### 为什么 | ||
| 167 | 这样可以支持: | ||
| 168 | - 线上回滚 | ||
| 169 | - encoder 升级 | ||
| 170 | - 索引热切换 | ||
| 171 | - 离线重放 | ||
| 172 | |||
| 173 | --- | ||
| 174 | |||
| 175 | ## 5. 维度扩展规则 | ||
| 176 | |||
| 177 | 当前 DDL 只演示了: | ||
| 178 | - `audio_embedding_vector_192` | ||
| 179 | - `audio_embedding_vector_768` | ||
| 180 | |||
| 181 | 后续如果接入新 encoder 维度,如 `1024`: | ||
| 182 | |||
| 183 | 1. 新增 `audio_embedding_vector_1024` | ||
| 184 | 2. 对应 feature_set 的 `embedding_dim=1024` | ||
| 185 | 3. 独立建索引 | ||
| 186 | 4. 通过 `retrieval_index_registry` 切换 | ||
| 187 | |||
| 188 | ### 原则 | ||
| 189 | - 维度变化是 feature set 升级,不是主数据模型升级 | ||
| 190 | - 主数据层不该因 encoder 升级而改表 | ||
| 191 | |||
| 192 | --- | ||
| 193 | |||
| 194 | ## 6. 当前推荐顺序 | ||
| 195 | |||
| 196 | ```mermaid | ||
| 197 | flowchart TD | ||
| 198 | A[注册 model_registry] --> B[注册 feature_set_registry] | ||
| 199 | B --> C[注册 reference_set_registry] | ||
| 200 | C --> D[抽取 embeddings/fingerprint] | ||
| 201 | D --> E[写 audio_embedding/audio_fingerprint] | ||
| 202 | E --> F[建 retrieval_index_registry] | ||
| 203 | ``` | ||
| 204 | |||
| 205 | --- | ||
| 206 | |||
| 207 | ## 7. 最后建议 | ||
| 208 | |||
| 209 | 如果你今天就开始做 Phase-1 初始化,建议最少先注册: | ||
| 210 | |||
| 211 | 1. `chromaprint v1` | ||
| 212 | 2. `mert v1-95m` | ||
| 213 | 3. `muq large-msd-iter` | ||
| 214 | 4. `mert 5s/2.5s mean` | ||
| 215 | 5. `muq 5s/2.5s mean` | ||
| 216 | 6. `phase1_hot_reference_v1` | ||
| 217 | |||
| 218 | 这样数据、模型、索引三条线就都有了稳定入口。 | ||
| 219 | |||
| 220 | --- | ||
| 221 | |||
| 222 | ## 10. Phase-1 worker contract(新增执行层) | ||
| 223 | |||
| 224 | 当前已经不只是 registry/bootstrap 了,还补上了最小真实 worker 执行面: | ||
| 225 | |||
| 226 | - `acr-engine/scripts/bootstrap_phase1_reference_members_live.py` | ||
| 227 | - `acr-engine/workers/mark_job_status.py` | ||
| 228 | - `acr-engine/workers/run_chromaprint_job.py` | ||
| 229 | - `acr-engine/workers/run_embedding_job.py` | ||
| 230 | |||
| 231 | 这层的作用不是立即跑完真实抽特征,而是先把下面这条链打通: | ||
| 232 | |||
| 233 | ```text | ||
| 234 | planner -> feature_extraction_job -> worker -> PostgreSQL status update | ||
| 235 | ``` | ||
| 236 | |||
| 237 | ### 当前能力 | ||
| 238 | |||
| 239 | 1. 读取 `feature_extraction_job` | ||
| 240 | 2. 联表解析 `feature_set_registry + model_registry` | ||
| 241 | 3. 解析 `target_scope` | ||
| 242 | 4. 回写 `pending -> running -> completed` | ||
| 243 | 5. 为后续真模型推理保留稳定契约 | ||
| 244 | |||
| 245 | ### 推荐阅读 | ||
| 246 | |||
| 247 | 详细契约与流程图见: | ||
| 248 | |||
| 249 | - [docs/phase1-worker-contract.md](./phase1-worker-contract.md) | ||
| 250 | |||
| 251 | --- | ||
| 252 | |||
| 253 | ## 8. live PostgreSQL bootstrap 脚本 | ||
| 254 | |||
| 255 | 为了避免每次手工执行 SQL,本仓库现在提供了一个可直接连 PostgreSQL 的 live bootstrap 脚本: | ||
| 256 | |||
| 257 | - `acr-engine/scripts/bootstrap_phase1_model_registry_live.py` | ||
| 258 | |||
| 259 | 用途: | ||
| 260 | - 向目标 schema 写入 `model_registry` | ||
| 261 | - 写入 `feature_set_registry` | ||
| 262 | - 写入 `reference_set_registry` | ||
| 263 | - 采用 **幂等式 upsert / ensure** 方式,适合重复执行 | ||
| 264 | |||
| 265 | ### 8.1 执行命令 | ||
| 266 | |||
| 267 | ```bash | ||
| 268 | cd /workspace/acr-engine | ||
| 269 | /usr/local/miniconda3/bin/python scripts/bootstrap_phase1_model_registry_live.py \ | ||
| 270 | --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' \ | ||
| 271 | --schema acr_test \ | ||
| 272 | --output data/pgvector_eval/music20/phase1_registry_bootstrap_report.json | ||
| 273 | ``` | ||
| 274 | |||
| 275 | ### 8.2 当前已验证结果(acr_test) | ||
| 276 | |||
| 277 | 本轮已在 `acr_test` schema 上真实执行,写入结果如下: | ||
| 278 | |||
| 279 | | 对象 | 数量 | | ||
| 280 | |---|---:| | ||
| 281 | | `model_registry` | `5` | | ||
| 282 | | `feature_set_registry` | `6` | | ||
| 283 | | `reference_set_registry` | `2` | | ||
| 284 | |||
| 285 | 其中新增的 Phase-1 对象包含: | ||
| 286 | |||
| 287 | #### models | ||
| 288 | - `chromaprint v1` | ||
| 289 | - `mert v1-95m` | ||
| 290 | - `muq large-msd-iter` | ||
| 291 | - `ecapa acr-baseline-v1` | ||
| 292 | |||
| 293 | #### feature sets | ||
| 294 | - `chromaprint fingerprint_asset` | ||
| 295 | - `mert semantic_embedding 5s/2.5s` | ||
| 296 | - `mert semantic_embedding 10s/5s` | ||
| 297 | - `muq semantic_embedding 5s/2.5s` | ||
| 298 | - `ecapa semantic_embedding 5s/2.5s` | ||
| 299 | |||
| 300 | #### reference set | ||
| 301 | - `phase1_hot_reference_v1` | ||
| 302 | |||
| 303 | ### 8.3 当前产物 | ||
| 304 | |||
| 305 | - `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json` | ||
| 306 | - `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_idempotency_report.json` | ||
| 307 | |||
| 308 | 这个文件已经记录了: | ||
| 309 | - model_id | ||
| 310 | - feature_set_id | ||
| 311 | - reference_set_id | ||
| 312 | - 最终表计数 | ||
| 313 | |||
| 314 | 因此,下次 session 不需要再从 SQL 片段手工执行开始,而可以直接从 live bootstrap 脚本接上。 | ||
| 315 | |||
| 316 | ### 8.4 幂等性验证(已做) | ||
| 317 | |||
| 318 | 同一套命令在 `acr_test` schema 上连续执行两次后,已经拿到真实幂等性证据: | ||
| 319 | |||
| 320 | | 项目 | 第 1 次 | 第 2 次 | | ||
| 321 | |---|---:|---:| | ||
| 322 | | `model_registry` | `5` | `5` | | ||
| 323 | | `feature_set_registry` | `6` | `6` | | ||
| 324 | | `reference_set_registry` | `2` | `2` | | ||
| 325 | |||
| 326 | 第二次执行时: | ||
| 327 | - `models` 全部表现为 `updated` | ||
| 328 | - `feature_sets` 全部表现为 `reused` | ||
| 329 | - `reference_set` 表现为 `updated` | ||
| 330 | |||
| 331 | 结论: | ||
| 332 | |||
| 333 | > 当前 bootstrap 脚本可重复执行,不会把 Phase-1 registry 数据重复灌爆。 | ||
| 334 | |||
| 335 | --- | ||
| 336 | |||
| 337 | ## 9. Phase-1 extraction job bootstrap | ||
| 338 | |||
| 339 | 当 `model_registry / feature_set_registry / reference_set_registry` 都已经存在后,下一步不是立刻手工跑抽特征,而是先把 **待执行 job** 写到 `feature_extraction_job`。 | ||
| 340 | |||
| 341 | 本仓库现在已经提供: | ||
| 342 | |||
| 343 | - `acr-engine/scripts/bootstrap_phase1_extraction_jobs_live.py` | ||
| 344 | |||
| 345 | 用途: | ||
| 346 | - 根据已存在的 `feature_set_registry` | ||
| 347 | - 为 `phase1_hot_reference_v1` 生成待执行 extraction jobs | ||
| 348 | - 把 Phase-1 的 exact / semantic lanes 统一放进 PostgreSQL job 表 | ||
| 349 | |||
| 350 | ### 9.1 执行命令 | ||
| 351 | |||
| 352 | ```bash | ||
| 353 | cd /workspace/acr-engine | ||
| 354 | /usr/local/miniconda3/bin/python scripts/bootstrap_phase1_extraction_jobs_live.py \ | ||
| 355 | --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' \ | ||
| 356 | --schema acr_test \ | ||
| 357 | --output data/pgvector_eval/music20/phase1_extraction_jobs_report.json | ||
| 358 | ``` | ||
| 359 | |||
| 360 | ### 9.2 当前已验证结果(acr_test) | ||
| 361 | |||
| 362 | 本轮已真实创建 5 条待执行 job: | ||
| 363 | |||
| 364 | | lane | model | feature | target_scope | status | | ||
| 365 | |---|---|---|---|---| | ||
| 366 | | exact | `chromaprint` | `fingerprint_asset` | `reference_set:phase1_hot_reference_v1` | `pending` | | ||
| 367 | | semantic | `mert` | `semantic_embedding` 5s/2.5s | `reference_set:phase1_hot_reference_v1` | `pending` | | ||
| 368 | | semantic | `mert` | `semantic_embedding` 10s/5s | `reference_set:phase1_hot_reference_v1` | `pending` | | ||
| 369 | | semantic | `muq` | `semantic_embedding` 5s/2.5s | `reference_set:phase1_hot_reference_v1` | `pending` | | ||
| 370 | | semantic | `ecapa` | `semantic_embedding` 5s/2.5s | `reference_set:phase1_hot_reference_v1` | `pending` | | ||
| 371 | |||
| 372 | 对应 live 报告: | ||
| 373 | - `acr-engine/data/pgvector_eval/music20/phase1_extraction_jobs_report.json` | ||
| 374 | |||
| 375 | 这意味着: | ||
| 376 | |||
| 377 | > 现在 PostgreSQL 里已经不只是“模型定义”和“特征定义”,而是连 **下一步该跑哪些抽特征任务** 都已经具备结构化入口了。 | ||
| 378 | |||
| 379 | --- | ||
| 380 | |||
| 381 | ## 10. Phase-1 extraction plan(从 pending jobs 生成) | ||
| 382 | |||
| 383 | 当 `feature_extraction_job` 已经存在后,下一步通常不是马上手敲命令,而是先从 PostgreSQL 生成一个**统一执行计划**。 | ||
| 384 | |||
| 385 | 本仓库现在已经提供: | ||
| 386 | |||
| 387 | - `acr-engine/scripts/plan_phase1_extraction_jobs_live.py` | ||
| 388 | |||
| 389 | 用途: | ||
| 390 | - 读取 `feature_extraction_job` | ||
| 391 | - 过滤 `job_status=pending` | ||
| 392 | - 联表 `feature_set_registry + model_registry` | ||
| 393 | - 生成按 lane / priority 排序的 execution plan | ||
| 394 | |||
| 395 | ### 10.1 执行命令 | ||
| 396 | |||
| 397 | ```bash | ||
| 398 | cd /workspace/acr-engine | ||
| 399 | /usr/local/miniconda3/bin/python scripts/plan_phase1_extraction_jobs_live.py \ | ||
| 400 | --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' \ | ||
| 401 | --schema acr_test \ | ||
| 402 | --job-status pending \ | ||
| 403 | --output data/pgvector_eval/music20/phase1_extraction_plan_report.json | ||
| 404 | ``` | ||
| 405 | |||
| 406 | ### 10.2 当前已验证结果(acr_test) | ||
| 407 | |||
| 408 | 本轮已真实生成一份 ordered execution plan: | ||
| 409 | |||
| 410 | | order | lane | model | feature | physical_target | | ||
| 411 | |---|---|---|---|---| | ||
| 412 | | 1 | `exact` | `chromaprint` | `fingerprint_asset` | `audio_fingerprint` | | ||
| 413 | | 2 | `semantic` | `mert` | `semantic_embedding 5s/2.5s` | `audio_embedding` | | ||
| 414 | | 3 | `semantic` | `mert` | `semantic_embedding 10s/5s` | `audio_embedding` | | ||
| 415 | | 4 | `semantic` | `muq` | `semantic_embedding 5s/2.5s` | `audio_embedding` | | ||
| 416 | | 5 | `semantic` | `ecapa` | `semantic_embedding 5s/2.5s` | `audio_embedding` | | ||
| 417 | |||
| 418 | 其中 planner 还会自动给出: | ||
| 419 | - `vector_table` | ||
| 420 | - `audio_embedding_vector_768` | ||
| 421 | - `audio_embedding_vector_192` | ||
| 422 | - `target_scope` | ||
| 423 | - `execution_notes` | ||
| 424 | |||
| 425 | 当前产物: | ||
| 426 | - `acr-engine/data/pgvector_eval/music20/phase1_extraction_plan_report.json` | ||
| 427 | |||
| 428 | 结论: | ||
| 429 | |||
| 430 | > 现在 PostgreSQL 里已经不仅能描述“有哪些 job”,还可以直接生成**按执行顺序排好的抽特征计划**。 | ||
| 431 | |||
| 432 | ### 10.3 ready-to-run command suggestions(已补齐) | ||
| 433 | |||
| 434 | 本轮又进一步把 planner 升级为:**每条 job 都生成 command suggestion**。 | ||
| 435 | |||
| 436 | 示例: | ||
| 437 | |||
| 438 | #### exact lane | ||
| 439 | |||
| 440 | ```bash | ||
| 441 | cd /workspace/acr-engine && PG_DSN="${PG_DSN:?set PG_DSN}" EXTRACTION_JOB_ID=1 FEATURE_SET_ID=2 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test OUTPUT_TARGET=audio_fingerprint \ | ||
| 442 | /usr/local/miniconda3/bin/python workers/run_chromaprint_job.py --complete-dry-run | ||
| 443 | ``` | ||
| 444 | |||
| 445 | #### semantic lane | ||
| 446 | |||
| 447 | ```bash | ||
| 448 | cd /workspace/acr-engine && PG_DSN="${PG_DSN:?set PG_DSN}" EXTRACTION_JOB_ID=2 FEATURE_SET_ID=3 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test MODEL_NAME=mert MODEL_VERSION=v1-95m VECTOR_TABLE=audio_embedding_vector_768 OUTPUT_TARGET=audio_embedding \ | ||
| 449 | /usr/local/miniconda3/bin/python workers/run_embedding_job.py --complete-dry-run | ||
| 450 | ``` | ||
| 451 | |||
| 452 | 这意味着下个 session 不需要先手工拼环境变量和 job 绑定关系,而可以直接从 planner 报告里复制命令模板。 | ||
| 453 | |||
| 454 | ### 10.4 planner 现在也会附带 validation commands | ||
| 455 | |||
| 456 | 除了 per-job command suggestion,当前 planner 还会输出一组全局验证入口: | ||
| 457 | |||
| 458 | - `prereq_audit` | ||
| 459 | - `worker_contract_smoke` | ||
| 460 | - `semantic_vector_negative_matrix` | ||
| 461 | - `asset_level_upsert_validation` | ||
| 462 | |||
| 463 | 也就是: | ||
| 464 | |||
| 465 | 1. 先审计 host 前置条件 | ||
| 466 | 2. 再跑 exact+semantic 的 contract smoke | ||
| 467 | 3. 再检查 semantic vector-table 负例是否稳定 | ||
| 468 | 4. 再验证 asset-level upsert contract | ||
| 469 | |||
| 470 | 这让 planner 从“只会排任务”升级成“同时给出执行前检查入口”的交付物。 | ||
| 471 | |||
| 472 | |||
| 473 | ### 10.5 planner validation_commands 已做直接消费验证 | ||
| 474 | |||
| 475 | 本轮继续补了一层 fresh evidence:不是只看 planner JSON 里有命令,而是**直接从 `phase1_extraction_plan_report.json` 读取命令后执行**。 | ||
| 476 | |||
| 477 | 对应产物: | ||
| 478 | |||
| 479 | - `acr-engine/data/pgvector_eval/music20/phase1_validation_commands_execution_report.json` | ||
| 480 | |||
| 481 | 当前已验证: | ||
| 482 | |||
| 483 | - `validation_commands.prereq_audit` -> `returncode = 0` | ||
| 484 | - `validation_commands.worker_contract_smoke` -> `returncode = 0` | ||
| 485 | - `validation_commands.semantic_vector_negative_matrix` -> `returncode = 0` | ||
| 486 | - `validation_commands.asset_level_upsert_validation` -> `returncode = 0` | ||
| 487 | |||
| 488 | 当前 `phase1_validation_commands_execution_report.json` 已经达到: | ||
| 489 | |||
| 490 | - `executed_commands = 4` | ||
| 491 | - `all_passed = true` | ||
| 492 | |||
| 493 | 这说明 planner 报告现在不仅能“展示命令”,还可以被脚本化消费为真正的执行入口。 |
| 1 | # Phase-1 实施清单 / Encoder-only Implementation Checklist | ||
| 2 | |||
| 3 | > 更新:2026-06-04 | ||
| 4 | > 目标:把“先不上微调、先用开源 encoder”的 Phase-1 方案拆成可执行步骤,方便数据、检索、平台、运维团队并行推进。 | ||
| 5 | |||
| 6 | ## 一页结论 | ||
| 7 | |||
| 8 | Phase-1 的交付目标不是“证明某个新模型绝对最优”,而是: | ||
| 9 | |||
| 10 | 1. 把 **PostgreSQL 主数据模型** 落稳 | ||
| 11 | 2. 把 **reference 资产 / window / feature_set** 跑通 | ||
| 12 | 3. 用 **MERT + MuQ** 建立 encoder-only baseline | ||
| 13 | 4. 把 **fingerprint lane + semantic lane** 的聚合链先跑通 | ||
| 14 | 5. 给 Phase-2 的 version/cover lane 留好接口 | ||
| 15 | |||
| 16 | --- | ||
| 17 | |||
| 18 | ## 1. 交付范围 | ||
| 19 | |||
| 20 | ### 本阶段必须完成 | ||
| 21 | - `canonical_song / work / recording / recording_asset / audio_window` 入库 | ||
| 22 | - `model_registry / feature_set_registry` 初始化 | ||
| 23 | - MERT/MuQ encoder-only 特征抽取 | ||
| 24 | - hot reference set 建设 | ||
| 25 | - semantic index 建设 | ||
| 26 | - query -> candidate -> canonical_song 的基础闭环 | ||
| 27 | |||
| 28 | ### 本阶段不强求完成 | ||
| 29 | - 底座微调 | ||
| 30 | - cover 专项训练 | ||
| 31 | - humming 专项 melody tower | ||
| 32 | - 全量冷数据统一进热索引 | ||
| 33 | |||
| 34 | --- | ||
| 35 | |||
| 36 | ## 2. 角色分工 | ||
| 37 | |||
| 38 | | 角色 | 主要交付 | | ||
| 39 | |---|---| | ||
| 40 | | 数据工程 | 资产清洗、去重、实体映射、切窗清单 | | ||
| 41 | | 后端/DBA | PostgreSQL DDL、索引、写入链、校验约束 | | ||
| 42 | | 检索工程 | fingerprint lane、semantic lane、聚合逻辑 | | ||
| 43 | | 模型工程 | MERT/MuQ 接入、feature_set 设计、抽特征脚本 | | ||
| 44 | | 平台/运维 | 离线任务编排、对象存储、热/冷索引治理 | | ||
| 45 | |||
| 46 | --- | ||
| 47 | |||
| 48 | ## 3. 分阶段 checklist | ||
| 49 | |||
| 50 | ## Stage 1:主数据落库 | ||
| 51 | |||
| 52 | ### 目标 | ||
| 53 | 把业务事实层稳定下来,不依赖具体 encoder。 | ||
| 54 | |||
| 55 | ### Checklist | ||
| 56 | - [ ] 建库执行 `acr-engine/sql/acr_pg_schema_v2.sql` | ||
| 57 | - [ ] 初始化 `canonical_song` | ||
| 58 | - [ ] 初始化 `work` | ||
| 59 | - [ ] 初始化 `recording` | ||
| 60 | - [ ] 初始化 `recording_asset` | ||
| 61 | - [ ] 校验 lineage trigger 可用 | ||
| 62 | - [ ] 用一小批 reference 数据做插入烟测 | ||
| 63 | |||
| 64 | ### 输出物 | ||
| 65 | - PostgreSQL schema v2 | ||
| 66 | - 初始实体数据 | ||
| 67 | - 可复用的数据导入脚本 | ||
| 68 | |||
| 69 | --- | ||
| 70 | |||
| 71 | ## Stage 2:reference 资产与切窗 | ||
| 72 | |||
| 73 | ### 目标 | ||
| 74 | 把“可被检索”的 reference 集合建出来。 | ||
| 75 | |||
| 76 | ### Checklist | ||
| 77 | - [ ] 选出 `is_reference=true` 的 recording | ||
| 78 | - [ ] 创建 `reference_set_registry` | ||
| 79 | - [ ] 回填 `reference_set_member` | ||
| 80 | - [ ] 统一标准化音频路径 | ||
| 81 | - [ ] 生成 `audio_window` | ||
| 82 | - [ ] 标记 `active_for_index` | ||
| 83 | |||
| 84 | ### 推荐规则 | ||
| 85 | - 先只放主 reference 版本 | ||
| 86 | - 默认先做 `5s / 2.5s hop` | ||
| 87 | - intro/outro 可先保留,后续再做 quality pruning | ||
| 88 | |||
| 89 | --- | ||
| 90 | |||
| 91 | ## Stage 3:模型与 feature_set 初始化 | ||
| 92 | |||
| 93 | ### 目标 | ||
| 94 | 把模型注册和特征版本定义稳定下来。 | ||
| 95 | |||
| 96 | ### Checklist | ||
| 97 | - [ ] 注册 `chromaprint` | ||
| 98 | - [ ] 注册 `mert v1-95m` | ||
| 99 | - [ ] 注册 `muq` | ||
| 100 | - [ ] 注册 `mert 5s/2.5s mean pool` | ||
| 101 | - [ ] 注册 `mert 10s/5s mean pool` | ||
| 102 | - [ ] 注册 `muq 5s/2.5s mean pool` | ||
| 103 | - [ ] 明确每个 feature_set 的 metric / quantization / dim | ||
| 104 | |||
| 105 | ### 输出物 | ||
| 106 | - `model_registry` 初始化数据 | ||
| 107 | - `feature_set_registry` 初始化数据 | ||
| 108 | - feature set 命名约定 | ||
| 109 | |||
| 110 | --- | ||
| 111 | |||
| 112 | ## Stage 4:encoder-only 抽特征 | ||
| 113 | |||
| 114 | ### 目标 | ||
| 115 | 先不上训练,直接把 reference 集变成可检索 embedding。 | ||
| 116 | |||
| 117 | ### Checklist | ||
| 118 | - [ ] 抽取 MERT window embeddings | ||
| 119 | - [ ] 抽取 MuQ window embeddings | ||
| 120 | - [ ] 写入 `audio_embedding` | ||
| 121 | - [ ] 热数据写入 `audio_embedding_vector_768` 或对应物理表 | ||
| 122 | - [ ] 冷数据落对象存储/parquet | ||
| 123 | - [ ] 回填 `is_indexed` | ||
| 124 | |||
| 125 | ### 验证 | ||
| 126 | - [ ] 随机抽样检查 `window -> embedding -> feature_set` 回链可用 | ||
| 127 | - [ ] 检查向量 norm/缺失率/重复率 | ||
| 128 | |||
| 129 | --- | ||
| 130 | |||
| 131 | ## Stage 5:索引与召回 | ||
| 132 | |||
| 133 | ### 目标 | ||
| 134 | 跑通 semantic lane 与 exact lane 的双路召回。 | ||
| 135 | |||
| 136 | ### Checklist | ||
| 137 | - [ ] 建 fingerprint index | ||
| 138 | - [ ] 建 semantic index | ||
| 139 | - [ ] 回填 `retrieval_index_registry` | ||
| 140 | - [ ] 做 query encode | ||
| 141 | - [ ] 返回 `retrieval_candidate` | ||
| 142 | - [ ] 聚合到 `recording / work / canonical_song` | ||
| 143 | - [ ] 跑 `phase1_prereq_audit` | ||
| 144 | - [ ] 跑 `phase1_worker_contract_smoke` | ||
| 145 | - [ ] 跑 `semantic_vector_negative_matrix` | ||
| 146 | - [ ] 跑 `asset_level_upsert_validation` | ||
| 147 | |||
| 148 | ### 第一版聚合建议 | ||
| 149 | - max score | ||
| 150 | - top-k average | ||
| 151 | - hit windows count | ||
| 152 | - exact lane / semantic lane agreement bonus | ||
| 153 | |||
| 154 | --- | ||
| 155 | |||
| 156 | ## Stage 6:基础评测与上线门禁 | ||
| 157 | |||
| 158 | ### 目标 | ||
| 159 | 先证明 Phase-1 结构可用。 | ||
| 160 | |||
| 161 | ### Checklist | ||
| 162 | - [ ] exact query bucket | ||
| 163 | - [ ] noisy/BGM bucket | ||
| 164 | - [ ] version-like bucket(即便暂时不训练 cover lane) | ||
| 165 | - [ ] Top1/Top3/MRR | ||
| 166 | - [ ] canonical_song recall | ||
| 167 | - [ ] work-level recall | ||
| 168 | - [ ] reference set 版本记录 | ||
| 169 | |||
| 170 | --- | ||
| 171 | |||
| 172 | ## 4. 推荐时间顺序 | ||
| 173 | |||
| 174 | ```mermaid | ||
| 175 | flowchart TD | ||
| 176 | A[Schema v2 落库] --> B[实体导入] | ||
| 177 | B --> C[reference set 初始化] | ||
| 178 | C --> D[audio_window 生成] | ||
| 179 | D --> E[model/feature_set 初始化] | ||
| 180 | E --> F[MERT/MuQ 抽特征] | ||
| 181 | F --> G[semantic index] | ||
| 182 | C --> H[fingerprint index] | ||
| 183 | G --> I[candidate aggregation] | ||
| 184 | H --> I | ||
| 185 | I --> J[Phase-1 benchmark] | ||
| 186 | ``` | ||
| 187 | |||
| 188 | --- | ||
| 189 | |||
| 190 | ## 5. 第一版验收标准 | ||
| 191 | |||
| 192 | ### 数据层 | ||
| 193 | - 能稳定插入 `canonical_song -> work -> recording -> recording_asset -> audio_window` | ||
| 194 | - 能支撑至少一套 `reference_set` | ||
| 195 | |||
| 196 | ### 模型/特征层 | ||
| 197 | - 能并行存在多个 `model_registry / feature_set_registry` | ||
| 198 | - 能跑通 MERT/MuQ encoder-only 抽特征 | ||
| 199 | |||
| 200 | ### 检索层 | ||
| 201 | - 能同时返回 fingerprint lane 与 semantic lane 候选 | ||
| 202 | - 能聚合输出 `canonical_song_id` | ||
| 203 | |||
| 204 | ### 运维层 | ||
| 205 | - 能重建 reference set | ||
| 206 | - 能重建 semantic index | ||
| 207 | - 能记录 feature_set 与 index version | ||
| 208 | |||
| 209 | --- | ||
| 210 | |||
| 211 | ## 6. 本阶段容易踩的坑 | ||
| 212 | |||
| 213 | 1. 先把 embedding 存储设计死到某个模型维度 | ||
| 214 | 2. 只保留 song_id,不保留 work/recording | ||
| 215 | 3. reference set 没有版本化 | ||
| 216 | 4. query 结果无法回查具体 evidence window | ||
| 217 | 5. exact lane 被过早删除 | ||
| 218 | |||
| 219 | --- | ||
| 220 | |||
| 221 | ## 7. 当前建议结论 | ||
| 222 | |||
| 223 | 如果你要马上排计划,建议按这个优先级: | ||
| 224 | |||
| 225 | 1. Schema v2 与主数据导入 | ||
| 226 | 2. reference set + audio_window | ||
| 227 | 3. MERT/MuQ feature_set 初始化 | ||
| 228 | 4. encoder-only 抽特征 | ||
| 229 | 5. 双路召回与聚合 | ||
| 230 | 6. benchmark 与门禁 | ||
| 231 | |||
| 232 | |||
| 233 | ## 6.1 当前 planner 已提供的 validation entrypoints | ||
| 234 | |||
| 235 | `acr-engine/scripts/plan_phase1_extraction_jobs_live.py` 现在除了 job 级 `command_suggestions`,还会在 `phase1_extraction_plan_report.json` 里附带: | ||
| 236 | |||
| 237 | - `validation_commands.prereq_audit` | ||
| 238 | - `validation_commands.worker_contract_smoke` | ||
| 239 | - `validation_commands.semantic_vector_negative_matrix` | ||
| 240 | - `validation_commands.asset_level_upsert_validation` | ||
| 241 | |||
| 242 | 这意味着下次启动时可以先跑“全局验证入口”,再决定是否执行具体 job,而不必手工拼测试命令。 | ||
| 243 | |||
| 244 | |||
| 245 | ## 6.2 当前推荐的一键验证入口 | ||
| 246 | |||
| 247 | 如果只是想先确认当前 host 是否具备继续推进 Phase-1 的条件,推荐优先执行: | ||
| 248 | |||
| 249 | ```bash | ||
| 250 | cd /workspace/acr-engine | ||
| 251 | /usr/local/miniconda3/bin/python scripts/run_planner_validation_commands_live.py --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' --output data/pgvector_eval/music20/planner_validation_commands_runner_report.json | ||
| 252 | ``` | ||
| 253 | |||
| 254 | 它会直接读取 `phase1_extraction_plan_report.json` 的 `validation_commands`,并批量执行: | ||
| 255 | |||
| 256 | - `prereq_audit` | ||
| 257 | - `worker_contract_smoke` | ||
| 258 | - `semantic_vector_negative_matrix` | ||
| 259 | - `asset_level_upsert_validation` | ||
| 260 | |||
| 261 | 当前 live 结果: | ||
| 262 | |||
| 263 | - `executed_count = 4` | ||
| 264 | - `all_passed = true` |
docs/phase1-worker-contract.md
deleted
100644 → 0
| 1 | # Phase-1 Worker Contract / 作业执行器契约 | ||
| 2 | |||
| 3 | > 更新:2026-06-04 | ||
| 4 | > 目标:把 Phase-1 从“只有 registry / plan”推进到“worker 可以真实消费 PostgreSQL 作业并更新状态”。 | ||
| 5 | |||
| 6 | --- | ||
| 7 | |||
| 8 | ## 一页结论 | ||
| 9 | |||
| 10 | 当前 Phase-1 已经具备一条最小真实执行链: | ||
| 11 | |||
| 12 | 1. planner 从 `feature_extraction_job` 读 pending jobs | ||
| 13 | 2. worker 读取 `extraction_job_id` | ||
| 14 | 3. worker 联表解析 `feature_set_registry + model_registry` | ||
| 15 | 4. worker 解析 `target_scope` | ||
| 16 | 5. worker 回写 `feature_extraction_job.job_status / input_count / output_count / metadata_json` | ||
| 17 | |||
| 18 | 也就是说,现在 PostgreSQL 不只是“数据字典”,已经开始承担: | ||
| 19 | - 作业编排面 | ||
| 20 | - 状态机面 | ||
| 21 | - 执行证据面 | ||
| 22 | |||
| 23 | --- | ||
| 24 | |||
| 25 | ## 1. 当前落地的 worker | ||
| 26 | |||
| 27 | 位于: | ||
| 28 | |||
| 29 | - `acr-engine/scripts/bootstrap_phase1_reference_members_live.py` | ||
| 30 | - `acr-engine/workers/mark_job_status.py` | ||
| 31 | - `acr-engine/workers/run_chromaprint_job.py` | ||
| 32 | - `acr-engine/workers/run_embedding_job.py` | ||
| 33 | - `acr-engine/workers/_job_common.py` | ||
| 34 | |||
| 35 | ### 角色划分 | ||
| 36 | |||
| 37 | | worker | 作用 | | ||
| 38 | |---|---| | ||
| 39 | | `mark_job_status.py` | 通用状态推进器 | | ||
| 40 | | `run_chromaprint_job.py` | exact lane worker | | ||
| 41 | | `run_embedding_job.py` | semantic lane worker | | ||
| 42 | | `_job_common.py` | 共享的 job 读取、scope 解析、状态回写逻辑 | | ||
| 43 | |||
| 44 | ### 配套 bootstrap | ||
| 45 | |||
| 46 | 为了让 worker 不再面对空 scope,这轮还补上了: | ||
| 47 | |||
| 48 | - `acr-engine/scripts/bootstrap_phase1_reference_members_live.py` | ||
| 49 | |||
| 50 | 它会把当前 `recording.is_reference = true` 的录音挂到: | ||
| 51 | |||
| 52 | - `phase1_hot_reference_v1` | ||
| 53 | |||
| 54 | 这样 worker 可以真实看到: | ||
| 55 | - `recording_count` | ||
| 56 | - `ready_asset_count` | ||
| 57 | - `active_window_count` | ||
| 58 | |||
| 59 | --- | ||
| 60 | |||
| 61 | ## 2. 当前状态机 | ||
| 62 | |||
| 63 | ```mermaid | ||
| 64 | flowchart LR | ||
| 65 | A[pending] --> B[running] | ||
| 66 | B --> C[completed] | ||
| 67 | B --> D[failed] | ||
| 68 | ``` | ||
| 69 | |||
| 70 | ### 当前已验证的状态流转 | ||
| 71 | |||
| 72 | - `pending -> running` | ||
| 73 | - `running -> completed`(dry-run 模式) | ||
| 74 | |||
| 75 | ### 当前状态保护 | ||
| 76 | |||
| 77 | - worker 认领 job 时要求前置状态为 `pending` | ||
| 78 | - worker 完成 job 时要求前置状态为 `running` | ||
| 79 | - `mark_job_status.py` 只接受: | ||
| 80 | - `pending` | ||
| 81 | - `running` | ||
| 82 | - `completed` | ||
| 83 | - `failed` | ||
| 84 | - `finished_at` 只在首次完成时落值,不再被重复覆盖 | ||
| 85 | |||
| 86 | ### 已验证的 guard 行为 | ||
| 87 | |||
| 88 | 当前已真实验证: | ||
| 89 | |||
| 90 | 1. 同一 chromaprint job 第一次 dry-run: | ||
| 91 | - 成功 `pending -> running -> completed` | ||
| 92 | 2. 不做 reset,直接第二次执行同一 job: | ||
| 93 | - 被前置状态保护拒绝 | ||
| 94 | |||
| 95 | 对应证据: | ||
| 96 | |||
| 97 | - `acr-engine/data/pgvector_eval/music20/phase1_worker_double_claim_guard_report.json` | ||
| 98 | |||
| 99 | ### 设计意图 | ||
| 100 | |||
| 101 | 先把 **作业契约与状态流转** 固定住,再把真正的模型推理塞进去。 | ||
| 102 | 这样后续不管换成: | ||
| 103 | - `Chromaprint` | ||
| 104 | - `MERT` | ||
| 105 | - `MuQ` | ||
| 106 | - `CoverHunter encoder` | ||
| 107 | |||
| 108 | 都不需要重做 orchestration 数据结构。 | ||
| 109 | |||
| 110 | --- | ||
| 111 | |||
| 112 | ## 3. worker 输入契约 | ||
| 113 | |||
| 114 | ### 环境变量 | ||
| 115 | |||
| 116 | | 变量 | 说明 | | ||
| 117 | |---|---| | ||
| 118 | | `PG_DSN` | PostgreSQL 连接串 | | ||
| 119 | | `PG_SCHEMA` | 目标 schema | | ||
| 120 | | `EXTRACTION_JOB_ID` | 要执行的作业 id | | ||
| 121 | | `FEATURE_SET_ID` | 规划时附带,worker 可用于一致性检查 | | ||
| 122 | | `TARGET_SCOPE` | 规划时附带,worker 当前以 DB 中 job 记录为准 | | ||
| 123 | | `MODEL_NAME` | embedding worker 用于防错 | | ||
| 124 | | `MODEL_VERSION` | embedding worker 用于防错 | | ||
| 125 | | `VECTOR_TABLE` | embedding worker 目标向量表 | | ||
| 126 | | `OUTPUT_TARGET` | `audio_fingerprint` 或 `audio_embedding` | | ||
| 127 | |||
| 128 | ### CLI 参数 | ||
| 129 | |||
| 130 | 三个 worker 都支持显式 CLI 参数覆盖 env。 | ||
| 131 | |||
| 132 | ### planner 命令模板的当前约定 | ||
| 133 | |||
| 134 | `plan_phase1_extraction_jobs_live.py` 现在会显式生成: | ||
| 135 | |||
| 136 | ```bash | ||
| 137 | cd /workspace/acr-engine && PG_DSN="${PG_DSN:?set PG_DSN}" ... | ||
| 138 | ``` | ||
| 139 | |||
| 140 | 这样复制命令时,如果调用方忘了提供数据库连接串,会立刻失败,而不是静默跑空。 | ||
| 141 | |||
| 142 | 当前 planner 还会显式使用: | ||
| 143 | |||
| 144 | ```bash | ||
| 145 | /usr/local/miniconda3/bin/python | ||
| 146 | ``` | ||
| 147 | |||
| 148 | 原因是当前环境里 `python` 不在 PATH 上,但这个解释器路径已被验证可用。 | ||
| 149 | |||
| 150 | 对于当前 dry-run worker,planner 的主命令模板也会显式带上: | ||
| 151 | |||
| 152 | ```bash | ||
| 153 | --complete-dry-run | ||
| 154 | ``` | ||
| 155 | |||
| 156 | 这样 `primary_command` 就能直接复现: | ||
| 157 | |||
| 158 | ```text | ||
| 159 | pending -> running -> completed | ||
| 160 | ``` | ||
| 161 | |||
| 162 | --- | ||
| 163 | |||
| 164 | ## 4. PostgreSQL 读取契约 | ||
| 165 | |||
| 166 | worker 当前真实读取: | ||
| 167 | |||
| 168 | 1. `feature_extraction_job` | ||
| 169 | 2. `feature_set_registry` | ||
| 170 | 3. `model_registry` | ||
| 171 | 4. `reference_set_registry` / `reference_set_member` | ||
| 172 | 5. `recording_asset` | ||
| 173 | 6. `audio_window` | ||
| 174 | |||
| 175 | ### 为什么要读 scope summary | ||
| 176 | |||
| 177 | 因为 Phase-1 第一阶段的核心不是“立刻抽出 embedding”,而是先确定: | ||
| 178 | |||
| 179 | - 这次 job 面向哪个 reference set | ||
| 180 | - 涉及多少 recording | ||
| 181 | - 涉及多少 ready asset | ||
| 182 | - 涉及多少 active window | ||
| 183 | |||
| 184 | 这样后续做: | ||
| 185 | - 分片 | ||
| 186 | - 并行 | ||
| 187 | - 重试 | ||
| 188 | - SLA 估算 | ||
| 189 | |||
| 190 | 才有稳定基线。 | ||
| 191 | |||
| 192 | --- | ||
| 193 | |||
| 194 | ## 5. 当前 dry-run 的真实意义 | ||
| 195 | |||
| 196 | 当前 worker 还没有真正调用模型做特征提取;它做的是: | ||
| 197 | |||
| 198 | 1. 验证 planner 命令模板可被真实消费 | ||
| 199 | 2. 验证 job -> feature_set -> model 的 join 契约 | ||
| 200 | 3. 验证 target scope 解析 | ||
| 201 | 4. 验证 PostgreSQL 作业状态回写 | ||
| 202 | 5. 为下一步真推理保留稳定入口 | ||
| 203 | |||
| 204 | 所以它不是假文档,而是: | ||
| 205 | |||
| 206 | > **先把工业执行面的骨架打通,再把模型推理填进去。** | ||
| 207 | |||
| 208 | --- | ||
| 209 | |||
| 210 | ## 6. 推荐执行顺序 | ||
| 211 | |||
| 212 | ```mermaid | ||
| 213 | flowchart TD | ||
| 214 | A[bootstrap model/feature/reference registry] --> B[bootstrap feature_extraction_job] | ||
| 215 | B --> C[plan pending jobs] | ||
| 216 | C --> D[run worker dry-run] | ||
| 217 | D --> E[validate status transitions] | ||
| 218 | E --> F[replace dry-run with real extractor] | ||
| 219 | ``` | ||
| 220 | |||
| 221 | --- | ||
| 222 | |||
| 223 | ## 7. exact lane 与 semantic lane 的后续替换点 | ||
| 224 | |||
| 225 | ### 7.1 Chromaprint worker | ||
| 226 | |||
| 227 | 后续把下面逻辑塞进 `run_chromaprint_job.py`: | ||
| 228 | |||
| 229 | 1. 读取 `recording_asset` | ||
| 230 | 2. 读取可用音频并提取 exact-lane hash | ||
| 231 | 3. 写 artifact JSON | ||
| 232 | 4. 写 `audio_fingerprint` | ||
| 233 | 5. 更新 `output_count` | ||
| 234 | 6. 标记 `completed` | ||
| 235 | |||
| 236 | ### 当前 exact lane 的真实状态 | ||
| 237 | |||
| 238 | 这轮已经把 `run_chromaprint_job.py` 从“只有 dry-run”推进到: | ||
| 239 | |||
| 240 | - 如果 source audio 可读: | ||
| 241 | - 生成 repo-local chromaprint-style hash artifact | ||
| 242 | - 写入 `audio_fingerprint` | ||
| 243 | - 如果 source audio 不可读: | ||
| 244 | - 明确把 job 标记为 `failed` | ||
| 245 | - 把 `failure_reason`、`missing_asset_count`、`missing_asset_samples` 写回 PostgreSQL | ||
| 246 | |||
| 247 | ### 当前失败语义 | ||
| 248 | |||
| 249 | 当前 exact lane 采用的是 **全量成功 / 否则失败**: | ||
| 250 | |||
| 251 | - 只要 scope 内任意 asset: | ||
| 252 | - 缺文件 | ||
| 253 | - 解码失败 | ||
| 254 | - hash 提取失败 | ||
| 255 | |||
| 256 | 就整体标记: | ||
| 257 | |||
| 258 | - `job_status = failed` | ||
| 259 | - `failure_reason = unreadable_audio_assets` | ||
| 260 | |||
| 261 | 这样不会把“部分成功”伪装成 `completed`。 | ||
| 262 | |||
| 263 | ### 当前依赖策略 | ||
| 264 | |||
| 265 | 当前 exact lane 不再强依赖 `librosa`: | ||
| 266 | |||
| 267 | - 优先使用 `librosa`(如果环境里存在) | ||
| 268 | - 否则回退到: | ||
| 269 | - Python `wave` | ||
| 270 | - `numpy` 线性重采样 | ||
| 271 | - `numpy` FFT spectrogram | ||
| 272 | |||
| 273 | 这使得 worker contract 能在更瘦的运行环境里继续工作。 | ||
| 274 | |||
| 275 | ### 当前幂等保护 | ||
| 276 | |||
| 277 | `audio_fingerprint` 现在补了: | ||
| 278 | |||
| 279 | - `UNIQUE(feature_set_id, asset_id)` | ||
| 280 | |||
| 281 | 对应 worker 写入改成: | ||
| 282 | |||
| 283 | - `INSERT ... ON CONFLICT DO UPDATE` | ||
| 284 | |||
| 285 | 因此 exact lane 对同一 `(feature_set_id, asset_id)` 的重复写入不再依赖应用层先查再写。 | ||
| 286 | |||
| 287 | ### 7.2 Embedding worker | ||
| 288 | |||
| 289 | `run_embedding_job.py` 现在已经不再只是简单 dry-run。当前它已经具备: | ||
| 290 | |||
| 291 | 1. 真实读取 `reference_set -> audio_window -> recording_asset` scope | ||
| 292 | 2. 真实检查目标向量表是否存在且与维度匹配 | ||
| 293 | 3. 真实检查模型 runtime 依赖是否齐全 | ||
| 294 | 4. 真实检查 source audio 是否存在 | ||
| 295 | 5. 把 blocker 明确写回 `feature_extraction_job.metadata_json` | ||
| 296 | 6. 在 blocker 存在时把 job 诚实标记为 `failed` | ||
| 297 | |||
| 298 | ### 当前失败语义 | ||
| 299 | |||
| 300 | semantic lane 当前采用的是 **preflight all-or-nothing**: | ||
| 301 | |||
| 302 | - 只要 scope 内音频路径不可达 / 文件不存在,记为: | ||
| 303 | - `unreadable_audio_assets` | ||
| 304 | - 只要模型 runtime 依赖导入不满足,记为: | ||
| 305 | - `model_runtime_unavailable` | ||
| 306 | - 只要目标向量表非法 / 缺失 / 维度不匹配,记为对应 blocker | ||
| 307 | |||
| 308 | worker 会把这些 blocker 聚合到: | ||
| 309 | |||
| 310 | - `failure_reason = preflight_failed` | ||
| 311 | - `preflight_blockers = [...]` | ||
| 312 | |||
| 313 | 这样不会把“模型没法跑”误写成 completed,也不会只暴露第一个错误。 | ||
| 314 | |||
| 315 | ### 当前 vector table 负例证据 | ||
| 316 | |||
| 317 | 除了正常 `audio_embedding_vector_768` 存在性校验外,本轮还对 semantic lane 补了 3 类 live 负例: | ||
| 318 | |||
| 319 | - `audio_embedding_vector_192` -> `vector_table_dim_mismatch` | ||
| 320 | - `audio_embedding_vector_1024` -> `vector_table_not_allowlisted` | ||
| 321 | - 缺失 `audio_embedding_vector_768` 的隔离 schema -> `vector_table_missing_in_schema` | ||
| 322 | |||
| 323 | 对应产物: | ||
| 324 | |||
| 325 | - `acr-engine/scripts/run_embedding_vector_table_negative_matrix_live.py` | ||
| 326 | - `acr-engine/data/pgvector_eval/music20/embedding_vector_table_negative_matrix_report.json` | ||
| 327 | |||
| 328 | 这说明 semantic worker 当前不只是会在“环境缺依赖”时失败,也能把 **配置错误的向量表** 精确落账。 | ||
| 329 | |||
| 330 | ### 当前 live 证据 | ||
| 331 | |||
| 332 | MERT 5s/2.5s job (`extraction_job_id=2`) 在 `acr_test` 上已经真实验证: | ||
| 333 | |||
| 334 | - `scope_window_count = 20` | ||
| 335 | - `job_status = failed` | ||
| 336 | - `output_count = 0` | ||
| 337 | - `preflight_blockers = ['unreadable_audio_assets', 'model_runtime_unavailable']` | ||
| 338 | - `runtime_report.missing_dependencies = ['torch', 'torchaudio', 'transformers']` | ||
| 339 | - `audio_embedding_vector_768` 已通过存在性与维度校验 | ||
| 340 | |||
| 341 | 对应产物: | ||
| 342 | |||
| 343 | - `acr-engine/data/pgvector_eval/music20/phase1_worker_embedding_write_attempt.json` | ||
| 344 | - `acr-engine/data/pgvector_eval/music20/phase1_worker_embedding_write_guard_report.json` | ||
| 345 | - `acr-engine/data/pgvector_eval/music20/phase1_worker_embedding_post_state.json` | ||
| 346 | |||
| 347 | ### 当前幂等保护 | ||
| 348 | |||
| 349 | 为了服务后续真正的 window embedding upsert,`audio_embedding` 现在补了两条唯一键: | ||
| 350 | |||
| 351 | - `UNIQUE(feature_set_id, window_id) WHERE window_id IS NOT NULL` | ||
| 352 | - `UNIQUE(feature_set_id, asset_id) WHERE window_id IS NULL AND asset_id IS NOT NULL` | ||
| 353 | |||
| 354 | 这让后续真实 encoder 接入后可以直接做: | ||
| 355 | |||
| 356 | - window 级 embedding upsert | ||
| 357 | - asset 级 embedding upsert | ||
| 358 | |||
| 359 | 而不需要先查再写。 | ||
| 360 | |||
| 361 | 当前这两条唯一键里,asset-level 路径也已经有 live 证据: | ||
| 362 | |||
| 363 | - `scripts/validate_audio_embedding_asset_upsert_live.py` | ||
| 364 | - `audio_embedding_asset_upsert_live_report.json` | ||
| 365 | |||
| 366 | 已验证: | ||
| 367 | |||
| 368 | - 重复 `INSERT` 会被 `uq_audio_embedding_feature_asset` 拒绝 | ||
| 369 | - `ON CONFLICT ... DO UPDATE` 会复用同一个 `embedding_id` | ||
| 370 | - `audio_embedding` / `audio_embedding_vector_192` 行数都保持为 `1` | ||
| 371 | |||
| 372 | ### 下一步替换点 | ||
| 373 | |||
| 374 | 当 runtime 与音频挂载到位后,只需要把 guarded failure path 替换成真实 inference: | ||
| 375 | |||
| 376 | 1. 加载 `MERT` / `MuQ` / `ECAPA` | ||
| 377 | 2. 提取向量 | ||
| 378 | 3. 写 `audio_embedding` | ||
| 379 | 4. 写 `audio_embedding_vector_<dim>` | ||
| 380 | 5. 更新 `output_count` | ||
| 381 | 6. 标记 `completed` | ||
| 382 | |||
| 383 | 也就是说,**PostgreSQL worker contract 已经固定,下一步换的是 encoder adapter,不是 orchestration 结构。** | ||
| 384 | |||
| 385 | --- | ||
| 386 | |||
| 387 | ## 8. 解决了什么问题 | ||
| 388 | |||
| 389 | 这次 worker contract 落地,主要解决了 4 个问题: | ||
| 390 | |||
| 391 | 1. **planner 不再只是纸面计划** | ||
| 392 | 2. **job status 有了真实推进器** | ||
| 393 | 3. **后续换模型不用重做 orchestration** | ||
| 394 | 4. **可以先 dry-run 验证执行链,再接入重模型** | ||
| 395 | |||
| 396 | --- | ||
| 397 | |||
| 398 | ## 9. 当前边界 | ||
| 399 | |||
| 400 | 当前还没有完成的部分: | ||
| 401 | |||
| 402 | - exact lane 虽已有真实写入路径,但当前 live 环境仍被 `/workspace/downloads` 缺失阻塞 | ||
| 403 | - semantic lane 已有真实 preflight failure contract,但还没有接上真正的 `MERT / MuQ / ECAPA` inference adapter | ||
| 404 | - `failed` 重试策略 | ||
| 405 | - job 分片执行器 | ||
| 406 | - 更完整的 embedding artifact / checksum 治理策略 | ||
| 407 | |||
| 408 | 但现在已经足够支撑下一阶段: | ||
| 409 | |||
| 410 | > **把真实 extractor 接到已经验证过的 PostgreSQL worker contract 上。** |
| 1 | # PostgreSQL DB Schema Samples / 融合优先 DDL 草案与查询样例 | 1 | # PostgreSQL Schema Samples / song-centric 4 表 DDL 与样例 |
| 2 | 2 | ||
| 3 | > 更新:2026-06-04 | 3 | > 更新:2026-06-04 |
| 4 | > 目标:把当前 **song-centric + 融合优先** 设计落成一版可以直接评审和继续实现的 PostgreSQL DDL 草案。 | ||
| 5 | > SQL 文件:[`acr-engine/sql/acr_pg_schema_songcentric_v1.sql`](../acr-engine/sql/acr_pg_schema_songcentric_v1.sql) | 4 | > SQL 文件:[`acr-engine/sql/acr_pg_schema_songcentric_v1.sql`](../acr-engine/sql/acr_pg_schema_songcentric_v1.sql) |
| 6 | > live smoke:[`acr-engine/scripts/smoke_songcentric_schema_live.py`](../acr-engine/scripts/smoke_songcentric_schema_live.py) | ||
| 7 | 5 | ||
| 8 | --- | 6 | --- |
| 9 | 7 | ||
| 10 | ## 一页结论 | 8 | ## 1. 一页结论 |
| 11 | 9 | ||
| 12 | 当前默认物理模型只看 4 张表: | 10 | 当前默认物理模型: |
| 13 | 11 | ||
| 14 | ```text | 12 | ```text |
| 15 | media_entity -> audio_object -> feature_fact -> set_membership | 13 | media_entity -> audio_object -> feature_fact -> set_membership |
| 16 | ``` | 14 | ``` |
| 17 | 15 | ||
| 18 | 对应逻辑语义: | 16 | 当前默认逻辑语义: |
| 19 | 17 | ||
| 20 | ```text | 18 | ```text |
| 21 | song -> asset -> window -> fingerprint / embedding | 19 | song -> asset -> window -> fingerprint / embedding |
| 22 | ``` | 20 | ``` |
| 23 | 21 | ||
| 24 | 其中: | 22 | 其中: |
| 25 | - `media_entity`:当前默认只承载 `song` | 23 | - `audio_object` 统一承载原始音频和切片 |
| 26 | - `audio_object`:统一承载 `asset` 与 `window` | 24 | - `feature_fact` 统一承载 exact/semantic 特征 |
| 27 | - `feature_fact`:统一承载 `fingerprint` 与 `embedding` | 25 | - `set_membership` 统一承载 reference/eval/hot 集关系 |
| 28 | - `set_membership`:统一承载 `reference / hot / eval` 等集合关系 | ||
| 29 | 26 | ||
| 30 | --- | 27 | --- |
| 31 | 28 | ||
| 32 | ## 1. 4 张表分别存什么 | 29 | ## 2. 切片 / 模型 / feature 落在哪张表 |
| 33 | 30 | ||
| 34 | | 表 | 当前主要 type | 存什么 | 为什么存在 | | 31 | | 对象 | 表 | 关键字段 | 示例 | |
| 35 | |---|---|---|---| | 32 | |---|---|---|---| |
| 36 | | `media_entity` | `song` | 歌曲主实体 | 最终归属对象是 `song_id` | | 33 | | song | `media_entity` | `entity_type='song'` | `song_000001` | |
| 37 | | `audio_object` | `asset`, `window` | 原始音频文件 + 切片 | 同一个 song 下可有多个音频,切片仍需 evidence | | 34 | | asset | `audio_object` | `object_type='asset'` | 一首歌的原始 wav/mp3/flac | |
| 38 | | `feature_fact` | `fingerprint`, `embedding` | 模型、feature set、特征结果 | 统一 exact/semantic 特征事实 | | 35 | | window | `audio_object` | `object_type='window'` | `0-5000ms`, `2500-7500ms` | |
| 39 | | `set_membership` | `reference_set`, `eval_set`, `hot_set` | 谁属于哪个集合 | 管理 reference 与评测范围 | | 36 | | fingerprint | `feature_fact` | `feature_type='fingerprint'` | chromaprint | |
| 37 | | embedding | `feature_fact` | `feature_type='embedding'` | MERT/MuQ/fallback vector | | ||
| 38 | | model | `feature_fact` | `model_name`, `model_version` | `mert-v1-95m`, `muq-base`, `local_wavehash_embed` | | ||
| 39 | | feature set | `feature_fact` | `feature_set_name`, `feature_schema_ver` | `mert_5s_hop2.5_v1` | | ||
| 40 | 40 | ||
| 41 | --- | 41 | --- |
| 42 | 42 | ||
| 43 | ## 2. 当前推荐 DDL 草案 | 43 | ## 3. DDL |
| 44 | 44 | ||
| 45 | ### 2.1 `media_entity` | 45 | ### 3.1 `media_entity` |
| 46 | 46 | ||
| 47 | ```sql | 47 | ```sql |
| 48 | create table if not exists media_entity ( | 48 | create table if not exists media_entity ( |
| ... | @@ -62,16 +62,9 @@ create table if not exists media_entity ( | ... | @@ -62,16 +62,9 @@ create table if not exists media_entity ( |
| 62 | constraint fk_media_entity_parent | 62 | constraint fk_media_entity_parent |
| 63 | foreign key (parent_entity_id) references media_entity(entity_id) | 63 | foreign key (parent_entity_id) references media_entity(entity_id) |
| 64 | ); | 64 | ); |
| 65 | |||
| 66 | create unique index if not exists uq_media_entity_song_biz_key | ||
| 67 | on media_entity(entity_type, biz_key) | ||
| 68 | where biz_key is not null; | ||
| 69 | |||
| 70 | create index if not exists idx_media_entity_root_song | ||
| 71 | on media_entity(root_song_id); | ||
| 72 | ``` | 65 | ``` |
| 73 | 66 | ||
| 74 | ### 2.2 `audio_object` | 67 | ### 3.2 `audio_object` |
| 75 | 68 | ||
| 76 | ```sql | 69 | ```sql |
| 77 | create table if not exists audio_object ( | 70 | create table if not exists audio_object ( |
| ... | @@ -99,23 +92,9 @@ create table if not exists audio_object ( | ... | @@ -99,23 +92,9 @@ create table if not exists audio_object ( |
| 99 | or (object_type = 'window' and parent_object_id is not null) | 92 | or (object_type = 'window' and parent_object_id is not null) |
| 100 | ) | 93 | ) |
| 101 | ); | 94 | ); |
| 102 | |||
| 103 | create index if not exists idx_audio_object_song_type | ||
| 104 | on audio_object(song_id, object_type); | ||
| 105 | |||
| 106 | create index if not exists idx_audio_object_parent | ||
| 107 | on audio_object(parent_object_id); | ||
| 108 | |||
| 109 | create unique index if not exists uq_audio_object_asset_checksum | ||
| 110 | on audio_object(song_id, checksum) | ||
| 111 | where object_type = 'asset' and checksum is not null; | ||
| 112 | |||
| 113 | create unique index if not exists uq_audio_object_window_range | ||
| 114 | on audio_object(parent_object_id, start_ms, end_ms) | ||
| 115 | where object_type = 'window'; | ||
| 116 | ``` | 95 | ``` |
| 117 | 96 | ||
| 118 | ### 2.3 `feature_fact` | 97 | ### 3.3 `feature_fact` |
| 119 | 98 | ||
| 120 | ```sql | 99 | ```sql |
| 121 | create table if not exists feature_fact ( | 100 | create table if not exists feature_fact ( |
| ... | @@ -142,23 +121,9 @@ create table if not exists feature_fact ( | ... | @@ -142,23 +121,9 @@ create table if not exists feature_fact ( |
| 142 | or (feature_type = 'embedding' and (embedding_uri is not null or vector_table_name is not null)) | 121 | or (feature_type = 'embedding' and (embedding_uri is not null or vector_table_name is not null)) |
| 143 | ) | 122 | ) |
| 144 | ); | 123 | ); |
| 145 | |||
| 146 | create index if not exists idx_feature_fact_object_type | ||
| 147 | on feature_fact(object_id, feature_type); | ||
| 148 | |||
| 149 | create index if not exists idx_feature_fact_song_type | ||
| 150 | on feature_fact(song_id, feature_type); | ||
| 151 | |||
| 152 | create unique index if not exists uq_feature_fact_embedding | ||
| 153 | on feature_fact(object_id, model_name, model_version, feature_set_name, feature_type) | ||
| 154 | where feature_type = 'embedding'; | ||
| 155 | |||
| 156 | create unique index if not exists uq_feature_fact_fingerprint | ||
| 157 | on feature_fact(object_id, model_name, model_version, feature_set_name, feature_type) | ||
| 158 | where feature_type = 'fingerprint'; | ||
| 159 | ``` | 124 | ``` |
| 160 | 125 | ||
| 161 | ### 2.4 `set_membership` | 126 | ### 3.4 `set_membership` |
| 162 | 127 | ||
| 163 | ```sql | 128 | ```sql |
| 164 | create table if not exists set_membership ( | 129 | create table if not exists set_membership ( |
| ... | @@ -174,285 +139,245 @@ create table if not exists set_membership ( | ... | @@ -174,285 +139,245 @@ create table if not exists set_membership ( |
| 174 | created_at timestamptz not null default now(), | 139 | created_at timestamptz not null default now(), |
| 175 | updated_at timestamptz not null default now() | 140 | updated_at timestamptz not null default now() |
| 176 | ); | 141 | ); |
| 177 | |||
| 178 | create unique index if not exists uq_set_membership_unique | ||
| 179 | on set_membership(set_type, set_name, member_type, member_id); | ||
| 180 | |||
| 181 | create index if not exists idx_set_membership_set_lookup | ||
| 182 | on set_membership(set_type, set_name, is_active, priority); | ||
| 183 | ``` | 142 | ``` |
| 184 | 143 | ||
| 185 | --- | 144 | --- |
| 186 | 145 | ||
| 187 | ## 3. 切片 / 模型 / feature 到底落哪张表 | 146 | ## 4. 典型写入流程图 |
| 188 | |||
| 189 | | 对象 | 落表 | 关键字段 | | ||
| 190 | |---|---|---| | ||
| 191 | | song | `media_entity` | `entity_type='song'` | | ||
| 192 | | 原始音频 | `audio_object` | `object_type='asset'` | | ||
| 193 | | 切片窗口 | `audio_object` | `object_type='window'`, `parent_object_id=<asset_id>` | | ||
| 194 | | 指纹特征 | `feature_fact` | `feature_type='fingerprint'` | | ||
| 195 | | embedding 特征 | `feature_fact` | `feature_type='embedding'` | | ||
| 196 | | 模型名/版本 | `feature_fact` | `model_name`, `model_version` | | ||
| 197 | | feature set | `feature_fact` | `feature_set_name`, `feature_schema_ver` | | ||
| 198 | | reference 集归属 | `set_membership` | `set_type='reference_set'` | | ||
| 199 | |||
| 200 | --- | ||
| 201 | |||
| 202 | ## 4. 流程图 | ||
| 203 | 147 | ||
| 204 | ### 4.1 落库流程 | 148 | ### 4.1 表写入顺序 |
| 205 | 149 | ||
| 206 | ```mermaid | 150 | ```mermaid |
| 207 | flowchart TD | 151 | flowchart TD |
| 208 | A[media_entity\nentity_type=song] --> B[audio_object\nobject_type=asset] | 152 | A[insert song] --> B[insert asset] |
| 209 | B --> C[audio_object\nobject_type=window] | 153 | B --> C[insert windows] |
| 210 | C --> D1[feature_fact\nfeature_type=fingerprint] | 154 | C --> D1[insert fingerprint facts] |
| 211 | C --> D2[feature_fact\nfeature_type=embedding] | 155 | C --> D2[insert embedding facts] |
| 212 | B --> E[set_membership\nreference_set] | 156 | A --> E[insert set_membership] |
| 157 | B --> E | ||
| 213 | C --> E | 158 | C --> E |
| 214 | ``` | 159 | ``` |
| 215 | 160 | ||
| 216 | ### 4.2 查询回溯流程 | 161 | ### 4.2 查询回溯顺序 |
| 217 | 162 | ||
| 218 | ```mermaid | 163 | ```mermaid |
| 219 | flowchart LR | 164 | flowchart LR |
| 220 | A[feature_fact] --> B[audio_object window] | 165 | A[query features] --> B[feature_fact] |
| 221 | B --> C[audio_object asset] | 166 | B --> C[window] |
| 222 | C --> D[media_entity song] | 167 | C --> D[asset] |
| 168 | D --> E[song] | ||
| 223 | ``` | 169 | ``` |
| 224 | 170 | ||
| 225 | ### 4.3 写入时序图 | 171 | --- |
| 226 | |||
| 227 | ```mermaid | ||
| 228 | sequenceDiagram | ||
| 229 | participant ING as Ingest/Extract Job | ||
| 230 | participant DB as PostgreSQL | ||
| 231 | |||
| 232 | ING->>DB: insert media_entity(song) | ||
| 233 | ING->>DB: insert audio_object(asset) | ||
| 234 | ING->>DB: insert audio_object(window) | ||
| 235 | ING->>DB: insert feature_fact(fingerprint) | ||
| 236 | ING->>DB: insert feature_fact(embedding) | ||
| 237 | ING->>DB: insert set_membership(reference_set) | ||
| 238 | ``` | ||
| 239 | 172 | ||
| 173 | ## 5. 样例数据 | ||
| 240 | 174 | ||
| 241 | ### 4.4 Phase-1 bootstrap 流程 | 175 | ### 5.1 写 song |
| 242 | 176 | ||
| 243 | ```mermaid | 177 | ```sql |
| 244 | flowchart TD | 178 | insert into media_entity ( |
| 245 | A[bootstrap_songcentric_phase1_live.py] --> B[media_entity song x N] | 179 | entity_type, biz_key, title, artist_name, metadata_json |
| 246 | B --> C[audio_object asset x N] | 180 | ) values ( |
| 247 | C --> D[audio_object window x N] | 181 | 'song', 'song_000001', 'Song Alpha', 'Artist A', |
| 248 | D --> E1[feature_fact fingerprint x N] | 182 | '{"source":"catalog_import","language":"zh"}'::jsonb |
| 249 | D --> E2[feature_fact embedding x N] | 183 | ) |
| 250 | C --> F[set_membership reference_set x N] | 184 | returning entity_id; |
| 251 | ``` | 185 | ``` |
| 252 | 186 | ||
| 253 | 当前 live bootstrap 脚本:[`acr-engine/scripts/bootstrap_songcentric_phase1_live.py`](../acr-engine/scripts/bootstrap_songcentric_phase1_live.py) | 187 | ### 5.2 写 asset |
| 254 | |||
| 255 | 188 | ||
| 256 | ### 4.5 Manifest 导入流程 | 189 | ```sql |
| 257 | 190 | insert into audio_object ( | |
| 258 | ```mermaid | 191 | object_type, song_id, source_type, storage_uri, storage_scheme, |
| 259 | flowchart TD | 192 | checksum, codec, sample_rate, channels, duration_ms, metadata_json |
| 260 | A[songcentric_manifest_sample.jsonl] --> B[import_songcentric_manifest_live.py] | 193 | ) values ( |
| 261 | B --> C[media_entity song] | 194 | 'asset', :song_id, 'catalog_master', |
| 262 | B --> D[audio_object asset] | 195 | 's3://bucket/song_alpha/clip1.wav', 's3', |
| 263 | B --> E[audio_object window x N] | 196 | 'sha256:asset001', 'wav', 44100, 2, 183000, |
| 264 | B --> F[feature_fact] | 197 | '{"uploader":"pipeline_v1"}'::jsonb |
| 265 | B --> G[set_membership] | 198 | ) |
| 199 | returning object_id; | ||
| 266 | ``` | 200 | ``` |
| 267 | 201 | ||
| 268 | 当前样例 manifest:[`acr-engine/data/pgvector_eval/music20/songcentric_manifest_sample.jsonl`](../acr-engine/data/pgvector_eval/music20/songcentric_manifest_sample.jsonl) | 202 | ### 5.3 写 window |
| 269 | 当前导入脚本:[`acr-engine/scripts/import_songcentric_manifest_live.py`](../acr-engine/scripts/import_songcentric_manifest_live.py) | ||
| 270 | 203 | ||
| 271 | 当前带 feature 样例 manifest:[`acr-engine/data/pgvector_eval/music20/songcentric_feature_manifest_sample.jsonl`](../acr-engine/data/pgvector_eval/music20/songcentric_feature_manifest_sample.jsonl) | 204 | ```sql |
| 272 | 205 | insert into audio_object ( | |
| 273 | 206 | object_type, song_id, parent_object_id, | |
| 274 | ### 4.6 真实目录生成 manifest 流程 | 207 | start_ms, end_ms, duration_ms, metadata_json |
| 275 | 208 | ) values | |
| 276 | ```mermaid | 209 | ('window', :song_id, :asset_id, 0, 5000, 5000, '{"hop_ms":2500}'::jsonb), |
| 277 | flowchart TD | 210 | ('window', :song_id, :asset_id, 2500, 7500, 5000, '{"hop_ms":2500}'::jsonb) |
| 278 | A[real audio directory] --> B[build_songcentric_manifest_from_directory.py] | 211 | returning object_id; |
| 279 | B --> C[songcentric_directory_manifest.jsonl] | ||
| 280 | C --> D[import_songcentric_manifest_live.py] | ||
| 281 | D --> E[media_entity] | ||
| 282 | D --> F[audio_object] | ||
| 283 | D --> G[set_membership] | ||
| 284 | ``` | 212 | ``` |
| 285 | 213 | ||
| 286 | 当前目录构建脚本:[`acr-engine/scripts/build_songcentric_manifest_from_directory.py`](../acr-engine/scripts/build_songcentric_manifest_from_directory.py) | 214 | ### 5.4 写 fingerprint |
| 287 | |||
| 288 | |||
| 289 | ### 4.7 真实目录补特征再导入流程 | ||
| 290 | 215 | ||
| 291 | ```mermaid | 216 | ```sql |
| 292 | flowchart TD | 217 | insert into feature_fact ( |
| 293 | A[real audio directory] --> B[build_songcentric_manifest_from_directory.py] | 218 | feature_type, object_id, song_id, |
| 294 | B --> C[songcentric_directory_manifest.jsonl] | 219 | model_name, model_version, feature_set_name, feature_schema_ver, |
| 295 | C --> D[enrich_songcentric_manifest_with_local_features.py] | 220 | fingerprint_value, checksum, metadata_json |
| 296 | D --> E[songcentric_directory_manifest_with_features.jsonl] | 221 | ) values ( |
| 297 | E --> F[import_songcentric_manifest_live.py] | 222 | 'fingerprint', :window_id, :song_id, |
| 298 | F --> G[feature_fact] | 223 | 'chromaprint', '1.0', 'chromaprint_5s_v1', 'v1', |
| 224 | 'AQAAE0mUaEkSZSo...', 'sha256:fp001', | ||
| 225 | '{"lane":"exact"}'::jsonb | ||
| 226 | ); | ||
| 299 | ``` | 227 | ``` |
| 300 | 228 | ||
| 301 | 当前特征补全脚本:[`acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py`](../acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py) | 229 | ### 5.5 写 embedding |
| 302 | |||
| 303 | |||
| 304 | ### 4.8 目录链中的 exact lane 提升 | ||
| 305 | 230 | ||
| 306 | 当前 `enrich_songcentric_manifest_with_local_features.py` 已优先复用仓库内 `ChromaprintMatcher` 生成 fingerprint;只有失败时才回退到 `local_wavehash`。 | 231 | ```sql |
| 307 | 232 | insert into feature_fact ( | |
| 308 | 本轮 fresh evidence: | 233 | feature_type, object_id, song_id, |
| 309 | - `wav_windows_seen = 5` | 234 | model_name, model_version, feature_set_name, feature_schema_ver, |
| 310 | - `matcher_fingerprint_count = 5` | 235 | embedding_dim, embedding_uri, vector_table_name, checksum, metadata_json |
| 311 | - `fallback_fingerprint_count = 0` | 236 | ) values ( |
| 312 | 237 | 'embedding', :window_id, :song_id, | |
| 313 | 这说明当前目录链里的 exact lane 已经不只是临时 hash,而是优先接上了仓库现有 fingerprint 提取能力。 | 238 | 'mert-v1-95m', 'hf-main', 'mert_5s_hop2.5_v1', 'v1', |
| 314 | 239 | 768, 's3://bucket/embeddings/song_alpha_win0001.npy', 'audio_embedding_vector_768', | |
| 315 | 240 | 'sha256:emb001', '{"lane":"semantic"}'::jsonb | |
| 316 | ### 4.9 目录链中的 semantic lane 运行时选择 | 241 | ); |
| 317 | |||
| 318 | 当前 `enrich_songcentric_manifest_with_local_features.py` 对 semantic lane 采用 **runtime-aware** 选择: | ||
| 319 | - 如果 `torch / torchaudio / transformers` 可用,则预留真实 semantic adapter 入口 | ||
| 320 | - 如果不可用,则明确落到 `local_wavehash_embed` fallback,并把缺失依赖写进 metadata/report | ||
| 321 | |||
| 322 | 本轮 fresh evidence: | ||
| 323 | - `semantic_runtime_available = false` | ||
| 324 | - `semantic_runtime_missing = ["torch", "torchaudio", "transformers"]` | ||
| 325 | - `semantic_fallback_count = 5` | ||
| 326 | |||
| 327 | 这说明当前 host 上 semantic lane 还未接真实模型,但链路已经具备明确的运行时分流与可审计证据。 | ||
| 328 | |||
| 329 | |||
| 330 | ### 4.10 一键 song-centric 目录链 runner | ||
| 331 | |||
| 332 | ```mermaid | ||
| 333 | flowchart TD | ||
| 334 | A[run_songcentric_directory_pipeline_live.py] --> B[build manifest] | ||
| 335 | B --> C[enrich features] | ||
| 336 | C --> D[import manifest] | ||
| 337 | D --> E[runner report] | ||
| 338 | ``` | 242 | ``` |
| 339 | 243 | ||
| 340 | 当前 runner:[`acr-engine/scripts/run_songcentric_directory_pipeline_live.py`](../acr-engine/scripts/run_songcentric_directory_pipeline_live.py) | 244 | ### 5.6 写 set membership |
| 341 | 245 | ||
| 342 | 它会在一条命令里输出: | 246 | ```sql |
| 343 | - 目录扫描结果 | 247 | insert into set_membership ( |
| 344 | - exact lane 是否走 `ChromaprintMatcher` | 248 | set_type, set_name, member_type, member_id, song_id, priority, metadata_json |
| 345 | - semantic lane 是否 runtime-ready | 249 | ) values |
| 346 | - live PostgreSQL 导入后的计数与 lineage 样例 | 250 | ('reference_set', 'phase1_hot_reference_v1', 'song', :song_id, :song_id, 100, '{}'::jsonb), |
| 251 | ('reference_set', 'phase1_hot_reference_v1', 'asset', :asset_id, :song_id, 100, '{}'::jsonb), | ||
| 252 | ('reference_set', 'phase1_hot_reference_v1', 'window', :window_id, :song_id, 100, '{}'::jsonb); | ||
| 253 | ``` | ||
| 347 | 254 | ||
| 348 | --- | 255 | --- |
| 349 | 256 | ||
| 350 | ## 5. 最常用 SQL 样例 | 257 | ## 6. 典型查询 |
| 351 | 258 | ||
| 352 | ### 5.1 写一首歌 | 259 | ### 6.1 查看某首歌有哪些 asset |
| 353 | 260 | ||
| 354 | ```sql | 261 | ```sql |
| 355 | insert into media_entity (entity_type, biz_key, title, artist_name) | 262 | select object_id, storage_uri, checksum, duration_ms |
| 356 | values ('song', 'song-10001', 'Song 10001', 'Artist A') | 263 | from audio_object |
| 357 | returning entity_id; | 264 | where song_id = :song_id |
| 265 | and object_type = 'asset' | ||
| 266 | order by object_id; | ||
| 358 | ``` | 267 | ``` |
| 359 | 268 | ||
| 360 | ### 5.2 写一个 asset | 269 | ### 6.2 查看某个 asset 切了哪些 window |
| 361 | 270 | ||
| 362 | ```sql | 271 | ```sql |
| 363 | insert into audio_object ( | 272 | select object_id, start_ms, end_ms, duration_ms |
| 364 | object_type, song_id, source_type, storage_uri, storage_scheme, | 273 | from audio_object |
| 365 | checksum, codec, sample_rate, channels, duration_ms | 274 | where parent_object_id = :asset_id |
| 366 | ) values ( | 275 | and object_type = 'window' |
| 367 | 'asset', :song_id, 'official', 's3://bucket/song10001/master.wav', 's3', | 276 | order by start_ms; |
| 368 | 'sha256:xxx', 'wav', 44100, 2, 215000 | ||
| 369 | ) returning object_id; | ||
| 370 | ``` | 277 | ``` |
| 371 | 278 | ||
| 372 | ### 5.3 写一个 window | 279 | ### 6.3 查看某个 window 被哪些模型编码过 |
| 373 | 280 | ||
| 374 | ```sql | 281 | ```sql |
| 375 | insert into audio_object ( | 282 | select feature_type, model_name, model_version, feature_set_name, embedding_dim, |
| 376 | object_type, song_id, parent_object_id, start_ms, end_ms, duration_ms | 283 | fingerprint_value, embedding_uri, vector_table_name |
| 377 | ) values ( | 284 | from feature_fact |
| 378 | 'window', :song_id, :asset_id, 30000, 35000, 5000 | 285 | where object_id = :window_id |
| 379 | ) returning object_id; | 286 | order by feature_type, model_name, model_version; |
| 380 | ``` | 287 | ``` |
| 381 | 288 | ||
| 382 | ### 5.4 写一条 embedding | 289 | ### 6.4 从 feature 回查 song |
| 383 | 290 | ||
| 384 | ```sql | 291 | ```sql |
| 385 | insert into feature_fact ( | 292 | select ff.feature_id, |
| 386 | feature_type, object_id, song_id, | 293 | ff.feature_type, |
| 387 | model_name, model_version, feature_set_name, | 294 | ff.model_name, |
| 388 | feature_schema_ver, embedding_dim, embedding_uri, vector_table_name | 295 | w.object_id as window_id, |
| 389 | ) values ( | 296 | w.start_ms, |
| 390 | 'embedding', :window_id, :song_id, | 297 | w.end_ms, |
| 391 | 'mert', 'v1-95m', 'mert_5s_hop2.5_meanpool', | 298 | a.object_id as asset_id, |
| 392 | 'v1', 768, 's3://bucket/emb/song10001_win0001.npy', 'audio_embedding_vector_768' | 299 | a.storage_uri, |
| 393 | ); | 300 | s.entity_id as song_id, |
| 301 | s.title, | ||
| 302 | s.artist_name | ||
| 303 | from feature_fact ff | ||
| 304 | join audio_object w | ||
| 305 | on w.object_id = ff.object_id | ||
| 306 | and w.object_type = 'window' | ||
| 307 | join audio_object a | ||
| 308 | on a.object_id = w.parent_object_id | ||
| 309 | and a.object_type = 'asset' | ||
| 310 | join media_entity s | ||
| 311 | on s.entity_id = ff.song_id | ||
| 312 | where ff.feature_id = :feature_id; | ||
| 394 | ``` | 313 | ``` |
| 395 | 314 | ||
| 396 | ### 5.5 把 asset 挂到 reference 集 | 315 | ### 6.5 查询 reference set 中的全部 window |
| 397 | 316 | ||
| 398 | ```sql | 317 | ```sql |
| 399 | insert into set_membership ( | 318 | select sm.set_name, |
| 400 | set_type, set_name, member_type, member_id, song_id, priority | 319 | sm.member_id as window_id, |
| 401 | ) values ( | 320 | sm.song_id, |
| 402 | 'reference_set', 'phase1_hot_reference_v1', 'asset', :asset_id, :song_id, 100 | 321 | ao.parent_object_id as asset_id, |
| 403 | ); | 322 | ao.start_ms, |
| 323 | ao.end_ms | ||
| 324 | from set_membership sm | ||
| 325 | join audio_object ao | ||
| 326 | on ao.object_id = sm.member_id | ||
| 327 | and sm.member_type = 'window' | ||
| 328 | where sm.set_type = 'reference_set' | ||
| 329 | and sm.set_name = 'phase1_hot_reference_v1' | ||
| 330 | and sm.is_active = true | ||
| 331 | order by sm.song_id, ao.start_ms; | ||
| 404 | ``` | 332 | ``` |
| 405 | 333 | ||
| 406 | ### 5.6 从 embedding 回查 song | 334 | --- |
| 407 | 335 | ||
| 408 | ```sql | 336 | ## 7. 一个最小存储样例怎么理解 |
| 409 | select ff.feature_id, | 337 | |
| 410 | ff.model_name, | 338 | ```text |
| 411 | ff.model_version, | 339 | song(Song Alpha) |
| 412 | ff.feature_set_name, | 340 | -> asset(clip1.wav) |
| 413 | win.object_id as window_id, | 341 | -> window(0-5000ms) |
| 414 | ast.object_id as asset_id, | 342 | -> fingerprint(chromaprint) |
| 415 | song.entity_id as song_id, | 343 | -> embedding(mert-v1-95m) |
| 416 | song.title, | 344 | -> window(2500-7500ms) |
| 417 | song.artist_name | 345 | -> fingerprint(chromaprint) |
| 418 | from feature_fact ff | 346 | -> embedding(mert-v1-95m) |
| 419 | join audio_object win | ||
| 420 | on win.object_id = ff.object_id | ||
| 421 | and win.object_type = 'window' | ||
| 422 | join audio_object ast | ||
| 423 | on ast.object_id = win.parent_object_id | ||
| 424 | and ast.object_type = 'asset' | ||
| 425 | join media_entity song | ||
| 426 | on song.entity_id = ff.song_id | ||
| 427 | and song.entity_type = 'song' | ||
| 428 | where ff.feature_id = :feature_id; | ||
| 429 | ``` | 347 | ``` |
| 430 | 348 | ||
| 349 | 落表后意味着: | ||
| 350 | - `song` 在 `media_entity` | ||
| 351 | - `asset/window` 都在 `audio_object` | ||
| 352 | - `chromaprint/mert` 都在 `feature_fact` | ||
| 353 | - 它们是否属于 hot reference 在 `set_membership` | ||
| 354 | |||
| 431 | --- | 355 | --- |
| 432 | 356 | ||
| 433 | ## 6. 当前设计意图 | 357 | ## 8. 设计意图总结 |
| 358 | |||
| 359 | 这套结构主要解决: | ||
| 360 | - 同一 song 下多个音频文件 | ||
| 361 | - 同一 asset 下多个切片窗口 | ||
| 362 | - 同一 window 被多个模型重复编码 | ||
| 363 | - fingerprint / embedding 统一落库 | ||
| 364 | - reference/eval/hot 统一集合治理 | ||
| 365 | - 查询后快速归属到 `song_id` | ||
| 434 | 366 | ||
| 435 | ### 为什么切片和原始音频统一用 `audio_object` | 367 | --- |
| 436 | - 新同学更容易理解 | ||
| 437 | - asset/window 共用大量字段 | ||
| 438 | - 减少专用表数量 | ||
| 439 | 368 | ||
| 440 | ### 为什么模型和特征统一用 `feature_fact` | 369 | ## 9. 当前最该关注的后续演进点 |
| 441 | - 不再一模型一张表 | ||
| 442 | - 不再 fingerprint 一张表、embedding 一张表后继续扩散 | ||
| 443 | - 更适合未来继续换 MERT / MuQ / 新模型 | ||
| 444 | 370 | ||
| 445 | ### 为什么 reference 集用 `set_membership` | 371 | 1. 保持 4 表主链不变 |
| 446 | - song / asset / window / feature 都可以挂集合 | 372 | 2. 给 semantic lane 接真实 `MERT` / `MuQ` adapter |
| 447 | - reference / eval / hot 切换统一处理 | 373 | 3. 继续复用 `feature_fact.model_name/model_version/feature_set_name` 做模型演进 |
| 374 | 4. 必要时再补更重的 registry / vector table 治理 | ||
| 448 | 375 | ||
| 449 | --- | 376 | --- |
| 450 | 377 | ||
| 451 | ## 7. 当前最推荐的实现顺序 | 378 | ## 10. 相关文档 |
| 452 | 379 | ||
| 453 | 1. 先建 `media_entity` | 380 | - [README.md](./README.md) |
| 454 | 2. 再建 `audio_object` | 381 | - [start-here.md](./start-here.md) |
| 455 | 3. 再建 `feature_fact` | 382 | - [session-handoff.md](./session-handoff.md) |
| 456 | 4. 最后建 `set_membership` | 383 | - [postgresql-data-model.md](./postgresql-data-model.md) |
| 457 | 5. 先打通 `song -> asset -> window -> embedding/fingerprint` | ||
| 458 | 6. 再继续补更重的治理能力 | ... | ... |
| 1 | # PostgreSQL 数据模型与 DDL 设计说明 | 1 | # PostgreSQL 数据模型 / 当前 song-centric 4 表方案 |
| 2 | 2 | ||
| 3 | > 更新:2026-06-04 | 3 | > 更新:2026-06-04 |
| 4 | > 关联 SQL:[`acr-engine/sql/acr_pg_schema_v2.sql`](../acr-engine/sql/acr_pg_schema_v2.sql) | 4 | > 关联 SQL:[`acr-engine/sql/acr_pg_schema_songcentric_v1.sql`](../acr-engine/sql/acr_pg_schema_songcentric_v1.sql) |
| 5 | > 目标:给出面向版权保护 / 大规模曲库 / 可替换 encoder 的 PostgreSQL 数据字典、DDL 设计意图、流程图与典型使用路径。 | ||
| 6 | |||
| 7 | ## 一页结论 | ||
| 8 | |||
| 9 | 当前推荐的 PostgreSQL 设计,不再围绕“某一个模型的 embedding 表”来建,而是围绕下面这条稳定主链来建: | ||
| 10 | |||
| 11 | ```text | ||
| 12 | canonical_song -> work -> recording -> recording_asset -> audio_window | ||
| 13 | -> model_registry -> feature_set_registry -> audio_embedding / audio_fingerprint | ||
| 14 | -> reference_set_registry -> retrieval_index_registry -> retrieval_candidate -> match_decision | ||
| 15 | ``` | ||
| 16 | |||
| 17 | 这套设计解决的是: | ||
| 18 | |||
| 19 | 1. **song/work/recording 混在一起的问题** | ||
| 20 | 2. **未来换模型就得改表的问题** | ||
| 21 | 3. **窗口级检索无法回溯证据的问题** | ||
| 22 | 4. **exact / semantic / future cover lane 无法统一聚合的问题** | ||
| 23 | |||
| 24 | --- | ||
| 25 | |||
| 26 | ## 0.1 为什么会感觉链路很多 | ||
| 27 | |||
| 28 | 本质上当前文档把 **3 类问题** 放在同一个总图里,所以看起来链路很长: | ||
| 29 | |||
| 30 | 1. **业务归属层**:`canonical_song / work / recording` | ||
| 31 | - 解决“最终归哪个 song_id / work_id / recording_id” | ||
| 32 | 2. **物理音频层**:`recording_asset / audio_window` | ||
| 33 | - 解决“实际文件是什么、切成了哪些检索窗口” | ||
| 34 | 3. **检索计算层**:`model_registry / feature_set_registry / audio_embedding / audio_fingerprint / retrieval_index_registry` | ||
| 35 | - 解决“用了哪个模型、哪套特征、哪套索引” | ||
| 36 | |||
| 37 | 所以这不是一条单链,而是: | ||
| 38 | - 一条 **归属回溯链** | ||
| 39 | - 一条 **音频资产链** | ||
| 40 | - 一条 **特征/索引链** | ||
| 41 | |||
| 42 | 把三者混看,就会误以为每次查询都要手工经过所有表。实际上在线检索真正关心的是: | ||
| 43 | |||
| 44 | ```text | ||
| 45 | window -> candidate -> recording -> song | ||
| 46 | ``` | ||
| 47 | |||
| 48 | --- | ||
| 49 | |||
| 50 | ## 0.2 当前是否可以简化 | ||
| 51 | |||
| 52 | 可以。 | ||
| 53 | |||
| 54 | 如果当前阶段的目标是: | ||
| 55 | |||
| 56 | > 先服务版权保护场景,让 query 能快速稳定地命中正确 `song_id` | ||
| 57 | |||
| 58 | 那么 Phase-1 完全可以收敛为下面这套 **最小可用骨架**: | ||
| 59 | |||
| 60 | ```text | ||
| 61 | song -> recording -> recording_asset -> audio_window | ||
| 62 | -> audio_fingerprint | ||
| 63 | -> audio_embedding | ||
| 64 | ``` | ||
| 65 | |||
| 66 | 为了支持模型替换,再保留一个轻量版本登记层: | ||
| 67 | |||
| 68 | ```text | ||
| 69 | feature_set_registry | ||
| 70 | ``` | ||
| 71 | |||
| 72 | 也就是说,Phase-1 最小主链可以压缩成: | ||
| 73 | |||
| 74 | ```text | ||
| 75 | song -> recording -> asset -> window -> feature | ||
| 76 | ``` | ||
| 77 | |||
| 78 | 其中 `feature` 可具体落成: | ||
| 79 | - `audio_fingerprint` | ||
| 80 | - `audio_embedding` | ||
| 81 | |||
| 82 | --- | ||
| 83 | |||
| 84 | ## 0.3 哪些层建议 Phase-1 保留,哪些层可以弱化 | ||
| 85 | |||
| 86 | ### 建议保留 | ||
| 87 | |||
| 88 | | 层 | 是否保留 | 原因 | | ||
| 89 | |---|---|---| | ||
| 90 | | `song` | 保留 | 最终业务返回对象 | | ||
| 91 | | `recording` | 保留 | 同一 song 下会有多个版本/录音 | | ||
| 92 | | `recording_asset` | 保留 | 一个 recording 可能有多个真实文件 | | ||
| 93 | | `audio_window` | 保留 | 检索和 evidence 的最小计算单元 | | ||
| 94 | | `feature_set_registry` | 保留 | 避免把 embedding/fingerprint 固化成表列 | | ||
| 95 | | `audio_embedding` / `audio_fingerprint` | 保留 | 真正的检索特征事实表 | | ||
| 96 | |||
| 97 | ### 可以弱化或延期 | ||
| 98 | |||
| 99 | | 层 | 当前建议 | 原因 | | ||
| 100 | |---|---|---| | ||
| 101 | | `work` | 可延期 | 如果当前只需稳定返回 `song_id`,可先不显式拆作品层 | | ||
| 102 | | `canonical_song` | 可与 `song` 合并理解 | 当前重点不是权利层深治理,而是先完成可用归属主键 | | ||
| 103 | | `retrieval_index_registry` | 可先弱化 | Phase-1 可先把索引治理做轻,不必一开始做太重 | | ||
| 104 | | `match_decision` 全量审计 | 可逐步补齐 | 先保证召回闭环,再加强审计/解释性 | | ||
| 105 | |||
| 106 | --- | ||
| 107 | |||
| 108 | ## 0.3.1 `recording` 和 `recording_asset` 能不能合并 | ||
| 109 | |||
| 110 | 可以合并,但**只适合非常早期、非常受控的数据集**。 | ||
| 111 | |||
| 112 | ### 什么时候可以临时合并 | ||
| 113 | |||
| 114 | 只有当下面条件基本都成立时,才可以把二者临时看成一个对象: | ||
| 115 | |||
| 116 | 1. 每个 `song` 只有一个可用录音版本 | ||
| 117 | 2. 每个录音只有一个音频文件 | ||
| 118 | 3. 不区分 master / distribution / captured / query_sample | ||
| 119 | 4. 不需要追踪同一录音的多个来源文件 | ||
| 120 | 5. 不需要后续补高码率、补母带、补平台版本 | ||
| 121 | |||
| 122 | 在这种情况下,可以暂时把模型理解成: | ||
| 123 | |||
| 124 | ```text | ||
| 125 | song -> recording_asset -> audio_window | ||
| 126 | ``` | ||
| 127 | |||
| 128 | 也就是让 `recording_asset` 同时承担“版本对象 + 物理文件对象”的职责。 | ||
| 129 | |||
| 130 | ### 为什么长期不建议合并 | ||
| 131 | |||
| 132 | 因为 `recording` 和 `recording_asset` 回答的是两个不同问题: | ||
| 133 | |||
| 134 | - `recording` 回答:**这是哪个录音版本** | ||
| 135 | - `recording_asset` 回答:**这个录音版本对应哪个具体文件** | ||
| 136 | |||
| 137 | 一旦进入真实版权保护场景,下面几类情况会非常常见: | ||
| 138 | |||
| 139 | 1. **同一录音有多个文件版本** | ||
| 140 | 例如 wav/flac/mp3、不同码率、不同平台导出件。 | ||
| 141 | 2. **同一 song 有多个录音版本** | ||
| 142 | 例如 official/live/remaster/short/bgm cut。 | ||
| 143 | 3. **同一录音要接多个来源** | ||
| 144 | 例如平台抓取、业务导出、人工补档。 | ||
| 145 | 4. **query 命中的是 asset,但归属要落到 recording/song** | ||
| 146 | 如果不拆层,后面聚合和去重会比较乱。 | ||
| 147 | |||
| 148 | ### 当前最推荐的判断 | ||
| 149 | |||
| 150 | 对于你现在这个目标: | ||
| 151 | - 约 `100w` 音频 | ||
| 152 | - 约 `30w` 歌曲 | ||
| 153 | - 面向版权保护 / 听歌识曲 / 版本归属 | ||
| 154 | |||
| 155 | **不建议把 `recording` 和 `recording_asset` 合并进正式 schema。** | ||
| 156 | |||
| 157 | 原因很直接: | ||
| 158 | - 数据量已经不小 | ||
| 159 | - 后续大概率会遇到多版本、多来源、多文件问题 | ||
| 160 | - 现在省掉一层,后面重构成本会更高 | ||
| 161 | |||
| 162 | ### 更务实的折中方案 | ||
| 163 | |||
| 164 | 如果你觉得当前实现心智负担太高,可以不在产品/算法讨论里反复强调 `recording_asset`,而是采用下面口径: | ||
| 165 | |||
| 166 | ```text | ||
| 167 | song -> recording -> asset -> window -> feature | ||
| 168 | ``` | ||
| 169 | |||
| 170 | 也就是说: | ||
| 171 | - **概念上保留** `recording` 和 `asset` 两层 | ||
| 172 | - **沟通上简写** 为 `recording -> asset` | ||
| 173 | - **实现上继续分表**,避免未来返工 | ||
| 174 | |||
| 175 | 这通常是 Phase-1 最稳妥的折中。 | ||
| 176 | 5 | ||
| 177 | --- | 6 | --- |
| 178 | 7 | ||
| 179 | ## 0.4 一个更容易理解的口径 | 8 | ## 1. 一页结论 |
| 180 | |||
| 181 | 建议把当前体系理解为下面两条核心链: | ||
| 182 | 9 | ||
| 183 | ### 归属链 | 10 | 当前默认只认 4 张核心物理表: |
| 184 | 11 | ||
| 185 | ```text | 12 | ```text |
| 186 | window -> asset -> recording -> song | 13 | media_entity -> audio_object -> feature_fact -> set_membership |
| 187 | ``` | 14 | ``` |
| 188 | 15 | ||
| 189 | 作用: | 16 | 逻辑语义这样理解: |
| 190 | - 检索命中后,回溯最终归属到哪个 `song_id` | ||
| 191 | |||
| 192 | ### 特征链 | ||
| 193 | 17 | ||
| 194 | ```text | 18 | ```text |
| 195 | window -> fingerprint / embedding -> candidate -> aggregate | 19 | song -> asset -> window -> fingerprint / embedding |
| 196 | ``` | 20 | ``` |
| 197 | 21 | ||
| 198 | 作用: | 22 | 这套设计的核心价值: |
| 199 | - 真正完成召回、打分、聚合与排序 | 23 | - **song-centric**:最终稳定返回 `song_id` |
| 200 | 24 | - **融合优先**:减少 `recording/work/version` 首阶段理解成本 | |
| 201 | 这样看时,整个设计就不再是“很多层没必要”,而是: | 25 | - **特征统一**:exact lane 和 semantic lane 统一落到 `feature_fact` |
| 202 | - **归属层负责回答是谁** | 26 | - **模型可替换**:换 `model_name/model_version/feature_set_name` 不必重拆 schema |
| 203 | - **窗口层负责回答命中了哪一段** | 27 | - **证据可回溯**:任何召回都能回查到具体 `window -> asset -> song` |
| 204 | - **特征层负责回答怎么检索出来** | ||
| 205 | |||
| 206 | --- | ||
| 207 | |||
| 208 | ## 1. 设计意图 | ||
| 209 | |||
| 210 | ## 1.1 这套设计想解决什么 | ||
| 211 | |||
| 212 | ### 问题 A:同一首歌可能有多个录音版本 | ||
| 213 | 所以必须区分: | ||
| 214 | - `canonical_song`:业务最终归一 song | ||
| 215 | - `work`:作品层 | ||
| 216 | - `recording`:具体录音版本 | ||
| 217 | |||
| 218 | ### 问题 B:一个录音可能有多个文件资产 | ||
| 219 | 所以必须有: | ||
| 220 | - `recording_asset` | ||
| 221 | |||
| 222 | ### 问题 C:检索真正命中的是片段,不是整首歌 | ||
| 223 | 所以必须有: | ||
| 224 | - `audio_window` | ||
| 225 | |||
| 226 | ### 问题 D:未来底座会切换 | ||
| 227 | 所以必须有: | ||
| 228 | - `model_registry` | ||
| 229 | - `feature_set_registry` | ||
| 230 | |||
| 231 | ### 问题 E:你会同时存在多个索引后端 | ||
| 232 | 所以必须有: | ||
| 233 | - `retrieval_index_registry` | ||
| 234 | 28 | ||
| 235 | --- | 29 | --- |
| 236 | 30 | ||
| 237 | ## 1.2 为什么不用“reference_embeddings / query_embeddings”那种原型表继续扩 | 31 | ## 2. 为什么现在收敛成 4 表 |
| 238 | |||
| 239 | 因为原型表有几个限制: | ||
| 240 | |||
| 241 | 1. 维度写死,如 `vector(192)` | ||
| 242 | 2. 数据对象太扁平,只围绕 `song_id` | ||
| 243 | 3. 无法优雅支持多个 encoder | ||
| 244 | 4. 无法表达同一 recording 下的多资产、多窗口、多 feature_set | ||
| 245 | |||
| 246 | 所以原型版 SQL 适合 demo,不适合你现在的 100w 音频目标。 | ||
| 247 | 32 | ||
| 248 | ### 当前最建议的简化口径 | 33 | 当前目标不是先建一个最完整的音乐版权知识图谱,而是先把下面这件事做稳: |
| 249 | 34 | ||
| 250 | 如果团队正在进入 Phase-1 实施,不必把所有表同时视为“首批必须上线的复杂系统”。 | 35 | > 收到一个录音/BGM/片段/翻唱相关查询后,能够快速定位它最可能对应哪个 `song_id`。 |
| 251 | 更推荐按下面顺序理解和落库: | ||
| 252 | 36 | ||
| 253 | 1. `song -> recording -> recording_asset -> audio_window` | 37 | 因此当前优先级是: |
| 254 | 2. `feature_set_registry -> audio_fingerprint / audio_embedding` | 38 | 1. 先固定 `song` 作为最终归属对象 |
| 255 | 3. `reference_set_registry` 与更重的索引治理随后补齐 | 39 | 2. 保留 `asset`,支持同一 `song` 下多个音频文件 |
| 40 | 3. 保留 `window`,支持切片级 evidence 与 offset | ||
| 41 | 4. 用一张 `feature_fact` 同时承载 fingerprint 与 embedding | ||
| 42 | 5. 用一张 `set_membership` 管理 reference/eval/hot 集合 | ||
| 256 | 43 | ||
| 257 | --- | 44 | --- |
| 258 | 45 | ||
| 259 | ## 1.2 当前业务前提变化:版本暂不重要,先做 song-centric | 46 | ## 3. 4 张表分别解决什么问题 |
| 260 | 47 | ||
| 261 | 如果当前业务约束是: | 48 | | 表 | 当前主要 type | 解决的问题 | |
| 262 | |||
| 263 | > **同一个歌曲下可以有多个录音或多个音频,但暂时不关心版本语义,只需要最终稳定归到同一个 `song_id`** | ||
| 264 | |||
| 265 | 那么当前 Phase-1 最推荐的默认口径应进一步收敛为: | ||
| 266 | |||
| 267 | ```text | ||
| 268 | song -> asset -> window -> feature | ||
| 269 | ``` | ||
| 270 | |||
| 271 | 也就是说: | ||
| 272 | - `song` 是当前唯一必须稳定返回的归属对象 | ||
| 273 | - 同一个 `song` 下允许存在多个音频文件 | ||
| 274 | - 这些音频文件可以是官方、抓取、BGM、片段、query sample 等不同来源 | ||
| 275 | - 现阶段先不把“录音版本差异”提升成必须单独建模的核心层 | ||
| 276 | |||
| 277 | ### 当前最推荐的物理实现 | ||
| 278 | |||
| 279 | 在这个业务前提下,最推荐直接采用 **3+1 张融合表**: | ||
| 280 | |||
| 281 | | 物理表 | 主要 type | 当前作用 | | ||
| 282 | |---|---|---| | 49 | |---|---|---| |
| 283 | | `media_entity` | `song` | 只承载最终业务归属对象 | | 50 | | `media_entity` | `song` | 最终归属对象是谁 | |
| 284 | | `audio_object` | `asset`, `window` | 承载音频文件与切片窗口 | | 51 | | `audio_object` | `asset`, `window` | 实际音频文件是什么、切成了哪些窗口 | |
| 285 | | `feature_fact` | `fingerprint`, `embedding` | 承载检索特征事实 | | 52 | | `feature_fact` | `fingerprint`, `embedding` | 每个窗口/对象用了哪个模型、产出了什么特征 | |
| 286 | | `set_membership` | `reference_set`, `hot_set`, `eval_set` | 承载 reference / eval 等集合关系 | | 53 | | `set_membership` | `reference_set`, `eval_set`, `hot_set` | 哪些 song/asset/window/feature 属于哪个集合 | |
| 287 | |||
| 288 | 对应逻辑主链: | ||
| 289 | 54 | ||
| 290 | ```text | 55 | --- |
| 291 | song -> asset -> window -> feature | ||
| 292 | ``` | ||
| 293 | |||
| 294 | ### 切片数据、模型、feature 具体落在哪些表 | ||
| 295 | 56 | ||
| 296 | 在当前 **song-centric + 融合优先** 设计下,可以直接按下面理解: | 57 | ## 4. 切片 / 模型 / feature 分别在哪张表 |
| 297 | 58 | ||
| 298 | | 你关心的对象 | 当前推荐表 | 关键 type / 字段 | 作用 | | 59 | | 业务对象 | 物理表 | 关键字段 | 用途 | |
| 299 | |---|---|---|---| | 60 | |---|---|---|---| |
| 300 | | 歌曲主实体 | `media_entity` | `entity_type=song` | 最终归属到哪个 `song_id` | | 61 | | song | `media_entity` | `entity_type='song'` | 最终返回 `song_id` | |
| 301 | | 原始音频文件 | `audio_object` | `object_type=asset`, `song_id`, `storage_uri`, `checksum` | 保存同一 song 下的多个音频文件 | | 62 | | asset | `audio_object` | `object_type='asset'` | 存原始音频文件元数据 | |
| 302 | | 切片窗口 | `audio_object` | `object_type=window`, `parent_object_id=<asset_id>`, `start_ms`, `end_ms` | 保存由 asset 切出来的检索窗口 | | 63 | | window | `audio_object` | `object_type='window'`, `parent_object_id=<asset_id>` | 存切片范围、offset、evidence | |
| 303 | | 模型信息 | `feature_fact` | `model_name`, `model_version`, `feature_set_name` | 记录这条特征是哪个模型、哪套参数算的 | | 64 | | fingerprint | `feature_fact` | `feature_type='fingerprint'`, `fingerprint_value` | exact lane 检索 | |
| 304 | | fingerprint 特征 | `feature_fact` | `feature_type=fingerprint`, `fingerprint_value` | 保存 exact lane 特征 | | 65 | | embedding | `feature_fact` | `feature_type='embedding'`, `embedding_uri/vector_table_name`, `embedding_dim` | semantic lane 检索 | |
| 305 | | embedding 特征 | `feature_fact` | `feature_type=embedding`, `embedding_dim`, `embedding_uri`, `vector_table_name` | 保存 semantic lane 特征 | | 66 | | model identity | `feature_fact` | `model_name`, `model_version` | 区分 MERT / MuQ / ECAPA / fallback | |
| 306 | | reference / eval 归属 | `set_membership` | `set_type`, `member_type`, `member_id` | 决定哪些 asset/window/song 进入 reference 集 | | 67 | | feature set identity | `feature_fact` | `feature_set_name`, `feature_schema_ver` | 区分特征配置、窗口策略、schema 版本 | |
| 68 | | reference routing | `set_membership` | `set_type`, `set_name` | 控制 reference/eval/hot 范围 | | ||
| 69 | |||
| 70 | ### 4.1 一个关键设计点 | ||
| 71 | |||
| 72 | 当前 **模型信息不单独放 registry 表作为默认主链依赖**,而是先直接沉淀在 `feature_fact`: | ||
| 73 | - 这样 Phase-1 更轻 | ||
| 74 | - 更适合“直接复用开源 encoder,不先训练/微调”的当前策略 | ||
| 75 | - 后续如果要补 registry,也可以把 `feature_fact` 中已有事实反向注册 | ||
| 307 | 76 | ||
| 308 | 最关键的一点是: | 77 | --- |
| 309 | 78 | ||
| 310 | > **切片本身也落在 `audio_object`,只是 `object_type=window`;模型与特征统一落在 `feature_fact`。** | 79 | ## 5. 核心流程图 |
| 311 | 80 | ||
| 312 | ### 对应流程图 | 81 | ### 5.1 落库流程 |
| 313 | 82 | ||
| 314 | ```mermaid | 83 | ```mermaid |
| 315 | flowchart TD | 84 | flowchart TD |
| 316 | A[media_entity | 85 | A[media_entity\nentity_type=song] --> B[audio_object\nobject_type=asset] |
| 317 | entity_type=song] --> B[audio_object | 86 | B --> C[audio_object\nobject_type=window] |
| 318 | object_type=asset] | 87 | C --> D1[feature_fact\nfeature_type=fingerprint] |
| 319 | B --> C[audio_object | 88 | C --> D2[feature_fact\nfeature_type=embedding] |
| 320 | object_type=window] | 89 | A --> E[set_membership] |
| 321 | C --> D1[feature_fact | 90 | B --> E |
| 322 | feature_type=fingerprint] | 91 | C --> E |
| 323 | C --> D2[feature_fact | ||
| 324 | feature_type=embedding] | ||
| 325 | D1 --> E[set_membership | ||
| 326 | reference_set / eval_set] | ||
| 327 | D2 --> E | ||
| 328 | ``` | 92 | ``` |
| 329 | 93 | ||
| 330 | ### 对应写入流程 | 94 | ### 5.2 查询回溯流程 |
| 331 | 95 | ||
| 332 | ```mermaid | 96 | ```mermaid |
| 333 | sequenceDiagram | 97 | flowchart LR |
| 334 | participant ING as Ingest Job | 98 | A[query audio] --> B[切片成 query windows] |
| 335 | participant DB as PostgreSQL | 99 | B --> C[抽 fingerprint / embedding] |
| 336 | 100 | C --> D[命中 feature_fact] | |
| 337 | ING->>DB: 写 media_entity(song) | 101 | D --> E[audio_object window] |
| 338 | ING->>DB: 写 audio_object(asset) | 102 | E --> F[audio_object asset] |
| 339 | ING->>DB: 切窗后写 audio_object(window) | 103 | F --> G[media_entity song] |
| 340 | ING->>DB: 写 feature_fact(fingerprint) | 104 | G --> H[输出 song_id + evidence] |
| 341 | ING->>DB: 写 feature_fact(embedding) | ||
| 342 | ING->>DB: 写 set_membership(reference/eval) | ||
| 343 | ``` | 105 | ``` |
| 344 | 106 | ||
| 345 | ### 一个最实用的查询回溯口径 | 107 | ### 5.3 表职责视图 |
| 346 | |||
| 347 | 如果 query 命中了一条 embedding/fingerprint,回溯路径就是: | ||
| 348 | 108 | ||
| 349 | ```text | 109 | ```mermaid |
| 350 | feature_fact -> audio_object(window) -> audio_object(asset) -> media_entity(song) | 110 | flowchart TB |
| 111 | M[media_entity\n谁] --> A[audio_object\n哪份音频/哪段切片] | ||
| 112 | A --> F[feature_fact\n用了哪个模型/产出什么特征] | ||
| 113 | M --> S[set_membership\n属于哪个 reference/eval/hot 集] | ||
| 114 | A --> S | ||
| 115 | F --> R[召回/匹配/聚合] | ||
| 351 | ``` | 116 | ``` |
| 352 | 117 | ||
| 353 | 这条链已经足够支撑你当前最关心的问题: | ||
| 354 | - 这个切片来自哪个音频文件 | ||
| 355 | - 这个音频文件归到哪个 `song_id` | ||
| 356 | - 这条特征是哪个模型/feature set 算出来的 | ||
| 357 | |||
| 358 | --- | 118 | --- |
| 359 | 119 | ||
| 360 | ### 为什么现在可以先不把 `recording` 做成强实体 | 120 | ## 6. 每张表的设计意图 |
| 361 | |||
| 362 | 因为你当前不关心: | ||
| 363 | - official / live / remaster 的严格版本区分 | ||
| 364 | - cover/version lane 的独立归档 | ||
| 365 | - 返回结果必须精确到 recording_id | ||
| 366 | |||
| 367 | 你当前真正关心的是: | ||
| 368 | |||
| 369 | > 这一批不同来源、不同形式的音频,最后是否都能被稳定归到同一个 `song_id` | ||
| 370 | |||
| 371 | 在这个目标下,把 `recording` 作为强主层,会增加理解成本,但对当前第一阶段收益有限。 | ||
| 372 | |||
| 373 | ### 但这不代表未来永远不要 `recording` | ||
| 374 | |||
| 375 | 推荐的处理方式是: | ||
| 376 | - **当前 schema 默认不强推 `recording`** | ||
| 377 | - 如果未来开始关心版本归属,再把 `recording` 从 `media_entity(entity_type=recording)` 或 `audio_object.metadata_json` 中提升出来 | ||
| 378 | |||
| 379 | 换句话说: | ||
| 380 | - **当前先做 song-centric 检索归属** | ||
| 381 | - **未来再演进到 recording-centric / work-centric 治理** | ||
| 382 | |||
| 383 | --- | ||
| 384 | |||
| 385 | ## 1.2.1 融合优先:逻辑分层保留,物理表尽量收敛 | ||
| 386 | |||
| 387 | 如果你的核心诉求是: | ||
| 388 | |||
| 389 | > **尽量减少表数量,用 `type` + 通用关联表达多种对象,而不是一路拆很多表再 join** | ||
| 390 | |||
| 391 | 那么推荐采用下面这个口径: | ||
| 392 | |||
| 393 | - **逻辑层** 当前默认保留 `song / asset / window / feature`;`recording` 仅保留为未来扩展语义 | ||
| 394 | - **物理层** 尽量融合成少数几张通用表 | ||
| 395 | |||
| 396 | 也就是说: | ||
| 397 | |||
| 398 | > **概念上分层,落库上收敛。** | ||
| 399 | |||
| 400 | ### 推荐的融合优先物理视图 | ||
| 401 | |||
| 402 | | 物理表 | 主要 type | 作用 | | ||
| 403 | |---|---|---| | ||
| 404 | | `media_entity` | `song`(当前默认), `work`/`recording`(未来扩展) | 承载业务归属对象 | | ||
| 405 | | `audio_object` | `asset`, `window` | 承载真实音频文件与切片对象 | | ||
| 406 | | `feature_fact` | `fingerprint`, `embedding` | 承载检索特征事实 | | ||
| 407 | | `set_membership` | `reference_set`, `hot_set`, `eval_set` | 承载集合归属关系 | | ||
| 408 | |||
| 409 | 这样,Phase-1 在物理表层面可以被收敛成: | ||
| 410 | |||
| 411 | ```text | ||
| 412 | media_entity -> audio_object -> feature_fact -> set_membership | ||
| 413 | ``` | ||
| 414 | |||
| 415 | 而不是新同学第一眼就看到很多高度专用表。 | ||
| 416 | 121 | ||
| 417 | ### 对应的逻辑语义 | 122 | ### 6.1 `media_entity` |
| 418 | 123 | ||
| 419 | #### `media_entity` | 124 | 用途: |
| 420 | 用 `entity_type` 区分: | 125 | - 作为 song 主实体表 |
| 421 | - `song`(当前默认必用) | 126 | - 统一承载 `song_id` |
| 422 | - `work`(可选) | 127 | - 后续如需要,也允许保留 `work/recording` type,但当前默认只把 `song` 当主语义 |
| 423 | - `recording`(未来扩展) | ||
| 424 | 128 | ||
| 425 | 公共字段可统一为: | 129 | 当前最常用字段: |
| 426 | - `entity_id` | 130 | - `entity_id` |
| 427 | - `entity_type` | 131 | - `entity_type` |
| 428 | - `parent_entity_id` | 132 | - `biz_key` |
| 429 | - `root_song_id` | ||
| 430 | - `title` | 133 | - `title` |
| 431 | - `artist_name` | 134 | - `artist_name` |
| 432 | - `entity_status` | ||
| 433 | - `metadata_json` | 135 | - `metadata_json` |
| 434 | 136 | ||
| 435 | #### `audio_object` | 137 | 设计意图: |
| 436 | 用 `object_type` 区分: | 138 | - 不再把 song 相关字段散落到多张表 |
| 437 | - `asset` | 139 | - 先把最终归属对象固定下来 |
| 438 | - `window` | ||
| 439 | 140 | ||
| 440 | 公共字段可统一为: | 141 | ### 6.2 `audio_object` |
| 441 | - `object_id` | 142 | |
| 143 | 用途: | ||
| 144 | - 同时管理 `asset` 与 `window` | ||
| 145 | - 用 `parent_object_id` 建立 `asset -> window` 父子关系 | ||
| 146 | |||
| 147 | 当前最常用字段: | ||
| 442 | - `object_type` | 148 | - `object_type` |
| 443 | - `recording_entity_id` | 149 | - `song_id` |
| 444 | - `parent_object_id` | 150 | - `parent_object_id` |
| 445 | - `storage_uri` | 151 | - `storage_uri` |
| 446 | - `codec` | ||
| 447 | - `sample_rate` | ||
| 448 | - `start_ms` / `end_ms` | ||
| 449 | - `duration_ms` | ||
| 450 | - `checksum` | 152 | - `checksum` |
| 451 | - `metadata_json` | 153 | - `duration_ms` |
| 154 | - `start_ms` | ||
| 155 | - `end_ms` | ||
| 452 | 156 | ||
| 453 | 解释: | 157 | 设计意图: |
| 454 | - `asset` 行表示真实音频文件 | 158 | - 同一 `song` 下可有多个音频文件 |
| 455 | - `window` 行表示由某个 `asset` 切出来的检索窗口 | 159 | - 同一音频文件可切成多个检索窗口 |
| 160 | - 查询命中后可以回查具体 offset | ||
| 456 | 161 | ||
| 457 | #### `feature_fact` | 162 | ### 6.3 `feature_fact` |
| 458 | 用 `feature_type` 区分: | ||
| 459 | - `fingerprint` | ||
| 460 | - `embedding` | ||
| 461 | 163 | ||
| 462 | 公共字段可统一为: | 164 | 用途: |
| 463 | - `feature_id` | 165 | - 统一存 exact lane 和 semantic lane 的特征事实 |
| 166 | - 统一挂模型信息、特征集信息、特征载荷位置 | ||
| 167 | |||
| 168 | 当前最常用字段: | ||
| 464 | - `feature_type` | 169 | - `feature_type` |
| 465 | - `object_id` | 170 | - `object_id` |
| 171 | - `song_id` | ||
| 466 | - `model_name` | 172 | - `model_name` |
| 467 | - `model_version` | 173 | - `model_version` |
| 468 | - `feature_set_name` | 174 | - `feature_set_name` |
| 469 | - `embedding_dim` | 175 | - `embedding_dim` |
| 470 | - `fingerprint_value` / `embedding_uri` | 176 | - `fingerprint_value` |
| 177 | - `embedding_uri` | ||
| 471 | - `vector_table_name` | 178 | - `vector_table_name` |
| 472 | - `metadata_json` | ||
| 473 | |||
| 474 | 这样可以避免: | ||
| 475 | - 一套模型一张表 | ||
| 476 | - 一类特征一张表 | ||
| 477 | - 后续换模型就改 schema | ||
| 478 | |||
| 479 | ### 为什么这比“纯拆表”更适合当前 Phase-1 | ||
| 480 | |||
| 481 | 优点: | ||
| 482 | 1. **新同学更容易理解**:看到的是 3~4 张核心表,而不是十几张专用表 | ||
| 483 | 2. **更符合当前业务前提**:多个音频直接挂到同一个 `song_id`,先不强区分 recording | ||
| 484 | 3. **模型演进更平滑**:`feature_fact` 可以同时容纳不同模型与不同特征 | ||
| 485 | 4. **更符合当前目标**:先把识别闭环跑通,而不是先把治理模型拆到很细 | ||
| 486 | |||
| 487 | ### 但不要融合过头 | ||
| 488 | |||
| 489 | 虽然推荐物理收敛,但仍然不建议极端融合成一张大全表。 | ||
| 490 | 例如下面这种仍然过度扁平: | ||
| 491 | |||
| 492 | ```text | ||
| 493 | song_everything | ||
| 494 | ``` | ||
| 495 | |||
| 496 | 原因是它会把: | ||
| 497 | - 归属对象 | ||
| 498 | - 音频对象 | ||
| 499 | - 检索特征 | ||
| 500 | - 集合关系 | ||
| 501 | |||
| 502 | 全部揉在一起,导致: | ||
| 503 | - 空字段过多 | ||
| 504 | - 约束难写 | ||
| 505 | - 批量写入难做 | ||
| 506 | - 查询语义不清晰 | ||
| 507 | |||
| 508 | 因此更推荐的边界是: | ||
| 509 | |||
| 510 | ```text | ||
| 511 | 实体一张表 + 音频对象一张表 + 特征事实一张表 + 集合关系一张表 | ||
| 512 | ``` | ||
| 513 | |||
| 514 | 这是“融合优先但不过度融合”的平衡点。 | ||
| 515 | |||
| 516 | --- | ||
| 517 | |||
| 518 | ## 1.3 Phase-1 极简 schema 视图 | ||
| 519 | |||
| 520 | 如果只从“第一阶段必须落哪些表”来理解,推荐优先采用“融合优先”的最小表集合: | ||
| 521 | |||
| 522 | | 层 | 融合优先推荐表 | 当前作用 | | ||
| 523 | |---|---|---| | ||
| 524 | | 实体层 | `media_entity` | 当前默认只承载 `song` | | ||
| 525 | | 音频对象层 | `audio_object` | 统一承载 `asset/window` | | ||
| 526 | | 特征层 | `feature_fact` | 统一承载 `fingerprint/embedding` | | ||
| 527 | | 集合层 | `set_membership` | 统一承载 `reference/hot/eval` 等集合关系 | | ||
| 528 | |||
| 529 | 也就是说,Phase-1 如果按物理实现优先,真正应该先落稳的是: | ||
| 530 | |||
| 531 | ```text | ||
| 532 | media_entity -> audio_object -> feature_fact -> set_membership | ||
| 533 | ``` | ||
| 534 | |||
| 535 | 如果按逻辑语义理解,则当前默认对应: | ||
| 536 | |||
| 537 | ```text | ||
| 538 | song -> asset/window -> fingerprint/embedding -> reference membership | ||
| 539 | ``` | ||
| 540 | |||
| 541 | ### 这版极简 schema 明确不要求第一天就重投入的内容 | ||
| 542 | |||
| 543 | 可以后补: | ||
| 544 | - `work` | ||
| 545 | - 更重的 `retrieval_index_registry` | ||
| 546 | - 更细的 `retrieval_candidate / match_decision` 在线审计表 | ||
| 547 | - 复杂的多 lane 重排治理表 | ||
| 548 | |||
| 549 | ### 但是极简不等于扁平 | ||
| 550 | |||
| 551 | 即使走极简版,也**不建议**退回到下面这种扁平结构: | ||
| 552 | |||
| 553 | ```text | ||
| 554 | song -> embedding | ||
| 555 | song -> fingerprint | ||
| 556 | ``` | ||
| 557 | |||
| 558 | 原因: | ||
| 559 | - 没有 `recording`,版本信息会丢 | ||
| 560 | - 没有 `asset`,文件来源与去重会乱 | ||
| 561 | - 没有 `window`,evidence/offset/多段聚合会弱很多 | ||
| 562 | - 没有 `feature_set_registry`,模型升级会把 schema 写死 | ||
| 563 | |||
| 564 | ### 一个最实用的实现口径 | ||
| 565 | |||
| 566 | 如果团队现在就要开干,最推荐的实施顺序是: | ||
| 567 | |||
| 568 | 1. 先落 `song / recording / recording_asset / audio_window` | ||
| 569 | 2. 再落 `feature_set_registry / audio_fingerprint / audio_embedding` | ||
| 570 | 3. 再落 `reference_set_registry / reference_set_member` | ||
| 571 | 4. 最后再补 `work / retrieval_index_registry / match_decision` 等增强层 | ||
| 572 | |||
| 573 | 这样既能保持当前 Phase-1 简洁,也不会破坏未来扩展。 | ||
| 574 | |||
| 575 | --- | ||
| 576 | |||
| 577 | ## 2. 数据主链 | ||
| 578 | |||
| 579 | ```mermaid | ||
| 580 | flowchart LR | ||
| 581 | A[canonical_song] --> B[work] | ||
| 582 | B --> C[recording] | ||
| 583 | C --> D[recording_asset] | ||
| 584 | D --> E[audio_window] | ||
| 585 | E --> F[audio_embedding] | ||
| 586 | E --> G[audio_fingerprint] | ||
| 587 | F --> H[retrieval_index_registry] | ||
| 588 | G --> H | ||
| 589 | H --> I[retrieval_candidate] | ||
| 590 | I --> J[match_decision] | ||
| 591 | ``` | ||
| 592 | |||
| 593 | --- | ||
| 594 | |||
| 595 | ## 3. 表分组 | ||
| 596 | |||
| 597 | | 分组 | 表 | 作用 | | ||
| 598 | |---|---|---| | ||
| 599 | | 版权与实体 | `canonical_song`, `work`, `recording` | 统一业务归属 | | ||
| 600 | | 资产层 | `recording_asset` | 管理真实文件资产 | | ||
| 601 | | 窗口层 | `audio_window` | 管理检索最小证据片段 | | ||
| 602 | | 模型与特征 | `model_registry`, `feature_set_registry`, `audio_embedding`, `audio_fingerprint` | 管理模型版本与特征事实 | | ||
| 603 | | reference 集 | `reference_set_registry`, `reference_set_member` | 管理热 reference 集与版本化切换 | | ||
| 604 | | 索引层 | `retrieval_index_registry` | 记录后端索引 | | ||
| 605 | | 匹配层 | `retrieval_candidate`, `match_decision` | 在线召回与最终归一 | | ||
| 606 | |||
| 607 | --- | ||
| 608 | 179 | ||
| 609 | ## 4. 关键表说明 | 180 | 设计意图: |
| 181 | - 避免为不同模型建一堆平行 embedding 表 | ||
| 182 | - 未来换 MERT / MuQ / 其他 encoder 时只增 feature rows,不改主 schema | ||
| 183 | - exact / semantic 两条 lane 可以共用同一归属链 | ||
| 610 | 184 | ||
| 611 | ## 4.1 `canonical_song` | 185 | ### 6.4 `set_membership` |
| 612 | 最终业务主键。 | ||
| 613 | 186 | ||
| 614 | 用途: | 187 | 用途: |
| 615 | - 服务最终返回 `canonical_song_id` | 188 | - 统一管理 reference_set / eval_set / hot_set |
| 616 | - 权利归属、产品展示、对外业务都以它为准 | 189 | - member 可以是 `song/asset/window/feature` |
| 617 | 190 | ||
| 618 | ## 4.2 `work` | 191 | 设计意图: |
| 619 | 作品层。 | 192 | - reference 范围不硬编码到 song 表里 |
| 620 | 193 | - 评测集、热集、灰度集能共用一张关系表 | |
| 621 | 用途: | ||
| 622 | - 同一首歌的不同翻唱/演绎归一到作品层 | ||
| 623 | - future phase 的 cover/version lane 常常先聚到 `work_id` | ||
| 624 | |||
| 625 | ## 4.3 `recording` | ||
| 626 | 录音层。 | ||
| 627 | |||
| 628 | 用途: | ||
| 629 | - official/live/remaster/cover/ugc 等不同版本分开管理 | ||
| 630 | - 允许多个 recording 最终映射到同一个 `canonical_song` | ||
| 631 | |||
| 632 | ## 4.4 `recording_asset` | ||
| 633 | 文件资产层。 | ||
| 634 | |||
| 635 | 用途: | ||
| 636 | - 同一个 recording 可有多个文件版本 | ||
| 637 | - 可区分 master/reference/distribution/captured/query_sample | ||
| 638 | |||
| 639 | ## 4.5 `audio_window` | ||
| 640 | 窗口层。 | ||
| 641 | |||
| 642 | 用途: | ||
| 643 | - 建指纹 | ||
| 644 | - 抽 embedding | ||
| 645 | - 在线输出 evidence window | ||
| 646 | - 对 intro/chorus 等片段做后续治理 | ||
| 647 | |||
| 648 | ## 4.6 `model_registry` | ||
| 649 | 模型注册表。 | ||
| 650 | |||
| 651 | 用途: | ||
| 652 | - 记录 `model_name/model_version/output_embedding_dim` | ||
| 653 | - 未来切换 MERT/MuQ/其他底座时不改业务表 | ||
| 654 | |||
| 655 | ## 4.7 `feature_set_registry` | ||
| 656 | 特征版本表。 | ||
| 657 | |||
| 658 | 用途: | ||
| 659 | - 记录窗长、hop、pooling、layer、metric | ||
| 660 | - 同一模型不同用法变成不同 feature_set | ||
| 661 | |||
| 662 | ## 4.8 `audio_embedding` | ||
| 663 | embedding 元数据事实表。 | ||
| 664 | |||
| 665 | 用途: | ||
| 666 | - 记录某个 asset/window 由哪个 feature_set 生成了什么 embedding | ||
| 667 | - 可指向 pgvector,也可只指向外部 parquet/npy | ||
| 668 | |||
| 669 | ## 4.9 `reference_set_registry` / `reference_set_member` | ||
| 670 | reference 集版本表。 | ||
| 671 | |||
| 672 | 用途: | ||
| 673 | - 把“当前线上热 reference 集”提升成显式对象 | ||
| 674 | - 支持 A/B、灰度、回滚、历史回放 | ||
| 675 | - 让 `is_reference` 从单条 recording 标签升级为“可切换集合” | ||
| 676 | |||
| 677 | ## 4.10 `retrieval_index_registry` | ||
| 678 | 索引注册表。 | ||
| 679 | |||
| 680 | 用途: | ||
| 681 | - 同一 feature_set 可挂多个 backend / shard / version | ||
| 682 | - 支持 pgvector / faiss / milvus 并存 | ||
| 683 | |||
| 684 | ## 4.11 `retrieval_candidate` | ||
| 685 | 召回候选。 | ||
| 686 | |||
| 687 | 用途: | ||
| 688 | - 保存 exact lane / semantic lane / future cover lane 的候选 | ||
| 689 | - 便于线下分析与线上回放 | ||
| 690 | |||
| 691 | ## 4.12 `match_decision` | ||
| 692 | 最终判定。 | ||
| 693 | |||
| 694 | 用途: | ||
| 695 | - 输出 `canonical_song_id / work_id / recording_id` | ||
| 696 | - 保留判定理由与分数 | ||
| 697 | 194 | ||
| 698 | --- | 195 | --- |
| 699 | 196 | ||
| 700 | ## 5. 示例流程图 | 197 | ## 7. 为什么“切片数据 + 模型 + feature”这样分布最合理 |
| 701 | 198 | ||
| 702 | ## 5.1 离线建库流程 | 199 | ### 切片数据放 `audio_object` |
| 200 | 因为切片本质是音频对象的一种: | ||
| 201 | - 它有父 asset | ||
| 202 | - 它有 `start_ms/end_ms` | ||
| 203 | - 它需要被回溯和复用 | ||
| 703 | 204 | ||
| 704 | ```mermaid | 205 | ### 模型信息放 `feature_fact` |
| 705 | flowchart TD | 206 | 因为模型是“某次特征计算”的属性: |
| 706 | A[导入音频资产] --> B[写 recording_asset] | 207 | - 同一个 window 可能被多个模型重复编码 |
| 707 | B --> C[切窗并写 audio_window] | 208 | - 同一个模型也可能有多个版本 |
| 708 | C --> D[注册 model_registry / feature_set_registry] | 209 | - 模型名和版本应该和 feature 结果绑定,而不是只和 asset 绑定 |
| 709 | D --> E[抽取 embedding / fingerprint] | ||
| 710 | E --> F[写 audio_embedding / audio_fingerprint] | ||
| 711 | F --> G[构建 retrieval index] | ||
| 712 | G --> H[登记 retrieval_index_registry] | ||
| 713 | ``` | ||
| 714 | 210 | ||
| 715 | ## 5.2 在线检索流程 | 211 | ### feature 放 `feature_fact` |
| 212 | 因为 feature 是事实: | ||
| 213 | - 某个对象 | ||
| 214 | - 用某个模型 | ||
| 215 | - 以某个 feature set | ||
| 216 | - 产出某个结果 | ||
| 716 | 217 | ||
| 717 | ```mermaid | 218 | 这正好就是一条事实记录。 |
| 718 | sequenceDiagram | ||
| 719 | participant Q as Query Audio | ||
| 720 | participant DB as PostgreSQL | ||
| 721 | participant IDX as Retrieval Index | ||
| 722 | participant SVC as Matching Service | ||
| 723 | |||
| 724 | Q->>SVC: 输入 query | ||
| 725 | SVC->>DB: 读取 active feature_set | ||
| 726 | SVC->>IDX: exact lane / semantic lane 查询 | ||
| 727 | IDX-->>SVC: 候选 window / recording | ||
| 728 | SVC->>DB: 回查 window -> recording -> work -> canonical_song | ||
| 729 | SVC->>DB: 写 retrieval_candidate | ||
| 730 | SVC->>DB: 写 match_decision | ||
| 731 | SVC-->>Q: 返回 canonical_song_id + evidence | ||
| 732 | ``` | ||
| 733 | 219 | ||
| 734 | --- | 220 | --- |
| 735 | 221 | ||
| 736 | ## 5.3 生产冻结前建议补硬的 4 个点 | 222 | ## 8. 第一个阶段如何服务 100w 音频 / 30w 歌曲 |
| 737 | |||
| 738 | ### A. lineage 硬约束 | ||
| 739 | 建议通过 trigger / transaction invariant 保证以下链路永远一致: | ||
| 740 | - `recording.work_id -> work.work_id` | ||
| 741 | - `recording.canonical_song_id -> work.canonical_song_id` | ||
| 742 | - `audio_window.asset_id -> recording_asset.recording_id -> recording/work/song` | ||
| 743 | - `audio_embedding.window_id -> audio_window.recording/work/song` | ||
| 744 | |||
| 745 | ### B. reference set 版本化 | ||
| 746 | 建议把“热 reference 集”提升成显式对象,而不是只依赖 `is_reference`。 | ||
| 747 | 这样可以支持: | ||
| 748 | - hot/cold reference 切换 | ||
| 749 | - A/B 对照 | ||
| 750 | - encoder 升级期间的双索引并存 | ||
| 751 | - 历史回放 | ||
| 752 | |||
| 753 | ### C. 候选实体多态约束 | ||
| 754 | `candidate_level + candidate_id` 很灵活,但生产化时至少要加枚举/约束,避免数据面上出现无效 level。 | ||
| 755 | |||
| 756 | ### D. 向量维度扩展规则 | ||
| 757 | 当前 `192/768` 物理表是热路径实现,不是最终维度上限。新增 encoder 维度时应遵循固定 playbook: | ||
| 758 | 1. 新增一张 `audio_embedding_vector_<dim>` 物理表 | ||
| 759 | 2. 回填对应 `feature_set` 的 embeddings | ||
| 760 | 3. 构建对应索引 | ||
| 761 | 4. 通过 `retrieval_index_registry` 切换 active 热索引 | ||
| 762 | 223 | ||
| 763 | --- | 224 | ### 建议的落盘顺序 |
| 764 | 225 | ||
| 765 | ## 6. 推荐 DDL 的主要原则 | 226 | 1. 先写 `media_entity(song)` |
| 227 | 2. 再写 `audio_object(asset)` | ||
| 228 | 3. 再批量切 `audio_object(window)` | ||
| 229 | 4. 再按模型批次写 `feature_fact` | ||
| 230 | 5. 最后写 `set_membership(reference_set/hot_set/eval_set)` | ||
| 766 | 231 | ||
| 767 | ## 原则 1:对象关系稳定,模型可变 | 232 | ### 为什么这样落 |
| 768 | 稳定的是: | ||
| 769 | - `song/work/recording/asset/window` | ||
| 770 | 233 | ||
| 771 | 可变的是: | 234 | 因为这能把“音频对象生命周期”和“模型计算生命周期”解耦: |
| 772 | - `model_name` | 235 | - 音频先入库 |
| 773 | - `feature_set` | 236 | - 切片先固定 |
| 774 | - `index_backend` | 237 | - exact lane 可先跑 |
| 775 | 238 | - semantic lane 之后补跑也不影响主链 | |
| 776 | ## 原则 2:向量不要写死为唯一真相 | ||
| 777 | 推荐把向量事实拆成: | ||
| 778 | - PostgreSQL 元数据主表 | ||
| 779 | - 向量可在 pgvector 分表或外部文件中存放 | ||
| 780 | |||
| 781 | ## 原则 3:窗口是最小证据粒度 | ||
| 782 | 因为版权保护最终不只是“命中这首歌”,还要回答: | ||
| 783 | - 命中的是哪一段 | ||
| 784 | - 哪个录音版本 | ||
| 785 | - 归属到哪个 work/song | ||
| 786 | 239 | ||
| 787 | --- | 240 | --- |
| 788 | 241 | ||
| 789 | ## 7. 推荐的物理实现思路 | 242 | ## 9. Phase-1 推荐策略 |
| 790 | 243 | ||
| 791 | ## 7.1 PostgreSQL 负责 | 244 | ### 9.1 exact lane |
| 792 | - 主数据 | 245 | - 默认:`ChromaprintMatcher` |
| 793 | - 模型注册 | 246 | - 落到:`feature_fact(feature_type='fingerprint')` |
| 794 | - 特征注册 | ||
| 795 | - 索引注册 | ||
| 796 | - 检索候选 | ||
| 797 | - 审核/决策 | ||
| 798 | 247 | ||
| 799 | ## 7.2 pgvector 负责 | 248 | ### 9.2 semantic lane |
| 800 | - 热 reference 集合 | 249 | - 当前优先:`MERT` |
| 801 | - 线上低延迟近邻查询 | 250 | - challenger:`MuQ` |
| 251 | - 当前 host 若 runtime 不可用,保留 fallback | ||
| 252 | - 落到:`feature_fact(feature_type='embedding')` | ||
| 802 | 253 | ||
| 803 | ## 7.3 外部对象存储/文件层负责 | 254 | ### 9.3 为什么不是 ECAPA-TDNN 主导 |
| 804 | - 原始音频 | 255 | - ECAPA 更偏 speaker/audio identity 方向 |
| 805 | - 标准化音频 | 256 | - 当前目标是版权保护 / song-level ACR |
| 806 | - 大体量 embedding parquet/npy | 257 | - `MERT` / `MuQ` 更适合作为 song semantic baseline/challenger |
| 807 | - 索引 shard 文件 | ||
| 808 | 258 | ||
| 809 | --- | 259 | --- |
| 810 | 260 | ||
| 811 | ## 8. 为什么这个设计更适合 SOTA 演进 | 261 | ## 10. 当前方案解决的问题 |
| 812 | |||
| 813 | 因为未来你最可能变化的不是 `canonical_song` 结构,而是: | ||
| 814 | |||
| 815 | | 会变化的东西 | 对应表 | | ||
| 816 | |---|---| | ||
| 817 | | 底座模型 | `model_registry` | | ||
| 818 | | 特征版本 | `feature_set_registry` | | ||
| 819 | | embedding dim | `model_registry.output_embedding_dim` | | ||
| 820 | | 池化与层选择 | `feature_set_registry.pooling_strategy/layer_selection` | | ||
| 821 | | 索引后端 | `retrieval_index_registry.index_backend` | | ||
| 822 | 262 | ||
| 823 | 所以 schema 的目标是: | 263 | 这套 4 表设计,当前主要解决: |
| 824 | 264 | - 同一 `song` 下多音频文件管理 | |
| 825 | > **允许模型变、索引变、特征变,但不让主数据和业务归属逻辑跟着崩。** | 265 | - 切片级 evidence 管理 |
| 266 | - fingerprint 与 embedding 统一落库 | ||
| 267 | - 模型切换时不重构主 schema | ||
| 268 | - reference/eval/hot 集统一治理 | ||
| 269 | - 检索命中后快速回到 `song_id` | ||
| 826 | 270 | ||
| 827 | --- | 271 | --- |
| 828 | 272 | ||
| 829 | ## 9. DDL 文件说明 | 273 | ## 11. 当前不刻意解决的问题 |
| 830 | |||
| 831 | 推荐直接使用: | ||
| 832 | - [`acr-engine/sql/acr_pg_schema_v2.sql`](../acr-engine/sql/acr_pg_schema_v2.sql) | ||
| 833 | |||
| 834 | 其中包含: | ||
| 835 | - 主数据表 | ||
| 836 | - 模型注册表 | ||
| 837 | - 特征表 | ||
| 838 | - 向量物理表(192/768 维示例) | ||
| 839 | - 索引建议 | ||
| 840 | |||
| 841 | 而原有: | ||
| 842 | - [`acr-engine/sql/pgvector_schema.sql`](../acr-engine/sql/pgvector_schema.sql) | ||
| 843 | 274 | ||
| 844 | 建议视为: | 275 | Phase-1 暂不强求: |
| 845 | - 原型版 / demo 版 / 兼容参考 | 276 | - 复杂 `work / recording / version` 治理 |
| 277 | - 完整权利层图谱 | ||
| 278 | - 训练/微调闭环 | ||
| 279 | - 重型 registry-first 体系 | ||
| 846 | 280 | ||
| 847 | --- | 281 | 这些都可以后续逐步加,但不该反向阻塞当前主链。 |
| 848 | |||
| 849 | ## 10. 实施顺序建议 | ||
| 850 | |||
| 851 | ### 第一批必须先落 | ||
| 852 | 1. `canonical_song` | ||
| 853 | 2. `work` | ||
| 854 | 3. `recording` | ||
| 855 | 4. `recording_asset` | ||
| 856 | 5. `audio_window` | ||
| 857 | 6. `model_registry` | ||
| 858 | 7. `feature_set_registry` | ||
| 859 | 8. `audio_embedding` | ||
| 860 | 9. `retrieval_index_registry` | ||
| 861 | |||
| 862 | ### 第二批再补 | ||
| 863 | 1. `retrieval_candidate` | ||
| 864 | 2. `match_decision` | ||
| 865 | 3. `audio_fingerprint` | ||
| 866 | 4. 更多维度的向量物理表 | ||
| 867 | 282 | ||
| 868 | --- | 283 | --- |
| 869 | 284 | ||
| 870 | ## 11. 典型注册与查询示例 | 285 | ## 12. 相关文档 |
| 871 | |||
| 872 | ## 11.1 注册一个开源模型 | ||
| 873 | |||
| 874 | ```sql | ||
| 875 | insert into model_registry ( | ||
| 876 | model_name, model_family, model_version, model_source, model_uri, | ||
| 877 | input_sample_rate, default_window_sec, default_hop_sec, output_embedding_dim, | ||
| 878 | pooling_supported, layer_selection_supported, is_trainable | ||
| 879 | ) values ( | ||
| 880 | 'mert', 'music_ssl', 'v1-95m', 'github', 'https://github.com/yizhilll/MERT', | ||
| 881 | 24000, 5.0, 2.5, 768, | ||
| 882 | array['mean','cls'], true, false | ||
| 883 | ); | ||
| 884 | ``` | ||
| 885 | |||
| 886 | ## 11.2 注册一个 feature set | ||
| 887 | |||
| 888 | ```sql | ||
| 889 | insert into feature_set_registry ( | ||
| 890 | model_id, feature_name, feature_level, extraction_granularity, | ||
| 891 | window_sec, hop_sec, embedding_dim, pooling_strategy, layer_selection, | ||
| 892 | normalize_l2, distance_metric, quantization_type, feature_schema_version | ||
| 893 | ) | ||
| 894 | select | ||
| 895 | model_id, 'semantic_embedding', 'window', 'sliding_window', | ||
| 896 | 5.0, 2.5, 768, 'mean', 'final', | ||
| 897 | true, 'cosine', 'none', 'v1' | ||
| 898 | from model_registry | ||
| 899 | where model_name = 'mert' and model_version = 'v1-95m'; | ||
| 900 | ``` | ||
| 901 | |||
| 902 | ## 11.3 查询当前激活的 reference feature set | ||
| 903 | |||
| 904 | ```sql | ||
| 905 | select fs.feature_set_id, mr.model_name, mr.model_version, | ||
| 906 | fs.window_sec, fs.hop_sec, fs.pooling_strategy, fs.distance_metric | ||
| 907 | from feature_set_registry fs | ||
| 908 | join model_registry mr on mr.model_id = fs.model_id | ||
| 909 | where fs.status = 'active' | ||
| 910 | and fs.feature_level = 'window' | ||
| 911 | and fs.feature_name = 'semantic_embedding' | ||
| 912 | order by fs.feature_set_id desc; | ||
| 913 | ``` | ||
| 914 | |||
| 915 | ## 11.4 从候选 window 回查到最终 song | ||
| 916 | |||
| 917 | ```sql | ||
| 918 | select rc.query_id, rc.rank_no, rc.normalized_score, | ||
| 919 | aw.window_id, aw.start_sec, aw.end_sec, | ||
| 920 | r.recording_id, r.version_type, | ||
| 921 | w.work_id, | ||
| 922 | cs.canonical_song_id, cs.title, cs.primary_artist | ||
| 923 | from retrieval_candidate rc | ||
| 924 | join audio_window aw on aw.window_id = rc.evidence_window_id | ||
| 925 | join recording r on r.recording_id = aw.recording_id | ||
| 926 | join work w on w.work_id = aw.work_id | ||
| 927 | join canonical_song cs on cs.canonical_song_id = aw.canonical_song_id | ||
| 928 | where rc.query_id = :query_id | ||
| 929 | order by rc.rank_no asc; | ||
| 930 | ``` | ||
| 931 | |||
| 932 | ## 11.5 查询某个 song 的全部 reference 资产和窗口 | ||
| 933 | |||
| 934 | ```sql | ||
| 935 | select cs.canonical_song_id, cs.title, | ||
| 936 | r.recording_id, r.version_type, r.is_reference, | ||
| 937 | ra.asset_id, ra.storage_uri, | ||
| 938 | aw.window_id, aw.window_index, aw.start_sec, aw.end_sec | ||
| 939 | from canonical_song cs | ||
| 940 | join recording r on r.canonical_song_id = cs.canonical_song_id | ||
| 941 | join recording_asset ra on ra.recording_id = r.recording_id | ||
| 942 | left join audio_window aw on aw.asset_id = ra.asset_id | ||
| 943 | where cs.canonical_song_id = :canonical_song_id | ||
| 944 | order by r.reference_priority asc, ra.asset_id asc, aw.window_index asc; | ||
| 945 | ``` | ||
| 946 | |||
| 947 | ## 11.6 查询某个 feature set 是否已完成索引构建 | ||
| 948 | |||
| 949 | ```sql | ||
| 950 | select fs.feature_set_id, mr.model_name, mr.model_version, | ||
| 951 | ri.index_backend, ri.index_type, ri.row_count, ri.index_status, ri.built_at | ||
| 952 | from feature_set_registry fs | ||
| 953 | join model_registry mr on mr.model_id = fs.model_id | ||
| 954 | left join retrieval_index_registry ri on ri.feature_set_id = fs.feature_set_id | ||
| 955 | where fs.feature_set_id = :feature_set_id; | ||
| 956 | ``` | ||
| 957 | |||
| 958 | --- | ||
| 959 | |||
| 960 | ## 12. 当前建议结论 | ||
| 961 | |||
| 962 | 如果你今天就要开始 PostgreSQL 落库,最推荐的做法是: | ||
| 963 | 286 | ||
| 964 | 1. 先把 `song/work/recording/asset/window` 落稳 | 287 | - [README.md](./README.md) |
| 965 | 2. 同时把 `model_registry / feature_set_registry` 落稳 | 288 | - [start-here.md](./start-here.md) |
| 966 | 3. Phase-1 只注册开源 encoder feature set,不写死到某个 embedding 列 | 289 | - [session-handoff.md](./session-handoff.md) |
| 967 | 4. 先把热 reference 集上 pgvector,冷数据通过外部文件或后续索引层接入 | 290 | - [postgres_db_schema_samples.md](./postgres_db_schema_samples.md) | ... | ... |
| 1 | # Production Encoder Freeze & Embedding Strategy / 生产 Encoder 冻结与 Embedding 策略答疑 | ||
| 2 | |||
| 3 | > 更新:2026-06-03 | ||
| 4 | > 关联文档:[持续开发交接文档](./session-handoff.md) · [PostgreSQL 数据模型](./postgresql-data-model.md) · [Phase-1 实施清单](./phase1-implementation-checklist.md) | ||
| 5 | |||
| 6 | ## 一页结论 | ||
| 7 | |||
| 8 | 围绕你当前最关心的生产问题,可以先压缩成 6 句话: | ||
| 9 | |||
| 10 | 1. **当前先冻结 encoder 是正确决定。** 对 30 万首生产曲库来说,先稳定 embedding 空间,比继续频繁改模型更重要。 | ||
| 11 | 2. **当前结构具备泛化能力。** 线上识别主路径是“固定 encoder 抽 embedding + reference 检索”,不是只能识别训练时见过的 closed-set 分类标签。 | ||
| 12 | 3. **模型权重可以外置复用。** 你可以把当前 `best_model.pt` 当成独立 encoder,用在别的 wav/mp3/flac/ogg 歌曲集合上。 | ||
| 13 | 4. **如果 encoder 变了,向量库就必须重建。** 歌曲元数据不用重做,但所有旧 embedding / ANN index / pgvector 向量都应视为旧版本。 | ||
| 14 | 5. **30 万首场景下必须做 embedding 版本化。** 不建议“直接覆盖旧库”,而应并行维护 `encoder_version=v1/v2/...`。 | ||
| 15 | 6. **最稳的上线策略是:先冻结 v1,上线使用;新模型只做离线 shadow build + A/B,收益足够大再切换。** | ||
| 16 | |||
| 17 | --- | ||
| 18 | |||
| 19 | ## 1. 当前项目进度:已经走到哪里 | ||
| 20 | |||
| 21 | ## 1.1 当前状态 | ||
| 22 | |||
| 23 | 根据 [持续开发交接文档](./session-handoff.md),当前项目已经从“原型是否能跑通”进入“真实数据验证 + 工程化推进”阶段: | ||
| 24 | |||
| 25 | - synthetic 数据链路已跑通:生成、训练、建索引、识别、评测都已具备 | ||
| 26 | - 开放数据链路已闭环:`inspect-local -> prepare-local -> validate-local -> train -> build-index -> evaluate -> generate_artifacts` | ||
| 27 | - 当前最佳候选方向已收敛到 `hum_focus` | ||
| 28 | - 真实 FMA smoke 已跨过训练,进入了 `build-index` 阶段 | ||
| 29 | |||
| 30 | 这说明当前不是“从零开始想方案”,而是已经具备: | ||
| 31 | |||
| 32 | 1. **一个可以独立抽 embedding 的 encoder** | ||
| 33 | 2. **一个可以对 reference 曲库建索引的 pipeline** | ||
| 34 | 3. **一个可以对 query 做识别的 hybrid 检索链路** | ||
| 35 | |||
| 36 | ## 1.2 当前最适合的策略 | ||
| 37 | |||
| 38 | 在这个阶段,最重要的不是继续频繁换 encoder,而是: | ||
| 39 | |||
| 40 | ```mermaid | ||
| 41 | flowchart TD | ||
| 42 | A[已有可运行模型与索引链] --> B[冻结生产 Encoder v1] | ||
| 43 | B --> C[完成 30 万首建库与使用] | ||
| 44 | C --> D[收集真实线上/离线问题] | ||
| 45 | D --> E[离线评估新 Encoder v2] | ||
| 46 | E --> F[证明显著收益后再迁移] | ||
| 47 | ``` | ||
| 48 | |||
| 49 | 也就是:**先让一版可用的 embedding 体系稳定下来,再做后续升级。** | ||
| 50 | |||
| 51 | --- | ||
| 52 | |||
| 53 | ## 2. 当前结构是否有泛化能力 | ||
| 54 | |||
| 55 | ## 2.1 简短回答 | ||
| 56 | |||
| 57 | **有。** | ||
| 58 | |||
| 59 | 但这里的“泛化”要分成两层理解: | ||
| 60 | |||
| 61 | | 泛化层次 | 当前是否支持 | 说明 | | ||
| 62 | |---|---|---| | ||
| 63 | | 新歌曲入库后可被识别 | 支持 | 只要用同一 encoder 生成 reference embedding 即可 | | ||
| 64 | | 完全未知分布上保持稳定高精度 | 尚未完全证明 | 当前更多是结构可行、真实大规模效果还需继续验证 | | ||
| 65 | |||
| 66 | ## 2.2 为什么说当前结构具备泛化能力 | ||
| 67 | |||
| 68 | 当前识别主链路不是: | ||
| 69 | - “输入 query -> 分类头直接输出某个固定 song class” | ||
| 70 | |||
| 71 | 而是: | ||
| 72 | - “输入 query -> encoder 抽 embedding -> 在 reference embedding 库中做相似度检索” | ||
| 73 | |||
| 74 | 结构图如下: | ||
| 75 | |||
| 76 | ```mermaid | ||
| 77 | flowchart LR | ||
| 78 | Q[query wav/mp3 片段] --> E1[冻结 Encoder v1] | ||
| 79 | E1 --> QE[query embedding] | ||
| 80 | QE --> S[similarity search] | ||
| 81 | |||
| 82 | R[30 万首 reference 曲库] --> E2[同一个冻结 Encoder v1] | ||
| 83 | E2 --> RE[reference embeddings] | ||
| 84 | RE --> S | ||
| 85 | |||
| 86 | S --> O[候选歌曲 + 分数] | ||
| 87 | ``` | ||
| 88 | |||
| 89 | 这种结构天然支持: | ||
| 90 | |||
| 91 | - 新歌曲加入 reference 曲库 | ||
| 92 | - 不重训模型的情况下扩歌库 | ||
| 93 | - 针对 query 做检索而不是固定类分类 | ||
| 94 | |||
| 95 | ## 2.3 但泛化能力的上限受什么影响 | ||
| 96 | |||
| 97 | 当前真实效果主要还受这些因素制约: | ||
| 98 | |||
| 99 | 1. **训练数据分布** | ||
| 100 | - 目前 hard case 已关注 `humming_like` / `confused` | ||
| 101 | - 但你的线上 30 万首曲库是否和当前训练/验证分布一致,仍需要真实数据验证 | ||
| 102 | |||
| 103 | 2. **reference 建库方式** | ||
| 104 | - 当前 reference 端是 **5 秒窗口 + 2.5 秒 stride** 的重叠滑窗 | ||
| 105 | - 这对片段识别是好的,但代价是窗口数上升、建库成本变高 | ||
| 106 | |||
| 107 | 3. **query 形式差异** | ||
| 108 | - 无损整曲、压缩 mp3、录音片段、截短片段、混响/手机采样等都会影响效果 | ||
| 109 | |||
| 110 | 4. **混合检索策略** | ||
| 111 | - 当前不是纯 embedding 检索,而是 `chromaprint + embedding + melody rerank` 的 hybrid | ||
| 112 | - 这对鲁棒性是加分项,但也意味着线上治理时要同时考虑指纹索引和向量索引 | ||
| 113 | |||
| 114 | --- | ||
| 115 | |||
| 116 | ## 3. 为什么当前先冻结 Encoder | ||
| 117 | |||
| 118 | ## 3.1 业务原因 | ||
| 119 | |||
| 120 | 你现在的生产环境里有 **30 万首歌曲**。这会带来一个决定性的工程事实: | ||
| 121 | |||
| 122 | > **一旦 encoder 改了,就不是“换个模型文件”这么简单,而是整套向量空间都可能变。** | ||
| 123 | |||
| 124 | 如果还没冻结 encoder,就会出现这些问题: | ||
| 125 | |||
| 126 | - 今天建的 30 万 embedding,明天可能全部作废 | ||
| 127 | - pgvector / Faiss / Milvus / 自研 ANN 索引都要全量重建 | ||
| 128 | - 离线评测和线上灰度无法稳定对齐 | ||
| 129 | - 回滚困难:你很难知道 query 是按哪个模型算的,reference 又是按哪个模型建的 | ||
| 130 | |||
| 131 | ## 3.2 工程原因 | ||
| 132 | |||
| 133 | 冻结 encoder 后,你才能稳定以下这些对象: | ||
| 134 | |||
| 135 | | 资产 | 是否应随 encoder 冻结 | 原因 | | ||
| 136 | |---|---|---| | ||
| 137 | | `best_model.pt` | 是 | 它决定向量空间 | | ||
| 138 | | `n_mels/sample_rate/window/stride` | 是 | 这些决定 embedding 分布 | | ||
| 139 | | `embedding_dim` | 是 | 向量表结构和索引依赖它 | | ||
| 140 | | reference embeddings | 是 | 必须和 query embedding 同版本 | | ||
| 141 | | ANN index / pgvector 索引 | 是 | 建立在具体 embedding 空间上 | | ||
| 142 | |||
| 143 | ## 3.3 当前建议的冻结定义 | ||
| 144 | |||
| 145 | 建议把当前生产 encoder 冻结为一组明确配置,而不是只记一个文件名。 | ||
| 146 | |||
| 147 | 推荐至少记录这些字段: | ||
| 148 | |||
| 149 | ```yaml | ||
| 150 | encoder_version: ecapa_hum_focus_v1 | ||
| 151 | checkpoint_path: /abs/path/to/best_model.pt | ||
| 152 | embedding_dim: 192 | ||
| 153 | sample_rate: 16000 | ||
| 154 | n_mels: 128 | ||
| 155 | n_fft: 512 | ||
| 156 | hop_length: 160 | ||
| 157 | reference_window_sec: 5.0 | ||
| 158 | reference_stride_sec: 2.5 | ||
| 159 | query_runtime_window_sec: 5.0 | ||
| 160 | feature_family: mel | ||
| 161 | index_family: hybrid_chromaprint_ecapa | ||
| 162 | status: frozen_for_production | ||
| 163 | ``` | ||
| 164 | |||
| 165 | --- | ||
| 166 | |||
| 167 | ## 4. 模型权重能否外置,给其他歌曲使用 | ||
| 168 | |||
| 169 | ## 4.1 简短回答 | ||
| 170 | |||
| 171 | **可以,而且应该这样做。** | ||
| 172 | |||
| 173 | 你可以把当前冻结好的 `best_model.pt` 视为一个独立的音乐 embedding encoder,用来处理: | ||
| 174 | |||
| 175 | - 新增歌曲入库 | ||
| 176 | - 旧歌库全量建 embedding | ||
| 177 | - 任意 query 音频片段识别 | ||
| 178 | - 后续 pgvector / ANN 向量服务对接 | ||
| 179 | |||
| 180 | ## 4.2 外置使用时的推荐组织方式 | ||
| 181 | |||
| 182 | 推荐把“模型”和“索引产物”分开管理: | ||
| 183 | |||
| 184 | ```mermaid | ||
| 185 | flowchart TD | ||
| 186 | A[models/] --> A1[ecapa_hum_focus_v1/best_model.pt] | ||
| 187 | A --> A2[ecapa_hum_focus_v1/encoder_manifest.yaml] | ||
| 188 | |||
| 189 | B[indexes/] --> B1[ecapa_hum_focus_v1/chromaprint.pkl] | ||
| 190 | B --> B2[ecapa_hum_focus_v1/reference_embs.npy] | ||
| 191 | B --> B3[ecapa_hum_focus_v1/reference_ids.npy] | ||
| 192 | B --> B4[ecapa_hum_focus_v1/index_metadata.json] | ||
| 193 | |||
| 194 | C[metadata/] --> C1[song_catalog.csv] | ||
| 195 | C --> C2[manifests/catalog.json] | ||
| 196 | ``` | ||
| 197 | |||
| 198 | ## 4.3 外置后能做什么 | ||
| 199 | |||
| 200 | ### 场景 A:直接给新歌曲建库 | ||
| 201 | |||
| 202 | - 你有一批新歌(wav/mp3/flac/ogg) | ||
| 203 | - 不重训 | ||
| 204 | - 用冻结 encoder 直接建 reference embedding | ||
| 205 | - 加入曲库识别 | ||
| 206 | |||
| 207 | ### 场景 B:给 query 片段做识别 | ||
| 208 | |||
| 209 | - 你有 5~10 秒左右的查询片段 | ||
| 210 | - 用同一 encoder 抽 query embedding | ||
| 211 | - 在 reference 库做相似度匹配 | ||
| 212 | |||
| 213 | ### 场景 C:离线批量生成 pgvector/ANN 数据 | ||
| 214 | |||
| 215 | - 你可以把 reference embeddings 作为离线产物导出 | ||
| 216 | - 再灌入 PostgreSQL + pgvector / Faiss / Milvus / 自研检索服务 | ||
| 217 | |||
| 218 | --- | ||
| 219 | |||
| 220 | ## 5. 你手头有无损、压缩、片段等 wav/mp3 文件集合,应该怎么直接使用 | ||
| 221 | |||
| 222 | ## 5.1 总体原则 | ||
| 223 | |||
| 224 | 你手头的文件不需要先人工区分“是不是训练专用格式”。 | ||
| 225 | |||
| 226 | 当前最重要的是把它们统一进入这套结构: | ||
| 227 | |||
| 228 | ```mermaid | ||
| 229 | flowchart LR | ||
| 230 | A[原始 wav/mp3/flac/ogg] --> B[标准化目录] | ||
| 231 | B --> C[manifest] | ||
| 232 | C --> D[reference 建库] | ||
| 233 | C --> E[query 验证] | ||
| 234 | D --> F[线上识别/检索] | ||
| 235 | ``` | ||
| 236 | |||
| 237 | ## 5.2 建议先分三类素材 | ||
| 238 | |||
| 239 | | 类别 | 作用 | 建议处理方式 | | ||
| 240 | |---|---|---| | ||
| 241 | | 完整歌曲 / 主版本 | 做 reference | 全部入库 | | ||
| 242 | | 截断片段 / 业务 query 样本 | 做评测 query | 固定留作测试集 | | ||
| 243 | | 低码率/手机录音/混响压缩片段 | 做 hard case query | 用来验证鲁棒性 | | ||
| 244 | |||
| 245 | ## 5.3 最短直接使用流程 | ||
| 246 | |||
| 247 | ### 第 1 步:冻结 encoder v1 | ||
| 248 | |||
| 249 | 先不要继续换模型,先把当前决定好的 checkpoint 固定下来。 | ||
| 250 | |||
| 251 | ### 第 2 步:准备音频目录 | ||
| 252 | |||
| 253 | 例如: | ||
| 254 | |||
| 255 | ```text | ||
| 256 | input_music/ | ||
| 257 | song_a.flac | ||
| 258 | song_b.mp3 | ||
| 259 | song_c.wav | ||
| 260 | ... | ||
| 261 | ``` | ||
| 262 | |||
| 263 | ### 第 3 步:检查目录是否适合进入当前链路 | ||
| 264 | |||
| 265 | ```bash | ||
| 266 | cd /root/vprecog/acr-engine | ||
| 267 | /usr/local/miniconda3/bin/python src/data/manifest_tools.py inspect-audio-dir \ | ||
| 268 | /abs/path/to/input_music \ | ||
| 269 | --query-duration 8.0 \ | ||
| 270 | --eval-ratio 0.2 | ||
| 271 | ``` | ||
| 272 | |||
| 273 | ### 第 4 步:生成统一 manifest | ||
| 274 | |||
| 275 | ```bash | ||
| 276 | cd /root/vprecog/acr-engine | ||
| 277 | /usr/local/miniconda3/bin/python src/data/manifest_tools.py audio-dir-to-splits \ | ||
| 278 | /abs/path/to/input_music \ | ||
| 279 | /abs/path/to/output_dataset \ | ||
| 280 | --source-dataset prod_music \ | ||
| 281 | --eval-ratio 0.2 \ | ||
| 282 | --query-duration 8.0 \ | ||
| 283 | --query-strategy hybrid \ | ||
| 284 | --query-stride 2.5 | ||
| 285 | ``` | ||
| 286 | |||
| 287 | 建议说明: | ||
| 288 | - `query-duration=8.0`:适合作为线下验证 query 长度 | ||
| 289 | - `query-strategy=hybrid`:更贴近当前项目已有的音乐感知切片方向 | ||
| 290 | - `query-stride=2.5`:如果你希望验证覆盖率更高,可以生成更多 query | ||
| 291 | |||
| 292 | ### 第 5 步:用冻结 encoder 建 reference index | ||
| 293 | |||
| 294 | ```bash | ||
| 295 | cd /root/vprecog/acr-engine | ||
| 296 | /usr/local/miniconda3/bin/python run_demo.py build-index \ | ||
| 297 | --data /abs/path/to/output_dataset/manifests \ | ||
| 298 | --model /abs/path/to/frozen/best_model.pt \ | ||
| 299 | --output /abs/path/to/output_index \ | ||
| 300 | --device cpu | ||
| 301 | ``` | ||
| 302 | |||
| 303 | ### 第 6 步:做离线识别验证 | ||
| 304 | |||
| 305 | ```bash | ||
| 306 | cd /root/vprecog/acr-engine | ||
| 307 | /usr/local/miniconda3/bin/python evaluate.py \ | ||
| 308 | --data /abs/path/to/output_dataset/manifests \ | ||
| 309 | --model /abs/path/to/frozen/best_model.pt \ | ||
| 310 | --index-prefix /abs/path/to/output_index/reference \ | ||
| 311 | --split test \ | ||
| 312 | --device cpu \ | ||
| 313 | --fast-eval \ | ||
| 314 | --output-json /abs/path/to/output_reports/eval.json | ||
| 315 | ``` | ||
| 316 | |||
| 317 | ### 第 7 步:确认后再推到生产向量库 | ||
| 318 | |||
| 319 | 当离线评测满足最低要求后,再把 reference embedding 导入生产检索系统,而不是一上来直接刷 30 万首正式库。 | ||
| 320 | |||
| 321 | --- | ||
| 322 | |||
| 323 | ## 6. 如果要快速微调,应该怎么做 | ||
| 324 | |||
| 325 | ## 6.1 原则 | ||
| 326 | |||
| 327 | **“快速微调”不等于“马上用 30 万首全量重训”。** | ||
| 328 | |||
| 329 | 对你当前场景,最合理的是: | ||
| 330 | |||
| 331 | 1. 先冻结生产 encoder v1 | ||
| 332 | 2. 用一个代表性子集训练/微调候选 v2 | ||
| 333 | 3. 只在离线环境评估 v2 | ||
| 334 | 4. 证明收益大于迁移成本,才考虑升级生产 encoder | ||
| 335 | |||
| 336 | ## 6.2 推荐的微调子集组成 | ||
| 337 | |||
| 338 | 建议优先抽一个 **几千到几万首规模的代表性集合**,覆盖: | ||
| 339 | |||
| 340 | - 无损高质量版本 | ||
| 341 | - 常见压缩版本(mp3/aac 等) | ||
| 342 | - 短片段 | ||
| 343 | - 开头/中段/结尾片段 | ||
| 344 | - 旋律近似、编曲相似的易混淆歌曲 | ||
| 345 | - 手机采样 / 录屏 / 二次压缩音频 | ||
| 346 | - 业务上最常见失败样本 | ||
| 347 | |||
| 348 | ## 6.3 推荐微调流程 | ||
| 349 | |||
| 350 | ```mermaid | ||
| 351 | flowchart TD | ||
| 352 | A[冻结 Encoder v1] --> B[抽代表性子集] | ||
| 353 | B --> C[生成 manifests] | ||
| 354 | C --> D[训练 v2 候选] | ||
| 355 | D --> E[建 v2 索引] | ||
| 356 | E --> F[固定测试集评测] | ||
| 357 | F --> G{收益显著?} | ||
| 358 | G -- 否 --> H[继续使用 v1] | ||
| 359 | G -- 是 --> I[准备 30 万首离线重刷 v2] | ||
| 360 | ``` | ||
| 361 | |||
| 362 | ## 6.4 微调时的判断标准 | ||
| 363 | |||
| 364 | 不要只看训练 loss,至少要看: | ||
| 365 | |||
| 366 | | 指标 | 作用 | | ||
| 367 | |---|---| | ||
| 368 | | top1 / topk | 基本识别率 | | ||
| 369 | | hard-case 命中率 | 看压缩/片段/混淆样本提升是否真实 | | ||
| 370 | | 新增模型对旧强项的回退 | 防止“补一个洞,漏一大片” | | ||
| 371 | | 全量建库速度 | 看新模型是否导致生产成本显著上升 | | ||
| 372 | | 线上 query 延迟 | 看推理成本是否可接受 | | ||
| 373 | |||
| 374 | ## 6.5 微调后的发布原则 | ||
| 375 | |||
| 376 | 微调完成后不要立刻替换 v1,而要: | ||
| 377 | |||
| 378 | 1. 标记为 `encoder_version=v2_candidate` | ||
| 379 | 2. 只做离线建库和评测 | ||
| 380 | 3. 在真实样本上和 v1 做 A/B | ||
| 381 | 4. 显著更优后再升级 | ||
| 382 | |||
| 383 | --- | ||
| 384 | |||
| 385 | ## 7. 如果 embedding 变了,哪些数据必须重建 | ||
| 386 | |||
| 387 | ## 7.1 要区分“元数据”和“向量数据” | ||
| 388 | |||
| 389 | | 数据类型 | 是否需要重建 | 说明 | | ||
| 390 | |---|---|---| | ||
| 391 | | 歌曲主数据(song_id/路径/业务标签) | 通常不需要 | 这些不依赖向量空间 | | ||
| 392 | | manifest | 通常不需要全重做 | 除非你的切片策略/数据治理规则也变了 | | ||
| 393 | | reference embeddings | 必须重建 | 因为 encoder 变了 | | ||
| 394 | | query embeddings 缓存 | 必须重建 | 否则和 reference 不同空间 | | ||
| 395 | | pgvector 行数据 | 必须重建 | 向量本体变了 | | ||
| 396 | | ANN 索引(Faiss/Milvus/HNSW/IVF 等) | 必须重建 | 建立在旧向量之上 | | ||
| 397 | | chromaprint 索引 | 不一定 | 只要指纹算法不变,可独立复用 | | ||
| 398 | |||
| 399 | ## 7.2 为什么必须重建 | ||
| 400 | |||
| 401 | 因为向量相似度检索默认假设: | ||
| 402 | |||
| 403 | > query embedding 和 reference embedding 来自同一个特征空间。 | ||
| 404 | |||
| 405 | 如果 query 用的是 `encoder v2`,reference 还停留在 `encoder v1`,就会出现: | ||
| 406 | |||
| 407 | - 分数不可比 | ||
| 408 | - recall 明显下降 | ||
| 409 | - 线上结果随机波动 | ||
| 410 | - A/B 结论失真 | ||
| 411 | |||
| 412 | ## 7.3 推荐的迁移策略 | ||
| 413 | |||
| 414 | 不要“原地覆盖旧 embedding”,而应采用双版本并行: | ||
| 415 | |||
| 416 | ```mermaid | ||
| 417 | flowchart LR | ||
| 418 | A[Encoder v1] --> B[index_v1] | ||
| 419 | C[Encoder v2] --> D[index_v2] | ||
| 420 | E[Query] --> F{使用哪个版本?} | ||
| 421 | F --> B | ||
| 422 | F --> D | ||
| 423 | ``` | ||
| 424 | |||
| 425 | 推荐步骤: | ||
| 426 | |||
| 427 | 1. 保留 `v1` 生产库不动 | ||
| 428 | 2. 离线刷 `v2` 的 30 万 embedding | ||
| 429 | 3. 建 `v2` 的 ANN / pgvector 索引 | ||
| 430 | 4. 用相同 query 集对比 v1/v2 | ||
| 431 | 5. 确认收益后切主流量 | ||
| 432 | 6. 回滚时直接切回 v1 | ||
| 433 | |||
| 434 | --- | ||
| 435 | |||
| 436 | ## 8. 30 万首生产环境推荐的版本化方案 | ||
| 437 | |||
| 438 | ## 8.1 推荐最小字段 | ||
| 439 | |||
| 440 | ### 歌曲元数据表 | ||
| 441 | |||
| 442 | | 字段 | 说明 | | ||
| 443 | |---|---| | ||
| 444 | | `song_id` | 稳定歌曲 ID | | ||
| 445 | | `audio_uri` | 原始音频路径/对象存储地址 | | ||
| 446 | | `duration_sec` | 时长 | | ||
| 447 | | `codec/container` | 格式信息 | | ||
| 448 | | `catalog_status` | 是否可入 reference 库 | | ||
| 449 | | `business_tags` | 业务标签 | | ||
| 450 | |||
| 451 | ### embedding 表 | ||
| 452 | |||
| 453 | | 字段 | 说明 | | ||
| 454 | |---|---| | ||
| 455 | | `song_id` | 对应歌曲 | | ||
| 456 | | `encoder_version` | 如 `ecapa_hum_focus_v1` | | ||
| 457 | | `window_index` | 第几个 reference window | | ||
| 458 | | `offset_sec` | 窗口起点 | | ||
| 459 | | `embedding_dim` | 例如 192 | | ||
| 460 | | `embedding` | 向量本体 | | ||
| 461 | | `built_at` | 构建时间 | | ||
| 462 | | `source_audio_hash` | 原音频指纹/摘要,便于查重与失效控制 | | ||
| 463 | |||
| 464 | ### index manifest | ||
| 465 | |||
| 466 | | 字段 | 说明 | | ||
| 467 | |---|---| | ||
| 468 | | `encoder_version` | 当前索引对应的 encoder | | ||
| 469 | | `checkpoint_path` | 模型文件 | | ||
| 470 | | `feature_config` | mel/n_mels/sample_rate 等 | | ||
| 471 | | `reference_window_sec` | 例如 5.0 | | ||
| 472 | | `reference_stride_sec` | 例如 2.5 | | ||
| 473 | | `catalog_size` | 曲库规模 | | ||
| 474 | | `num_reference_windows` | 总窗口数 | | ||
| 475 | | `built_at` | 构建时间 | | ||
| 476 | |||
| 477 | ## 8.2 推荐目录结构 | ||
| 478 | |||
| 479 | ```text | ||
| 480 | prod_artifacts/ | ||
| 481 | models/ | ||
| 482 | ecapa_hum_focus_v1/ | ||
| 483 | best_model.pt | ||
| 484 | encoder_manifest.yaml | ||
| 485 | |||
| 486 | indexes/ | ||
| 487 | ecapa_hum_focus_v1/ | ||
| 488 | chromaprint.pkl | ||
| 489 | chromaprint_progress.json | ||
| 490 | reference_embs.npy | ||
| 491 | reference_ids.npy | ||
| 492 | index_metadata.json | ||
| 493 | |||
| 494 | reports/ | ||
| 495 | ecapa_hum_focus_v1/ | ||
| 496 | eval.json | ||
| 497 | hard_case_eval.json | ||
| 498 | ``` | ||
| 499 | |||
| 500 | --- | ||
| 501 | |||
| 502 | ## 9. 当前建议的生产操作手册 | ||
| 503 | |||
| 504 | ## 9.1 目标 | ||
| 505 | |||
| 506 | 当前目标不是继续改模型,而是先完成: | ||
| 507 | |||
| 508 | 1. 冻结 encoder v1 | ||
| 509 | 2. 用 v1 支撑第一版 30 万首曲库建库 | ||
| 510 | 3. 建立版本化规范 | ||
| 511 | 4. 为后续 v2 升级预留迁移机制 | ||
| 512 | |||
| 513 | ## 9.2 分步操作 | ||
| 514 | |||
| 515 | ### Phase 1:冻结与归档 | ||
| 516 | |||
| 517 | - [ ] 选定当前生产 checkpoint | ||
| 518 | - [ ] 为该 checkpoint 生成 `encoder_manifest.yaml` | ||
| 519 | - [ ] 记录 `encoder_version` | ||
| 520 | - [ ] 固定 reference window / stride / mel 配置 | ||
| 521 | - [ ] 禁止直接覆盖此 checkpoint | ||
| 522 | |||
| 523 | ### Phase 2:小规模真实集验证 | ||
| 524 | |||
| 525 | - [ ] 抽 1k~10k 首真实歌曲做第一次建库 | ||
| 526 | - [ ] 抽真实 query 集做评测 | ||
| 527 | - [ ] 统计 top1/topk/hard-case 结果 | ||
| 528 | - [ ] 验证索引构建速度、磁盘占用、查询延迟 | ||
| 529 | |||
| 530 | ### Phase 3:30 万首全量离线建库 | ||
| 531 | |||
| 532 | - [ ] 清洗 song_id 与元数据 | ||
| 533 | - [ ] 生成标准 catalog/manifests | ||
| 534 | - [ ] 用冻结 encoder v1 批量生成 reference embeddings | ||
| 535 | - [ ] 生成 chromaprint / 向量索引 | ||
| 536 | - [ ] 导入生产检索服务 | ||
| 537 | |||
| 538 | ### Phase 4:上线与观测 | ||
| 539 | |||
| 540 | - [ ] 上线 v1 | ||
| 541 | - [ ] 记录 query 失败样本 | ||
| 542 | - [ ] 归档 hard-case | ||
| 543 | - [ ] 为 v2 微调准备离线样本集 | ||
| 544 | |||
| 545 | ### Phase 5:未来升级 | ||
| 546 | |||
| 547 | - [ ] 训练 `v2_candidate` | ||
| 548 | - [ ] 离线全量重刷 `index_v2` | ||
| 549 | - [ ] 做离线 A/B | ||
| 550 | - [ ] 收益显著后再切换主版本 | ||
| 551 | |||
| 552 | --- | ||
| 553 | |||
| 554 | ## 10. 常见问题 FAQ | ||
| 555 | |||
| 556 | ## 10.1 我现在有一大堆 wav/mp3,可以直接用吗? | ||
| 557 | |||
| 558 | **可以。** | ||
| 559 | |||
| 560 | 先不要纠结格式本身,先把它们组织成统一音频目录,再生成 manifests,再用冻结 encoder 建 reference index。 | ||
| 561 | |||
| 562 | ## 10.2 无损和压缩版本要不要分开? | ||
| 563 | |||
| 564 | **建议保留来源信息,但 reference 主库优先保留“主版本”。** | ||
| 565 | |||
| 566 | 如果同一首歌有多个编码版本: | ||
| 567 | - 主版本作为 reference 主资产 | ||
| 568 | - 其他压缩版本优先作为评测 query / hard case | ||
| 569 | |||
| 570 | 这样更利于真实评估鲁棒性。 | ||
| 571 | |||
| 572 | ## 10.3 片段文件可以直接拿来做 query 吗? | ||
| 573 | |||
| 574 | **可以。** | ||
| 575 | |||
| 576 | 尤其适合作为: | ||
| 577 | - clean query | ||
| 578 | - compressed query | ||
| 579 | - hard case query | ||
| 580 | |||
| 581 | 但如果要把它当训练样本,最好仍然能回溯到稳定的 `song_id` 和原 reference。 | ||
| 582 | |||
| 583 | ## 10.4 如果后面发现 encoder 不够好怎么办? | ||
| 584 | |||
| 585 | 不要直接替换现网。正确做法是: | ||
| 586 | |||
| 587 | 1. 保持 v1 不动 | ||
| 588 | 2. 离线训练 v2 | ||
| 589 | 3. 离线重刷 v2 的全量 embedding | ||
| 590 | 4. 对比 v1/v2 | ||
| 591 | 5. 确认收益后再切 | ||
| 592 | |||
| 593 | ## 10.5 现在有没有必要立刻上全量 30 万首? | ||
| 594 | |||
| 595 | **建议先做一轮中等规模验证,再上全量。** | ||
| 596 | |||
| 597 | 推荐先做: | ||
| 598 | - 1k 首小验证 | ||
| 599 | - 1w~5w 首中等规模验证 | ||
| 600 | - 验证速度/精度/存储后,再上 30 万 | ||
| 601 | |||
| 602 | ## 10.6 现阶段最重要的一句话建议是什么? | ||
| 603 | |||
| 604 | > **先冻结 encoder v1,把 embedding/version/index 治理做好,再讨论 v2。** | ||
| 605 | |||
| 606 | --- | ||
| 607 | |||
| 608 | ## 11. 最终建议 | ||
| 609 | |||
| 610 | 对你现在的阶段,我的建议优先级是: | ||
| 611 | |||
| 612 | 1. **冻结当前 encoder** | ||
| 613 | 2. **建立 embedding 版本化规范** | ||
| 614 | 3. **先做小到中规模真实集建库验证** | ||
| 615 | 4. **再推进 30 万首全量建库** | ||
| 616 | 5. **把新模型升级变成“离线重刷 + A/B + 切换”的标准动作** | ||
| 617 | |||
| 618 | 这样做的好处是: | ||
| 619 | |||
| 620 | - 你现在就能开始用 | ||
| 621 | - 后面也不会因为继续调模型把生产库拖乱 | ||
| 622 | - 未来升级有明确路径,不会出现“模型变了,数据全乱了”的问题 | ||
| 623 | |||
| 624 | ## Sources | ||
| 625 | - [持续开发交接文档](./session-handoff.md) | ||
| 626 | - [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 627 | - [phase1-implementation-checklist.md](./phase1-implementation-checklist.md) | ||
| 628 | - [phase1-worker-contract.md](./phase1-worker-contract.md) | ||
| 629 | - [acr-engine/train.py](../acr-engine/train.py) | ||
| 630 | - [acr-engine/run_demo.py](../acr-engine/run_demo.py) | ||
| 631 | - [acr-engine/src/engines/ecapa_embedder.py](../acr-engine/src/engines/ecapa_embedder.py) | ||
| 632 | - [acr-engine/src/data/manifest_tools.py](../acr-engine/src/data/manifest_tools.py) |
| 1 | # Session Handoff / 持续开发交接文档 | 1 | # Session Handoff / 持续开发交接文档 |
| 2 | 2 | ||
| 3 | > 更新:2026-06-04 | 3 | > 目标:让下次启动的新 session 在 **3~10 分钟内** 知道从哪里开始。 |
| 4 | > 目的:让下次启动的新 session 在 **3~10 分钟内** 明确: | ||
| 5 | > 1. 当前项目已经走到哪里 | ||
| 6 | > 2. 应该先读哪些文档 | ||
| 7 | > 3. 应该从哪一步开始推进 | ||
| 8 | > 4. 哪些是稳定结论,哪些还是待验证缺口 | ||
| 9 | 4 | ||
| 10 | --- | 5 | --- |
| 11 | 6 | ||
| 12 | ## 0. 下次启动先做什么 | 7 | ## 1. 下次启动先做什么 |
| 13 | 8 | ||
| 14 | 如果下次启动要继续当前主线(**song-centric 真实目录 -> feature -> PostgreSQL**),先执行: | 9 | 优先直接跑当前主线: |
| 15 | 10 | ||
| 16 | ```bash | 11 | ```bash |
| 17 | cd /workspace | 12 | cd /workspace |
| ... | @@ -22,8 +17,15 @@ cd /workspace | ... | @@ -22,8 +17,15 @@ cd /workspace |
| 22 | --output-dir acr-engine/data/pgvector_eval/music20 | 17 | --output-dir acr-engine/data/pgvector_eval/music20 |
| 23 | ``` | 18 | ``` |
| 24 | 19 | ||
| 20 | 或: | ||
| 21 | |||
| 22 | ```bash | ||
| 23 | acr-engine/scripts/start_songcentric_shortest_path.sh 'postgres://d2:d2pass@127.0.0.1:5432/d2' | ||
| 24 | ``` | ||
| 25 | |||
| 25 | 当前 fresh evidence: | 26 | 当前 fresh evidence: |
| 26 | - `song_count = 2` | 27 | - `song_count = 2` |
| 28 | - `asset_count = 2` | ||
| 27 | - `window_count = 5` | 29 | - `window_count = 5` |
| 28 | - `matcher_fingerprint_count = 5` | 30 | - `matcher_fingerprint_count = 5` |
| 29 | - `fallback_fingerprint_count = 0` | 31 | - `fallback_fingerprint_count = 0` |
| ... | @@ -31,198 +33,128 @@ cd /workspace | ... | @@ -31,198 +33,128 @@ cd /workspace |
| 31 | - `semantic_runtime_missing = [torch, torchaudio, transformers]` | 33 | - `semantic_runtime_missing = [torch, torchaudio, transformers]` |
| 32 | - `import_counts = media_entity:9 / audio_object:22 / feature_fact:24 / set_membership:9` | 34 | - `import_counts = media_entity:9 / audio_object:22 / feature_fact:24 / set_membership:9` |
| 33 | 35 | ||
| 34 | 如果只是回归历史 Phase-1 planner/worker 合同,再执行: | ||
| 35 | |||
| 36 | ```bash | ||
| 37 | cd /workspace/acr-engine | ||
| 38 | /usr/local/miniconda3/bin/python scripts/run_planner_validation_commands_live.py \ | ||
| 39 | --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' \ | ||
| 40 | --output data/pgvector_eval/music20/planner_validation_commands_runner_report.json | ||
| 41 | ``` | ||
| 42 | |||
| 43 | 也可以用包装脚本:`acr-engine/scripts/start_phase1_shortest_path.sh 'postgres://d2:d2pass@127.0.0.1:5432/d2'` | ||
| 44 | |||
| 45 | 当前 fresh evidence: | ||
| 46 | - `executed_count = 4` | ||
| 47 | - `all_passed = true` | ||
| 48 | |||
| 49 | 这条 runner 会一次性执行: | ||
| 50 | 1. `prereq_audit` | ||
| 51 | 2. `worker_contract_smoke` | ||
| 52 | 3. `semantic_vector_negative_matrix` | ||
| 53 | 4. `asset_level_upsert_validation` | ||
| 54 | |||
| 55 | 如果结果仍是: | ||
| 56 | - `downloads_root_exists = false` | ||
| 57 | - `ready_jobs = 0` | ||
| 58 | - exact = `failed/unreadable_audio_assets` | ||
| 59 | - semantic = `4/4 failed` | ||
| 60 | |||
| 61 | 则说明当前优先级仍然是: | ||
| 62 | 1. 补 `/workspace/downloads` 挂载 | ||
| 63 | 2. 补 `torch / torchaudio / transformers / speechbrain` | ||
| 64 | |||
| 65 | 而不是回头怀疑 PostgreSQL contract。 | ||
| 66 | |||
| 67 | --- | 36 | --- |
| 68 | 37 | ||
| 69 | ## 1. 当前项目一句话状态 | 38 | ## 2. 当前一句话状态 |
| 70 | 39 | ||
| 71 | 项目已经从“原型能否跑通”转为: | 40 | > **4 表 song-centric schema 已在 live PostgreSQL 上真实打通了“真实目录 -> 切片 -> exact/semantic feature enrichment -> import -> feature_fact”的宿主链。** |
| 72 | 41 | ||
| 73 | > **面向版权保护 / 听歌识曲 / 版本归属的可演进音乐 ACR 系统。** | 42 | 下一步最应该做的是: |
| 74 | 43 | > **在不破坏这条宿主链的前提下,把 semantic lane 从 runtime-aware fallback 升级到真实 MERT / MuQ adapter。** | |
| 75 | 当前目标是让 `100w` 音频、约 `30w` 歌曲能通过稳定的数据主链和模型注册机制,支撑检索、归属、升级与回滚。 | ||
| 76 | 44 | ||
| 77 | --- | 45 | --- |
| 78 | 46 | ||
| 79 | ## 2. 当前稳定结论 | 47 | ## 3. 当前稳定结论 |
| 80 | 48 | ||
| 81 | ### 技术路线 | 49 | ### 3.1 默认物理模型 |
| 82 | - exact lane:`Chromaprint` | ||
| 83 | - semantic baseline:`MERT-v1-95M` | ||
| 84 | - semantic challenger:`MuQ` | ||
| 85 | - `ECAPA`:historical baseline | ||
| 86 | - Phase-1:先走 **encoder-only**,先不用微调底座 | ||
| 87 | 50 | ||
| 88 | ### 数据主链 | ||
| 89 | ```text | 51 | ```text |
| 90 | canonical_song -> work -> recording -> recording_asset -> audio_window | 52 | media_entity -> audio_object -> feature_fact -> set_membership |
| 91 | ``` | 53 | ``` |
| 92 | 54 | ||
| 93 | ### 模型主链 | 55 | ### 3.2 默认逻辑语义 |
| 56 | |||
| 94 | ```text | 57 | ```text |
| 95 | model_registry -> feature_set_registry -> audio_embedding / audio_fingerprint -> retrieval_index_registry | 58 | song -> asset -> window -> fingerprint / embedding |
| 96 | ``` | 59 | ``` |
| 97 | 60 | ||
| 98 | ### reference set 结论 | 61 | ### 3.3 关键设计取舍 |
| 99 | - 保留 `is_reference=true` | 62 | - 最终归属对象当前只要求稳定返回 `song_id` |
| 100 | - 但生产切换必须依赖 `reference_set_registry / reference_set_member` | 63 | - 同一个 `song` 下允许多个音频文件 |
| 101 | - 后续 A/B、热切换、回滚都围绕 reference set 版本化进行 | 64 | - `window` 仍保留,因为它是切片/evidence/offset/召回最小单元 |
| 65 | - `feature_fact` 统一承载 `fingerprint` 与 `embedding` | ||
| 66 | - Phase-1 不先训练/微调,先直接复用开源 encoder | ||
| 102 | 67 | ||
| 103 | --- | 68 | --- |
| 104 | 69 | ||
| 105 | ## 3. 当前已经完成的关键交付 | 70 | ## 4. 切片 / 模型 / feature 分别在哪张表 |
| 106 | 71 | ||
| 107 | ### 文档与设计 | 72 | | 对象 | 表 | 关键字段 | |
| 108 | - 文档主入口已收敛到 `README -> start-here -> session-handoff` | 73 | |---|---|---| |
| 109 | - SOTA 演进路径已明确 | 74 | | song | `media_entity` | `entity_type='song'` | |
| 110 | - PostgreSQL 主数据与特征模型已固定为 v2 推荐方案 | 75 | | asset | `audio_object` | `object_type='asset'` | |
| 111 | - Phase-1 实施 checklist / registry bootstrap / worker contract 文档已齐备 | 76 | | window | `audio_object` | `object_type='window'`, `parent_object_id=<asset_id>` | |
| 112 | 77 | | model identity | `feature_fact` | `model_name`, `model_version`, `feature_set_name` | | |
| 113 | ### PostgreSQL / live contract | 78 | | fingerprint payload | `feature_fact` | `feature_type='fingerprint'`, `fingerprint_value` | |
| 114 | - `acr-engine/sql/acr_pg_schema_v2.sql` 已落地 | 79 | | embedding payload | `feature_fact` | `feature_type='embedding'`, `embedding_uri/vector_table_name`, `embedding_dim` | |
| 115 | - `model_registry / feature_set_registry / reference_set_registry` 已 live bootstrap 验证 | 80 | | set routing | `set_membership` | `set_type`, `set_name`, `member_type`, `member_id` | |
| 116 | - `audio_embedding` 的 asset-level 幂等 upsert 已 live 验证 | ||
| 117 | - semantic vector-table 负例矩阵已 live 验证 | ||
| 118 | - planner validation commands 已可被 runner 一键执行 | ||
| 119 | |||
| 120 | ### worker / script | ||
| 121 | - `run_chromaprint_job.py` 已具备真实写入路径 | ||
| 122 | - `run_embedding_job.py` 已具备 preflight failure contract | ||
| 123 | - `run_phase1_prereq_audit_live.py` 已能输出 host 前置条件审计 | ||
| 124 | - `run_planner_validation_commands_live.py` 已收敛最短验证链路 | ||
| 125 | 81 | ||
| 126 | --- | 82 | --- |
| 127 | 83 | ||
| 128 | ## 4. 当前明确的 blocker | 84 | ## 5. 当前流程图 |
| 129 | 85 | ||
| 130 | ### 环境 blocker | 86 | ```mermaid |
| 131 | - `/workspace/downloads` 缺失 | 87 | flowchart TD |
| 132 | - 缺少 `torch` | 88 | A[song / media_entity] --> B[asset / audio_object] |
| 133 | - 缺少 `torchaudio` | 89 | B --> C[window / audio_object] |
| 134 | - 缺少 `transformers` | 90 | C --> D1[fingerprint / feature_fact] |
| 135 | - 缺少 `speechbrain` | 91 | C --> D2[embedding / feature_fact] |
| 136 | 92 | A --> E[set_membership] | |
| 137 | ### 能力 blocker | 93 | B --> E |
| 138 | - 还未真实跑通 `MERT / MuQ` inference | 94 | C --> E |
| 139 | - 还未完成线上融合策略 | 95 | D1 --> F[召回与归属到 song_id] |
| 140 | - 还未接入更大规模真实 reference set | 96 | D2 --> F |
| 141 | 97 | ``` | |
| 142 | 因此当前最该优先推进的是: | ||
| 143 | > **把环境补齐,再把 semantic lane 从 guarded failure 推到真实抽特征。** | ||
| 144 | |||
| 145 | --- | ||
| 146 | |||
| 147 | ## 5. 下次启动先读什么 | ||
| 148 | |||
| 149 | 按这个顺序即可: | ||
| 150 | 1. [README.md](./README.md) | ||
| 151 | 2. [start-here.md](./start-here.md) | ||
| 152 | 3. [acr-architecture.md](./acr-architecture.md) | ||
| 153 | 4. [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 154 | 5. [phase1-implementation-checklist.md](./phase1-implementation-checklist.md) | ||
| 155 | 6. [model-feature-registry-bootstrap.md](./model-feature-registry-bootstrap.md) | ||
| 156 | 7. [phase1-worker-contract.md](./phase1-worker-contract.md) | ||
| 157 | 8. [CHANGELOG.md](./CHANGELOG.md) | ||
| 158 | 98 | ||
| 159 | --- | 99 | --- |
| 160 | 100 | ||
| 161 | ## 6. 下次启动优先动作 | 101 | ## 6. 当前已经真实验证过什么 |
| 162 | 102 | ||
| 163 | ### 路线 A:继续环境恢复 | 103 | ### live PostgreSQL |
| 164 | 1. 检查 `/workspace/downloads` 是否已挂载 | 104 | - DSN: `postgres://d2:d2pass@127.0.0.1:5432/d2` |
| 165 | 2. 检查 `torch / torchaudio / transformers / speechbrain` 是否已可导入 | 105 | - schema: `acr_songcentric_test` |
| 166 | 3. 重跑 planner validation runner | ||
| 167 | 4. 确认 `ready_jobs` 是否从 `0` 开始提升 | ||
| 168 | 106 | ||
| 169 | ### 路线 B:继续语义特征抽取实现 | 107 | ### 已验证链路 |
| 170 | 1. 查看 `acr-engine/workers/run_embedding_job.py` | 108 | 1. `acr-engine/sql/acr_pg_schema_songcentric_v1.sql` 可真实建表 |
| 171 | 2. 保持现有失败语义 contract | 109 | 2. `bootstrap_songcentric_phase1_live.py` 可重复 seed |
| 172 | 3. 接入真实 inference adapter | 110 | 3. `import_songcentric_manifest_live.py` 可幂等导入 `song/asset/window/membership` |
| 173 | 4. 复用现有 `audio_embedding` upsert 逻辑 | 111 | 4. manifest 中 `windows[].features[]` 已可直接落 `feature_fact` |
| 174 | 112 | 5. 真实目录 -> manifest -> import 已验证通过 | |
| 175 | ### 路线 C:继续数据规模化落库 | 113 | 6. 真实目录 -> fingerprint enrichment -> import 已验证通过 |
| 176 | 1. 查看 [postgresql-data-model.md](./postgresql-data-model.md) | 114 | 7. exact lane 已优先复用仓库内 `ChromaprintMatcher` |
| 177 | 2. 查看 [postgres_db_schema_samples.md](./postgres_db_schema_samples.md) | 115 | 8. semantic lane 已 runtime-aware,但当前 host 因依赖缺失仍走 fallback |
| 178 | 3. 规划 100w 音频导入批次 | ||
| 179 | 4. 固定 `reference_set` / `feature_set` / `index` 版本治理 | ||
| 180 | 116 | ||
| 181 | --- | 117 | --- |
| 182 | 118 | ||
| 183 | ## 7. 当前不要再重复讨论的结论 | 119 | ## 7. 当前 host 的真实 blocker |
| 184 | 120 | ||
| 185 | - 不要回退到只有 `song_id` 的扁平表 | 121 | - 缺 `torch` |
| 186 | - 不要把 embedding 设计成固定列 | 122 | - 缺 `torchaudio` |
| 187 | - 不要在 Phase-1 先讨论重新训练底座 | 123 | - 缺 `transformers` |
| 188 | - 不要把当前问题误判成 PostgreSQL schema 问题 | 124 | - 因此当前 `semantic_runtime_available = false` |
| 189 | 125 | ||
| 190 | --- | 126 | 这说明当前主要 blocker 是: |
| 127 | > **语义 encoder runtime 还没就绪,不是 schema 没设计好。** | ||
| 191 | 128 | ||
| 192 | ## 8. 关键文件入口 | 129 | --- |
| 193 | 130 | ||
| 194 | ### 文档 | 131 | ## 8. 下次继续时先看哪些文件 |
| 195 | - [README.md](./README.md) | ||
| 196 | - [start-here.md](./start-here.md) | ||
| 197 | - [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 198 | - [phase1-worker-contract.md](./phase1-worker-contract.md) | ||
| 199 | 132 | ||
| 200 | ### 代码与脚本 | 133 | 1. [README.md](./README.md) |
| 201 | - `acr-engine/sql/acr_pg_schema_v2.sql` | 134 | 2. [start-here.md](./start-here.md) |
| 202 | - `acr-engine/workers/run_chromaprint_job.py` | 135 | 3. [postgresql-data-model.md](./postgresql-data-model.md) |
| 203 | - `acr-engine/workers/run_embedding_job.py` | 136 | 4. [postgres_db_schema_samples.md](./postgres_db_schema_samples.md) |
| 204 | - `acr-engine/scripts/run_planner_validation_commands_live.py` | 137 | 5. [CHANGELOG.md](./CHANGELOG.md) |
| 205 | - `acr-engine/scripts/run_phase1_prereq_audit_live.py` | 138 | |
| 139 | 关键代码: | ||
| 140 | - `acr-engine/sql/acr_pg_schema_songcentric_v1.sql` | ||
| 141 | - `acr-engine/scripts/run_songcentric_directory_pipeline_live.py` | ||
| 142 | - `acr-engine/scripts/build_songcentric_manifest_from_directory.py` | ||
| 143 | - `acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py` | ||
| 144 | - `acr-engine/scripts/import_songcentric_manifest_live.py` | ||
| 145 | - `acr-engine/scripts/start_songcentric_shortest_path.sh` | ||
| 206 | 146 | ||
| 207 | --- | 147 | --- |
| 208 | 148 | ||
| 209 | ## 9. 当前验证状态摘要 | 149 | ## 9. 下一步优先顺序 |
| 210 | |||
| 211 | ### 已验证 | ||
| 212 | - planner validation runner 可执行且 `all_passed = true` | ||
| 213 | - exact lane 当前会诚实落成 `failed/unreadable_audio_assets` | ||
| 214 | - semantic lane 当前会诚实落成 `preflight_failed` | ||
| 215 | - asset-level embedding upsert 幂等合同已验证 | ||
| 216 | - vector table 负例矩阵已验证 | ||
| 217 | - prerequisites audit 已验证 | ||
| 218 | 150 | ||
| 219 | ### 未验证 | 151 | 1. 保持当前 4 表 schema 不回退 |
| 220 | - MERT / MuQ 真实 inference | 152 | 2. 给 `enrich_songcentric_manifest_with_local_features.py` 接真实 semantic adapter |
| 221 | - 更大规模生产 reference set 导入 | 153 | 3. 保留 fallback 分支,不破坏当前 host 的可运行性 |
| 222 | - 最终线上融合与重排策略 | 154 | 4. 重新跑主链 runner,确认 semantic lane 有 fresh 证据 |
| 223 | 155 | ||
| 224 | --- | 156 | --- |
| 225 | 157 | ||
| 226 | ## 一句话 handoff | 158 | ## 一句话 handoff |
| 227 | 159 | ||
| 228 | > 下次接手不要再从总方案开始,先跑 runner;若结果仍显示 downloads/runtime 缺失,就优先补环境,再推进 semantic lane 真实抽特征。 | 160 | > 下次不要再从总方案争论开始,直接跑 song-centric runner;如果 exact 正常、semantic 仍 fallback,就继续补真实 semantic adapter 和依赖。 | ... | ... |
docs/sota-evolution-guide.md
deleted
100644 → 0
| 1 | # SOTA 演进方案说明 / SOTA Evolution Guide | ||
| 2 | |||
| 3 | > 更新:2026-06-04 | ||
| 4 | > 目标:给出一个“先不上微调、先用开源 encoder”的 Phase-1 路线,并明确后续如何演进到更强的版权保护 / 版本归属系统。 | ||
| 5 | |||
| 6 | ## 一页结论 | ||
| 7 | |||
| 8 | 如果当前约束是: | ||
| 9 | - 先不微调底座 | ||
| 10 | - 先要落数据规范 | ||
| 11 | - 先解决 100w 音频 / 30w 歌曲的检索与归属基础问题 | ||
| 12 | |||
| 13 | 那么最合理的 Phase-1 路线不是“重训一套新模型”,而是: | ||
| 14 | |||
| 15 | 1. **保留 exact lane**:Chromaprint / fingerprint | ||
| 16 | 2. **semantic lane 主底座**:MERT-v1-95M | ||
| 17 | 3. **semantic lane challenger**:MuQ | ||
| 18 | 4. **数据库先稳住**:`model_registry + feature_set_registry + audio_embedding + retrieval_index_registry` | ||
| 19 | 5. **结果先按层聚合**:window -> recording -> work -> canonical_song | ||
| 20 | |||
| 21 | --- | ||
| 22 | |||
| 23 | ## 1. 为什么当前要走 encoder-only Phase-1 | ||
| 24 | |||
| 25 | 因为你当前最紧迫的问题不是“模型精度极限”,而是: | ||
| 26 | |||
| 27 | - 曲库很大:100w 音频 / 30w 歌曲 | ||
| 28 | - 数据关系复杂:同曲可能有多录音、多版本、多来源资产 | ||
| 29 | - 如果数据规范不稳,未来任何模型升级都会反复返工 | ||
| 30 | |||
| 31 | 所以 Phase-1 目标应该是: | ||
| 32 | |||
| 33 | ```mermaid | ||
| 34 | flowchart LR | ||
| 35 | A[冻结数据规范] --> B[接入开源 encoder] | ||
| 36 | B --> C[建立 semantic baseline] | ||
| 37 | C --> D[做大规模索引与聚合验证] | ||
| 38 | D --> E[再决定是否进入微调 / version lane] | ||
| 39 | ``` | ||
| 40 | |||
| 41 | --- | ||
| 42 | |||
| 43 | ## 2. 推荐的阶段划分 | ||
| 44 | |||
| 45 | ## Phase-0:当前仓库阶段(已具备) | ||
| 46 | - `Chromaprint + ECAPA + melody rerank` | ||
| 47 | - 可跑通训练/建索引/评测/服务闭环 | ||
| 48 | - 适合作为 baseline,而不是最终生产底座 | ||
| 49 | |||
| 50 | ## Phase-1:Encoder-only foundation baseline(当前推荐) | ||
| 51 | - exact lane:Chromaprint | ||
| 52 | - semantic lane:MERT-v1-95M | ||
| 53 | - challenger:MuQ | ||
| 54 | - 不微调底座 | ||
| 55 | - 只做 feature extraction + index + aggregation | ||
| 56 | |||
| 57 | ## Phase-2:Version / Cover lane | ||
| 58 | - 在 Phase-1 数据模型稳定后 | ||
| 59 | - 引入 cover/version 专门分支 | ||
| 60 | - 强化 work-level 归属 | ||
| 61 | |||
| 62 | ## Phase-3:Industrial retrieval stack | ||
| 63 | - ANN + reranker | ||
| 64 | - online/offline artifact registry | ||
| 65 | - 监控、回放、审计、人工复核 | ||
| 66 | |||
| 67 | --- | ||
| 68 | |||
| 69 | ## 3. Phase-1 的推荐模型组合 | ||
| 70 | |||
| 71 | ## 3.1 Exact lane | ||
| 72 | ### 选型 | ||
| 73 | - Chromaprint / landmark hash | ||
| 74 | |||
| 75 | ### 作用 | ||
| 76 | - 原曲片段 | ||
| 77 | - 平台转码 | ||
| 78 | - near-duplicate | ||
| 79 | - 局部片段强匹配 | ||
| 80 | |||
| 81 | ### 为什么保留 | ||
| 82 | 版权保护不能只靠 semantic embedding。exact lane 在很多真实投诉/取证场景里仍然是最快且证据最强的第一条路径。 | ||
| 83 | |||
| 84 | --- | ||
| 85 | |||
| 86 | ## 3.2 Semantic lane 主模型:MERT-v1-95M | ||
| 87 | |||
| 88 | ### 推荐原因 | ||
| 89 | - 是 music SSL foundation model | ||
| 90 | - 已有公开论文与实现 | ||
| 91 | - 比自训小型 ECAPA 更符合音乐任务底座定位 | ||
| 92 | - Phase-1 直接做 frozen encoder 成本与风险都更低 | ||
| 93 | |||
| 94 | ### Phase-1 中的角色 | ||
| 95 | - 作为主 encoder 产出 window embedding | ||
| 96 | - 负责 noisy/BGM/一般跨域检索 baseline | ||
| 97 | - 后面可继续作为 teacher 或兼容旧索引版本 | ||
| 98 | |||
| 99 | ### 推荐 feature set | ||
| 100 | 1. `mert_v1_95m__window_5s_hop_2.5s__meanpool__l2` | ||
| 101 | 2. `mert_v1_95m__window_10s_hop_5s__meanpool__l2` | ||
| 102 | |||
| 103 | ### 为什么先做两套 | ||
| 104 | - `5s/2.5s`:更利于局部定位 | ||
| 105 | - `10s/5s`:更利于整体语义稳定 | ||
| 106 | |||
| 107 | --- | ||
| 108 | |||
| 109 | ## 3.3 Semantic lane Challenger:MuQ | ||
| 110 | |||
| 111 | ### 推荐原因 | ||
| 112 | - 更新、更接近下一代 music foundation model 路线 | ||
| 113 | - 值得作为 challenger baseline | ||
| 114 | - 即使不开微调,也有希望在部分 MIR 任务上优于较早底座 | ||
| 115 | |||
| 116 | ### 当前建议 | ||
| 117 | - Phase-1 先作为对照组,不立即替代 MERT | ||
| 118 | - 重点验证:向量分布稳定性、窗口级检索表现、内存/推理成本 | ||
| 119 | |||
| 120 | --- | ||
| 121 | |||
| 122 | ## 3.4 为什么 Phase-1 不直接以 CoverHunter 为主线 | ||
| 123 | |||
| 124 | 因为 CoverHunter 的优势在: | ||
| 125 | - cover song identification | ||
| 126 | - alignment / refined attention / coarse-to-fine 训练 | ||
| 127 | |||
| 128 | 而你当前约束是: | ||
| 129 | - 先不用微调 | ||
| 130 | - 先用开源 encoder | ||
| 131 | - 先把数据和检索规范落稳 | ||
| 132 | |||
| 133 | 所以它更适合作为 **Phase-2 的 version/cover lane 方向**,而不是 Phase-1 的主 baseline。 | ||
| 134 | |||
| 135 | --- | ||
| 136 | |||
| 137 | ## 4. 角色关注点 | ||
| 138 | |||
| 139 | ## 4.1 模型底座角色 | ||
| 140 | 重点关注: | ||
| 141 | - 哪些 encoder 已注册到 `model_registry` | ||
| 142 | - 每个 encoder 的 input SR、window、pooling、embedding dim | ||
| 143 | - 哪些 feature set 是线上候选,哪些只是实验候选 | ||
| 144 | |||
| 145 | ## 4.2 检索角色 | ||
| 146 | 重点关注: | ||
| 147 | - 指纹 lane 与 semantic lane 如何组合 | ||
| 148 | - `recording/work/song` 聚合规则 | ||
| 149 | - top-k 候选如何稳定输出 | ||
| 150 | |||
| 151 | ## 4.3 数据角色 | ||
| 152 | 重点关注: | ||
| 153 | - 资产去重 | ||
| 154 | - reference 资产选择 | ||
| 155 | - window manifest | ||
| 156 | - 是否支持全量重建特征与索引 | ||
| 157 | |||
| 158 | ## 4.4 运维 / 平台角色 | ||
| 159 | 重点关注: | ||
| 160 | - encoder 版本切换是否可灰度 | ||
| 161 | - 索引重建是否可并行 | ||
| 162 | - 热/冷索引、历史索引是否可回滚 | ||
| 163 | |||
| 164 | --- | ||
| 165 | |||
| 166 | ## 5. Phase-1 的实施顺序 | ||
| 167 | |||
| 168 | ```mermaid | ||
| 169 | flowchart TD | ||
| 170 | A[冻结 PostgreSQL 数据规范] --> B[导入 canonical/work/recording/asset/window] | ||
| 171 | B --> C[注册 model_registry / feature_set_registry] | ||
| 172 | C --> D[抽取 MERT 特征] | ||
| 173 | C --> E[抽取 MuQ 特征] | ||
| 174 | D --> F[构建 semantic index] | ||
| 175 | E --> F | ||
| 176 | F --> G[与 fingerprint lane 做聚合] | ||
| 177 | G --> H[输出 canonical_song_id / work_id / recording_id] | ||
| 178 | ``` | ||
| 179 | |||
| 180 | --- | ||
| 181 | |||
| 182 | ## 6. 每阶段解决的问题 | ||
| 183 | |||
| 184 | | 阶段 | 解决的问题 | 暂不解决的问题 | | ||
| 185 | |---|---|---| | ||
| 186 | | Phase-1 | 数据规范、开源底座 baseline、索引可重建、song/work/recording 聚合 | 底座微调、cover 专项训练、melody tower | | ||
| 187 | | Phase-2 | version/cover 归属、work-level recall | 更复杂跨模态 humming | | ||
| 188 | | Phase-3 | 工业化服务、回放、监控、人工审核闭环 | 极致 research SOTA | | ||
| 189 | |||
| 190 | --- | ||
| 191 | |||
| 192 | ## 7. 与当前仓库的关系 | ||
| 193 | |||
| 194 | ### 当前保留 | ||
| 195 | - `ECAPA baseline`:保留做对照,不作为长期主底座 | ||
| 196 | - `Chromaprint`:保留,且在版权保护场景里非常重要 | ||
| 197 | - `melody rerank`:保留为辅助 lane | ||
| 198 | |||
| 199 | ### 当前新增 | ||
| 200 | - `model_registry` | ||
| 201 | - `feature_set_registry` | ||
| 202 | - foundation encoder 特征抽取与注册 | ||
| 203 | - 更清晰的 `canonical_song / work / recording` 数据结构 | ||
| 204 | |||
| 205 | --- | ||
| 206 | |||
| 207 | ## 8. 当前推荐结论 | ||
| 208 | |||
| 209 | 如果今天就要给 Phase-1 定方案,我建议: | ||
| 210 | |||
| 211 | 1. **先不改训练主线,不删 ECAPA** | ||
| 212 | 2. **新增 MERT-v1-95M semantic lane** | ||
| 213 | 3. **新增 MuQ challenger lane** | ||
| 214 | 4. **只把 `is_reference=true` 的主参考窗口先做成热索引** | ||
| 215 | 5. **先把 PostgreSQL 设计当成主交付** | ||
| 216 | |||
| 217 | 换句话说: | ||
| 218 | |||
| 219 | > Phase-1 的核心不是“哪一个模型最终赢”,而是“数据规范 + 模型注册 + 特征注册 + 索引注册”这套长期结构先稳定下来。 |
| 1 | # Start Here / 新同学接手入口 | 1 | # Start Here / 新同学接手入口 |
| 2 | 2 | ||
| 3 | > 目标:让新来的同学在 **10 分钟内**知道:先跑什么、先读什么、当前卡在哪、下一步该做什么。 | 3 | > 目标:让新同学在 **10 分钟内** 知道现在的主链、先跑什么、先看什么。 |
| 4 | 4 | ||
| 5 | --- | 5 | --- |
| 6 | 6 | ||
| 7 | ## 1. 先执行这条命令 | 7 | ## 1. 先执行这条命令 |
| 8 | 8 | ||
| 9 | 如果当前目标是验证 **song-centric 真实目录 -> feature -> PostgreSQL** 主链,优先跑: | ||
| 10 | |||
| 11 | ```bash | 9 | ```bash |
| 12 | cd /workspace | 10 | cd /workspace |
| 13 | /usr/local/miniconda3/bin/python acr-engine/scripts/run_songcentric_directory_pipeline_live.py \ | 11 | /usr/local/miniconda3/bin/python acr-engine/scripts/run_songcentric_directory_pipeline_live.py \ |
| ... | @@ -17,167 +15,116 @@ cd /workspace | ... | @@ -17,167 +15,116 @@ cd /workspace |
| 17 | --output-dir acr-engine/data/pgvector_eval/music20 | 15 | --output-dir acr-engine/data/pgvector_eval/music20 |
| 18 | ``` | 16 | ``` |
| 19 | 17 | ||
| 18 | 或: | ||
| 19 | |||
| 20 | ```bash | ||
| 21 | acr-engine/scripts/start_songcentric_shortest_path.sh 'postgres://d2:d2pass@127.0.0.1:5432/d2' | ||
| 22 | ``` | ||
| 23 | |||
| 20 | 当前 fresh evidence: | 24 | 当前 fresh evidence: |
| 21 | - `song_count = 2` | 25 | - `song_count = 2` |
| 26 | - `asset_count = 2` | ||
| 22 | - `window_count = 5` | 27 | - `window_count = 5` |
| 23 | - `matcher_fingerprint_count = 5` | 28 | - `matcher_fingerprint_count = 5` |
| 24 | - `fallback_fingerprint_count = 0` | 29 | - `fallback_fingerprint_count = 0` |
| 25 | - `semantic_runtime_available = false` | 30 | - `semantic_runtime_available = false` |
| 26 | - `import_counts.feature_fact = 24` | 31 | - `semantic_runtime_missing = [torch, torchaudio, transformers]` |
| 27 | 32 | - `import_counts = media_entity:9 / audio_object:22 / feature_fact:24 / set_membership:9` | |
| 28 | 如果你当前目标是验证老的 Phase-1 planner/worker 合同,再跑下面这条: | ||
| 29 | |||
| 30 | ```bash | ||
| 31 | cd /workspace/acr-engine | ||
| 32 | /usr/local/miniconda3/bin/python scripts/run_planner_validation_commands_live.py \ | ||
| 33 | --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' \ | ||
| 34 | --output data/pgvector_eval/music20/planner_validation_commands_runner_report.json | ||
| 35 | ``` | ||
| 36 | |||
| 37 | 也可以用包装脚本:`acr-engine/scripts/start_phase1_shortest_path.sh 'postgres://d2:d2pass@127.0.0.1:5432/d2'` | ||
| 38 | |||
| 39 | ### 当前 fresh evidence | ||
| 40 | - `executed_count = 4` | ||
| 41 | - `all_passed = true` | ||
| 42 | |||
| 43 | ### 这条命令会执行 | ||
| 44 | 1. `prereq_audit` | ||
| 45 | 2. `worker_contract_smoke` | ||
| 46 | 3. `semantic_vector_negative_matrix` | ||
| 47 | 4. `asset_level_upsert_validation` | ||
| 48 | |||
| 49 | ### 看到下面这些结果时应该如何判断 | ||
| 50 | 如果你看到: | ||
| 51 | - `downloads_root_exists = false` | ||
| 52 | - `ready_jobs = 0` | ||
| 53 | - exact lane = `failed/unreadable_audio_assets` | ||
| 54 | - semantic lane = `4/4 failed` | ||
| 55 | |||
| 56 | 说明当前优先级是: | ||
| 57 | 1. 挂载 `/workspace/downloads` | ||
| 58 | 2. 安装 `torch / torchaudio / transformers / speechbrain` | ||
| 59 | |||
| 60 | 也就是说: | ||
| 61 | > 当前首要问题是运行环境前置条件,不是 PostgreSQL schema,也不是 worker contract 设计错误。 | ||
| 62 | 33 | ||
| 63 | --- | 34 | --- |
| 64 | 35 | ||
| 65 | ## 2. 接手时只读这 5 份文档 | 36 | ## 2. 只读这 4 份文档 |
| 66 | 37 | ||
| 67 | 1. [README.md](./README.md) | 38 | 1. [README.md](./README.md) |
| 68 | 2. [session-handoff.md](./session-handoff.md) | 39 | 2. [session-handoff.md](./session-handoff.md) |
| 69 | 3. [acr-architecture.md](./acr-architecture.md) | 40 | 3. [postgresql-data-model.md](./postgresql-data-model.md) |
| 70 | 4. [postgresql-data-model.md](./postgresql-data-model.md) | 41 | 4. [postgres_db_schema_samples.md](./postgres_db_schema_samples.md) |
| 71 | 5. [phase1-implementation-checklist.md](./phase1-implementation-checklist.md) | ||
| 72 | |||
| 73 | 如果你负责算法或检索,再补: | ||
| 74 | - [sota-evolution-guide.md](./sota-evolution-guide.md) | ||
| 75 | - [model-feature-registry-bootstrap.md](./model-feature-registry-bootstrap.md) | ||
| 76 | - [phase1-worker-contract.md](./phase1-worker-contract.md) | ||
| 77 | 42 | ||
| 78 | --- | 43 | --- |
| 79 | 44 | ||
| 80 | ## 3. 用一句话理解项目 | 45 | ## 3. 用一句话理解当前项目 |
| 81 | 46 | ||
| 82 | 我们在做的是一个面向 **版权保护 / 听歌识曲 / 版本归属** 的音乐 ACR 系统, | 47 | 我们当前做的是一个 **面向版权保护的 song-centric 音乐 ACR 系统**: |
| 83 | 目标是从 `100w` 音频、约 `30w` 歌曲中,快速定位正确的 `song_id` 归属;当前阶段暂不把版本/recording 作为必须返回对象。 | 48 | 目标是在约 `100w` 音频、约 `30w` 歌曲里,把录音、BGM、片段、翻唱相关查询尽快定位到应归属的 `song_id`。 |
| 84 | 49 | ||
| 85 | --- | 50 | --- |
| 86 | 51 | ||
| 87 | ## 4. 当前主线方案 | 52 | ## 4. 当前最重要的设计结论 |
| 88 | 53 | ||
| 89 | ### 检索主线 | 54 | ### 4.1 不再默认走旧的多层 v2 体系 |
| 90 | - exact lane:`Chromaprint` | 55 | 当前默认只认 4 张核心物理表: |
| 91 | - semantic lane baseline:`MERT-v1-95M` | ||
| 92 | - semantic lane challenger:`MuQ` | ||
| 93 | - historical baseline:`ECAPA` | ||
| 94 | 56 | ||
| 95 | ### 当前 Phase-1 最小主线 | ||
| 96 | ```text | 57 | ```text |
| 97 | song -> asset -> window | 58 | media_entity -> audio_object -> feature_fact -> set_membership |
| 98 | ``` | 59 | ``` |
| 99 | 60 | ||
| 100 | ### 可演进完整版主线 | 61 | ### 4.2 逻辑语义这样理解 |
| 101 | ```text | ||
| 102 | canonical_song -> work -> recording -> recording_asset -> audio_window | ||
| 103 | ``` | ||
| 104 | 62 | ||
| 105 | ### 模型主线 | ||
| 106 | ```text | 63 | ```text |
| 107 | model_registry -> feature_set_registry -> audio_embedding / audio_fingerprint -> retrieval_index_registry | 64 | song -> asset -> window -> fingerprint / embedding |
| 108 | ``` | 65 | ``` |
| 109 | 66 | ||
| 110 | --- | 67 | ### 4.3 切片 / 模型 / feature 到底落哪里 |
| 111 | 68 | ||
| 112 | ## 5. 当前哪些已经稳定 | 69 | | 对象 | 表 | 关键字段 | |
| 70 | |---|---|---| | ||
| 71 | | song | `media_entity` | `entity_type='song'` | | ||
| 72 | | 原始音频文件 | `audio_object` | `object_type='asset'` | | ||
| 73 | | 切片窗口 | `audio_object` | `object_type='window'`, `parent_object_id=<asset_id>` | | ||
| 74 | | 指纹特征 | `feature_fact` | `feature_type='fingerprint'`, `fingerprint_value` | | ||
| 75 | | embedding 特征 | `feature_fact` | `feature_type='embedding'`, `embedding_uri/vector_table_name` | | ||
| 76 | | 模型信息 | `feature_fact` | `model_name`, `model_version`, `feature_set_name` | | ||
| 77 | | reference/eval/hot 集 | `set_membership` | `set_type`, `set_name` | | ||
| 113 | 78 | ||
| 114 | - PostgreSQL v2 schema 已落地 | 79 | --- |
| 115 | - registry bootstrap 已有 live 验证 | ||
| 116 | - worker contract 已有 live 验证 | ||
| 117 | - exact / semantic 的失败语义已可审计 | ||
| 118 | - planner 已能输出 validation commands | ||
| 119 | - planner validation runner 已可一键执行 | ||
| 120 | 80 | ||
| 121 | ## 6. 当前哪些还没完成 | 81 | ## 5. 当前主链流程图 |
| 122 | 82 | ||
| 123 | - 还没有真正跑通 MERT / MuQ inference | 83 | ```mermaid |
| 124 | - 当前 host 没有 `/workspace/downloads` | 84 | flowchart TD |
| 125 | - 当前 host 缺 `torch / torchaudio / transformers / speechbrain` | 85 | A[media_entity\nentity_type=song] --> B[audio_object\nobject_type=asset] |
| 126 | - 还没完成最终线上融合策略 | 86 | B --> C[audio_object\nobject_type=window] |
| 127 | - 还没接入更大规模真实 reference set | 87 | C --> D1[feature_fact\nfingerprint] |
| 88 | C --> D2[feature_fact\nembedding] | ||
| 89 | B --> E[set_membership\nreference_set / eval_set / hot_set] | ||
| 90 | C --> E | ||
| 91 | ``` | ||
| 128 | 92 | ||
| 129 | --- | 93 | --- |
| 130 | 94 | ||
| 131 | ## 7. 如果你现在继续推进,按这个顺序 | 95 | ## 6. 当前哪些已经稳定 |
| 132 | |||
| 133 | ### 路线 A:先解环境 | ||
| 134 | 1. 挂载 `/workspace/downloads` | ||
| 135 | 2. 安装 semantic runtime 依赖 | ||
| 136 | 3. 重跑 planner validation runner | ||
| 137 | 4. 确认 `ready_jobs` 是否开始恢复 | ||
| 138 | 96 | ||
| 139 | ### 路线 B:先解实现 | 97 | - live PostgreSQL schema 已真实建表通过 |
| 140 | 1. 阅读 [phase1-worker-contract.md](./phase1-worker-contract.md) | 98 | - 真实目录 -> manifest -> import 已打通 |
| 141 | 2. 阅读 `acr-engine/workers/run_embedding_job.py` | 99 | - 真实目录 -> fingerprint enrichment -> import 已打通 |
| 142 | 3. 用真实 inference adapter 替换 guarded failure path | 100 | - semantic lane 已做成 runtime-aware |
| 143 | 4. 保持当前 PostgreSQL contract 不变 | 101 | - 当前 host 无 `torch/torchaudio/transformers` 时会明确 fallback,不会伪装成功 |
| 144 | 102 | - 当前 exact lane 已优先复用仓库内 `ChromaprintMatcher` | |
| 145 | ### 路线 C:先解数据 | ||
| 146 | 1. 阅读 [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 147 | 2. 阅读 [postgres_db_schema_samples.md](./postgres_db_schema_samples.md) | ||
| 148 | 3. 准备更大的 reference set | ||
| 149 | 4. 保持 `reference_set_registry / reference_set_member` 版本化 | ||
| 150 | 103 | ||
| 151 | --- | 104 | --- |
| 152 | 105 | ||
| 153 | ## 8. 当前不建议优先做的事 | 106 | ## 7. 当前最该继续什么 |
| 154 | 107 | ||
| 155 | - 不要重新讨论要不要 `song/work/recording` 分层 | 108 | ### 第一优先级 |
| 156 | - 不要回退到只有 `song_id` 的扁平表 | 109 | 把 semantic lane 从 fallback 升级成真实 encoder adapter,且不破坏现有宿主链。 |
| 157 | - 不要先讨论重新训练底座 | ||
| 158 | - 不要把当前问题误判成 PostgreSQL contract 设计问题 | ||
| 159 | 110 | ||
| 160 | --- | 111 | ### 当前 host 事实 |
| 112 | - `torch` 缺失 | ||
| 113 | - `torchaudio` 缺失 | ||
| 114 | - `transformers` 缺失 | ||
| 115 | - 当前因此 `semantic_runtime_available = false` | ||
| 161 | 116 | ||
| 162 | ## 9. 仓库常用入口 | 117 | --- |
| 163 | 118 | ||
| 164 | ### 文档 | 119 | ## 8. 当前不要再绕回去的点 |
| 165 | - [README.md](./README.md) | ||
| 166 | - [session-handoff.md](./session-handoff.md) | ||
| 167 | - [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 168 | - [postgres_db_schema_samples.md](./postgres_db_schema_samples.md) | ||
| 169 | - [phase1-worker-contract.md](./phase1-worker-contract.md) | ||
| 170 | 120 | ||
| 171 | ### 脚本 | 121 | - 不要回退到旧的 v2 schema 作为默认口径 |
| 172 | - `acr-engine/scripts/run_planner_validation_commands_live.py` | 122 | - 不要重新引入 `recording/work/version` 作为 Phase-1 必须返回对象 |
| 173 | - `acr-engine/scripts/run_phase1_prereq_audit_live.py` | 123 | - 不要先讨论训练/微调 |
| 174 | - `acr-engine/scripts/run_phase1_worker_contract_smoke_live.py` | 124 | - 不要把“模型底座可替换”误解成“数据库要重新拆很多层” |
| 175 | - `acr-engine/scripts/run_embedding_vector_table_negative_matrix_live.py` | ||
| 176 | - `acr-engine/scripts/validate_audio_embedding_asset_upsert_live.py` | ||
| 177 | - `acr-engine/scripts/run_songcentric_directory_pipeline_live.py` | ||
| 178 | 125 | ||
| 179 | --- | 126 | --- |
| 180 | 127 | ||
| 181 | ## 一句话结论 | 128 | ## 一句话结论 |
| 182 | 129 | ||
| 183 | > 新同学接手时,先跑 runner,再读 5 份核心文档;当前首要问题是环境前置条件,不是 schema/contract 本身。 | 130 | > 当前最重要的是守住 4 表 song-centric 主链,并在这个主链上把 semantic encoder 真正接起来。 | ... | ... |
-
Please register or sign in to post a comment