Why the schema docs need one song-centric story, not parallel histories
Constraint: New teammates must understand where slice/model/feature data lands without reading deprecated v2/planner-worker material Rejected: Keep old docs with disclaimers | still leaves two competing mental models in the default docs path Confidence: high Scope-risk: narrow Directive: Keep future docs anchored on the 4-table song-centric path unless the physical schema default truly changes Tested: markdown link check on /workspace/docs; staged diff review; verified referenced wrapper script is present Not-tested: No database or pipeline rerun was needed for this docs-only consolidation
Showing
13 changed files
with
142 additions
and
892 deletions
| 1 | #!/usr/bin/env bash | ||
| 2 | set -euo pipefail | ||
| 3 | |||
| 4 | ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" | ||
| 5 | PYTHON_BIN="${PYTHON_BIN:-/usr/local/miniconda3/bin/python}" | ||
| 6 | DSN="${1:-${PG_DSN:-}}" | ||
| 7 | SCHEMA="${2:-${PG_SCHEMA:-acr_songcentric_test}}" | ||
| 8 | INPUT_ROOT="${3:-$ROOT_DIR/data/songcentric_builder_smoke}" | ||
| 9 | OUTPUT_DIR="${4:-$ROOT_DIR/data/pgvector_eval/music20}" | ||
| 10 | |||
| 11 | if [[ -z "$DSN" ]]; then | ||
| 12 | echo "usage: $0 <postgres-dsn> [schema] [input-root] [output-dir]" >&2 | ||
| 13 | echo "or set PG_DSN before running this script" >&2 | ||
| 14 | exit 1 | ||
| 15 | fi | ||
| 16 | |||
| 17 | cd "$ROOT_DIR/.." | ||
| 18 | "$PYTHON_BIN" acr-engine/scripts/run_songcentric_directory_pipeline_live.py \ | ||
| 19 | --dsn "$DSN" \ | ||
| 20 | --schema "$SCHEMA" \ | ||
| 21 | --input-root "${INPUT_ROOT#$ROOT_DIR/..\/}" \ | ||
| 22 | --output-dir "${OUTPUT_DIR#$ROOT_DIR/..\/}" |
| 1 | # Changelog | ||
| 2 | |||
| 1 | ## 2026-06-04 | 3 | ## 2026-06-04 |
| 4 | - 收敛 `docs/` 到当前 song-centric 主线,只保留 `README / start-here / session-handoff / postgresql-data-model / postgres_db_schema_samples / CHANGELOG` 六份核心文档,删除旧的 v2 / planner-worker / registry 扩展文档,避免新同学误入已退居次线的设计。 | ||
| 5 | - 重写 `docs/postgresql-data-model.md`,明确 `保存切片的数据 + 模型 + feature` 的落表方案:`window` 落 `audio_object`,模型身份落 `feature_fact.model_name/model_version/feature_set_name`,具体 `fingerprint/embedding` 也统一落 `feature_fact`。 | ||
| 6 | - 重写 `docs/postgres_db_schema_samples.md` 与入口文档,补充当前 4 表主链的流程图、典型 SQL 样例、查询回溯路径与写入顺序,统一文档口径到 `media_entity -> audio_object -> feature_fact -> set_membership`。 | ||
| 7 | |||
| 8 | ## 2026-06-04 | ||
| 9 | |||
| 10 | - 新增 `acr-engine/scripts/start_songcentric_shortest_path.sh`,把当前默认主线再收敛成一条可直接复制执行的 shell 入口,并已用 fresh runner 结果再次验证。 | ||
| 2 | 11 | ||
| 3 | - 将 `run_songcentric_directory_pipeline_live.py` 提升为当前默认主线入口,并把 fresh runner 结果同步到 `docs/README.md`、`docs/start-here.md`、`docs/session-handoff.md`,降低下次 session 的恢复成本。 | 12 | - 将 `run_songcentric_directory_pipeline_live.py` 提升为当前默认主线入口,并把 fresh runner 结果同步到 `docs/README.md`、`docs/start-here.md`、`docs/session-handoff.md`,降低下次 session 的恢复成本。 |
| 4 | 13 | ... | ... |
| 1 | # ACR Docs Overview | 1 | # ACR Docs Overview |
| 2 | 2 | ||
| 3 | > 当前仅保留与 **song-centric + 融合优先** ACR 设计直接相关的文档。 | 3 | > 当前 docs 只保留与 **song-centric + 4 表融合 schema** 直接相关的文档。 |
| 4 | 4 | ||
| 5 | --- | 5 | --- |
| 6 | 6 | ||
| 7 | ## 0. 新同学先做什么 | 7 | ## 1. 先看什么 |
| 8 | 8 | ||
| 9 | 如果当前要继续 song-centric 主线,先跑: | 9 | 新同学接手顺序: |
| 10 | 10 | ||
| 11 | ```bash | 11 | 1. [start-here.md](./start-here.md) |
| 12 | cd /workspace | 12 | 2. [session-handoff.md](./session-handoff.md) |
| 13 | /usr/local/miniconda3/bin/python acr-engine/scripts/run_songcentric_directory_pipeline_live.py \ | 13 | 3. [postgresql-data-model.md](./postgresql-data-model.md) |
| 14 | --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' \ | 14 | 4. [postgres_db_schema_samples.md](./postgres_db_schema_samples.md) |
| 15 | --schema acr_songcentric_test \ | 15 | 5. [CHANGELOG.md](./CHANGELOG.md) |
| 16 | --input-root acr-engine/data/songcentric_builder_smoke \ | ||
| 17 | --output-dir acr-engine/data/pgvector_eval/music20 | ||
| 18 | ``` | ||
| 19 | |||
| 20 | 如果要回归旧的 planner/worker 合同,再跑: | ||
| 21 | |||
| 22 | ```bash | ||
| 23 | cd /workspace/acr-engine | ||
| 24 | /usr/local/miniconda3/bin/python scripts/run_planner_validation_commands_live.py \ | ||
| 25 | --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' \ | ||
| 26 | --output data/pgvector_eval/music20/planner_validation_commands_runner_report.json | ||
| 27 | ``` | ||
| 28 | |||
| 29 | 也可以用包装脚本:`acr-engine/scripts/start_phase1_shortest_path.sh 'postgres://d2:d2pass@127.0.0.1:5432/d2'` | ||
| 30 | |||
| 31 | 当前 fresh evidence: | ||
| 32 | - `executed_count = 4` | ||
| 33 | - `all_passed = true` | ||
| 34 | 16 | ||
| 35 | --- | 17 | --- |
| 36 | 18 | ||
| 37 | ## 1. 当前默认设计口径 | 19 | ## 2. 当前默认设计口径 |
| 38 | 20 | ||
| 39 | 当前 Phase-1 默认按下面理解: | 21 | 逻辑语义: |
| 40 | 22 | ||
| 41 | ```text | 23 | ```text |
| 42 | song -> asset -> window -> fingerprint / embedding | 24 | song -> asset -> window -> fingerprint / embedding |
| 43 | ``` | 25 | ``` |
| 44 | 26 | ||
| 45 | 对应融合优先物理表: | 27 | 物理落表: |
| 46 | 28 | ||
| 47 | ```text | 29 | ```text |
| 48 | media_entity -> audio_object -> feature_fact -> set_membership | 30 | media_entity -> audio_object -> feature_fact -> set_membership |
| 49 | ``` | 31 | ``` |
| 50 | 32 | ||
| 33 | 核心目标: | ||
| 34 | - 最终稳定返回 `song_id` | ||
| 35 | - 同一个 `song` 下允许多个音频文件 | ||
| 36 | - `window` 是切片/evidence/召回最小单元 | ||
| 37 | - `feature_fact` 同时承载 exact lane 与 semantic lane | ||
| 38 | - Phase-1 直接复用开源 encoder,不先训练/微调 | ||
| 39 | |||
| 51 | --- | 40 | --- |
| 52 | 41 | ||
| 53 | ## 2. 必读文档 | 42 | ## 3. 一键验证主链 |
| 54 | 43 | ||
| 55 | 1. [start-here.md](./start-here.md) | 44 | ```bash |
| 56 | 2. [session-handoff.md](./session-handoff.md) | 45 | cd /workspace |
| 57 | 3. [acr-architecture.md](./acr-architecture.md) | 46 | /usr/local/miniconda3/bin/python acr-engine/scripts/run_songcentric_directory_pipeline_live.py \ |
| 58 | 4. [postgresql-data-model.md](./postgresql-data-model.md) | 47 | --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' \ |
| 59 | 5. [phase1-implementation-checklist.md](./phase1-implementation-checklist.md) | 48 | --schema acr_songcentric_test \ |
| 49 | --input-root acr-engine/data/songcentric_builder_smoke \ | ||
| 50 | --output-dir acr-engine/data/pgvector_eval/music20 | ||
| 51 | ``` | ||
| 60 | 52 | ||
| 61 | --- | 53 | 包装脚本: |
| 62 | 54 | ||
| 63 | ## 3. 实施相关文档 | 55 | ```bash |
| 56 | acr-engine/scripts/start_songcentric_shortest_path.sh 'postgres://d2:d2pass@127.0.0.1:5432/d2' | ||
| 57 | ``` | ||
| 64 | 58 | ||
| 65 | - [postgresql-data-model.md](./postgresql-data-model.md) — 当前唯一默认数据模型;含切片/模型/feature 落表说明与流程图 | 59 | 当前 fresh evidence: |
| 66 | - [postgres_db_schema_samples.md](./postgres_db_schema_samples.md) — PostgreSQL 存储样例 | 60 | - `song_count = 2` |
| 67 | - [model-feature-registry-bootstrap.md](./model-feature-registry-bootstrap.md) — model/feature/reference set 初始化 | 61 | - `asset_count = 2` |
| 68 | - [phase1-worker-contract.md](./phase1-worker-contract.md) — worker、job、失败语义合同 | 62 | - `window_count = 5` |
| 69 | - [phase1-implementation-checklist.md](./phase1-implementation-checklist.md) — Phase-1 实施清单 | 63 | - `matcher_fingerprint_count = 5` |
| 70 | - [production-encoder-freeze-and-embedding-strategy.md](./production-encoder-freeze-and-embedding-strategy.md) — encoder-only 冻结策略 | 64 | - `fallback_fingerprint_count = 0` |
| 71 | - [sota-evolution-guide.md](./sota-evolution-guide.md) — 当前 SOTA 演进主线 | 65 | - `semantic_runtime_available = false` |
| 66 | - `import_counts.feature_fact = 24` | ||
| 72 | 67 | ||
| 73 | --- | 68 | --- |
| 74 | 69 | ||
| 75 | ## 4. 当前稳定结论 | 70 | ## 4. 当前保留文档分别解决什么 |
| 76 | 71 | ||
| 77 | - 最终归属对象当前只要求稳定返回 `song_id` | 72 | - [start-here.md](./start-here.md):新同学 10 分钟接手入口 |
| 78 | - 同一个 `song` 下允许有多个音频文件 | 73 | - [session-handoff.md](./session-handoff.md):下次启动从哪里继续 |
| 79 | - 当前暂不把 `recording/version` 作为必须返回对象 | 74 | - [postgresql-data-model.md](./postgresql-data-model.md):表设计、字段语义、流程图、设计取舍 |
| 80 | - `window` 仍然保留,因为它是 evidence / offset / 检索最小单元 | 75 | - [postgres_db_schema_samples.md](./postgres_db_schema_samples.md):DDL、样例数据、典型 SQL、导入查询链路 |
| 81 | - `feature_fact` 统一承载 `fingerprint` 和 `embedding` | 76 | - [CHANGELOG.md](./CHANGELOG.md):变更历史 |
| 82 | 77 | ||
| 83 | --- | 78 | --- |
| 84 | 79 | ||
| 85 | ## 5. 文档维护命令 | 80 | ## 5. 文档维护命令 |
| 86 | 81 | ||
| 87 | ```bash | 82 | ```bash |
| 88 | /usr/local/miniconda3/bin/python scripts/check_markdown_links.py --root docs | 83 | /usr/local/miniconda3/bin/python /workspace/scripts/check_markdown_links.py --root /workspace/docs |
| 89 | ``` | 84 | ``` |
| 90 | |||
| 91 | 默认会跳过 `CHANGELOG.md` 这类历史归档文档。 | ... | ... |
docs/acr-architecture.md
deleted
100644 → 0
| 1 | # ACR 系统蓝图 / Architecture Blueprint | ||
| 2 | |||
| 3 | > 更新:2026-06-04 | ||
| 4 | > 目标:把当前 ACR 原型、未来 SOTA 演进路径、以及不同角色的关注点统一到一份可读的系统蓝图里。 | ||
| 5 | |||
| 6 | ## 一页结论 | ||
| 7 | |||
| 8 | 当前仓库已经验证了一个可运行的混合识别原型: | ||
| 9 | |||
| 10 | - `Chromaprint / fingerprint`:负责 exact / near-duplicate 快速召回 | ||
| 11 | - `ECAPA-style embedding`:负责当前语义向量召回 baseline | ||
| 12 | - `melody-aware rerank`:负责弱旋律补强 | ||
| 13 | |||
| 14 | 但未来面向 **版权保护 + 100w 音频 / 30w 歌曲** 的目标,系统应演进为: | ||
| 15 | |||
| 16 | 1. **数据规范稳定**:`canonical_song -> work -> recording -> recording_asset -> audio_window` | ||
| 17 | 2. **底座模型可替换**:`model_registry -> feature_set_registry -> embedding/index` | ||
| 18 | 3. **检索链分层**:exact lane + semantic lane + version/cover lane + aggregation | ||
| 19 | 4. **服务与运维分离**:离线建库、在线召回、审核归一、监控治理分别有清晰职责 | ||
| 20 | |||
| 21 | --- | ||
| 22 | |||
| 23 | ## 1. 总体系统图 | ||
| 24 | |||
| 25 | ```mermaid | ||
| 26 | flowchart TD | ||
| 27 | A[Audio Sources\n官方母带 / 平台音频 / 抓取音频 / UGC / 录音] --> B[Asset Normalization] | ||
| 28 | B --> C[Canonical Data Model\nSong / Work / Recording / Asset / Window] | ||
| 29 | |||
| 30 | C --> D1[Exact Lane\nChromaprint / Neural AFP] | ||
| 31 | C --> D2[Semantic Lane\nFoundation Encoder] | ||
| 32 | C --> D3[Version/Cover Lane\nPhase-2+] | ||
| 33 | |||
| 34 | D1 --> E[Candidate Aggregation] | ||
| 35 | D2 --> E | ||
| 36 | D3 --> E | ||
| 37 | |||
| 38 | E --> F[Canonical Song Decision] | ||
| 39 | F --> G[Service / Review / Audit] | ||
| 40 | ``` | ||
| 41 | |||
| 42 | --- | ||
| 43 | |||
| 44 | ## 2. 当前实现 vs 目标实现 | ||
| 45 | |||
| 46 | | 维度 | 当前实现 | 目标实现 | | ||
| 47 | |---|---|---| | ||
| 48 | | 底座向量模型 | ECAPA-style baseline | MERT / MuQ 等 foundation encoder 为主 | | ||
| 49 | | 检索结构 | chromaprint + embedding + melody | exact + semantic + version/cover + rerank | | ||
| 50 | | 数据主键 | 以 `song_id` 为核心 | `canonical_song / work / recording / asset / window` 分层 | | ||
| 51 | | 存储形态 | 原型式 pgvector schema + 文件产物 | PostgreSQL 主数据 + 可替换向量/索引层 | | ||
| 52 | | 服务目标 | 验证闭环 | 版权保护 / 归属判断 / 工业化运维 | | ||
| 53 | |||
| 54 | --- | ||
| 55 | |||
| 56 | ## 2.1 为什么现在会显得“层很多” | ||
| 57 | |||
| 58 | 因为当前蓝图同时覆盖了 3 个维度: | ||
| 59 | |||
| 60 | 1. **业务归属**:`song/work/recording` | ||
| 61 | 2. **音频实体**:`asset/window` | ||
| 62 | 3. **检索计算**:`feature/index/candidate/decision` | ||
| 63 | |||
| 64 | 把这三类问题放在一张总图中,会看起来像一条很长的链。 | ||
| 65 | 但在工程上,它们其实是不同职责: | ||
| 66 | |||
| 67 | - 业务归属层回答:**最后该归谁** | ||
| 68 | - 音频实体层回答:**命中的是哪段音频** | ||
| 69 | - 检索计算层回答:**这段音频是怎么被召回出来的** | ||
| 70 | |||
| 71 | --- | ||
| 72 | |||
| 73 | ## 2.2 当前最小可用架构可以收敛到什么程度 | ||
| 74 | |||
| 75 | 如果当前阶段只追求: | ||
| 76 | |||
| 77 | > 快速稳定地把 query 命中到正确 `song_id` | ||
| 78 | |||
| 79 | 那 Phase-1 完全可以按下面这套最小骨架推进: | ||
| 80 | |||
| 81 | ```text | ||
| 82 | song -> asset -> window -> fingerprint / embedding | ||
| 83 | ``` | ||
| 84 | |||
| 85 | 保留原因: | ||
| 86 | - `window` 不能删:它是 offset/evidence/多段投票的最小单元 | ||
| 87 | - `feature_set_registry` / `feature_fact` 不能删:否则未来换 MERT/MuQ 会把 schema 写死 | ||
| 88 | - `asset` 不能删:同一个 `song` 下会有多个真实音频文件 | ||
| 89 | |||
| 90 | 可以延后: | ||
| 91 | - `recording` | ||
| 92 | - `work` | ||
| 93 | - 更重的 `retrieval_index_registry` | ||
| 94 | - 更细的全链路审计表 | ||
| 95 | |||
| 96 | 因此推荐口径不是“把所有层都砍掉”,而是: | ||
| 97 | > **Phase-1 先上 song-centric 最小可用层;未来版本归属/cover/work 治理再继续加层。** | ||
| 98 | |||
| 99 | --- | ||
| 100 | |||
| 101 | ## 3. 角色视图 | ||
| 102 | |||
| 103 | ## 3.1 产品 / 架构角色 | ||
| 104 | |||
| 105 | 关注: | ||
| 106 | - 版权保护是否能最终定位到 `canonical_song_id` | ||
| 107 | - `recording` 与 `work` 的区别是否明确 | ||
| 108 | - 当前阶段是否坚持“先冻结规范、后迭代模型” | ||
| 109 | - 各团队之间接口是否清晰 | ||
| 110 | |||
| 111 | 最该读: | ||
| 112 | - 本文 | ||
| 113 | - [sota-evolution-guide.md](./sota-evolution-guide.md) | ||
| 114 | - [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 115 | |||
| 116 | --- | ||
| 117 | |||
| 118 | ## 3.2 开发角色(后端 / 检索 / 数据) | ||
| 119 | |||
| 120 | 关注: | ||
| 121 | - 如何把音频导入统一实体模型 | ||
| 122 | - 如何切窗、建 feature_set、挂索引 | ||
| 123 | - 如何从 query 走到候选,再归一到 `canonical_song_id` | ||
| 124 | - 如何支持未来切换 `model_name / model_version / feature_set` | ||
| 125 | |||
| 126 | 最该读: | ||
| 127 | - 本文 | ||
| 128 | - [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 129 | |||
| 130 | --- | ||
| 131 | |||
| 132 | ## 3.3 运维 / 平台角色 | ||
| 133 | |||
| 134 | 关注: | ||
| 135 | - 离线任务:抽特征、建索引、重建索引 | ||
| 136 | - 在线服务:召回、聚合、缓存、可观测性 | ||
| 137 | - 存储分层:对象存储、PostgreSQL、索引后端 | ||
| 138 | - 版本化:encoder 变更如何灰度、回滚、双写/双索引 | ||
| 139 | |||
| 140 | 最该读: | ||
| 141 | - 本文 | ||
| 142 | - [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 143 | - [phase1-worker-contract.md](./phase1-worker-contract.md) | ||
| 144 | |||
| 145 | --- | ||
| 146 | |||
| 147 | ## 3.4 模型底座 / 研究角色 | ||
| 148 | |||
| 149 | 关注: | ||
| 150 | - Phase-1 先不用微调时,选哪个开源 encoder | ||
| 151 | - 如何定义 feature_set:窗长、hop、pooling、layer selection | ||
| 152 | - 未来如何从 encoder-only 升级到 version/cover lane | ||
| 153 | - 如何让新模型接入而不破坏数据层 | ||
| 154 | |||
| 155 | 最该读: | ||
| 156 | - [sota-evolution-guide.md](./sota-evolution-guide.md) | ||
| 157 | - [production-encoder-freeze-and-embedding-strategy.md](./production-encoder-freeze-and-embedding-strategy.md) | ||
| 158 | - [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 159 | |||
| 160 | --- | ||
| 161 | |||
| 162 | ## 4. 离线 / 在线职责拆分 | ||
| 163 | |||
| 164 | ```mermaid | ||
| 165 | flowchart LR | ||
| 166 | A[Offline\n数据治理/切窗/特征抽取/建索引] --> B[Registered Artifacts\nfeature_set / index / metadata] | ||
| 167 | B --> C[Online\nquery encode / retrieve / aggregate / decide] | ||
| 168 | ``` | ||
| 169 | |||
| 170 | ### 离线职责 | ||
| 171 | - 资产标准化 | ||
| 172 | - 元数据归一 | ||
| 173 | - 切窗 | ||
| 174 | - 模型特征抽取 | ||
| 175 | - fingerprint / embedding 建索引 | ||
| 176 | - 回填 PostgreSQL 元数据 | ||
| 177 | |||
| 178 | ### 在线职责 | ||
| 179 | - 接收 query | ||
| 180 | - query 切块 / 编码 | ||
| 181 | - exact / semantic / version lane 召回 | ||
| 182 | - recording/work/song 聚合 | ||
| 183 | - 输出 `canonical_song_id` + 证据 | ||
| 184 | |||
| 185 | --- | ||
| 186 | |||
| 187 | ## 5. 为什么必须把角色拆开 | ||
| 188 | |||
| 189 | 因为这个项目已经不是单一模型脚本,而是: | ||
| 190 | |||
| 191 | 1. **数据治理系统**:谁的音频、属于哪个 recording/work/song | ||
| 192 | 2. **检索系统**:如何从 query 找到候选 | ||
| 193 | 3. **判定系统**:最终输出哪一个 `canonical_song_id` | ||
| 194 | 4. **服务系统**:如何对外提供 API 与可观测性 | ||
| 195 | 5. **演进系统**:底座模型会变,但数据规范不能跟着乱变 | ||
| 196 | |||
| 197 | --- | ||
| 198 | |||
| 199 | ## 6. 当前阶段建议 | ||
| 200 | |||
| 201 | ### 当前最重要的不是继续改训练,而是: | ||
| 202 | |||
| 203 | 1. 先把 PostgreSQL 数据规范稳定下来 | ||
| 204 | 2. 先把 `model_registry / feature_set_registry` 结构打稳 | ||
| 205 | 3. Phase-1 用开源 encoder 直接做 semantic lane baseline | ||
| 206 | 4. 保留当前 ECAPA 作为历史 baseline / 对照组 | ||
| 207 | |||
| 208 | ### 当前系统中的保留项 | ||
| 209 | - `Chromaprint`:保留 | ||
| 210 | - `ECAPA baseline`:保留为对照组 | ||
| 211 | - `melody rerank`:保留为补充 lane,不再作为主演进方向 | ||
| 212 | |||
| 213 | ### 当前系统中的升级项 | ||
| 214 | - semantic lane 主 encoder -> foundation model | ||
| 215 | - pgvector 原型 schema -> 可扩展 PostgreSQL 数据模型 | ||
| 216 | - 扁平 song_id -> canonical/work/recording/recording_asset/audio_window | ||
| 217 | |||
| 218 | --- | ||
| 219 | |||
| 220 | ## 7. 与代码的映射 | ||
| 221 | |||
| 222 | | 代码/文档 | 当前角色 | | ||
| 223 | |---|---| | ||
| 224 | | `acr-engine/src/engines/chromaprint_matcher.py` | exact lane 原型 | | ||
| 225 | | `acr-engine/src/engines/ecapa_embedder.py` | current embedding lane baseline | | ||
| 226 | | `acr-engine/src/engines/hybrid_engine.py` | current aggregation prototype | | ||
| 227 | | `acr-engine/sql/pgvector_schema.sql` | 早期 pgvector prototype | | ||
| 228 | | `acr-engine/sql/acr_pg_schema_v2.sql` | 推荐的 PostgreSQL V2 schema | | ||
| 229 | | [postgresql-data-model.md](./postgresql-data-model.md) | V2 schema 设计说明 | | ||
| 230 | |||
| 231 | --- | ||
| 232 | |||
| 233 | ## 8. 阅读建议 | ||
| 234 | |||
| 235 | 如果你是: | ||
| 236 | - **架构负责人**:下一篇看 [sota-evolution-guide.md](./sota-evolution-guide.md) | ||
| 237 | - **数据/后端负责人**:下一篇看 [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 238 | - **模型负责人**:先看 [sota-evolution-guide.md](./sota-evolution-guide.md) 再看 [production-encoder-freeze-and-embedding-strategy.md](./production-encoder-freeze-and-embedding-strategy.md) |
This diff is collapsed.
Click to expand it.
| 1 | # Phase-1 实施清单 / Encoder-only Implementation Checklist | ||
| 2 | |||
| 3 | > 更新:2026-06-04 | ||
| 4 | > 目标:把“先不上微调、先用开源 encoder”的 Phase-1 方案拆成可执行步骤,方便数据、检索、平台、运维团队并行推进。 | ||
| 5 | |||
| 6 | ## 一页结论 | ||
| 7 | |||
| 8 | Phase-1 的交付目标不是“证明某个新模型绝对最优”,而是: | ||
| 9 | |||
| 10 | 1. 把 **PostgreSQL 主数据模型** 落稳 | ||
| 11 | 2. 把 **reference 资产 / window / feature_set** 跑通 | ||
| 12 | 3. 用 **MERT + MuQ** 建立 encoder-only baseline | ||
| 13 | 4. 把 **fingerprint lane + semantic lane** 的聚合链先跑通 | ||
| 14 | 5. 给 Phase-2 的 version/cover lane 留好接口 | ||
| 15 | |||
| 16 | --- | ||
| 17 | |||
| 18 | ## 1. 交付范围 | ||
| 19 | |||
| 20 | ### 本阶段必须完成 | ||
| 21 | - `canonical_song / work / recording / recording_asset / audio_window` 入库 | ||
| 22 | - `model_registry / feature_set_registry` 初始化 | ||
| 23 | - MERT/MuQ encoder-only 特征抽取 | ||
| 24 | - hot reference set 建设 | ||
| 25 | - semantic index 建设 | ||
| 26 | - query -> candidate -> canonical_song 的基础闭环 | ||
| 27 | |||
| 28 | ### 本阶段不强求完成 | ||
| 29 | - 底座微调 | ||
| 30 | - cover 专项训练 | ||
| 31 | - humming 专项 melody tower | ||
| 32 | - 全量冷数据统一进热索引 | ||
| 33 | |||
| 34 | --- | ||
| 35 | |||
| 36 | ## 2. 角色分工 | ||
| 37 | |||
| 38 | | 角色 | 主要交付 | | ||
| 39 | |---|---| | ||
| 40 | | 数据工程 | 资产清洗、去重、实体映射、切窗清单 | | ||
| 41 | | 后端/DBA | PostgreSQL DDL、索引、写入链、校验约束 | | ||
| 42 | | 检索工程 | fingerprint lane、semantic lane、聚合逻辑 | | ||
| 43 | | 模型工程 | MERT/MuQ 接入、feature_set 设计、抽特征脚本 | | ||
| 44 | | 平台/运维 | 离线任务编排、对象存储、热/冷索引治理 | | ||
| 45 | |||
| 46 | --- | ||
| 47 | |||
| 48 | ## 3. 分阶段 checklist | ||
| 49 | |||
| 50 | ## Stage 1:主数据落库 | ||
| 51 | |||
| 52 | ### 目标 | ||
| 53 | 把业务事实层稳定下来,不依赖具体 encoder。 | ||
| 54 | |||
| 55 | ### Checklist | ||
| 56 | - [ ] 建库执行 `acr-engine/sql/acr_pg_schema_v2.sql` | ||
| 57 | - [ ] 初始化 `canonical_song` | ||
| 58 | - [ ] 初始化 `work` | ||
| 59 | - [ ] 初始化 `recording` | ||
| 60 | - [ ] 初始化 `recording_asset` | ||
| 61 | - [ ] 校验 lineage trigger 可用 | ||
| 62 | - [ ] 用一小批 reference 数据做插入烟测 | ||
| 63 | |||
| 64 | ### 输出物 | ||
| 65 | - PostgreSQL schema v2 | ||
| 66 | - 初始实体数据 | ||
| 67 | - 可复用的数据导入脚本 | ||
| 68 | |||
| 69 | --- | ||
| 70 | |||
| 71 | ## Stage 2:reference 资产与切窗 | ||
| 72 | |||
| 73 | ### 目标 | ||
| 74 | 把“可被检索”的 reference 集合建出来。 | ||
| 75 | |||
| 76 | ### Checklist | ||
| 77 | - [ ] 选出 `is_reference=true` 的 recording | ||
| 78 | - [ ] 创建 `reference_set_registry` | ||
| 79 | - [ ] 回填 `reference_set_member` | ||
| 80 | - [ ] 统一标准化音频路径 | ||
| 81 | - [ ] 生成 `audio_window` | ||
| 82 | - [ ] 标记 `active_for_index` | ||
| 83 | |||
| 84 | ### 推荐规则 | ||
| 85 | - 先只放主 reference 版本 | ||
| 86 | - 默认先做 `5s / 2.5s hop` | ||
| 87 | - intro/outro 可先保留,后续再做 quality pruning | ||
| 88 | |||
| 89 | --- | ||
| 90 | |||
| 91 | ## Stage 3:模型与 feature_set 初始化 | ||
| 92 | |||
| 93 | ### 目标 | ||
| 94 | 把模型注册和特征版本定义稳定下来。 | ||
| 95 | |||
| 96 | ### Checklist | ||
| 97 | - [ ] 注册 `chromaprint` | ||
| 98 | - [ ] 注册 `mert v1-95m` | ||
| 99 | - [ ] 注册 `muq` | ||
| 100 | - [ ] 注册 `mert 5s/2.5s mean pool` | ||
| 101 | - [ ] 注册 `mert 10s/5s mean pool` | ||
| 102 | - [ ] 注册 `muq 5s/2.5s mean pool` | ||
| 103 | - [ ] 明确每个 feature_set 的 metric / quantization / dim | ||
| 104 | |||
| 105 | ### 输出物 | ||
| 106 | - `model_registry` 初始化数据 | ||
| 107 | - `feature_set_registry` 初始化数据 | ||
| 108 | - feature set 命名约定 | ||
| 109 | |||
| 110 | --- | ||
| 111 | |||
| 112 | ## Stage 4:encoder-only 抽特征 | ||
| 113 | |||
| 114 | ### 目标 | ||
| 115 | 先不上训练,直接把 reference 集变成可检索 embedding。 | ||
| 116 | |||
| 117 | ### Checklist | ||
| 118 | - [ ] 抽取 MERT window embeddings | ||
| 119 | - [ ] 抽取 MuQ window embeddings | ||
| 120 | - [ ] 写入 `audio_embedding` | ||
| 121 | - [ ] 热数据写入 `audio_embedding_vector_768` 或对应物理表 | ||
| 122 | - [ ] 冷数据落对象存储/parquet | ||
| 123 | - [ ] 回填 `is_indexed` | ||
| 124 | |||
| 125 | ### 验证 | ||
| 126 | - [ ] 随机抽样检查 `window -> embedding -> feature_set` 回链可用 | ||
| 127 | - [ ] 检查向量 norm/缺失率/重复率 | ||
| 128 | |||
| 129 | --- | ||
| 130 | |||
| 131 | ## Stage 5:索引与召回 | ||
| 132 | |||
| 133 | ### 目标 | ||
| 134 | 跑通 semantic lane 与 exact lane 的双路召回。 | ||
| 135 | |||
| 136 | ### Checklist | ||
| 137 | - [ ] 建 fingerprint index | ||
| 138 | - [ ] 建 semantic index | ||
| 139 | - [ ] 回填 `retrieval_index_registry` | ||
| 140 | - [ ] 做 query encode | ||
| 141 | - [ ] 返回 `retrieval_candidate` | ||
| 142 | - [ ] 聚合到 `recording / work / canonical_song` | ||
| 143 | - [ ] 跑 `phase1_prereq_audit` | ||
| 144 | - [ ] 跑 `phase1_worker_contract_smoke` | ||
| 145 | - [ ] 跑 `semantic_vector_negative_matrix` | ||
| 146 | - [ ] 跑 `asset_level_upsert_validation` | ||
| 147 | |||
| 148 | ### 第一版聚合建议 | ||
| 149 | - max score | ||
| 150 | - top-k average | ||
| 151 | - hit windows count | ||
| 152 | - exact lane / semantic lane agreement bonus | ||
| 153 | |||
| 154 | --- | ||
| 155 | |||
| 156 | ## Stage 6:基础评测与上线门禁 | ||
| 157 | |||
| 158 | ### 目标 | ||
| 159 | 先证明 Phase-1 结构可用。 | ||
| 160 | |||
| 161 | ### Checklist | ||
| 162 | - [ ] exact query bucket | ||
| 163 | - [ ] noisy/BGM bucket | ||
| 164 | - [ ] version-like bucket(即便暂时不训练 cover lane) | ||
| 165 | - [ ] Top1/Top3/MRR | ||
| 166 | - [ ] canonical_song recall | ||
| 167 | - [ ] work-level recall | ||
| 168 | - [ ] reference set 版本记录 | ||
| 169 | |||
| 170 | --- | ||
| 171 | |||
| 172 | ## 4. 推荐时间顺序 | ||
| 173 | |||
| 174 | ```mermaid | ||
| 175 | flowchart TD | ||
| 176 | A[Schema v2 落库] --> B[实体导入] | ||
| 177 | B --> C[reference set 初始化] | ||
| 178 | C --> D[audio_window 生成] | ||
| 179 | D --> E[model/feature_set 初始化] | ||
| 180 | E --> F[MERT/MuQ 抽特征] | ||
| 181 | F --> G[semantic index] | ||
| 182 | C --> H[fingerprint index] | ||
| 183 | G --> I[candidate aggregation] | ||
| 184 | H --> I | ||
| 185 | I --> J[Phase-1 benchmark] | ||
| 186 | ``` | ||
| 187 | |||
| 188 | --- | ||
| 189 | |||
| 190 | ## 5. 第一版验收标准 | ||
| 191 | |||
| 192 | ### 数据层 | ||
| 193 | - 能稳定插入 `canonical_song -> work -> recording -> recording_asset -> audio_window` | ||
| 194 | - 能支撑至少一套 `reference_set` | ||
| 195 | |||
| 196 | ### 模型/特征层 | ||
| 197 | - 能并行存在多个 `model_registry / feature_set_registry` | ||
| 198 | - 能跑通 MERT/MuQ encoder-only 抽特征 | ||
| 199 | |||
| 200 | ### 检索层 | ||
| 201 | - 能同时返回 fingerprint lane 与 semantic lane 候选 | ||
| 202 | - 能聚合输出 `canonical_song_id` | ||
| 203 | |||
| 204 | ### 运维层 | ||
| 205 | - 能重建 reference set | ||
| 206 | - 能重建 semantic index | ||
| 207 | - 能记录 feature_set 与 index version | ||
| 208 | |||
| 209 | --- | ||
| 210 | |||
| 211 | ## 6. 本阶段容易踩的坑 | ||
| 212 | |||
| 213 | 1. 先把 embedding 存储设计死到某个模型维度 | ||
| 214 | 2. 只保留 song_id,不保留 work/recording | ||
| 215 | 3. reference set 没有版本化 | ||
| 216 | 4. query 结果无法回查具体 evidence window | ||
| 217 | 5. exact lane 被过早删除 | ||
| 218 | |||
| 219 | --- | ||
| 220 | |||
| 221 | ## 7. 当前建议结论 | ||
| 222 | |||
| 223 | 如果你要马上排计划,建议按这个优先级: | ||
| 224 | |||
| 225 | 1. Schema v2 与主数据导入 | ||
| 226 | 2. reference set + audio_window | ||
| 227 | 3. MERT/MuQ feature_set 初始化 | ||
| 228 | 4. encoder-only 抽特征 | ||
| 229 | 5. 双路召回与聚合 | ||
| 230 | 6. benchmark 与门禁 | ||
| 231 | |||
| 232 | |||
| 233 | ## 6.1 当前 planner 已提供的 validation entrypoints | ||
| 234 | |||
| 235 | `acr-engine/scripts/plan_phase1_extraction_jobs_live.py` 现在除了 job 级 `command_suggestions`,还会在 `phase1_extraction_plan_report.json` 里附带: | ||
| 236 | |||
| 237 | - `validation_commands.prereq_audit` | ||
| 238 | - `validation_commands.worker_contract_smoke` | ||
| 239 | - `validation_commands.semantic_vector_negative_matrix` | ||
| 240 | - `validation_commands.asset_level_upsert_validation` | ||
| 241 | |||
| 242 | 这意味着下次启动时可以先跑“全局验证入口”,再决定是否执行具体 job,而不必手工拼测试命令。 | ||
| 243 | |||
| 244 | |||
| 245 | ## 6.2 当前推荐的一键验证入口 | ||
| 246 | |||
| 247 | 如果只是想先确认当前 host 是否具备继续推进 Phase-1 的条件,推荐优先执行: | ||
| 248 | |||
| 249 | ```bash | ||
| 250 | cd /workspace/acr-engine | ||
| 251 | /usr/local/miniconda3/bin/python scripts/run_planner_validation_commands_live.py --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' --output data/pgvector_eval/music20/planner_validation_commands_runner_report.json | ||
| 252 | ``` | ||
| 253 | |||
| 254 | 它会直接读取 `phase1_extraction_plan_report.json` 的 `validation_commands`,并批量执行: | ||
| 255 | |||
| 256 | - `prereq_audit` | ||
| 257 | - `worker_contract_smoke` | ||
| 258 | - `semantic_vector_negative_matrix` | ||
| 259 | - `asset_level_upsert_validation` | ||
| 260 | |||
| 261 | 当前 live 结果: | ||
| 262 | |||
| 263 | - `executed_count = 4` | ||
| 264 | - `all_passed = true` |
docs/phase1-worker-contract.md
deleted
100644 → 0
This diff is collapsed.
Click to expand it.
This diff is collapsed.
Click to expand it.
This diff is collapsed.
Click to expand it.
This diff is collapsed.
Click to expand it.
This diff is collapsed.
Click to expand it.
docs/sota-evolution-guide.md
deleted
100644 → 0
| 1 | # SOTA 演进方案说明 / SOTA Evolution Guide | ||
| 2 | |||
| 3 | > 更新:2026-06-04 | ||
| 4 | > 目标:给出一个“先不上微调、先用开源 encoder”的 Phase-1 路线,并明确后续如何演进到更强的版权保护 / 版本归属系统。 | ||
| 5 | |||
| 6 | ## 一页结论 | ||
| 7 | |||
| 8 | 如果当前约束是: | ||
| 9 | - 先不微调底座 | ||
| 10 | - 先要落数据规范 | ||
| 11 | - 先解决 100w 音频 / 30w 歌曲的检索与归属基础问题 | ||
| 12 | |||
| 13 | 那么最合理的 Phase-1 路线不是“重训一套新模型”,而是: | ||
| 14 | |||
| 15 | 1. **保留 exact lane**:Chromaprint / fingerprint | ||
| 16 | 2. **semantic lane 主底座**:MERT-v1-95M | ||
| 17 | 3. **semantic lane challenger**:MuQ | ||
| 18 | 4. **数据库先稳住**:`model_registry + feature_set_registry + audio_embedding + retrieval_index_registry` | ||
| 19 | 5. **结果先按层聚合**:window -> recording -> work -> canonical_song | ||
| 20 | |||
| 21 | --- | ||
| 22 | |||
| 23 | ## 1. 为什么当前要走 encoder-only Phase-1 | ||
| 24 | |||
| 25 | 因为你当前最紧迫的问题不是“模型精度极限”,而是: | ||
| 26 | |||
| 27 | - 曲库很大:100w 音频 / 30w 歌曲 | ||
| 28 | - 数据关系复杂:同曲可能有多录音、多版本、多来源资产 | ||
| 29 | - 如果数据规范不稳,未来任何模型升级都会反复返工 | ||
| 30 | |||
| 31 | 所以 Phase-1 目标应该是: | ||
| 32 | |||
| 33 | ```mermaid | ||
| 34 | flowchart LR | ||
| 35 | A[冻结数据规范] --> B[接入开源 encoder] | ||
| 36 | B --> C[建立 semantic baseline] | ||
| 37 | C --> D[做大规模索引与聚合验证] | ||
| 38 | D --> E[再决定是否进入微调 / version lane] | ||
| 39 | ``` | ||
| 40 | |||
| 41 | --- | ||
| 42 | |||
| 43 | ## 2. 推荐的阶段划分 | ||
| 44 | |||
| 45 | ## Phase-0:当前仓库阶段(已具备) | ||
| 46 | - `Chromaprint + ECAPA + melody rerank` | ||
| 47 | - 可跑通训练/建索引/评测/服务闭环 | ||
| 48 | - 适合作为 baseline,而不是最终生产底座 | ||
| 49 | |||
| 50 | ## Phase-1:Encoder-only foundation baseline(当前推荐) | ||
| 51 | - exact lane:Chromaprint | ||
| 52 | - semantic lane:MERT-v1-95M | ||
| 53 | - challenger:MuQ | ||
| 54 | - 不微调底座 | ||
| 55 | - 只做 feature extraction + index + aggregation | ||
| 56 | |||
| 57 | ## Phase-2:Version / Cover lane | ||
| 58 | - 在 Phase-1 数据模型稳定后 | ||
| 59 | - 引入 cover/version 专门分支 | ||
| 60 | - 强化 work-level 归属 | ||
| 61 | |||
| 62 | ## Phase-3:Industrial retrieval stack | ||
| 63 | - ANN + reranker | ||
| 64 | - online/offline artifact registry | ||
| 65 | - 监控、回放、审计、人工复核 | ||
| 66 | |||
| 67 | --- | ||
| 68 | |||
| 69 | ## 3. Phase-1 的推荐模型组合 | ||
| 70 | |||
| 71 | ## 3.1 Exact lane | ||
| 72 | ### 选型 | ||
| 73 | - Chromaprint / landmark hash | ||
| 74 | |||
| 75 | ### 作用 | ||
| 76 | - 原曲片段 | ||
| 77 | - 平台转码 | ||
| 78 | - near-duplicate | ||
| 79 | - 局部片段强匹配 | ||
| 80 | |||
| 81 | ### 为什么保留 | ||
| 82 | 版权保护不能只靠 semantic embedding。exact lane 在很多真实投诉/取证场景里仍然是最快且证据最强的第一条路径。 | ||
| 83 | |||
| 84 | --- | ||
| 85 | |||
| 86 | ## 3.2 Semantic lane 主模型:MERT-v1-95M | ||
| 87 | |||
| 88 | ### 推荐原因 | ||
| 89 | - 是 music SSL foundation model | ||
| 90 | - 已有公开论文与实现 | ||
| 91 | - 比自训小型 ECAPA 更符合音乐任务底座定位 | ||
| 92 | - Phase-1 直接做 frozen encoder 成本与风险都更低 | ||
| 93 | |||
| 94 | ### Phase-1 中的角色 | ||
| 95 | - 作为主 encoder 产出 window embedding | ||
| 96 | - 负责 noisy/BGM/一般跨域检索 baseline | ||
| 97 | - 后面可继续作为 teacher 或兼容旧索引版本 | ||
| 98 | |||
| 99 | ### 推荐 feature set | ||
| 100 | 1. `mert_v1_95m__window_5s_hop_2.5s__meanpool__l2` | ||
| 101 | 2. `mert_v1_95m__window_10s_hop_5s__meanpool__l2` | ||
| 102 | |||
| 103 | ### 为什么先做两套 | ||
| 104 | - `5s/2.5s`:更利于局部定位 | ||
| 105 | - `10s/5s`:更利于整体语义稳定 | ||
| 106 | |||
| 107 | --- | ||
| 108 | |||
| 109 | ## 3.3 Semantic lane Challenger:MuQ | ||
| 110 | |||
| 111 | ### 推荐原因 | ||
| 112 | - 更新、更接近下一代 music foundation model 路线 | ||
| 113 | - 值得作为 challenger baseline | ||
| 114 | - 即使不开微调,也有希望在部分 MIR 任务上优于较早底座 | ||
| 115 | |||
| 116 | ### 当前建议 | ||
| 117 | - Phase-1 先作为对照组,不立即替代 MERT | ||
| 118 | - 重点验证:向量分布稳定性、窗口级检索表现、内存/推理成本 | ||
| 119 | |||
| 120 | --- | ||
| 121 | |||
| 122 | ## 3.4 为什么 Phase-1 不直接以 CoverHunter 为主线 | ||
| 123 | |||
| 124 | 因为 CoverHunter 的优势在: | ||
| 125 | - cover song identification | ||
| 126 | - alignment / refined attention / coarse-to-fine 训练 | ||
| 127 | |||
| 128 | 而你当前约束是: | ||
| 129 | - 先不用微调 | ||
| 130 | - 先用开源 encoder | ||
| 131 | - 先把数据和检索规范落稳 | ||
| 132 | |||
| 133 | 所以它更适合作为 **Phase-2 的 version/cover lane 方向**,而不是 Phase-1 的主 baseline。 | ||
| 134 | |||
| 135 | --- | ||
| 136 | |||
| 137 | ## 4. 角色关注点 | ||
| 138 | |||
| 139 | ## 4.1 模型底座角色 | ||
| 140 | 重点关注: | ||
| 141 | - 哪些 encoder 已注册到 `model_registry` | ||
| 142 | - 每个 encoder 的 input SR、window、pooling、embedding dim | ||
| 143 | - 哪些 feature set 是线上候选,哪些只是实验候选 | ||
| 144 | |||
| 145 | ## 4.2 检索角色 | ||
| 146 | 重点关注: | ||
| 147 | - 指纹 lane 与 semantic lane 如何组合 | ||
| 148 | - `recording/work/song` 聚合规则 | ||
| 149 | - top-k 候选如何稳定输出 | ||
| 150 | |||
| 151 | ## 4.3 数据角色 | ||
| 152 | 重点关注: | ||
| 153 | - 资产去重 | ||
| 154 | - reference 资产选择 | ||
| 155 | - window manifest | ||
| 156 | - 是否支持全量重建特征与索引 | ||
| 157 | |||
| 158 | ## 4.4 运维 / 平台角色 | ||
| 159 | 重点关注: | ||
| 160 | - encoder 版本切换是否可灰度 | ||
| 161 | - 索引重建是否可并行 | ||
| 162 | - 热/冷索引、历史索引是否可回滚 | ||
| 163 | |||
| 164 | --- | ||
| 165 | |||
| 166 | ## 5. Phase-1 的实施顺序 | ||
| 167 | |||
| 168 | ```mermaid | ||
| 169 | flowchart TD | ||
| 170 | A[冻结 PostgreSQL 数据规范] --> B[导入 canonical/work/recording/asset/window] | ||
| 171 | B --> C[注册 model_registry / feature_set_registry] | ||
| 172 | C --> D[抽取 MERT 特征] | ||
| 173 | C --> E[抽取 MuQ 特征] | ||
| 174 | D --> F[构建 semantic index] | ||
| 175 | E --> F | ||
| 176 | F --> G[与 fingerprint lane 做聚合] | ||
| 177 | G --> H[输出 canonical_song_id / work_id / recording_id] | ||
| 178 | ``` | ||
| 179 | |||
| 180 | --- | ||
| 181 | |||
| 182 | ## 6. 每阶段解决的问题 | ||
| 183 | |||
| 184 | | 阶段 | 解决的问题 | 暂不解决的问题 | | ||
| 185 | |---|---|---| | ||
| 186 | | Phase-1 | 数据规范、开源底座 baseline、索引可重建、song/work/recording 聚合 | 底座微调、cover 专项训练、melody tower | | ||
| 187 | | Phase-2 | version/cover 归属、work-level recall | 更复杂跨模态 humming | | ||
| 188 | | Phase-3 | 工业化服务、回放、监控、人工审核闭环 | 极致 research SOTA | | ||
| 189 | |||
| 190 | --- | ||
| 191 | |||
| 192 | ## 7. 与当前仓库的关系 | ||
| 193 | |||
| 194 | ### 当前保留 | ||
| 195 | - `ECAPA baseline`:保留做对照,不作为长期主底座 | ||
| 196 | - `Chromaprint`:保留,且在版权保护场景里非常重要 | ||
| 197 | - `melody rerank`:保留为辅助 lane | ||
| 198 | |||
| 199 | ### 当前新增 | ||
| 200 | - `model_registry` | ||
| 201 | - `feature_set_registry` | ||
| 202 | - foundation encoder 特征抽取与注册 | ||
| 203 | - 更清晰的 `canonical_song / work / recording` 数据结构 | ||
| 204 | |||
| 205 | --- | ||
| 206 | |||
| 207 | ## 8. 当前推荐结论 | ||
| 208 | |||
| 209 | 如果今天就要给 Phase-1 定方案,我建议: | ||
| 210 | |||
| 211 | 1. **先不改训练主线,不删 ECAPA** | ||
| 212 | 2. **新增 MERT-v1-95M semantic lane** | ||
| 213 | 3. **新增 MuQ challenger lane** | ||
| 214 | 4. **只把 `is_reference=true` 的主参考窗口先做成热索引** | ||
| 215 | 5. **先把 PostgreSQL 设计当成主交付** | ||
| 216 | |||
| 217 | 换句话说: | ||
| 218 | |||
| 219 | > Phase-1 的核心不是“哪一个模型最终赢”,而是“数据规范 + 模型注册 + 特征注册 + 索引注册”这套长期结构先稳定下来。 |
| 1 | # Start Here / 新同学接手入口 | 1 | # Start Here / 新同学接手入口 |
| 2 | 2 | ||
| 3 | > 目标:让新来的同学在 **10 分钟内**知道:先跑什么、先读什么、当前卡在哪、下一步该做什么。 | 3 | > 目标:让新同学在 **10 分钟内** 知道现在的主链、先跑什么、先看什么。 |
| 4 | 4 | ||
| 5 | --- | 5 | --- |
| 6 | 6 | ||
| 7 | ## 1. 先执行这条命令 | 7 | ## 1. 先执行这条命令 |
| 8 | 8 | ||
| 9 | 如果当前目标是验证 **song-centric 真实目录 -> feature -> PostgreSQL** 主链,优先跑: | ||
| 10 | |||
| 11 | ```bash | 9 | ```bash |
| 12 | cd /workspace | 10 | cd /workspace |
| 13 | /usr/local/miniconda3/bin/python acr-engine/scripts/run_songcentric_directory_pipeline_live.py \ | 11 | /usr/local/miniconda3/bin/python acr-engine/scripts/run_songcentric_directory_pipeline_live.py \ |
| ... | @@ -17,167 +15,116 @@ cd /workspace | ... | @@ -17,167 +15,116 @@ cd /workspace |
| 17 | --output-dir acr-engine/data/pgvector_eval/music20 | 15 | --output-dir acr-engine/data/pgvector_eval/music20 |
| 18 | ``` | 16 | ``` |
| 19 | 17 | ||
| 18 | 或: | ||
| 19 | |||
| 20 | ```bash | ||
| 21 | acr-engine/scripts/start_songcentric_shortest_path.sh 'postgres://d2:d2pass@127.0.0.1:5432/d2' | ||
| 22 | ``` | ||
| 23 | |||
| 20 | 当前 fresh evidence: | 24 | 当前 fresh evidence: |
| 21 | - `song_count = 2` | 25 | - `song_count = 2` |
| 26 | - `asset_count = 2` | ||
| 22 | - `window_count = 5` | 27 | - `window_count = 5` |
| 23 | - `matcher_fingerprint_count = 5` | 28 | - `matcher_fingerprint_count = 5` |
| 24 | - `fallback_fingerprint_count = 0` | 29 | - `fallback_fingerprint_count = 0` |
| 25 | - `semantic_runtime_available = false` | 30 | - `semantic_runtime_available = false` |
| 26 | - `import_counts.feature_fact = 24` | 31 | - `semantic_runtime_missing = [torch, torchaudio, transformers]` |
| 27 | 32 | - `import_counts = media_entity:9 / audio_object:22 / feature_fact:24 / set_membership:9` | |
| 28 | 如果你当前目标是验证老的 Phase-1 planner/worker 合同,再跑下面这条: | ||
| 29 | |||
| 30 | ```bash | ||
| 31 | cd /workspace/acr-engine | ||
| 32 | /usr/local/miniconda3/bin/python scripts/run_planner_validation_commands_live.py \ | ||
| 33 | --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' \ | ||
| 34 | --output data/pgvector_eval/music20/planner_validation_commands_runner_report.json | ||
| 35 | ``` | ||
| 36 | |||
| 37 | 也可以用包装脚本:`acr-engine/scripts/start_phase1_shortest_path.sh 'postgres://d2:d2pass@127.0.0.1:5432/d2'` | ||
| 38 | |||
| 39 | ### 当前 fresh evidence | ||
| 40 | - `executed_count = 4` | ||
| 41 | - `all_passed = true` | ||
| 42 | |||
| 43 | ### 这条命令会执行 | ||
| 44 | 1. `prereq_audit` | ||
| 45 | 2. `worker_contract_smoke` | ||
| 46 | 3. `semantic_vector_negative_matrix` | ||
| 47 | 4. `asset_level_upsert_validation` | ||
| 48 | |||
| 49 | ### 看到下面这些结果时应该如何判断 | ||
| 50 | 如果你看到: | ||
| 51 | - `downloads_root_exists = false` | ||
| 52 | - `ready_jobs = 0` | ||
| 53 | - exact lane = `failed/unreadable_audio_assets` | ||
| 54 | - semantic lane = `4/4 failed` | ||
| 55 | |||
| 56 | 说明当前优先级是: | ||
| 57 | 1. 挂载 `/workspace/downloads` | ||
| 58 | 2. 安装 `torch / torchaudio / transformers / speechbrain` | ||
| 59 | |||
| 60 | 也就是说: | ||
| 61 | > 当前首要问题是运行环境前置条件,不是 PostgreSQL schema,也不是 worker contract 设计错误。 | ||
| 62 | 33 | ||
| 63 | --- | 34 | --- |
| 64 | 35 | ||
| 65 | ## 2. 接手时只读这 5 份文档 | 36 | ## 2. 只读这 4 份文档 |
| 66 | 37 | ||
| 67 | 1. [README.md](./README.md) | 38 | 1. [README.md](./README.md) |
| 68 | 2. [session-handoff.md](./session-handoff.md) | 39 | 2. [session-handoff.md](./session-handoff.md) |
| 69 | 3. [acr-architecture.md](./acr-architecture.md) | 40 | 3. [postgresql-data-model.md](./postgresql-data-model.md) |
| 70 | 4. [postgresql-data-model.md](./postgresql-data-model.md) | 41 | 4. [postgres_db_schema_samples.md](./postgres_db_schema_samples.md) |
| 71 | 5. [phase1-implementation-checklist.md](./phase1-implementation-checklist.md) | ||
| 72 | |||
| 73 | 如果你负责算法或检索,再补: | ||
| 74 | - [sota-evolution-guide.md](./sota-evolution-guide.md) | ||
| 75 | - [model-feature-registry-bootstrap.md](./model-feature-registry-bootstrap.md) | ||
| 76 | - [phase1-worker-contract.md](./phase1-worker-contract.md) | ||
| 77 | 42 | ||
| 78 | --- | 43 | --- |
| 79 | 44 | ||
| 80 | ## 3. 用一句话理解项目 | 45 | ## 3. 用一句话理解当前项目 |
| 81 | 46 | ||
| 82 | 我们在做的是一个面向 **版权保护 / 听歌识曲 / 版本归属** 的音乐 ACR 系统, | 47 | 我们当前做的是一个 **面向版权保护的 song-centric 音乐 ACR 系统**: |
| 83 | 目标是从 `100w` 音频、约 `30w` 歌曲中,快速定位正确的 `song_id` 归属;当前阶段暂不把版本/recording 作为必须返回对象。 | 48 | 目标是在约 `100w` 音频、约 `30w` 歌曲里,把录音、BGM、片段、翻唱相关查询尽快定位到应归属的 `song_id`。 |
| 84 | 49 | ||
| 85 | --- | 50 | --- |
| 86 | 51 | ||
| 87 | ## 4. 当前主线方案 | 52 | ## 4. 当前最重要的设计结论 |
| 88 | 53 | ||
| 89 | ### 检索主线 | 54 | ### 4.1 不再默认走旧的多层 v2 体系 |
| 90 | - exact lane:`Chromaprint` | 55 | 当前默认只认 4 张核心物理表: |
| 91 | - semantic lane baseline:`MERT-v1-95M` | ||
| 92 | - semantic lane challenger:`MuQ` | ||
| 93 | - historical baseline:`ECAPA` | ||
| 94 | 56 | ||
| 95 | ### 当前 Phase-1 最小主线 | ||
| 96 | ```text | 57 | ```text |
| 97 | song -> asset -> window | 58 | media_entity -> audio_object -> feature_fact -> set_membership |
| 98 | ``` | 59 | ``` |
| 99 | 60 | ||
| 100 | ### 可演进完整版主线 | 61 | ### 4.2 逻辑语义这样理解 |
| 101 | ```text | ||
| 102 | canonical_song -> work -> recording -> recording_asset -> audio_window | ||
| 103 | ``` | ||
| 104 | 62 | ||
| 105 | ### 模型主线 | ||
| 106 | ```text | 63 | ```text |
| 107 | model_registry -> feature_set_registry -> audio_embedding / audio_fingerprint -> retrieval_index_registry | 64 | song -> asset -> window -> fingerprint / embedding |
| 108 | ``` | 65 | ``` |
| 109 | 66 | ||
| 110 | --- | 67 | ### 4.3 切片 / 模型 / feature 到底落哪里 |
| 111 | 68 | ||
| 112 | ## 5. 当前哪些已经稳定 | 69 | | 对象 | 表 | 关键字段 | |
| 70 | |---|---|---| | ||
| 71 | | song | `media_entity` | `entity_type='song'` | | ||
| 72 | | 原始音频文件 | `audio_object` | `object_type='asset'` | | ||
| 73 | | 切片窗口 | `audio_object` | `object_type='window'`, `parent_object_id=<asset_id>` | | ||
| 74 | | 指纹特征 | `feature_fact` | `feature_type='fingerprint'`, `fingerprint_value` | | ||
| 75 | | embedding 特征 | `feature_fact` | `feature_type='embedding'`, `embedding_uri/vector_table_name` | | ||
| 76 | | 模型信息 | `feature_fact` | `model_name`, `model_version`, `feature_set_name` | | ||
| 77 | | reference/eval/hot 集 | `set_membership` | `set_type`, `set_name` | | ||
| 113 | 78 | ||
| 114 | - PostgreSQL v2 schema 已落地 | 79 | --- |
| 115 | - registry bootstrap 已有 live 验证 | ||
| 116 | - worker contract 已有 live 验证 | ||
| 117 | - exact / semantic 的失败语义已可审计 | ||
| 118 | - planner 已能输出 validation commands | ||
| 119 | - planner validation runner 已可一键执行 | ||
| 120 | 80 | ||
| 121 | ## 6. 当前哪些还没完成 | 81 | ## 5. 当前主链流程图 |
| 122 | 82 | ||
| 123 | - 还没有真正跑通 MERT / MuQ inference | 83 | ```mermaid |
| 124 | - 当前 host 没有 `/workspace/downloads` | 84 | flowchart TD |
| 125 | - 当前 host 缺 `torch / torchaudio / transformers / speechbrain` | 85 | A[media_entity\nentity_type=song] --> B[audio_object\nobject_type=asset] |
| 126 | - 还没完成最终线上融合策略 | 86 | B --> C[audio_object\nobject_type=window] |
| 127 | - 还没接入更大规模真实 reference set | 87 | C --> D1[feature_fact\nfingerprint] |
| 88 | C --> D2[feature_fact\nembedding] | ||
| 89 | B --> E[set_membership\nreference_set / eval_set / hot_set] | ||
| 90 | C --> E | ||
| 91 | ``` | ||
| 128 | 92 | ||
| 129 | --- | 93 | --- |
| 130 | 94 | ||
| 131 | ## 7. 如果你现在继续推进,按这个顺序 | 95 | ## 6. 当前哪些已经稳定 |
| 132 | |||
| 133 | ### 路线 A:先解环境 | ||
| 134 | 1. 挂载 `/workspace/downloads` | ||
| 135 | 2. 安装 semantic runtime 依赖 | ||
| 136 | 3. 重跑 planner validation runner | ||
| 137 | 4. 确认 `ready_jobs` 是否开始恢复 | ||
| 138 | 96 | ||
| 139 | ### 路线 B:先解实现 | 97 | - live PostgreSQL schema 已真实建表通过 |
| 140 | 1. 阅读 [phase1-worker-contract.md](./phase1-worker-contract.md) | 98 | - 真实目录 -> manifest -> import 已打通 |
| 141 | 2. 阅读 `acr-engine/workers/run_embedding_job.py` | 99 | - 真实目录 -> fingerprint enrichment -> import 已打通 |
| 142 | 3. 用真实 inference adapter 替换 guarded failure path | 100 | - semantic lane 已做成 runtime-aware |
| 143 | 4. 保持当前 PostgreSQL contract 不变 | 101 | - 当前 host 无 `torch/torchaudio/transformers` 时会明确 fallback,不会伪装成功 |
| 144 | 102 | - 当前 exact lane 已优先复用仓库内 `ChromaprintMatcher` | |
| 145 | ### 路线 C:先解数据 | ||
| 146 | 1. 阅读 [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 147 | 2. 阅读 [postgres_db_schema_samples.md](./postgres_db_schema_samples.md) | ||
| 148 | 3. 准备更大的 reference set | ||
| 149 | 4. 保持 `reference_set_registry / reference_set_member` 版本化 | ||
| 150 | 103 | ||
| 151 | --- | 104 | --- |
| 152 | 105 | ||
| 153 | ## 8. 当前不建议优先做的事 | 106 | ## 7. 当前最该继续什么 |
| 154 | 107 | ||
| 155 | - 不要重新讨论要不要 `song/work/recording` 分层 | 108 | ### 第一优先级 |
| 156 | - 不要回退到只有 `song_id` 的扁平表 | 109 | 把 semantic lane 从 fallback 升级成真实 encoder adapter,且不破坏现有宿主链。 |
| 157 | - 不要先讨论重新训练底座 | ||
| 158 | - 不要把当前问题误判成 PostgreSQL contract 设计问题 | ||
| 159 | 110 | ||
| 160 | --- | 111 | ### 当前 host 事实 |
| 112 | - `torch` 缺失 | ||
| 113 | - `torchaudio` 缺失 | ||
| 114 | - `transformers` 缺失 | ||
| 115 | - 当前因此 `semantic_runtime_available = false` | ||
| 161 | 116 | ||
| 162 | ## 9. 仓库常用入口 | 117 | --- |
| 163 | 118 | ||
| 164 | ### 文档 | 119 | ## 8. 当前不要再绕回去的点 |
| 165 | - [README.md](./README.md) | ||
| 166 | - [session-handoff.md](./session-handoff.md) | ||
| 167 | - [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 168 | - [postgres_db_schema_samples.md](./postgres_db_schema_samples.md) | ||
| 169 | - [phase1-worker-contract.md](./phase1-worker-contract.md) | ||
| 170 | 120 | ||
| 171 | ### 脚本 | 121 | - 不要回退到旧的 v2 schema 作为默认口径 |
| 172 | - `acr-engine/scripts/run_planner_validation_commands_live.py` | 122 | - 不要重新引入 `recording/work/version` 作为 Phase-1 必须返回对象 |
| 173 | - `acr-engine/scripts/run_phase1_prereq_audit_live.py` | 123 | - 不要先讨论训练/微调 |
| 174 | - `acr-engine/scripts/run_phase1_worker_contract_smoke_live.py` | 124 | - 不要把“模型底座可替换”误解成“数据库要重新拆很多层” |
| 175 | - `acr-engine/scripts/run_embedding_vector_table_negative_matrix_live.py` | ||
| 176 | - `acr-engine/scripts/validate_audio_embedding_asset_upsert_live.py` | ||
| 177 | - `acr-engine/scripts/run_songcentric_directory_pipeline_live.py` | ||
| 178 | 125 | ||
| 179 | --- | 126 | --- |
| 180 | 127 | ||
| 181 | ## 一句话结论 | 128 | ## 一句话结论 |
| 182 | 129 | ||
| 183 | > 新同学接手时,先跑 runner,再读 5 份核心文档;当前首要问题是环境前置条件,不是 schema/contract 本身。 | 130 | > 当前最重要的是守住 4 表 song-centric 主链,并在这个主链上把 semantic encoder 真正接起来。 | ... | ... |
-
Please register or sign in to post a comment