Clarify the ACR evolution path and freeze a production-grade data model
Constraint: Phase-1 must support encoder-only open-source backbones without destabilizing future schema evolution Rejected: extending the old flat song_id + fixed-vector schema | would couple model swaps to schema rewrites and weaken copyright lineage Confidence: high Scope-risk: moderate Directive: treat canonical_song/work/recording/recording_asset/audio_window plus model/feature registries as the stable contract; evolve models and indexes around them Tested: git diff --check on changed files; Python content/structure sanity check; architect review APPROVED; README link coverage and DDL object presence verified Not-tested: live PostgreSQL apply not run because psql is unavailable in this environment
Showing
6 changed files
with
1379 additions
and
81 deletions
acr-engine/sql/acr_pg_schema_v2.sql
0 → 100644
| 1 | -- ACR PostgreSQL Schema V2 | ||
| 2 | -- Purpose: | ||
| 3 | -- 1. Support canonical_song/work/recording/asset/window hierarchy | ||
| 4 | -- 2. Support encoder-first evolution via model_registry + feature_set_registry | ||
| 5 | -- 3. Support pgvector-backed hot reference sets without binding the entire system to one vector table | ||
| 6 | |||
| 7 | CREATE EXTENSION IF NOT EXISTS vector; | ||
| 8 | |||
| 9 | -- ========================================================= | ||
| 10 | -- 1. Canonical business entities | ||
| 11 | -- ========================================================= | ||
| 12 | |||
| 13 | CREATE TABLE IF NOT EXISTS canonical_song ( | ||
| 14 | canonical_song_id BIGSERIAL PRIMARY KEY, | ||
| 15 | biz_song_code TEXT UNIQUE, | ||
| 16 | title TEXT NOT NULL, | ||
| 17 | title_norm TEXT, | ||
| 18 | primary_artist TEXT, | ||
| 19 | primary_artist_norm TEXT, | ||
| 20 | language_code TEXT, | ||
| 21 | rights_status TEXT, | ||
| 22 | metadata_json JSONB NOT NULL DEFAULT '{}'::jsonb, | ||
| 23 | created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | ||
| 24 | updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW() | ||
| 25 | ); | ||
| 26 | |||
| 27 | CREATE TABLE IF NOT EXISTS work ( | ||
| 28 | work_id BIGSERIAL PRIMARY KEY, | ||
| 29 | canonical_song_id BIGINT NOT NULL REFERENCES canonical_song(canonical_song_id), | ||
| 30 | work_code TEXT UNIQUE, | ||
| 31 | work_title TEXT NOT NULL, | ||
| 32 | work_title_norm TEXT, | ||
| 33 | composer TEXT, | ||
| 34 | lyricist TEXT, | ||
| 35 | publisher TEXT, | ||
| 36 | iswc TEXT, | ||
| 37 | metadata_json JSONB NOT NULL DEFAULT '{}'::jsonb, | ||
| 38 | created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | ||
| 39 | updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW() | ||
| 40 | ); | ||
| 41 | |||
| 42 | CREATE TABLE IF NOT EXISTS recording ( | ||
| 43 | recording_id BIGSERIAL PRIMARY KEY, | ||
| 44 | work_id BIGINT NOT NULL REFERENCES work(work_id), | ||
| 45 | canonical_song_id BIGINT NOT NULL REFERENCES canonical_song(canonical_song_id), | ||
| 46 | recording_code TEXT UNIQUE, | ||
| 47 | recording_title TEXT, | ||
| 48 | artist_name TEXT, | ||
| 49 | album_name TEXT, | ||
| 50 | version_type TEXT NOT NULL, | ||
| 51 | is_reference BOOLEAN NOT NULL DEFAULT FALSE, | ||
| 52 | reference_priority INTEGER NOT NULL DEFAULT 100, | ||
| 53 | release_date DATE, | ||
| 54 | isrc TEXT, | ||
| 55 | duration_sec NUMERIC(10,3), | ||
| 56 | metadata_json JSONB NOT NULL DEFAULT '{}'::jsonb, | ||
| 57 | created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | ||
| 58 | updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW() | ||
| 59 | ); | ||
| 60 | |||
| 61 | -- ========================================================= | ||
| 62 | -- 2. Assets and windows | ||
| 63 | -- ========================================================= | ||
| 64 | |||
| 65 | CREATE TABLE IF NOT EXISTS recording_asset ( | ||
| 66 | asset_id BIGSERIAL PRIMARY KEY, | ||
| 67 | recording_id BIGINT NOT NULL REFERENCES recording(recording_id), | ||
| 68 | asset_role TEXT NOT NULL, | ||
| 69 | storage_uri TEXT NOT NULL, | ||
| 70 | storage_scheme TEXT NOT NULL, | ||
| 71 | file_ext TEXT, | ||
| 72 | mime_type TEXT, | ||
| 73 | file_size_bytes BIGINT, | ||
| 74 | audio_sha256 TEXT, | ||
| 75 | sample_rate INTEGER, | ||
| 76 | channels INTEGER, | ||
| 77 | bit_rate_kbps INTEGER, | ||
| 78 | codec_name TEXT, | ||
| 79 | duration_sec NUMERIC(10,3), | ||
| 80 | loudness_lufs NUMERIC(8,3), | ||
| 81 | normalized_storage_uri TEXT, | ||
| 82 | ingest_status TEXT NOT NULL DEFAULT 'ready', | ||
| 83 | metadata_json JSONB NOT NULL DEFAULT '{}'::jsonb, | ||
| 84 | created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | ||
| 85 | updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW() | ||
| 86 | ); | ||
| 87 | |||
| 88 | CREATE UNIQUE INDEX IF NOT EXISTS uq_recording_asset_sha256 | ||
| 89 | ON recording_asset(audio_sha256) | ||
| 90 | WHERE audio_sha256 IS NOT NULL; | ||
| 91 | |||
| 92 | CREATE TABLE IF NOT EXISTS audio_window ( | ||
| 93 | window_id BIGSERIAL PRIMARY KEY, | ||
| 94 | asset_id BIGINT NOT NULL REFERENCES recording_asset(asset_id), | ||
| 95 | recording_id BIGINT NOT NULL REFERENCES recording(recording_id), | ||
| 96 | work_id BIGINT NOT NULL REFERENCES work(work_id), | ||
| 97 | canonical_song_id BIGINT NOT NULL REFERENCES canonical_song(canonical_song_id), | ||
| 98 | window_index INTEGER NOT NULL, | ||
| 99 | start_sec NUMERIC(10,3) NOT NULL, | ||
| 100 | end_sec NUMERIC(10,3) NOT NULL, | ||
| 101 | duration_sec NUMERIC(10,3) NOT NULL, | ||
| 102 | segment_role TEXT NOT NULL DEFAULT 'reference', | ||
| 103 | segment_type TEXT, | ||
| 104 | quality_score NUMERIC(8,5), | ||
| 105 | active_for_index BOOLEAN NOT NULL DEFAULT TRUE, | ||
| 106 | metadata_json JSONB NOT NULL DEFAULT '{}'::jsonb, | ||
| 107 | created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() | ||
| 108 | ); | ||
| 109 | |||
| 110 | CREATE UNIQUE INDEX IF NOT EXISTS uq_audio_window_asset_idx | ||
| 111 | ON audio_window(asset_id, window_index); | ||
| 112 | |||
| 113 | -- ========================================================= | ||
| 114 | -- 3. Model and feature registries | ||
| 115 | -- ========================================================= | ||
| 116 | |||
| 117 | CREATE TABLE IF NOT EXISTS model_registry ( | ||
| 118 | model_id BIGSERIAL PRIMARY KEY, | ||
| 119 | model_name TEXT NOT NULL, | ||
| 120 | model_family TEXT NOT NULL, | ||
| 121 | model_version TEXT NOT NULL, | ||
| 122 | model_source TEXT, | ||
| 123 | model_uri TEXT, | ||
| 124 | license_name TEXT, | ||
| 125 | input_modality TEXT NOT NULL DEFAULT 'audio', | ||
| 126 | input_sample_rate INTEGER, | ||
| 127 | input_channel_mode TEXT DEFAULT 'mono', | ||
| 128 | default_window_sec NUMERIC(10,3), | ||
| 129 | default_hop_sec NUMERIC(10,3), | ||
| 130 | output_embedding_dim INTEGER, | ||
| 131 | pooling_supported TEXT[], | ||
| 132 | layer_selection_supported BOOLEAN NOT NULL DEFAULT FALSE, | ||
| 133 | is_trainable BOOLEAN NOT NULL DEFAULT FALSE, | ||
| 134 | is_active BOOLEAN NOT NULL DEFAULT TRUE, | ||
| 135 | metadata_json JSONB NOT NULL DEFAULT '{}'::jsonb, | ||
| 136 | created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | ||
| 137 | updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | ||
| 138 | UNIQUE(model_name, model_version) | ||
| 139 | ); | ||
| 140 | |||
| 141 | CREATE TABLE IF NOT EXISTS feature_set_registry ( | ||
| 142 | feature_set_id BIGSERIAL PRIMARY KEY, | ||
| 143 | model_id BIGINT NOT NULL REFERENCES model_registry(model_id), | ||
| 144 | feature_name TEXT NOT NULL, | ||
| 145 | feature_level TEXT NOT NULL, | ||
| 146 | extraction_granularity TEXT NOT NULL, | ||
| 147 | window_sec NUMERIC(10,3), | ||
| 148 | hop_sec NUMERIC(10,3), | ||
| 149 | embedding_dim INTEGER, | ||
| 150 | pooling_strategy TEXT, | ||
| 151 | layer_selection TEXT, | ||
| 152 | normalize_l2 BOOLEAN NOT NULL DEFAULT TRUE, | ||
| 153 | distance_metric TEXT NOT NULL, | ||
| 154 | quantization_type TEXT, | ||
| 155 | feature_schema_version TEXT NOT NULL, | ||
| 156 | config_json JSONB NOT NULL DEFAULT '{}'::jsonb, | ||
| 157 | status TEXT NOT NULL DEFAULT 'active', | ||
| 158 | created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | ||
| 159 | updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW() | ||
| 160 | ); | ||
| 161 | |||
| 162 | CREATE TABLE IF NOT EXISTS feature_extraction_job ( | ||
| 163 | extraction_job_id BIGSERIAL PRIMARY KEY, | ||
| 164 | feature_set_id BIGINT NOT NULL REFERENCES feature_set_registry(feature_set_id), | ||
| 165 | target_scope TEXT NOT NULL, | ||
| 166 | job_status TEXT NOT NULL DEFAULT 'pending', | ||
| 167 | shard_key TEXT, | ||
| 168 | input_count BIGINT, | ||
| 169 | output_count BIGINT, | ||
| 170 | started_at TIMESTAMPTZ, | ||
| 171 | finished_at TIMESTAMPTZ, | ||
| 172 | log_uri TEXT, | ||
| 173 | metadata_json JSONB NOT NULL DEFAULT '{}'::jsonb, | ||
| 174 | created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() | ||
| 175 | ); | ||
| 176 | |||
| 177 | -- ========================================================= | ||
| 178 | -- 4. Feature facts | ||
| 179 | -- ========================================================= | ||
| 180 | |||
| 181 | CREATE TABLE IF NOT EXISTS audio_embedding ( | ||
| 182 | embedding_id BIGSERIAL PRIMARY KEY, | ||
| 183 | feature_set_id BIGINT NOT NULL REFERENCES feature_set_registry(feature_set_id), | ||
| 184 | extraction_job_id BIGINT REFERENCES feature_extraction_job(extraction_job_id), | ||
| 185 | asset_id BIGINT REFERENCES recording_asset(asset_id), | ||
| 186 | window_id BIGINT REFERENCES audio_window(window_id), | ||
| 187 | recording_id BIGINT NOT NULL REFERENCES recording(recording_id), | ||
| 188 | work_id BIGINT NOT NULL REFERENCES work(work_id), | ||
| 189 | canonical_song_id BIGINT NOT NULL REFERENCES canonical_song(canonical_song_id), | ||
| 190 | embedding_storage_mode TEXT NOT NULL, | ||
| 191 | embedding_uri TEXT, | ||
| 192 | vector_norm NUMERIC(12,6), | ||
| 193 | checksum TEXT, | ||
| 194 | is_indexed BOOLEAN NOT NULL DEFAULT FALSE, | ||
| 195 | metadata_json JSONB NOT NULL DEFAULT '{}'::jsonb, | ||
| 196 | created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | ||
| 197 | CONSTRAINT ck_audio_embedding_scope CHECK (asset_id IS NOT NULL OR window_id IS NOT NULL) | ||
| 198 | ); | ||
| 199 | |||
| 200 | CREATE TABLE IF NOT EXISTS audio_embedding_vector_192 ( | ||
| 201 | embedding_id BIGINT PRIMARY KEY REFERENCES audio_embedding(embedding_id) ON DELETE CASCADE, | ||
| 202 | embedding VECTOR(192) NOT NULL | ||
| 203 | ); | ||
| 204 | |||
| 205 | CREATE TABLE IF NOT EXISTS audio_embedding_vector_768 ( | ||
| 206 | embedding_id BIGINT PRIMARY KEY REFERENCES audio_embedding(embedding_id) ON DELETE CASCADE, | ||
| 207 | embedding VECTOR(768) NOT NULL | ||
| 208 | ); | ||
| 209 | |||
| 210 | CREATE TABLE IF NOT EXISTS audio_fingerprint ( | ||
| 211 | fingerprint_id BIGSERIAL PRIMARY KEY, | ||
| 212 | feature_set_id BIGINT NOT NULL REFERENCES feature_set_registry(feature_set_id), | ||
| 213 | asset_id BIGINT REFERENCES recording_asset(asset_id), | ||
| 214 | window_id BIGINT REFERENCES audio_window(window_id), | ||
| 215 | recording_id BIGINT NOT NULL REFERENCES recording(recording_id), | ||
| 216 | work_id BIGINT NOT NULL REFERENCES work(work_id), | ||
| 217 | canonical_song_id BIGINT NOT NULL REFERENCES canonical_song(canonical_song_id), | ||
| 218 | fingerprint_uri TEXT, | ||
| 219 | hash_count INTEGER, | ||
| 220 | is_indexed BOOLEAN NOT NULL DEFAULT FALSE, | ||
| 221 | metadata_json JSONB NOT NULL DEFAULT '{}'::jsonb, | ||
| 222 | created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() | ||
| 223 | ); | ||
| 224 | |||
| 225 | CREATE TABLE IF NOT EXISTS reference_set_registry ( | ||
| 226 | reference_set_id BIGSERIAL PRIMARY KEY, | ||
| 227 | set_name TEXT NOT NULL UNIQUE, | ||
| 228 | description TEXT, | ||
| 229 | encoder_scope TEXT, | ||
| 230 | status TEXT NOT NULL DEFAULT 'active', | ||
| 231 | metadata_json JSONB NOT NULL DEFAULT '{}'::jsonb, | ||
| 232 | created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | ||
| 233 | updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW() | ||
| 234 | ); | ||
| 235 | |||
| 236 | CREATE TABLE IF NOT EXISTS reference_set_member ( | ||
| 237 | reference_set_id BIGINT NOT NULL REFERENCES reference_set_registry(reference_set_id) ON DELETE CASCADE, | ||
| 238 | recording_id BIGINT NOT NULL REFERENCES recording(recording_id) ON DELETE CASCADE, | ||
| 239 | member_role TEXT NOT NULL DEFAULT 'hot_reference', | ||
| 240 | created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), | ||
| 241 | PRIMARY KEY(reference_set_id, recording_id) | ||
| 242 | ); | ||
| 243 | |||
| 244 | -- ========================================================= | ||
| 245 | -- 4.5 Lineage invariants (recommended production hardening) | ||
| 246 | -- ========================================================= | ||
| 247 | |||
| 248 | CREATE OR REPLACE FUNCTION check_recording_lineage() | ||
| 249 | RETURNS trigger AS $$ | ||
| 250 | DECLARE | ||
| 251 | work_song_id BIGINT; | ||
| 252 | BEGIN | ||
| 253 | SELECT canonical_song_id INTO work_song_id | ||
| 254 | FROM work | ||
| 255 | WHERE work_id = NEW.work_id; | ||
| 256 | |||
| 257 | IF work_song_id IS NULL THEN | ||
| 258 | RAISE EXCEPTION 'Invalid work_id=% for recording', NEW.work_id; | ||
| 259 | END IF; | ||
| 260 | |||
| 261 | IF NEW.canonical_song_id <> work_song_id THEN | ||
| 262 | RAISE EXCEPTION 'recording.canonical_song_id % mismatches work.canonical_song_id %', NEW.canonical_song_id, work_song_id; | ||
| 263 | END IF; | ||
| 264 | |||
| 265 | RETURN NEW; | ||
| 266 | END; | ||
| 267 | $$ LANGUAGE plpgsql; | ||
| 268 | |||
| 269 | CREATE OR REPLACE FUNCTION check_audio_window_lineage() | ||
| 270 | RETURNS trigger AS $$ | ||
| 271 | DECLARE | ||
| 272 | asset_recording_id BIGINT; | ||
| 273 | rec_work_id BIGINT; | ||
| 274 | rec_song_id BIGINT; | ||
| 275 | BEGIN | ||
| 276 | SELECT recording_id INTO asset_recording_id | ||
| 277 | FROM recording_asset | ||
| 278 | WHERE asset_id = NEW.asset_id; | ||
| 279 | |||
| 280 | SELECT work_id, canonical_song_id INTO rec_work_id, rec_song_id | ||
| 281 | FROM recording | ||
| 282 | WHERE recording_id = NEW.recording_id; | ||
| 283 | |||
| 284 | IF asset_recording_id IS NULL OR rec_work_id IS NULL THEN | ||
| 285 | RAISE EXCEPTION 'Invalid asset_id=% or recording_id=% for audio_window', NEW.asset_id, NEW.recording_id; | ||
| 286 | END IF; | ||
| 287 | |||
| 288 | IF NEW.recording_id <> asset_recording_id THEN | ||
| 289 | RAISE EXCEPTION 'audio_window.recording_id % mismatches recording_asset.recording_id %', NEW.recording_id, asset_recording_id; | ||
| 290 | END IF; | ||
| 291 | |||
| 292 | IF NEW.work_id <> rec_work_id OR NEW.canonical_song_id <> rec_song_id THEN | ||
| 293 | RAISE EXCEPTION 'audio_window lineage mismatch for recording_id=%', NEW.recording_id; | ||
| 294 | END IF; | ||
| 295 | |||
| 296 | RETURN NEW; | ||
| 297 | END; | ||
| 298 | $$ LANGUAGE plpgsql; | ||
| 299 | |||
| 300 | CREATE OR REPLACE FUNCTION check_audio_embedding_lineage() | ||
| 301 | RETURNS trigger AS $$ | ||
| 302 | DECLARE | ||
| 303 | parent_recording_id BIGINT; | ||
| 304 | parent_work_id BIGINT; | ||
| 305 | parent_song_id BIGINT; | ||
| 306 | BEGIN | ||
| 307 | IF NEW.window_id IS NOT NULL THEN | ||
| 308 | SELECT recording_id, work_id, canonical_song_id | ||
| 309 | INTO parent_recording_id, parent_work_id, parent_song_id | ||
| 310 | FROM audio_window | ||
| 311 | WHERE window_id = NEW.window_id; | ||
| 312 | ELSE | ||
| 313 | SELECT r.recording_id, r.work_id, r.canonical_song_id | ||
| 314 | INTO parent_recording_id, parent_work_id, parent_song_id | ||
| 315 | FROM recording_asset ra | ||
| 316 | JOIN recording r ON r.recording_id = ra.recording_id | ||
| 317 | WHERE ra.asset_id = NEW.asset_id; | ||
| 318 | END IF; | ||
| 319 | |||
| 320 | IF parent_recording_id IS NULL THEN | ||
| 321 | RAISE EXCEPTION 'Invalid parent reference for audio_embedding'; | ||
| 322 | END IF; | ||
| 323 | |||
| 324 | IF NEW.recording_id <> parent_recording_id OR NEW.work_id <> parent_work_id OR NEW.canonical_song_id <> parent_song_id THEN | ||
| 325 | RAISE EXCEPTION 'audio_embedding lineage mismatch'; | ||
| 326 | END IF; | ||
| 327 | |||
| 328 | RETURN NEW; | ||
| 329 | END; | ||
| 330 | $$ LANGUAGE plpgsql; | ||
| 331 | |||
| 332 | -- ========================================================= | ||
| 333 | -- 5. Retrieval index and results | ||
| 334 | -- ========================================================= | ||
| 335 | |||
| 336 | CREATE TABLE IF NOT EXISTS retrieval_index_registry ( | ||
| 337 | retrieval_index_id BIGSERIAL PRIMARY KEY, | ||
| 338 | feature_set_id BIGINT NOT NULL REFERENCES feature_set_registry(feature_set_id), | ||
| 339 | index_name TEXT NOT NULL, | ||
| 340 | index_backend TEXT NOT NULL, | ||
| 341 | index_type TEXT NOT NULL, | ||
| 342 | storage_uri TEXT, | ||
| 343 | shard_no INTEGER, | ||
| 344 | row_count BIGINT, | ||
| 345 | index_status TEXT NOT NULL DEFAULT 'active', | ||
| 346 | config_json JSONB NOT NULL DEFAULT '{}'::jsonb, | ||
| 347 | built_at TIMESTAMPTZ, | ||
| 348 | created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() | ||
| 349 | ); | ||
| 350 | |||
| 351 | CREATE TABLE IF NOT EXISTS retrieval_candidate ( | ||
| 352 | retrieval_candidate_id BIGSERIAL PRIMARY KEY, | ||
| 353 | query_id TEXT NOT NULL, | ||
| 354 | retrieval_index_id BIGINT REFERENCES retrieval_index_registry(retrieval_index_id), | ||
| 355 | feature_set_id BIGINT REFERENCES feature_set_registry(feature_set_id), | ||
| 356 | source_lane TEXT NOT NULL CHECK (source_lane IN ('fingerprint', 'semantic', 'cover', 'melody', 'fusion')), | ||
| 357 | candidate_level TEXT NOT NULL CHECK (candidate_level IN ('window', 'recording', 'work', 'canonical_song')), | ||
| 358 | candidate_id BIGINT NOT NULL, | ||
| 359 | evidence_window_id BIGINT REFERENCES audio_window(window_id), | ||
| 360 | raw_score NUMERIC(14,8), | ||
| 361 | normalized_score NUMERIC(14,8), | ||
| 362 | rank_no INTEGER, | ||
| 363 | metadata_json JSONB NOT NULL DEFAULT '{}'::jsonb, | ||
| 364 | created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() | ||
| 365 | ); | ||
| 366 | |||
| 367 | CREATE TABLE IF NOT EXISTS match_decision ( | ||
| 368 | match_decision_id BIGSERIAL PRIMARY KEY, | ||
| 369 | query_id TEXT NOT NULL, | ||
| 370 | canonical_song_id BIGINT REFERENCES canonical_song(canonical_song_id), | ||
| 371 | work_id BIGINT REFERENCES work(work_id), | ||
| 372 | recording_id BIGINT REFERENCES recording(recording_id), | ||
| 373 | decision_status TEXT NOT NULL, | ||
| 374 | decision_score NUMERIC(14,8), | ||
| 375 | decision_reason TEXT, | ||
| 376 | metadata_json JSONB NOT NULL DEFAULT '{}'::jsonb, | ||
| 377 | created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() | ||
| 378 | ); | ||
| 379 | |||
| 380 | CREATE TRIGGER trg_recording_lineage | ||
| 381 | BEFORE INSERT OR UPDATE ON recording | ||
| 382 | FOR EACH ROW EXECUTE FUNCTION check_recording_lineage(); | ||
| 383 | |||
| 384 | CREATE TRIGGER trg_audio_window_lineage | ||
| 385 | BEFORE INSERT OR UPDATE ON audio_window | ||
| 386 | FOR EACH ROW EXECUTE FUNCTION check_audio_window_lineage(); | ||
| 387 | |||
| 388 | CREATE TRIGGER trg_audio_embedding_lineage | ||
| 389 | BEFORE INSERT OR UPDATE ON audio_embedding | ||
| 390 | FOR EACH ROW EXECUTE FUNCTION check_audio_embedding_lineage(); | ||
| 391 | |||
| 392 | -- ========================================================= | ||
| 393 | -- 6. Recommended indexes | ||
| 394 | -- ========================================================= | ||
| 395 | |||
| 396 | CREATE INDEX IF NOT EXISTS idx_work_canonical_song_id | ||
| 397 | ON work(canonical_song_id); | ||
| 398 | |||
| 399 | CREATE INDEX IF NOT EXISTS idx_recording_work_id | ||
| 400 | ON recording(work_id); | ||
| 401 | |||
| 402 | CREATE INDEX IF NOT EXISTS idx_recording_canonical_song_id | ||
| 403 | ON recording(canonical_song_id); | ||
| 404 | |||
| 405 | CREATE INDEX IF NOT EXISTS idx_recording_reference | ||
| 406 | ON recording(is_reference, reference_priority); | ||
| 407 | |||
| 408 | CREATE INDEX IF NOT EXISTS idx_recording_asset_recording_id | ||
| 409 | ON recording_asset(recording_id); | ||
| 410 | |||
| 411 | CREATE INDEX IF NOT EXISTS idx_audio_window_asset_id | ||
| 412 | ON audio_window(asset_id); | ||
| 413 | |||
| 414 | CREATE INDEX IF NOT EXISTS idx_audio_window_recording_id | ||
| 415 | ON audio_window(recording_id); | ||
| 416 | |||
| 417 | CREATE INDEX IF NOT EXISTS idx_audio_window_canonical_song_id | ||
| 418 | ON audio_window(canonical_song_id); | ||
| 419 | |||
| 420 | CREATE INDEX IF NOT EXISTS idx_audio_window_active_for_index | ||
| 421 | ON audio_window(active_for_index); | ||
| 422 | |||
| 423 | CREATE INDEX IF NOT EXISTS idx_audio_embedding_feature_set_id | ||
| 424 | ON audio_embedding(feature_set_id); | ||
| 425 | |||
| 426 | CREATE INDEX IF NOT EXISTS idx_audio_embedding_window_id | ||
| 427 | ON audio_embedding(window_id); | ||
| 428 | |||
| 429 | CREATE INDEX IF NOT EXISTS idx_audio_embedding_recording_id | ||
| 430 | ON audio_embedding(recording_id); | ||
| 431 | |||
| 432 | CREATE INDEX IF NOT EXISTS idx_reference_set_member_recording_id | ||
| 433 | ON reference_set_member(recording_id); | ||
| 434 | |||
| 435 | CREATE INDEX IF NOT EXISTS idx_retrieval_candidate_query_id | ||
| 436 | ON retrieval_candidate(query_id); | ||
| 437 | |||
| 438 | CREATE INDEX IF NOT EXISTS idx_match_decision_query_id | ||
| 439 | ON match_decision(query_id); | ||
| 440 | |||
| 441 | -- Optional hot-set HNSW indexes for pgvector-backed online retrieval | ||
| 442 | CREATE INDEX IF NOT EXISTS idx_audio_embedding_vector_192_cos_hnsw | ||
| 443 | ON audio_embedding_vector_192 USING hnsw (embedding vector_cosine_ops); | ||
| 444 | |||
| 445 | CREATE INDEX IF NOT EXISTS idx_audio_embedding_vector_768_cos_hnsw | ||
| 446 | ON audio_embedding_vector_768 USING hnsw (embedding vector_cosine_ops); |
| 1 | ## 2026-06-04 | ||
| 2 | |||
| 3 | - 重构文档主阅读路径,新增按角色划分的文档入口:架构、开发、运维、模型底座。 | ||
| 4 | - 新增 [SOTA 演进方案说明](./sota-evolution-guide.md),明确 Phase-1 encoder-only 路线、MERT/MuQ 角色与后续 version/cover 演进。 | ||
| 5 | - 重写 [ACR 系统蓝图](./acr-architecture.md),补充角色视图、离线/在线职责分工与当前实现到目标实现的映射。 | ||
| 6 | - 新增 [PostgreSQL 数据模型与 DDL 设计说明](./postgresql-data-model.md),补充设计意图、解决的问题、流程图与实施顺序。 | ||
| 7 | - 新增 `acr-engine/sql/acr_pg_schema_v2.sql`,提供面向 `canonical_song/work/recording/asset/window + model_registry/feature_set_registry` 的推荐版 PostgreSQL DDL。 | ||
| 8 | - 根据 architect 复核意见补充:`recording_asset` 术语统一、reference set 版本化对象、候选枚举约束、以及关键 lineage trigger 设计。 | ||
| 9 | |||
| 1 | - 新增 `acr-engine/scripts/export_workspace_music20_embeddings_jsonl.py` 与 `acr-engine/scripts/evaluate_songid_pgvector_path.py`,补齐 song_id 级 pgvector 评测脚手架。 | 10 | - 新增 `acr-engine/scripts/export_workspace_music20_embeddings_jsonl.py` 与 `acr-engine/scripts/evaluate_songid_pgvector_path.py`,补齐 song_id 级 pgvector 评测脚手架。 |
| 2 | - 新增 `acr-engine/data/pgvector_eval/music20/` 评测产物,当前 `faiss-as-pgvector-standin` 结果:整体 `top1=0.9091`、`top3=0.9545`;其中 `query_type=1` 很强(`top1=1.0`),`query_type=7` 仍明显偏弱(`top1=0.0`,`top3=0.5`)。 | 11 | - 新增 `acr-engine/data/pgvector_eval/music20/` 评测产物,当前 `faiss-as-pgvector-standin` 结果:整体 `top1=0.9091`、`top3=0.9545`;其中 `query_type=1` 很强(`top1=1.0`),`query_type=7` 仍明显偏弱(`top1=0.0`,`top3=0.5`)。 |
| 3 | - 新增 `acr-engine/data/local_eval/voice_workspace20_type7_eval.json`,对当前 `workspace_music20` 语义做了 20 条 `type_7` 批量验证:`top1=0.0`、`top3=0.05`,说明业务 song_id 正确性仍明显不足。 | 12 | - 新增 `acr-engine/data/local_eval/voice_workspace20_type7_eval.json`,对当前 `workspace_music20` 语义做了 20 条 `type_7` 批量验证:`top1=0.0`、`top3=0.05`,说明业务 song_id 正确性仍明显不足。 | ... | ... |
| 1 | # ACR Docs Overview | 1 | # ACR Docs Overview |
| 2 | 2 | ||
| 3 | > 保留最新架构与最短落地入口。历史细节仍在仓库中,但默认阅读只保留下面 6 份主文档。 | 3 | > 面向“版权保护 / 听歌识曲 / 版本归属”的音乐 ACR 文档入口。默认先看主路径,历史细节文档作为补充材料保留。 |
| 4 | 4 | ||
| 5 | ## 最短阅读顺序 | 5 | ## 一页结论 |
| 6 | 6 | ||
| 7 | 1. [session-handoff.md](./session-handoff.md) | 7 | 当前项目已经从“原型是否能跑通”转向“**如何把 100w 音频 / 30w 歌曲做成可演进的版权检索系统**”。 |
| 8 | 2. [CHANGELOG.md](./CHANGELOG.md) | 8 | 默认阅读顺序不再按“训练脚本 -> demo”,而按: |
| 9 | |||
| 10 | 1. **系统蓝图**:当前系统是什么、未来要演进成什么 | ||
| 11 | 2. **SOTA 演进**:Phase-1 不微调底座时怎么做,后面如何升级 | ||
| 12 | 3. **PostgreSQL 数据模型**:资产、窗口、特征、索引、匹配结果如何落盘 | ||
| 13 | 4. **现有实现对照**:当前仓库代码和文档分别在哪 | ||
| 14 | |||
| 15 | --- | ||
| 16 | |||
| 17 | ## 主阅读路径(推荐) | ||
| 18 | |||
| 19 | ### 1. 管理 / 架构 / 跨团队负责人 | ||
| 20 | 1. [acr-architecture.md](./acr-architecture.md) | ||
| 21 | 2. [sota-evolution-guide.md](./sota-evolution-guide.md) | ||
| 22 | 3. [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 23 | 4. [session-handoff.md](./session-handoff.md) | ||
| 24 | |||
| 25 | ### 2. 开发 / 数据 / 检索工程师 | ||
| 26 | 1. [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 27 | 2. [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) | ||
| 9 | 3. [acr-architecture.md](./acr-architecture.md) | 28 | 3. [acr-architecture.md](./acr-architecture.md) |
| 10 | 4. [dataset-spec.md](./dataset-spec.md) | 29 | 4. [runbook.md](./runbook.md) |
| 11 | 5. [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) | 30 | |
| 12 | 6. [runbook.md](./runbook.md) | 31 | ### 3. 运维 / 平台 / 服务工程师 |
| 32 | 1. [acr-architecture.md](./acr-architecture.md) | ||
| 33 | 2. [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 34 | 3. [service-api.md](./service-api.md) | ||
| 35 | 4. [runbook.md](./runbook.md) | ||
| 36 | |||
| 37 | ### 4. 模型 / 底座 / 研究工程师 | ||
| 38 | 1. [sota-research-2026.md](./sota-research-2026.md) | ||
| 39 | 2. [sota-evolution-guide.md](./sota-evolution-guide.md) | ||
| 40 | 3. [production-encoder-freeze-and-embedding-strategy.md](./production-encoder-freeze-and-embedding-strategy.md) | ||
| 41 | 4. [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) | ||
| 42 | |||
| 43 | --- | ||
| 44 | |||
| 45 | ## 新的核心文档分工 | ||
| 46 | |||
| 47 | | 文档 | 作用 | 适合谁先读 | | ||
| 48 | |---|---|---| | ||
| 49 | | [acr-architecture.md](./acr-architecture.md) | 当前系统蓝图、角色分工、在线/离线链路 | 架构、开发、运维 | | ||
| 50 | | [sota-evolution-guide.md](./sota-evolution-guide.md) | SOTA 演进路径、Phase-1 encoder-only 方案、后续升级路线 | 架构、模型、检索 | | ||
| 51 | | [postgresql-data-model.md](./postgresql-data-model.md) | PostgreSQL 数据字典、DDL 设计意图、流程图、查询路径 | 数据、后端、检索、平台 | | ||
| 52 | | [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) | 当前训练/manifest/pgvector 原型链说明 | 开发、数据 | | ||
| 53 | | [session-handoff.md](./session-handoff.md) | 最新状态与续跑上下文 | 新 session 接手人 | | ||
| 54 | |||
| 55 | --- | ||
| 56 | |||
| 57 | ## 当前实现与未来目标的关系 | ||
| 58 | |||
| 59 | ```mermaid | ||
| 60 | flowchart LR | ||
| 61 | A[当前实现\nChromaprint + ECAPA + Melody Rerank] --> B[Phase-1\nEncoder-only Foundation Backbone] | ||
| 62 | B --> C[Phase-2\nVersion/Cover Lane + Better Aggregation] | ||
| 63 | C --> D[Phase-3\nIndustrial Retrieval + Reranker + Governance] | ||
| 64 | ``` | ||
| 65 | |||
| 66 | - **当前实现** 已验证基础链路可运行。 | ||
| 67 | - **Phase-1** 目标是:不微调底座,直接上更强开源 encoder,并把 PostgreSQL 数据规范先落稳。 | ||
| 68 | - **Phase-2** 目标是:增强 version / cover / hard-case 归属能力。 | ||
| 69 | - **Phase-3** 目标是:多索引、多角色协作、数据治理、服务化上线。 | ||
| 70 | |||
| 71 | --- | ||
| 13 | 72 | ||
| 14 | ## 当前推荐只看这几类 | 73 | ## 现有实现入口 |
| 15 | 74 | ||
| 16 | ### 1. 项目架构 | 75 | ### 代码入口 |
| 17 | - [acr-architecture.md](./acr-architecture.md) | 76 | - `acr-engine/src/engines/chromaprint_matcher.py` |
| 18 | - [session-handoff.md](./session-handoff.md) | 77 | - `acr-engine/src/engines/ecapa_embedder.py` |
| 78 | - `acr-engine/src/engines/hybrid_engine.py` | ||
| 79 | - `acr-engine/src/service/app.py` | ||
| 80 | - `acr-engine/sql/pgvector_schema.sql`(原型版) | ||
| 81 | - `acr-engine/sql/acr_pg_schema_v2.sql`(本轮新增的推荐版) | ||
| 19 | 82 | ||
| 20 | ### 2. 数据与评测 | 83 | ### 历史/补充文档 |
| 21 | - [dataset-spec.md](./dataset-spec.md) | 84 | - [sota-research-2026.md](./sota-research-2026.md) |
| 22 | - [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) | 85 | - [production-encoder-freeze-and-embedding-strategy.md](./production-encoder-freeze-and-embedding-strategy.md) |
| 23 | - [open-dataset-workflow.md](./open-dataset-workflow.md) | 86 | - [project-responsibility-map.md](./project-responsibility-map.md) |
| 87 | - [industrialization-roadmap.md](./industrialization-roadmap.md) | ||
| 24 | 88 | ||
| 25 | ### 3. 运行与服务 | 89 | --- |
| 26 | - [runbook.md](./runbook.md) | ||
| 27 | - [service-api.md](./service-api.md) | ||
| 28 | 90 | ||
| 29 | ### 4. 最新 hard-case 结论 | 91 | ## 如何理解当前文档体系 |
| 30 | - [acr-hard-case-analysis.md](../acr-engine/../docs/acr-hard-case-analysis.md) | ||
| 31 | 92 | ||
| 32 | ## 当前架构一句话 | 93 | - **主文档**:优先保证“读完就知道怎么推进” |
| 94 | - **历史文档**:保留实验上下文、旧方案与补充解释 | ||
| 95 | - **SQL 文件**:保证可以直接落地数据库原型 | ||
| 33 | 96 | ||
| 34 | - `/workspace`:样本与素材来源 | 97 | 如果你只读 3 份: |
| 35 | - `acr-engine/`:训练、索引、识别、服务主工程 | 98 | 1. [acr-architecture.md](./acr-architecture.md) |
| 36 | - 本地小样本验证:优先 **FAISS** | 99 | 2. [sota-evolution-guide.md](./sota-evolution-guide.md) |
| 37 | - 生产向量检索:统一 **pgvector** | 100 | 3. [postgresql-data-model.md](./postgresql-data-model.md) | ... | ... |
| 1 | # ACR 系统架构图 | 1 | # ACR 系统蓝图 / Architecture Blueprint |
| 2 | 2 | ||
| 3 | > 更新:2026-06-02 | 3 | > 更新:2026-06-04 |
| 4 | > 目标:把当前 ACR 原型、未来 SOTA 演进路径、以及不同角色的关注点统一到一份可读的系统蓝图里。 | ||
| 4 | 5 | ||
| 5 | ## 一页结论 | 6 | ## 一页结论 |
| 6 | 7 | ||
| 7 | - 识别链路已不是单一模型,而是 **指纹 + 向量 + melody-aware rerank** 的混合结构 | 8 | 当前仓库已经验证了一个可运行的混合识别原型: |
| 8 | - 数据与服务已经进入工业化演进阶段 | 9 | |
| 9 | - 当前主短板在:`humming_like` 与 `confused` 的 hard-case 精度 | 10 | - `Chromaprint / fingerprint`:负责 exact / near-duplicate 快速召回 |
| 11 | - `ECAPA-style embedding`:负责当前语义向量召回 baseline | ||
| 12 | - `melody-aware rerank`:负责弱旋律补强 | ||
| 13 | |||
| 14 | 但未来面向 **版权保护 + 100w 音频 / 30w 歌曲** 的目标,系统应演进为: | ||
| 15 | |||
| 16 | 1. **数据规范稳定**:`canonical_song -> work -> recording -> recording_asset -> audio_window` | ||
| 17 | 2. **底座模型可替换**:`model_registry -> feature_set_registry -> embedding/index` | ||
| 18 | 3. **检索链分层**:exact lane + semantic lane + version/cover lane + aggregation | ||
| 19 | 4. **服务与运维分离**:离线建库、在线召回、审核归一、监控治理分别有清晰职责 | ||
| 10 | 20 | ||
| 11 | --- | 21 | --- |
| 12 | 22 | ||
| 13 | ## 1. 总体架构图 | 23 | ## 1. 总体系统图 |
| 14 | 24 | ||
| 15 | ```mermaid | 25 | ```mermaid |
| 16 | flowchart LR | 26 | flowchart TD |
| 17 | Q[Query Audio] --> P[Preprocess] | 27 | A[Audio Sources\n官方母带 / 平台音频 / 抓取音频 / UGC / 录音] --> B[Asset Normalization] |
| 18 | P --> F1[Chromaprint Features] | 28 | B --> C[Canonical Data Model\nSong / Work / Recording / Asset / Window] |
| 19 | P --> F2[128-Mel Features] | ||
| 20 | P --> F3[Melody Signature] | ||
| 21 | |||
| 22 | F1 --> M1[Fingerprint Matcher] | ||
| 23 | F2 --> M2[ECAPA + BandSplit Embedder] | ||
| 24 | F3 --> M3[Melody Similarity] | ||
| 25 | 29 | ||
| 26 | C[Catalog References] --> I1[Fingerprint Index] | 30 | C --> D1[Exact Lane\nChromaprint / Neural AFP] |
| 27 | C --> I2[Embedding Window Index] | 31 | C --> D2[Semantic Lane\nFoundation Encoder] |
| 28 | C --> I3[Reference Melody Cache] | 32 | C --> D3[Version/Cover Lane\nPhase-2+] |
| 29 | 33 | ||
| 30 | M1 --> H[Hybrid Fusion] | 34 | D1 --> E[Candidate Aggregation] |
| 31 | M2 --> H | 35 | D2 --> E |
| 32 | M3 --> H | 36 | D3 --> E |
| 33 | 37 | ||
| 34 | H --> O[Top-K + Reject] | 38 | E --> F[Canonical Song Decision] |
| 39 | F --> G[Service / Review / Audit] | ||
| 35 | ``` | 40 | ``` |
| 36 | 41 | ||
| 37 | --- | 42 | --- |
| 38 | 43 | ||
| 39 | ## 2. 在线/离线分层图 | 44 | ## 2. 当前实现 vs 目标实现 |
| 45 | |||
| 46 | | 维度 | 当前实现 | 目标实现 | | ||
| 47 | |---|---|---| | ||
| 48 | | 底座向量模型 | ECAPA-style baseline | MERT / MuQ 等 foundation encoder 为主 | | ||
| 49 | | 检索结构 | chromaprint + embedding + melody | exact + semantic + version/cover + rerank | | ||
| 50 | | 数据主键 | 以 `song_id` 为核心 | `canonical_song / work / recording / asset / window` 分层 | | ||
| 51 | | 存储形态 | 原型式 pgvector schema + 文件产物 | PostgreSQL 主数据 + 可替换向量/索引层 | | ||
| 52 | | 服务目标 | 验证闭环 | 版权保护 / 归属判断 / 工业化运维 | | ||
| 53 | |||
| 54 | --- | ||
| 55 | |||
| 56 | ## 3. 角色视图 | ||
| 57 | |||
| 58 | ## 3.1 产品 / 架构角色 | ||
| 59 | |||
| 60 | 关注: | ||
| 61 | - 版权保护是否能最终定位到 `canonical_song_id` | ||
| 62 | - `recording` 与 `work` 的区别是否明确 | ||
| 63 | - 当前阶段是否坚持“先冻结规范、后迭代模型” | ||
| 64 | - 各团队之间接口是否清晰 | ||
| 65 | |||
| 66 | 最该读: | ||
| 67 | - 本文 | ||
| 68 | - [sota-evolution-guide.md](./sota-evolution-guide.md) | ||
| 69 | - [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 70 | |||
| 71 | --- | ||
| 72 | |||
| 73 | ## 3.2 开发角色(后端 / 检索 / 数据) | ||
| 74 | |||
| 75 | 关注: | ||
| 76 | - 如何把音频导入统一实体模型 | ||
| 77 | - 如何切窗、建 feature_set、挂索引 | ||
| 78 | - 如何从 query 走到候选,再归一到 `canonical_song_id` | ||
| 79 | - 如何支持未来切换 `model_name / model_version / feature_set` | ||
| 80 | |||
| 81 | 最该读: | ||
| 82 | - 本文 | ||
| 83 | - [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 84 | - [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) | ||
| 85 | |||
| 86 | --- | ||
| 87 | |||
| 88 | ## 3.3 运维 / 平台角色 | ||
| 89 | |||
| 90 | 关注: | ||
| 91 | - 离线任务:抽特征、建索引、重建索引 | ||
| 92 | - 在线服务:召回、聚合、缓存、可观测性 | ||
| 93 | - 存储分层:对象存储、PostgreSQL、索引后端 | ||
| 94 | - 版本化:encoder 变更如何灰度、回滚、双写/双索引 | ||
| 95 | |||
| 96 | 最该读: | ||
| 97 | - 本文 | ||
| 98 | - [service-api.md](./service-api.md) | ||
| 99 | - [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 100 | |||
| 101 | --- | ||
| 102 | |||
| 103 | ## 3.4 模型底座 / 研究角色 | ||
| 104 | |||
| 105 | 关注: | ||
| 106 | - Phase-1 先不用微调时,选哪个开源 encoder | ||
| 107 | - 如何定义 feature_set:窗长、hop、pooling、layer selection | ||
| 108 | - 未来如何从 encoder-only 升级到 version/cover lane | ||
| 109 | - 如何让新模型接入而不破坏数据层 | ||
| 110 | |||
| 111 | 最该读: | ||
| 112 | - [sota-evolution-guide.md](./sota-evolution-guide.md) | ||
| 113 | - [sota-research-2026.md](./sota-research-2026.md) | ||
| 114 | - [production-encoder-freeze-and-embedding-strategy.md](./production-encoder-freeze-and-embedding-strategy.md) | ||
| 115 | |||
| 116 | --- | ||
| 117 | |||
| 118 | ## 4. 离线 / 在线职责拆分 | ||
| 40 | 119 | ||
| 41 | ```mermaid | 120 | ```mermaid |
| 42 | flowchart TD | 121 | flowchart LR |
| 43 | A[Offline Pipeline] --> A1[Dataset Prep] | 122 | A[Offline\n数据治理/切窗/特征抽取/建索引] --> B[Registered Artifacts\nfeature_set / index / metadata] |
| 44 | A --> A2[Training] | 123 | B --> C[Online\nquery encode / retrieve / aggregate / decide] |
| 45 | A --> A3[Index Build] | ||
| 46 | A --> A4[Benchmark] | ||
| 47 | |||
| 48 | B[Online Service] --> B1[/health] | ||
| 49 | B --> B2[/recognize] | ||
| 50 | B --> B3[/index/build] | ||
| 51 | ``` | 124 | ``` |
| 52 | 125 | ||
| 126 | ### 离线职责 | ||
| 127 | - 资产标准化 | ||
| 128 | - 元数据归一 | ||
| 129 | - 切窗 | ||
| 130 | - 模型特征抽取 | ||
| 131 | - fingerprint / embedding 建索引 | ||
| 132 | - 回填 PostgreSQL 元数据 | ||
| 133 | |||
| 134 | ### 在线职责 | ||
| 135 | - 接收 query | ||
| 136 | - query 切块 / 编码 | ||
| 137 | - exact / semantic / version lane 召回 | ||
| 138 | - recording/work/song 聚合 | ||
| 139 | - 输出 `canonical_song_id` + 证据 | ||
| 140 | |||
| 53 | --- | 141 | --- |
| 54 | 142 | ||
| 55 | ## 3. 关键模块表 | 143 | ## 5. 为什么必须把角色拆开 |
| 56 | 144 | ||
| 57 | | 模块 | 输入 | 输出 | 作用 | | 145 | 因为这个项目已经不是单一模型脚本,而是: |
| 58 | |---|---|---|---| | 146 | |
| 59 | | Preprocess | wav | mel/chroma/f0 | 统一特征入口 | | 147 | 1. **数据治理系统**:谁的音频、属于哪个 recording/work/song |
| 60 | | Fingerprint Matcher | query audio | chroma candidates | 快速召回 | | 148 | 2. **检索系统**:如何从 query 找到候选 |
| 61 | | ECAPA Embedder | mel | embeddings | 语义向量检索 | | 149 | 3. **判定系统**:最终输出哪一个 `canonical_song_id` |
| 62 | | Melody Similarity | query/ref melody | melody score | 哼唱场景补强 | | 150 | 4. **服务系统**:如何对外提供 API 与可观测性 |
| 63 | | Hybrid Fusion | multi-scores | ranked candidates | 综合排序 | | 151 | 5. **演进系统**:底座模型会变,但数据规范不能跟着乱变 |
| 64 | | Service API | request | JSON result | 对外调用 | | ||
| 65 | 152 | ||
| 66 | --- | 153 | --- |
| 67 | 154 | ||
| 68 | ## 4. 当前设计重点 | 155 | ## 6. 当前阶段建议 |
| 156 | |||
| 157 | ### 当前最重要的不是继续改训练,而是: | ||
| 69 | 158 | ||
| 70 | ### 4.1 为什么是混合结构 | 159 | 1. 先把 PostgreSQL 数据规范稳定下来 |
| 71 | 纯指纹对哼唱弱,纯 embedding 对局部强匹配和解释性不足,因此使用混合结构更稳妥。 | 160 | 2. 先把 `model_registry / feature_set_registry` 结构打稳 |
| 161 | 3. Phase-1 用开源 encoder 直接做 semantic lane baseline | ||
| 162 | 4. 保留当前 ECAPA 作为历史 baseline / 对照组 | ||
| 72 | 163 | ||
| 73 | ### 4.2 为什么加入 melody-aware | 164 | ### 当前系统中的保留项 |
| 74 | 目前 hard-case 主要在哼唱/近旋律混淆,因此用 melody signature 做辅助排序。 | 165 | - `Chromaprint`:保留 |
| 166 | - `ECAPA baseline`:保留为对照组 | ||
| 167 | - `melody rerank`:保留为补充 lane,不再作为主演进方向 | ||
| 75 | 168 | ||
| 76 | ### 4.3 为什么要 window-level index | 169 | ### 当前系统中的升级项 |
| 77 | 整曲平均 embedding 会损失局部片段信息;window-level 更贴近 ACR 场景。 | 170 | - semantic lane 主 encoder -> foundation model |
| 171 | - pgvector 原型 schema -> 可扩展 PostgreSQL 数据模型 | ||
| 172 | - 扁平 song_id -> canonical/work/recording/recording_asset/audio_window | ||
| 78 | 173 | ||
| 79 | --- | 174 | --- |
| 80 | 175 | ||
| 81 | ## 5. 细节附录 | 176 | ## 7. 与代码的映射 |
| 82 | 177 | ||
| 83 | 代码映射: | 178 | | 代码/文档 | 当前角色 | |
| 84 | - `src/engines/chromaprint_matcher.py` | 179 | |---|---| |
| 85 | - `src/engines/ecapa_embedder.py` | 180 | | `acr-engine/src/engines/chromaprint_matcher.py` | exact lane 原型 | |
| 86 | - `src/engines/hybrid_engine.py` | 181 | | `acr-engine/src/engines/ecapa_embedder.py` | current embedding lane baseline | |
| 87 | - `src/service/app.py` | 182 | | `acr-engine/src/engines/hybrid_engine.py` | current aggregation prototype | |
| 183 | | `acr-engine/sql/pgvector_schema.sql` | 早期 pgvector prototype | | ||
| 184 | | `acr-engine/sql/acr_pg_schema_v2.sql` | 推荐的 PostgreSQL V2 schema | | ||
| 185 | | [postgresql-data-model.md](./postgresql-data-model.md) | V2 schema 设计说明 | | ||
| 186 | |||
| 187 | --- | ||
| 88 | 188 | ||
| 189 | ## 8. 阅读建议 | ||
| 89 | 190 | ||
| 90 | ## Sources | 191 | 如果你是: |
| 91 | - See `docs/references-and-sources.md` for the current source map. | 192 | - **架构负责人**:下一篇看 [sota-evolution-guide.md](./sota-evolution-guide.md) |
| 193 | - **数据/后端负责人**:下一篇看 [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 194 | - **模型负责人**:先看 [sota-evolution-guide.md](./sota-evolution-guide.md) 再回到 [sota-research-2026.md](./sota-research-2026.md) | ... | ... |
docs/postgresql-data-model.md
0 → 100644
| 1 | # PostgreSQL 数据模型与 DDL 设计说明 | ||
| 2 | |||
| 3 | > 更新:2026-06-04 | ||
| 4 | > 关联 SQL:[`acr-engine/sql/acr_pg_schema_v2.sql`](../acr-engine/sql/acr_pg_schema_v2.sql) | ||
| 5 | > 目标:给出面向版权保护 / 大规模曲库 / 可替换 encoder 的 PostgreSQL 数据字典、DDL 设计意图、流程图与典型使用路径。 | ||
| 6 | |||
| 7 | ## 一页结论 | ||
| 8 | |||
| 9 | 当前推荐的 PostgreSQL 设计,不再围绕“某一个模型的 embedding 表”来建,而是围绕下面这条稳定主链来建: | ||
| 10 | |||
| 11 | ```text | ||
| 12 | canonical_song -> work -> recording -> recording_asset -> audio_window | ||
| 13 | -> model_registry -> feature_set_registry -> audio_embedding / audio_fingerprint | ||
| 14 | -> reference_set_registry -> retrieval_index_registry -> retrieval_candidate -> match_decision | ||
| 15 | ``` | ||
| 16 | |||
| 17 | 这套设计解决的是: | ||
| 18 | |||
| 19 | 1. **song/work/recording 混在一起的问题** | ||
| 20 | 2. **未来换模型就得改表的问题** | ||
| 21 | 3. **窗口级检索无法回溯证据的问题** | ||
| 22 | 4. **exact / semantic / future cover lane 无法统一聚合的问题** | ||
| 23 | |||
| 24 | --- | ||
| 25 | |||
| 26 | ## 1. 设计意图 | ||
| 27 | |||
| 28 | ## 1.1 这套设计想解决什么 | ||
| 29 | |||
| 30 | ### 问题 A:同一首歌可能有多个录音版本 | ||
| 31 | 所以必须区分: | ||
| 32 | - `canonical_song`:业务最终归一 song | ||
| 33 | - `work`:作品层 | ||
| 34 | - `recording`:具体录音版本 | ||
| 35 | |||
| 36 | ### 问题 B:一个录音可能有多个文件资产 | ||
| 37 | 所以必须有: | ||
| 38 | - `recording_asset` | ||
| 39 | |||
| 40 | ### 问题 C:检索真正命中的是片段,不是整首歌 | ||
| 41 | 所以必须有: | ||
| 42 | - `audio_window` | ||
| 43 | |||
| 44 | ### 问题 D:未来底座会切换 | ||
| 45 | 所以必须有: | ||
| 46 | - `model_registry` | ||
| 47 | - `feature_set_registry` | ||
| 48 | |||
| 49 | ### 问题 E:你会同时存在多个索引后端 | ||
| 50 | 所以必须有: | ||
| 51 | - `retrieval_index_registry` | ||
| 52 | |||
| 53 | --- | ||
| 54 | |||
| 55 | ## 1.2 为什么不用“reference_embeddings / query_embeddings”那种原型表继续扩 | ||
| 56 | |||
| 57 | 因为原型表有几个限制: | ||
| 58 | |||
| 59 | 1. 维度写死,如 `vector(192)` | ||
| 60 | 2. 数据对象太扁平,只围绕 `song_id` | ||
| 61 | 3. 无法优雅支持多个 encoder | ||
| 62 | 4. 无法表达同一 recording 下的多资产、多窗口、多 feature_set | ||
| 63 | |||
| 64 | 所以原型版 SQL 适合 demo,不适合你现在的 100w 音频目标。 | ||
| 65 | |||
| 66 | --- | ||
| 67 | |||
| 68 | ## 2. 数据主链 | ||
| 69 | |||
| 70 | ```mermaid | ||
| 71 | flowchart LR | ||
| 72 | A[canonical_song] --> B[work] | ||
| 73 | B --> C[recording] | ||
| 74 | C --> D[recording_asset] | ||
| 75 | D --> E[audio_window] | ||
| 76 | E --> F[audio_embedding] | ||
| 77 | E --> G[audio_fingerprint] | ||
| 78 | F --> H[retrieval_index_registry] | ||
| 79 | G --> H | ||
| 80 | H --> I[retrieval_candidate] | ||
| 81 | I --> J[match_decision] | ||
| 82 | ``` | ||
| 83 | |||
| 84 | --- | ||
| 85 | |||
| 86 | ## 3. 表分组 | ||
| 87 | |||
| 88 | | 分组 | 表 | 作用 | | ||
| 89 | |---|---|---| | ||
| 90 | | 版权与实体 | `canonical_song`, `work`, `recording` | 统一业务归属 | | ||
| 91 | | 资产层 | `recording_asset` | 管理真实文件资产 | | ||
| 92 | | 窗口层 | `audio_window` | 管理检索最小证据片段 | | ||
| 93 | | 模型与特征 | `model_registry`, `feature_set_registry`, `audio_embedding`, `audio_fingerprint` | 管理模型版本与特征事实 | | ||
| 94 | | reference 集 | `reference_set_registry`, `reference_set_member` | 管理热 reference 集与版本化切换 | | ||
| 95 | | 索引层 | `retrieval_index_registry` | 记录后端索引 | | ||
| 96 | | 匹配层 | `retrieval_candidate`, `match_decision` | 在线召回与最终归一 | | ||
| 97 | |||
| 98 | --- | ||
| 99 | |||
| 100 | ## 4. 关键表说明 | ||
| 101 | |||
| 102 | ## 4.1 `canonical_song` | ||
| 103 | 最终业务主键。 | ||
| 104 | |||
| 105 | 用途: | ||
| 106 | - 服务最终返回 `canonical_song_id` | ||
| 107 | - 权利归属、产品展示、对外业务都以它为准 | ||
| 108 | |||
| 109 | ## 4.2 `work` | ||
| 110 | 作品层。 | ||
| 111 | |||
| 112 | 用途: | ||
| 113 | - 同一首歌的不同翻唱/演绎归一到作品层 | ||
| 114 | - future phase 的 cover/version lane 常常先聚到 `work_id` | ||
| 115 | |||
| 116 | ## 4.3 `recording` | ||
| 117 | 录音层。 | ||
| 118 | |||
| 119 | 用途: | ||
| 120 | - official/live/remaster/cover/ugc 等不同版本分开管理 | ||
| 121 | - 允许多个 recording 最终映射到同一个 `canonical_song` | ||
| 122 | |||
| 123 | ## 4.4 `recording_asset` | ||
| 124 | 文件资产层。 | ||
| 125 | |||
| 126 | 用途: | ||
| 127 | - 同一个 recording 可有多个文件版本 | ||
| 128 | - 可区分 master/reference/distribution/captured/query_sample | ||
| 129 | |||
| 130 | ## 4.5 `audio_window` | ||
| 131 | 窗口层。 | ||
| 132 | |||
| 133 | 用途: | ||
| 134 | - 建指纹 | ||
| 135 | - 抽 embedding | ||
| 136 | - 在线输出 evidence window | ||
| 137 | - 对 intro/chorus 等片段做后续治理 | ||
| 138 | |||
| 139 | ## 4.6 `model_registry` | ||
| 140 | 模型注册表。 | ||
| 141 | |||
| 142 | 用途: | ||
| 143 | - 记录 `model_name/model_version/output_embedding_dim` | ||
| 144 | - 未来切换 MERT/MuQ/其他底座时不改业务表 | ||
| 145 | |||
| 146 | ## 4.7 `feature_set_registry` | ||
| 147 | 特征版本表。 | ||
| 148 | |||
| 149 | 用途: | ||
| 150 | - 记录窗长、hop、pooling、layer、metric | ||
| 151 | - 同一模型不同用法变成不同 feature_set | ||
| 152 | |||
| 153 | ## 4.8 `audio_embedding` | ||
| 154 | embedding 元数据事实表。 | ||
| 155 | |||
| 156 | 用途: | ||
| 157 | - 记录某个 asset/window 由哪个 feature_set 生成了什么 embedding | ||
| 158 | - 可指向 pgvector,也可只指向外部 parquet/npy | ||
| 159 | |||
| 160 | ## 4.9 `reference_set_registry` / `reference_set_member` | ||
| 161 | reference 集版本表。 | ||
| 162 | |||
| 163 | 用途: | ||
| 164 | - 把“当前线上热 reference 集”提升成显式对象 | ||
| 165 | - 支持 A/B、灰度、回滚、历史回放 | ||
| 166 | - 让 `is_reference` 从单条 recording 标签升级为“可切换集合” | ||
| 167 | |||
| 168 | ## 4.10 `retrieval_index_registry` | ||
| 169 | 索引注册表。 | ||
| 170 | |||
| 171 | 用途: | ||
| 172 | - 同一 feature_set 可挂多个 backend / shard / version | ||
| 173 | - 支持 pgvector / faiss / milvus 并存 | ||
| 174 | |||
| 175 | ## 4.11 `retrieval_candidate` | ||
| 176 | 召回候选。 | ||
| 177 | |||
| 178 | 用途: | ||
| 179 | - 保存 exact lane / semantic lane / future cover lane 的候选 | ||
| 180 | - 便于线下分析与线上回放 | ||
| 181 | |||
| 182 | ## 4.12 `match_decision` | ||
| 183 | 最终判定。 | ||
| 184 | |||
| 185 | 用途: | ||
| 186 | - 输出 `canonical_song_id / work_id / recording_id` | ||
| 187 | - 保留判定理由与分数 | ||
| 188 | |||
| 189 | --- | ||
| 190 | |||
| 191 | ## 5. 示例流程图 | ||
| 192 | |||
| 193 | ## 5.1 离线建库流程 | ||
| 194 | |||
| 195 | ```mermaid | ||
| 196 | flowchart TD | ||
| 197 | A[导入音频资产] --> B[写 recording_asset] | ||
| 198 | B --> C[切窗并写 audio_window] | ||
| 199 | C --> D[注册 model_registry / feature_set_registry] | ||
| 200 | D --> E[抽取 embedding / fingerprint] | ||
| 201 | E --> F[写 audio_embedding / audio_fingerprint] | ||
| 202 | F --> G[构建 retrieval index] | ||
| 203 | G --> H[登记 retrieval_index_registry] | ||
| 204 | ``` | ||
| 205 | |||
| 206 | ## 5.2 在线检索流程 | ||
| 207 | |||
| 208 | ```mermaid | ||
| 209 | sequenceDiagram | ||
| 210 | participant Q as Query Audio | ||
| 211 | participant DB as PostgreSQL | ||
| 212 | participant IDX as Retrieval Index | ||
| 213 | participant SVC as Matching Service | ||
| 214 | |||
| 215 | Q->>SVC: 输入 query | ||
| 216 | SVC->>DB: 读取 active feature_set | ||
| 217 | SVC->>IDX: exact lane / semantic lane 查询 | ||
| 218 | IDX-->>SVC: 候选 window / recording | ||
| 219 | SVC->>DB: 回查 window -> recording -> work -> canonical_song | ||
| 220 | SVC->>DB: 写 retrieval_candidate | ||
| 221 | SVC->>DB: 写 match_decision | ||
| 222 | SVC-->>Q: 返回 canonical_song_id + evidence | ||
| 223 | ``` | ||
| 224 | |||
| 225 | --- | ||
| 226 | |||
| 227 | ## 5.3 生产冻结前建议补硬的 4 个点 | ||
| 228 | |||
| 229 | ### A. lineage 硬约束 | ||
| 230 | 建议通过 trigger / transaction invariant 保证以下链路永远一致: | ||
| 231 | - `recording.work_id -> work.work_id` | ||
| 232 | - `recording.canonical_song_id -> work.canonical_song_id` | ||
| 233 | - `audio_window.asset_id -> recording_asset.recording_id -> recording/work/song` | ||
| 234 | - `audio_embedding.window_id -> audio_window.recording/work/song` | ||
| 235 | |||
| 236 | ### B. reference set 版本化 | ||
| 237 | 建议把“热 reference 集”提升成显式对象,而不是只依赖 `is_reference`。 | ||
| 238 | 这样可以支持: | ||
| 239 | - hot/cold reference 切换 | ||
| 240 | - A/B 对照 | ||
| 241 | - encoder 升级期间的双索引并存 | ||
| 242 | - 历史回放 | ||
| 243 | |||
| 244 | ### C. 候选实体多态约束 | ||
| 245 | `candidate_level + candidate_id` 很灵活,但生产化时至少要加枚举/约束,避免数据面上出现无效 level。 | ||
| 246 | |||
| 247 | ### D. 向量维度扩展规则 | ||
| 248 | 当前 `192/768` 物理表是热路径实现,不是最终维度上限。新增 encoder 维度时应遵循固定 playbook: | ||
| 249 | 1. 新增一张 `audio_embedding_vector_<dim>` 物理表 | ||
| 250 | 2. 回填对应 `feature_set` 的 embeddings | ||
| 251 | 3. 构建对应索引 | ||
| 252 | 4. 通过 `retrieval_index_registry` 切换 active 热索引 | ||
| 253 | |||
| 254 | --- | ||
| 255 | |||
| 256 | ## 6. 推荐 DDL 的主要原则 | ||
| 257 | |||
| 258 | ## 原则 1:对象关系稳定,模型可变 | ||
| 259 | 稳定的是: | ||
| 260 | - `song/work/recording/asset/window` | ||
| 261 | |||
| 262 | 可变的是: | ||
| 263 | - `model_name` | ||
| 264 | - `feature_set` | ||
| 265 | - `index_backend` | ||
| 266 | |||
| 267 | ## 原则 2:向量不要写死为唯一真相 | ||
| 268 | 推荐把向量事实拆成: | ||
| 269 | - PostgreSQL 元数据主表 | ||
| 270 | - 向量可在 pgvector 分表或外部文件中存放 | ||
| 271 | |||
| 272 | ## 原则 3:窗口是最小证据粒度 | ||
| 273 | 因为版权保护最终不只是“命中这首歌”,还要回答: | ||
| 274 | - 命中的是哪一段 | ||
| 275 | - 哪个录音版本 | ||
| 276 | - 归属到哪个 work/song | ||
| 277 | |||
| 278 | --- | ||
| 279 | |||
| 280 | ## 7. 推荐的物理实现思路 | ||
| 281 | |||
| 282 | ## 7.1 PostgreSQL 负责 | ||
| 283 | - 主数据 | ||
| 284 | - 模型注册 | ||
| 285 | - 特征注册 | ||
| 286 | - 索引注册 | ||
| 287 | - 检索候选 | ||
| 288 | - 审核/决策 | ||
| 289 | |||
| 290 | ## 7.2 pgvector 负责 | ||
| 291 | - 热 reference 集合 | ||
| 292 | - 线上低延迟近邻查询 | ||
| 293 | |||
| 294 | ## 7.3 外部对象存储/文件层负责 | ||
| 295 | - 原始音频 | ||
| 296 | - 标准化音频 | ||
| 297 | - 大体量 embedding parquet/npy | ||
| 298 | - 索引 shard 文件 | ||
| 299 | |||
| 300 | --- | ||
| 301 | |||
| 302 | ## 8. 为什么这个设计更适合 SOTA 演进 | ||
| 303 | |||
| 304 | 因为未来你最可能变化的不是 `canonical_song` 结构,而是: | ||
| 305 | |||
| 306 | | 会变化的东西 | 对应表 | | ||
| 307 | |---|---| | ||
| 308 | | 底座模型 | `model_registry` | | ||
| 309 | | 特征版本 | `feature_set_registry` | | ||
| 310 | | embedding dim | `model_registry.output_embedding_dim` | | ||
| 311 | | 池化与层选择 | `feature_set_registry.pooling_strategy/layer_selection` | | ||
| 312 | | 索引后端 | `retrieval_index_registry.index_backend` | | ||
| 313 | |||
| 314 | 所以 schema 的目标是: | ||
| 315 | |||
| 316 | > **允许模型变、索引变、特征变,但不让主数据和业务归属逻辑跟着崩。** | ||
| 317 | |||
| 318 | --- | ||
| 319 | |||
| 320 | ## 9. DDL 文件说明 | ||
| 321 | |||
| 322 | 推荐直接使用: | ||
| 323 | - [`acr-engine/sql/acr_pg_schema_v2.sql`](../acr-engine/sql/acr_pg_schema_v2.sql) | ||
| 324 | |||
| 325 | 其中包含: | ||
| 326 | - 主数据表 | ||
| 327 | - 模型注册表 | ||
| 328 | - 特征表 | ||
| 329 | - 向量物理表(192/768 维示例) | ||
| 330 | - 索引建议 | ||
| 331 | |||
| 332 | 而原有: | ||
| 333 | - [`acr-engine/sql/pgvector_schema.sql`](../acr-engine/sql/pgvector_schema.sql) | ||
| 334 | |||
| 335 | 建议视为: | ||
| 336 | - 原型版 / demo 版 / 兼容参考 | ||
| 337 | |||
| 338 | --- | ||
| 339 | |||
| 340 | ## 10. 实施顺序建议 | ||
| 341 | |||
| 342 | ### 第一批必须先落 | ||
| 343 | 1. `canonical_song` | ||
| 344 | 2. `work` | ||
| 345 | 3. `recording` | ||
| 346 | 4. `recording_asset` | ||
| 347 | 5. `audio_window` | ||
| 348 | 6. `model_registry` | ||
| 349 | 7. `feature_set_registry` | ||
| 350 | 8. `audio_embedding` | ||
| 351 | 9. `retrieval_index_registry` | ||
| 352 | |||
| 353 | ### 第二批再补 | ||
| 354 | 1. `retrieval_candidate` | ||
| 355 | 2. `match_decision` | ||
| 356 | 3. `audio_fingerprint` | ||
| 357 | 4. 更多维度的向量物理表 | ||
| 358 | |||
| 359 | --- | ||
| 360 | |||
| 361 | ## 11. 典型注册与查询示例 | ||
| 362 | |||
| 363 | ## 11.1 注册一个开源模型 | ||
| 364 | |||
| 365 | ```sql | ||
| 366 | insert into model_registry ( | ||
| 367 | model_name, model_family, model_version, model_source, model_uri, | ||
| 368 | input_sample_rate, default_window_sec, default_hop_sec, output_embedding_dim, | ||
| 369 | pooling_supported, layer_selection_supported, is_trainable | ||
| 370 | ) values ( | ||
| 371 | 'mert', 'music_ssl', 'v1-95m', 'github', 'https://github.com/yizhilll/MERT', | ||
| 372 | 24000, 5.0, 2.5, 768, | ||
| 373 | array['mean','cls'], true, false | ||
| 374 | ); | ||
| 375 | ``` | ||
| 376 | |||
| 377 | ## 11.2 注册一个 feature set | ||
| 378 | |||
| 379 | ```sql | ||
| 380 | insert into feature_set_registry ( | ||
| 381 | model_id, feature_name, feature_level, extraction_granularity, | ||
| 382 | window_sec, hop_sec, embedding_dim, pooling_strategy, layer_selection, | ||
| 383 | normalize_l2, distance_metric, quantization_type, feature_schema_version | ||
| 384 | ) | ||
| 385 | select | ||
| 386 | model_id, 'semantic_embedding', 'window', 'sliding_window', | ||
| 387 | 5.0, 2.5, 768, 'mean', 'final', | ||
| 388 | true, 'cosine', 'none', 'v1' | ||
| 389 | from model_registry | ||
| 390 | where model_name = 'mert' and model_version = 'v1-95m'; | ||
| 391 | ``` | ||
| 392 | |||
| 393 | ## 11.3 查询当前激活的 reference feature set | ||
| 394 | |||
| 395 | ```sql | ||
| 396 | select fs.feature_set_id, mr.model_name, mr.model_version, | ||
| 397 | fs.window_sec, fs.hop_sec, fs.pooling_strategy, fs.distance_metric | ||
| 398 | from feature_set_registry fs | ||
| 399 | join model_registry mr on mr.model_id = fs.model_id | ||
| 400 | where fs.status = 'active' | ||
| 401 | and fs.feature_level = 'window' | ||
| 402 | and fs.feature_name = 'semantic_embedding' | ||
| 403 | order by fs.feature_set_id desc; | ||
| 404 | ``` | ||
| 405 | |||
| 406 | ## 11.4 从候选 window 回查到最终 song | ||
| 407 | |||
| 408 | ```sql | ||
| 409 | select rc.query_id, rc.rank_no, rc.normalized_score, | ||
| 410 | aw.window_id, aw.start_sec, aw.end_sec, | ||
| 411 | r.recording_id, r.version_type, | ||
| 412 | w.work_id, | ||
| 413 | cs.canonical_song_id, cs.title, cs.primary_artist | ||
| 414 | from retrieval_candidate rc | ||
| 415 | join audio_window aw on aw.window_id = rc.evidence_window_id | ||
| 416 | join recording r on r.recording_id = aw.recording_id | ||
| 417 | join work w on w.work_id = aw.work_id | ||
| 418 | join canonical_song cs on cs.canonical_song_id = aw.canonical_song_id | ||
| 419 | where rc.query_id = :query_id | ||
| 420 | order by rc.rank_no asc; | ||
| 421 | ``` | ||
| 422 | |||
| 423 | ## 11.5 查询某个 song 的全部 reference 资产和窗口 | ||
| 424 | |||
| 425 | ```sql | ||
| 426 | select cs.canonical_song_id, cs.title, | ||
| 427 | r.recording_id, r.version_type, r.is_reference, | ||
| 428 | ra.asset_id, ra.storage_uri, | ||
| 429 | aw.window_id, aw.window_index, aw.start_sec, aw.end_sec | ||
| 430 | from canonical_song cs | ||
| 431 | join recording r on r.canonical_song_id = cs.canonical_song_id | ||
| 432 | join recording_asset ra on ra.recording_id = r.recording_id | ||
| 433 | left join audio_window aw on aw.asset_id = ra.asset_id | ||
| 434 | where cs.canonical_song_id = :canonical_song_id | ||
| 435 | order by r.reference_priority asc, ra.asset_id asc, aw.window_index asc; | ||
| 436 | ``` | ||
| 437 | |||
| 438 | ## 11.6 查询某个 feature set 是否已完成索引构建 | ||
| 439 | |||
| 440 | ```sql | ||
| 441 | select fs.feature_set_id, mr.model_name, mr.model_version, | ||
| 442 | ri.index_backend, ri.index_type, ri.row_count, ri.index_status, ri.built_at | ||
| 443 | from feature_set_registry fs | ||
| 444 | join model_registry mr on mr.model_id = fs.model_id | ||
| 445 | left join retrieval_index_registry ri on ri.feature_set_id = fs.feature_set_id | ||
| 446 | where fs.feature_set_id = :feature_set_id; | ||
| 447 | ``` | ||
| 448 | |||
| 449 | --- | ||
| 450 | |||
| 451 | ## 12. 当前建议结论 | ||
| 452 | |||
| 453 | 如果你今天就要开始 PostgreSQL 落库,最推荐的做法是: | ||
| 454 | |||
| 455 | 1. 先把 `song/work/recording/asset/window` 落稳 | ||
| 456 | 2. 同时把 `model_registry / feature_set_registry` 落稳 | ||
| 457 | 3. Phase-1 只注册开源 encoder feature set,不写死到某个 embedding 列 | ||
| 458 | 4. 先把热 reference 集上 pgvector,冷数据通过外部文件或后续索引层接入 |
docs/sota-evolution-guide.md
0 → 100644
| 1 | # SOTA 演进方案说明 / SOTA Evolution Guide | ||
| 2 | |||
| 3 | > 更新:2026-06-04 | ||
| 4 | > 目标:给出一个“先不上微调、先用开源 encoder”的 Phase-1 路线,并明确后续如何演进到更强的版权保护 / 版本归属系统。 | ||
| 5 | |||
| 6 | ## 一页结论 | ||
| 7 | |||
| 8 | 如果当前约束是: | ||
| 9 | - 先不微调底座 | ||
| 10 | - 先要落数据规范 | ||
| 11 | - 先解决 100w 音频 / 30w 歌曲的检索与归属基础问题 | ||
| 12 | |||
| 13 | 那么最合理的 Phase-1 路线不是“重训一套新模型”,而是: | ||
| 14 | |||
| 15 | 1. **保留 exact lane**:Chromaprint / fingerprint | ||
| 16 | 2. **semantic lane 主底座**:MERT-v1-95M | ||
| 17 | 3. **semantic lane challenger**:MuQ | ||
| 18 | 4. **数据库先稳住**:`model_registry + feature_set_registry + audio_embedding + retrieval_index_registry` | ||
| 19 | 5. **结果先按层聚合**:window -> recording -> work -> canonical_song | ||
| 20 | |||
| 21 | --- | ||
| 22 | |||
| 23 | ## 1. 为什么当前要走 encoder-only Phase-1 | ||
| 24 | |||
| 25 | 因为你当前最紧迫的问题不是“模型精度极限”,而是: | ||
| 26 | |||
| 27 | - 曲库很大:100w 音频 / 30w 歌曲 | ||
| 28 | - 数据关系复杂:同曲可能有多录音、多版本、多来源资产 | ||
| 29 | - 如果数据规范不稳,未来任何模型升级都会反复返工 | ||
| 30 | |||
| 31 | 所以 Phase-1 目标应该是: | ||
| 32 | |||
| 33 | ```mermaid | ||
| 34 | flowchart LR | ||
| 35 | A[冻结数据规范] --> B[接入开源 encoder] | ||
| 36 | B --> C[建立 semantic baseline] | ||
| 37 | C --> D[做大规模索引与聚合验证] | ||
| 38 | D --> E[再决定是否进入微调 / version lane] | ||
| 39 | ``` | ||
| 40 | |||
| 41 | --- | ||
| 42 | |||
| 43 | ## 2. 推荐的阶段划分 | ||
| 44 | |||
| 45 | ## Phase-0:当前仓库阶段(已具备) | ||
| 46 | - `Chromaprint + ECAPA + melody rerank` | ||
| 47 | - 可跑通训练/建索引/评测/服务闭环 | ||
| 48 | - 适合作为 baseline,而不是最终生产底座 | ||
| 49 | |||
| 50 | ## Phase-1:Encoder-only foundation baseline(当前推荐) | ||
| 51 | - exact lane:Chromaprint | ||
| 52 | - semantic lane:MERT-v1-95M | ||
| 53 | - challenger:MuQ | ||
| 54 | - 不微调底座 | ||
| 55 | - 只做 feature extraction + index + aggregation | ||
| 56 | |||
| 57 | ## Phase-2:Version / Cover lane | ||
| 58 | - 在 Phase-1 数据模型稳定后 | ||
| 59 | - 引入 cover/version 专门分支 | ||
| 60 | - 强化 work-level 归属 | ||
| 61 | |||
| 62 | ## Phase-3:Industrial retrieval stack | ||
| 63 | - ANN + reranker | ||
| 64 | - online/offline artifact registry | ||
| 65 | - 监控、回放、审计、人工复核 | ||
| 66 | |||
| 67 | --- | ||
| 68 | |||
| 69 | ## 3. Phase-1 的推荐模型组合 | ||
| 70 | |||
| 71 | ## 3.1 Exact lane | ||
| 72 | ### 选型 | ||
| 73 | - Chromaprint / landmark hash | ||
| 74 | |||
| 75 | ### 作用 | ||
| 76 | - 原曲片段 | ||
| 77 | - 平台转码 | ||
| 78 | - near-duplicate | ||
| 79 | - 局部片段强匹配 | ||
| 80 | |||
| 81 | ### 为什么保留 | ||
| 82 | 版权保护不能只靠 semantic embedding。exact lane 在很多真实投诉/取证场景里仍然是最快且证据最强的第一条路径。 | ||
| 83 | |||
| 84 | --- | ||
| 85 | |||
| 86 | ## 3.2 Semantic lane 主模型:MERT-v1-95M | ||
| 87 | |||
| 88 | ### 推荐原因 | ||
| 89 | - 是 music SSL foundation model | ||
| 90 | - 已有公开论文与实现 | ||
| 91 | - 比自训小型 ECAPA 更符合音乐任务底座定位 | ||
| 92 | - Phase-1 直接做 frozen encoder 成本与风险都更低 | ||
| 93 | |||
| 94 | ### Phase-1 中的角色 | ||
| 95 | - 作为主 encoder 产出 window embedding | ||
| 96 | - 负责 noisy/BGM/一般跨域检索 baseline | ||
| 97 | - 后面可继续作为 teacher 或兼容旧索引版本 | ||
| 98 | |||
| 99 | ### 推荐 feature set | ||
| 100 | 1. `mert_v1_95m__window_5s_hop_2.5s__meanpool__l2` | ||
| 101 | 2. `mert_v1_95m__window_10s_hop_5s__meanpool__l2` | ||
| 102 | |||
| 103 | ### 为什么先做两套 | ||
| 104 | - `5s/2.5s`:更利于局部定位 | ||
| 105 | - `10s/5s`:更利于整体语义稳定 | ||
| 106 | |||
| 107 | --- | ||
| 108 | |||
| 109 | ## 3.3 Semantic lane Challenger:MuQ | ||
| 110 | |||
| 111 | ### 推荐原因 | ||
| 112 | - 更新、更接近下一代 music foundation model 路线 | ||
| 113 | - 值得作为 challenger baseline | ||
| 114 | - 即使不开微调,也有希望在部分 MIR 任务上优于较早底座 | ||
| 115 | |||
| 116 | ### 当前建议 | ||
| 117 | - Phase-1 先作为对照组,不立即替代 MERT | ||
| 118 | - 重点验证:向量分布稳定性、窗口级检索表现、内存/推理成本 | ||
| 119 | |||
| 120 | --- | ||
| 121 | |||
| 122 | ## 3.4 为什么 Phase-1 不直接以 CoverHunter 为主线 | ||
| 123 | |||
| 124 | 因为 CoverHunter 的优势在: | ||
| 125 | - cover song identification | ||
| 126 | - alignment / refined attention / coarse-to-fine 训练 | ||
| 127 | |||
| 128 | 而你当前约束是: | ||
| 129 | - 先不用微调 | ||
| 130 | - 先用开源 encoder | ||
| 131 | - 先把数据和检索规范落稳 | ||
| 132 | |||
| 133 | 所以它更适合作为 **Phase-2 的 version/cover lane 方向**,而不是 Phase-1 的主 baseline。 | ||
| 134 | |||
| 135 | --- | ||
| 136 | |||
| 137 | ## 4. 角色关注点 | ||
| 138 | |||
| 139 | ## 4.1 模型底座角色 | ||
| 140 | 重点关注: | ||
| 141 | - 哪些 encoder 已注册到 `model_registry` | ||
| 142 | - 每个 encoder 的 input SR、window、pooling、embedding dim | ||
| 143 | - 哪些 feature set 是线上候选,哪些只是实验候选 | ||
| 144 | |||
| 145 | ## 4.2 检索角色 | ||
| 146 | 重点关注: | ||
| 147 | - 指纹 lane 与 semantic lane 如何组合 | ||
| 148 | - `recording/work/song` 聚合规则 | ||
| 149 | - top-k 候选如何稳定输出 | ||
| 150 | |||
| 151 | ## 4.3 数据角色 | ||
| 152 | 重点关注: | ||
| 153 | - 资产去重 | ||
| 154 | - reference 资产选择 | ||
| 155 | - window manifest | ||
| 156 | - 是否支持全量重建特征与索引 | ||
| 157 | |||
| 158 | ## 4.4 运维 / 平台角色 | ||
| 159 | 重点关注: | ||
| 160 | - encoder 版本切换是否可灰度 | ||
| 161 | - 索引重建是否可并行 | ||
| 162 | - 热/冷索引、历史索引是否可回滚 | ||
| 163 | |||
| 164 | --- | ||
| 165 | |||
| 166 | ## 5. Phase-1 的实施顺序 | ||
| 167 | |||
| 168 | ```mermaid | ||
| 169 | flowchart TD | ||
| 170 | A[冻结 PostgreSQL 数据规范] --> B[导入 canonical/work/recording/asset/window] | ||
| 171 | B --> C[注册 model_registry / feature_set_registry] | ||
| 172 | C --> D[抽取 MERT 特征] | ||
| 173 | C --> E[抽取 MuQ 特征] | ||
| 174 | D --> F[构建 semantic index] | ||
| 175 | E --> F | ||
| 176 | F --> G[与 fingerprint lane 做聚合] | ||
| 177 | G --> H[输出 canonical_song_id / work_id / recording_id] | ||
| 178 | ``` | ||
| 179 | |||
| 180 | --- | ||
| 181 | |||
| 182 | ## 6. 每阶段解决的问题 | ||
| 183 | |||
| 184 | | 阶段 | 解决的问题 | 暂不解决的问题 | | ||
| 185 | |---|---|---| | ||
| 186 | | Phase-1 | 数据规范、开源底座 baseline、索引可重建、song/work/recording 聚合 | 底座微调、cover 专项训练、melody tower | | ||
| 187 | | Phase-2 | version/cover 归属、work-level recall | 更复杂跨模态 humming | | ||
| 188 | | Phase-3 | 工业化服务、回放、监控、人工审核闭环 | 极致 research SOTA | | ||
| 189 | |||
| 190 | --- | ||
| 191 | |||
| 192 | ## 7. 与当前仓库的关系 | ||
| 193 | |||
| 194 | ### 当前保留 | ||
| 195 | - `ECAPA baseline`:保留做对照,不作为长期主底座 | ||
| 196 | - `Chromaprint`:保留,且在版权保护场景里非常重要 | ||
| 197 | - `melody rerank`:保留为辅助 lane | ||
| 198 | |||
| 199 | ### 当前新增 | ||
| 200 | - `model_registry` | ||
| 201 | - `feature_set_registry` | ||
| 202 | - foundation encoder 特征抽取与注册 | ||
| 203 | - 更清晰的 `canonical_song / work / recording` 数据结构 | ||
| 204 | |||
| 205 | --- | ||
| 206 | |||
| 207 | ## 8. 当前推荐结论 | ||
| 208 | |||
| 209 | 如果今天就要给 Phase-1 定方案,我建议: | ||
| 210 | |||
| 211 | 1. **先不改训练主线,不删 ECAPA** | ||
| 212 | 2. **新增 MERT-v1-95M semantic lane** | ||
| 213 | 3. **新增 MuQ challenger lane** | ||
| 214 | 4. **只把 `is_reference=true` 的主参考窗口先做成热索引** | ||
| 215 | 5. **先把 PostgreSQL 设计当成主交付** | ||
| 216 | |||
| 217 | 换句话说: | ||
| 218 | |||
| 219 | > Phase-1 的核心不是“哪一个模型最终赢”,而是“数据规范 + 模型注册 + 特征注册 + 索引注册”这套长期结构先稳定下来。 |
-
Please register or sign in to post a comment