postgresql-data-model.md 24.4 KB

Raw Blame History Permalink



PostgreSQL 数据模型 / 当前 song-centric 4 表方案


更新：2026-06-04

关联 SQL：acr-engine/sql/acr_pg_schema_songcentric_v1.sql


1. 一页结论

当前默认只认 4 张核心物理表：

media_entity -> audio_object -> feature_fact -> set_membership


逻辑语义这样理解：

song -> asset -> window -> fingerprint / embedding


这套设计的核心价值：


song-centric：最终稳定返回 song_id


融合优先：减少 recording/work/version 首阶段理解成本

特征统一：exact lane 和 semantic lane 统一落到 feature_fact


模型可替换：换 model_name/model_version/feature_set_name 不必重拆 schema

证据可回溯：任何召回都能回查到具体 window -> asset -> song


2. 为什么现在收敛成 4 表

当前目标不是先建一个最完整的音乐版权知识图谱，而是先把下面这件事做稳：


收到一个录音/BGM/片段/翻唱相关查询后，能够快速定位它最可能对应哪个 song_id。


因此当前优先级是：


先固定 song 作为最终归属对象
保留 asset，支持同一 song 下多个音频文件
保留 window，支持切片级 evidence 与 offset
用一张 feature_fact 同时承载 fingerprint 与 embedding
用一张 set_membership 管理 reference/eval/hot 集合


3. 4 张表分别解决什么问题


表
当前主要 type
解决的问题


media_entity
song
最终归属对象是谁


audio_object

asset, window

实际音频文件是什么、切成了哪些窗口


feature_fact

fingerprint, embedding

每个窗口/对象用了哪个模型、产出了什么特征


set_membership

reference_set, eval_set, hot_set

哪些 song/asset/window/feature 属于哪个集合


4. 切片 / 模型 / feature 分别在哪张表


业务对象
物理表
关键字段
用途


song
media_entity
entity_type='song'
最终返回 song_id


asset
audio_object
object_type='asset'
存原始音频文件元数据


window
audio_object

object_type='window', parent_object_id=<asset_id>

存切片范围、offset、evidence


fingerprint
feature_fact

feature_type='fingerprint', fingerprint_value

exact lane 检索


embedding
feature_fact

feature_type='embedding', embedding_uri/vector_table_name, embedding_dim

semantic lane 检索


model identity
feature_fact

model_name, model_version

区分 MERT / MuQ / ECAPA / fallback


feature set identity
feature_fact

feature_set_name, feature_schema_ver

区分特征配置、窗口策略、schema 版本


reference routing
set_membership

set_type, set_name

控制 reference/eval/hot 范围


4.1 一个关键设计点

当前 模型信息不单独放 registry 表作为默认主链依赖，而是先直接沉淀在 feature_fact：


这样 Phase-1 更轻
更适合“直接复用开源 encoder，不先训练/微调”的当前策略
后续如果要补 registry，也可以把 feature_fact 中已有事实反向注册


5. 核心流程图


5.1 落库流程

flowchart TD
    A[media_entity\nentity_type=song] --> B[audio_object\nobject_type=asset]
    B --> C[audio_object\nobject_type=window]
    C --> D1[feature_fact\nfeature_type=fingerprint]
    C --> D2[feature_fact\nfeature_type=embedding]
    A --> E[set_membership]
    B --> E
    C --> E


5.2 查询回溯流程

flowchart LR
    A[query audio] --> B[切片成 query windows]
    B --> C[抽 fingerprint / embedding]
    C --> D[命中 feature_fact]
    D --> E[audio_object window]
    E --> F[audio_object asset]
    F --> G[media_entity song]
    G --> H[输出 song_id + evidence]


5.3 表职责视图

flowchart TB
    M[media_entity\n谁] --> A[audio_object\n哪份音频/哪段切片]
    A --> F[feature_fact\n用了哪个模型/产出什么特征]
    M --> S[set_membership\n属于哪个 reference/eval/hot 集]
    A --> S
    F --> R[召回/匹配/聚合]


6. 每张表的设计意图


6.1 media_entity


用途：


作为 song 主实体表
统一承载 song_id

后续如需要，也允许保留 work/recording type，但当前默认只把 song 当主语义


当前最常用字段：


entity_id
entity_type
biz_key
title
artist_name
metadata_json


设计意图：


不再把 song 相关字段散落到多张表
先把最终归属对象固定下来


6.2 audio_object


用途：


同时管理 asset 与 window

用 parent_object_id 建立 asset -> window 父子关系


当前最常用字段：


object_type
song_id
parent_object_id
storage_uri
checksum
duration_ms
start_ms
end_ms


设计意图：


同一 song 下可有多个音频文件
同一音频文件可切成多个检索窗口
查询命中后可以回查具体 offset


6.3 feature_fact


用途：


统一存 exact lane 和 semantic lane 的特征事实
统一挂模型信息、特征集信息、特征载荷位置


当前最常用字段：


feature_type
object_id
song_id
model_name
model_version
feature_set_name
embedding_dim
fingerprint_value
embedding_uri
vector_table_name


设计意图：


避免为不同模型建一堆平行 embedding 表
未来换 MERT / MuQ / 其他 encoder 时只增 feature rows，不改主 schema
exact / semantic 两条 lane 可以共用同一归属链


6.4 set_membership


用途：


统一管理 reference_set / eval_set / hot_set
member 可以是 song/asset/window/feature


设计意图：


reference 范围不硬编码到 song 表里
评测集、热集、灰度集能共用一张关系表


7. 为什么“切片数据 + 模型 + feature”这样分布最合理


切片数据放 audio_object


因为切片本质是音频对象的一种：


它有父 asset
它有 start_ms/end_ms

它需要被回溯和复用


模型信息放 feature_fact


因为模型是“某次特征计算”的属性：


同一个 window 可能被多个模型重复编码
同一个模型也可能有多个版本
模型名和版本应该和 feature 结果绑定，而不是只和 asset 绑定


feature 放 feature_fact


因为 feature 是事实：


某个对象
用某个模型
以某个 feature set
产出某个结果


这正好就是一条事实记录。


8. 第一个阶段如何服务 100w 音频 / 30w 歌曲


建议的落盘顺序


先写 media_entity(song)

再写 audio_object(asset)

再批量切 audio_object(window)

再按模型批次写 feature_fact

最后写 set_membership(reference_set/hot_set/eval_set)


为什么这样落

因为这能把“音频对象生命周期”和“模型计算生命周期”解耦：


音频先入库
切片先固定
exact lane 可先跑
semantic lane 之后补跑也不影响主链


9. Phase-1 推荐策略


9.1 exact lane


默认：ChromaprintMatcher

落到：feature_fact(feature_type='fingerprint')


9.2 semantic lane


当前优先：MERT

challenger：MuQ

当前 host 若 runtime 不可用，保留 fallback
落到：feature_fact(feature_type='embedding')


9.3 为什么不是 ECAPA-TDNN 主导


ECAPA 更偏 speaker/audio identity 方向
当前目标是版权保护 / song-level ACR

MERT / MuQ 更适合作为 song semantic baseline/challenger


10. 当前方案解决的问题

这套 4 表设计，当前主要解决：


同一 song 下多音频文件管理
切片级 evidence 管理
fingerprint 与 embedding 统一落库
模型切换时不重构主 schema
reference/eval/hot 集统一治理
检索命中后快速回到 song_id


11. 当前不刻意解决的问题

Phase-1 暂不强求：


复杂 work / recording / version 治理
完整权利层图谱
训练/微调闭环
重型 registry-first 体系


这些都可以后续逐步加，但不该反向阻塞当前主链。


12. 相关文档


README.md
start-here.md
session-handoff.md
postgres_db_schema_samples.md


13. 在线检索时怎么从 feature 回到 song_id


这是当前研发最需要牢记的一条回溯链：

feature_fact -> audio_object(window) -> audio_object(asset) -> media_entity(song)


13.1 在线检索流程图

flowchart LR
    Q[query audio] --> QW[query windows]
    QW --> QE[query fingerprint / embedding]
    QE --> FF[feature_fact]
    FF --> W[audio_object\nobject_type=window]
    W --> A[audio_object\nobject_type=asset]
    A --> S[media_entity\nentity_type=song]
    S --> R[return song_id + title + artist + evidence]


13.2 聚合流程图

flowchart TD
    A[query window features] --> B[命中多个 feature_fact rows]
    B --> C[回查 window]
    C --> D[回查 asset]
    D --> E[聚合到 song_id]
    E --> F[按 hit_count / score / offset coverage 排序]
    F --> G[返回 topK songs]


13.3 最小查询 SQL 模板

select ff.feature_id,
       ff.feature_type,
       ff.model_name,
       ff.model_version,
       ff.feature_set_name,
       w.object_id as window_id,
       w.start_ms,
       w.end_ms,
       a.object_id as asset_id,
       a.storage_uri,
       s.entity_id as song_id,
       s.title,
       s.artist_name
from feature_fact ff
join audio_object w
  on w.object_id = ff.object_id
 and w.object_type = 'window'
join audio_object a
  on a.object_id = w.parent_object_id
 and a.object_type = 'asset'
join media_entity s
  on s.entity_id = ff.song_id
where ff.feature_id = :feature_id;


13.4 一个 song-level 聚合 SQL 模板

select ff.song_id,
       s.title,
       s.artist_name,
       count(*) as matched_windows,
       min(w.start_ms) as first_hit_ms,
       max(w.end_ms) as last_hit_ms
from feature_fact ff
join audio_object w
  on w.object_id = ff.object_id
 and w.object_type = 'window'
join media_entity s
  on s.entity_id = ff.song_id
where ff.feature_type = :feature_type
  and ff.model_name = :model_name
  and ff.feature_set_name = :feature_set_name
  and ff.feature_id = any(:matched_feature_ids)
group by ff.song_id, s.title, s.artist_name
order by matched_windows desc, first_hit_ms asc
limit 20;


13.5 这条链为什么重要

因为它把 3 件事拆清楚了：


feature_fact 负责回答：命中了什么特征


audio_object(window/asset) 负责回答：命中了哪段、来自哪个文件


media_entity(song) 负责回答：最终该归到哪个 song_id


所以 Phase-1 即使不引入更复杂的 recording/work/version，也已经足够支撑：


版权保护归属
片段/BGM 定位
evidence 回查
topK song 级召回


14. exact + semantic 双通道如何融合到 song 排序

当前推荐把线上召回理解成两条并行 lane：


exact lane：chromaprint 等 fingerprint

semantic lane：MERT / MuQ / fallback embedding


二者最终都不要直接返回 feature_id，而是都要先回到：

feature_fact -> window -> asset -> song


再做 song_id 级聚合。


14.1 融合流程图

flowchart TD
    Q[query audio] --> WQ[query windows]
    WQ --> E1[exact lane\nfingerprint retrieval]
    WQ --> E2[semantic lane\nembedding retrieval]
    E1 --> C1[exact candidates\nfeature_fact rows]
    E2 --> C2[semantic candidates\nfeature_fact rows]
    C1 --> N1[normalize exact scores]
    C2 --> N2[normalize semantic scores]
    N1 --> G[song_id aggregation]
    N2 --> G
    G --> R[rerank top songs]
    R --> O[return topK song_ids + evidence]


14.2 song 级聚合时看什么

建议至少保留这些聚合信号：


exact_hit_count
semantic_hit_count
exact_best_score
semantic_best_score
matched_asset_count
matched_window_count
offset_coverage_ms
first_hit_ms
last_hit_ms


14.3 一个推荐的融合口径

Phase-1 可以先用 规则融合，不急着上学习排序：

final_song_score =
    0.55 * exact_score_norm
  + 0.35 * semantic_score_norm
  + 0.10 * coverage_score_norm


其中：


exact_score_norm：song 级 exact 命中强度

semantic_score_norm：song 级 semantic 命中强度

coverage_score_norm：多个 window 是否连续覆盖同一 song


14.4 为什么 exact 权重更高

因为当前场景是版权保护 / song-level ACR：


exact lane 命中时通常 precision 更高
semantic lane 更适合补召回、抗翻唱/变速/BGM 干扰
所以 Phase-1 更稳妥的策略是 exact 主导、semantic 补强


14.5 一个融合后的 song-level 结果表结构（逻辑视图）

song_id
exact_hit_count
semantic_hit_count
exact_best_score
semantic_best_score
offset_coverage_ms
final_song_score
best_asset_id
best_window_id
best_model_name


14.6 伪 SQL 聚合模板

with matched as (
    select ff.song_id,
           ff.feature_type,
           ff.model_name,
           w.object_id as window_id,
           w.parent_object_id as asset_id,
           w.start_ms,
           w.end_ms,
           :score_map[ff.feature_id]::double precision as raw_score
    from feature_fact ff
    join audio_object w
      on w.object_id = ff.object_id
     and w.object_type = 'window'
    where ff.feature_id = any(:matched_feature_ids)
), song_agg as (
    select song_id,
           count(*) filter (where feature_type = 'fingerprint') as exact_hit_count,
           count(*) filter (where feature_type = 'embedding') as semantic_hit_count,
           max(raw_score) filter (where feature_type = 'fingerprint') as exact_best_score,
           max(raw_score) filter (where feature_type = 'embedding') as semantic_best_score,
           count(distinct asset_id) as matched_asset_count,
           count(distinct window_id) as matched_window_count,
           max(end_ms) - min(start_ms) as offset_coverage_ms
    from matched
    group by song_id
)
select sa.song_id,
       s.title,
       s.artist_name,
       sa.exact_hit_count,
       sa.semantic_hit_count,
       sa.exact_best_score,
       sa.semantic_best_score,
       sa.matched_asset_count,
       sa.matched_window_count,
       sa.offset_coverage_ms
from song_agg sa
join media_entity s on s.entity_id = sa.song_id
order by coalesce(sa.exact_best_score, 0) desc,
         coalesce(sa.semantic_best_score, 0) desc,
         sa.offset_coverage_ms desc
limit 20;


14.7 当前最务实的实现顺序


先分别拿到 exact lane topN feature candidates
再拿到 semantic lane topN feature candidates
全部回查成 song_id 粒度
在应用层做规则融合
输出 topK song_id + evidence


这样做的好处是：


不要求一开始就把融合逻辑写死在数据库里
便于后续调权重
便于对比 MERT / MuQ / fallback 的增益


15. 数据到底是怎么绑定在一起的

这是当前 4 表 schema 最核心的绑定关系：

song(media_entity)
  1 -> N asset(audio_object)
  1 asset -> N window(audio_object)
  1 window -> N feature_fact


换句话说：


media_entity 定义 这个东西最终属于哪个 song


audio_object 定义 这个 song 下有哪些音频文件、每个文件切了哪些窗口


feature_fact 定义 这些窗口被哪些模型编码过，产出了哪些特征


15.1 绑定关系图

flowchart TD
    S[media_entity\nsong] --> A1[audio_object\nasset]
    S --> A2[audio_object\nasset]
    A1 --> W1[audio_object\nwindow]
    A1 --> W2[audio_object\nwindow]
    W1 --> F1[feature_fact\nchromaprint]
    W1 --> F2[feature_fact\nmert]
    W1 --> F3[feature_fact\nmuq]
    W2 --> F4[feature_fact\nchromaprint]
    W2 --> F5[feature_fact\nlocal_wavehash_embed]


15.2 每张表靠什么字段绑定


从
到
绑定字段
说明


audio_object(asset/window)
media_entity(song)
audio_object.song_id = media_entity.entity_id
asset/window 都归属于某个 song


audio_object(window)
audio_object(asset)
audio_object.parent_object_id = asset.object_id
window 的父对象一定是 asset


feature_fact
audio_object(window)
feature_fact.object_id = window.object_id
feature 绑定到具体切片


feature_fact
media_entity(song)
feature_fact.song_id = media_entity.entity_id
冗余保存 song_id，便于检索聚合


set_membership
song/asset/window/feature
member_type + member_id
集合关系是多态绑定


15.3 为什么 feature_fact 同时存 object_id 和 song_id


因为二者回答的是不同问题：


object_id 回答：这个特征是从哪一个 window 抽出来的


song_id 回答：这个特征最终属于哪一个 song


这样做的好处：


在线召回时可以直接按 song_id 聚合
同时又能回查到具体 window -> asset -> offset

不需要每次聚合都先做一遍深链路 join 才知道 song 归属


15.4 一条 feature 记录可以怎么理解

一条 feature_fact 本质上是在说：


对 song_id = X 下面的某个 window(object_id = Y)，使用 model_name/model_version/feature_set_name 这套编码方案，产出了一个 fingerprint 或 embedding 特征。


所以 feature_fact 不是“模型注册表”，而是“模型计算结果事实表”。


16. Phase-1 开源模型集合应该怎么落地存储

当前 Phase-1 的原则是：


先直接用开源模型做 encoder，不微调；数据库里先把“是谁算的、怎么算的、结果放哪”固定下来。


16.1 当前建议的模型集合


lane
model_name
model_version
feature_type
用途


exact
chromaprint
1.0
fingerprint
高精度 exact 命中


semantic baseline
mert-v1-95m
hf-main
embedding
song semantic baseline


semantic challenger
muq-base
hf-main
embedding
cover / bgm / 复杂干扰 challenger


semantic fallback
local_wavehash_embed
phase1_local
embedding
当前 host 缺 runtime 时兜底


historical baseline
ecapa-tdnn
baseline_only
embedding
历史对比，不建议做 Phase-1 主导


16.2 建议用什么字段固化模型身份

统一落在 feature_fact：


model_name
model_version
feature_set_name
feature_schema_ver

embedding_dim（embedding 时）


16.3 feature_set_name 应该怎么命名

建议把下面几类信息编码进去：

<model_family>_<window_sec>s_hop<stride_sec>_<variant>_v<schema>


例如：


chromaprint_5s_v1
mert_5s_hop2.5_v1
muq_5s_hop2.5_v1
wavehash_5s_hop2.5_v1


16.4 Phase-1 推荐的存储规则


exact lane


feature_type = 'fingerprint'

fingerprint_value 必填
model_name = 'chromaprint'

embedding_uri / vector_table_name 为空


semantic lane


feature_type = 'embedding'

embedding_dim 必填

embedding_uri 或 vector_table_name 至少一个必填

fingerprint_value 为空


16.5 为什么现在不强依赖单独的 model_registry

因为当前 Phase-1 更关注：


先把特征稳定算出来
先把特征和 song/window 的绑定关系固化
先让检索与归属链闭环


所以当前最务实的方式是：


模型身份直接写进 feature_fact
后续如果模型数量继续变多，再补 registry 也不迟


16.6 一个推荐的落库顺序

对于每个 asset：


写 media_entity(song)

写 audio_object(asset)

切窗并写 audio_object(window)

跑 chromaprint，写 feature_fact(fingerprint)

跑 mert-v1-95m，写 feature_fact(embedding)

跑 muq-base，写 feature_fact(embedding)

如果 runtime 不可用，至少写 local_wavehash_embed fallback


这样最终会形成：

同一个 window
  -> 1 条 chromaprint fingerprint
  -> 1 条 mert embedding
  -> 1 条 muq embedding
  -> (可选) 1 条 fallback embedding


16.7 一句话理解 Phase-1 的存储策略


audio_object 负责“哪段音频”，feature_fact 负责“哪种模型算出了什么特征”，二者用 object_id 绑定，再用 song_id 把所有结果稳定归到 song。


17. 100w 音频 / 30w song 的批量入库与索引建设策略

当前规模下，最重要的原则不是一次把所有模型都算完，而是：


先把 song / asset / window 主链稳定落盘，再按模型批次补 feature_fact，再逐步建设检索索引。


17.1 推荐分阶段


Phase A：主数据先落稳

先写：


media_entity(song)
audio_object(asset)
audio_object(window)


目标：


先固定 song -> asset -> window 主链
先让所有后续模型计算都有统一对象主键


Phase B：exact lane 先铺满

再写：


feature_fact(feature_type='fingerprint', model_name='chromaprint')


目标：


先建立高 precision 的版权保护基线
先让 song-level exact 召回可用


Phase C：semantic baseline 批量补齐

再写：


feature_fact(feature_type='embedding', model_name='mert-v1-95m')


目标：


先让 semantic 主召回 baseline 形成覆盖


Phase D：challenger / fallback 补齐

按资源逐步补：


muq-base
local_wavehash_embed

ecapa-tdnn（仅对比）


17.2 推荐的批次粒度

建议按 song 批次 或 asset 批次 导入，而不是按 feature 批次直接扫全表：


主数据导入：每批 5k ~ 20k songs

window 切片：每批 50k ~ 200k windows

fingerprint 抽取：每批 50k ~ 200k windows

embedding 抽取：按 GPU/CPU 能力动态切批


17.3 为什么要把主链和特征链分开批处理

因为两者生命周期不同：


主链是一次性、相对稳定的
特征链会随着模型更换持续追加


所以推荐：


audio_object 先全量相对稳定落库

feature_fact 按模型、按批次持续追加


17.4 推荐索引建设顺序


先建业务主链索引

优先保证这些索引：


idx_audio_object_song_type
idx_audio_object_parent
idx_feature_fact_object_type
idx_feature_fact_song_type
idx_set_membership_set_lookup


再建模型巡检索引

如果后续缺模型扫描频繁，建议追加：

create index if not exists idx_feature_fact_model_lookup
    on feature_fact(model_name, model_version, feature_set_name, feature_type, song_id);


最后再建重型向量检索索引

向量索引不建议和主链初始化绑死：


先把 feature_fact 事实落稳
再按具体 vector table / dim 建近邻索引


17.5 一个推荐的冷热分层策略


热层


set_membership.set_type = 'hot_set'
高频 song、高频 asset、热点版权曲库
优先保留 exact + semantic 全特征


温层


reference_set
主 reference catalog
保持 exact 全覆盖，semantic 分批补齐


冷层


长尾 song
先保主链和 exact
semantic 可按需补算


17.6 推荐的批量巡检 SQL


查没有 window 的 asset

select a.object_id as asset_id, a.song_id, a.storage_uri
from audio_object a
where a.object_type = 'asset'
  and not exists (
      select 1
      from audio_object w
      where w.parent_object_id = a.object_id
        and w.object_type = 'window'
  )
order by a.song_id, a.object_id;


查没有 fingerprint 的 window

select w.object_id as window_id, w.song_id, w.parent_object_id as asset_id
from audio_object w
where w.object_type = 'window'
  and not exists (
      select 1
      from feature_fact ff
      where ff.object_id = w.object_id
        and ff.feature_type = 'fingerprint'
        and ff.model_name = 'chromaprint'
  )
order by w.song_id, w.parent_object_id, w.start_ms;


查没有 MERT embedding 的 window

select w.object_id as window_id, w.song_id, w.parent_object_id as asset_id
from audio_object w
where w.object_type = 'window'
  and not exists (
      select 1
      from feature_fact ff
      where ff.object_id = w.object_id
        and ff.feature_type = 'embedding'
        and ff.model_name = 'mert-v1-95m'
        and ff.model_version = 'hf-main'
        and ff.feature_set_name = 'mert_5s_hop2.5_v1'
  )
order by w.song_id, w.parent_object_id, w.start_ms;


17.7 Phase-1 最稳的执行顺序


song/asset/window 先全量落库

chromaprint 先铺满

mert-v1-95m 作为第一条 semantic baseline 批量补齐

muq-base 做 challenger
按 hot/reference/cold 分层补算
最后再调双通道融合权重


17.8 一句话策略


大规模阶段不要先追求“所有模型都齐”，而要先保证 对象主链完整、exact 先可用、semantic 可持续补齐、集合可分层治理。