Commit 75f156b8 75f156b8345d456daa816a129b02a52073ccd078 by cnb.bofCdSsphPA

Why the docs need explicit bindings between audio objects and feature facts

Constraint: Phase-1 implementers need one concrete explanation for how songs, assets, windows, and open-model features are linked and stored
Rejected: Rely on schema columns alone | does not show the intended per-model storage pattern for the current encoder-only phase
Confidence: high
Scope-risk: narrow
Directive: Keep future model-onboarding docs grounded in feature_fact object_id/song_id bindings unless the schema default changes
Tested: markdown link check on /workspace/docs after adding binding diagrams and SQL storage examples
Not-tested: No live database rerun; this is a documentation clarification over an already-verified schema
1 parent 43644ac8
......@@ -6,6 +6,7 @@
- 重写 `docs/postgres_db_schema_samples.md` 与入口文档,补充当前 4 表主链的流程图、典型 SQL 样例、查询回溯路径与写入顺序,统一文档口径到 `media_entity -> audio_object -> feature_fact -> set_membership`
- 继续补强在线检索说明:在 `docs/postgresql-data-model.md``docs/postgres_db_schema_samples.md` 新增 `feature_fact -> window -> asset -> song_id` 回溯流程图,以及 song-level 聚合 SQL 模板,方便研发直接按当前 schema 实现召回后归属。
- 继续补充检索融合设计:在 `docs/postgresql-data-model.md``docs/postgres_db_schema_samples.md` 新增 exact lane + semantic lane 双通道的 song 级聚合流程图、规则融合口径与 SQL 骨架,明确 Phase-1 采用 `exact 主导、semantic 补强` 的排序策略。
- 继续补充数据绑定与模型落库说明:在 `docs/postgresql-data-model.md``docs/postgres_db_schema_samples.md` 明确 `media_entity -> audio_object(asset/window) -> feature_fact` 的绑定字段关系,并给出 `chromaprint / mert-v1-95m / muq-base / local_wavehash_embed / ecapa-tdnn` 的 Phase-1 存储口径与 SQL 样例。
## 2026-06-04
......
......@@ -495,3 +495,141 @@ order by coalesce(max(raw_score) filter (where m.feature_type = 'fingerprint'),
offset_coverage_ms desc
limit 20;
```
---
## 13. 绑定关系与开源模型落库样例
### 13.1 最小绑定关系
```text
media_entity(song)
-> audio_object(asset)
-> audio_object(window)
-> feature_fact(chromaprint)
-> feature_fact(mert-v1-95m)
-> feature_fact(muq-base)
```
### 13.2 具体样例
#### Step 1: song
```sql
insert into media_entity (
entity_type, biz_key, title, artist_name
) values (
'song', 'song_000123', 'Demo Song', 'Demo Artist'
)
returning entity_id;
```
#### Step 2: asset
```sql
insert into audio_object (
object_type, song_id, storage_uri, checksum, codec, sample_rate, channels, duration_ms
) values (
'asset', :song_id, 's3://bucket/demo_song/master.wav', 'sha256:asset-demo', 'wav', 44100, 2, 210000
)
returning object_id;
```
#### Step 3: window
```sql
insert into audio_object (
object_type, song_id, parent_object_id, start_ms, end_ms, duration_ms
) values (
'window', :song_id, :asset_id, 0, 5000, 5000
)
returning object_id;
```
#### Step 4: chromaprint fingerprint
```sql
insert into feature_fact (
feature_type, object_id, song_id,
model_name, model_version, feature_set_name, feature_schema_ver,
fingerprint_value
) values (
'fingerprint', :window_id, :song_id,
'chromaprint', '1.0', 'chromaprint_5s_v1', 'v1',
'AQAAE0mUaEkSZSo...'
);
```
#### Step 5: MERT embedding
```sql
insert into feature_fact (
feature_type, object_id, song_id,
model_name, model_version, feature_set_name, feature_schema_ver,
embedding_dim, embedding_uri, vector_table_name
) values (
'embedding', :window_id, :song_id,
'mert-v1-95m', 'hf-main', 'mert_5s_hop2.5_v1', 'v1',
768, 's3://bucket/emb/demo_song_win0001_mert.npy', 'audio_embedding_vector_768'
);
```
#### Step 6: MuQ embedding
```sql
insert into feature_fact (
feature_type, object_id, song_id,
model_name, model_version, feature_set_name, feature_schema_ver,
embedding_dim, embedding_uri, vector_table_name
) values (
'embedding', :window_id, :song_id,
'muq-base', 'hf-main', 'muq_5s_hop2.5_v1', 'v1',
768, 's3://bucket/emb/demo_song_win0001_muq.npy', 'audio_embedding_vector_768'
);
```
#### Step 7: fallback embedding
```sql
insert into feature_fact (
feature_type, object_id, song_id,
model_name, model_version, feature_set_name, feature_schema_ver,
embedding_dim, embedding_uri, vector_table_name
) values (
'embedding', :window_id, :song_id,
'local_wavehash_embed', 'phase1_local', 'wavehash_5s_hop2.5_v1', 'v1',
8, 'file:///tmp/demo_song_win0001_wavehash.npy', 'audio_embedding_vector_8_placeholder'
);
```
### 13.3 查询某个 window 已经被哪些开源模型编码过
```sql
select object_id,
song_id,
feature_type,
model_name,
model_version,
feature_set_name,
embedding_dim,
fingerprint_value,
embedding_uri,
vector_table_name
from feature_fact
where object_id = :window_id
order by feature_type, model_name;
```
### 13.4 查询某个 song 当前有哪些模型特征
```sql
select ff.song_id,
ff.model_name,
ff.model_version,
ff.feature_type,
count(*) as feature_rows
from feature_fact ff
where ff.song_id = :song_id
group by ff.song_id, ff.model_name, ff.model_version, ff.feature_type
order by ff.feature_type, ff.model_name;
```
......
......@@ -534,3 +534,159 @@ limit 20;
- 不要求一开始就把融合逻辑写死在数据库里
- 便于后续调权重
- 便于对比 `MERT` / `MuQ` / fallback 的增益
---
## 15. 数据到底是怎么绑定在一起的
这是当前 4 表 schema 最核心的绑定关系:
```text
song(media_entity)
1 -> N asset(audio_object)
1 asset -> N window(audio_object)
1 window -> N feature_fact
```
换句话说:
- `media_entity` 定义 **这个东西最终属于哪个 song**
- `audio_object` 定义 **这个 song 下有哪些音频文件、每个文件切了哪些窗口**
- `feature_fact` 定义 **这些窗口被哪些模型编码过,产出了哪些特征**
### 15.1 绑定关系图
```mermaid
flowchart TD
S[media_entity\nsong] --> A1[audio_object\nasset]
S --> A2[audio_object\nasset]
A1 --> W1[audio_object\nwindow]
A1 --> W2[audio_object\nwindow]
W1 --> F1[feature_fact\nchromaprint]
W1 --> F2[feature_fact\nmert]
W1 --> F3[feature_fact\nmuq]
W2 --> F4[feature_fact\nchromaprint]
W2 --> F5[feature_fact\nlocal_wavehash_embed]
```
### 15.2 每张表靠什么字段绑定
| 从 | 到 | 绑定字段 | 说明 |
|---|---|---|---|
| `audio_object(asset/window)` | `media_entity(song)` | `audio_object.song_id = media_entity.entity_id` | asset/window 都归属于某个 song |
| `audio_object(window)` | `audio_object(asset)` | `audio_object.parent_object_id = asset.object_id` | window 的父对象一定是 asset |
| `feature_fact` | `audio_object(window)` | `feature_fact.object_id = window.object_id` | feature 绑定到具体切片 |
| `feature_fact` | `media_entity(song)` | `feature_fact.song_id = media_entity.entity_id` | 冗余保存 song_id,便于检索聚合 |
| `set_membership` | `song/asset/window/feature` | `member_type + member_id` | 集合关系是多态绑定 |
### 15.3 为什么 `feature_fact` 同时存 `object_id` 和 `song_id`
因为二者回答的是不同问题:
- `object_id` 回答:**这个特征是从哪一个 window 抽出来的**
- `song_id` 回答:**这个特征最终属于哪一个 song**
这样做的好处:
- 在线召回时可以直接按 `song_id` 聚合
- 同时又能回查到具体 `window -> asset -> offset`
- 不需要每次聚合都先做一遍深链路 join 才知道 song 归属
### 15.4 一条 feature 记录可以怎么理解
一条 `feature_fact` 本质上是在说:
> 对 `song_id = X` 下面的某个 `window(object_id = Y)`,使用 `model_name/model_version/feature_set_name` 这套编码方案,产出了一个 `fingerprint` 或 `embedding` 特征。
所以 `feature_fact` 不是“模型注册表”,而是“**模型计算结果事实表**”。
---
## 16. Phase-1 开源模型集合应该怎么落地存储
当前 Phase-1 的原则是:
> **先直接用开源模型做 encoder,不微调;数据库里先把“是谁算的、怎么算的、结果放哪”固定下来。**
### 16.1 当前建议的模型集合
| lane | model_name | model_version | feature_type | 用途 |
|---|---|---|---|---|
| exact | `chromaprint` | `1.0` | `fingerprint` | 高精度 exact 命中 |
| semantic baseline | `mert-v1-95m` | `hf-main` | `embedding` | song semantic baseline |
| semantic challenger | `muq-base` | `hf-main` | `embedding` | cover / bgm / 复杂干扰 challenger |
| semantic fallback | `local_wavehash_embed` | `phase1_local` | `embedding` | 当前 host 缺 runtime 时兜底 |
| historical baseline | `ecapa-tdnn` | `baseline_only` | `embedding` | 历史对比,不建议做 Phase-1 主导 |
### 16.2 建议用什么字段固化模型身份
统一落在 `feature_fact`
- `model_name`
- `model_version`
- `feature_set_name`
- `feature_schema_ver`
- `embedding_dim`(embedding 时)
### 16.3 `feature_set_name` 应该怎么命名
建议把下面几类信息编码进去:
```text
<model_family>_<window_sec>s_hop<stride_sec>_<variant>_v<schema>
```
例如:
- `chromaprint_5s_v1`
- `mert_5s_hop2.5_v1`
- `muq_5s_hop2.5_v1`
- `wavehash_5s_hop2.5_v1`
### 16.4 Phase-1 推荐的存储规则
#### exact lane
- `feature_type = 'fingerprint'`
- `fingerprint_value` 必填
- `model_name = 'chromaprint'`
- `embedding_uri / vector_table_name` 为空
#### semantic lane
- `feature_type = 'embedding'`
- `embedding_dim` 必填
- `embedding_uri``vector_table_name` 至少一个必填
- `fingerprint_value` 为空
### 16.5 为什么现在不强依赖单独的 model_registry
因为当前 Phase-1 更关注:
- 先把特征稳定算出来
- 先把特征和 song/window 的绑定关系固化
- 先让检索与归属链闭环
所以当前最务实的方式是:
- **模型身份直接写进 `feature_fact`**
- 后续如果模型数量继续变多,再补 registry 也不迟
### 16.6 一个推荐的落库顺序
对于每个 asset:
1.`media_entity(song)`
2.`audio_object(asset)`
3. 切窗并写 `audio_object(window)`
4.`chromaprint`,写 `feature_fact(fingerprint)`
5.`mert-v1-95m`,写 `feature_fact(embedding)`
6.`muq-base`,写 `feature_fact(embedding)`
7. 如果 runtime 不可用,至少写 `local_wavehash_embed` fallback
这样最终会形成:
```text
同一个 window
-> 1 条 chromaprint fingerprint
-> 1 条 mert embedding
-> 1 条 muq embedding
-> (可选) 1 条 fallback embedding
```
### 16.7 一句话理解 Phase-1 的存储策略
> `audio_object` 负责“哪段音频”,`feature_fact` 负责“哪种模型算出了什么特征”,二者用 `object_id` 绑定,再用 `song_id` 把所有结果稳定归到 song。
......