Why the docs need explicit bindings between audio objects and feature facts

Constraint: Phase-1 implementers need one concrete explanation for how songs, assets, windows, and open-model features are linked and stored Rejected: Rely on schema columns alone | does not show the intended per-model storage pattern for the current encoder-only phase Confidence: high Scope-risk: narrow Directive: Keep future model-onboarding docs grounded in feature_fact object_id/song_id bindings unless the schema default changes Tested: markdown link check on /workspace/docs after adding binding diagrams and SQL storage examples Not-tested: No live database rerun; this is a documentation clarification over an already-verified schema

Why the docs need explicit bindings between audio objects and feature facts
Constraint: Phase-1 implementers need one concrete explanation for how songs, assets, windows, and open-model features are linked and stored Rejected: Rely on schema columns alone | does not show the intended per-model storage pattern for the current encoder-only phase Confidence: high Scope-risk: narrow Directive: Keep future model-onboarding docs grounded in feature_fact object_id/song_id bindings unless the schema default changes Tested: markdown link check on /workspace/docs after adding binding diagrams and SQL storage examples Not-tested: No live database rerun; this is a documentation clarification over an already-verified schema
cnb.bofCdSsphPA
Commit 75f156b8 ... 75f156b8345d456daa816a129b02a52073ccd078 authored 2026-06-04 15:12:13 +0800 by cnb.bofCdSsphPA
Showing 3 changed files with 295 additions and 0 deletions
docs/CHANGELOG.md
docs/postgres_db_schema_samples.md
docs/postgresql-data-model.md
--- a/docs/CHANGELOG.md
View file @75f156b
+++ b/docs/CHANGELOG.md
View file @75f156b
@@ -6,6 +6,7 @@
 - 重写 `docs/postgres_db_schema_samples.md` 与入口文档，补充当前 4 表主链的流程图、典型 SQL 样例、查询回溯路径与写入顺序，统一文档口径到 `media_entity -> audio_object -> feature_fact -> set_membership`。
 - 继续补强在线检索说明：在 `docs/postgresql-data-model.md` 与 `docs/postgres_db_schema_samples.md` 新增 `feature_fact -> window -> asset -> song_id` 回溯流程图，以及 song-level 聚合 SQL 模板，方便研发直接按当前 schema 实现召回后归属。
 - 继续补充检索融合设计：在 `docs/postgresql-data-model.md` 与 `docs/postgres_db_schema_samples.md` 新增 exact lane + semantic lane 双通道的 song 级聚合流程图、规则融合口径与 SQL 骨架，明确 Phase-1 采用 `exact 主导、semantic 补强` 的排序策略。
+- 继续补充数据绑定与模型落库说明：在 `docs/postgresql-data-model.md` 与 `docs/postgres_db_schema_samples.md` 明确 `media_entity -> audio_object(asset/window) -> feature_fact` 的绑定字段关系，并给出 `chromaprint / mert-v1-95m / muq-base / local_wavehash_embed / ecapa-tdnn` 的 Phase-1 存储口径与 SQL 样例。

 ## 2026-06-04

--- a/docs/postgres_db_schema_samples.md
View file @75f156b
+++ b/docs/postgres_db_schema_samples.md
View file @75f156b
@@ -495,3 +495,141 @@ order by coalesce(max(raw_score) filter (where m.feature_type = 'fingerprint'), 
         offset_coverage_ms desc
 limit 20;
 ```
+
+---
+
+## 13. 绑定关系与开源模型落库样例
+
+### 13.1 最小绑定关系
+
+```text
+media_entity(song)
+  -> audio_object(asset)
+    -> audio_object(window)
+      -> feature_fact(chromaprint)
+      -> feature_fact(mert-v1-95m)
+      -> feature_fact(muq-base)
+```
+
+### 13.2 具体样例
+
+#### Step 1: song
+
+```sql
+insert into media_entity (
+    entity_type, biz_key, title, artist_name
+) values (
+    'song', 'song_000123', 'Demo Song', 'Demo Artist'
+)
+returning entity_id;
+```
+
+#### Step 2: asset
+
+```sql
+insert into audio_object (
+    object_type, song_id, storage_uri, checksum, codec, sample_rate, channels, duration_ms
+) values (
+    'asset', :song_id, 's3://bucket/demo_song/master.wav', 'sha256:asset-demo', 'wav', 44100, 2, 210000
+)
+returning object_id;
+```
+
+#### Step 3: window
+
+```sql
+insert into audio_object (
+    object_type, song_id, parent_object_id, start_ms, end_ms, duration_ms
+) values (
+    'window', :song_id, :asset_id, 0, 5000, 5000
+)
+returning object_id;
+```
+
+#### Step 4: chromaprint fingerprint
+
+```sql
+insert into feature_fact (
+    feature_type, object_id, song_id,
+    model_name, model_version, feature_set_name, feature_schema_ver,
+    fingerprint_value
+) values (
+    'fingerprint', :window_id, :song_id,
+    'chromaprint', '1.0', 'chromaprint_5s_v1', 'v1',
+    'AQAAE0mUaEkSZSo...'
+);
+```
+
+#### Step 5: MERT embedding
+
+```sql
+insert into feature_fact (
+    feature_type, object_id, song_id,
+    model_name, model_version, feature_set_name, feature_schema_ver,
+    embedding_dim, embedding_uri, vector_table_name
+) values (
+    'embedding', :window_id, :song_id,
+    'mert-v1-95m', 'hf-main', 'mert_5s_hop2.5_v1', 'v1',
+    768, 's3://bucket/emb/demo_song_win0001_mert.npy', 'audio_embedding_vector_768'
+);
+```
+
+#### Step 6: MuQ embedding
+
+```sql
+insert into feature_fact (
+    feature_type, object_id, song_id,
+    model_name, model_version, feature_set_name, feature_schema_ver,
+    embedding_dim, embedding_uri, vector_table_name
+) values (
+    'embedding', :window_id, :song_id,
+    'muq-base', 'hf-main', 'muq_5s_hop2.5_v1', 'v1',
+    768, 's3://bucket/emb/demo_song_win0001_muq.npy', 'audio_embedding_vector_768'
+);
+```
+
+#### Step 7: fallback embedding
+
+```sql
+insert into feature_fact (
+    feature_type, object_id, song_id,
+    model_name, model_version, feature_set_name, feature_schema_ver,
+    embedding_dim, embedding_uri, vector_table_name
+) values (
+    'embedding', :window_id, :song_id,
+    'local_wavehash_embed', 'phase1_local', 'wavehash_5s_hop2.5_v1', 'v1',
+    8, 'file:///tmp/demo_song_win0001_wavehash.npy', 'audio_embedding_vector_8_placeholder'
+);
+```
+
+### 13.3 查询某个 window 已经被哪些开源模型编码过
+
+```sql
+select object_id,
+       song_id,
+       feature_type,
+       model_name,
+       model_version,
+       feature_set_name,
+       embedding_dim,
+       fingerprint_value,
+       embedding_uri,
+       vector_table_name
+from feature_fact
+where object_id = :window_id
+order by feature_type, model_name;
+```
+
+### 13.4 查询某个 song 当前有哪些模型特征
+
+```sql
+select ff.song_id,
+       ff.model_name,
+       ff.model_version,
+       ff.feature_type,
+       count(*) as feature_rows
+from feature_fact ff
+where ff.song_id = :song_id
+group by ff.song_id, ff.model_name, ff.model_version, ff.feature_type
+order by ff.feature_type, ff.model_name;
+```
--- a/docs/postgresql-data-model.md
View file @75f156b
+++ b/docs/postgresql-data-model.md
View file @75f156b
@@ -534,3 +534,159 @@ limit 20;
 - 不要求一开始就把融合逻辑写死在数据库里
 - 便于后续调权重
 - 便于对比 `MERT` / `MuQ` / fallback 的增益
+
+---
+
+## 15. 数据到底是怎么绑定在一起的
+
+这是当前 4 表 schema 最核心的绑定关系：
+
+```text
+song(media_entity)
+  1 -> N asset(audio_object)
+  1 asset -> N window(audio_object)
+  1 window -> N feature_fact
+```
+
+换句话说：
+- `media_entity` 定义 **这个东西最终属于哪个 song**
+- `audio_object` 定义 **这个 song 下有哪些音频文件、每个文件切了哪些窗口**
+- `feature_fact` 定义 **这些窗口被哪些模型编码过，产出了哪些特征**
+
+### 15.1 绑定关系图
+
+```mermaid
+flowchart TD
+    S[media_entity\nsong] --> A1[audio_object\nasset]
+    S --> A2[audio_object\nasset]
+    A1 --> W1[audio_object\nwindow]
+    A1 --> W2[audio_object\nwindow]
+    W1 --> F1[feature_fact\nchromaprint]
+    W1 --> F2[feature_fact\nmert]
+    W1 --> F3[feature_fact\nmuq]
+    W2 --> F4[feature_fact\nchromaprint]
+    W2 --> F5[feature_fact\nlocal_wavehash_embed]
+```
+
+### 15.2 每张表靠什么字段绑定
+
+| 从 | 到 | 绑定字段 | 说明 |
+|---|---|---|---|
+| `audio_object(asset/window)` | `media_entity(song)` | `audio_object.song_id = media_entity.entity_id` | asset/window 都归属于某个 song |
+| `audio_object(window)` | `audio_object(asset)` | `audio_object.parent_object_id = asset.object_id` | window 的父对象一定是 asset |
+| `feature_fact` | `audio_object(window)` | `feature_fact.object_id = window.object_id` | feature 绑定到具体切片 |
+| `feature_fact` | `media_entity(song)` | `feature_fact.song_id = media_entity.entity_id` | 冗余保存 song_id，便于检索聚合 |
+| `set_membership` | `song/asset/window/feature` | `member_type + member_id` | 集合关系是多态绑定 |
+
+### 15.3 为什么 `feature_fact` 同时存 `object_id` 和 `song_id`
+
+因为二者回答的是不同问题：
+
+- `object_id` 回答：**这个特征是从哪一个 window 抽出来的**
+- `song_id` 回答：**这个特征最终属于哪一个 song**
+
+这样做的好处：
+- 在线召回时可以直接按 `song_id` 聚合
+- 同时又能回查到具体 `window -> asset -> offset`
+- 不需要每次聚合都先做一遍深链路 join 才知道 song 归属
+
+### 15.4 一条 feature 记录可以怎么理解
+
+一条 `feature_fact` 本质上是在说：
+
+> 对 `song_id = X` 下面的某个 `window(object_id = Y)`，使用 `model_name/model_version/feature_set_name` 这套编码方案，产出了一个 `fingerprint` 或 `embedding` 特征。
+
+所以 `feature_fact` 不是“模型注册表”，而是“**模型计算结果事实表**”。
+
+---
+
+## 16. Phase-1 开源模型集合应该怎么落地存储
+
+当前 Phase-1 的原则是：
+
+> **先直接用开源模型做 encoder，不微调；数据库里先把“是谁算的、怎么算的、结果放哪”固定下来。**
+
+### 16.1 当前建议的模型集合
+
+| lane | model_name | model_version | feature_type | 用途 |
+|---|---|---|---|---|
+| exact | `chromaprint` | `1.0` | `fingerprint` | 高精度 exact 命中 |
+| semantic baseline | `mert-v1-95m` | `hf-main` | `embedding` | song semantic baseline |
+| semantic challenger | `muq-base` | `hf-main` | `embedding` | cover / bgm / 复杂干扰 challenger |
+| semantic fallback | `local_wavehash_embed` | `phase1_local` | `embedding` | 当前 host 缺 runtime 时兜底 |
+| historical baseline | `ecapa-tdnn` | `baseline_only` | `embedding` | 历史对比，不建议做 Phase-1 主导 |
+
+### 16.2 建议用什么字段固化模型身份
+
+统一落在 `feature_fact`：
+
+- `model_name`
+- `model_version`
+- `feature_set_name`
+- `feature_schema_ver`
+- `embedding_dim`（embedding 时）
+
+### 16.3 `feature_set_name` 应该怎么命名
+
+建议把下面几类信息编码进去：
+
+```text
+<model_family>_<window_sec>s_hop<stride_sec>_<variant>_v<schema>
+```
+
+例如：
+- `chromaprint_5s_v1`
+- `mert_5s_hop2.5_v1`
+- `muq_5s_hop2.5_v1`
+- `wavehash_5s_hop2.5_v1`
+
+### 16.4 Phase-1 推荐的存储规则
+
+#### exact lane
+- `feature_type = 'fingerprint'`
+- `fingerprint_value` 必填
+- `model_name = 'chromaprint'`
+- `embedding_uri / vector_table_name` 为空
+
+#### semantic lane
+- `feature_type = 'embedding'`
+- `embedding_dim` 必填
+- `embedding_uri` 或 `vector_table_name` 至少一个必填
+- `fingerprint_value` 为空
+
+### 16.5 为什么现在不强依赖单独的 model_registry
+
+因为当前 Phase-1 更关注：
+- 先把特征稳定算出来
+- 先把特征和 song/window 的绑定关系固化
+- 先让检索与归属链闭环
+
+所以当前最务实的方式是：
+- **模型身份直接写进 `feature_fact`**
+- 后续如果模型数量继续变多，再补 registry 也不迟
+
+### 16.6 一个推荐的落库顺序
+
+对于每个 asset：
+
+1. 写 `media_entity(song)`
+2. 写 `audio_object(asset)`
+3. 切窗并写 `audio_object(window)`
+4. 跑 `chromaprint`，写 `feature_fact(fingerprint)`
+5. 跑 `mert-v1-95m`，写 `feature_fact(embedding)`
+6. 跑 `muq-base`，写 `feature_fact(embedding)`
+7. 如果 runtime 不可用，至少写 `local_wavehash_embed` fallback
+
+这样最终会形成：
+
+```text
+同一个 window
+  -> 1 条 chromaprint fingerprint
+  -> 1 条 mert embedding
+  -> 1 条 muq embedding
+  -> (可选) 1 条 fallback embedding
+```
+
+### 16.7 一句话理解 Phase-1 的存储策略
+
+> `audio_object` 负责“哪段音频”，`feature_fact` 负责“哪种模型算出了什么特征”，二者用 `object_id` 绑定，再用 `song_id` 把所有结果稳定归到 song。