Why the docs need a first-pass fusion strategy, not just storage paths

Constraint: Phase-1 retrieval must explain how exact and semantic candidates become a ranked song list on the current 4-table schema Rejected: Defer fusion guidance until model integration finishes | leaves implementation teams without a default ranking contract Confidence: high Scope-risk: narrow Directive: Keep Phase-1 ranking docs exact-led and evidence-oriented until measured recall data justifies a different default Tested: markdown link check on /workspace/docs after adding fusion diagrams and SQL skeletons Not-tested: No live retrieval benchmark rerun; this change documents the intended ranking path only

Why the docs need a first-pass fusion strategy, not just storage paths
Constraint: Phase-1 retrieval must explain how exact and semantic candidates become a ranked song list on the current 4-table schema Rejected: Defer fusion guidance until model integration finishes | leaves implementation teams without a default ranking contract Confidence: high Scope-risk: narrow Directive: Keep Phase-1 ranking docs exact-led and evidence-oriented until measured recall data justifies a different default Tested: markdown link check on /workspace/docs after adding fusion diagrams and SQL skeletons Not-tested: No live retrieval benchmark rerun; this change documents the intended ranking path only
cnb.bofCdSsphPA
Commit 43644ac8 ... 43644ac8006f0ff089a93a251e323b64dbe6dcf7 authored 2026-06-04 15:10:52 +0800 by cnb.bofCdSsphPA
Showing 3 changed files with 205 additions and 0 deletions
docs/CHANGELOG.md
docs/postgres_db_schema_samples.md
docs/postgresql-data-model.md
--- a/docs/CHANGELOG.md
View file @43644ac
+++ b/docs/CHANGELOG.md
View file @43644ac
@@ -5,6 +5,7 @@
 - 重写 `docs/postgresql-data-model.md`，明确 `保存切片的数据 + 模型 + feature` 的落表方案：`window` 落 `audio_object`，模型身份落 `feature_fact.model_name/model_version/feature_set_name`，具体 `fingerprint/embedding` 也统一落 `feature_fact`。
 - 重写 `docs/postgres_db_schema_samples.md` 与入口文档，补充当前 4 表主链的流程图、典型 SQL 样例、查询回溯路径与写入顺序，统一文档口径到 `media_entity -> audio_object -> feature_fact -> set_membership`。
 - 继续补强在线检索说明：在 `docs/postgresql-data-model.md` 与 `docs/postgres_db_schema_samples.md` 新增 `feature_fact -> window -> asset -> song_id` 回溯流程图，以及 song-level 聚合 SQL 模板，方便研发直接按当前 schema 实现召回后归属。
+- 继续补充检索融合设计：在 `docs/postgresql-data-model.md` 与 `docs/postgres_db_schema_samples.md` 新增 exact lane + semantic lane 双通道的 song 级聚合流程图、规则融合口径与 SQL 骨架，明确 Phase-1 采用 `exact 主导、semantic 补强` 的排序策略。

 ## 2026-06-04

--- a/docs/postgres_db_schema_samples.md
View file @43644ac
+++ b/docs/postgres_db_schema_samples.md
View file @43644ac
@@ -437,3 +437,61 @@ group by ff.song_id, s.title, s.artist_name
 order by matched_windows desc
 limit 20;
 ```
+
+---
+
+## 12. exact + semantic 双通道融合样例
+
+### 12.1 融合流程图
+
+```mermaid
+flowchart TD
+    A[exact candidates] --> C[song aggregation]
+    B[semantic candidates] --> C
+    C --> D[rerank]
+    D --> E[topK song_ids]
+```
+
+### 12.2 推荐的 Phase-1 融合口径
+
+```text
+final_song_score =
+    0.55 * exact_score_norm
+  + 0.35 * semantic_score_norm
+  + 0.10 * coverage_score_norm
+```
+
+### 12.3 融合聚合 SQL 骨架
+
+```sql
+with matched as (
+    select ff.song_id,
+           ff.feature_type,
+           w.object_id as window_id,
+           w.parent_object_id as asset_id,
+           w.start_ms,
+           w.end_ms,
+           :score_map[ff.feature_id]::double precision as raw_score
+    from feature_fact ff
+    join audio_object w
+      on w.object_id = ff.object_id
+     and w.object_type = 'window'
+    where ff.feature_id = any(:matched_feature_ids)
+)
+select m.song_id,
+       s.title,
+       s.artist_name,
+       count(*) filter (where m.feature_type = 'fingerprint') as exact_hit_count,
+       count(*) filter (where m.feature_type = 'embedding') as semantic_hit_count,
+       max(raw_score) filter (where m.feature_type = 'fingerprint') as exact_best_score,
+       max(raw_score) filter (where m.feature_type = 'embedding') as semantic_best_score,
+       max(end_ms) - min(start_ms) as offset_coverage_ms
+from matched m
+join media_entity s
+  on s.entity_id = m.song_id
+group by m.song_id, s.title, s.artist_name
+order by coalesce(max(raw_score) filter (where m.feature_type = 'fingerprint'), 0) desc,
+         coalesce(max(raw_score) filter (where m.feature_type = 'embedding'), 0) desc,
+         offset_coverage_ms desc
+limit 20;
+```
--- a/docs/postgresql-data-model.md
View file @43644ac
+++ b/docs/postgresql-data-model.md
View file @43644ac
@@ -388,3 +388,149 @@ limit 20;
 - 片段/BGM 定位
 - evidence 回查
 - topK song 级召回
+
+---
+
+## 14. exact + semantic 双通道如何融合到 song 排序
+
+当前推荐把线上召回理解成两条并行 lane：
+
+- **exact lane**：`chromaprint` 等 fingerprint
+- **semantic lane**：`MERT / MuQ / fallback embedding`
+
+二者最终都不要直接返回 `feature_id`，而是都要先回到：
+
+```text
+feature_fact -> window -> asset -> song
+```
+
+再做 `song_id` 级聚合。
+
+### 14.1 融合流程图
+
+```mermaid
+flowchart TD
+    Q[query audio] --> WQ[query windows]
+    WQ --> E1[exact lane\nfingerprint retrieval]
+    WQ --> E2[semantic lane\nembedding retrieval]
+    E1 --> C1[exact candidates\nfeature_fact rows]
+    E2 --> C2[semantic candidates\nfeature_fact rows]
+    C1 --> N1[normalize exact scores]
+    C2 --> N2[normalize semantic scores]
+    N1 --> G[song_id aggregation]
+    N2 --> G
+    G --> R[rerank top songs]
+    R --> O[return topK song_ids + evidence]
+```
+
+### 14.2 song 级聚合时看什么
+
+建议至少保留这些聚合信号：
+
+- `exact_hit_count`
+- `semantic_hit_count`
+- `exact_best_score`
+- `semantic_best_score`
+- `matched_asset_count`
+- `matched_window_count`
+- `offset_coverage_ms`
+- `first_hit_ms`
+- `last_hit_ms`
+
+### 14.3 一个推荐的融合口径
+
+Phase-1 可以先用 **规则融合**，不急着上学习排序：
+
+```text
+final_song_score =
+    0.55 * exact_score_norm
+  + 0.35 * semantic_score_norm
+  + 0.10 * coverage_score_norm
+```
+
+其中：
+- `exact_score_norm`：song 级 exact 命中强度
+- `semantic_score_norm`：song 级 semantic 命中强度
+- `coverage_score_norm`：多个 window 是否连续覆盖同一 song
+
+### 14.4 为什么 exact 权重更高
+
+因为当前场景是版权保护 / song-level ACR：
+- exact lane 命中时通常 precision 更高
+- semantic lane 更适合补召回、抗翻唱/变速/BGM 干扰
+- 所以 Phase-1 更稳妥的策略是 **exact 主导、semantic 补强**
+
+### 14.5 一个融合后的 song-level 结果表结构（逻辑视图）
+
+```text
+song_id
+exact_hit_count
+semantic_hit_count
+exact_best_score
+semantic_best_score
+offset_coverage_ms
+final_song_score
+best_asset_id
+best_window_id
+best_model_name
+```
+
+### 14.6 伪 SQL 聚合模板
+
+```sql
+with matched as (
+    select ff.song_id,
+           ff.feature_type,
+           ff.model_name,
+           w.object_id as window_id,
+           w.parent_object_id as asset_id,
+           w.start_ms,
+           w.end_ms,
+           :score_map[ff.feature_id]::double precision as raw_score
+    from feature_fact ff
+    join audio_object w
+      on w.object_id = ff.object_id
+     and w.object_type = 'window'
+    where ff.feature_id = any(:matched_feature_ids)
+), song_agg as (
+    select song_id,
+           count(*) filter (where feature_type = 'fingerprint') as exact_hit_count,
+           count(*) filter (where feature_type = 'embedding') as semantic_hit_count,
+           max(raw_score) filter (where feature_type = 'fingerprint') as exact_best_score,
+           max(raw_score) filter (where feature_type = 'embedding') as semantic_best_score,
+           count(distinct asset_id) as matched_asset_count,
+           count(distinct window_id) as matched_window_count,
+           max(end_ms) - min(start_ms) as offset_coverage_ms
+    from matched
+    group by song_id
+)
+select sa.song_id,
+       s.title,
+       s.artist_name,
+       sa.exact_hit_count,
+       sa.semantic_hit_count,
+       sa.exact_best_score,
+       sa.semantic_best_score,
+       sa.matched_asset_count,
+       sa.matched_window_count,
+       sa.offset_coverage_ms
+from song_agg sa
+join media_entity s on s.entity_id = sa.song_id
+order by coalesce(sa.exact_best_score, 0) desc,
+         coalesce(sa.semantic_best_score, 0) desc,
+         sa.offset_coverage_ms desc
+limit 20;
+```
+
+### 14.7 当前最务实的实现顺序
+
+1. 先分别拿到 exact lane topN feature candidates
+2. 再拿到 semantic lane topN feature candidates
+3. 全部回查成 `song_id` 粒度
+4. 在应用层做规则融合
+5. 输出 `topK song_id + evidence`
+
+这样做的好处是：
+- 不要求一开始就把融合逻辑写死在数据库里
+- 便于后续调权重
+- 便于对比 `MERT` / `MuQ` / fallback 的增益