Commit 5869c876 5869c87636057d951afc3e0ac8cdddbc0c96fd05 by cnb.bofCdSsphPA

Why the retrieval docs need the online song backtrace made explicit

Constraint: New engineers need a direct feature_fact-to-song_id query path on the current 4-table schema without reconstructing it from scattered examples
Rejected: Leave only insert-side diagrams | does not explain how online recall returns song ownership evidence
Confidence: high
Scope-risk: narrow
Directive: Keep query-path docs aligned with the feature_fact -> window -> asset -> song chain when adding new retrieval lanes
Tested: markdown link check on /workspace/docs after adding retrieval flow diagrams and SQL templates
Not-tested: No live database rerun; this change only documents the already-verified schema path
1 parent 38b37e08
......@@ -4,6 +4,7 @@
- 收敛 `docs/` 到当前 song-centric 主线,只保留 `README / start-here / session-handoff / postgresql-data-model / postgres_db_schema_samples / CHANGELOG` 六份核心文档,删除旧的 v2 / planner-worker / registry 扩展文档,避免新同学误入已退居次线的设计。
- 重写 `docs/postgresql-data-model.md`,明确 `保存切片的数据 + 模型 + feature` 的落表方案:`window``audio_object`,模型身份落 `feature_fact.model_name/model_version/feature_set_name`,具体 `fingerprint/embedding` 也统一落 `feature_fact`
- 重写 `docs/postgres_db_schema_samples.md` 与入口文档,补充当前 4 表主链的流程图、典型 SQL 样例、查询回溯路径与写入顺序,统一文档口径到 `media_entity -> audio_object -> feature_fact -> set_membership`
- 继续补强在线检索说明:在 `docs/postgresql-data-model.md``docs/postgres_db_schema_samples.md` 新增 `feature_fact -> window -> asset -> song_id` 回溯流程图,以及 song-level 聚合 SQL 模板,方便研发直接按当前 schema 实现召回后归属。
## 2026-06-04
......
......@@ -381,3 +381,59 @@ song(Song Alpha)
- [start-here.md](./start-here.md)
- [session-handoff.md](./session-handoff.md)
- [postgresql-data-model.md](./postgresql-data-model.md)
---
## 11. 在线检索回溯样例
### 11.1 从命中的 feature 回查 song
```mermaid
flowchart LR
A[feature_fact] --> B[window]
B --> C[asset]
C --> D[song]
```
### 11.2 典型在线查询 SQL
```sql
select ff.feature_id,
ff.feature_type,
ff.model_name,
ff.feature_set_name,
w.object_id as window_id,
w.start_ms,
w.end_ms,
a.object_id as asset_id,
a.storage_uri,
s.entity_id as song_id,
s.title,
s.artist_name
from feature_fact ff
join audio_object w
on w.object_id = ff.object_id
and w.object_type = 'window'
join audio_object a
on a.object_id = w.parent_object_id
and a.object_type = 'asset'
join media_entity s
on s.entity_id = ff.song_id
where ff.feature_id = :feature_id;
```
### 11.3 典型 song-level 聚合 SQL
```sql
select ff.song_id,
s.title,
s.artist_name,
count(*) as matched_windows
from feature_fact ff
join media_entity s
on s.entity_id = ff.song_id
where ff.feature_id = any(:matched_feature_ids)
group by ff.song_id, s.title, s.artist_name
order by matched_windows desc
limit 20;
```
......
......@@ -288,3 +288,103 @@ Phase-1 暂不强求:
- [start-here.md](./start-here.md)
- [session-handoff.md](./session-handoff.md)
- [postgres_db_schema_samples.md](./postgres_db_schema_samples.md)
---
## 13. 在线检索时怎么从 feature 回到 `song_id`
这是当前研发最需要牢记的一条回溯链:
```text
feature_fact -> audio_object(window) -> audio_object(asset) -> media_entity(song)
```
### 13.1 在线检索流程图
```mermaid
flowchart LR
Q[query audio] --> QW[query windows]
QW --> QE[query fingerprint / embedding]
QE --> FF[feature_fact]
FF --> W[audio_object\nobject_type=window]
W --> A[audio_object\nobject_type=asset]
A --> S[media_entity\nentity_type=song]
S --> R[return song_id + title + artist + evidence]
```
### 13.2 聚合流程图
```mermaid
flowchart TD
A[query window features] --> B[命中多个 feature_fact rows]
B --> C[回查 window]
C --> D[回查 asset]
D --> E[聚合到 song_id]
E --> F[按 hit_count / score / offset coverage 排序]
F --> G[返回 topK songs]
```
### 13.3 最小查询 SQL 模板
```sql
select ff.feature_id,
ff.feature_type,
ff.model_name,
ff.model_version,
ff.feature_set_name,
w.object_id as window_id,
w.start_ms,
w.end_ms,
a.object_id as asset_id,
a.storage_uri,
s.entity_id as song_id,
s.title,
s.artist_name
from feature_fact ff
join audio_object w
on w.object_id = ff.object_id
and w.object_type = 'window'
join audio_object a
on a.object_id = w.parent_object_id
and a.object_type = 'asset'
join media_entity s
on s.entity_id = ff.song_id
where ff.feature_id = :feature_id;
```
### 13.4 一个 song-level 聚合 SQL 模板
```sql
select ff.song_id,
s.title,
s.artist_name,
count(*) as matched_windows,
min(w.start_ms) as first_hit_ms,
max(w.end_ms) as last_hit_ms
from feature_fact ff
join audio_object w
on w.object_id = ff.object_id
and w.object_type = 'window'
join media_entity s
on s.entity_id = ff.song_id
where ff.feature_type = :feature_type
and ff.model_name = :model_name
and ff.feature_set_name = :feature_set_name
and ff.feature_id = any(:matched_feature_ids)
group by ff.song_id, s.title, s.artist_name
order by matched_windows desc, first_hit_ms asc
limit 20;
```
### 13.5 这条链为什么重要
因为它把 3 件事拆清楚了:
- `feature_fact` 负责回答:**命中了什么特征**
- `audio_object(window/asset)` 负责回答:**命中了哪段、来自哪个文件**
- `media_entity(song)` 负责回答:**最终该归到哪个 `song_id`**
所以 Phase-1 即使不引入更复杂的 `recording/work/version`,也已经足够支撑:
- 版权保护归属
- 片段/BGM 定位
- evidence 回查
- topK song 级召回
......