Commit b624273c b624273c7015949694852e7fb46f17d25ecab86a by cnb.bofCdSsphPA

Why the schema samples need a complete multi-asset multi-model example

Constraint: Phase-1 implementers need one concrete end-to-end sample that shows how a single song expands into multiple assets, windows, and model facts
Rejected: Keep only isolated insert snippets | does not help with batch backfill or completeness checks
Confidence: high
Scope-risk: narrow
Directive: When extending storage examples, include operational queries for gap detection and model completeness, not just inserts
Tested: markdown link check on /workspace/docs after adding the complete sample and audit SQL
Not-tested: No live database rerun; this is a documentation-only expansion over the verified schema
1 parent 75f156b8
......@@ -7,6 +7,7 @@
- 继续补强在线检索说明:在 `docs/postgresql-data-model.md``docs/postgres_db_schema_samples.md` 新增 `feature_fact -> window -> asset -> song_id` 回溯流程图,以及 song-level 聚合 SQL 模板,方便研发直接按当前 schema 实现召回后归属。
- 继续补充检索融合设计:在 `docs/postgresql-data-model.md``docs/postgres_db_schema_samples.md` 新增 exact lane + semantic lane 双通道的 song 级聚合流程图、规则融合口径与 SQL 骨架,明确 Phase-1 采用 `exact 主导、semantic 补强` 的排序策略。
- 继续补充数据绑定与模型落库说明:在 `docs/postgresql-data-model.md``docs/postgres_db_schema_samples.md` 明确 `media_entity -> audio_object(asset/window) -> feature_fact` 的绑定字段关系,并给出 `chromaprint / mert-v1-95m / muq-base / local_wavehash_embed / ecapa-tdnn` 的 Phase-1 存储口径与 SQL 样例。
-`docs/postgres_db_schema_samples.md` 继续补充一个完整的 `同一 song -> 多 asset -> 多 window -> 多 model` 样例,附带缺模型扫描 SQL 与 asset 级特征完备性检查 SQL,方便 Phase-1 批量补算与巡检。
## 2026-06-04
......
......@@ -633,3 +633,113 @@ where ff.song_id = :song_id
group by ff.song_id, ff.model_name, ff.model_version, ff.feature_type
order by ff.feature_type, ff.model_name;
```
---
## 14. 一个完整的多 asset / 多 window / 多 model 样例
假设:
- 同一个 `song_id = 1001`
- 有 2 个音频文件:`master.wav``ugc_clip.mp3`
- 每个 asset 切成 2 个 window
- 每个 window 都跑 `chromaprint + mert-v1-95m + muq-base`
### 14.1 逻辑结构
```text
song(1001)
-> asset(2001, master.wav)
-> window(3001, 0-5000)
-> chromaprint
-> mert-v1-95m
-> muq-base
-> window(3002, 2500-7500)
-> chromaprint
-> mert-v1-95m
-> muq-base
-> asset(2002, ugc_clip.mp3)
-> window(3003, 10000-15000)
-> chromaprint
-> mert-v1-95m
-> muq-base
-> window(3004, 12500-17500)
-> chromaprint
-> mert-v1-95m
-> muq-base
```
### 14.2 会落成多少行
| 表 | 行数 | 说明 |
|---|---:|---|
| `media_entity` | 1 | 一个 song |
| `audio_object` | 6 | 2 个 asset + 4 个 window |
| `feature_fact` | 12 | 4 个 window × 3 个模型 |
| `set_membership` | 视需要 | 可给 song/asset/window 挂 reference_set |
### 14.3 查询某个 song 的全量树状数据
```sql
select s.entity_id as song_id,
s.title,
a.object_id as asset_id,
a.storage_uri,
w.object_id as window_id,
w.start_ms,
w.end_ms,
ff.feature_type,
ff.model_name,
ff.model_version,
ff.feature_set_name
from media_entity s
join audio_object a
on a.song_id = s.entity_id
and a.object_type = 'asset'
join audio_object w
on w.parent_object_id = a.object_id
and w.object_type = 'window'
left join feature_fact ff
on ff.object_id = w.object_id
where s.entity_id = :song_id
order by a.object_id, w.start_ms, ff.feature_type, ff.model_name;
```
### 14.4 查询哪些 window 缺某个模型
这个 SQL 很适合做补算任务扫描,比如检查哪些 window 还没跑 `muq-base`
```sql
select w.object_id as window_id,
w.song_id,
w.parent_object_id as asset_id,
w.start_ms,
w.end_ms
from audio_object w
where w.object_type = 'window'
and not exists (
select 1
from feature_fact ff
where ff.object_id = w.object_id
and ff.feature_type = 'embedding'
and ff.model_name = 'muq-base'
and ff.model_version = 'hf-main'
and ff.feature_set_name = 'muq_5s_hop2.5_v1'
)
order by w.song_id, w.parent_object_id, w.start_ms;
```
### 14.5 查询某个 asset 下每个 window 已经具备哪些模型
```sql
select w.object_id as window_id,
w.start_ms,
w.end_ms,
string_agg(ff.model_name || ':' || ff.feature_type, ', ' order by ff.model_name) as ready_features
from audio_object w
left join feature_fact ff
on ff.object_id = w.object_id
where w.object_type = 'window'
and w.parent_object_id = :asset_id
group by w.object_id, w.start_ms, w.end_ms
order by w.start_ms;
```
......