Commit df241d20 df241d20c056e82a3a67efffe9e2f013f0feabd9 by cnb.bofCdSsphPA

Why the docs need a concrete scale-up ingestion and indexing strategy

Constraint: A 100w-audio Phase-1 system needs a default order for ingest, exact coverage, semantic backfill, and index build timing on the current schema
Rejected: Leave scale-up as implied by the small examples | does not guide batch execution or gap audits at production volume
Confidence: high
Scope-risk: narrow
Directive: Keep future scale docs anchored on main-chain-first ingestion and model-by-model feature backfill unless benchmark evidence proves otherwise
Tested: markdown link check on /workspace/docs after adding scale-up ingestion, indexing, and audit SQL guidance
Not-tested: No live database rerun; this is a documentation-only expansion over the verified schema path
1 parent b624273c
......@@ -8,6 +8,7 @@
- 继续补充检索融合设计:在 `docs/postgresql-data-model.md``docs/postgres_db_schema_samples.md` 新增 exact lane + semantic lane 双通道的 song 级聚合流程图、规则融合口径与 SQL 骨架,明确 Phase-1 采用 `exact 主导、semantic 补强` 的排序策略。
- 继续补充数据绑定与模型落库说明:在 `docs/postgresql-data-model.md``docs/postgres_db_schema_samples.md` 明确 `media_entity -> audio_object(asset/window) -> feature_fact` 的绑定字段关系,并给出 `chromaprint / mert-v1-95m / muq-base / local_wavehash_embed / ecapa-tdnn` 的 Phase-1 存储口径与 SQL 样例。
-`docs/postgres_db_schema_samples.md` 继续补充一个完整的 `同一 song -> 多 asset -> 多 window -> 多 model` 样例,附带缺模型扫描 SQL 与 asset 级特征完备性检查 SQL,方便 Phase-1 批量补算与巡检。
- 继续补充规模化实施口径:在 `docs/postgresql-data-model.md``docs/postgres_db_schema_samples.md` 新增 `100w 音频 / 30w song` 的批量入库顺序、索引建设顺序、冷热分层策略与主链/模型缺口巡检 SQL,明确 Phase-1 先保主链、exact 先铺满、semantic 再批量补齐。
## 2026-06-04
......
......@@ -743,3 +743,81 @@ where w.object_type = 'window'
group by w.object_id, w.start_ms, w.end_ms
order by w.start_ms;
```
---
## 15. 批量入库与索引建设样例
### 15.1 推荐批量顺序
```text
batch-1: media_entity(song)
batch-2: audio_object(asset)
batch-3: audio_object(window)
batch-4: feature_fact(chromaprint)
batch-5: feature_fact(mert-v1-95m)
batch-6: feature_fact(muq-base)
```
### 15.2 推荐补充索引
```sql
create index if not exists idx_feature_fact_model_lookup
on feature_fact(model_name, model_version, feature_set_name, feature_type, song_id);
```
### 15.3 主链完整性巡检
#### 没有 window 的 asset
```sql
select a.object_id as asset_id, a.song_id, a.storage_uri
from audio_object a
where a.object_type = 'asset'
and not exists (
select 1
from audio_object w
where w.parent_object_id = a.object_id
and w.object_type = 'window'
);
```
#### 没有 chromaprint 的 window
```sql
select w.object_id as window_id, w.song_id
from audio_object w
where w.object_type = 'window'
and not exists (
select 1
from feature_fact ff
where ff.object_id = w.object_id
and ff.feature_type = 'fingerprint'
and ff.model_name = 'chromaprint'
);
```
#### 没有 MERT 的 window
```sql
select w.object_id as window_id, w.song_id
from audio_object w
where w.object_type = 'window'
and not exists (
select 1
from feature_fact ff
where ff.object_id = w.object_id
and ff.feature_type = 'embedding'
and ff.model_name = 'mert-v1-95m'
and ff.model_version = 'hf-main'
and ff.feature_set_name = 'mert_5s_hop2.5_v1'
);
```
### 15.4 冷热分层口径
```text
hot_set -> 高频版权曲
reference_set -> 主 reference catalog
cold -> 长尾曲库,先保主链 + exact
```
......
......@@ -690,3 +690,167 @@ flowchart TD
### 16.7 一句话理解 Phase-1 的存储策略
> `audio_object` 负责“哪段音频”,`feature_fact` 负责“哪种模型算出了什么特征”,二者用 `object_id` 绑定,再用 `song_id` 把所有结果稳定归到 song。
---
## 17. 100w 音频 / 30w song 的批量入库与索引建设策略
当前规模下,最重要的原则不是一次把所有模型都算完,而是:
> **先把 song / asset / window 主链稳定落盘,再按模型批次补 feature_fact,再逐步建设检索索引。**
### 17.1 推荐分阶段
#### Phase A:主数据先落稳
先写:
- `media_entity(song)`
- `audio_object(asset)`
- `audio_object(window)`
目标:
- 先固定 `song -> asset -> window` 主链
- 先让所有后续模型计算都有统一对象主键
#### Phase B:exact lane 先铺满
再写:
- `feature_fact(feature_type='fingerprint', model_name='chromaprint')`
目标:
- 先建立高 precision 的版权保护基线
- 先让 song-level exact 召回可用
#### Phase C:semantic baseline 批量补齐
再写:
- `feature_fact(feature_type='embedding', model_name='mert-v1-95m')`
目标:
- 先让 semantic 主召回 baseline 形成覆盖
#### Phase D:challenger / fallback 补齐
按资源逐步补:
- `muq-base`
- `local_wavehash_embed`
- `ecapa-tdnn`(仅对比)
### 17.2 推荐的批次粒度
建议按 **song 批次****asset 批次** 导入,而不是按 feature 批次直接扫全表:
- 主数据导入:每批 `5k ~ 20k songs`
- window 切片:每批 `50k ~ 200k windows`
- fingerprint 抽取:每批 `50k ~ 200k windows`
- embedding 抽取:按 GPU/CPU 能力动态切批
### 17.3 为什么要把主链和特征链分开批处理
因为两者生命周期不同:
- 主链是一次性、相对稳定的
- 特征链会随着模型更换持续追加
所以推荐:
- `audio_object` 先全量相对稳定落库
- `feature_fact` 按模型、按批次持续追加
### 17.4 推荐索引建设顺序
#### 先建业务主链索引
优先保证这些索引:
- `idx_audio_object_song_type`
- `idx_audio_object_parent`
- `idx_feature_fact_object_type`
- `idx_feature_fact_song_type`
- `idx_set_membership_set_lookup`
#### 再建模型巡检索引
如果后续缺模型扫描频繁,建议追加:
```sql
create index if not exists idx_feature_fact_model_lookup
on feature_fact(model_name, model_version, feature_set_name, feature_type, song_id);
```
#### 最后再建重型向量检索索引
向量索引不建议和主链初始化绑死:
- 先把 `feature_fact` 事实落稳
- 再按具体 vector table / dim 建近邻索引
### 17.5 一个推荐的冷热分层策略
#### 热层
- `set_membership.set_type = 'hot_set'`
- 高频 song、高频 asset、热点版权曲库
- 优先保留 exact + semantic 全特征
#### 温层
- `reference_set`
- 主 reference catalog
- 保持 exact 全覆盖,semantic 分批补齐
#### 冷层
- 长尾 song
- 先保主链和 exact
- semantic 可按需补算
### 17.6 推荐的批量巡检 SQL
#### 查没有 window 的 asset
```sql
select a.object_id as asset_id, a.song_id, a.storage_uri
from audio_object a
where a.object_type = 'asset'
and not exists (
select 1
from audio_object w
where w.parent_object_id = a.object_id
and w.object_type = 'window'
)
order by a.song_id, a.object_id;
```
#### 查没有 fingerprint 的 window
```sql
select w.object_id as window_id, w.song_id, w.parent_object_id as asset_id
from audio_object w
where w.object_type = 'window'
and not exists (
select 1
from feature_fact ff
where ff.object_id = w.object_id
and ff.feature_type = 'fingerprint'
and ff.model_name = 'chromaprint'
)
order by w.song_id, w.parent_object_id, w.start_ms;
```
#### 查没有 MERT embedding 的 window
```sql
select w.object_id as window_id, w.song_id, w.parent_object_id as asset_id
from audio_object w
where w.object_type = 'window'
and not exists (
select 1
from feature_fact ff
where ff.object_id = w.object_id
and ff.feature_type = 'embedding'
and ff.model_name = 'mert-v1-95m'
and ff.model_version = 'hf-main'
and ff.feature_set_name = 'mert_5s_hop2.5_v1'
)
order by w.song_id, w.parent_object_id, w.start_ms;
```
### 17.7 Phase-1 最稳的执行顺序
1. song/asset/window 先全量落库
2. `chromaprint` 先铺满
3. `mert-v1-95m` 作为第一条 semantic baseline 批量补齐
4. `muq-base` 做 challenger
5. 按 hot/reference/cold 分层补算
6. 最后再调双通道融合权重
### 17.8 一句话策略
> 大规模阶段不要先追求“所有模型都齐”,而要先保证 **对象主链完整、exact 先可用、semantic 可持续补齐、集合可分层治理**。
......