Why the docs need a concrete scale-up ingestion and indexing strategy

Constraint: A 100w-audio Phase-1 system needs a default order for ingest, exact coverage, semantic backfill, and index build timing on the current schema Rejected: Leave scale-up as implied by the small examples | does not guide batch execution or gap audits at production volume Confidence: high Scope-risk: narrow Directive: Keep future scale docs anchored on main-chain-first ingestion and model-by-model feature backfill unless benchmark evidence proves otherwise Tested: markdown link check on /workspace/docs after adding scale-up ingestion, indexing, and audit SQL guidance Not-tested: No live database rerun; this is a documentation-only expansion over the verified schema path

Why the docs need a concrete scale-up ingestion and indexing strategy
Constraint: A 100w-audio Phase-1 system needs a default order for ingest, exact coverage, semantic backfill, and index build timing on the current schema Rejected: Leave scale-up as implied by the small examples | does not guide batch execution or gap audits at production volume Confidence: high Scope-risk: narrow Directive: Keep future scale docs anchored on main-chain-first ingestion and model-by-model feature backfill unless benchmark evidence proves otherwise Tested: markdown link check on /workspace/docs after adding scale-up ingestion, indexing, and audit SQL guidance Not-tested: No live database rerun; this is a documentation-only expansion over the verified schema path
cnb.bofCdSsphPA
Commit df241d20 ... df241d20c056e82a3a67efffe9e2f013f0feabd9 authored 2026-06-04 15:14:46 +0800 by cnb.bofCdSsphPA
Showing 3 changed files with 243 additions and 0 deletions
docs/CHANGELOG.md
docs/postgres_db_schema_samples.md
docs/postgresql-data-model.md
--- a/docs/CHANGELOG.md
View file @df241d2
+++ b/docs/CHANGELOG.md
View file @df241d2
@@ -8,6 +8,7 @@
 - 继续补充检索融合设计：在 `docs/postgresql-data-model.md` 与 `docs/postgres_db_schema_samples.md` 新增 exact lane + semantic lane 双通道的 song 级聚合流程图、规则融合口径与 SQL 骨架，明确 Phase-1 采用 `exact 主导、semantic 补强` 的排序策略。
 - 继续补充数据绑定与模型落库说明：在 `docs/postgresql-data-model.md` 与 `docs/postgres_db_schema_samples.md` 明确 `media_entity -> audio_object(asset/window) -> feature_fact` 的绑定字段关系，并给出 `chromaprint / mert-v1-95m / muq-base / local_wavehash_embed / ecapa-tdnn` 的 Phase-1 存储口径与 SQL 样例。
 - 在 `docs/postgres_db_schema_samples.md` 继续补充一个完整的 `同一 song -> 多 asset -> 多 window -> 多 model` 样例，附带缺模型扫描 SQL 与 asset 级特征完备性检查 SQL，方便 Phase-1 批量补算与巡检。
+- 继续补充规模化实施口径：在 `docs/postgresql-data-model.md` 与 `docs/postgres_db_schema_samples.md` 新增 `100w 音频 / 30w song` 的批量入库顺序、索引建设顺序、冷热分层策略与主链/模型缺口巡检 SQL，明确 Phase-1 先保主链、exact 先铺满、semantic 再批量补齐。

 ## 2026-06-04

--- a/docs/postgres_db_schema_samples.md
View file @df241d2
+++ b/docs/postgres_db_schema_samples.md
View file @df241d2
@@ -743,3 +743,81 @@ where w.object_type = 'window'
 group by w.object_id, w.start_ms, w.end_ms
 order by w.start_ms;
 ```
+
+---
+
+## 15. 批量入库与索引建设样例
+
+### 15.1 推荐批量顺序
+
+```text
+batch-1: media_entity(song)
+batch-2: audio_object(asset)
+batch-3: audio_object(window)
+batch-4: feature_fact(chromaprint)
+batch-5: feature_fact(mert-v1-95m)
+batch-6: feature_fact(muq-base)
+```
+
+### 15.2 推荐补充索引
+
+```sql
+create index if not exists idx_feature_fact_model_lookup
+    on feature_fact(model_name, model_version, feature_set_name, feature_type, song_id);
+```
+
+### 15.3 主链完整性巡检
+
+#### 没有 window 的 asset
+
+```sql
+select a.object_id as asset_id, a.song_id, a.storage_uri
+from audio_object a
+where a.object_type = 'asset'
+  and not exists (
+      select 1
+      from audio_object w
+      where w.parent_object_id = a.object_id
+        and w.object_type = 'window'
+  );
+```
+
+#### 没有 chromaprint 的 window
+
+```sql
+select w.object_id as window_id, w.song_id
+from audio_object w
+where w.object_type = 'window'
+  and not exists (
+      select 1
+      from feature_fact ff
+      where ff.object_id = w.object_id
+        and ff.feature_type = 'fingerprint'
+        and ff.model_name = 'chromaprint'
+  );
+```
+
+#### 没有 MERT 的 window
+
+```sql
+select w.object_id as window_id, w.song_id
+from audio_object w
+where w.object_type = 'window'
+  and not exists (
+      select 1
+      from feature_fact ff
+      where ff.object_id = w.object_id
+        and ff.feature_type = 'embedding'
+        and ff.model_name = 'mert-v1-95m'
+        and ff.model_version = 'hf-main'
+        and ff.feature_set_name = 'mert_5s_hop2.5_v1'
+  );
+```
+
+### 15.4 冷热分层口径
+
+```text
+hot_set       -> 高频版权曲
+reference_set -> 主 reference catalog
+cold          -> 长尾曲库，先保主链 + exact
+```
--- a/docs/postgresql-data-model.md
View file @df241d2
+++ b/docs/postgresql-data-model.md
View file @df241d2
@@ -690,3 +690,167 @@ flowchart TD
 ### 16.7 一句话理解 Phase-1 的存储策略

 > `audio_object` 负责“哪段音频”，`feature_fact` 负责“哪种模型算出了什么特征”，二者用 `object_id` 绑定，再用 `song_id` 把所有结果稳定归到 song。
+
+---
+
+## 17. 100w 音频 / 30w song 的批量入库与索引建设策略
+
+当前规模下，最重要的原则不是一次把所有模型都算完，而是：
+
+> **先把 song / asset / window 主链稳定落盘，再按模型批次补 feature_fact，再逐步建设检索索引。**
+
+### 17.1 推荐分阶段
+
+#### Phase A：主数据先落稳
+先写：
+- `media_entity(song)`
+- `audio_object(asset)`
+- `audio_object(window)`
+
+目标：
+- 先固定 `song -> asset -> window` 主链
+- 先让所有后续模型计算都有统一对象主键
+
+#### Phase B：exact lane 先铺满
+再写：
+- `feature_fact(feature_type='fingerprint', model_name='chromaprint')`
+
+目标：
+- 先建立高 precision 的版权保护基线
+- 先让 song-level exact 召回可用
+
+#### Phase C：semantic baseline 批量补齐
+再写：
+- `feature_fact(feature_type='embedding', model_name='mert-v1-95m')`
+
+目标：
+- 先让 semantic 主召回 baseline 形成覆盖
+
+#### Phase D：challenger / fallback 补齐
+按资源逐步补：
+- `muq-base`
+- `local_wavehash_embed`
+- `ecapa-tdnn`（仅对比）
+
+### 17.2 推荐的批次粒度
+
+建议按 **song 批次** 或 **asset 批次** 导入，而不是按 feature 批次直接扫全表：
+
+- 主数据导入：每批 `5k ~ 20k songs`
+- window 切片：每批 `50k ~ 200k windows`
+- fingerprint 抽取：每批 `50k ~ 200k windows`
+- embedding 抽取：按 GPU/CPU 能力动态切批
+
+### 17.3 为什么要把主链和特征链分开批处理
+
+因为两者生命周期不同：
+- 主链是一次性、相对稳定的
+- 特征链会随着模型更换持续追加
+
+所以推荐：
+- `audio_object` 先全量相对稳定落库
+- `feature_fact` 按模型、按批次持续追加
+
+### 17.4 推荐索引建设顺序
+
+#### 先建业务主链索引
+优先保证这些索引：
+- `idx_audio_object_song_type`
+- `idx_audio_object_parent`
+- `idx_feature_fact_object_type`
+- `idx_feature_fact_song_type`
+- `idx_set_membership_set_lookup`
+
+#### 再建模型巡检索引
+如果后续缺模型扫描频繁，建议追加：
+
+```sql
+create index if not exists idx_feature_fact_model_lookup
+    on feature_fact(model_name, model_version, feature_set_name, feature_type, song_id);
+```
+
+#### 最后再建重型向量检索索引
+向量索引不建议和主链初始化绑死：
+- 先把 `feature_fact` 事实落稳
+- 再按具体 vector table / dim 建近邻索引
+
+### 17.5 一个推荐的冷热分层策略
+
+#### 热层
+- `set_membership.set_type = 'hot_set'`
+- 高频 song、高频 asset、热点版权曲库
+- 优先保留 exact + semantic 全特征
+
+#### 温层
+- `reference_set`
+- 主 reference catalog
+- 保持 exact 全覆盖，semantic 分批补齐
+
+#### 冷层
+- 长尾 song
+- 先保主链和 exact
+- semantic 可按需补算
+
+### 17.6 推荐的批量巡检 SQL
+
+#### 查没有 window 的 asset
+
+```sql
+select a.object_id as asset_id, a.song_id, a.storage_uri
+from audio_object a
+where a.object_type = 'asset'
+  and not exists (
+      select 1
+      from audio_object w
+      where w.parent_object_id = a.object_id
+        and w.object_type = 'window'
+  )
+order by a.song_id, a.object_id;
+```
+
+#### 查没有 fingerprint 的 window
+
+```sql
+select w.object_id as window_id, w.song_id, w.parent_object_id as asset_id
+from audio_object w
+where w.object_type = 'window'
+  and not exists (
+      select 1
+      from feature_fact ff
+      where ff.object_id = w.object_id
+        and ff.feature_type = 'fingerprint'
+        and ff.model_name = 'chromaprint'
+  )
+order by w.song_id, w.parent_object_id, w.start_ms;
+```
+
+#### 查没有 MERT embedding 的 window
+
+```sql
+select w.object_id as window_id, w.song_id, w.parent_object_id as asset_id
+from audio_object w
+where w.object_type = 'window'
+  and not exists (
+      select 1
+      from feature_fact ff
+      where ff.object_id = w.object_id
+        and ff.feature_type = 'embedding'
+        and ff.model_name = 'mert-v1-95m'
+        and ff.model_version = 'hf-main'
+        and ff.feature_set_name = 'mert_5s_hop2.5_v1'
+  )
+order by w.song_id, w.parent_object_id, w.start_ms;
+```
+
+### 17.7 Phase-1 最稳的执行顺序
+
+1. song/asset/window 先全量落库
+2. `chromaprint` 先铺满
+3. `mert-v1-95m` 作为第一条 semantic baseline 批量补齐
+4. `muq-base` 做 challenger
+5. 按 hot/reference/cold 分层补算
+6. 最后再调双通道融合权重
+
+### 17.8 一句话策略
+
+> 大规模阶段不要先追求“所有模型都齐”，而要先保证 **对象主链完整、exact 先可用、semantic 可持续补齐、集合可分层治理**。