Commit df241d20 df241d20c056e82a3a67efffe9e2f013f0feabd9 by cnb.bofCdSsphPA

Why the docs need a concrete scale-up ingestion and indexing strategy

Constraint: A 100w-audio Phase-1 system needs a default order for ingest, exact coverage, semantic backfill, and index build timing on the current schema
Rejected: Leave scale-up as implied by the small examples | does not guide batch execution or gap audits at production volume
Confidence: high
Scope-risk: narrow
Directive: Keep future scale docs anchored on main-chain-first ingestion and model-by-model feature backfill unless benchmark evidence proves otherwise
Tested: markdown link check on /workspace/docs after adding scale-up ingestion, indexing, and audit SQL guidance
Not-tested: No live database rerun; this is a documentation-only expansion over the verified schema path
1 parent b624273c
...@@ -8,6 +8,7 @@ ...@@ -8,6 +8,7 @@
8 - 继续补充检索融合设计:在 `docs/postgresql-data-model.md``docs/postgres_db_schema_samples.md` 新增 exact lane + semantic lane 双通道的 song 级聚合流程图、规则融合口径与 SQL 骨架,明确 Phase-1 采用 `exact 主导、semantic 补强` 的排序策略。 8 - 继续补充检索融合设计:在 `docs/postgresql-data-model.md``docs/postgres_db_schema_samples.md` 新增 exact lane + semantic lane 双通道的 song 级聚合流程图、规则融合口径与 SQL 骨架,明确 Phase-1 采用 `exact 主导、semantic 补强` 的排序策略。
9 - 继续补充数据绑定与模型落库说明:在 `docs/postgresql-data-model.md``docs/postgres_db_schema_samples.md` 明确 `media_entity -> audio_object(asset/window) -> feature_fact` 的绑定字段关系,并给出 `chromaprint / mert-v1-95m / muq-base / local_wavehash_embed / ecapa-tdnn` 的 Phase-1 存储口径与 SQL 样例。 9 - 继续补充数据绑定与模型落库说明:在 `docs/postgresql-data-model.md``docs/postgres_db_schema_samples.md` 明确 `media_entity -> audio_object(asset/window) -> feature_fact` 的绑定字段关系,并给出 `chromaprint / mert-v1-95m / muq-base / local_wavehash_embed / ecapa-tdnn` 的 Phase-1 存储口径与 SQL 样例。
10 -`docs/postgres_db_schema_samples.md` 继续补充一个完整的 `同一 song -> 多 asset -> 多 window -> 多 model` 样例,附带缺模型扫描 SQL 与 asset 级特征完备性检查 SQL,方便 Phase-1 批量补算与巡检。 10 -`docs/postgres_db_schema_samples.md` 继续补充一个完整的 `同一 song -> 多 asset -> 多 window -> 多 model` 样例,附带缺模型扫描 SQL 与 asset 级特征完备性检查 SQL,方便 Phase-1 批量补算与巡检。
11 - 继续补充规模化实施口径:在 `docs/postgresql-data-model.md``docs/postgres_db_schema_samples.md` 新增 `100w 音频 / 30w song` 的批量入库顺序、索引建设顺序、冷热分层策略与主链/模型缺口巡检 SQL,明确 Phase-1 先保主链、exact 先铺满、semantic 再批量补齐。
11 12
12 ## 2026-06-04 13 ## 2026-06-04
13 14
......
...@@ -743,3 +743,81 @@ where w.object_type = 'window' ...@@ -743,3 +743,81 @@ where w.object_type = 'window'
743 group by w.object_id, w.start_ms, w.end_ms 743 group by w.object_id, w.start_ms, w.end_ms
744 order by w.start_ms; 744 order by w.start_ms;
745 ``` 745 ```
746
747 ---
748
749 ## 15. 批量入库与索引建设样例
750
751 ### 15.1 推荐批量顺序
752
753 ```text
754 batch-1: media_entity(song)
755 batch-2: audio_object(asset)
756 batch-3: audio_object(window)
757 batch-4: feature_fact(chromaprint)
758 batch-5: feature_fact(mert-v1-95m)
759 batch-6: feature_fact(muq-base)
760 ```
761
762 ### 15.2 推荐补充索引
763
764 ```sql
765 create index if not exists idx_feature_fact_model_lookup
766 on feature_fact(model_name, model_version, feature_set_name, feature_type, song_id);
767 ```
768
769 ### 15.3 主链完整性巡检
770
771 #### 没有 window 的 asset
772
773 ```sql
774 select a.object_id as asset_id, a.song_id, a.storage_uri
775 from audio_object a
776 where a.object_type = 'asset'
777 and not exists (
778 select 1
779 from audio_object w
780 where w.parent_object_id = a.object_id
781 and w.object_type = 'window'
782 );
783 ```
784
785 #### 没有 chromaprint 的 window
786
787 ```sql
788 select w.object_id as window_id, w.song_id
789 from audio_object w
790 where w.object_type = 'window'
791 and not exists (
792 select 1
793 from feature_fact ff
794 where ff.object_id = w.object_id
795 and ff.feature_type = 'fingerprint'
796 and ff.model_name = 'chromaprint'
797 );
798 ```
799
800 #### 没有 MERT 的 window
801
802 ```sql
803 select w.object_id as window_id, w.song_id
804 from audio_object w
805 where w.object_type = 'window'
806 and not exists (
807 select 1
808 from feature_fact ff
809 where ff.object_id = w.object_id
810 and ff.feature_type = 'embedding'
811 and ff.model_name = 'mert-v1-95m'
812 and ff.model_version = 'hf-main'
813 and ff.feature_set_name = 'mert_5s_hop2.5_v1'
814 );
815 ```
816
817 ### 15.4 冷热分层口径
818
819 ```text
820 hot_set -> 高频版权曲
821 reference_set -> 主 reference catalog
822 cold -> 长尾曲库,先保主链 + exact
823 ```
......
...@@ -690,3 +690,167 @@ flowchart TD ...@@ -690,3 +690,167 @@ flowchart TD
690 ### 16.7 一句话理解 Phase-1 的存储策略 690 ### 16.7 一句话理解 Phase-1 的存储策略
691 691
692 > `audio_object` 负责“哪段音频”,`feature_fact` 负责“哪种模型算出了什么特征”,二者用 `object_id` 绑定,再用 `song_id` 把所有结果稳定归到 song。 692 > `audio_object` 负责“哪段音频”,`feature_fact` 负责“哪种模型算出了什么特征”,二者用 `object_id` 绑定,再用 `song_id` 把所有结果稳定归到 song。
693
694 ---
695
696 ## 17. 100w 音频 / 30w song 的批量入库与索引建设策略
697
698 当前规模下,最重要的原则不是一次把所有模型都算完,而是:
699
700 > **先把 song / asset / window 主链稳定落盘,再按模型批次补 feature_fact,再逐步建设检索索引。**
701
702 ### 17.1 推荐分阶段
703
704 #### Phase A:主数据先落稳
705 先写:
706 - `media_entity(song)`
707 - `audio_object(asset)`
708 - `audio_object(window)`
709
710 目标:
711 - 先固定 `song -> asset -> window` 主链
712 - 先让所有后续模型计算都有统一对象主键
713
714 #### Phase B:exact lane 先铺满
715 再写:
716 - `feature_fact(feature_type='fingerprint', model_name='chromaprint')`
717
718 目标:
719 - 先建立高 precision 的版权保护基线
720 - 先让 song-level exact 召回可用
721
722 #### Phase C:semantic baseline 批量补齐
723 再写:
724 - `feature_fact(feature_type='embedding', model_name='mert-v1-95m')`
725
726 目标:
727 - 先让 semantic 主召回 baseline 形成覆盖
728
729 #### Phase D:challenger / fallback 补齐
730 按资源逐步补:
731 - `muq-base`
732 - `local_wavehash_embed`
733 - `ecapa-tdnn`(仅对比)
734
735 ### 17.2 推荐的批次粒度
736
737 建议按 **song 批次****asset 批次** 导入,而不是按 feature 批次直接扫全表:
738
739 - 主数据导入:每批 `5k ~ 20k songs`
740 - window 切片:每批 `50k ~ 200k windows`
741 - fingerprint 抽取:每批 `50k ~ 200k windows`
742 - embedding 抽取:按 GPU/CPU 能力动态切批
743
744 ### 17.3 为什么要把主链和特征链分开批处理
745
746 因为两者生命周期不同:
747 - 主链是一次性、相对稳定的
748 - 特征链会随着模型更换持续追加
749
750 所以推荐:
751 - `audio_object` 先全量相对稳定落库
752 - `feature_fact` 按模型、按批次持续追加
753
754 ### 17.4 推荐索引建设顺序
755
756 #### 先建业务主链索引
757 优先保证这些索引:
758 - `idx_audio_object_song_type`
759 - `idx_audio_object_parent`
760 - `idx_feature_fact_object_type`
761 - `idx_feature_fact_song_type`
762 - `idx_set_membership_set_lookup`
763
764 #### 再建模型巡检索引
765 如果后续缺模型扫描频繁,建议追加:
766
767 ```sql
768 create index if not exists idx_feature_fact_model_lookup
769 on feature_fact(model_name, model_version, feature_set_name, feature_type, song_id);
770 ```
771
772 #### 最后再建重型向量检索索引
773 向量索引不建议和主链初始化绑死:
774 - 先把 `feature_fact` 事实落稳
775 - 再按具体 vector table / dim 建近邻索引
776
777 ### 17.5 一个推荐的冷热分层策略
778
779 #### 热层
780 - `set_membership.set_type = 'hot_set'`
781 - 高频 song、高频 asset、热点版权曲库
782 - 优先保留 exact + semantic 全特征
783
784 #### 温层
785 - `reference_set`
786 - 主 reference catalog
787 - 保持 exact 全覆盖,semantic 分批补齐
788
789 #### 冷层
790 - 长尾 song
791 - 先保主链和 exact
792 - semantic 可按需补算
793
794 ### 17.6 推荐的批量巡检 SQL
795
796 #### 查没有 window 的 asset
797
798 ```sql
799 select a.object_id as asset_id, a.song_id, a.storage_uri
800 from audio_object a
801 where a.object_type = 'asset'
802 and not exists (
803 select 1
804 from audio_object w
805 where w.parent_object_id = a.object_id
806 and w.object_type = 'window'
807 )
808 order by a.song_id, a.object_id;
809 ```
810
811 #### 查没有 fingerprint 的 window
812
813 ```sql
814 select w.object_id as window_id, w.song_id, w.parent_object_id as asset_id
815 from audio_object w
816 where w.object_type = 'window'
817 and not exists (
818 select 1
819 from feature_fact ff
820 where ff.object_id = w.object_id
821 and ff.feature_type = 'fingerprint'
822 and ff.model_name = 'chromaprint'
823 )
824 order by w.song_id, w.parent_object_id, w.start_ms;
825 ```
826
827 #### 查没有 MERT embedding 的 window
828
829 ```sql
830 select w.object_id as window_id, w.song_id, w.parent_object_id as asset_id
831 from audio_object w
832 where w.object_type = 'window'
833 and not exists (
834 select 1
835 from feature_fact ff
836 where ff.object_id = w.object_id
837 and ff.feature_type = 'embedding'
838 and ff.model_name = 'mert-v1-95m'
839 and ff.model_version = 'hf-main'
840 and ff.feature_set_name = 'mert_5s_hop2.5_v1'
841 )
842 order by w.song_id, w.parent_object_id, w.start_ms;
843 ```
844
845 ### 17.7 Phase-1 最稳的执行顺序
846
847 1. song/asset/window 先全量落库
848 2. `chromaprint` 先铺满
849 3. `mert-v1-95m` 作为第一条 semantic baseline 批量补齐
850 4. `muq-base` 做 challenger
851 5. 按 hot/reference/cold 分层补算
852 6. 最后再调双通道融合权重
853
854 ### 17.8 一句话策略
855
856 > 大规模阶段不要先追求“所有模型都齐”,而要先保证 **对象主链完整、exact 先可用、semantic 可持续补齐、集合可分层治理**。
......