Why the docs need a concrete scale-up ingestion and indexing strategy
Constraint: A 100w-audio Phase-1 system needs a default order for ingest, exact coverage, semantic backfill, and index build timing on the current schema Rejected: Leave scale-up as implied by the small examples | does not guide batch execution or gap audits at production volume Confidence: high Scope-risk: narrow Directive: Keep future scale docs anchored on main-chain-first ingestion and model-by-model feature backfill unless benchmark evidence proves otherwise Tested: markdown link check on /workspace/docs after adding scale-up ingestion, indexing, and audit SQL guidance Not-tested: No live database rerun; this is a documentation-only expansion over the verified schema path
Showing
3 changed files
with
243 additions
and
0 deletions
| ... | @@ -8,6 +8,7 @@ | ... | @@ -8,6 +8,7 @@ |
| 8 | - 继续补充检索融合设计:在 `docs/postgresql-data-model.md` 与 `docs/postgres_db_schema_samples.md` 新增 exact lane + semantic lane 双通道的 song 级聚合流程图、规则融合口径与 SQL 骨架,明确 Phase-1 采用 `exact 主导、semantic 补强` 的排序策略。 | 8 | - 继续补充检索融合设计:在 `docs/postgresql-data-model.md` 与 `docs/postgres_db_schema_samples.md` 新增 exact lane + semantic lane 双通道的 song 级聚合流程图、规则融合口径与 SQL 骨架,明确 Phase-1 采用 `exact 主导、semantic 补强` 的排序策略。 |
| 9 | - 继续补充数据绑定与模型落库说明:在 `docs/postgresql-data-model.md` 与 `docs/postgres_db_schema_samples.md` 明确 `media_entity -> audio_object(asset/window) -> feature_fact` 的绑定字段关系,并给出 `chromaprint / mert-v1-95m / muq-base / local_wavehash_embed / ecapa-tdnn` 的 Phase-1 存储口径与 SQL 样例。 | 9 | - 继续补充数据绑定与模型落库说明:在 `docs/postgresql-data-model.md` 与 `docs/postgres_db_schema_samples.md` 明确 `media_entity -> audio_object(asset/window) -> feature_fact` 的绑定字段关系,并给出 `chromaprint / mert-v1-95m / muq-base / local_wavehash_embed / ecapa-tdnn` 的 Phase-1 存储口径与 SQL 样例。 |
| 10 | - 在 `docs/postgres_db_schema_samples.md` 继续补充一个完整的 `同一 song -> 多 asset -> 多 window -> 多 model` 样例,附带缺模型扫描 SQL 与 asset 级特征完备性检查 SQL,方便 Phase-1 批量补算与巡检。 | 10 | - 在 `docs/postgres_db_schema_samples.md` 继续补充一个完整的 `同一 song -> 多 asset -> 多 window -> 多 model` 样例,附带缺模型扫描 SQL 与 asset 级特征完备性检查 SQL,方便 Phase-1 批量补算与巡检。 |
| 11 | - 继续补充规模化实施口径:在 `docs/postgresql-data-model.md` 与 `docs/postgres_db_schema_samples.md` 新增 `100w 音频 / 30w song` 的批量入库顺序、索引建设顺序、冷热分层策略与主链/模型缺口巡检 SQL,明确 Phase-1 先保主链、exact 先铺满、semantic 再批量补齐。 | ||
| 11 | 12 | ||
| 12 | ## 2026-06-04 | 13 | ## 2026-06-04 |
| 13 | 14 | ... | ... |
| ... | @@ -743,3 +743,81 @@ where w.object_type = 'window' | ... | @@ -743,3 +743,81 @@ where w.object_type = 'window' |
| 743 | group by w.object_id, w.start_ms, w.end_ms | 743 | group by w.object_id, w.start_ms, w.end_ms |
| 744 | order by w.start_ms; | 744 | order by w.start_ms; |
| 745 | ``` | 745 | ``` |
| 746 | |||
| 747 | --- | ||
| 748 | |||
| 749 | ## 15. 批量入库与索引建设样例 | ||
| 750 | |||
| 751 | ### 15.1 推荐批量顺序 | ||
| 752 | |||
| 753 | ```text | ||
| 754 | batch-1: media_entity(song) | ||
| 755 | batch-2: audio_object(asset) | ||
| 756 | batch-3: audio_object(window) | ||
| 757 | batch-4: feature_fact(chromaprint) | ||
| 758 | batch-5: feature_fact(mert-v1-95m) | ||
| 759 | batch-6: feature_fact(muq-base) | ||
| 760 | ``` | ||
| 761 | |||
| 762 | ### 15.2 推荐补充索引 | ||
| 763 | |||
| 764 | ```sql | ||
| 765 | create index if not exists idx_feature_fact_model_lookup | ||
| 766 | on feature_fact(model_name, model_version, feature_set_name, feature_type, song_id); | ||
| 767 | ``` | ||
| 768 | |||
| 769 | ### 15.3 主链完整性巡检 | ||
| 770 | |||
| 771 | #### 没有 window 的 asset | ||
| 772 | |||
| 773 | ```sql | ||
| 774 | select a.object_id as asset_id, a.song_id, a.storage_uri | ||
| 775 | from audio_object a | ||
| 776 | where a.object_type = 'asset' | ||
| 777 | and not exists ( | ||
| 778 | select 1 | ||
| 779 | from audio_object w | ||
| 780 | where w.parent_object_id = a.object_id | ||
| 781 | and w.object_type = 'window' | ||
| 782 | ); | ||
| 783 | ``` | ||
| 784 | |||
| 785 | #### 没有 chromaprint 的 window | ||
| 786 | |||
| 787 | ```sql | ||
| 788 | select w.object_id as window_id, w.song_id | ||
| 789 | from audio_object w | ||
| 790 | where w.object_type = 'window' | ||
| 791 | and not exists ( | ||
| 792 | select 1 | ||
| 793 | from feature_fact ff | ||
| 794 | where ff.object_id = w.object_id | ||
| 795 | and ff.feature_type = 'fingerprint' | ||
| 796 | and ff.model_name = 'chromaprint' | ||
| 797 | ); | ||
| 798 | ``` | ||
| 799 | |||
| 800 | #### 没有 MERT 的 window | ||
| 801 | |||
| 802 | ```sql | ||
| 803 | select w.object_id as window_id, w.song_id | ||
| 804 | from audio_object w | ||
| 805 | where w.object_type = 'window' | ||
| 806 | and not exists ( | ||
| 807 | select 1 | ||
| 808 | from feature_fact ff | ||
| 809 | where ff.object_id = w.object_id | ||
| 810 | and ff.feature_type = 'embedding' | ||
| 811 | and ff.model_name = 'mert-v1-95m' | ||
| 812 | and ff.model_version = 'hf-main' | ||
| 813 | and ff.feature_set_name = 'mert_5s_hop2.5_v1' | ||
| 814 | ); | ||
| 815 | ``` | ||
| 816 | |||
| 817 | ### 15.4 冷热分层口径 | ||
| 818 | |||
| 819 | ```text | ||
| 820 | hot_set -> 高频版权曲 | ||
| 821 | reference_set -> 主 reference catalog | ||
| 822 | cold -> 长尾曲库,先保主链 + exact | ||
| 823 | ``` | ... | ... |
| ... | @@ -690,3 +690,167 @@ flowchart TD | ... | @@ -690,3 +690,167 @@ flowchart TD |
| 690 | ### 16.7 一句话理解 Phase-1 的存储策略 | 690 | ### 16.7 一句话理解 Phase-1 的存储策略 |
| 691 | 691 | ||
| 692 | > `audio_object` 负责“哪段音频”,`feature_fact` 负责“哪种模型算出了什么特征”,二者用 `object_id` 绑定,再用 `song_id` 把所有结果稳定归到 song。 | 692 | > `audio_object` 负责“哪段音频”,`feature_fact` 负责“哪种模型算出了什么特征”,二者用 `object_id` 绑定,再用 `song_id` 把所有结果稳定归到 song。 |
| 693 | |||
| 694 | --- | ||
| 695 | |||
| 696 | ## 17. 100w 音频 / 30w song 的批量入库与索引建设策略 | ||
| 697 | |||
| 698 | 当前规模下,最重要的原则不是一次把所有模型都算完,而是: | ||
| 699 | |||
| 700 | > **先把 song / asset / window 主链稳定落盘,再按模型批次补 feature_fact,再逐步建设检索索引。** | ||
| 701 | |||
| 702 | ### 17.1 推荐分阶段 | ||
| 703 | |||
| 704 | #### Phase A:主数据先落稳 | ||
| 705 | 先写: | ||
| 706 | - `media_entity(song)` | ||
| 707 | - `audio_object(asset)` | ||
| 708 | - `audio_object(window)` | ||
| 709 | |||
| 710 | 目标: | ||
| 711 | - 先固定 `song -> asset -> window` 主链 | ||
| 712 | - 先让所有后续模型计算都有统一对象主键 | ||
| 713 | |||
| 714 | #### Phase B:exact lane 先铺满 | ||
| 715 | 再写: | ||
| 716 | - `feature_fact(feature_type='fingerprint', model_name='chromaprint')` | ||
| 717 | |||
| 718 | 目标: | ||
| 719 | - 先建立高 precision 的版权保护基线 | ||
| 720 | - 先让 song-level exact 召回可用 | ||
| 721 | |||
| 722 | #### Phase C:semantic baseline 批量补齐 | ||
| 723 | 再写: | ||
| 724 | - `feature_fact(feature_type='embedding', model_name='mert-v1-95m')` | ||
| 725 | |||
| 726 | 目标: | ||
| 727 | - 先让 semantic 主召回 baseline 形成覆盖 | ||
| 728 | |||
| 729 | #### Phase D:challenger / fallback 补齐 | ||
| 730 | 按资源逐步补: | ||
| 731 | - `muq-base` | ||
| 732 | - `local_wavehash_embed` | ||
| 733 | - `ecapa-tdnn`(仅对比) | ||
| 734 | |||
| 735 | ### 17.2 推荐的批次粒度 | ||
| 736 | |||
| 737 | 建议按 **song 批次** 或 **asset 批次** 导入,而不是按 feature 批次直接扫全表: | ||
| 738 | |||
| 739 | - 主数据导入:每批 `5k ~ 20k songs` | ||
| 740 | - window 切片:每批 `50k ~ 200k windows` | ||
| 741 | - fingerprint 抽取:每批 `50k ~ 200k windows` | ||
| 742 | - embedding 抽取:按 GPU/CPU 能力动态切批 | ||
| 743 | |||
| 744 | ### 17.3 为什么要把主链和特征链分开批处理 | ||
| 745 | |||
| 746 | 因为两者生命周期不同: | ||
| 747 | - 主链是一次性、相对稳定的 | ||
| 748 | - 特征链会随着模型更换持续追加 | ||
| 749 | |||
| 750 | 所以推荐: | ||
| 751 | - `audio_object` 先全量相对稳定落库 | ||
| 752 | - `feature_fact` 按模型、按批次持续追加 | ||
| 753 | |||
| 754 | ### 17.4 推荐索引建设顺序 | ||
| 755 | |||
| 756 | #### 先建业务主链索引 | ||
| 757 | 优先保证这些索引: | ||
| 758 | - `idx_audio_object_song_type` | ||
| 759 | - `idx_audio_object_parent` | ||
| 760 | - `idx_feature_fact_object_type` | ||
| 761 | - `idx_feature_fact_song_type` | ||
| 762 | - `idx_set_membership_set_lookup` | ||
| 763 | |||
| 764 | #### 再建模型巡检索引 | ||
| 765 | 如果后续缺模型扫描频繁,建议追加: | ||
| 766 | |||
| 767 | ```sql | ||
| 768 | create index if not exists idx_feature_fact_model_lookup | ||
| 769 | on feature_fact(model_name, model_version, feature_set_name, feature_type, song_id); | ||
| 770 | ``` | ||
| 771 | |||
| 772 | #### 最后再建重型向量检索索引 | ||
| 773 | 向量索引不建议和主链初始化绑死: | ||
| 774 | - 先把 `feature_fact` 事实落稳 | ||
| 775 | - 再按具体 vector table / dim 建近邻索引 | ||
| 776 | |||
| 777 | ### 17.5 一个推荐的冷热分层策略 | ||
| 778 | |||
| 779 | #### 热层 | ||
| 780 | - `set_membership.set_type = 'hot_set'` | ||
| 781 | - 高频 song、高频 asset、热点版权曲库 | ||
| 782 | - 优先保留 exact + semantic 全特征 | ||
| 783 | |||
| 784 | #### 温层 | ||
| 785 | - `reference_set` | ||
| 786 | - 主 reference catalog | ||
| 787 | - 保持 exact 全覆盖,semantic 分批补齐 | ||
| 788 | |||
| 789 | #### 冷层 | ||
| 790 | - 长尾 song | ||
| 791 | - 先保主链和 exact | ||
| 792 | - semantic 可按需补算 | ||
| 793 | |||
| 794 | ### 17.6 推荐的批量巡检 SQL | ||
| 795 | |||
| 796 | #### 查没有 window 的 asset | ||
| 797 | |||
| 798 | ```sql | ||
| 799 | select a.object_id as asset_id, a.song_id, a.storage_uri | ||
| 800 | from audio_object a | ||
| 801 | where a.object_type = 'asset' | ||
| 802 | and not exists ( | ||
| 803 | select 1 | ||
| 804 | from audio_object w | ||
| 805 | where w.parent_object_id = a.object_id | ||
| 806 | and w.object_type = 'window' | ||
| 807 | ) | ||
| 808 | order by a.song_id, a.object_id; | ||
| 809 | ``` | ||
| 810 | |||
| 811 | #### 查没有 fingerprint 的 window | ||
| 812 | |||
| 813 | ```sql | ||
| 814 | select w.object_id as window_id, w.song_id, w.parent_object_id as asset_id | ||
| 815 | from audio_object w | ||
| 816 | where w.object_type = 'window' | ||
| 817 | and not exists ( | ||
| 818 | select 1 | ||
| 819 | from feature_fact ff | ||
| 820 | where ff.object_id = w.object_id | ||
| 821 | and ff.feature_type = 'fingerprint' | ||
| 822 | and ff.model_name = 'chromaprint' | ||
| 823 | ) | ||
| 824 | order by w.song_id, w.parent_object_id, w.start_ms; | ||
| 825 | ``` | ||
| 826 | |||
| 827 | #### 查没有 MERT embedding 的 window | ||
| 828 | |||
| 829 | ```sql | ||
| 830 | select w.object_id as window_id, w.song_id, w.parent_object_id as asset_id | ||
| 831 | from audio_object w | ||
| 832 | where w.object_type = 'window' | ||
| 833 | and not exists ( | ||
| 834 | select 1 | ||
| 835 | from feature_fact ff | ||
| 836 | where ff.object_id = w.object_id | ||
| 837 | and ff.feature_type = 'embedding' | ||
| 838 | and ff.model_name = 'mert-v1-95m' | ||
| 839 | and ff.model_version = 'hf-main' | ||
| 840 | and ff.feature_set_name = 'mert_5s_hop2.5_v1' | ||
| 841 | ) | ||
| 842 | order by w.song_id, w.parent_object_id, w.start_ms; | ||
| 843 | ``` | ||
| 844 | |||
| 845 | ### 17.7 Phase-1 最稳的执行顺序 | ||
| 846 | |||
| 847 | 1. song/asset/window 先全量落库 | ||
| 848 | 2. `chromaprint` 先铺满 | ||
| 849 | 3. `mert-v1-95m` 作为第一条 semantic baseline 批量补齐 | ||
| 850 | 4. `muq-base` 做 challenger | ||
| 851 | 5. 按 hot/reference/cold 分层补算 | ||
| 852 | 6. 最后再调双通道融合权重 | ||
| 853 | |||
| 854 | ### 17.8 一句话策略 | ||
| 855 | |||
| 856 | > 大规模阶段不要先追求“所有模型都齐”,而要先保证 **对象主链完整、exact 先可用、semantic 可持续补齐、集合可分层治理**。 | ... | ... |
-
Please register or sign in to post a comment