Why the docs need an explicit vector payload storage contract
Constraint: Phase-1 embedding storage must explain where vectors live, how feature_fact points to them, and how hot/cold migration works at scale Rejected: Leave vector payload placement implicit | does not give operators a stable contract for ANN loading, backfill, or cleanup Confidence: high Scope-risk: narrow Directive: Keep embedding payload guidance split between feature_fact metadata and vector-side storage unless the physical schema default changes Tested: markdown link check on /workspace/docs after adding vector payload storage and lifecycle guidance Not-tested: No live database rerun; this is a documentation-only clarification over the current schema path
Showing
3 changed files
with
254 additions
and
0 deletions
| ... | @@ -9,6 +9,7 @@ | ... | @@ -9,6 +9,7 @@ |
| 9 | - 继续补充数据绑定与模型落库说明:在 `docs/postgresql-data-model.md` 与 `docs/postgres_db_schema_samples.md` 明确 `media_entity -> audio_object(asset/window) -> feature_fact` 的绑定字段关系,并给出 `chromaprint / mert-v1-95m / muq-base / local_wavehash_embed / ecapa-tdnn` 的 Phase-1 存储口径与 SQL 样例。 | 9 | - 继续补充数据绑定与模型落库说明:在 `docs/postgresql-data-model.md` 与 `docs/postgres_db_schema_samples.md` 明确 `media_entity -> audio_object(asset/window) -> feature_fact` 的绑定字段关系,并给出 `chromaprint / mert-v1-95m / muq-base / local_wavehash_embed / ecapa-tdnn` 的 Phase-1 存储口径与 SQL 样例。 |
| 10 | - 在 `docs/postgres_db_schema_samples.md` 继续补充一个完整的 `同一 song -> 多 asset -> 多 window -> 多 model` 样例,附带缺模型扫描 SQL 与 asset 级特征完备性检查 SQL,方便 Phase-1 批量补算与巡检。 | 10 | - 在 `docs/postgres_db_schema_samples.md` 继续补充一个完整的 `同一 song -> 多 asset -> 多 window -> 多 model` 样例,附带缺模型扫描 SQL 与 asset 级特征完备性检查 SQL,方便 Phase-1 批量补算与巡检。 |
| 11 | - 继续补充规模化实施口径:在 `docs/postgresql-data-model.md` 与 `docs/postgres_db_schema_samples.md` 新增 `100w 音频 / 30w song` 的批量入库顺序、索引建设顺序、冷热分层策略与主链/模型缺口巡检 SQL,明确 Phase-1 先保主链、exact 先铺满、semantic 再批量补齐。 | 11 | - 继续补充规模化实施口径:在 `docs/postgresql-data-model.md` 与 `docs/postgres_db_schema_samples.md` 新增 `100w 音频 / 30w song` 的批量入库顺序、索引建设顺序、冷热分层策略与主链/模型缺口巡检 SQL,明确 Phase-1 先保主链、exact 先铺满、semantic 再批量补齐。 |
| 12 | - 继续补充 embedding 落盘规范:在 `docs/postgresql-data-model.md` 与 `docs/postgres_db_schema_samples.md` 新增 `embedding_uri / vector_table_name / embedding_dim` 的字段语义、向量侧表命名规范、冷热迁移与回收策略,以及 `feature_fact -> vector table` 的关联样例 SQL。 | ||
| 12 | 13 | ||
| 13 | ## 2026-06-04 | 14 | ## 2026-06-04 |
| 14 | 15 | ... | ... |
| ... | @@ -821,3 +821,105 @@ hot_set -> 高频版权曲 | ... | @@ -821,3 +821,105 @@ hot_set -> 高频版权曲 |
| 821 | reference_set -> 主 reference catalog | 821 | reference_set -> 主 reference catalog |
| 822 | cold -> 长尾曲库,先保主链 + exact | 822 | cold -> 长尾曲库,先保主链 + exact |
| 823 | ``` | 823 | ``` |
| 824 | |||
| 825 | --- | ||
| 826 | |||
| 827 | ## 16. vector table / embedding 文件存储样例 | ||
| 828 | |||
| 829 | ### 16.1 `feature_fact` 中怎么记录 embedding 位置 | ||
| 830 | |||
| 831 | ```sql | ||
| 832 | insert into feature_fact ( | ||
| 833 | feature_type, | ||
| 834 | object_id, | ||
| 835 | song_id, | ||
| 836 | model_name, | ||
| 837 | model_version, | ||
| 838 | feature_set_name, | ||
| 839 | feature_schema_ver, | ||
| 840 | embedding_dim, | ||
| 841 | embedding_uri, | ||
| 842 | vector_table_name, | ||
| 843 | checksum, | ||
| 844 | metadata_json | ||
| 845 | ) values ( | ||
| 846 | 'embedding', | ||
| 847 | :window_id, | ||
| 848 | :song_id, | ||
| 849 | 'mert-v1-95m', | ||
| 850 | 'hf-main', | ||
| 851 | 'mert_5s_hop2.5_v1', | ||
| 852 | 'v1', | ||
| 853 | 768, | ||
| 854 | 's3://acr-emb/phase1/mert-v1-95m/song_1001/asset_2001/window_3001_mert_5s_hop2.5_v1.npy', | ||
| 855 | 'audio_embedding_vector_768', | ||
| 856 | 'sha256:emb-demo-001', | ||
| 857 | '{"storage_tier":"hot"}'::jsonb | ||
| 858 | ); | ||
| 859 | ``` | ||
| 860 | |||
| 861 | ### 16.2 一个推荐的向量侧表样例 | ||
| 862 | |||
| 863 | > 这里是逻辑样例,真实向量类型可按你的 pgvector 版本落地。 | ||
| 864 | |||
| 865 | ```sql | ||
| 866 | create table if not exists audio_embedding_vector_768 ( | ||
| 867 | embedding_row_id bigserial primary key, | ||
| 868 | feature_id bigint not null references feature_fact(feature_id), | ||
| 869 | vector_dim integer not null default 768, | ||
| 870 | embedding_vector vector(768), | ||
| 871 | created_at timestamptz not null default now() | ||
| 872 | ); | ||
| 873 | ``` | ||
| 874 | |||
| 875 | ### 16.3 feature_fact 与 vector table 的关联查询 | ||
| 876 | |||
| 877 | ```sql | ||
| 878 | select ff.feature_id, | ||
| 879 | ff.song_id, | ||
| 880 | ff.object_id as window_id, | ||
| 881 | ff.model_name, | ||
| 882 | ff.feature_set_name, | ||
| 883 | ff.embedding_dim, | ||
| 884 | ff.embedding_uri, | ||
| 885 | ff.vector_table_name, | ||
| 886 | v.embedding_row_id | ||
| 887 | from feature_fact ff | ||
| 888 | left join audio_embedding_vector_768 v | ||
| 889 | on v.feature_id = ff.feature_id | ||
| 890 | where ff.feature_type = 'embedding' | ||
| 891 | and ff.model_name = 'mert-v1-95m' | ||
| 892 | and ff.feature_set_name = 'mert_5s_hop2.5_v1'; | ||
| 893 | ``` | ||
| 894 | |||
| 895 | ### 16.4 查哪些 embedding 还没进入 vector table | ||
| 896 | |||
| 897 | ```sql | ||
| 898 | select ff.feature_id, | ||
| 899 | ff.song_id, | ||
| 900 | ff.object_id as window_id, | ||
| 901 | ff.embedding_uri | ||
| 902 | from feature_fact ff | ||
| 903 | left join audio_embedding_vector_768 v | ||
| 904 | on v.feature_id = ff.feature_id | ||
| 905 | where ff.feature_type = 'embedding' | ||
| 906 | and ff.embedding_dim = 768 | ||
| 907 | and ff.vector_table_name = 'audio_embedding_vector_768' | ||
| 908 | and v.feature_id is null | ||
| 909 | order by ff.song_id, ff.object_id; | ||
| 910 | ``` | ||
| 911 | |||
| 912 | ### 16.5 查哪些冷层 embedding 可以先不进热索引 | ||
| 913 | |||
| 914 | ```sql | ||
| 915 | select ff.feature_id, | ||
| 916 | ff.song_id, | ||
| 917 | ff.object_id, | ||
| 918 | ff.embedding_uri, | ||
| 919 | ff.metadata_json | ||
| 920 | from feature_fact ff | ||
| 921 | where ff.feature_type = 'embedding' | ||
| 922 | and coalesce(ff.metadata_json->>'storage_tier', 'cold') = 'cold' | ||
| 923 | and ff.vector_table_name is null | ||
| 924 | order by ff.song_id, ff.object_id; | ||
| 925 | ``` | ... | ... |
| ... | @@ -854,3 +854,154 @@ order by w.song_id, w.parent_object_id, w.start_ms; | ... | @@ -854,3 +854,154 @@ order by w.song_id, w.parent_object_id, w.start_ms; |
| 854 | ### 17.8 一句话策略 | 854 | ### 17.8 一句话策略 |
| 855 | 855 | ||
| 856 | > 大规模阶段不要先追求“所有模型都齐”,而要先保证 **对象主链完整、exact 先可用、semantic 可持续补齐、集合可分层治理**。 | 856 | > 大规模阶段不要先追求“所有模型都齐”,而要先保证 **对象主链完整、exact 先可用、semantic 可持续补齐、集合可分层治理**。 |
| 857 | |||
| 858 | --- | ||
| 859 | |||
| 860 | ## 18. vector table / embedding 文件存储规范 | ||
| 861 | |||
| 862 | 当前 `feature_fact` 对 embedding 采用的是 **元数据 + 外部载荷位置** 的设计: | ||
| 863 | |||
| 864 | - `feature_fact` 负责记录: | ||
| 865 | - 这个 embedding 属于哪个 `window` | ||
| 866 | - 属于哪个 `song` | ||
| 867 | - 由哪个 `model_name/model_version/feature_set_name` 产生 | ||
| 868 | - 维度是多少 | ||
| 869 | - 向量实际放在哪 | ||
| 870 | |||
| 871 | 也就是说: | ||
| 872 | |||
| 873 | ```text | ||
| 874 | feature_fact = 向量事实索引卡 | ||
| 875 | embedding_uri / vector_table_name = 向量载荷位置 | ||
| 876 | ``` | ||
| 877 | |||
| 878 | ### 18.1 为什么不把大向量直接塞进 `feature_fact` | ||
| 879 | |||
| 880 | 因为当前目标是 100w 音频规模: | ||
| 881 | - `feature_fact` 应该先承担可检索、可聚合、可审计的“事实表”职责 | ||
| 882 | - 大向量 payload 适合放到: | ||
| 883 | - 外部文件(`embedding_uri`) | ||
| 884 | - 专门的 vector table(`vector_table_name`) | ||
| 885 | |||
| 886 | 这样做的好处: | ||
| 887 | - 主链表更轻 | ||
| 888 | - 便于冷热迁移 | ||
| 889 | - 便于不同维度分开治理 | ||
| 890 | - 便于不同 ANN 索引策略独立演化 | ||
| 891 | |||
| 892 | ### 18.2 当前推荐字段语义 | ||
| 893 | |||
| 894 | | 字段 | 含义 | 何时必填 | | ||
| 895 | |---|---|---| | ||
| 896 | | `embedding_dim` | 向量维度 | `feature_type='embedding'` 时必填 | | ||
| 897 | | `embedding_uri` | 向量文件或对象存储地址 | 外部文件存储时必填 | | ||
| 898 | | `vector_table_name` | 向量物理表名 | 落 pgvector / vector side table 时必填 | | ||
| 899 | | `checksum` | 向量载荷摘要 | 建议填写 | | ||
| 900 | |||
| 901 | ### 18.3 推荐的 table naming | ||
| 902 | |||
| 903 | 建议按维度分表: | ||
| 904 | |||
| 905 | ```text | ||
| 906 | audio_embedding_vector_8_placeholder | ||
| 907 | audio_embedding_vector_192 | ||
| 908 | audio_embedding_vector_768 | ||
| 909 | audio_embedding_vector_1024 | ||
| 910 | ``` | ||
| 911 | |||
| 912 | 原因: | ||
| 913 | - 不同维度的 ANN 索引通常不应混表 | ||
| 914 | - 便于控制索引构建成本 | ||
| 915 | - 便于模型演进时独立扩容 | ||
| 916 | |||
| 917 | ### 18.4 推荐的 embedding_uri 规范 | ||
| 918 | |||
| 919 | 建议统一成下面几种 scheme 之一: | ||
| 920 | |||
| 921 | ```text | ||
| 922 | s3://bucket/path/to/embedding.npy | ||
| 923 | oss://bucket/path/to/embedding.npy | ||
| 924 | file:///data/embeddings/song_xxx/window_xxx.npy | ||
| 925 | ``` | ||
| 926 | |||
| 927 | 建议路径里至少编码: | ||
| 928 | - `song_id` | ||
| 929 | - `asset_id` | ||
| 930 | - `window_id` | ||
| 931 | - `model_name` | ||
| 932 | - `feature_set_name` | ||
| 933 | |||
| 934 | 例如: | ||
| 935 | |||
| 936 | ```text | ||
| 937 | s3://acr-emb/phase1/mert-v1-95m/song_1001/asset_2001/window_3001_mert_5s_hop2.5_v1.npy | ||
| 938 | ``` | ||
| 939 | |||
| 940 | ### 18.5 Phase-1 推荐策略 | ||
| 941 | |||
| 942 | #### 小规模 / 先打通链路 | ||
| 943 | 优先: | ||
| 944 | - `embedding_uri` 必填 | ||
| 945 | - `vector_table_name` 可先写占位或后补 | ||
| 946 | |||
| 947 | #### 中规模 / 要做 ANN 检索 | ||
| 948 | 优先: | ||
| 949 | - `embedding_uri` 保留 | ||
| 950 | - `vector_table_name` 同步填写 | ||
| 951 | - 向量侧表开始按维度维护 | ||
| 952 | |||
| 953 | #### 大规模 / 热冷分层 | ||
| 954 | 优先: | ||
| 955 | - 热层 embedding 同时保留 `embedding_uri + vector_table_name` | ||
| 956 | - 冷层可只保留 `embedding_uri` | ||
| 957 | - 只给热层/主 reference 建近邻索引 | ||
| 958 | |||
| 959 | ### 18.6 推荐的冷热迁移策略 | ||
| 960 | |||
| 961 | #### 热层 | ||
| 962 | - `hot_set` 成员 | ||
| 963 | - 主 reference 热门版权曲 | ||
| 964 | - 保留:`feature_fact + vector_table + embedding_uri` | ||
| 965 | |||
| 966 | #### 温层 | ||
| 967 | - 主 reference 全量库 | ||
| 968 | - 保留:`feature_fact + embedding_uri` | ||
| 969 | - vector table 可按需部分加载 | ||
| 970 | |||
| 971 | #### 冷层 | ||
| 972 | - 长尾 song | ||
| 973 | - 最低要求:`feature_fact + embedding_uri` | ||
| 974 | - vector table 可不常驻 | ||
| 975 | |||
| 976 | ### 18.7 回收策略 | ||
| 977 | |||
| 978 | 不建议先删 `feature_fact`,而是按下面顺序回收: | ||
| 979 | |||
| 980 | 1. 先删或迁移冷 embedding payload 文件 | ||
| 981 | 2. 再回收冷层 vector table 中对应 row / index 分片 | ||
| 982 | 3. 最后如果确认整批数据弃用,才删 `feature_fact` | ||
| 983 | |||
| 984 | 因为: | ||
| 985 | - `feature_fact` 是主审计事实 | ||
| 986 | - 一旦删掉,很难追溯“这个 window 是否算过、用哪个模型算过” | ||
| 987 | |||
| 988 | ### 18.8 一个推荐的向量侧表逻辑结构 | ||
| 989 | |||
| 990 | 逻辑上建议至少有: | ||
| 991 | |||
| 992 | ```text | ||
| 993 | embedding_row_id | ||
| 994 | feature_id | ||
| 995 | vector_dim | ||
| 996 | embedding_vector | ||
| 997 | created_at | ||
| 998 | ``` | ||
| 999 | |||
| 1000 | 关键点: | ||
| 1001 | - `feature_id` 回指 `feature_fact.feature_id` | ||
| 1002 | - 向量侧表只存向量 payload 与最少索引字段 | ||
| 1003 | - song/window/model 相关语义仍以 `feature_fact` 为准 | ||
| 1004 | |||
| 1005 | ### 18.9 一句话规范 | ||
| 1006 | |||
| 1007 | > `feature_fact` 负责“谁算的、算的是什么、归属到谁”,`embedding_uri / vector_table_name` 负责“向量 payload 实际放哪”。 | ... | ... |
-
Please register or sign in to post a comment