Prove asset-level embedding upserts against live PostgreSQL
Constraint: The schema already declared asset-level idempotency, but without live evidence future work could mistake it for an unverified design note. Rejected: Rely on DDL inspection alone | It would not prove duplicate inserts are blocked and upserts reuse the same embedding row. Confidence: high Scope-risk: narrow Directive: Keep asset-level writer implementations aligned with the verified ON CONFLICT (feature_set_id, asset_id) WHERE window_id IS NULL contract. Tested: /usr/local/miniconda3/bin/python -m py_compile scripts/validate_audio_embedding_asset_upsert_live.py; git diff --check; /usr/local/miniconda3/bin/python scripts/validate_audio_embedding_asset_upsert_live.py --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' --schema acr_asset_upsert_test --output data/pgvector_eval/music20/audio_embedding_asset_upsert_live_report.json Not-tested: No production semantic writer uses the asset-level contract yet; this commit validates the DB contract, not an end-to-end extractor.
Showing
6 changed files
with
86 additions
and
0 deletions
| 1 | { | ||
| 2 | "schema": "acr_asset_upsert_test", | ||
| 3 | "dsn_redacted": "postgres://d2:***@127.0.0.1:5432/d2", | ||
| 4 | "seed_ids": { | ||
| 5 | "model_id": 1, | ||
| 6 | "feature_set_id": 1, | ||
| 7 | "canonical_song_id": 1, | ||
| 8 | "work_id": 1, | ||
| 9 | "recording_id": 1, | ||
| 10 | "asset_id": 1 | ||
| 11 | }, | ||
| 12 | "first_insert_embedding_id": 1, | ||
| 13 | "duplicate_insert_guard": { | ||
| 14 | "passed": true, | ||
| 15 | "error_type": "UniqueViolation", | ||
| 16 | "message": "duplicate key value violates unique constraint \"uq_audio_embedding_feature_asset\"" | ||
| 17 | }, | ||
| 18 | "upsert_embedding_id": 1, | ||
| 19 | "same_embedding_id_reused": true, | ||
| 20 | "counts": { | ||
| 21 | "audio_embedding": 1, | ||
| 22 | "audio_embedding_vector_192": 1 | ||
| 23 | }, | ||
| 24 | "final_state": { | ||
| 25 | "embedding_id": 1, | ||
| 26 | "asset_id": 1, | ||
| 27 | "window_id": null, | ||
| 28 | "checksum": "checksum-v2", | ||
| 29 | "embedding_uri": "inline://asset-probe-upsert", | ||
| 30 | "metadata_json": { | ||
| 31 | "probe": "asset_level_upsert_v2" | ||
| 32 | }, | ||
| 33 | "vector_literal": "[0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2]" | ||
| 34 | }, | ||
| 35 | "passed": true | ||
| 36 | } | ||
| ... | \ No newline at end of file | ... | \ No newline at end of file |
This diff is collapsed.
Click to expand it.
| 1 | ## 2026-06-04 | 1 | ## 2026-06-04 |
| 2 | 2 | ||
| 3 | - 新增 `scripts/validate_audio_embedding_asset_upsert_live.py` 与 `audio_embedding_asset_upsert_live_report.json`,在隔离 schema `acr_asset_upsert_test` 上真实验证 `uq_audio_embedding_feature_asset`:重复普通 insert 会触发 `UniqueViolation`,而 `ON CONFLICT ... DO UPDATE` 会复用同一 `embedding_id`,最终 `audio_embedding/audio_embedding_vector_192` 行数都保持为 `1`。 | ||
| 3 | - 新增 `scripts/run_phase1_embedding_preflight_matrix_live.py` 与 `phase1_embedding_preflight_matrix_report.json`,对 `mert / muq / ecapa` 四条 semantic jobs 做了统一 live preflight 矩阵验证;结果表明 4 条 job 全都稳定落到 `preflight_failed`,且 blocker 已收敛为 `/workspace/downloads` 未挂载与语义模型 runtime 缺失,而不是单条 job 的偶发异常。 | 4 | - 新增 `scripts/run_phase1_embedding_preflight_matrix_live.py` 与 `phase1_embedding_preflight_matrix_report.json`,对 `mert / muq / ecapa` 四条 semantic jobs 做了统一 live preflight 矩阵验证;结果表明 4 条 job 全都稳定落到 `preflight_failed`,且 blocker 已收敛为 `/workspace/downloads` 未挂载与语义模型 runtime 缺失,而不是单条 job 的偶发异常。 |
| 4 | - 更新 `run_embedding_job.py`,把 semantic lane 从“只有 dry-run”推进到“真实 scope 读取 + vector table 校验 + runtime 依赖校验 + 缺音频校验 + PostgreSQL failed 落账”的 preflight write contract;当前 live `mert` job 会把 `unreadable_audio_assets` 与 `model_runtime_unavailable` 同时写入 `feature_extraction_job.metadata_json`,不再只停留在纸面设计。 | 5 | - 更新 `run_embedding_job.py`,把 semantic lane 从“只有 dry-run”推进到“真实 scope 读取 + vector table 校验 + runtime 依赖校验 + 缺音频校验 + PostgreSQL failed 落账”的 preflight write contract;当前 live `mert` job 会把 `unreadable_audio_assets` 与 `model_runtime_unavailable` 同时写入 `feature_extraction_job.metadata_json`,不再只停留在纸面设计。 |
| 5 | - 给 `audio_embedding` 补上 `UNIQUE(feature_set_id, window_id) WHERE window_id IS NOT NULL` 与 `UNIQUE(feature_set_id, asset_id) WHERE window_id IS NULL AND asset_id IS NOT NULL` 两条幂等唯一键,为后续真实 `MERT / MuQ / ECAPA` upsert 落库固定主键策略。 | 6 | - 给 `audio_embedding` 补上 `UNIQUE(feature_set_id, window_id) WHERE window_id IS NOT NULL` 与 `UNIQUE(feature_set_id, asset_id) WHERE window_id IS NULL AND asset_id IS NOT NULL` 两条幂等唯一键,为后续真实 `MERT / MuQ / ECAPA` upsert 落库固定主键策略。 | ... | ... |
| ... | @@ -343,6 +343,17 @@ MERT 5s/2.5s job (`extraction_job_id=2`) 在 `acr_test` 上已经真实验证: | ... | @@ -343,6 +343,17 @@ MERT 5s/2.5s job (`extraction_job_id=2`) 在 `acr_test` 上已经真实验证: |
| 343 | 343 | ||
| 344 | 而不需要先查再写。 | 344 | 而不需要先查再写。 |
| 345 | 345 | ||
| 346 | 当前这两条唯一键里,asset-level 路径也已经有 live 证据: | ||
| 347 | |||
| 348 | - `scripts/validate_audio_embedding_asset_upsert_live.py` | ||
| 349 | - `audio_embedding_asset_upsert_live_report.json` | ||
| 350 | |||
| 351 | 已验证: | ||
| 352 | |||
| 353 | - 重复 `INSERT` 会被 `uq_audio_embedding_feature_asset` 拒绝 | ||
| 354 | - `ON CONFLICT ... DO UPDATE` 会复用同一个 `embedding_id` | ||
| 355 | - `audio_embedding` / `audio_embedding_vector_192` 行数都保持为 `1` | ||
| 356 | |||
| 346 | ### 下一步替换点 | 357 | ### 下一步替换点 |
| 347 | 358 | ||
| 348 | 当 runtime 与音频挂载到位后,只需要把 guarded failure path 替换成真实 inference: | 359 | 当 runtime 与音频挂载到位后,只需要把 guarded failure path 替换成真实 inference: | ... | ... |
| ... | @@ -774,3 +774,40 @@ cd /workspace/acr-engine | ... | @@ -774,3 +774,40 @@ cd /workspace/acr-engine |
| 774 | - 当前真正阻塞 Phase-1 encoder-only 落地的是: | 774 | - 当前真正阻塞 Phase-1 encoder-only 落地的是: |
| 775 | 1. `/workspace/downloads` 音频挂载 | 775 | 1. `/workspace/downloads` 音频挂载 |
| 776 | 2. 模型 runtime 依赖安装 | 776 | 2. 模型 runtime 依赖安装 |
| 777 | |||
| 778 | |||
| 779 | ## 新增:asset-level embedding upsert live 验证 | ||
| 780 | |||
| 781 | 为了把 `uq_audio_embedding_feature_asset` 从“DDL 声明”推进到“真实证据”,本轮新增: | ||
| 782 | |||
| 783 | - `acr-engine/scripts/validate_audio_embedding_asset_upsert_live.py` | ||
| 784 | - `acr-engine/data/pgvector_eval/music20/audio_embedding_asset_upsert_live_report.json` | ||
| 785 | |||
| 786 | ### 验证动作 | ||
| 787 | |||
| 788 | 脚本会在隔离 schema `acr_asset_upsert_test` 中: | ||
| 789 | |||
| 790 | 1. 落最小主数据图:`song -> work -> recording -> asset` | ||
| 791 | 2. 插入第一条 `window_id IS NULL` 的 asset-level embedding | ||
| 792 | 3. 再做一次普通重复 `INSERT` | ||
| 793 | 4. 预期被 `uq_audio_embedding_feature_asset` 拒绝 | ||
| 794 | 5. 再做一次 `ON CONFLICT ... DO UPDATE` | ||
| 795 | 6. 验证最终仍只有 `1` 条 `audio_embedding` 与 `1` 条 `audio_embedding_vector_192` | ||
| 796 | |||
| 797 | ### 当前结果 | ||
| 798 | |||
| 799 | | 项 | 结果 | | ||
| 800 | |---|---| | ||
| 801 | | 首次 `embedding_id` | `1` | | ||
| 802 | | 重复普通 `INSERT` | `UniqueViolation` | | ||
| 803 | | 唯一键名 | `uq_audio_embedding_feature_asset` | | ||
| 804 | | upsert 后 `embedding_id` | `1` | | ||
| 805 | | `same_embedding_id_reused` | `true` | | ||
| 806 | | `audio_embedding` 行数 | `1` | | ||
| 807 | | `audio_embedding_vector_192` 行数 | `1` | | ||
| 808 | | 最终 `checksum` | `checksum-v2` | | ||
| 809 | |||
| 810 | 结论: | ||
| 811 | |||
| 812 | - asset-level 唯一键不是“纸面存在”,而是已经在 live PostgreSQL 上真实生效 | ||
| 813 | - 后续如果补 asset-level semantic writer,可以直接沿用同一个 `ON CONFLICT (feature_set_id, asset_id) ...` 合同 | ... | ... |
| ... | @@ -192,6 +192,7 @@ sed -n '1,320p' acr-engine/sql/acr_pg_schema_v2.sql | ... | @@ -192,6 +192,7 @@ sed -n '1,320p' acr-engine/sql/acr_pg_schema_v2.sql |
| 192 | - semantic lane 也已完成 live failure contract:`run_embedding_job.py` 现在会同时暴露 `unreadable_audio_assets` 与 `model_runtime_unavailable`,而不是把失败伪装成 completed | 192 | - semantic lane 也已完成 live failure contract:`run_embedding_job.py` 现在会同时暴露 `unreadable_audio_assets` 与 `model_runtime_unavailable`,而不是把失败伪装成 completed |
| 193 | - `audio_embedding` 已补上 window / asset 双路唯一键,后续真实 encoder 只需替换 inference adapter 即可复用同一 upsert 合同 | 193 | - `audio_embedding` 已补上 window / asset 双路唯一键,后续真实 encoder 只需替换 inference adapter 即可复用同一 upsert 合同 |
| 194 | - `scripts/run_phase1_embedding_preflight_matrix_live.py` 已跑通,4 条 semantic jobs(mert/muq/ecapa)在 `acr_test` 上都被稳定标记为 `preflight_failed`;当前共性 blocker 已收敛为 `/workspace/downloads` 缺失 + 语义模型 runtime 缺失 | 194 | - `scripts/run_phase1_embedding_preflight_matrix_live.py` 已跑通,4 条 semantic jobs(mert/muq/ecapa)在 `acr_test` 上都被稳定标记为 `preflight_failed`;当前共性 blocker 已收敛为 `/workspace/downloads` 缺失 + 语义模型 runtime 缺失 |
| 195 | - `scripts/validate_audio_embedding_asset_upsert_live.py` 已在隔离 schema `acr_asset_upsert_test` 上验证 `uq_audio_embedding_feature_asset`:重复 insert 会被唯一键拒绝,upsert 会复用同一 `embedding_id`,说明 asset-level 幂等键也已有真实证据 | ||
| 195 | - `phase1_hot_reference_v1` 在 `acr_test` 里已经真实补齐 `20` 个 reference members,因此 worker dry-run 当前看到的 scope 已是 `20 recordings / 20 assets / 20 windows` | 196 | - `phase1_hot_reference_v1` 在 `acr_test` 里已经真实补齐 `20` 个 reference members,因此 worker dry-run 当前看到的 scope 已是 `20 recordings / 20 assets / 20 windows` |
| 196 | - worker contract 现在已有基础前置状态保护;重复执行同一 chromaprint dry-run job 会被 `expected_status=pending` 明确拒绝,证据见 `phase1_worker_double_claim_guard_report.json` | 197 | - worker contract 现在已有基础前置状态保护;重复执行同一 chromaprint dry-run job 会被 `expected_status=pending` 明确拒绝,证据见 `phase1_worker_double_claim_guard_report.json` |
| 197 | - exact lane 的 `run_chromaprint_job.py` 已具备非 dry-run 写入路径;当前在 `acr_test` 的 live 结果是因为 `/workspace/downloads/...` 缺失而明确 `failed`,不是继续假装 `completed` | 198 | - exact lane 的 `run_chromaprint_job.py` 已具备非 dry-run 写入路径;当前在 `acr_test` 的 live 结果是因为 `/workspace/downloads/...` 缺失而明确 `failed`,不是继续假装 `completed` | ... | ... |
-
Please register or sign in to post a comment