Commit 6ea7365b 6ea7365b235904d9b4fbfcd3b704d8d1cdec2259 by cnb.bofCdSsphPA

Prove asset-level embedding upserts against live PostgreSQL

Constraint: The schema already declared asset-level idempotency, but without live evidence future work could mistake it for an unverified design note.
Rejected: Rely on DDL inspection alone | It would not prove duplicate inserts are blocked and upserts reuse the same embedding row.
Confidence: high
Scope-risk: narrow
Directive: Keep asset-level writer implementations aligned with the verified ON CONFLICT (feature_set_id, asset_id) WHERE window_id IS NULL contract.
Tested: /usr/local/miniconda3/bin/python -m py_compile scripts/validate_audio_embedding_asset_upsert_live.py; git diff --check; /usr/local/miniconda3/bin/python scripts/validate_audio_embedding_asset_upsert_live.py --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' --schema acr_asset_upsert_test --output data/pgvector_eval/music20/audio_embedding_asset_upsert_live_report.json
Not-tested: No production semantic writer uses the asset-level contract yet; this commit validates the DB contract, not an end-to-end extractor.
1 parent 015e3261
1 {
2 "schema": "acr_asset_upsert_test",
3 "dsn_redacted": "postgres://d2:***@127.0.0.1:5432/d2",
4 "seed_ids": {
5 "model_id": 1,
6 "feature_set_id": 1,
7 "canonical_song_id": 1,
8 "work_id": 1,
9 "recording_id": 1,
10 "asset_id": 1
11 },
12 "first_insert_embedding_id": 1,
13 "duplicate_insert_guard": {
14 "passed": true,
15 "error_type": "UniqueViolation",
16 "message": "duplicate key value violates unique constraint \"uq_audio_embedding_feature_asset\""
17 },
18 "upsert_embedding_id": 1,
19 "same_embedding_id_reused": true,
20 "counts": {
21 "audio_embedding": 1,
22 "audio_embedding_vector_192": 1
23 },
24 "final_state": {
25 "embedding_id": 1,
26 "asset_id": 1,
27 "window_id": null,
28 "checksum": "checksum-v2",
29 "embedding_uri": "inline://asset-probe-upsert",
30 "metadata_json": {
31 "probe": "asset_level_upsert_v2"
32 },
33 "vector_literal": "[0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2]"
34 },
35 "passed": true
36 }
...\ No newline at end of file ...\ No newline at end of file
1 ## 2026-06-04 1 ## 2026-06-04
2 2
3 - 新增 `scripts/validate_audio_embedding_asset_upsert_live.py``audio_embedding_asset_upsert_live_report.json`,在隔离 schema `acr_asset_upsert_test` 上真实验证 `uq_audio_embedding_feature_asset`:重复普通 insert 会触发 `UniqueViolation`,而 `ON CONFLICT ... DO UPDATE` 会复用同一 `embedding_id`,最终 `audio_embedding/audio_embedding_vector_192` 行数都保持为 `1`
3 - 新增 `scripts/run_phase1_embedding_preflight_matrix_live.py``phase1_embedding_preflight_matrix_report.json`,对 `mert / muq / ecapa` 四条 semantic jobs 做了统一 live preflight 矩阵验证;结果表明 4 条 job 全都稳定落到 `preflight_failed`,且 blocker 已收敛为 `/workspace/downloads` 未挂载与语义模型 runtime 缺失,而不是单条 job 的偶发异常。 4 - 新增 `scripts/run_phase1_embedding_preflight_matrix_live.py``phase1_embedding_preflight_matrix_report.json`,对 `mert / muq / ecapa` 四条 semantic jobs 做了统一 live preflight 矩阵验证;结果表明 4 条 job 全都稳定落到 `preflight_failed`,且 blocker 已收敛为 `/workspace/downloads` 未挂载与语义模型 runtime 缺失,而不是单条 job 的偶发异常。
4 - 更新 `run_embedding_job.py`,把 semantic lane 从“只有 dry-run”推进到“真实 scope 读取 + vector table 校验 + runtime 依赖校验 + 缺音频校验 + PostgreSQL failed 落账”的 preflight write contract;当前 live `mert` job 会把 `unreadable_audio_assets``model_runtime_unavailable` 同时写入 `feature_extraction_job.metadata_json`,不再只停留在纸面设计。 5 - 更新 `run_embedding_job.py`,把 semantic lane 从“只有 dry-run”推进到“真实 scope 读取 + vector table 校验 + runtime 依赖校验 + 缺音频校验 + PostgreSQL failed 落账”的 preflight write contract;当前 live `mert` job 会把 `unreadable_audio_assets``model_runtime_unavailable` 同时写入 `feature_extraction_job.metadata_json`,不再只停留在纸面设计。
5 -`audio_embedding` 补上 `UNIQUE(feature_set_id, window_id) WHERE window_id IS NOT NULL``UNIQUE(feature_set_id, asset_id) WHERE window_id IS NULL AND asset_id IS NOT NULL` 两条幂等唯一键,为后续真实 `MERT / MuQ / ECAPA` upsert 落库固定主键策略。 6 -`audio_embedding` 补上 `UNIQUE(feature_set_id, window_id) WHERE window_id IS NOT NULL``UNIQUE(feature_set_id, asset_id) WHERE window_id IS NULL AND asset_id IS NOT NULL` 两条幂等唯一键,为后续真实 `MERT / MuQ / ECAPA` upsert 落库固定主键策略。
......
...@@ -343,6 +343,17 @@ MERT 5s/2.5s job (`extraction_job_id=2`) 在 `acr_test` 上已经真实验证: ...@@ -343,6 +343,17 @@ MERT 5s/2.5s job (`extraction_job_id=2`) 在 `acr_test` 上已经真实验证:
343 343
344 而不需要先查再写。 344 而不需要先查再写。
345 345
346 当前这两条唯一键里,asset-level 路径也已经有 live 证据:
347
348 - `scripts/validate_audio_embedding_asset_upsert_live.py`
349 - `audio_embedding_asset_upsert_live_report.json`
350
351 已验证:
352
353 - 重复 `INSERT` 会被 `uq_audio_embedding_feature_asset` 拒绝
354 - `ON CONFLICT ... DO UPDATE` 会复用同一个 `embedding_id`
355 - `audio_embedding` / `audio_embedding_vector_192` 行数都保持为 `1`
356
346 ### 下一步替换点 357 ### 下一步替换点
347 358
348 当 runtime 与音频挂载到位后,只需要把 guarded failure path 替换成真实 inference: 359 当 runtime 与音频挂载到位后,只需要把 guarded failure path 替换成真实 inference:
......
...@@ -774,3 +774,40 @@ cd /workspace/acr-engine ...@@ -774,3 +774,40 @@ cd /workspace/acr-engine
774 - 当前真正阻塞 Phase-1 encoder-only 落地的是: 774 - 当前真正阻塞 Phase-1 encoder-only 落地的是:
775 1. `/workspace/downloads` 音频挂载 775 1. `/workspace/downloads` 音频挂载
776 2. 模型 runtime 依赖安装 776 2. 模型 runtime 依赖安装
777
778
779 ## 新增:asset-level embedding upsert live 验证
780
781 为了把 `uq_audio_embedding_feature_asset` 从“DDL 声明”推进到“真实证据”,本轮新增:
782
783 - `acr-engine/scripts/validate_audio_embedding_asset_upsert_live.py`
784 - `acr-engine/data/pgvector_eval/music20/audio_embedding_asset_upsert_live_report.json`
785
786 ### 验证动作
787
788 脚本会在隔离 schema `acr_asset_upsert_test` 中:
789
790 1. 落最小主数据图:`song -> work -> recording -> asset`
791 2. 插入第一条 `window_id IS NULL` 的 asset-level embedding
792 3. 再做一次普通重复 `INSERT`
793 4. 预期被 `uq_audio_embedding_feature_asset` 拒绝
794 5. 再做一次 `ON CONFLICT ... DO UPDATE`
795 6. 验证最终仍只有 `1``audio_embedding``1``audio_embedding_vector_192`
796
797 ### 当前结果
798
799 | 项 | 结果 |
800 |---|---|
801 | 首次 `embedding_id` | `1` |
802 | 重复普通 `INSERT` | `UniqueViolation` |
803 | 唯一键名 | `uq_audio_embedding_feature_asset` |
804 | upsert 后 `embedding_id` | `1` |
805 | `same_embedding_id_reused` | `true` |
806 | `audio_embedding` 行数 | `1` |
807 | `audio_embedding_vector_192` 行数 | `1` |
808 | 最终 `checksum` | `checksum-v2` |
809
810 结论:
811
812 - asset-level 唯一键不是“纸面存在”,而是已经在 live PostgreSQL 上真实生效
813 - 后续如果补 asset-level semantic writer,可以直接沿用同一个 `ON CONFLICT (feature_set_id, asset_id) ...` 合同
......
...@@ -192,6 +192,7 @@ sed -n '1,320p' acr-engine/sql/acr_pg_schema_v2.sql ...@@ -192,6 +192,7 @@ sed -n '1,320p' acr-engine/sql/acr_pg_schema_v2.sql
192 - semantic lane 也已完成 live failure contract:`run_embedding_job.py` 现在会同时暴露 `unreadable_audio_assets``model_runtime_unavailable`,而不是把失败伪装成 completed 192 - semantic lane 也已完成 live failure contract:`run_embedding_job.py` 现在会同时暴露 `unreadable_audio_assets``model_runtime_unavailable`,而不是把失败伪装成 completed
193 - `audio_embedding` 已补上 window / asset 双路唯一键,后续真实 encoder 只需替换 inference adapter 即可复用同一 upsert 合同 193 - `audio_embedding` 已补上 window / asset 双路唯一键,后续真实 encoder 只需替换 inference adapter 即可复用同一 upsert 合同
194 - `scripts/run_phase1_embedding_preflight_matrix_live.py` 已跑通,4 条 semantic jobs(mert/muq/ecapa)在 `acr_test` 上都被稳定标记为 `preflight_failed`;当前共性 blocker 已收敛为 `/workspace/downloads` 缺失 + 语义模型 runtime 缺失 194 - `scripts/run_phase1_embedding_preflight_matrix_live.py` 已跑通,4 条 semantic jobs(mert/muq/ecapa)在 `acr_test` 上都被稳定标记为 `preflight_failed`;当前共性 blocker 已收敛为 `/workspace/downloads` 缺失 + 语义模型 runtime 缺失
195 - `scripts/validate_audio_embedding_asset_upsert_live.py` 已在隔离 schema `acr_asset_upsert_test` 上验证 `uq_audio_embedding_feature_asset`:重复 insert 会被唯一键拒绝,upsert 会复用同一 `embedding_id`,说明 asset-level 幂等键也已有真实证据
195 - `phase1_hot_reference_v1``acr_test` 里已经真实补齐 `20` 个 reference members,因此 worker dry-run 当前看到的 scope 已是 `20 recordings / 20 assets / 20 windows` 196 - `phase1_hot_reference_v1``acr_test` 里已经真实补齐 `20` 个 reference members,因此 worker dry-run 当前看到的 scope 已是 `20 recordings / 20 assets / 20 windows`
196 - worker contract 现在已有基础前置状态保护;重复执行同一 chromaprint dry-run job 会被 `expected_status=pending` 明确拒绝,证据见 `phase1_worker_double_claim_guard_report.json` 197 - worker contract 现在已有基础前置状态保护;重复执行同一 chromaprint dry-run job 会被 `expected_status=pending` 明确拒绝,证据见 `phase1_worker_double_claim_guard_report.json`
197 - exact lane 的 `run_chromaprint_job.py` 已具备非 dry-run 写入路径;当前在 `acr_test` 的 live 结果是因为 `/workspace/downloads/...` 缺失而明确 `failed`,不是继续假装 `completed` 198 - exact lane 的 `run_chromaprint_job.py` 已具备非 dry-run 写入路径;当前在 `acr_test` 的 live 结果是因为 `/workspace/downloads/...` 缺失而明确 `failed`,不是继续假装 `completed`
......