Commit fe416ec9 fe416ec9cae627abe79814ec3a3e000feea99d02 by cnb.bofCdSsphPA

Make the fused Phase-1 ACR schema concrete with DDL samples

Constraint: Keep the storage design aligned to the current song-centric model while turning the 4-table fused schema into something engineers can directly review and implement.
Rejected: Keep only conceptual docs without concrete SQL | It leaves too much ambiguity about where slices, models, and features actually land.
Confidence: high
Scope-risk: narrow
Directive: Until the repository gains a production SQL file for the fused model, treat postgres_db_schema_samples.md as the authoritative DDL draft for media_entity/audio_object/feature_fact/set_membership.
Tested: git diff --check on touched files; /usr/local/miniconda3/bin/python scripts/check_markdown_links.py --root docs returned OK for 11 active markdown files
Not-tested: Executing the fused DDL against a live PostgreSQL schema
1 parent ac2e6730
1 ## 2026-06-04 1 ## 2026-06-04
2 2
3 - 重写 `docs/postgres_db_schema_samples.md` 为当前 song-centric 融合优先方案的 DDL 草案,补齐 4 张核心表(`media_entity` / `audio_object` / `feature_fact` / `set_membership`)、落表说明、流程图与常用 SQL 样例。
4
3 -`docs/postgresql-data-model.md` 新增“切片数据 / 模型 / feature 具体落哪张表”的表格与流程图,明确当前默认回溯链为 `feature_fact -> audio_object(window) -> audio_object(asset) -> media_entity(song)` 5 -`docs/postgresql-data-model.md` 新增“切片数据 / 模型 / feature 具体落哪张表”的表格与流程图,明确当前默认回溯链为 `feature_fact -> audio_object(window) -> audio_object(asset) -> media_entity(song)`
4 - 收敛 `docs/README.md` 为当前 song-centric 设计入口,并清理 docs 目录中与当前设计无关的模板、开放数据、业务导出、历史路线类文档。 6 - 收敛 `docs/README.md` 为当前 song-centric 设计入口,并清理 docs 目录中与当前设计无关的模板、开放数据、业务导出、历史路线类文档。
5 7
......
1 # PostgreSQL DB Schema Samples / 落库样例与 live 测试链路 1 # PostgreSQL DB Schema Samples / 融合优先 DDL 草案与查询样例
2 2
3 > 更新:2026-06-04 3 > 更新:2026-06-04
4 > 目标:给后续开发一个**可直接照着做**的 PostgreSQL 落库样例,同时保留一次真实 `pgvector` live 测试的证据 4 > 目标:把当前 **song-centric + 融合优先** 设计落成一版可以直接评审和继续实现的 PostgreSQL DDL 草案
5 5
6 --- 6 ---
7 7
8 ## 一页结论 8 ## 一页结论
9 9
10 这次已经在用户提供的 PostgreSQL 上完成了下面几件事: 10 当前默认物理模型只看 4 张表:
11
12 1. **真实连接 PostgreSQL 成功**
13 - DSN:`postgres://d2:***@127.0.0.1:5432/d2`
14 - PostgreSQL:`17.5`
15 - 已确认扩展 `vector` 存在
16
17 2. **真实应用 schema v2 成功**
18 - 使用隔离 schema:`acr_test`
19 - DDL 来源:`acr-engine/sql/acr_pg_schema_v2.sql`
20 - 已成功创建主数据、registry、embedding、candidate、decision 等表
21
22 3. **真实插入了一套完整的样例数据链**
23 - `canonical_song -> work -> recording -> recording_asset -> audio_window`
24 - `model_registry -> feature_set_registry -> audio_embedding -> retrieval_index_registry`
25 - `reference_set_registry -> reference_set_member`
26
27 4. **真实跑通了一轮 PostgreSQL + pgvector 检索评测**
28 - 输入:`acr-engine/data/pgvector_eval/music20/*.jsonl`
29 - 输出:`acr-engine/data/pgvector_eval/music20/live_pgvector_report.json`
30 - live pgvector 指标和现有 FAISS stand-in 指标**一致**
31 - overall `top1=0.9091`
32 - overall `top3=0.9545`
33 - `query_type=1`: `top1=1.0`
34 - `query_type=7`: `top1=0.0`, `top3=0.5`
35
36 5. **lineage trigger 已被验证有效**
37 - 脚本主动构造了三类错误 lineage:
38 - `recording`
39 - `audio_window`
40 - `audio_embedding`
41 - PostgreSQL 都正确拒绝插入
42 11
43 --- 12 ```text
44 13 media_entity -> audio_object -> feature_fact -> set_membership
45 ## 本次使用的 live 测试资产
46
47 ### 数据库
48
49 | 项目 | 值 |
50 |---|---|
51 | Host | `127.0.0.1` |
52 | Port | `5432` |
53 | DB | `d2` |
54 | User | `d2` |
55 | PostgreSQL | `17.5` |
56 | 扩展 | `vector`, `pg_trgm`, `ltree`, `hstore` 等 |
57 | 本次测试 schema | `acr_test` |
58
59 ### 代码与产物
60
61 | 类型 | 路径 |
62 |---|---|
63 | 推荐 DDL | `acr-engine/sql/acr_pg_schema_v2.sql` |
64 | live 测试脚本 | `acr-engine/scripts/live_pgvector_music20_eval.py` |
65 | registry bootstrap 脚本 | `acr-engine/scripts/bootstrap_phase1_model_registry_live.py` |
66 | live 报告 | `acr-engine/data/pgvector_eval/music20/live_pgvector_report.json` |
67 | FAISS 对照报告 | `acr-engine/data/pgvector_eval/music20/songid_eval_report_fresh.json` |
68 | registry bootstrap 报告 | `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json` |
69 | registry bootstrap 幂等性报告 | `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_idempotency_report.json` |
70 | extraction job bootstrap 报告 | `acr-engine/data/pgvector_eval/music20/phase1_extraction_jobs_report.json` |
71 | extraction plan 报告 | `acr-engine/data/pgvector_eval/music20/phase1_extraction_plan_report.json` |
72 | reference member bootstrap 报告 | `acr-engine/data/pgvector_eval/music20/phase1_reference_member_bootstrap_report.json` |
73 | chromaprint worker dry-run 报告 | `acr-engine/data/pgvector_eval/music20/phase1_worker_chromaprint_dry_run.json` |
74 | embedding worker dry-run 报告 | `acr-engine/data/pgvector_eval/music20/phase1_worker_embedding_dry_run.json` |
75 | job status 手工回写报告 | `acr-engine/data/pgvector_eval/music20/phase1_worker_mark_pending_report.json` |
76 | double-claim guard 报告 | `acr-engine/data/pgvector_eval/music20/phase1_worker_double_claim_guard_report.json` |
77 | 历史对照报告 | `acr-engine/data/pgvector_eval/music20/songid_eval_report.json` |
78
79 ---
80
81 ## 这次实际落进去的数据链
82
83 ```mermaid
84 flowchart LR
85 A[reference_embeddings.jsonl] --> B[canonical_song]
86 B --> C[work]
87 C --> D[recording]
88 D --> E[recording_asset]
89 E --> F[audio_window]
90 F --> G[audio_embedding]
91 G --> H[audio_embedding_vector_192]
92
93 I[model_registry] --> J[feature_set_registry]
94 J --> G
95
96 K[reference_set_registry] --> L[reference_set_member]
97 D --> L
98
99 M[query_embeddings.jsonl] --> N[SQL pgvector search]
100 H --> N
101 N --> O[retrieval_candidate]
102 O --> P[match_decision]
103 ``` 14 ```
104 15
105 --- 16 对应逻辑语义:
106
107 ## 为什么这次 live 测试要把 24 维 embedding pad 到 192 维
108
109 当前 `schema v2` 里提供了:
110 - `audio_embedding_vector_192`
111 - `audio_embedding_vector_768`
112
113 而这次本地 `music20` 样例 embedding 是 **24 维 chroma 特征**
114
115 所以本次 live 测试采用的策略是:
116
117 - **逻辑维度**`24`
118 - **物理落盘维度**`192`
119 - **做法**:后面补 `0`,写入 `vector(192)`
120
121 这样做的原因:
122 - 不需要临时改 schema
123 - 仍然可以验证 schema v2 + pgvector + retrieval 链路
124 - 对这批样例的余弦相似度排序不会产生方向性错误(所有向量都以同样方式补零)
125 17
126 这只是**验证链路**用法。 18 ```text
127 19 song -> asset -> window -> fingerprint / embedding
128 生产里应按真实 encoder 维度选择:
129 - `MERT` / `MuQ` 之类高维 embedding:直接落合适物理表
130 - 如果后续维度更多,建议继续扩成 `audio_embedding_vector_<dim>` 分桶策略
131
132 ---
133
134 ## 本次实际落盘样例
135
136 以下内容来自 `acr_test` schema 的真实查询结果。
137
138 ### 1. canonical_song
139
140 ```json
141 {"canonical_song_id":1,"biz_song_code":"100","title":"Song 100","primary_artist":"Artist 100","rights_status":"protected"}
142 {"canonical_song_id":2,"biz_song_code":"101","title":"Song 101","primary_artist":"Artist 101","rights_status":"protected"}
143 ``` 20 ```
144 21
145 ### 2. work 22 其中:
23 - `media_entity`:当前默认只承载 `song`
24 - `audio_object`:统一承载 `asset``window`
25 - `feature_fact`:统一承载 `fingerprint``embedding`
26 - `set_membership`:统一承载 `reference / hot / eval` 等集合关系
146 27
147 ```json 28 ---
148 {"work_id":1,"canonical_song_id":1,"work_code":"work-100","work_title":"Song 100","composer":"Composer 100"}
149 {"work_id":2,"canonical_song_id":2,"work_code":"work-101","work_title":"Song 101","composer":"Composer 101"}
150 ```
151
152 ### 3. recording
153
154 ```json
155 {"recording_id":1,"work_id":1,"canonical_song_id":1,"recording_code":"rec-100","version_type":"master_reference","is_reference":true,"reference_priority":100}
156 {"recording_id":2,"work_id":2,"canonical_song_id":2,"recording_code":"rec-101","version_type":"master_reference","is_reference":true,"reference_priority":101}
157 ```
158
159 ### 4. recording_asset
160
161 ```json
162 {"asset_id":1,"recording_id":1,"asset_role":"reference_audio","storage_uri":"/workspace/downloads/100/type_11/93dfdeb0-7da5-42a8-9c71-cf12af57dd191650256918.wav","storage_scheme":"file","duration_sec":8.0,"ingest_status":"ready"}
163 {"asset_id":2,"recording_id":2,"asset_role":"reference_audio","storage_uri":"/workspace/downloads/101/type_11/83c0c07f-4f96-4ff4-998c-58db910f3cfa1650256915.wav","storage_scheme":"file","duration_sec":8.0,"ingest_status":"ready"}
164 ```
165 29
166 ### 5. audio_window 30 ## 1. 4 张表分别存什么
167 31
168 ```json 32 | 表 | 当前主要 type | 存什么 | 为什么存在 |
169 {"window_id":1,"asset_id":1,"recording_id":1,"work_id":1,"canonical_song_id":1,"window_index":0,"start_sec":0.0,"end_sec":8.0,"segment_role":"reference","segment_type":"full_clip"} 33 |---|---|---|---|
170 {"window_id":2,"asset_id":2,"recording_id":2,"work_id":2,"canonical_song_id":2,"window_index":0,"start_sec":0.0,"end_sec":8.0,"segment_role":"reference","segment_type":"full_clip"} 34 | `media_entity` | `song` | 歌曲主实体 | 最终归属对象是 `song_id` |
171 ``` 35 | `audio_object` | `asset`, `window` | 原始音频文件 + 切片 | 同一个 song 下可有多个音频,切片仍需 evidence |
36 | `feature_fact` | `fingerprint`, `embedding` | 模型、feature set、特征结果 | 统一 exact/semantic 特征事实 |
37 | `set_membership` | `reference_set`, `eval_set`, `hot_set` | 谁属于哪个集合 | 管理 reference 与评测范围 |
172 38
173 ### 6. model_registry / feature_set_registry 39 ---
174 40
175 ```json 41 ## 2. 当前推荐 DDL 草案
176 {"model_id":1,"model_name":"local_chroma24","model_family":"chroma_baseline","model_version":"v1","output_embedding_dim":24,"default_window_sec":8.0} 42
177 {"feature_set_id":1,"model_id":1,"feature_name":"chroma24_songid_eval","embedding_dim":24,"distance_metric":"cosine","feature_schema_version":"v1"} 43 ### 2.1 `media_entity`
44
45 ```sql
46 create table if not exists media_entity (
47 entity_id bigserial primary key,
48 entity_type text not null check (entity_type in ('song', 'work', 'recording')),
49 root_song_id bigint,
50 parent_entity_id bigint,
51 biz_key text,
52 title text not null,
53 artist_name text,
54 entity_status text not null default 'active',
55 metadata_json jsonb not null default '{}'::jsonb,
56 created_at timestamptz not null default now(),
57 updated_at timestamptz not null default now(),
58 constraint fk_media_entity_root_song
59 foreign key (root_song_id) references media_entity(entity_id),
60 constraint fk_media_entity_parent
61 foreign key (parent_entity_id) references media_entity(entity_id)
62 );
63
64 create unique index if not exists uq_media_entity_song_biz_key
65 on media_entity(entity_type, biz_key)
66 where biz_key is not null;
67
68 create index if not exists idx_media_entity_root_song
69 on media_entity(root_song_id);
178 ``` 70 ```
179 71
180 ### 7. audio_embedding 72 ### 2.2 `audio_object`
181 73
182 ```json 74 ```sql
183 {"embedding_id":1,"feature_set_id":1,"asset_id":1,"window_id":1,"recording_id":1,"canonical_song_id":1,"embedding_storage_mode":"pgvector_inline_192_padded","is_indexed":true} 75 create table if not exists audio_object (
184 {"embedding_id":2,"feature_set_id":1,"asset_id":2,"window_id":2,"recording_id":2,"canonical_song_id":2,"embedding_storage_mode":"pgvector_inline_192_padded","is_indexed":true} 76 object_id bigserial primary key,
77 object_type text not null check (object_type in ('asset', 'window')),
78 song_id bigint not null references media_entity(entity_id),
79 parent_object_id bigint references audio_object(object_id),
80 source_type text,
81 storage_uri text,
82 storage_scheme text,
83 checksum text,
84 codec text,
85 sample_rate integer,
86 channels integer,
87 duration_ms integer,
88 start_ms integer,
89 end_ms integer,
90 object_status text not null default 'ready',
91 metadata_json jsonb not null default '{}'::jsonb,
92 created_at timestamptz not null default now(),
93 updated_at timestamptz not null default now(),
94 constraint ck_audio_object_window_parent
95 check (
96 (object_type = 'asset' and parent_object_id is null)
97 or (object_type = 'window' and parent_object_id is not null)
98 )
99 );
100
101 create index if not exists idx_audio_object_song_type
102 on audio_object(song_id, object_type);
103
104 create index if not exists idx_audio_object_parent
105 on audio_object(parent_object_id);
106
107 create unique index if not exists uq_audio_object_asset_checksum
108 on audio_object(song_id, checksum)
109 where object_type = 'asset' and checksum is not null;
110
111 create unique index if not exists uq_audio_object_window_range
112 on audio_object(parent_object_id, start_ms, end_ms)
113 where object_type = 'window';
185 ``` 114 ```
186 115
187 ### 8. reference_set_registry / retrieval_index_registry 116 ### 2.3 `feature_fact`
188 117
189 ```json 118 ```sql
190 {"reference_set_id":1,"set_name":"music20_live_reference","encoder_scope":"local_chroma24","status":"active"} 119 create table if not exists feature_fact (
191 {"retrieval_index_id":1,"feature_set_id":1,"index_name":"music20_live_pgvector_hnsw","index_backend":"pgvector","index_type":"hnsw_cosine","row_count":20,"index_status":"active"} 120 feature_id bigserial primary key,
121 feature_type text not null check (feature_type in ('fingerprint', 'embedding')),
122 object_id bigint not null references audio_object(object_id),
123 song_id bigint not null references media_entity(entity_id),
124 model_name text not null,
125 model_version text not null,
126 feature_set_name text not null,
127 feature_schema_ver text not null default 'v1',
128 embedding_dim integer,
129 fingerprint_value text,
130 embedding_uri text,
131 vector_table_name text,
132 checksum text,
133 feature_status text not null default 'ready',
134 metadata_json jsonb not null default '{}'::jsonb,
135 created_at timestamptz not null default now(),
136 updated_at timestamptz not null default now(),
137 constraint ck_feature_payload
138 check (
139 (feature_type = 'fingerprint' and fingerprint_value is not null)
140 or (feature_type = 'embedding' and (embedding_uri is not null or vector_table_name is not null))
141 )
142 );
143
144 create index if not exists idx_feature_fact_object_type
145 on feature_fact(object_id, feature_type);
146
147 create index if not exists idx_feature_fact_song_type
148 on feature_fact(song_id, feature_type);
149
150 create unique index if not exists uq_feature_fact_embedding
151 on feature_fact(object_id, model_name, model_version, feature_set_name, feature_type)
152 where feature_type = 'embedding';
153
154 create unique index if not exists uq_feature_fact_fingerprint
155 on feature_fact(object_id, model_name, model_version, feature_set_name, feature_type)
156 where feature_type = 'fingerprint';
192 ``` 157 ```
193 158
194 ### 9. retrieval_candidate / match_decision 159 ### 2.4 `set_membership`
195 160
196 ```json 161 ```sql
197 {"retrieval_candidate_id":1,"query_id":"music20-q0000-t1-song100","source_lane":"semantic","candidate_level":"canonical_song","candidate_id":1,"raw_score":0.99998549,"normalized_score":0.90998694,"rank_no":1} 162 create table if not exists set_membership (
198 {"retrieval_candidate_id":2,"query_id":"music20-q0000-t1-song100","source_lane":"semantic","candidate_level":"canonical_song","candidate_id":17,"raw_score":0.9527432,"normalized_score":0.86746888,"rank_no":2} 163 membership_id bigserial primary key,
199 {"match_decision_id":1,"query_id":"music20-q0000-t1-song100","canonical_song_id":1,"decision_status":"matched","decision_score":0.90998694} 164 set_type text not null check (set_type in ('reference_set', 'eval_set', 'hot_set')),
165 set_name text not null,
166 member_type text not null check (member_type in ('song', 'asset', 'window', 'feature')),
167 member_id bigint not null,
168 song_id bigint references media_entity(entity_id),
169 is_active boolean not null default true,
170 priority integer not null default 100,
171 metadata_json jsonb not null default '{}'::jsonb,
172 created_at timestamptz not null default now(),
173 updated_at timestamptz not null default now()
174 );
175
176 create unique index if not exists uq_set_membership_unique
177 on set_membership(set_type, set_name, member_type, member_id);
178
179 create index if not exists idx_set_membership_set_lookup
180 on set_membership(set_type, set_name, is_active, priority);
200 ``` 181 ```
201 182
202 --- 183 ---
203 184
204 ## 本次 live 测试的表规模 185 ## 3. 切片 / 模型 / feature 到底落哪张表
205 186
206 | 表 | 行数 | 187 | 对象 | 落表 | 关键字段 |
207 |---|---:| 188 |---|---|---|
208 | `canonical_song` | 20 | 189 | song | `media_entity` | `entity_type='song'` |
209 | `work` | 20 | 190 | 原始音频 | `audio_object` | `object_type='asset'` |
210 | `recording` | 20 | 191 | 切片窗口 | `audio_object` | `object_type='window'`, `parent_object_id=<asset_id>` |
211 | `recording_asset` | 20 | 192 | 指纹特征 | `feature_fact` | `feature_type='fingerprint'` |
212 | `audio_window` | 20 | 193 | embedding 特征 | `feature_fact` | `feature_type='embedding'` |
213 | `audio_embedding` | 20 | 194 | 模型名/版本 | `feature_fact` | `model_name`, `model_version` |
214 | `retrieval_candidate` | 220 | 195 | feature set | `feature_fact` | `feature_set_name`, `feature_schema_ver` |
215 | `match_decision` | 22 | 196 | reference 集归属 | `set_membership` | `set_type='reference_set'` |
216
217 说明:
218 - 20 条 reference song
219 - 22 条 query
220 - 每条 query 写入 top10 candidate,因此 `22 * 10 = 220`
221
222 ---
223
224 ## 本次测试链路与逻辑
225
226 ### A. schema / 数据完整性测试
227
228 1. 连接 PostgreSQL
229 2. 创建隔离 schema:`acr_test`
230 3. 执行 `acr_pg_schema_v2.sql`
231 4. 初始化:
232 - `model_registry`
233 - `feature_set_registry`
234 - `reference_set_registry`
235 - `retrieval_index_registry`
236 5. 导入 20 条 reference 样例
237 6. 验证表计数是否正确
238 7. 主动插入三类错误 lineage:
239 - `recording.canonical_song_id``work.canonical_song_id` 不一致
240 - `audio_window.recording_id``recording_asset.recording_id` 不一致
241 - `audio_embedding``canonical_song_id` 与父 `audio_window` 不一致
242 8. 预期 PostgreSQL trigger 拒绝这些坏写入
243
244 ### B. live 检索评测测试
245
246 1.`reference_embeddings.jsonl` 读 20 条 reference embedding
247 2. 写入 `audio_embedding` + `audio_embedding_vector_192`
248 3.`query_embeddings.jsonl` 读 22 条 query embedding
249 4. 每条 query 用 SQL 执行 `pgvector cosine` 检索
250 5. 在应用层做 song-level aggregation:
251 - `max_sim`
252 - `top3_avg`
253 - `vote`
254 - `combined = 0.6 * max_sim + 0.3 * top3_avg + 0.1 * vote_factor`
255 6. 将 top10 候选落表到 `retrieval_candidate`
256 7. 将 top1 决策落表到 `match_decision`
257 8. 计算:
258 - overall `top1/top3/top10/mrr`
259 - `by_query_type`
260 - `confusion_focus`
261
262 ### C. confusion test 口径
263
264 当前这次 live 样例里只实际包含:
265 - `type_1`
266 197
267 --- 198 ---
268 199
269 ## Phase-1 worker dry-run 测试链路(新增) 200 ## 4. 流程图
270
271 这一步解决的是:
272
273 > planner 虽然已经能输出可复制命令,但之前仓库里没有真正的 worker 可以消费这些命令。
274 201
275 现在已经补上最小真实 worker: 202 ### 4.1 落库流程
276
277 - `acr-engine/workers/mark_job_status.py`
278 - `acr-engine/workers/run_chromaprint_job.py`
279 - `acr-engine/workers/run_embedding_job.py`
280
281 ### 测试目标
282
283 验证下面这条链是真实可走通的:
284 203
285 ```mermaid 204 ```mermaid
286 flowchart TD 205 flowchart TD
287 A[feature_extraction_job pending] --> B[planner 生成命令模板] 206 A[media_entity\nentity_type=song] --> B[audio_object\nobject_type=asset]
288 B --> C[worker 读取 extraction_job_id] 207 B --> C[audio_object\nobject_type=window]
289 C --> D[worker 解析 feature/model/scope] 208 C --> D1[feature_fact\nfeature_type=fingerprint]
290 D --> E[worker 回写 running/completed] 209 C --> D2[feature_fact\nfeature_type=embedding]
291 E --> F[bootstrap 脚本可再次恢复 pending] 210 B --> E[set_membership\nreference_set]
211 C --> E
292 ``` 212 ```
293 213
294 ### 当前验证口径 214 ### 4.2 查询回溯流程
295
296 这轮先不跑真实模型推理,而是先验证工业执行面:
297
298 1. `run_chromaprint_job.py`
299 - 真实连接 PostgreSQL
300 - 读取 `feature_extraction_job=1`
301 - 解析 `reference_set:phase1_hot_reference_v1`
302 - 回写 `running -> completed`
303
304 2. `run_embedding_job.py`
305 - 真实连接 PostgreSQL
306 - 读取 `feature_extraction_job=2`
307 - 解析 `mert v1-95m`
308 - 回写 `running -> completed`
309
310 3. 再次执行 `bootstrap_phase1_extraction_jobs_live.py`
311 - 把 job 状态恢复为 `pending`
312 - 保证后续 session 可以从同一批 jobs 继续推进
313
314 4. `plan_phase1_extraction_jobs_live.py`
315 - 当前生成的主命令模板已显式带:
316 - `cd /workspace/acr-engine &&`
317 - `PG_DSN="${PG_DSN:?set PG_DSN}"`
318 - `--complete-dry-run`
319 - 因此 `primary_command` 已经可以直接复现当前 dry-run 状态流转
320
321 ### 为什么先做 dry-run
322
323 因为当前第一优先级是把下面这些东西固定住:
324
325 - job contract
326 - status transitions
327 - scope 解析
328 - planner -> worker 命令兼容性
329
330 等这个骨架稳定后,再把真实的:
331 - chromaprint 提取
332 - MERT / MuQ embedding 提取
333
334 接进去,整体风险更低。
335
336 ### 当前 live 结果的关键更新
337
338 本轮已经新增:
339
340 - `acr-engine/scripts/bootstrap_phase1_reference_members_live.py`
341
342 并已把 `acr_test.phase1_hot_reference_v1` 真实挂上 `20` 条 reference recordings,因此当前 worker dry-run 看到的 scope 已变成:
343
344 - `recording_count=20`
345 - `ready_asset_count=20`
346 - `active_window_count=20`
347
348 这说明当前验证已经从“空 scope 状态机演示”推进到:
349
350 - planner -> worker 命令兼容
351 - worker -> PostgreSQL 状态流转可用
352 - reference_set -> recording/asset/window scope 解析可用
353
354 仍然要注意:
355
356 - 这依然是 **dry-run**
357 -**不是**真实特征抽取吞吐验证
358
359 ### 当前并发/重试保护验证
360
361 本轮还额外做了一个故意的重复执行测试:
362
363 1. 先让 `feature_extraction_job=1``pending -> running -> completed`
364 2. 不做 reset,直接再次执行同一个 chromaprint dry-run worker
365 3. 预期第二次执行失败,因为 worker 认领 job 时要求:
366 - `expected_status = pending`
367
368 实际结果见:
369
370 - `phase1_worker_double_claim_guard_report.json`
371
372 关键证据:
373
374 - `double_claim_exit_code = 1`
375 - `stderr = failed to update feature_extraction_job=1 with expected_status=pending`
376
377 这证明当前最小 worker contract 已经具备:
378
379 - 基础 claim guard
380 - 基础重复执行保护
381
382 ---
383
384 ## exact lane 非 dry-run 写入尝试(新增)
385
386 这轮又继续向前推进了一步:
387
388 > `run_chromaprint_job.py` 已经不再只是 dry-run。
389
390 当前行为:
391
392 1. 如果 reference asset 对应音频文件可读:
393 - 提取 repo-local chromaprint-style hash
394 - 写 artifact JSON
395 -`audio_fingerprint`
396 - job 标记为 `completed`
397
398 2. 如果 reference asset 对应音频文件不可读:
399 - job 标记为 `failed`
400 -`metadata_json` 里写入:
401 - `failure_reason`
402 - `missing_asset_count`
403 - `missing_asset_samples`
404
405 ### 本轮 live 结果
406
407 报告:
408
409 - `acr-engine/data/pgvector_eval/music20/phase1_worker_chromaprint_write_attempt.json`
410 - `acr-engine/data/pgvector_eval/music20/phase1_worker_chromaprint_write_guard_report.json`
411
412 关键结果:
413
414 - `scope_asset_count = 20`
415 - `processed_assets = 0`
416 - `missing_assets = 20`
417 - `job_status = failed`
418 - `failure_reason = unreadable_audio_assets`
419 - `audio_fingerprint_count = 0`
420
421 ### 这说明什么
422
423 说明当前 exact lane 的 PostgreSQL worker contract 已经具备:
424
425 - 非 dry-run 的真实写入路径
426 - 明确的失败落盘
427 - 环境缺失时的可审计错误证据
428 - “全量成功 / 否则失败”的批次语义
429 - `audio_fingerprint(feature_set_id, asset_id)` 的原子 upsert 约束基础
430
431 但当前容器仍然缺:
432
433 - `/workspace/downloads/...` 实际音频文件
434
435 因此这轮证明的是:
436
437 - **worker 写入路径已经接上**
438 - **当前被环境数据挂载阻塞**
439
440 而不是 exact lane 逻辑本身还没落地。
441 - `type_7`
442
443 因此:
444 - `type_7` 可以作为 **当前 live confusion check**
445 - `type_8 / type_16` 这次 live JSONL 没覆盖到,只能结合历史业务样本结果一起看
446
447 ---
448
449 ## live pgvector 结果
450
451 ### 1. overall
452
453 | 指标 | 值 |
454 |---|---:|
455 | query 数 | 22 |
456 | top1 | `0.9091` |
457 | top3 | `0.9545` |
458 | top10 | `0.9545` |
459 | MRR | `0.9343` |
460 | mean rank | `1.8182` |
461
462 ### 2. by query type
463
464 | query_type | count | top1 | top3 | top10 | 解释 |
465 |---|---:|---:|---:|---:|---|
466 | `1` | 20 | `1.0` | `1.0` | `1.0` | clean / near-clean |
467 | `7` | 2 | `0.0` | `0.5` | `0.5` | 当前 live confusion 样例 |
468 | `8` | 0 | N/A | N/A | N/A | 本次 live JSONL 未覆盖 |
469 | `16` | 0 | N/A | N/A | N/A | 本次 live JSONL 未覆盖 |
470
471 ### 3. 和现有 FAISS stand-in 的一致性
472
473 | 路径 | overall top1 | overall top3 | type_1 top1 | type_7 top1 | type_7 top3 |
474 |---|---:|---:|---:|---:|---:|
475 | live PostgreSQL + pgvector | `0.9091` | `0.9545` | `1.0` | `0.0` | `0.5` |
476 | FAISS stand-in | `0.9091` | `0.9545` | `1.0` | `0.0` | `0.5` |
477
478 结论:
479
480 > 当前 `acr_test` 上的 live pgvector 路径,已经和现有 stand-in 检索逻辑对齐。
481 > 问题不在“PostgreSQL 落盘导致召回变坏”,而在当前样例 embedding 对混淆类 query 本身就不够强。
482
483 ---
484
485 ## 本轮补充:完整 lineage trigger 负例覆盖
486
487 本轮重新执行 live 脚本后,`live_pgvector_report.json` 中的 `lineage_negative_test` 已从“单条 audio_window 验证”升级为“三类坏写入全部验证”:
488
489 | case | 结果 | PostgreSQL 返回 |
490 |---|---|---|
491 | `recording_lineage_mismatch` | 拒绝成功 | `recording.canonical_song_id ... mismatches work.canonical_song_id ...` |
492 | `audio_window_lineage_mismatch` | 拒绝成功 | `Invalid asset_id=... or recording_id=... for audio_window` |
493 | `audio_embedding_lineage_mismatch` | 拒绝成功 | `audio_embedding lineage mismatch` |
494
495 这意味着:
496
497 > 当前 schema v2 的三条核心 lineage trigger,已经都有真实负例证据,而不只是“理论上存在”。
498
499 同时,本轮还补了两条机械验证证据:
500 - `py_compile` 通过:`live_pgvector_music20_eval.py`
501 - `git diff --check` 通过:本轮脚本、报告、文档变更无格式问题
502
503 ---
504
505 ## 混淆测试补充视图
506
507 ### 1. 当前 live 样例视图
508
509 | query_type | 数据来源 | top1 | top3 | 结论 |
510 |---|---|---:|---:|---|
511 | `7` | `live_pgvector_report.json` | `0.0` | `0.5` | 已明显偏弱 |
512
513 ### 2. 历史本地 20-song 小样本视图
514
515 来自:`acr-engine/data/local_eval/music20_summary.json`
516
517 | query_type | top1 | top3 |
518 |---|---:|---:|
519 | `1` | `1.0` | `1.0` |
520 | `7` | `0.45` | `0.65` |
521 | `8` | `0.4667` | `0.7333` |
522 | `16` | `0.4167` | `0.4167` |
523
524 说明:
525 - 这是**本地小样本 chroma/FAISS sanity flow** 的结果
526 - 它比当前 live JSONL 的 type_7 好,是因为样本构成不同
527 - 不能把这个结果直接当作生产效果,但可以当作“当前特征在小样本内并非完全不可用”的旁证
528
529 ### 3. 历史业务语料 voice correctness 视图
530
531 | query_type | 文件 | top1 | top3 | 结论 |
532 |---|---|---:|---:|---|
533 | `7` | `voice_workspace20_type7_eval.json` | `0.0` | `0.05` | 极弱 |
534 | `8` | `voice_workspace20_type8_eval.json` | `0.0` | `0.0` | 极弱 |
535 | `16` | `voice_workspace20_type16_eval.json` | `0.0` | `0.0` | 极弱 |
536
537 结论:
538 215
539 > 只要 query 进入更真实、更混淆的业务样本,当前这条 baseline 仍然远远不够。 216 ```mermaid
540 > PostgreSQL 落库没问题,真正的问题还是 **embedding lane 对 hard case 的判别力不足**。 217 flowchart LR
541 218 A[feature_fact] --> B[audio_object window]
542 --- 219 B --> C[audio_object asset]
543 220 C --> D[media_entity song]
544 ## 这次验证了什么,没验证什么 221 ```
545
546 ### 已验证
547
548 - PostgreSQL 真实连通可用
549 - `vector` 扩展可用
550 - schema v2 可以真实 apply
551 - main lineage trigger 可以真实拦截坏数据
552 - 样例数据链可以按 `song -> work -> recording -> asset -> window -> embedding` 落盘
553 - live pgvector 检索和现有 stand-in 逻辑一致
554 - `retrieval_candidate` / `match_decision` 可以真实承载在线结果
555 - semantic worker 已真实验证 preflight failure 语义:既能识别 `/workspace/downloads` 缺失,也能识别 `torch/torchaudio/transformers` 缺失
556 - `audio_embedding` 已补上 window / asset 双路幂等唯一键,为后续 encoder 真实 upsert 预留稳定主键
557 222
558 ### 未验证 223 ### 4.3 写入时序图
559 224
560 - 还没把 `MERT` / `MuQ` 真正接进这套 live 路径 225 ```mermaid
561 - 这次 live 样例没有覆盖 `type_8 / type_16` 的 JSONL embedding 226 sequenceDiagram
562 - 这次只验证了 20-song 级别,不代表 30w song 的索引性能 227 participant ING as Ingest/Extract Job
563 - 还没做多 recording / 多 version / cover lane 的聚合测试 228 participant DB as PostgreSQL
229
230 ING->>DB: insert media_entity(song)
231 ING->>DB: insert audio_object(asset)
232 ING->>DB: insert audio_object(window)
233 ING->>DB: insert feature_fact(fingerprint)
234 ING->>DB: insert feature_fact(embedding)
235 ING->>DB: insert set_membership(reference_set)
236 ```
564 237
565 --- 238 ---
566 239
567 ## 推荐的下一步 240 ## 5. 最常用 SQL 样例
568
569 ### 本轮新增:Phase-1 registry 已可 live bootstrap
570
571 除了 live 检索脚本外,本轮还新增了:
572
573 - `acr-engine/scripts/bootstrap_phase1_model_registry_live.py`
574
575 它已经在 `acr_test` schema 上真实写入了:
576 - `chromaprint`
577 - `mert`
578 - `muq`
579 - `ecapa`
580 - 对应 feature sets
581 - `phase1_hot_reference_v1`
582 241
583 对应 live 报告: 242 ### 5.1 写一首歌
584 - `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json`
585 243
586 ### 本轮继续新增:Phase-1 extraction jobs 已可 live bootstrap 244 ```sql
587 245 insert into media_entity (entity_type, biz_key, title, artist_name)
588 在 registry bootstrap 之后,本轮又新增: 246 values ('song', 'song-10001', 'Song 10001', 'Artist A')
589 247 returning entity_id;
590 - `acr-engine/scripts/bootstrap_phase1_extraction_jobs_live.py` 248 ```
591
592 它已经在 `acr_test` schema 上真实创建了 5 条 `feature_extraction_job`
593 - `chromaprint`
594 - `mert 5s/2.5s`
595 - `mert 10s/5s`
596 - `muq 5s/2.5s`
597 - `ecapa 5s/2.5s`
598
599 对应 live 报告:
600 - `acr-engine/data/pgvector_eval/music20/phase1_extraction_jobs_report.json`
601
602 ### 本轮继续新增:pending jobs 已可生成 live execution plan
603
604 在 extraction jobs 之后,本轮又新增:
605
606 - `acr-engine/scripts/plan_phase1_extraction_jobs_live.py`
607
608 它已经在 `acr_test` schema 上真实读取 5 条 `pending` jobs,并生成按执行顺序排列的 plan:
609 - `chromaprint exact lane` 优先
610 - 然后是 `mert / muq / ecapa` 的 semantic lanes
611
612 对应 live 报告:
613 - `acr-engine/data/pgvector_eval/music20/phase1_extraction_plan_report.json`
614
615 本轮补充后,plan 里还会真实给出:
616 - `command_suggestions`
617 - `primary_command`
618
619 也就是从 PostgreSQL 的 pending jobs 已经可以直接走到“可复制的执行命令模板”。
620
621 ### 路线 1:继续做 PostgreSQL 工程化
622
623 1.`live_pgvector_music20_eval.py` 泛化成:
624 - 可导入任意 manifest/reference set
625 - 可选择 encoder / feature set
626 - 可直接生成 `retrieval_candidate` / `match_decision` 报告
627 2. 增加:
628 - `audio_embedding_vector_1024` / 其他常见维度表
629 - bulk COPY / batched insert
630 - HNSW 参数管理
631
632 ### 路线 2:继续做混淆类效果验证
633
634 1. 构造真正覆盖 `type_8 / type_16` 的 query embedding JSONL
635 2. 用同一条 live script 重跑 PostgreSQL 评测
636 3. 对比:
637 - `Chromaprint only`
638 - `semantic only`
639 - `fusion`
640 4. 输出 confusion bucket 报告
641 249
642 当前环境补充说明: 250 ### 5.2 写一个 asset
643 - 本轮继续尝试从 `/workspace/downloads` 直接补 `type_8 / type_16` live 样本时,发现该目录在当前容器里**不存在**
644 - 因此,下一轮若要继续这条支线,需要先恢复/挂载业务样本目录,或把对应 query 音频与 reference 清单重新落到仓库可见路径
645 251
646 ### 路线 3:切到 Phase-1 encoder-only 主线 252 ```sql
253 insert into audio_object (
254 object_type, song_id, source_type, storage_uri, storage_scheme,
255 checksum, codec, sample_rate, channels, duration_ms
256 ) values (
257 'asset', :song_id, 'official', 's3://bucket/song10001/master.wav', 's3',
258 'sha256:xxx', 'wav', 44100, 2, 215000
259 ) returning object_id;
260 ```
647 261
648 1. 保留当前 PostgreSQL 结构不变 262 ### 5.3 写一个 window
649 2.`local_chroma24` 替换成:
650 - `MERT-v1-95M`
651 - `MuQ`
652 3. 继续复用:
653 - `model_registry`
654 - `feature_set_registry`
655 - `reference_set_registry`
656 - `retrieval_index_registry`
657 4. 重新测:
658 - clean
659 - type_7
660 - type_8
661 - type_16
662 - 业务 voice bucket
663 263
664 --- 264 ```sql
265 insert into audio_object (
266 object_type, song_id, parent_object_id, start_ms, end_ms, duration_ms
267 ) values (
268 'window', :song_id, :asset_id, 30000, 35000, 5000
269 ) returning object_id;
270 ```
665 271
666 ## 复现命令 272 ### 5.4 写一条 embedding
273
274 ```sql
275 insert into feature_fact (
276 feature_type, object_id, song_id,
277 model_name, model_version, feature_set_name,
278 feature_schema_ver, embedding_dim, embedding_uri, vector_table_name
279 ) values (
280 'embedding', :window_id, :song_id,
281 'mert', 'v1-95m', 'mert_5s_hop2.5_meanpool',
282 'v1', 768, 's3://bucket/emb/song10001_win0001.npy', 'audio_embedding_vector_768'
283 );
284 ```
667 285
668 ### 1. live PostgreSQL + pgvector 测试 286 ### 5.5 把 asset 挂到 reference 集
669 287
670 ```bash 288 ```sql
671 cd /workspace/acr-engine 289 insert into set_membership (
672 /usr/local/miniconda3/bin/python scripts/live_pgvector_music20_eval.py \ 290 set_type, set_name, member_type, member_id, song_id, priority
673 --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' \ 291 ) values (
674 --schema acr_test \ 292 'reference_set', 'phase1_hot_reference_v1', 'asset', :asset_id, :song_id, 100
675 --reset-schema \ 293 );
676 --output data/pgvector_eval/music20/live_pgvector_report.json
677 ``` 294 ```
678 295
679 ### 2. FAISS stand-in 对照测试 296 ### 5.6 从 embedding 回查 song
680 297
681 ```bash 298 ```sql
682 cd /workspace/acr-engine 299 select ff.feature_id,
683 /usr/local/miniconda3/bin/python scripts/evaluate_songid_pgvector_path.py \ 300 ff.model_name,
684 --reference-embeddings-jsonl data/pgvector_eval/music20/reference_embeddings.jsonl \ 301 ff.model_version,
685 --query-embeddings-jsonl data/pgvector_eval/music20/query_embeddings.jsonl \ 302 ff.feature_set_name,
686 --output data/pgvector_eval/music20/songid_eval_report_fresh.json 303 win.object_id as window_id,
304 ast.object_id as asset_id,
305 song.entity_id as song_id,
306 song.title,
307 song.artist_name
308 from feature_fact ff
309 join audio_object win
310 on win.object_id = ff.object_id
311 and win.object_type = 'window'
312 join audio_object ast
313 on ast.object_id = win.parent_object_id
314 and ast.object_type = 'asset'
315 join media_entity song
316 on song.entity_id = ff.song_id
317 and song.entity_type = 'song'
318 where ff.feature_id = :feature_id;
687 ``` 319 ```
688 320
689 --- 321 ---
690 322
691 ## 一句话结论 323 ## 6. 当前设计意图
692
693 > PostgreSQL 这条路已经可以真实落 schema、落样例、落 candidate、落 decision,也能真实跑 pgvector 检索。
694 > 当前最大的短板不再是“怎么存”,而是 **当前 baseline embedding 对混淆 query 的召回仍然明显不够**。
695
696
697 ## 新增:Phase-1 semantic worker live 证据
698
699 本轮继续对 `run_embedding_job.py` 做 live PostgreSQL 验证,目标不是伪造 embedding,而是把 **失败语义先固定住**
700
701 ### 结果摘要
702
703 `extraction_job_id=2``mert v1-95m`, `5s/2.5s`)执行非 dry-run worker 后:
704
705 | 项 | 结果 |
706 |---|---|
707 | `scope_window_count` | `20` |
708 | `job_status` | `failed` |
709 | `output_count` | `0` |
710 | `failure_reason` | `preflight_failed` |
711 | `preflight_blockers` | `['unreadable_audio_assets', 'model_runtime_unavailable']` |
712 | `vector_table_report.resolved` | `true` |
713 | `audio_embedding_vector_768_count` | `0` |
714 324
715 说明: 325 ### 为什么切片和原始音频统一用 `audio_object`
326 - 新同学更容易理解
327 - asset/window 共用大量字段
328 - 减少专用表数量
716 329
717 - 当前语义 lane 不是“没做事”,而是已经真实走到了 PostgreSQL job scope / runtime / vector table / asset 路径检查 330 ### 为什么模型和特征统一用 `feature_fact`
718 - 只是当前容器同时被两个外部条件挡住: 331 - 不再一模型一张表
719 1. `/workspace/downloads/...` 未挂载 332 - 不再 fingerprint 一张表、embedding 一张表后继续扩散
720 2. `torch / torchaudio / transformers` 未安装 333 - 更适合未来继续换 MERT / MuQ / 新模型
721 334
722 ### 证据文件 335 ### 为什么 reference 集用 `set_membership`
336 - song / asset / window / feature 都可以挂集合
337 - reference / eval / hot 切换统一处理
723 338
724 - `acr-engine/data/pgvector_eval/music20/phase1_worker_embedding_write_attempt.json` 339 ---
725 - `acr-engine/data/pgvector_eval/music20/phase1_worker_embedding_write_guard_report.json`
726 - `acr-engine/data/pgvector_eval/music20/phase1_worker_embedding_post_state.json`
727
728 ### 为什么要先补唯一键
729
730 当前 `audio_embedding` 已新增:
731
732 - `uq_audio_embedding_feature_window`
733 - `uq_audio_embedding_feature_asset`
734
735 设计意图是:
736
737 1. 同一 `feature_set_id + window_id` 的 embedding 重跑时可以稳定 upsert
738 2. 将来如果有 asset-level embedding,也能独立幂等
739 3. 不把幂等职责留给应用层“先查再写”
740
741 这一步对后续的 `MERT / MuQ / ECAPA` 都通用。
742
743
744 ## 新增:Semantic preflight blocker matrix
745
746 为了避免下次 session 继续手工逐个试,本轮又新增:
747
748 - `acr-engine/scripts/run_phase1_embedding_preflight_matrix_live.py`
749 - `acr-engine/data/pgvector_eval/music20/phase1_embedding_preflight_matrix_report.json`
750
751 它会:
752
753 1. 先把 `feature_extraction_job` 重置回 `pending`
754 2. 顺序执行全部 semantic jobs(当前是 `mert 5s``mert 10s``muq 5s``ecapa 5s`
755 3. 归并输出每个 job 的:
756 - `failure_reason`
757 - `preflight_blockers`
758 - `runtime_missing_dependencies`
759 - `vector_table_report`
760
761 ### 当前矩阵结果
762
763 | job | model | vector table | blockers | runtime missing |
764 |---|---|---|---|---|
765 | 2 | `mert v1-95m` | `audio_embedding_vector_768` | `unreadable_audio_assets`, `model_runtime_unavailable` | `torch`, `torchaudio`, `transformers` |
766 | 3 | `mert v1-95m` | `audio_embedding_vector_768` | `unreadable_audio_assets`, `model_runtime_unavailable` | `torch`, `torchaudio`, `transformers` |
767 | 4 | `muq large-msd-iter` | `audio_embedding_vector_768` | `unreadable_audio_assets`, `model_runtime_unavailable` | `torch`, `torchaudio`, `transformers` |
768 | 5 | `ecapa acr-baseline-v1` | `audio_embedding_vector_192` | `unreadable_audio_assets`, `model_runtime_unavailable` | `torch`, `torchaudio`, `speechbrain` |
769
770 结论:
771
772 - 当前 semantic lane 的失败已经具有**稳定矩阵特征**,不是某一个 job 独有的偶发问题
773 - `vector_table` 路径已全部通过
774 - 当前真正阻塞 Phase-1 encoder-only 落地的是:
775 1. `/workspace/downloads` 音频挂载
776 2. 模型 runtime 依赖安装
777
778
779 ## 新增:asset-level embedding upsert live 验证
780
781 为了把 `uq_audio_embedding_feature_asset` 从“DDL 声明”推进到“真实证据”,本轮新增:
782
783 - `acr-engine/scripts/validate_audio_embedding_asset_upsert_live.py`
784 - `acr-engine/data/pgvector_eval/music20/audio_embedding_asset_upsert_live_report.json`
785
786 ### 验证动作
787
788 脚本会在隔离 schema `acr_asset_upsert_test` 中:
789
790 1. 落最小主数据图:`song -> work -> recording -> asset`
791 2. 插入第一条 `window_id IS NULL` 的 asset-level embedding
792 3. 再做一次普通重复 `INSERT`
793 4. 预期被 `uq_audio_embedding_feature_asset` 拒绝
794 5. 再做一次 `ON CONFLICT ... DO UPDATE`
795 6. 验证最终仍只有 `1``audio_embedding``1``audio_embedding_vector_192`
796
797 ### 当前结果
798
799 | 项 | 结果 |
800 |---|---|
801 | 首次 `embedding_id` | `1` |
802 | 重复普通 `INSERT` | `UniqueViolation` |
803 | 唯一键名 | `uq_audio_embedding_feature_asset` |
804 | upsert 后 `embedding_id` | `1` |
805 | `same_embedding_id_reused` | `true` |
806 | `audio_embedding` 行数 | `1` |
807 | `audio_embedding_vector_192` 行数 | `1` |
808 | 最终 `checksum` | `checksum-v2` |
809
810 结论:
811
812 - asset-level 唯一键不是“纸面存在”,而是已经在 live PostgreSQL 上真实生效
813 - 后续如果补 asset-level semantic writer,可以直接沿用同一个 `ON CONFLICT (feature_set_id, asset_id) ...` 合同
814
815
816 ## 新增:Phase-1 worker contract smoke 总览
817
818 为了让下次启动不用分别手工跑 exact worker 与 semantic matrix,本轮新增:
819
820 - `acr-engine/scripts/run_phase1_worker_contract_smoke_live.py`
821 - `acr-engine/data/pgvector_eval/music20/phase1_worker_contract_smoke_report.json`
822
823 它会:
824
825 1. reset `feature_extraction_job`
826 2. 跑一次 exact lane 非 dry-run
827 3. 再 reset jobs
828 4. 跑完整 semantic preflight matrix
829 5. 输出一个总览 JSON
830
831 ### 当前 smoke 总览结果
832
833 | lane | 结果 |
834 |---|---|
835 | exact | `failed` |
836 | exact failure reason | `unreadable_audio_assets` |
837 | exact missing assets | `20` |
838 | semantic jobs | `4` |
839 | semantic failed jobs | `4` |
840 | semantic blockers | `model_runtime_unavailable`, `unreadable_audio_assets` |
841
842 这说明:
843
844 - 当前 PostgreSQL worker contract 本身已经是**稳定的**
845 - 当前阻塞已经非常明确,主要不是 orchestration,而是环境:
846 - `/workspace/downloads` 未挂载
847 - semantic model runtime 未安装
848
849
850 ## 新增:semantic vector table 负例矩阵
851
852 为了避免后续把 semantic worker 的失败都误归因为“缺模型/缺音频”,本轮新增:
853
854 - `acr-engine/scripts/run_embedding_vector_table_negative_matrix_live.py`
855 - `acr-engine/data/pgvector_eval/music20/embedding_vector_table_negative_matrix_report.json`
856
857 它真实验证了 3 类向量表配置错误:
858
859 | case | schema | vector table | reason |
860 |---|---|---|---|
861 | `vector_table_dim_mismatch` | `acr_test` | `audio_embedding_vector_192` | `vector_table_dim_mismatch` |
862 | `vector_table_not_allowlisted` | `acr_test` | `audio_embedding_vector_1024` | `vector_table_not_allowlisted` |
863 | `vector_table_missing_in_schema` | `acr_vector_table_missing_test` | `audio_embedding_vector_768` | `vector_table_missing_in_schema` |
864
865 共同点:
866
867 - 3 条 case 全部 `job_status = failed`
868 - `failure_reason = preflight_failed`
869 - `preflight_blockers` 中除了环境 blocker,还会额外带上精确的 vector-table blocker
870
871 这说明:
872
873 - 当前 semantic preflight 已经能够把“运行环境问题”和“配置错误问题”分层暴露
874 - 后续只要看 `vector_table_report.reason`,就能快速区分是 DDL/配置错误,还是模型 runtime/音频挂载错误
875
876
877 ## 新增:Phase-1 prerequisites audit
878
879 为了避免每次都靠肉眼猜“到底是音频挂载缺失,还是模型 runtime 缺失”,本轮新增:
880
881 - `acr-engine/scripts/run_phase1_prereq_audit_live.py`
882 - `acr-engine/data/pgvector_eval/music20/phase1_prereq_audit_report.json`
883
884 ### 当前审计结果
885
886 | 指标 | 结果 |
887 |---|---|
888 | `downloads_root_exists` | `false` |
889 | `total_jobs` | `5` |
890 | `ready_jobs` | `0` |
891 | `blocked_jobs` | `5` |
892 | 缺失依赖并集 | `speechbrain`, `torch`, `torchaudio`, `transformers` |
893
894 按 job 看:
895 340
896 - `chromaprint`:依赖层面可跑,但被 `/workspace/downloads` 缺失阻塞 341 ## 7. 当前最推荐的实现顺序
897 - `mert / muq`:同时被 `/workspace/downloads` 缺失与 `torch/torchaudio/transformers` 缺失阻塞
898 - `ecapa`:同时被 `/workspace/downloads` 缺失与 `torch/torchaudio/speechbrain` 缺失阻塞
899 342
900 这使得“当前为什么跑不通”已经可以通过单份 JSON 报告回答,而不必重新手工试跑。 343 1. 先建 `media_entity`
344 2. 再建 `audio_object`
345 3. 再建 `feature_fact`
346 4. 最后建 `set_membership`
347 5. 先打通 `song -> asset -> window -> embedding/fingerprint`
348 6. 再继续补更重的治理能力
......
...@@ -59,7 +59,7 @@ cd /workspace/acr-engine ...@@ -59,7 +59,7 @@ cd /workspace/acr-engine
59 ## 3. 用一句话理解项目 59 ## 3. 用一句话理解项目
60 60
61 我们在做的是一个面向 **版权保护 / 听歌识曲 / 版本归属** 的音乐 ACR 系统, 61 我们在做的是一个面向 **版权保护 / 听歌识曲 / 版本归属** 的音乐 ACR 系统,
62 目标是从 `100w` 音频、约 `30w` 歌曲中,快速定位正确的 `song_id / work / recording` 归属 62 目标是从 `100w` 音频、约 `30w` 歌曲中,快速定位正确的 `song_id` 归属;当前阶段暂不把版本/recording 作为必须返回对象
63 63
64 --- 64 ---
65 65
...@@ -71,7 +71,12 @@ cd /workspace/acr-engine ...@@ -71,7 +71,12 @@ cd /workspace/acr-engine
71 - semantic lane challenger:`MuQ` 71 - semantic lane challenger:`MuQ`
72 - historical baseline:`ECAPA` 72 - historical baseline:`ECAPA`
73 73
74 ### 数据主线 74 ### 当前 Phase-1 最小主线
75 ```text
76 song -> asset -> window
77 ```
78
79 ### 可演进完整版主线
75 ```text 80 ```text
76 canonical_song -> work -> recording -> recording_asset -> audio_window 81 canonical_song -> work -> recording -> recording_asset -> audio_window
77 ``` 82 ```
...@@ -139,6 +144,7 @@ model_registry -> feature_set_registry -> audio_embedding / audio_fingerprint -> ...@@ -139,6 +144,7 @@ model_registry -> feature_set_registry -> audio_embedding / audio_fingerprint ->
139 - [README.md](./README.md) 144 - [README.md](./README.md)
140 - [session-handoff.md](./session-handoff.md) 145 - [session-handoff.md](./session-handoff.md)
141 - [postgresql-data-model.md](./postgresql-data-model.md) 146 - [postgresql-data-model.md](./postgresql-data-model.md)
147 - [postgres_db_schema_samples.md](./postgres_db_schema_samples.md)
142 - [phase1-worker-contract.md](./phase1-worker-contract.md) 148 - [phase1-worker-contract.md](./phase1-worker-contract.md)
143 149
144 ### 脚本 150 ### 脚本
......
1 #!/usr/bin/env /usr/local/miniconda3/bin/python
2 from __future__ import annotations
3
4 import argparse
5 import fnmatch
6 import re
7 import sys
8 from pathlib import Path
9
10 LINK_RE = re.compile(r'!?(?:\[([^\]]*)\])\(([^)]+)\)')
11 SKIP_PREFIXES = ('http://', 'https://', 'mailto:', 'tel:', '#')
12 DEFAULT_EXCLUDES = ['CHANGELOG.md']
13
14
15 def should_check(target: str) -> bool:
16 target = target.strip()
17 return bool(target) and not target.startswith(SKIP_PREFIXES)
18
19
20 def normalize_target(raw: str) -> str:
21 target = raw.strip()
22 if target.startswith('<') and target.endswith('>'):
23 target = target[1:-1]
24 target = target.split('#', 1)[0].split('?', 1)[0].strip()
25 return target
26
27
28 def iter_markdown_files(root: Path, excludes: list[str]) -> list[Path]:
29 files: list[Path] = []
30 for path in sorted(root.rglob('*.md')):
31 rel = path.relative_to(root).as_posix()
32 if any(fnmatch.fnmatch(rel, pattern) for pattern in excludes):
33 continue
34 files.append(path)
35 return files
36
37
38 def scan_markdown_file(path: Path, root: Path) -> list[tuple[str, str]]:
39 missing: list[tuple[str, str]] = []
40 text = path.read_text(encoding='utf-8')
41 for _, raw_target in LINK_RE.findall(text):
42 if not should_check(raw_target):
43 continue
44 target = normalize_target(raw_target)
45 if not target:
46 continue
47 resolved = (path.parent / target).resolve()
48 if not resolved.exists():
49 missing.append((path.relative_to(root).as_posix(), raw_target))
50 return missing
51
52
53 if __name__ == '__main__':
54 parser = argparse.ArgumentParser(description='Check relative Markdown links for missing files.')
55 parser.add_argument('--root', default='docs', help='Root directory containing markdown files')
56 parser.add_argument('--exclude', action='append', default=[], help='Glob patterns relative to root to exclude')
57 args = parser.parse_args()
58
59 root = Path(args.root).resolve()
60 if not root.exists():
61 print(f'root not found: {root}', file=sys.stderr)
62 sys.exit(2)
63
64 excludes = DEFAULT_EXCLUDES + list(args.exclude)
65 files = iter_markdown_files(root, excludes)
66 failures: list[tuple[str, str]] = []
67 for md in files:
68 failures.extend(scan_markdown_file(md, root))
69
70 if failures:
71 print('Missing relative markdown targets:')
72 for source, target in failures:
73 print(f'- {source}: {target}')
74 sys.exit(1)
75
76 print(f'OK: checked {len(files)} markdown files under {root} (excluded: {excludes})')