Commit fef8f438 fef8f4387d95be5d4a017ba55150b6fa7463f1f6 by cnb.bofCdSsphPA

Bootstrap the Phase-1 model registry on live PostgreSQL

Constraint: Continue the Ralph loop without waiting on missing business sample mounts, while still leaving a push-ready implementation and documentation trail
Rejected: Keep Phase-1 registry setup as static SQL snippets only | It slows live validation and leaves no machine-checkable bootstrap path
Confidence: high
Scope-risk: narrow
Directive: Treat model_registry/feature_set_registry/reference_set_registry as the mandatory entrypoint before any future MERT/MuQ extraction jobs
Tested: /usr/local/miniconda3/bin/python scripts/bootstrap_phase1_model_registry_live.py --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' --schema acr_test --output data/pgvector_eval/music20/phase1_registry_bootstrap_report.json; /usr/local/miniconda3/bin/python -m py_compile scripts/bootstrap_phase1_model_registry_live.py; git diff --check -- acr-engine/scripts/bootstrap_phase1_model_registry_live.py acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json docs/model-feature-registry-bootstrap.md docs/postgres_db_schema_samples.md docs/session-handoff.md docs/CHANGELOG.md
Not-tested: Actual MERT/MuQ embedding extraction, hard-case type_8/type_16 live queries, multi-recording/cover-lane retrieval
1 parent ea51b9c1
1 {
2 "schema": "acr_test",
3 "dsn_redacted": "postgres://d2:***@127.0.0.1:5432/d2",
4 "models": [
5 {
6 "model_id": 2,
7 "model_name": "chromaprint",
8 "model_version": "v1",
9 "output_embedding_dim": null
10 },
11 {
12 "model_id": 3,
13 "model_name": "mert",
14 "model_version": "v1-95m",
15 "output_embedding_dim": 768
16 },
17 {
18 "model_id": 4,
19 "model_name": "muq",
20 "model_version": "large-msd-iter",
21 "output_embedding_dim": 768
22 },
23 {
24 "model_id": 5,
25 "model_name": "ecapa",
26 "model_version": "acr-baseline-v1",
27 "output_embedding_dim": 192
28 }
29 ],
30 "feature_sets": [
31 {
32 "feature_set_id": 2,
33 "model_name": "chromaprint",
34 "model_version": "v1",
35 "feature_name": "fingerprint_asset",
36 "window_sec": 5.0,
37 "hop_sec": 2.5,
38 "embedding_dim": null,
39 "distance_metric": "hamming"
40 },
41 {
42 "feature_set_id": 3,
43 "model_name": "mert",
44 "model_version": "v1-95m",
45 "feature_name": "semantic_embedding",
46 "window_sec": 5.0,
47 "hop_sec": 2.5,
48 "embedding_dim": 768,
49 "distance_metric": "cosine"
50 },
51 {
52 "feature_set_id": 4,
53 "model_name": "mert",
54 "model_version": "v1-95m",
55 "feature_name": "semantic_embedding",
56 "window_sec": 10.0,
57 "hop_sec": 5.0,
58 "embedding_dim": 768,
59 "distance_metric": "cosine"
60 },
61 {
62 "feature_set_id": 5,
63 "model_name": "muq",
64 "model_version": "large-msd-iter",
65 "feature_name": "semantic_embedding",
66 "window_sec": 5.0,
67 "hop_sec": 2.5,
68 "embedding_dim": 768,
69 "distance_metric": "cosine"
70 },
71 {
72 "feature_set_id": 6,
73 "model_name": "ecapa",
74 "model_version": "acr-baseline-v1",
75 "feature_name": "semantic_embedding",
76 "window_sec": 5.0,
77 "hop_sec": 2.5,
78 "embedding_dim": 192,
79 "distance_metric": "cosine"
80 }
81 ],
82 "reference_set": {
83 "reference_set_id": 2,
84 "set_name": "phase1_hot_reference_v1",
85 "encoder_scope": "chromaprint-v1 / mert-v1-95m / muq-large-msd-iter"
86 },
87 "counts": {
88 "model_registry": 5,
89 "feature_set_registry": 6,
90 "reference_set_registry": 2
91 }
92 }
...\ No newline at end of file ...\ No newline at end of file
1 ## 2026-06-04 1 ## 2026-06-04
2 2
3 - 新增 `acr-engine/scripts/bootstrap_phase1_model_registry_live.py``acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json`,把 Phase-1 的 `chromaprint / mert / muq / ecapa` 与对应 `feature_set_registry / reference_set_registry` 初始化做成可直接连 PostgreSQL 的 live bootstrap 脚本,并已在 `acr_test` schema 验证通过。
3 - 补充文档阻塞事实:当前容器里缺少 `/workspace/downloads`,因此本轮无法直接从业务样本目录继续生成 `type_8 / type_16` 的 live PostgreSQL query JSONL;已把该环境前提写入 handoff 与 PostgreSQL 样例文档。 4 - 补充文档阻塞事实:当前容器里缺少 `/workspace/downloads`,因此本轮无法直接从业务样本目录继续生成 `type_8 / type_16` 的 live PostgreSQL query JSONL;已把该环境前提写入 handoff 与 PostgreSQL 样例文档。
4 - 更新 [PostgreSQL 落库样例与 live 测试链路](./postgres_db_schema_samples.md)`acr-engine/scripts/live_pgvector_music20_eval.py`,把 lineage 负例验证从单条 `audio_window` 扩展到 `recording` / `audio_window` / `audio_embedding` 三类核心 trigger,并已重跑 live pgvector 报告确认检索指标不变;同时补充 `py_compile``diff --check` 通过的机械验证事实。 5 - 更新 [PostgreSQL 落库样例与 live 测试链路](./postgres_db_schema_samples.md)`acr-engine/scripts/live_pgvector_music20_eval.py`,把 lineage 负例验证从单条 `audio_window` 扩展到 `recording` / `audio_window` / `audio_embedding` 三类核心 trigger,并已重跑 live pgvector 报告确认检索指标不变;同时补充 `py_compile``diff --check` 通过的机械验证事实。
5 - 新增 [PostgreSQL 落库样例与 live 测试链路](./postgres_db_schema_samples.md),补齐 `acr_pg_schema_v2.sql` 的真实落库样例、`pgvector` live 检索验证、lineage trigger 负例测试,以及当前召回/混淆结果解读。 6 - 新增 [PostgreSQL 落库样例与 live 测试链路](./postgres_db_schema_samples.md),补齐 `acr_pg_schema_v2.sql` 的真实落库样例、`pgvector` live 检索验证、lineage trigger 负例测试,以及当前召回/混淆结果解读。
......
...@@ -216,3 +216,67 @@ flowchart TD ...@@ -216,3 +216,67 @@ flowchart TD
216 6. `phase1_hot_reference_v1` 216 6. `phase1_hot_reference_v1`
217 217
218 这样数据、模型、索引三条线就都有了稳定入口。 218 这样数据、模型、索引三条线就都有了稳定入口。
219
220 ---
221
222 ## 8. live PostgreSQL bootstrap 脚本
223
224 为了避免每次手工执行 SQL,本仓库现在提供了一个可直接连 PostgreSQL 的 live bootstrap 脚本:
225
226 - `acr-engine/scripts/bootstrap_phase1_model_registry_live.py`
227
228 用途:
229 - 向目标 schema 写入 `model_registry`
230 - 写入 `feature_set_registry`
231 - 写入 `reference_set_registry`
232 - 采用 **幂等式 upsert / ensure** 方式,适合重复执行
233
234 ### 8.1 执行命令
235
236 ```bash
237 cd /workspace/acr-engine
238 /usr/local/miniconda3/bin/python scripts/bootstrap_phase1_model_registry_live.py \
239 --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' \
240 --schema acr_test \
241 --output data/pgvector_eval/music20/phase1_registry_bootstrap_report.json
242 ```
243
244 ### 8.2 当前已验证结果(acr_test)
245
246 本轮已在 `acr_test` schema 上真实执行,写入结果如下:
247
248 | 对象 | 数量 |
249 |---|---:|
250 | `model_registry` | `5` |
251 | `feature_set_registry` | `6` |
252 | `reference_set_registry` | `2` |
253
254 其中新增的 Phase-1 对象包含:
255
256 #### models
257 - `chromaprint v1`
258 - `mert v1-95m`
259 - `muq large-msd-iter`
260 - `ecapa acr-baseline-v1`
261
262 #### feature sets
263 - `chromaprint fingerprint_asset`
264 - `mert semantic_embedding 5s/2.5s`
265 - `mert semantic_embedding 10s/5s`
266 - `muq semantic_embedding 5s/2.5s`
267 - `ecapa semantic_embedding 5s/2.5s`
268
269 #### reference set
270 - `phase1_hot_reference_v1`
271
272 ### 8.3 当前产物
273
274 - `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json`
275
276 这个文件已经记录了:
277 - model_id
278 - feature_set_id
279 - reference_set_id
280 - 最终表计数
281
282 因此,下次 session 不需要再从 SQL 片段手工执行开始,而可以直接从 live bootstrap 脚本接上。
......
...@@ -62,8 +62,10 @@ ...@@ -62,8 +62,10 @@
62 |---|---| 62 |---|---|
63 | 推荐 DDL | `acr-engine/sql/acr_pg_schema_v2.sql` | 63 | 推荐 DDL | `acr-engine/sql/acr_pg_schema_v2.sql` |
64 | live 测试脚本 | `acr-engine/scripts/live_pgvector_music20_eval.py` | 64 | live 测试脚本 | `acr-engine/scripts/live_pgvector_music20_eval.py` |
65 | registry bootstrap 脚本 | `acr-engine/scripts/bootstrap_phase1_model_registry_live.py` |
65 | live 报告 | `acr-engine/data/pgvector_eval/music20/live_pgvector_report.json` | 66 | live 报告 | `acr-engine/data/pgvector_eval/music20/live_pgvector_report.json` |
66 | FAISS 对照报告 | `acr-engine/data/pgvector_eval/music20/songid_eval_report_fresh.json` | 67 | FAISS 对照报告 | `acr-engine/data/pgvector_eval/music20/songid_eval_report_fresh.json` |
68 | registry bootstrap 报告 | `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json` |
67 | 历史对照报告 | `acr-engine/data/pgvector_eval/music20/songid_eval_report.json` | 69 | 历史对照报告 | `acr-engine/data/pgvector_eval/music20/songid_eval_report.json` |
68 70
69 --- 71 ---
...@@ -379,6 +381,23 @@ flowchart LR ...@@ -379,6 +381,23 @@ flowchart LR
379 381
380 ## 推荐的下一步 382 ## 推荐的下一步
381 383
384 ### 本轮新增:Phase-1 registry 已可 live bootstrap
385
386 除了 live 检索脚本外,本轮还新增了:
387
388 - `acr-engine/scripts/bootstrap_phase1_model_registry_live.py`
389
390 它已经在 `acr_test` schema 上真实写入了:
391 - `chromaprint`
392 - `mert`
393 - `muq`
394 - `ecapa`
395 - 对应 feature sets
396 - `phase1_hot_reference_v1`
397
398 对应 live 报告:
399 - `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json`
400
382 ### 路线 1:继续做 PostgreSQL 工程化 401 ### 路线 1:继续做 PostgreSQL 工程化
383 402
384 1.`live_pgvector_music20_eval.py` 泛化成: 403 1.`live_pgvector_music20_eval.py` 泛化成:
......
...@@ -24,6 +24,7 @@ ...@@ -24,6 +24,7 @@
24 - SOTA 演进路径已明确:**Phase-1 先走 encoder-only** 24 - SOTA 演进路径已明确:**Phase-1 先走 encoder-only**
25 - PostgreSQL 主数据与特征注册 DDL 已落地为推荐版 schema 25 - PostgreSQL 主数据与特征注册 DDL 已落地为推荐版 schema
26 - Phase-1 实施 checklist 和 model/feature/reference set 初始化手册已补齐 26 - Phase-1 实施 checklist 和 model/feature/reference set 初始化手册已补齐
27 - `acr_test` schema 上已经真实完成 Phase-1 `model_registry / feature_set_registry / reference_set_registry` bootstrap 验证
27 28
28 当前最重要的下一步不是继续写方案,而是: 29 当前最重要的下一步不是继续写方案,而是:
29 30
...@@ -180,6 +181,7 @@ sed -n '1,320p' acr-engine/sql/acr_pg_schema_v2.sql ...@@ -180,6 +181,7 @@ sed -n '1,320p' acr-engine/sql/acr_pg_schema_v2.sql
180 - 代码已推送远端 181 - 代码已推送远端
181 - PostgreSQL `acr_test` live 路径已再次验证:`recording` / `audio_window` / `audio_embedding` 三类 lineage trigger 均有真实负例证据 182 - PostgreSQL `acr_test` live 路径已再次验证:`recording` / `audio_window` / `audio_embedding` 三类 lineage trigger 均有真实负例证据
182 - 机械校验已补齐:`live_pgvector_music20_eval.py``py_compile` 通过,相关变更 `diff --check` 通过 183 - 机械校验已补齐:`live_pgvector_music20_eval.py``py_compile` 通过,相关变更 `diff --check` 通过
184 - PostgreSQL `acr_test` schema 上已真实写入 Phase-1 registry bootstrap:`chromaprint / mert / muq / ecapa` + 5 组 feature set + `phase1_hot_reference_v1`
183 185
184 ### 未验证 / 仍是缺口 186 ### 未验证 / 仍是缺口
185 - **未实际跑 MERT / MuQ encoder-only 特征抽取** 187 - **未实际跑 MERT / MuQ encoder-only 特征抽取**
......