Bootstrap the Phase-1 model registry on live PostgreSQL
Constraint: Continue the Ralph loop without waiting on missing business sample mounts, while still leaving a push-ready implementation and documentation trail Rejected: Keep Phase-1 registry setup as static SQL snippets only | It slows live validation and leaves no machine-checkable bootstrap path Confidence: high Scope-risk: narrow Directive: Treat model_registry/feature_set_registry/reference_set_registry as the mandatory entrypoint before any future MERT/MuQ extraction jobs Tested: /usr/local/miniconda3/bin/python scripts/bootstrap_phase1_model_registry_live.py --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' --schema acr_test --output data/pgvector_eval/music20/phase1_registry_bootstrap_report.json; /usr/local/miniconda3/bin/python -m py_compile scripts/bootstrap_phase1_model_registry_live.py; git diff --check -- acr-engine/scripts/bootstrap_phase1_model_registry_live.py acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json docs/model-feature-registry-bootstrap.md docs/postgres_db_schema_samples.md docs/session-handoff.md docs/CHANGELOG.md Not-tested: Actual MERT/MuQ embedding extraction, hard-case type_8/type_16 live queries, multi-recording/cover-lane retrieval
Showing
6 changed files
with
178 additions
and
0 deletions
| 1 | { | ||
| 2 | "schema": "acr_test", | ||
| 3 | "dsn_redacted": "postgres://d2:***@127.0.0.1:5432/d2", | ||
| 4 | "models": [ | ||
| 5 | { | ||
| 6 | "model_id": 2, | ||
| 7 | "model_name": "chromaprint", | ||
| 8 | "model_version": "v1", | ||
| 9 | "output_embedding_dim": null | ||
| 10 | }, | ||
| 11 | { | ||
| 12 | "model_id": 3, | ||
| 13 | "model_name": "mert", | ||
| 14 | "model_version": "v1-95m", | ||
| 15 | "output_embedding_dim": 768 | ||
| 16 | }, | ||
| 17 | { | ||
| 18 | "model_id": 4, | ||
| 19 | "model_name": "muq", | ||
| 20 | "model_version": "large-msd-iter", | ||
| 21 | "output_embedding_dim": 768 | ||
| 22 | }, | ||
| 23 | { | ||
| 24 | "model_id": 5, | ||
| 25 | "model_name": "ecapa", | ||
| 26 | "model_version": "acr-baseline-v1", | ||
| 27 | "output_embedding_dim": 192 | ||
| 28 | } | ||
| 29 | ], | ||
| 30 | "feature_sets": [ | ||
| 31 | { | ||
| 32 | "feature_set_id": 2, | ||
| 33 | "model_name": "chromaprint", | ||
| 34 | "model_version": "v1", | ||
| 35 | "feature_name": "fingerprint_asset", | ||
| 36 | "window_sec": 5.0, | ||
| 37 | "hop_sec": 2.5, | ||
| 38 | "embedding_dim": null, | ||
| 39 | "distance_metric": "hamming" | ||
| 40 | }, | ||
| 41 | { | ||
| 42 | "feature_set_id": 3, | ||
| 43 | "model_name": "mert", | ||
| 44 | "model_version": "v1-95m", | ||
| 45 | "feature_name": "semantic_embedding", | ||
| 46 | "window_sec": 5.0, | ||
| 47 | "hop_sec": 2.5, | ||
| 48 | "embedding_dim": 768, | ||
| 49 | "distance_metric": "cosine" | ||
| 50 | }, | ||
| 51 | { | ||
| 52 | "feature_set_id": 4, | ||
| 53 | "model_name": "mert", | ||
| 54 | "model_version": "v1-95m", | ||
| 55 | "feature_name": "semantic_embedding", | ||
| 56 | "window_sec": 10.0, | ||
| 57 | "hop_sec": 5.0, | ||
| 58 | "embedding_dim": 768, | ||
| 59 | "distance_metric": "cosine" | ||
| 60 | }, | ||
| 61 | { | ||
| 62 | "feature_set_id": 5, | ||
| 63 | "model_name": "muq", | ||
| 64 | "model_version": "large-msd-iter", | ||
| 65 | "feature_name": "semantic_embedding", | ||
| 66 | "window_sec": 5.0, | ||
| 67 | "hop_sec": 2.5, | ||
| 68 | "embedding_dim": 768, | ||
| 69 | "distance_metric": "cosine" | ||
| 70 | }, | ||
| 71 | { | ||
| 72 | "feature_set_id": 6, | ||
| 73 | "model_name": "ecapa", | ||
| 74 | "model_version": "acr-baseline-v1", | ||
| 75 | "feature_name": "semantic_embedding", | ||
| 76 | "window_sec": 5.0, | ||
| 77 | "hop_sec": 2.5, | ||
| 78 | "embedding_dim": 192, | ||
| 79 | "distance_metric": "cosine" | ||
| 80 | } | ||
| 81 | ], | ||
| 82 | "reference_set": { | ||
| 83 | "reference_set_id": 2, | ||
| 84 | "set_name": "phase1_hot_reference_v1", | ||
| 85 | "encoder_scope": "chromaprint-v1 / mert-v1-95m / muq-large-msd-iter" | ||
| 86 | }, | ||
| 87 | "counts": { | ||
| 88 | "model_registry": 5, | ||
| 89 | "feature_set_registry": 6, | ||
| 90 | "reference_set_registry": 2 | ||
| 91 | } | ||
| 92 | } | ||
| ... | \ No newline at end of file | ... | \ No newline at end of file |
This diff is collapsed.
Click to expand it.
| 1 | ## 2026-06-04 | 1 | ## 2026-06-04 |
| 2 | 2 | ||
| 3 | - 新增 `acr-engine/scripts/bootstrap_phase1_model_registry_live.py` 与 `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json`,把 Phase-1 的 `chromaprint / mert / muq / ecapa` 与对应 `feature_set_registry / reference_set_registry` 初始化做成可直接连 PostgreSQL 的 live bootstrap 脚本,并已在 `acr_test` schema 验证通过。 | ||
| 3 | - 补充文档阻塞事实:当前容器里缺少 `/workspace/downloads`,因此本轮无法直接从业务样本目录继续生成 `type_8 / type_16` 的 live PostgreSQL query JSONL;已把该环境前提写入 handoff 与 PostgreSQL 样例文档。 | 4 | - 补充文档阻塞事实:当前容器里缺少 `/workspace/downloads`,因此本轮无法直接从业务样本目录继续生成 `type_8 / type_16` 的 live PostgreSQL query JSONL;已把该环境前提写入 handoff 与 PostgreSQL 样例文档。 |
| 4 | - 更新 [PostgreSQL 落库样例与 live 测试链路](./postgres_db_schema_samples.md) 与 `acr-engine/scripts/live_pgvector_music20_eval.py`,把 lineage 负例验证从单条 `audio_window` 扩展到 `recording` / `audio_window` / `audio_embedding` 三类核心 trigger,并已重跑 live pgvector 报告确认检索指标不变;同时补充 `py_compile` 与 `diff --check` 通过的机械验证事实。 | 5 | - 更新 [PostgreSQL 落库样例与 live 测试链路](./postgres_db_schema_samples.md) 与 `acr-engine/scripts/live_pgvector_music20_eval.py`,把 lineage 负例验证从单条 `audio_window` 扩展到 `recording` / `audio_window` / `audio_embedding` 三类核心 trigger,并已重跑 live pgvector 报告确认检索指标不变;同时补充 `py_compile` 与 `diff --check` 通过的机械验证事实。 |
| 5 | - 新增 [PostgreSQL 落库样例与 live 测试链路](./postgres_db_schema_samples.md),补齐 `acr_pg_schema_v2.sql` 的真实落库样例、`pgvector` live 检索验证、lineage trigger 负例测试,以及当前召回/混淆结果解读。 | 6 | - 新增 [PostgreSQL 落库样例与 live 测试链路](./postgres_db_schema_samples.md),补齐 `acr_pg_schema_v2.sql` 的真实落库样例、`pgvector` live 检索验证、lineage trigger 负例测试,以及当前召回/混淆结果解读。 | ... | ... |
| ... | @@ -216,3 +216,67 @@ flowchart TD | ... | @@ -216,3 +216,67 @@ flowchart TD |
| 216 | 6. `phase1_hot_reference_v1` | 216 | 6. `phase1_hot_reference_v1` |
| 217 | 217 | ||
| 218 | 这样数据、模型、索引三条线就都有了稳定入口。 | 218 | 这样数据、模型、索引三条线就都有了稳定入口。 |
| 219 | |||
| 220 | --- | ||
| 221 | |||
| 222 | ## 8. live PostgreSQL bootstrap 脚本 | ||
| 223 | |||
| 224 | 为了避免每次手工执行 SQL,本仓库现在提供了一个可直接连 PostgreSQL 的 live bootstrap 脚本: | ||
| 225 | |||
| 226 | - `acr-engine/scripts/bootstrap_phase1_model_registry_live.py` | ||
| 227 | |||
| 228 | 用途: | ||
| 229 | - 向目标 schema 写入 `model_registry` | ||
| 230 | - 写入 `feature_set_registry` | ||
| 231 | - 写入 `reference_set_registry` | ||
| 232 | - 采用 **幂等式 upsert / ensure** 方式,适合重复执行 | ||
| 233 | |||
| 234 | ### 8.1 执行命令 | ||
| 235 | |||
| 236 | ```bash | ||
| 237 | cd /workspace/acr-engine | ||
| 238 | /usr/local/miniconda3/bin/python scripts/bootstrap_phase1_model_registry_live.py \ | ||
| 239 | --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' \ | ||
| 240 | --schema acr_test \ | ||
| 241 | --output data/pgvector_eval/music20/phase1_registry_bootstrap_report.json | ||
| 242 | ``` | ||
| 243 | |||
| 244 | ### 8.2 当前已验证结果(acr_test) | ||
| 245 | |||
| 246 | 本轮已在 `acr_test` schema 上真实执行,写入结果如下: | ||
| 247 | |||
| 248 | | 对象 | 数量 | | ||
| 249 | |---|---:| | ||
| 250 | | `model_registry` | `5` | | ||
| 251 | | `feature_set_registry` | `6` | | ||
| 252 | | `reference_set_registry` | `2` | | ||
| 253 | |||
| 254 | 其中新增的 Phase-1 对象包含: | ||
| 255 | |||
| 256 | #### models | ||
| 257 | - `chromaprint v1` | ||
| 258 | - `mert v1-95m` | ||
| 259 | - `muq large-msd-iter` | ||
| 260 | - `ecapa acr-baseline-v1` | ||
| 261 | |||
| 262 | #### feature sets | ||
| 263 | - `chromaprint fingerprint_asset` | ||
| 264 | - `mert semantic_embedding 5s/2.5s` | ||
| 265 | - `mert semantic_embedding 10s/5s` | ||
| 266 | - `muq semantic_embedding 5s/2.5s` | ||
| 267 | - `ecapa semantic_embedding 5s/2.5s` | ||
| 268 | |||
| 269 | #### reference set | ||
| 270 | - `phase1_hot_reference_v1` | ||
| 271 | |||
| 272 | ### 8.3 当前产物 | ||
| 273 | |||
| 274 | - `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json` | ||
| 275 | |||
| 276 | 这个文件已经记录了: | ||
| 277 | - model_id | ||
| 278 | - feature_set_id | ||
| 279 | - reference_set_id | ||
| 280 | - 最终表计数 | ||
| 281 | |||
| 282 | 因此,下次 session 不需要再从 SQL 片段手工执行开始,而可以直接从 live bootstrap 脚本接上。 | ... | ... |
| ... | @@ -62,8 +62,10 @@ | ... | @@ -62,8 +62,10 @@ |
| 62 | |---|---| | 62 | |---|---| |
| 63 | | 推荐 DDL | `acr-engine/sql/acr_pg_schema_v2.sql` | | 63 | | 推荐 DDL | `acr-engine/sql/acr_pg_schema_v2.sql` | |
| 64 | | live 测试脚本 | `acr-engine/scripts/live_pgvector_music20_eval.py` | | 64 | | live 测试脚本 | `acr-engine/scripts/live_pgvector_music20_eval.py` | |
| 65 | | registry bootstrap 脚本 | `acr-engine/scripts/bootstrap_phase1_model_registry_live.py` | | ||
| 65 | | live 报告 | `acr-engine/data/pgvector_eval/music20/live_pgvector_report.json` | | 66 | | live 报告 | `acr-engine/data/pgvector_eval/music20/live_pgvector_report.json` | |
| 66 | | FAISS 对照报告 | `acr-engine/data/pgvector_eval/music20/songid_eval_report_fresh.json` | | 67 | | FAISS 对照报告 | `acr-engine/data/pgvector_eval/music20/songid_eval_report_fresh.json` | |
| 68 | | registry bootstrap 报告 | `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json` | | ||
| 67 | | 历史对照报告 | `acr-engine/data/pgvector_eval/music20/songid_eval_report.json` | | 69 | | 历史对照报告 | `acr-engine/data/pgvector_eval/music20/songid_eval_report.json` | |
| 68 | 70 | ||
| 69 | --- | 71 | --- |
| ... | @@ -379,6 +381,23 @@ flowchart LR | ... | @@ -379,6 +381,23 @@ flowchart LR |
| 379 | 381 | ||
| 380 | ## 推荐的下一步 | 382 | ## 推荐的下一步 |
| 381 | 383 | ||
| 384 | ### 本轮新增:Phase-1 registry 已可 live bootstrap | ||
| 385 | |||
| 386 | 除了 live 检索脚本外,本轮还新增了: | ||
| 387 | |||
| 388 | - `acr-engine/scripts/bootstrap_phase1_model_registry_live.py` | ||
| 389 | |||
| 390 | 它已经在 `acr_test` schema 上真实写入了: | ||
| 391 | - `chromaprint` | ||
| 392 | - `mert` | ||
| 393 | - `muq` | ||
| 394 | - `ecapa` | ||
| 395 | - 对应 feature sets | ||
| 396 | - `phase1_hot_reference_v1` | ||
| 397 | |||
| 398 | 对应 live 报告: | ||
| 399 | - `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json` | ||
| 400 | |||
| 382 | ### 路线 1:继续做 PostgreSQL 工程化 | 401 | ### 路线 1:继续做 PostgreSQL 工程化 |
| 383 | 402 | ||
| 384 | 1. 把 `live_pgvector_music20_eval.py` 泛化成: | 403 | 1. 把 `live_pgvector_music20_eval.py` 泛化成: | ... | ... |
| ... | @@ -24,6 +24,7 @@ | ... | @@ -24,6 +24,7 @@ |
| 24 | - SOTA 演进路径已明确:**Phase-1 先走 encoder-only** | 24 | - SOTA 演进路径已明确:**Phase-1 先走 encoder-only** |
| 25 | - PostgreSQL 主数据与特征注册 DDL 已落地为推荐版 schema | 25 | - PostgreSQL 主数据与特征注册 DDL 已落地为推荐版 schema |
| 26 | - Phase-1 实施 checklist 和 model/feature/reference set 初始化手册已补齐 | 26 | - Phase-1 实施 checklist 和 model/feature/reference set 初始化手册已补齐 |
| 27 | - `acr_test` schema 上已经真实完成 Phase-1 `model_registry / feature_set_registry / reference_set_registry` bootstrap 验证 | ||
| 27 | 28 | ||
| 28 | 当前最重要的下一步不是继续写方案,而是: | 29 | 当前最重要的下一步不是继续写方案,而是: |
| 29 | 30 | ||
| ... | @@ -180,6 +181,7 @@ sed -n '1,320p' acr-engine/sql/acr_pg_schema_v2.sql | ... | @@ -180,6 +181,7 @@ sed -n '1,320p' acr-engine/sql/acr_pg_schema_v2.sql |
| 180 | - 代码已推送远端 | 181 | - 代码已推送远端 |
| 181 | - PostgreSQL `acr_test` live 路径已再次验证:`recording` / `audio_window` / `audio_embedding` 三类 lineage trigger 均有真实负例证据 | 182 | - PostgreSQL `acr_test` live 路径已再次验证:`recording` / `audio_window` / `audio_embedding` 三类 lineage trigger 均有真实负例证据 |
| 182 | - 机械校验已补齐:`live_pgvector_music20_eval.py` 的 `py_compile` 通过,相关变更 `diff --check` 通过 | 183 | - 机械校验已补齐:`live_pgvector_music20_eval.py` 的 `py_compile` 通过,相关变更 `diff --check` 通过 |
| 184 | - PostgreSQL `acr_test` schema 上已真实写入 Phase-1 registry bootstrap:`chromaprint / mert / muq / ecapa` + 5 组 feature set + `phase1_hot_reference_v1` | ||
| 183 | 185 | ||
| 184 | ### 未验证 / 仍是缺口 | 186 | ### 未验证 / 仍是缺口 |
| 185 | - **未实际跑 MERT / MuQ encoder-only 特征抽取** | 187 | - **未实际跑 MERT / MuQ encoder-only 特征抽取** | ... | ... |
-
Please register or sign in to post a comment