Commit fef8f438 fef8f4387d95be5d4a017ba55150b6fa7463f1f6 by cnb.bofCdSsphPA

Bootstrap the Phase-1 model registry on live PostgreSQL

Constraint: Continue the Ralph loop without waiting on missing business sample mounts, while still leaving a push-ready implementation and documentation trail
Rejected: Keep Phase-1 registry setup as static SQL snippets only | It slows live validation and leaves no machine-checkable bootstrap path
Confidence: high
Scope-risk: narrow
Directive: Treat model_registry/feature_set_registry/reference_set_registry as the mandatory entrypoint before any future MERT/MuQ extraction jobs
Tested: /usr/local/miniconda3/bin/python scripts/bootstrap_phase1_model_registry_live.py --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' --schema acr_test --output data/pgvector_eval/music20/phase1_registry_bootstrap_report.json; /usr/local/miniconda3/bin/python -m py_compile scripts/bootstrap_phase1_model_registry_live.py; git diff --check -- acr-engine/scripts/bootstrap_phase1_model_registry_live.py acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json docs/model-feature-registry-bootstrap.md docs/postgres_db_schema_samples.md docs/session-handoff.md docs/CHANGELOG.md
Not-tested: Actual MERT/MuQ embedding extraction, hard-case type_8/type_16 live queries, multi-recording/cover-lane retrieval
1 parent ea51b9c1
{
"schema": "acr_test",
"dsn_redacted": "postgres://d2:***@127.0.0.1:5432/d2",
"models": [
{
"model_id": 2,
"model_name": "chromaprint",
"model_version": "v1",
"output_embedding_dim": null
},
{
"model_id": 3,
"model_name": "mert",
"model_version": "v1-95m",
"output_embedding_dim": 768
},
{
"model_id": 4,
"model_name": "muq",
"model_version": "large-msd-iter",
"output_embedding_dim": 768
},
{
"model_id": 5,
"model_name": "ecapa",
"model_version": "acr-baseline-v1",
"output_embedding_dim": 192
}
],
"feature_sets": [
{
"feature_set_id": 2,
"model_name": "chromaprint",
"model_version": "v1",
"feature_name": "fingerprint_asset",
"window_sec": 5.0,
"hop_sec": 2.5,
"embedding_dim": null,
"distance_metric": "hamming"
},
{
"feature_set_id": 3,
"model_name": "mert",
"model_version": "v1-95m",
"feature_name": "semantic_embedding",
"window_sec": 5.0,
"hop_sec": 2.5,
"embedding_dim": 768,
"distance_metric": "cosine"
},
{
"feature_set_id": 4,
"model_name": "mert",
"model_version": "v1-95m",
"feature_name": "semantic_embedding",
"window_sec": 10.0,
"hop_sec": 5.0,
"embedding_dim": 768,
"distance_metric": "cosine"
},
{
"feature_set_id": 5,
"model_name": "muq",
"model_version": "large-msd-iter",
"feature_name": "semantic_embedding",
"window_sec": 5.0,
"hop_sec": 2.5,
"embedding_dim": 768,
"distance_metric": "cosine"
},
{
"feature_set_id": 6,
"model_name": "ecapa",
"model_version": "acr-baseline-v1",
"feature_name": "semantic_embedding",
"window_sec": 5.0,
"hop_sec": 2.5,
"embedding_dim": 192,
"distance_metric": "cosine"
}
],
"reference_set": {
"reference_set_id": 2,
"set_name": "phase1_hot_reference_v1",
"encoder_scope": "chromaprint-v1 / mert-v1-95m / muq-large-msd-iter"
},
"counts": {
"model_registry": 5,
"feature_set_registry": 6,
"reference_set_registry": 2
}
}
\ No newline at end of file
## 2026-06-04
- 新增 `acr-engine/scripts/bootstrap_phase1_model_registry_live.py``acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json`,把 Phase-1 的 `chromaprint / mert / muq / ecapa` 与对应 `feature_set_registry / reference_set_registry` 初始化做成可直接连 PostgreSQL 的 live bootstrap 脚本,并已在 `acr_test` schema 验证通过。
- 补充文档阻塞事实:当前容器里缺少 `/workspace/downloads`,因此本轮无法直接从业务样本目录继续生成 `type_8 / type_16` 的 live PostgreSQL query JSONL;已把该环境前提写入 handoff 与 PostgreSQL 样例文档。
- 更新 [PostgreSQL 落库样例与 live 测试链路](./postgres_db_schema_samples.md)`acr-engine/scripts/live_pgvector_music20_eval.py`,把 lineage 负例验证从单条 `audio_window` 扩展到 `recording` / `audio_window` / `audio_embedding` 三类核心 trigger,并已重跑 live pgvector 报告确认检索指标不变;同时补充 `py_compile``diff --check` 通过的机械验证事实。
- 新增 [PostgreSQL 落库样例与 live 测试链路](./postgres_db_schema_samples.md),补齐 `acr_pg_schema_v2.sql` 的真实落库样例、`pgvector` live 检索验证、lineage trigger 负例测试,以及当前召回/混淆结果解读。
......
......@@ -216,3 +216,67 @@ flowchart TD
6. `phase1_hot_reference_v1`
这样数据、模型、索引三条线就都有了稳定入口。
---
## 8. live PostgreSQL bootstrap 脚本
为了避免每次手工执行 SQL,本仓库现在提供了一个可直接连 PostgreSQL 的 live bootstrap 脚本:
- `acr-engine/scripts/bootstrap_phase1_model_registry_live.py`
用途:
- 向目标 schema 写入 `model_registry`
- 写入 `feature_set_registry`
- 写入 `reference_set_registry`
- 采用 **幂等式 upsert / ensure** 方式,适合重复执行
### 8.1 执行命令
```bash
cd /workspace/acr-engine
/usr/local/miniconda3/bin/python scripts/bootstrap_phase1_model_registry_live.py \
--dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' \
--schema acr_test \
--output data/pgvector_eval/music20/phase1_registry_bootstrap_report.json
```
### 8.2 当前已验证结果(acr_test)
本轮已在 `acr_test` schema 上真实执行,写入结果如下:
| 对象 | 数量 |
|---|---:|
| `model_registry` | `5` |
| `feature_set_registry` | `6` |
| `reference_set_registry` | `2` |
其中新增的 Phase-1 对象包含:
#### models
- `chromaprint v1`
- `mert v1-95m`
- `muq large-msd-iter`
- `ecapa acr-baseline-v1`
#### feature sets
- `chromaprint fingerprint_asset`
- `mert semantic_embedding 5s/2.5s`
- `mert semantic_embedding 10s/5s`
- `muq semantic_embedding 5s/2.5s`
- `ecapa semantic_embedding 5s/2.5s`
#### reference set
- `phase1_hot_reference_v1`
### 8.3 当前产物
- `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json`
这个文件已经记录了:
- model_id
- feature_set_id
- reference_set_id
- 最终表计数
因此,下次 session 不需要再从 SQL 片段手工执行开始,而可以直接从 live bootstrap 脚本接上。
......
......@@ -62,8 +62,10 @@
|---|---|
| 推荐 DDL | `acr-engine/sql/acr_pg_schema_v2.sql` |
| live 测试脚本 | `acr-engine/scripts/live_pgvector_music20_eval.py` |
| registry bootstrap 脚本 | `acr-engine/scripts/bootstrap_phase1_model_registry_live.py` |
| live 报告 | `acr-engine/data/pgvector_eval/music20/live_pgvector_report.json` |
| FAISS 对照报告 | `acr-engine/data/pgvector_eval/music20/songid_eval_report_fresh.json` |
| registry bootstrap 报告 | `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json` |
| 历史对照报告 | `acr-engine/data/pgvector_eval/music20/songid_eval_report.json` |
---
......@@ -379,6 +381,23 @@ flowchart LR
## 推荐的下一步
### 本轮新增:Phase-1 registry 已可 live bootstrap
除了 live 检索脚本外,本轮还新增了:
- `acr-engine/scripts/bootstrap_phase1_model_registry_live.py`
它已经在 `acr_test` schema 上真实写入了:
- `chromaprint`
- `mert`
- `muq`
- `ecapa`
- 对应 feature sets
- `phase1_hot_reference_v1`
对应 live 报告:
- `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json`
### 路线 1:继续做 PostgreSQL 工程化
1.`live_pgvector_music20_eval.py` 泛化成:
......
......@@ -24,6 +24,7 @@
- SOTA 演进路径已明确:**Phase-1 先走 encoder-only**
- PostgreSQL 主数据与特征注册 DDL 已落地为推荐版 schema
- Phase-1 实施 checklist 和 model/feature/reference set 初始化手册已补齐
- `acr_test` schema 上已经真实完成 Phase-1 `model_registry / feature_set_registry / reference_set_registry` bootstrap 验证
当前最重要的下一步不是继续写方案,而是:
......@@ -180,6 +181,7 @@ sed -n '1,320p' acr-engine/sql/acr_pg_schema_v2.sql
- 代码已推送远端
- PostgreSQL `acr_test` live 路径已再次验证:`recording` / `audio_window` / `audio_embedding` 三类 lineage trigger 均有真实负例证据
- 机械校验已补齐:`live_pgvector_music20_eval.py``py_compile` 通过,相关变更 `diff --check` 通过
- PostgreSQL `acr_test` schema 上已真实写入 Phase-1 registry bootstrap:`chromaprint / mert / muq / ecapa` + 5 组 feature set + `phase1_hot_reference_v1`
### 未验证 / 仍是缺口
- **未实际跑 MERT / MuQ encoder-only 特征抽取**
......