Commit f0c82687 f0c826879d8435a60e80241dff3bfe19778648ce by cnb.bofCdSsphPA

Prove the Phase-1 registry bootstrap is idempotent

Constraint: Ralph follow-up work must keep producing audit-ready evidence and a pushed trail for the next session
Rejected: Assume the new bootstrap script is safe to rerun without proof | Duplicate feature-set inserts would erode trust in the PostgreSQL bootstrap path
Confidence: high
Scope-risk: narrow
Directive: Re-run registry bootstrap in-place before future extraction jobs and treat count drift as a regression signal
Tested: /usr/local/miniconda3/bin/python scripts/bootstrap_phase1_model_registry_live.py --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' --schema acr_test --output data/pgvector_eval/music20/phase1_registry_bootstrap_report.json (run twice); /usr/local/miniconda3/bin/python -m py_compile scripts/bootstrap_phase1_model_registry_live.py; git diff --check -- acr-engine/scripts/bootstrap_phase1_model_registry_live.py acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_idempotency_report.json docs/model-feature-registry-bootstrap.md docs/postgres_db_schema_samples.md docs/session-handoff.md docs/CHANGELOG.md
Not-tested: Actual downstream MERT/MuQ extraction after bootstrap, missing business sample mount recovery
1 parent fef8f438
{
"run1_counts": {
"model_registry": 5,
"feature_set_registry": 6,
"reference_set_registry": 2
},
"run2_counts": {
"model_registry": 5,
"feature_set_registry": 6,
"reference_set_registry": 2
},
"run2_model_operations": [
"updated",
"updated",
"updated",
"updated"
],
"run2_feature_operations": [
"reused",
"reused",
"reused",
"reused",
"reused"
],
"run2_reference_set_operation": "updated",
"idempotent": true
}
\ No newline at end of file
......@@ -6,25 +6,29 @@
"model_id": 2,
"model_name": "chromaprint",
"model_version": "v1",
"output_embedding_dim": null
"output_embedding_dim": null,
"operation": "updated"
},
{
"model_id": 3,
"model_name": "mert",
"model_version": "v1-95m",
"output_embedding_dim": 768
"output_embedding_dim": 768,
"operation": "updated"
},
{
"model_id": 4,
"model_name": "muq",
"model_version": "large-msd-iter",
"output_embedding_dim": 768
"output_embedding_dim": 768,
"operation": "updated"
},
{
"model_id": 5,
"model_name": "ecapa",
"model_version": "acr-baseline-v1",
"output_embedding_dim": 192
"output_embedding_dim": 192,
"operation": "updated"
}
],
"feature_sets": [
......@@ -36,7 +40,8 @@
"window_sec": 5.0,
"hop_sec": 2.5,
"embedding_dim": null,
"distance_metric": "hamming"
"distance_metric": "hamming",
"operation": "reused"
},
{
"feature_set_id": 3,
......@@ -46,7 +51,8 @@
"window_sec": 5.0,
"hop_sec": 2.5,
"embedding_dim": 768,
"distance_metric": "cosine"
"distance_metric": "cosine",
"operation": "reused"
},
{
"feature_set_id": 4,
......@@ -56,7 +62,8 @@
"window_sec": 10.0,
"hop_sec": 5.0,
"embedding_dim": 768,
"distance_metric": "cosine"
"distance_metric": "cosine",
"operation": "reused"
},
{
"feature_set_id": 5,
......@@ -66,7 +73,8 @@
"window_sec": 5.0,
"hop_sec": 2.5,
"embedding_dim": 768,
"distance_metric": "cosine"
"distance_metric": "cosine",
"operation": "reused"
},
{
"feature_set_id": 6,
......@@ -76,13 +84,15 @@
"window_sec": 5.0,
"hop_sec": 2.5,
"embedding_dim": 192,
"distance_metric": "cosine"
"distance_metric": "cosine",
"operation": "reused"
}
],
"reference_set": {
"reference_set_id": 2,
"set_name": "phase1_hot_reference_v1",
"encoder_scope": "chromaprint-v1 / mert-v1-95m / muq-large-msd-iter"
"encoder_scope": "chromaprint-v1 / mert-v1-95m / muq-large-msd-iter",
"operation": "updated"
},
"counts": {
"model_registry": 5,
......
......@@ -207,7 +207,11 @@ REFERENCE_SET = {
}
def upsert_model(conn: psycopg.Connection, model: dict[str, Any]) -> int:
def upsert_model(conn: psycopg.Connection, model: dict[str, Any]) -> tuple[int, str]:
existing = conn.execute(
'SELECT model_id FROM model_registry WHERE model_name = %s AND model_version = %s',
(model['model_name'], model['model_version']),
).fetchone()
row = conn.execute(
"""
INSERT INTO model_registry (
......@@ -242,10 +246,10 @@ def upsert_model(conn: psycopg.Connection, model: dict[str, Any]) -> int:
""",
{**model, 'metadata_json': json.dumps(model['metadata_json'])},
).fetchone()
return int(row[0])
return int(row[0]), ('updated' if existing else 'inserted')
def ensure_feature_set(conn: psycopg.Connection, model_id: int, feature: dict[str, Any]) -> int:
def ensure_feature_set(conn: psycopg.Connection, model_id: int, feature: dict[str, Any]) -> tuple[int, str]:
existing = conn.execute(
"""
SELECT feature_set_id
......@@ -283,7 +287,7 @@ def ensure_feature_set(conn: psycopg.Connection, model_id: int, feature: dict[st
"UPDATE feature_set_registry SET config_json = %s::jsonb, status = %s, updated_at = NOW() WHERE feature_set_id = %s",
(json.dumps(feature['config_json']), feature['status'], existing[0]),
)
return int(existing[0])
return int(existing[0]), 'reused'
row = conn.execute(
"""
......@@ -318,10 +322,14 @@ def ensure_feature_set(conn: psycopg.Connection, model_id: int, feature: dict[st
feature['status'],
),
).fetchone()
return int(row[0])
return int(row[0]), 'inserted'
def upsert_reference_set(conn: psycopg.Connection, payload: dict[str, Any]) -> int:
def upsert_reference_set(conn: psycopg.Connection, payload: dict[str, Any]) -> tuple[int, str]:
existing = conn.execute(
'SELECT reference_set_id FROM reference_set_registry WHERE set_name = %s',
(payload['set_name'],),
).fetchone()
row = conn.execute(
"""
INSERT INTO reference_set_registry (set_name, description, encoder_scope, status, metadata_json)
......@@ -343,7 +351,7 @@ def upsert_reference_set(conn: psycopg.Connection, payload: dict[str, Any]) -> i
json.dumps(payload['metadata_json']),
),
).fetchone()
return int(row[0])
return int(row[0]), ('updated' if existing else 'inserted')
def main() -> None:
......@@ -365,18 +373,19 @@ def main() -> None:
conn.execute(f'SET search_path TO {args.schema}, public;')
model_ids: dict[tuple[str, str], int] = {}
for model in MODELS:
model_id = upsert_model(conn, model)
model_id, operation = upsert_model(conn, model)
model_ids[(model['model_name'], model['model_version'])] = model_id
summary['models'].append({
'model_id': model_id,
'model_name': model['model_name'],
'model_version': model['model_version'],
'output_embedding_dim': model['output_embedding_dim'],
'operation': operation,
})
for feature in FEATURE_SETS:
model_id = model_ids[(feature['model_name'], feature['model_version'])]
feature_set_id = ensure_feature_set(conn, model_id, feature)
feature_set_id, operation = ensure_feature_set(conn, model_id, feature)
summary['feature_sets'].append({
'feature_set_id': feature_set_id,
'model_name': feature['model_name'],
......@@ -386,13 +395,15 @@ def main() -> None:
'hop_sec': feature['hop_sec'],
'embedding_dim': feature['embedding_dim'],
'distance_metric': feature['distance_metric'],
'operation': operation,
})
reference_set_id = upsert_reference_set(conn, REFERENCE_SET)
reference_set_id, operation = upsert_reference_set(conn, REFERENCE_SET)
summary['reference_set'] = {
'reference_set_id': reference_set_id,
'set_name': REFERENCE_SET['set_name'],
'encoder_scope': REFERENCE_SET['encoder_scope'],
'operation': operation,
}
summary['counts'] = {
'model_registry': int(conn.execute('SELECT count(*) FROM model_registry;').fetchone()[0]),
......
## 2026-06-04
- 补充 `phase1_registry_bootstrap_idempotency_report.json` 与文档说明,验证 `bootstrap_phase1_model_registry_live.py``acr_test` schema 上连续执行两次后表计数保持稳定,证明 Phase-1 registry bootstrap 具备可重复执行的幂等性。
- 新增 `acr-engine/scripts/bootstrap_phase1_model_registry_live.py``acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json`,把 Phase-1 的 `chromaprint / mert / muq / ecapa` 与对应 `feature_set_registry / reference_set_registry` 初始化做成可直接连 PostgreSQL 的 live bootstrap 脚本,并已在 `acr_test` schema 验证通过。
- 补充文档阻塞事实:当前容器里缺少 `/workspace/downloads`,因此本轮无法直接从业务样本目录继续生成 `type_8 / type_16` 的 live PostgreSQL query JSONL;已把该环境前提写入 handoff 与 PostgreSQL 样例文档。
- 更新 [PostgreSQL 落库样例与 live 测试链路](./postgres_db_schema_samples.md)`acr-engine/scripts/live_pgvector_music20_eval.py`,把 lineage 负例验证从单条 `audio_window` 扩展到 `recording` / `audio_window` / `audio_embedding` 三类核心 trigger,并已重跑 live pgvector 报告确认检索指标不变;同时补充 `py_compile``diff --check` 通过的机械验证事实。
......
......@@ -272,6 +272,7 @@ cd /workspace/acr-engine
### 8.3 当前产物
- `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json`
- `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_idempotency_report.json`
这个文件已经记录了:
- model_id
......@@ -280,3 +281,22 @@ cd /workspace/acr-engine
- 最终表计数
因此,下次 session 不需要再从 SQL 片段手工执行开始,而可以直接从 live bootstrap 脚本接上。
### 8.4 幂等性验证(已做)
同一套命令在 `acr_test` schema 上连续执行两次后,已经拿到真实幂等性证据:
| 项目 | 第 1 次 | 第 2 次 |
|---|---:|---:|
| `model_registry` | `5` | `5` |
| `feature_set_registry` | `6` | `6` |
| `reference_set_registry` | `2` | `2` |
第二次执行时:
- `models` 全部表现为 `updated`
- `feature_sets` 全部表现为 `reused`
- `reference_set` 表现为 `updated`
结论:
> 当前 bootstrap 脚本可重复执行,不会把 Phase-1 registry 数据重复灌爆。
......
......@@ -66,6 +66,7 @@
| live 报告 | `acr-engine/data/pgvector_eval/music20/live_pgvector_report.json` |
| FAISS 对照报告 | `acr-engine/data/pgvector_eval/music20/songid_eval_report_fresh.json` |
| registry bootstrap 报告 | `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json` |
| registry bootstrap 幂等性报告 | `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_idempotency_report.json` |
| 历史对照报告 | `acr-engine/data/pgvector_eval/music20/songid_eval_report.json` |
---
......
......@@ -182,6 +182,7 @@ sed -n '1,320p' acr-engine/sql/acr_pg_schema_v2.sql
- PostgreSQL `acr_test` live 路径已再次验证:`recording` / `audio_window` / `audio_embedding` 三类 lineage trigger 均有真实负例证据
- 机械校验已补齐:`live_pgvector_music20_eval.py``py_compile` 通过,相关变更 `diff --check` 通过
- PostgreSQL `acr_test` schema 上已真实写入 Phase-1 registry bootstrap:`chromaprint / mert / muq / ecapa` + 5 组 feature set + `phase1_hot_reference_v1`
- Phase-1 registry bootstrap 已有幂等性证据:同 schema 连续执行两次后,`model_registry=5 / feature_set_registry=6 / reference_set_registry=2` 保持不变
### 未验证 / 仍是缺口
- **未实际跑 MERT / MuQ encoder-only 特征抽取**
......