Prove the Phase-1 registry bootstrap is idempotent

Constraint: Ralph follow-up work must keep producing audit-ready evidence and a pushed trail for the next session Rejected: Assume the new bootstrap script is safe to rerun without proof | Duplicate feature-set inserts would erode trust in the PostgreSQL bootstrap path Confidence: high Scope-risk: narrow Directive: Re-run registry bootstrap in-place before future extraction jobs and treat count drift as a regression signal Tested: /usr/local/miniconda3/bin/python scripts/bootstrap_phase1_model_registry_live.py --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' --schema acr_test --output data/pgvector_eval/music20/phase1_registry_bootstrap_report.json (run twice); /usr/local/miniconda3/bin/python -m py_compile scripts/bootstrap_phase1_model_registry_live.py; git diff --check -- acr-engine/scripts/bootstrap_phase1_model_registry_live.py acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_idempotency_report.json docs/model-feature-registry-bootstrap.md docs/postgres_db_schema_samples.md docs/session-handoff.md docs/CHANGELOG.md Not-tested: Actual downstream MERT/MuQ extraction after bootstrap, missing business sample mount recovery

Prove the Phase-1 registry bootstrap is idempotent
Constraint: Ralph follow-up work must keep producing audit-ready evidence and a pushed trail for the next session Rejected: Assume the new bootstrap script is safe to rerun without proof | Duplicate feature-set inserts would erode trust in the PostgreSQL bootstrap path Confidence: high Scope-risk: narrow Directive: Re-run registry bootstrap in-place before future extraction jobs and treat count drift as a regression signal Tested: /usr/local/miniconda3/bin/python scripts/bootstrap_phase1_model_registry_live.py --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' --schema acr_test --output data/pgvector_eval/music20/phase1_registry_bootstrap_report.json (run twice); /usr/local/miniconda3/bin/python -m py_compile scripts/bootstrap_phase1_model_registry_live.py; git diff --check -- acr-engine/scripts/bootstrap_phase1_model_registry_live.py acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_idempotency_report.json docs/model-feature-registry-bootstrap.md docs/postgres_db_schema_samples.md docs/session-handoff.md docs/CHANGELOG.md Not-tested: Actual downstream MERT/MuQ extraction after bootstrap, missing business sample mount recovery
cnb.bofCdSsphPA
Commit f0c82687 ... f0c826879d8435a60e80241dff3bfe19778648ce authored 2026-06-04 12:47:24 +0800 by cnb.bofCdSsphPA
Showing 7 changed files with 91 additions and 20 deletions
acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_idempotency_report.json
acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json
acr-engine/scripts/bootstrap_phase1_model_registry_live.py
docs/CHANGELOG.md
docs/model-feature-registry-bootstrap.md
docs/postgres_db_schema_samples.md
docs/session-handoff.md
--- a/acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_idempotency_report.json 0 → 100644
View file @f0c8268
+++ b/acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_idempotency_report.json 0 → 100644
View file @f0c8268
+{
+  "run1_counts": {
+    "model_registry": 5,
+    "feature_set_registry": 6,
+    "reference_set_registry": 2
+  },
+  "run2_counts": {
+    "model_registry": 5,
+    "feature_set_registry": 6,
+    "reference_set_registry": 2
+  },
+  "run2_model_operations": [
+    "updated",
+    "updated",
+    "updated",
+    "updated"
+  ],
+  "run2_feature_operations": [
+    "reused",
+    "reused",
+    "reused",
+    "reused",
+    "reused"
+  ],
+  "run2_reference_set_operation": "updated",
+  "idempotent": true
+}
\ No newline at end of file
--- a/acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json
View file @f0c8268
+++ b/acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json
View file @f0c8268
@@ -6,25 +6,29 @@
      "model_id": 2,
      "model_name": "chromaprint",
      "model_version": "v1",
-      "output_embedding_dim": null
+      "output_embedding_dim": null,
+      "operation": "updated"
    },
    {
      "model_id": 3,
      "model_name": "mert",
      "model_version": "v1-95m",
-      "output_embedding_dim": 768
+      "output_embedding_dim": 768,
+      "operation": "updated"
    },
    {
      "model_id": 4,
      "model_name": "muq",
      "model_version": "large-msd-iter",
-      "output_embedding_dim": 768
+      "output_embedding_dim": 768,
+      "operation": "updated"
    },
    {
      "model_id": 5,
      "model_name": "ecapa",
      "model_version": "acr-baseline-v1",
-      "output_embedding_dim": 192
+      "output_embedding_dim": 192,
+      "operation": "updated"
    }
  ],
  "feature_sets": [
@@ -36,7 +40,8 @@
      "window_sec": 5.0,
      "hop_sec": 2.5,
      "embedding_dim": null,
-      "distance_metric": "hamming"
+      "distance_metric": "hamming",
+      "operation": "reused"
    },
    {
      "feature_set_id": 3,
@@ -46,7 +51,8 @@
      "window_sec": 5.0,
      "hop_sec": 2.5,
      "embedding_dim": 768,
-      "distance_metric": "cosine"
+      "distance_metric": "cosine",
+      "operation": "reused"
    },
    {
      "feature_set_id": 4,
@@ -56,7 +62,8 @@
      "window_sec": 10.0,
      "hop_sec": 5.0,
      "embedding_dim": 768,
-      "distance_metric": "cosine"
+      "distance_metric": "cosine",
+      "operation": "reused"
    },
    {
      "feature_set_id": 5,
@@ -66,7 +73,8 @@
      "window_sec": 5.0,
      "hop_sec": 2.5,
      "embedding_dim": 768,
-      "distance_metric": "cosine"
+      "distance_metric": "cosine",
+      "operation": "reused"
    },
    {
      "feature_set_id": 6,
@@ -76,13 +84,15 @@
      "window_sec": 5.0,
      "hop_sec": 2.5,
      "embedding_dim": 192,
-      "distance_metric": "cosine"
+      "distance_metric": "cosine",
+      "operation": "reused"
    }
  ],
  "reference_set": {
    "reference_set_id": 2,
    "set_name": "phase1_hot_reference_v1",
-    "encoder_scope": "chromaprint-v1 / mert-v1-95m / muq-large-msd-iter"
+    "encoder_scope": "chromaprint-v1 / mert-v1-95m / muq-large-msd-iter",
+    "operation": "updated"
  },
  "counts": {
    "model_registry": 5,
--- a/acr-engine/scripts/bootstrap_phase1_model_registry_live.py
View file @f0c8268
+++ b/acr-engine/scripts/bootstrap_phase1_model_registry_live.py
View file @f0c8268
@@ -207,7 +207,11 @@ REFERENCE_SET = {
 }


-def upsert_model(conn: psycopg.Connection, model: dict[str, Any]) -> int:
+def upsert_model(conn: psycopg.Connection, model: dict[str, Any]) -> tuple[int, str]:
+    existing = conn.execute(
+        'SELECT model_id FROM model_registry WHERE model_name = %s AND model_version = %s',
+        (model['model_name'], model['model_version']),
+    ).fetchone()
    row = conn.execute(
        """
        INSERT INTO model_registry (
@@ -242,10 +246,10 @@ def upsert_model(conn: psycopg.Connection, model: dict[str, Any]) -> int:
        """,
        {**model, 'metadata_json': json.dumps(model['metadata_json'])},
    ).fetchone()
-    return int(row[0])
+    return int(row[0]), ('updated' if existing else 'inserted')


-def ensure_feature_set(conn: psycopg.Connection, model_id: int, feature: dict[str, Any]) -> int:
+def ensure_feature_set(conn: psycopg.Connection, model_id: int, feature: dict[str, Any]) -> tuple[int, str]:
    existing = conn.execute(
        """
        SELECT feature_set_id
@@ -283,7 +287,7 @@ def ensure_feature_set(conn: psycopg.Connection, model_id: int, feature: dict[st
            "UPDATE feature_set_registry SET config_json = %s::jsonb, status = %s, updated_at = NOW() WHERE feature_set_id = %s",
            (json.dumps(feature['config_json']), feature['status'], existing[0]),
        )
-        return int(existing[0])
+        return int(existing[0]), 'reused'

    row = conn.execute(
        """
@@ -318,10 +322,14 @@ def ensure_feature_set(conn: psycopg.Connection, model_id: int, feature: dict[st
            feature['status'],
        ),
    ).fetchone()
-    return int(row[0])
+    return int(row[0]), 'inserted'


-def upsert_reference_set(conn: psycopg.Connection, payload: dict[str, Any]) -> int:
+def upsert_reference_set(conn: psycopg.Connection, payload: dict[str, Any]) -> tuple[int, str]:
+    existing = conn.execute(
+        'SELECT reference_set_id FROM reference_set_registry WHERE set_name = %s',
+        (payload['set_name'],),
+    ).fetchone()
    row = conn.execute(
        """
        INSERT INTO reference_set_registry (set_name, description, encoder_scope, status, metadata_json)
@@ -343,7 +351,7 @@ def upsert_reference_set(conn: psycopg.Connection, payload: dict[str, Any]) -> i
            json.dumps(payload['metadata_json']),
        ),
    ).fetchone()
-    return int(row[0])
+    return int(row[0]), ('updated' if existing else 'inserted')


 def main() -> None:
@@ -365,18 +373,19 @@ def main() -> None:
        conn.execute(f'SET search_path TO {args.schema}, public;')
        model_ids: dict[tuple[str, str], int] = {}
        for model in MODELS:
-            model_id = upsert_model(conn, model)
+            model_id, operation = upsert_model(conn, model)
            model_ids[(model['model_name'], model['model_version'])] = model_id
            summary['models'].append({
                'model_id': model_id,
                'model_name': model['model_name'],
                'model_version': model['model_version'],
                'output_embedding_dim': model['output_embedding_dim'],
+                'operation': operation,
            })

        for feature in FEATURE_SETS:
            model_id = model_ids[(feature['model_name'], feature['model_version'])]
-            feature_set_id = ensure_feature_set(conn, model_id, feature)
+            feature_set_id, operation = ensure_feature_set(conn, model_id, feature)
            summary['feature_sets'].append({
                'feature_set_id': feature_set_id,
                'model_name': feature['model_name'],
@@ -386,13 +395,15 @@ def main() -> None:
                'hop_sec': feature['hop_sec'],
                'embedding_dim': feature['embedding_dim'],
                'distance_metric': feature['distance_metric'],
+                'operation': operation,
            })

-        reference_set_id = upsert_reference_set(conn, REFERENCE_SET)
+        reference_set_id, operation = upsert_reference_set(conn, REFERENCE_SET)
        summary['reference_set'] = {
            'reference_set_id': reference_set_id,
            'set_name': REFERENCE_SET['set_name'],
            'encoder_scope': REFERENCE_SET['encoder_scope'],
+            'operation': operation,
        }
        summary['counts'] = {
            'model_registry': int(conn.execute('SELECT count(*) FROM model_registry;').fetchone()[0]),
--- a/docs/CHANGELOG.md
View file @f0c8268
+++ b/docs/CHANGELOG.md
View file @f0c8268
 ## 2026-06-04

+- 补充 `phase1_registry_bootstrap_idempotency_report.json` 与文档说明，验证 `bootstrap_phase1_model_registry_live.py` 在 `acr_test` schema 上连续执行两次后表计数保持稳定，证明 Phase-1 registry bootstrap 具备可重复执行的幂等性。
 - 新增 `acr-engine/scripts/bootstrap_phase1_model_registry_live.py` 与 `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json`，把 Phase-1 的 `chromaprint / mert / muq / ecapa` 与对应 `feature_set_registry / reference_set_registry` 初始化做成可直接连 PostgreSQL 的 live bootstrap 脚本，并已在 `acr_test` schema 验证通过。
 - 补充文档阻塞事实：当前容器里缺少 `/workspace/downloads`，因此本轮无法直接从业务样本目录继续生成 `type_8 / type_16` 的 live PostgreSQL query JSONL；已把该环境前提写入 handoff 与 PostgreSQL 样例文档。
 - 更新 [PostgreSQL 落库样例与 live 测试链路](./postgres_db_schema_samples.md) 与 `acr-engine/scripts/live_pgvector_music20_eval.py`，把 lineage 负例验证从单条 `audio_window` 扩展到 `recording` / `audio_window` / `audio_embedding` 三类核心 trigger，并已重跑 live pgvector 报告确认检索指标不变；同时补充 `py_compile` 与 `diff --check` 通过的机械验证事实。
--- a/docs/model-feature-registry-bootstrap.md
View file @f0c8268
+++ b/docs/model-feature-registry-bootstrap.md
View file @f0c8268
@@ -272,6 +272,7 @@ cd /workspace/acr-engine
 ### 8.3 当前产物

 - `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json`
+- `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_idempotency_report.json`

 这个文件已经记录了：
 - model_id
@@ -280,3 +281,22 @@ cd /workspace/acr-engine
 - 最终表计数

 因此，下次 session 不需要再从 SQL 片段手工执行开始，而可以直接从 live bootstrap 脚本接上。
+
+### 8.4 幂等性验证（已做）
+
+同一套命令在 `acr_test` schema 上连续执行两次后，已经拿到真实幂等性证据：
+
+| 项目 | 第 1 次 | 第 2 次 |
+|---|---:|---:|
+| `model_registry` | `5` | `5` |
+| `feature_set_registry` | `6` | `6` |
+| `reference_set_registry` | `2` | `2` |
+
+第二次执行时：
+- `models` 全部表现为 `updated`
+- `feature_sets` 全部表现为 `reused`
+- `reference_set` 表现为 `updated`
+
+结论：
+
+> 当前 bootstrap 脚本可重复执行，不会把 Phase-1 registry 数据重复灌爆。
--- a/docs/postgres_db_schema_samples.md
View file @f0c8268
+++ b/docs/postgres_db_schema_samples.md
View file @f0c8268
@@ -66,6 +66,7 @@
 | live 报告 | `acr-engine/data/pgvector_eval/music20/live_pgvector_report.json` |
 | FAISS 对照报告 | `acr-engine/data/pgvector_eval/music20/songid_eval_report_fresh.json` |
 | registry bootstrap 报告 | `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json` |
+| registry bootstrap 幂等性报告 | `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_idempotency_report.json` |
 | 历史对照报告 | `acr-engine/data/pgvector_eval/music20/songid_eval_report.json` |

 ---
--- a/docs/session-handoff.md
View file @f0c8268
+++ b/docs/session-handoff.md
View file @f0c8268
@@ -182,6 +182,7 @@ sed -n '1,320p' acr-engine/sql/acr_pg_schema_v2.sql
 - PostgreSQL `acr_test` live 路径已再次验证：`recording` / `audio_window` / `audio_embedding` 三类 lineage trigger 均有真实负例证据
 - 机械校验已补齐：`live_pgvector_music20_eval.py` 的 `py_compile` 通过，相关变更 `diff --check` 通过
 - PostgreSQL `acr_test` schema 上已真实写入 Phase-1 registry bootstrap：`chromaprint / mert / muq / ecapa` + 5 组 feature set + `phase1_hot_reference_v1`
+- Phase-1 registry bootstrap 已有幂等性证据：同 schema 连续执行两次后，`model_registry=5 / feature_set_registry=6 / reference_set_registry=2` 保持不变

 ### 未验证 / 仍是缺口
 - **未实际跑 MERT / MuQ encoder-only 特征抽取**