Validate the PostgreSQL ACR storage path with live evidence

Constraint: The new data model had to be proven against the user-provided PostgreSQL instance and stay aligned with Phase-1 encoder-only decisions Rejected: Document-only schema guidance without a live database run | It would leave retrieval correctness and table intent unproven Confidence: high Scope-risk: narrow Directive: Keep future retrieval experiments writing through model/feature/reference registries instead of adding fixed per-model columns Tested: /usr/local/miniconda3/bin/python scripts/live_pgvector_music20_eval.py --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' --schema acr_test --reset-schema --output data/pgvector_eval/music20/live_pgvector_report.json; /usr/local/miniconda3/bin/python scripts/evaluate_songid_pgvector_path.py --reference-embeddings-jsonl data/pgvector_eval/music20/reference_embeddings.jsonl --query-embeddings-jsonl data/pgvector_eval/music20/query_embeddings.jsonl --output data/pgvector_eval/music20/songid_eval_report_fresh.json; /usr/local/miniconda3/bin/python -m py_compile scripts/live_pgvector_music20_eval.py scripts/evaluate_songid_pgvector_path.py; git diff --check -- docs/README.md docs/CHANGELOG.md docs/postgres_db_schema_samples.md acr-engine/scripts/live_pgvector_music20_eval.py acr-engine/data/pgvector_eval/music20/live_pgvector_report.json acr-engine/data/pgvector_eval/music20/songid_eval_report_fresh.json Not-tested: MERT/MuQ live embeddings, type_8/type_16 live JSONL coverage, multi-recording/cover-lane decision flow

Validate the PostgreSQL ACR storage path with live evidence
Constraint: The new data model had to be proven against the user-provided PostgreSQL instance and stay aligned with Phase-1 encoder-only decisions Rejected: Document-only schema guidance without a live database run | It would leave retrieval correctness and table intent unproven Confidence: high Scope-risk: narrow Directive: Keep future retrieval experiments writing through model/feature/reference registries instead of adding fixed per-model columns Tested: /usr/local/miniconda3/bin/python scripts/live_pgvector_music20_eval.py --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' --schema acr_test --reset-schema --output data/pgvector_eval/music20/live_pgvector_report.json; /usr/local/miniconda3/bin/python scripts/evaluate_songid_pgvector_path.py --reference-embeddings-jsonl data/pgvector_eval/music20/reference_embeddings.jsonl --query-embeddings-jsonl data/pgvector_eval/music20/query_embeddings.jsonl --output data/pgvector_eval/music20/songid_eval_report_fresh.json; /usr/local/miniconda3/bin/python -m py_compile scripts/live_pgvector_music20_eval.py scripts/evaluate_songid_pgvector_path.py; git diff --check -- docs/README.md docs/CHANGELOG.md docs/postgres_db_schema_samples.md acr-engine/scripts/live_pgvector_music20_eval.py acr-engine/data/pgvector_eval/music20/live_pgvector_report.json acr-engine/data/pgvector_eval/music20/songid_eval_report_fresh.json Not-tested: MERT/MuQ live embeddings, type_8/type_16 live JSONL coverage, multi-recording/cover-lane decision flow
cnb.bofCdSsphPA
Commit 96c9ce7d ... 96c9ce7d40ae89231e6f9d1a37d73e2c5bb380ff authored 2026-06-04 12:20:15 +0800 by cnb.bofCdSsphPA
Showing 6 changed files with 1428 additions and 0 deletions
acr-engine/data/pgvector_eval/music20/live_pgvector_report.json
acr-engine/data/pgvector_eval/music20/songid_eval_report_fresh.json
acr-engine/scripts/live_pgvector_music20_eval.py
docs/CHANGELOG.md
docs/README.md
docs/postgres_db_schema_samples.md
--- a/acr-engine/data/pgvector_eval/music20/live_pgvector_report.json 0 → 100644
View file @96c9ce7
+++ b/acr-engine/data/pgvector_eval/music20/live_pgvector_report.json 0 → 100644
View file @96c9ce7
+{
+  "schema": "acr_test",
+  "dsn_redacted": "postgres://d2:***@127.0.0.1:5432/d2",
+  "input": {
+    "reference_embeddings_jsonl": "/workspace/acr-engine/data/pgvector_eval/music20/reference_embeddings.jsonl",
+    "query_embeddings_jsonl": "/workspace/acr-engine/data/pgvector_eval/music20/query_embeddings.jsonl",
+    "reference_count": 20,
+    "query_count": 22
+  },
+  "registry": {
+    "model_id": 1,
+    "feature_set_id": 1,
+    "reference_set_id": 1,
+    "retrieval_index_id": 1
+  },
+  "table_counts": {
+    "canonical_song": 20,
+    "work": 20,
+    "recording": 20,
+    "recording_asset": 20,
+    "audio_window": 20,
+    "audio_embedding": 20,
+    "retrieval_candidate": 220,
+    "match_decision": 22
+  },
+  "lineage_negative_test": {
+    "passed": true,
+    "error_type": "RaiseException",
+    "message": "Invalid asset_id=1 or recording_id=1000000 for audio_window"
+  },
+  "evaluation": {
+    "backend": "postgresql+pgvector-live",
+    "note": "Reference embeddings are stored in schema v2; 24-d logical embeddings are zero-padded to vector(192) for physical storage.",
+    "overall": {
+      "count": 22,
+      "top1": 0.909091,
+      "top3": 0.954545,
+      "top10": 0.954545,
+      "mrr": 0.934343,
+      "mean_rank": 1.8182,
+      "median_rank": 1.0
+    },
+    "by_query_type": {
+      "1": {
+        "count": 20,
+        "top1": 1.0,
+        "top3": 1.0,
+        "top10": 1.0,
+        "mrr": 1.0,
+        "mean_rank": 1.0,
+        "median_rank": 1.0
+      },
+      "7": {
+        "count": 2,
+        "top1": 0.0,
+        "top3": 0.5,
+        "top10": 0.5,
+        "mrr": 0.277778,
+        "mean_rank": 10.0,
+        "median_rank": 10.0
+      }
+    },
+    "confusion_focus": {
+      "7": {
+        "query_type": 7,
+        "metrics": {
+          "count": 2,
+          "top1": 0.0,
+          "top3": 0.5,
+          "top10": 0.5,
+          "mrr": 0.277778,
+          "mean_rank": 10.0,
+          "median_rank": 10.0
+        },
+        "interpretation": "light confusion / transformed query"
+      },
+      "8": {
+        "query_type": 8,
+        "metrics": {
+          "count": 0
+        },
+        "interpretation": "harder confusion bucket"
+      },
+      "16": {
+        "query_type": 16,
+        "metrics": {
+          "count": 0
+        },
+        "interpretation": "strong confusion or far-domain bucket"
+      }
+    },
+    "examples": {
+      "1": [
+        {
+          "query_id": "music20-q0000-t1-song100",
+          "song_id": "100",
+          "rank": 1,
+          "top3": [
+            {
+              "song_id": "100",
+              "canonical_song_id": 1,
+              "evidence_window_id": 1,
+              "combined_score": 0.9099869376417087,
+              "max_sim": 0.9999854862685651,
+              "top3_avg": 0.9999854862685651,
+              "vote": 1
+            },
+            {
+              "song_id": "116",
+              "canonical_song_id": 17,
+              "evidence_window_id": 17,
+              "combined_score": 0.8674688834706314,
+              "max_sim": 0.9527432038562573,
+              "top3_avg": 0.9527432038562573,
+              "vote": 1
+            },
+            {
+              "song_id": "103",
+              "canonical_song_id": 4,
+              "evidence_window_id": 4,
+              "combined_score": 0.8665370278518509,
+              "max_sim": 0.9517078087242788,
+              "top3_avg": 0.9517078087242788,
+              "vote": 1
+            }
+          ]
+        },
+        {
+          "query_id": "music20-q0001-t1-song101",
+          "song_id": "101",
+          "rank": 1,
+          "top3": [
+            {
+              "song_id": "101",
+              "canonical_song_id": 2,
+              "evidence_window_id": 2,
+              "combined_score": 0.9099997586011674,
+              "max_sim": 0.999999731779075,
+              "top3_avg": 0.999999731779075,
+              "vote": 1
+            },
+            {
+              "song_id": "118",
+              "canonical_song_id": 19,
+              "evidence_window_id": 19,
+              "combined_score": 0.8930541242989376,
+              "max_sim": 0.9811712492210417,
+              "top3_avg": 0.9811712492210417,
+              "vote": 1
+            },
+            {
+              "song_id": "116",
+              "canonical_song_id": 17,
+              "evidence_window_id": 17,
+              "combined_score": 0.892017854392,
+              "max_sim": 0.9800198382133333,
+              "top3_avg": 0.9800198382133333,
+              "vote": 1
+            }
+          ]
+        },
+        {
+          "query_id": "music20-q0002-t1-song102",
+          "song_id": "102",
+          "rank": 1,
+          "top3": [
+            {
+              "song_id": "102",
+              "canonical_song_id": 3,
+              "evidence_window_id": 3,
+              "combined_score": 0.9099973714353238,
+              "max_sim": 0.9999970793725819,
+              "top3_avg": 0.9999970793725819,
+              "vote": 1
+            },
+            {
+              "song_id": "113",
+              "canonical_song_id": 14,
+              "evidence_window_id": 14,
+              "combined_score": 0.878619819365752,
+              "max_sim": 0.9651331326286134,
+              "top3_avg": 0.9651331326286134,
+              "vote": 1
+            },
+            {
+              "song_id": "118",
+              "canonical_song_id": 19,
+              "evidence_window_id": 19,
+              "combined_score": 0.8727551417721799,
+              "max_sim": 0.9586168241913111,
+              "top3_avg": 0.9586168241913111,
+              "vote": 1
+            }
+          ]
+        },
+        {
+          "query_id": "music20-q0003-t1-song103",
+          "song_id": "103",
+          "rank": 1,
+          "top3": [
+            {
+              "song_id": "103",
+              "canonical_song_id": 4,
+              "evidence_window_id": 4,
+              "combined_score": 0.9078967457382905,
+              "max_sim": 0.9976630508203228,
+              "top3_avg": 0.9976630508203228,
+              "vote": 1
+            },
+            {
+              "song_id": "116",
+              "canonical_song_id": 17,
+              "evidence_window_id": 17,
+              "combined_score": 0.8892688048103843,
+              "max_sim": 0.9769653386782048,
+              "top3_avg": 0.9769653386782048,
+              "vote": 1
+            },
+            {
+              "song_id": "109",
+              "canonical_song_id": 10,
+              "evidence_window_id": 10,
+              "combined_score": 0.8786497490793317,
+              "max_sim": 0.9651663878659241,
+              "top3_avg": 0.9651663878659241,
+              "vote": 1
+            }
+          ]
+        },
+        {
+          "query_id": "music20-q0004-t1-song104",
+          "song_id": "104",
+          "rank": 1,
+          "top3": [
+            {
+              "song_id": "104",
+              "canonical_song_id": 5,
+              "evidence_window_id": 5,
+              "combined_score": 0.9099890834089845,
+              "max_sim": 0.9999878704544272,
+              "top3_avg": 0.9999878704544272,
+              "vote": 1
+            },
+            {
+              "song_id": "109",
+              "canonical_song_id": 10,
+              "evidence_window_id": 10,
+              "combined_score": 0.8646899513807881,
+              "max_sim": 0.9496555015342091,
+              "top3_avg": 0.9496555015342091,
+              "vote": 1
+            },
+            {
+              "song_id": "116",
+              "canonical_song_id": 17,
+              "evidence_window_id": 17,
+              "combined_score": 0.8414633946738618,
+              "max_sim": 0.9238482163042909,
+              "top3_avg": 0.9238482163042909,
+              "vote": 1
+            }
+          ]
+        }
+      ],
+      "7": [
+        {
+          "query_id": "music20-q0020-t7-song111",
+          "song_id": "111",
+          "rank": 18,
+          "top3": [
+            {
+              "song_id": "109",
+              "canonical_song_id": 10,
+              "evidence_window_id": 10,
+              "combined_score": 0.8765411333280498,
+              "max_sim": 0.9628234814756109,
+              "top3_avg": 0.9628234814756109,
+              "vote": 1
+            },
+            {
+              "song_id": "116",
+              "canonical_song_id": 17,
+              "evidence_window_id": 17,
+              "combined_score": 0.8749381679370203,
+              "max_sim": 0.9610424088189115,
+              "top3_avg": 0.9610424088189115,
+              "vote": 1
+            },
+            {
+              "song_id": "118",
+              "canonical_song_id": 19,
+              "evidence_window_id": 19,
+              "combined_score": 0.8641276021561776,
+              "max_sim": 0.9490306690624195,
+              "top3_avg": 0.9490306690624195,
+              "vote": 1
+            }
+          ]
+        },
+        {
+          "query_id": "music20-q0021-t7-song116",
+          "song_id": "116",
+          "rank": 2,
+          "top3": [
+            {
+              "song_id": "109",
+              "canonical_song_id": 10,
+              "evidence_window_id": 10,
+              "combined_score": 0.8701787704282636,
+              "max_sim": 0.9557541893647373,
+              "top3_avg": 0.9557541893647373,
+              "vote": 1
+            },
+            {
+              "song_id": "116",
+              "canonical_song_id": 17,
+              "evidence_window_id": 17,
+              "combined_score": 0.8674951972070233,
+              "max_sim": 0.9527724413411371,
+              "top3_avg": 0.9527724413411371,
+              "vote": 1
+            },
+            {
+              "song_id": "103",
+              "canonical_song_id": 4,
+              "evidence_window_id": 4,
+              "combined_score": 0.8659579133987426,
+              "max_sim": 0.9510643482208252,
+              "top3_avg": 0.9510643482208252,
+              "vote": 1
+            }
+          ]
+        }
+      ]
+    }
+  }
+}
\ No newline at end of file
--- a/acr-engine/data/pgvector_eval/music20/songid_eval_report_fresh.json 0 → 100644
View file @96c9ce7
+++ b/acr-engine/data/pgvector_eval/music20/songid_eval_report_fresh.json 0 → 100644
View file @96c9ce7
+{
+  "backend": "faiss-as-pgvector-standin",
+  "note": "Uses song-level aggregation compatible with a future pgvector online path.",
+  "overall": {
+    "count": 22,
+    "top1": 0.909091,
+    "top3": 0.954545,
+    "top10": 0.954545,
+    "mrr": 0.934343,
+    "mean_rank": 1.8182,
+    "median_rank": 1.0
+  },
+  "by_query_type": {
+    "1": {
+      "count": 20,
+      "top1": 1.0,
+      "top3": 1.0,
+      "top10": 1.0,
+      "mrr": 1.0,
+      "mean_rank": 1.0,
+      "median_rank": 1.0
+    },
+    "7": {
+      "count": 2,
+      "top1": 0.0,
+      "top3": 0.5,
+      "top10": 0.5,
+      "mrr": 0.277778,
+      "mean_rank": 10.0,
+      "median_rank": 10.0
+    }
+  },
+  "examples": {
+    "1": [
+      {
+        "song_id": "100",
+        "rank": 1,
+        "top3": [
+          [
+            "100",
+            0.9099869644641876,
+            0.9999855160713196,
+            0.9999855160713196,
+            1
+          ],
+          [
+            "116",
+            0.8674689626693726,
+            0.9527432918548584,
+            0.9527432918548584,
+            1
+          ],
+          [
+            "103",
+            0.8665370559692382,
+            0.9517078399658203,
+            0.9517078399658203,
+            1
+          ]
+        ]
+      },
+      {
+        "song_id": "101",
+        "rank": 1,
+        "top3": [
+          [
+            "101",
+            0.9099996781349182,
+            0.9999996423721313,
+            0.9999996423721313,
+            1
+          ],
+          [
+            "118",
+            0.8930539643764497,
+            0.9811710715293884,
+            0.9811710715293884,
+            1
+          ],
+          [
+            "116",
+            0.8920178270339967,
+            0.9800198078155518,
+            0.9800198078155518,
+            1
+          ]
+        ]
+      },
+      {
+        "song_id": "102",
+        "rank": 1,
+        "top3": [
+          [
+            "102",
+            0.9099974250793457,
+            0.9999971389770508,
+            0.9999971389770508,
+            1
+          ],
+          [
+            "113",
+            0.878619978427887,
+            0.9651333093643188,
+            0.9651333093643188,
+            1
+          ],
+          [
+            "118",
+            0.8727551674842834,
+            0.9586168527603149,
+            0.9586168527603149,
+            1
+          ]
+        ]
+      },
+      {
+        "song_id": "103",
+        "rank": 1,
+        "top3": [
+          [
+            "103",
+            0.9078967189788818,
+            0.9976630210876465,
+            0.9976630210876465,
+            1
+          ],
+          [
+            "116",
+            0.8892688846588135,
+            0.9769654273986816,
+            0.9769654273986816,
+            1
+          ],
+          [
+            "109",
+            0.8786498045921325,
+            0.965166449546814,
+            0.965166449546814,
+            1
+          ]
+        ]
+      },
+      {
+        "song_id": "104",
+        "rank": 1,
+        "top3": [
+          [
+            "104",
+            0.9099890029430389,
+            0.999987781047821,
+            0.999987781047821,
+            1
+          ],
+          [
+            "109",
+            0.8646899795532226,
+            0.9496555328369141,
+            0.9496555328369141,
+            1
+          ],
+          [
+            "116",
+            0.8414634442329406,
+            0.9238482713699341,
+            0.9238482713699341,
+            1
+          ]
+        ]
+      }
+    ],
+    "7": [
+      {
+        "song_id": "111",
+        "rank": 18,
+        "top3": [
+          [
+            "109",
+            0.8765411591529846,
+            0.9628235101699829,
+            0.9628235101699829,
+            1
+          ],
+          [
+            "116",
+            0.8749382710456848,
+            0.9610425233840942,
+            0.9610425233840942,
+            1
+          ],
+          [
+            "118",
+            0.8641276276111602,
+            0.9490306973457336,
+            0.9490306973457336,
+            1
+          ]
+        ]
+      },
+      {
+        "song_id": "116",
+        "rank": 2,
+        "top3": [
+          [
+            "109",
+            0.8701787447929383,
+            0.9557541608810425,
+            0.9557541608810425,
+            1
+          ],
+          [
+            "116",
+            0.8674952483177185,
+            0.9527724981307983,
+            0.9527724981307983,
+            1
+          ],
+          [
+            "103",
+            0.8659579670429229,
+            0.95106440782547,
+            0.95106440782547,
+            1
+          ]
+        ]
+      }
+    ]
+  }
+}
\ No newline at end of file
--- a/acr-engine/scripts/live_pgvector_music20_eval.py 0 → 100755
View file @96c9ce7
+++ b/acr-engine/scripts/live_pgvector_music20_eval.py 0 → 100755
View file @96c9ce7
+#!/usr/bin/env /usr/local/miniconda3/bin/python
+from __future__ import annotations
+
+import argparse
+import json
+from collections import defaultdict
+from dataclasses import dataclass
+from pathlib import Path
+from statistics import median
+from typing import Any
+
+import psycopg
+
+ROOT = Path(__file__).resolve().parents[1]
+DEFAULT_SCHEMA_SQL = ROOT / 'sql' / 'acr_pg_schema_v2.sql'
+DEFAULT_REFERENCE = ROOT / 'data' / 'pgvector_eval' / 'music20' / 'reference_embeddings.jsonl'
+DEFAULT_QUERY = ROOT / 'data' / 'pgvector_eval' / 'music20' / 'query_embeddings.jsonl'
+DEFAULT_OUTPUT = ROOT / 'data' / 'pgvector_eval' / 'music20' / 'live_pgvector_report.json'
+
+
+@dataclass
+class EntityIds:
+    canonical_song_id: int
+    work_id: int
+    recording_id: int
+    asset_id: int
+    window_id: int
+    embedding_id: int
+
+
+def load_jsonl(path: Path) -> list[dict[str, Any]]:
+    return [json.loads(line) for line in path.read_text(encoding='utf-8').splitlines() if line.strip()]
+
+
+def pad_embedding(vec: list[float], target_dim: int = 192) -> list[float]:
+    if len(vec) > target_dim:
+        raise ValueError(f'embedding dim {len(vec)} > target {target_dim}')
+    if len(vec) == target_dim:
+        return vec
+    return vec + [0.0] * (target_dim - len(vec))
+
+
+def vec_literal(vec: list[float]) -> str:
+    return '[' + ','.join(f'{x:.10f}' for x in vec) + ']'
+
+
+def compute_metrics(ranks: list[int], topk: int) -> dict[str, Any]:
+    if not ranks:
+        return {'count': 0}
+    return {
+        'count': len(ranks),
+        'top1': round(sum(1 for r in ranks if r == 1) / len(ranks), 6),
+        'top3': round(sum(1 for r in ranks if r <= 3) / len(ranks), 6),
+        f'top{topk}': round(sum(1 for r in ranks if r <= topk) / len(ranks), 6),
+        'mrr': round(sum(1.0 / r for r in ranks) / len(ranks), 6),
+        'mean_rank': round(sum(ranks) / len(ranks), 4),
+        'median_rank': median(ranks),
+    }
+
+
+def aggregate_song_scores(rows: list[dict[str, Any]]) -> list[dict[str, Any]]:
+    grouped: dict[str, list[dict[str, Any]]] = defaultdict(list)
+    for row in rows:
+        grouped[row['song_id']].append(row)
+    ranked = []
+    for song_id, vals in grouped.items():
+        vals.sort(key=lambda x: x['score'], reverse=True)
+        scores = [v['score'] for v in vals]
+        max_sim = scores[0]
+        top3_avg = sum(scores[:3]) / min(3, len(scores))
+        vote = len(scores)
+        combined = 0.6 * max_sim + 0.3 * top3_avg + 0.1 * min(vote / 10.0, 1.0)
+        ranked.append({
+            'song_id': song_id,
+            'canonical_song_id': vals[0]['canonical_song_id'],
+            'evidence_window_id': vals[0]['window_id'],
+            'combined_score': combined,
+            'max_sim': max_sim,
+            'top3_avg': top3_avg,
+            'vote': vote,
+        })
+    ranked.sort(key=lambda x: x['combined_score'], reverse=True)
+    return ranked
+
+
+def reset_schema(conn: psycopg.Connection, schema: str) -> None:
+    conn.execute(f'DROP SCHEMA IF EXISTS {schema} CASCADE;')
+    conn.execute(f'CREATE SCHEMA {schema};')
+    conn.execute(f'SET search_path TO {schema}, public;')
+
+
+def apply_schema(conn: psycopg.Connection, schema_sql: Path) -> None:
+    sql_text = schema_sql.read_text(encoding='utf-8')
+    conn.execute(sql_text)
+
+
+def seed_registry(conn: psycopg.Connection) -> tuple[int, int, int, int]:
+    model_id = conn.execute(
+        """
+        INSERT INTO model_registry (
+            model_name, model_family, model_version, model_source, model_uri,
+            license_name, input_sample_rate, default_window_sec, default_hop_sec,
+            output_embedding_dim, pooling_supported, metadata_json
+        ) VALUES (
+            'local_chroma24', 'chroma_baseline', 'v1', 'repo-local-eval',
+            'acr-engine/scripts/live_pgvector_music20_eval.py', 'internal-eval',
+            22050, 8.0, 8.0, 24, ARRAY['mean_std'],
+            '{"storage_padding":"zero-pad to vector(192) for pgvector compatibility"}'::jsonb
+        )
+        ON CONFLICT (model_name, model_version) DO UPDATE
+        SET updated_at = NOW()
+        RETURNING model_id;
+        """
+    ).fetchone()[0]
+
+    feature_set_id = conn.execute(
+        """
+        INSERT INTO feature_set_registry (
+            model_id, feature_name, feature_level, extraction_granularity,
+            window_sec, hop_sec, embedding_dim, pooling_strategy, layer_selection,
+            normalize_l2, distance_metric, quantization_type, feature_schema_version,
+            config_json, status
+        ) VALUES (
+            %s, 'chroma24_songid_eval', 'window', 'window',
+            8.0, 8.0, 24, 'mean_std', 'na', TRUE, 'cosine', NULL, 'v1',
+            '{"physical_storage":"audio_embedding_vector_192","padding":"zero"}'::jsonb,
+            'active'
+        )
+        RETURNING feature_set_id;
+        """,
+        (model_id,),
+    ).fetchone()[0]
+
+    reference_set_id = conn.execute(
+        """
+        INSERT INTO reference_set_registry (set_name, description, encoder_scope, status, metadata_json)
+        VALUES (
+            'music20_live_reference',
+            '20-song local live pgvector evaluation reference set',
+            'local_chroma24',
+            'active',
+            '{"purpose":"live_pgvector_music20_eval"}'::jsonb
+        )
+        ON CONFLICT (set_name) DO UPDATE SET updated_at = NOW()
+        RETURNING reference_set_id;
+        """
+    ).fetchone()[0]
+
+    retrieval_index_id = conn.execute(
+        """
+        INSERT INTO retrieval_index_registry (
+            feature_set_id, index_name, index_backend, index_type, storage_uri,
+            shard_no, row_count, index_status, config_json, built_at
+        ) VALUES (
+            %s, 'music20_live_pgvector_hnsw', 'pgvector', 'hnsw_cosine',
+            'postgres://d2@127.0.0.1/d2#acr_test.audio_embedding_vector_192',
+            0, 0, 'active', '{"physical_dim":192,"logical_dim":24}'::jsonb, NOW()
+        )
+        RETURNING retrieval_index_id;
+        """,
+        (feature_set_id,),
+    ).fetchone()[0]
+
+    return model_id, feature_set_id, reference_set_id, retrieval_index_id
+
+
+def ingest_references(conn: psycopg.Connection, refs: list[dict[str, Any]], feature_set_id: int, reference_set_id: int) -> dict[str, EntityIds]:
+    entities: dict[str, EntityIds] = {}
+    for idx, row in enumerate(refs):
+        song_id = str(row['song_id'])
+        canonical_song_id = conn.execute(
+            """
+            INSERT INTO canonical_song (biz_song_code, title, title_norm, primary_artist, primary_artist_norm, rights_status, metadata_json)
+            VALUES (%s, %s, %s, %s, %s, %s, %s::jsonb)
+            RETURNING canonical_song_id;
+            """,
+            (song_id, f'Song {song_id}', f'song {song_id}', f'Artist {song_id}', f'artist {song_id}', 'protected', json.dumps({'source': 'music20_live_eval'})),
+        ).fetchone()[0]
+        work_id = conn.execute(
+            """
+            INSERT INTO work (canonical_song_id, work_code, work_title, work_title_norm, composer, publisher, metadata_json)
+            VALUES (%s, %s, %s, %s, %s, %s, %s::jsonb)
+            RETURNING work_id;
+            """,
+            (canonical_song_id, f'work-{song_id}', f'Song {song_id}', f'song {song_id}', f'Composer {song_id}', 'Unknown', json.dumps({'note': '1:1 work for eval'})),
+        ).fetchone()[0]
+        recording_id = conn.execute(
+            """
+            INSERT INTO recording (
+                work_id, canonical_song_id, recording_code, recording_title, artist_name,
+                album_name, version_type, is_reference, reference_priority, duration_sec, metadata_json
+            ) VALUES (%s, %s, %s, %s, %s, %s, %s, TRUE, %s, %s, %s::jsonb)
+            RETURNING recording_id;
+            """,
+            (work_id, canonical_song_id, f'rec-{song_id}', f'Song {song_id} Reference', f'Artist {song_id}', 'music20', 'master_reference', 100 + idx, 8.0, json.dumps({'source_audio_path': row['audio_path']})),
+        ).fetchone()[0]
+        asset_id = conn.execute(
+            """
+            INSERT INTO recording_asset (
+                recording_id, asset_role, storage_uri, storage_scheme, file_ext, mime_type,
+                sample_rate, channels, codec_name, duration_sec, normalized_storage_uri,
+                ingest_status, metadata_json
+            ) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s::jsonb)
+            RETURNING asset_id;
+            """,
+            (recording_id, 'reference_audio', row['audio_path'], 'file', Path(row['audio_path']).suffix.lstrip('.'), 'audio/wav', 22050, 1, 'pcm_s16le', 8.0, row['audio_path'], 'ready', json.dumps({'type': 'reference'})),
+        ).fetchone()[0]
+        window_id = conn.execute(
+            """
+            INSERT INTO audio_window (
+                asset_id, recording_id, work_id, canonical_song_id,
+                window_index, start_sec, end_sec, duration_sec,
+                segment_role, segment_type, quality_score, active_for_index, metadata_json
+            ) VALUES (%s, %s, %s, %s, 0, 0.0, 8.0, 8.0, 'reference', 'full_clip', 1.0, TRUE, %s::jsonb)
+            RETURNING window_id;
+            """,
+            (asset_id, recording_id, work_id, canonical_song_id, json.dumps({'source_audio_path': row['audio_path']})),
+        ).fetchone()[0]
+        embedding_id = conn.execute(
+            """
+            INSERT INTO audio_embedding (
+                feature_set_id, extraction_job_id, asset_id, window_id, recording_id, work_id,
+                canonical_song_id, embedding_storage_mode, embedding_uri, vector_norm, checksum,
+                is_indexed, metadata_json
+            ) VALUES (%s, NULL, %s, %s, %s, %s, %s, %s, NULL, %s, NULL, TRUE, %s::jsonb)
+            RETURNING embedding_id;
+            """,
+            (feature_set_id, asset_id, window_id, recording_id, work_id, canonical_song_id, 'pgvector_inline_192_padded', 1.0, json.dumps({'logical_embedding_dim': len(row['embedding'])})),
+        ).fetchone()[0]
+        conn.execute(
+            'INSERT INTO audio_embedding_vector_192 (embedding_id, embedding) VALUES (%s, %s::vector);',
+            (embedding_id, vec_literal(pad_embedding(row['embedding']))),
+        )
+        conn.execute(
+            'INSERT INTO reference_set_member (reference_set_id, recording_id, member_role) VALUES (%s, %s, %s);',
+            (reference_set_id, recording_id, 'hot_reference'),
+        )
+        entities[song_id] = EntityIds(canonical_song_id, work_id, recording_id, asset_id, window_id, embedding_id)
+    return entities
+
+
+def run_lineage_negative_test(conn: psycopg.Connection, entity: EntityIds) -> dict[str, Any]:
+    try:
+        with conn.transaction():
+            conn.execute(
+                """
+                INSERT INTO audio_window (
+                    asset_id, recording_id, work_id, canonical_song_id, window_index,
+                    start_sec, end_sec, duration_sec, segment_role, segment_type, quality_score, active_for_index
+                ) VALUES (%s, %s, %s, %s, 999, 0.0, 8.0, 8.0, 'reference', 'bad_lineage', 0.0, TRUE);
+                """,
+                (entity.asset_id, entity.recording_id + 999999, entity.work_id, entity.canonical_song_id),
+            )
+        return {'passed': False, 'note': 'bad lineage insert unexpectedly succeeded'}
+    except Exception as exc:
+        return {'passed': True, 'error_type': type(exc).__name__, 'message': str(exc).splitlines()[0]}
+
+
+def fetch_raw_candidates(conn: psycopg.Connection, feature_set_id: int, query_vec: list[float], topn: int) -> list[dict[str, Any]]:
+    rows = conn.execute(
+        """
+        SELECT
+            cs.biz_song_code AS song_id,
+            ae.canonical_song_id,
+            aw.window_id,
+            1 - (aev.embedding <=> %s::vector) AS score
+        FROM audio_embedding_vector_192 aev
+        JOIN audio_embedding ae ON ae.embedding_id = aev.embedding_id
+        JOIN canonical_song cs ON cs.canonical_song_id = ae.canonical_song_id
+        JOIN audio_window aw ON aw.window_id = ae.window_id
+        WHERE ae.feature_set_id = %s
+        ORDER BY aev.embedding <=> %s::vector
+        LIMIT %s;
+        """,
+        (vec_literal(pad_embedding(query_vec)), feature_set_id, vec_literal(pad_embedding(query_vec)), topn),
+    ).fetchall()
+    return [
+        {
+            'song_id': r[0],
+            'canonical_song_id': r[1],
+            'window_id': r[2],
+            'score': float(r[3]),
+        }
+        for r in rows
+    ]
+
+
+def persist_candidates(conn: psycopg.Connection, query_id: str, retrieval_index_id: int, feature_set_id: int, ranked: list[dict[str, Any]], topk: int) -> None:
+    for i, item in enumerate(ranked[:topk], start=1):
+        conn.execute(
+            """
+            INSERT INTO retrieval_candidate (
+                query_id, retrieval_index_id, feature_set_id, source_lane,
+                candidate_level, candidate_id, evidence_window_id, raw_score,
+                normalized_score, rank_no, metadata_json
+            ) VALUES (%s, %s, %s, 'semantic', 'canonical_song', %s, %s, %s, %s, %s, %s::jsonb);
+            """,
+            (query_id, retrieval_index_id, feature_set_id, item['canonical_song_id'], item['evidence_window_id'], item['max_sim'], item['combined_score'], i, json.dumps({'vote': item['vote'], 'song_id': item['song_id']})),
+        )
+
+
+def persist_decision(conn: psycopg.Connection, query_id: str, ranked: list[dict[str, Any]]) -> None:
+    top = ranked[0] if ranked else None
+    conn.execute(
+        """
+        INSERT INTO match_decision (
+            query_id, canonical_song_id, work_id, recording_id,
+            decision_status, decision_score, decision_reason, metadata_json
+        ) VALUES (%s, %s, NULL, NULL, %s, %s, %s, %s::jsonb);
+        """,
+        (
+            query_id,
+            top['canonical_song_id'] if top else None,
+            'matched' if top else 'no_match',
+            top['combined_score'] if top else None,
+            'top1 semantic candidate from live pgvector eval' if top else 'no candidate',
+            json.dumps({'top_song_id': top['song_id']} if top else {}),
+        ),
+    )
+
+
+def evaluate_live(conn: psycopg.Connection, feature_set_id: int, retrieval_index_id: int, queries: list[dict[str, Any]], topn: int, topk: int) -> dict[str, Any]:
+    by_type: dict[str, list[int]] = defaultdict(list)
+    examples: dict[str, list[dict[str, Any]]] = defaultdict(list)
+    confusion_focus: dict[str, dict[str, Any]] = {}
+
+    for idx, q in enumerate(queries):
+        qtype = str(q['query_type'])
+        query_id = f'music20-q{idx:04d}-t{qtype}-song{q["song_id"]}'
+        raw_rows = fetch_raw_candidates(conn, feature_set_id, q['embedding'], topn)
+        ranked = aggregate_song_scores(raw_rows)
+        gold = str(q['song_id'])
+        rank = next((i + 1 for i, item in enumerate(ranked) if item['song_id'] == gold), len(ranked) + 1)
+        by_type[qtype].append(rank)
+        persist_candidates(conn, query_id, retrieval_index_id, feature_set_id, ranked, topk)
+        persist_decision(conn, query_id, ranked)
+        if len(examples[qtype]) < 5:
+            examples[qtype].append({
+                'query_id': query_id,
+                'song_id': gold,
+                'rank': rank,
+                'top3': ranked[:3],
+            })
+
+    for qtype in ('7', '8', '16'):
+        ranks = by_type.get(qtype, [])
+        confusion_focus[qtype] = {
+            'query_type': int(qtype),
+            'metrics': compute_metrics(ranks, topk),
+            'interpretation': {
+                '7': 'light confusion / transformed query',
+                '8': 'harder confusion bucket',
+                '16': 'strong confusion or far-domain bucket',
+            }[qtype],
+        }
+
+    all_ranks = [r for ranks in by_type.values() for r in ranks]
+    return {
+        'backend': 'postgresql+pgvector-live',
+        'note': 'Reference embeddings are stored in schema v2; 24-d logical embeddings are zero-padded to vector(192) for physical storage.',
+        'overall': compute_metrics(all_ranks, topk),
+        'by_query_type': {qtype: compute_metrics(ranks, topk) for qtype, ranks in by_type.items()},
+        'confusion_focus': confusion_focus,
+        'examples': examples,
+    }
+
+
+def main() -> None:
+    ap = argparse.ArgumentParser()
+    ap.add_argument('--dsn', required=True)
+    ap.add_argument('--schema', default='acr_test')
+    ap.add_argument('--schema-sql', default=str(DEFAULT_SCHEMA_SQL))
+    ap.add_argument('--reference-embeddings-jsonl', default=str(DEFAULT_REFERENCE))
+    ap.add_argument('--query-embeddings-jsonl', default=str(DEFAULT_QUERY))
+    ap.add_argument('--output', default=str(DEFAULT_OUTPUT))
+    ap.add_argument('--topn', type=int, default=20)
+    ap.add_argument('--topk', type=int, default=10)
+    ap.add_argument('--reset-schema', action='store_true')
+    args = ap.parse_args()
+
+    refs = load_jsonl(Path(args.reference_embeddings_jsonl))
+    queries = load_jsonl(Path(args.query_embeddings_jsonl))
+
+    with psycopg.connect(args.dsn, autocommit=True) as conn:
+        if args.reset_schema:
+            reset_schema(conn, args.schema)
+        else:
+            conn.execute(f'CREATE SCHEMA IF NOT EXISTS {args.schema};')
+            conn.execute(f'SET search_path TO {args.schema}, public;')
+        apply_schema(conn, Path(args.schema_sql))
+        model_id, feature_set_id, reference_set_id, retrieval_index_id = seed_registry(conn)
+        entities = ingest_references(conn, refs, feature_set_id, reference_set_id)
+        lineage_check = run_lineage_negative_test(conn, next(iter(entities.values())))
+        report = evaluate_live(conn, feature_set_id, retrieval_index_id, queries, args.topn, args.topk)
+        conn.execute('UPDATE retrieval_index_registry SET row_count = %s WHERE retrieval_index_id = %s;', (len(refs), retrieval_index_id))
+        counts = {
+            'canonical_song': conn.execute('SELECT count(*) FROM canonical_song;').fetchone()[0],
+            'work': conn.execute('SELECT count(*) FROM work;').fetchone()[0],
+            'recording': conn.execute('SELECT count(*) FROM recording;').fetchone()[0],
+            'recording_asset': conn.execute('SELECT count(*) FROM recording_asset;').fetchone()[0],
+            'audio_window': conn.execute('SELECT count(*) FROM audio_window;').fetchone()[0],
+            'audio_embedding': conn.execute('SELECT count(*) FROM audio_embedding;').fetchone()[0],
+            'retrieval_candidate': conn.execute('SELECT count(*) FROM retrieval_candidate;').fetchone()[0],
+            'match_decision': conn.execute('SELECT count(*) FROM match_decision;').fetchone()[0],
+        }
+
+    payload = {
+        'schema': args.schema,
+        'dsn_redacted': 'postgres://d2:***@127.0.0.1:5432/d2',
+        'input': {
+            'reference_embeddings_jsonl': args.reference_embeddings_jsonl,
+            'query_embeddings_jsonl': args.query_embeddings_jsonl,
+            'reference_count': len(refs),
+            'query_count': len(queries),
+        },
+        'registry': {
+            'model_id': model_id,
+            'feature_set_id': feature_set_id,
+            'reference_set_id': reference_set_id,
+            'retrieval_index_id': retrieval_index_id,
+        },
+        'table_counts': counts,
+        'lineage_negative_test': lineage_check,
+        'evaluation': report,
+    }
+
+    out = Path(args.output)
+    out.parent.mkdir(parents=True, exist_ok=True)
+    out.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding='utf-8')
+    print(json.dumps(payload, ensure_ascii=False, indent=2))
+
+
+if __name__ == '__main__':
+    main()
--- a/docs/CHANGELOG.md
View file @96c9ce7
+++ b/docs/CHANGELOG.md
View file @96c9ce7
 ## 2026-06-04

+- 新增 [PostgreSQL 落库样例与 live 测试链路](./postgres_db_schema_samples.md)，补齐 `acr_pg_schema_v2.sql` 的真实落库样例、`pgvector` live 检索验证、lineage trigger 负例测试，以及当前召回/混淆结果解读。
+- 新增 `acr-engine/scripts/live_pgvector_music20_eval.py`，支持对用户提供的 PostgreSQL 执行隔离 schema 建表、样例数据导入、`pgvector` live 检索、`retrieval_candidate` / `match_decision` 落表与评测报告生成。
+- 新增 `acr-engine/data/pgvector_eval/music20/live_pgvector_report.json` 与 `songid_eval_report_fresh.json`，记录本轮 live PostgreSQL + pgvector 与 FAISS stand-in 的对齐结果：overall `top1=0.9091` / `top3=0.9545`，但 `type_7` 仍明显偏弱。
 - 重写 [session-handoff 交接文档](./session-handoff.md)，将其从历史流水账收敛为“下次启动即用”的启动手册，明确当前稳定结论、推荐阅读顺序、已验证/未验证边界，以及下一步应从 PostgreSQL v2 schema 与 Phase-1 encoder-only 执行链开始推进。
 - 新增 [Phase-1 实施清单](./phase1-implementation-checklist.md)，把 encoder-only 路线拆成主数据、reference set、feature set、索引、评测的可执行阶段。
 - 新增 [模型与 Feature Set 初始化手册](./model-feature-registry-bootstrap.md)，补齐 model_registry / feature_set_registry / reference_set_registry 的初始化约定与示例 SQL。
--- a/docs/README.md
View file @96c9ce7
+++ b/docs/README.md
View file @96c9ce7
@@ -54,6 +54,7 @@
 | [acr-architecture.md](./acr-architecture.md) | 当前系统蓝图、角色分工、在线/离线链路 | 架构、开发、运维 |
 | [sota-evolution-guide.md](./sota-evolution-guide.md) | SOTA 演进路径、Phase-1 encoder-only 方案、后续升级路线 | 架构、模型、检索 |
 | [postgresql-data-model.md](./postgresql-data-model.md) | PostgreSQL 数据字典、DDL 设计意图、流程图、查询路径 | 数据、后端、检索、平台 |
+| [postgres_db_schema_samples.md](./postgres_db_schema_samples.md) | PostgreSQL 实际落库样例、live pgvector 测试链路、召回/混淆结果 | 数据、后端、检索、平台 |
 | [phase1-implementation-checklist.md](./phase1-implementation-checklist.md) | Phase-1 落地 checklist，按阶段拆执行项 | 架构、开发、平台 |
 | [model-feature-registry-bootstrap.md](./model-feature-registry-bootstrap.md) | 模型、feature set、reference set 初始化手册 | 模型、检索、数据 |
 | [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) | 当前训练/manifest/pgvector 原型链说明 | 开发、数据 |
--- a/docs/postgres_db_schema_samples.md 0 → 100644
View file @96c9ce7
+++ b/docs/postgres_db_schema_samples.md 0 → 100644
View file @96c9ce7
+# PostgreSQL DB Schema Samples / 落库样例与 live 测试链路
+
+> 更新：2026-06-04  
+> 目标：给后续开发一个**可直接照着做**的 PostgreSQL 落库样例，同时保留一次真实 `pgvector` live 测试的证据。
+
+---
+
+## 一页结论
+
+这次已经在用户提供的 PostgreSQL 上完成了下面几件事：
+
+1. **真实连接 PostgreSQL 成功**
+   - DSN：`postgres://d2:***@127.0.0.1:5432/d2`
+   - PostgreSQL：`17.5`
+   - 已确认扩展 `vector` 存在
+
+2. **真实应用 schema v2 成功**
+   - 使用隔离 schema：`acr_test`
+   - DDL 来源：`acr-engine/sql/acr_pg_schema_v2.sql`
+   - 已成功创建主数据、registry、embedding、candidate、decision 等表
+
+3. **真实插入了一套完整的样例数据链**
+   - `canonical_song -> work -> recording -> recording_asset -> audio_window`
+   - `model_registry -> feature_set_registry -> audio_embedding -> retrieval_index_registry`
+   - `reference_set_registry -> reference_set_member`
+
+4. **真实跑通了一轮 PostgreSQL + pgvector 检索评测**
+   - 输入：`acr-engine/data/pgvector_eval/music20/*.jsonl`
+   - 输出：`acr-engine/data/pgvector_eval/music20/live_pgvector_report.json`
+   - live pgvector 指标和现有 FAISS stand-in 指标**一致**：
+     - overall `top1=0.9091`
+     - overall `top3=0.9545`
+     - `query_type=1`: `top1=1.0`
+     - `query_type=7`: `top1=0.0`, `top3=0.5`
+
+5. **lineage trigger 已被验证有效**
+   - 脚本主动构造了一次错误 lineage 的 `audio_window`
+   - PostgreSQL 正确拒绝插入
+
+---
+
+## 本次使用的 live 测试资产
+
+### 数据库
+
+| 项目 | 值 |
+|---|---|
+| Host | `127.0.0.1` |
+| Port | `5432` |
+| DB | `d2` |
+| User | `d2` |
+| PostgreSQL | `17.5` |
+| 扩展 | `vector`, `pg_trgm`, `ltree`, `hstore` 等 |
+| 本次测试 schema | `acr_test` |
+
+### 代码与产物
+
+| 类型 | 路径 |
+|---|---|
+| 推荐 DDL | `acr-engine/sql/acr_pg_schema_v2.sql` |
+| live 测试脚本 | `acr-engine/scripts/live_pgvector_music20_eval.py` |
+| live 报告 | `acr-engine/data/pgvector_eval/music20/live_pgvector_report.json` |
+| FAISS 对照报告 | `acr-engine/data/pgvector_eval/music20/songid_eval_report_fresh.json` |
+| 历史对照报告 | `acr-engine/data/pgvector_eval/music20/songid_eval_report.json` |
+
+---
+
+## 这次实际落进去的数据链
+
+```mermaid
+flowchart LR
+    A[reference_embeddings.jsonl] --> B[canonical_song]
+    B --> C[work]
+    C --> D[recording]
+    D --> E[recording_asset]
+    E --> F[audio_window]
+    F --> G[audio_embedding]
+    G --> H[audio_embedding_vector_192]
+
+    I[model_registry] --> J[feature_set_registry]
+    J --> G
+
+    K[reference_set_registry] --> L[reference_set_member]
+    D --> L
+
+    M[query_embeddings.jsonl] --> N[SQL pgvector search]
+    H --> N
+    N --> O[retrieval_candidate]
+    O --> P[match_decision]
+```
+
+---
+
+## 为什么这次 live 测试要把 24 维 embedding pad 到 192 维
+
+当前 `schema v2` 里提供了：
+- `audio_embedding_vector_192`
+- `audio_embedding_vector_768`
+
+而这次本地 `music20` 样例 embedding 是 **24 维 chroma 特征**。
+
+所以本次 live 测试采用的策略是：
+
+- **逻辑维度**：`24`
+- **物理落盘维度**：`192`
+- **做法**：后面补 `0`，写入 `vector(192)`
+
+这样做的原因：
+- 不需要临时改 schema
+- 仍然可以验证 schema v2 + pgvector + retrieval 链路
+- 对这批样例的余弦相似度排序不会产生方向性错误（所有向量都以同样方式补零）
+
+这只是**验证链路**用法。
+
+生产里应按真实 encoder 维度选择：
+- `MERT` / `MuQ` 之类高维 embedding：直接落合适物理表
+- 如果后续维度更多，建议继续扩成 `audio_embedding_vector_<dim>` 分桶策略
+
+---
+
+## 本次实际落盘样例
+
+以下内容来自 `acr_test` schema 的真实查询结果。
+
+### 1. canonical_song
+
+```json
+{"canonical_song_id":1,"biz_song_code":"100","title":"Song 100","primary_artist":"Artist 100","rights_status":"protected"}
+{"canonical_song_id":2,"biz_song_code":"101","title":"Song 101","primary_artist":"Artist 101","rights_status":"protected"}
+```
+
+### 2. work
+
+```json
+{"work_id":1,"canonical_song_id":1,"work_code":"work-100","work_title":"Song 100","composer":"Composer 100"}
+{"work_id":2,"canonical_song_id":2,"work_code":"work-101","work_title":"Song 101","composer":"Composer 101"}
+```
+
+### 3. recording
+
+```json
+{"recording_id":1,"work_id":1,"canonical_song_id":1,"recording_code":"rec-100","version_type":"master_reference","is_reference":true,"reference_priority":100}
+{"recording_id":2,"work_id":2,"canonical_song_id":2,"recording_code":"rec-101","version_type":"master_reference","is_reference":true,"reference_priority":101}
+```
+
+### 4. recording_asset
+
+```json
+{"asset_id":1,"recording_id":1,"asset_role":"reference_audio","storage_uri":"/workspace/downloads/100/type_11/93dfdeb0-7da5-42a8-9c71-cf12af57dd191650256918.wav","storage_scheme":"file","duration_sec":8.0,"ingest_status":"ready"}
+{"asset_id":2,"recording_id":2,"asset_role":"reference_audio","storage_uri":"/workspace/downloads/101/type_11/83c0c07f-4f96-4ff4-998c-58db910f3cfa1650256915.wav","storage_scheme":"file","duration_sec":8.0,"ingest_status":"ready"}
+```
+
+### 5. audio_window
+
+```json
+{"window_id":1,"asset_id":1,"recording_id":1,"work_id":1,"canonical_song_id":1,"window_index":0,"start_sec":0.0,"end_sec":8.0,"segment_role":"reference","segment_type":"full_clip"}
+{"window_id":2,"asset_id":2,"recording_id":2,"work_id":2,"canonical_song_id":2,"window_index":0,"start_sec":0.0,"end_sec":8.0,"segment_role":"reference","segment_type":"full_clip"}
+```
+
+### 6. model_registry / feature_set_registry
+
+```json
+{"model_id":1,"model_name":"local_chroma24","model_family":"chroma_baseline","model_version":"v1","output_embedding_dim":24,"default_window_sec":8.0}
+{"feature_set_id":1,"model_id":1,"feature_name":"chroma24_songid_eval","embedding_dim":24,"distance_metric":"cosine","feature_schema_version":"v1"}
+```
+
+### 7. audio_embedding
+
+```json
+{"embedding_id":1,"feature_set_id":1,"asset_id":1,"window_id":1,"recording_id":1,"canonical_song_id":1,"embedding_storage_mode":"pgvector_inline_192_padded","is_indexed":true}
+{"embedding_id":2,"feature_set_id":1,"asset_id":2,"window_id":2,"recording_id":2,"canonical_song_id":2,"embedding_storage_mode":"pgvector_inline_192_padded","is_indexed":true}
+```
+
+### 8. reference_set_registry / retrieval_index_registry
+
+```json
+{"reference_set_id":1,"set_name":"music20_live_reference","encoder_scope":"local_chroma24","status":"active"}
+{"retrieval_index_id":1,"feature_set_id":1,"index_name":"music20_live_pgvector_hnsw","index_backend":"pgvector","index_type":"hnsw_cosine","row_count":20,"index_status":"active"}
+```
+
+### 9. retrieval_candidate / match_decision
+
+```json
+{"retrieval_candidate_id":1,"query_id":"music20-q0000-t1-song100","source_lane":"semantic","candidate_level":"canonical_song","candidate_id":1,"raw_score":0.99998549,"normalized_score":0.90998694,"rank_no":1}
+{"retrieval_candidate_id":2,"query_id":"music20-q0000-t1-song100","source_lane":"semantic","candidate_level":"canonical_song","candidate_id":17,"raw_score":0.9527432,"normalized_score":0.86746888,"rank_no":2}
+{"match_decision_id":1,"query_id":"music20-q0000-t1-song100","canonical_song_id":1,"decision_status":"matched","decision_score":0.90998694}
+```
+
+---
+
+## 本次 live 测试的表规模
+
+| 表 | 行数 |
+|---|---:|
+| `canonical_song` | 20 |
+| `work` | 20 |
+| `recording` | 20 |
+| `recording_asset` | 20 |
+| `audio_window` | 20 |
+| `audio_embedding` | 20 |
+| `retrieval_candidate` | 220 |
+| `match_decision` | 22 |
+
+说明：
+- 20 条 reference song
+- 22 条 query
+- 每条 query 写入 top10 candidate，因此 `22 * 10 = 220`
+
+---
+
+## 本次测试链路与逻辑
+
+### A. schema / 数据完整性测试
+
+1. 连接 PostgreSQL
+2. 创建隔离 schema：`acr_test`
+3. 执行 `acr_pg_schema_v2.sql`
+4. 初始化：
+   - `model_registry`
+   - `feature_set_registry`
+   - `reference_set_registry`
+   - `retrieval_index_registry`
+5. 导入 20 条 reference 样例
+6. 验证表计数是否正确
+7. 主动插入一条错误 lineage 的 `audio_window`
+8. 预期 PostgreSQL trigger 拒绝该写入
+
+### B. live 检索评测测试
+
+1. 从 `reference_embeddings.jsonl` 读 20 条 reference embedding
+2. 写入 `audio_embedding` + `audio_embedding_vector_192`
+3. 从 `query_embeddings.jsonl` 读 22 条 query embedding
+4. 每条 query 用 SQL 执行 `pgvector cosine` 检索
+5. 在应用层做 song-level aggregation：
+   - `max_sim`
+   - `top3_avg`
+   - `vote`
+   - `combined = 0.6 * max_sim + 0.3 * top3_avg + 0.1 * vote_factor`
+6. 将 top10 候选落表到 `retrieval_candidate`
+7. 将 top1 决策落表到 `match_decision`
+8. 计算：
+   - overall `top1/top3/top10/mrr`
+   - `by_query_type`
+   - `confusion_focus`
+
+### C. confusion test 口径
+
+当前这次 live 样例里只实际包含：
+- `type_1`
+- `type_7`
+
+因此：
+- `type_7` 可以作为 **当前 live confusion check**
+- `type_8 / type_16` 这次 live JSONL 没覆盖到，只能结合历史业务样本结果一起看
+
+---
+
+## live pgvector 结果
+
+### 1. overall
+
+| 指标 | 值 |
+|---|---:|
+| query 数 | 22 |
+| top1 | `0.9091` |
+| top3 | `0.9545` |
+| top10 | `0.9545` |
+| MRR | `0.9343` |
+| mean rank | `1.8182` |
+
+### 2. by query type
+
+| query_type | count | top1 | top3 | top10 | 解释 |
+|---|---:|---:|---:|---:|---|
+| `1` | 20 | `1.0` | `1.0` | `1.0` | clean / near-clean |
+| `7` | 2 | `0.0` | `0.5` | `0.5` | 当前 live confusion 样例 |
+| `8` | 0 | N/A | N/A | N/A | 本次 live JSONL 未覆盖 |
+| `16` | 0 | N/A | N/A | N/A | 本次 live JSONL 未覆盖 |
+
+### 3. 和现有 FAISS stand-in 的一致性
+
+| 路径 | overall top1 | overall top3 | type_1 top1 | type_7 top1 | type_7 top3 |
+|---|---:|---:|---:|---:|---:|
+| live PostgreSQL + pgvector | `0.9091` | `0.9545` | `1.0` | `0.0` | `0.5` |
+| FAISS stand-in | `0.9091` | `0.9545` | `1.0` | `0.0` | `0.5` |
+
+结论：
+
+> 当前 `acr_test` 上的 live pgvector 路径，已经和现有 stand-in 检索逻辑对齐。  
+> 问题不在“PostgreSQL 落盘导致召回变坏”，而在当前样例 embedding 对混淆类 query 本身就不够强。
+
+---
+
+## 混淆测试补充视图
+
+### 1. 当前 live 样例视图
+
+| query_type | 数据来源 | top1 | top3 | 结论 |
+|---|---|---:|---:|---|
+| `7` | `live_pgvector_report.json` | `0.0` | `0.5` | 已明显偏弱 |
+
+### 2. 历史本地 20-song 小样本视图
+
+来自：`acr-engine/data/local_eval/music20_summary.json`
+
+| query_type | top1 | top3 |
+|---|---:|---:|
+| `1` | `1.0` | `1.0` |
+| `7` | `0.45` | `0.65` |
+| `8` | `0.4667` | `0.7333` |
+| `16` | `0.4167` | `0.4167` |
+
+说明：
+- 这是**本地小样本 chroma/FAISS sanity flow** 的结果
+- 它比当前 live JSONL 的 type_7 好，是因为样本构成不同
+- 不能把这个结果直接当作生产效果，但可以当作“当前特征在小样本内并非完全不可用”的旁证
+
+### 3. 历史业务语料 voice correctness 视图
+
+| query_type | 文件 | top1 | top3 | 结论 |
+|---|---|---:|---:|---|
+| `7` | `voice_workspace20_type7_eval.json` | `0.0` | `0.05` | 极弱 |
+| `8` | `voice_workspace20_type8_eval.json` | `0.0` | `0.0` | 极弱 |
+| `16` | `voice_workspace20_type16_eval.json` | `0.0` | `0.0` | 极弱 |
+
+结论：
+
+> 只要 query 进入更真实、更混淆的业务样本，当前这条 baseline 仍然远远不够。  
+> PostgreSQL 落库没问题，真正的问题还是 **embedding lane 对 hard case 的判别力不足**。
+
+---
+
+## 这次验证了什么，没验证什么
+
+### 已验证
+
+- PostgreSQL 真实连通可用
+- `vector` 扩展可用
+- schema v2 可以真实 apply
+- main lineage trigger 可以真实拦截坏数据
+- 样例数据链可以按 `song -> work -> recording -> asset -> window -> embedding` 落盘
+- live pgvector 检索和现有 stand-in 逻辑一致
+- `retrieval_candidate` / `match_decision` 可以真实承载在线结果
+
+### 未验证
+
+- 还没把 `MERT` / `MuQ` 真正接进这套 live 路径
+- 这次 live 样例没有覆盖 `type_8 / type_16` 的 JSONL embedding
+- 这次只验证了 20-song 级别，不代表 30w song 的索引性能
+- 还没做多 recording / 多 version / cover lane 的聚合测试
+
+---
+
+## 推荐的下一步
+
+### 路线 1：继续做 PostgreSQL 工程化
+
+1. 把 `live_pgvector_music20_eval.py` 泛化成：
+   - 可导入任意 manifest/reference set
+   - 可选择 encoder / feature set
+   - 可直接生成 `retrieval_candidate` / `match_decision` 报告
+2. 增加：
+   - `audio_embedding_vector_1024` / 其他常见维度表
+   - bulk COPY / batched insert
+   - HNSW 参数管理
+
+### 路线 2：继续做混淆类效果验证
+
+1. 构造真正覆盖 `type_8 / type_16` 的 query embedding JSONL
+2. 用同一条 live script 重跑 PostgreSQL 评测
+3. 对比：
+   - `Chromaprint only`
+   - `semantic only`
+   - `fusion`
+4. 输出 confusion bucket 报告
+
+### 路线 3：切到 Phase-1 encoder-only 主线
+
+1. 保留当前 PostgreSQL 结构不变
+2. 将 `local_chroma24` 替换成：
+   - `MERT-v1-95M`
+   - `MuQ`
+3. 继续复用：
+   - `model_registry`
+   - `feature_set_registry`
+   - `reference_set_registry`
+   - `retrieval_index_registry`
+4. 重新测：
+   - clean
+   - type_7
+   - type_8
+   - type_16
+   - 业务 voice bucket
+
+---
+
+## 复现命令
+
+### 1. live PostgreSQL + pgvector 测试
+
+```bash
+cd /workspace/acr-engine
+/usr/local/miniconda3/bin/python scripts/live_pgvector_music20_eval.py \
+  --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' \
+  --schema acr_test \
+  --reset-schema \
+  --output data/pgvector_eval/music20/live_pgvector_report.json
+```
+
+### 2. FAISS stand-in 对照测试
+
+```bash
+cd /workspace/acr-engine
+/usr/local/miniconda3/bin/python scripts/evaluate_songid_pgvector_path.py \
+  --reference-embeddings-jsonl data/pgvector_eval/music20/reference_embeddings.jsonl \
+  --query-embeddings-jsonl data/pgvector_eval/music20/query_embeddings.jsonl \
+  --output data/pgvector_eval/music20/songid_eval_report_fresh.json
+```
+
+---
+
+## 一句话结论
+
+> PostgreSQL 这条路已经可以真实落 schema、落样例、落 candidate、落 decision，也能真实跑 pgvector 检索。  
+> 当前最大的短板不再是“怎么存”，而是 **当前 baseline embedding 对混淆 query 的召回仍然明显不够**。