Why the song-centric semantic lane must move from placeholder to a real MERT baseline

Constraint: The current host now has torch/torchaudio/transformers, so the default song-centric pipeline should produce a real semantic baseline instead of a runtime-ready placeholder Rejected: Keep the placeholder branch after runtime became available | would leave the main pipeline in a misleading half-ready state Confidence: medium Scope-risk: narrow Directive: Preserve the local_wavehash_embed fallback, but treat mert-v1-95m as the default semantic baseline until MuQ is added as a challenger Tested: installed torch-2.12.0+cpu, torchaudio-2.11.0+cpu, transformers-5.10.1; py_compile for enrich_songcentric_manifest_with_local_features.py; reran song-centric pipeline; verified latest embedding rows are mert-v1-95m; markdown link check on /workspace/docs Not-tested: MuQ adapter implementation and production vector-table persistence are still pending

Why the song-centric semantic lane must move from placeholder to a real MERT baseline
Constraint: The current host now has torch/torchaudio/transformers, so the default song-centric pipeline should produce a real semantic baseline instead of a runtime-ready placeholder Rejected: Keep the placeholder branch after runtime became available | would leave the main pipeline in a misleading half-ready state Confidence: medium Scope-risk: narrow Directive: Preserve the local_wavehash_embed fallback, but treat mert-v1-95m as the default semantic baseline until MuQ is added as a challenger Tested: installed torch-2.12.0+cpu, torchaudio-2.11.0+cpu, transformers-5.10.1; py_compile for enrich_songcentric_manifest_with_local_features.py; reran song-centric pipeline; verified latest embedding rows are mert-v1-95m; markdown link check on /workspace/docs Not-tested: MuQ adapter implementation and production vector-table persistence are still pending
cnb.bofCdSsphPA
Commit 80df0d30 ... 80df0d301f60778aac95e3d4fd528af8c7afb47d authored 2026-06-04 15:53:24 +0800 by cnb.bofCdSsphPA
Showing 4 changed files with 86 additions and 12 deletions
acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py
docs/CHANGELOG.md
docs/session-handoff.md
docs/start-here.md
--- a/acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py
View file @80df0d3
+++ b/acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py
View file @80df0d3
@@ -8,6 +8,8 @@ import json
 import wave
 from pathlib import Path

+import numpy as np
+
 ROOT = Path(__file__).resolve().parents[1]
 import sys
 if str(ROOT) not in sys.path:
@@ -15,6 +17,9 @@ if str(ROOT) not in sys.path:

 from src.engines.chromaprint_matcher import ChromaprintMatcher, load_audio_mono

+MERT_MODEL_ID = 'm-a-p/MERT-v1-95M'
+_MERT_RUNTIME = None
+

 def load_jsonl(path: Path):
    for line in path.read_text().splitlines():
@@ -72,8 +77,77 @@ def extract_matcher_fingerprint(path: Path, start_ms: int, end_ms: int) -> dict 
        return None


-def build_semantic_feature(stats: dict, start_ms: int, end_ms: int, runtime_ok: bool, missing: list[str]) -> dict:
+def load_mert_runtime():
+    global _MERT_RUNTIME
+    if _MERT_RUNTIME is not None:
+        return _MERT_RUNTIME
+    import torch
+    import torchaudio
+    from transformers import Wav2Vec2FeatureExtractor, AutoModel
+    feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(MERT_MODEL_ID, trust_remote_code=True)
+    model = AutoModel.from_pretrained(MERT_MODEL_ID, trust_remote_code=True)
+    model.eval()
+    _MERT_RUNTIME = {
+        'torch': torch,
+        'torchaudio': torchaudio,
+        'feature_extractor': feature_extractor,
+        'model': model,
+        'sample_rate': int(feature_extractor.sampling_rate),
+        'hidden_size': int(getattr(model.config, 'hidden_size', 768)),
+    }
+    return _MERT_RUNTIME
+
+
+def extract_mert_embedding(asset_path: Path, start_ms: int, end_ms: int) -> dict:
+    rt = load_mert_runtime()
+    torch = rt['torch']
+    samples, sr = load_audio_mono(str(asset_path), sr=rt['sample_rate'])
+    samples = np.asarray(samples, dtype=np.float32)
+    start_frame = int(start_ms * sr / 1000)
+    end_frame = int(end_ms * sr / 1000)
+    segment = samples[start_frame:end_frame]
+    if segment.size == 0:
+        raise ValueError('empty segment for MERT extraction')
+    inputs = rt['feature_extractor'](
+        segment,
+        sampling_rate=sr,
+        return_tensors='pt',
+    )
+    with torch.no_grad():
+        outputs = rt['model'](**inputs)
+    emb = outputs.last_hidden_state.mean(dim=1).squeeze(0).cpu().numpy().astype(np.float32)
+    digest = hashlib.sha256(emb.tobytes()).hexdigest()
+    return {
+        'embedding_dim': int(emb.shape[0]),
+        'embedding_uri': f"inline-mert://{digest[:16]}:{start_ms}:{end_ms}",
+        'vector_table_name': f"audio_embedding_vector_{int(emb.shape[0])}_placeholder",
+        'checksum': f"emb:{digest[:16]}",
+        'metadata_json': {
+            'semantic_backend': 'mert_runtime',
+            'embedding_preview': [float(x) for x in emb[:8]],
+            'model_id': MERT_MODEL_ID,
+            'sample_rate': sr,
+        },
+    }
+
+
+def build_semantic_feature(asset_path: Path, stats: dict, start_ms: int, end_ms: int, runtime_ok: bool, missing: list[str]) -> dict:
    if runtime_ok:
+        try:
+            mert = extract_mert_embedding(asset_path, start_ms, end_ms)
+            return {
+                'feature_type': 'embedding',
+                'model_name': 'mert-v1-95m',
+                'model_version': 'hf-main',
+                'feature_set_name': 'mert_5s_hop2.5_v1',
+                'feature_schema_ver': 'v1',
+                'embedding_dim': mert['embedding_dim'],
+                'embedding_uri': mert['embedding_uri'],
+                'vector_table_name': mert['vector_table_name'],
+                'checksum': mert['checksum'],
+                'metadata_json': mert['metadata_json'],
+            }
+        except Exception as exc:
            return {
                'feature_type': 'embedding',
                'model_name': 'semantic_runtime_ready_placeholder',
@@ -84,7 +158,7 @@ def build_semantic_feature(stats: dict, start_ms: int, end_ms: int, runtime_ok: 
                'embedding_uri': f"runtime-ready://{stats['digest'][:16]}:{start_ms}:{end_ms}",
                'vector_table_name': 'audio_embedding_vector_8_placeholder',
                'checksum': f"emb:{stats['digest'][:16]}",
-            'metadata_json': {'semantic_backend': 'runtime_ready_placeholder'},
+                'metadata_json': {'semantic_backend': 'runtime_ready_placeholder', 'runtime_error': str(exc)},
            }
    return {
        'feature_type': 'embedding',
@@ -162,7 +236,7 @@ def main() -> int:
                    }
                    fallback_fp_count += 1

-                emb = build_semantic_feature(stats, window['start_ms'], window['end_ms'], runtime_ok, missing_runtime)
+                emb = build_semantic_feature(asset_path, stats, window['start_ms'], window['end_ms'], runtime_ok, missing_runtime)
                if runtime_ok:
                    semantic_runtime_ready_count += 1
                else:
--- a/docs/CHANGELOG.md
View file @80df0d3
+++ b/docs/CHANGELOG.md
View file @80df0d3
 # Changelog

 ## 2026-06-04
- fresh runtime 进展：已在当前 host 成功安装 `torch-2.12.0+cpu`、`torchaudio-2.11.0+cpu` 与 `transformers-5.10.1`，重跑 song-centric 主链后确认 `semantic_runtime_available = true`、`semantic_runtime_ready_count = 5`、`semantic_fallback_count = 0`；当前 semantic 已从 fallback 推进到 `semantic_runtime_ready_placeholder`，下一步只差接真实 `MERT / MuQ` adapter。
+- fresh runtime 进展：已在当前 host 成功安装 `torch-2.12.0+cpu`、`torchaudio-2.11.0+cpu` 与 `transformers-5.10.1`，重跑 song-centric 主链后确认 `semantic_runtime_available = true`、`semantic_runtime_ready_count = 5`、`semantic_fallback_count = 0`；当前 semantic 已从 fallback 推进到 `mert-v1-95m`，下一步可在不破坏当前 MERT 基线的前提下继续接 `MuQ` adapter。
 - 收敛 `docs/` 到当前 song-centric 主线，只保留 `README / start-here / session-handoff / postgresql-data-model / postgres_db_schema_samples / CHANGELOG` 六份核心文档，删除旧的 v2 / planner-worker / registry 扩展文档，避免新同学误入已退居次线的设计。
 - 重写 `docs/postgresql-data-model.md`，明确 `保存切片的数据 + 模型 + feature` 的落表方案：`window` 落 `audio_object`，模型身份落 `feature_fact.model_name/model_version/feature_set_name`，具体 `fingerprint/embedding` 也统一落 `feature_fact`。
 - 重写 `docs/postgres_db_schema_samples.md` 与入口文档，补充当前 4 表主链的流程图、典型 SQL 样例、查询回溯路径与写入顺序，统一文档口径到 `media_entity -> audio_object -> feature_fact -> set_membership`。
--- a/docs/session-handoff.md
View file @80df0d3
+++ b/docs/session-handoff.md
View file @80df0d3
@@ -33,7 +33,7 @@ acr-engine/scripts/start_songcentric_shortest_path.sh 'postgres://d2:d2pass@127.
 - `semantic_runtime_missing = []`
 - `semantic_runtime_ready_count = 5`
 - `semantic_fallback_count = 0`
- `import_counts = media_entity:9 / audio_object:22 / feature_fact:29 / set_membership:9`
+- `import_counts = media_entity:9 / audio_object:22 / feature_fact:34 / set_membership:9`

 ---

@@ -122,10 +122,10 @@ flowchart TD

 - `torch / torchaudio / transformers` 已可导入
 - 当前 `semantic_runtime_available = true`
- 当前 semantic 仍不是 `MERT / MuQ`，而是 `semantic_runtime_ready_placeholder`
+- 当前 semantic 已接上真实 `mert-v1-95m` baseline

 这说明当前主要 blocker 已从“依赖缺失”推进为：
-> **runtime 已就绪，但真实 semantic adapter 还没接入。**
+> **runtime 已就绪，真实 `MERT` baseline 已接入，下一步可继续接 `MuQ`。**

 ---

@@ -174,7 +174,7 @@ flowchart TD
 - exact lane 已优先复用 `ChromaprintMatcher`
 - semantic lane 还没有真实接入 `MERT / MuQ`
 - runtime 就绪时，当前会产出：
-  - `model_name = semantic_runtime_ready_placeholder`
+  - `model_name = mert-v1-95m`
 - fallback 分支仍保留：
  - `model_name = local_wavehash_embed`

--- a/docs/start-here.md
View file @80df0d3
+++ b/docs/start-here.md
View file @80df0d3
@@ -31,7 +31,7 @@ acr-engine/scripts/start_songcentric_shortest_path.sh 'postgres://d2:d2pass@127.
 - `semantic_runtime_missing = []`
 - `semantic_runtime_ready_count = 5`
 - `semantic_fallback_count = 0`
- `import_counts = media_entity:9 / audio_object:22 / feature_fact:29 / set_membership:9`
+- `import_counts = media_entity:9 / audio_object:22 / feature_fact:34 / set_membership:9`

 ---

@@ -100,7 +100,7 @@ flowchart TD
 - 真实目录 -> manifest -> import 已打通
 - 真实目录 -> fingerprint enrichment -> import 已打通
 - semantic lane 已做成 runtime-ready
- 当前 host 已能进入 runtime-ready placeholder 分支，下一步只差接真实 `MERT / MuQ`
+- 当前 host 已能进入 runtime-ready placeholder 分支，下一步可在不破坏当前 MERT 基线的前提下继续接 `MuQ`
 - 当前 exact lane 已优先复用仓库内 `ChromaprintMatcher`

 ---
@@ -108,14 +108,14 @@ flowchart TD
 ## 7. 当前最该继续什么

 ### 第一优先级
-把 semantic lane 从 `semantic_runtime_ready_placeholder` 升级成真实 encoder adapter，且不破坏现有宿主链。
+把 semantic lane 从 `mert-v1-95m` baseline 扩展到 `MuQ` challenger，且不破坏现有宿主链。

 ### 当前 host 事实
 - `torch` 已可导入
 - `torchaudio` 已可导入
 - `transformers` 已可导入
 - 当前 `semantic_runtime_available = true`
- 当前最新主链产出仍是 `semantic_runtime_ready_placeholder`，不是真实 `MERT / MuQ`
+- 当前最新主链产出已经是 `mert-v1-95m`；下一步可继续补 `MuQ` challenger

 ---