Why the Phase-1 docs must explain feature-to-window binding explicitly

Constraint: The current default must stay aligned with the live 4-table song-centric path and the real MERT baseline Rejected: Re-expanding old multi-layer docs | increases onboarding cost and reintroduces stale states Confidence: high Scope-risk: narrow Directive: Keep future schema docs anchored to live model_name/feature_set_name facts, not aspirational placeholders Tested: markdown link check under docs; live PostgreSQL spot-check of feature_fact model_name/object_id/song_id lineage Not-tested: Mermaid rendering in external markdown viewers

Why the Phase-1 docs must explain feature-to-window binding explicitly
Constraint: The current default must stay aligned with the live 4-table song-centric path and the real MERT baseline Rejected: Re-expanding old multi-layer docs | increases onboarding cost and reintroduces stale states Confidence: high Scope-risk: narrow Directive: Keep future schema docs anchored to live model_name/feature_set_name facts, not aspirational placeholders Tested: markdown link check under docs; live PostgreSQL spot-check of feature_fact model_name/object_id/song_id lineage Not-tested: Mermaid rendering in external markdown viewers
cnb.bofCdSsphPA
Commit 6ee8c576 ... 6ee8c576b04b4538dcf21a113600f08e1f08adbb authored 2026-06-04 16:11:59 +0800 by cnb.bofCdSsphPA
Showing 6 changed files with 376 additions and 89 deletions
docs/CHANGELOG.md
docs/README.md
docs/postgres_db_schema_samples.md
docs/postgresql-data-model.md
docs/session-handoff.md
docs/start-here.md
--- a/docs/CHANGELOG.md
View file @6ee8c57
+++ b/docs/CHANGELOG.md
View file @6ee8c57
 # Changelog

 ## 2026-06-04
+- 继续收敛文档到当前 live 主链口径：补齐 `feature_fact.object_id -> audio_object(window)`、`window.parent_object_id -> asset`、`feature_fact.song_id -> media_entity(song)` 的绑定说明，并新增 manifest/SQL 双样例，专门回答 Phase-1 开源模型集合应该如何落地存储以及 feature 与 audio object 如何关联。
+- 修正 `docs/session-handoff.md` 中关于 semantic lane 的旧状态残留，统一到当前真实事实：live 默认已落 `chromaprint_matcher + mert-v1-95m`，MuQ 仍是下一阶段 challenger。
+
+## 2026-06-04
 - fresh runtime 进展：已在当前 host 成功安装 `torch-2.12.0+cpu`、`torchaudio-2.11.0+cpu` 与 `transformers-5.10.1`，重跑 song-centric 主链后确认 `semantic_runtime_available = true`、`semantic_runtime_ready_count = 5`、`semantic_fallback_count = 0`；当前 semantic 已从 fallback 推进到 `mert-v1-95m`，下一步可在不破坏当前 MERT 基线的前提下继续接 `MuQ` adapter。
 - 新增 MuQ 接入线索固化：根据仓库现有 Phase-1 脚本与外部模型线索，下一步可优先尝试 `OpenMuQ/MuQ-large-msd-iter` 作为 MuQ challenger 的最小接入目标；官方加载入口可优先按 `from muq import MuQ` + `MuQ.from_pretrained("OpenMuQ/MuQ-large-msd-iter")`。
 - fresh MuQ 进展：当前 host 已完成 `muq` 包安装，但 `import muq` 仍被 `RuntimeError: operator torchvision::nms does not exist` 卡住；当前 blocker 已从“MuQ 未安装”推进为“torchvision 兼容问题”。
--- a/docs/README.md
View file @6ee8c57
+++ b/docs/README.md
View file @6ee8c57
@@ -73,8 +73,8 @@ acr-engine/scripts/start_songcentric_shortest_path.sh 'postgres://d2:d2pass@127.

 - [start-here.md](./start-here.md)：新同学 10 分钟接手入口
 - [session-handoff.md](./session-handoff.md)：下次启动从哪里继续
- [postgresql-data-model.md](./postgresql-data-model.md)：表设计、字段语义、流程图、设计取舍
- [postgres_db_schema_samples.md](./postgres_db_schema_samples.md)：DDL、样例数据、典型 SQL、导入查询链路
+- [postgresql-data-model.md](./postgresql-data-model.md)：表设计、字段语义、feature 与 audio_object 的绑定关系、Phase-1 模型落库口径
+- [postgres_db_schema_samples.md](./postgres_db_schema_samples.md)：DDL、manifest/SQL 样例、典型查询链路、真实存储示例
 - [CHANGELOG.md](./CHANGELOG.md)：变更历史

 ---
--- a/docs/postgres_db_schema_samples.md
View file @6ee8c57
+++ b/docs/postgres_db_schema_samples.md
View file @6ee8c57
@@ -33,14 +33,39 @@ song -> asset -> window -> fingerprint / embedding
 | song | `media_entity` | `entity_type='song'` | `song_000001` |
 | asset | `audio_object` | `object_type='asset'` | 一首歌的原始 wav/mp3/flac |
 | window | `audio_object` | `object_type='window'` | `0-5000ms`, `2500-7500ms` |
-| fingerprint | `feature_fact` | `feature_type='fingerprint'` | chromaprint |
+| fingerprint | `feature_fact` | `feature_type='fingerprint'` | chromaprint_matcher |
 | embedding | `feature_fact` | `feature_type='embedding'` | MERT/MuQ/fallback vector |
-| model | `feature_fact` | `model_name`, `model_version` | `mert-v1-95m`, `muq-base`, `local_wavehash_embed` |
+| model | `feature_fact` | `model_name`, `model_version` | `chromaprint_matcher`, `mert-v1-95m`, `muq-large-msd-iter`, `local_wavehash_embed` |
 | feature set | `feature_fact` | `feature_set_name`, `feature_schema_ver` | `mert_5s_hop2.5_v1` |

 ---

-## 3. DDL
+## 3. Phase-1 数据绑定一页图
+
+```mermaid
+flowchart LR
+    S[media_entity
+song] --> A[audio_object
+asset]
+    A --> W[audio_object
+window]
+    W --> F1[feature_fact
+chromaprint_matcher]
+    W --> F2[feature_fact
+mert-v1-95m]
+    W --> F3[feature_fact
+muq-large-msd-iter 计划]
+```
+
+关键绑定字段：
+- `audio_object.song_id -> media_entity.entity_id`
+- `window.parent_object_id -> asset.object_id`
+- `feature_fact.object_id -> window.object_id`
+- `feature_fact.song_id -> media_entity.entity_id`
+
+一句话：`feature_fact` 绑的是“具体 window”，不是抽象 song；但为了快速返回结果，又会把 `song_id` 冗余写进去。
+
+## 4. DDL

 ### 3.1 `media_entity`

@@ -170,9 +195,63 @@ flowchart LR

 ---

-## 5. 样例数据
+## 5. 导入前的 manifest 样例
+
+当前主链导入前，推荐就把 feature 放到 `windows[].features[]` 里：
+
+```json
+{
+  "song": {"biz_key": "song_alpha", "title": "song alpha", "artist_name": "artist a"},
+  "asset": {
+    "source_type": "official",
+    "storage_uri": "/workspace/acr-engine/data/songcentric_builder_smoke/song_alpha/artist_a/clip1.wav",
+    "storage_scheme": "file",
+    "checksum": "path:/workspace/acr-engine/data/songcentric_builder_smoke/song_alpha/artist_a/clip1.wav",
+    "codec": "wav",
+    "sample_rate": 16000,
+    "channels": 1,
+    "duration_ms": 8000
+  },
+  "windows": [
+    {
+      "start_ms": 0,
+      "end_ms": 5000,
+      "features": [
+        {
+          "feature_type": "fingerprint",
+          "model_name": "chromaprint_matcher",
+          "model_version": "phase1_local",
+          "feature_set_name": "chromaprint_matcher_5s",
+          "fingerprint_value": "dc0c731425f360787f462da693ff4a50"
+        },
+        {
+          "feature_type": "embedding",
+          "model_name": "mert-v1-95m",
+          "model_version": "hf-main",
+          "feature_set_name": "mert_5s_hop2.5_v1",
+          "feature_schema_ver": "v1",
+          "embedding_dim": 768,
+          "embedding_uri": "inline-mert://19c0162d3bdde235:0:5000",
+          "vector_table_name": "audio_embedding_vector_768_placeholder"
+        }
+      ]
+    }
+  ],
+  "memberships": [
+    {"set_type": "reference_set", "set_name": "phase1_hot_reference_v1", "member_type": "asset", "priority": 100}
+  ]
+}
+```
+
+这份 JSON 的含义非常直接：
+- `song` 决定最终要回到哪个 `song_id`
+- `asset` 决定原始音频文件是谁
+- `windows[]` 决定切片边界
+- `windows[].features[]` 决定每个切片已经由哪些模型编码过
+
+## 6. 样例数据

-### 5.1 写 song
+### 6.1 写 song

 ```sql
 insert into media_entity (
@@ -184,7 +263,7 @@ insert into media_entity (
 returning entity_id;
 ```

-### 5.2 写 asset
+### 6.2 写 asset

 ```sql
 insert into audio_object (
@@ -199,7 +278,7 @@ insert into audio_object (
 returning object_id;
 ```

-### 5.3 写 window
+### 6.3 写 window

 ```sql
 insert into audio_object (
@@ -211,7 +290,7 @@ insert into audio_object (
 returning object_id;
 ```

-### 5.4 写 fingerprint
+### 6.4 写 fingerprint

 ```sql
 insert into feature_fact (
@@ -220,13 +299,13 @@ insert into feature_fact (
    fingerprint_value, checksum, metadata_json
 ) values (
    'fingerprint', :window_id, :song_id,
-    'chromaprint', '1.0', 'chromaprint_5s_v1', 'v1',
+    'chromaprint_matcher', 'phase1_local', 'chromaprint_matcher_5s', 'v1',
    'AQAAE0mUaEkSZSo...', 'sha256:fp001',
    '{"lane":"exact"}'::jsonb
 );
 ```

-### 5.5 写 embedding
+### 6.5 写 embedding

 ```sql
 insert into feature_fact (
@@ -241,7 +320,7 @@ insert into feature_fact (
 );
 ```

-### 5.6 写 set membership
+### 6.6 写 set membership

 ```sql
 insert into set_membership (
@@ -254,7 +333,7 @@ insert into set_membership (

 ---

-## 6. 典型查询
+## 7. 典型查询

 ### 6.1 查看某首歌有哪些 asset

@@ -555,7 +634,7 @@ insert into feature_fact (
    fingerprint_value
 ) values (
    'fingerprint', :window_id, :song_id,
-    'chromaprint', '1.0', 'chromaprint_5s_v1', 'v1',
+    'chromaprint_matcher', 'phase1_local', 'chromaprint_matcher_5s', 'v1',
    'AQAAE0mUaEkSZSo...'
 );
 ```
@@ -583,7 +662,7 @@ insert into feature_fact (
    embedding_dim, embedding_uri, vector_table_name
 ) values (
    'embedding', :window_id, :song_id,
-    'muq-base', 'hf-main', 'muq_5s_hop2.5_v1', 'v1',
+    'muq-large-msd-iter', 'hf-main', 'muq_5s_hop2.5_v1', 'v1',
    768, 's3://bucket/emb/demo_song_win0001_muq.npy', 'audio_embedding_vector_768'
 );
 ```
@@ -636,13 +715,50 @@ order by ff.feature_type, ff.model_name;

 ---

-## 14. 一个完整的多 asset / 多 window / 多 model 样例
+## 14. 一个真实绑定查询样例
+
+下面这条 SQL 用来回答用户最关心的问题：
+
+> 一条 feature 是怎么和 audio object 绑定，并最终回到 `song_id` 的？
+
+```sql
+select ff.feature_id,
+       ff.feature_type,
+       ff.model_name,
+       ff.model_version,
+       ff.feature_set_name,
+       w.object_id as window_id,
+       w.start_ms,
+       w.end_ms,
+       a.object_id as asset_id,
+       a.storage_uri,
+       s.entity_id as song_id,
+       s.biz_key
+from feature_fact ff
+join audio_object w
+  on w.object_id = ff.object_id
+ and w.object_type = 'window'
+join audio_object a
+  on a.object_id = w.parent_object_id
+ and a.object_type = 'asset'
+join media_entity s
+  on s.entity_id = ff.song_id
+where ff.feature_id = :feature_id;
+```
+
+你可以把它理解成 4 步：
+1. 从 `feature_fact` 找到这条特征
+2. 用 `object_id` 找到它绑定的 `window`
+3. 用 `parent_object_id` 找到它所属的 `asset`
+4. 用 `song_id` 找到最终归属的 `song`
+
+## 15. 一个完整的多 asset / 多 window / 多 model 样例

 假设：
 - 同一个 `song_id = 1001`
 - 有 2 个音频文件：`master.wav`、`ugc_clip.mp3`
 - 每个 asset 切成 2 个 window
- 每个 window 都跑 `chromaprint + mert-v1-95m + muq-base`
+- 每个 window 都跑 `chromaprint_matcher + mert-v1-95m + muq-large-msd-iter`

 ### 14.1 逻辑结构

@@ -650,22 +766,22 @@ order by ff.feature_type, ff.model_name;
 song(1001)
  -> asset(2001, master.wav)
    -> window(3001, 0-5000)
-      -> chromaprint
+      -> chromaprint_matcher
      -> mert-v1-95m
-      -> muq-base
+      -> muq-large-msd-iter
    -> window(3002, 2500-7500)
-      -> chromaprint
+      -> chromaprint_matcher
      -> mert-v1-95m
-      -> muq-base
+      -> muq-large-msd-iter
  -> asset(2002, ugc_clip.mp3)
    -> window(3003, 10000-15000)
-      -> chromaprint
+      -> chromaprint_matcher
      -> mert-v1-95m
-      -> muq-base
+      -> muq-large-msd-iter
    -> window(3004, 12500-17500)
-      -> chromaprint
+      -> chromaprint_matcher
      -> mert-v1-95m
-      -> muq-base
+      -> muq-large-msd-iter
 ```

 ### 14.2 会落成多少行
@@ -706,7 +822,7 @@ order by a.object_id, w.start_ms, ff.feature_type, ff.model_name;

 ### 14.4 查询哪些 window 缺某个模型

-这个 SQL 很适合做补算任务扫描，比如检查哪些 window 还没跑 `muq-base`：
+这个 SQL 很适合做补算任务扫描，比如检查哪些 window 还没跑 `muq-large-msd-iter`：

 ```sql
 select w.object_id as window_id,
@@ -721,7 +837,7 @@ where w.object_type = 'window'
      from feature_fact ff
      where ff.object_id = w.object_id
        and ff.feature_type = 'embedding'
-        and ff.model_name = 'muq-base'
+        and ff.model_name = 'muq-large-msd-iter'
        and ff.model_version = 'hf-main'
        and ff.feature_set_name = 'muq_5s_hop2.5_v1'
  )
@@ -746,7 +862,7 @@ order by w.start_ms;

 ---

-## 15. 批量入库与索引建设样例
+## 16. 批量入库与索引建设样例

 ### 15.1 推荐批量顺序

@@ -756,7 +872,7 @@ batch-2: audio_object(asset)
 batch-3: audio_object(window)
 batch-4: feature_fact(chromaprint)
 batch-5: feature_fact(mert-v1-95m)
-batch-6: feature_fact(muq-base)
+batch-6: feature_fact(muq-large-msd-iter)
 ```

 ### 15.2 推荐补充索引
--- a/docs/postgresql-data-model.md
View file @6ee8c57
+++ b/docs/postgresql-data-model.md
View file @6ee8c57
@@ -67,7 +67,68 @@ song -> asset -> window -> fingerprint / embedding
 | feature set identity | `feature_fact` | `feature_set_name`, `feature_schema_ver` | 区分特征配置、窗口策略、schema 版本 |
 | reference routing | `set_membership` | `set_type`, `set_name` | 控制 reference/eval/hot 范围 |

-### 4.1 一个关键设计点
+### 4.1 feature 和 audio_object 到底怎么绑定
+
+这是当前 schema 最关键的一层：
+
+```text
+feature_fact.object_id -> audio_object.object_id
+```
+
+含义：
+- 一条 `feature_fact` 永远对应一个具体音频对象
+- 在当前 Phase-1 主链里，这个对象默认是 `window`
+- 所以检索命中的最小证据单元是 `window`，不是整首 song，也不是整份 asset
+
+再往上回溯：
+
+```text
+feature_fact.object_id -> window.object_id
+window.parent_object_id -> asset.object_id
+window.song_id / feature_fact.song_id -> media_entity.entity_id
+```
+
+也就是说：
+- `object_id` 负责绑定到“具体哪段音频”
+- `parent_object_id` 负责回到“这段音频属于哪份 asset”
+- `song_id` 负责快速回到“最终归属哪个 song_id”
+
+### 4.2 为什么 `feature_fact` 里还要冗余存 `song_id`
+
+因为版权保护场景里，在线服务最终要快速输出 `song_id`。
+
+所以 `feature_fact.song_id` 是一个**有意的冗余字段**，目的有 3 个：
+- 减少召回后 song-level 聚合时的 join 成本
+- 允许直接按 `song_id + model_name + feature_type` 做覆盖率巡检
+- 便于后续把 `window` 命中快速折叠为 song-level evidence
+
+### 4.3 Phase-1 默认为什么把 feature 绑到 `window` 而不是 `asset`
+
+因为 Phase-1 的目标不是只知道“这份音频大概像谁”，而是还要保留：
+- 命中的 offset
+- 命中的具体 5s 片段
+- exact / semantic 在同一时间段上的并行证据
+
+因此默认策略是：
+- `asset`：承载原始音频文件
+- `window`：承载检索、匹配、回溯最小单元
+- `feature_fact`：默认挂到 `window`
+
+### 4.4 一个最小链路示意
+
+```mermaid
+flowchart LR
+    F[feature_fact
+model_name=mert-v1-95m] --> W[audio_object
+object_type=window]
+    W --> A[audio_object
+object_type=asset]
+    W --> S[media_entity
+entity_type=song]
+    F --> S
+```
+
+### 4.5 一个关键设计点

 当前 **模型信息不单独放 registry 表作为默认主链依赖**，而是先直接沉淀在 `feature_fact`：
 - 这样 Phase-1 更轻
@@ -610,10 +671,10 @@ flowchart TD

 | lane | model_name | model_version | feature_type | 用途 |
 |---|---|---|---|---|
-| exact | `chromaprint` | `1.0` | `fingerprint` | 高精度 exact 命中 |
-| semantic baseline | `mert-v1-95m` | `hf-main` | `embedding` | song semantic baseline |
-| semantic challenger | `muq-base` | `hf-main` | `embedding` | cover / bgm / 复杂干扰 challenger |
-| semantic fallback | `local_wavehash_embed` | `phase1_local` | `embedding` | 当前 host 缺 runtime 时兜底 |
+| exact（当前 live） | `chromaprint_matcher` | `phase1_local` | `fingerprint` | 当前 live exact baseline |
+| semantic baseline（当前 live） | `mert-v1-95m` | `hf-main` | `embedding` | 当前 live semantic baseline |
+| semantic challenger（计划） | `muq-large-msd-iter` | `hf-main` | `embedding` | 下一阶段 cover / bgm / 复杂干扰 challenger |
+| semantic fallback | `local_wavehash_embed` | `phase1_local` | `embedding` | runtime 不可用时兜底 |
 | historical baseline | `ecapa-tdnn` | `baseline_only` | `embedding` | 历史对比，不建议做 Phase-1 主导 |

 ### 16.2 建议用什么字段固化模型身份
@@ -635,17 +696,17 @@ flowchart TD
 ```

 例如：
- `chromaprint_5s_v1`
- `mert_5s_hop2.5_v1`
- `muq_5s_hop2.5_v1`
- `wavehash_5s_hop2.5_v1`
+- `chromaprint_matcher_5s`（当前 live）
+- `mert_5s_hop2.5_v1`（当前 live）
+- `muq_5s_hop2.5_v1`（计划）
+- `wavehash_5s_hop2.5_v1`（fallback）

 ### 16.4 Phase-1 推荐的存储规则

 #### exact lane
 - `feature_type = 'fingerprint'`
 - `fingerprint_value` 必填
- `model_name = 'chromaprint'`
+- `model_name = 'chromaprint_matcher'`
 - `embedding_uri / vector_table_name` 为空

 #### semantic lane
@@ -674,7 +735,7 @@ flowchart TD
 3. 切窗并写 `audio_object(window)`
 4. 跑 `chromaprint`，写 `feature_fact(fingerprint)`
 5. 跑 `mert-v1-95m`，写 `feature_fact(embedding)`
-6. 跑 `muq-base`，写 `feature_fact(embedding)`
+6. 下一阶段接 `muq-large-msd-iter`，写 `feature_fact(embedding)`
 7. 如果 runtime 不可用，至少写 `local_wavehash_embed` fallback

 这样最终会形成：
@@ -683,7 +744,7 @@ flowchart TD
 同一个 window
  -> 1 条 chromaprint fingerprint
  -> 1 条 mert embedding
-  -> 1 条 muq embedding
+  -> 1 条 muq embedding（接入后）
  -> (可选) 1 条 fallback embedding
 ```

@@ -693,7 +754,87 @@ flowchart TD

 ---

-## 17. 100w 音频 / 30w song 的批量入库与索引建设策略
+## 17. 当前 live 样例：一条 feature 是怎么回到 song_id 的
+
+下面是当前 PostgreSQL `acr_songcentric_test` 的真实主链口径：
+
+- `feature_type = 'fingerprint'` 时，当前 live `model_name = 'chromaprint_matcher'`
+- `feature_type = 'embedding'` 时，当前 live baseline `model_name = 'mert-v1-95m'`
+- 历史测试里还能看到旧的 placeholder / fallback 行，但它们不是当前默认基线
+
+### 17.1 一个真实 manifest 样例（导入前）
+
+```json
+{
+  "song": {"biz_key": "song_alpha", "title": "song alpha", "artist_name": "artist a"},
+  "asset": {"storage_uri": ".../clip1.wav", "duration_ms": 8000},
+  "windows": [
+    {
+      "start_ms": 0,
+      "end_ms": 5000,
+      "features": [
+        {
+          "feature_type": "fingerprint",
+          "model_name": "chromaprint_matcher",
+          "model_version": "phase1_local",
+          "feature_set_name": "chromaprint_matcher_5s"
+        },
+        {
+          "feature_type": "embedding",
+          "model_name": "mert-v1-95m",
+          "model_version": "hf-main",
+          "feature_set_name": "mert_5s_hop2.5_v1",
+          "embedding_dim": 768
+        }
+      ]
+    }
+  ]
+}
+```
+
+### 17.2 导入后的绑定结果应该长什么样
+
+```text
+media_entity(song_alpha)
+  -> audio_object(asset: clip1.wav)
+    -> audio_object(window: 0-5000)
+      -> feature_fact(fingerprint, chromaprint_matcher)
+      -> feature_fact(embedding, mert-v1-95m)
+```
+
+### 17.3 查询某条 feature 绑定到哪个 window / asset / song
+
+```sql
+select ff.feature_id,
+       ff.feature_type,
+       ff.model_name,
+       ff.feature_set_name,
+       w.object_id as window_id,
+       w.start_ms,
+       w.end_ms,
+       a.object_id as asset_id,
+       a.storage_uri,
+       s.entity_id as song_id,
+       s.biz_key
+from feature_fact ff
+join audio_object w
+  on w.object_id = ff.object_id
+ and w.object_type = 'window'
+join audio_object a
+  on a.object_id = w.parent_object_id
+ and a.object_type = 'asset'
+join media_entity s
+  on s.entity_id = ff.song_id
+where ff.feature_id = :feature_id;
+```
+
+这条 SQL 回答的就是：
+- 这条 feature 是哪个模型算的
+- 它绑定的是哪个 window
+- 这个 window 属于哪个 asset
+- 最终应该归到哪个 `song_id`
+
+## 18. 100w 音频 / 30w song 的批量入库与索引建设策略

 当前规模下，最重要的原则不是一次把所有模型都算完，而是：

--- a/docs/session-handoff.md
View file @6ee8c57
+++ b/docs/session-handoff.md
View file @6ee8c57
@@ -42,7 +42,7 @@ acr-engine/scripts/start_songcentric_shortest_path.sh 'postgres://d2:d2pass@127.
 > **4 表 song-centric schema 已在 live PostgreSQL 上真实打通了“真实目录 -> 切片 -> exact/semantic feature enrichment -> import -> feature_fact”的宿主链。**

 下一步最应该做的是：
-> **在不破坏这条宿主链的前提下，把 semantic lane 从 runtime-aware fallback 升级到真实 MERT / MuQ adapter。**
+> **在不破坏这条宿主链的前提下，继续把 semantic lane 从当前 MERT baseline 扩展到 MuQ challenger。**

 ---

@@ -114,7 +114,7 @@ flowchart TD
 5. 真实目录 -> manifest -> import 已验证通过
 6. 真实目录 -> fingerprint enrichment -> import 已验证通过
 7. exact lane 已优先复用仓库内 `ChromaprintMatcher`
-8. semantic lane 已 runtime-ready，当前 host 已可进入 placeholder runtime 分支
+8. semantic lane 已真实接入 `mert-v1-95m` baseline，当前 host 的 live 主链已不再停留在 placeholder 分支

 ---

@@ -169,48 +169,67 @@ flowchart TD

 ---

-## 10. 真实 semantic adapter 下一步应该接到哪里
-
-当前最直接的接入点已经明确：
-
- 入口脚本：`acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py`
- 关键函数：`build_semantic_feature(...)`
-
-### 当前真实状态
-
- exact lane 已优先复用 `ChromaprintMatcher`
- semantic lane 还没有真实接入 `MERT / MuQ`
- runtime 就绪时，当前会产出：
-  - `model_name = mert-v1-95m`
- fallback 分支仍保留：
-  - `model_name = local_wavehash_embed`
-
-### fresh 依赖检查事实
-
-当前 host 仍缺：
- `torch`
- `torchaudio`
- `transformers`
-
-### 下次 session 最直接的实现顺序
-
-1. 安装 `torch / torchaudio / transformers`
-2. 在 `build_semantic_feature(...)` 内接真实 `MERT` 或 `MuQ` adapter
-3. 保留当前 `local_wavehash_embed` fallback 不删
-4. 重跑：
-
-```bash
-cd /workspace
-/usr/local/miniconda3/bin/python acr-engine/scripts/run_songcentric_directory_pipeline_live.py \
-  --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' \
-  --schema acr_songcentric_test \
-  --input-root acr-engine/data/songcentric_builder_smoke \
-  --output-dir acr-engine/data/pgvector_eval/music20
+## 10. 数据关联与当前 live 落库事实
+
+当前最重要的绑定关系只有 3 条：
+
+1. `feature_fact.object_id -> audio_object.object_id`
+   - feature 绑定到具体音频对象
+   - Phase-1 默认绑定 `window`，不是直接绑定 song
+2. `audio_object.parent_object_id -> audio_object.object_id`
+   - `window -> asset` 父子回溯链
+3. `feature_fact.song_id -> media_entity.entity_id`
+   - 用于快速做 song-level 聚合与最终返回 `song_id`
+
+可以用一句话理解：
+
+> `audio_object` 说明“这段音频是谁”，`feature_fact` 说明“这段音频被哪个模型编码成了什么特征”。
+
+### 当前 live 主链已经真实落了什么
+
+当前 live 新数据已经真实落到：
+- exact：`chromaprint_matcher / phase1_local / chromaprint_matcher_5s`
+- semantic baseline：`mert-v1-95m / hf-main / mert_5s_hop2.5_v1`
+
+当前 MuQ 状态：
+- 目标模型：`OpenMuQ/MuQ-large-msd-iter`
+- 当前 blocker：`import muq` 触发 `RuntimeError: operator torchvision::nms does not exist`
+- 结论：MuQ 仍是下一阶段 challenger，不是当前 live 默认基线
+
+### 当前 manifest 形状（导入前）
+
+```json
+{
+  "song": {"biz_key": "song_alpha", "title": "song alpha"},
+  "asset": {"storage_uri": ".../clip1.wav"},
+  "windows": [
+    {
+      "start_ms": 0,
+      "end_ms": 5000,
+      "features": [
+        {
+          "feature_type": "fingerprint",
+          "model_name": "chromaprint_matcher",
+          "feature_set_name": "chromaprint_matcher_5s"
+        },
+        {
+          "feature_type": "embedding",
+          "model_name": "mert-v1-95m",
+          "feature_set_name": "mert_5s_hop2.5_v1",
+          "embedding_dim": 768
+        }
+      ]
+    }
+  ]
+}
 ```

-### 期望看到的 fresh 指标变化
-
- `semantic_runtime_available = true`
- `semantic_runtime_ready_count > 0`
- `semantic_fallback_count` 明显下降或归零
+### 下次 session 最直接的继续点

+1. 不要再验证 MERT 是否接上，已经接上
+2. 直接处理 MuQ 的 `torchvision::nms` 兼容问题
+3. 接入 `OpenMuQ/MuQ-large-msd-iter` challenger
+4. 重跑主链 runner，确认每个 window 最终可同时看到：
+   - `chromaprint_matcher`
+   - `mert-v1-95m`
+   - `muq-large-msd-iter`（或最终统一后的 `model_name`）
--- a/docs/start-here.md
View file @6ee8c57
+++ b/docs/start-here.md
View file @6ee8c57
@@ -78,6 +78,12 @@ song -> asset -> window -> fingerprint / embedding
 | 模型信息 | `feature_fact` | `model_name`, `model_version`, `feature_set_name` |
 | reference/eval/hot 集 | `set_membership` | `set_type`, `set_name` |

+补充理解：
+- `feature_fact.object_id -> audio_object.object_id`：feature 直接绑定到具体音频对象，Phase-1 默认绑 `window`
+- `audio_object.parent_object_id`：把 `window` 回溯到它的 `asset`
+- `feature_fact.song_id -> media_entity.entity_id`：为了 song-level 聚合与快速返回 `song_id` 做的冗余固化
+- 如果你只想看这一层的详细解释，直接读 [postgresql-data-model.md](./postgresql-data-model.md) 第 4 节和 [postgres_db_schema_samples.md](./postgres_db_schema_samples.md) 第 5 节。
+
 ---

 ## 5. 当前主链流程图
@@ -99,8 +105,9 @@ flowchart TD
 - live PostgreSQL schema 已真实建表通过
 - 真实目录 -> manifest -> import 已打通
 - 真实目录 -> fingerprint enrichment -> import 已打通
- semantic lane 已做成 runtime-ready
- 当前 host 已能进入 runtime-ready placeholder 分支，下一步可在不破坏当前 MERT 基线的前提下继续接 `MuQ`
+- semantic lane 已真实接入 `mert-v1-95m` baseline
+- 当前 host 上 live 主链已落 `chromaprint_matcher + mert-v1-95m`
+- 下一步是在不破坏当前 MERT 基线的前提下继续接 `MuQ` challenger
 - 当前 exact lane 已优先复用仓库内 `ChromaprintMatcher`

 ---