Commit 5869c876 5869c87636057d951afc3e0ac8cdddbc0c96fd05 by cnb.bofCdSsphPA

Why the retrieval docs need the online song backtrace made explicit

Constraint: New engineers need a direct feature_fact-to-song_id query path on the current 4-table schema without reconstructing it from scattered examples
Rejected: Leave only insert-side diagrams | does not explain how online recall returns song ownership evidence
Confidence: high
Scope-risk: narrow
Directive: Keep query-path docs aligned with the feature_fact -> window -> asset -> song chain when adding new retrieval lanes
Tested: markdown link check on /workspace/docs after adding retrieval flow diagrams and SQL templates
Not-tested: No live database rerun; this change only documents the already-verified schema path
1 parent 38b37e08
...@@ -4,6 +4,7 @@ ...@@ -4,6 +4,7 @@
4 - 收敛 `docs/` 到当前 song-centric 主线,只保留 `README / start-here / session-handoff / postgresql-data-model / postgres_db_schema_samples / CHANGELOG` 六份核心文档,删除旧的 v2 / planner-worker / registry 扩展文档,避免新同学误入已退居次线的设计。 4 - 收敛 `docs/` 到当前 song-centric 主线,只保留 `README / start-here / session-handoff / postgresql-data-model / postgres_db_schema_samples / CHANGELOG` 六份核心文档,删除旧的 v2 / planner-worker / registry 扩展文档,避免新同学误入已退居次线的设计。
5 - 重写 `docs/postgresql-data-model.md`,明确 `保存切片的数据 + 模型 + feature` 的落表方案:`window``audio_object`,模型身份落 `feature_fact.model_name/model_version/feature_set_name`,具体 `fingerprint/embedding` 也统一落 `feature_fact` 5 - 重写 `docs/postgresql-data-model.md`,明确 `保存切片的数据 + 模型 + feature` 的落表方案:`window``audio_object`,模型身份落 `feature_fact.model_name/model_version/feature_set_name`,具体 `fingerprint/embedding` 也统一落 `feature_fact`
6 - 重写 `docs/postgres_db_schema_samples.md` 与入口文档,补充当前 4 表主链的流程图、典型 SQL 样例、查询回溯路径与写入顺序,统一文档口径到 `media_entity -> audio_object -> feature_fact -> set_membership` 6 - 重写 `docs/postgres_db_schema_samples.md` 与入口文档,补充当前 4 表主链的流程图、典型 SQL 样例、查询回溯路径与写入顺序,统一文档口径到 `media_entity -> audio_object -> feature_fact -> set_membership`
7 - 继续补强在线检索说明:在 `docs/postgresql-data-model.md``docs/postgres_db_schema_samples.md` 新增 `feature_fact -> window -> asset -> song_id` 回溯流程图,以及 song-level 聚合 SQL 模板,方便研发直接按当前 schema 实现召回后归属。
7 8
8 ## 2026-06-04 9 ## 2026-06-04
9 10
......
...@@ -381,3 +381,59 @@ song(Song Alpha) ...@@ -381,3 +381,59 @@ song(Song Alpha)
381 - [start-here.md](./start-here.md) 381 - [start-here.md](./start-here.md)
382 - [session-handoff.md](./session-handoff.md) 382 - [session-handoff.md](./session-handoff.md)
383 - [postgresql-data-model.md](./postgresql-data-model.md) 383 - [postgresql-data-model.md](./postgresql-data-model.md)
384
385 ---
386
387 ## 11. 在线检索回溯样例
388
389 ### 11.1 从命中的 feature 回查 song
390
391 ```mermaid
392 flowchart LR
393 A[feature_fact] --> B[window]
394 B --> C[asset]
395 C --> D[song]
396 ```
397
398 ### 11.2 典型在线查询 SQL
399
400 ```sql
401 select ff.feature_id,
402 ff.feature_type,
403 ff.model_name,
404 ff.feature_set_name,
405 w.object_id as window_id,
406 w.start_ms,
407 w.end_ms,
408 a.object_id as asset_id,
409 a.storage_uri,
410 s.entity_id as song_id,
411 s.title,
412 s.artist_name
413 from feature_fact ff
414 join audio_object w
415 on w.object_id = ff.object_id
416 and w.object_type = 'window'
417 join audio_object a
418 on a.object_id = w.parent_object_id
419 and a.object_type = 'asset'
420 join media_entity s
421 on s.entity_id = ff.song_id
422 where ff.feature_id = :feature_id;
423 ```
424
425 ### 11.3 典型 song-level 聚合 SQL
426
427 ```sql
428 select ff.song_id,
429 s.title,
430 s.artist_name,
431 count(*) as matched_windows
432 from feature_fact ff
433 join media_entity s
434 on s.entity_id = ff.song_id
435 where ff.feature_id = any(:matched_feature_ids)
436 group by ff.song_id, s.title, s.artist_name
437 order by matched_windows desc
438 limit 20;
439 ```
......
...@@ -288,3 +288,103 @@ Phase-1 暂不强求: ...@@ -288,3 +288,103 @@ Phase-1 暂不强求:
288 - [start-here.md](./start-here.md) 288 - [start-here.md](./start-here.md)
289 - [session-handoff.md](./session-handoff.md) 289 - [session-handoff.md](./session-handoff.md)
290 - [postgres_db_schema_samples.md](./postgres_db_schema_samples.md) 290 - [postgres_db_schema_samples.md](./postgres_db_schema_samples.md)
291
292 ---
293
294 ## 13. 在线检索时怎么从 feature 回到 `song_id`
295
296 这是当前研发最需要牢记的一条回溯链:
297
298 ```text
299 feature_fact -> audio_object(window) -> audio_object(asset) -> media_entity(song)
300 ```
301
302 ### 13.1 在线检索流程图
303
304 ```mermaid
305 flowchart LR
306 Q[query audio] --> QW[query windows]
307 QW --> QE[query fingerprint / embedding]
308 QE --> FF[feature_fact]
309 FF --> W[audio_object\nobject_type=window]
310 W --> A[audio_object\nobject_type=asset]
311 A --> S[media_entity\nentity_type=song]
312 S --> R[return song_id + title + artist + evidence]
313 ```
314
315 ### 13.2 聚合流程图
316
317 ```mermaid
318 flowchart TD
319 A[query window features] --> B[命中多个 feature_fact rows]
320 B --> C[回查 window]
321 C --> D[回查 asset]
322 D --> E[聚合到 song_id]
323 E --> F[按 hit_count / score / offset coverage 排序]
324 F --> G[返回 topK songs]
325 ```
326
327 ### 13.3 最小查询 SQL 模板
328
329 ```sql
330 select ff.feature_id,
331 ff.feature_type,
332 ff.model_name,
333 ff.model_version,
334 ff.feature_set_name,
335 w.object_id as window_id,
336 w.start_ms,
337 w.end_ms,
338 a.object_id as asset_id,
339 a.storage_uri,
340 s.entity_id as song_id,
341 s.title,
342 s.artist_name
343 from feature_fact ff
344 join audio_object w
345 on w.object_id = ff.object_id
346 and w.object_type = 'window'
347 join audio_object a
348 on a.object_id = w.parent_object_id
349 and a.object_type = 'asset'
350 join media_entity s
351 on s.entity_id = ff.song_id
352 where ff.feature_id = :feature_id;
353 ```
354
355 ### 13.4 一个 song-level 聚合 SQL 模板
356
357 ```sql
358 select ff.song_id,
359 s.title,
360 s.artist_name,
361 count(*) as matched_windows,
362 min(w.start_ms) as first_hit_ms,
363 max(w.end_ms) as last_hit_ms
364 from feature_fact ff
365 join audio_object w
366 on w.object_id = ff.object_id
367 and w.object_type = 'window'
368 join media_entity s
369 on s.entity_id = ff.song_id
370 where ff.feature_type = :feature_type
371 and ff.model_name = :model_name
372 and ff.feature_set_name = :feature_set_name
373 and ff.feature_id = any(:matched_feature_ids)
374 group by ff.song_id, s.title, s.artist_name
375 order by matched_windows desc, first_hit_ms asc
376 limit 20;
377 ```
378
379 ### 13.5 这条链为什么重要
380
381 因为它把 3 件事拆清楚了:
382 - `feature_fact` 负责回答:**命中了什么特征**
383 - `audio_object(window/asset)` 负责回答:**命中了哪段、来自哪个文件**
384 - `media_entity(song)` 负责回答:**最终该归到哪个 `song_id`**
385
386 所以 Phase-1 即使不引入更复杂的 `recording/work/version`,也已经足够支撑:
387 - 版权保护归属
388 - 片段/BGM 定位
389 - evidence 回查
390 - topK song 级召回
......