song-ingest-query-delivery.md 12.9 KB

Song Ingest → Query → Delivery 完整操作手册

这份文档专门回答:歌曲是怎么从目录/文件进入当前系统,再怎么被查询/评估,最后交付哪些产物与验证证据。


1. 适用范围

当前仓库里有两条你必须区分的链路:

  1. live PostgreSQL song-centric 主链

    • 目标:把真实歌曲目录导入当前 4 表 schema,并把 window + fingerprint + embedding 落到 PostgreSQL
    • 入口脚本:
      • acr-engine/scripts/run_songcentric_directory_pipeline_live.py
      • acr-engine/scripts/build_songcentric_manifest_from_directory.py
      • acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py
      • acr-engine/scripts/import_songcentric_manifest_live.py
  2. selected20 文件级实战评测链

    • 目标:在 20 首歌专题上评估当前 exact / semantic / fused 的 song-level 命中率
    • 入口脚本:
      • acr-engine/scripts/evaluate_selected20_songid_retrieval.py

这两条链路都重要:

  • PostgreSQL 主链负责“怎么入库、怎么绑定、怎么回查 song_id”
  • selected20 评测链负责“当前方案实战到底效果如何”

2. 当前默认数据模型

逻辑语义:

song -> asset -> window -> fingerprint / embedding

物理落表:

media_entity -> audio_object -> feature_fact -> set_membership

绑定规则:

  • media_entity:song 主体
  • audio_object(object_type='asset'):原始音频文件
  • audio_object(object_type='window'):切片窗口
  • feature_fact.object_id -> audio_object.object_id:feature 绑定 window
  • audio_object.parent_object_id:window 回到 asset
  • feature_fact.song_id -> media_entity.entity_id:song-level 回查与聚合
  • set_membership:reference/eval/hot set 路由

如果要看字段级解释,配套文档:


3. live PostgreSQL song-centric 主链:从歌曲目录到入库

3.1 一条命令跑完整主链

cd /workspace
/usr/local/miniconda3/bin/python acr-engine/scripts/run_songcentric_directory_pipeline_live.py \
  --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' \
  --schema acr_songcentric_test \
  --input-root acr-engine/data/songcentric_builder_smoke \
  --output-dir acr-engine/data/pgvector_eval/music20

或:

acr-engine/scripts/start_songcentric_shortest_path.sh 'postgres://d2:d2pass@127.0.0.1:5432/d2'

3.2 这个 runner 实际做了什么

acr-engine/scripts/run_songcentric_directory_pipeline_live.py 顺序执行 3 步:

  1. build_manifest

    • 脚本:acr-engine/scripts/build_songcentric_manifest_from_directory.py
    • 输入:歌曲目录
    • 输出:songcentric_pipeline_manifest.jsonl
  2. enrich_features

    • 脚本:acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py
    • 输入:manifest
    • 输出:songcentric_pipeline_manifest_with_features.jsonl
  3. import_manifest

    • 脚本:acr-engine/scripts/import_songcentric_manifest_live.py
    • 输入:带 feature 的 manifest
    • 输出:songcentric_pipeline_import_report.json

最终还会生成聚合报告:

  • acr-engine/data/pgvector_eval/music20/songcentric_pipeline_runner_report.json

4. 第一步:build manifest(先把歌曲目录变成结构化 song/asset/window)

脚本:

  • acr-engine/scripts/build_songcentric_manifest_from_directory.py

4.1 输入目录假设

脚本会递归扫描:

  • .wav
  • .mp3
  • .flac
  • .ogg
  • .m4a

并把目录结构解析为:

  • 第一级目录:song_key
  • 第二级目录:artist
  • 文件本身:asset

4.2 它做的关键事

  1. 推断:song.biz_key / title / artist_name
  2. 为每个音频文件生成 1 条 asset
  3. window_ms=5000stride_ms=2500 默认切 windows
  4. 给每条记录附上 membership:
    • set_type=reference_set
    • set_name=phase1_hot_reference_v1

4.3 输出物

  • manifest:
    • acr-engine/data/pgvector_eval/music20/songcentric_pipeline_manifest.jsonl
  • build report:
    • acr-engine/data/pgvector_eval/music20/songcentric_pipeline_build_report.json

4.4 当前 fresh evidence

来自 songcentric_pipeline_runner_report.json

  • song_count = 2
  • asset_count = 2
  • window_count = 5
  • window_ms = 5000
  • stride_ms = 2500

5. 第二步:enrich features(给每个 window 补 exact / semantic 特征)

脚本:

  • acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py

5.1 它做的 exact lane

对每个 wav window

  • 优先走仓库内 ChromaprintMatcher
  • 生成:
    • feature_type='fingerprint'
    • model_name='chromaprint_matcher'
    • model_version='phase1_local'
    • feature_set_name='chromaprint_matcher_5s'

如果 matcher 提取失败,才回退:

  • model_name='local_wavehash'

5.2 它做的 semantic lane

语义路径先检查 runtime:

  • torch
  • torchaudio
  • transformers

当前 runtime 可用时:

  • MERT-v1-95M
  • 写成:
    • feature_type='embedding'
    • model_name='mert-v1-95m'
    • model_version='hf-main'
    • feature_set_name='mert_5s_hop2.5_v1'

如果 runtime 不可用:

  • 回退到 local_wavehash_embed

5.3 输出物

  • enriched manifest:
    • acr-engine/data/pgvector_eval/music20/songcentric_pipeline_manifest_with_features.jsonl
  • enrich report:
    • acr-engine/data/pgvector_eval/music20/songcentric_pipeline_enrich_report.json

5.4 当前 fresh evidence

来自 songcentric_pipeline_enrich_report.json

  • wav_windows_seen = 5
  • features_added = 10
  • matcher_fingerprint_count = 5
  • fallback_fingerprint_count = 0
  • semantic_runtime_available = true
  • semantic_runtime_missing = []
  • semantic_runtime_ready_count = 5
  • semantic_fallback_count = 0

一句话解释:

当前 5 个 window 都已经真实拿到了 chromaprint_matcher + mert-v1-95m,没有走 fallback。


6. 第三步:import manifest(把 song / asset / window / feature 真正落到 PostgreSQL)

脚本:

  • acr-engine/scripts/import_songcentric_manifest_live.py

6.1 它的写入顺序

对每条 manifest row:

  1. ensure_song()
    • media_entity
  2. ensure_asset()
    • audio_object(object_type='asset')
  3. ensure_window()
    • audio_object(object_type='window')
  4. ensure_feature()
    • feature_fact
  5. ensure_membership()
    • set_membership

6.2 它保证的核心绑定

  • window 绑定 asset:parent_object_id
  • feature 绑定 window:feature_fact.object_id
  • feature 归属 song:feature_fact.song_id
  • membership 可绑定 assetsong

6.3 输出物

  • import report:
    • acr-engine/data/pgvector_eval/music20/songcentric_pipeline_import_report.json

6.4 当前 fresh evidence

来自 songcentric_pipeline_import_report.json

  • media_entity = 9
  • audio_object = 22
  • feature_fact = 34
  • set_membership = 9

当前可直接复核的一条 live 样例:

  • window_id = 22
  • asset_id = 20
  • song_id = 9
  • title = song beta
  • start_ms = 1000
  • end_ms = 6000
  • 对应 feature:
    • feature_type = embedding
    • model_name = mert-v1-95m
    • model_version = hf-main
    • feature_set_name = mert_5s_hop2.5_v1

这条样例证明:

feature -> window -> asset -> song

这条回溯链现在已经真实落库,不只是文档设计。


7. 查询阶段:怎么从命中的 feature 回到 song_id

当前 repo 里,查询要分成两层理解:

7.1 PostgreSQL 主链里的“回查”

这层重点不是 online ANN 服务,而是:

  • 命中某条 feature 后
  • 如何回查 window / asset / song
  • 如何做 song-level 聚合

这部分 SQL 与样例已经在:

重点章节:

  • 在线检索时怎么从 feature 回到 song_id
  • exact + semantic 双通道如何融合到 song 排序

最短可执行回查 SQL(命中某条 feature 后,直接回到 window / asset / song):

select ff.feature_id,
       ff.feature_type,
       ff.model_name,
       win.object_id  as window_id,
       win.start_ms,
       win.end_ms,
       ast.object_id  as asset_id,
       ast.storage_uri,
       song.entity_id as song_id,
       song.biz_key,
       song.title
from acr_songcentric_test.feature_fact ff
join acr_songcentric_test.audio_object win
  on win.object_id = ff.object_id and win.object_type = 'window'
join acr_songcentric_test.audio_object ast
  on ast.object_id = win.parent_object_id and ast.object_type = 'asset'
join acr_songcentric_test.media_entity song
  on song.entity_id = ff.song_id and song.entity_type = 'song'
where ff.feature_id = 34;

如果要直接按 song 做聚合/融合,再回看:

7.2 当前 selected20 文件级评测里的“查询”

脚本:

  • acr-engine/scripts/evaluate_selected20_songid_retrieval.py

它做的是:

  1. type_11 建 reference
  2. chromaprint_matcher 做 exact 候选
  3. MERT embedding 做 semantic 候选
  4. 0.6 * exact + 0.4 * semantic 做 fused 排序
  5. 检查 true song_id 是否在 top1/top3

这条链主要用于:

  • 快速看实战 song-level 正确率
  • 作为后续 MuQ challenger / 融合策略的回归基线

8. selected20 实战评测:怎么操作、产物在哪、当前效果如何

8.1 复现命令

cd /workspace/acr-engine
/usr/local/miniconda3/bin/python scripts/evaluate_selected20_songid_retrieval.py \
  --downloads-dir /root/hikoon_song_files/output/selected_20_songs/downloads \
  --reference-type 11 \
  --query-types 1 7 12 16 \
  --duration 8.0 \
  --topk 3 \
  --exact-weight 0.6 \
  --semantic-weight 0.4 \
  --output-json /workspace/acr-engine/data/local_eval/selected20_songid_eval_report.json \
  --output-md /workspace/docs/selected20_songid_eval.md

8.2 输出物

  • JSON:acr-engine/data/local_eval/selected20_songid_eval_report.json
  • Markdown 摘要:docs/selected20_songid_eval.md
  • 复现手册:docs/selected20_songid_eval_repro.md

8.3 当前实测结果

query 总数:123

lane top1 top3
exact 0.6016 0.8130
semantic 0.4715 0.6016
fused 0.6341 0.8537

分类型:

  • type_1:最强,已打满
  • type_12:也很强,semantic 单 lane 最优
  • type_7:短板之一
  • type_16:短板之一

当前最重要的判断:

融合链路已经优于单独 exact 和单独 semantic,但 type_7 / type_16 仍是主要难点。


9. 交付时你应该一起带走哪些文件

9.1 主链相关

  • docs/postgresql-data-model.md
  • docs/postgres_db_schema_samples.md
  • docs/song-ingest-query-delivery.md
  • docs/session-handoff.md
  • docs/start-here.md
  • docs/README.md

9.2 selected20 相关

  • docs/selected20_songid_eval.md
  • docs/selected20_songid_eval_repro.md
  • acr-engine/data/local_eval/selected20_songid_eval_report.json

9.3 PostgreSQL live 产物相关

  • acr-engine/data/pgvector_eval/music20/songcentric_pipeline_manifest.jsonl
  • acr-engine/data/pgvector_eval/music20/songcentric_pipeline_build_report.json
  • acr-engine/data/pgvector_eval/music20/songcentric_pipeline_manifest_with_features.jsonl
  • acr-engine/data/pgvector_eval/music20/songcentric_pipeline_enrich_report.json
  • acr-engine/data/pgvector_eval/music20/songcentric_pipeline_import_report.json
  • acr-engine/data/pgvector_eval/music20/songcentric_pipeline_runner_report.json

10. 最小交付检查清单

10.1 如果你要交付“入库链路”

至少确认:

  • schema 已建好
  • runner 可执行
  • manifest / enriched manifest / import report / runner report 都已生成
  • feature -> window -> asset -> song 的 live lineage sample 可回查

10.2 如果你要交付“效果评测”

至少确认:

  • selected20 命令可复现
  • JSON 报告存在
  • Markdown 摘要存在
  • overall / per-type 指标已写进交付说明

10.3 如果你要交付“下次可接手”

至少确认:

  • README.md 有入口
  • start-here.md 有最短命令
  • session-handoff.md 写清当前状态
  • CHANGELOG.md 记录这次交付

11. 当前最务实的下一步

  1. 不要回退 4 表 song-centric 主链
  2. selected20 继续作为小样本回归基线
  3. 后续接入 MuQ challenger 后,第一时间复跑 selected20
  4. 重点盯 type_7 / type_16 是否改善
  5. PostgreSQL 主链保持 feature -> window -> asset -> song 的可回查性不被破坏

12. 相关文档