Song Ingest → Query → Delivery 完整操作手册
这份文档专门回答:歌曲是怎么从目录/文件进入当前系统,再怎么被查询/评估,最后交付哪些产物与验证证据。
1. 适用范围
当前仓库里有两条你必须区分的链路:
-
live PostgreSQL song-centric 主链
- 目标:把真实歌曲目录导入当前 4 表 schema,并把
window + fingerprint + embedding落到 PostgreSQL - 入口脚本:
acr-engine/scripts/run_songcentric_directory_pipeline_live.pyacr-engine/scripts/build_songcentric_manifest_from_directory.pyacr-engine/scripts/enrich_songcentric_manifest_with_local_features.pyacr-engine/scripts/import_songcentric_manifest_live.py
- 目标:把真实歌曲目录导入当前 4 表 schema,并把
-
selected20 文件级实战评测链
- 目标:在 20 首歌专题上评估当前
exact / semantic / fused的 song-level 命中率 - 入口脚本:
acr-engine/scripts/evaluate_selected20_songid_retrieval.py
- 目标:在 20 首歌专题上评估当前
这两条链路都重要:
- PostgreSQL 主链负责“怎么入库、怎么绑定、怎么回查 song_id”
- selected20 评测链负责“当前方案实战到底效果如何”
2. 当前默认数据模型
逻辑语义:
song -> asset -> window -> fingerprint / embedding
物理落表:
media_entity -> audio_object -> feature_fact -> set_membership
绑定规则:
-
media_entity:song 主体 -
audio_object(object_type='asset'):原始音频文件 -
audio_object(object_type='window'):切片窗口 -
feature_fact.object_id -> audio_object.object_id:feature 绑定 window -
audio_object.parent_object_id:window 回到 asset -
feature_fact.song_id -> media_entity.entity_id:song-level 回查与聚合 -
set_membership:reference/eval/hot set 路由
如果要看字段级解释,配套文档:
3. live PostgreSQL song-centric 主链:从歌曲目录到入库
3.1 一条命令跑完整主链
cd /workspace
/usr/local/miniconda3/bin/python acr-engine/scripts/run_songcentric_directory_pipeline_live.py \
--dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' \
--schema acr_songcentric_test \
--input-root acr-engine/data/songcentric_builder_smoke \
--output-dir acr-engine/data/pgvector_eval/music20
或:
acr-engine/scripts/start_songcentric_shortest_path.sh 'postgres://d2:d2pass@127.0.0.1:5432/d2'
3.2 这个 runner 实际做了什么
acr-engine/scripts/run_songcentric_directory_pipeline_live.py 顺序执行 3 步:
-
build_manifest- 脚本:
acr-engine/scripts/build_songcentric_manifest_from_directory.py - 输入:歌曲目录
- 输出:
songcentric_pipeline_manifest.jsonl
- 脚本:
-
enrich_features- 脚本:
acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py - 输入:manifest
- 输出:
songcentric_pipeline_manifest_with_features.jsonl
- 脚本:
-
import_manifest- 脚本:
acr-engine/scripts/import_songcentric_manifest_live.py - 输入:带 feature 的 manifest
- 输出:
songcentric_pipeline_import_report.json
- 脚本:
最终还会生成聚合报告:
acr-engine/data/pgvector_eval/music20/songcentric_pipeline_runner_report.json
4. 第一步:build manifest(先把歌曲目录变成结构化 song/asset/window)
脚本:
acr-engine/scripts/build_songcentric_manifest_from_directory.py
4.1 输入目录假设
脚本会递归扫描:
.wav.mp3.flac.ogg.m4a
并把目录结构解析为:
- 第一级目录:
song_key - 第二级目录:
artist - 文件本身:asset
4.2 它做的关键事
- 推断:
song.biz_key / title / artist_name - 为每个音频文件生成 1 条 asset
- 按
window_ms=5000、stride_ms=2500默认切 windows - 给每条记录附上 membership:
set_type=reference_setset_name=phase1_hot_reference_v1
4.3 输出物
- manifest:
acr-engine/data/pgvector_eval/music20/songcentric_pipeline_manifest.jsonl
- build report:
acr-engine/data/pgvector_eval/music20/songcentric_pipeline_build_report.json
4.4 当前 fresh evidence
来自 songcentric_pipeline_runner_report.json:
song_count = 2asset_count = 2window_count = 5window_ms = 5000stride_ms = 2500
5. 第二步:enrich features(给每个 window 补 exact / semantic 特征)
脚本:
acr-engine/scripts/enrich_songcentric_manifest_with_local_features.py
5.1 它做的 exact lane
对每个 wav window:
- 优先走仓库内
ChromaprintMatcher - 生成:
feature_type='fingerprint'model_name='chromaprint_matcher'model_version='phase1_local'feature_set_name='chromaprint_matcher_5s'
如果 matcher 提取失败,才回退:
model_name='local_wavehash'
5.2 它做的 semantic lane
语义路径先检查 runtime:
torchtorchaudiotransformers
当前 runtime 可用时:
- 走
MERT-v1-95M - 写成:
feature_type='embedding'model_name='mert-v1-95m'model_version='hf-main'feature_set_name='mert_5s_hop2.5_v1'
如果 runtime 不可用:
- 回退到
local_wavehash_embed
5.3 输出物
- enriched manifest:
acr-engine/data/pgvector_eval/music20/songcentric_pipeline_manifest_with_features.jsonl
- enrich report:
acr-engine/data/pgvector_eval/music20/songcentric_pipeline_enrich_report.json
5.4 当前 fresh evidence
来自 songcentric_pipeline_enrich_report.json:
wav_windows_seen = 5features_added = 10matcher_fingerprint_count = 5fallback_fingerprint_count = 0semantic_runtime_available = truesemantic_runtime_missing = []semantic_runtime_ready_count = 5semantic_fallback_count = 0
一句话解释:
当前 5 个 window 都已经真实拿到了
chromaprint_matcher + mert-v1-95m,没有走 fallback。
6. 第三步:import manifest(把 song / asset / window / feature 真正落到 PostgreSQL)
脚本:
acr-engine/scripts/import_songcentric_manifest_live.py
6.1 它的写入顺序
对每条 manifest row:
-
ensure_song()- 落
media_entity
- 落
-
ensure_asset()- 落
audio_object(object_type='asset')
- 落
-
ensure_window()- 落
audio_object(object_type='window')
- 落
-
ensure_feature()- 落
feature_fact
- 落
-
ensure_membership()- 落
set_membership
- 落
6.2 它保证的核心绑定
- window 绑定 asset:
parent_object_id - feature 绑定 window:
feature_fact.object_id - feature 归属 song:
feature_fact.song_id - membership 可绑定
asset或song
6.3 输出物
- import report:
acr-engine/data/pgvector_eval/music20/songcentric_pipeline_import_report.json
6.4 当前 fresh evidence
来自 songcentric_pipeline_import_report.json:
media_entity = 9audio_object = 22feature_fact = 34set_membership = 9
当前可直接复核的一条 live 样例:
window_id = 22asset_id = 20song_id = 9title = song betastart_ms = 1000end_ms = 6000- 对应 feature:
feature_type = embeddingmodel_name = mert-v1-95mmodel_version = hf-mainfeature_set_name = mert_5s_hop2.5_v1
这条样例证明:
feature -> window -> asset -> song
这条回溯链现在已经真实落库,不只是文档设计。
7. 查询阶段:怎么从命中的 feature 回到 song_id
当前 repo 里,查询要分成两层理解:
7.1 PostgreSQL 主链里的“回查”
这层重点不是 online ANN 服务,而是:
- 命中某条 feature 后
- 如何回查 window / asset / song
- 如何做 song-level 聚合
这部分 SQL 与样例已经在:
重点章节:
在线检索时怎么从 feature 回到 song_idexact + semantic 双通道如何融合到 song 排序
最短可执行回查 SQL(命中某条 feature 后,直接回到 window / asset / song):
select ff.feature_id,
ff.feature_type,
ff.model_name,
win.object_id as window_id,
win.start_ms,
win.end_ms,
ast.object_id as asset_id,
ast.storage_uri,
song.entity_id as song_id,
song.biz_key,
song.title
from acr_songcentric_test.feature_fact ff
join acr_songcentric_test.audio_object win
on win.object_id = ff.object_id and win.object_type = 'window'
join acr_songcentric_test.audio_object ast
on ast.object_id = win.parent_object_id and ast.object_type = 'asset'
join acr_songcentric_test.media_entity song
on song.entity_id = ff.song_id and song.entity_type = 'song'
where ff.feature_id = 34;
如果要直接按 song 做聚合/融合,再回看:
7.2 当前 selected20 文件级评测里的“查询”
脚本:
acr-engine/scripts/evaluate_selected20_songid_retrieval.py
它做的是:
- 从
type_11建 reference - 用
chromaprint_matcher做 exact 候选 - 用
MERTembedding 做 semantic 候选 - 按
0.6 * exact + 0.4 * semantic做 fused 排序 - 检查 true
song_id是否在 top1/top3
这条链主要用于:
- 快速看实战 song-level 正确率
- 作为后续 MuQ challenger / 融合策略的回归基线
8. selected20 实战评测:怎么操作、产物在哪、当前效果如何
8.1 复现命令
cd /workspace/acr-engine
/usr/local/miniconda3/bin/python scripts/evaluate_selected20_songid_retrieval.py \
--downloads-dir /root/hikoon_song_files/output/selected_20_songs/downloads \
--reference-type 11 \
--query-types 1 7 12 16 \
--duration 8.0 \
--topk 3 \
--exact-weight 0.6 \
--semantic-weight 0.4 \
--output-json /workspace/acr-engine/data/local_eval/selected20_songid_eval_report.json \
--output-md /workspace/docs/selected20_songid_eval.md
8.2 输出物
- JSON:
acr-engine/data/local_eval/selected20_songid_eval_report.json - Markdown 摘要:
docs/selected20_songid_eval.md - 复现手册:
docs/selected20_songid_eval_repro.md
8.3 当前实测结果
query 总数:123
| lane | top1 | top3 |
|---|---|---|
| exact | 0.6016 | 0.8130 |
| semantic | 0.4715 | 0.6016 |
| fused | 0.6341 | 0.8537 |
分类型:
-
type_1:最强,已打满 -
type_12:也很强,semantic 单 lane 最优 -
type_7:短板之一 -
type_16:短板之一
当前最重要的判断:
融合链路已经优于单独 exact 和单独 semantic,但
type_7 / type_16仍是主要难点。
9. 交付时你应该一起带走哪些文件
9.1 主链相关
docs/postgresql-data-model.mddocs/postgres_db_schema_samples.mddocs/song-ingest-query-delivery.mddocs/session-handoff.mddocs/start-here.mddocs/README.md
9.2 selected20 相关
docs/selected20_songid_eval.mddocs/selected20_songid_eval_repro.mdacr-engine/data/local_eval/selected20_songid_eval_report.json
9.3 PostgreSQL live 产物相关
acr-engine/data/pgvector_eval/music20/songcentric_pipeline_manifest.jsonlacr-engine/data/pgvector_eval/music20/songcentric_pipeline_build_report.jsonacr-engine/data/pgvector_eval/music20/songcentric_pipeline_manifest_with_features.jsonlacr-engine/data/pgvector_eval/music20/songcentric_pipeline_enrich_report.jsonacr-engine/data/pgvector_eval/music20/songcentric_pipeline_import_report.jsonacr-engine/data/pgvector_eval/music20/songcentric_pipeline_runner_report.json
10. 最小交付检查清单
10.1 如果你要交付“入库链路”
至少确认:
- schema 已建好
- runner 可执行
- manifest / enriched manifest / import report / runner report 都已生成
-
feature -> window -> asset -> song的 live lineage sample 可回查
10.2 如果你要交付“效果评测”
至少确认:
- selected20 命令可复现
- JSON 报告存在
- Markdown 摘要存在
- overall / per-type 指标已写进交付说明
10.3 如果你要交付“下次可接手”
至少确认:
-
README.md有入口 -
start-here.md有最短命令 -
session-handoff.md写清当前状态 -
CHANGELOG.md记录这次交付
11. 当前最务实的下一步
- 不要回退 4 表 song-centric 主链
- 把
selected20继续作为小样本回归基线 - 后续接入 MuQ challenger 后,第一时间复跑 selected20
- 重点盯
type_7 / type_16是否改善 - PostgreSQL 主链保持
feature -> window -> asset -> song的可回查性不被破坏