Reduce ACR handoff time with a single doc chain
Constraint: Preserve the current Phase-1 runner, PostgreSQL v2 contract, and live validation narrative while removing duplicate doc entrypoints. Rejected: Keep multiple parallel handoff docs | They force new contributors to diff stale narratives before they can act. Confidence: high Scope-risk: narrow Directive: Treat README -> start-here -> session-handoff as the only first-read path unless a newer handoff chain fully replaces it. Tested: git diff --check on touched docs/script; rg for deleted-doc residual refs outside CHANGELOG; reran scripts/run_planner_validation_commands_live.py with executed_count=4 and all_passed=true Not-tested: Markdown link rendering in external viewers
Showing
11 changed files
with
181 additions
and
143 deletions
| 1 | #!/usr/bin/env bash | ||
| 2 | set -euo pipefail | ||
| 3 | |||
| 4 | ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" | ||
| 5 | PYTHON_BIN="${PYTHON_BIN:-/usr/local/miniconda3/bin/python}" | ||
| 6 | DSN="${1:-${PG_DSN:-}}" | ||
| 7 | OUTPUT="${2:-$ROOT_DIR/data/pgvector_eval/music20/planner_validation_commands_runner_report.json}" | ||
| 8 | |||
| 9 | if [[ -z "$DSN" ]]; then | ||
| 10 | echo "usage: $0 <postgres-dsn> [output-json]" >&2 | ||
| 11 | echo "or set PG_DSN before running this script" >&2 | ||
| 12 | exit 1 | ||
| 13 | fi | ||
| 14 | |||
| 15 | cd "$ROOT_DIR" | ||
| 16 | "$PYTHON_BIN" scripts/run_planner_validation_commands_live.py \ | ||
| 17 | --dsn "$DSN" \ | ||
| 18 | --output "$OUTPUT" |
| 1 | ## 2026-06-04 | 1 | ## 2026-06-04 |
| 2 | 2 | ||
| 3 | - 收敛文档入口链路,新增 `docs/start-here.md`,统一新同学接手路径为:`README -> start-here -> session-handoff`。 | ||
| 4 | - 重写 `docs/README.md`,按“接手 / 方案 / 实施 / 运行 / 角色”重组导航,降低首次阅读成本。 | ||
| 5 | - 重构 `docs/session-handoff.md`,把最新 Phase-1 runner、稳定结论、blocker 与下一步动作收口到单页文档。 | ||
| 6 | - 清理重复或过期文档:删除 `docs/acr-design.md`、`docs/open-dataset-plan.md`、`docs/external-manifest-template.md`、`docs/roadmap.md`、`docs/changelist-2026-06-02.md`、`docs/delivery-handoff-2026-06-02.md`。 | ||
| 7 | - 历史记录仍保留在 `docs/CHANGELOG.md`,当前有效入口以上述主链为准。 | ||
| 8 | |||
| 9 | ## 2026-06-04 | ||
| 10 | |||
| 3 | - 更新 `docs/README.md` 顶部为与 `session-handoff` 一致的“最短启动路径”,并再次用该入口命令重跑 `run_planner_validation_commands_live.py`,确认 fresh 结果仍为 `executed_count=4`、`all_passed=true`。 | 11 | - 更新 `docs/README.md` 顶部为与 `session-handoff` 一致的“最短启动路径”,并再次用该入口命令重跑 `run_planner_validation_commands_live.py`,确认 fresh 结果仍为 `executed_count=4`、`all_passed=true`。 |
| 4 | - 重构 `docs/session-handoff.md` 顶部为“首选启动流程(最短路径)”,直接给出 `run_planner_validation_commands_live.py` 的一条启动命令,以及基于 fresh runner 报告(`executed_count=4`, `all_passed=true`)的结果判断逻辑,减少下次 session 的恢复成本。 | 12 | - 重构 `docs/session-handoff.md` 顶部为“首选启动流程(最短路径)”,直接给出 `run_planner_validation_commands_live.py` 的一条启动命令,以及基于 fresh runner 报告(`executed_count=4`, `all_passed=true`)的结果判断逻辑,减少下次 session 的恢复成本。 |
| 5 | - 新增 `scripts/run_planner_validation_commands_live.py` 与 `planner_validation_commands_runner_report.json`,可直接读取 `phase1_extraction_plan_report.json` 中的 `validation_commands` 并批量执行;当前 4 条 entrypoints 已全部执行成功,`executed_count=4`、`all_passed=true`。 | 13 | - 新增 `scripts/run_planner_validation_commands_live.py` 与 `planner_validation_commands_runner_report.json`,可直接读取 `phase1_extraction_plan_report.json` 中的 `validation_commands` 并批量执行;当前 4 条 entrypoints 已全部执行成功,`executed_count=4`、`all_passed=true`。 | ... | ... |
This diff is collapsed.
Click to expand it.
docs/acr-design.md
deleted
100644 → 0
This diff is collapsed.
Click to expand it.
docs/changelist-2026-06-02.md
deleted
100644 → 0
This diff is collapsed.
Click to expand it.
docs/delivery-handoff-2026-06-02.md
deleted
100644 → 0
This diff is collapsed.
Click to expand it.
docs/external-manifest-template.md
deleted
100644 → 0
| 1 | # External Manifest Template | ||
| 2 | |||
| 3 | 适用于 FMA / Jamendo / CCMusic / ModelScope 白名单数据集。 | ||
| 4 | |||
| 5 | ## catalog.csv 最小字段 | ||
| 6 | |||
| 7 | ```csv | ||
| 8 | song_id,audio_path,duration,source_dataset | ||
| 9 | track_0001,raw/track_0001.wav,12.5,fma | ||
| 10 | ``` | ||
| 11 | |||
| 12 | 转换命令: | ||
| 13 | |||
| 14 | ```bash | ||
| 15 | python src/data/manifest_tools.py csv-to-catalog catalog.csv manifests/catalog.json | ||
| 16 | ``` | ||
| 17 | |||
| 18 | ## 输出 catalog.json 结构 | ||
| 19 | |||
| 20 | ```json | ||
| 21 | { | ||
| 22 | "song_id": "track_0001", | ||
| 23 | "audio_path": "raw/track_0001.wav", | ||
| 24 | "duration": 12.5, | ||
| 25 | "type": "reference", | ||
| 26 | "source_dataset": "fma" | ||
| 27 | } | ||
| 28 | ``` |
docs/open-dataset-plan.md
deleted
100644 → 0
| 1 | # Open Dataset Integration Plan | ||
| 2 | |||
| 3 | ## Recommended order | ||
| 4 | |||
| 5 | 1. **FMA small** | ||
| 6 | - URL: https://github.com/mdeff/fma | ||
| 7 | - Why: easiest small realistic music subset for retrieval experiments | ||
| 8 | 2. **MTG-Jamendo** | ||
| 9 | - URL: https://github.com/MTG/mtg-jamendo-dataset | ||
| 10 | - Why: larger CC-licensed corpus with scriptable upstream tooling | ||
| 11 | 3. **QBSH / humming corpora** | ||
| 12 | - Why: add after retrieval baseline is stable | ||
| 13 | |||
| 14 | ## Repo strategy | ||
| 15 | |||
| 16 | - Keep external dataset ingestion optional | ||
| 17 | - Convert external tracks into: | ||
| 18 | - `catalog.json` for searchable references | ||
| 19 | - query segment manifests for evaluation | ||
| 20 | - Start with small local subsets before full-corpus scaling |
docs/roadmap.md
deleted
100644 → 0
| 1 | # ACR 项目 Roadmap | ||
| 2 | |||
| 3 | > 更新:2026-06-02 | ||
| 4 | |||
| 5 | ## Phase 0:原型跑通(当前阶段) | ||
| 6 | |||
| 7 | ### 目标 | ||
| 8 | 完成一个端到端可运行的本地 demo。 | ||
| 9 | |||
| 10 | ### 范围 | ||
| 11 | - [x] 合成数据生成 | ||
| 12 | - [x] 数据增强 | ||
| 13 | - [x] ECAPA embedding 模型 | ||
| 14 | - [x] 传统指纹匹配器 | ||
| 15 | - [x] HybridEngine | ||
| 16 | - [x] 最小训练入口 | ||
| 17 | - [x] 最小识别入口 | ||
| 18 | - [x] 文档补全 | ||
| 19 | |||
| 20 | ### 验收标准 | ||
| 21 | - 能生成数据 | ||
| 22 | - 能训练至少 1 epoch | ||
| 23 | - 能建立 reference 索引 | ||
| 24 | - 能对测试片段输出 Top-K 候选 | ||
| 25 | |||
| 26 | --- | ||
| 27 | |||
| 28 | ## Phase 1:研究验证 | ||
| 29 | |||
| 30 | ### 目标 | ||
| 31 | 验证不同场景下识别效果是否可接受。 | ||
| 32 | |||
| 33 | ### 任务 | ||
| 34 | - [ ] 增加 top-1 / top-5 / MRR 评估脚本 | ||
| 35 | - [ ] 对 clean / noisy / stretched / pitch-shifted 分开评测 | ||
| 36 | - [ ] 增加 query-by-humming 专项评测集 | ||
| 37 | - [ ] 加入更稳健的 negative sampling | ||
| 38 | - [ ] 补充 checkpoint / config versioning | ||
| 39 | |||
| 40 | --- | ||
| 41 | |||
| 42 | ## Phase 2:工程化 | ||
| 43 | |||
| 44 | ### 目标 | ||
| 45 | 把原型升级为可复现实验项目。 | ||
| 46 | |||
| 47 | ### 任务 | ||
| 48 | - [ ] 增加 `Makefile` 或 `justfile` | ||
| 49 | - [ ] 增加 `pytest` 基础测试 | ||
| 50 | - [ ] 增加日志与指标记录 | ||
| 51 | - [ ] 增加模型导出与加载规范 | ||
| 52 | - [ ] 增加 CLI 参数校验 | ||
| 53 | - [ ] 增加 Docker 运行方式 | ||
| 54 | |||
| 55 | --- | ||
| 56 | |||
| 57 | ## Phase 3:产品化 PoC | ||
| 58 | |||
| 59 | ### 目标 | ||
| 60 | 提供可被业务方调用的服务接口。 | ||
| 61 | |||
| 62 | ### 任务 | ||
| 63 | - [ ] FastAPI 服务化 | ||
| 64 | - [ ] 上传音频并返回候选歌曲 | ||
| 65 | - [ ] 曲库增量入库命令 | ||
| 66 | - [ ] 元数据管理接口 | ||
| 67 | - [ ] 结果缓存与批量检索 | ||
| 68 | |||
| 69 | --- | ||
| 70 | |||
| 71 | ## Phase 4:大规模检索 | ||
| 72 | |||
| 73 | ### 目标 | ||
| 74 | 支持百万级以上曲库。 | ||
| 75 | |||
| 76 | ### 任务 | ||
| 77 | - [ ] 接入 Faiss / HNSW | ||
| 78 | - [ ] embedding 分片与压缩 | ||
| 79 | - [ ] 双层召回 + 精排 | ||
| 80 | - [ ] 在线索引更新 | ||
| 81 | - [ ] 冷热分层存储 | ||
| 82 | |||
| 83 | --- | ||
| 84 | |||
| 85 | ## Phase 5:真实业务能力 | ||
| 86 | |||
| 87 | ### 目标 | ||
| 88 | 逼近真实听歌识曲产品。 | ||
| 89 | |||
| 90 | ### 任务 | ||
| 91 | - [ ] 真实版权音频数据接入 | ||
| 92 | - [ ] 哼唱专项模型/旋律塔 | ||
| 93 | - [ ] 多模态融合(旋律 + 声纹 + 指纹) | ||
| 94 | - [ ] 在线 A/B 评估 | ||
| 95 | - [ ] 监控与质量回流 |
This diff is collapsed.
Click to expand it.
docs/start-here.md
0 → 100644
| 1 | # Start Here / 新同学接手入口 | ||
| 2 | |||
| 3 | > 目标:让新来的同学在 **10 分钟内**知道:先跑什么、先读什么、当前卡在哪、下一步该做什么。 | ||
| 4 | |||
| 5 | --- | ||
| 6 | |||
| 7 | ## 1. 先执行这条命令 | ||
| 8 | |||
| 9 | ```bash | ||
| 10 | cd /workspace/acr-engine | ||
| 11 | /usr/local/miniconda3/bin/python scripts/run_planner_validation_commands_live.py \ | ||
| 12 | --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' \ | ||
| 13 | --output data/pgvector_eval/music20/planner_validation_commands_runner_report.json | ||
| 14 | ``` | ||
| 15 | |||
| 16 | 也可以用包装脚本:`acr-engine/scripts/start_phase1_shortest_path.sh 'postgres://d2:d2pass@127.0.0.1:5432/d2'` | ||
| 17 | |||
| 18 | ### 当前 fresh evidence | ||
| 19 | - `executed_count = 4` | ||
| 20 | - `all_passed = true` | ||
| 21 | |||
| 22 | ### 这条命令会执行 | ||
| 23 | 1. `prereq_audit` | ||
| 24 | 2. `worker_contract_smoke` | ||
| 25 | 3. `semantic_vector_negative_matrix` | ||
| 26 | 4. `asset_level_upsert_validation` | ||
| 27 | |||
| 28 | ### 看到下面这些结果时应该如何判断 | ||
| 29 | 如果你看到: | ||
| 30 | - `downloads_root_exists = false` | ||
| 31 | - `ready_jobs = 0` | ||
| 32 | - exact lane = `failed/unreadable_audio_assets` | ||
| 33 | - semantic lane = `4/4 failed` | ||
| 34 | |||
| 35 | 说明当前优先级是: | ||
| 36 | 1. 挂载 `/workspace/downloads` | ||
| 37 | 2. 安装 `torch / torchaudio / transformers / speechbrain` | ||
| 38 | |||
| 39 | 也就是说: | ||
| 40 | > 当前首要问题是运行环境前置条件,不是 PostgreSQL schema,也不是 worker contract 设计错误。 | ||
| 41 | |||
| 42 | --- | ||
| 43 | |||
| 44 | ## 2. 接手时只读这 5 份文档 | ||
| 45 | |||
| 46 | 1. [README.md](./README.md) | ||
| 47 | 2. [session-handoff.md](./session-handoff.md) | ||
| 48 | 3. [acr-architecture.md](./acr-architecture.md) | ||
| 49 | 4. [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 50 | 5. [phase1-implementation-checklist.md](./phase1-implementation-checklist.md) | ||
| 51 | |||
| 52 | 如果你负责算法或检索,再补: | ||
| 53 | - [sota-evolution-guide.md](./sota-evolution-guide.md) | ||
| 54 | - [model-feature-registry-bootstrap.md](./model-feature-registry-bootstrap.md) | ||
| 55 | - [phase1-worker-contract.md](./phase1-worker-contract.md) | ||
| 56 | |||
| 57 | --- | ||
| 58 | |||
| 59 | ## 3. 用一句话理解项目 | ||
| 60 | |||
| 61 | 我们在做的是一个面向 **版权保护 / 听歌识曲 / 版本归属** 的音乐 ACR 系统, | ||
| 62 | 目标是从 `100w` 音频、约 `30w` 歌曲中,快速定位正确的 `song_id / work / recording` 归属。 | ||
| 63 | |||
| 64 | --- | ||
| 65 | |||
| 66 | ## 4. 当前主线方案 | ||
| 67 | |||
| 68 | ### 检索主线 | ||
| 69 | - exact lane:`Chromaprint` | ||
| 70 | - semantic lane baseline:`MERT-v1-95M` | ||
| 71 | - semantic lane challenger:`MuQ` | ||
| 72 | - historical baseline:`ECAPA` | ||
| 73 | |||
| 74 | ### 数据主线 | ||
| 75 | ```text | ||
| 76 | canonical_song -> work -> recording -> recording_asset -> audio_window | ||
| 77 | ``` | ||
| 78 | |||
| 79 | ### 模型主线 | ||
| 80 | ```text | ||
| 81 | model_registry -> feature_set_registry -> audio_embedding / audio_fingerprint -> retrieval_index_registry | ||
| 82 | ``` | ||
| 83 | |||
| 84 | --- | ||
| 85 | |||
| 86 | ## 5. 当前哪些已经稳定 | ||
| 87 | |||
| 88 | - PostgreSQL v2 schema 已落地 | ||
| 89 | - registry bootstrap 已有 live 验证 | ||
| 90 | - worker contract 已有 live 验证 | ||
| 91 | - exact / semantic 的失败语义已可审计 | ||
| 92 | - planner 已能输出 validation commands | ||
| 93 | - planner validation runner 已可一键执行 | ||
| 94 | |||
| 95 | ## 6. 当前哪些还没完成 | ||
| 96 | |||
| 97 | - 还没有真正跑通 MERT / MuQ inference | ||
| 98 | - 当前 host 没有 `/workspace/downloads` | ||
| 99 | - 当前 host 缺 `torch / torchaudio / transformers / speechbrain` | ||
| 100 | - 还没完成最终线上融合策略 | ||
| 101 | - 还没接入更大规模真实 reference set | ||
| 102 | |||
| 103 | --- | ||
| 104 | |||
| 105 | ## 7. 如果你现在继续推进,按这个顺序 | ||
| 106 | |||
| 107 | ### 路线 A:先解环境 | ||
| 108 | 1. 挂载 `/workspace/downloads` | ||
| 109 | 2. 安装 semantic runtime 依赖 | ||
| 110 | 3. 重跑 planner validation runner | ||
| 111 | 4. 确认 `ready_jobs` 是否开始恢复 | ||
| 112 | |||
| 113 | ### 路线 B:先解实现 | ||
| 114 | 1. 阅读 [phase1-worker-contract.md](./phase1-worker-contract.md) | ||
| 115 | 2. 阅读 `acr-engine/workers/run_embedding_job.py` | ||
| 116 | 3. 用真实 inference adapter 替换 guarded failure path | ||
| 117 | 4. 保持当前 PostgreSQL contract 不变 | ||
| 118 | |||
| 119 | ### 路线 C:先解数据 | ||
| 120 | 1. 阅读 [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 121 | 2. 阅读 [postgres_db_schema_samples.md](./postgres_db_schema_samples.md) | ||
| 122 | 3. 准备更大的 reference set | ||
| 123 | 4. 保持 `reference_set_registry / reference_set_member` 版本化 | ||
| 124 | |||
| 125 | --- | ||
| 126 | |||
| 127 | ## 8. 当前不建议优先做的事 | ||
| 128 | |||
| 129 | - 不要重新讨论要不要 `song/work/recording` 分层 | ||
| 130 | - 不要回退到只有 `song_id` 的扁平表 | ||
| 131 | - 不要先讨论重新训练底座 | ||
| 132 | - 不要把当前问题误判成 PostgreSQL contract 设计问题 | ||
| 133 | |||
| 134 | --- | ||
| 135 | |||
| 136 | ## 9. 仓库常用入口 | ||
| 137 | |||
| 138 | ### 文档 | ||
| 139 | - [README.md](./README.md) | ||
| 140 | - [session-handoff.md](./session-handoff.md) | ||
| 141 | - [postgresql-data-model.md](./postgresql-data-model.md) | ||
| 142 | - [phase1-worker-contract.md](./phase1-worker-contract.md) | ||
| 143 | |||
| 144 | ### 脚本 | ||
| 145 | - `acr-engine/scripts/run_planner_validation_commands_live.py` | ||
| 146 | - `acr-engine/scripts/run_phase1_prereq_audit_live.py` | ||
| 147 | - `acr-engine/scripts/run_phase1_worker_contract_smoke_live.py` | ||
| 148 | - `acr-engine/scripts/run_embedding_vector_table_negative_matrix_live.py` | ||
| 149 | - `acr-engine/scripts/validate_audio_embedding_asset_upsert_live.py` | ||
| 150 | |||
| 151 | --- | ||
| 152 | |||
| 153 | ## 一句话结论 | ||
| 154 | |||
| 155 | > 新同学接手时,先跑 runner,再读 5 份核心文档;当前首要问题是环境前置条件,不是 schema/contract 本身。 |
-
Please register or sign in to post a comment