Reduce ACR handoff time with a single doc chain

Constraint: Preserve the current Phase-1 runner, PostgreSQL v2 contract, and live validation narrative while removing duplicate doc entrypoints. Rejected: Keep multiple parallel handoff docs | They force new contributors to diff stale narratives before they can act. Confidence: high Scope-risk: narrow Directive: Treat README -> start-here -> session-handoff as the only first-read path unless a newer handoff chain fully replaces it. Tested: git diff --check on touched docs/script; rg for deleted-doc residual refs outside CHANGELOG; reran scripts/run_planner_validation_commands_live.py with executed_count=4 and all_passed=true Not-tested: Markdown link rendering in external viewers

Reduce ACR handoff time with a single doc chain
Constraint: Preserve the current Phase-1 runner, PostgreSQL v2 contract, and live validation narrative while removing duplicate doc entrypoints. Rejected: Keep multiple parallel handoff docs | They force new contributors to diff stale narratives before they can act. Confidence: high Scope-risk: narrow Directive: Treat README -> start-here -> session-handoff as the only first-read path unless a newer handoff chain fully replaces it. Tested: git diff --check on touched docs/script; rg for deleted-doc residual refs outside CHANGELOG; reran scripts/run_planner_validation_commands_live.py with executed_count=4 and all_passed=true Not-tested: Markdown link rendering in external viewers
cnb.bofCdSsphPA
Commit 6d4f8c1c ... 6d4f8c1c1ded3ba035a550151beac07f6d087304 authored 2026-06-04 14:25:25 +0800 by cnb.bofCdSsphPA
Showing 11 changed files with 181 additions and 143 deletions
acr-engine/scripts/start_phase1_shortest_path.sh
docs/CHANGELOG.md
docs/README.md
docs/acr-design.md
docs/changelist-2026-06-02.md
docs/delivery-handoff-2026-06-02.md
docs/external-manifest-template.md
docs/open-dataset-plan.md
docs/roadmap.md
docs/session-handoff.md
docs/start-here.md
--- a/acr-engine/scripts/start_phase1_shortest_path.sh 0 → 100755
View file @6d4f8c1
+++ b/acr-engine/scripts/start_phase1_shortest_path.sh 0 → 100755
View file @6d4f8c1
+#!/usr/bin/env bash
+set -euo pipefail
+ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
+PYTHON_BIN="${PYTHON_BIN:-/usr/local/miniconda3/bin/python}"
+DSN="${1:-${PG_DSN:-}}"
+OUTPUT="${2:-$ROOT_DIR/data/pgvector_eval/music20/planner_validation_commands_runner_report.json}"
+if [[ -z "$DSN" ]]; then
+  echo "usage: $0 <postgres-dsn> [output-json]" >&2
+  echo "or set PG_DSN before running this script" >&2
+  exit 1
+fi
+cd "$ROOT_DIR"
+"$PYTHON_BIN" scripts/run_planner_validation_commands_live.py \
+  --dsn "$DSN" \
+  --output "$OUTPUT"
--- a/docs/CHANGELOG.md
View file @6d4f8c1
+++ b/docs/CHANGELOG.md
View file @6d4f8c1
 ## 2026-06-04
+- 收敛文档入口链路，新增 `docs/start-here.md`，统一新同学接手路径为：`README -> start-here -> session-handoff`。
+- 重写 `docs/README.md`，按“接手 / 方案 / 实施 / 运行 / 角色”重组导航，降低首次阅读成本。
+- 重构 `docs/session-handoff.md`，把最新 Phase-1 runner、稳定结论、blocker 与下一步动作收口到单页文档。
+- 清理重复或过期文档：删除 `docs/acr-design.md`、`docs/open-dataset-plan.md`、`docs/external-manifest-template.md`、`docs/roadmap.md`、`docs/changelist-2026-06-02.md`、`docs/delivery-handoff-2026-06-02.md`。
+- 历史记录仍保留在 `docs/CHANGELOG.md`，当前有效入口以上述主链为准。
+## 2026-06-04
 - 更新 `docs/README.md` 顶部为与 `session-handoff` 一致的“最短启动路径”，并再次用该入口命令重跑 `run_planner_validation_commands_live.py`，确认 fresh 结果仍为 `executed_count=4`、`all_passed=true`。
 - 重构 `docs/session-handoff.md` 顶部为“首选启动流程（最短路径）”，直接给出 `run_planner_validation_commands_live.py` 的一条启动命令，以及基于 fresh runner 报告（`executed_count=4`, `all_passed=true`）的结果判断逻辑，减少下次 session 的恢复成本。
 - 新增 `scripts/run_planner_validation_commands_live.py` 与 `planner_validation_commands_runner_report.json`，可直接读取 `phase1_extraction_plan_report.json` 中的 `validation_commands` 并批量执行；当前 4 条 entrypoints 已全部执行成功，`executed_count=4`、`all_passed=true`。
--- a/docs/README.md
View file @6d4f8c1
+++ b/docs/README.md
View file @6d4f8c1
--- a/docs/acr-design.md deleted 100644 → 0
View file @8d6e4b2
+++ b/docs/acr-design.md deleted 100644 → 0
View file @8d6e4b2
--- a/docs/changelist-2026-06-02.md deleted 100644 → 0
View file @8d6e4b2
+++ b/docs/changelist-2026-06-02.md deleted 100644 → 0
View file @8d6e4b2
--- a/docs/delivery-handoff-2026-06-02.md deleted 100644 → 0
View file @8d6e4b2
+++ b/docs/delivery-handoff-2026-06-02.md deleted 100644 → 0
View file @8d6e4b2
--- a/docs/external-manifest-template.md deleted 100644 → 0
View file @8d6e4b2
+++ b/docs/external-manifest-template.md deleted 100644 → 0
View file @8d6e4b2
-# External Manifest Template
-适用于 FMA / Jamendo / CCMusic / ModelScope 白名单数据集。
-## catalog.csv 最小字段
-```csv
-song_id,audio_path,duration,source_dataset
-track_0001,raw/track_0001.wav,12.5,fma
-```
-转换命令：
-```bash
-python src/data/manifest_tools.py csv-to-catalog catalog.csv manifests/catalog.json
-```
-## 输出 catalog.json 结构
-```json
-{
-  "song_id": "track_0001",
-  "audio_path": "raw/track_0001.wav",
-  "duration": 12.5,
-  "type": "reference",
-  "source_dataset": "fma"
-}
-```
--- a/docs/open-dataset-plan.md deleted 100644 → 0
View file @8d6e4b2
+++ b/docs/open-dataset-plan.md deleted 100644 → 0
View file @8d6e4b2
-# Open Dataset Integration Plan
-## Recommended order
-1. **FMA small**
-   - URL: https://github.com/mdeff/fma
-   - Why: easiest small realistic music subset for retrieval experiments
-2. **MTG-Jamendo**
-   - URL: https://github.com/MTG/mtg-jamendo-dataset
-   - Why: larger CC-licensed corpus with scriptable upstream tooling
-3. **QBSH / humming corpora**
-   - Why: add after retrieval baseline is stable
-## Repo strategy
- Keep external dataset ingestion optional
- Convert external tracks into:
-  - `catalog.json` for searchable references
-  - query segment manifests for evaluation
- Start with small local subsets before full-corpus scaling
--- a/docs/roadmap.md deleted 100644 → 0
View file @8d6e4b2
+++ b/docs/roadmap.md deleted 100644 → 0
View file @8d6e4b2
-# ACR 项目 Roadmap
-> 更新：2026-06-02
-## Phase 0：原型跑通（当前阶段）
-### 目标
-完成一个端到端可运行的本地 demo。
-### 范围
- [x] 合成数据生成
- [x] 数据增强
- [x] ECAPA embedding 模型
- [x] 传统指纹匹配器
- [x] HybridEngine
- [x] 最小训练入口
- [x] 最小识别入口
- [x] 文档补全
-### 验收标准
- 能生成数据
- 能训练至少 1 epoch
- 能建立 reference 索引
- 能对测试片段输出 Top-K 候选
---
-## Phase 1：研究验证
-### 目标
-验证不同场景下识别效果是否可接受。
-### 任务
- [ ] 增加 top-1 / top-5 / MRR 评估脚本
- [ ] 对 clean / noisy / stretched / pitch-shifted 分开评测
- [ ] 增加 query-by-humming 专项评测集
- [ ] 加入更稳健的 negative sampling
- [ ] 补充 checkpoint / config versioning
---
-## Phase 2：工程化
-### 目标
-把原型升级为可复现实验项目。
-### 任务
- [ ] 增加 `Makefile` 或 `justfile`
- [ ] 增加 `pytest` 基础测试
- [ ] 增加日志与指标记录
- [ ] 增加模型导出与加载规范
- [ ] 增加 CLI 参数校验
- [ ] 增加 Docker 运行方式
---
-## Phase 3：产品化 PoC
-### 目标
-提供可被业务方调用的服务接口。
-### 任务
- [ ] FastAPI 服务化
- [ ] 上传音频并返回候选歌曲
- [ ] 曲库增量入库命令
- [ ] 元数据管理接口
- [ ] 结果缓存与批量检索
---
-## Phase 4：大规模检索
-### 目标
-支持百万级以上曲库。
-### 任务
- [ ] 接入 Faiss / HNSW
- [ ] embedding 分片与压缩
- [ ] 双层召回 + 精排
- [ ] 在线索引更新
- [ ] 冷热分层存储
---
-## Phase 5：真实业务能力
-### 目标
-逼近真实听歌识曲产品。
-### 任务
- [ ] 真实版权音频数据接入
- [ ] 哼唱专项模型/旋律塔
- [ ] 多模态融合（旋律 + 声纹 + 指纹）
- [ ] 在线 A/B 评估
- [ ] 监控与质量回流
--- a/docs/session-handoff.md
View file @6d4f8c1
+++ b/docs/session-handoff.md
View file @6d4f8c1
--- a/docs/start-here.md 0 → 100644
View file @6d4f8c1
+++ b/docs/start-here.md 0 → 100644
View file @6d4f8c1
+# Start Here / 新同学接手入口
+> 目标：让新来的同学在 **10 分钟内**知道：先跑什么、先读什么、当前卡在哪、下一步该做什么。
+---
+## 1. 先执行这条命令
+```bash
+cd /workspace/acr-engine
+/usr/local/miniconda3/bin/python scripts/run_planner_validation_commands_live.py \
+  --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' \
+  --output data/pgvector_eval/music20/planner_validation_commands_runner_report.json
+```
+也可以用包装脚本：`acr-engine/scripts/start_phase1_shortest_path.sh 'postgres://d2:d2pass@127.0.0.1:5432/d2'`
+### 当前 fresh evidence
+- `executed_count = 4`
+- `all_passed = true`
+### 这条命令会执行
+1. `prereq_audit`
+2. `worker_contract_smoke`
+3. `semantic_vector_negative_matrix`
+4. `asset_level_upsert_validation`
+### 看到下面这些结果时应该如何判断
+如果你看到：
+- `downloads_root_exists = false`
+- `ready_jobs = 0`
+- exact lane = `failed/unreadable_audio_assets`
+- semantic lane = `4/4 failed`
+说明当前优先级是：
+1. 挂载 `/workspace/downloads`
+2. 安装 `torch / torchaudio / transformers / speechbrain`
+也就是说：
+> 当前首要问题是运行环境前置条件，不是 PostgreSQL schema，也不是 worker contract 设计错误。
+---
+## 2. 接手时只读这 5 份文档
+1. [README.md](./README.md)
+2. [session-handoff.md](./session-handoff.md)
+3. [acr-architecture.md](./acr-architecture.md)
+4. [postgresql-data-model.md](./postgresql-data-model.md)
+5. [phase1-implementation-checklist.md](./phase1-implementation-checklist.md)
+如果你负责算法或检索，再补：
+- [sota-evolution-guide.md](./sota-evolution-guide.md)
+- [model-feature-registry-bootstrap.md](./model-feature-registry-bootstrap.md)
+- [phase1-worker-contract.md](./phase1-worker-contract.md)
+---
+## 3. 用一句话理解项目
+我们在做的是一个面向 **版权保护 / 听歌识曲 / 版本归属** 的音乐 ACR 系统，
+目标是从 `100w` 音频、约 `30w` 歌曲中，快速定位正确的 `song_id / work / recording` 归属。
+---
+## 4. 当前主线方案
+### 检索主线
+- exact lane：`Chromaprint`
+- semantic lane baseline：`MERT-v1-95M`
+- semantic lane challenger：`MuQ`
+- historical baseline：`ECAPA`
+### 数据主线
+```text
+canonical_song -> work -> recording -> recording_asset -> audio_window
+```
+### 模型主线
+```text
+model_registry -> feature_set_registry -> audio_embedding / audio_fingerprint -> retrieval_index_registry
+```
+---
+## 5. 当前哪些已经稳定
+- PostgreSQL v2 schema 已落地
+- registry bootstrap 已有 live 验证
+- worker contract 已有 live 验证
+- exact / semantic 的失败语义已可审计
+- planner 已能输出 validation commands
+- planner validation runner 已可一键执行
+## 6. 当前哪些还没完成
+- 还没有真正跑通 MERT / MuQ inference
+- 当前 host 没有 `/workspace/downloads`
+- 当前 host 缺 `torch / torchaudio / transformers / speechbrain`
+- 还没完成最终线上融合策略
+- 还没接入更大规模真实 reference set
+---
+## 7. 如果你现在继续推进，按这个顺序
+### 路线 A：先解环境
+1. 挂载 `/workspace/downloads`
+2. 安装 semantic runtime 依赖
+3. 重跑 planner validation runner
+4. 确认 `ready_jobs` 是否开始恢复
+### 路线 B：先解实现
+1. 阅读 [phase1-worker-contract.md](./phase1-worker-contract.md)
+2. 阅读 `acr-engine/workers/run_embedding_job.py`
+3. 用真实 inference adapter 替换 guarded failure path
+4. 保持当前 PostgreSQL contract 不变
+### 路线 C：先解数据
+1. 阅读 [postgresql-data-model.md](./postgresql-data-model.md)
+2. 阅读 [postgres_db_schema_samples.md](./postgres_db_schema_samples.md)
+3. 准备更大的 reference set
+4. 保持 `reference_set_registry / reference_set_member` 版本化
+---
+## 8. 当前不建议优先做的事
+- 不要重新讨论要不要 `song/work/recording` 分层
+- 不要回退到只有 `song_id` 的扁平表
+- 不要先讨论重新训练底座
+- 不要把当前问题误判成 PostgreSQL contract 设计问题
+---
+## 9. 仓库常用入口
+### 文档
+- [README.md](./README.md)
+- [session-handoff.md](./session-handoff.md)
+- [postgresql-data-model.md](./postgresql-data-model.md)
+- [phase1-worker-contract.md](./phase1-worker-contract.md)
+### 脚本
+- `acr-engine/scripts/run_planner_validation_commands_live.py`
+- `acr-engine/scripts/run_phase1_prereq_audit_live.py`
+- `acr-engine/scripts/run_phase1_worker_contract_smoke_live.py`
+- `acr-engine/scripts/run_embedding_vector_table_negative_matrix_live.py`
+- `acr-engine/scripts/validate_audio_embedding_asset_upsert_live.py`
+---
+## 一句话结论
+> 新同学接手时，先跑 runner，再读 5 份核心文档；当前首要问题是环境前置条件，不是 schema/contract 本身。