Reduce ACR docs to the current song-centric storage design

Constraint: Keep only documentation that directly serves the current Phase-1 song-centric + fused-table storage and retrieval design. Rejected: Preserve broad historical, dataset, business-export, and template docs in the main docs root | They increase handoff cost and blur the active design surface. Confidence: high Scope-risk: moderate Directive: Treat postgresql-data-model.md as the single source of truth for where slices, models, and features are stored until a concrete fused DDL supersedes it. Tested: git diff --check on touched docs; /usr/local/miniconda3/bin/python scripts/check_markdown_links.py --root docs returned OK for 11 active markdown files; final docs root reduced to 12 files Not-tested: external markdown renderers and downstream readers that may still expect removed auxiliary docs

Reduce ACR docs to the current song-centric storage design
Constraint: Keep only documentation that directly serves the current Phase-1 song-centric + fused-table storage and retrieval design. Rejected: Preserve broad historical, dataset, business-export, and template docs in the main docs root | They increase handoff cost and blur the active design surface. Confidence: high Scope-risk: moderate Directive: Treat postgresql-data-model.md as the single source of truth for where slices, models, and features are stored until a concrete fused DDL supersedes it. Tested: git diff --check on touched docs; /usr/local/miniconda3/bin/python scripts/check_markdown_links.py --root docs returned OK for 11 active markdown files; final docs root reduced to 12 files Not-tested: external markdown renderers and downstream readers that may still expect removed auxiliary docs
cnb.bofCdSsphPA
Commit ac2e6730 ... ac2e6730b7e695d7647162accae65e05cb2a5379 authored 2026-06-04 14:37:22 +0800 by cnb.bofCdSsphPA
Showing 25 changed files with 181 additions and 1852 deletions
docs/CHANGELOG.md
docs/README.md
docs/acr-architecture.md
docs/benchmark-report-template.md
docs/business-export-cookbook.md
docs/business-manifest-and-type-role-spec.md
docs/business-music-bucket-and-type-guide.md
docs/business-project-manifest-adapter.md
docs/current-capability-map.md
docs/dataset-sources-and-licensing.md
docs/dataset-spec.md
docs/industrial-benchmark-spec.md
docs/industrialization-roadmap.md
docs/model-card-template.md
docs/open-dataset-workflow.md
docs/postgresql-data-model.md
docs/production-encoder-freeze-and-embedding-strategy.md
docs/project-responsibility-map.md
docs/references-and-sources.md
docs/release-checklist.md
--- a/docs/CHANGELOG.md
View file @ac2e673
+++ b/docs/CHANGELOG.md
View file @ac2e673
 ## 2026-06-04
+- 在 `docs/postgresql-data-model.md` 新增“切片数据 / 模型 / feature 具体落哪张表”的表格与流程图，明确当前默认回溯链为 `feature_fact -> audio_object(window) -> audio_object(asset) -> media_entity(song)`。
+- 收敛 `docs/README.md` 为当前 song-centric 设计入口，并清理 docs 目录中与当前设计无关的模板、开放数据、业务导出、历史路线类文档。
 - 收敛文档入口链路，新增 `docs/start-here.md`，统一新同学接手路径为：`README -> start-here -> session-handoff`。
 - 重写 `docs/README.md`，按“接手 / 方案 / 实施 / 运行 / 角色”重组导航，降低首次阅读成本。
 - 重构 `docs/session-handoff.md`，把最新 Phase-1 runner、稳定结论、blocker 与下一步动作收口到单页文档。
@@ -12,6 +15,8 @@
 - 根据“尽量融合、用多 type 关联”的新约束，在 `docs/postgresql-data-model.md` 补充“融合优先”建模视图：推荐以 `media_entity`、`audio_object`、`feature_fact`、`set_membership` 这 4 类通用表承载 Phase-1 物理实现，同时保留 `song/recording/asset/window/feature` 的逻辑分层。
+- 根据“当前不关心版本，只需多个音频稳定归到同一个 song_id”的新约束，将 `docs/postgresql-data-model.md`、`docs/README.md` 与 `docs/start-here.md` 的默认 Phase-1 口径进一步收敛为 `song -> asset -> window -> feature`；`recording` 调整为未来扩展层，而非当前强主层。
 ## 2026-06-04
 - 更新 `docs/README.md` 顶部为与 `session-handoff` 一致的“最短启动路径”，并再次用该入口命令重跑 `run_planner_validation_commands_live.py`，确认 fresh 结果仍为 `executed_count=4`、`all_passed=true`。
--- a/docs/README.md
View file @ac2e673
+++ b/docs/README.md
View file @ac2e673
 # ACR Docs Overview
-> 面向“版权保护 / 听歌识曲 / 版本归属”的音乐 ACR 文档总入口。
+> 当前仅保留与 **song-centric + 融合优先** ACR 设计直接相关的文档。
 ---
 ## 0. 新同学先做什么
-### 先跑，不要先读一堆文档
 ```bash
 cd /workspace/acr-engine
 /usr/local/miniconda3/bin/python scripts/run_planner_validation_commands_live.py \
@@ -21,105 +19,53 @@ cd /workspace/acr-engine
 - `executed_count = 4`
 - `all_passed = true`
-### 再按这条阅读链路走
-1. [start-here.md](./start-here.md)
-2. [session-handoff.md](./session-handoff.md)
-3. [acr-architecture.md](./acr-architecture.md)
-4. [postgresql-data-model.md](./postgresql-data-model.md)
-5. [phase1-implementation-checklist.md](./phase1-implementation-checklist.md)
 ---
-## 1. 文档总导航
+## 1. 当前默认设计口径
-### A. 接手项目 / 恢复上下文
+当前 Phase-1 默认按下面理解：
- [start-here.md](./start-here.md) — 新同学 10 分钟接手入口
- [session-handoff.md](./session-handoff.md) — 当前状态、阻塞、下一步
- [CHANGELOG.md](./CHANGELOG.md) — 变更记录
-### B. 系统方案 / 设计主线
+```text
- [acr-architecture.md](./acr-architecture.md) — 总体架构与分层
+song -> asset -> window -> fingerprint / embedding
- [sota-evolution-guide.md](./sota-evolution-guide.md) — SOTA 演进路径
+```
- [postgresql-data-model.md](./postgresql-data-model.md) — PostgreSQL 主数据/特征模型
- [production-encoder-freeze-and-embedding-strategy.md](./production-encoder-freeze-and-embedding-strategy.md) — encoder-only 冻结策略
-### C. 第一个阶段怎么落地
+对应融合优先物理表：
- [phase1-implementation-checklist.md](./phase1-implementation-checklist.md) — Phase-1 执行清单
- [postgresql-data-model.md](./postgresql-data-model.md) — 含 Phase-1 极简 schema 与融合优先视图
- [model-feature-registry-bootstrap.md](./model-feature-registry-bootstrap.md) — model/feature/reference set 初始化
- [phase1-worker-contract.md](./phase1-worker-contract.md) — worker、job、失败语义合同
- [postgres_db_schema_samples.md](./postgres_db_schema_samples.md) — PostgreSQL 存储样例
-### D. 运行 / 服务 / 数据治理
+```text
- [runbook.md](./runbook.md) — 运维/运行手册
+media_entity -> audio_object -> feature_fact -> set_membership
- [service-api.md](./service-api.md) — 服务 API
+```
- [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) — 训练/向量检索说明
- [open-dataset-workflow.md](./open-dataset-workflow.md) — 开源数据接入流程
 ---
-## 2. 按角色阅读
+## 2. 必读文档
-### 产品 / 业务 / 版权策略
 1. [start-here.md](./start-here.md)
-2. [acr-architecture.md](./acr-architecture.md)
-3. [project-responsibility-map.md](./project-responsibility-map.md)
-4. [business-export-cookbook.md](./business-export-cookbook.md)
-### 数据 / 平台 / PostgreSQL
-1. [postgresql-data-model.md](./postgresql-data-model.md)
-2. [postgres_db_schema_samples.md](./postgres_db_schema_samples.md)
-3. [model-feature-registry-bootstrap.md](./model-feature-registry-bootstrap.md)
-4. [runbook.md](./runbook.md)
-### 算法 / 检索 / 模型
-1. [sota-evolution-guide.md](./sota-evolution-guide.md)
-2. [production-encoder-freeze-and-embedding-strategy.md](./production-encoder-freeze-and-embedding-strategy.md)
-3. [phase1-worker-contract.md](./phase1-worker-contract.md)
-4. [sota-research-2026.md](./sota-research-2026.md)
-### 开发 / 实施 / 交付
-1. [phase1-implementation-checklist.md](./phase1-implementation-checklist.md)
 2. [session-handoff.md](./session-handoff.md)
-3. [CHANGELOG.md](./CHANGELOG.md)
+3. [acr-architecture.md](./acr-architecture.md)
-4. [release-checklist.md](./release-checklist.md)
+4. [postgresql-data-model.md](./postgresql-data-model.md)
+5. [phase1-implementation-checklist.md](./phase1-implementation-checklist.md)
 ---
-## 3. 当前最重要的稳定结论
+## 3. 实施相关文档
- 目标场景不是普通歌曲推荐，而是 **版权保护 / 听歌识曲 / 版本归属**。
+- [postgresql-data-model.md](./postgresql-data-model.md) — 当前唯一默认数据模型；含切片/模型/feature 落表说明与流程图
- Phase-1 先走 **encoder-only** 路线，不先微调底座。
+- [postgres_db_schema_samples.md](./postgres_db_schema_samples.md) — PostgreSQL 存储样例
- exact lane：`Chromaprint`。
+- [model-feature-registry-bootstrap.md](./model-feature-registry-bootstrap.md) — model/feature/reference set 初始化
- semantic baseline：`MERT-v1-95M`。
+- [phase1-worker-contract.md](./phase1-worker-contract.md) — worker、job、失败语义合同
- semantic challenger：`MuQ`。
+- [phase1-implementation-checklist.md](./phase1-implementation-checklist.md) — Phase-1 实施清单
- `ECAPA` 保留为 historical baseline，不再作为长期主底座。
+- [production-encoder-freeze-and-embedding-strategy.md](./production-encoder-freeze-and-embedding-strategy.md) — encoder-only 冻结策略
- 可演进完整版主链为：
+- [sota-evolution-guide.md](./sota-evolution-guide.md) — 当前 SOTA 演进主线
-```text
-canonical_song -> work -> recording -> recording_asset -> audio_window
-```
- 如果只看 Phase-1 最小骨架，可以先按下面理解：
-```text
-song -> recording -> asset -> window -> fingerprint / embedding
-```
- 模型/特征主链固定为：
-```text
-model_registry -> feature_set_registry -> audio_embedding / audio_fingerprint -> retrieval_index_registry
-```
 ---
-## 4. 当前不要浪费时间的方向
+## 4. 当前稳定结论
- 不要回退到只用一个 `song_id` 的扁平结构。
+- 最终归属对象当前只要求稳定返回 `song_id`
- 不要把 embedding 存成固定列（如 `mert_embedding` / `muq_embedding`）。
+- 同一个 `song` 下允许有多个音频文件
- 不要在 Phase-1 先讨论重新训练底座。
+- 当前暂不把 `recording/version` 作为必须返回对象
- 不要把当前阻塞误判成 PostgreSQL schema 问题；当前主要 blocker 是音频挂载与 runtime 依赖。
+- `window` 仍然保留，因为它是 evidence / offset / 检索最小单元
+- `feature_fact` 统一承载 `fingerprint` 和 `embedding`
 ---
@@ -129,19 +75,4 @@ model_registry -> feature_set_registry -> audio_embedding / audio_fingerprint ->
 /usr/local/miniconda3/bin/python scripts/check_markdown_links.py --root docs
 ```
-用途：在清理或重组文档后，快速发现 `docs/` 下的相对链接断链。默认会跳过 `CHANGELOG.md` 这类历史归档文档。
+默认会跳过 `CHANGELOG.md` 这类历史归档文档。
---
-## 6. 补充但不建议作为第一入口
-以下文档保留用于专题补充，不建议新同学第一轮就读：
- [dataset-spec.md](./dataset-spec.md)
- [dataset-sources-and-licensing.md](./dataset-sources-and-licensing.md)
- [references-and-sources.md](./references-and-sources.md)
- [current-capability-map.md](./current-capability-map.md)
- [industrialization-roadmap.md](./industrialization-roadmap.md)
- [industrial-benchmark-spec.md](./industrial-benchmark-spec.md)
- [benchmark-report-template.md](./benchmark-report-template.md)
- [model-card-template.md](./model-card-template.md)
- [report-layout.md](./report-layout.md)
--- a/docs/acr-architecture.md
View file @ac2e673
+++ b/docs/acr-architecture.md
View file @ac2e673
@@ -79,21 +79,22 @@ flowchart TD
 那 Phase-1 完全可以按下面这套最小骨架推进：
 ```text
-song -> recording -> asset -> window -> fingerprint / embedding
+song -> asset -> window -> fingerprint / embedding
 ```
 保留原因：
- `recording` 不能删：同一首歌会有多个版本
 - `window` 不能删：它是 offset/evidence/多段投票的最小单元
- `feature_set_registry` 不能删：否则未来换 MERT/MuQ 会把 schema 写死
+- `feature_set_registry` / `feature_fact` 不能删：否则未来换 MERT/MuQ 会把 schema 写死
+- `asset` 不能删：同一个 `song` 下会有多个真实音频文件
 可以延后：
+- `recording`
 - `work`
 - 更重的 `retrieval_index_registry`
 - 更细的全链路审计表
 因此推荐口径不是“把所有层都砍掉”，而是：
-> **Phase-1 先上最小可用层；未来版本归属/cover/work 治理再继续加层。**
+> **Phase-1 先上 song-centric 最小可用层；未来版本归属/cover/work 治理再继续加层。**
 ---
@@ -125,7 +126,6 @@ song -> recording -> asset -> window -> fingerprint / embedding
 最该读：
 - 本文
 - [postgresql-data-model.md](./postgresql-data-model.md)
- [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md)
 ---
@@ -139,8 +139,8 @@ song -> recording -> asset -> window -> fingerprint / embedding
 最该读：
 - 本文
- [service-api.md](./service-api.md)
 - [postgresql-data-model.md](./postgresql-data-model.md)
+- [phase1-worker-contract.md](./phase1-worker-contract.md)
 ---
@@ -154,8 +154,8 @@ song -> recording -> asset -> window -> fingerprint / embedding
 最该读：
 - [sota-evolution-guide.md](./sota-evolution-guide.md)
- [sota-research-2026.md](./sota-research-2026.md)
 - [production-encoder-freeze-and-embedding-strategy.md](./production-encoder-freeze-and-embedding-strategy.md)
+- [postgresql-data-model.md](./postgresql-data-model.md)
 ---
@@ -235,4 +235,4 @@ flowchart LR
 如果你是：
 - **架构负责人**：下一篇看 [sota-evolution-guide.md](./sota-evolution-guide.md)
 - **数据/后端负责人**：下一篇看 [postgresql-data-model.md](./postgresql-data-model.md)
- **模型负责人**：先看 [sota-evolution-guide.md](./sota-evolution-guide.md) 再回到 [sota-research-2026.md](./sota-research-2026.md)
+- **模型负责人**：先看 [sota-evolution-guide.md](./sota-evolution-guide.md) 再看 [production-encoder-freeze-and-embedding-strategy.md](./production-encoder-freeze-and-embedding-strategy.md)
--- a/docs/benchmark-report-template.md deleted 100644 → 0
View file @4422297
+++ b/docs/benchmark-report-template.md deleted 100644 → 0
View file @4422297
-# Benchmark Report Template
-> 用于每次模型版本评测输出
-## 一页结论
- 模型版本：
- 数据版本：
- 核心结论：
- 是否通过上线门禁：
-## 1. 评测范围图
-```mermaid
-flowchart LR
-    A[Model Version] --> B[Datasets]
-    A --> C[Scenario Buckets]
-    A --> D[Latency / Ops]
-```
-## 2. 指标表
-| Bucket | top1 | top5 | MRR | FAR | Notes |
-|---|---:|---:|---:|---:|---|
-| clean |  |  |  |  |  |
-| humming_like |  |  |  |  |  |
-| confused |  |  |  |  |  |
-## 3. 文字分析
- 最强项：
- 最弱项：
- 与上一版本对比：
-## 4. 细节附录
- 评测命令
- 数据清单
- 原始 JSON 报告路径
-## Sources
- `docs/industrial-benchmark-spec.md`
--- a/docs/business-export-cookbook.md deleted 100644 → 0
View file @4422297
+++ b/docs/business-export-cookbook.md deleted 100644 → 0
View file @4422297
-# Business Export Cookbook / 业务库表导出 Cookbook
-> 更新：2026-06-02  
-> 关联文档：[业务 Manifest 与 Type-Role 规范](./business-manifest-and-type-role-spec.md) · [业务素材类型与 Bucket 指南](./business-music-bucket-and-type-guide.md)
-## 一页结论
-下个 session 如果要从你们的业务库表真正导出训练/评测清单，建议直接按这个顺序：
-1. 先从 SQL 导出音频资产基础字段
-2. 用 `type-role mapping` 补 `role` / `bucket`
-3. 落成 CSV 或 JSONL 中间文件
-4. 再转成项目 manifest
-5. 或直接先用仓库脚本转成 manifest-ready JSONL
-仓库里已经补好以下参考物：
- [../acr-engine/configs/manifests/business_asset_manifest_template.json](../acr-engine/configs/manifests/business_asset_manifest_template.json)
- [../acr-engine/configs/manifests/business_type_role_mapping.json](../acr-engine/configs/manifests/business_type_role_mapping.json)
- [../acr-engine/configs/manifests/examples/business_asset_export_example.csv](../acr-engine/configs/manifests/examples/business_asset_export_example.csv)
- [../acr-engine/configs/manifests/examples/business_asset_export_example.jsonl](../acr-engine/configs/manifests/examples/business_asset_export_example.jsonl)
---
-## 1. 推荐 SQL 导出字段
-```sql
-SELECT
-  s.id              AS song_id,
-  a.id              AS asset_id,
-  a.type            AS type,
-  a.file_path       AS audio_path,
-  s.title           AS title,
-  s.artist_name     AS artist,
-  s.album_id        AS album_id,
-  a.duration_sec    AS duration_sec,
-  a.sample_rate     AS sample_rate,
-  a.bitrate         AS bitrate,
-  a.license_code    AS license,
-  a.created_at      AS created_at
-FROM music_asset a
-JOIN song s ON s.id = a.song_id
-WHERE a.type IN (1,7,8,9,10,11,16,18,2,12);
-```
-说明：
- 这不是强制 SQL，只是字段映射样例。
- 关键不是表名，而是把字段凑齐到 manifest 规范里。
---
-## 2. 导出后要补什么字段
-| 字段 | 来源 | 说明 |
-|---|---|---|
-| `role` | `business_type_role_mapping.json` | 由 `type` 映射 |
-| `bucket` | `business_type_role_mapping.json` | 默认业务 bucket |
-| `split` | 导出脚本或后处理 | `train/val/test/holdout` |
-| `source_dataset` | 固定值 | 如 `internal_catalog` |
-| `offset_sec` | 片段类素材可填 | 非片段可先置 `0` |
---
-## 3. 推荐中间格式
-### CSV
-适合：
- 业务同学先导数据
- Excel / 表格工具核对
-样例：
- [../acr-engine/configs/manifests/examples/business_asset_export_example.csv](../acr-engine/configs/manifests/examples/business_asset_export_example.csv)
-### JSONL
-适合：
- 脚本流式处理
- 后续直接转 manifest
-样例：
- [../acr-engine/configs/manifests/examples/business_asset_export_example.jsonl](../acr-engine/configs/manifests/examples/business_asset_export_example.jsonl)
---
-## 4. 建议后处理规则
-1. `type=10/11` 默认补成 `reference`
-2. `type=1/9` 默认补成压缩域 `reference`
-3. `type=7/8/16` 默认补成 `query`
-4. `type=18/2/12` 默认先 `excluded`
-5. 非音频资产直接过滤掉
---
-## 5. 下个 session 最直接动作
-1. 按 SQL 样例从业务库导一次真实数据
-2. 存成 CSV 或 JSONL
-3. 用仓库里的 mapping 规则补齐 `role` / `bucket`
-4. 再转换成项目需要的 manifest
-## Sources
- See [business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md)
- See [business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md)
-## 6. 轻量规范化脚本
-仓库里已经补了一层可直接运行的转换脚本：
- [../acr-engine/scripts/normalize_business_export.py](../acr-engine/scripts/normalize_business_export.py)
-示例：
-```bash
-cd /workspace/acr-engine
-/usr/local/miniconda3/bin/python scripts/normalize_business_export.py \
-  --input configs/manifests/examples/business_asset_export_example.csv \
-  --output /tmp/business_asset_manifest_ready.jsonl
-```
-这个脚本会：
-1. 读取 CSV 或 JSONL 导出
-2. 应用 `business_type_role_mapping.json`
-3. 自动补 `role / bucket / source_dataset / split` 默认值
-4. 输出 manifest-ready JSONL
-## 7. 拆分为角色清单
-如果你已经拿到了 manifest-ready JSONL，还可以继续用：
- [../acr-engine/scripts/split_business_manifest_ready.py](../acr-engine/scripts/split_business_manifest_ready.py)
-示例：
-```bash
-cd /workspace/acr-engine
-/usr/local/miniconda3/bin/python scripts/split_business_manifest_ready.py \
-  --input /tmp/business_asset_manifest_ready.jsonl \
-  --output-dir /tmp/business_asset_manifest_split
-```
-它会输出：
- `reference.json`
- `query.json`
- `excluded.json`
-这样下个 session 可以更快把业务素材继续整形成训练/评测所需清单。
-## 8. 生成项目 manifest
-如果你已经有 manifest-ready JSONL，可以直接继续生成项目当前需要的四个 manifest：
- [../acr-engine/scripts/build_business_project_manifests.py](../acr-engine/scripts/build_business_project_manifests.py)
- [business-project-manifest-adapter.md](./business-project-manifest-adapter.md)
--- a/docs/business-manifest-and-type-role-spec.md deleted 100644 → 0
View file @4422297
+++ b/docs/business-manifest-and-type-role-spec.md deleted 100644 → 0
View file @4422297
-# Business Manifest and Type-Role Spec / 业务 Manifest 与 Type-Role 规范
-> 更新：2026-06-02  
-> 关联文档：[业务素材类型与 Bucket 指南](./business-music-bucket-and-type-guide.md) · [训练数据与 pgvector 指南](./training-data-and-pgvector-guide.md)
-## 一页结论
-现在仓库里已经有两份可以直接复用的业务接入模板：
- [../acr-engine/configs/manifests/business_asset_manifest_template.json](../acr-engine/configs/manifests/business_asset_manifest_template.json)
- [../acr-engine/configs/manifests/business_type_role_mapping.json](../acr-engine/configs/manifests/business_type_role_mapping.json)
-它们解决两个问题：
-1. 业务库表里的字段，最少要映射成什么 manifest 字段。
-2. 你们的 `type` 应该默认落到 `reference / query / excluded` 哪一类。
---
-## 1. 映射图
-```mermaid
-flowchart LR
-    A[业务库表记录] --> B[type-role mapping]
-    B --> C[reference]
-    B --> D[query]
-    B --> E[excluded]
-    C --> F[manifest rows]
-    D --> F
-    F --> G[train / build-index / evaluate]
-```
---
-## 2. 最小 manifest 字段
-| 字段 | 必需 | 说明 |
-|---|---|---|
-| `song_id` | 是 | 歌曲主 ID |
-| `asset_id` | 是 | 具体素材 ID |
-| `type` | 是 | 你们现有的素材类型 |
-| `role` | 是 | `reference` / `query` / `excluded` |
-| `split` | 是 | `train` / `val` / `test` / `holdout` |
-| `audio_path` | 是 | 可访问的音频路径 |
-| `source_dataset` | 是 | 来源标识 |
-| `bucket` | 否 | 分桶评测标签 |
-| `offset_sec` | 否 | query 起点 |
-| `duration_sec` | 否 | 片段长度 |
---
-## 3. 默认 type-role 规则
-| type | 默认 role | 默认 bucket | 说明 |
-|---:|---|---|---|
-| `10` / `11` | `reference` | `lossless_reference_core` | 无损主库 |
-| `9` / `1` | `reference` | `compressed_reference_realworld` | 压缩真实分布 |
-| `8` / `7` / `16` | `query` | `short_video_hook` | 短视频/副歌入口 |
-| `18` | `excluded` | `demo_variation_pool` | 先人工筛 |
-| `2` / `12` | `excluded` | `with_harmony_shift` | 先做专项桶 |
-| 其余非音频 type | `excluded` | `non_audio` | 不入模 |
---
-## 4. 导出原则
-1. **reference 与 query 即使同曲，也不要混成同一条资产记录。**
-2. **如果无法确认是否同曲同版本，默认 `excluded`。**
-3. **`type=18 demo` 不要自动并入 train，先人工审。**
-4. **短视频片段优先导出为 `query`，不要直接当 reference。**
---
-## 5. 模板与脚本
- Manifest 模板：
-  - [../acr-engine/configs/manifests/business_asset_manifest_template.json](../acr-engine/configs/manifests/business_asset_manifest_template.json)
- Type-role 模板：
-  - [../acr-engine/configs/manifests/business_type_role_mapping.json](../acr-engine/configs/manifests/business_type_role_mapping.json)
- 打印脚本：
-  - [../acr-engine/scripts/print_business_type_mapping.py](../acr-engine/scripts/print_business_type_mapping.py)
- 规范化脚本：
-  - [../acr-engine/scripts/normalize_business_export.py](../acr-engine/scripts/normalize_business_export.py)
- 角色拆分脚本：
-  - [../acr-engine/scripts/split_business_manifest_ready.py](../acr-engine/scripts/split_business_manifest_ready.py)
-示例命令：
-```bash
-cd /workspace/acr-engine
-/usr/local/miniconda3/bin/python scripts/print_business_type_mapping.py
-```
---
-## 6. 下个 session 直接动作
-1. 按这份规范把库表字段映射到 manifest 行。
-2. 用 `business_type_role_mapping.json` 给每条资产打默认 `role` / `bucket`。
-3. 先导出 `reference` 与 `query` 清单，再进入训练与 bucket benchmark。
-## 延伸阅读
- [business-export-cookbook.md](./business-export-cookbook.md)
-## Sources
- See [business-music-bucket-and-type-guide.md](./business-music-bucket-and-type-guide.md)
- See [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md)
--- a/docs/business-music-bucket-and-type-guide.md deleted 100644 → 0
View file @4422297
+++ b/docs/business-music-bucket-and-type-guide.md deleted 100644 → 0
View file @4422297
-# Business Music Bucket and Type Guide / 业务音乐素材类型与 Bucket 指南
-> 更新：2026-06-02  
-> 关联文档：[训练数据与 pgvector 指南](./training-data-and-pgvector-guide.md) · [开放数据工作流](./open-dataset-workflow.md) · [工业级 Benchmark 规范](./industrial-benchmark-spec.md)
-## 一页结论
-针对你们现有的素材 `type` 字段，**不要把所有文件都混进训练**。  
-更推荐按“reference 主资产 + query 派生资产 + hard-case 评测资产”三层来用。
-### 最推荐参与训练/建库的类型
-| 优先级 | type | 含义 | 训练用途 |
-|---|---:|---|---|
-| 高 | `10` | 伴奏无和声-无损 | 最干净的 reference 候选 |
-| 高 | `11` | 原曲-无损 | 主 reference / 主训练资产 |
-| 高 | `9` | 伴奏无和声-压缩 | reference 补充 / 压缩域适配 |
-| 高 | `1` | 原曲-压缩 | 训练域补充 / 真实线上分布 |
-| 中 | `18` | 音频 demo | 可作为弱监督补充，需人工筛 |
-| 中 | `8` | 片段(副歌) | 可用于 repeated-section / 高辨识度 query |
-| 中 | `7` | 抖音片段 | 可用于短视频域 query 评测 |
-| 中 | `16` | 快手片段 | 可用于短视频域 query 评测 |
-### 通常不直接参与主训练的类型
-| type | 含义 | 原因 |
-|---:|---|---|
-| `2` / `12` | 伴奏有和声 | 容易引入“同曲不同演唱层”的额外变异，适合后续单独实验 |
-| `3` / `13` / `20` | 歌词 / LRC / 译文滚动歌词 | 非音频资产 |
-| `4` / `14` / `19` | 封面 / PSD / 曲谱图片 | 非音频资产 |
-| `5` | 授权书 | 合规文件，不入模 |
-| `6` | 专辑信息 | 元数据，不入模 |
-| `17` | 词曲压缩包 | 需先拆解，不应直接入模 |
---
-## 1. 业务素材职责图
-```mermaid
-flowchart LR
-    A[无损主资产\n10 / 11] --> B[reference 主库]
-    C[压缩主资产\n1 / 9] --> D[训练域增强]
-    E[短视频片段\n7 / 16 / 8] --> F[query 评测集]
-    G[录音/demo\n18] --> H[弱监督补充池]
-    B --> I[训练 / 建索引]
-    D --> I
-    F --> J[短片段评测 / hard-case]
-    H --> K[人工筛选后再进入 I 或 J]
-```
---
-## 2. 你们的 type 应该怎么用
-## 2.1 主训练 / 主建库推荐
-### A. 第一优先：`10` + `11`
-原因：
- 音质最好
- 标签语义最稳定
- 最适合作为“真值 reference”
-推荐用途：
- `reference`
- 主训练资产
- pgvector 主 embedding 表
-### B. 第二优先：`9` + `1`
-原因：
- 更接近线上真实压缩分布
- 可以增强模型对编码损伤的适应性
-推荐用途：
- 训练补充
- 评测时做 clean/compressed query
- reference 域扩展
-### C. 第三优先：`8` / `7` / `16`
-原因：
- 更接近真实识别入口
- 有利于短片段 / 副歌 / 短视频域评测
-推荐用途：
- query 评测集
- repeated-section-rich bucket
- short-video bucket
-### D. 谨慎使用：`18`
-原因：
- `demo` 的混音、编排、完整度差异很大
- 很容易把“不是同一首最终版本”的样本混入同标签
-推荐用途：
- 先放人工筛选池
- 只在确认与正式版本同曲同主旋律时再纳入训练或 hard-case
---
-## 2.2 不建议一开始就并入主训练的类型
-### `2` / `12` 伴奏有和声
-风险：
- 同一 `song_id` 下会多出人声/和声干扰
- 如果当前系统目标是“音乐 ACR / BGM 识别”，这类素材更适合作为后续 domain robustness 对照
-建议：
- 先单独放一个 `with_harmony_accompaniment` bucket
- 不要一开始和 `10`/`9` 直接混训
---
-## 3. 建议的训练/评测分层
-```mermaid
-flowchart TD
-    A[主库 reference] --> A1[10 / 11]
-    B[训练补充] --> B1[1 / 9]
-    C[短片段评测] --> C1[7 / 16 / 8]
-    D[特殊对照] --> D1[2 / 12 / 18]
-    E[非音频元数据] --> E1[3 / 4 / 5 / 6 / 13 / 14 / 17 / 19 / 20]
-```
-### 推荐首版策略
-| 层 | 推荐 type |
-|---|---|
-| reference 主库 | `10`, `11` |
-| 训练补充 | `1`, `9` |
-| query 评测 | `7`, `8`, `16` |
-| 人工筛选后可补充 | `18` |
-| 后续鲁棒性专项实验 | `2`, `12` |
---
-## 4. 业务语义 bucket 建议
-## 4.1 第一批最值得做的 bucket
-| bucket 名称 | 推荐来源 type | 作用 |
-|---|---|---|
-| `lossless_reference_core` | `10`, `11` | 最干净真值库 |
-| `compressed_reference_realworld` | `1`, `9` | 线上压缩域 |
-| `short_video_hook` | `7`, `16`, `8` | 短视频 / 副歌识别 |
-| `with_harmony_shift` | `2`, `12` | 有和声伴奏干扰 |
-| `demo_variation_pool` | `18` | demo 与正式版差异风险 |
-| `hard_negative_confusable` | 人工精选 | 风格近似、编曲近似、旋律近似 |
---
-## 4.2 为什么这比通用 semantic bucket 更贴近业务
-因为你们的数据不是纯学术数据集，而是**带素材业务语义**的：
- 有主资产 / 压缩版 / 无损版
- 有短视频片段
- 有副歌片段
- 有带和声/不带和声伴奏
-因此你们最先应该做的不是抽象的 genre bucket，而是：
-1. **版本形态 bucket**
-2. **入口场景 bucket**
-3. **混淆风险 bucket**
---
-## 5. 推荐配置模板
-配套模板：
- [../acr-engine/configs/buckets/fma_semantic_bucket_template.json](../acr-engine/configs/buckets/fma_semantic_bucket_template.json)
- [../acr-engine/configs/buckets/business_type_bucket_template.json](../acr-engine/configs/buckets/business_type_bucket_template.json)
-其中：
- `fma_semantic_bucket_template.json` 更偏通用方法学
- `business_type_bucket_template.json` 更偏你们现有业务素材形态
---
-## 6. 和 pgvector 怎么配合
-如果后续落到 pgvector，建议至少保留这些字段：
-| 字段 | 说明 |
-|---|---|
-| `song_id` | 主歌曲 ID |
-| `asset_id` | 具体资产 ID |
-| `type` | 你们的素材类型 |
-| `bucket` | 当前评测/训练桶 |
-| `role` | `reference` / `query` |
-| `source_dataset` | 来源 |
-| `offset_sec` | query 起点 |
-| `duration_sec` | query 长度 |
-| `embedding` | pgvector 向量 |
-这样后面就能按：
- `type` 过滤
- `bucket` 出报表
- `role` 区分 reference/query
- `source_dataset` 做多源分析
---
-## 7. 下个 session 的直接动作
-1. 先按这个文档筛出首批可用 type：`10`, `11`, `9`, `1`, `8`, `7`, `16`
-2. 再把这些映射进：
-   - [../acr-engine/configs/buckets/business_type_bucket_template.json](../acr-engine/configs/buckets/business_type_bucket_template.json)
-3. 跑 bucket benchmark
-4. 对照 `hybrid` / `high_energy` 在不同业务 bucket 下是否分化
-## Sources
- See [training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md)
- See [industrial-benchmark-spec.md](./industrial-benchmark-spec.md)
--- a/docs/business-project-manifest-adapter.md deleted 100644 → 0
View file @4422297
+++ b/docs/business-project-manifest-adapter.md deleted 100644 → 0
View file @4422297
-# Business Project Manifest Adapter / 业务数据到项目 Manifest 适配说明
-> 更新：2026-06-02  
-> 关联文档：[业务导出 Cookbook](./business-export-cookbook.md) · [业务 Manifest 与 Type-Role 规范](./business-manifest-and-type-role-spec.md)
-## 一页结论
-现在仓库里已经有一条接近项目训练/评测 manifest 的离线脚本链：
-1. 业务库表导出 CSV / JSONL
-2. [../acr-engine/scripts/normalize_business_export.py](../acr-engine/scripts/normalize_business_export.py)
-3. [../acr-engine/scripts/split_business_manifest_ready.py](../acr-engine/scripts/split_business_manifest_ready.py)
-4. [../acr-engine/scripts/build_business_project_manifests.py](../acr-engine/scripts/build_business_project_manifests.py)
-最后一步会直接生成：
- `catalog.json`
- `train.json`
- `test.json`
- `val.json`
-格式对齐当前项目已有 manifest 结构。
---
-## 1. 对齐后的项目格式
-### `catalog.json`
- 只放 reference
- 字段：`song_id / audio_path / duration / type=reference / source_dataset`
-### `train.json` / `test.json`
- 前半部分是 query
- 后半部分拼接 reference
- query 字段：
-  - `song_id`
-  - `audio_path`
-  - `duration`
-  - `type=clean`
-  - `offset`
-  - `segment_type=external_query`
-  - `source_dataset`
-### `val.json`
- 当前默认只放 `split=val` 的 query
- 可选把 `holdout` 合并进 `val`
---
-## 2. 示例命令
-```bash
-cd /workspace/acr-engine
-/usr/local/miniconda3/bin/python scripts/normalize_business_export.py \
-  --input configs/manifests/examples/business_asset_export_example.csv \
-  --output /tmp/business_asset_manifest_ready.jsonl
-/usr/local/miniconda3/bin/python scripts/build_business_project_manifests.py \
-  --input /tmp/business_asset_manifest_ready.jsonl \
-  --output-dir /tmp/business_project_manifests
-```
-如果你希望把 `holdout` 先并进 `val.json`：
-```bash
-/usr/local/miniconda3/bin/python scripts/build_business_project_manifests.py \
-  --input /tmp/business_asset_manifest_ready.jsonl \
-  --output-dir /tmp/business_project_manifests \
-  --include-holdout-in-val
-```
---
-## 3. 适配边界
-这一步还不是最终“真实业务生产接入”，但已经足够让下个 session：
- 用真实业务导出样本跑通 manifest 结构
- 对接 `train.py / evaluate.py / run_demo.py`
- 再只针对最终字段细节做小修
-## Sources
- See [business-export-cookbook.md](./business-export-cookbook.md)
- See [business-manifest-and-type-role-spec.md](./business-manifest-and-type-role-spec.md)
--- a/docs/current-capability-map.md deleted 100644 → 0
View file @4422297
+++ b/docs/current-capability-map.md deleted 100644 → 0
View file @4422297
-# Current Capability Map / 当前能力地图
-> 更新：2026-06-02
-## 一页结论
-当前项目有三类能力：
-1. **已完整闭环**
-2. **已打通但仍是 smoke 级**
-3. **仍待真实数据/更大规模验证**
---
-## 1. 能力状态表
-| 能力 | 当前状态 | 说明 |
-|---|---|---|
-| synthetic 数据生成 | 已完成 | 可稳定生成合成训练/评测数据 |
-| synthetic 训练 | 已完成 | `train.py` 可跑通 |
-| synthetic 建索引 | 已完成 | `run_demo.py build-index` 可跑通 |
-| synthetic 评测 | 已完成 | `evaluate.py` 可输出 JSON |
-| synthetic 发布制品 | 已完成 | 可生成 benchmark/model-card/checklist |
-| 开放数据 inspect | 已完成 | `inspect-local` / `inspect-batch` |
-| 开放数据 prepare | 已完成 | `prepare-local` |
-| 开放数据 validate | 已完成 | `validate-local` |
-| 开放数据训练 smoke | 已完成 | 已在 stand-in 数据上验证 |
-| 开放数据索引 smoke | 已完成 | 已在 stand-in 数据上验证 |
-| 开放数据评测 smoke | 已完成 | 已在 stand-in 数据上验证 |
-| 开放数据发布制品 smoke | 已完成 | 已在 stand-in 数据上验证 |
-| 一键 smoke-local | 已完成 | inspect→prepare→validate→train→index→eval→artifacts |
-| 真实 FMA 本地目录 smoke | 待外部数据 | 代码已就绪，缺真实音频目录 |
-| 真实 MTG-Jamendo 本地目录 smoke | 待外部数据 | 代码已就绪，缺真实音频目录 |
-| hard-case 精度优化 | 进行中 | confused / humming_like 仍需持续优化 |
-| foundation model baseline | 未完成 | 仅完成文档研究与路线规划 |
-| 工业级生产部署 | 未完成 | 服务骨架已在，生产治理未完成 |
---
-## 2. 最短路径图
-```mermaid
-flowchart LR
-    A[Local Audio Dir] --> B[inspect-local]
-    B --> C[prepare-local]
-    C --> D[validate-local]
-    D --> E[train]
-    E --> F[build-index]
-    F --> G[evaluate]
-    G --> H[generate_artifacts]
-```
---
-## 3. 当前最可靠入口
- [docs/open-dataset-workflow.md](./open-dataset-workflow.md)
- [docs/session-handoff.md](./session-handoff.md)
- [acr-engine/FIRST_RUN_CHECKLIST.md](../acr-engine/FIRST_RUN_CHECKLIST.md)
- [acr-engine/scripts/status_snapshot.py](../acr-engine/scripts/status_snapshot.py)
---
-## 4. 当前最重要缺口
-1. 真实 FMA 本地音频未落地
-2. 真实 MTG-Jamendo 本地音频未落地
-3. hard-case 在真实数据上的表现未知
-4. foundation model baseline 还未开始实现
-5. 服务与部署仍偏原型级
---
-## Sources
- [session-handoff.md](./session-handoff.md)
- [open-dataset-workflow.md](./open-dataset-workflow.md)
- [CHANGELOG.md](./CHANGELOG.md)
--- a/docs/dataset-sources-and-licensing.md deleted 100644 → 0
View file @4422297
+++ b/docs/dataset-sources-and-licensing.md deleted 100644 → 0
View file @4422297
-# Dataset Sources and Licensing
-> 更新：2026-06-02
-## 一页结论
- 当前优先目标改为：**个人使用下充分利用开源数据集**
- 外部数据集接入现在不仅要能 bootstrap，还要能真实切成 train/eval manifests
- 当前建议优先级：
-  1. FMA
-  2. MTG-Jamendo
-  3. CCMusic（审批/核验后）
-  4. ModelScope music datasets（白名单后）
- ModelScope 与 CCMusic 当前都不能默认直接进入商用训练
-对个人使用的直接建议：
- FMA / MTG-Jamendo：优先转成训练与评估资产
- CCMusic / ModelScope：优先当补充评估或探索来源
- 保留 license 注记，但不再把“商用阻塞”作为个人实验主阻塞
-推荐先读：
- [开放数据工作流](./open-dataset-workflow.md)
-建议接入顺序：
-1. 下载/准备 FMA 或 MTG-Jamendo 的本地音频目录
-2. 运行 [acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py) `inspect-local` 或 `inspect-batch`
-3. 再运行 [acr-engine/src/data/external_adapters.py](../acr-engine/src/data/external_adapters.py) `prepare-local`
-4. 生成 [catalog.json / train.json / test.json / val.json](../acr-engine/data/external_ingested/README.md)
-5. 将 [train.json](../acr-engine/data/external_ingested/README.md) 用于训练，将 [test.json](../acr-engine/data/external_ingested/README.md) 用于固定评估
---
-## 1. 来源分层图
-```mermaid
-flowchart TD
-    A[Candidate Datasets] --> B[Open / MIR Baselines]
-    A --> C[Chinese / Regional Sources]
-    A --> D[Discovery Surfaces]
-    B --> B1[FMA]
-    B --> B2[MTG-Jamendo]
-    C --> C1[CCMusic]
-    D --> D1[ModelScope music datasets]
-```
---
-## 2. 数据源表
-| 数据源 | 角色 | 风险 | 当前策略 |
-|---|---|---|---|
-| FMA | 首批真实 baseline | track license 需核验 | review_required |
-| MTG-Jamendo | retrieval/tagging corpus | CC 细则需核验 | review_required |
-| CCMusic | 中文 MIR 资源 | 可能需申请/存在限制 | review_required |
-| ModelScope music | 数据发现入口 | license 分散 | deny_until_whitelisted |
---
-## 3. 白名单流程图
-```mermaid
-flowchart LR
-    A[发现数据集] --> B[收集 license / terms]
-    B --> C[法律/合规审查]
-    C --> D{可商用?}
-    D -- 是 --> E[加入 whitelist]
-    D -- 否 --> F[禁止进入训练]
-```
---
-## 4. 文字说明
-### 4.1 为什么 ModelScope 只能先当 discovery surface
-因为不同数据集来源和条款差异很大，不能因为“在 ModelScope 上”就默认可商用。
-### 4.2 为什么 CCMusic 要单独看
-它对中文音乐任务很有价值，但部分子集可能涉及申请、协议或非标准商业许可边界。
-### 4.3 为什么 license registry 要和模型版本绑定
-这样才能在未来追踪：
- 某个模型到底用了哪些数据
- 这些数据是否允许对应商用场景
---
-## 5. 细节附录
-入口链接：
- FMA: https://github.com/mdeff/fma
- MTG-Jamendo: https://github.com/MTG/mtg-jamendo-dataset
- CCMusic: https://ccmusic-database.github.io/en/database/ccm.html
- ModelScope search: https://modelscope.cn/search?page=1&search=music&type=dataset
-## Sources
- See [references-and-sources.md](./references-and-sources.md) for the current source map.
-## Download / LFS governance
-### Preferred repository behavior
-```mermaid
-flowchart TD
-    A[Upstream dataset source] --> B[Local raw drop zone]
-    B --> C[Git LFS tracked large files]
-    C --> D[check-local-ready]
-    D --> E[prepare-local / smoke-local]
-```
-### Current repo policy
-| Item | Policy | Reason |
-|---|---|---|
-| `acr-engine/data/raw/**/*.zip` | Git LFS | avoid bloating normal git history |
-| `acr-engine/data/raw/**/*.wav` / `.mp3` / `.flac` / `.ogg` | Git LFS | allow local reproducibility without normal blob explosion |
-| FMA Small | acceptable as first real-data engineering baseline | easiest realistic open music smoke path |
-| MTG-Jamendo | default to research/eval lane | do not assume commercial-safe rights without subset-specific proof |
-### Operational note
-Even when a dataset is technically downloadable, this project should separate:
- **engineering usability**
- **benchmark suitability**
- **commercial deployment suitability**
-These are not the same thing.
--- a/docs/dataset-spec.md deleted 100644 → 0
View file @4422297
+++ b/docs/dataset-spec.md deleted 100644 → 0
View file @4422297
--- a/docs/industrial-benchmark-spec.md deleted 100644 → 0
View file @4422297
+++ b/docs/industrial-benchmark-spec.md deleted 100644 → 0
View file @4422297
-# Industrial Benchmark Spec
-> 更新：2026-06-02
-## 一页结论
- 工业级 ACR 不能只看总 top1
- 必须同时看：
-  1. hard-case
-  2. rejection / false accept
-  3. latency / scale
-  4. license provenance completeness
---
-## 1. Benchmark 分层图
-```mermaid
-flowchart TD
-    A[Industrial Benchmark] --> B[Accuracy]
-    A --> C[Robustness]
-    A --> D[Operational]
-    A --> E[Compliance]
-    B --> B1[top1/top5/MRR]
-    C --> C1[humming/confused/noisy]
-    D --> D1[latency/indexing/throughput]
-    E --> E1[data provenance/license coverage]
-```
---
-## 2. 指标表
-| 维度 | 指标 | 目标 |
-|---|---|---|
-| Accuracy | top1 / top5 / MRR | 主识别质量 |
-| Robustness | humming/confused/noisy top1 | hard-case 质量 |
-| Operational | p50/p95 latency | 服务能力 |
-| Operational | index throughput | 建库能力 |
-| Safety | false accept rate | 误识别风险 |
-| Compliance | license coverage | 商业可用前提 |
---
-## 3. 场景图
-```mermaid
-flowchart LR
-    Q[Queries] --> Q1[clean]
-    Q --> Q2[augmented]
-    Q --> Q3[humming_like]
-    Q --> Q4[confused]
-    Q --> Q5[noisy/compressed]
-```
---
-## 4. 文字说明
-### 4.1 为什么 hard-case 要单独出报表
-因为总体 top1 很容易掩盖哼唱和混淆场景的失败，而这些正是用户最敏感的场景。
-### 4.2 为什么要加入 operational metrics
-工业级系统不是离线竞赛模型，需要考虑服务响应与增量索引成本。
-### 4.3 为什么要把 compliance 放进 benchmark
-对于商用系统，如果训练/评测数据来源不可追溯，再高精度也不能安全上线。
---
-## 5. 细节附录
-推荐 release gate：
- clean top1 >= 0.95
- noisy top1 >= 0.85
- confused top1 >= 0.70
- humming_like top1 >= 0.60
- top5 >= 0.95 on production-relevant buckets
-## Sources
- See [references-and-sources.md](./references-and-sources.md) for the current source map.
-## 6. Bucket / Style-aware 基线
-当前仓库已经新增可运行基线脚本：
- [../acr-engine/scripts/ab_smoke_bucketed.py](../acr-engine/scripts/ab_smoke_bucketed.py)
-用途：
- 按 bucket 配置文件拆分多个小子集
- 对每个 bucket 分别运行现有 `ab_smoke_segmentation.py`
- 输出 bucket 级 winner 与聚合均值
-推荐最小配置文件格式：
-```json
-{
-  "buckets": [
-    {"name": "prefix_000_a", "patterns": ["fma_small/000/00000?.mp3"], "subset_size": 4},
-    {"name": "prefix_000_b", "patterns": ["fma_small/000/00014?.mp3"], "subset_size": 4}
-  ]
-}
-```
-推荐命令：
-```bash
-/usr/local/miniconda3/bin/python acr-engine/scripts/ab_smoke_bucketed.py   --dataset fma   --input-dir data/raw/fma_small_audio   --bucket-config /tmp/cap64_bucket_test.json   --work-root /tmp/ab_smoke_bucketed_smoke   --default-subset-size 4   --query-duration 8   --train-epochs 1   --batch-size 2   --device cpu   --strategies high_energy hybrid   --max-test-queries 4   --seed 42   --output-json /tmp/ab_smoke_bucketed_smoke/report.json
-```
-当前已验证的最小结果：
- `prefix_000_a` winner=`hybrid`
- `prefix_000_b` winner=`high_energy`
- aggregate 层面两者 `mean_top1` 都是 `1.0`
-因此 bucket benchmark 的当前意义不是“选出唯一赢家”，而是为后续语义 bucket / hard-case bucket 提供一个可复用执行框架。
-推荐模板：
- [../acr-engine/configs/buckets/fma_semantic_bucket_template.json](../acr-engine/configs/buckets/fma_semantic_bucket_template.json)
-它不是自动标注器，而是一个“人工先分 bucket，再复用统一 benchmark 流程”的执行模板。
--- a/docs/industrialization-roadmap.md deleted 100644 → 0
View file @4422297
+++ b/docs/industrialization-roadmap.md deleted 100644 → 0
View file @4422297
-# 工业化路线图
-> 更新：2026-06-02
-## 一页结论
-当前项目已完成：
- 原型可运行
- retrieval-first 初步改造
- 服务骨架
- 外部数据 adapter 雏形
-下一阶段必须聚焦三件事：
-1. **真实数据接入**
-2. **hard-case 精度**
-3. **商业化合规与服务稳定性**
---
-## 1. 路线图图示
-```mermaid
-flowchart LR
-    P0[P0 原型跑通] --> P1[P1 真实数据验证]
-    P1 --> P2[P2 工程化与服务化]
-    P2 --> P3[P3 大规模索引]
-    P3 --> P4[P4 商用上线]
-```
---
-## 2. 阶段表
-| 阶段 | 目标 | 当前状态 | 核心产物 |
-|---|---|---|---|
-| P0 | 端到端原型 | 已完成 | demo/train/index/eval |
-| P1 | 白名单真实数据接入 | 进行中 | adapters/manifests/benchmark |
-| P2 | API / benchmark / ops | 进行中 | FastAPI + spec |
-| P3 | ANN / 增量索引 | 未完成 | Faiss/HNSW |
-| P4 | 可商用平台 | 未完成 | license gate / SLA / release flow |
---
-## 3. 近期优先级
-### Priority A
- FMA / Jamendo 小规模白名单子集接入
- humming_like / confused 精度提升
- service 配置化与真实部署 smoke
-### Priority B
- ANN 向量索引
- 拒识/误接收指标
- 模型版本化
-### Priority C
- foundation model baseline
- 在线评估与监控
- 商业部署流程
---
-## 4. 分层职责
-| 层 | 重点 |
-|---|---|
-| 数据层 | 只接入可审计白名单数据 |
-| 模型层 | 以 retrieval 指标为主，不迷信分类头 |
-| 检索层 | 强化 hard-case 与 rejection |
-| 服务层 | 稳定 API、可配置、可观测 |
-| 合规层 | 任何上线模型必须可追溯数据来源 |
---
-## 5. 细节附录
-关联文档：
- [数据来源与接入](./dataset-sources-and-licensing.md)
- [工业评测规范](./industrial-benchmark-spec.md)
- [服务接口](./service-api.md)
-## Sources
- See [references-and-sources.md](./references-and-sources.md) for the current source map.
--- a/docs/model-card-template.md deleted 100644 → 0
View file @4422297
+++ b/docs/model-card-template.md deleted 100644 → 0
View file @4422297
-# Model Card Template
-## 一页结论
- 模型名称：
- 版本：
- 适用场景：
- 不适用场景：
-## 1. 模型结构图
-```mermaid
-flowchart LR
-    A[Input Audio] --> B[128 Mel + BandSplit]
-    B --> C[Encoder]
-    C --> D[Embedding]
-    D --> E[Hybrid Retrieval]
-```
-## 2. 关键信息表
-| 项 | 内容 |
-|---|---|
-| 训练数据 |  |
-| 评测数据 |  |
-| 主要指标 |  |
-| 已知风险 |  |
-| 许可证约束 |  |
-## 3. 文字说明
- 训练方式：
- 模型限制：
- 风险提示：
-## 4. 细节附录
- checkpoint 路径
- config 路径
- benchmark 报告路径
-## Sources
- `docs/dataset-spec.md`
- `docs/benchmark-report-template.md`
--- a/docs/open-dataset-workflow.md deleted 100644 → 0
View file @4422297
+++ b/docs/open-dataset-workflow.md deleted 100644 → 0
View file @4422297
--- a/docs/postgresql-data-model.md
View file @ac2e673
+++ b/docs/postgresql-data-model.md
View file @ac2e673
@@ -256,6 +256,132 @@ window -> fingerprint / embedding -> candidate -> aggregate
 ---
+## 1.2 当前业务前提变化：版本暂不重要，先做 song-centric
+如果当前业务约束是：
+> **同一个歌曲下可以有多个录音或多个音频，但暂时不关心版本语义，只需要最终稳定归到同一个 `song_id`**
+那么当前 Phase-1 最推荐的默认口径应进一步收敛为：
+```text
+song -> asset -> window -> feature
+```
+也就是说：
+- `song` 是当前唯一必须稳定返回的归属对象
+- 同一个 `song` 下允许存在多个音频文件
+- 这些音频文件可以是官方、抓取、BGM、片段、query sample 等不同来源
+- 现阶段先不把“录音版本差异”提升成必须单独建模的核心层
+### 当前最推荐的物理实现
+在这个业务前提下，最推荐直接采用 **3+1 张融合表**：
+| 物理表 | 主要 type | 当前作用 |
+|---|---|---|
+| `media_entity` | `song` | 只承载最终业务归属对象 |
+| `audio_object` | `asset`, `window` | 承载音频文件与切片窗口 |
+| `feature_fact` | `fingerprint`, `embedding` | 承载检索特征事实 |
+| `set_membership` | `reference_set`, `hot_set`, `eval_set` | 承载 reference / eval 等集合关系 |
+对应逻辑主链：
+```text
+song -> asset -> window -> feature
+```
+### 切片数据、模型、feature 具体落在哪些表
+在当前 **song-centric + 融合优先** 设计下，可以直接按下面理解：
+| 你关心的对象 | 当前推荐表 | 关键 type / 字段 | 作用 |
+|---|---|---|---|
+| 歌曲主实体 | `media_entity` | `entity_type=song` | 最终归属到哪个 `song_id` |
+| 原始音频文件 | `audio_object` | `object_type=asset`, `song_id`, `storage_uri`, `checksum` | 保存同一 song 下的多个音频文件 |
+| 切片窗口 | `audio_object` | `object_type=window`, `parent_object_id=<asset_id>`, `start_ms`, `end_ms` | 保存由 asset 切出来的检索窗口 |
+| 模型信息 | `feature_fact` | `model_name`, `model_version`, `feature_set_name` | 记录这条特征是哪个模型、哪套参数算的 |
+| fingerprint 特征 | `feature_fact` | `feature_type=fingerprint`, `fingerprint_value` | 保存 exact lane 特征 |
+| embedding 特征 | `feature_fact` | `feature_type=embedding`, `embedding_dim`, `embedding_uri`, `vector_table_name` | 保存 semantic lane 特征 |
+| reference / eval 归属 | `set_membership` | `set_type`, `member_type`, `member_id` | 决定哪些 asset/window/song 进入 reference 集 |
+最关键的一点是：
+> **切片本身也落在 `audio_object`，只是 `object_type=window`；模型与特征统一落在 `feature_fact`。**
+### 对应流程图
+```mermaid
+flowchart TD
+    A[media_entity
+entity_type=song] --> B[audio_object
+object_type=asset]
+    B --> C[audio_object
+object_type=window]
+    C --> D1[feature_fact
+feature_type=fingerprint]
+    C --> D2[feature_fact
+feature_type=embedding]
+    D1 --> E[set_membership
+reference_set / eval_set]
+    D2 --> E
+```
+### 对应写入流程
+```mermaid
+sequenceDiagram
+    participant ING as Ingest Job
+    participant DB as PostgreSQL
+    ING->>DB: 写 media_entity(song)
+    ING->>DB: 写 audio_object(asset)
+    ING->>DB: 切窗后写 audio_object(window)
+    ING->>DB: 写 feature_fact(fingerprint)
+    ING->>DB: 写 feature_fact(embedding)
+    ING->>DB: 写 set_membership(reference/eval)
+```
+### 一个最实用的查询回溯口径
+如果 query 命中了一条 embedding/fingerprint，回溯路径就是：
+```text
+feature_fact -> audio_object(window) -> audio_object(asset) -> media_entity(song)
+```
+这条链已经足够支撑你当前最关心的问题：
+- 这个切片来自哪个音频文件
+- 这个音频文件归到哪个 `song_id`
+- 这条特征是哪个模型/feature set 算出来的
+---
+### 为什么现在可以先不把 `recording` 做成强实体
+因为你当前不关心：
+- official / live / remaster 的严格版本区分
+- cover/version lane 的独立归档
+- 返回结果必须精确到 recording_id
+你当前真正关心的是：
+> 这一批不同来源、不同形式的音频，最后是否都能被稳定归到同一个 `song_id`
+在这个目标下，把 `recording` 作为强主层，会增加理解成本，但对当前第一阶段收益有限。
+### 但这不代表未来永远不要 `recording`
+推荐的处理方式是：
+- **当前 schema 默认不强推 `recording`**
+- 如果未来开始关心版本归属，再把 `recording` 从 `media_entity(entity_type=recording)` 或 `audio_object.metadata_json` 中提升出来
+换句话说：
+- **当前先做 song-centric 检索归属**
+- **未来再演进到 recording-centric / work-centric 治理**
+---
 ## 1.2.1 融合优先：逻辑分层保留，物理表尽量收敛
 如果你的核心诉求是：
@@ -264,7 +390,7 @@ window -> fingerprint / embedding -> candidate -> aggregate
 那么推荐采用下面这个口径：
- **逻辑层** 仍然保留 `song / recording / asset / window / feature`
+- **逻辑层** 当前默认保留 `song / asset / window / feature`；`recording` 仅保留为未来扩展语义
 - **物理层** 尽量融合成少数几张通用表
 也就是说：
@@ -275,7 +401,7 @@ window -> fingerprint / embedding -> candidate -> aggregate
 | 物理表 | 主要 type | 作用 |
 |---|---|---|
-| `media_entity` | `song`, `work`, `recording` | 承载业务归属对象 |
+| `media_entity` | `song`（当前默认）, `work`/`recording`（未来扩展） | 承载业务归属对象 |
 | `audio_object` | `asset`, `window` | 承载真实音频文件与切片对象 |
 | `feature_fact` | `fingerprint`, `embedding` | 承载检索特征事实 |
 | `set_membership` | `reference_set`, `hot_set`, `eval_set` | 承载集合归属关系 |
@@ -292,9 +418,9 @@ media_entity -> audio_object -> feature_fact -> set_membership
 #### `media_entity`
 用 `entity_type` 区分：
- `song`
+- `song`（当前默认必用）
- `work`
+- `work`（可选）
- `recording`
+- `recording`（未来扩展）
 公共字段可统一为：
 - `entity_id`
@@ -354,7 +480,7 @@ media_entity -> audio_object -> feature_fact -> set_membership
 优点：
 1. **新同学更容易理解**：看到的是 3~4 张核心表，而不是十几张专用表
-2. **多 type 复用更自然**：`song/work/recording`、`asset/window` 都能用 type 统一表达
+2. **更符合当前业务前提**：多个音频直接挂到同一个 `song_id`，先不强区分 recording
 3. **模型演进更平滑**：`feature_fact` 可以同时容纳不同模型与不同特征
 4. **更符合当前目标**：先把识别闭环跑通，而不是先把治理模型拆到很细
@@ -395,7 +521,7 @@ song_everything
 | 层 | 融合优先推荐表 | 当前作用 |
 |---|---|---|
-| 实体层 | `media_entity` | 统一承载 `song/work/recording` |
+| 实体层 | `media_entity` | 当前默认只承载 `song` |
 | 音频对象层 | `audio_object` | 统一承载 `asset/window` |
 | 特征层 | `feature_fact` | 统一承载 `fingerprint/embedding` |
 | 集合层 | `set_membership` | 统一承载 `reference/hot/eval` 等集合关系 |
@@ -406,10 +532,10 @@ song_everything
 media_entity -> audio_object -> feature_fact -> set_membership
 ```
-如果按逻辑语义理解，则仍然对应：
+如果按逻辑语义理解，则当前默认对应：
 ```text
-song/work/recording -> asset/window -> fingerprint/embedding -> reference membership
+song -> asset/window -> fingerprint/embedding -> reference membership
 ```
 ### 这版极简 schema 明确不要求第一天就重投入的内容
--- a/docs/production-encoder-freeze-and-embedding-strategy.md
View file @ac2e673
+++ b/docs/production-encoder-freeze-and-embedding-strategy.md
View file @ac2e673
 # Production Encoder Freeze & Embedding Strategy / 生产 Encoder 冻结与 Embedding 策略答疑
 > 更新：2026-06-03  
-> 关联文档：[持续开发交接文档](./session-handoff.md) · [训练数据与 pgvector 指南](./training-data-and-pgvector-guide.md) · [开放数据工作流](./open-dataset-workflow.md) · [服务接口](./service-api.md)
+> 关联文档：[持续开发交接文档](./session-handoff.md) · [PostgreSQL 数据模型](./postgresql-data-model.md) · [Phase-1 实施清单](./phase1-implementation-checklist.md)
 ## 一页结论
@@ -623,9 +623,9 @@ prod_artifacts/
 ## Sources
 - [持续开发交接文档](./session-handoff.md)
- [训练数据与 pgvector 指南](./training-data-and-pgvector-guide.md)
+- [postgresql-data-model.md](./postgresql-data-model.md)
- [开放数据工作流](./open-dataset-workflow.md)
+- [phase1-implementation-checklist.md](./phase1-implementation-checklist.md)
- [服务接口](./service-api.md)
+- [phase1-worker-contract.md](./phase1-worker-contract.md)
 - [acr-engine/train.py](../acr-engine/train.py)
 - [acr-engine/run_demo.py](../acr-engine/run_demo.py)
 - [acr-engine/src/engines/ecapa_embedder.py](../acr-engine/src/engines/ecapa_embedder.py)
--- a/docs/project-responsibility-map.md deleted 100644 → 0
View file @4422297
+++ b/docs/project-responsibility-map.md deleted 100644 → 0
View file @4422297
-# ACR 项目职责图
-> 更新：2026-06-02
-## 一页结论
- 本项目已经从“算法原型”升级为“**面向工业化的 ACR 平台雏形**”
- 当前系统分为 **数据层、训练层、检索层、服务层、评测层、合规层**
- 近期重点不是再堆功能，而是：
-  1. 提升 `humming_like` / `confused` 准确率
-  2. 接入真实白名单数据集
-  3. 完善服务、索引、benchmark 与合规闭环
---
-## 1. 分层图
-```mermaid
-flowchart TD
-    A[L1 业务目标层] --> B[L2 系统能力层]
-    B --> C[L3 核心模块层]
-    C --> D[L4 工程服务层]
-    C --> E[L5 数据与合规层]
-    A1[听歌识曲 / 哼唱识别 / 商业可用]:::goal --> A
-    B1[高准确率识别] --> B
-    B2[可扩展曲库] --> B
-    B3[可服务化调用] --> B
-    B4[可审计数据来源] --> B
-    C1[训练与表征学习] --> C
-    C2[指纹检索] --> C
-    C3[向量检索] --> C
-    C4[混合重排] --> C
-    C5[评测基准] --> C
-    D1[FastAPI] --> D
-    D2[Index Build] --> D
-    D3[Manifest Tools] --> D
-    E1[External Adapters] --> E
-    E2[Dataset Registry] --> E
-    E3[License Review] --> E
-    classDef goal fill:#e8f5e9,stroke:#2e7d32;
-```
---
-## 2. 职责总表
-| 层级 | 模块 | 负责内容 | 当前状态 |
-|---|---|---|---|
-| 数据层 | `src/data/*` | synthetic 数据、external adapters、manifest | 已有基础 |
-| 训练层 | `train.py` / `src/models/*` | 128 Mel、band-split、embedding 学习 | 可运行 |
-| 检索层 | `src/engines/*` | chromaprint、embedding、melody-aware hybrid | 可运行 |
-| 服务层 | `src/service/*` | health / recognize / index build | 骨架已通 |
-| 评测层 | `evaluate.py` | top1/top5/hard-case benchmark | 已建立 |
-| 合规层 | registry/docs | dataset source / licensing / whitelist | 雏形已建 |
---
-## 3. 分工图
-```mermaid
-flowchart LR
-    D[数据团队] --> D1[数据接入]
-    D --> D2[manifest 标准化]
-    D --> D3[license 审查]
-    M[模型团队] --> M1[特征与模型]
-    M --> M2[鲁棒训练]
-    M --> M3[hard-case 优化]
-    R[检索团队] --> R1[指纹索引]
-    R --> R2[向量索引]
-    R --> R3[融合与拒识]
-    S[平台团队] --> S1[API 服务]
-    S --> S2[部署]
-    S --> S3[监控]
-    Q[质量团队] --> Q1[benchmark]
-    Q --> Q2[回归验证]
-    Q --> Q3[上线门禁]
-```
---
-## 4. 文字说明
-### 4.1 数据层
-负责把不同来源的数据集（synthetic、FMA、Jamendo、CCMusic、ModelScope 白名单集）转成统一的 `catalog/query manifest`。
-### 4.2 训练层
-负责音乐任务特征建模，目前已经从低维说话人风格输入升级到：
- 128 Mel
- band-split
- retrieval-first 训练方向
-### 4.3 检索层
-负责三路信息融合：
- 指纹匹配
- embedding 匹配
- melody-aware 重排
-### 4.4 服务层
-负责把离线原型包装成可调用系统，目前已有 FastAPI 骨架。
-### 4.5 评测层
-负责质量门禁，不能只看总体 top1，要看 hard-case、拒识、误接收。
-### 4.6 合规层
-负责商用前提，任何外部数据集都必须进入 registry 和白名单流程。
---
-## 5. 细节附录
-关键文档：
- `docs/dataset-spec.md`
- `docs/industrial-benchmark-spec.md`
- `docs/dataset-sources-and-licensing.md`
- `docs/industrialization-roadmap.md`
-## Sources
- See `docs/references-and-sources.md` for the current source map.
--- a/docs/references-and-sources.md deleted 100644 → 0
View file @4422297
+++ b/docs/references-and-sources.md deleted 100644 → 0
View file @4422297
-# References and Sources Map
-> 更新：2026-06-02
-## 一页结论
-当前项目的引用分成四类：
-1. **开源数据集来源**
-2. **研究/SOTA 来源**
-3. **服务与工程规范来源**
-4. **项目内部文档来源**
---
-## 1. 引用分层图
-```mermaid
-flowchart TD
-    A[References] --> B[Datasets]
-    A --> C[Research]
-    A --> D[Engineering]
-    A --> E[Internal Docs]
-    B --> B1[FMA]
-    B --> B2[MTG-Jamendo]
-    B --> B3[CCMusic]
-    B --> B4[ModelScope]
-    C --> C1[Neural AFP]
-    C --> C2[Music Foundation Models]
-    C --> C3[Band-split]
-    C --> C4[Data Balancing]
-```
---
-## 2. 外部来源表
-| 类别 | 名称 | URL | 当前用途 |
-|---|---|---|---|
-| Dataset | FMA | https://github.com/mdeff/fma | 真实 retrieval baseline 候选 |
-| Dataset | MTG-Jamendo | https://github.com/MTG/mtg-jamendo-dataset | 真实音乐检索候选 |
-| Dataset | CCMusic | https://ccmusic-database.github.io/en/database/ccm.html | 中文 MIR 数据源候选 |
-| Dataset | ModelScope music search | https://modelscope.cn/search?page=1&search=music&type=dataset | 数据发现入口 |
-| Research | MERT | https://arxiv.org/abs/2306.00107 | foundation-model 方向参考 |
-| Research | MuQ | https://arxiv.org/abs/2501.01108 | music representation 方向参考 |
-| Research | Band-split RNN | https://arxiv.org/abs/2209.15174 | 频带建模参考 |
-| Research | BAGAN | https://arxiv.org/abs/1803.09655 | 数据平衡增强参考 |
---
-## 3. 内部文档依赖图
-```mermaid
-flowchart LR
-    A[references-and-sources.md] --> B[dataset-sources-and-licensing.md]
-    A --> C[sota-research-2026.md]
-    A --> D[industrialization-roadmap.md]
-```
---
-## 4. 文字说明
-### 4.1 为什么单独做 References Map
-因为后续文档会越来越多，如果不把“哪些结论来自哪里”系统整理出来，很快会失去可追溯性。
-### 4.2 目前引用质量说明
- dataset 来源：优先官方 repo / 官方主页
- research 来源：优先 arXiv / 论文主页
- service/工程来源：当前主要以内生工程规范为主
-### 4.3 未来要加强的地方
- 在每篇核心文档底部补“Sources”小节
- benchmark 报告与 model card 显式引用训练数据与论文版本
---
-## 5. 细节附录
-建议补充：
- 每份文档增加 `Sources` 节
- 每次模型 release 输出引用快照
-## Sources
- FMA: https://github.com/mdeff/fma
- MTG-Jamendo: https://github.com/MTG/mtg-jamendo-dataset
- CCMusic: https://ccmusic-database.github.io/en/database/ccm.html
- ModelScope music search: https://modelscope.cn/search?page=1&search=music&type=dataset
--- a/docs/release-checklist.md deleted 100644 → 0
View file @4422297
+++ b/docs/release-checklist.md deleted 100644 → 0
View file @4422297
-# Release Checklist
-## 一页结论
-发布前必须同时满足：
- 质量通过
- 合规通过
- 服务通过
- 文档齐全
-## 1. 发布门禁图
-```mermaid
-flowchart TD
-    A[Release Candidate] --> B[Benchmark Pass]
-    A --> C[License Review Pass]
-    A --> D[Service Smoke Pass]
-    A --> E[Docs Complete]
-```
-## 2. Checklist 表
-| 项目 | 状态 |
-|---|---|
-| benchmark report 已生成 |  |
-| model card 已生成 |  |
-| license registry 已更新 |  |
-| service smoke test 通过 | partial: `/health` OK, `/recognize/voice` payload returns against `workspace_music20`, but batch validation is currently poor (`type_7 top1=0.0/top3=0.05`, `type_8 top1=0.0/top3=0.0`, `type_16 top1=0.0/top3=0.0`) |
-| dataset whitelist 已确认 |  |
-| changelog 已更新 | yes |
-| architect review completed | yes (approved with watch) |
-## 3. 文字说明
- 任何缺失项都不能视作商用可发布
-## 4. 细节附录
- 发布 commit
- benchmark 报告路径
- model card 路径
- license 审查记录路径
-## Sources
- `docs/dataset-sources-and-licensing.md`
- `docs/industrial-benchmark-spec.md`
-## 2026-06-03 voice-query service foundation
- `/health` 已可用
- `/recognize/voice` 路由已接入，但当前推理仍被 `torch` 缺失阻塞
- 本地 FAISS 20-song 验证已完成
- handoff / changelog / docs README 已同步
- handoff 已刷新：yes（已指向 voice service runtime 当前状态与下一步排查路径）
- business-corpus song_id baseline 已生成：yes（`data/pgvector_eval/music20/songid_eval_report.json`）
--- a/docs/report-layout.md deleted 100644 → 0
View file @4422297
+++ b/docs/report-layout.md deleted 100644 → 0
View file @4422297
-# Report Layout Convention
-## 一页结论
-所有评测与发布产物统一放入：
- `reports/<model-version>/<data-version>/eval.json`
- `reports/<model-version>/<data-version>/benchmark-report.md`
- `reports/<model-version>/<data-version>/model-card.md`
- `reports/<model-version>/<data-version>/release-checklist.md`
- `reports/<model-version>/<data-version>/artifact-manifest.json`
---
-## 1. 布局图
-```mermaid
-flowchart TD
-    A[reports/] --> B[model-version]
-    B --> C[data-version]
-    C --> D[eval.json]
-    C --> E[benchmark-report.md]
-    C --> F[model-card.md]
-    C --> G[release-checklist.md]
-    C --> H[artifact-manifest.json]
-```
---
-## 2. 约定表
-| 文件 | 用途 |
-|---|---|
-| eval.json | 机器可读评测输出 |
-| benchmark-report.md | 人类可读 benchmark 摘要 |
-| model-card.md | 模型说明 |
-| release-checklist.md | 发布门禁 |
-| artifact-manifest.json | 产物索引 |
---
-## 3. 文字说明
- 所有 release 候选都应有独立目录
- 不要把临时 smoke 文件与正式 release 报告混放
-## Sources
- docs/benchmark-report-template.md
- docs/model-card-template.md
- docs/release-checklist.md
--- a/docs/runbook.md deleted 100644 → 0
View file @4422297
+++ b/docs/runbook.md deleted 100644 → 0
View file @4422297
-# ACR 项目运行手册
-## 1. 环境
-```bash
-cd acr-engine
-python -m venv .venv
-source .venv/bin/activate
-pip install -r requirements.txt
-```
-## 2. 生成数据
-```bash
-python run_demo.py generate-data --output data/synthetic --num-songs 24
-```
-## 3. 校验训练链路
-```bash
-python train.py --data data/synthetic --dry-run --device cpu
-```
-## 4. 最小训练
-```bash
-python train.py --data data/synthetic --output data/models --device cpu --epochs 1 --batch-size 8
-```
-## 5. 建索引
-```bash
-python run_demo.py build-index --data data/synthetic --model data/models/best_model.pt --output data/index --device cpu
-```
-## 6. 跑识别
-```bash
-python run_demo.py recognize \
-  --query data/synthetic/segments/song_0020_seg_00.wav \
-  --data data/synthetic \
-  --model data/models/best_model.pt \
-  --index-prefix data/index/reference \
-  --device cpu
-```
-## 7. 成功判定
-至少满足：
- 能输出 JSON 结果
- 返回 `candidates`
- 结果中包含 `song_id` 和 `confidence`
--- a/docs/service-api.md deleted 100644 → 0
View file @4422297
+++ b/docs/service-api.md deleted 100644 → 0
View file @4422297
-# ACR Service API
-> 更新：2026-06-02
-## 一页结论
- 当前服务是工业化骨架，不是最终生产网关
- 已提供最小可调用能力：
-  1. health
-  2. ready
-  3. config
-  4. cache
-  5. recognize
-  6. index build
- 已补充：服务就绪探针、基础缓存可见性、索引/模型存在性检查
- 下一阶段重点是：鉴权、异步任务、ANN 索引、监控、错误码规范化
---
-## 1. 服务结构图
-```mermaid
-flowchart LR
-    C[Client] --> H[/health]
-    C --> H2[/ready]
-    C --> G[/config]
-    C --> C2[/cache]
-    C --> R[/recognize]
-    C --> I[/index/build]
-    R --> E[Hybrid Engine Cache]
-    I --> B[Index Builders]
-```
---
-## 2. Endpoint 表
-| Endpoint | 方法 | 作用 |
-|---|---|---|
-| `/health` | GET | 健康检查 |
-| `/config` | GET | 查看默认配置 |
-| `/ready` | GET | 查看模型/索引/manifest 是否就绪 |
-| `/cache` | GET | 查看当前 engine cache 状态 |
-| `/recognize` | POST | 输入 query，输出候选 |
-| `/index/build` | POST | 触发离线索引构建 |
---
-## 3. 请求流程图
-```mermaid
-sequenceDiagram
-    participant Client
-    participant API
-    participant Engine
-    Client->>API: POST /recognize
-    API->>Engine: load matcher/index/model
-    Engine-->>API: top-k candidates
-    API-->>Client: JSON result
-```
---
-## 4. 文字说明
-### 4.1 为什么先暴露文件路径 API
-当前阶段优先验证系统闭环，不急于引入上传存储层与异步 job orchestration。
-### 4.2 `/config` 的作用
-帮助服务侧和调用侧快速确认当前默认数据目录、模型路径与索引前缀。
-### 4.3 后续生产化差距
- 缺鉴权
- 缺对象存储上传
- 缺异步索引任务
- 缺可观测性
- 缺错误码与 SLA 规范
---
-## 5. 细节附录
-### `/health`
-返回：
-```json
-{"status":"ok","service":"acr","version":"0.2.0"}
-```
-### `/config`
-返回：
-```json
-{
-  "data_dir":"data/synthetic_v2",
-  "model_path":"data/models_v3/best_model.pt",
-  "index_prefix":"data/index_v3/reference",
-  "device":"cpu"
-}
-```
-### `/ready`
-返回：
-```json
-{
-  "service":"acr",
-  "version":"0.3.0",
-  "ready":true,
-  "files":{...},
-  "manifests":[...],
-  "engine_cache_size":0
-}
-```
-### `/cache`
-返回当前进程内 engine cache 统计。
-## 6. 本地运行与 smoke
-```bash
-cd acr-engine
-/usr/local/miniconda3/bin/python -m uvicorn src.service.app:app --host 127.0.0.1 --port 8000
-```
-另一个终端可直接执行：
-```bash
-cd acr-engine
-/usr/local/miniconda3/bin/python scripts/service_smoke.py
-```
-该 smoke 当前会校验：
- `/health`
- `/ready`
- `/config`
- `/cache`
-## Sources
- See [references-and-sources.md](./references-and-sources.md) for the current source map.
--- a/docs/sota-research-2026.md deleted 100644 → 0
View file @4422297
+++ b/docs/sota-research-2026.md deleted 100644 → 0
View file @4422297
-# ACR / Music Retrieval SOTA Research (截至 2026-06-02)
-## 结论摘要
-到 2025-2026，这个方向相比传统“从零训练一个小型 ECAPA embedding”已经明显前进了。
-当前更强的方向主要有三类：
-1. **Neural Audio Fingerprinting 的鲁棒训练增强**
-2. **Music Foundation Model 作为 backbone / teacher**
-3. **Band-split / band-aware 结构用于音乐频谱建模**
-对本项目当前阶段的直接结论：
- **仅靠样本重复或统一加权不是 SOTA 思路**
- 更接近 2026 工业最佳实践的是：**retrieval-first + hard negative mining + foundation model backbone + 任务专门支路**
- 我们当前仓库已经走到其中两步：`128 Mel + band-split`、`retrieval-first eval`
- 下一步最该补的是：`confusion-aware negatives` 与 `humming melody tower`
-## 1. 方向图
-```mermaid
-flowchart LR
-    A[2026 ACR / MIR SOTA] --> B[Neural AFP Robustness]
-    A --> C[Music Foundation Models]
-    A --> D[Band-aware Architectures]
-    A --> E[Data Balancing / Hard Negatives]
-```
-## 1. Neural AFP 的更强实践
-### Enhancing Neural Audio Fingerprint Robustness to Audio Degradation for Music Identification (2025)
- arXiv: https://arxiv.org/abs/2506.22661
-关键信息：
- 指出很多 neural AFP 工作对真实退化模拟不够真实
- 系统比较 metric learning 方法
- 发现自监督 triplet loss 变体在该任务中更优
- 强调多个 positive samples 对不同 loss 的影响不同
-对本项目的启发：
- 不应只依赖当前简单 SupCon + CE
- 应增加更真实的退化增强
- 应明确做 retrieval 指标选择，而非只看分类头
-## 2. Music Foundation Model Backbones
-### Robust Neural Audio Fingerprinting using Music Foundation Models (2025)
- arXiv: https://arxiv.org/abs/2511.05399
-关键信息：
- 使用预训练 music foundation model（例如 MuQ、MERT）作为 neural fingerprinting backbone
- 在 distorted / compressed / manipulated 音频条件下优于从零训练模型
- 还能更好做 segment-level localization
-### MERT (2023)
- arXiv: https://arxiv.org/abs/2306.00107
-关键信息：
- 大规模自监督 music understanding 模型
- 在多个 music understanding 任务上达到强表现
-### MuQ (2025)
- arXiv: https://arxiv.org/abs/2501.01108
-关键信息：
- 面向音乐的自监督表征学习模型
- 使用 Mel-RVQ 目标
- 在多种下游任务上优于更早工作
-对本项目的启发：
- 2026 继续只用小模型从零训，不太可能是最佳路线
- 更合理路线：
-  - 当前仓库保留轻量自训 baseline
-  - 下一阶段增加 MERT / MuQ frozen encoder 或 adapter fine-tune 版本
-## 3. Band-split / band-aware 结构
-### Music Source Separation with Band-split RNN (2022)
- arXiv: https://arxiv.org/abs/2209.15174
-关键信息：
- 显式把频谱切成多个频带再建模
- 对音乐任务优于直接照搬通用音频结构
-虽然该文主要做 source separation，不是 ACR，但它对“音乐频带先验”很有启发。
-对本项目的启发：
- 输入层加入 band-split 是合理工程方向
- 未来可继续发展成：
-  - band-aware attention
-  - multi-band retrieval heads
-  - harmonic/rhythm 双塔结构
-## 4. 数据平衡与生成增强
-### BAGAN: Data Augmentation with Balancing GAN (2018)
- arXiv: https://arxiv.org/abs/1803.09655
-严格说你提到的 `pro-WGAN` 我这次没有找到一个明确、权威、在该任务里广泛标准化的同名主文献；当前更接近、且有明确权威来源的是 **BAGAN / balancing GAN** 这一类面向不平衡数据增强的方法。
-因此本次实现里我采用的是：
- **pro-WGAN 风格的工程近似平衡策略**
- 不是声称已经复现某篇明确的 `pro-WGAN` SOTA 论文
-如果你之后指定了准确论文或仓库，我可以按那一版精确对齐实现。
-### 对当前实验结果的解释
-| 策略 | overall top1 | humming_like top1 | confused top1 | 结论 |
-|---|---:|---:|---:|---|
-| naive oversampling (smoke-v4) | 0.40 | 0.00 | 0.00 | 明显退化 |
-| type-aware weighting (smoke-v5) | 0.60 | 0.50 | 0.00 | 改善 humming，但 confused 无突破 |
-| sample-level confused-priority weighting (smoke-v6) | 0.65 | 0.25 | 0.25 | confused 突破，但需要重新平衡 humming |
-这说明：
-1. 2026 年这个方向里，**“难例重要”是对的**
-2. 但 **单维度加权还不够**，需要把不同 hard case 分开建模
-3. 对音乐 ACR 来说，`confused` 与 `humming_like` 不是同一种难度来源：
-   - `confused` 更偏 timbre / arrangement / retrieval ambiguity
-   - `humming_like` 更偏 melody / pitch contour mismatch
-4. 当前仓库里的 residual confused failure 进一步显示：
-   - `intro` 片段是更高风险区域
-   - 下一步应引入 `segment_type-aware hard negatives`
-   - 这比继续全局调 sample ratio 更接近工业有效路径
-## 5. 2026 年是否已经有更好的方案？
-有，结论是：**有明显更好的路线**。
-最值得参考的是：
-1. 用 **music foundation model** 做 backbone
-2. 用 **更真实退化模拟 + retrieval-first metric learning**
-3. 用 **segment-level / window-level indexing**，而不是整曲平均 embedding
-4. 对哼唱任务增加 **melody/pitch contour 专门支路**
-## 6. 对本项目的建议排序
-### 当前阶段（已开始）
- 128 Mel 替换低维说话人风格输入
- band-split 输入层
- 更强混淆增强
- retrieval-first 评测
-### 下一阶段
- MERT / MuQ frozen feature baseline
- triplet / multi-positive metric learning 对比 SupCon
- window-level index aggregation
- FMA / Jamendo 小规模真实数据验证
- confusion-aware negative mining
- humming 专门旋律支路 / pitch contour rerank
-### 更后阶段
- humming 专门 melody tower
- foundation model + lightweight fingerprint head
- ANN + reranker 两阶段工业化检索
-## Sources
- Araz et al., 2025, Enhancing Neural Audio Fingerprint Robustness to Audio Degradation for Music Identification: https://arxiv.org/abs/2506.22661
- Singh et al., 2025, Robust Neural Audio Fingerprinting using Music Foundation Models: https://arxiv.org/abs/2511.05399
- Li et al., 2023, MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training: https://arxiv.org/abs/2306.00107
- Zhu et al., 2025, MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization: https://arxiv.org/abs/2501.01108
- Luo & Yu, 2022, Music Source Separation with Band-split RNN: https://arxiv.org/abs/2209.15174
- Mariani et al., 2018, BAGAN: Data Augmentation with Balancing GAN: https://arxiv.org/abs/1803.09655
--- a/docs/training-data-and-pgvector-guide.md deleted 100644 → 0
View file @4422297
+++ b/docs/training-data-and-pgvector-guide.md deleted 100644 → 0
View file @4422297