Make the fused Phase-1 ACR schema concrete with DDL samples
Constraint: Keep the storage design aligned to the current song-centric model while turning the 4-table fused schema into something engineers can directly review and implement. Rejected: Keep only conceptual docs without concrete SQL | It leaves too much ambiguity about where slices, models, and features actually land. Confidence: high Scope-risk: narrow Directive: Until the repository gains a production SQL file for the fused model, treat postgres_db_schema_samples.md as the authoritative DDL draft for media_entity/audio_object/feature_fact/set_membership. Tested: git diff --check on touched files; /usr/local/miniconda3/bin/python scripts/check_markdown_links.py --root docs returned OK for 11 active markdown files Not-tested: Executing the fused DDL against a live PostgreSQL schema
Showing
4 changed files
with
369 additions
and
837 deletions
| 1 | ## 2026-06-04 | 1 | ## 2026-06-04 |
| 2 | 2 | ||
| 3 | - 重写 `docs/postgres_db_schema_samples.md` 为当前 song-centric 融合优先方案的 DDL 草案,补齐 4 张核心表(`media_entity` / `audio_object` / `feature_fact` / `set_membership`)、落表说明、流程图与常用 SQL 样例。 | ||
| 4 | |||
| 3 | - 在 `docs/postgresql-data-model.md` 新增“切片数据 / 模型 / feature 具体落哪张表”的表格与流程图,明确当前默认回溯链为 `feature_fact -> audio_object(window) -> audio_object(asset) -> media_entity(song)`。 | 5 | - 在 `docs/postgresql-data-model.md` 新增“切片数据 / 模型 / feature 具体落哪张表”的表格与流程图,明确当前默认回溯链为 `feature_fact -> audio_object(window) -> audio_object(asset) -> media_entity(song)`。 |
| 4 | - 收敛 `docs/README.md` 为当前 song-centric 设计入口,并清理 docs 目录中与当前设计无关的模板、开放数据、业务导出、历史路线类文档。 | 6 | - 收敛 `docs/README.md` 为当前 song-centric 设计入口,并清理 docs 目录中与当前设计无关的模板、开放数据、业务导出、历史路线类文档。 |
| 5 | 7 | ... | ... |
| 1 | # PostgreSQL DB Schema Samples / 落库样例与 live 测试链路 | 1 | # PostgreSQL DB Schema Samples / 融合优先 DDL 草案与查询样例 |
| 2 | 2 | ||
| 3 | > 更新:2026-06-04 | 3 | > 更新:2026-06-04 |
| 4 | > 目标:给后续开发一个**可直接照着做**的 PostgreSQL 落库样例,同时保留一次真实 `pgvector` live 测试的证据。 | 4 | > 目标:把当前 **song-centric + 融合优先** 设计落成一版可以直接评审和继续实现的 PostgreSQL DDL 草案。 |
| 5 | 5 | ||
| 6 | --- | 6 | --- |
| 7 | 7 | ||
| 8 | ## 一页结论 | 8 | ## 一页结论 |
| 9 | 9 | ||
| 10 | 这次已经在用户提供的 PostgreSQL 上完成了下面几件事: | 10 | 当前默认物理模型只看 4 张表: |
| 11 | |||
| 12 | 1. **真实连接 PostgreSQL 成功** | ||
| 13 | - DSN:`postgres://d2:***@127.0.0.1:5432/d2` | ||
| 14 | - PostgreSQL:`17.5` | ||
| 15 | - 已确认扩展 `vector` 存在 | ||
| 16 | |||
| 17 | 2. **真实应用 schema v2 成功** | ||
| 18 | - 使用隔离 schema:`acr_test` | ||
| 19 | - DDL 来源:`acr-engine/sql/acr_pg_schema_v2.sql` | ||
| 20 | - 已成功创建主数据、registry、embedding、candidate、decision 等表 | ||
| 21 | |||
| 22 | 3. **真实插入了一套完整的样例数据链** | ||
| 23 | - `canonical_song -> work -> recording -> recording_asset -> audio_window` | ||
| 24 | - `model_registry -> feature_set_registry -> audio_embedding -> retrieval_index_registry` | ||
| 25 | - `reference_set_registry -> reference_set_member` | ||
| 26 | |||
| 27 | 4. **真实跑通了一轮 PostgreSQL + pgvector 检索评测** | ||
| 28 | - 输入:`acr-engine/data/pgvector_eval/music20/*.jsonl` | ||
| 29 | - 输出:`acr-engine/data/pgvector_eval/music20/live_pgvector_report.json` | ||
| 30 | - live pgvector 指标和现有 FAISS stand-in 指标**一致**: | ||
| 31 | - overall `top1=0.9091` | ||
| 32 | - overall `top3=0.9545` | ||
| 33 | - `query_type=1`: `top1=1.0` | ||
| 34 | - `query_type=7`: `top1=0.0`, `top3=0.5` | ||
| 35 | |||
| 36 | 5. **lineage trigger 已被验证有效** | ||
| 37 | - 脚本主动构造了三类错误 lineage: | ||
| 38 | - `recording` | ||
| 39 | - `audio_window` | ||
| 40 | - `audio_embedding` | ||
| 41 | - PostgreSQL 都正确拒绝插入 | ||
| 42 | 11 | ||
| 43 | --- | 12 | ```text |
| 44 | 13 | media_entity -> audio_object -> feature_fact -> set_membership | |
| 45 | ## 本次使用的 live 测试资产 | ||
| 46 | |||
| 47 | ### 数据库 | ||
| 48 | |||
| 49 | | 项目 | 值 | | ||
| 50 | |---|---| | ||
| 51 | | Host | `127.0.0.1` | | ||
| 52 | | Port | `5432` | | ||
| 53 | | DB | `d2` | | ||
| 54 | | User | `d2` | | ||
| 55 | | PostgreSQL | `17.5` | | ||
| 56 | | 扩展 | `vector`, `pg_trgm`, `ltree`, `hstore` 等 | | ||
| 57 | | 本次测试 schema | `acr_test` | | ||
| 58 | |||
| 59 | ### 代码与产物 | ||
| 60 | |||
| 61 | | 类型 | 路径 | | ||
| 62 | |---|---| | ||
| 63 | | 推荐 DDL | `acr-engine/sql/acr_pg_schema_v2.sql` | | ||
| 64 | | live 测试脚本 | `acr-engine/scripts/live_pgvector_music20_eval.py` | | ||
| 65 | | registry bootstrap 脚本 | `acr-engine/scripts/bootstrap_phase1_model_registry_live.py` | | ||
| 66 | | live 报告 | `acr-engine/data/pgvector_eval/music20/live_pgvector_report.json` | | ||
| 67 | | FAISS 对照报告 | `acr-engine/data/pgvector_eval/music20/songid_eval_report_fresh.json` | | ||
| 68 | | registry bootstrap 报告 | `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json` | | ||
| 69 | | registry bootstrap 幂等性报告 | `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_idempotency_report.json` | | ||
| 70 | | extraction job bootstrap 报告 | `acr-engine/data/pgvector_eval/music20/phase1_extraction_jobs_report.json` | | ||
| 71 | | extraction plan 报告 | `acr-engine/data/pgvector_eval/music20/phase1_extraction_plan_report.json` | | ||
| 72 | | reference member bootstrap 报告 | `acr-engine/data/pgvector_eval/music20/phase1_reference_member_bootstrap_report.json` | | ||
| 73 | | chromaprint worker dry-run 报告 | `acr-engine/data/pgvector_eval/music20/phase1_worker_chromaprint_dry_run.json` | | ||
| 74 | | embedding worker dry-run 报告 | `acr-engine/data/pgvector_eval/music20/phase1_worker_embedding_dry_run.json` | | ||
| 75 | | job status 手工回写报告 | `acr-engine/data/pgvector_eval/music20/phase1_worker_mark_pending_report.json` | | ||
| 76 | | double-claim guard 报告 | `acr-engine/data/pgvector_eval/music20/phase1_worker_double_claim_guard_report.json` | | ||
| 77 | | 历史对照报告 | `acr-engine/data/pgvector_eval/music20/songid_eval_report.json` | | ||
| 78 | |||
| 79 | --- | ||
| 80 | |||
| 81 | ## 这次实际落进去的数据链 | ||
| 82 | |||
| 83 | ```mermaid | ||
| 84 | flowchart LR | ||
| 85 | A[reference_embeddings.jsonl] --> B[canonical_song] | ||
| 86 | B --> C[work] | ||
| 87 | C --> D[recording] | ||
| 88 | D --> E[recording_asset] | ||
| 89 | E --> F[audio_window] | ||
| 90 | F --> G[audio_embedding] | ||
| 91 | G --> H[audio_embedding_vector_192] | ||
| 92 | |||
| 93 | I[model_registry] --> J[feature_set_registry] | ||
| 94 | J --> G | ||
| 95 | |||
| 96 | K[reference_set_registry] --> L[reference_set_member] | ||
| 97 | D --> L | ||
| 98 | |||
| 99 | M[query_embeddings.jsonl] --> N[SQL pgvector search] | ||
| 100 | H --> N | ||
| 101 | N --> O[retrieval_candidate] | ||
| 102 | O --> P[match_decision] | ||
| 103 | ``` | 14 | ``` |
| 104 | 15 | ||
| 105 | --- | 16 | 对应逻辑语义: |
| 106 | |||
| 107 | ## 为什么这次 live 测试要把 24 维 embedding pad 到 192 维 | ||
| 108 | |||
| 109 | 当前 `schema v2` 里提供了: | ||
| 110 | - `audio_embedding_vector_192` | ||
| 111 | - `audio_embedding_vector_768` | ||
| 112 | |||
| 113 | 而这次本地 `music20` 样例 embedding 是 **24 维 chroma 特征**。 | ||
| 114 | |||
| 115 | 所以本次 live 测试采用的策略是: | ||
| 116 | |||
| 117 | - **逻辑维度**:`24` | ||
| 118 | - **物理落盘维度**:`192` | ||
| 119 | - **做法**:后面补 `0`,写入 `vector(192)` | ||
| 120 | |||
| 121 | 这样做的原因: | ||
| 122 | - 不需要临时改 schema | ||
| 123 | - 仍然可以验证 schema v2 + pgvector + retrieval 链路 | ||
| 124 | - 对这批样例的余弦相似度排序不会产生方向性错误(所有向量都以同样方式补零) | ||
| 125 | 17 | ||
| 126 | 这只是**验证链路**用法。 | 18 | ```text |
| 127 | 19 | song -> asset -> window -> fingerprint / embedding | |
| 128 | 生产里应按真实 encoder 维度选择: | ||
| 129 | - `MERT` / `MuQ` 之类高维 embedding:直接落合适物理表 | ||
| 130 | - 如果后续维度更多,建议继续扩成 `audio_embedding_vector_<dim>` 分桶策略 | ||
| 131 | |||
| 132 | --- | ||
| 133 | |||
| 134 | ## 本次实际落盘样例 | ||
| 135 | |||
| 136 | 以下内容来自 `acr_test` schema 的真实查询结果。 | ||
| 137 | |||
| 138 | ### 1. canonical_song | ||
| 139 | |||
| 140 | ```json | ||
| 141 | {"canonical_song_id":1,"biz_song_code":"100","title":"Song 100","primary_artist":"Artist 100","rights_status":"protected"} | ||
| 142 | {"canonical_song_id":2,"biz_song_code":"101","title":"Song 101","primary_artist":"Artist 101","rights_status":"protected"} | ||
| 143 | ``` | 20 | ``` |
| 144 | 21 | ||
| 145 | ### 2. work | 22 | 其中: |
| 23 | - `media_entity`:当前默认只承载 `song` | ||
| 24 | - `audio_object`:统一承载 `asset` 与 `window` | ||
| 25 | - `feature_fact`:统一承载 `fingerprint` 与 `embedding` | ||
| 26 | - `set_membership`:统一承载 `reference / hot / eval` 等集合关系 | ||
| 146 | 27 | ||
| 147 | ```json | 28 | --- |
| 148 | {"work_id":1,"canonical_song_id":1,"work_code":"work-100","work_title":"Song 100","composer":"Composer 100"} | ||
| 149 | {"work_id":2,"canonical_song_id":2,"work_code":"work-101","work_title":"Song 101","composer":"Composer 101"} | ||
| 150 | ``` | ||
| 151 | |||
| 152 | ### 3. recording | ||
| 153 | |||
| 154 | ```json | ||
| 155 | {"recording_id":1,"work_id":1,"canonical_song_id":1,"recording_code":"rec-100","version_type":"master_reference","is_reference":true,"reference_priority":100} | ||
| 156 | {"recording_id":2,"work_id":2,"canonical_song_id":2,"recording_code":"rec-101","version_type":"master_reference","is_reference":true,"reference_priority":101} | ||
| 157 | ``` | ||
| 158 | |||
| 159 | ### 4. recording_asset | ||
| 160 | |||
| 161 | ```json | ||
| 162 | {"asset_id":1,"recording_id":1,"asset_role":"reference_audio","storage_uri":"/workspace/downloads/100/type_11/93dfdeb0-7da5-42a8-9c71-cf12af57dd191650256918.wav","storage_scheme":"file","duration_sec":8.0,"ingest_status":"ready"} | ||
| 163 | {"asset_id":2,"recording_id":2,"asset_role":"reference_audio","storage_uri":"/workspace/downloads/101/type_11/83c0c07f-4f96-4ff4-998c-58db910f3cfa1650256915.wav","storage_scheme":"file","duration_sec":8.0,"ingest_status":"ready"} | ||
| 164 | ``` | ||
| 165 | 29 | ||
| 166 | ### 5. audio_window | 30 | ## 1. 4 张表分别存什么 |
| 167 | 31 | ||
| 168 | ```json | 32 | | 表 | 当前主要 type | 存什么 | 为什么存在 | |
| 169 | {"window_id":1,"asset_id":1,"recording_id":1,"work_id":1,"canonical_song_id":1,"window_index":0,"start_sec":0.0,"end_sec":8.0,"segment_role":"reference","segment_type":"full_clip"} | 33 | |---|---|---|---| |
| 170 | {"window_id":2,"asset_id":2,"recording_id":2,"work_id":2,"canonical_song_id":2,"window_index":0,"start_sec":0.0,"end_sec":8.0,"segment_role":"reference","segment_type":"full_clip"} | 34 | | `media_entity` | `song` | 歌曲主实体 | 最终归属对象是 `song_id` | |
| 171 | ``` | 35 | | `audio_object` | `asset`, `window` | 原始音频文件 + 切片 | 同一个 song 下可有多个音频,切片仍需 evidence | |
| 36 | | `feature_fact` | `fingerprint`, `embedding` | 模型、feature set、特征结果 | 统一 exact/semantic 特征事实 | | ||
| 37 | | `set_membership` | `reference_set`, `eval_set`, `hot_set` | 谁属于哪个集合 | 管理 reference 与评测范围 | | ||
| 172 | 38 | ||
| 173 | ### 6. model_registry / feature_set_registry | 39 | --- |
| 174 | 40 | ||
| 175 | ```json | 41 | ## 2. 当前推荐 DDL 草案 |
| 176 | {"model_id":1,"model_name":"local_chroma24","model_family":"chroma_baseline","model_version":"v1","output_embedding_dim":24,"default_window_sec":8.0} | 42 | |
| 177 | {"feature_set_id":1,"model_id":1,"feature_name":"chroma24_songid_eval","embedding_dim":24,"distance_metric":"cosine","feature_schema_version":"v1"} | 43 | ### 2.1 `media_entity` |
| 44 | |||
| 45 | ```sql | ||
| 46 | create table if not exists media_entity ( | ||
| 47 | entity_id bigserial primary key, | ||
| 48 | entity_type text not null check (entity_type in ('song', 'work', 'recording')), | ||
| 49 | root_song_id bigint, | ||
| 50 | parent_entity_id bigint, | ||
| 51 | biz_key text, | ||
| 52 | title text not null, | ||
| 53 | artist_name text, | ||
| 54 | entity_status text not null default 'active', | ||
| 55 | metadata_json jsonb not null default '{}'::jsonb, | ||
| 56 | created_at timestamptz not null default now(), | ||
| 57 | updated_at timestamptz not null default now(), | ||
| 58 | constraint fk_media_entity_root_song | ||
| 59 | foreign key (root_song_id) references media_entity(entity_id), | ||
| 60 | constraint fk_media_entity_parent | ||
| 61 | foreign key (parent_entity_id) references media_entity(entity_id) | ||
| 62 | ); | ||
| 63 | |||
| 64 | create unique index if not exists uq_media_entity_song_biz_key | ||
| 65 | on media_entity(entity_type, biz_key) | ||
| 66 | where biz_key is not null; | ||
| 67 | |||
| 68 | create index if not exists idx_media_entity_root_song | ||
| 69 | on media_entity(root_song_id); | ||
| 178 | ``` | 70 | ``` |
| 179 | 71 | ||
| 180 | ### 7. audio_embedding | 72 | ### 2.2 `audio_object` |
| 181 | 73 | ||
| 182 | ```json | 74 | ```sql |
| 183 | {"embedding_id":1,"feature_set_id":1,"asset_id":1,"window_id":1,"recording_id":1,"canonical_song_id":1,"embedding_storage_mode":"pgvector_inline_192_padded","is_indexed":true} | 75 | create table if not exists audio_object ( |
| 184 | {"embedding_id":2,"feature_set_id":1,"asset_id":2,"window_id":2,"recording_id":2,"canonical_song_id":2,"embedding_storage_mode":"pgvector_inline_192_padded","is_indexed":true} | 76 | object_id bigserial primary key, |
| 77 | object_type text not null check (object_type in ('asset', 'window')), | ||
| 78 | song_id bigint not null references media_entity(entity_id), | ||
| 79 | parent_object_id bigint references audio_object(object_id), | ||
| 80 | source_type text, | ||
| 81 | storage_uri text, | ||
| 82 | storage_scheme text, | ||
| 83 | checksum text, | ||
| 84 | codec text, | ||
| 85 | sample_rate integer, | ||
| 86 | channels integer, | ||
| 87 | duration_ms integer, | ||
| 88 | start_ms integer, | ||
| 89 | end_ms integer, | ||
| 90 | object_status text not null default 'ready', | ||
| 91 | metadata_json jsonb not null default '{}'::jsonb, | ||
| 92 | created_at timestamptz not null default now(), | ||
| 93 | updated_at timestamptz not null default now(), | ||
| 94 | constraint ck_audio_object_window_parent | ||
| 95 | check ( | ||
| 96 | (object_type = 'asset' and parent_object_id is null) | ||
| 97 | or (object_type = 'window' and parent_object_id is not null) | ||
| 98 | ) | ||
| 99 | ); | ||
| 100 | |||
| 101 | create index if not exists idx_audio_object_song_type | ||
| 102 | on audio_object(song_id, object_type); | ||
| 103 | |||
| 104 | create index if not exists idx_audio_object_parent | ||
| 105 | on audio_object(parent_object_id); | ||
| 106 | |||
| 107 | create unique index if not exists uq_audio_object_asset_checksum | ||
| 108 | on audio_object(song_id, checksum) | ||
| 109 | where object_type = 'asset' and checksum is not null; | ||
| 110 | |||
| 111 | create unique index if not exists uq_audio_object_window_range | ||
| 112 | on audio_object(parent_object_id, start_ms, end_ms) | ||
| 113 | where object_type = 'window'; | ||
| 185 | ``` | 114 | ``` |
| 186 | 115 | ||
| 187 | ### 8. reference_set_registry / retrieval_index_registry | 116 | ### 2.3 `feature_fact` |
| 188 | 117 | ||
| 189 | ```json | 118 | ```sql |
| 190 | {"reference_set_id":1,"set_name":"music20_live_reference","encoder_scope":"local_chroma24","status":"active"} | 119 | create table if not exists feature_fact ( |
| 191 | {"retrieval_index_id":1,"feature_set_id":1,"index_name":"music20_live_pgvector_hnsw","index_backend":"pgvector","index_type":"hnsw_cosine","row_count":20,"index_status":"active"} | 120 | feature_id bigserial primary key, |
| 121 | feature_type text not null check (feature_type in ('fingerprint', 'embedding')), | ||
| 122 | object_id bigint not null references audio_object(object_id), | ||
| 123 | song_id bigint not null references media_entity(entity_id), | ||
| 124 | model_name text not null, | ||
| 125 | model_version text not null, | ||
| 126 | feature_set_name text not null, | ||
| 127 | feature_schema_ver text not null default 'v1', | ||
| 128 | embedding_dim integer, | ||
| 129 | fingerprint_value text, | ||
| 130 | embedding_uri text, | ||
| 131 | vector_table_name text, | ||
| 132 | checksum text, | ||
| 133 | feature_status text not null default 'ready', | ||
| 134 | metadata_json jsonb not null default '{}'::jsonb, | ||
| 135 | created_at timestamptz not null default now(), | ||
| 136 | updated_at timestamptz not null default now(), | ||
| 137 | constraint ck_feature_payload | ||
| 138 | check ( | ||
| 139 | (feature_type = 'fingerprint' and fingerprint_value is not null) | ||
| 140 | or (feature_type = 'embedding' and (embedding_uri is not null or vector_table_name is not null)) | ||
| 141 | ) | ||
| 142 | ); | ||
| 143 | |||
| 144 | create index if not exists idx_feature_fact_object_type | ||
| 145 | on feature_fact(object_id, feature_type); | ||
| 146 | |||
| 147 | create index if not exists idx_feature_fact_song_type | ||
| 148 | on feature_fact(song_id, feature_type); | ||
| 149 | |||
| 150 | create unique index if not exists uq_feature_fact_embedding | ||
| 151 | on feature_fact(object_id, model_name, model_version, feature_set_name, feature_type) | ||
| 152 | where feature_type = 'embedding'; | ||
| 153 | |||
| 154 | create unique index if not exists uq_feature_fact_fingerprint | ||
| 155 | on feature_fact(object_id, model_name, model_version, feature_set_name, feature_type) | ||
| 156 | where feature_type = 'fingerprint'; | ||
| 192 | ``` | 157 | ``` |
| 193 | 158 | ||
| 194 | ### 9. retrieval_candidate / match_decision | 159 | ### 2.4 `set_membership` |
| 195 | 160 | ||
| 196 | ```json | 161 | ```sql |
| 197 | {"retrieval_candidate_id":1,"query_id":"music20-q0000-t1-song100","source_lane":"semantic","candidate_level":"canonical_song","candidate_id":1,"raw_score":0.99998549,"normalized_score":0.90998694,"rank_no":1} | 162 | create table if not exists set_membership ( |
| 198 | {"retrieval_candidate_id":2,"query_id":"music20-q0000-t1-song100","source_lane":"semantic","candidate_level":"canonical_song","candidate_id":17,"raw_score":0.9527432,"normalized_score":0.86746888,"rank_no":2} | 163 | membership_id bigserial primary key, |
| 199 | {"match_decision_id":1,"query_id":"music20-q0000-t1-song100","canonical_song_id":1,"decision_status":"matched","decision_score":0.90998694} | 164 | set_type text not null check (set_type in ('reference_set', 'eval_set', 'hot_set')), |
| 165 | set_name text not null, | ||
| 166 | member_type text not null check (member_type in ('song', 'asset', 'window', 'feature')), | ||
| 167 | member_id bigint not null, | ||
| 168 | song_id bigint references media_entity(entity_id), | ||
| 169 | is_active boolean not null default true, | ||
| 170 | priority integer not null default 100, | ||
| 171 | metadata_json jsonb not null default '{}'::jsonb, | ||
| 172 | created_at timestamptz not null default now(), | ||
| 173 | updated_at timestamptz not null default now() | ||
| 174 | ); | ||
| 175 | |||
| 176 | create unique index if not exists uq_set_membership_unique | ||
| 177 | on set_membership(set_type, set_name, member_type, member_id); | ||
| 178 | |||
| 179 | create index if not exists idx_set_membership_set_lookup | ||
| 180 | on set_membership(set_type, set_name, is_active, priority); | ||
| 200 | ``` | 181 | ``` |
| 201 | 182 | ||
| 202 | --- | 183 | --- |
| 203 | 184 | ||
| 204 | ## 本次 live 测试的表规模 | 185 | ## 3. 切片 / 模型 / feature 到底落哪张表 |
| 205 | 186 | ||
| 206 | | 表 | 行数 | | 187 | | 对象 | 落表 | 关键字段 | |
| 207 | |---|---:| | 188 | |---|---|---| |
| 208 | | `canonical_song` | 20 | | 189 | | song | `media_entity` | `entity_type='song'` | |
| 209 | | `work` | 20 | | 190 | | 原始音频 | `audio_object` | `object_type='asset'` | |
| 210 | | `recording` | 20 | | 191 | | 切片窗口 | `audio_object` | `object_type='window'`, `parent_object_id=<asset_id>` | |
| 211 | | `recording_asset` | 20 | | 192 | | 指纹特征 | `feature_fact` | `feature_type='fingerprint'` | |
| 212 | | `audio_window` | 20 | | 193 | | embedding 特征 | `feature_fact` | `feature_type='embedding'` | |
| 213 | | `audio_embedding` | 20 | | 194 | | 模型名/版本 | `feature_fact` | `model_name`, `model_version` | |
| 214 | | `retrieval_candidate` | 220 | | 195 | | feature set | `feature_fact` | `feature_set_name`, `feature_schema_ver` | |
| 215 | | `match_decision` | 22 | | 196 | | reference 集归属 | `set_membership` | `set_type='reference_set'` | |
| 216 | |||
| 217 | 说明: | ||
| 218 | - 20 条 reference song | ||
| 219 | - 22 条 query | ||
| 220 | - 每条 query 写入 top10 candidate,因此 `22 * 10 = 220` | ||
| 221 | |||
| 222 | --- | ||
| 223 | |||
| 224 | ## 本次测试链路与逻辑 | ||
| 225 | |||
| 226 | ### A. schema / 数据完整性测试 | ||
| 227 | |||
| 228 | 1. 连接 PostgreSQL | ||
| 229 | 2. 创建隔离 schema:`acr_test` | ||
| 230 | 3. 执行 `acr_pg_schema_v2.sql` | ||
| 231 | 4. 初始化: | ||
| 232 | - `model_registry` | ||
| 233 | - `feature_set_registry` | ||
| 234 | - `reference_set_registry` | ||
| 235 | - `retrieval_index_registry` | ||
| 236 | 5. 导入 20 条 reference 样例 | ||
| 237 | 6. 验证表计数是否正确 | ||
| 238 | 7. 主动插入三类错误 lineage: | ||
| 239 | - `recording.canonical_song_id` 与 `work.canonical_song_id` 不一致 | ||
| 240 | - `audio_window.recording_id` 与 `recording_asset.recording_id` 不一致 | ||
| 241 | - `audio_embedding` 的 `canonical_song_id` 与父 `audio_window` 不一致 | ||
| 242 | 8. 预期 PostgreSQL trigger 拒绝这些坏写入 | ||
| 243 | |||
| 244 | ### B. live 检索评测测试 | ||
| 245 | |||
| 246 | 1. 从 `reference_embeddings.jsonl` 读 20 条 reference embedding | ||
| 247 | 2. 写入 `audio_embedding` + `audio_embedding_vector_192` | ||
| 248 | 3. 从 `query_embeddings.jsonl` 读 22 条 query embedding | ||
| 249 | 4. 每条 query 用 SQL 执行 `pgvector cosine` 检索 | ||
| 250 | 5. 在应用层做 song-level aggregation: | ||
| 251 | - `max_sim` | ||
| 252 | - `top3_avg` | ||
| 253 | - `vote` | ||
| 254 | - `combined = 0.6 * max_sim + 0.3 * top3_avg + 0.1 * vote_factor` | ||
| 255 | 6. 将 top10 候选落表到 `retrieval_candidate` | ||
| 256 | 7. 将 top1 决策落表到 `match_decision` | ||
| 257 | 8. 计算: | ||
| 258 | - overall `top1/top3/top10/mrr` | ||
| 259 | - `by_query_type` | ||
| 260 | - `confusion_focus` | ||
| 261 | |||
| 262 | ### C. confusion test 口径 | ||
| 263 | |||
| 264 | 当前这次 live 样例里只实际包含: | ||
| 265 | - `type_1` | ||
| 266 | 197 | ||
| 267 | --- | 198 | --- |
| 268 | 199 | ||
| 269 | ## Phase-1 worker dry-run 测试链路(新增) | 200 | ## 4. 流程图 |
| 270 | |||
| 271 | 这一步解决的是: | ||
| 272 | |||
| 273 | > planner 虽然已经能输出可复制命令,但之前仓库里没有真正的 worker 可以消费这些命令。 | ||
| 274 | 201 | ||
| 275 | 现在已经补上最小真实 worker: | 202 | ### 4.1 落库流程 |
| 276 | |||
| 277 | - `acr-engine/workers/mark_job_status.py` | ||
| 278 | - `acr-engine/workers/run_chromaprint_job.py` | ||
| 279 | - `acr-engine/workers/run_embedding_job.py` | ||
| 280 | |||
| 281 | ### 测试目标 | ||
| 282 | |||
| 283 | 验证下面这条链是真实可走通的: | ||
| 284 | 203 | ||
| 285 | ```mermaid | 204 | ```mermaid |
| 286 | flowchart TD | 205 | flowchart TD |
| 287 | A[feature_extraction_job pending] --> B[planner 生成命令模板] | 206 | A[media_entity\nentity_type=song] --> B[audio_object\nobject_type=asset] |
| 288 | B --> C[worker 读取 extraction_job_id] | 207 | B --> C[audio_object\nobject_type=window] |
| 289 | C --> D[worker 解析 feature/model/scope] | 208 | C --> D1[feature_fact\nfeature_type=fingerprint] |
| 290 | D --> E[worker 回写 running/completed] | 209 | C --> D2[feature_fact\nfeature_type=embedding] |
| 291 | E --> F[bootstrap 脚本可再次恢复 pending] | 210 | B --> E[set_membership\nreference_set] |
| 211 | C --> E | ||
| 292 | ``` | 212 | ``` |
| 293 | 213 | ||
| 294 | ### 当前验证口径 | 214 | ### 4.2 查询回溯流程 |
| 295 | |||
| 296 | 这轮先不跑真实模型推理,而是先验证工业执行面: | ||
| 297 | |||
| 298 | 1. `run_chromaprint_job.py` | ||
| 299 | - 真实连接 PostgreSQL | ||
| 300 | - 读取 `feature_extraction_job=1` | ||
| 301 | - 解析 `reference_set:phase1_hot_reference_v1` | ||
| 302 | - 回写 `running -> completed` | ||
| 303 | |||
| 304 | 2. `run_embedding_job.py` | ||
| 305 | - 真实连接 PostgreSQL | ||
| 306 | - 读取 `feature_extraction_job=2` | ||
| 307 | - 解析 `mert v1-95m` | ||
| 308 | - 回写 `running -> completed` | ||
| 309 | |||
| 310 | 3. 再次执行 `bootstrap_phase1_extraction_jobs_live.py` | ||
| 311 | - 把 job 状态恢复为 `pending` | ||
| 312 | - 保证后续 session 可以从同一批 jobs 继续推进 | ||
| 313 | |||
| 314 | 4. `plan_phase1_extraction_jobs_live.py` | ||
| 315 | - 当前生成的主命令模板已显式带: | ||
| 316 | - `cd /workspace/acr-engine &&` | ||
| 317 | - `PG_DSN="${PG_DSN:?set PG_DSN}"` | ||
| 318 | - `--complete-dry-run` | ||
| 319 | - 因此 `primary_command` 已经可以直接复现当前 dry-run 状态流转 | ||
| 320 | |||
| 321 | ### 为什么先做 dry-run | ||
| 322 | |||
| 323 | 因为当前第一优先级是把下面这些东西固定住: | ||
| 324 | |||
| 325 | - job contract | ||
| 326 | - status transitions | ||
| 327 | - scope 解析 | ||
| 328 | - planner -> worker 命令兼容性 | ||
| 329 | |||
| 330 | 等这个骨架稳定后,再把真实的: | ||
| 331 | - chromaprint 提取 | ||
| 332 | - MERT / MuQ embedding 提取 | ||
| 333 | |||
| 334 | 接进去,整体风险更低。 | ||
| 335 | |||
| 336 | ### 当前 live 结果的关键更新 | ||
| 337 | |||
| 338 | 本轮已经新增: | ||
| 339 | |||
| 340 | - `acr-engine/scripts/bootstrap_phase1_reference_members_live.py` | ||
| 341 | |||
| 342 | 并已把 `acr_test.phase1_hot_reference_v1` 真实挂上 `20` 条 reference recordings,因此当前 worker dry-run 看到的 scope 已变成: | ||
| 343 | |||
| 344 | - `recording_count=20` | ||
| 345 | - `ready_asset_count=20` | ||
| 346 | - `active_window_count=20` | ||
| 347 | |||
| 348 | 这说明当前验证已经从“空 scope 状态机演示”推进到: | ||
| 349 | |||
| 350 | - planner -> worker 命令兼容 | ||
| 351 | - worker -> PostgreSQL 状态流转可用 | ||
| 352 | - reference_set -> recording/asset/window scope 解析可用 | ||
| 353 | |||
| 354 | 仍然要注意: | ||
| 355 | |||
| 356 | - 这依然是 **dry-run** | ||
| 357 | - 还**不是**真实特征抽取吞吐验证 | ||
| 358 | |||
| 359 | ### 当前并发/重试保护验证 | ||
| 360 | |||
| 361 | 本轮还额外做了一个故意的重复执行测试: | ||
| 362 | |||
| 363 | 1. 先让 `feature_extraction_job=1` 从 `pending -> running -> completed` | ||
| 364 | 2. 不做 reset,直接再次执行同一个 chromaprint dry-run worker | ||
| 365 | 3. 预期第二次执行失败,因为 worker 认领 job 时要求: | ||
| 366 | - `expected_status = pending` | ||
| 367 | |||
| 368 | 实际结果见: | ||
| 369 | |||
| 370 | - `phase1_worker_double_claim_guard_report.json` | ||
| 371 | |||
| 372 | 关键证据: | ||
| 373 | |||
| 374 | - `double_claim_exit_code = 1` | ||
| 375 | - `stderr = failed to update feature_extraction_job=1 with expected_status=pending` | ||
| 376 | |||
| 377 | 这证明当前最小 worker contract 已经具备: | ||
| 378 | |||
| 379 | - 基础 claim guard | ||
| 380 | - 基础重复执行保护 | ||
| 381 | |||
| 382 | --- | ||
| 383 | |||
| 384 | ## exact lane 非 dry-run 写入尝试(新增) | ||
| 385 | |||
| 386 | 这轮又继续向前推进了一步: | ||
| 387 | |||
| 388 | > `run_chromaprint_job.py` 已经不再只是 dry-run。 | ||
| 389 | |||
| 390 | 当前行为: | ||
| 391 | |||
| 392 | 1. 如果 reference asset 对应音频文件可读: | ||
| 393 | - 提取 repo-local chromaprint-style hash | ||
| 394 | - 写 artifact JSON | ||
| 395 | - 写 `audio_fingerprint` | ||
| 396 | - job 标记为 `completed` | ||
| 397 | |||
| 398 | 2. 如果 reference asset 对应音频文件不可读: | ||
| 399 | - job 标记为 `failed` | ||
| 400 | - 在 `metadata_json` 里写入: | ||
| 401 | - `failure_reason` | ||
| 402 | - `missing_asset_count` | ||
| 403 | - `missing_asset_samples` | ||
| 404 | |||
| 405 | ### 本轮 live 结果 | ||
| 406 | |||
| 407 | 报告: | ||
| 408 | |||
| 409 | - `acr-engine/data/pgvector_eval/music20/phase1_worker_chromaprint_write_attempt.json` | ||
| 410 | - `acr-engine/data/pgvector_eval/music20/phase1_worker_chromaprint_write_guard_report.json` | ||
| 411 | |||
| 412 | 关键结果: | ||
| 413 | |||
| 414 | - `scope_asset_count = 20` | ||
| 415 | - `processed_assets = 0` | ||
| 416 | - `missing_assets = 20` | ||
| 417 | - `job_status = failed` | ||
| 418 | - `failure_reason = unreadable_audio_assets` | ||
| 419 | - `audio_fingerprint_count = 0` | ||
| 420 | |||
| 421 | ### 这说明什么 | ||
| 422 | |||
| 423 | 说明当前 exact lane 的 PostgreSQL worker contract 已经具备: | ||
| 424 | |||
| 425 | - 非 dry-run 的真实写入路径 | ||
| 426 | - 明确的失败落盘 | ||
| 427 | - 环境缺失时的可审计错误证据 | ||
| 428 | - “全量成功 / 否则失败”的批次语义 | ||
| 429 | - `audio_fingerprint(feature_set_id, asset_id)` 的原子 upsert 约束基础 | ||
| 430 | |||
| 431 | 但当前容器仍然缺: | ||
| 432 | |||
| 433 | - `/workspace/downloads/...` 实际音频文件 | ||
| 434 | |||
| 435 | 因此这轮证明的是: | ||
| 436 | |||
| 437 | - **worker 写入路径已经接上** | ||
| 438 | - **当前被环境数据挂载阻塞** | ||
| 439 | |||
| 440 | 而不是 exact lane 逻辑本身还没落地。 | ||
| 441 | - `type_7` | ||
| 442 | |||
| 443 | 因此: | ||
| 444 | - `type_7` 可以作为 **当前 live confusion check** | ||
| 445 | - `type_8 / type_16` 这次 live JSONL 没覆盖到,只能结合历史业务样本结果一起看 | ||
| 446 | |||
| 447 | --- | ||
| 448 | |||
| 449 | ## live pgvector 结果 | ||
| 450 | |||
| 451 | ### 1. overall | ||
| 452 | |||
| 453 | | 指标 | 值 | | ||
| 454 | |---|---:| | ||
| 455 | | query 数 | 22 | | ||
| 456 | | top1 | `0.9091` | | ||
| 457 | | top3 | `0.9545` | | ||
| 458 | | top10 | `0.9545` | | ||
| 459 | | MRR | `0.9343` | | ||
| 460 | | mean rank | `1.8182` | | ||
| 461 | |||
| 462 | ### 2. by query type | ||
| 463 | |||
| 464 | | query_type | count | top1 | top3 | top10 | 解释 | | ||
| 465 | |---|---:|---:|---:|---:|---| | ||
| 466 | | `1` | 20 | `1.0` | `1.0` | `1.0` | clean / near-clean | | ||
| 467 | | `7` | 2 | `0.0` | `0.5` | `0.5` | 当前 live confusion 样例 | | ||
| 468 | | `8` | 0 | N/A | N/A | N/A | 本次 live JSONL 未覆盖 | | ||
| 469 | | `16` | 0 | N/A | N/A | N/A | 本次 live JSONL 未覆盖 | | ||
| 470 | |||
| 471 | ### 3. 和现有 FAISS stand-in 的一致性 | ||
| 472 | |||
| 473 | | 路径 | overall top1 | overall top3 | type_1 top1 | type_7 top1 | type_7 top3 | | ||
| 474 | |---|---:|---:|---:|---:|---:| | ||
| 475 | | live PostgreSQL + pgvector | `0.9091` | `0.9545` | `1.0` | `0.0` | `0.5` | | ||
| 476 | | FAISS stand-in | `0.9091` | `0.9545` | `1.0` | `0.0` | `0.5` | | ||
| 477 | |||
| 478 | 结论: | ||
| 479 | |||
| 480 | > 当前 `acr_test` 上的 live pgvector 路径,已经和现有 stand-in 检索逻辑对齐。 | ||
| 481 | > 问题不在“PostgreSQL 落盘导致召回变坏”,而在当前样例 embedding 对混淆类 query 本身就不够强。 | ||
| 482 | |||
| 483 | --- | ||
| 484 | |||
| 485 | ## 本轮补充:完整 lineage trigger 负例覆盖 | ||
| 486 | |||
| 487 | 本轮重新执行 live 脚本后,`live_pgvector_report.json` 中的 `lineage_negative_test` 已从“单条 audio_window 验证”升级为“三类坏写入全部验证”: | ||
| 488 | |||
| 489 | | case | 结果 | PostgreSQL 返回 | | ||
| 490 | |---|---|---| | ||
| 491 | | `recording_lineage_mismatch` | 拒绝成功 | `recording.canonical_song_id ... mismatches work.canonical_song_id ...` | | ||
| 492 | | `audio_window_lineage_mismatch` | 拒绝成功 | `Invalid asset_id=... or recording_id=... for audio_window` | | ||
| 493 | | `audio_embedding_lineage_mismatch` | 拒绝成功 | `audio_embedding lineage mismatch` | | ||
| 494 | |||
| 495 | 这意味着: | ||
| 496 | |||
| 497 | > 当前 schema v2 的三条核心 lineage trigger,已经都有真实负例证据,而不只是“理论上存在”。 | ||
| 498 | |||
| 499 | 同时,本轮还补了两条机械验证证据: | ||
| 500 | - `py_compile` 通过:`live_pgvector_music20_eval.py` | ||
| 501 | - `git diff --check` 通过:本轮脚本、报告、文档变更无格式问题 | ||
| 502 | |||
| 503 | --- | ||
| 504 | |||
| 505 | ## 混淆测试补充视图 | ||
| 506 | |||
| 507 | ### 1. 当前 live 样例视图 | ||
| 508 | |||
| 509 | | query_type | 数据来源 | top1 | top3 | 结论 | | ||
| 510 | |---|---|---:|---:|---| | ||
| 511 | | `7` | `live_pgvector_report.json` | `0.0` | `0.5` | 已明显偏弱 | | ||
| 512 | |||
| 513 | ### 2. 历史本地 20-song 小样本视图 | ||
| 514 | |||
| 515 | 来自:`acr-engine/data/local_eval/music20_summary.json` | ||
| 516 | |||
| 517 | | query_type | top1 | top3 | | ||
| 518 | |---|---:|---:| | ||
| 519 | | `1` | `1.0` | `1.0` | | ||
| 520 | | `7` | `0.45` | `0.65` | | ||
| 521 | | `8` | `0.4667` | `0.7333` | | ||
| 522 | | `16` | `0.4167` | `0.4167` | | ||
| 523 | |||
| 524 | 说明: | ||
| 525 | - 这是**本地小样本 chroma/FAISS sanity flow** 的结果 | ||
| 526 | - 它比当前 live JSONL 的 type_7 好,是因为样本构成不同 | ||
| 527 | - 不能把这个结果直接当作生产效果,但可以当作“当前特征在小样本内并非完全不可用”的旁证 | ||
| 528 | |||
| 529 | ### 3. 历史业务语料 voice correctness 视图 | ||
| 530 | |||
| 531 | | query_type | 文件 | top1 | top3 | 结论 | | ||
| 532 | |---|---|---:|---:|---| | ||
| 533 | | `7` | `voice_workspace20_type7_eval.json` | `0.0` | `0.05` | 极弱 | | ||
| 534 | | `8` | `voice_workspace20_type8_eval.json` | `0.0` | `0.0` | 极弱 | | ||
| 535 | | `16` | `voice_workspace20_type16_eval.json` | `0.0` | `0.0` | 极弱 | | ||
| 536 | |||
| 537 | 结论: | ||
| 538 | 215 | ||
| 539 | > 只要 query 进入更真实、更混淆的业务样本,当前这条 baseline 仍然远远不够。 | 216 | ```mermaid |
| 540 | > PostgreSQL 落库没问题,真正的问题还是 **embedding lane 对 hard case 的判别力不足**。 | 217 | flowchart LR |
| 541 | 218 | A[feature_fact] --> B[audio_object window] | |
| 542 | --- | 219 | B --> C[audio_object asset] |
| 543 | 220 | C --> D[media_entity song] | |
| 544 | ## 这次验证了什么,没验证什么 | 221 | ``` |
| 545 | |||
| 546 | ### 已验证 | ||
| 547 | |||
| 548 | - PostgreSQL 真实连通可用 | ||
| 549 | - `vector` 扩展可用 | ||
| 550 | - schema v2 可以真实 apply | ||
| 551 | - main lineage trigger 可以真实拦截坏数据 | ||
| 552 | - 样例数据链可以按 `song -> work -> recording -> asset -> window -> embedding` 落盘 | ||
| 553 | - live pgvector 检索和现有 stand-in 逻辑一致 | ||
| 554 | - `retrieval_candidate` / `match_decision` 可以真实承载在线结果 | ||
| 555 | - semantic worker 已真实验证 preflight failure 语义:既能识别 `/workspace/downloads` 缺失,也能识别 `torch/torchaudio/transformers` 缺失 | ||
| 556 | - `audio_embedding` 已补上 window / asset 双路幂等唯一键,为后续 encoder 真实 upsert 预留稳定主键 | ||
| 557 | 222 | ||
| 558 | ### 未验证 | 223 | ### 4.3 写入时序图 |
| 559 | 224 | ||
| 560 | - 还没把 `MERT` / `MuQ` 真正接进这套 live 路径 | 225 | ```mermaid |
| 561 | - 这次 live 样例没有覆盖 `type_8 / type_16` 的 JSONL embedding | 226 | sequenceDiagram |
| 562 | - 这次只验证了 20-song 级别,不代表 30w song 的索引性能 | 227 | participant ING as Ingest/Extract Job |
| 563 | - 还没做多 recording / 多 version / cover lane 的聚合测试 | 228 | participant DB as PostgreSQL |
| 229 | |||
| 230 | ING->>DB: insert media_entity(song) | ||
| 231 | ING->>DB: insert audio_object(asset) | ||
| 232 | ING->>DB: insert audio_object(window) | ||
| 233 | ING->>DB: insert feature_fact(fingerprint) | ||
| 234 | ING->>DB: insert feature_fact(embedding) | ||
| 235 | ING->>DB: insert set_membership(reference_set) | ||
| 236 | ``` | ||
| 564 | 237 | ||
| 565 | --- | 238 | --- |
| 566 | 239 | ||
| 567 | ## 推荐的下一步 | 240 | ## 5. 最常用 SQL 样例 |
| 568 | |||
| 569 | ### 本轮新增:Phase-1 registry 已可 live bootstrap | ||
| 570 | |||
| 571 | 除了 live 检索脚本外,本轮还新增了: | ||
| 572 | |||
| 573 | - `acr-engine/scripts/bootstrap_phase1_model_registry_live.py` | ||
| 574 | |||
| 575 | 它已经在 `acr_test` schema 上真实写入了: | ||
| 576 | - `chromaprint` | ||
| 577 | - `mert` | ||
| 578 | - `muq` | ||
| 579 | - `ecapa` | ||
| 580 | - 对应 feature sets | ||
| 581 | - `phase1_hot_reference_v1` | ||
| 582 | 241 | ||
| 583 | 对应 live 报告: | 242 | ### 5.1 写一首歌 |
| 584 | - `acr-engine/data/pgvector_eval/music20/phase1_registry_bootstrap_report.json` | ||
| 585 | 243 | ||
| 586 | ### 本轮继续新增:Phase-1 extraction jobs 已可 live bootstrap | 244 | ```sql |
| 587 | 245 | insert into media_entity (entity_type, biz_key, title, artist_name) | |
| 588 | 在 registry bootstrap 之后,本轮又新增: | 246 | values ('song', 'song-10001', 'Song 10001', 'Artist A') |
| 589 | 247 | returning entity_id; | |
| 590 | - `acr-engine/scripts/bootstrap_phase1_extraction_jobs_live.py` | 248 | ``` |
| 591 | |||
| 592 | 它已经在 `acr_test` schema 上真实创建了 5 条 `feature_extraction_job`: | ||
| 593 | - `chromaprint` | ||
| 594 | - `mert 5s/2.5s` | ||
| 595 | - `mert 10s/5s` | ||
| 596 | - `muq 5s/2.5s` | ||
| 597 | - `ecapa 5s/2.5s` | ||
| 598 | |||
| 599 | 对应 live 报告: | ||
| 600 | - `acr-engine/data/pgvector_eval/music20/phase1_extraction_jobs_report.json` | ||
| 601 | |||
| 602 | ### 本轮继续新增:pending jobs 已可生成 live execution plan | ||
| 603 | |||
| 604 | 在 extraction jobs 之后,本轮又新增: | ||
| 605 | |||
| 606 | - `acr-engine/scripts/plan_phase1_extraction_jobs_live.py` | ||
| 607 | |||
| 608 | 它已经在 `acr_test` schema 上真实读取 5 条 `pending` jobs,并生成按执行顺序排列的 plan: | ||
| 609 | - `chromaprint exact lane` 优先 | ||
| 610 | - 然后是 `mert / muq / ecapa` 的 semantic lanes | ||
| 611 | |||
| 612 | 对应 live 报告: | ||
| 613 | - `acr-engine/data/pgvector_eval/music20/phase1_extraction_plan_report.json` | ||
| 614 | |||
| 615 | 本轮补充后,plan 里还会真实给出: | ||
| 616 | - `command_suggestions` | ||
| 617 | - `primary_command` | ||
| 618 | |||
| 619 | 也就是从 PostgreSQL 的 pending jobs 已经可以直接走到“可复制的执行命令模板”。 | ||
| 620 | |||
| 621 | ### 路线 1:继续做 PostgreSQL 工程化 | ||
| 622 | |||
| 623 | 1. 把 `live_pgvector_music20_eval.py` 泛化成: | ||
| 624 | - 可导入任意 manifest/reference set | ||
| 625 | - 可选择 encoder / feature set | ||
| 626 | - 可直接生成 `retrieval_candidate` / `match_decision` 报告 | ||
| 627 | 2. 增加: | ||
| 628 | - `audio_embedding_vector_1024` / 其他常见维度表 | ||
| 629 | - bulk COPY / batched insert | ||
| 630 | - HNSW 参数管理 | ||
| 631 | |||
| 632 | ### 路线 2:继续做混淆类效果验证 | ||
| 633 | |||
| 634 | 1. 构造真正覆盖 `type_8 / type_16` 的 query embedding JSONL | ||
| 635 | 2. 用同一条 live script 重跑 PostgreSQL 评测 | ||
| 636 | 3. 对比: | ||
| 637 | - `Chromaprint only` | ||
| 638 | - `semantic only` | ||
| 639 | - `fusion` | ||
| 640 | 4. 输出 confusion bucket 报告 | ||
| 641 | 249 | ||
| 642 | 当前环境补充说明: | 250 | ### 5.2 写一个 asset |
| 643 | - 本轮继续尝试从 `/workspace/downloads` 直接补 `type_8 / type_16` live 样本时,发现该目录在当前容器里**不存在** | ||
| 644 | - 因此,下一轮若要继续这条支线,需要先恢复/挂载业务样本目录,或把对应 query 音频与 reference 清单重新落到仓库可见路径 | ||
| 645 | 251 | ||
| 646 | ### 路线 3:切到 Phase-1 encoder-only 主线 | 252 | ```sql |
| 253 | insert into audio_object ( | ||
| 254 | object_type, song_id, source_type, storage_uri, storage_scheme, | ||
| 255 | checksum, codec, sample_rate, channels, duration_ms | ||
| 256 | ) values ( | ||
| 257 | 'asset', :song_id, 'official', 's3://bucket/song10001/master.wav', 's3', | ||
| 258 | 'sha256:xxx', 'wav', 44100, 2, 215000 | ||
| 259 | ) returning object_id; | ||
| 260 | ``` | ||
| 647 | 261 | ||
| 648 | 1. 保留当前 PostgreSQL 结构不变 | 262 | ### 5.3 写一个 window |
| 649 | 2. 将 `local_chroma24` 替换成: | ||
| 650 | - `MERT-v1-95M` | ||
| 651 | - `MuQ` | ||
| 652 | 3. 继续复用: | ||
| 653 | - `model_registry` | ||
| 654 | - `feature_set_registry` | ||
| 655 | - `reference_set_registry` | ||
| 656 | - `retrieval_index_registry` | ||
| 657 | 4. 重新测: | ||
| 658 | - clean | ||
| 659 | - type_7 | ||
| 660 | - type_8 | ||
| 661 | - type_16 | ||
| 662 | - 业务 voice bucket | ||
| 663 | 263 | ||
| 664 | --- | 264 | ```sql |
| 265 | insert into audio_object ( | ||
| 266 | object_type, song_id, parent_object_id, start_ms, end_ms, duration_ms | ||
| 267 | ) values ( | ||
| 268 | 'window', :song_id, :asset_id, 30000, 35000, 5000 | ||
| 269 | ) returning object_id; | ||
| 270 | ``` | ||
| 665 | 271 | ||
| 666 | ## 复现命令 | 272 | ### 5.4 写一条 embedding |
| 273 | |||
| 274 | ```sql | ||
| 275 | insert into feature_fact ( | ||
| 276 | feature_type, object_id, song_id, | ||
| 277 | model_name, model_version, feature_set_name, | ||
| 278 | feature_schema_ver, embedding_dim, embedding_uri, vector_table_name | ||
| 279 | ) values ( | ||
| 280 | 'embedding', :window_id, :song_id, | ||
| 281 | 'mert', 'v1-95m', 'mert_5s_hop2.5_meanpool', | ||
| 282 | 'v1', 768, 's3://bucket/emb/song10001_win0001.npy', 'audio_embedding_vector_768' | ||
| 283 | ); | ||
| 284 | ``` | ||
| 667 | 285 | ||
| 668 | ### 1. live PostgreSQL + pgvector 测试 | 286 | ### 5.5 把 asset 挂到 reference 集 |
| 669 | 287 | ||
| 670 | ```bash | 288 | ```sql |
| 671 | cd /workspace/acr-engine | 289 | insert into set_membership ( |
| 672 | /usr/local/miniconda3/bin/python scripts/live_pgvector_music20_eval.py \ | 290 | set_type, set_name, member_type, member_id, song_id, priority |
| 673 | --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' \ | 291 | ) values ( |
| 674 | --schema acr_test \ | 292 | 'reference_set', 'phase1_hot_reference_v1', 'asset', :asset_id, :song_id, 100 |
| 675 | --reset-schema \ | 293 | ); |
| 676 | --output data/pgvector_eval/music20/live_pgvector_report.json | ||
| 677 | ``` | 294 | ``` |
| 678 | 295 | ||
| 679 | ### 2. FAISS stand-in 对照测试 | 296 | ### 5.6 从 embedding 回查 song |
| 680 | 297 | ||
| 681 | ```bash | 298 | ```sql |
| 682 | cd /workspace/acr-engine | 299 | select ff.feature_id, |
| 683 | /usr/local/miniconda3/bin/python scripts/evaluate_songid_pgvector_path.py \ | 300 | ff.model_name, |
| 684 | --reference-embeddings-jsonl data/pgvector_eval/music20/reference_embeddings.jsonl \ | 301 | ff.model_version, |
| 685 | --query-embeddings-jsonl data/pgvector_eval/music20/query_embeddings.jsonl \ | 302 | ff.feature_set_name, |
| 686 | --output data/pgvector_eval/music20/songid_eval_report_fresh.json | 303 | win.object_id as window_id, |
| 304 | ast.object_id as asset_id, | ||
| 305 | song.entity_id as song_id, | ||
| 306 | song.title, | ||
| 307 | song.artist_name | ||
| 308 | from feature_fact ff | ||
| 309 | join audio_object win | ||
| 310 | on win.object_id = ff.object_id | ||
| 311 | and win.object_type = 'window' | ||
| 312 | join audio_object ast | ||
| 313 | on ast.object_id = win.parent_object_id | ||
| 314 | and ast.object_type = 'asset' | ||
| 315 | join media_entity song | ||
| 316 | on song.entity_id = ff.song_id | ||
| 317 | and song.entity_type = 'song' | ||
| 318 | where ff.feature_id = :feature_id; | ||
| 687 | ``` | 319 | ``` |
| 688 | 320 | ||
| 689 | --- | 321 | --- |
| 690 | 322 | ||
| 691 | ## 一句话结论 | 323 | ## 6. 当前设计意图 |
| 692 | |||
| 693 | > PostgreSQL 这条路已经可以真实落 schema、落样例、落 candidate、落 decision,也能真实跑 pgvector 检索。 | ||
| 694 | > 当前最大的短板不再是“怎么存”,而是 **当前 baseline embedding 对混淆 query 的召回仍然明显不够**。 | ||
| 695 | |||
| 696 | |||
| 697 | ## 新增:Phase-1 semantic worker live 证据 | ||
| 698 | |||
| 699 | 本轮继续对 `run_embedding_job.py` 做 live PostgreSQL 验证,目标不是伪造 embedding,而是把 **失败语义先固定住**。 | ||
| 700 | |||
| 701 | ### 结果摘要 | ||
| 702 | |||
| 703 | 对 `extraction_job_id=2`(`mert v1-95m`, `5s/2.5s`)执行非 dry-run worker 后: | ||
| 704 | |||
| 705 | | 项 | 结果 | | ||
| 706 | |---|---| | ||
| 707 | | `scope_window_count` | `20` | | ||
| 708 | | `job_status` | `failed` | | ||
| 709 | | `output_count` | `0` | | ||
| 710 | | `failure_reason` | `preflight_failed` | | ||
| 711 | | `preflight_blockers` | `['unreadable_audio_assets', 'model_runtime_unavailable']` | | ||
| 712 | | `vector_table_report.resolved` | `true` | | ||
| 713 | | `audio_embedding_vector_768_count` | `0` | | ||
| 714 | 324 | ||
| 715 | 说明: | 325 | ### 为什么切片和原始音频统一用 `audio_object` |
| 326 | - 新同学更容易理解 | ||
| 327 | - asset/window 共用大量字段 | ||
| 328 | - 减少专用表数量 | ||
| 716 | 329 | ||
| 717 | - 当前语义 lane 不是“没做事”,而是已经真实走到了 PostgreSQL job scope / runtime / vector table / asset 路径检查 | 330 | ### 为什么模型和特征统一用 `feature_fact` |
| 718 | - 只是当前容器同时被两个外部条件挡住: | 331 | - 不再一模型一张表 |
| 719 | 1. `/workspace/downloads/...` 未挂载 | 332 | - 不再 fingerprint 一张表、embedding 一张表后继续扩散 |
| 720 | 2. `torch / torchaudio / transformers` 未安装 | 333 | - 更适合未来继续换 MERT / MuQ / 新模型 |
| 721 | 334 | ||
| 722 | ### 证据文件 | 335 | ### 为什么 reference 集用 `set_membership` |
| 336 | - song / asset / window / feature 都可以挂集合 | ||
| 337 | - reference / eval / hot 切换统一处理 | ||
| 723 | 338 | ||
| 724 | - `acr-engine/data/pgvector_eval/music20/phase1_worker_embedding_write_attempt.json` | 339 | --- |
| 725 | - `acr-engine/data/pgvector_eval/music20/phase1_worker_embedding_write_guard_report.json` | ||
| 726 | - `acr-engine/data/pgvector_eval/music20/phase1_worker_embedding_post_state.json` | ||
| 727 | |||
| 728 | ### 为什么要先补唯一键 | ||
| 729 | |||
| 730 | 当前 `audio_embedding` 已新增: | ||
| 731 | |||
| 732 | - `uq_audio_embedding_feature_window` | ||
| 733 | - `uq_audio_embedding_feature_asset` | ||
| 734 | |||
| 735 | 设计意图是: | ||
| 736 | |||
| 737 | 1. 同一 `feature_set_id + window_id` 的 embedding 重跑时可以稳定 upsert | ||
| 738 | 2. 将来如果有 asset-level embedding,也能独立幂等 | ||
| 739 | 3. 不把幂等职责留给应用层“先查再写” | ||
| 740 | |||
| 741 | 这一步对后续的 `MERT / MuQ / ECAPA` 都通用。 | ||
| 742 | |||
| 743 | |||
| 744 | ## 新增:Semantic preflight blocker matrix | ||
| 745 | |||
| 746 | 为了避免下次 session 继续手工逐个试,本轮又新增: | ||
| 747 | |||
| 748 | - `acr-engine/scripts/run_phase1_embedding_preflight_matrix_live.py` | ||
| 749 | - `acr-engine/data/pgvector_eval/music20/phase1_embedding_preflight_matrix_report.json` | ||
| 750 | |||
| 751 | 它会: | ||
| 752 | |||
| 753 | 1. 先把 `feature_extraction_job` 重置回 `pending` | ||
| 754 | 2. 顺序执行全部 semantic jobs(当前是 `mert 5s`、`mert 10s`、`muq 5s`、`ecapa 5s`) | ||
| 755 | 3. 归并输出每个 job 的: | ||
| 756 | - `failure_reason` | ||
| 757 | - `preflight_blockers` | ||
| 758 | - `runtime_missing_dependencies` | ||
| 759 | - `vector_table_report` | ||
| 760 | |||
| 761 | ### 当前矩阵结果 | ||
| 762 | |||
| 763 | | job | model | vector table | blockers | runtime missing | | ||
| 764 | |---|---|---|---|---| | ||
| 765 | | 2 | `mert v1-95m` | `audio_embedding_vector_768` | `unreadable_audio_assets`, `model_runtime_unavailable` | `torch`, `torchaudio`, `transformers` | | ||
| 766 | | 3 | `mert v1-95m` | `audio_embedding_vector_768` | `unreadable_audio_assets`, `model_runtime_unavailable` | `torch`, `torchaudio`, `transformers` | | ||
| 767 | | 4 | `muq large-msd-iter` | `audio_embedding_vector_768` | `unreadable_audio_assets`, `model_runtime_unavailable` | `torch`, `torchaudio`, `transformers` | | ||
| 768 | | 5 | `ecapa acr-baseline-v1` | `audio_embedding_vector_192` | `unreadable_audio_assets`, `model_runtime_unavailable` | `torch`, `torchaudio`, `speechbrain` | | ||
| 769 | |||
| 770 | 结论: | ||
| 771 | |||
| 772 | - 当前 semantic lane 的失败已经具有**稳定矩阵特征**,不是某一个 job 独有的偶发问题 | ||
| 773 | - `vector_table` 路径已全部通过 | ||
| 774 | - 当前真正阻塞 Phase-1 encoder-only 落地的是: | ||
| 775 | 1. `/workspace/downloads` 音频挂载 | ||
| 776 | 2. 模型 runtime 依赖安装 | ||
| 777 | |||
| 778 | |||
| 779 | ## 新增:asset-level embedding upsert live 验证 | ||
| 780 | |||
| 781 | 为了把 `uq_audio_embedding_feature_asset` 从“DDL 声明”推进到“真实证据”,本轮新增: | ||
| 782 | |||
| 783 | - `acr-engine/scripts/validate_audio_embedding_asset_upsert_live.py` | ||
| 784 | - `acr-engine/data/pgvector_eval/music20/audio_embedding_asset_upsert_live_report.json` | ||
| 785 | |||
| 786 | ### 验证动作 | ||
| 787 | |||
| 788 | 脚本会在隔离 schema `acr_asset_upsert_test` 中: | ||
| 789 | |||
| 790 | 1. 落最小主数据图:`song -> work -> recording -> asset` | ||
| 791 | 2. 插入第一条 `window_id IS NULL` 的 asset-level embedding | ||
| 792 | 3. 再做一次普通重复 `INSERT` | ||
| 793 | 4. 预期被 `uq_audio_embedding_feature_asset` 拒绝 | ||
| 794 | 5. 再做一次 `ON CONFLICT ... DO UPDATE` | ||
| 795 | 6. 验证最终仍只有 `1` 条 `audio_embedding` 与 `1` 条 `audio_embedding_vector_192` | ||
| 796 | |||
| 797 | ### 当前结果 | ||
| 798 | |||
| 799 | | 项 | 结果 | | ||
| 800 | |---|---| | ||
| 801 | | 首次 `embedding_id` | `1` | | ||
| 802 | | 重复普通 `INSERT` | `UniqueViolation` | | ||
| 803 | | 唯一键名 | `uq_audio_embedding_feature_asset` | | ||
| 804 | | upsert 后 `embedding_id` | `1` | | ||
| 805 | | `same_embedding_id_reused` | `true` | | ||
| 806 | | `audio_embedding` 行数 | `1` | | ||
| 807 | | `audio_embedding_vector_192` 行数 | `1` | | ||
| 808 | | 最终 `checksum` | `checksum-v2` | | ||
| 809 | |||
| 810 | 结论: | ||
| 811 | |||
| 812 | - asset-level 唯一键不是“纸面存在”,而是已经在 live PostgreSQL 上真实生效 | ||
| 813 | - 后续如果补 asset-level semantic writer,可以直接沿用同一个 `ON CONFLICT (feature_set_id, asset_id) ...` 合同 | ||
| 814 | |||
| 815 | |||
| 816 | ## 新增:Phase-1 worker contract smoke 总览 | ||
| 817 | |||
| 818 | 为了让下次启动不用分别手工跑 exact worker 与 semantic matrix,本轮新增: | ||
| 819 | |||
| 820 | - `acr-engine/scripts/run_phase1_worker_contract_smoke_live.py` | ||
| 821 | - `acr-engine/data/pgvector_eval/music20/phase1_worker_contract_smoke_report.json` | ||
| 822 | |||
| 823 | 它会: | ||
| 824 | |||
| 825 | 1. reset `feature_extraction_job` | ||
| 826 | 2. 跑一次 exact lane 非 dry-run | ||
| 827 | 3. 再 reset jobs | ||
| 828 | 4. 跑完整 semantic preflight matrix | ||
| 829 | 5. 输出一个总览 JSON | ||
| 830 | |||
| 831 | ### 当前 smoke 总览结果 | ||
| 832 | |||
| 833 | | lane | 结果 | | ||
| 834 | |---|---| | ||
| 835 | | exact | `failed` | | ||
| 836 | | exact failure reason | `unreadable_audio_assets` | | ||
| 837 | | exact missing assets | `20` | | ||
| 838 | | semantic jobs | `4` | | ||
| 839 | | semantic failed jobs | `4` | | ||
| 840 | | semantic blockers | `model_runtime_unavailable`, `unreadable_audio_assets` | | ||
| 841 | |||
| 842 | 这说明: | ||
| 843 | |||
| 844 | - 当前 PostgreSQL worker contract 本身已经是**稳定的** | ||
| 845 | - 当前阻塞已经非常明确,主要不是 orchestration,而是环境: | ||
| 846 | - `/workspace/downloads` 未挂载 | ||
| 847 | - semantic model runtime 未安装 | ||
| 848 | |||
| 849 | |||
| 850 | ## 新增:semantic vector table 负例矩阵 | ||
| 851 | |||
| 852 | 为了避免后续把 semantic worker 的失败都误归因为“缺模型/缺音频”,本轮新增: | ||
| 853 | |||
| 854 | - `acr-engine/scripts/run_embedding_vector_table_negative_matrix_live.py` | ||
| 855 | - `acr-engine/data/pgvector_eval/music20/embedding_vector_table_negative_matrix_report.json` | ||
| 856 | |||
| 857 | 它真实验证了 3 类向量表配置错误: | ||
| 858 | |||
| 859 | | case | schema | vector table | reason | | ||
| 860 | |---|---|---|---| | ||
| 861 | | `vector_table_dim_mismatch` | `acr_test` | `audio_embedding_vector_192` | `vector_table_dim_mismatch` | | ||
| 862 | | `vector_table_not_allowlisted` | `acr_test` | `audio_embedding_vector_1024` | `vector_table_not_allowlisted` | | ||
| 863 | | `vector_table_missing_in_schema` | `acr_vector_table_missing_test` | `audio_embedding_vector_768` | `vector_table_missing_in_schema` | | ||
| 864 | |||
| 865 | 共同点: | ||
| 866 | |||
| 867 | - 3 条 case 全部 `job_status = failed` | ||
| 868 | - `failure_reason = preflight_failed` | ||
| 869 | - `preflight_blockers` 中除了环境 blocker,还会额外带上精确的 vector-table blocker | ||
| 870 | |||
| 871 | 这说明: | ||
| 872 | |||
| 873 | - 当前 semantic preflight 已经能够把“运行环境问题”和“配置错误问题”分层暴露 | ||
| 874 | - 后续只要看 `vector_table_report.reason`,就能快速区分是 DDL/配置错误,还是模型 runtime/音频挂载错误 | ||
| 875 | |||
| 876 | |||
| 877 | ## 新增:Phase-1 prerequisites audit | ||
| 878 | |||
| 879 | 为了避免每次都靠肉眼猜“到底是音频挂载缺失,还是模型 runtime 缺失”,本轮新增: | ||
| 880 | |||
| 881 | - `acr-engine/scripts/run_phase1_prereq_audit_live.py` | ||
| 882 | - `acr-engine/data/pgvector_eval/music20/phase1_prereq_audit_report.json` | ||
| 883 | |||
| 884 | ### 当前审计结果 | ||
| 885 | |||
| 886 | | 指标 | 结果 | | ||
| 887 | |---|---| | ||
| 888 | | `downloads_root_exists` | `false` | | ||
| 889 | | `total_jobs` | `5` | | ||
| 890 | | `ready_jobs` | `0` | | ||
| 891 | | `blocked_jobs` | `5` | | ||
| 892 | | 缺失依赖并集 | `speechbrain`, `torch`, `torchaudio`, `transformers` | | ||
| 893 | |||
| 894 | 按 job 看: | ||
| 895 | 340 | ||
| 896 | - `chromaprint`:依赖层面可跑,但被 `/workspace/downloads` 缺失阻塞 | 341 | ## 7. 当前最推荐的实现顺序 |
| 897 | - `mert / muq`:同时被 `/workspace/downloads` 缺失与 `torch/torchaudio/transformers` 缺失阻塞 | ||
| 898 | - `ecapa`:同时被 `/workspace/downloads` 缺失与 `torch/torchaudio/speechbrain` 缺失阻塞 | ||
| 899 | 342 | ||
| 900 | 这使得“当前为什么跑不通”已经可以通过单份 JSON 报告回答,而不必重新手工试跑。 | 343 | 1. 先建 `media_entity` |
| 344 | 2. 再建 `audio_object` | ||
| 345 | 3. 再建 `feature_fact` | ||
| 346 | 4. 最后建 `set_membership` | ||
| 347 | 5. 先打通 `song -> asset -> window -> embedding/fingerprint` | ||
| 348 | 6. 再继续补更重的治理能力 | ... | ... |
| ... | @@ -59,7 +59,7 @@ cd /workspace/acr-engine | ... | @@ -59,7 +59,7 @@ cd /workspace/acr-engine |
| 59 | ## 3. 用一句话理解项目 | 59 | ## 3. 用一句话理解项目 |
| 60 | 60 | ||
| 61 | 我们在做的是一个面向 **版权保护 / 听歌识曲 / 版本归属** 的音乐 ACR 系统, | 61 | 我们在做的是一个面向 **版权保护 / 听歌识曲 / 版本归属** 的音乐 ACR 系统, |
| 62 | 目标是从 `100w` 音频、约 `30w` 歌曲中,快速定位正确的 `song_id / work / recording` 归属。 | 62 | 目标是从 `100w` 音频、约 `30w` 歌曲中,快速定位正确的 `song_id` 归属;当前阶段暂不把版本/recording 作为必须返回对象。 |
| 63 | 63 | ||
| 64 | --- | 64 | --- |
| 65 | 65 | ||
| ... | @@ -71,7 +71,12 @@ cd /workspace/acr-engine | ... | @@ -71,7 +71,12 @@ cd /workspace/acr-engine |
| 71 | - semantic lane challenger:`MuQ` | 71 | - semantic lane challenger:`MuQ` |
| 72 | - historical baseline:`ECAPA` | 72 | - historical baseline:`ECAPA` |
| 73 | 73 | ||
| 74 | ### 数据主线 | 74 | ### 当前 Phase-1 最小主线 |
| 75 | ```text | ||
| 76 | song -> asset -> window | ||
| 77 | ``` | ||
| 78 | |||
| 79 | ### 可演进完整版主线 | ||
| 75 | ```text | 80 | ```text |
| 76 | canonical_song -> work -> recording -> recording_asset -> audio_window | 81 | canonical_song -> work -> recording -> recording_asset -> audio_window |
| 77 | ``` | 82 | ``` |
| ... | @@ -139,6 +144,7 @@ model_registry -> feature_set_registry -> audio_embedding / audio_fingerprint -> | ... | @@ -139,6 +144,7 @@ model_registry -> feature_set_registry -> audio_embedding / audio_fingerprint -> |
| 139 | - [README.md](./README.md) | 144 | - [README.md](./README.md) |
| 140 | - [session-handoff.md](./session-handoff.md) | 145 | - [session-handoff.md](./session-handoff.md) |
| 141 | - [postgresql-data-model.md](./postgresql-data-model.md) | 146 | - [postgresql-data-model.md](./postgresql-data-model.md) |
| 147 | - [postgres_db_schema_samples.md](./postgres_db_schema_samples.md) | ||
| 142 | - [phase1-worker-contract.md](./phase1-worker-contract.md) | 148 | - [phase1-worker-contract.md](./phase1-worker-contract.md) |
| 143 | 149 | ||
| 144 | ### 脚本 | 150 | ### 脚本 | ... | ... |
scripts/check_markdown_links.py
0 → 100755
| 1 | #!/usr/bin/env /usr/local/miniconda3/bin/python | ||
| 2 | from __future__ import annotations | ||
| 3 | |||
| 4 | import argparse | ||
| 5 | import fnmatch | ||
| 6 | import re | ||
| 7 | import sys | ||
| 8 | from pathlib import Path | ||
| 9 | |||
| 10 | LINK_RE = re.compile(r'!?(?:\[([^\]]*)\])\(([^)]+)\)') | ||
| 11 | SKIP_PREFIXES = ('http://', 'https://', 'mailto:', 'tel:', '#') | ||
| 12 | DEFAULT_EXCLUDES = ['CHANGELOG.md'] | ||
| 13 | |||
| 14 | |||
| 15 | def should_check(target: str) -> bool: | ||
| 16 | target = target.strip() | ||
| 17 | return bool(target) and not target.startswith(SKIP_PREFIXES) | ||
| 18 | |||
| 19 | |||
| 20 | def normalize_target(raw: str) -> str: | ||
| 21 | target = raw.strip() | ||
| 22 | if target.startswith('<') and target.endswith('>'): | ||
| 23 | target = target[1:-1] | ||
| 24 | target = target.split('#', 1)[0].split('?', 1)[0].strip() | ||
| 25 | return target | ||
| 26 | |||
| 27 | |||
| 28 | def iter_markdown_files(root: Path, excludes: list[str]) -> list[Path]: | ||
| 29 | files: list[Path] = [] | ||
| 30 | for path in sorted(root.rglob('*.md')): | ||
| 31 | rel = path.relative_to(root).as_posix() | ||
| 32 | if any(fnmatch.fnmatch(rel, pattern) for pattern in excludes): | ||
| 33 | continue | ||
| 34 | files.append(path) | ||
| 35 | return files | ||
| 36 | |||
| 37 | |||
| 38 | def scan_markdown_file(path: Path, root: Path) -> list[tuple[str, str]]: | ||
| 39 | missing: list[tuple[str, str]] = [] | ||
| 40 | text = path.read_text(encoding='utf-8') | ||
| 41 | for _, raw_target in LINK_RE.findall(text): | ||
| 42 | if not should_check(raw_target): | ||
| 43 | continue | ||
| 44 | target = normalize_target(raw_target) | ||
| 45 | if not target: | ||
| 46 | continue | ||
| 47 | resolved = (path.parent / target).resolve() | ||
| 48 | if not resolved.exists(): | ||
| 49 | missing.append((path.relative_to(root).as_posix(), raw_target)) | ||
| 50 | return missing | ||
| 51 | |||
| 52 | |||
| 53 | if __name__ == '__main__': | ||
| 54 | parser = argparse.ArgumentParser(description='Check relative Markdown links for missing files.') | ||
| 55 | parser.add_argument('--root', default='docs', help='Root directory containing markdown files') | ||
| 56 | parser.add_argument('--exclude', action='append', default=[], help='Glob patterns relative to root to exclude') | ||
| 57 | args = parser.parse_args() | ||
| 58 | |||
| 59 | root = Path(args.root).resolve() | ||
| 60 | if not root.exists(): | ||
| 61 | print(f'root not found: {root}', file=sys.stderr) | ||
| 62 | sys.exit(2) | ||
| 63 | |||
| 64 | excludes = DEFAULT_EXCLUDES + list(args.exclude) | ||
| 65 | files = iter_markdown_files(root, excludes) | ||
| 66 | failures: list[tuple[str, str]] = [] | ||
| 67 | for md in files: | ||
| 68 | failures.extend(scan_markdown_file(md, root)) | ||
| 69 | |||
| 70 | if failures: | ||
| 71 | print('Missing relative markdown targets:') | ||
| 72 | for source, target in failures: | ||
| 73 | print(f'- {source}: {target}') | ||
| 74 | sys.exit(1) | ||
| 75 | |||
| 76 | print(f'OK: checked {len(files)} markdown files under {root} (excluded: {excludes})') |
-
Please register or sign in to post a comment