Favor typed unified tables for the Phase-1 ACR storage model
Constraint: Reduce schema reading cost for new engineers while preserving the logical distinctions needed for copyright-scale retrieval and attribution. Rejected: Keep adding highly specialized tables for every layer in Phase-1 | It increases join cost in the mental model faster than it improves first-stage delivery. Confidence: high Scope-risk: narrow Directive: Prefer a fused physical model (media_entity/audio_object/feature_fact/set_membership) with type fields, while keeping song/recording/asset/window as logical semantics. Tested: git diff --check on touched docs; /usr/local/miniconda3/bin/python scripts/check_markdown_links.py --root docs returned OK for 31 markdown files; rg confirmed fused-model sections are present in docs Not-tested: concrete SQL DDL for the fused physical model
Showing
3 changed files
with
150 additions
and
12 deletions
| ... | @@ -10,6 +10,8 @@ | ... | @@ -10,6 +10,8 @@ |
| 10 | 10 | ||
| 11 | - 在 `docs/postgresql-data-model.md` 补充 `Phase-1` 极简 schema 视图,明确首批应优先落稳的表集合:`song/recording/recording_asset/audio_window`、`feature_set_registry/audio_fingerprint/audio_embedding`、`reference_set_registry/reference_set_member`。 | 11 | - 在 `docs/postgresql-data-model.md` 补充 `Phase-1` 极简 schema 视图,明确首批应优先落稳的表集合:`song/recording/recording_asset/audio_window`、`feature_set_registry/audio_fingerprint/audio_embedding`、`reference_set_registry/reference_set_member`。 |
| 12 | 12 | ||
| 13 | - 根据“尽量融合、用多 type 关联”的新约束,在 `docs/postgresql-data-model.md` 补充“融合优先”建模视图:推荐以 `media_entity`、`audio_object`、`feature_fact`、`set_membership` 这 4 类通用表承载 Phase-1 物理实现,同时保留 `song/recording/asset/window/feature` 的逻辑分层。 | ||
| 14 | |||
| 13 | ## 2026-06-04 | 15 | ## 2026-06-04 |
| 14 | 16 | ||
| 15 | - 更新 `docs/README.md` 顶部为与 `session-handoff` 一致的“最短启动路径”,并再次用该入口命令重跑 `run_planner_validation_commands_live.py`,确认 fresh 结果仍为 `executed_count=4`、`all_passed=true`。 | 17 | - 更新 `docs/README.md` 顶部为与 `session-handoff` 一致的“最短启动路径”,并再次用该入口命令重跑 `run_planner_validation_commands_live.py`,确认 fresh 结果仍为 `executed_count=4`、`all_passed=true`。 | ... | ... |
| ... | @@ -45,7 +45,7 @@ cd /workspace/acr-engine | ... | @@ -45,7 +45,7 @@ cd /workspace/acr-engine |
| 45 | 45 | ||
| 46 | ### C. 第一个阶段怎么落地 | 46 | ### C. 第一个阶段怎么落地 |
| 47 | - [phase1-implementation-checklist.md](./phase1-implementation-checklist.md) — Phase-1 执行清单 | 47 | - [phase1-implementation-checklist.md](./phase1-implementation-checklist.md) — Phase-1 执行清单 |
| 48 | - [postgresql-data-model.md](./postgresql-data-model.md) — 含 Phase-1 极简 schema 视图 | 48 | - [postgresql-data-model.md](./postgresql-data-model.md) — 含 Phase-1 极简 schema 与融合优先视图 |
| 49 | - [model-feature-registry-bootstrap.md](./model-feature-registry-bootstrap.md) — model/feature/reference set 初始化 | 49 | - [model-feature-registry-bootstrap.md](./model-feature-registry-bootstrap.md) — model/feature/reference set 初始化 |
| 50 | - [phase1-worker-contract.md](./phase1-worker-contract.md) — worker、job、失败语义合同 | 50 | - [phase1-worker-contract.md](./phase1-worker-contract.md) — worker、job、失败语义合同 |
| 51 | - [postgres_db_schema_samples.md](./postgres_db_schema_samples.md) — PostgreSQL 存储样例 | 51 | - [postgres_db_schema_samples.md](./postgres_db_schema_samples.md) — PostgreSQL 存储样例 | ... | ... |
| ... | @@ -256,24 +256,160 @@ window -> fingerprint / embedding -> candidate -> aggregate | ... | @@ -256,24 +256,160 @@ window -> fingerprint / embedding -> candidate -> aggregate |
| 256 | 256 | ||
| 257 | --- | 257 | --- |
| 258 | 258 | ||
| 259 | ## 1.2.1 融合优先:逻辑分层保留,物理表尽量收敛 | ||
| 260 | |||
| 261 | 如果你的核心诉求是: | ||
| 262 | |||
| 263 | > **尽量减少表数量,用 `type` + 通用关联表达多种对象,而不是一路拆很多表再 join** | ||
| 264 | |||
| 265 | 那么推荐采用下面这个口径: | ||
| 266 | |||
| 267 | - **逻辑层** 仍然保留 `song / recording / asset / window / feature` | ||
| 268 | - **物理层** 尽量融合成少数几张通用表 | ||
| 269 | |||
| 270 | 也就是说: | ||
| 271 | |||
| 272 | > **概念上分层,落库上收敛。** | ||
| 273 | |||
| 274 | ### 推荐的融合优先物理视图 | ||
| 275 | |||
| 276 | | 物理表 | 主要 type | 作用 | | ||
| 277 | |---|---|---| | ||
| 278 | | `media_entity` | `song`, `work`, `recording` | 承载业务归属对象 | | ||
| 279 | | `audio_object` | `asset`, `window` | 承载真实音频文件与切片对象 | | ||
| 280 | | `feature_fact` | `fingerprint`, `embedding` | 承载检索特征事实 | | ||
| 281 | | `set_membership` | `reference_set`, `hot_set`, `eval_set` | 承载集合归属关系 | | ||
| 282 | |||
| 283 | 这样,Phase-1 在物理表层面可以被收敛成: | ||
| 284 | |||
| 285 | ```text | ||
| 286 | media_entity -> audio_object -> feature_fact -> set_membership | ||
| 287 | ``` | ||
| 288 | |||
| 289 | 而不是新同学第一眼就看到很多高度专用表。 | ||
| 290 | |||
| 291 | ### 对应的逻辑语义 | ||
| 292 | |||
| 293 | #### `media_entity` | ||
| 294 | 用 `entity_type` 区分: | ||
| 295 | - `song` | ||
| 296 | - `work` | ||
| 297 | - `recording` | ||
| 298 | |||
| 299 | 公共字段可统一为: | ||
| 300 | - `entity_id` | ||
| 301 | - `entity_type` | ||
| 302 | - `parent_entity_id` | ||
| 303 | - `root_song_id` | ||
| 304 | - `title` | ||
| 305 | - `artist_name` | ||
| 306 | - `entity_status` | ||
| 307 | - `metadata_json` | ||
| 308 | |||
| 309 | #### `audio_object` | ||
| 310 | 用 `object_type` 区分: | ||
| 311 | - `asset` | ||
| 312 | - `window` | ||
| 313 | |||
| 314 | 公共字段可统一为: | ||
| 315 | - `object_id` | ||
| 316 | - `object_type` | ||
| 317 | - `recording_entity_id` | ||
| 318 | - `parent_object_id` | ||
| 319 | - `storage_uri` | ||
| 320 | - `codec` | ||
| 321 | - `sample_rate` | ||
| 322 | - `start_ms` / `end_ms` | ||
| 323 | - `duration_ms` | ||
| 324 | - `checksum` | ||
| 325 | - `metadata_json` | ||
| 326 | |||
| 327 | 解释: | ||
| 328 | - `asset` 行表示真实音频文件 | ||
| 329 | - `window` 行表示由某个 `asset` 切出来的检索窗口 | ||
| 330 | |||
| 331 | #### `feature_fact` | ||
| 332 | 用 `feature_type` 区分: | ||
| 333 | - `fingerprint` | ||
| 334 | - `embedding` | ||
| 335 | |||
| 336 | 公共字段可统一为: | ||
| 337 | - `feature_id` | ||
| 338 | - `feature_type` | ||
| 339 | - `object_id` | ||
| 340 | - `model_name` | ||
| 341 | - `model_version` | ||
| 342 | - `feature_set_name` | ||
| 343 | - `embedding_dim` | ||
| 344 | - `fingerprint_value` / `embedding_uri` | ||
| 345 | - `vector_table_name` | ||
| 346 | - `metadata_json` | ||
| 347 | |||
| 348 | 这样可以避免: | ||
| 349 | - 一套模型一张表 | ||
| 350 | - 一类特征一张表 | ||
| 351 | - 后续换模型就改 schema | ||
| 352 | |||
| 353 | ### 为什么这比“纯拆表”更适合当前 Phase-1 | ||
| 354 | |||
| 355 | 优点: | ||
| 356 | 1. **新同学更容易理解**:看到的是 3~4 张核心表,而不是十几张专用表 | ||
| 357 | 2. **多 type 复用更自然**:`song/work/recording`、`asset/window` 都能用 type 统一表达 | ||
| 358 | 3. **模型演进更平滑**:`feature_fact` 可以同时容纳不同模型与不同特征 | ||
| 359 | 4. **更符合当前目标**:先把识别闭环跑通,而不是先把治理模型拆到很细 | ||
| 360 | |||
| 361 | ### 但不要融合过头 | ||
| 362 | |||
| 363 | 虽然推荐物理收敛,但仍然不建议极端融合成一张大全表。 | ||
| 364 | 例如下面这种仍然过度扁平: | ||
| 365 | |||
| 366 | ```text | ||
| 367 | song_everything | ||
| 368 | ``` | ||
| 369 | |||
| 370 | 原因是它会把: | ||
| 371 | - 归属对象 | ||
| 372 | - 音频对象 | ||
| 373 | - 检索特征 | ||
| 374 | - 集合关系 | ||
| 375 | |||
| 376 | 全部揉在一起,导致: | ||
| 377 | - 空字段过多 | ||
| 378 | - 约束难写 | ||
| 379 | - 批量写入难做 | ||
| 380 | - 查询语义不清晰 | ||
| 381 | |||
| 382 | 因此更推荐的边界是: | ||
| 383 | |||
| 384 | ```text | ||
| 385 | 实体一张表 + 音频对象一张表 + 特征事实一张表 + 集合关系一张表 | ||
| 386 | ``` | ||
| 387 | |||
| 388 | 这是“融合优先但不过度融合”的平衡点。 | ||
| 389 | |||
| 390 | --- | ||
| 391 | |||
| 259 | ## 1.3 Phase-1 极简 schema 视图 | 392 | ## 1.3 Phase-1 极简 schema 视图 |
| 260 | 393 | ||
| 261 | 如果只从“第一阶段必须落哪些表”来理解,推荐把正式设计压缩成下面这组最小表集合: | 394 | 如果只从“第一阶段必须落哪些表”来理解,推荐优先采用“融合优先”的最小表集合: |
| 262 | 395 | ||
| 263 | | 层 | 推荐保留表 | 当前作用 | | 396 | | 层 | 融合优先推荐表 | 当前作用 | |
| 264 | |---|---|---| | 397 | |---|---|---| |
| 265 | | 归属层 | `song`(或当前 `canonical_song` 的等价口径), `recording` | 最终归属到 song,区分不同录音版本 | | 398 | | 实体层 | `media_entity` | 统一承载 `song/work/recording` | |
| 266 | | 资产层 | `recording_asset` | 管理真实音频文件、来源与编码版本 | | 399 | | 音频对象层 | `audio_object` | 统一承载 `asset/window` | |
| 267 | | 窗口层 | `audio_window` | 支撑 offset / evidence / 多段投票 | | 400 | | 特征层 | `feature_fact` | 统一承载 `fingerprint/embedding` | |
| 268 | | 特征层 | `feature_set_registry`, `audio_fingerprint`, `audio_embedding` | 管理 fingerprint / embedding 的生成事实 | | 401 | | 集合层 | `set_membership` | 统一承载 `reference/hot/eval` 等集合关系 | |
| 269 | | reference 层 | `reference_set_registry`, `reference_set_member` | 管理当前线上 reference 集 | | ||
| 270 | 402 | ||
| 271 | 也就是说,Phase-1 真正应该优先落稳的是: | 403 | 也就是说,Phase-1 如果按物理实现优先,真正应该先落稳的是: |
| 272 | 404 | ||
| 273 | ```text | 405 | ```text |
| 274 | song -> recording -> recording_asset -> audio_window | 406 | media_entity -> audio_object -> feature_fact -> set_membership |
| 275 | feature_set_registry -> audio_fingerprint / audio_embedding | 407 | ``` |
| 276 | reference_set_registry -> reference_set_member | 408 | |
| 409 | 如果按逻辑语义理解,则仍然对应: | ||
| 410 | |||
| 411 | ```text | ||
| 412 | song/work/recording -> asset/window -> fingerprint/embedding -> reference membership | ||
| 277 | ``` | 413 | ``` |
| 278 | 414 | ||
| 279 | ### 这版极简 schema 明确不要求第一天就重投入的内容 | 415 | ### 这版极简 schema 明确不要求第一天就重投入的内容 | ... | ... |
-
Please register or sign in to post a comment