Commit fe416ec9 fe416ec9cae627abe79814ec3a3e000feea99d02 by cnb.bofCdSsphPA

Make the fused Phase-1 ACR schema concrete with DDL samples

Constraint: Keep the storage design aligned to the current song-centric model while turning the 4-table fused schema into something engineers can directly review and implement.
Rejected: Keep only conceptual docs without concrete SQL | It leaves too much ambiguity about where slices, models, and features actually land.
Confidence: high
Scope-risk: narrow
Directive: Until the repository gains a production SQL file for the fused model, treat postgres_db_schema_samples.md as the authoritative DDL draft for media_entity/audio_object/feature_fact/set_membership.
Tested: git diff --check on touched files; /usr/local/miniconda3/bin/python scripts/check_markdown_links.py --root docs returned OK for 11 active markdown files
Not-tested: Executing the fused DDL against a live PostgreSQL schema
1 parent ac2e6730
## 2026-06-04
- 重写 `docs/postgres_db_schema_samples.md` 为当前 song-centric 融合优先方案的 DDL 草案,补齐 4 张核心表(`media_entity` / `audio_object` / `feature_fact` / `set_membership`)、落表说明、流程图与常用 SQL 样例。
-`docs/postgresql-data-model.md` 新增“切片数据 / 模型 / feature 具体落哪张表”的表格与流程图,明确当前默认回溯链为 `feature_fact -> audio_object(window) -> audio_object(asset) -> media_entity(song)`
- 收敛 `docs/README.md` 为当前 song-centric 设计入口,并清理 docs 目录中与当前设计无关的模板、开放数据、业务导出、历史路线类文档。
......
......@@ -59,7 +59,7 @@ cd /workspace/acr-engine
## 3. 用一句话理解项目
我们在做的是一个面向 **版权保护 / 听歌识曲 / 版本归属** 的音乐 ACR 系统,
目标是从 `100w` 音频、约 `30w` 歌曲中,快速定位正确的 `song_id / work / recording` 归属
目标是从 `100w` 音频、约 `30w` 歌曲中,快速定位正确的 `song_id` 归属;当前阶段暂不把版本/recording 作为必须返回对象
---
......@@ -71,7 +71,12 @@ cd /workspace/acr-engine
- semantic lane challenger:`MuQ`
- historical baseline:`ECAPA`
### 数据主线
### 当前 Phase-1 最小主线
```text
song -> asset -> window
```
### 可演进完整版主线
```text
canonical_song -> work -> recording -> recording_asset -> audio_window
```
......@@ -139,6 +144,7 @@ model_registry -> feature_set_registry -> audio_embedding / audio_fingerprint ->
- [README.md](./README.md)
- [session-handoff.md](./session-handoff.md)
- [postgresql-data-model.md](./postgresql-data-model.md)
- [postgres_db_schema_samples.md](./postgres_db_schema_samples.md)
- [phase1-worker-contract.md](./phase1-worker-contract.md)
### 脚本
......
#!/usr/bin/env /usr/local/miniconda3/bin/python
from __future__ import annotations
import argparse
import fnmatch
import re
import sys
from pathlib import Path
LINK_RE = re.compile(r'!?(?:\[([^\]]*)\])\(([^)]+)\)')
SKIP_PREFIXES = ('http://', 'https://', 'mailto:', 'tel:', '#')
DEFAULT_EXCLUDES = ['CHANGELOG.md']
def should_check(target: str) -> bool:
target = target.strip()
return bool(target) and not target.startswith(SKIP_PREFIXES)
def normalize_target(raw: str) -> str:
target = raw.strip()
if target.startswith('<') and target.endswith('>'):
target = target[1:-1]
target = target.split('#', 1)[0].split('?', 1)[0].strip()
return target
def iter_markdown_files(root: Path, excludes: list[str]) -> list[Path]:
files: list[Path] = []
for path in sorted(root.rglob('*.md')):
rel = path.relative_to(root).as_posix()
if any(fnmatch.fnmatch(rel, pattern) for pattern in excludes):
continue
files.append(path)
return files
def scan_markdown_file(path: Path, root: Path) -> list[tuple[str, str]]:
missing: list[tuple[str, str]] = []
text = path.read_text(encoding='utf-8')
for _, raw_target in LINK_RE.findall(text):
if not should_check(raw_target):
continue
target = normalize_target(raw_target)
if not target:
continue
resolved = (path.parent / target).resolve()
if not resolved.exists():
missing.append((path.relative_to(root).as_posix(), raw_target))
return missing
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Check relative Markdown links for missing files.')
parser.add_argument('--root', default='docs', help='Root directory containing markdown files')
parser.add_argument('--exclude', action='append', default=[], help='Glob patterns relative to root to exclude')
args = parser.parse_args()
root = Path(args.root).resolve()
if not root.exists():
print(f'root not found: {root}', file=sys.stderr)
sys.exit(2)
excludes = DEFAULT_EXCLUDES + list(args.exclude)
files = iter_markdown_files(root, excludes)
failures: list[tuple[str, str]] = []
for md in files:
failures.extend(scan_markdown_file(md, root))
if failures:
print('Missing relative markdown targets:')
for source, target in failures:
print(f'- {source}: {target}')
sys.exit(1)
print(f'OK: checked {len(files)} markdown files under {root} (excluded: {excludes})')