新增 PostgreSQL 去重检索链路与 hard 评估集支持
- 新增 PostgreSQL 导入脚本、评估脚本和 schema 定义,支持基于 exact_hash、pg_trgm 和行级 hash 的三层召回策略 - 评估 CLI 新增 hard profile,覆盖错别字、OCR 错误、整段翻译、medley 片段等更贴近业务边界的场景 - 调整 checker.py 复核阈值与匹配理由文案,优化翻译行相似与仅副歌重复场景的判定逻辑 - 同步更新 README、TEST_WORKFLOW 和单元测试 Co-Authored-By: Claude <noreply@anthropic.com>
Showing
12 changed files
with
183 additions
and
5 deletions
POSTGRES_MIGRATION.md
0 → 100644
This diff is collapsed.
Click to expand it.
| ... | @@ -85,15 +85,33 @@ python -m lyric_dedup.cli generate-eval-set \ | ... | @@ -85,15 +85,33 @@ python -m lyric_dedup.cli generate-eval-set \ |
| 85 | --positive-ratio 0.3 | 85 | --positive-ratio 0.3 |
| 86 | ``` | 86 | ``` |
| 87 | 87 | ||
| 88 | 生成器的业务口径: | 88 | 默认 `--profile standard` 生成常规生产评估集。也可以生成更贴近业务边界的 hard 集: |
| 89 | |||
| 90 | ```bash | ||
| 91 | python -m lyric_dedup.cli generate-eval-set \ | ||
| 92 | --profile hard \ | ||
| 93 | --library-dir data/library \ | ||
| 94 | --lyrics-dir data/generated_eval/hard_incoming \ | ||
| 95 | --csv data/generated_eval/eval_hard_5000.csv \ | ||
| 96 | --eval-index data/generated_eval/eval_hard_5000.csv.index.pkl \ | ||
| 97 | --size 5000 \ | ||
| 98 | --positive-ratio 0.3 | ||
| 99 | ``` | ||
| 100 | |||
| 101 | standard 业务口径: | ||
| 89 | 102 | ||
| 90 | - 先扫描整个曲库,按有效歌词行数、语言类型、文件来源前缀做分层采样,不再按排序前缀取样。 | 103 | - 先扫描整个曲库,按有效歌词行数、语言类型、文件来源前缀做分层采样,不再按排序前缀取样。 |
| 91 | - `应去重` 样本只生成全曲歌词的样式变化,例如时间戳、标点、平台噪声、空行、重复副歌次数变化、附加中文翻译。 | 104 | - `应去重` 样本只生成全曲歌词的样式变化,例如时间戳、标点、平台噪声、空行、重复副歌次数变化、附加中文翻译、少量错别字/英文拼写错误。 |
| 92 | - `不应去重` 样本以真实 holdout 完整歌词为主,也包含片段歌词、重复副歌碰撞、仅翻译相似、同主题新歌词、短歌词/占位边界样本。 | 105 | - `不应去重` 样本以真实 holdout 完整歌词为主,也包含片段歌词、重复副歌碰撞、仅翻译相似、同主题新歌词、短歌词/占位边界样本。 |
| 93 | - 片段歌词即使命中已有歌曲的一部分,也不应该输出 `duplicate`;最多进入 `review`。 | 106 | - 片段歌词即使命中已有歌曲的一部分,也不应该输出 `duplicate`;最多进入 `review`。 |
| 94 | - 生成器会额外写出 `--eval-index`,这个索引排除了 holdout 歌,评估生成 CSV 时应使用它。 | 107 | - 生成器会额外写出 `--eval-index`,这个索引排除了 holdout 歌,评估生成 CSV 时应使用它。 |
| 95 | - 同时会生成 `*.manifest.json`,记录 seed、曲库规模、holdout 数、样本类型分布、语言/来源分桶和样本来源覆盖数。 | 108 | - 同时会生成 `*.manifest.json`,记录 seed、曲库规模、holdout 数、样本类型分布、语言/来源分桶和样本来源覆盖数。 |
| 96 | 109 | ||
| 110 | hard 业务口径不故意制造反常输入,主要覆盖上线更容易踩边界的情况: | ||
| 111 | |||
| 112 | - `应去重`: 同曲平台版本噪声、较完整歌词缺少一段、整段中文翻译附加、较真实的录入/OCR 错别字、时间戳和平台元信息混合。 | ||
| 113 | - `不应去重`: 真实 holdout 新歌、从 holdout 中优先挑选和曲库有行重合的近邻新歌、较长但不完整的单曲片段、多曲 medley/串烧式片段、重复副歌碰撞、仅翻译相似、短歌词边界。 | ||
| 114 | |||
| 97 | 先准备一个 CSV,例如 `data/eval/eval.csv`: | 115 | 先准备一个 CSV,例如 `data/eval/eval.csv`: |
| 98 | 116 | ||
| 99 | ```csv | 117 | ```csv | ... | ... |
| ... | @@ -108,6 +108,20 @@ python -m lyric_dedup.cli generate-eval-set \ | ... | @@ -108,6 +108,20 @@ python -m lyric_dedup.cli generate-eval-set \ |
| 108 | --positive-ratio 0.3 | 108 | --positive-ratio 0.3 |
| 109 | ``` | 109 | ``` |
| 110 | 110 | ||
| 111 | 如需生成更贴近业务边界的 hard 口径测试集: | ||
| 112 | |||
| 113 | ```bash | ||
| 114 | python -m lyric_dedup.cli generate-eval-set \ | ||
| 115 | --profile hard \ | ||
| 116 | --library-dir data/library \ | ||
| 117 | --lyrics-dir data/generated_eval/hard_incoming \ | ||
| 118 | --csv data/generated_eval/eval_hard_5000.csv \ | ||
| 119 | --index outputs/indexes/library_lyrics.pkl \ | ||
| 120 | --eval-index data/generated_eval/eval_hard_5000.csv.index.pkl \ | ||
| 121 | --size 5000 \ | ||
| 122 | --positive-ratio 0.3 | ||
| 123 | ``` | ||
| 124 | |||
| 111 | 默认生产评估口径: | 125 | 默认生产评估口径: |
| 112 | 126 | ||
| 113 | ```text | 127 | ```text |
| ... | @@ -120,7 +134,7 @@ python -m lyric_dedup.cli generate-eval-set \ | ... | @@ -120,7 +134,7 @@ python -m lyric_dedup.cli generate-eval-set \ |
| 120 | 业务口径: | 134 | 业务口径: |
| 121 | 135 | ||
| 122 | ```text | 136 | ```text |
| 123 | positive_* = 应去重,全曲歌词样式变化 | 137 | positive_* = 应去重,全曲歌词样式变化,包括少量错别字/英文拼写错误扰动 |
| 124 | negative_real_holdout_full_song = 不应去重,完整真实歌词,已从评估索引中排除 | 138 | negative_real_holdout_full_song = 不应去重,完整真实歌词,已从评估索引中排除 |
| 125 | negative_fragment = 不应去重,单曲片段 | 139 | negative_fragment = 不应去重,单曲片段 |
| 126 | negative_shared_chorus = 不应去重,重复副歌碰撞 | 140 | negative_shared_chorus = 不应去重,重复副歌碰撞 |
| ... | @@ -129,6 +143,15 @@ negative_same_theme_synthetic = 不应去重,同主题新歌词 | ... | @@ -129,6 +143,15 @@ negative_same_theme_synthetic = 不应去重,同主题新歌词 |
| 129 | edge_short_or_placeholder = 不应去重,短歌词/占位边界样本 | 143 | edge_short_or_placeholder = 不应去重,短歌词/占位边界样本 |
| 130 | ``` | 144 | ``` |
| 131 | 145 | ||
| 146 | hard 口径额外强调真实业务边界,而不是故意制造反常难题: | ||
| 147 | |||
| 148 | ```text | ||
| 149 | positive_realistic_variant = 应去重,同曲平台版本噪声、较完整缺段、整段翻译附加、真实录入/OCR 错 | ||
| 150 | negative_near_neighbor_holdout_full_song = 不应去重,和曲库有较多行重合的真实 holdout 新歌 | ||
| 151 | negative_long_fragment = 不应去重,较长但不完整的单曲片段 | ||
| 152 | negative_catalog_mashup = 不应去重,多首真实歌词片段组成的串烧/混剪式输入 | ||
| 153 | ``` | ||
| 154 | |||
| 132 | 生成器会扫描整个曲库并按有效歌词行数、语言类型、文件来源前缀分层采样。它会分出一批 holdout 完整歌词作为真实新歌负样本,并生成一个排除 holdout 的评估索引。每次还会输出: | 155 | 生成器会扫描整个曲库并按有效歌词行数、语言类型、文件来源前缀分层采样。它会分出一批 holdout 完整歌词作为真实新歌负样本,并生成一个排除 holdout 的评估索引。每次还会输出: |
| 133 | 156 | ||
| 134 | ```text | 157 | ```text | ... | ... |
| ... | @@ -5,7 +5,7 @@ from __future__ import annotations | ... | @@ -5,7 +5,7 @@ from __future__ import annotations |
| 5 | import hashlib | 5 | import hashlib |
| 6 | import pickle | 6 | import pickle |
| 7 | from dataclasses import dataclass | 7 | from dataclasses import dataclass |
| 8 | from enum import StrEnum | 8 | from enum import Enum |
| 9 | from pathlib import Path | 9 | from pathlib import Path |
| 10 | 10 | ||
| 11 | from lyric_dedup.minhash_lsh import MinHashConfig | 11 | from lyric_dedup.minhash_lsh import MinHashConfig |
| ... | @@ -16,7 +16,7 @@ from lyric_dedup.normalization import lyric_tokens | ... | @@ -16,7 +16,7 @@ from lyric_dedup.normalization import lyric_tokens |
| 16 | from lyric_dedup.normalization import normalize_lyrics | 16 | from lyric_dedup.normalization import normalize_lyrics |
| 17 | 17 | ||
| 18 | 18 | ||
| 19 | class DuplicateDecision(StrEnum): | 19 | class DuplicateDecision(str, Enum): |
| 20 | DUPLICATE = "duplicate" | 20 | DUPLICATE = "duplicate" |
| 21 | REVIEW = "review" | 21 | REVIEW = "review" |
| 22 | NEW = "new" | 22 | NEW = "new" | ... | ... |
| ... | @@ -53,6 +53,12 @@ def main() -> None: | ... | @@ -53,6 +53,12 @@ def main() -> None: |
| 53 | generate.add_argument("--seed", type=int, default=20260602) | 53 | generate.add_argument("--seed", type=int, default=20260602) |
| 54 | generate.add_argument("--index", default="", help="optional source index path recorded in the manifest") | 54 | generate.add_argument("--index", default="", help="optional source index path recorded in the manifest") |
| 55 | generate.add_argument("--eval-index", default="", help="output index built from non-holdout records for this eval set") | 55 | generate.add_argument("--eval-index", default="", help="output index built from non-holdout records for this eval set") |
| 56 | generate.add_argument( | ||
| 57 | "--profile", | ||
| 58 | choices=("standard", "hard"), | ||
| 59 | default="standard", | ||
| 60 | help="evaluation sample profile: standard production mix or harder business-realistic edge mix", | ||
| 61 | ) | ||
| 56 | 62 | ||
| 57 | args = parser.parse_args() | 63 | args = parser.parse_args() |
| 58 | if args.command == "build-index": | 64 | if args.command == "build-index": |
| ... | @@ -80,6 +86,7 @@ def main() -> None: | ... | @@ -80,6 +86,7 @@ def main() -> None: |
| 80 | seed=args.seed, | 86 | seed=args.seed, |
| 81 | index_path=Path(args.index) if args.index else None, | 87 | index_path=Path(args.index) if args.index else None, |
| 82 | eval_index_path=Path(args.eval_index) if args.eval_index else None, | 88 | eval_index_path=Path(args.eval_index) if args.eval_index else None, |
| 89 | profile=args.profile, | ||
| 83 | ) | 90 | ) |
| 84 | print(json.dumps(summary, ensure_ascii=False)) | 91 | print(json.dumps(summary, ensure_ascii=False)) |
| 85 | 92 | ... | ... |
This diff is collapsed.
Click to expand it.
requirements.txt
0 → 100644
scripts/evaluate_postgres.py
0 → 100644
This diff is collapsed.
Click to expand it.
scripts/import_library_postgres.py
0 → 100644
This diff is collapsed.
Click to expand it.
scripts/init_postgres.py
0 → 100644
| 1 | """Initialize PostgreSQL schema for lyric dedup storage.""" | ||
| 2 | |||
| 3 | from __future__ import annotations | ||
| 4 | |||
| 5 | import argparse | ||
| 6 | import sys | ||
| 7 | from pathlib import Path | ||
| 8 | |||
| 9 | |||
| 10 | PROJECT_ROOT = Path(__file__).resolve().parents[1] | ||
| 11 | SCHEMA_PATH = PROJECT_ROOT / "scripts" / "postgres_schema.sql" | ||
| 12 | |||
| 13 | |||
| 14 | def main() -> None: | ||
| 15 | parser = argparse.ArgumentParser(description="Initialize PostgreSQL schema for lyric dedup.") | ||
| 16 | parser.add_argument("--dsn", required=True, help="PostgreSQL DSN, e.g. postgresql://user:pass@localhost:5432/lyric_dedup") | ||
| 17 | parser.add_argument("--schema", default=str(SCHEMA_PATH)) | ||
| 18 | args = parser.parse_args() | ||
| 19 | |||
| 20 | psycopg = _import_psycopg() | ||
| 21 | schema_sql = Path(args.schema).read_text(encoding="utf-8") | ||
| 22 | with psycopg.connect(args.dsn) as conn: | ||
| 23 | with conn.cursor() as cursor: | ||
| 24 | cursor.execute(schema_sql) | ||
| 25 | conn.commit() | ||
| 26 | print(f"initialized schema from {args.schema}") | ||
| 27 | |||
| 28 | |||
| 29 | def _import_psycopg(): | ||
| 30 | try: | ||
| 31 | import psycopg | ||
| 32 | |||
| 33 | return psycopg | ||
| 34 | except ModuleNotFoundError: | ||
| 35 | print( | ||
| 36 | "Missing dependency: psycopg. Install it with:\n" | ||
| 37 | " python -m pip install 'psycopg[binary]'", | ||
| 38 | file=sys.stderr, | ||
| 39 | ) | ||
| 40 | raise SystemExit(1) | ||
| 41 | |||
| 42 | |||
| 43 | if __name__ == "__main__": | ||
| 44 | main() |
scripts/postgres_schema.sql
0 → 100644
| 1 | create extension if not exists pg_trgm; | ||
| 2 | |||
| 3 | create table if not exists lyrics ( | ||
| 4 | id bigserial primary key, | ||
| 5 | record_id text not null unique, | ||
| 6 | source_path text not null, | ||
| 7 | title text, | ||
| 8 | artist text, | ||
| 9 | raw_text text not null, | ||
| 10 | normalized_text text not null, | ||
| 11 | primary_text text not null, | ||
| 12 | translation_text text, | ||
| 13 | exact_hash text not null, | ||
| 14 | split_confidence text, | ||
| 15 | split_reason text, | ||
| 16 | line_count integer not null, | ||
| 17 | created_at timestamptz not null default now(), | ||
| 18 | updated_at timestamptz not null default now(), | ||
| 19 | deleted_at timestamptz | ||
| 20 | ); | ||
| 21 | |||
| 22 | create index if not exists lyrics_exact_hash_idx | ||
| 23 | on lyrics (exact_hash) | ||
| 24 | where deleted_at is null; | ||
| 25 | |||
| 26 | create index if not exists lyrics_primary_text_trgm_idx | ||
| 27 | on lyrics using gin (primary_text gin_trgm_ops); | ||
| 28 | |||
| 29 | create table if not exists lyric_lines ( | ||
| 30 | lyric_id bigint not null references lyrics(id) on delete cascade, | ||
| 31 | role text not null, | ||
| 32 | line_no integer not null, | ||
| 33 | normalized_line text not null, | ||
| 34 | line_hash text not null, | ||
| 35 | primary key (lyric_id, role, line_no) | ||
| 36 | ); | ||
| 37 | |||
| 38 | create index if not exists lyric_lines_hash_idx | ||
| 39 | on lyric_lines (line_hash); | ||
| 40 | |||
| 41 | create index if not exists lyric_lines_lyric_id_idx | ||
| 42 | on lyric_lines (lyric_id); |
| ... | @@ -316,6 +316,40 @@ def test_generated_eval_set_uses_stratified_production_mix(tmp_path) -> None: | ... | @@ -316,6 +316,40 @@ def test_generated_eval_set_uses_stratified_production_mix(tmp_path) -> None: |
| 316 | assert all(row["expected"] == "不应去重" for row in rows if row["sample_type"].startswith("negative_")) | 316 | assert all(row["expected"] == "不应去重" for row in rows if row["sample_type"].startswith("negative_")) |
| 317 | 317 | ||
| 318 | 318 | ||
| 319 | def test_generated_hard_eval_set_uses_business_realistic_edge_mix(tmp_path) -> None: | ||
| 320 | library = tmp_path / "library" | ||
| 321 | incoming = tmp_path / "generated" / "incoming" | ||
| 322 | eval_csv = tmp_path / "generated" / "eval_hard.csv" | ||
| 323 | library.mkdir() | ||
| 324 | for idx in range(24): | ||
| 325 | prefix = "AY" if idx % 3 == 0 else "WHHY" | ||
| 326 | lyric = BASE_LYRIC.replace("我爱你", f"我想你{idx}").replace("城市", f"城市{idx}") | ||
| 327 | if idx % 4 == 0: | ||
| 328 | lyric += "\nI miss you tonight\nUnder the moonlight\nNever let me go\n" | ||
| 329 | (library / f"{idx}_{prefix}{idx:06d}.txt").write_text(lyric, encoding="utf-8") | ||
| 330 | |||
| 331 | generate_eval_set( | ||
| 332 | library_dir=library, | ||
| 333 | output_dir=incoming, | ||
| 334 | csv_path=eval_csv, | ||
| 335 | size=40, | ||
| 336 | positive_ratio=0.3, | ||
| 337 | profile="hard", | ||
| 338 | ) | ||
| 339 | |||
| 340 | rows = list(csv.DictReader(eval_csv.open(encoding="utf-8"))) | ||
| 341 | manifest = json.loads((tmp_path / "generated" / "eval_hard.csv.manifest.json").read_text(encoding="utf-8")) | ||
| 342 | sample_types = {row["sample_type"] for row in rows} | ||
| 343 | |||
| 344 | assert len(rows) == 40 | ||
| 345 | assert manifest["profile"] == "hard" | ||
| 346 | assert "positive_realistic_variant" in manifest["plan"] | ||
| 347 | assert "negative_near_neighbor_holdout_full_song" in manifest["plan"] | ||
| 348 | assert "negative_long_fragment" in sample_types | ||
| 349 | assert "negative_catalog_mashup" in sample_types | ||
| 350 | assert any(row["sample_type"].startswith("positive_") for row in rows) | ||
| 351 | |||
| 352 | |||
| 319 | def test_foreign_original_with_added_chinese_translation_is_duplicate() -> None: | 353 | def test_foreign_original_with_added_chinese_translation_is_duplicate() -> None: |
| 320 | checker = DuplicateChecker() | 354 | checker = DuplicateChecker() |
| 321 | checker.add_record( | 355 | checker.add_record( | ... | ... |
-
Please register or sign in to post a comment