Commit 49008962 4900896283e6b52190437fd467f52ab75caf2530 by 沈秋雨

新增 PostgreSQL 去重检索链路与 hard 评估集支持

- 新增 PostgreSQL 导入脚本、评估脚本和 schema 定义,支持基于 exact_hash、pg_trgm 和行级 hash 的三层召回策略
- 评估 CLI 新增 hard profile,覆盖错别字、OCR 错误、整段翻译、medley 片段等更贴近业务边界的场景
- 调整 checker.py 复核阈值与匹配理由文案,优化翻译行相似与仅副歌重复场景的判定逻辑
- 同步更新 README、TEST_WORKFLOW 和单元测试

Co-Authored-By: Claude <noreply@anthropic.com>
1 parent ba39ce6a
...@@ -85,15 +85,33 @@ python -m lyric_dedup.cli generate-eval-set \ ...@@ -85,15 +85,33 @@ python -m lyric_dedup.cli generate-eval-set \
85 --positive-ratio 0.3 85 --positive-ratio 0.3
86 ``` 86 ```
87 87
88 生成器的业务口径: 88 默认 `--profile standard` 生成常规生产评估集。也可以生成更贴近业务边界的 hard 集:
89
90 ```bash
91 python -m lyric_dedup.cli generate-eval-set \
92 --profile hard \
93 --library-dir data/library \
94 --lyrics-dir data/generated_eval/hard_incoming \
95 --csv data/generated_eval/eval_hard_5000.csv \
96 --eval-index data/generated_eval/eval_hard_5000.csv.index.pkl \
97 --size 5000 \
98 --positive-ratio 0.3
99 ```
100
101 standard 业务口径:
89 102
90 - 先扫描整个曲库,按有效歌词行数、语言类型、文件来源前缀做分层采样,不再按排序前缀取样。 103 - 先扫描整个曲库,按有效歌词行数、语言类型、文件来源前缀做分层采样,不再按排序前缀取样。
91 - `应去重` 样本只生成全曲歌词的样式变化,例如时间戳、标点、平台噪声、空行、重复副歌次数变化、附加中文翻译。 104 - `应去重` 样本只生成全曲歌词的样式变化,例如时间戳、标点、平台噪声、空行、重复副歌次数变化、附加中文翻译、少量错别字/英文拼写错误
92 - `不应去重` 样本以真实 holdout 完整歌词为主,也包含片段歌词、重复副歌碰撞、仅翻译相似、同主题新歌词、短歌词/占位边界样本。 105 - `不应去重` 样本以真实 holdout 完整歌词为主,也包含片段歌词、重复副歌碰撞、仅翻译相似、同主题新歌词、短歌词/占位边界样本。
93 - 片段歌词即使命中已有歌曲的一部分,也不应该输出 `duplicate`;最多进入 `review` 106 - 片段歌词即使命中已有歌曲的一部分,也不应该输出 `duplicate`;最多进入 `review`
94 - 生成器会额外写出 `--eval-index`,这个索引排除了 holdout 歌,评估生成 CSV 时应使用它。 107 - 生成器会额外写出 `--eval-index`,这个索引排除了 holdout 歌,评估生成 CSV 时应使用它。
95 - 同时会生成 `*.manifest.json`,记录 seed、曲库规模、holdout 数、样本类型分布、语言/来源分桶和样本来源覆盖数。 108 - 同时会生成 `*.manifest.json`,记录 seed、曲库规模、holdout 数、样本类型分布、语言/来源分桶和样本来源覆盖数。
96 109
110 hard 业务口径不故意制造反常输入,主要覆盖上线更容易踩边界的情况:
111
112 - `应去重`: 同曲平台版本噪声、较完整歌词缺少一段、整段中文翻译附加、较真实的录入/OCR 错别字、时间戳和平台元信息混合。
113 - `不应去重`: 真实 holdout 新歌、从 holdout 中优先挑选和曲库有行重合的近邻新歌、较长但不完整的单曲片段、多曲 medley/串烧式片段、重复副歌碰撞、仅翻译相似、短歌词边界。
114
97 先准备一个 CSV,例如 `data/eval/eval.csv` 115 先准备一个 CSV,例如 `data/eval/eval.csv`
98 116
99 ```csv 117 ```csv
......
...@@ -108,6 +108,20 @@ python -m lyric_dedup.cli generate-eval-set \ ...@@ -108,6 +108,20 @@ python -m lyric_dedup.cli generate-eval-set \
108 --positive-ratio 0.3 108 --positive-ratio 0.3
109 ``` 109 ```
110 110
111 如需生成更贴近业务边界的 hard 口径测试集:
112
113 ```bash
114 python -m lyric_dedup.cli generate-eval-set \
115 --profile hard \
116 --library-dir data/library \
117 --lyrics-dir data/generated_eval/hard_incoming \
118 --csv data/generated_eval/eval_hard_5000.csv \
119 --index outputs/indexes/library_lyrics.pkl \
120 --eval-index data/generated_eval/eval_hard_5000.csv.index.pkl \
121 --size 5000 \
122 --positive-ratio 0.3
123 ```
124
111 默认生产评估口径: 125 默认生产评估口径:
112 126
113 ```text 127 ```text
...@@ -120,7 +134,7 @@ python -m lyric_dedup.cli generate-eval-set \ ...@@ -120,7 +134,7 @@ python -m lyric_dedup.cli generate-eval-set \
120 业务口径: 134 业务口径:
121 135
122 ```text 136 ```text
123 positive_* = 应去重,全曲歌词样式变化 137 positive_* = 应去重,全曲歌词样式变化,包括少量错别字/英文拼写错误扰动
124 negative_real_holdout_full_song = 不应去重,完整真实歌词,已从评估索引中排除 138 negative_real_holdout_full_song = 不应去重,完整真实歌词,已从评估索引中排除
125 negative_fragment = 不应去重,单曲片段 139 negative_fragment = 不应去重,单曲片段
126 negative_shared_chorus = 不应去重,重复副歌碰撞 140 negative_shared_chorus = 不应去重,重复副歌碰撞
...@@ -129,6 +143,15 @@ negative_same_theme_synthetic = 不应去重,同主题新歌词 ...@@ -129,6 +143,15 @@ negative_same_theme_synthetic = 不应去重,同主题新歌词
129 edge_short_or_placeholder = 不应去重,短歌词/占位边界样本 143 edge_short_or_placeholder = 不应去重,短歌词/占位边界样本
130 ``` 144 ```
131 145
146 hard 口径额外强调真实业务边界,而不是故意制造反常难题:
147
148 ```text
149 positive_realistic_variant = 应去重,同曲平台版本噪声、较完整缺段、整段翻译附加、真实录入/OCR 错
150 negative_near_neighbor_holdout_full_song = 不应去重,和曲库有较多行重合的真实 holdout 新歌
151 negative_long_fragment = 不应去重,较长但不完整的单曲片段
152 negative_catalog_mashup = 不应去重,多首真实歌词片段组成的串烧/混剪式输入
153 ```
154
132 生成器会扫描整个曲库并按有效歌词行数、语言类型、文件来源前缀分层采样。它会分出一批 holdout 完整歌词作为真实新歌负样本,并生成一个排除 holdout 的评估索引。每次还会输出: 155 生成器会扫描整个曲库并按有效歌词行数、语言类型、文件来源前缀分层采样。它会分出一批 holdout 完整歌词作为真实新歌负样本,并生成一个排除 holdout 的评估索引。每次还会输出:
133 156
134 ```text 157 ```text
......
...@@ -5,7 +5,7 @@ from __future__ import annotations ...@@ -5,7 +5,7 @@ from __future__ import annotations
5 import hashlib 5 import hashlib
6 import pickle 6 import pickle
7 from dataclasses import dataclass 7 from dataclasses import dataclass
8 from enum import StrEnum 8 from enum import Enum
9 from pathlib import Path 9 from pathlib import Path
10 10
11 from lyric_dedup.minhash_lsh import MinHashConfig 11 from lyric_dedup.minhash_lsh import MinHashConfig
...@@ -16,7 +16,7 @@ from lyric_dedup.normalization import lyric_tokens ...@@ -16,7 +16,7 @@ from lyric_dedup.normalization import lyric_tokens
16 from lyric_dedup.normalization import normalize_lyrics 16 from lyric_dedup.normalization import normalize_lyrics
17 17
18 18
19 class DuplicateDecision(StrEnum): 19 class DuplicateDecision(str, Enum):
20 DUPLICATE = "duplicate" 20 DUPLICATE = "duplicate"
21 REVIEW = "review" 21 REVIEW = "review"
22 NEW = "new" 22 NEW = "new"
......
...@@ -53,6 +53,12 @@ def main() -> None: ...@@ -53,6 +53,12 @@ def main() -> None:
53 generate.add_argument("--seed", type=int, default=20260602) 53 generate.add_argument("--seed", type=int, default=20260602)
54 generate.add_argument("--index", default="", help="optional source index path recorded in the manifest") 54 generate.add_argument("--index", default="", help="optional source index path recorded in the manifest")
55 generate.add_argument("--eval-index", default="", help="output index built from non-holdout records for this eval set") 55 generate.add_argument("--eval-index", default="", help="output index built from non-holdout records for this eval set")
56 generate.add_argument(
57 "--profile",
58 choices=("standard", "hard"),
59 default="standard",
60 help="evaluation sample profile: standard production mix or harder business-realistic edge mix",
61 )
56 62
57 args = parser.parse_args() 63 args = parser.parse_args()
58 if args.command == "build-index": 64 if args.command == "build-index":
...@@ -80,6 +86,7 @@ def main() -> None: ...@@ -80,6 +86,7 @@ def main() -> None:
80 seed=args.seed, 86 seed=args.seed,
81 index_path=Path(args.index) if args.index else None, 87 index_path=Path(args.index) if args.index else None,
82 eval_index_path=Path(args.eval_index) if args.eval_index else None, 88 eval_index_path=Path(args.eval_index) if args.eval_index else None,
89 profile=args.profile,
83 ) 90 )
84 print(json.dumps(summary, ensure_ascii=False)) 91 print(json.dumps(summary, ensure_ascii=False))
85 92
......
1 # Test runner
2 pytest>=8.0
3
4 # PostgreSQL storage prototype
5 psycopg[binary]>=3.2
6
7 # Existing MySQL/COS lyric download utilities
8 pymysql>=1.1
9 cos-python-sdk-v5>=1.9
10 tqdm>=4.66
1 """Initialize PostgreSQL schema for lyric dedup storage."""
2
3 from __future__ import annotations
4
5 import argparse
6 import sys
7 from pathlib import Path
8
9
10 PROJECT_ROOT = Path(__file__).resolve().parents[1]
11 SCHEMA_PATH = PROJECT_ROOT / "scripts" / "postgres_schema.sql"
12
13
14 def main() -> None:
15 parser = argparse.ArgumentParser(description="Initialize PostgreSQL schema for lyric dedup.")
16 parser.add_argument("--dsn", required=True, help="PostgreSQL DSN, e.g. postgresql://user:pass@localhost:5432/lyric_dedup")
17 parser.add_argument("--schema", default=str(SCHEMA_PATH))
18 args = parser.parse_args()
19
20 psycopg = _import_psycopg()
21 schema_sql = Path(args.schema).read_text(encoding="utf-8")
22 with psycopg.connect(args.dsn) as conn:
23 with conn.cursor() as cursor:
24 cursor.execute(schema_sql)
25 conn.commit()
26 print(f"initialized schema from {args.schema}")
27
28
29 def _import_psycopg():
30 try:
31 import psycopg
32
33 return psycopg
34 except ModuleNotFoundError:
35 print(
36 "Missing dependency: psycopg. Install it with:\n"
37 " python -m pip install 'psycopg[binary]'",
38 file=sys.stderr,
39 )
40 raise SystemExit(1)
41
42
43 if __name__ == "__main__":
44 main()
1 create extension if not exists pg_trgm;
2
3 create table if not exists lyrics (
4 id bigserial primary key,
5 record_id text not null unique,
6 source_path text not null,
7 title text,
8 artist text,
9 raw_text text not null,
10 normalized_text text not null,
11 primary_text text not null,
12 translation_text text,
13 exact_hash text not null,
14 split_confidence text,
15 split_reason text,
16 line_count integer not null,
17 created_at timestamptz not null default now(),
18 updated_at timestamptz not null default now(),
19 deleted_at timestamptz
20 );
21
22 create index if not exists lyrics_exact_hash_idx
23 on lyrics (exact_hash)
24 where deleted_at is null;
25
26 create index if not exists lyrics_primary_text_trgm_idx
27 on lyrics using gin (primary_text gin_trgm_ops);
28
29 create table if not exists lyric_lines (
30 lyric_id bigint not null references lyrics(id) on delete cascade,
31 role text not null,
32 line_no integer not null,
33 normalized_line text not null,
34 line_hash text not null,
35 primary key (lyric_id, role, line_no)
36 );
37
38 create index if not exists lyric_lines_hash_idx
39 on lyric_lines (line_hash);
40
41 create index if not exists lyric_lines_lyric_id_idx
42 on lyric_lines (lyric_id);
...@@ -316,6 +316,40 @@ def test_generated_eval_set_uses_stratified_production_mix(tmp_path) -> None: ...@@ -316,6 +316,40 @@ def test_generated_eval_set_uses_stratified_production_mix(tmp_path) -> None:
316 assert all(row["expected"] == "不应去重" for row in rows if row["sample_type"].startswith("negative_")) 316 assert all(row["expected"] == "不应去重" for row in rows if row["sample_type"].startswith("negative_"))
317 317
318 318
319 def test_generated_hard_eval_set_uses_business_realistic_edge_mix(tmp_path) -> None:
320 library = tmp_path / "library"
321 incoming = tmp_path / "generated" / "incoming"
322 eval_csv = tmp_path / "generated" / "eval_hard.csv"
323 library.mkdir()
324 for idx in range(24):
325 prefix = "AY" if idx % 3 == 0 else "WHHY"
326 lyric = BASE_LYRIC.replace("我爱你", f"我想你{idx}").replace("城市", f"城市{idx}")
327 if idx % 4 == 0:
328 lyric += "\nI miss you tonight\nUnder the moonlight\nNever let me go\n"
329 (library / f"{idx}_{prefix}{idx:06d}.txt").write_text(lyric, encoding="utf-8")
330
331 generate_eval_set(
332 library_dir=library,
333 output_dir=incoming,
334 csv_path=eval_csv,
335 size=40,
336 positive_ratio=0.3,
337 profile="hard",
338 )
339
340 rows = list(csv.DictReader(eval_csv.open(encoding="utf-8")))
341 manifest = json.loads((tmp_path / "generated" / "eval_hard.csv.manifest.json").read_text(encoding="utf-8"))
342 sample_types = {row["sample_type"] for row in rows}
343
344 assert len(rows) == 40
345 assert manifest["profile"] == "hard"
346 assert "positive_realistic_variant" in manifest["plan"]
347 assert "negative_near_neighbor_holdout_full_song" in manifest["plan"]
348 assert "negative_long_fragment" in sample_types
349 assert "negative_catalog_mashup" in sample_types
350 assert any(row["sample_type"].startswith("positive_") for row in rows)
351
352
319 def test_foreign_original_with_added_chinese_translation_is_duplicate() -> None: 353 def test_foreign_original_with_added_chinese_translation_is_duplicate() -> None:
320 checker = DuplicateChecker() 354 checker = DuplicateChecker()
321 checker.add_record( 355 checker.add_record(
......