Commit 51ddab43 51ddab43fb5e3638a8b8c9cd8049679fe8b2ccc7 by 沈秋雨

Add lyric duplicate detection workflow

0 parents
1 .DS_Store
2 __pycache__/
3 *.py[cod]
4 .pytest_cache/
5
6 # Local lyric data and generated artifacts
7 data/
8 outputs/
9 downloaded_lyrics/
10 downloaded_lyrics_type3/
11 download_failed*.txt
12
13 # Local downloader / scratch utilities
14 download_lyrics.py
15 test_db_connection.py
16 *.env
17
18 # Reference project kept locally only
19 text-dedup-main/
20
21 # Virtual environments and editor files
22 .venv/
23 venv/
24 .idea/
25 .vscode/
1 # Lyric Duplicate Checker
2
3 第一版用于“新增歌词查重”:先用已有 `.lrc` / `.txt` 歌词建立索引,再把新增歌词拿来查询,返回 `duplicate``review``new`
4
5 ## 建立索引
6
7 假设已有曲库在 `data/library/`
8
9 ```bash
10 python -m lyric_dedup.cli build-index \
11 --lyrics-dir data/library \
12 --index outputs/indexes/lyrics.pkl
13 ```
14
15 ## 检查单个新增歌词
16
17 ```bash
18 python -m lyric_dedup.cli check-file \
19 --index outputs/indexes/lyrics.pkl \
20 --file data/incoming/new_song.lrc
21 ```
22
23 ## 批量检查新增目录
24
25 ```bash
26 python -m lyric_dedup.cli batch-check \
27 --index outputs/indexes/lyrics.pkl \
28 --lyrics-dir data/incoming \
29 --out outputs/results/incoming_check.csv
30 ```
31
32 CSV 里重点看这些列:
33
34 - `decision`: 总判断。
35 - `best_candidate_id`: 最像的已有歌词。
36 - `best_candidate_jaccard`: n-gram 字面相似度。
37 - `best_candidate_line_coverage`: 行级覆盖率。
38 - `matched_unique_lines`: 命中的规范化歌词行。
39 - `best_candidate_reason`: 中文判定原因,说明为什么判重、复核或判新。
40
41 生产判断建议:`duplicate` 可自动拦截;`review` 进人工池;`new` 入库前仍可抽样检查。
42
43 ## 原文 + 中文翻译歌词的防护规则
44
45 当前会把歌词拆成三类行:
46
47 - `primary_lines`: 原文行,自动判重主要依赖这部分。
48 - `translation_lines`: 中文翻译行,只用于召回和复核解释。
49 - `unknown_lines`: 无法稳定判断的行。
50
51 高置信拆分包括:
52
53 - 同一个时间戳下出现外文行和中文行。
54 - 多组稳定的外文行 + 中文行交替。
55
56 中置信拆分包括:
57
58 - 同一行内明显的外文 / 中文翻译,例如 `I miss you / 今晚我想你`
59
60 低置信拆分包括:
61
62 - 先整段外文,再整段中文翻译。
63
64 判定策略:
65
66 - 原文高度一致,即使新增歌词多了中文翻译,也可以 `duplicate`
67 - 只有翻译行相似,原文相似不足,只能 `review`,不自动判重。
68 - 疑似整段翻译结构属于低置信拆分,即使原文 hash 一致,也先 `review`
69 - 普通中文歌没有检测到翻译结构时,全部有效行按原文处理。
70
71 由于索引里会保存拆分后的原文/翻译特征,修改拆分规则后需要重建索引。
72
73 ## 用标注 CSV 评估正确率
74
75 可以先从已有曲库自动生成一批评估样本:
76
77 ```bash
78 python -m lyric_dedup.cli generate-eval-set \
79 --library-dir data/library \
80 --lyrics-dir data/generated_eval/incoming \
81 --csv data/generated_eval/eval_10.csv \
82 --size 10 \
83 --positive-ratio 0.6
84 ```
85
86 生成器的业务口径:
87
88 - `应去重` 样本只生成全曲歌词的样式变化,例如时间戳、标点、平台噪声、空行、LRC 样式、附加中文翻译。
89 - `不应去重` 样本包含片段歌词、短句碰撞、不同歌曲片段混合、同主题新歌词、仅翻译相似。
90 - 片段歌词即使命中已有歌曲的一部分,也不应该输出 `duplicate`;最多进入 `review`
91
92 先准备一个 CSV,例如 `data/eval/eval.csv`
93
94 ```csv
95 id,file,expected
96 case-001,incoming/song_a.lrc,应去重
97 case-002,incoming/song_b.txt,不应去重
98 ```
99
100 也可以不用文件路径,直接把歌词放在 `lyrics` 列:
101
102 ```csv
103 id,lyrics,expected
104 case-003,"我爱你在每个夜里\n听风说话也听见你",duplicate
105 case-004,"南方的雨穿过街心\n你把故事说给云听",new
106 ```
107
108 `expected` 支持这些写法:
109
110 - 应去重:`应去重``重复``duplicate``1``true``yes`
111 - 不应去重:`不应去重``不重复``new``0``false``no`
112
113 运行评估:
114
115 ```bash
116 python -m lyric_dedup.cli evaluate-csv \
117 --index outputs/indexes/lyrics.pkl \
118 --csv data/eval/eval.csv \
119 --base-dir data \
120 --out outputs/results/eval_result.csv
121 ```
122
123 默认只有系统输出 `duplicate` 才算“预测应去重”。这适合评估自动拦截的准确率,误杀会更明显。
124
125 如果你想评估“可疑样本召回率”,也就是 `duplicate``review` 都算命中:
126
127 ```bash
128 python -m lyric_dedup.cli evaluate-csv \
129 --index outputs/indexes/lyrics.pkl \
130 --csv data/eval/eval.csv \
131 --base-dir data \
132 --positive-decisions duplicate,review \
133 --out outputs/results/eval_result_review_as_positive.csv
134 ```
135
136 会生成两个文件:
137
138 - `outputs/results/eval_result.csv`: 每条样本的预测、候选、原因和是否正确。
139 - `outputs/results/eval_result.csv.summary.json`: 总体指标。
140
141 summary 里重点看:
142
143 - `accuracy`: 总正确率。
144 - `precision`: 预测应去重的样本里,有多少是真的应去重。自动拦截优先看这个。
145 - `recall`: 真实应去重的样本里,有多少被系统抓到。
146 - `f1`: precision 和 recall 的综合指标。
147 - `false_positive`: 不应去重但被判为应去重,属于误杀。
148 - `false_negative`: 应去重但没抓到,属于漏召。
1 # 歌词查重测试流程
2
3 本文档记录从已有歌词目录建立索引、生成测试集、批量评估和查看结果的完整命令。
4
5 ## 1. 准备目录
6
7 已有曲库放在:
8
9 ```text
10 data/library/
11 ```
12
13 支持文件:
14
15 ```text
16 .lrc
17 .txt
18 ```
19
20 生成的测试样本会放在:
21
22 ```text
23 data/generated_eval/incoming/
24 ```
25
26 测试集标注 CSV 会放在:
27
28 ```text
29 data/generated_eval/eval_100.csv
30 ```
31
32 评估结果会放在:
33
34 ```text
35 outputs/results/
36 ```
37
38 ## 2. 建立已有曲库索引
39
40 如果刚往 `data/library` 新增了一批样本,建议先运行处理脚本:
41
42 ```bash
43 python scripts/process_library.py \
44 --library-dir data/library \
45 --index outputs/indexes/library_lyrics.pkl
46 ```
47
48 这个脚本会:
49
50 ```text
51 1. 扫描并隔离纯音乐占位样本,例如包含【曲库专用】或“此歌曲为没有填词的纯音乐”的文件。
52 2. 重建 outputs/indexes/library_lyrics.pkl。
53 3. 输出处理报告 outputs/results/library_process_report.json。
54 ```
55
56 如果你想先看会处理哪些文件,不实际移动和重建索引:
57
58 ```bash
59 python scripts/process_library.py \
60 --library-dir data/library \
61 --dry-run
62 ```
63
64 如果要顺手生成并评估 500 条测试样本:
65
66 ```bash
67 python scripts/process_library.py \
68 --library-dir data/library \
69 --index outputs/indexes/library_lyrics.pkl \
70 --eval-size 1180 \
71 --positive-ratio 0.2 \
72 --eval-csv data/generated_eval/eval_1180.csv \
73 --eval-out outputs/results/library_eval_1180.csv
74 ```
75
76 隔离出来的文件默认会移动到:
77
78 ```text
79 data/quarantine/no_lyrics_placeholders/
80 ```
81
82 也可以只手动建索引:
83
84 ```bash
85 python -m lyric_dedup.cli build-index \
86 --lyrics-dir data/library \
87 --index outputs/indexes/library_lyrics.pkl
88 ```
89
90 索引文件:
91
92 ```text
93 outputs/indexes/library_lyrics.pkl
94 ```
95
96 注意:如果修改了 `data/library`,或修改了预处理/判重逻辑,需要重新执行本步骤。
97
98 ## 3. 生成 100 条测试样本
99
100 ```bash
101 python -m lyric_dedup.cli generate-eval-set \
102 --library-dir data/library \
103 --lyrics-dir data/generated_eval/incoming \
104 --csv data/generated_eval/eval_500.csv \
105 --size 500 \
106 --positive-ratio 0.2
107 ```
108
109 默认生成:
110
111 ```text
112 应去重: 60
113 不应去重: 40
114 ```
115
116 生成器会先清理 `data/generated_eval/incoming/` 下旧的 `.txt` / `.lrc` 生成文件,再写入新样本。
117
118 业务口径:
119
120 ```text
121 pos_* = 应去重,全曲歌词样式变化
122 neg_* = 不应去重,片段/短句碰撞/混合片段/同主题新歌词/仅翻译相似
123 ```
124
125 ## 4. 严格评估:只把 duplicate 算作去重
126
127 ```bash
128 python -m lyric_dedup.cli evaluate-csv \
129 --index outputs/indexes/library_lyrics.pkl \
130 --csv data/generated_eval/eval_500.csv \
131 --base-dir data/generated_eval \
132 --out outputs/results/library_eval_500.csv
133 ```
134
135 这个口径下:
136
137 ```text
138 duplicate -> 预测应去重
139 review -> 预测不应去重
140 new -> 预测不应去重
141 ```
142
143 适合评估自动拦截的 precision,重点看:
144
145 ```text
146 false_positive
147 ```
148
149 ## 5. 召回评估:把 duplicate 和 review 都算作抓到可疑样本
150
151 ```bash
152 python -m lyric_dedup.cli evaluate-csv \
153 --index outputs/indexes/library_lyrics.pkl \
154 --csv data/generated_eval/eval_500.csv \
155 --base-dir data/generated_eval \
156 --positive-decisions duplicate,review \
157 --out outputs/results/library_eval_500_review_positive.csv
158 ```
159
160 这个口径下:
161
162 ```text
163 duplicate -> 预测应去重
164 review -> 预测应去重
165 new -> 预测不应去重
166 ```
167
168 适合评估可疑样本召回,重点看:
169
170 ```text
171 false_negative
172 ```
173
174 ## 6. 查看总体指标
175
176 严格口径:
177
178 ```bash
179 cat outputs/results/library_eval_100.csv.summary.json
180 ```
181
182 召回口径:
183
184 ```bash
185 cat outputs/results/library_eval_100_review_positive.csv.summary.json
186 ```
187
188 指标含义:
189
190 ```text
191 accuracy 总正确率
192 precision 预测应去重的样本里,有多少是真的应去重
193 recall 真实应去重的样本里,有多少被系统抓到
194 f1 precision 和 recall 的综合指标
195 true_positive 应去重且预测应去重
196 false_positive 不应去重但预测应去重,误杀
197 true_negative 不应去重且预测不应去重
198 false_negative 应去重但预测不应去重,漏召
199 ```
200
201 ## 7. 查看每条样本结果
202
203 ```bash
204 open outputs/results/library_eval_100.csv
205 ```
206
207 如果不能使用 `open`,可以直接查看 CSV:
208
209 ```bash
210 python -c 'import csv; rows=csv.DictReader(open("outputs/results/library_eval_100.csv", encoding="utf-8")); [print(r["id"], r["decision"], r["correct"], r["reason"], sep=" | ") for r in rows]'
211 ```
212
213 ## 8. 查看失败样本
214
215 严格口径失败样本:
216
217 ```bash
218 python -c 'import csv; rows=csv.DictReader(open("outputs/results/library_eval_100.csv", encoding="utf-8")); [print(r["id"], r["source"], r["decision"], r["reason"], sep=" | ") for r in rows if r["correct"] == "False"]'
219 ```
220
221 查看某个样本的完整候选:
222
223 ```bash
224 python -m lyric_dedup.cli check-file \
225 --index outputs/indexes/library_lyrics.pkl \
226 --file data/generated_eval/incoming/neg_068_mixed_fragments.txt \
227 --max-candidates 10
228 ```
229
230 ## 9. 核对测试集分布
231
232 ```bash
233 python -c 'import csv, collections; rows=list(csv.DictReader(open("data/generated_eval/eval_10.csv", encoding="utf-8"))); print(len(rows)); print(collections.Counter(r["expected"] for r in rows)); print(collections.Counter(r["sample_type"] for r in rows)); print(collections.Counter(r["sample_type"] for r in rows if r["expected"]=="应去重")); print(collections.Counter(r["sample_type"] for r in rows if r["expected"]=="不应去重"))'
234 ```
235
236 核对生成目录文件数:
237
238 ```bash
239 find data/generated_eval/incoming -type f | wc -l
240 ```
241
242 ## 10. 运行代码测试
243
244 ```bash
245 python -m pytest tests
246 ```
247
248 编译检查:
249
250 ```bash
251 python -m compileall -q lyric_dedup tests
252 ```
253
254 ## 11. 关于测试集不重复
255
256 当前自动生成的 100 条是规则覆盖测试集,不保证样本之间规范化后完全不重复。
257
258 如果要求 100 条测试样本彼此不重复,并且仍使用默认比例:
259
260 ```text
261 size = 100
262 positive_ratio = 0.6
263 ```
264
265 则至少需要:
266
267 ```text
268 60 首互不重复的种子歌词
269 ```
270
271 原因:应去重样本是全曲变体,同一首歌的多个样式变化规范化后仍然是同一首歌。
272
273 更稳妥的真实准确率评估方式是准备人工标注 CSV:
274
275 ```csv
276 id,file,expected
277 case-001,incoming/song_a.lrc,应去重
278 case-002,incoming/song_b.txt,不应去重
279 ```
280
281 然后直接执行第 4 节或第 5 节的 `evaluate-csv`
1 """Lyric duplicate detection utilities."""
2
3 from lyric_dedup.checker import DuplicateCheckResult
4 from lyric_dedup.checker import DuplicateChecker
5 from lyric_dedup.checker import DuplicateDecision
6 from lyric_dedup.checker import LyricRecord
7
8 __all__ = [
9 "DuplicateCheckResult",
10 "DuplicateChecker",
11 "DuplicateDecision",
12 "LyricRecord",
13 ]
1 """Incremental lyric duplicate checker."""
2
3 from __future__ import annotations
4
5 import hashlib
6 import pickle
7 from dataclasses import dataclass
8 from enum import StrEnum
9 from pathlib import Path
10
11 from lyric_dedup.minhash_lsh import MinHashConfig
12 from lyric_dedup.minhash_lsh import MinHashLSH
13 from lyric_dedup.normalization import NormalizedLyrics
14 from lyric_dedup.normalization import fingerprint_text
15 from lyric_dedup.normalization import lyric_tokens
16 from lyric_dedup.normalization import normalize_lyrics
17
18
19 class DuplicateDecision(StrEnum):
20 DUPLICATE = "duplicate"
21 REVIEW = "review"
22 NEW = "new"
23
24
25 @dataclass(frozen=True)
26 class LyricRecord:
27 record_id: str
28 lyrics: str
29 title: str | None = None
30 artist: str | None = None
31
32
33 @dataclass(frozen=True)
34 class CandidateMatch:
35 record_id: str
36 decision: DuplicateDecision
37 confidence: float
38 jaccard: float
39 line_coverage: float
40 primary_jaccard: float
41 primary_line_coverage: float
42 translation_jaccard: float
43 translation_line_coverage: float
44 matched_unique_lines: tuple[str, ...]
45 reason: str
46
47
48 @dataclass(frozen=True)
49 class DuplicateCheckResult:
50 decision: DuplicateDecision
51 confidence: float
52 candidates: tuple[CandidateMatch, ...]
53 normalized_full_text: str
54 reason: str
55
56
57 @dataclass(frozen=True)
58 class _IndexedRecord:
59 record: LyricRecord
60 normalized: NormalizedLyrics
61 exact_hash: str
62 tokens: set[str]
63 primary_tokens: set[str]
64 translation_tokens: set[str]
65 fallback_lines: tuple[str, ...]
66 fallback_tokens: set[str]
67 signature: tuple[int, ...]
68
69
70 class DuplicateChecker:
71 """In-memory first version for checking newly submitted lyrics.
72
73 The API is intentionally small: build or load records with ``add_record``, then
74 call ``check`` for a new lyric. Persistence can serialize the indexed fields
75 later without changing result semantics.
76 """
77
78 def __init__(
79 self,
80 *,
81 minhash_config: MinHashConfig | None = None,
82 duplicate_jaccard_threshold: float = 0.78,
83 duplicate_line_coverage_threshold: float = 0.72,
84 review_jaccard_threshold: float = 0.45,
85 review_line_coverage_threshold: float = 0.35,
86 ) -> None:
87 self._lsh = MinHashLSH(minhash_config)
88 self._records: dict[str, _IndexedRecord] = {}
89 self._exact_hash_to_ids: dict[str, set[str]] = {}
90 self._line_to_ids: dict[str, set[str]] = {}
91 self._token_to_ids: dict[str, set[str]] = {}
92 self.duplicate_jaccard_threshold = duplicate_jaccard_threshold
93 self.duplicate_line_coverage_threshold = duplicate_line_coverage_threshold
94 self.review_jaccard_threshold = review_jaccard_threshold
95 self.review_line_coverage_threshold = review_line_coverage_threshold
96
97 def add_record(self, record: LyricRecord) -> None:
98 indexed = self._index(record)
99 self._records[record.record_id] = indexed
100 self._exact_hash_to_ids.setdefault(indexed.exact_hash, set()).add(record.record_id)
101 for line in indexed.normalized.unique_lines:
102 if len(line) >= 4:
103 self._line_to_ids.setdefault(line, set()).add(record.record_id)
104 for token in indexed.tokens:
105 self._token_to_ids.setdefault(token, set()).add(record.record_id)
106 for token in indexed.fallback_tokens:
107 self._token_to_ids.setdefault(token, set()).add(record.record_id)
108 self._lsh.add(record.record_id, indexed.signature)
109
110 def save(self, path: str | Path) -> None:
111 """Persist the in-memory index for later checks."""
112 with Path(path).open("wb") as file:
113 pickle.dump(self, file, protocol=pickle.HIGHEST_PROTOCOL)
114
115 @classmethod
116 def load(cls, path: str | Path) -> "DuplicateChecker":
117 """Load a previously persisted index."""
118 with Path(path).open("rb") as file:
119 checker = pickle.load(file)
120 if not isinstance(checker, cls):
121 raise TypeError(f"{path} does not contain a DuplicateChecker index")
122 return checker
123
124 @property
125 def record_count(self) -> int:
126 return len(self._records)
127
128 def check(self, lyrics: str, *, max_candidates: int = 10) -> DuplicateCheckResult:
129 return self.check_record(LyricRecord(record_id="__query__", lyrics=lyrics), max_candidates=max_candidates)
130
131 def check_record(self, record: LyricRecord, *, max_candidates: int = 10) -> DuplicateCheckResult:
132 query = self._index(record)
133 exact_ids = self._exact_hash_to_ids.get(query.exact_hash, set())
134 if exact_ids:
135 candidates = tuple(self._rank_exact_candidate(query, self._records[record_id]) for record_id in sorted(exact_ids)[:max_candidates])
136 duplicate = next((candidate for candidate in candidates if candidate.decision == DuplicateDecision.DUPLICATE), None)
137 if duplicate is not None:
138 return DuplicateCheckResult(
139 decision=DuplicateDecision.DUPLICATE,
140 confidence=duplicate.confidence,
141 candidates=candidates,
142 normalized_full_text=query.normalized.normalized_full_text,
143 reason=duplicate.reason,
144 )
145 return DuplicateCheckResult(
146 decision=DuplicateDecision.REVIEW,
147 confidence=candidates[0].confidence,
148 candidates=candidates,
149 normalized_full_text=query.normalized.normalized_full_text,
150 reason=candidates[0].reason,
151 )
152
153 candidate_ids = self._recall_candidates(query)
154 ranked = sorted(
155 (self._rank_candidate(query, self._records[record_id]) for record_id in candidate_ids),
156 key=lambda item: (item.decision == DuplicateDecision.DUPLICATE, item.confidence, item.jaccard),
157 reverse=True,
158 )[:max_candidates]
159
160 duplicate = next((candidate for candidate in ranked if candidate.decision == DuplicateDecision.DUPLICATE), None)
161 if duplicate is not None:
162 return DuplicateCheckResult(
163 decision=DuplicateDecision.DUPLICATE,
164 confidence=duplicate.confidence,
165 candidates=tuple(ranked),
166 normalized_full_text=query.normalized.normalized_full_text,
167 reason=duplicate.reason,
168 )
169
170 review = next((candidate for candidate in ranked if candidate.decision == DuplicateDecision.REVIEW), None)
171 if review is not None:
172 return DuplicateCheckResult(
173 decision=DuplicateDecision.REVIEW,
174 confidence=review.confidence,
175 candidates=tuple(ranked),
176 normalized_full_text=query.normalized.normalized_full_text,
177 reason=review.reason,
178 )
179
180 return DuplicateCheckResult(
181 decision=DuplicateDecision.NEW,
182 confidence=1.0 - (ranked[0].confidence if ranked else 0.0),
183 candidates=tuple(ranked),
184 normalized_full_text=query.normalized.normalized_full_text,
185 reason="精确匹配、近重复召回和字面重合信号都较低",
186 )
187
188 def _index(self, record: LyricRecord) -> _IndexedRecord:
189 normalized = normalize_lyrics(record.lyrics)
190 tokens = lyric_tokens(normalized)
191 primary_tokens = lyric_tokens(normalized, lines=normalized.primary_lines)
192 translation_tokens = lyric_tokens(normalized, lines=normalized.translation_lines)
193 fallback_lines = tuple(_fallback_no_lyrics_lines(record.lyrics))
194 fallback_tokens = set(fallback_lines)
195 signature = self._lsh.signature(primary_tokens or tokens or fallback_tokens)
196 exact_hash = hashlib.sha256(_exact_fingerprint(normalized, fallback_lines).encode("utf-8")).hexdigest()
197 return _IndexedRecord(
198 record=record,
199 normalized=normalized,
200 exact_hash=exact_hash,
201 tokens=tokens,
202 primary_tokens=primary_tokens,
203 translation_tokens=translation_tokens,
204 fallback_lines=fallback_lines,
205 fallback_tokens=fallback_tokens,
206 signature=signature,
207 )
208
209 def _recall_candidates(self, query: _IndexedRecord) -> set[str]:
210 candidate_ids = self._lsh.query(query.signature)
211 for line in query.normalized.primary_lines:
212 if len(line) >= 4:
213 candidate_ids.update(self._line_to_ids.get(line, set()))
214 for line in query.normalized.translation_lines:
215 if len(line) >= 4:
216 candidate_ids.update(self._line_to_ids.get(line, set()))
217 for token in query.primary_tokens or query.tokens:
218 candidate_ids.update(self._token_to_ids.get(token, set()))
219 for token in query.translation_tokens:
220 candidate_ids.update(self._token_to_ids.get(token, set()))
221 for token in query.fallback_tokens:
222 candidate_ids.update(self._token_to_ids.get(token, set()))
223 return candidate_ids
224
225 def _rank_exact_candidate(self, query: _IndexedRecord, candidate: _IndexedRecord) -> CandidateMatch:
226 low_confidence_split = (
227 query.normalized.split_confidence == "low" or candidate.normalized.split_confidence == "low"
228 )
229 translation_jaccard = _jaccard(query.translation_tokens, candidate.translation_tokens)
230 translation_coverage, _ = _line_coverage_lines(
231 query.normalized.translation_lines,
232 candidate.normalized.translation_lines,
233 )
234 no_effective_lyrics = not query.normalized.primary_lines and not candidate.normalized.primary_lines
235 if no_effective_lyrics:
236 decision = DuplicateDecision.DUPLICATE
237 confidence = 1.0
238 reason = "无有效歌词,使用文件内容兜底指纹命中"
239 elif low_confidence_split:
240 decision = DuplicateDecision.REVIEW
241 confidence = 0.95
242 reason = "原文哈希一致,但疑似整段翻译结构拆分置信度较低,需要人工复核"
243 elif query.normalized.translation_lines or candidate.normalized.translation_lines:
244 decision = DuplicateDecision.DUPLICATE
245 confidence = 1.0
246 reason = "规范化后的原文歌词哈希完全一致,翻译行未参与自动判重"
247 else:
248 decision = DuplicateDecision.DUPLICATE
249 confidence = 1.0
250 reason = "规范化后的原文歌词哈希完全一致"
251 return CandidateMatch(
252 record_id=candidate.record.record_id,
253 decision=decision,
254 confidence=confidence,
255 jaccard=1.0,
256 line_coverage=1.0,
257 primary_jaccard=1.0,
258 primary_line_coverage=1.0,
259 translation_jaccard=round(translation_jaccard, 4),
260 translation_line_coverage=round(translation_coverage, 4),
261 matched_unique_lines=query.normalized.primary_lines,
262 reason=reason,
263 )
264
265 def _rank_candidate(self, query: _IndexedRecord, candidate: _IndexedRecord) -> CandidateMatch:
266 if not query.normalized.primary_lines or not candidate.normalized.primary_lines:
267 return _rank_no_effective_lyrics_candidate(query, candidate)
268
269 jaccard = _jaccard(query.tokens, candidate.tokens)
270 coverage, matched_lines = _line_coverage(query.normalized, candidate.normalized)
271 primary_jaccard = _jaccard(query.primary_tokens, candidate.primary_tokens)
272 primary_coverage, primary_matched_lines = _line_coverage_lines(
273 query.normalized.primary_lines,
274 candidate.normalized.primary_lines,
275 )
276 translation_jaccard = _jaccard(query.translation_tokens, candidate.translation_tokens)
277 translation_coverage, translation_matched_lines = _line_coverage_lines(
278 query.normalized.translation_lines,
279 candidate.normalized.translation_lines,
280 )
281 chorus_only = _is_chorus_only_match(query.normalized, candidate.normalized, primary_matched_lines)
282 translation_only = (
283 bool(translation_matched_lines)
284 and primary_jaccard < self.review_jaccard_threshold
285 and primary_coverage < self.review_line_coverage_threshold
286 and (translation_jaccard >= self.review_jaccard_threshold or translation_coverage >= self.review_line_coverage_threshold)
287 )
288 low_confidence_split = (
289 query.normalized.split_confidence == "low" or candidate.normalized.split_confidence == "low"
290 )
291
292 confidence = round((0.58 * primary_jaccard) + (0.42 * primary_coverage), 4)
293 if (
294 (primary_jaccard >= self.duplicate_jaccard_threshold or (primary_jaccard >= 0.78 and primary_coverage >= 0.9))
295 and primary_coverage >= self.duplicate_line_coverage_threshold
296 and not chorus_only
297 and not translation_only
298 and not low_confidence_split
299 ):
300 decision = DuplicateDecision.DUPLICATE
301 if query.normalized.translation_lines or candidate.normalized.translation_lines:
302 reason = "原文歌词高度一致,翻译行未参与自动判重"
303 else:
304 reason = "原文 n-gram 字面相似度高,且行级覆盖范围广"
305 elif (
306 chorus_only
307 or translation_only
308 or low_confidence_split
309 or primary_jaccard >= self.review_jaccard_threshold
310 or primary_coverage >= self.review_line_coverage_threshold
311 or jaccard >= self.review_jaccard_threshold
312 or coverage >= self.review_line_coverage_threshold
313 ):
314 decision = DuplicateDecision.REVIEW
315 reason = "候选相似度达到复核阈值,需要人工确认"
316 if chorus_only:
317 reason = "重合内容主要集中在重复副歌行,不自动判重"
318 elif translation_only:
319 reason = "仅翻译行相似,原文字面重合不足,不自动判重"
320 elif low_confidence_split:
321 reason = "疑似整段翻译结构但拆分置信度较低,需要人工复核"
322 else:
323 decision = DuplicateDecision.NEW
324 reason = "候选重合度低于复核阈值"
325
326 return CandidateMatch(
327 record_id=candidate.record.record_id,
328 decision=decision,
329 confidence=confidence,
330 jaccard=round(jaccard, 4),
331 line_coverage=round(coverage, 4),
332 primary_jaccard=round(primary_jaccard, 4),
333 primary_line_coverage=round(primary_coverage, 4),
334 translation_jaccard=round(translation_jaccard, 4),
335 translation_line_coverage=round(translation_coverage, 4),
336 matched_unique_lines=tuple(matched_lines),
337 reason=reason,
338 )
339
340
341 def _rank_no_effective_lyrics_candidate(query: _IndexedRecord, candidate: _IndexedRecord) -> CandidateMatch:
342 fallback_jaccard = _jaccard(query.fallback_tokens, candidate.fallback_tokens)
343 fallback_coverage, matched_lines = _line_coverage_lines(query.fallback_lines, candidate.fallback_lines)
344 if fallback_jaccard >= 0.35 and fallback_coverage >= 0.35 and len(matched_lines) >= 2:
345 return CandidateMatch(
346 record_id=candidate.record.record_id,
347 decision=DuplicateDecision.DUPLICATE,
348 confidence=round((0.58 * fallback_jaccard) + (0.42 * fallback_coverage), 4),
349 jaccard=round(fallback_jaccard, 4),
350 line_coverage=round(fallback_coverage, 4),
351 primary_jaccard=0.0,
352 primary_line_coverage=0.0,
353 translation_jaccard=0.0,
354 translation_line_coverage=0.0,
355 matched_unique_lines=tuple(matched_lines),
356 reason="无有效歌词,文件内容兜底特征高度相似",
357 )
358 if fallback_jaccard >= 0.2 or fallback_coverage >= 0.2:
359 return CandidateMatch(
360 record_id=candidate.record.record_id,
361 decision=DuplicateDecision.REVIEW,
362 confidence=round((0.58 * fallback_jaccard) + (0.42 * fallback_coverage), 4),
363 jaccard=round(fallback_jaccard, 4),
364 line_coverage=round(fallback_coverage, 4),
365 primary_jaccard=0.0,
366 primary_line_coverage=0.0,
367 translation_jaccard=0.0,
368 translation_line_coverage=0.0,
369 matched_unique_lines=tuple(matched_lines),
370 reason="无有效歌词,文件内容兜底特征部分相似,需要人工复核",
371 )
372 return CandidateMatch(
373 record_id=candidate.record.record_id,
374 decision=DuplicateDecision.NEW,
375 confidence=0.0,
376 jaccard=round(fallback_jaccard, 4),
377 line_coverage=round(fallback_coverage, 4),
378 primary_jaccard=0.0,
379 primary_line_coverage=0.0,
380 translation_jaccard=0.0,
381 translation_line_coverage=0.0,
382 matched_unique_lines=(),
383 reason="无有效歌词,且文件内容兜底特征未命中",
384 )
385
386
387 def _jaccard(left: set[str], right: set[str]) -> float:
388 if not left and not right:
389 return 1.0
390 if not left or not right:
391 return 0.0
392 return len(left & right) / len(left | right)
393
394
395 def _exact_fingerprint(normalized: NormalizedLyrics, fallback_lines: tuple[str, ...]) -> str:
396 primary_text = fingerprint_text(normalized)
397 if primary_text:
398 return f"lyrics|{primary_text}"
399 return "no_effective_lyrics_content|" + "\n".join(fallback_lines)
400
401
402 def _fallback_no_lyrics_lines(text: str) -> list[str]:
403 import re
404 import unicodedata
405
406 lines: list[str] = []
407 for raw_line in unicodedata.normalize("NFKC", text).splitlines():
408 line = raw_line.strip().lower()
409 line = re.sub(r"\[(?:\d{1,2}:)?\d{1,2}:\d{2}(?:[.:]\d{1,3})?\]", "", line)
410 line = re.sub(r"[【\[].{0,80}?[】\]]", "", line)
411 if "歌词来自" in line or "qq音乐" in line or "网易云" in line or "酷狗" in line:
412 continue
413 if "未经" in line or "不得翻唱" in line or "不得翻录" in line or "著作权" in line:
414 continue
415 punctuation = ",。!?;:、“”‘’·…—~!¥()【】《》〈〉「」『』﹏,.;:!?()[]{}<>|/\\_-"
416 line = "".join(" " if char in punctuation else char for char in line)
417 line = re.sub(r"\s+", " ", line).strip()
418 if line:
419 lines.append(line)
420 return list(dict.fromkeys(lines))
421
422
423 def _line_coverage(left: NormalizedLyrics, right: NormalizedLyrics) -> tuple[float, list[str]]:
424 return _line_coverage_lines(left.unique_lines, right.unique_lines)
425
426
427 def _line_coverage_lines(left: tuple[str, ...], right: tuple[str, ...]) -> tuple[float, list[str]]:
428 left_lines = set(left)
429 right_lines = set(right)
430 if not left_lines and not right_lines:
431 return 1.0, []
432 if not left_lines or not right_lines:
433 return 0.0, []
434 matched = sorted(left_lines & right_lines)
435 return len(matched) / max(len(left_lines), len(right_lines)), matched
436
437
438 def _is_chorus_only_match(left: NormalizedLyrics, right: NormalizedLyrics, matched_lines: list[str]) -> bool:
439 if not matched_lines:
440 return False
441 matched = set(matched_lines)
442 repeated_matches = [
443 line
444 for line in matched
445 if left.line_counts.get(line, 0) >= 2 or right.line_counts.get(line, 0) >= 2
446 ]
447 if len(matched) <= 2 and repeated_matches:
448 return True
449 if repeated_matches and len(repeated_matches) / len(matched) >= 0.8:
450 matched_ratio_left = sum(left.line_counts.get(line, 0) for line in matched) / max(left.content_line_count, 1)
451 matched_ratio_right = sum(right.line_counts.get(line, 0) for line in matched) / max(right.content_line_count, 1)
452 return min(matched_ratio_left, matched_ratio_right) < 0.7
453 return False
1 """Command line tools for lyric duplicate checking."""
2
3 from __future__ import annotations
4
5 import argparse
6 import csv
7 import json
8 from pathlib import Path
9
10 from lyric_dedup.checker import DuplicateChecker
11 from lyric_dedup.checker import LyricRecord
12 from lyric_dedup.eval_dataset import generate_eval_set
13 from lyric_dedup.file_import import iter_lyric_files
14 from lyric_dedup.file_import import read_lyric_file
15 from lyric_dedup.file_import import record_from_file
16 from lyric_dedup.file_import import records_from_dir
17
18
19 def main() -> None:
20 parser = argparse.ArgumentParser(prog="lyric-dedup")
21 subparsers = parser.add_subparsers(dest="command", required=True)
22
23 build = subparsers.add_parser("build-index", help="build an index from .lrc/.txt files")
24 build.add_argument("--lyrics-dir", required=True)
25 build.add_argument("--index", required=True)
26
27 check = subparsers.add_parser("check-file", help="check one .lrc/.txt file against an index")
28 check.add_argument("--index", required=True)
29 check.add_argument("--file", required=True)
30 check.add_argument("--max-candidates", type=int, default=10)
31
32 batch = subparsers.add_parser("batch-check", help="check a directory of .lrc/.txt files against an index")
33 batch.add_argument("--index", required=True)
34 batch.add_argument("--lyrics-dir", required=True)
35 batch.add_argument("--out", required=True)
36 batch.add_argument("--max-candidates", type=int, default=5)
37
38 evaluate = subparsers.add_parser("evaluate-csv", help="evaluate labeled duplicate samples from a CSV file")
39 evaluate.add_argument("--index", required=True)
40 evaluate.add_argument("--csv", required=True)
41 evaluate.add_argument("--out", required=True)
42 evaluate.add_argument("--base-dir", default="")
43 evaluate.add_argument("--positive-decisions", default="duplicate")
44 evaluate.add_argument("--max-candidates", type=int, default=5)
45
46 generate = subparsers.add_parser("generate-eval-set", help="generate labeled eval samples from a lyric library")
47 generate.add_argument("--library-dir", required=True)
48 generate.add_argument("--lyrics-dir", required=True)
49 generate.add_argument("--csv", required=True)
50 generate.add_argument("--size", type=int, default=100)
51 generate.add_argument("--positive-ratio", type=float, default=0.6)
52 generate.add_argument("--seed", type=int, default=20260602)
53
54 args = parser.parse_args()
55 if args.command == "build-index":
56 build_index(Path(args.lyrics_dir), Path(args.index))
57 elif args.command == "check-file":
58 check_file(Path(args.index), Path(args.file), args.max_candidates)
59 elif args.command == "batch-check":
60 batch_check(Path(args.index), Path(args.lyrics_dir), Path(args.out), args.max_candidates)
61 elif args.command == "evaluate-csv":
62 evaluate_csv(
63 Path(args.index),
64 Path(args.csv),
65 Path(args.out),
66 base_dir=Path(args.base_dir) if args.base_dir else None,
67 positive_decisions={item.strip() for item in args.positive_decisions.split(",") if item.strip()},
68 max_candidates=args.max_candidates,
69 )
70 elif args.command == "generate-eval-set":
71 summary = generate_eval_set(
72 library_dir=Path(args.library_dir),
73 output_dir=Path(args.lyrics_dir),
74 csv_path=Path(args.csv),
75 size=args.size,
76 positive_ratio=args.positive_ratio,
77 seed=args.seed,
78 )
79 print(json.dumps(summary, ensure_ascii=False))
80
81
82 def build_index(lyrics_dir: Path, index_path: Path) -> None:
83 checker = DuplicateChecker()
84 records = records_from_dir(lyrics_dir)
85 for record in records:
86 checker.add_record(record)
87 index_path.parent.mkdir(parents=True, exist_ok=True)
88 checker.save(index_path)
89 print(json.dumps({"indexed": checker.record_count, "index": str(index_path)}, ensure_ascii=False))
90
91
92 def check_file(index_path: Path, file_path: Path, max_candidates: int) -> None:
93 checker = DuplicateChecker.load(index_path)
94 record = record_from_file(file_path)
95 result = checker.check_record(record, max_candidates=max_candidates)
96 print(json.dumps(_result_to_dict(result, source=str(file_path)), ensure_ascii=False, indent=2))
97
98
99 def batch_check(index_path: Path, lyrics_dir: Path, out_path: Path, max_candidates: int) -> None:
100 checker = DuplicateChecker.load(index_path)
101 out_path.parent.mkdir(parents=True, exist_ok=True)
102 rows: list[dict[str, object]] = []
103 for path in iter_lyric_files(lyrics_dir):
104 record = record_from_file(path, base_dir=lyrics_dir)
105 result = checker.check_record(record, max_candidates=max_candidates)
106 best = result.candidates[0] if result.candidates else None
107 rows.append(
108 {
109 "source": str(path),
110 "record_id": record.record_id,
111 "decision": result.decision.value,
112 "confidence": result.confidence,
113 "reason": result.reason,
114 "best_candidate_id": best.record_id if best else "",
115 "best_candidate_decision": best.decision.value if best else "",
116 "best_candidate_confidence": best.confidence if best else "",
117 "best_candidate_jaccard": best.jaccard if best else "",
118 "best_candidate_line_coverage": best.line_coverage if best else "",
119 "best_candidate_primary_jaccard": best.primary_jaccard if best else "",
120 "best_candidate_primary_line_coverage": best.primary_line_coverage if best else "",
121 "best_candidate_translation_jaccard": best.translation_jaccard if best else "",
122 "best_candidate_translation_line_coverage": best.translation_line_coverage if best else "",
123 "best_candidate_reason": best.reason if best else "",
124 "matched_unique_lines": " | ".join(best.matched_unique_lines) if best else "",
125 }
126 )
127
128 if out_path.suffix.lower() == ".jsonl":
129 with out_path.open("w", encoding="utf-8") as file:
130 for row in rows:
131 file.write(json.dumps(row, ensure_ascii=False) + "\n")
132 else:
133 with out_path.open("w", encoding="utf-8", newline="") as file:
134 writer = csv.DictWriter(file, fieldnames=list(rows[0].keys()) if rows else ["source"])
135 writer.writeheader()
136 writer.writerows(rows)
137 summary = {
138 "checked": len(rows),
139 "duplicate": sum(1 for row in rows if row["decision"] == "duplicate"),
140 "review": sum(1 for row in rows if row["decision"] == "review"),
141 "new": sum(1 for row in rows if row["decision"] == "new"),
142 "out": str(out_path),
143 }
144 print(json.dumps(summary, ensure_ascii=False))
145
146
147 def evaluate_csv(
148 index_path: Path,
149 csv_path: Path,
150 out_path: Path,
151 *,
152 base_dir: Path | None,
153 positive_decisions: set[str],
154 max_candidates: int,
155 ) -> None:
156 checker = DuplicateChecker.load(index_path)
157 rows: list[dict[str, object]] = []
158 with csv_path.open(encoding="utf-8-sig", newline="") as file:
159 reader = csv.DictReader(file)
160 if reader.fieldnames is None:
161 raise ValueError("评估 CSV 需要表头")
162 for row_number, row in enumerate(reader, start=2):
163 sample_id = row.get("id") or row.get("sample_id") or str(row_number)
164 record, source = _record_from_eval_row(row, csv_path=csv_path, base_dir=base_dir)
165 expected_duplicate = _parse_expected(row.get("expected") or row.get("label") or row.get("target"))
166 result = checker.check_record(record, max_candidates=max_candidates)
167 predicted_duplicate = result.decision.value in positive_decisions
168 best = result.candidates[0] if result.candidates else None
169 rows.append(
170 {
171 "id": sample_id,
172 "source": source,
173 "expected_duplicate": expected_duplicate,
174 "decision": result.decision.value,
175 "predicted_duplicate": predicted_duplicate,
176 "correct": expected_duplicate == predicted_duplicate,
177 "confidence": result.confidence,
178 "reason": result.reason,
179 "best_candidate_id": best.record_id if best else "",
180 "best_candidate_decision": best.decision.value if best else "",
181 "best_candidate_confidence": best.confidence if best else "",
182 "best_candidate_jaccard": best.jaccard if best else "",
183 "best_candidate_line_coverage": best.line_coverage if best else "",
184 "best_candidate_primary_jaccard": best.primary_jaccard if best else "",
185 "best_candidate_primary_line_coverage": best.primary_line_coverage if best else "",
186 "best_candidate_translation_jaccard": best.translation_jaccard if best else "",
187 "best_candidate_translation_line_coverage": best.translation_line_coverage if best else "",
188 "best_candidate_reason": best.reason if best else "",
189 "matched_unique_lines": " | ".join(best.matched_unique_lines) if best else "",
190 }
191 )
192
193 out_path.parent.mkdir(parents=True, exist_ok=True)
194 with out_path.open("w", encoding="utf-8", newline="") as file:
195 writer = csv.DictWriter(file, fieldnames=list(rows[0].keys()) if rows else ["id"])
196 writer.writeheader()
197 writer.writerows(rows)
198
199 summary = _evaluation_summary(rows, positive_decisions=positive_decisions, out_path=out_path)
200 summary_path = out_path.with_suffix(out_path.suffix + ".summary.json")
201 summary_path.write_text(json.dumps(summary, ensure_ascii=False, indent=2), encoding="utf-8")
202 print(json.dumps(summary, ensure_ascii=False))
203
204
205 def _result_to_dict(result, *, source: str) -> dict[str, object]:
206 return {
207 "source": source,
208 "decision": result.decision.value,
209 "confidence": result.confidence,
210 "reason": result.reason,
211 "candidates": [
212 {
213 "record_id": candidate.record_id,
214 "decision": candidate.decision.value,
215 "confidence": candidate.confidence,
216 "jaccard": candidate.jaccard,
217 "line_coverage": candidate.line_coverage,
218 "primary_jaccard": candidate.primary_jaccard,
219 "primary_line_coverage": candidate.primary_line_coverage,
220 "translation_jaccard": candidate.translation_jaccard,
221 "translation_line_coverage": candidate.translation_line_coverage,
222 "reason": candidate.reason,
223 "matched_unique_lines": list(candidate.matched_unique_lines),
224 }
225 for candidate in result.candidates
226 ],
227 }
228
229
230 def _lyrics_from_eval_row(row: dict[str, str], *, csv_path: Path, base_dir: Path | None) -> tuple[str, str]:
231 lyrics = (row.get("lyrics") or "").strip()
232 if lyrics:
233 return lyrics.replace("\\n", "\n"), "inline"
234
235 file_value = (row.get("file") or row.get("path") or row.get("source") or "").strip()
236 if not file_value:
237 raise ValueError("评估 CSV 每行需要提供 lyrics,或 file/path/source 文件路径")
238
239 file_path = Path(file_value)
240 if not file_path.is_absolute():
241 file_path = (base_dir or csv_path.parent) / file_path
242 return read_lyric_file(file_path), str(file_path)
243
244
245 def _record_from_eval_row(row: dict[str, str], *, csv_path: Path, base_dir: Path | None):
246 lyrics = (row.get("lyrics") or "").strip()
247 if lyrics:
248 return (
249 LyricRecord(
250 record_id=row.get("id") or row.get("sample_id") or "__eval__",
251 lyrics=lyrics.replace("\\n", "\n"),
252 title=row.get("title") or None,
253 artist=row.get("artist") or None,
254 ),
255 "inline",
256 )
257
258 file_value = (row.get("file") or row.get("path") or row.get("source") or "").strip()
259 if not file_value:
260 raise ValueError("评估 CSV 每行需要 lyrics,或 file/path/source 文件路径")
261
262 file_path = Path(file_value)
263 if not file_path.is_absolute():
264 file_path = (base_dir or csv_path.parent) / file_path
265 record = record_from_file(file_path)
266 if row.get("title") or row.get("artist"):
267 record = LyricRecord(
268 record_id=record.record_id,
269 lyrics=record.lyrics,
270 title=row.get("title") or record.title,
271 artist=row.get("artist") or record.artist,
272 )
273 return record, str(file_path)
274
275
276 def _parse_expected(value: str | None) -> bool:
277 if value is None:
278 raise ValueError("评估 CSV 每行需要 expected/label/target 列")
279 normalized = value.strip().lower()
280 positives = {"1", "true", "yes", "y", "duplicate", "dup", "重复", "应去重", "去重", "是"}
281 negatives = {"0", "false", "no", "n", "new", "not_duplicate", "non_duplicate", "不重复", "不应去重", "新歌", "否"}
282 if normalized in positives:
283 return True
284 if normalized in negatives:
285 return False
286 raise ValueError(f"无法识别 expected 值: {value!r}")
287
288
289 def _evaluation_summary(
290 rows: list[dict[str, object]],
291 *,
292 positive_decisions: set[str],
293 out_path: Path,
294 ) -> dict[str, object]:
295 tp = sum(1 for row in rows if row["expected_duplicate"] is True and row["predicted_duplicate"] is True)
296 fp = sum(1 for row in rows if row["expected_duplicate"] is False and row["predicted_duplicate"] is True)
297 tn = sum(1 for row in rows if row["expected_duplicate"] is False and row["predicted_duplicate"] is False)
298 fn = sum(1 for row in rows if row["expected_duplicate"] is True and row["predicted_duplicate"] is False)
299 total = len(rows)
300 precision = tp / (tp + fp) if tp + fp else 0.0
301 recall = tp / (tp + fn) if tp + fn else 0.0
302 accuracy = (tp + tn) / total if total else 0.0
303 f1 = (2 * precision * recall / (precision + recall)) if precision + recall else 0.0
304 return {
305 "total": total,
306 "positive_decisions": sorted(positive_decisions),
307 "accuracy": round(accuracy, 4),
308 "precision": round(precision, 4),
309 "recall": round(recall, 4),
310 "f1": round(f1, 4),
311 "true_positive": tp,
312 "false_positive": fp,
313 "true_negative": tn,
314 "false_negative": fn,
315 "duplicate": sum(1 for row in rows if row["decision"] == "duplicate"),
316 "review": sum(1 for row in rows if row["decision"] == "review"),
317 "new": sum(1 for row in rows if row["decision"] == "new"),
318 "out": str(out_path),
319 "summary": str(out_path.with_suffix(out_path.suffix + ".summary.json")),
320 }
321
322
323 if __name__ == "__main__":
324 main()
1 """Generate labeled evaluation samples from an existing lyric library."""
2
3 from __future__ import annotations
4
5 import csv
6 import random
7 import re
8 from dataclasses import dataclass
9 from pathlib import Path
10
11 from lyric_dedup.file_import import iter_lyric_files
12 from lyric_dedup.file_import import read_lyric_file
13 from lyric_dedup.file_import import record_from_file
14 from lyric_dedup.normalization import normalize_lyrics
15
16
17 @dataclass(frozen=True)
18 class GeneratedSample:
19 sample_id: str
20 file: str
21 expected: str
22 sample_type: str
23 source: str
24 title: str = ""
25 artist: str = ""
26
27
28 def generate_eval_set(
29 *,
30 library_dir: Path,
31 output_dir: Path,
32 csv_path: Path,
33 size: int = 100,
34 positive_ratio: float = 0.6,
35 seed: int = 20260602,
36 ) -> dict[str, object]:
37 rng = random.Random(seed)
38 source_files = iter_lyric_files(library_dir)
39 if not source_files:
40 raise ValueError(f"{library_dir} 下没有 .lrc/.txt 歌词文件")
41
42 output_dir.mkdir(parents=True, exist_ok=True)
43 csv_path.parent.mkdir(parents=True, exist_ok=True)
44 _clean_generated_output_dir(output_dir)
45
46 positives = round(size * positive_ratio)
47 negatives = size - positives
48 samples: list[GeneratedSample] = []
49 for index in range(positives):
50 source = source_files[index % len(source_files)]
51 samples.append(_positive_sample(index + 1, source, output_dir, csv_path.parent, rng))
52 for index in range(negatives):
53 left = source_files[index % len(source_files)]
54 right = source_files[(index + 1) % len(source_files)]
55 samples.append(_negative_sample(positives + index + 1, left, right, output_dir, csv_path.parent, rng))
56
57 rng.shuffle(samples)
58 with csv_path.open("w", encoding="utf-8", newline="") as file:
59 writer = csv.DictWriter(file, fieldnames=["id", "file", "expected", "sample_type", "source", "title", "artist"])
60 writer.writeheader()
61 writer.writerows(
62 {
63 "id": sample.sample_id,
64 "file": sample.file,
65 "expected": sample.expected,
66 "sample_type": sample.sample_type,
67 "source": sample.source,
68 "title": sample.title,
69 "artist": sample.artist,
70 }
71 for sample in samples
72 )
73
74 return {
75 "size": size,
76 "positive": positives,
77 "negative": negatives,
78 "library_files": len(source_files),
79 "lyrics_dir": str(output_dir),
80 "csv": str(csv_path),
81 }
82
83
84 def _positive_sample(index: int, source: Path, output_dir: Path, csv_base: Path, rng: random.Random) -> GeneratedSample:
85 raw = read_lyric_file(source)
86 source_record = record_from_file(source)
87 variants = [
88 ("exact_copy", raw),
89 ("timestamped", _add_timestamps(_content_lines(raw))),
90 ("punctuation_noise", _add_punctuation_noise(_content_lines(raw), rng)),
91 ("with_platform_noise", _with_platform_noise(_content_lines(raw))),
92 ("blank_line_noise", _add_blank_line_noise(_content_lines(raw))),
93 ("lrc_with_platform_noise", _add_timestamps(_content_lines(_with_platform_noise(_content_lines(raw))))),
94 ("translation_added", _translation_added(_content_lines(raw))),
95 ]
96 sample_type, text = variants[(index - 1) % len(variants)]
97 name = f"pos_{index:03d}_{sample_type}.txt"
98 path = output_dir / name
99 path.write_text(text, encoding="utf-8")
100 return GeneratedSample(
101 sample_id=f"pos-{index:03d}",
102 file=str(path.relative_to(csv_base)),
103 expected="应去重",
104 sample_type=sample_type,
105 source=str(source),
106 title=source_record.title or "",
107 artist=source_record.artist or "",
108 )
109
110
111 def _negative_sample(index: int, left: Path, right: Path, output_dir: Path, csv_base: Path, rng: random.Random) -> GeneratedSample:
112 left_lines = _normalized_lines(left)
113 right_lines = _normalized_lines(right)
114 variants = [
115 ("single_song_fragment", _single_song_fragment(left_lines)),
116 ("short_shared_snippet", _short_shared_snippet(left_lines, rng)),
117 ("mixed_fragments", _mixed_fragments(left_lines, right_lines, rng)),
118 ("same_theme_synthetic", _same_theme_synthetic(index)),
119 ("translation_only_like", _translation_only_like(left_lines)),
120 ]
121 sample_type, text = variants[(index - 1) % len(variants)]
122 name = f"neg_{index:03d}_{sample_type}.txt"
123 path = output_dir / name
124 path.write_text(text, encoding="utf-8")
125 return GeneratedSample(
126 sample_id=f"neg-{index:03d}",
127 file=str(path.relative_to(csv_base)),
128 expected="不应去重",
129 sample_type=sample_type,
130 source=f"{left} | {right}",
131 )
132
133
134 def _content_lines(text: str) -> list[str]:
135 lines = [line.strip() for line in text.splitlines() if line.strip()]
136 return lines or [text.strip()]
137
138
139 def _clean_generated_output_dir(output_dir: Path) -> None:
140 for path in output_dir.iterdir():
141 if path.is_file() and path.suffix.lower() in {".txt", ".lrc"}:
142 path.unlink()
143
144
145 def _normalized_lines(path: Path) -> list[str]:
146 normalized = normalize_lyrics(read_lyric_file(path))
147 return list(normalized.primary_lines or normalized.unique_lines)
148
149
150 def _add_timestamps(lines: list[str]) -> str:
151 return "\n".join(f"[00:{idx % 60:02d}.00]{line}" for idx, line in enumerate(lines, start=1))
152
153
154 def _add_punctuation_noise(lines: list[str], rng: random.Random) -> str:
155 marks = ["!", "?", "...", ",", "。"]
156 return "\n".join(f"{line}{rng.choice(marks)}" for line in lines)
157
158
159 def _with_platform_noise(lines: list[str]) -> str:
160 return "\n".join(["歌词来自QQ音乐", "作词:测试", *lines, "未经著作权人许可 不得翻唱"])
161
162
163 def _add_blank_line_noise(lines: list[str]) -> str:
164 result: list[str] = []
165 for idx, line in enumerate(lines, start=1):
166 result.append(line)
167 if idx % 4 == 0:
168 result.append("")
169 return "\n".join(result)
170
171
172 def _translation_added(lines: list[str]) -> str:
173 result: list[str] = []
174 for idx, line in enumerate(lines, start=1):
175 result.append(line)
176 if _looks_foreign(line) and idx <= 24:
177 result.append(_pseudo_translation(idx))
178 return "\n".join(result)
179
180
181 def _single_song_fragment(lines: list[str]) -> str:
182 if len(lines) <= 4:
183 return "\n".join(lines[: max(1, len(lines) // 2)])
184 fragment_len = max(2, min(8, len(lines) // 4))
185 start = max(0, (len(lines) - fragment_len) // 2)
186 return "\n".join(lines[start : start + fragment_len])
187
188
189 def _short_shared_snippet(lines: list[str], rng: random.Random) -> str:
190 snippet = rng.sample(lines, k=min(2, len(lines))) if lines else []
191 synthetic = [
192 "清晨的风吹过新的街口",
193 "我把昨天放进安静的口袋",
194 *snippet,
195 "故事从这里重新开始",
196 "灯光落下我继续往前走",
197 ]
198 return "\n".join(synthetic)
199
200
201 def _mixed_fragments(left_lines: list[str], right_lines: list[str], rng: random.Random) -> str:
202 left_pick = rng.sample(left_lines, k=min(2, len(left_lines))) if left_lines else []
203 right_pick = rng.sample(right_lines, k=min(2, len(right_lines))) if right_lines else []
204 filler = ["新的旋律慢慢靠近", "陌生的名字写在风里", "没有人停在原地"]
205 return "\n".join([*left_pick, *filler, *right_pick])
206
207
208 def _same_theme_synthetic(index: int) -> str:
209 themes = [
210 "我在夜里想起远方的你",
211 "城市灯火陪我走过雨季",
212 "那些没说完的话留在风里",
213 "明天醒来我们各自继续",
214 f"这是第 {index} 个全新测试样本",
215 ]
216 return "\n".join(themes)
217
218
219 def _translation_only_like(lines: list[str]) -> str:
220 foreign_count = sum(1 for line in lines if _looks_foreign(line))
221 if foreign_count < 2:
222 return _same_theme_synthetic(foreign_count + len(lines))
223 return "\n".join(_pseudo_translation(idx) for idx in range(1, min(8, foreign_count) + 1))
224
225
226 def _pseudo_translation(index: int) -> str:
227 translations = [
228 "今晚我仍然想念你",
229 "风会带走所有疲惫",
230 "黑暗里也会有光",
231 "别让昨天困住自己",
232 "我们终会继续向前",
233 "雨停以后天空会亮",
234 "把遗憾留在旧时光",
235 "你已经足够好了",
236 ]
237 return translations[(index - 1) % len(translations)]
238
239
240 def _looks_foreign(line: str) -> bool:
241 latin = len(re.findall(r"[A-Za-z]", line))
242 cjk = len(re.findall(r"[\u4e00-\u9fff]", line))
243 return latin > 0 and cjk == 0
1 """Import LRC/TXT lyric files into records."""
2
3 from __future__ import annotations
4
5 import hashlib
6 from pathlib import Path
7
8 from lyric_dedup.checker import LyricRecord
9
10
11 SUPPORTED_SUFFIXES = {".lrc", ".txt"}
12
13
14 def iter_lyric_files(root: str | Path) -> list[Path]:
15 base = Path(root)
16 return sorted(
17 path
18 for path in base.rglob("*")
19 if path.is_file() and path.suffix.lower() in SUPPORTED_SUFFIXES
20 )
21
22
23 def read_lyric_file(path: str | Path) -> str:
24 file_path = Path(path)
25 data = file_path.read_bytes()
26 for encoding in ("utf-8-sig", "utf-8", "gb18030", "big5"):
27 try:
28 return data.decode(encoding)
29 except UnicodeDecodeError:
30 continue
31 return data.decode("utf-8", errors="replace")
32
33
34 def record_from_file(path: str | Path, *, base_dir: str | Path | None = None) -> LyricRecord:
35 file_path = Path(path)
36 lyrics = read_lyric_file(file_path)
37 title, artist = _metadata_from_name(file_path.stem)
38 record_id = _record_id(file_path, base_dir)
39 return LyricRecord(record_id=record_id, lyrics=lyrics, title=title, artist=artist)
40
41
42 def records_from_dir(root: str | Path) -> list[LyricRecord]:
43 return [record_from_file(path, base_dir=root) for path in iter_lyric_files(root)]
44
45
46 def _record_id(path: Path, base_dir: str | Path | None) -> str:
47 if base_dir is None:
48 source = str(path.resolve())
49 else:
50 source = str(path.resolve().relative_to(Path(base_dir).resolve()))
51 digest = hashlib.sha1(source.encode("utf-8")).hexdigest()[:12]
52 return f"{digest}:{source}"
53
54
55 def _metadata_from_name(stem: str) -> tuple[str | None, str | None]:
56 cleaned = stem.removesuffix("-歌词").removesuffix("_歌词").removesuffix(" 歌词").strip()
57 if " - " in cleaned:
58 artist, title = cleaned.split(" - ", 1)
59 return title.strip() or None, artist.strip() or None
60 for sep in ("-", "_"):
61 if sep in cleaned:
62 title, artist = cleaned.rsplit(sep, 1)
63 return title.strip() or None, artist.strip() or None
64 return stem.strip() or None, None
1 """Small in-memory MinHash LSH index for incremental lyric lookup."""
2
3 from __future__ import annotations
4
5 import hashlib
6 from collections import defaultdict
7 from dataclasses import dataclass
8
9
10 _MAX_HASH = (1 << 64) - 1
11
12
13 @dataclass(frozen=True)
14 class MinHashConfig:
15 num_perm: int = 96
16 bands: int = 24
17 seed: int = 17
18
19 @property
20 def rows_per_band(self) -> int:
21 if self.num_perm % self.bands != 0:
22 raise ValueError("num_perm must be divisible by bands")
23 return self.num_perm // self.bands
24
25
26 class MinHashLSH:
27 def __init__(self, config: MinHashConfig | None = None) -> None:
28 self.config = config or MinHashConfig()
29 self._buckets: dict[tuple[int, tuple[int, ...]], set[str]] = defaultdict(set)
30
31 def signature(self, tokens: set[str]) -> tuple[int, ...]:
32 if not tokens:
33 return tuple([_MAX_HASH] * self.config.num_perm)
34
35 signature = [_MAX_HASH] * self.config.num_perm
36 for token in tokens:
37 encoded = token.encode("utf-8")
38 for idx in range(self.config.num_perm):
39 digest = hashlib.blake2b(
40 encoded,
41 digest_size=8,
42 person=f"lyr{self.config.seed + idx:05d}".encode("ascii")[:16],
43 ).digest()
44 value = int.from_bytes(digest, "big")
45 if value < signature[idx]:
46 signature[idx] = value
47 return tuple(signature)
48
49 def add(self, record_id: str, signature: tuple[int, ...]) -> None:
50 for key in self._band_keys(signature):
51 self._buckets[key].add(record_id)
52
53 def query(self, signature: tuple[int, ...]) -> set[str]:
54 candidates: set[str] = set()
55 for key in self._band_keys(signature):
56 candidates.update(self._buckets.get(key, set()))
57 return candidates
58
59 def _band_keys(self, signature: tuple[int, ...]) -> list[tuple[int, tuple[int, ...]]]:
60 rows = self.config.rows_per_band
61 return [(band, signature[band * rows : (band + 1) * rows]) for band in range(self.config.bands)]
1 """Lyric-specific normalization and feature extraction."""
2
3 from __future__ import annotations
4
5 import re
6 import string
7 import unicodedata
8 from collections import Counter
9 from dataclasses import dataclass
10
11
12 _TRADITIONAL_TO_SIMPLIFIED = str.maketrans(
13 {
14 "愛": "爱",
15 "會": "会",
16 "個": "个",
17 "妳": "你",
18 "們": "们",
19 "麼": "么",
20 "夢": "梦",
21 "憶": "忆",
22 "風": "风",
23 "無": "无",
24 "與": "与",
25 "聽": "听",
26 "說": "说",
27 "見": "见",
28 "話": "话",
29 "還": "还",
30 "這": "这",
31 "那": "那",
32 "裡": "里",
33 "裏": "里",
34 "過": "过",
35 "來": "来",
36 "進": "进",
37 "去": "去",
38 "給": "给",
39 "讓": "让",
40 "嗎": "吗",
41 "為": "为",
42 "誰": "谁",
43 "對": "对",
44 "錯": "错",
45 "淚": "泪",
46 "寫": "写",
47 "雲": "云",
48 "藍": "蓝",
49 "紅": "红",
50 "綠": "绿",
51 "黃": "黄",
52 "長": "长",
53 "遠": "远",
54 "燈": "灯",
55 "臺": "台",
56 "台": "台",
57 "後": "后",
58 "從": "从",
59 "時": "时",
60 "間": "间",
61 "葉": "叶",
62 "歲": "岁",
63 "聲": "声",
64 "邊": "边",
65 "歡": "欢",
66 "繼": "继",
67 "續": "续",
68 "難": "难",
69 "雙": "双",
70 "舊": "旧",
71 "離": "离",
72 }
73 )
74
75 _TIMESTAMP_RE = re.compile(r"\[((?:\d{1,2}:)?\d{1,2}:\d{2}(?:[.:]\d{1,3})?)\]")
76 _BRACKET_RE = re.compile(r"[\[((【<《].{0,40}?[\]))】>》]")
77 _ROLE_PREFIX_RE = re.compile(r"^\s*(?:男|女|合|主歌|副歌|verse|chorus|bridge|rap)\s*[::]\s*", re.IGNORECASE)
78 _CREDIT_PREFIX_RE = re.compile(
79 r"^\s*(?:作词|作詞|作曲|编曲|編曲|制作|製作|监制|監製|录音|錄音|混音|母带|"
80 r"出品|发行|發行|歌词|歌詞|lyric(?:s)?|composer|writer|producer|arranger|"
81 r"copyright|未经|未經|qq音乐|酷狗|网易云|網易雲|lrc)",
82 re.IGNORECASE,
83 )
84 _WATERMARK_RE = re.compile(
85 r"(?:qq音乐|酷狗音乐|网易云音乐|網易雲音樂|虾米音乐|歌词网|歌詞網|"
86 r"music\.163\.com|www\.|http[s]?://|\blrc\b)",
87 re.IGNORECASE,
88 )
89 _CJK_RE = re.compile(r"[\u4e00-\u9fff]")
90 _LATIN_RE = re.compile(r"[a-zA-Z]")
91 _KANA_RE = re.compile(r"[\u3040-\u30ff]")
92 _HANGUL_RE = re.compile(r"[\uac00-\ud7af]")
93 _WORD_RE = re.compile(r"[a-z0-9]+|[\u4e00-\u9fff]", re.IGNORECASE)
94 _INLINE_SPLIT_RE = re.compile(r"\s+(?:/|\|||)\s+|(?<=[A-Za-z])\s*[-—]\s*(?=[\u4e00-\u9fff])")
95
96
97 @dataclass(frozen=True)
98 class _LineEntry:
99 text: str
100 timestamp: str | None
101 language: str
102 source_index: int
103
104
105 @dataclass(frozen=True)
106 class NormalizedLyrics:
107 raw_text: str
108 normalized_full_text: str
109 normalized_lines: tuple[str, ...]
110 unique_lines: tuple[str, ...]
111 line_counts: dict[str, int]
112 content_line_count: int
113 primary_lines: tuple[str, ...]
114 translation_lines: tuple[str, ...]
115 unknown_lines: tuple[str, ...]
116 line_roles: tuple[str, ...]
117 split_confidence: str
118 split_reason: str
119
120
121 def normalize_lyrics(text: str) -> NormalizedLyrics:
122 """Normalize lyrics while preserving line-level structure for ranking."""
123 entries: list[_LineEntry] = []
124 for index, raw_line in enumerate(unicodedata.normalize("NFKC", text).splitlines()):
125 entries.extend(_clean_line_entries(raw_line, index))
126
127 cleaned_lines = [entry.text for entry in entries]
128 roles, confidence, reason = _assign_line_roles(entries)
129 primary_lines = tuple(entry.text for entry, role in zip(entries, roles, strict=False) if role == "primary")
130 translation_lines = tuple(entry.text for entry, role in zip(entries, roles, strict=False) if role == "translation")
131 unknown_lines = tuple(entry.text for entry, role in zip(entries, roles, strict=False) if role == "unknown")
132 if not primary_lines:
133 primary_lines = tuple(cleaned_lines)
134 roles = tuple("primary" for _ in cleaned_lines)
135 if cleaned_lines and confidence == "none":
136 reason = "未检测到可分离的翻译结构,全部有效行按原文处理"
137
138 counts = Counter(cleaned_lines)
139 unique_lines = tuple(dict.fromkeys(cleaned_lines))
140 return NormalizedLyrics(
141 raw_text=text,
142 normalized_full_text="\n".join(cleaned_lines),
143 normalized_lines=tuple(cleaned_lines),
144 unique_lines=unique_lines,
145 line_counts=dict(counts),
146 content_line_count=len(cleaned_lines),
147 primary_lines=tuple(dict.fromkeys(primary_lines)),
148 translation_lines=tuple(dict.fromkeys(translation_lines)),
149 unknown_lines=tuple(dict.fromkeys(unknown_lines)),
150 line_roles=tuple(roles),
151 split_confidence=confidence,
152 split_reason=reason,
153 )
154
155
156 def fingerprint_text(normalized: NormalizedLyrics) -> str:
157 """Return a text form suitable for exact hashing.
158
159 Repeated adjacent or non-adjacent lyric lines are collapsed so different chorus
160 repeat counts do not prevent exact duplicate detection.
161 """
162 return "\n".join(normalized.primary_lines or normalized.unique_lines)
163
164
165 def lyric_tokens(
166 normalized: NormalizedLyrics,
167 ngram_size: int = 3,
168 *,
169 lines: tuple[str, ...] | None = None,
170 ) -> set[str]:
171 """Build mixed CJK/Latin n-grams with repeated lines down-weighted."""
172 tokens: set[str] = set()
173 selected_lines = lines if lines is not None else normalized.unique_lines
174 for line in selected_lines:
175 units = _token_units(line)
176 if len(units) < ngram_size:
177 if units:
178 tokens.add(" ".join(units))
179 continue
180 for start in range(len(units) - ngram_size + 1):
181 tokens.add(" ".join(units[start : start + ngram_size]))
182 return tokens
183
184
185 def _clean_line_entries(raw_line: str, source_index: int) -> list[_LineEntry]:
186 timestamp_match = _TIMESTAMP_RE.search(raw_line)
187 timestamp = timestamp_match.group(1) if timestamp_match else None
188 line = _TIMESTAMP_RE.sub("", raw_line)
189 line = _ROLE_PREFIX_RE.sub("", line).strip()
190 inline_entries = _split_inline_translation(line, timestamp, source_index)
191 if inline_entries:
192 return inline_entries
193 return _entry_from_text(line, timestamp, source_index)
194
195
196 def _split_inline_translation(line: str, timestamp: str | None, source_index: int) -> list[_LineEntry]:
197 parts = [part.strip() for part in _INLINE_SPLIT_RE.split(line, maxsplit=1)]
198 if len(parts) != 2:
199 return []
200 left_entries = _entry_from_text(parts[0], timestamp, source_index)
201 right_entries = _entry_from_text(parts[1], timestamp, source_index)
202 if not left_entries or not right_entries:
203 return []
204 left_lang = left_entries[0].language
205 right_lang = right_entries[0].language
206 if _is_foreign_language(left_lang) and right_lang == "zh":
207 return [left_entries[0], right_entries[0]]
208 if left_lang == "zh" and _is_foreign_language(right_lang):
209 return [right_entries[0], left_entries[0]]
210 return []
211
212
213 def _entry_from_text(text: str, timestamp: str | None, source_index: int) -> list[_LineEntry]:
214 line = _BRACKET_RE.sub("", text)
215 line = line.strip().lower().translate(_TRADITIONAL_TO_SIMPLIFIED)
216 if not line or _is_noise_line(line):
217 return []
218 line = _strip_symbols(line)
219 if not line:
220 return []
221 return [_LineEntry(text=line, timestamp=timestamp, language=_detect_language(line), source_index=source_index)]
222
223
224 def _assign_line_roles(entries: list[_LineEntry]) -> tuple[tuple[str, ...], str, str]:
225 if not entries:
226 return (), "none", "没有有效歌词行"
227
228 timestamp_roles = _roles_by_same_timestamp(entries)
229 if timestamp_roles is not None:
230 return timestamp_roles, "high", "同时间戳下检测到外文行和中文行配对"
231
232 inline_roles = _roles_by_inline_translation(entries)
233 if inline_roles is not None:
234 return inline_roles, "medium", "同一原始行内检测到明显的外文和中文翻译"
235
236 alternating_roles = _roles_by_alternating_translation(entries)
237 if alternating_roles is not None:
238 return alternating_roles, "high", "检测到稳定的外文行和中文翻译行交替结构"
239
240 block_roles = _roles_by_translation_block(entries)
241 if block_roles is not None:
242 return block_roles, "low", "检测到疑似原文段落加中文翻译段落,置信度较低"
243
244 return tuple("primary" for _ in entries), "none", "未检测到可分离的翻译结构,全部有效行按原文处理"
245
246
247 def _roles_by_same_timestamp(entries: list[_LineEntry]) -> tuple[str, ...] | None:
248 roles = ["unknown"] * len(entries)
249 groups: dict[str, list[int]] = {}
250 for idx, entry in enumerate(entries):
251 if entry.timestamp:
252 groups.setdefault(entry.timestamp, []).append(idx)
253
254 paired = 0
255 for indexes in groups.values():
256 if len(indexes) < 2:
257 continue
258 foreign = [idx for idx in indexes if _is_foreign_language(entries[idx].language)]
259 chinese = [idx for idx in indexes if entries[idx].language == "zh"]
260 if not foreign or not chinese:
261 continue
262 for idx in foreign:
263 roles[idx] = "primary"
264 for idx in chinese:
265 roles[idx] = "translation"
266 paired += 1
267
268 if paired == 0:
269 return None
270 for idx, role in enumerate(roles):
271 if role == "unknown":
272 roles[idx] = "primary"
273 return tuple(roles)
274
275
276 def _roles_by_alternating_translation(entries: list[_LineEntry]) -> tuple[str, ...] | None:
277 roles = ["unknown"] * len(entries)
278 pairs = 0
279 idx = 0
280 while idx < len(entries) - 1:
281 current = entries[idx]
282 nxt = entries[idx + 1]
283 if _is_foreign_language(current.language) and nxt.language == "zh":
284 roles[idx] = "primary"
285 roles[idx + 1] = "translation"
286 pairs += 1
287 idx += 2
288 continue
289 idx += 1
290
291 if pairs < 2:
292 return None
293 assigned = sum(1 for role in roles if role != "unknown")
294 if assigned / len(entries) < 0.65:
295 return None
296 for idx, role in enumerate(roles):
297 if role == "unknown":
298 roles[idx] = "primary"
299 return tuple(roles)
300
301
302 def _roles_by_inline_translation(entries: list[_LineEntry]) -> tuple[str, ...] | None:
303 roles = ["primary"] * len(entries)
304 pairs = 0
305 by_source: dict[int, list[int]] = {}
306 for idx, entry in enumerate(entries):
307 by_source.setdefault(entry.source_index, []).append(idx)
308 for indexes in by_source.values():
309 if len(indexes) != 2:
310 continue
311 first, second = indexes
312 if _is_foreign_language(entries[first].language) and entries[second].language == "zh":
313 roles[first] = "primary"
314 roles[second] = "translation"
315 pairs += 1
316 elif entries[first].language == "zh" and _is_foreign_language(entries[second].language):
317 roles[first] = "translation"
318 roles[second] = "primary"
319 pairs += 1
320 return tuple(roles) if pairs else None
321
322
323 def _roles_by_translation_block(entries: list[_LineEntry]) -> tuple[str, ...] | None:
324 if len(entries) < 4:
325 return None
326 midpoint = len(entries) // 2
327 first = entries[:midpoint]
328 second = entries[midpoint:]
329 first_foreign = sum(1 for entry in first if _is_foreign_language(entry.language))
330 second_zh = sum(1 for entry in second if entry.language == "zh")
331 if first_foreign / len(first) >= 0.75 and second_zh / len(second) >= 0.75:
332 return tuple("primary" if idx < midpoint else "translation" for idx in range(len(entries)))
333 return None
334
335
336 def _detect_language(line: str) -> str:
337 cjk = len(_CJK_RE.findall(line))
338 latin = len(_LATIN_RE.findall(line))
339 kana = len(_KANA_RE.findall(line))
340 hangul = len(_HANGUL_RE.findall(line))
341 if hangul:
342 return "kr"
343 if kana:
344 return "jp"
345 if cjk and latin:
346 return "mixed"
347 if cjk:
348 return "zh"
349 if latin:
350 return "latin"
351 return "other"
352
353
354 def _is_foreign_language(language: str) -> bool:
355 return language in {"latin", "jp", "kr", "other"}
356
357
358 def _is_noise_line(line: str) -> bool:
359 if _CREDIT_PREFIX_RE.search(line) or _WATERMARK_RE.search(line):
360 return True
361 has_cjk_or_latin = bool(_CJK_RE.search(line) or _LATIN_RE.search(line))
362 if not has_cjk_or_latin:
363 return True
364 compact = _strip_symbols(line)
365 return len(compact) <= 1
366
367
368 def _strip_symbols(line: str) -> str:
369 punctuation = string.punctuation + ",。!?;:、“”‘’·…—~!¥()【】《》〈〉「」『』﹏"
370 line = "".join(" " if char in punctuation else char for char in line)
371 line = re.sub(r"\s+", " ", line)
372 line = re.sub(r"(?<=[\u4e00-\u9fff])\s+(?=[\u4e00-\u9fff])", "", line)
373 return line.strip()
374
375
376 def _token_units(line: str) -> list[str]:
377 units: list[str] = []
378 for match in _WORD_RE.finditer(line):
379 token = match.group(0).lower()
380 if _CJK_RE.fullmatch(token):
381 units.append(token)
382 else:
383 units.append(token)
384 return units
1 """Process newly added lyric library files.
2
3 This script is intended for the recurring workflow after adding files to
4 ``data/library``:
5
6 1. Move pure-music placeholder lyric files out of the active library.
7 2. Rebuild the duplicate-checking index.
8 3. Optionally regenerate and evaluate a synthetic regression set.
9 """
10
11 from __future__ import annotations
12
13 import argparse
14 import csv
15 import json
16 import shutil
17 import sys
18 from datetime import datetime
19 from pathlib import Path
20
21 PROJECT_ROOT = Path(__file__).resolve().parents[1]
22 if str(PROJECT_ROOT) not in sys.path:
23 sys.path.insert(0, str(PROJECT_ROOT))
24
25 from lyric_dedup.checker import DuplicateChecker
26 from lyric_dedup.cli import evaluate_csv
27 from lyric_dedup.eval_dataset import generate_eval_set
28 from lyric_dedup.file_import import iter_lyric_files
29 from lyric_dedup.file_import import read_lyric_file
30 from lyric_dedup.file_import import records_from_dir
31 from lyric_dedup.normalization import normalize_lyrics
32
33
34 PLACEHOLDER_MARKERS = (
35 "【曲库专用】",
36 "此歌曲为没有填词的纯音乐",
37 )
38
39
40 def main() -> None:
41 parser = argparse.ArgumentParser(description="Process lyric library additions.")
42 parser.add_argument("--library-dir", default="data/library")
43 parser.add_argument("--index", default="outputs/indexes/library_lyrics.pkl")
44 parser.add_argument("--quarantine-dir", default="data/quarantine/no_lyrics_placeholders")
45 parser.add_argument("--dry-run", action="store_true", help="Only report placeholder files; do not move or write outputs.")
46 parser.add_argument("--delete-placeholders", action="store_true", help="Delete matched placeholder files instead of moving them.")
47 parser.add_argument("--eval-size", type=int, default=0, help="Generate and evaluate this many synthetic samples. 0 disables eval.")
48 parser.add_argument("--positive-ratio", type=float, default=0.2)
49 parser.add_argument("--eval-dir", default="data/generated_eval/incoming")
50 parser.add_argument("--eval-csv", default="data/generated_eval/eval.csv")
51 parser.add_argument("--eval-out", default="outputs/results/library_eval.csv")
52 parser.add_argument("--report", default="outputs/results/library_process_report.json")
53 args = parser.parse_args()
54
55 library_dir = Path(args.library_dir)
56 quarantine_dir = Path(args.quarantine_dir)
57 report_path = Path(args.report)
58
59 files_before = iter_lyric_files(library_dir)
60 placeholders = _find_placeholder_files(library_dir)
61 short_effective = _effective_line_report(library_dir)
62
63 moved_or_deleted: list[str] = []
64 if not args.dry_run:
65 moved_or_deleted = _handle_placeholders(
66 placeholders,
67 library_dir=library_dir,
68 quarantine_dir=quarantine_dir,
69 delete=args.delete_placeholders,
70 )
71 _build_index(library_dir, Path(args.index))
72
73 if args.eval_size > 0:
74 generate_eval_set(
75 library_dir=library_dir,
76 output_dir=Path(args.eval_dir),
77 csv_path=Path(args.eval_csv),
78 size=args.eval_size,
79 positive_ratio=args.positive_ratio,
80 )
81 evaluate_csv(
82 Path(args.index),
83 Path(args.eval_csv),
84 Path(args.eval_out),
85 base_dir=Path(args.eval_csv).parent,
86 positive_decisions={"duplicate"},
87 max_candidates=5,
88 )
89 evaluate_csv(
90 Path(args.index),
91 Path(args.eval_csv),
92 Path(args.eval_out).with_name(Path(args.eval_out).stem + "_review_positive.csv"),
93 base_dir=Path(args.eval_csv).parent,
94 positive_decisions={"duplicate", "review"},
95 max_candidates=5,
96 )
97
98 report = {
99 "timestamp": datetime.now().isoformat(timespec="seconds"),
100 "dry_run": args.dry_run,
101 "library_dir": str(library_dir),
102 "files_before": len(files_before),
103 "placeholder_matches": len(placeholders),
104 "placeholder_files": [str(path) for path in placeholders],
105 "handled_placeholder_files": moved_or_deleted,
106 "files_after": len(iter_lyric_files(library_dir)),
107 "index": str(args.index),
108 "eval_size": args.eval_size,
109 "eval_csv": str(args.eval_csv) if args.eval_size > 0 else "",
110 "eval_out": str(args.eval_out) if args.eval_size > 0 else "",
111 "short_effective_line_counts": short_effective,
112 }
113
114 print(json.dumps(report, ensure_ascii=False, indent=2))
115 if not args.dry_run:
116 report_path.parent.mkdir(parents=True, exist_ok=True)
117 report_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8")
118
119
120 def _find_placeholder_files(library_dir: Path) -> list[Path]:
121 matches: list[Path] = []
122 for path in iter_lyric_files(library_dir):
123 text = read_lyric_file(path)
124 if any(marker in text for marker in PLACEHOLDER_MARKERS):
125 matches.append(path)
126 return matches
127
128
129 def _handle_placeholders(
130 placeholders: list[Path],
131 *,
132 library_dir: Path,
133 quarantine_dir: Path,
134 delete: bool,
135 ) -> list[str]:
136 handled: list[str] = []
137 if not placeholders:
138 return handled
139 if not delete:
140 quarantine_dir.mkdir(parents=True, exist_ok=True)
141 for path in placeholders:
142 if delete:
143 path.unlink()
144 handled.append(f"deleted:{path}")
145 continue
146 relative = path.resolve().relative_to(library_dir.resolve())
147 destination = quarantine_dir / relative
148 destination.parent.mkdir(parents=True, exist_ok=True)
149 if destination.exists():
150 destination = destination.with_name(f"{destination.stem}_{datetime.now().strftime('%Y%m%d%H%M%S')}{destination.suffix}")
151 shutil.move(str(path), str(destination))
152 handled.append(f"moved:{path}->{destination}")
153 return handled
154
155
156 def _build_index(library_dir: Path, index_path: Path) -> None:
157 checker = DuplicateChecker()
158 for record in records_from_dir(library_dir):
159 checker.add_record(record)
160 index_path.parent.mkdir(parents=True, exist_ok=True)
161 checker.save(index_path)
162
163
164 def _effective_line_report(library_dir: Path) -> dict[str, int]:
165 buckets = {
166 "total": 0,
167 "zero_effective_lines": 0,
168 "one_to_three_effective_lines": 0,
169 "four_to_five_effective_lines": 0,
170 "six_plus_effective_lines": 0,
171 }
172 for path in iter_lyric_files(library_dir):
173 buckets["total"] += 1
174 normalized = normalize_lyrics(read_lyric_file(path))
175 line_count = len(normalized.primary_lines or normalized.unique_lines)
176 if line_count == 0:
177 buckets["zero_effective_lines"] += 1
178 elif line_count <= 3:
179 buckets["one_to_three_effective_lines"] += 1
180 elif line_count <= 5:
181 buckets["four_to_five_effective_lines"] += 1
182 else:
183 buckets["six_plus_effective_lines"] += 1
184 return buckets
185
186
187 if __name__ == "__main__":
188 main()
1 import csv
2
3 from lyric_dedup import DuplicateChecker
4 from lyric_dedup import DuplicateDecision
5 from lyric_dedup import LyricRecord
6 from lyric_dedup.cli import evaluate_csv
7 from lyric_dedup.eval_dataset import generate_eval_set
8 from lyric_dedup.file_import import record_from_file
9 from lyric_dedup.normalization import normalize_lyrics
10
11
12 BASE_LYRIC = """
13 [00:01.00]作词:Someone
14 [00:02.00]我爱你在每个夜里
15 [00:03.00]听风说话也听见你
16 [00:04.00]城市的灯慢慢亮起
17 [00:05.00]我把回忆写进歌曲
18 [00:06.00]啦啦啦 我们不分离
19 [00:07.00]啦啦啦 我们不分离
20 [00:08.00]明天还会继续想你
21 """
22
23
24 def test_normalization_removes_lyric_noise_and_simplifies() -> None:
25 normalized = normalize_lyrics("[00:01.20]我愛你!\nQQ音乐 www.example.com\n(副歌)\n聽風說話\n")
26
27 assert normalized.normalized_lines == ("我爱你", "听风说话")
28 assert normalized.normalized_full_text == "我爱你\n听风说话"
29 assert normalized.primary_lines == ("我爱你", "听风说话")
30
31
32 def test_exact_duplicate_handles_timestamps_punctuation_traditional_and_chorus_counts() -> None:
33 checker = DuplicateChecker()
34 checker.add_record(LyricRecord("song-1", BASE_LYRIC))
35
36 result = checker.check(
37 """
38 我愛你,在每個夜裡!!!
39 聽風說話,也聽見你
40 城市的燈慢慢亮起
41 我把回憶寫進歌曲
42 啦啦啦 我們不分離
43 明天還會繼續想你
44 """
45 )
46
47 assert result.decision == DuplicateDecision.DUPLICATE
48 assert result.confidence == 1.0
49 assert result.candidates[0].record_id == "song-1"
50
51
52 def test_short_shared_repeated_chorus_is_review_not_duplicate() -> None:
53 checker = DuplicateChecker()
54 checker.add_record(
55 LyricRecord(
56 "song-1",
57 """
58 海边的风吹过旧信
59 你说夏天不会远去
60 啦啦啦 我们不分离
61 啦啦啦 我们不分离
62 转身以后各自旅行
63 """,
64 )
65 )
66
67 result = checker.check(
68 """
69 山谷的雨落在清晨
70 我把名字交给星辰
71 啦啦啦 我们不分离
72 啦啦啦 我们不分离
73 世界安静等一个人
74 """
75 )
76
77 assert result.decision == DuplicateDecision.REVIEW
78 assert result.candidates[0].reason == "重合内容主要集中在重复副歌行,不自动判重"
79
80
81 def test_substantial_line_overlap_is_duplicate_after_lsh_recall() -> None:
82 checker = DuplicateChecker()
83 checker.add_record(LyricRecord("song-1", BASE_LYRIC))
84
85 result = checker.check(
86 """
87 我爱你在每个夜里
88 听风说话也听见你
89 城市灯火慢慢亮起
90 我把回忆写进歌曲
91 啦啦啦 我们不分离
92 明天还会继续想你
93 """
94 )
95
96 assert result.decision == DuplicateDecision.DUPLICATE
97 assert result.candidates[0].jaccard >= 0.78
98 assert result.candidates[0].line_coverage >= 0.72
99
100
101 def test_fragment_of_full_song_is_not_duplicate() -> None:
102 checker = DuplicateChecker()
103 checker.add_record(LyricRecord("song-1", BASE_LYRIC))
104
105 result = checker.check(
106 """
107 听风说话也听见你
108 城市的灯慢慢亮起
109 我把回忆写进歌曲
110 """
111 )
112
113 assert result.decision != DuplicateDecision.DUPLICATE
114 assert result.candidates[0].primary_line_coverage < 0.72
115
116
117 def test_no_effective_lyrics_use_metadata_fallback_without_empty_hash_collision() -> None:
118 placeholder = """
119 作词:DJ金木
120 作曲:DJ金木
121 编曲:DJ金木
122 混音:DJ金木
123 【未经著作权人许可 不得翻唱 翻录或使用】
124 """
125 checker = DuplicateChecker()
126 checker.add_record(LyricRecord("song-1", placeholder, title="Amnesia(House)", artist="DJ金木"))
127 checker.add_record(LyricRecord("song-2", placeholder, title="Angel(纯音乐)", artist="DJ金木"))
128
129 same_song = checker.check_record(
130 LyricRecord("__query__", placeholder, title="Amnesia(House)", artist="DJ金木")
131 )
132 different_title = checker.check_record(
133 LyricRecord("__query__", placeholder, title="Different Song", artist="DJ金木")
134 )
135
136 assert same_song.decision == DuplicateDecision.DUPLICATE
137 assert same_song.reason == "无有效歌词,使用文件内容兜底指纹命中"
138 assert different_title.decision == DuplicateDecision.DUPLICATE
139
140
141 def test_no_effective_lyrics_metadata_fallback_ignores_placeholder_noise() -> None:
142 source = """
143 作词:DJ金木
144 作曲:DJ金木
145 编曲:DJ金木
146 混音:DJ金木
147 【未经著作权人许可 不得翻唱 翻录或使用】
148 """
149 noisy = """
150 [00:01.00]歌词来自QQ音乐
151 [00:02.00]作词:测试
152 [00:03.00]作词:DJ金木!
153 [00:04.00]作曲:DJ金木...
154 [00:05.00]未经著作权人许可 不得翻唱
155 """
156 checker = DuplicateChecker()
157 checker.add_record(LyricRecord("song-1", source, title="Amnesia(House)", artist="DJ金木"))
158
159 result = checker.check_record(LyricRecord("__query__", noisy, title="Amnesia(House)", artist="DJ金木"))
160
161 assert result.decision == DuplicateDecision.DUPLICATE
162 assert result.reason == "无有效歌词,文件内容兜底特征高度相似"
163
164
165 def test_unrelated_lyrics_with_shared_watermark_are_new() -> None:
166 checker = DuplicateChecker()
167 checker.add_record(
168 LyricRecord(
169 "song-1",
170 """
171 歌词来自QQ音乐
172 北方的雪落在窗前
173 我等一封迟来的信
174 """,
175 )
176 )
177
178 result = checker.check(
179 """
180 歌词来自QQ音乐
181 南方的雨穿过街心
182 你把故事说给云听
183 """
184 )
185
186 assert result.decision == DuplicateDecision.NEW
187 assert result.candidates == ()
188
189
190 def test_mixed_chinese_english_tokenization_recalls_candidate() -> None:
191 checker = DuplicateChecker()
192 checker.add_record(
193 LyricRecord(
194 "song-1",
195 """
196 say hello 在风里
197 hold me close tonight
198 我们穿过蓝色街道
199 never let me go
200 """,
201 )
202 )
203
204 result = checker.check(
205 """
206 say hello 在风里
207 hold me close tonight
208 我们穿过蓝色街道
209 never let me go
210 """
211 )
212
213 assert result.decision == DuplicateDecision.DUPLICATE
214
215
216 def test_checker_can_persist_index(tmp_path) -> None:
217 index_path = tmp_path / "lyrics.pkl"
218 checker = DuplicateChecker()
219 checker.add_record(LyricRecord("song-1", BASE_LYRIC))
220 checker.save(index_path)
221
222 loaded = DuplicateChecker.load(index_path)
223 result = loaded.check(BASE_LYRIC)
224
225 assert loaded.record_count == 1
226 assert result.decision == DuplicateDecision.DUPLICATE
227
228
229 def test_record_from_lrc_file(tmp_path) -> None:
230 lyric_file = tmp_path / "周杰伦 - 测试歌.lrc"
231 lyric_file.write_text("[00:01.00]我愛你\n", encoding="utf-8")
232
233 record = record_from_file(lyric_file, base_dir=tmp_path)
234
235 assert record.title == "测试歌"
236 assert record.artist == "周杰伦"
237 assert record.lyrics == "[00:01.00]我愛你\n"
238
239
240 def test_record_from_song_artist_lyrics_filename(tmp_path) -> None:
241 lyric_file = tmp_path / "Amnesia(House)-DJ金木-歌词.txt"
242 lyric_file.write_text("作词:DJ金木\n", encoding="utf-8")
243
244 record = record_from_file(lyric_file, base_dir=tmp_path)
245
246 assert record.title == "Amnesia(House)"
247 assert record.artist == "DJ金木"
248
249
250 def test_evaluate_csv_reports_binary_metrics(tmp_path) -> None:
251 library = tmp_path / "library"
252 incoming = tmp_path / "incoming"
253 library.mkdir()
254 incoming.mkdir()
255 (library / "歌手A - 夜里.lrc").write_text(BASE_LYRIC, encoding="utf-8")
256 (incoming / "dup.lrc").write_text(BASE_LYRIC.replace("我爱你", "我愛你"), encoding="utf-8")
257 (incoming / "new.txt").write_text("南方的雨穿过街心\n你把故事说给云听\n", encoding="utf-8")
258
259 checker = DuplicateChecker()
260 checker.add_record(record_from_file(library / "歌手A - 夜里.lrc", base_dir=library))
261 index_path = tmp_path / "lyrics.pkl"
262 checker.save(index_path)
263
264 eval_csv = tmp_path / "eval.csv"
265 eval_csv.write_text(
266 "id,file,expected\n"
267 "case-1,incoming/dup.lrc,应去重\n"
268 "case-2,incoming/new.txt,不应去重\n",
269 encoding="utf-8",
270 )
271 out_path = tmp_path / "eval_out.csv"
272
273 evaluate_csv(
274 index_path,
275 eval_csv,
276 out_path,
277 base_dir=tmp_path,
278 positive_decisions={"duplicate"},
279 max_candidates=5,
280 )
281
282 rows = list(csv.DictReader(out_path.open(encoding="utf-8")))
283 assert [row["correct"] for row in rows] == ["True", "True"]
284 assert rows[0]["reason"] == "规范化后的原文歌词哈希完全一致"
285 assert (tmp_path / "eval_out.csv.summary.json").exists()
286
287
288 def test_generated_eval_set_marks_fragments_as_negative(tmp_path) -> None:
289 library = tmp_path / "library"
290 incoming = tmp_path / "generated" / "incoming"
291 eval_csv = tmp_path / "generated" / "eval.csv"
292 library.mkdir()
293 (library / "song.txt").write_text(BASE_LYRIC, encoding="utf-8")
294
295 generate_eval_set(library_dir=library, output_dir=incoming, csv_path=eval_csv, size=20, positive_ratio=0.5)
296
297 rows = list(csv.DictReader(eval_csv.open(encoding="utf-8")))
298 positive_types = {row["sample_type"] for row in rows if row["expected"] == "应去重"}
299 fragment_rows = [row for row in rows if row["sample_type"] == "single_song_fragment"]
300
301 assert "trimmed_version" not in positive_types
302 assert "single_song_fragment" not in positive_types
303 assert fragment_rows
304 assert all(row["expected"] == "不应去重" for row in fragment_rows)
305
306
307 def test_foreign_original_with_added_chinese_translation_is_duplicate() -> None:
308 checker = DuplicateChecker()
309 checker.add_record(
310 LyricRecord(
311 "song-1",
312 """
313 I miss you tonight
314 Under the moonlight
315 Never let me go
316 """,
317 )
318 )
319
320 result = checker.check(
321 """
322 I miss you tonight
323 今晚我想你
324 Under the moonlight
325 月光之下
326 Never let me go
327 永远不要让我离开
328 """
329 )
330
331 assert result.decision == DuplicateDecision.DUPLICATE
332 assert result.reason == "规范化后的原文歌词哈希完全一致,翻译行未参与自动判重"
333
334
335 def test_same_timestamp_translation_split_is_high_confidence() -> None:
336 normalized = normalize_lyrics(
337 """
338 [00:01.00]I miss you tonight
339 [00:01.00]今晚我想你
340 [00:02.00]Under the moonlight
341 [00:02.00]月光之下
342 """
343 )
344
345 assert normalized.primary_lines == ("i miss you tonight", "under the moonlight")
346 assert normalized.translation_lines == ("今晚我想你", "月光之下")
347 assert normalized.split_confidence == "high"
348
349
350 def test_translation_only_overlap_is_review_not_duplicate() -> None:
351 checker = DuplicateChecker()
352 checker.add_record(
353 LyricRecord(
354 "song-1",
355 """
356 I miss you tonight
357 今晚我想你
358 Under the moonlight
359 月光之下
360 Never let me go
361 永远不要让我离开
362 """,
363 )
364 )
365
366 result = checker.check(
367 """
368 Te extrano esta noche
369 今晚我想你
370 Bajo la luna
371 月光之下
372 No me dejes ir
373 永远不要让我离开
374 """
375 )
376
377 assert result.decision == DuplicateDecision.REVIEW
378 assert result.reason == "仅翻译行相似,原文字面重合不足,不自动判重"
379 assert result.candidates[0].translation_jaccard >= 0.45
380
381
382 def test_block_translation_split_is_review_when_primary_matches() -> None:
383 checker = DuplicateChecker()
384 checker.add_record(
385 LyricRecord(
386 "song-1",
387 """
388 I miss you tonight
389 Under the moonlight
390 Never let me go
391 """,
392 )
393 )
394
395 result = checker.check(
396 """
397 I miss you tonight
398 Under the moonlight
399 Never let me go
400 今晚我想你
401 月光之下
402 永远不要让我离开
403 """
404 )
405
406 assert result.decision == DuplicateDecision.REVIEW
407 assert result.reason == "原文哈希一致,但疑似整段翻译结构拆分置信度较低,需要人工复核"