Commit 51ddab43 51ddab43fb5e3638a8b8c9cd8049679fe8b2ccc7 by 沈秋雨

Add lyric duplicate detection workflow

0 parents
1 .DS_Store
2 __pycache__/
3 *.py[cod]
4 .pytest_cache/
5
6 # Local lyric data and generated artifacts
7 data/
8 outputs/
9 downloaded_lyrics/
10 downloaded_lyrics_type3/
11 download_failed*.txt
12
13 # Local downloader / scratch utilities
14 download_lyrics.py
15 test_db_connection.py
16 *.env
17
18 # Reference project kept locally only
19 text-dedup-main/
20
21 # Virtual environments and editor files
22 .venv/
23 venv/
24 .idea/
25 .vscode/
1 # Lyric Duplicate Checker
2
3 第一版用于“新增歌词查重”:先用已有 `.lrc` / `.txt` 歌词建立索引,再把新增歌词拿来查询,返回 `duplicate``review``new`
4
5 ## 建立索引
6
7 假设已有曲库在 `data/library/`
8
9 ```bash
10 python -m lyric_dedup.cli build-index \
11 --lyrics-dir data/library \
12 --index outputs/indexes/lyrics.pkl
13 ```
14
15 ## 检查单个新增歌词
16
17 ```bash
18 python -m lyric_dedup.cli check-file \
19 --index outputs/indexes/lyrics.pkl \
20 --file data/incoming/new_song.lrc
21 ```
22
23 ## 批量检查新增目录
24
25 ```bash
26 python -m lyric_dedup.cli batch-check \
27 --index outputs/indexes/lyrics.pkl \
28 --lyrics-dir data/incoming \
29 --out outputs/results/incoming_check.csv
30 ```
31
32 CSV 里重点看这些列:
33
34 - `decision`: 总判断。
35 - `best_candidate_id`: 最像的已有歌词。
36 - `best_candidate_jaccard`: n-gram 字面相似度。
37 - `best_candidate_line_coverage`: 行级覆盖率。
38 - `matched_unique_lines`: 命中的规范化歌词行。
39 - `best_candidate_reason`: 中文判定原因,说明为什么判重、复核或判新。
40
41 生产判断建议:`duplicate` 可自动拦截;`review` 进人工池;`new` 入库前仍可抽样检查。
42
43 ## 原文 + 中文翻译歌词的防护规则
44
45 当前会把歌词拆成三类行:
46
47 - `primary_lines`: 原文行,自动判重主要依赖这部分。
48 - `translation_lines`: 中文翻译行,只用于召回和复核解释。
49 - `unknown_lines`: 无法稳定判断的行。
50
51 高置信拆分包括:
52
53 - 同一个时间戳下出现外文行和中文行。
54 - 多组稳定的外文行 + 中文行交替。
55
56 中置信拆分包括:
57
58 - 同一行内明显的外文 / 中文翻译,例如 `I miss you / 今晚我想你`
59
60 低置信拆分包括:
61
62 - 先整段外文,再整段中文翻译。
63
64 判定策略:
65
66 - 原文高度一致,即使新增歌词多了中文翻译,也可以 `duplicate`
67 - 只有翻译行相似,原文相似不足,只能 `review`,不自动判重。
68 - 疑似整段翻译结构属于低置信拆分,即使原文 hash 一致,也先 `review`
69 - 普通中文歌没有检测到翻译结构时,全部有效行按原文处理。
70
71 由于索引里会保存拆分后的原文/翻译特征,修改拆分规则后需要重建索引。
72
73 ## 用标注 CSV 评估正确率
74
75 可以先从已有曲库自动生成一批评估样本:
76
77 ```bash
78 python -m lyric_dedup.cli generate-eval-set \
79 --library-dir data/library \
80 --lyrics-dir data/generated_eval/incoming \
81 --csv data/generated_eval/eval_10.csv \
82 --size 10 \
83 --positive-ratio 0.6
84 ```
85
86 生成器的业务口径:
87
88 - `应去重` 样本只生成全曲歌词的样式变化,例如时间戳、标点、平台噪声、空行、LRC 样式、附加中文翻译。
89 - `不应去重` 样本包含片段歌词、短句碰撞、不同歌曲片段混合、同主题新歌词、仅翻译相似。
90 - 片段歌词即使命中已有歌曲的一部分,也不应该输出 `duplicate`;最多进入 `review`
91
92 先准备一个 CSV,例如 `data/eval/eval.csv`
93
94 ```csv
95 id,file,expected
96 case-001,incoming/song_a.lrc,应去重
97 case-002,incoming/song_b.txt,不应去重
98 ```
99
100 也可以不用文件路径,直接把歌词放在 `lyrics` 列:
101
102 ```csv
103 id,lyrics,expected
104 case-003,"我爱你在每个夜里\n听风说话也听见你",duplicate
105 case-004,"南方的雨穿过街心\n你把故事说给云听",new
106 ```
107
108 `expected` 支持这些写法:
109
110 - 应去重:`应去重``重复``duplicate``1``true``yes`
111 - 不应去重:`不应去重``不重复``new``0``false``no`
112
113 运行评估:
114
115 ```bash
116 python -m lyric_dedup.cli evaluate-csv \
117 --index outputs/indexes/lyrics.pkl \
118 --csv data/eval/eval.csv \
119 --base-dir data \
120 --out outputs/results/eval_result.csv
121 ```
122
123 默认只有系统输出 `duplicate` 才算“预测应去重”。这适合评估自动拦截的准确率,误杀会更明显。
124
125 如果你想评估“可疑样本召回率”,也就是 `duplicate``review` 都算命中:
126
127 ```bash
128 python -m lyric_dedup.cli evaluate-csv \
129 --index outputs/indexes/lyrics.pkl \
130 --csv data/eval/eval.csv \
131 --base-dir data \
132 --positive-decisions duplicate,review \
133 --out outputs/results/eval_result_review_as_positive.csv
134 ```
135
136 会生成两个文件:
137
138 - `outputs/results/eval_result.csv`: 每条样本的预测、候选、原因和是否正确。
139 - `outputs/results/eval_result.csv.summary.json`: 总体指标。
140
141 summary 里重点看:
142
143 - `accuracy`: 总正确率。
144 - `precision`: 预测应去重的样本里,有多少是真的应去重。自动拦截优先看这个。
145 - `recall`: 真实应去重的样本里,有多少被系统抓到。
146 - `f1`: precision 和 recall 的综合指标。
147 - `false_positive`: 不应去重但被判为应去重,属于误杀。
148 - `false_negative`: 应去重但没抓到,属于漏召。
1 # 歌词查重测试流程
2
3 本文档记录从已有歌词目录建立索引、生成测试集、批量评估和查看结果的完整命令。
4
5 ## 1. 准备目录
6
7 已有曲库放在:
8
9 ```text
10 data/library/
11 ```
12
13 支持文件:
14
15 ```text
16 .lrc
17 .txt
18 ```
19
20 生成的测试样本会放在:
21
22 ```text
23 data/generated_eval/incoming/
24 ```
25
26 测试集标注 CSV 会放在:
27
28 ```text
29 data/generated_eval/eval_100.csv
30 ```
31
32 评估结果会放在:
33
34 ```text
35 outputs/results/
36 ```
37
38 ## 2. 建立已有曲库索引
39
40 如果刚往 `data/library` 新增了一批样本,建议先运行处理脚本:
41
42 ```bash
43 python scripts/process_library.py \
44 --library-dir data/library \
45 --index outputs/indexes/library_lyrics.pkl
46 ```
47
48 这个脚本会:
49
50 ```text
51 1. 扫描并隔离纯音乐占位样本,例如包含【曲库专用】或“此歌曲为没有填词的纯音乐”的文件。
52 2. 重建 outputs/indexes/library_lyrics.pkl。
53 3. 输出处理报告 outputs/results/library_process_report.json。
54 ```
55
56 如果你想先看会处理哪些文件,不实际移动和重建索引:
57
58 ```bash
59 python scripts/process_library.py \
60 --library-dir data/library \
61 --dry-run
62 ```
63
64 如果要顺手生成并评估 500 条测试样本:
65
66 ```bash
67 python scripts/process_library.py \
68 --library-dir data/library \
69 --index outputs/indexes/library_lyrics.pkl \
70 --eval-size 1180 \
71 --positive-ratio 0.2 \
72 --eval-csv data/generated_eval/eval_1180.csv \
73 --eval-out outputs/results/library_eval_1180.csv
74 ```
75
76 隔离出来的文件默认会移动到:
77
78 ```text
79 data/quarantine/no_lyrics_placeholders/
80 ```
81
82 也可以只手动建索引:
83
84 ```bash
85 python -m lyric_dedup.cli build-index \
86 --lyrics-dir data/library \
87 --index outputs/indexes/library_lyrics.pkl
88 ```
89
90 索引文件:
91
92 ```text
93 outputs/indexes/library_lyrics.pkl
94 ```
95
96 注意:如果修改了 `data/library`,或修改了预处理/判重逻辑,需要重新执行本步骤。
97
98 ## 3. 生成 100 条测试样本
99
100 ```bash
101 python -m lyric_dedup.cli generate-eval-set \
102 --library-dir data/library \
103 --lyrics-dir data/generated_eval/incoming \
104 --csv data/generated_eval/eval_500.csv \
105 --size 500 \
106 --positive-ratio 0.2
107 ```
108
109 默认生成:
110
111 ```text
112 应去重: 60
113 不应去重: 40
114 ```
115
116 生成器会先清理 `data/generated_eval/incoming/` 下旧的 `.txt` / `.lrc` 生成文件,再写入新样本。
117
118 业务口径:
119
120 ```text
121 pos_* = 应去重,全曲歌词样式变化
122 neg_* = 不应去重,片段/短句碰撞/混合片段/同主题新歌词/仅翻译相似
123 ```
124
125 ## 4. 严格评估:只把 duplicate 算作去重
126
127 ```bash
128 python -m lyric_dedup.cli evaluate-csv \
129 --index outputs/indexes/library_lyrics.pkl \
130 --csv data/generated_eval/eval_500.csv \
131 --base-dir data/generated_eval \
132 --out outputs/results/library_eval_500.csv
133 ```
134
135 这个口径下:
136
137 ```text
138 duplicate -> 预测应去重
139 review -> 预测不应去重
140 new -> 预测不应去重
141 ```
142
143 适合评估自动拦截的 precision,重点看:
144
145 ```text
146 false_positive
147 ```
148
149 ## 5. 召回评估:把 duplicate 和 review 都算作抓到可疑样本
150
151 ```bash
152 python -m lyric_dedup.cli evaluate-csv \
153 --index outputs/indexes/library_lyrics.pkl \
154 --csv data/generated_eval/eval_500.csv \
155 --base-dir data/generated_eval \
156 --positive-decisions duplicate,review \
157 --out outputs/results/library_eval_500_review_positive.csv
158 ```
159
160 这个口径下:
161
162 ```text
163 duplicate -> 预测应去重
164 review -> 预测应去重
165 new -> 预测不应去重
166 ```
167
168 适合评估可疑样本召回,重点看:
169
170 ```text
171 false_negative
172 ```
173
174 ## 6. 查看总体指标
175
176 严格口径:
177
178 ```bash
179 cat outputs/results/library_eval_100.csv.summary.json
180 ```
181
182 召回口径:
183
184 ```bash
185 cat outputs/results/library_eval_100_review_positive.csv.summary.json
186 ```
187
188 指标含义:
189
190 ```text
191 accuracy 总正确率
192 precision 预测应去重的样本里,有多少是真的应去重
193 recall 真实应去重的样本里,有多少被系统抓到
194 f1 precision 和 recall 的综合指标
195 true_positive 应去重且预测应去重
196 false_positive 不应去重但预测应去重,误杀
197 true_negative 不应去重且预测不应去重
198 false_negative 应去重但预测不应去重,漏召
199 ```
200
201 ## 7. 查看每条样本结果
202
203 ```bash
204 open outputs/results/library_eval_100.csv
205 ```
206
207 如果不能使用 `open`,可以直接查看 CSV:
208
209 ```bash
210 python -c 'import csv; rows=csv.DictReader(open("outputs/results/library_eval_100.csv", encoding="utf-8")); [print(r["id"], r["decision"], r["correct"], r["reason"], sep=" | ") for r in rows]'
211 ```
212
213 ## 8. 查看失败样本
214
215 严格口径失败样本:
216
217 ```bash
218 python -c 'import csv; rows=csv.DictReader(open("outputs/results/library_eval_100.csv", encoding="utf-8")); [print(r["id"], r["source"], r["decision"], r["reason"], sep=" | ") for r in rows if r["correct"] == "False"]'
219 ```
220
221 查看某个样本的完整候选:
222
223 ```bash
224 python -m lyric_dedup.cli check-file \
225 --index outputs/indexes/library_lyrics.pkl \
226 --file data/generated_eval/incoming/neg_068_mixed_fragments.txt \
227 --max-candidates 10
228 ```
229
230 ## 9. 核对测试集分布
231
232 ```bash
233 python -c 'import csv, collections; rows=list(csv.DictReader(open("data/generated_eval/eval_10.csv", encoding="utf-8"))); print(len(rows)); print(collections.Counter(r["expected"] for r in rows)); print(collections.Counter(r["sample_type"] for r in rows)); print(collections.Counter(r["sample_type"] for r in rows if r["expected"]=="应去重")); print(collections.Counter(r["sample_type"] for r in rows if r["expected"]=="不应去重"))'
234 ```
235
236 核对生成目录文件数:
237
238 ```bash
239 find data/generated_eval/incoming -type f | wc -l
240 ```
241
242 ## 10. 运行代码测试
243
244 ```bash
245 python -m pytest tests
246 ```
247
248 编译检查:
249
250 ```bash
251 python -m compileall -q lyric_dedup tests
252 ```
253
254 ## 11. 关于测试集不重复
255
256 当前自动生成的 100 条是规则覆盖测试集,不保证样本之间规范化后完全不重复。
257
258 如果要求 100 条测试样本彼此不重复,并且仍使用默认比例:
259
260 ```text
261 size = 100
262 positive_ratio = 0.6
263 ```
264
265 则至少需要:
266
267 ```text
268 60 首互不重复的种子歌词
269 ```
270
271 原因:应去重样本是全曲变体,同一首歌的多个样式变化规范化后仍然是同一首歌。
272
273 更稳妥的真实准确率评估方式是准备人工标注 CSV:
274
275 ```csv
276 id,file,expected
277 case-001,incoming/song_a.lrc,应去重
278 case-002,incoming/song_b.txt,不应去重
279 ```
280
281 然后直接执行第 4 节或第 5 节的 `evaluate-csv`
1 """Lyric duplicate detection utilities."""
2
3 from lyric_dedup.checker import DuplicateCheckResult
4 from lyric_dedup.checker import DuplicateChecker
5 from lyric_dedup.checker import DuplicateDecision
6 from lyric_dedup.checker import LyricRecord
7
8 __all__ = [
9 "DuplicateCheckResult",
10 "DuplicateChecker",
11 "DuplicateDecision",
12 "LyricRecord",
13 ]
1 """Generate labeled evaluation samples from an existing lyric library."""
2
3 from __future__ import annotations
4
5 import csv
6 import random
7 import re
8 from dataclasses import dataclass
9 from pathlib import Path
10
11 from lyric_dedup.file_import import iter_lyric_files
12 from lyric_dedup.file_import import read_lyric_file
13 from lyric_dedup.file_import import record_from_file
14 from lyric_dedup.normalization import normalize_lyrics
15
16
17 @dataclass(frozen=True)
18 class GeneratedSample:
19 sample_id: str
20 file: str
21 expected: str
22 sample_type: str
23 source: str
24 title: str = ""
25 artist: str = ""
26
27
28 def generate_eval_set(
29 *,
30 library_dir: Path,
31 output_dir: Path,
32 csv_path: Path,
33 size: int = 100,
34 positive_ratio: float = 0.6,
35 seed: int = 20260602,
36 ) -> dict[str, object]:
37 rng = random.Random(seed)
38 source_files = iter_lyric_files(library_dir)
39 if not source_files:
40 raise ValueError(f"{library_dir} 下没有 .lrc/.txt 歌词文件")
41
42 output_dir.mkdir(parents=True, exist_ok=True)
43 csv_path.parent.mkdir(parents=True, exist_ok=True)
44 _clean_generated_output_dir(output_dir)
45
46 positives = round(size * positive_ratio)
47 negatives = size - positives
48 samples: list[GeneratedSample] = []
49 for index in range(positives):
50 source = source_files[index % len(source_files)]
51 samples.append(_positive_sample(index + 1, source, output_dir, csv_path.parent, rng))
52 for index in range(negatives):
53 left = source_files[index % len(source_files)]
54 right = source_files[(index + 1) % len(source_files)]
55 samples.append(_negative_sample(positives + index + 1, left, right, output_dir, csv_path.parent, rng))
56
57 rng.shuffle(samples)
58 with csv_path.open("w", encoding="utf-8", newline="") as file:
59 writer = csv.DictWriter(file, fieldnames=["id", "file", "expected", "sample_type", "source", "title", "artist"])
60 writer.writeheader()
61 writer.writerows(
62 {
63 "id": sample.sample_id,
64 "file": sample.file,
65 "expected": sample.expected,
66 "sample_type": sample.sample_type,
67 "source": sample.source,
68 "title": sample.title,
69 "artist": sample.artist,
70 }
71 for sample in samples
72 )
73
74 return {
75 "size": size,
76 "positive": positives,
77 "negative": negatives,
78 "library_files": len(source_files),
79 "lyrics_dir": str(output_dir),
80 "csv": str(csv_path),
81 }
82
83
84 def _positive_sample(index: int, source: Path, output_dir: Path, csv_base: Path, rng: random.Random) -> GeneratedSample:
85 raw = read_lyric_file(source)
86 source_record = record_from_file(source)
87 variants = [
88 ("exact_copy", raw),
89 ("timestamped", _add_timestamps(_content_lines(raw))),
90 ("punctuation_noise", _add_punctuation_noise(_content_lines(raw), rng)),
91 ("with_platform_noise", _with_platform_noise(_content_lines(raw))),
92 ("blank_line_noise", _add_blank_line_noise(_content_lines(raw))),
93 ("lrc_with_platform_noise", _add_timestamps(_content_lines(_with_platform_noise(_content_lines(raw))))),
94 ("translation_added", _translation_added(_content_lines(raw))),
95 ]
96 sample_type, text = variants[(index - 1) % len(variants)]
97 name = f"pos_{index:03d}_{sample_type}.txt"
98 path = output_dir / name
99 path.write_text(text, encoding="utf-8")
100 return GeneratedSample(
101 sample_id=f"pos-{index:03d}",
102 file=str(path.relative_to(csv_base)),
103 expected="应去重",
104 sample_type=sample_type,
105 source=str(source),
106 title=source_record.title or "",
107 artist=source_record.artist or "",
108 )
109
110
111 def _negative_sample(index: int, left: Path, right: Path, output_dir: Path, csv_base: Path, rng: random.Random) -> GeneratedSample:
112 left_lines = _normalized_lines(left)
113 right_lines = _normalized_lines(right)
114 variants = [
115 ("single_song_fragment", _single_song_fragment(left_lines)),
116 ("short_shared_snippet", _short_shared_snippet(left_lines, rng)),
117 ("mixed_fragments", _mixed_fragments(left_lines, right_lines, rng)),
118 ("same_theme_synthetic", _same_theme_synthetic(index)),
119 ("translation_only_like", _translation_only_like(left_lines)),
120 ]
121 sample_type, text = variants[(index - 1) % len(variants)]
122 name = f"neg_{index:03d}_{sample_type}.txt"
123 path = output_dir / name
124 path.write_text(text, encoding="utf-8")
125 return GeneratedSample(
126 sample_id=f"neg-{index:03d}",
127 file=str(path.relative_to(csv_base)),
128 expected="不应去重",
129 sample_type=sample_type,
130 source=f"{left} | {right}",
131 )
132
133
134 def _content_lines(text: str) -> list[str]:
135 lines = [line.strip() for line in text.splitlines() if line.strip()]
136 return lines or [text.strip()]
137
138
139 def _clean_generated_output_dir(output_dir: Path) -> None:
140 for path in output_dir.iterdir():
141 if path.is_file() and path.suffix.lower() in {".txt", ".lrc"}:
142 path.unlink()
143
144
145 def _normalized_lines(path: Path) -> list[str]:
146 normalized = normalize_lyrics(read_lyric_file(path))
147 return list(normalized.primary_lines or normalized.unique_lines)
148
149
150 def _add_timestamps(lines: list[str]) -> str:
151 return "\n".join(f"[00:{idx % 60:02d}.00]{line}" for idx, line in enumerate(lines, start=1))
152
153
154 def _add_punctuation_noise(lines: list[str], rng: random.Random) -> str:
155 marks = ["!", "?", "...", ",", "。"]
156 return "\n".join(f"{line}{rng.choice(marks)}" for line in lines)
157
158
159 def _with_platform_noise(lines: list[str]) -> str:
160 return "\n".join(["歌词来自QQ音乐", "作词:测试", *lines, "未经著作权人许可 不得翻唱"])
161
162
163 def _add_blank_line_noise(lines: list[str]) -> str:
164 result: list[str] = []
165 for idx, line in enumerate(lines, start=1):
166 result.append(line)
167 if idx % 4 == 0:
168 result.append("")
169 return "\n".join(result)
170
171
172 def _translation_added(lines: list[str]) -> str:
173 result: list[str] = []
174 for idx, line in enumerate(lines, start=1):
175 result.append(line)
176 if _looks_foreign(line) and idx <= 24:
177 result.append(_pseudo_translation(idx))
178 return "\n".join(result)
179
180
181 def _single_song_fragment(lines: list[str]) -> str:
182 if len(lines) <= 4:
183 return "\n".join(lines[: max(1, len(lines) // 2)])
184 fragment_len = max(2, min(8, len(lines) // 4))
185 start = max(0, (len(lines) - fragment_len) // 2)
186 return "\n".join(lines[start : start + fragment_len])
187
188
189 def _short_shared_snippet(lines: list[str], rng: random.Random) -> str:
190 snippet = rng.sample(lines, k=min(2, len(lines))) if lines else []
191 synthetic = [
192 "清晨的风吹过新的街口",
193 "我把昨天放进安静的口袋",
194 *snippet,
195 "故事从这里重新开始",
196 "灯光落下我继续往前走",
197 ]
198 return "\n".join(synthetic)
199
200
201 def _mixed_fragments(left_lines: list[str], right_lines: list[str], rng: random.Random) -> str:
202 left_pick = rng.sample(left_lines, k=min(2, len(left_lines))) if left_lines else []
203 right_pick = rng.sample(right_lines, k=min(2, len(right_lines))) if right_lines else []
204 filler = ["新的旋律慢慢靠近", "陌生的名字写在风里", "没有人停在原地"]
205 return "\n".join([*left_pick, *filler, *right_pick])
206
207
208 def _same_theme_synthetic(index: int) -> str:
209 themes = [
210 "我在夜里想起远方的你",
211 "城市灯火陪我走过雨季",
212 "那些没说完的话留在风里",
213 "明天醒来我们各自继续",
214 f"这是第 {index} 个全新测试样本",
215 ]
216 return "\n".join(themes)
217
218
219 def _translation_only_like(lines: list[str]) -> str:
220 foreign_count = sum(1 for line in lines if _looks_foreign(line))
221 if foreign_count < 2:
222 return _same_theme_synthetic(foreign_count + len(lines))
223 return "\n".join(_pseudo_translation(idx) for idx in range(1, min(8, foreign_count) + 1))
224
225
226 def _pseudo_translation(index: int) -> str:
227 translations = [
228 "今晚我仍然想念你",
229 "风会带走所有疲惫",
230 "黑暗里也会有光",
231 "别让昨天困住自己",
232 "我们终会继续向前",
233 "雨停以后天空会亮",
234 "把遗憾留在旧时光",
235 "你已经足够好了",
236 ]
237 return translations[(index - 1) % len(translations)]
238
239
240 def _looks_foreign(line: str) -> bool:
241 latin = len(re.findall(r"[A-Za-z]", line))
242 cjk = len(re.findall(r"[\u4e00-\u9fff]", line))
243 return latin > 0 and cjk == 0
1 """Import LRC/TXT lyric files into records."""
2
3 from __future__ import annotations
4
5 import hashlib
6 from pathlib import Path
7
8 from lyric_dedup.checker import LyricRecord
9
10
11 SUPPORTED_SUFFIXES = {".lrc", ".txt"}
12
13
14 def iter_lyric_files(root: str | Path) -> list[Path]:
15 base = Path(root)
16 return sorted(
17 path
18 for path in base.rglob("*")
19 if path.is_file() and path.suffix.lower() in SUPPORTED_SUFFIXES
20 )
21
22
23 def read_lyric_file(path: str | Path) -> str:
24 file_path = Path(path)
25 data = file_path.read_bytes()
26 for encoding in ("utf-8-sig", "utf-8", "gb18030", "big5"):
27 try:
28 return data.decode(encoding)
29 except UnicodeDecodeError:
30 continue
31 return data.decode("utf-8", errors="replace")
32
33
34 def record_from_file(path: str | Path, *, base_dir: str | Path | None = None) -> LyricRecord:
35 file_path = Path(path)
36 lyrics = read_lyric_file(file_path)
37 title, artist = _metadata_from_name(file_path.stem)
38 record_id = _record_id(file_path, base_dir)
39 return LyricRecord(record_id=record_id, lyrics=lyrics, title=title, artist=artist)
40
41
42 def records_from_dir(root: str | Path) -> list[LyricRecord]:
43 return [record_from_file(path, base_dir=root) for path in iter_lyric_files(root)]
44
45
46 def _record_id(path: Path, base_dir: str | Path | None) -> str:
47 if base_dir is None:
48 source = str(path.resolve())
49 else:
50 source = str(path.resolve().relative_to(Path(base_dir).resolve()))
51 digest = hashlib.sha1(source.encode("utf-8")).hexdigest()[:12]
52 return f"{digest}:{source}"
53
54
55 def _metadata_from_name(stem: str) -> tuple[str | None, str | None]:
56 cleaned = stem.removesuffix("-歌词").removesuffix("_歌词").removesuffix(" 歌词").strip()
57 if " - " in cleaned:
58 artist, title = cleaned.split(" - ", 1)
59 return title.strip() or None, artist.strip() or None
60 for sep in ("-", "_"):
61 if sep in cleaned:
62 title, artist = cleaned.rsplit(sep, 1)
63 return title.strip() or None, artist.strip() or None
64 return stem.strip() or None, None
1 """Small in-memory MinHash LSH index for incremental lyric lookup."""
2
3 from __future__ import annotations
4
5 import hashlib
6 from collections import defaultdict
7 from dataclasses import dataclass
8
9
10 _MAX_HASH = (1 << 64) - 1
11
12
13 @dataclass(frozen=True)
14 class MinHashConfig:
15 num_perm: int = 96
16 bands: int = 24
17 seed: int = 17
18
19 @property
20 def rows_per_band(self) -> int:
21 if self.num_perm % self.bands != 0:
22 raise ValueError("num_perm must be divisible by bands")
23 return self.num_perm // self.bands
24
25
26 class MinHashLSH:
27 def __init__(self, config: MinHashConfig | None = None) -> None:
28 self.config = config or MinHashConfig()
29 self._buckets: dict[tuple[int, tuple[int, ...]], set[str]] = defaultdict(set)
30
31 def signature(self, tokens: set[str]) -> tuple[int, ...]:
32 if not tokens:
33 return tuple([_MAX_HASH] * self.config.num_perm)
34
35 signature = [_MAX_HASH] * self.config.num_perm
36 for token in tokens:
37 encoded = token.encode("utf-8")
38 for idx in range(self.config.num_perm):
39 digest = hashlib.blake2b(
40 encoded,
41 digest_size=8,
42 person=f"lyr{self.config.seed + idx:05d}".encode("ascii")[:16],
43 ).digest()
44 value = int.from_bytes(digest, "big")
45 if value < signature[idx]:
46 signature[idx] = value
47 return tuple(signature)
48
49 def add(self, record_id: str, signature: tuple[int, ...]) -> None:
50 for key in self._band_keys(signature):
51 self._buckets[key].add(record_id)
52
53 def query(self, signature: tuple[int, ...]) -> set[str]:
54 candidates: set[str] = set()
55 for key in self._band_keys(signature):
56 candidates.update(self._buckets.get(key, set()))
57 return candidates
58
59 def _band_keys(self, signature: tuple[int, ...]) -> list[tuple[int, tuple[int, ...]]]:
60 rows = self.config.rows_per_band
61 return [(band, signature[band * rows : (band + 1) * rows]) for band in range(self.config.bands)]
1 """Process newly added lyric library files.
2
3 This script is intended for the recurring workflow after adding files to
4 ``data/library``:
5
6 1. Move pure-music placeholder lyric files out of the active library.
7 2. Rebuild the duplicate-checking index.
8 3. Optionally regenerate and evaluate a synthetic regression set.
9 """
10
11 from __future__ import annotations
12
13 import argparse
14 import csv
15 import json
16 import shutil
17 import sys
18 from datetime import datetime
19 from pathlib import Path
20
21 PROJECT_ROOT = Path(__file__).resolve().parents[1]
22 if str(PROJECT_ROOT) not in sys.path:
23 sys.path.insert(0, str(PROJECT_ROOT))
24
25 from lyric_dedup.checker import DuplicateChecker
26 from lyric_dedup.cli import evaluate_csv
27 from lyric_dedup.eval_dataset import generate_eval_set
28 from lyric_dedup.file_import import iter_lyric_files
29 from lyric_dedup.file_import import read_lyric_file
30 from lyric_dedup.file_import import records_from_dir
31 from lyric_dedup.normalization import normalize_lyrics
32
33
34 PLACEHOLDER_MARKERS = (
35 "【曲库专用】",
36 "此歌曲为没有填词的纯音乐",
37 )
38
39
40 def main() -> None:
41 parser = argparse.ArgumentParser(description="Process lyric library additions.")
42 parser.add_argument("--library-dir", default="data/library")
43 parser.add_argument("--index", default="outputs/indexes/library_lyrics.pkl")
44 parser.add_argument("--quarantine-dir", default="data/quarantine/no_lyrics_placeholders")
45 parser.add_argument("--dry-run", action="store_true", help="Only report placeholder files; do not move or write outputs.")
46 parser.add_argument("--delete-placeholders", action="store_true", help="Delete matched placeholder files instead of moving them.")
47 parser.add_argument("--eval-size", type=int, default=0, help="Generate and evaluate this many synthetic samples. 0 disables eval.")
48 parser.add_argument("--positive-ratio", type=float, default=0.2)
49 parser.add_argument("--eval-dir", default="data/generated_eval/incoming")
50 parser.add_argument("--eval-csv", default="data/generated_eval/eval.csv")
51 parser.add_argument("--eval-out", default="outputs/results/library_eval.csv")
52 parser.add_argument("--report", default="outputs/results/library_process_report.json")
53 args = parser.parse_args()
54
55 library_dir = Path(args.library_dir)
56 quarantine_dir = Path(args.quarantine_dir)
57 report_path = Path(args.report)
58
59 files_before = iter_lyric_files(library_dir)
60 placeholders = _find_placeholder_files(library_dir)
61 short_effective = _effective_line_report(library_dir)
62
63 moved_or_deleted: list[str] = []
64 if not args.dry_run:
65 moved_or_deleted = _handle_placeholders(
66 placeholders,
67 library_dir=library_dir,
68 quarantine_dir=quarantine_dir,
69 delete=args.delete_placeholders,
70 )
71 _build_index(library_dir, Path(args.index))
72
73 if args.eval_size > 0:
74 generate_eval_set(
75 library_dir=library_dir,
76 output_dir=Path(args.eval_dir),
77 csv_path=Path(args.eval_csv),
78 size=args.eval_size,
79 positive_ratio=args.positive_ratio,
80 )
81 evaluate_csv(
82 Path(args.index),
83 Path(args.eval_csv),
84 Path(args.eval_out),
85 base_dir=Path(args.eval_csv).parent,
86 positive_decisions={"duplicate"},
87 max_candidates=5,
88 )
89 evaluate_csv(
90 Path(args.index),
91 Path(args.eval_csv),
92 Path(args.eval_out).with_name(Path(args.eval_out).stem + "_review_positive.csv"),
93 base_dir=Path(args.eval_csv).parent,
94 positive_decisions={"duplicate", "review"},
95 max_candidates=5,
96 )
97
98 report = {
99 "timestamp": datetime.now().isoformat(timespec="seconds"),
100 "dry_run": args.dry_run,
101 "library_dir": str(library_dir),
102 "files_before": len(files_before),
103 "placeholder_matches": len(placeholders),
104 "placeholder_files": [str(path) for path in placeholders],
105 "handled_placeholder_files": moved_or_deleted,
106 "files_after": len(iter_lyric_files(library_dir)),
107 "index": str(args.index),
108 "eval_size": args.eval_size,
109 "eval_csv": str(args.eval_csv) if args.eval_size > 0 else "",
110 "eval_out": str(args.eval_out) if args.eval_size > 0 else "",
111 "short_effective_line_counts": short_effective,
112 }
113
114 print(json.dumps(report, ensure_ascii=False, indent=2))
115 if not args.dry_run:
116 report_path.parent.mkdir(parents=True, exist_ok=True)
117 report_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8")
118
119
120 def _find_placeholder_files(library_dir: Path) -> list[Path]:
121 matches: list[Path] = []
122 for path in iter_lyric_files(library_dir):
123 text = read_lyric_file(path)
124 if any(marker in text for marker in PLACEHOLDER_MARKERS):
125 matches.append(path)
126 return matches
127
128
129 def _handle_placeholders(
130 placeholders: list[Path],
131 *,
132 library_dir: Path,
133 quarantine_dir: Path,
134 delete: bool,
135 ) -> list[str]:
136 handled: list[str] = []
137 if not placeholders:
138 return handled
139 if not delete:
140 quarantine_dir.mkdir(parents=True, exist_ok=True)
141 for path in placeholders:
142 if delete:
143 path.unlink()
144 handled.append(f"deleted:{path}")
145 continue
146 relative = path.resolve().relative_to(library_dir.resolve())
147 destination = quarantine_dir / relative
148 destination.parent.mkdir(parents=True, exist_ok=True)
149 if destination.exists():
150 destination = destination.with_name(f"{destination.stem}_{datetime.now().strftime('%Y%m%d%H%M%S')}{destination.suffix}")
151 shutil.move(str(path), str(destination))
152 handled.append(f"moved:{path}->{destination}")
153 return handled
154
155
156 def _build_index(library_dir: Path, index_path: Path) -> None:
157 checker = DuplicateChecker()
158 for record in records_from_dir(library_dir):
159 checker.add_record(record)
160 index_path.parent.mkdir(parents=True, exist_ok=True)
161 checker.save(index_path)
162
163
164 def _effective_line_report(library_dir: Path) -> dict[str, int]:
165 buckets = {
166 "total": 0,
167 "zero_effective_lines": 0,
168 "one_to_three_effective_lines": 0,
169 "four_to_five_effective_lines": 0,
170 "six_plus_effective_lines": 0,
171 }
172 for path in iter_lyric_files(library_dir):
173 buckets["total"] += 1
174 normalized = normalize_lyrics(read_lyric_file(path))
175 line_count = len(normalized.primary_lines or normalized.unique_lines)
176 if line_count == 0:
177 buckets["zero_effective_lines"] += 1
178 elif line_count <= 3:
179 buckets["one_to_three_effective_lines"] += 1
180 elif line_count <= 5:
181 buckets["four_to_five_effective_lines"] += 1
182 else:
183 buckets["six_plus_effective_lines"] += 1
184 return buckets
185
186
187 if __name__ == "__main__":
188 main()