Add lyric duplicate detection workflow
0 parents
Showing
12 changed files
with
2591 additions
and
0 deletions
.gitignore
0 → 100644
| 1 | .DS_Store | ||
| 2 | __pycache__/ | ||
| 3 | *.py[cod] | ||
| 4 | .pytest_cache/ | ||
| 5 | |||
| 6 | # Local lyric data and generated artifacts | ||
| 7 | data/ | ||
| 8 | outputs/ | ||
| 9 | downloaded_lyrics/ | ||
| 10 | downloaded_lyrics_type3/ | ||
| 11 | download_failed*.txt | ||
| 12 | |||
| 13 | # Local downloader / scratch utilities | ||
| 14 | download_lyrics.py | ||
| 15 | test_db_connection.py | ||
| 16 | *.env | ||
| 17 | |||
| 18 | # Reference project kept locally only | ||
| 19 | text-dedup-main/ | ||
| 20 | |||
| 21 | # Virtual environments and editor files | ||
| 22 | .venv/ | ||
| 23 | venv/ | ||
| 24 | .idea/ | ||
| 25 | .vscode/ |
README.md
0 → 100644
| 1 | # Lyric Duplicate Checker | ||
| 2 | |||
| 3 | 第一版用于“新增歌词查重”:先用已有 `.lrc` / `.txt` 歌词建立索引,再把新增歌词拿来查询,返回 `duplicate`、`review` 或 `new`。 | ||
| 4 | |||
| 5 | ## 建立索引 | ||
| 6 | |||
| 7 | 假设已有曲库在 `data/library/`: | ||
| 8 | |||
| 9 | ```bash | ||
| 10 | python -m lyric_dedup.cli build-index \ | ||
| 11 | --lyrics-dir data/library \ | ||
| 12 | --index outputs/indexes/lyrics.pkl | ||
| 13 | ``` | ||
| 14 | |||
| 15 | ## 检查单个新增歌词 | ||
| 16 | |||
| 17 | ```bash | ||
| 18 | python -m lyric_dedup.cli check-file \ | ||
| 19 | --index outputs/indexes/lyrics.pkl \ | ||
| 20 | --file data/incoming/new_song.lrc | ||
| 21 | ``` | ||
| 22 | |||
| 23 | ## 批量检查新增目录 | ||
| 24 | |||
| 25 | ```bash | ||
| 26 | python -m lyric_dedup.cli batch-check \ | ||
| 27 | --index outputs/indexes/lyrics.pkl \ | ||
| 28 | --lyrics-dir data/incoming \ | ||
| 29 | --out outputs/results/incoming_check.csv | ||
| 30 | ``` | ||
| 31 | |||
| 32 | CSV 里重点看这些列: | ||
| 33 | |||
| 34 | - `decision`: 总判断。 | ||
| 35 | - `best_candidate_id`: 最像的已有歌词。 | ||
| 36 | - `best_candidate_jaccard`: n-gram 字面相似度。 | ||
| 37 | - `best_candidate_line_coverage`: 行级覆盖率。 | ||
| 38 | - `matched_unique_lines`: 命中的规范化歌词行。 | ||
| 39 | - `best_candidate_reason`: 中文判定原因,说明为什么判重、复核或判新。 | ||
| 40 | |||
| 41 | 生产判断建议:`duplicate` 可自动拦截;`review` 进人工池;`new` 入库前仍可抽样检查。 | ||
| 42 | |||
| 43 | ## 原文 + 中文翻译歌词的防护规则 | ||
| 44 | |||
| 45 | 当前会把歌词拆成三类行: | ||
| 46 | |||
| 47 | - `primary_lines`: 原文行,自动判重主要依赖这部分。 | ||
| 48 | - `translation_lines`: 中文翻译行,只用于召回和复核解释。 | ||
| 49 | - `unknown_lines`: 无法稳定判断的行。 | ||
| 50 | |||
| 51 | 高置信拆分包括: | ||
| 52 | |||
| 53 | - 同一个时间戳下出现外文行和中文行。 | ||
| 54 | - 多组稳定的外文行 + 中文行交替。 | ||
| 55 | |||
| 56 | 中置信拆分包括: | ||
| 57 | |||
| 58 | - 同一行内明显的外文 / 中文翻译,例如 `I miss you / 今晚我想你`。 | ||
| 59 | |||
| 60 | 低置信拆分包括: | ||
| 61 | |||
| 62 | - 先整段外文,再整段中文翻译。 | ||
| 63 | |||
| 64 | 判定策略: | ||
| 65 | |||
| 66 | - 原文高度一致,即使新增歌词多了中文翻译,也可以 `duplicate`。 | ||
| 67 | - 只有翻译行相似,原文相似不足,只能 `review`,不自动判重。 | ||
| 68 | - 疑似整段翻译结构属于低置信拆分,即使原文 hash 一致,也先 `review`。 | ||
| 69 | - 普通中文歌没有检测到翻译结构时,全部有效行按原文处理。 | ||
| 70 | |||
| 71 | 由于索引里会保存拆分后的原文/翻译特征,修改拆分规则后需要重建索引。 | ||
| 72 | |||
| 73 | ## 用标注 CSV 评估正确率 | ||
| 74 | |||
| 75 | 可以先从已有曲库自动生成一批评估样本: | ||
| 76 | |||
| 77 | ```bash | ||
| 78 | python -m lyric_dedup.cli generate-eval-set \ | ||
| 79 | --library-dir data/library \ | ||
| 80 | --lyrics-dir data/generated_eval/incoming \ | ||
| 81 | --csv data/generated_eval/eval_10.csv \ | ||
| 82 | --size 10 \ | ||
| 83 | --positive-ratio 0.6 | ||
| 84 | ``` | ||
| 85 | |||
| 86 | 生成器的业务口径: | ||
| 87 | |||
| 88 | - `应去重` 样本只生成全曲歌词的样式变化,例如时间戳、标点、平台噪声、空行、LRC 样式、附加中文翻译。 | ||
| 89 | - `不应去重` 样本包含片段歌词、短句碰撞、不同歌曲片段混合、同主题新歌词、仅翻译相似。 | ||
| 90 | - 片段歌词即使命中已有歌曲的一部分,也不应该输出 `duplicate`;最多进入 `review`。 | ||
| 91 | |||
| 92 | 先准备一个 CSV,例如 `data/eval/eval.csv`: | ||
| 93 | |||
| 94 | ```csv | ||
| 95 | id,file,expected | ||
| 96 | case-001,incoming/song_a.lrc,应去重 | ||
| 97 | case-002,incoming/song_b.txt,不应去重 | ||
| 98 | ``` | ||
| 99 | |||
| 100 | 也可以不用文件路径,直接把歌词放在 `lyrics` 列: | ||
| 101 | |||
| 102 | ```csv | ||
| 103 | id,lyrics,expected | ||
| 104 | case-003,"我爱你在每个夜里\n听风说话也听见你",duplicate | ||
| 105 | case-004,"南方的雨穿过街心\n你把故事说给云听",new | ||
| 106 | ``` | ||
| 107 | |||
| 108 | `expected` 支持这些写法: | ||
| 109 | |||
| 110 | - 应去重:`应去重`、`重复`、`duplicate`、`1`、`true`、`yes` | ||
| 111 | - 不应去重:`不应去重`、`不重复`、`new`、`0`、`false`、`no` | ||
| 112 | |||
| 113 | 运行评估: | ||
| 114 | |||
| 115 | ```bash | ||
| 116 | python -m lyric_dedup.cli evaluate-csv \ | ||
| 117 | --index outputs/indexes/lyrics.pkl \ | ||
| 118 | --csv data/eval/eval.csv \ | ||
| 119 | --base-dir data \ | ||
| 120 | --out outputs/results/eval_result.csv | ||
| 121 | ``` | ||
| 122 | |||
| 123 | 默认只有系统输出 `duplicate` 才算“预测应去重”。这适合评估自动拦截的准确率,误杀会更明显。 | ||
| 124 | |||
| 125 | 如果你想评估“可疑样本召回率”,也就是 `duplicate` 和 `review` 都算命中: | ||
| 126 | |||
| 127 | ```bash | ||
| 128 | python -m lyric_dedup.cli evaluate-csv \ | ||
| 129 | --index outputs/indexes/lyrics.pkl \ | ||
| 130 | --csv data/eval/eval.csv \ | ||
| 131 | --base-dir data \ | ||
| 132 | --positive-decisions duplicate,review \ | ||
| 133 | --out outputs/results/eval_result_review_as_positive.csv | ||
| 134 | ``` | ||
| 135 | |||
| 136 | 会生成两个文件: | ||
| 137 | |||
| 138 | - `outputs/results/eval_result.csv`: 每条样本的预测、候选、原因和是否正确。 | ||
| 139 | - `outputs/results/eval_result.csv.summary.json`: 总体指标。 | ||
| 140 | |||
| 141 | summary 里重点看: | ||
| 142 | |||
| 143 | - `accuracy`: 总正确率。 | ||
| 144 | - `precision`: 预测应去重的样本里,有多少是真的应去重。自动拦截优先看这个。 | ||
| 145 | - `recall`: 真实应去重的样本里,有多少被系统抓到。 | ||
| 146 | - `f1`: precision 和 recall 的综合指标。 | ||
| 147 | - `false_positive`: 不应去重但被判为应去重,属于误杀。 | ||
| 148 | - `false_negative`: 应去重但没抓到,属于漏召。 |
TEST_WORKFLOW.md
0 → 100644
| 1 | # 歌词查重测试流程 | ||
| 2 | |||
| 3 | 本文档记录从已有歌词目录建立索引、生成测试集、批量评估和查看结果的完整命令。 | ||
| 4 | |||
| 5 | ## 1. 准备目录 | ||
| 6 | |||
| 7 | 已有曲库放在: | ||
| 8 | |||
| 9 | ```text | ||
| 10 | data/library/ | ||
| 11 | ``` | ||
| 12 | |||
| 13 | 支持文件: | ||
| 14 | |||
| 15 | ```text | ||
| 16 | .lrc | ||
| 17 | .txt | ||
| 18 | ``` | ||
| 19 | |||
| 20 | 生成的测试样本会放在: | ||
| 21 | |||
| 22 | ```text | ||
| 23 | data/generated_eval/incoming/ | ||
| 24 | ``` | ||
| 25 | |||
| 26 | 测试集标注 CSV 会放在: | ||
| 27 | |||
| 28 | ```text | ||
| 29 | data/generated_eval/eval_100.csv | ||
| 30 | ``` | ||
| 31 | |||
| 32 | 评估结果会放在: | ||
| 33 | |||
| 34 | ```text | ||
| 35 | outputs/results/ | ||
| 36 | ``` | ||
| 37 | |||
| 38 | ## 2. 建立已有曲库索引 | ||
| 39 | |||
| 40 | 如果刚往 `data/library` 新增了一批样本,建议先运行处理脚本: | ||
| 41 | |||
| 42 | ```bash | ||
| 43 | python scripts/process_library.py \ | ||
| 44 | --library-dir data/library \ | ||
| 45 | --index outputs/indexes/library_lyrics.pkl | ||
| 46 | ``` | ||
| 47 | |||
| 48 | 这个脚本会: | ||
| 49 | |||
| 50 | ```text | ||
| 51 | 1. 扫描并隔离纯音乐占位样本,例如包含【曲库专用】或“此歌曲为没有填词的纯音乐”的文件。 | ||
| 52 | 2. 重建 outputs/indexes/library_lyrics.pkl。 | ||
| 53 | 3. 输出处理报告 outputs/results/library_process_report.json。 | ||
| 54 | ``` | ||
| 55 | |||
| 56 | 如果你想先看会处理哪些文件,不实际移动和重建索引: | ||
| 57 | |||
| 58 | ```bash | ||
| 59 | python scripts/process_library.py \ | ||
| 60 | --library-dir data/library \ | ||
| 61 | --dry-run | ||
| 62 | ``` | ||
| 63 | |||
| 64 | 如果要顺手生成并评估 500 条测试样本: | ||
| 65 | |||
| 66 | ```bash | ||
| 67 | python scripts/process_library.py \ | ||
| 68 | --library-dir data/library \ | ||
| 69 | --index outputs/indexes/library_lyrics.pkl \ | ||
| 70 | --eval-size 1180 \ | ||
| 71 | --positive-ratio 0.2 \ | ||
| 72 | --eval-csv data/generated_eval/eval_1180.csv \ | ||
| 73 | --eval-out outputs/results/library_eval_1180.csv | ||
| 74 | ``` | ||
| 75 | |||
| 76 | 隔离出来的文件默认会移动到: | ||
| 77 | |||
| 78 | ```text | ||
| 79 | data/quarantine/no_lyrics_placeholders/ | ||
| 80 | ``` | ||
| 81 | |||
| 82 | 也可以只手动建索引: | ||
| 83 | |||
| 84 | ```bash | ||
| 85 | python -m lyric_dedup.cli build-index \ | ||
| 86 | --lyrics-dir data/library \ | ||
| 87 | --index outputs/indexes/library_lyrics.pkl | ||
| 88 | ``` | ||
| 89 | |||
| 90 | 索引文件: | ||
| 91 | |||
| 92 | ```text | ||
| 93 | outputs/indexes/library_lyrics.pkl | ||
| 94 | ``` | ||
| 95 | |||
| 96 | 注意:如果修改了 `data/library`,或修改了预处理/判重逻辑,需要重新执行本步骤。 | ||
| 97 | |||
| 98 | ## 3. 生成 100 条测试样本 | ||
| 99 | |||
| 100 | ```bash | ||
| 101 | python -m lyric_dedup.cli generate-eval-set \ | ||
| 102 | --library-dir data/library \ | ||
| 103 | --lyrics-dir data/generated_eval/incoming \ | ||
| 104 | --csv data/generated_eval/eval_500.csv \ | ||
| 105 | --size 500 \ | ||
| 106 | --positive-ratio 0.2 | ||
| 107 | ``` | ||
| 108 | |||
| 109 | 默认生成: | ||
| 110 | |||
| 111 | ```text | ||
| 112 | 应去重: 60 | ||
| 113 | 不应去重: 40 | ||
| 114 | ``` | ||
| 115 | |||
| 116 | 生成器会先清理 `data/generated_eval/incoming/` 下旧的 `.txt` / `.lrc` 生成文件,再写入新样本。 | ||
| 117 | |||
| 118 | 业务口径: | ||
| 119 | |||
| 120 | ```text | ||
| 121 | pos_* = 应去重,全曲歌词样式变化 | ||
| 122 | neg_* = 不应去重,片段/短句碰撞/混合片段/同主题新歌词/仅翻译相似 | ||
| 123 | ``` | ||
| 124 | |||
| 125 | ## 4. 严格评估:只把 duplicate 算作去重 | ||
| 126 | |||
| 127 | ```bash | ||
| 128 | python -m lyric_dedup.cli evaluate-csv \ | ||
| 129 | --index outputs/indexes/library_lyrics.pkl \ | ||
| 130 | --csv data/generated_eval/eval_500.csv \ | ||
| 131 | --base-dir data/generated_eval \ | ||
| 132 | --out outputs/results/library_eval_500.csv | ||
| 133 | ``` | ||
| 134 | |||
| 135 | 这个口径下: | ||
| 136 | |||
| 137 | ```text | ||
| 138 | duplicate -> 预测应去重 | ||
| 139 | review -> 预测不应去重 | ||
| 140 | new -> 预测不应去重 | ||
| 141 | ``` | ||
| 142 | |||
| 143 | 适合评估自动拦截的 precision,重点看: | ||
| 144 | |||
| 145 | ```text | ||
| 146 | false_positive | ||
| 147 | ``` | ||
| 148 | |||
| 149 | ## 5. 召回评估:把 duplicate 和 review 都算作抓到可疑样本 | ||
| 150 | |||
| 151 | ```bash | ||
| 152 | python -m lyric_dedup.cli evaluate-csv \ | ||
| 153 | --index outputs/indexes/library_lyrics.pkl \ | ||
| 154 | --csv data/generated_eval/eval_500.csv \ | ||
| 155 | --base-dir data/generated_eval \ | ||
| 156 | --positive-decisions duplicate,review \ | ||
| 157 | --out outputs/results/library_eval_500_review_positive.csv | ||
| 158 | ``` | ||
| 159 | |||
| 160 | 这个口径下: | ||
| 161 | |||
| 162 | ```text | ||
| 163 | duplicate -> 预测应去重 | ||
| 164 | review -> 预测应去重 | ||
| 165 | new -> 预测不应去重 | ||
| 166 | ``` | ||
| 167 | |||
| 168 | 适合评估可疑样本召回,重点看: | ||
| 169 | |||
| 170 | ```text | ||
| 171 | false_negative | ||
| 172 | ``` | ||
| 173 | |||
| 174 | ## 6. 查看总体指标 | ||
| 175 | |||
| 176 | 严格口径: | ||
| 177 | |||
| 178 | ```bash | ||
| 179 | cat outputs/results/library_eval_100.csv.summary.json | ||
| 180 | ``` | ||
| 181 | |||
| 182 | 召回口径: | ||
| 183 | |||
| 184 | ```bash | ||
| 185 | cat outputs/results/library_eval_100_review_positive.csv.summary.json | ||
| 186 | ``` | ||
| 187 | |||
| 188 | 指标含义: | ||
| 189 | |||
| 190 | ```text | ||
| 191 | accuracy 总正确率 | ||
| 192 | precision 预测应去重的样本里,有多少是真的应去重 | ||
| 193 | recall 真实应去重的样本里,有多少被系统抓到 | ||
| 194 | f1 precision 和 recall 的综合指标 | ||
| 195 | true_positive 应去重且预测应去重 | ||
| 196 | false_positive 不应去重但预测应去重,误杀 | ||
| 197 | true_negative 不应去重且预测不应去重 | ||
| 198 | false_negative 应去重但预测不应去重,漏召 | ||
| 199 | ``` | ||
| 200 | |||
| 201 | ## 7. 查看每条样本结果 | ||
| 202 | |||
| 203 | ```bash | ||
| 204 | open outputs/results/library_eval_100.csv | ||
| 205 | ``` | ||
| 206 | |||
| 207 | 如果不能使用 `open`,可以直接查看 CSV: | ||
| 208 | |||
| 209 | ```bash | ||
| 210 | python -c 'import csv; rows=csv.DictReader(open("outputs/results/library_eval_100.csv", encoding="utf-8")); [print(r["id"], r["decision"], r["correct"], r["reason"], sep=" | ") for r in rows]' | ||
| 211 | ``` | ||
| 212 | |||
| 213 | ## 8. 查看失败样本 | ||
| 214 | |||
| 215 | 严格口径失败样本: | ||
| 216 | |||
| 217 | ```bash | ||
| 218 | python -c 'import csv; rows=csv.DictReader(open("outputs/results/library_eval_100.csv", encoding="utf-8")); [print(r["id"], r["source"], r["decision"], r["reason"], sep=" | ") for r in rows if r["correct"] == "False"]' | ||
| 219 | ``` | ||
| 220 | |||
| 221 | 查看某个样本的完整候选: | ||
| 222 | |||
| 223 | ```bash | ||
| 224 | python -m lyric_dedup.cli check-file \ | ||
| 225 | --index outputs/indexes/library_lyrics.pkl \ | ||
| 226 | --file data/generated_eval/incoming/neg_068_mixed_fragments.txt \ | ||
| 227 | --max-candidates 10 | ||
| 228 | ``` | ||
| 229 | |||
| 230 | ## 9. 核对测试集分布 | ||
| 231 | |||
| 232 | ```bash | ||
| 233 | python -c 'import csv, collections; rows=list(csv.DictReader(open("data/generated_eval/eval_10.csv", encoding="utf-8"))); print(len(rows)); print(collections.Counter(r["expected"] for r in rows)); print(collections.Counter(r["sample_type"] for r in rows)); print(collections.Counter(r["sample_type"] for r in rows if r["expected"]=="应去重")); print(collections.Counter(r["sample_type"] for r in rows if r["expected"]=="不应去重"))' | ||
| 234 | ``` | ||
| 235 | |||
| 236 | 核对生成目录文件数: | ||
| 237 | |||
| 238 | ```bash | ||
| 239 | find data/generated_eval/incoming -type f | wc -l | ||
| 240 | ``` | ||
| 241 | |||
| 242 | ## 10. 运行代码测试 | ||
| 243 | |||
| 244 | ```bash | ||
| 245 | python -m pytest tests | ||
| 246 | ``` | ||
| 247 | |||
| 248 | 编译检查: | ||
| 249 | |||
| 250 | ```bash | ||
| 251 | python -m compileall -q lyric_dedup tests | ||
| 252 | ``` | ||
| 253 | |||
| 254 | ## 11. 关于测试集不重复 | ||
| 255 | |||
| 256 | 当前自动生成的 100 条是规则覆盖测试集,不保证样本之间规范化后完全不重复。 | ||
| 257 | |||
| 258 | 如果要求 100 条测试样本彼此不重复,并且仍使用默认比例: | ||
| 259 | |||
| 260 | ```text | ||
| 261 | size = 100 | ||
| 262 | positive_ratio = 0.6 | ||
| 263 | ``` | ||
| 264 | |||
| 265 | 则至少需要: | ||
| 266 | |||
| 267 | ```text | ||
| 268 | 60 首互不重复的种子歌词 | ||
| 269 | ``` | ||
| 270 | |||
| 271 | 原因:应去重样本是全曲变体,同一首歌的多个样式变化规范化后仍然是同一首歌。 | ||
| 272 | |||
| 273 | 更稳妥的真实准确率评估方式是准备人工标注 CSV: | ||
| 274 | |||
| 275 | ```csv | ||
| 276 | id,file,expected | ||
| 277 | case-001,incoming/song_a.lrc,应去重 | ||
| 278 | case-002,incoming/song_b.txt,不应去重 | ||
| 279 | ``` | ||
| 280 | |||
| 281 | 然后直接执行第 4 节或第 5 节的 `evaluate-csv`。 |
lyric_dedup/__init__.py
0 → 100644
| 1 | """Lyric duplicate detection utilities.""" | ||
| 2 | |||
| 3 | from lyric_dedup.checker import DuplicateCheckResult | ||
| 4 | from lyric_dedup.checker import DuplicateChecker | ||
| 5 | from lyric_dedup.checker import DuplicateDecision | ||
| 6 | from lyric_dedup.checker import LyricRecord | ||
| 7 | |||
| 8 | __all__ = [ | ||
| 9 | "DuplicateCheckResult", | ||
| 10 | "DuplicateChecker", | ||
| 11 | "DuplicateDecision", | ||
| 12 | "LyricRecord", | ||
| 13 | ] |
lyric_dedup/checker.py
0 → 100644
| 1 | """Incremental lyric duplicate checker.""" | ||
| 2 | |||
| 3 | from __future__ import annotations | ||
| 4 | |||
| 5 | import hashlib | ||
| 6 | import pickle | ||
| 7 | from dataclasses import dataclass | ||
| 8 | from enum import StrEnum | ||
| 9 | from pathlib import Path | ||
| 10 | |||
| 11 | from lyric_dedup.minhash_lsh import MinHashConfig | ||
| 12 | from lyric_dedup.minhash_lsh import MinHashLSH | ||
| 13 | from lyric_dedup.normalization import NormalizedLyrics | ||
| 14 | from lyric_dedup.normalization import fingerprint_text | ||
| 15 | from lyric_dedup.normalization import lyric_tokens | ||
| 16 | from lyric_dedup.normalization import normalize_lyrics | ||
| 17 | |||
| 18 | |||
| 19 | class DuplicateDecision(StrEnum): | ||
| 20 | DUPLICATE = "duplicate" | ||
| 21 | REVIEW = "review" | ||
| 22 | NEW = "new" | ||
| 23 | |||
| 24 | |||
| 25 | @dataclass(frozen=True) | ||
| 26 | class LyricRecord: | ||
| 27 | record_id: str | ||
| 28 | lyrics: str | ||
| 29 | title: str | None = None | ||
| 30 | artist: str | None = None | ||
| 31 | |||
| 32 | |||
| 33 | @dataclass(frozen=True) | ||
| 34 | class CandidateMatch: | ||
| 35 | record_id: str | ||
| 36 | decision: DuplicateDecision | ||
| 37 | confidence: float | ||
| 38 | jaccard: float | ||
| 39 | line_coverage: float | ||
| 40 | primary_jaccard: float | ||
| 41 | primary_line_coverage: float | ||
| 42 | translation_jaccard: float | ||
| 43 | translation_line_coverage: float | ||
| 44 | matched_unique_lines: tuple[str, ...] | ||
| 45 | reason: str | ||
| 46 | |||
| 47 | |||
| 48 | @dataclass(frozen=True) | ||
| 49 | class DuplicateCheckResult: | ||
| 50 | decision: DuplicateDecision | ||
| 51 | confidence: float | ||
| 52 | candidates: tuple[CandidateMatch, ...] | ||
| 53 | normalized_full_text: str | ||
| 54 | reason: str | ||
| 55 | |||
| 56 | |||
| 57 | @dataclass(frozen=True) | ||
| 58 | class _IndexedRecord: | ||
| 59 | record: LyricRecord | ||
| 60 | normalized: NormalizedLyrics | ||
| 61 | exact_hash: str | ||
| 62 | tokens: set[str] | ||
| 63 | primary_tokens: set[str] | ||
| 64 | translation_tokens: set[str] | ||
| 65 | fallback_lines: tuple[str, ...] | ||
| 66 | fallback_tokens: set[str] | ||
| 67 | signature: tuple[int, ...] | ||
| 68 | |||
| 69 | |||
| 70 | class DuplicateChecker: | ||
| 71 | """In-memory first version for checking newly submitted lyrics. | ||
| 72 | |||
| 73 | The API is intentionally small: build or load records with ``add_record``, then | ||
| 74 | call ``check`` for a new lyric. Persistence can serialize the indexed fields | ||
| 75 | later without changing result semantics. | ||
| 76 | """ | ||
| 77 | |||
| 78 | def __init__( | ||
| 79 | self, | ||
| 80 | *, | ||
| 81 | minhash_config: MinHashConfig | None = None, | ||
| 82 | duplicate_jaccard_threshold: float = 0.78, | ||
| 83 | duplicate_line_coverage_threshold: float = 0.72, | ||
| 84 | review_jaccard_threshold: float = 0.45, | ||
| 85 | review_line_coverage_threshold: float = 0.35, | ||
| 86 | ) -> None: | ||
| 87 | self._lsh = MinHashLSH(minhash_config) | ||
| 88 | self._records: dict[str, _IndexedRecord] = {} | ||
| 89 | self._exact_hash_to_ids: dict[str, set[str]] = {} | ||
| 90 | self._line_to_ids: dict[str, set[str]] = {} | ||
| 91 | self._token_to_ids: dict[str, set[str]] = {} | ||
| 92 | self.duplicate_jaccard_threshold = duplicate_jaccard_threshold | ||
| 93 | self.duplicate_line_coverage_threshold = duplicate_line_coverage_threshold | ||
| 94 | self.review_jaccard_threshold = review_jaccard_threshold | ||
| 95 | self.review_line_coverage_threshold = review_line_coverage_threshold | ||
| 96 | |||
| 97 | def add_record(self, record: LyricRecord) -> None: | ||
| 98 | indexed = self._index(record) | ||
| 99 | self._records[record.record_id] = indexed | ||
| 100 | self._exact_hash_to_ids.setdefault(indexed.exact_hash, set()).add(record.record_id) | ||
| 101 | for line in indexed.normalized.unique_lines: | ||
| 102 | if len(line) >= 4: | ||
| 103 | self._line_to_ids.setdefault(line, set()).add(record.record_id) | ||
| 104 | for token in indexed.tokens: | ||
| 105 | self._token_to_ids.setdefault(token, set()).add(record.record_id) | ||
| 106 | for token in indexed.fallback_tokens: | ||
| 107 | self._token_to_ids.setdefault(token, set()).add(record.record_id) | ||
| 108 | self._lsh.add(record.record_id, indexed.signature) | ||
| 109 | |||
| 110 | def save(self, path: str | Path) -> None: | ||
| 111 | """Persist the in-memory index for later checks.""" | ||
| 112 | with Path(path).open("wb") as file: | ||
| 113 | pickle.dump(self, file, protocol=pickle.HIGHEST_PROTOCOL) | ||
| 114 | |||
| 115 | @classmethod | ||
| 116 | def load(cls, path: str | Path) -> "DuplicateChecker": | ||
| 117 | """Load a previously persisted index.""" | ||
| 118 | with Path(path).open("rb") as file: | ||
| 119 | checker = pickle.load(file) | ||
| 120 | if not isinstance(checker, cls): | ||
| 121 | raise TypeError(f"{path} does not contain a DuplicateChecker index") | ||
| 122 | return checker | ||
| 123 | |||
| 124 | @property | ||
| 125 | def record_count(self) -> int: | ||
| 126 | return len(self._records) | ||
| 127 | |||
| 128 | def check(self, lyrics: str, *, max_candidates: int = 10) -> DuplicateCheckResult: | ||
| 129 | return self.check_record(LyricRecord(record_id="__query__", lyrics=lyrics), max_candidates=max_candidates) | ||
| 130 | |||
| 131 | def check_record(self, record: LyricRecord, *, max_candidates: int = 10) -> DuplicateCheckResult: | ||
| 132 | query = self._index(record) | ||
| 133 | exact_ids = self._exact_hash_to_ids.get(query.exact_hash, set()) | ||
| 134 | if exact_ids: | ||
| 135 | candidates = tuple(self._rank_exact_candidate(query, self._records[record_id]) for record_id in sorted(exact_ids)[:max_candidates]) | ||
| 136 | duplicate = next((candidate for candidate in candidates if candidate.decision == DuplicateDecision.DUPLICATE), None) | ||
| 137 | if duplicate is not None: | ||
| 138 | return DuplicateCheckResult( | ||
| 139 | decision=DuplicateDecision.DUPLICATE, | ||
| 140 | confidence=duplicate.confidence, | ||
| 141 | candidates=candidates, | ||
| 142 | normalized_full_text=query.normalized.normalized_full_text, | ||
| 143 | reason=duplicate.reason, | ||
| 144 | ) | ||
| 145 | return DuplicateCheckResult( | ||
| 146 | decision=DuplicateDecision.REVIEW, | ||
| 147 | confidence=candidates[0].confidence, | ||
| 148 | candidates=candidates, | ||
| 149 | normalized_full_text=query.normalized.normalized_full_text, | ||
| 150 | reason=candidates[0].reason, | ||
| 151 | ) | ||
| 152 | |||
| 153 | candidate_ids = self._recall_candidates(query) | ||
| 154 | ranked = sorted( | ||
| 155 | (self._rank_candidate(query, self._records[record_id]) for record_id in candidate_ids), | ||
| 156 | key=lambda item: (item.decision == DuplicateDecision.DUPLICATE, item.confidence, item.jaccard), | ||
| 157 | reverse=True, | ||
| 158 | )[:max_candidates] | ||
| 159 | |||
| 160 | duplicate = next((candidate for candidate in ranked if candidate.decision == DuplicateDecision.DUPLICATE), None) | ||
| 161 | if duplicate is not None: | ||
| 162 | return DuplicateCheckResult( | ||
| 163 | decision=DuplicateDecision.DUPLICATE, | ||
| 164 | confidence=duplicate.confidence, | ||
| 165 | candidates=tuple(ranked), | ||
| 166 | normalized_full_text=query.normalized.normalized_full_text, | ||
| 167 | reason=duplicate.reason, | ||
| 168 | ) | ||
| 169 | |||
| 170 | review = next((candidate for candidate in ranked if candidate.decision == DuplicateDecision.REVIEW), None) | ||
| 171 | if review is not None: | ||
| 172 | return DuplicateCheckResult( | ||
| 173 | decision=DuplicateDecision.REVIEW, | ||
| 174 | confidence=review.confidence, | ||
| 175 | candidates=tuple(ranked), | ||
| 176 | normalized_full_text=query.normalized.normalized_full_text, | ||
| 177 | reason=review.reason, | ||
| 178 | ) | ||
| 179 | |||
| 180 | return DuplicateCheckResult( | ||
| 181 | decision=DuplicateDecision.NEW, | ||
| 182 | confidence=1.0 - (ranked[0].confidence if ranked else 0.0), | ||
| 183 | candidates=tuple(ranked), | ||
| 184 | normalized_full_text=query.normalized.normalized_full_text, | ||
| 185 | reason="精确匹配、近重复召回和字面重合信号都较低", | ||
| 186 | ) | ||
| 187 | |||
| 188 | def _index(self, record: LyricRecord) -> _IndexedRecord: | ||
| 189 | normalized = normalize_lyrics(record.lyrics) | ||
| 190 | tokens = lyric_tokens(normalized) | ||
| 191 | primary_tokens = lyric_tokens(normalized, lines=normalized.primary_lines) | ||
| 192 | translation_tokens = lyric_tokens(normalized, lines=normalized.translation_lines) | ||
| 193 | fallback_lines = tuple(_fallback_no_lyrics_lines(record.lyrics)) | ||
| 194 | fallback_tokens = set(fallback_lines) | ||
| 195 | signature = self._lsh.signature(primary_tokens or tokens or fallback_tokens) | ||
| 196 | exact_hash = hashlib.sha256(_exact_fingerprint(normalized, fallback_lines).encode("utf-8")).hexdigest() | ||
| 197 | return _IndexedRecord( | ||
| 198 | record=record, | ||
| 199 | normalized=normalized, | ||
| 200 | exact_hash=exact_hash, | ||
| 201 | tokens=tokens, | ||
| 202 | primary_tokens=primary_tokens, | ||
| 203 | translation_tokens=translation_tokens, | ||
| 204 | fallback_lines=fallback_lines, | ||
| 205 | fallback_tokens=fallback_tokens, | ||
| 206 | signature=signature, | ||
| 207 | ) | ||
| 208 | |||
| 209 | def _recall_candidates(self, query: _IndexedRecord) -> set[str]: | ||
| 210 | candidate_ids = self._lsh.query(query.signature) | ||
| 211 | for line in query.normalized.primary_lines: | ||
| 212 | if len(line) >= 4: | ||
| 213 | candidate_ids.update(self._line_to_ids.get(line, set())) | ||
| 214 | for line in query.normalized.translation_lines: | ||
| 215 | if len(line) >= 4: | ||
| 216 | candidate_ids.update(self._line_to_ids.get(line, set())) | ||
| 217 | for token in query.primary_tokens or query.tokens: | ||
| 218 | candidate_ids.update(self._token_to_ids.get(token, set())) | ||
| 219 | for token in query.translation_tokens: | ||
| 220 | candidate_ids.update(self._token_to_ids.get(token, set())) | ||
| 221 | for token in query.fallback_tokens: | ||
| 222 | candidate_ids.update(self._token_to_ids.get(token, set())) | ||
| 223 | return candidate_ids | ||
| 224 | |||
| 225 | def _rank_exact_candidate(self, query: _IndexedRecord, candidate: _IndexedRecord) -> CandidateMatch: | ||
| 226 | low_confidence_split = ( | ||
| 227 | query.normalized.split_confidence == "low" or candidate.normalized.split_confidence == "low" | ||
| 228 | ) | ||
| 229 | translation_jaccard = _jaccard(query.translation_tokens, candidate.translation_tokens) | ||
| 230 | translation_coverage, _ = _line_coverage_lines( | ||
| 231 | query.normalized.translation_lines, | ||
| 232 | candidate.normalized.translation_lines, | ||
| 233 | ) | ||
| 234 | no_effective_lyrics = not query.normalized.primary_lines and not candidate.normalized.primary_lines | ||
| 235 | if no_effective_lyrics: | ||
| 236 | decision = DuplicateDecision.DUPLICATE | ||
| 237 | confidence = 1.0 | ||
| 238 | reason = "无有效歌词,使用文件内容兜底指纹命中" | ||
| 239 | elif low_confidence_split: | ||
| 240 | decision = DuplicateDecision.REVIEW | ||
| 241 | confidence = 0.95 | ||
| 242 | reason = "原文哈希一致,但疑似整段翻译结构拆分置信度较低,需要人工复核" | ||
| 243 | elif query.normalized.translation_lines or candidate.normalized.translation_lines: | ||
| 244 | decision = DuplicateDecision.DUPLICATE | ||
| 245 | confidence = 1.0 | ||
| 246 | reason = "规范化后的原文歌词哈希完全一致,翻译行未参与自动判重" | ||
| 247 | else: | ||
| 248 | decision = DuplicateDecision.DUPLICATE | ||
| 249 | confidence = 1.0 | ||
| 250 | reason = "规范化后的原文歌词哈希完全一致" | ||
| 251 | return CandidateMatch( | ||
| 252 | record_id=candidate.record.record_id, | ||
| 253 | decision=decision, | ||
| 254 | confidence=confidence, | ||
| 255 | jaccard=1.0, | ||
| 256 | line_coverage=1.0, | ||
| 257 | primary_jaccard=1.0, | ||
| 258 | primary_line_coverage=1.0, | ||
| 259 | translation_jaccard=round(translation_jaccard, 4), | ||
| 260 | translation_line_coverage=round(translation_coverage, 4), | ||
| 261 | matched_unique_lines=query.normalized.primary_lines, | ||
| 262 | reason=reason, | ||
| 263 | ) | ||
| 264 | |||
| 265 | def _rank_candidate(self, query: _IndexedRecord, candidate: _IndexedRecord) -> CandidateMatch: | ||
| 266 | if not query.normalized.primary_lines or not candidate.normalized.primary_lines: | ||
| 267 | return _rank_no_effective_lyrics_candidate(query, candidate) | ||
| 268 | |||
| 269 | jaccard = _jaccard(query.tokens, candidate.tokens) | ||
| 270 | coverage, matched_lines = _line_coverage(query.normalized, candidate.normalized) | ||
| 271 | primary_jaccard = _jaccard(query.primary_tokens, candidate.primary_tokens) | ||
| 272 | primary_coverage, primary_matched_lines = _line_coverage_lines( | ||
| 273 | query.normalized.primary_lines, | ||
| 274 | candidate.normalized.primary_lines, | ||
| 275 | ) | ||
| 276 | translation_jaccard = _jaccard(query.translation_tokens, candidate.translation_tokens) | ||
| 277 | translation_coverage, translation_matched_lines = _line_coverage_lines( | ||
| 278 | query.normalized.translation_lines, | ||
| 279 | candidate.normalized.translation_lines, | ||
| 280 | ) | ||
| 281 | chorus_only = _is_chorus_only_match(query.normalized, candidate.normalized, primary_matched_lines) | ||
| 282 | translation_only = ( | ||
| 283 | bool(translation_matched_lines) | ||
| 284 | and primary_jaccard < self.review_jaccard_threshold | ||
| 285 | and primary_coverage < self.review_line_coverage_threshold | ||
| 286 | and (translation_jaccard >= self.review_jaccard_threshold or translation_coverage >= self.review_line_coverage_threshold) | ||
| 287 | ) | ||
| 288 | low_confidence_split = ( | ||
| 289 | query.normalized.split_confidence == "low" or candidate.normalized.split_confidence == "low" | ||
| 290 | ) | ||
| 291 | |||
| 292 | confidence = round((0.58 * primary_jaccard) + (0.42 * primary_coverage), 4) | ||
| 293 | if ( | ||
| 294 | (primary_jaccard >= self.duplicate_jaccard_threshold or (primary_jaccard >= 0.78 and primary_coverage >= 0.9)) | ||
| 295 | and primary_coverage >= self.duplicate_line_coverage_threshold | ||
| 296 | and not chorus_only | ||
| 297 | and not translation_only | ||
| 298 | and not low_confidence_split | ||
| 299 | ): | ||
| 300 | decision = DuplicateDecision.DUPLICATE | ||
| 301 | if query.normalized.translation_lines or candidate.normalized.translation_lines: | ||
| 302 | reason = "原文歌词高度一致,翻译行未参与自动判重" | ||
| 303 | else: | ||
| 304 | reason = "原文 n-gram 字面相似度高,且行级覆盖范围广" | ||
| 305 | elif ( | ||
| 306 | chorus_only | ||
| 307 | or translation_only | ||
| 308 | or low_confidence_split | ||
| 309 | or primary_jaccard >= self.review_jaccard_threshold | ||
| 310 | or primary_coverage >= self.review_line_coverage_threshold | ||
| 311 | or jaccard >= self.review_jaccard_threshold | ||
| 312 | or coverage >= self.review_line_coverage_threshold | ||
| 313 | ): | ||
| 314 | decision = DuplicateDecision.REVIEW | ||
| 315 | reason = "候选相似度达到复核阈值,需要人工确认" | ||
| 316 | if chorus_only: | ||
| 317 | reason = "重合内容主要集中在重复副歌行,不自动判重" | ||
| 318 | elif translation_only: | ||
| 319 | reason = "仅翻译行相似,原文字面重合不足,不自动判重" | ||
| 320 | elif low_confidence_split: | ||
| 321 | reason = "疑似整段翻译结构但拆分置信度较低,需要人工复核" | ||
| 322 | else: | ||
| 323 | decision = DuplicateDecision.NEW | ||
| 324 | reason = "候选重合度低于复核阈值" | ||
| 325 | |||
| 326 | return CandidateMatch( | ||
| 327 | record_id=candidate.record.record_id, | ||
| 328 | decision=decision, | ||
| 329 | confidence=confidence, | ||
| 330 | jaccard=round(jaccard, 4), | ||
| 331 | line_coverage=round(coverage, 4), | ||
| 332 | primary_jaccard=round(primary_jaccard, 4), | ||
| 333 | primary_line_coverage=round(primary_coverage, 4), | ||
| 334 | translation_jaccard=round(translation_jaccard, 4), | ||
| 335 | translation_line_coverage=round(translation_coverage, 4), | ||
| 336 | matched_unique_lines=tuple(matched_lines), | ||
| 337 | reason=reason, | ||
| 338 | ) | ||
| 339 | |||
| 340 | |||
| 341 | def _rank_no_effective_lyrics_candidate(query: _IndexedRecord, candidate: _IndexedRecord) -> CandidateMatch: | ||
| 342 | fallback_jaccard = _jaccard(query.fallback_tokens, candidate.fallback_tokens) | ||
| 343 | fallback_coverage, matched_lines = _line_coverage_lines(query.fallback_lines, candidate.fallback_lines) | ||
| 344 | if fallback_jaccard >= 0.35 and fallback_coverage >= 0.35 and len(matched_lines) >= 2: | ||
| 345 | return CandidateMatch( | ||
| 346 | record_id=candidate.record.record_id, | ||
| 347 | decision=DuplicateDecision.DUPLICATE, | ||
| 348 | confidence=round((0.58 * fallback_jaccard) + (0.42 * fallback_coverage), 4), | ||
| 349 | jaccard=round(fallback_jaccard, 4), | ||
| 350 | line_coverage=round(fallback_coverage, 4), | ||
| 351 | primary_jaccard=0.0, | ||
| 352 | primary_line_coverage=0.0, | ||
| 353 | translation_jaccard=0.0, | ||
| 354 | translation_line_coverage=0.0, | ||
| 355 | matched_unique_lines=tuple(matched_lines), | ||
| 356 | reason="无有效歌词,文件内容兜底特征高度相似", | ||
| 357 | ) | ||
| 358 | if fallback_jaccard >= 0.2 or fallback_coverage >= 0.2: | ||
| 359 | return CandidateMatch( | ||
| 360 | record_id=candidate.record.record_id, | ||
| 361 | decision=DuplicateDecision.REVIEW, | ||
| 362 | confidence=round((0.58 * fallback_jaccard) + (0.42 * fallback_coverage), 4), | ||
| 363 | jaccard=round(fallback_jaccard, 4), | ||
| 364 | line_coverage=round(fallback_coverage, 4), | ||
| 365 | primary_jaccard=0.0, | ||
| 366 | primary_line_coverage=0.0, | ||
| 367 | translation_jaccard=0.0, | ||
| 368 | translation_line_coverage=0.0, | ||
| 369 | matched_unique_lines=tuple(matched_lines), | ||
| 370 | reason="无有效歌词,文件内容兜底特征部分相似,需要人工复核", | ||
| 371 | ) | ||
| 372 | return CandidateMatch( | ||
| 373 | record_id=candidate.record.record_id, | ||
| 374 | decision=DuplicateDecision.NEW, | ||
| 375 | confidence=0.0, | ||
| 376 | jaccard=round(fallback_jaccard, 4), | ||
| 377 | line_coverage=round(fallback_coverage, 4), | ||
| 378 | primary_jaccard=0.0, | ||
| 379 | primary_line_coverage=0.0, | ||
| 380 | translation_jaccard=0.0, | ||
| 381 | translation_line_coverage=0.0, | ||
| 382 | matched_unique_lines=(), | ||
| 383 | reason="无有效歌词,且文件内容兜底特征未命中", | ||
| 384 | ) | ||
| 385 | |||
| 386 | |||
| 387 | def _jaccard(left: set[str], right: set[str]) -> float: | ||
| 388 | if not left and not right: | ||
| 389 | return 1.0 | ||
| 390 | if not left or not right: | ||
| 391 | return 0.0 | ||
| 392 | return len(left & right) / len(left | right) | ||
| 393 | |||
| 394 | |||
| 395 | def _exact_fingerprint(normalized: NormalizedLyrics, fallback_lines: tuple[str, ...]) -> str: | ||
| 396 | primary_text = fingerprint_text(normalized) | ||
| 397 | if primary_text: | ||
| 398 | return f"lyrics|{primary_text}" | ||
| 399 | return "no_effective_lyrics_content|" + "\n".join(fallback_lines) | ||
| 400 | |||
| 401 | |||
| 402 | def _fallback_no_lyrics_lines(text: str) -> list[str]: | ||
| 403 | import re | ||
| 404 | import unicodedata | ||
| 405 | |||
| 406 | lines: list[str] = [] | ||
| 407 | for raw_line in unicodedata.normalize("NFKC", text).splitlines(): | ||
| 408 | line = raw_line.strip().lower() | ||
| 409 | line = re.sub(r"\[(?:\d{1,2}:)?\d{1,2}:\d{2}(?:[.:]\d{1,3})?\]", "", line) | ||
| 410 | line = re.sub(r"[【\[].{0,80}?[】\]]", "", line) | ||
| 411 | if "歌词来自" in line or "qq音乐" in line or "网易云" in line or "酷狗" in line: | ||
| 412 | continue | ||
| 413 | if "未经" in line or "不得翻唱" in line or "不得翻录" in line or "著作权" in line: | ||
| 414 | continue | ||
| 415 | punctuation = ",。!?;:、“”‘’·…—~!¥()【】《》〈〉「」『』﹏,.;:!?()[]{}<>|/\\_-" | ||
| 416 | line = "".join(" " if char in punctuation else char for char in line) | ||
| 417 | line = re.sub(r"\s+", " ", line).strip() | ||
| 418 | if line: | ||
| 419 | lines.append(line) | ||
| 420 | return list(dict.fromkeys(lines)) | ||
| 421 | |||
| 422 | |||
| 423 | def _line_coverage(left: NormalizedLyrics, right: NormalizedLyrics) -> tuple[float, list[str]]: | ||
| 424 | return _line_coverage_lines(left.unique_lines, right.unique_lines) | ||
| 425 | |||
| 426 | |||
| 427 | def _line_coverage_lines(left: tuple[str, ...], right: tuple[str, ...]) -> tuple[float, list[str]]: | ||
| 428 | left_lines = set(left) | ||
| 429 | right_lines = set(right) | ||
| 430 | if not left_lines and not right_lines: | ||
| 431 | return 1.0, [] | ||
| 432 | if not left_lines or not right_lines: | ||
| 433 | return 0.0, [] | ||
| 434 | matched = sorted(left_lines & right_lines) | ||
| 435 | return len(matched) / max(len(left_lines), len(right_lines)), matched | ||
| 436 | |||
| 437 | |||
| 438 | def _is_chorus_only_match(left: NormalizedLyrics, right: NormalizedLyrics, matched_lines: list[str]) -> bool: | ||
| 439 | if not matched_lines: | ||
| 440 | return False | ||
| 441 | matched = set(matched_lines) | ||
| 442 | repeated_matches = [ | ||
| 443 | line | ||
| 444 | for line in matched | ||
| 445 | if left.line_counts.get(line, 0) >= 2 or right.line_counts.get(line, 0) >= 2 | ||
| 446 | ] | ||
| 447 | if len(matched) <= 2 and repeated_matches: | ||
| 448 | return True | ||
| 449 | if repeated_matches and len(repeated_matches) / len(matched) >= 0.8: | ||
| 450 | matched_ratio_left = sum(left.line_counts.get(line, 0) for line in matched) / max(left.content_line_count, 1) | ||
| 451 | matched_ratio_right = sum(right.line_counts.get(line, 0) for line in matched) / max(right.content_line_count, 1) | ||
| 452 | return min(matched_ratio_left, matched_ratio_right) < 0.7 | ||
| 453 | return False |
lyric_dedup/cli.py
0 → 100644
| 1 | """Command line tools for lyric duplicate checking.""" | ||
| 2 | |||
| 3 | from __future__ import annotations | ||
| 4 | |||
| 5 | import argparse | ||
| 6 | import csv | ||
| 7 | import json | ||
| 8 | from pathlib import Path | ||
| 9 | |||
| 10 | from lyric_dedup.checker import DuplicateChecker | ||
| 11 | from lyric_dedup.checker import LyricRecord | ||
| 12 | from lyric_dedup.eval_dataset import generate_eval_set | ||
| 13 | from lyric_dedup.file_import import iter_lyric_files | ||
| 14 | from lyric_dedup.file_import import read_lyric_file | ||
| 15 | from lyric_dedup.file_import import record_from_file | ||
| 16 | from lyric_dedup.file_import import records_from_dir | ||
| 17 | |||
| 18 | |||
| 19 | def main() -> None: | ||
| 20 | parser = argparse.ArgumentParser(prog="lyric-dedup") | ||
| 21 | subparsers = parser.add_subparsers(dest="command", required=True) | ||
| 22 | |||
| 23 | build = subparsers.add_parser("build-index", help="build an index from .lrc/.txt files") | ||
| 24 | build.add_argument("--lyrics-dir", required=True) | ||
| 25 | build.add_argument("--index", required=True) | ||
| 26 | |||
| 27 | check = subparsers.add_parser("check-file", help="check one .lrc/.txt file against an index") | ||
| 28 | check.add_argument("--index", required=True) | ||
| 29 | check.add_argument("--file", required=True) | ||
| 30 | check.add_argument("--max-candidates", type=int, default=10) | ||
| 31 | |||
| 32 | batch = subparsers.add_parser("batch-check", help="check a directory of .lrc/.txt files against an index") | ||
| 33 | batch.add_argument("--index", required=True) | ||
| 34 | batch.add_argument("--lyrics-dir", required=True) | ||
| 35 | batch.add_argument("--out", required=True) | ||
| 36 | batch.add_argument("--max-candidates", type=int, default=5) | ||
| 37 | |||
| 38 | evaluate = subparsers.add_parser("evaluate-csv", help="evaluate labeled duplicate samples from a CSV file") | ||
| 39 | evaluate.add_argument("--index", required=True) | ||
| 40 | evaluate.add_argument("--csv", required=True) | ||
| 41 | evaluate.add_argument("--out", required=True) | ||
| 42 | evaluate.add_argument("--base-dir", default="") | ||
| 43 | evaluate.add_argument("--positive-decisions", default="duplicate") | ||
| 44 | evaluate.add_argument("--max-candidates", type=int, default=5) | ||
| 45 | |||
| 46 | generate = subparsers.add_parser("generate-eval-set", help="generate labeled eval samples from a lyric library") | ||
| 47 | generate.add_argument("--library-dir", required=True) | ||
| 48 | generate.add_argument("--lyrics-dir", required=True) | ||
| 49 | generate.add_argument("--csv", required=True) | ||
| 50 | generate.add_argument("--size", type=int, default=100) | ||
| 51 | generate.add_argument("--positive-ratio", type=float, default=0.6) | ||
| 52 | generate.add_argument("--seed", type=int, default=20260602) | ||
| 53 | |||
| 54 | args = parser.parse_args() | ||
| 55 | if args.command == "build-index": | ||
| 56 | build_index(Path(args.lyrics_dir), Path(args.index)) | ||
| 57 | elif args.command == "check-file": | ||
| 58 | check_file(Path(args.index), Path(args.file), args.max_candidates) | ||
| 59 | elif args.command == "batch-check": | ||
| 60 | batch_check(Path(args.index), Path(args.lyrics_dir), Path(args.out), args.max_candidates) | ||
| 61 | elif args.command == "evaluate-csv": | ||
| 62 | evaluate_csv( | ||
| 63 | Path(args.index), | ||
| 64 | Path(args.csv), | ||
| 65 | Path(args.out), | ||
| 66 | base_dir=Path(args.base_dir) if args.base_dir else None, | ||
| 67 | positive_decisions={item.strip() for item in args.positive_decisions.split(",") if item.strip()}, | ||
| 68 | max_candidates=args.max_candidates, | ||
| 69 | ) | ||
| 70 | elif args.command == "generate-eval-set": | ||
| 71 | summary = generate_eval_set( | ||
| 72 | library_dir=Path(args.library_dir), | ||
| 73 | output_dir=Path(args.lyrics_dir), | ||
| 74 | csv_path=Path(args.csv), | ||
| 75 | size=args.size, | ||
| 76 | positive_ratio=args.positive_ratio, | ||
| 77 | seed=args.seed, | ||
| 78 | ) | ||
| 79 | print(json.dumps(summary, ensure_ascii=False)) | ||
| 80 | |||
| 81 | |||
| 82 | def build_index(lyrics_dir: Path, index_path: Path) -> None: | ||
| 83 | checker = DuplicateChecker() | ||
| 84 | records = records_from_dir(lyrics_dir) | ||
| 85 | for record in records: | ||
| 86 | checker.add_record(record) | ||
| 87 | index_path.parent.mkdir(parents=True, exist_ok=True) | ||
| 88 | checker.save(index_path) | ||
| 89 | print(json.dumps({"indexed": checker.record_count, "index": str(index_path)}, ensure_ascii=False)) | ||
| 90 | |||
| 91 | |||
| 92 | def check_file(index_path: Path, file_path: Path, max_candidates: int) -> None: | ||
| 93 | checker = DuplicateChecker.load(index_path) | ||
| 94 | record = record_from_file(file_path) | ||
| 95 | result = checker.check_record(record, max_candidates=max_candidates) | ||
| 96 | print(json.dumps(_result_to_dict(result, source=str(file_path)), ensure_ascii=False, indent=2)) | ||
| 97 | |||
| 98 | |||
| 99 | def batch_check(index_path: Path, lyrics_dir: Path, out_path: Path, max_candidates: int) -> None: | ||
| 100 | checker = DuplicateChecker.load(index_path) | ||
| 101 | out_path.parent.mkdir(parents=True, exist_ok=True) | ||
| 102 | rows: list[dict[str, object]] = [] | ||
| 103 | for path in iter_lyric_files(lyrics_dir): | ||
| 104 | record = record_from_file(path, base_dir=lyrics_dir) | ||
| 105 | result = checker.check_record(record, max_candidates=max_candidates) | ||
| 106 | best = result.candidates[0] if result.candidates else None | ||
| 107 | rows.append( | ||
| 108 | { | ||
| 109 | "source": str(path), | ||
| 110 | "record_id": record.record_id, | ||
| 111 | "decision": result.decision.value, | ||
| 112 | "confidence": result.confidence, | ||
| 113 | "reason": result.reason, | ||
| 114 | "best_candidate_id": best.record_id if best else "", | ||
| 115 | "best_candidate_decision": best.decision.value if best else "", | ||
| 116 | "best_candidate_confidence": best.confidence if best else "", | ||
| 117 | "best_candidate_jaccard": best.jaccard if best else "", | ||
| 118 | "best_candidate_line_coverage": best.line_coverage if best else "", | ||
| 119 | "best_candidate_primary_jaccard": best.primary_jaccard if best else "", | ||
| 120 | "best_candidate_primary_line_coverage": best.primary_line_coverage if best else "", | ||
| 121 | "best_candidate_translation_jaccard": best.translation_jaccard if best else "", | ||
| 122 | "best_candidate_translation_line_coverage": best.translation_line_coverage if best else "", | ||
| 123 | "best_candidate_reason": best.reason if best else "", | ||
| 124 | "matched_unique_lines": " | ".join(best.matched_unique_lines) if best else "", | ||
| 125 | } | ||
| 126 | ) | ||
| 127 | |||
| 128 | if out_path.suffix.lower() == ".jsonl": | ||
| 129 | with out_path.open("w", encoding="utf-8") as file: | ||
| 130 | for row in rows: | ||
| 131 | file.write(json.dumps(row, ensure_ascii=False) + "\n") | ||
| 132 | else: | ||
| 133 | with out_path.open("w", encoding="utf-8", newline="") as file: | ||
| 134 | writer = csv.DictWriter(file, fieldnames=list(rows[0].keys()) if rows else ["source"]) | ||
| 135 | writer.writeheader() | ||
| 136 | writer.writerows(rows) | ||
| 137 | summary = { | ||
| 138 | "checked": len(rows), | ||
| 139 | "duplicate": sum(1 for row in rows if row["decision"] == "duplicate"), | ||
| 140 | "review": sum(1 for row in rows if row["decision"] == "review"), | ||
| 141 | "new": sum(1 for row in rows if row["decision"] == "new"), | ||
| 142 | "out": str(out_path), | ||
| 143 | } | ||
| 144 | print(json.dumps(summary, ensure_ascii=False)) | ||
| 145 | |||
| 146 | |||
| 147 | def evaluate_csv( | ||
| 148 | index_path: Path, | ||
| 149 | csv_path: Path, | ||
| 150 | out_path: Path, | ||
| 151 | *, | ||
| 152 | base_dir: Path | None, | ||
| 153 | positive_decisions: set[str], | ||
| 154 | max_candidates: int, | ||
| 155 | ) -> None: | ||
| 156 | checker = DuplicateChecker.load(index_path) | ||
| 157 | rows: list[dict[str, object]] = [] | ||
| 158 | with csv_path.open(encoding="utf-8-sig", newline="") as file: | ||
| 159 | reader = csv.DictReader(file) | ||
| 160 | if reader.fieldnames is None: | ||
| 161 | raise ValueError("评估 CSV 需要表头") | ||
| 162 | for row_number, row in enumerate(reader, start=2): | ||
| 163 | sample_id = row.get("id") or row.get("sample_id") or str(row_number) | ||
| 164 | record, source = _record_from_eval_row(row, csv_path=csv_path, base_dir=base_dir) | ||
| 165 | expected_duplicate = _parse_expected(row.get("expected") or row.get("label") or row.get("target")) | ||
| 166 | result = checker.check_record(record, max_candidates=max_candidates) | ||
| 167 | predicted_duplicate = result.decision.value in positive_decisions | ||
| 168 | best = result.candidates[0] if result.candidates else None | ||
| 169 | rows.append( | ||
| 170 | { | ||
| 171 | "id": sample_id, | ||
| 172 | "source": source, | ||
| 173 | "expected_duplicate": expected_duplicate, | ||
| 174 | "decision": result.decision.value, | ||
| 175 | "predicted_duplicate": predicted_duplicate, | ||
| 176 | "correct": expected_duplicate == predicted_duplicate, | ||
| 177 | "confidence": result.confidence, | ||
| 178 | "reason": result.reason, | ||
| 179 | "best_candidate_id": best.record_id if best else "", | ||
| 180 | "best_candidate_decision": best.decision.value if best else "", | ||
| 181 | "best_candidate_confidence": best.confidence if best else "", | ||
| 182 | "best_candidate_jaccard": best.jaccard if best else "", | ||
| 183 | "best_candidate_line_coverage": best.line_coverage if best else "", | ||
| 184 | "best_candidate_primary_jaccard": best.primary_jaccard if best else "", | ||
| 185 | "best_candidate_primary_line_coverage": best.primary_line_coverage if best else "", | ||
| 186 | "best_candidate_translation_jaccard": best.translation_jaccard if best else "", | ||
| 187 | "best_candidate_translation_line_coverage": best.translation_line_coverage if best else "", | ||
| 188 | "best_candidate_reason": best.reason if best else "", | ||
| 189 | "matched_unique_lines": " | ".join(best.matched_unique_lines) if best else "", | ||
| 190 | } | ||
| 191 | ) | ||
| 192 | |||
| 193 | out_path.parent.mkdir(parents=True, exist_ok=True) | ||
| 194 | with out_path.open("w", encoding="utf-8", newline="") as file: | ||
| 195 | writer = csv.DictWriter(file, fieldnames=list(rows[0].keys()) if rows else ["id"]) | ||
| 196 | writer.writeheader() | ||
| 197 | writer.writerows(rows) | ||
| 198 | |||
| 199 | summary = _evaluation_summary(rows, positive_decisions=positive_decisions, out_path=out_path) | ||
| 200 | summary_path = out_path.with_suffix(out_path.suffix + ".summary.json") | ||
| 201 | summary_path.write_text(json.dumps(summary, ensure_ascii=False, indent=2), encoding="utf-8") | ||
| 202 | print(json.dumps(summary, ensure_ascii=False)) | ||
| 203 | |||
| 204 | |||
| 205 | def _result_to_dict(result, *, source: str) -> dict[str, object]: | ||
| 206 | return { | ||
| 207 | "source": source, | ||
| 208 | "decision": result.decision.value, | ||
| 209 | "confidence": result.confidence, | ||
| 210 | "reason": result.reason, | ||
| 211 | "candidates": [ | ||
| 212 | { | ||
| 213 | "record_id": candidate.record_id, | ||
| 214 | "decision": candidate.decision.value, | ||
| 215 | "confidence": candidate.confidence, | ||
| 216 | "jaccard": candidate.jaccard, | ||
| 217 | "line_coverage": candidate.line_coverage, | ||
| 218 | "primary_jaccard": candidate.primary_jaccard, | ||
| 219 | "primary_line_coverage": candidate.primary_line_coverage, | ||
| 220 | "translation_jaccard": candidate.translation_jaccard, | ||
| 221 | "translation_line_coverage": candidate.translation_line_coverage, | ||
| 222 | "reason": candidate.reason, | ||
| 223 | "matched_unique_lines": list(candidate.matched_unique_lines), | ||
| 224 | } | ||
| 225 | for candidate in result.candidates | ||
| 226 | ], | ||
| 227 | } | ||
| 228 | |||
| 229 | |||
| 230 | def _lyrics_from_eval_row(row: dict[str, str], *, csv_path: Path, base_dir: Path | None) -> tuple[str, str]: | ||
| 231 | lyrics = (row.get("lyrics") or "").strip() | ||
| 232 | if lyrics: | ||
| 233 | return lyrics.replace("\\n", "\n"), "inline" | ||
| 234 | |||
| 235 | file_value = (row.get("file") or row.get("path") or row.get("source") or "").strip() | ||
| 236 | if not file_value: | ||
| 237 | raise ValueError("评估 CSV 每行需要提供 lyrics,或 file/path/source 文件路径") | ||
| 238 | |||
| 239 | file_path = Path(file_value) | ||
| 240 | if not file_path.is_absolute(): | ||
| 241 | file_path = (base_dir or csv_path.parent) / file_path | ||
| 242 | return read_lyric_file(file_path), str(file_path) | ||
| 243 | |||
| 244 | |||
| 245 | def _record_from_eval_row(row: dict[str, str], *, csv_path: Path, base_dir: Path | None): | ||
| 246 | lyrics = (row.get("lyrics") or "").strip() | ||
| 247 | if lyrics: | ||
| 248 | return ( | ||
| 249 | LyricRecord( | ||
| 250 | record_id=row.get("id") or row.get("sample_id") or "__eval__", | ||
| 251 | lyrics=lyrics.replace("\\n", "\n"), | ||
| 252 | title=row.get("title") or None, | ||
| 253 | artist=row.get("artist") or None, | ||
| 254 | ), | ||
| 255 | "inline", | ||
| 256 | ) | ||
| 257 | |||
| 258 | file_value = (row.get("file") or row.get("path") or row.get("source") or "").strip() | ||
| 259 | if not file_value: | ||
| 260 | raise ValueError("评估 CSV 每行需要 lyrics,或 file/path/source 文件路径") | ||
| 261 | |||
| 262 | file_path = Path(file_value) | ||
| 263 | if not file_path.is_absolute(): | ||
| 264 | file_path = (base_dir or csv_path.parent) / file_path | ||
| 265 | record = record_from_file(file_path) | ||
| 266 | if row.get("title") or row.get("artist"): | ||
| 267 | record = LyricRecord( | ||
| 268 | record_id=record.record_id, | ||
| 269 | lyrics=record.lyrics, | ||
| 270 | title=row.get("title") or record.title, | ||
| 271 | artist=row.get("artist") or record.artist, | ||
| 272 | ) | ||
| 273 | return record, str(file_path) | ||
| 274 | |||
| 275 | |||
| 276 | def _parse_expected(value: str | None) -> bool: | ||
| 277 | if value is None: | ||
| 278 | raise ValueError("评估 CSV 每行需要 expected/label/target 列") | ||
| 279 | normalized = value.strip().lower() | ||
| 280 | positives = {"1", "true", "yes", "y", "duplicate", "dup", "重复", "应去重", "去重", "是"} | ||
| 281 | negatives = {"0", "false", "no", "n", "new", "not_duplicate", "non_duplicate", "不重复", "不应去重", "新歌", "否"} | ||
| 282 | if normalized in positives: | ||
| 283 | return True | ||
| 284 | if normalized in negatives: | ||
| 285 | return False | ||
| 286 | raise ValueError(f"无法识别 expected 值: {value!r}") | ||
| 287 | |||
| 288 | |||
| 289 | def _evaluation_summary( | ||
| 290 | rows: list[dict[str, object]], | ||
| 291 | *, | ||
| 292 | positive_decisions: set[str], | ||
| 293 | out_path: Path, | ||
| 294 | ) -> dict[str, object]: | ||
| 295 | tp = sum(1 for row in rows if row["expected_duplicate"] is True and row["predicted_duplicate"] is True) | ||
| 296 | fp = sum(1 for row in rows if row["expected_duplicate"] is False and row["predicted_duplicate"] is True) | ||
| 297 | tn = sum(1 for row in rows if row["expected_duplicate"] is False and row["predicted_duplicate"] is False) | ||
| 298 | fn = sum(1 for row in rows if row["expected_duplicate"] is True and row["predicted_duplicate"] is False) | ||
| 299 | total = len(rows) | ||
| 300 | precision = tp / (tp + fp) if tp + fp else 0.0 | ||
| 301 | recall = tp / (tp + fn) if tp + fn else 0.0 | ||
| 302 | accuracy = (tp + tn) / total if total else 0.0 | ||
| 303 | f1 = (2 * precision * recall / (precision + recall)) if precision + recall else 0.0 | ||
| 304 | return { | ||
| 305 | "total": total, | ||
| 306 | "positive_decisions": sorted(positive_decisions), | ||
| 307 | "accuracy": round(accuracy, 4), | ||
| 308 | "precision": round(precision, 4), | ||
| 309 | "recall": round(recall, 4), | ||
| 310 | "f1": round(f1, 4), | ||
| 311 | "true_positive": tp, | ||
| 312 | "false_positive": fp, | ||
| 313 | "true_negative": tn, | ||
| 314 | "false_negative": fn, | ||
| 315 | "duplicate": sum(1 for row in rows if row["decision"] == "duplicate"), | ||
| 316 | "review": sum(1 for row in rows if row["decision"] == "review"), | ||
| 317 | "new": sum(1 for row in rows if row["decision"] == "new"), | ||
| 318 | "out": str(out_path), | ||
| 319 | "summary": str(out_path.with_suffix(out_path.suffix + ".summary.json")), | ||
| 320 | } | ||
| 321 | |||
| 322 | |||
| 323 | if __name__ == "__main__": | ||
| 324 | main() |
lyric_dedup/eval_dataset.py
0 → 100644
| 1 | """Generate labeled evaluation samples from an existing lyric library.""" | ||
| 2 | |||
| 3 | from __future__ import annotations | ||
| 4 | |||
| 5 | import csv | ||
| 6 | import random | ||
| 7 | import re | ||
| 8 | from dataclasses import dataclass | ||
| 9 | from pathlib import Path | ||
| 10 | |||
| 11 | from lyric_dedup.file_import import iter_lyric_files | ||
| 12 | from lyric_dedup.file_import import read_lyric_file | ||
| 13 | from lyric_dedup.file_import import record_from_file | ||
| 14 | from lyric_dedup.normalization import normalize_lyrics | ||
| 15 | |||
| 16 | |||
| 17 | @dataclass(frozen=True) | ||
| 18 | class GeneratedSample: | ||
| 19 | sample_id: str | ||
| 20 | file: str | ||
| 21 | expected: str | ||
| 22 | sample_type: str | ||
| 23 | source: str | ||
| 24 | title: str = "" | ||
| 25 | artist: str = "" | ||
| 26 | |||
| 27 | |||
| 28 | def generate_eval_set( | ||
| 29 | *, | ||
| 30 | library_dir: Path, | ||
| 31 | output_dir: Path, | ||
| 32 | csv_path: Path, | ||
| 33 | size: int = 100, | ||
| 34 | positive_ratio: float = 0.6, | ||
| 35 | seed: int = 20260602, | ||
| 36 | ) -> dict[str, object]: | ||
| 37 | rng = random.Random(seed) | ||
| 38 | source_files = iter_lyric_files(library_dir) | ||
| 39 | if not source_files: | ||
| 40 | raise ValueError(f"{library_dir} 下没有 .lrc/.txt 歌词文件") | ||
| 41 | |||
| 42 | output_dir.mkdir(parents=True, exist_ok=True) | ||
| 43 | csv_path.parent.mkdir(parents=True, exist_ok=True) | ||
| 44 | _clean_generated_output_dir(output_dir) | ||
| 45 | |||
| 46 | positives = round(size * positive_ratio) | ||
| 47 | negatives = size - positives | ||
| 48 | samples: list[GeneratedSample] = [] | ||
| 49 | for index in range(positives): | ||
| 50 | source = source_files[index % len(source_files)] | ||
| 51 | samples.append(_positive_sample(index + 1, source, output_dir, csv_path.parent, rng)) | ||
| 52 | for index in range(negatives): | ||
| 53 | left = source_files[index % len(source_files)] | ||
| 54 | right = source_files[(index + 1) % len(source_files)] | ||
| 55 | samples.append(_negative_sample(positives + index + 1, left, right, output_dir, csv_path.parent, rng)) | ||
| 56 | |||
| 57 | rng.shuffle(samples) | ||
| 58 | with csv_path.open("w", encoding="utf-8", newline="") as file: | ||
| 59 | writer = csv.DictWriter(file, fieldnames=["id", "file", "expected", "sample_type", "source", "title", "artist"]) | ||
| 60 | writer.writeheader() | ||
| 61 | writer.writerows( | ||
| 62 | { | ||
| 63 | "id": sample.sample_id, | ||
| 64 | "file": sample.file, | ||
| 65 | "expected": sample.expected, | ||
| 66 | "sample_type": sample.sample_type, | ||
| 67 | "source": sample.source, | ||
| 68 | "title": sample.title, | ||
| 69 | "artist": sample.artist, | ||
| 70 | } | ||
| 71 | for sample in samples | ||
| 72 | ) | ||
| 73 | |||
| 74 | return { | ||
| 75 | "size": size, | ||
| 76 | "positive": positives, | ||
| 77 | "negative": negatives, | ||
| 78 | "library_files": len(source_files), | ||
| 79 | "lyrics_dir": str(output_dir), | ||
| 80 | "csv": str(csv_path), | ||
| 81 | } | ||
| 82 | |||
| 83 | |||
| 84 | def _positive_sample(index: int, source: Path, output_dir: Path, csv_base: Path, rng: random.Random) -> GeneratedSample: | ||
| 85 | raw = read_lyric_file(source) | ||
| 86 | source_record = record_from_file(source) | ||
| 87 | variants = [ | ||
| 88 | ("exact_copy", raw), | ||
| 89 | ("timestamped", _add_timestamps(_content_lines(raw))), | ||
| 90 | ("punctuation_noise", _add_punctuation_noise(_content_lines(raw), rng)), | ||
| 91 | ("with_platform_noise", _with_platform_noise(_content_lines(raw))), | ||
| 92 | ("blank_line_noise", _add_blank_line_noise(_content_lines(raw))), | ||
| 93 | ("lrc_with_platform_noise", _add_timestamps(_content_lines(_with_platform_noise(_content_lines(raw))))), | ||
| 94 | ("translation_added", _translation_added(_content_lines(raw))), | ||
| 95 | ] | ||
| 96 | sample_type, text = variants[(index - 1) % len(variants)] | ||
| 97 | name = f"pos_{index:03d}_{sample_type}.txt" | ||
| 98 | path = output_dir / name | ||
| 99 | path.write_text(text, encoding="utf-8") | ||
| 100 | return GeneratedSample( | ||
| 101 | sample_id=f"pos-{index:03d}", | ||
| 102 | file=str(path.relative_to(csv_base)), | ||
| 103 | expected="应去重", | ||
| 104 | sample_type=sample_type, | ||
| 105 | source=str(source), | ||
| 106 | title=source_record.title or "", | ||
| 107 | artist=source_record.artist or "", | ||
| 108 | ) | ||
| 109 | |||
| 110 | |||
| 111 | def _negative_sample(index: int, left: Path, right: Path, output_dir: Path, csv_base: Path, rng: random.Random) -> GeneratedSample: | ||
| 112 | left_lines = _normalized_lines(left) | ||
| 113 | right_lines = _normalized_lines(right) | ||
| 114 | variants = [ | ||
| 115 | ("single_song_fragment", _single_song_fragment(left_lines)), | ||
| 116 | ("short_shared_snippet", _short_shared_snippet(left_lines, rng)), | ||
| 117 | ("mixed_fragments", _mixed_fragments(left_lines, right_lines, rng)), | ||
| 118 | ("same_theme_synthetic", _same_theme_synthetic(index)), | ||
| 119 | ("translation_only_like", _translation_only_like(left_lines)), | ||
| 120 | ] | ||
| 121 | sample_type, text = variants[(index - 1) % len(variants)] | ||
| 122 | name = f"neg_{index:03d}_{sample_type}.txt" | ||
| 123 | path = output_dir / name | ||
| 124 | path.write_text(text, encoding="utf-8") | ||
| 125 | return GeneratedSample( | ||
| 126 | sample_id=f"neg-{index:03d}", | ||
| 127 | file=str(path.relative_to(csv_base)), | ||
| 128 | expected="不应去重", | ||
| 129 | sample_type=sample_type, | ||
| 130 | source=f"{left} | {right}", | ||
| 131 | ) | ||
| 132 | |||
| 133 | |||
| 134 | def _content_lines(text: str) -> list[str]: | ||
| 135 | lines = [line.strip() for line in text.splitlines() if line.strip()] | ||
| 136 | return lines or [text.strip()] | ||
| 137 | |||
| 138 | |||
| 139 | def _clean_generated_output_dir(output_dir: Path) -> None: | ||
| 140 | for path in output_dir.iterdir(): | ||
| 141 | if path.is_file() and path.suffix.lower() in {".txt", ".lrc"}: | ||
| 142 | path.unlink() | ||
| 143 | |||
| 144 | |||
| 145 | def _normalized_lines(path: Path) -> list[str]: | ||
| 146 | normalized = normalize_lyrics(read_lyric_file(path)) | ||
| 147 | return list(normalized.primary_lines or normalized.unique_lines) | ||
| 148 | |||
| 149 | |||
| 150 | def _add_timestamps(lines: list[str]) -> str: | ||
| 151 | return "\n".join(f"[00:{idx % 60:02d}.00]{line}" for idx, line in enumerate(lines, start=1)) | ||
| 152 | |||
| 153 | |||
| 154 | def _add_punctuation_noise(lines: list[str], rng: random.Random) -> str: | ||
| 155 | marks = ["!", "?", "...", ",", "。"] | ||
| 156 | return "\n".join(f"{line}{rng.choice(marks)}" for line in lines) | ||
| 157 | |||
| 158 | |||
| 159 | def _with_platform_noise(lines: list[str]) -> str: | ||
| 160 | return "\n".join(["歌词来自QQ音乐", "作词:测试", *lines, "未经著作权人许可 不得翻唱"]) | ||
| 161 | |||
| 162 | |||
| 163 | def _add_blank_line_noise(lines: list[str]) -> str: | ||
| 164 | result: list[str] = [] | ||
| 165 | for idx, line in enumerate(lines, start=1): | ||
| 166 | result.append(line) | ||
| 167 | if idx % 4 == 0: | ||
| 168 | result.append("") | ||
| 169 | return "\n".join(result) | ||
| 170 | |||
| 171 | |||
| 172 | def _translation_added(lines: list[str]) -> str: | ||
| 173 | result: list[str] = [] | ||
| 174 | for idx, line in enumerate(lines, start=1): | ||
| 175 | result.append(line) | ||
| 176 | if _looks_foreign(line) and idx <= 24: | ||
| 177 | result.append(_pseudo_translation(idx)) | ||
| 178 | return "\n".join(result) | ||
| 179 | |||
| 180 | |||
| 181 | def _single_song_fragment(lines: list[str]) -> str: | ||
| 182 | if len(lines) <= 4: | ||
| 183 | return "\n".join(lines[: max(1, len(lines) // 2)]) | ||
| 184 | fragment_len = max(2, min(8, len(lines) // 4)) | ||
| 185 | start = max(0, (len(lines) - fragment_len) // 2) | ||
| 186 | return "\n".join(lines[start : start + fragment_len]) | ||
| 187 | |||
| 188 | |||
| 189 | def _short_shared_snippet(lines: list[str], rng: random.Random) -> str: | ||
| 190 | snippet = rng.sample(lines, k=min(2, len(lines))) if lines else [] | ||
| 191 | synthetic = [ | ||
| 192 | "清晨的风吹过新的街口", | ||
| 193 | "我把昨天放进安静的口袋", | ||
| 194 | *snippet, | ||
| 195 | "故事从这里重新开始", | ||
| 196 | "灯光落下我继续往前走", | ||
| 197 | ] | ||
| 198 | return "\n".join(synthetic) | ||
| 199 | |||
| 200 | |||
| 201 | def _mixed_fragments(left_lines: list[str], right_lines: list[str], rng: random.Random) -> str: | ||
| 202 | left_pick = rng.sample(left_lines, k=min(2, len(left_lines))) if left_lines else [] | ||
| 203 | right_pick = rng.sample(right_lines, k=min(2, len(right_lines))) if right_lines else [] | ||
| 204 | filler = ["新的旋律慢慢靠近", "陌生的名字写在风里", "没有人停在原地"] | ||
| 205 | return "\n".join([*left_pick, *filler, *right_pick]) | ||
| 206 | |||
| 207 | |||
| 208 | def _same_theme_synthetic(index: int) -> str: | ||
| 209 | themes = [ | ||
| 210 | "我在夜里想起远方的你", | ||
| 211 | "城市灯火陪我走过雨季", | ||
| 212 | "那些没说完的话留在风里", | ||
| 213 | "明天醒来我们各自继续", | ||
| 214 | f"这是第 {index} 个全新测试样本", | ||
| 215 | ] | ||
| 216 | return "\n".join(themes) | ||
| 217 | |||
| 218 | |||
| 219 | def _translation_only_like(lines: list[str]) -> str: | ||
| 220 | foreign_count = sum(1 for line in lines if _looks_foreign(line)) | ||
| 221 | if foreign_count < 2: | ||
| 222 | return _same_theme_synthetic(foreign_count + len(lines)) | ||
| 223 | return "\n".join(_pseudo_translation(idx) for idx in range(1, min(8, foreign_count) + 1)) | ||
| 224 | |||
| 225 | |||
| 226 | def _pseudo_translation(index: int) -> str: | ||
| 227 | translations = [ | ||
| 228 | "今晚我仍然想念你", | ||
| 229 | "风会带走所有疲惫", | ||
| 230 | "黑暗里也会有光", | ||
| 231 | "别让昨天困住自己", | ||
| 232 | "我们终会继续向前", | ||
| 233 | "雨停以后天空会亮", | ||
| 234 | "把遗憾留在旧时光", | ||
| 235 | "你已经足够好了", | ||
| 236 | ] | ||
| 237 | return translations[(index - 1) % len(translations)] | ||
| 238 | |||
| 239 | |||
| 240 | def _looks_foreign(line: str) -> bool: | ||
| 241 | latin = len(re.findall(r"[A-Za-z]", line)) | ||
| 242 | cjk = len(re.findall(r"[\u4e00-\u9fff]", line)) | ||
| 243 | return latin > 0 and cjk == 0 |
lyric_dedup/file_import.py
0 → 100644
| 1 | """Import LRC/TXT lyric files into records.""" | ||
| 2 | |||
| 3 | from __future__ import annotations | ||
| 4 | |||
| 5 | import hashlib | ||
| 6 | from pathlib import Path | ||
| 7 | |||
| 8 | from lyric_dedup.checker import LyricRecord | ||
| 9 | |||
| 10 | |||
| 11 | SUPPORTED_SUFFIXES = {".lrc", ".txt"} | ||
| 12 | |||
| 13 | |||
| 14 | def iter_lyric_files(root: str | Path) -> list[Path]: | ||
| 15 | base = Path(root) | ||
| 16 | return sorted( | ||
| 17 | path | ||
| 18 | for path in base.rglob("*") | ||
| 19 | if path.is_file() and path.suffix.lower() in SUPPORTED_SUFFIXES | ||
| 20 | ) | ||
| 21 | |||
| 22 | |||
| 23 | def read_lyric_file(path: str | Path) -> str: | ||
| 24 | file_path = Path(path) | ||
| 25 | data = file_path.read_bytes() | ||
| 26 | for encoding in ("utf-8-sig", "utf-8", "gb18030", "big5"): | ||
| 27 | try: | ||
| 28 | return data.decode(encoding) | ||
| 29 | except UnicodeDecodeError: | ||
| 30 | continue | ||
| 31 | return data.decode("utf-8", errors="replace") | ||
| 32 | |||
| 33 | |||
| 34 | def record_from_file(path: str | Path, *, base_dir: str | Path | None = None) -> LyricRecord: | ||
| 35 | file_path = Path(path) | ||
| 36 | lyrics = read_lyric_file(file_path) | ||
| 37 | title, artist = _metadata_from_name(file_path.stem) | ||
| 38 | record_id = _record_id(file_path, base_dir) | ||
| 39 | return LyricRecord(record_id=record_id, lyrics=lyrics, title=title, artist=artist) | ||
| 40 | |||
| 41 | |||
| 42 | def records_from_dir(root: str | Path) -> list[LyricRecord]: | ||
| 43 | return [record_from_file(path, base_dir=root) for path in iter_lyric_files(root)] | ||
| 44 | |||
| 45 | |||
| 46 | def _record_id(path: Path, base_dir: str | Path | None) -> str: | ||
| 47 | if base_dir is None: | ||
| 48 | source = str(path.resolve()) | ||
| 49 | else: | ||
| 50 | source = str(path.resolve().relative_to(Path(base_dir).resolve())) | ||
| 51 | digest = hashlib.sha1(source.encode("utf-8")).hexdigest()[:12] | ||
| 52 | return f"{digest}:{source}" | ||
| 53 | |||
| 54 | |||
| 55 | def _metadata_from_name(stem: str) -> tuple[str | None, str | None]: | ||
| 56 | cleaned = stem.removesuffix("-歌词").removesuffix("_歌词").removesuffix(" 歌词").strip() | ||
| 57 | if " - " in cleaned: | ||
| 58 | artist, title = cleaned.split(" - ", 1) | ||
| 59 | return title.strip() or None, artist.strip() or None | ||
| 60 | for sep in ("-", "_"): | ||
| 61 | if sep in cleaned: | ||
| 62 | title, artist = cleaned.rsplit(sep, 1) | ||
| 63 | return title.strip() or None, artist.strip() or None | ||
| 64 | return stem.strip() or None, None |
lyric_dedup/minhash_lsh.py
0 → 100644
| 1 | """Small in-memory MinHash LSH index for incremental lyric lookup.""" | ||
| 2 | |||
| 3 | from __future__ import annotations | ||
| 4 | |||
| 5 | import hashlib | ||
| 6 | from collections import defaultdict | ||
| 7 | from dataclasses import dataclass | ||
| 8 | |||
| 9 | |||
| 10 | _MAX_HASH = (1 << 64) - 1 | ||
| 11 | |||
| 12 | |||
| 13 | @dataclass(frozen=True) | ||
| 14 | class MinHashConfig: | ||
| 15 | num_perm: int = 96 | ||
| 16 | bands: int = 24 | ||
| 17 | seed: int = 17 | ||
| 18 | |||
| 19 | @property | ||
| 20 | def rows_per_band(self) -> int: | ||
| 21 | if self.num_perm % self.bands != 0: | ||
| 22 | raise ValueError("num_perm must be divisible by bands") | ||
| 23 | return self.num_perm // self.bands | ||
| 24 | |||
| 25 | |||
| 26 | class MinHashLSH: | ||
| 27 | def __init__(self, config: MinHashConfig | None = None) -> None: | ||
| 28 | self.config = config or MinHashConfig() | ||
| 29 | self._buckets: dict[tuple[int, tuple[int, ...]], set[str]] = defaultdict(set) | ||
| 30 | |||
| 31 | def signature(self, tokens: set[str]) -> tuple[int, ...]: | ||
| 32 | if not tokens: | ||
| 33 | return tuple([_MAX_HASH] * self.config.num_perm) | ||
| 34 | |||
| 35 | signature = [_MAX_HASH] * self.config.num_perm | ||
| 36 | for token in tokens: | ||
| 37 | encoded = token.encode("utf-8") | ||
| 38 | for idx in range(self.config.num_perm): | ||
| 39 | digest = hashlib.blake2b( | ||
| 40 | encoded, | ||
| 41 | digest_size=8, | ||
| 42 | person=f"lyr{self.config.seed + idx:05d}".encode("ascii")[:16], | ||
| 43 | ).digest() | ||
| 44 | value = int.from_bytes(digest, "big") | ||
| 45 | if value < signature[idx]: | ||
| 46 | signature[idx] = value | ||
| 47 | return tuple(signature) | ||
| 48 | |||
| 49 | def add(self, record_id: str, signature: tuple[int, ...]) -> None: | ||
| 50 | for key in self._band_keys(signature): | ||
| 51 | self._buckets[key].add(record_id) | ||
| 52 | |||
| 53 | def query(self, signature: tuple[int, ...]) -> set[str]: | ||
| 54 | candidates: set[str] = set() | ||
| 55 | for key in self._band_keys(signature): | ||
| 56 | candidates.update(self._buckets.get(key, set())) | ||
| 57 | return candidates | ||
| 58 | |||
| 59 | def _band_keys(self, signature: tuple[int, ...]) -> list[tuple[int, tuple[int, ...]]]: | ||
| 60 | rows = self.config.rows_per_band | ||
| 61 | return [(band, signature[band * rows : (band + 1) * rows]) for band in range(self.config.bands)] |
lyric_dedup/normalization.py
0 → 100644
| 1 | """Lyric-specific normalization and feature extraction.""" | ||
| 2 | |||
| 3 | from __future__ import annotations | ||
| 4 | |||
| 5 | import re | ||
| 6 | import string | ||
| 7 | import unicodedata | ||
| 8 | from collections import Counter | ||
| 9 | from dataclasses import dataclass | ||
| 10 | |||
| 11 | |||
| 12 | _TRADITIONAL_TO_SIMPLIFIED = str.maketrans( | ||
| 13 | { | ||
| 14 | "愛": "爱", | ||
| 15 | "會": "会", | ||
| 16 | "個": "个", | ||
| 17 | "妳": "你", | ||
| 18 | "們": "们", | ||
| 19 | "麼": "么", | ||
| 20 | "夢": "梦", | ||
| 21 | "憶": "忆", | ||
| 22 | "風": "风", | ||
| 23 | "無": "无", | ||
| 24 | "與": "与", | ||
| 25 | "聽": "听", | ||
| 26 | "說": "说", | ||
| 27 | "見": "见", | ||
| 28 | "話": "话", | ||
| 29 | "還": "还", | ||
| 30 | "這": "这", | ||
| 31 | "那": "那", | ||
| 32 | "裡": "里", | ||
| 33 | "裏": "里", | ||
| 34 | "過": "过", | ||
| 35 | "來": "来", | ||
| 36 | "進": "进", | ||
| 37 | "去": "去", | ||
| 38 | "給": "给", | ||
| 39 | "讓": "让", | ||
| 40 | "嗎": "吗", | ||
| 41 | "為": "为", | ||
| 42 | "誰": "谁", | ||
| 43 | "對": "对", | ||
| 44 | "錯": "错", | ||
| 45 | "淚": "泪", | ||
| 46 | "寫": "写", | ||
| 47 | "雲": "云", | ||
| 48 | "藍": "蓝", | ||
| 49 | "紅": "红", | ||
| 50 | "綠": "绿", | ||
| 51 | "黃": "黄", | ||
| 52 | "長": "长", | ||
| 53 | "遠": "远", | ||
| 54 | "燈": "灯", | ||
| 55 | "臺": "台", | ||
| 56 | "台": "台", | ||
| 57 | "後": "后", | ||
| 58 | "從": "从", | ||
| 59 | "時": "时", | ||
| 60 | "間": "间", | ||
| 61 | "葉": "叶", | ||
| 62 | "歲": "岁", | ||
| 63 | "聲": "声", | ||
| 64 | "邊": "边", | ||
| 65 | "歡": "欢", | ||
| 66 | "繼": "继", | ||
| 67 | "續": "续", | ||
| 68 | "難": "难", | ||
| 69 | "雙": "双", | ||
| 70 | "舊": "旧", | ||
| 71 | "離": "离", | ||
| 72 | } | ||
| 73 | ) | ||
| 74 | |||
| 75 | _TIMESTAMP_RE = re.compile(r"\[((?:\d{1,2}:)?\d{1,2}:\d{2}(?:[.:]\d{1,3})?)\]") | ||
| 76 | _BRACKET_RE = re.compile(r"[\[((【<《].{0,40}?[\]))】>》]") | ||
| 77 | _ROLE_PREFIX_RE = re.compile(r"^\s*(?:男|女|合|主歌|副歌|verse|chorus|bridge|rap)\s*[::]\s*", re.IGNORECASE) | ||
| 78 | _CREDIT_PREFIX_RE = re.compile( | ||
| 79 | r"^\s*(?:作词|作詞|作曲|编曲|編曲|制作|製作|监制|監製|录音|錄音|混音|母带|" | ||
| 80 | r"出品|发行|發行|歌词|歌詞|lyric(?:s)?|composer|writer|producer|arranger|" | ||
| 81 | r"copyright|未经|未經|qq音乐|酷狗|网易云|網易雲|lrc)", | ||
| 82 | re.IGNORECASE, | ||
| 83 | ) | ||
| 84 | _WATERMARK_RE = re.compile( | ||
| 85 | r"(?:qq音乐|酷狗音乐|网易云音乐|網易雲音樂|虾米音乐|歌词网|歌詞網|" | ||
| 86 | r"music\.163\.com|www\.|http[s]?://|\blrc\b)", | ||
| 87 | re.IGNORECASE, | ||
| 88 | ) | ||
| 89 | _CJK_RE = re.compile(r"[\u4e00-\u9fff]") | ||
| 90 | _LATIN_RE = re.compile(r"[a-zA-Z]") | ||
| 91 | _KANA_RE = re.compile(r"[\u3040-\u30ff]") | ||
| 92 | _HANGUL_RE = re.compile(r"[\uac00-\ud7af]") | ||
| 93 | _WORD_RE = re.compile(r"[a-z0-9]+|[\u4e00-\u9fff]", re.IGNORECASE) | ||
| 94 | _INLINE_SPLIT_RE = re.compile(r"\s+(?:/|\|||)\s+|(?<=[A-Za-z])\s*[-—]\s*(?=[\u4e00-\u9fff])") | ||
| 95 | |||
| 96 | |||
| 97 | @dataclass(frozen=True) | ||
| 98 | class _LineEntry: | ||
| 99 | text: str | ||
| 100 | timestamp: str | None | ||
| 101 | language: str | ||
| 102 | source_index: int | ||
| 103 | |||
| 104 | |||
| 105 | @dataclass(frozen=True) | ||
| 106 | class NormalizedLyrics: | ||
| 107 | raw_text: str | ||
| 108 | normalized_full_text: str | ||
| 109 | normalized_lines: tuple[str, ...] | ||
| 110 | unique_lines: tuple[str, ...] | ||
| 111 | line_counts: dict[str, int] | ||
| 112 | content_line_count: int | ||
| 113 | primary_lines: tuple[str, ...] | ||
| 114 | translation_lines: tuple[str, ...] | ||
| 115 | unknown_lines: tuple[str, ...] | ||
| 116 | line_roles: tuple[str, ...] | ||
| 117 | split_confidence: str | ||
| 118 | split_reason: str | ||
| 119 | |||
| 120 | |||
| 121 | def normalize_lyrics(text: str) -> NormalizedLyrics: | ||
| 122 | """Normalize lyrics while preserving line-level structure for ranking.""" | ||
| 123 | entries: list[_LineEntry] = [] | ||
| 124 | for index, raw_line in enumerate(unicodedata.normalize("NFKC", text).splitlines()): | ||
| 125 | entries.extend(_clean_line_entries(raw_line, index)) | ||
| 126 | |||
| 127 | cleaned_lines = [entry.text for entry in entries] | ||
| 128 | roles, confidence, reason = _assign_line_roles(entries) | ||
| 129 | primary_lines = tuple(entry.text for entry, role in zip(entries, roles, strict=False) if role == "primary") | ||
| 130 | translation_lines = tuple(entry.text for entry, role in zip(entries, roles, strict=False) if role == "translation") | ||
| 131 | unknown_lines = tuple(entry.text for entry, role in zip(entries, roles, strict=False) if role == "unknown") | ||
| 132 | if not primary_lines: | ||
| 133 | primary_lines = tuple(cleaned_lines) | ||
| 134 | roles = tuple("primary" for _ in cleaned_lines) | ||
| 135 | if cleaned_lines and confidence == "none": | ||
| 136 | reason = "未检测到可分离的翻译结构,全部有效行按原文处理" | ||
| 137 | |||
| 138 | counts = Counter(cleaned_lines) | ||
| 139 | unique_lines = tuple(dict.fromkeys(cleaned_lines)) | ||
| 140 | return NormalizedLyrics( | ||
| 141 | raw_text=text, | ||
| 142 | normalized_full_text="\n".join(cleaned_lines), | ||
| 143 | normalized_lines=tuple(cleaned_lines), | ||
| 144 | unique_lines=unique_lines, | ||
| 145 | line_counts=dict(counts), | ||
| 146 | content_line_count=len(cleaned_lines), | ||
| 147 | primary_lines=tuple(dict.fromkeys(primary_lines)), | ||
| 148 | translation_lines=tuple(dict.fromkeys(translation_lines)), | ||
| 149 | unknown_lines=tuple(dict.fromkeys(unknown_lines)), | ||
| 150 | line_roles=tuple(roles), | ||
| 151 | split_confidence=confidence, | ||
| 152 | split_reason=reason, | ||
| 153 | ) | ||
| 154 | |||
| 155 | |||
| 156 | def fingerprint_text(normalized: NormalizedLyrics) -> str: | ||
| 157 | """Return a text form suitable for exact hashing. | ||
| 158 | |||
| 159 | Repeated adjacent or non-adjacent lyric lines are collapsed so different chorus | ||
| 160 | repeat counts do not prevent exact duplicate detection. | ||
| 161 | """ | ||
| 162 | return "\n".join(normalized.primary_lines or normalized.unique_lines) | ||
| 163 | |||
| 164 | |||
| 165 | def lyric_tokens( | ||
| 166 | normalized: NormalizedLyrics, | ||
| 167 | ngram_size: int = 3, | ||
| 168 | *, | ||
| 169 | lines: tuple[str, ...] | None = None, | ||
| 170 | ) -> set[str]: | ||
| 171 | """Build mixed CJK/Latin n-grams with repeated lines down-weighted.""" | ||
| 172 | tokens: set[str] = set() | ||
| 173 | selected_lines = lines if lines is not None else normalized.unique_lines | ||
| 174 | for line in selected_lines: | ||
| 175 | units = _token_units(line) | ||
| 176 | if len(units) < ngram_size: | ||
| 177 | if units: | ||
| 178 | tokens.add(" ".join(units)) | ||
| 179 | continue | ||
| 180 | for start in range(len(units) - ngram_size + 1): | ||
| 181 | tokens.add(" ".join(units[start : start + ngram_size])) | ||
| 182 | return tokens | ||
| 183 | |||
| 184 | |||
| 185 | def _clean_line_entries(raw_line: str, source_index: int) -> list[_LineEntry]: | ||
| 186 | timestamp_match = _TIMESTAMP_RE.search(raw_line) | ||
| 187 | timestamp = timestamp_match.group(1) if timestamp_match else None | ||
| 188 | line = _TIMESTAMP_RE.sub("", raw_line) | ||
| 189 | line = _ROLE_PREFIX_RE.sub("", line).strip() | ||
| 190 | inline_entries = _split_inline_translation(line, timestamp, source_index) | ||
| 191 | if inline_entries: | ||
| 192 | return inline_entries | ||
| 193 | return _entry_from_text(line, timestamp, source_index) | ||
| 194 | |||
| 195 | |||
| 196 | def _split_inline_translation(line: str, timestamp: str | None, source_index: int) -> list[_LineEntry]: | ||
| 197 | parts = [part.strip() for part in _INLINE_SPLIT_RE.split(line, maxsplit=1)] | ||
| 198 | if len(parts) != 2: | ||
| 199 | return [] | ||
| 200 | left_entries = _entry_from_text(parts[0], timestamp, source_index) | ||
| 201 | right_entries = _entry_from_text(parts[1], timestamp, source_index) | ||
| 202 | if not left_entries or not right_entries: | ||
| 203 | return [] | ||
| 204 | left_lang = left_entries[0].language | ||
| 205 | right_lang = right_entries[0].language | ||
| 206 | if _is_foreign_language(left_lang) and right_lang == "zh": | ||
| 207 | return [left_entries[0], right_entries[0]] | ||
| 208 | if left_lang == "zh" and _is_foreign_language(right_lang): | ||
| 209 | return [right_entries[0], left_entries[0]] | ||
| 210 | return [] | ||
| 211 | |||
| 212 | |||
| 213 | def _entry_from_text(text: str, timestamp: str | None, source_index: int) -> list[_LineEntry]: | ||
| 214 | line = _BRACKET_RE.sub("", text) | ||
| 215 | line = line.strip().lower().translate(_TRADITIONAL_TO_SIMPLIFIED) | ||
| 216 | if not line or _is_noise_line(line): | ||
| 217 | return [] | ||
| 218 | line = _strip_symbols(line) | ||
| 219 | if not line: | ||
| 220 | return [] | ||
| 221 | return [_LineEntry(text=line, timestamp=timestamp, language=_detect_language(line), source_index=source_index)] | ||
| 222 | |||
| 223 | |||
| 224 | def _assign_line_roles(entries: list[_LineEntry]) -> tuple[tuple[str, ...], str, str]: | ||
| 225 | if not entries: | ||
| 226 | return (), "none", "没有有效歌词行" | ||
| 227 | |||
| 228 | timestamp_roles = _roles_by_same_timestamp(entries) | ||
| 229 | if timestamp_roles is not None: | ||
| 230 | return timestamp_roles, "high", "同时间戳下检测到外文行和中文行配对" | ||
| 231 | |||
| 232 | inline_roles = _roles_by_inline_translation(entries) | ||
| 233 | if inline_roles is not None: | ||
| 234 | return inline_roles, "medium", "同一原始行内检测到明显的外文和中文翻译" | ||
| 235 | |||
| 236 | alternating_roles = _roles_by_alternating_translation(entries) | ||
| 237 | if alternating_roles is not None: | ||
| 238 | return alternating_roles, "high", "检测到稳定的外文行和中文翻译行交替结构" | ||
| 239 | |||
| 240 | block_roles = _roles_by_translation_block(entries) | ||
| 241 | if block_roles is not None: | ||
| 242 | return block_roles, "low", "检测到疑似原文段落加中文翻译段落,置信度较低" | ||
| 243 | |||
| 244 | return tuple("primary" for _ in entries), "none", "未检测到可分离的翻译结构,全部有效行按原文处理" | ||
| 245 | |||
| 246 | |||
| 247 | def _roles_by_same_timestamp(entries: list[_LineEntry]) -> tuple[str, ...] | None: | ||
| 248 | roles = ["unknown"] * len(entries) | ||
| 249 | groups: dict[str, list[int]] = {} | ||
| 250 | for idx, entry in enumerate(entries): | ||
| 251 | if entry.timestamp: | ||
| 252 | groups.setdefault(entry.timestamp, []).append(idx) | ||
| 253 | |||
| 254 | paired = 0 | ||
| 255 | for indexes in groups.values(): | ||
| 256 | if len(indexes) < 2: | ||
| 257 | continue | ||
| 258 | foreign = [idx for idx in indexes if _is_foreign_language(entries[idx].language)] | ||
| 259 | chinese = [idx for idx in indexes if entries[idx].language == "zh"] | ||
| 260 | if not foreign or not chinese: | ||
| 261 | continue | ||
| 262 | for idx in foreign: | ||
| 263 | roles[idx] = "primary" | ||
| 264 | for idx in chinese: | ||
| 265 | roles[idx] = "translation" | ||
| 266 | paired += 1 | ||
| 267 | |||
| 268 | if paired == 0: | ||
| 269 | return None | ||
| 270 | for idx, role in enumerate(roles): | ||
| 271 | if role == "unknown": | ||
| 272 | roles[idx] = "primary" | ||
| 273 | return tuple(roles) | ||
| 274 | |||
| 275 | |||
| 276 | def _roles_by_alternating_translation(entries: list[_LineEntry]) -> tuple[str, ...] | None: | ||
| 277 | roles = ["unknown"] * len(entries) | ||
| 278 | pairs = 0 | ||
| 279 | idx = 0 | ||
| 280 | while idx < len(entries) - 1: | ||
| 281 | current = entries[idx] | ||
| 282 | nxt = entries[idx + 1] | ||
| 283 | if _is_foreign_language(current.language) and nxt.language == "zh": | ||
| 284 | roles[idx] = "primary" | ||
| 285 | roles[idx + 1] = "translation" | ||
| 286 | pairs += 1 | ||
| 287 | idx += 2 | ||
| 288 | continue | ||
| 289 | idx += 1 | ||
| 290 | |||
| 291 | if pairs < 2: | ||
| 292 | return None | ||
| 293 | assigned = sum(1 for role in roles if role != "unknown") | ||
| 294 | if assigned / len(entries) < 0.65: | ||
| 295 | return None | ||
| 296 | for idx, role in enumerate(roles): | ||
| 297 | if role == "unknown": | ||
| 298 | roles[idx] = "primary" | ||
| 299 | return tuple(roles) | ||
| 300 | |||
| 301 | |||
| 302 | def _roles_by_inline_translation(entries: list[_LineEntry]) -> tuple[str, ...] | None: | ||
| 303 | roles = ["primary"] * len(entries) | ||
| 304 | pairs = 0 | ||
| 305 | by_source: dict[int, list[int]] = {} | ||
| 306 | for idx, entry in enumerate(entries): | ||
| 307 | by_source.setdefault(entry.source_index, []).append(idx) | ||
| 308 | for indexes in by_source.values(): | ||
| 309 | if len(indexes) != 2: | ||
| 310 | continue | ||
| 311 | first, second = indexes | ||
| 312 | if _is_foreign_language(entries[first].language) and entries[second].language == "zh": | ||
| 313 | roles[first] = "primary" | ||
| 314 | roles[second] = "translation" | ||
| 315 | pairs += 1 | ||
| 316 | elif entries[first].language == "zh" and _is_foreign_language(entries[second].language): | ||
| 317 | roles[first] = "translation" | ||
| 318 | roles[second] = "primary" | ||
| 319 | pairs += 1 | ||
| 320 | return tuple(roles) if pairs else None | ||
| 321 | |||
| 322 | |||
| 323 | def _roles_by_translation_block(entries: list[_LineEntry]) -> tuple[str, ...] | None: | ||
| 324 | if len(entries) < 4: | ||
| 325 | return None | ||
| 326 | midpoint = len(entries) // 2 | ||
| 327 | first = entries[:midpoint] | ||
| 328 | second = entries[midpoint:] | ||
| 329 | first_foreign = sum(1 for entry in first if _is_foreign_language(entry.language)) | ||
| 330 | second_zh = sum(1 for entry in second if entry.language == "zh") | ||
| 331 | if first_foreign / len(first) >= 0.75 and second_zh / len(second) >= 0.75: | ||
| 332 | return tuple("primary" if idx < midpoint else "translation" for idx in range(len(entries))) | ||
| 333 | return None | ||
| 334 | |||
| 335 | |||
| 336 | def _detect_language(line: str) -> str: | ||
| 337 | cjk = len(_CJK_RE.findall(line)) | ||
| 338 | latin = len(_LATIN_RE.findall(line)) | ||
| 339 | kana = len(_KANA_RE.findall(line)) | ||
| 340 | hangul = len(_HANGUL_RE.findall(line)) | ||
| 341 | if hangul: | ||
| 342 | return "kr" | ||
| 343 | if kana: | ||
| 344 | return "jp" | ||
| 345 | if cjk and latin: | ||
| 346 | return "mixed" | ||
| 347 | if cjk: | ||
| 348 | return "zh" | ||
| 349 | if latin: | ||
| 350 | return "latin" | ||
| 351 | return "other" | ||
| 352 | |||
| 353 | |||
| 354 | def _is_foreign_language(language: str) -> bool: | ||
| 355 | return language in {"latin", "jp", "kr", "other"} | ||
| 356 | |||
| 357 | |||
| 358 | def _is_noise_line(line: str) -> bool: | ||
| 359 | if _CREDIT_PREFIX_RE.search(line) or _WATERMARK_RE.search(line): | ||
| 360 | return True | ||
| 361 | has_cjk_or_latin = bool(_CJK_RE.search(line) or _LATIN_RE.search(line)) | ||
| 362 | if not has_cjk_or_latin: | ||
| 363 | return True | ||
| 364 | compact = _strip_symbols(line) | ||
| 365 | return len(compact) <= 1 | ||
| 366 | |||
| 367 | |||
| 368 | def _strip_symbols(line: str) -> str: | ||
| 369 | punctuation = string.punctuation + ",。!?;:、“”‘’·…—~!¥()【】《》〈〉「」『』﹏" | ||
| 370 | line = "".join(" " if char in punctuation else char for char in line) | ||
| 371 | line = re.sub(r"\s+", " ", line) | ||
| 372 | line = re.sub(r"(?<=[\u4e00-\u9fff])\s+(?=[\u4e00-\u9fff])", "", line) | ||
| 373 | return line.strip() | ||
| 374 | |||
| 375 | |||
| 376 | def _token_units(line: str) -> list[str]: | ||
| 377 | units: list[str] = [] | ||
| 378 | for match in _WORD_RE.finditer(line): | ||
| 379 | token = match.group(0).lower() | ||
| 380 | if _CJK_RE.fullmatch(token): | ||
| 381 | units.append(token) | ||
| 382 | else: | ||
| 383 | units.append(token) | ||
| 384 | return units |
scripts/process_library.py
0 → 100644
| 1 | """Process newly added lyric library files. | ||
| 2 | |||
| 3 | This script is intended for the recurring workflow after adding files to | ||
| 4 | ``data/library``: | ||
| 5 | |||
| 6 | 1. Move pure-music placeholder lyric files out of the active library. | ||
| 7 | 2. Rebuild the duplicate-checking index. | ||
| 8 | 3. Optionally regenerate and evaluate a synthetic regression set. | ||
| 9 | """ | ||
| 10 | |||
| 11 | from __future__ import annotations | ||
| 12 | |||
| 13 | import argparse | ||
| 14 | import csv | ||
| 15 | import json | ||
| 16 | import shutil | ||
| 17 | import sys | ||
| 18 | from datetime import datetime | ||
| 19 | from pathlib import Path | ||
| 20 | |||
| 21 | PROJECT_ROOT = Path(__file__).resolve().parents[1] | ||
| 22 | if str(PROJECT_ROOT) not in sys.path: | ||
| 23 | sys.path.insert(0, str(PROJECT_ROOT)) | ||
| 24 | |||
| 25 | from lyric_dedup.checker import DuplicateChecker | ||
| 26 | from lyric_dedup.cli import evaluate_csv | ||
| 27 | from lyric_dedup.eval_dataset import generate_eval_set | ||
| 28 | from lyric_dedup.file_import import iter_lyric_files | ||
| 29 | from lyric_dedup.file_import import read_lyric_file | ||
| 30 | from lyric_dedup.file_import import records_from_dir | ||
| 31 | from lyric_dedup.normalization import normalize_lyrics | ||
| 32 | |||
| 33 | |||
| 34 | PLACEHOLDER_MARKERS = ( | ||
| 35 | "【曲库专用】", | ||
| 36 | "此歌曲为没有填词的纯音乐", | ||
| 37 | ) | ||
| 38 | |||
| 39 | |||
| 40 | def main() -> None: | ||
| 41 | parser = argparse.ArgumentParser(description="Process lyric library additions.") | ||
| 42 | parser.add_argument("--library-dir", default="data/library") | ||
| 43 | parser.add_argument("--index", default="outputs/indexes/library_lyrics.pkl") | ||
| 44 | parser.add_argument("--quarantine-dir", default="data/quarantine/no_lyrics_placeholders") | ||
| 45 | parser.add_argument("--dry-run", action="store_true", help="Only report placeholder files; do not move or write outputs.") | ||
| 46 | parser.add_argument("--delete-placeholders", action="store_true", help="Delete matched placeholder files instead of moving them.") | ||
| 47 | parser.add_argument("--eval-size", type=int, default=0, help="Generate and evaluate this many synthetic samples. 0 disables eval.") | ||
| 48 | parser.add_argument("--positive-ratio", type=float, default=0.2) | ||
| 49 | parser.add_argument("--eval-dir", default="data/generated_eval/incoming") | ||
| 50 | parser.add_argument("--eval-csv", default="data/generated_eval/eval.csv") | ||
| 51 | parser.add_argument("--eval-out", default="outputs/results/library_eval.csv") | ||
| 52 | parser.add_argument("--report", default="outputs/results/library_process_report.json") | ||
| 53 | args = parser.parse_args() | ||
| 54 | |||
| 55 | library_dir = Path(args.library_dir) | ||
| 56 | quarantine_dir = Path(args.quarantine_dir) | ||
| 57 | report_path = Path(args.report) | ||
| 58 | |||
| 59 | files_before = iter_lyric_files(library_dir) | ||
| 60 | placeholders = _find_placeholder_files(library_dir) | ||
| 61 | short_effective = _effective_line_report(library_dir) | ||
| 62 | |||
| 63 | moved_or_deleted: list[str] = [] | ||
| 64 | if not args.dry_run: | ||
| 65 | moved_or_deleted = _handle_placeholders( | ||
| 66 | placeholders, | ||
| 67 | library_dir=library_dir, | ||
| 68 | quarantine_dir=quarantine_dir, | ||
| 69 | delete=args.delete_placeholders, | ||
| 70 | ) | ||
| 71 | _build_index(library_dir, Path(args.index)) | ||
| 72 | |||
| 73 | if args.eval_size > 0: | ||
| 74 | generate_eval_set( | ||
| 75 | library_dir=library_dir, | ||
| 76 | output_dir=Path(args.eval_dir), | ||
| 77 | csv_path=Path(args.eval_csv), | ||
| 78 | size=args.eval_size, | ||
| 79 | positive_ratio=args.positive_ratio, | ||
| 80 | ) | ||
| 81 | evaluate_csv( | ||
| 82 | Path(args.index), | ||
| 83 | Path(args.eval_csv), | ||
| 84 | Path(args.eval_out), | ||
| 85 | base_dir=Path(args.eval_csv).parent, | ||
| 86 | positive_decisions={"duplicate"}, | ||
| 87 | max_candidates=5, | ||
| 88 | ) | ||
| 89 | evaluate_csv( | ||
| 90 | Path(args.index), | ||
| 91 | Path(args.eval_csv), | ||
| 92 | Path(args.eval_out).with_name(Path(args.eval_out).stem + "_review_positive.csv"), | ||
| 93 | base_dir=Path(args.eval_csv).parent, | ||
| 94 | positive_decisions={"duplicate", "review"}, | ||
| 95 | max_candidates=5, | ||
| 96 | ) | ||
| 97 | |||
| 98 | report = { | ||
| 99 | "timestamp": datetime.now().isoformat(timespec="seconds"), | ||
| 100 | "dry_run": args.dry_run, | ||
| 101 | "library_dir": str(library_dir), | ||
| 102 | "files_before": len(files_before), | ||
| 103 | "placeholder_matches": len(placeholders), | ||
| 104 | "placeholder_files": [str(path) for path in placeholders], | ||
| 105 | "handled_placeholder_files": moved_or_deleted, | ||
| 106 | "files_after": len(iter_lyric_files(library_dir)), | ||
| 107 | "index": str(args.index), | ||
| 108 | "eval_size": args.eval_size, | ||
| 109 | "eval_csv": str(args.eval_csv) if args.eval_size > 0 else "", | ||
| 110 | "eval_out": str(args.eval_out) if args.eval_size > 0 else "", | ||
| 111 | "short_effective_line_counts": short_effective, | ||
| 112 | } | ||
| 113 | |||
| 114 | print(json.dumps(report, ensure_ascii=False, indent=2)) | ||
| 115 | if not args.dry_run: | ||
| 116 | report_path.parent.mkdir(parents=True, exist_ok=True) | ||
| 117 | report_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8") | ||
| 118 | |||
| 119 | |||
| 120 | def _find_placeholder_files(library_dir: Path) -> list[Path]: | ||
| 121 | matches: list[Path] = [] | ||
| 122 | for path in iter_lyric_files(library_dir): | ||
| 123 | text = read_lyric_file(path) | ||
| 124 | if any(marker in text for marker in PLACEHOLDER_MARKERS): | ||
| 125 | matches.append(path) | ||
| 126 | return matches | ||
| 127 | |||
| 128 | |||
| 129 | def _handle_placeholders( | ||
| 130 | placeholders: list[Path], | ||
| 131 | *, | ||
| 132 | library_dir: Path, | ||
| 133 | quarantine_dir: Path, | ||
| 134 | delete: bool, | ||
| 135 | ) -> list[str]: | ||
| 136 | handled: list[str] = [] | ||
| 137 | if not placeholders: | ||
| 138 | return handled | ||
| 139 | if not delete: | ||
| 140 | quarantine_dir.mkdir(parents=True, exist_ok=True) | ||
| 141 | for path in placeholders: | ||
| 142 | if delete: | ||
| 143 | path.unlink() | ||
| 144 | handled.append(f"deleted:{path}") | ||
| 145 | continue | ||
| 146 | relative = path.resolve().relative_to(library_dir.resolve()) | ||
| 147 | destination = quarantine_dir / relative | ||
| 148 | destination.parent.mkdir(parents=True, exist_ok=True) | ||
| 149 | if destination.exists(): | ||
| 150 | destination = destination.with_name(f"{destination.stem}_{datetime.now().strftime('%Y%m%d%H%M%S')}{destination.suffix}") | ||
| 151 | shutil.move(str(path), str(destination)) | ||
| 152 | handled.append(f"moved:{path}->{destination}") | ||
| 153 | return handled | ||
| 154 | |||
| 155 | |||
| 156 | def _build_index(library_dir: Path, index_path: Path) -> None: | ||
| 157 | checker = DuplicateChecker() | ||
| 158 | for record in records_from_dir(library_dir): | ||
| 159 | checker.add_record(record) | ||
| 160 | index_path.parent.mkdir(parents=True, exist_ok=True) | ||
| 161 | checker.save(index_path) | ||
| 162 | |||
| 163 | |||
| 164 | def _effective_line_report(library_dir: Path) -> dict[str, int]: | ||
| 165 | buckets = { | ||
| 166 | "total": 0, | ||
| 167 | "zero_effective_lines": 0, | ||
| 168 | "one_to_three_effective_lines": 0, | ||
| 169 | "four_to_five_effective_lines": 0, | ||
| 170 | "six_plus_effective_lines": 0, | ||
| 171 | } | ||
| 172 | for path in iter_lyric_files(library_dir): | ||
| 173 | buckets["total"] += 1 | ||
| 174 | normalized = normalize_lyrics(read_lyric_file(path)) | ||
| 175 | line_count = len(normalized.primary_lines or normalized.unique_lines) | ||
| 176 | if line_count == 0: | ||
| 177 | buckets["zero_effective_lines"] += 1 | ||
| 178 | elif line_count <= 3: | ||
| 179 | buckets["one_to_three_effective_lines"] += 1 | ||
| 180 | elif line_count <= 5: | ||
| 181 | buckets["four_to_five_effective_lines"] += 1 | ||
| 182 | else: | ||
| 183 | buckets["six_plus_effective_lines"] += 1 | ||
| 184 | return buckets | ||
| 185 | |||
| 186 | |||
| 187 | if __name__ == "__main__": | ||
| 188 | main() |
tests/test_lyric_dedup.py
0 → 100644
| 1 | import csv | ||
| 2 | |||
| 3 | from lyric_dedup import DuplicateChecker | ||
| 4 | from lyric_dedup import DuplicateDecision | ||
| 5 | from lyric_dedup import LyricRecord | ||
| 6 | from lyric_dedup.cli import evaluate_csv | ||
| 7 | from lyric_dedup.eval_dataset import generate_eval_set | ||
| 8 | from lyric_dedup.file_import import record_from_file | ||
| 9 | from lyric_dedup.normalization import normalize_lyrics | ||
| 10 | |||
| 11 | |||
| 12 | BASE_LYRIC = """ | ||
| 13 | [00:01.00]作词:Someone | ||
| 14 | [00:02.00]我爱你在每个夜里 | ||
| 15 | [00:03.00]听风说话也听见你 | ||
| 16 | [00:04.00]城市的灯慢慢亮起 | ||
| 17 | [00:05.00]我把回忆写进歌曲 | ||
| 18 | [00:06.00]啦啦啦 我们不分离 | ||
| 19 | [00:07.00]啦啦啦 我们不分离 | ||
| 20 | [00:08.00]明天还会继续想你 | ||
| 21 | """ | ||
| 22 | |||
| 23 | |||
| 24 | def test_normalization_removes_lyric_noise_and_simplifies() -> None: | ||
| 25 | normalized = normalize_lyrics("[00:01.20]我愛你!\nQQ音乐 www.example.com\n(副歌)\n聽風說話\n") | ||
| 26 | |||
| 27 | assert normalized.normalized_lines == ("我爱你", "听风说话") | ||
| 28 | assert normalized.normalized_full_text == "我爱你\n听风说话" | ||
| 29 | assert normalized.primary_lines == ("我爱你", "听风说话") | ||
| 30 | |||
| 31 | |||
| 32 | def test_exact_duplicate_handles_timestamps_punctuation_traditional_and_chorus_counts() -> None: | ||
| 33 | checker = DuplicateChecker() | ||
| 34 | checker.add_record(LyricRecord("song-1", BASE_LYRIC)) | ||
| 35 | |||
| 36 | result = checker.check( | ||
| 37 | """ | ||
| 38 | 我愛你,在每個夜裡!!! | ||
| 39 | 聽風說話,也聽見你 | ||
| 40 | 城市的燈慢慢亮起 | ||
| 41 | 我把回憶寫進歌曲 | ||
| 42 | 啦啦啦 我們不分離 | ||
| 43 | 明天還會繼續想你 | ||
| 44 | """ | ||
| 45 | ) | ||
| 46 | |||
| 47 | assert result.decision == DuplicateDecision.DUPLICATE | ||
| 48 | assert result.confidence == 1.0 | ||
| 49 | assert result.candidates[0].record_id == "song-1" | ||
| 50 | |||
| 51 | |||
| 52 | def test_short_shared_repeated_chorus_is_review_not_duplicate() -> None: | ||
| 53 | checker = DuplicateChecker() | ||
| 54 | checker.add_record( | ||
| 55 | LyricRecord( | ||
| 56 | "song-1", | ||
| 57 | """ | ||
| 58 | 海边的风吹过旧信 | ||
| 59 | 你说夏天不会远去 | ||
| 60 | 啦啦啦 我们不分离 | ||
| 61 | 啦啦啦 我们不分离 | ||
| 62 | 转身以后各自旅行 | ||
| 63 | """, | ||
| 64 | ) | ||
| 65 | ) | ||
| 66 | |||
| 67 | result = checker.check( | ||
| 68 | """ | ||
| 69 | 山谷的雨落在清晨 | ||
| 70 | 我把名字交给星辰 | ||
| 71 | 啦啦啦 我们不分离 | ||
| 72 | 啦啦啦 我们不分离 | ||
| 73 | 世界安静等一个人 | ||
| 74 | """ | ||
| 75 | ) | ||
| 76 | |||
| 77 | assert result.decision == DuplicateDecision.REVIEW | ||
| 78 | assert result.candidates[0].reason == "重合内容主要集中在重复副歌行,不自动判重" | ||
| 79 | |||
| 80 | |||
| 81 | def test_substantial_line_overlap_is_duplicate_after_lsh_recall() -> None: | ||
| 82 | checker = DuplicateChecker() | ||
| 83 | checker.add_record(LyricRecord("song-1", BASE_LYRIC)) | ||
| 84 | |||
| 85 | result = checker.check( | ||
| 86 | """ | ||
| 87 | 我爱你在每个夜里 | ||
| 88 | 听风说话也听见你 | ||
| 89 | 城市灯火慢慢亮起 | ||
| 90 | 我把回忆写进歌曲 | ||
| 91 | 啦啦啦 我们不分离 | ||
| 92 | 明天还会继续想你 | ||
| 93 | """ | ||
| 94 | ) | ||
| 95 | |||
| 96 | assert result.decision == DuplicateDecision.DUPLICATE | ||
| 97 | assert result.candidates[0].jaccard >= 0.78 | ||
| 98 | assert result.candidates[0].line_coverage >= 0.72 | ||
| 99 | |||
| 100 | |||
| 101 | def test_fragment_of_full_song_is_not_duplicate() -> None: | ||
| 102 | checker = DuplicateChecker() | ||
| 103 | checker.add_record(LyricRecord("song-1", BASE_LYRIC)) | ||
| 104 | |||
| 105 | result = checker.check( | ||
| 106 | """ | ||
| 107 | 听风说话也听见你 | ||
| 108 | 城市的灯慢慢亮起 | ||
| 109 | 我把回忆写进歌曲 | ||
| 110 | """ | ||
| 111 | ) | ||
| 112 | |||
| 113 | assert result.decision != DuplicateDecision.DUPLICATE | ||
| 114 | assert result.candidates[0].primary_line_coverage < 0.72 | ||
| 115 | |||
| 116 | |||
| 117 | def test_no_effective_lyrics_use_metadata_fallback_without_empty_hash_collision() -> None: | ||
| 118 | placeholder = """ | ||
| 119 | 作词:DJ金木 | ||
| 120 | 作曲:DJ金木 | ||
| 121 | 编曲:DJ金木 | ||
| 122 | 混音:DJ金木 | ||
| 123 | 【未经著作权人许可 不得翻唱 翻录或使用】 | ||
| 124 | """ | ||
| 125 | checker = DuplicateChecker() | ||
| 126 | checker.add_record(LyricRecord("song-1", placeholder, title="Amnesia(House)", artist="DJ金木")) | ||
| 127 | checker.add_record(LyricRecord("song-2", placeholder, title="Angel(纯音乐)", artist="DJ金木")) | ||
| 128 | |||
| 129 | same_song = checker.check_record( | ||
| 130 | LyricRecord("__query__", placeholder, title="Amnesia(House)", artist="DJ金木") | ||
| 131 | ) | ||
| 132 | different_title = checker.check_record( | ||
| 133 | LyricRecord("__query__", placeholder, title="Different Song", artist="DJ金木") | ||
| 134 | ) | ||
| 135 | |||
| 136 | assert same_song.decision == DuplicateDecision.DUPLICATE | ||
| 137 | assert same_song.reason == "无有效歌词,使用文件内容兜底指纹命中" | ||
| 138 | assert different_title.decision == DuplicateDecision.DUPLICATE | ||
| 139 | |||
| 140 | |||
| 141 | def test_no_effective_lyrics_metadata_fallback_ignores_placeholder_noise() -> None: | ||
| 142 | source = """ | ||
| 143 | 作词:DJ金木 | ||
| 144 | 作曲:DJ金木 | ||
| 145 | 编曲:DJ金木 | ||
| 146 | 混音:DJ金木 | ||
| 147 | 【未经著作权人许可 不得翻唱 翻录或使用】 | ||
| 148 | """ | ||
| 149 | noisy = """ | ||
| 150 | [00:01.00]歌词来自QQ音乐 | ||
| 151 | [00:02.00]作词:测试 | ||
| 152 | [00:03.00]作词:DJ金木! | ||
| 153 | [00:04.00]作曲:DJ金木... | ||
| 154 | [00:05.00]未经著作权人许可 不得翻唱 | ||
| 155 | """ | ||
| 156 | checker = DuplicateChecker() | ||
| 157 | checker.add_record(LyricRecord("song-1", source, title="Amnesia(House)", artist="DJ金木")) | ||
| 158 | |||
| 159 | result = checker.check_record(LyricRecord("__query__", noisy, title="Amnesia(House)", artist="DJ金木")) | ||
| 160 | |||
| 161 | assert result.decision == DuplicateDecision.DUPLICATE | ||
| 162 | assert result.reason == "无有效歌词,文件内容兜底特征高度相似" | ||
| 163 | |||
| 164 | |||
| 165 | def test_unrelated_lyrics_with_shared_watermark_are_new() -> None: | ||
| 166 | checker = DuplicateChecker() | ||
| 167 | checker.add_record( | ||
| 168 | LyricRecord( | ||
| 169 | "song-1", | ||
| 170 | """ | ||
| 171 | 歌词来自QQ音乐 | ||
| 172 | 北方的雪落在窗前 | ||
| 173 | 我等一封迟来的信 | ||
| 174 | """, | ||
| 175 | ) | ||
| 176 | ) | ||
| 177 | |||
| 178 | result = checker.check( | ||
| 179 | """ | ||
| 180 | 歌词来自QQ音乐 | ||
| 181 | 南方的雨穿过街心 | ||
| 182 | 你把故事说给云听 | ||
| 183 | """ | ||
| 184 | ) | ||
| 185 | |||
| 186 | assert result.decision == DuplicateDecision.NEW | ||
| 187 | assert result.candidates == () | ||
| 188 | |||
| 189 | |||
| 190 | def test_mixed_chinese_english_tokenization_recalls_candidate() -> None: | ||
| 191 | checker = DuplicateChecker() | ||
| 192 | checker.add_record( | ||
| 193 | LyricRecord( | ||
| 194 | "song-1", | ||
| 195 | """ | ||
| 196 | say hello 在风里 | ||
| 197 | hold me close tonight | ||
| 198 | 我们穿过蓝色街道 | ||
| 199 | never let me go | ||
| 200 | """, | ||
| 201 | ) | ||
| 202 | ) | ||
| 203 | |||
| 204 | result = checker.check( | ||
| 205 | """ | ||
| 206 | say hello 在风里 | ||
| 207 | hold me close tonight | ||
| 208 | 我们穿过蓝色街道 | ||
| 209 | never let me go | ||
| 210 | """ | ||
| 211 | ) | ||
| 212 | |||
| 213 | assert result.decision == DuplicateDecision.DUPLICATE | ||
| 214 | |||
| 215 | |||
| 216 | def test_checker_can_persist_index(tmp_path) -> None: | ||
| 217 | index_path = tmp_path / "lyrics.pkl" | ||
| 218 | checker = DuplicateChecker() | ||
| 219 | checker.add_record(LyricRecord("song-1", BASE_LYRIC)) | ||
| 220 | checker.save(index_path) | ||
| 221 | |||
| 222 | loaded = DuplicateChecker.load(index_path) | ||
| 223 | result = loaded.check(BASE_LYRIC) | ||
| 224 | |||
| 225 | assert loaded.record_count == 1 | ||
| 226 | assert result.decision == DuplicateDecision.DUPLICATE | ||
| 227 | |||
| 228 | |||
| 229 | def test_record_from_lrc_file(tmp_path) -> None: | ||
| 230 | lyric_file = tmp_path / "周杰伦 - 测试歌.lrc" | ||
| 231 | lyric_file.write_text("[00:01.00]我愛你\n", encoding="utf-8") | ||
| 232 | |||
| 233 | record = record_from_file(lyric_file, base_dir=tmp_path) | ||
| 234 | |||
| 235 | assert record.title == "测试歌" | ||
| 236 | assert record.artist == "周杰伦" | ||
| 237 | assert record.lyrics == "[00:01.00]我愛你\n" | ||
| 238 | |||
| 239 | |||
| 240 | def test_record_from_song_artist_lyrics_filename(tmp_path) -> None: | ||
| 241 | lyric_file = tmp_path / "Amnesia(House)-DJ金木-歌词.txt" | ||
| 242 | lyric_file.write_text("作词:DJ金木\n", encoding="utf-8") | ||
| 243 | |||
| 244 | record = record_from_file(lyric_file, base_dir=tmp_path) | ||
| 245 | |||
| 246 | assert record.title == "Amnesia(House)" | ||
| 247 | assert record.artist == "DJ金木" | ||
| 248 | |||
| 249 | |||
| 250 | def test_evaluate_csv_reports_binary_metrics(tmp_path) -> None: | ||
| 251 | library = tmp_path / "library" | ||
| 252 | incoming = tmp_path / "incoming" | ||
| 253 | library.mkdir() | ||
| 254 | incoming.mkdir() | ||
| 255 | (library / "歌手A - 夜里.lrc").write_text(BASE_LYRIC, encoding="utf-8") | ||
| 256 | (incoming / "dup.lrc").write_text(BASE_LYRIC.replace("我爱你", "我愛你"), encoding="utf-8") | ||
| 257 | (incoming / "new.txt").write_text("南方的雨穿过街心\n你把故事说给云听\n", encoding="utf-8") | ||
| 258 | |||
| 259 | checker = DuplicateChecker() | ||
| 260 | checker.add_record(record_from_file(library / "歌手A - 夜里.lrc", base_dir=library)) | ||
| 261 | index_path = tmp_path / "lyrics.pkl" | ||
| 262 | checker.save(index_path) | ||
| 263 | |||
| 264 | eval_csv = tmp_path / "eval.csv" | ||
| 265 | eval_csv.write_text( | ||
| 266 | "id,file,expected\n" | ||
| 267 | "case-1,incoming/dup.lrc,应去重\n" | ||
| 268 | "case-2,incoming/new.txt,不应去重\n", | ||
| 269 | encoding="utf-8", | ||
| 270 | ) | ||
| 271 | out_path = tmp_path / "eval_out.csv" | ||
| 272 | |||
| 273 | evaluate_csv( | ||
| 274 | index_path, | ||
| 275 | eval_csv, | ||
| 276 | out_path, | ||
| 277 | base_dir=tmp_path, | ||
| 278 | positive_decisions={"duplicate"}, | ||
| 279 | max_candidates=5, | ||
| 280 | ) | ||
| 281 | |||
| 282 | rows = list(csv.DictReader(out_path.open(encoding="utf-8"))) | ||
| 283 | assert [row["correct"] for row in rows] == ["True", "True"] | ||
| 284 | assert rows[0]["reason"] == "规范化后的原文歌词哈希完全一致" | ||
| 285 | assert (tmp_path / "eval_out.csv.summary.json").exists() | ||
| 286 | |||
| 287 | |||
| 288 | def test_generated_eval_set_marks_fragments_as_negative(tmp_path) -> None: | ||
| 289 | library = tmp_path / "library" | ||
| 290 | incoming = tmp_path / "generated" / "incoming" | ||
| 291 | eval_csv = tmp_path / "generated" / "eval.csv" | ||
| 292 | library.mkdir() | ||
| 293 | (library / "song.txt").write_text(BASE_LYRIC, encoding="utf-8") | ||
| 294 | |||
| 295 | generate_eval_set(library_dir=library, output_dir=incoming, csv_path=eval_csv, size=20, positive_ratio=0.5) | ||
| 296 | |||
| 297 | rows = list(csv.DictReader(eval_csv.open(encoding="utf-8"))) | ||
| 298 | positive_types = {row["sample_type"] for row in rows if row["expected"] == "应去重"} | ||
| 299 | fragment_rows = [row for row in rows if row["sample_type"] == "single_song_fragment"] | ||
| 300 | |||
| 301 | assert "trimmed_version" not in positive_types | ||
| 302 | assert "single_song_fragment" not in positive_types | ||
| 303 | assert fragment_rows | ||
| 304 | assert all(row["expected"] == "不应去重" for row in fragment_rows) | ||
| 305 | |||
| 306 | |||
| 307 | def test_foreign_original_with_added_chinese_translation_is_duplicate() -> None: | ||
| 308 | checker = DuplicateChecker() | ||
| 309 | checker.add_record( | ||
| 310 | LyricRecord( | ||
| 311 | "song-1", | ||
| 312 | """ | ||
| 313 | I miss you tonight | ||
| 314 | Under the moonlight | ||
| 315 | Never let me go | ||
| 316 | """, | ||
| 317 | ) | ||
| 318 | ) | ||
| 319 | |||
| 320 | result = checker.check( | ||
| 321 | """ | ||
| 322 | I miss you tonight | ||
| 323 | 今晚我想你 | ||
| 324 | Under the moonlight | ||
| 325 | 月光之下 | ||
| 326 | Never let me go | ||
| 327 | 永远不要让我离开 | ||
| 328 | """ | ||
| 329 | ) | ||
| 330 | |||
| 331 | assert result.decision == DuplicateDecision.DUPLICATE | ||
| 332 | assert result.reason == "规范化后的原文歌词哈希完全一致,翻译行未参与自动判重" | ||
| 333 | |||
| 334 | |||
| 335 | def test_same_timestamp_translation_split_is_high_confidence() -> None: | ||
| 336 | normalized = normalize_lyrics( | ||
| 337 | """ | ||
| 338 | [00:01.00]I miss you tonight | ||
| 339 | [00:01.00]今晚我想你 | ||
| 340 | [00:02.00]Under the moonlight | ||
| 341 | [00:02.00]月光之下 | ||
| 342 | """ | ||
| 343 | ) | ||
| 344 | |||
| 345 | assert normalized.primary_lines == ("i miss you tonight", "under the moonlight") | ||
| 346 | assert normalized.translation_lines == ("今晚我想你", "月光之下") | ||
| 347 | assert normalized.split_confidence == "high" | ||
| 348 | |||
| 349 | |||
| 350 | def test_translation_only_overlap_is_review_not_duplicate() -> None: | ||
| 351 | checker = DuplicateChecker() | ||
| 352 | checker.add_record( | ||
| 353 | LyricRecord( | ||
| 354 | "song-1", | ||
| 355 | """ | ||
| 356 | I miss you tonight | ||
| 357 | 今晚我想你 | ||
| 358 | Under the moonlight | ||
| 359 | 月光之下 | ||
| 360 | Never let me go | ||
| 361 | 永远不要让我离开 | ||
| 362 | """, | ||
| 363 | ) | ||
| 364 | ) | ||
| 365 | |||
| 366 | result = checker.check( | ||
| 367 | """ | ||
| 368 | Te extrano esta noche | ||
| 369 | 今晚我想你 | ||
| 370 | Bajo la luna | ||
| 371 | 月光之下 | ||
| 372 | No me dejes ir | ||
| 373 | 永远不要让我离开 | ||
| 374 | """ | ||
| 375 | ) | ||
| 376 | |||
| 377 | assert result.decision == DuplicateDecision.REVIEW | ||
| 378 | assert result.reason == "仅翻译行相似,原文字面重合不足,不自动判重" | ||
| 379 | assert result.candidates[0].translation_jaccard >= 0.45 | ||
| 380 | |||
| 381 | |||
| 382 | def test_block_translation_split_is_review_when_primary_matches() -> None: | ||
| 383 | checker = DuplicateChecker() | ||
| 384 | checker.add_record( | ||
| 385 | LyricRecord( | ||
| 386 | "song-1", | ||
| 387 | """ | ||
| 388 | I miss you tonight | ||
| 389 | Under the moonlight | ||
| 390 | Never let me go | ||
| 391 | """, | ||
| 392 | ) | ||
| 393 | ) | ||
| 394 | |||
| 395 | result = checker.check( | ||
| 396 | """ | ||
| 397 | I miss you tonight | ||
| 398 | Under the moonlight | ||
| 399 | Never let me go | ||
| 400 | 今晚我想你 | ||
| 401 | 月光之下 | ||
| 402 | 永远不要让我离开 | ||
| 403 | """ | ||
| 404 | ) | ||
| 405 | |||
| 406 | assert result.decision == DuplicateDecision.REVIEW | ||
| 407 | assert result.reason == "原文哈希一致,但疑似整段翻译结构拆分置信度较低,需要人工复核" |
-
Please register or sign in to post a comment