Add lyric duplicate detection workflow
0 parents
Showing
12 changed files
with
1023 additions
and
0 deletions
.gitignore
0 → 100644
| 1 | .DS_Store | ||
| 2 | __pycache__/ | ||
| 3 | *.py[cod] | ||
| 4 | .pytest_cache/ | ||
| 5 | |||
| 6 | # Local lyric data and generated artifacts | ||
| 7 | data/ | ||
| 8 | outputs/ | ||
| 9 | downloaded_lyrics/ | ||
| 10 | downloaded_lyrics_type3/ | ||
| 11 | download_failed*.txt | ||
| 12 | |||
| 13 | # Local downloader / scratch utilities | ||
| 14 | download_lyrics.py | ||
| 15 | test_db_connection.py | ||
| 16 | *.env | ||
| 17 | |||
| 18 | # Reference project kept locally only | ||
| 19 | text-dedup-main/ | ||
| 20 | |||
| 21 | # Virtual environments and editor files | ||
| 22 | .venv/ | ||
| 23 | venv/ | ||
| 24 | .idea/ | ||
| 25 | .vscode/ |
README.md
0 → 100644
| 1 | # Lyric Duplicate Checker | ||
| 2 | |||
| 3 | 第一版用于“新增歌词查重”:先用已有 `.lrc` / `.txt` 歌词建立索引,再把新增歌词拿来查询,返回 `duplicate`、`review` 或 `new`。 | ||
| 4 | |||
| 5 | ## 建立索引 | ||
| 6 | |||
| 7 | 假设已有曲库在 `data/library/`: | ||
| 8 | |||
| 9 | ```bash | ||
| 10 | python -m lyric_dedup.cli build-index \ | ||
| 11 | --lyrics-dir data/library \ | ||
| 12 | --index outputs/indexes/lyrics.pkl | ||
| 13 | ``` | ||
| 14 | |||
| 15 | ## 检查单个新增歌词 | ||
| 16 | |||
| 17 | ```bash | ||
| 18 | python -m lyric_dedup.cli check-file \ | ||
| 19 | --index outputs/indexes/lyrics.pkl \ | ||
| 20 | --file data/incoming/new_song.lrc | ||
| 21 | ``` | ||
| 22 | |||
| 23 | ## 批量检查新增目录 | ||
| 24 | |||
| 25 | ```bash | ||
| 26 | python -m lyric_dedup.cli batch-check \ | ||
| 27 | --index outputs/indexes/lyrics.pkl \ | ||
| 28 | --lyrics-dir data/incoming \ | ||
| 29 | --out outputs/results/incoming_check.csv | ||
| 30 | ``` | ||
| 31 | |||
| 32 | CSV 里重点看这些列: | ||
| 33 | |||
| 34 | - `decision`: 总判断。 | ||
| 35 | - `best_candidate_id`: 最像的已有歌词。 | ||
| 36 | - `best_candidate_jaccard`: n-gram 字面相似度。 | ||
| 37 | - `best_candidate_line_coverage`: 行级覆盖率。 | ||
| 38 | - `matched_unique_lines`: 命中的规范化歌词行。 | ||
| 39 | - `best_candidate_reason`: 中文判定原因,说明为什么判重、复核或判新。 | ||
| 40 | |||
| 41 | 生产判断建议:`duplicate` 可自动拦截;`review` 进人工池;`new` 入库前仍可抽样检查。 | ||
| 42 | |||
| 43 | ## 原文 + 中文翻译歌词的防护规则 | ||
| 44 | |||
| 45 | 当前会把歌词拆成三类行: | ||
| 46 | |||
| 47 | - `primary_lines`: 原文行,自动判重主要依赖这部分。 | ||
| 48 | - `translation_lines`: 中文翻译行,只用于召回和复核解释。 | ||
| 49 | - `unknown_lines`: 无法稳定判断的行。 | ||
| 50 | |||
| 51 | 高置信拆分包括: | ||
| 52 | |||
| 53 | - 同一个时间戳下出现外文行和中文行。 | ||
| 54 | - 多组稳定的外文行 + 中文行交替。 | ||
| 55 | |||
| 56 | 中置信拆分包括: | ||
| 57 | |||
| 58 | - 同一行内明显的外文 / 中文翻译,例如 `I miss you / 今晚我想你`。 | ||
| 59 | |||
| 60 | 低置信拆分包括: | ||
| 61 | |||
| 62 | - 先整段外文,再整段中文翻译。 | ||
| 63 | |||
| 64 | 判定策略: | ||
| 65 | |||
| 66 | - 原文高度一致,即使新增歌词多了中文翻译,也可以 `duplicate`。 | ||
| 67 | - 只有翻译行相似,原文相似不足,只能 `review`,不自动判重。 | ||
| 68 | - 疑似整段翻译结构属于低置信拆分,即使原文 hash 一致,也先 `review`。 | ||
| 69 | - 普通中文歌没有检测到翻译结构时,全部有效行按原文处理。 | ||
| 70 | |||
| 71 | 由于索引里会保存拆分后的原文/翻译特征,修改拆分规则后需要重建索引。 | ||
| 72 | |||
| 73 | ## 用标注 CSV 评估正确率 | ||
| 74 | |||
| 75 | 可以先从已有曲库自动生成一批评估样本: | ||
| 76 | |||
| 77 | ```bash | ||
| 78 | python -m lyric_dedup.cli generate-eval-set \ | ||
| 79 | --library-dir data/library \ | ||
| 80 | --lyrics-dir data/generated_eval/incoming \ | ||
| 81 | --csv data/generated_eval/eval_10.csv \ | ||
| 82 | --size 10 \ | ||
| 83 | --positive-ratio 0.6 | ||
| 84 | ``` | ||
| 85 | |||
| 86 | 生成器的业务口径: | ||
| 87 | |||
| 88 | - `应去重` 样本只生成全曲歌词的样式变化,例如时间戳、标点、平台噪声、空行、LRC 样式、附加中文翻译。 | ||
| 89 | - `不应去重` 样本包含片段歌词、短句碰撞、不同歌曲片段混合、同主题新歌词、仅翻译相似。 | ||
| 90 | - 片段歌词即使命中已有歌曲的一部分,也不应该输出 `duplicate`;最多进入 `review`。 | ||
| 91 | |||
| 92 | 先准备一个 CSV,例如 `data/eval/eval.csv`: | ||
| 93 | |||
| 94 | ```csv | ||
| 95 | id,file,expected | ||
| 96 | case-001,incoming/song_a.lrc,应去重 | ||
| 97 | case-002,incoming/song_b.txt,不应去重 | ||
| 98 | ``` | ||
| 99 | |||
| 100 | 也可以不用文件路径,直接把歌词放在 `lyrics` 列: | ||
| 101 | |||
| 102 | ```csv | ||
| 103 | id,lyrics,expected | ||
| 104 | case-003,"我爱你在每个夜里\n听风说话也听见你",duplicate | ||
| 105 | case-004,"南方的雨穿过街心\n你把故事说给云听",new | ||
| 106 | ``` | ||
| 107 | |||
| 108 | `expected` 支持这些写法: | ||
| 109 | |||
| 110 | - 应去重:`应去重`、`重复`、`duplicate`、`1`、`true`、`yes` | ||
| 111 | - 不应去重:`不应去重`、`不重复`、`new`、`0`、`false`、`no` | ||
| 112 | |||
| 113 | 运行评估: | ||
| 114 | |||
| 115 | ```bash | ||
| 116 | python -m lyric_dedup.cli evaluate-csv \ | ||
| 117 | --index outputs/indexes/lyrics.pkl \ | ||
| 118 | --csv data/eval/eval.csv \ | ||
| 119 | --base-dir data \ | ||
| 120 | --out outputs/results/eval_result.csv | ||
| 121 | ``` | ||
| 122 | |||
| 123 | 默认只有系统输出 `duplicate` 才算“预测应去重”。这适合评估自动拦截的准确率,误杀会更明显。 | ||
| 124 | |||
| 125 | 如果你想评估“可疑样本召回率”,也就是 `duplicate` 和 `review` 都算命中: | ||
| 126 | |||
| 127 | ```bash | ||
| 128 | python -m lyric_dedup.cli evaluate-csv \ | ||
| 129 | --index outputs/indexes/lyrics.pkl \ | ||
| 130 | --csv data/eval/eval.csv \ | ||
| 131 | --base-dir data \ | ||
| 132 | --positive-decisions duplicate,review \ | ||
| 133 | --out outputs/results/eval_result_review_as_positive.csv | ||
| 134 | ``` | ||
| 135 | |||
| 136 | 会生成两个文件: | ||
| 137 | |||
| 138 | - `outputs/results/eval_result.csv`: 每条样本的预测、候选、原因和是否正确。 | ||
| 139 | - `outputs/results/eval_result.csv.summary.json`: 总体指标。 | ||
| 140 | |||
| 141 | summary 里重点看: | ||
| 142 | |||
| 143 | - `accuracy`: 总正确率。 | ||
| 144 | - `precision`: 预测应去重的样本里,有多少是真的应去重。自动拦截优先看这个。 | ||
| 145 | - `recall`: 真实应去重的样本里,有多少被系统抓到。 | ||
| 146 | - `f1`: precision 和 recall 的综合指标。 | ||
| 147 | - `false_positive`: 不应去重但被判为应去重,属于误杀。 | ||
| 148 | - `false_negative`: 应去重但没抓到,属于漏召。 |
TEST_WORKFLOW.md
0 → 100644
| 1 | # 歌词查重测试流程 | ||
| 2 | |||
| 3 | 本文档记录从已有歌词目录建立索引、生成测试集、批量评估和查看结果的完整命令。 | ||
| 4 | |||
| 5 | ## 1. 准备目录 | ||
| 6 | |||
| 7 | 已有曲库放在: | ||
| 8 | |||
| 9 | ```text | ||
| 10 | data/library/ | ||
| 11 | ``` | ||
| 12 | |||
| 13 | 支持文件: | ||
| 14 | |||
| 15 | ```text | ||
| 16 | .lrc | ||
| 17 | .txt | ||
| 18 | ``` | ||
| 19 | |||
| 20 | 生成的测试样本会放在: | ||
| 21 | |||
| 22 | ```text | ||
| 23 | data/generated_eval/incoming/ | ||
| 24 | ``` | ||
| 25 | |||
| 26 | 测试集标注 CSV 会放在: | ||
| 27 | |||
| 28 | ```text | ||
| 29 | data/generated_eval/eval_100.csv | ||
| 30 | ``` | ||
| 31 | |||
| 32 | 评估结果会放在: | ||
| 33 | |||
| 34 | ```text | ||
| 35 | outputs/results/ | ||
| 36 | ``` | ||
| 37 | |||
| 38 | ## 2. 建立已有曲库索引 | ||
| 39 | |||
| 40 | 如果刚往 `data/library` 新增了一批样本,建议先运行处理脚本: | ||
| 41 | |||
| 42 | ```bash | ||
| 43 | python scripts/process_library.py \ | ||
| 44 | --library-dir data/library \ | ||
| 45 | --index outputs/indexes/library_lyrics.pkl | ||
| 46 | ``` | ||
| 47 | |||
| 48 | 这个脚本会: | ||
| 49 | |||
| 50 | ```text | ||
| 51 | 1. 扫描并隔离纯音乐占位样本,例如包含【曲库专用】或“此歌曲为没有填词的纯音乐”的文件。 | ||
| 52 | 2. 重建 outputs/indexes/library_lyrics.pkl。 | ||
| 53 | 3. 输出处理报告 outputs/results/library_process_report.json。 | ||
| 54 | ``` | ||
| 55 | |||
| 56 | 如果你想先看会处理哪些文件,不实际移动和重建索引: | ||
| 57 | |||
| 58 | ```bash | ||
| 59 | python scripts/process_library.py \ | ||
| 60 | --library-dir data/library \ | ||
| 61 | --dry-run | ||
| 62 | ``` | ||
| 63 | |||
| 64 | 如果要顺手生成并评估 500 条测试样本: | ||
| 65 | |||
| 66 | ```bash | ||
| 67 | python scripts/process_library.py \ | ||
| 68 | --library-dir data/library \ | ||
| 69 | --index outputs/indexes/library_lyrics.pkl \ | ||
| 70 | --eval-size 1180 \ | ||
| 71 | --positive-ratio 0.2 \ | ||
| 72 | --eval-csv data/generated_eval/eval_1180.csv \ | ||
| 73 | --eval-out outputs/results/library_eval_1180.csv | ||
| 74 | ``` | ||
| 75 | |||
| 76 | 隔离出来的文件默认会移动到: | ||
| 77 | |||
| 78 | ```text | ||
| 79 | data/quarantine/no_lyrics_placeholders/ | ||
| 80 | ``` | ||
| 81 | |||
| 82 | 也可以只手动建索引: | ||
| 83 | |||
| 84 | ```bash | ||
| 85 | python -m lyric_dedup.cli build-index \ | ||
| 86 | --lyrics-dir data/library \ | ||
| 87 | --index outputs/indexes/library_lyrics.pkl | ||
| 88 | ``` | ||
| 89 | |||
| 90 | 索引文件: | ||
| 91 | |||
| 92 | ```text | ||
| 93 | outputs/indexes/library_lyrics.pkl | ||
| 94 | ``` | ||
| 95 | |||
| 96 | 注意:如果修改了 `data/library`,或修改了预处理/判重逻辑,需要重新执行本步骤。 | ||
| 97 | |||
| 98 | ## 3. 生成 100 条测试样本 | ||
| 99 | |||
| 100 | ```bash | ||
| 101 | python -m lyric_dedup.cli generate-eval-set \ | ||
| 102 | --library-dir data/library \ | ||
| 103 | --lyrics-dir data/generated_eval/incoming \ | ||
| 104 | --csv data/generated_eval/eval_500.csv \ | ||
| 105 | --size 500 \ | ||
| 106 | --positive-ratio 0.2 | ||
| 107 | ``` | ||
| 108 | |||
| 109 | 默认生成: | ||
| 110 | |||
| 111 | ```text | ||
| 112 | 应去重: 60 | ||
| 113 | 不应去重: 40 | ||
| 114 | ``` | ||
| 115 | |||
| 116 | 生成器会先清理 `data/generated_eval/incoming/` 下旧的 `.txt` / `.lrc` 生成文件,再写入新样本。 | ||
| 117 | |||
| 118 | 业务口径: | ||
| 119 | |||
| 120 | ```text | ||
| 121 | pos_* = 应去重,全曲歌词样式变化 | ||
| 122 | neg_* = 不应去重,片段/短句碰撞/混合片段/同主题新歌词/仅翻译相似 | ||
| 123 | ``` | ||
| 124 | |||
| 125 | ## 4. 严格评估:只把 duplicate 算作去重 | ||
| 126 | |||
| 127 | ```bash | ||
| 128 | python -m lyric_dedup.cli evaluate-csv \ | ||
| 129 | --index outputs/indexes/library_lyrics.pkl \ | ||
| 130 | --csv data/generated_eval/eval_500.csv \ | ||
| 131 | --base-dir data/generated_eval \ | ||
| 132 | --out outputs/results/library_eval_500.csv | ||
| 133 | ``` | ||
| 134 | |||
| 135 | 这个口径下: | ||
| 136 | |||
| 137 | ```text | ||
| 138 | duplicate -> 预测应去重 | ||
| 139 | review -> 预测不应去重 | ||
| 140 | new -> 预测不应去重 | ||
| 141 | ``` | ||
| 142 | |||
| 143 | 适合评估自动拦截的 precision,重点看: | ||
| 144 | |||
| 145 | ```text | ||
| 146 | false_positive | ||
| 147 | ``` | ||
| 148 | |||
| 149 | ## 5. 召回评估:把 duplicate 和 review 都算作抓到可疑样本 | ||
| 150 | |||
| 151 | ```bash | ||
| 152 | python -m lyric_dedup.cli evaluate-csv \ | ||
| 153 | --index outputs/indexes/library_lyrics.pkl \ | ||
| 154 | --csv data/generated_eval/eval_500.csv \ | ||
| 155 | --base-dir data/generated_eval \ | ||
| 156 | --positive-decisions duplicate,review \ | ||
| 157 | --out outputs/results/library_eval_500_review_positive.csv | ||
| 158 | ``` | ||
| 159 | |||
| 160 | 这个口径下: | ||
| 161 | |||
| 162 | ```text | ||
| 163 | duplicate -> 预测应去重 | ||
| 164 | review -> 预测应去重 | ||
| 165 | new -> 预测不应去重 | ||
| 166 | ``` | ||
| 167 | |||
| 168 | 适合评估可疑样本召回,重点看: | ||
| 169 | |||
| 170 | ```text | ||
| 171 | false_negative | ||
| 172 | ``` | ||
| 173 | |||
| 174 | ## 6. 查看总体指标 | ||
| 175 | |||
| 176 | 严格口径: | ||
| 177 | |||
| 178 | ```bash | ||
| 179 | cat outputs/results/library_eval_100.csv.summary.json | ||
| 180 | ``` | ||
| 181 | |||
| 182 | 召回口径: | ||
| 183 | |||
| 184 | ```bash | ||
| 185 | cat outputs/results/library_eval_100_review_positive.csv.summary.json | ||
| 186 | ``` | ||
| 187 | |||
| 188 | 指标含义: | ||
| 189 | |||
| 190 | ```text | ||
| 191 | accuracy 总正确率 | ||
| 192 | precision 预测应去重的样本里,有多少是真的应去重 | ||
| 193 | recall 真实应去重的样本里,有多少被系统抓到 | ||
| 194 | f1 precision 和 recall 的综合指标 | ||
| 195 | true_positive 应去重且预测应去重 | ||
| 196 | false_positive 不应去重但预测应去重,误杀 | ||
| 197 | true_negative 不应去重且预测不应去重 | ||
| 198 | false_negative 应去重但预测不应去重,漏召 | ||
| 199 | ``` | ||
| 200 | |||
| 201 | ## 7. 查看每条样本结果 | ||
| 202 | |||
| 203 | ```bash | ||
| 204 | open outputs/results/library_eval_100.csv | ||
| 205 | ``` | ||
| 206 | |||
| 207 | 如果不能使用 `open`,可以直接查看 CSV: | ||
| 208 | |||
| 209 | ```bash | ||
| 210 | python -c 'import csv; rows=csv.DictReader(open("outputs/results/library_eval_100.csv", encoding="utf-8")); [print(r["id"], r["decision"], r["correct"], r["reason"], sep=" | ") for r in rows]' | ||
| 211 | ``` | ||
| 212 | |||
| 213 | ## 8. 查看失败样本 | ||
| 214 | |||
| 215 | 严格口径失败样本: | ||
| 216 | |||
| 217 | ```bash | ||
| 218 | python -c 'import csv; rows=csv.DictReader(open("outputs/results/library_eval_100.csv", encoding="utf-8")); [print(r["id"], r["source"], r["decision"], r["reason"], sep=" | ") for r in rows if r["correct"] == "False"]' | ||
| 219 | ``` | ||
| 220 | |||
| 221 | 查看某个样本的完整候选: | ||
| 222 | |||
| 223 | ```bash | ||
| 224 | python -m lyric_dedup.cli check-file \ | ||
| 225 | --index outputs/indexes/library_lyrics.pkl \ | ||
| 226 | --file data/generated_eval/incoming/neg_068_mixed_fragments.txt \ | ||
| 227 | --max-candidates 10 | ||
| 228 | ``` | ||
| 229 | |||
| 230 | ## 9. 核对测试集分布 | ||
| 231 | |||
| 232 | ```bash | ||
| 233 | python -c 'import csv, collections; rows=list(csv.DictReader(open("data/generated_eval/eval_10.csv", encoding="utf-8"))); print(len(rows)); print(collections.Counter(r["expected"] for r in rows)); print(collections.Counter(r["sample_type"] for r in rows)); print(collections.Counter(r["sample_type"] for r in rows if r["expected"]=="应去重")); print(collections.Counter(r["sample_type"] for r in rows if r["expected"]=="不应去重"))' | ||
| 234 | ``` | ||
| 235 | |||
| 236 | 核对生成目录文件数: | ||
| 237 | |||
| 238 | ```bash | ||
| 239 | find data/generated_eval/incoming -type f | wc -l | ||
| 240 | ``` | ||
| 241 | |||
| 242 | ## 10. 运行代码测试 | ||
| 243 | |||
| 244 | ```bash | ||
| 245 | python -m pytest tests | ||
| 246 | ``` | ||
| 247 | |||
| 248 | 编译检查: | ||
| 249 | |||
| 250 | ```bash | ||
| 251 | python -m compileall -q lyric_dedup tests | ||
| 252 | ``` | ||
| 253 | |||
| 254 | ## 11. 关于测试集不重复 | ||
| 255 | |||
| 256 | 当前自动生成的 100 条是规则覆盖测试集,不保证样本之间规范化后完全不重复。 | ||
| 257 | |||
| 258 | 如果要求 100 条测试样本彼此不重复,并且仍使用默认比例: | ||
| 259 | |||
| 260 | ```text | ||
| 261 | size = 100 | ||
| 262 | positive_ratio = 0.6 | ||
| 263 | ``` | ||
| 264 | |||
| 265 | 则至少需要: | ||
| 266 | |||
| 267 | ```text | ||
| 268 | 60 首互不重复的种子歌词 | ||
| 269 | ``` | ||
| 270 | |||
| 271 | 原因:应去重样本是全曲变体,同一首歌的多个样式变化规范化后仍然是同一首歌。 | ||
| 272 | |||
| 273 | 更稳妥的真实准确率评估方式是准备人工标注 CSV: | ||
| 274 | |||
| 275 | ```csv | ||
| 276 | id,file,expected | ||
| 277 | case-001,incoming/song_a.lrc,应去重 | ||
| 278 | case-002,incoming/song_b.txt,不应去重 | ||
| 279 | ``` | ||
| 280 | |||
| 281 | 然后直接执行第 4 节或第 5 节的 `evaluate-csv`。 |
lyric_dedup/__init__.py
0 → 100644
| 1 | """Lyric duplicate detection utilities.""" | ||
| 2 | |||
| 3 | from lyric_dedup.checker import DuplicateCheckResult | ||
| 4 | from lyric_dedup.checker import DuplicateChecker | ||
| 5 | from lyric_dedup.checker import DuplicateDecision | ||
| 6 | from lyric_dedup.checker import LyricRecord | ||
| 7 | |||
| 8 | __all__ = [ | ||
| 9 | "DuplicateCheckResult", | ||
| 10 | "DuplicateChecker", | ||
| 11 | "DuplicateDecision", | ||
| 12 | "LyricRecord", | ||
| 13 | ] |
lyric_dedup/checker.py
0 → 100644
This diff is collapsed.
Click to expand it.
lyric_dedup/cli.py
0 → 100644
This diff is collapsed.
Click to expand it.
lyric_dedup/eval_dataset.py
0 → 100644
| 1 | """Generate labeled evaluation samples from an existing lyric library.""" | ||
| 2 | |||
| 3 | from __future__ import annotations | ||
| 4 | |||
| 5 | import csv | ||
| 6 | import random | ||
| 7 | import re | ||
| 8 | from dataclasses import dataclass | ||
| 9 | from pathlib import Path | ||
| 10 | |||
| 11 | from lyric_dedup.file_import import iter_lyric_files | ||
| 12 | from lyric_dedup.file_import import read_lyric_file | ||
| 13 | from lyric_dedup.file_import import record_from_file | ||
| 14 | from lyric_dedup.normalization import normalize_lyrics | ||
| 15 | |||
| 16 | |||
| 17 | @dataclass(frozen=True) | ||
| 18 | class GeneratedSample: | ||
| 19 | sample_id: str | ||
| 20 | file: str | ||
| 21 | expected: str | ||
| 22 | sample_type: str | ||
| 23 | source: str | ||
| 24 | title: str = "" | ||
| 25 | artist: str = "" | ||
| 26 | |||
| 27 | |||
| 28 | def generate_eval_set( | ||
| 29 | *, | ||
| 30 | library_dir: Path, | ||
| 31 | output_dir: Path, | ||
| 32 | csv_path: Path, | ||
| 33 | size: int = 100, | ||
| 34 | positive_ratio: float = 0.6, | ||
| 35 | seed: int = 20260602, | ||
| 36 | ) -> dict[str, object]: | ||
| 37 | rng = random.Random(seed) | ||
| 38 | source_files = iter_lyric_files(library_dir) | ||
| 39 | if not source_files: | ||
| 40 | raise ValueError(f"{library_dir} 下没有 .lrc/.txt 歌词文件") | ||
| 41 | |||
| 42 | output_dir.mkdir(parents=True, exist_ok=True) | ||
| 43 | csv_path.parent.mkdir(parents=True, exist_ok=True) | ||
| 44 | _clean_generated_output_dir(output_dir) | ||
| 45 | |||
| 46 | positives = round(size * positive_ratio) | ||
| 47 | negatives = size - positives | ||
| 48 | samples: list[GeneratedSample] = [] | ||
| 49 | for index in range(positives): | ||
| 50 | source = source_files[index % len(source_files)] | ||
| 51 | samples.append(_positive_sample(index + 1, source, output_dir, csv_path.parent, rng)) | ||
| 52 | for index in range(negatives): | ||
| 53 | left = source_files[index % len(source_files)] | ||
| 54 | right = source_files[(index + 1) % len(source_files)] | ||
| 55 | samples.append(_negative_sample(positives + index + 1, left, right, output_dir, csv_path.parent, rng)) | ||
| 56 | |||
| 57 | rng.shuffle(samples) | ||
| 58 | with csv_path.open("w", encoding="utf-8", newline="") as file: | ||
| 59 | writer = csv.DictWriter(file, fieldnames=["id", "file", "expected", "sample_type", "source", "title", "artist"]) | ||
| 60 | writer.writeheader() | ||
| 61 | writer.writerows( | ||
| 62 | { | ||
| 63 | "id": sample.sample_id, | ||
| 64 | "file": sample.file, | ||
| 65 | "expected": sample.expected, | ||
| 66 | "sample_type": sample.sample_type, | ||
| 67 | "source": sample.source, | ||
| 68 | "title": sample.title, | ||
| 69 | "artist": sample.artist, | ||
| 70 | } | ||
| 71 | for sample in samples | ||
| 72 | ) | ||
| 73 | |||
| 74 | return { | ||
| 75 | "size": size, | ||
| 76 | "positive": positives, | ||
| 77 | "negative": negatives, | ||
| 78 | "library_files": len(source_files), | ||
| 79 | "lyrics_dir": str(output_dir), | ||
| 80 | "csv": str(csv_path), | ||
| 81 | } | ||
| 82 | |||
| 83 | |||
| 84 | def _positive_sample(index: int, source: Path, output_dir: Path, csv_base: Path, rng: random.Random) -> GeneratedSample: | ||
| 85 | raw = read_lyric_file(source) | ||
| 86 | source_record = record_from_file(source) | ||
| 87 | variants = [ | ||
| 88 | ("exact_copy", raw), | ||
| 89 | ("timestamped", _add_timestamps(_content_lines(raw))), | ||
| 90 | ("punctuation_noise", _add_punctuation_noise(_content_lines(raw), rng)), | ||
| 91 | ("with_platform_noise", _with_platform_noise(_content_lines(raw))), | ||
| 92 | ("blank_line_noise", _add_blank_line_noise(_content_lines(raw))), | ||
| 93 | ("lrc_with_platform_noise", _add_timestamps(_content_lines(_with_platform_noise(_content_lines(raw))))), | ||
| 94 | ("translation_added", _translation_added(_content_lines(raw))), | ||
| 95 | ] | ||
| 96 | sample_type, text = variants[(index - 1) % len(variants)] | ||
| 97 | name = f"pos_{index:03d}_{sample_type}.txt" | ||
| 98 | path = output_dir / name | ||
| 99 | path.write_text(text, encoding="utf-8") | ||
| 100 | return GeneratedSample( | ||
| 101 | sample_id=f"pos-{index:03d}", | ||
| 102 | file=str(path.relative_to(csv_base)), | ||
| 103 | expected="应去重", | ||
| 104 | sample_type=sample_type, | ||
| 105 | source=str(source), | ||
| 106 | title=source_record.title or "", | ||
| 107 | artist=source_record.artist or "", | ||
| 108 | ) | ||
| 109 | |||
| 110 | |||
| 111 | def _negative_sample(index: int, left: Path, right: Path, output_dir: Path, csv_base: Path, rng: random.Random) -> GeneratedSample: | ||
| 112 | left_lines = _normalized_lines(left) | ||
| 113 | right_lines = _normalized_lines(right) | ||
| 114 | variants = [ | ||
| 115 | ("single_song_fragment", _single_song_fragment(left_lines)), | ||
| 116 | ("short_shared_snippet", _short_shared_snippet(left_lines, rng)), | ||
| 117 | ("mixed_fragments", _mixed_fragments(left_lines, right_lines, rng)), | ||
| 118 | ("same_theme_synthetic", _same_theme_synthetic(index)), | ||
| 119 | ("translation_only_like", _translation_only_like(left_lines)), | ||
| 120 | ] | ||
| 121 | sample_type, text = variants[(index - 1) % len(variants)] | ||
| 122 | name = f"neg_{index:03d}_{sample_type}.txt" | ||
| 123 | path = output_dir / name | ||
| 124 | path.write_text(text, encoding="utf-8") | ||
| 125 | return GeneratedSample( | ||
| 126 | sample_id=f"neg-{index:03d}", | ||
| 127 | file=str(path.relative_to(csv_base)), | ||
| 128 | expected="不应去重", | ||
| 129 | sample_type=sample_type, | ||
| 130 | source=f"{left} | {right}", | ||
| 131 | ) | ||
| 132 | |||
| 133 | |||
| 134 | def _content_lines(text: str) -> list[str]: | ||
| 135 | lines = [line.strip() for line in text.splitlines() if line.strip()] | ||
| 136 | return lines or [text.strip()] | ||
| 137 | |||
| 138 | |||
| 139 | def _clean_generated_output_dir(output_dir: Path) -> None: | ||
| 140 | for path in output_dir.iterdir(): | ||
| 141 | if path.is_file() and path.suffix.lower() in {".txt", ".lrc"}: | ||
| 142 | path.unlink() | ||
| 143 | |||
| 144 | |||
| 145 | def _normalized_lines(path: Path) -> list[str]: | ||
| 146 | normalized = normalize_lyrics(read_lyric_file(path)) | ||
| 147 | return list(normalized.primary_lines or normalized.unique_lines) | ||
| 148 | |||
| 149 | |||
| 150 | def _add_timestamps(lines: list[str]) -> str: | ||
| 151 | return "\n".join(f"[00:{idx % 60:02d}.00]{line}" for idx, line in enumerate(lines, start=1)) | ||
| 152 | |||
| 153 | |||
| 154 | def _add_punctuation_noise(lines: list[str], rng: random.Random) -> str: | ||
| 155 | marks = ["!", "?", "...", ",", "。"] | ||
| 156 | return "\n".join(f"{line}{rng.choice(marks)}" for line in lines) | ||
| 157 | |||
| 158 | |||
| 159 | def _with_platform_noise(lines: list[str]) -> str: | ||
| 160 | return "\n".join(["歌词来自QQ音乐", "作词:测试", *lines, "未经著作权人许可 不得翻唱"]) | ||
| 161 | |||
| 162 | |||
| 163 | def _add_blank_line_noise(lines: list[str]) -> str: | ||
| 164 | result: list[str] = [] | ||
| 165 | for idx, line in enumerate(lines, start=1): | ||
| 166 | result.append(line) | ||
| 167 | if idx % 4 == 0: | ||
| 168 | result.append("") | ||
| 169 | return "\n".join(result) | ||
| 170 | |||
| 171 | |||
| 172 | def _translation_added(lines: list[str]) -> str: | ||
| 173 | result: list[str] = [] | ||
| 174 | for idx, line in enumerate(lines, start=1): | ||
| 175 | result.append(line) | ||
| 176 | if _looks_foreign(line) and idx <= 24: | ||
| 177 | result.append(_pseudo_translation(idx)) | ||
| 178 | return "\n".join(result) | ||
| 179 | |||
| 180 | |||
| 181 | def _single_song_fragment(lines: list[str]) -> str: | ||
| 182 | if len(lines) <= 4: | ||
| 183 | return "\n".join(lines[: max(1, len(lines) // 2)]) | ||
| 184 | fragment_len = max(2, min(8, len(lines) // 4)) | ||
| 185 | start = max(0, (len(lines) - fragment_len) // 2) | ||
| 186 | return "\n".join(lines[start : start + fragment_len]) | ||
| 187 | |||
| 188 | |||
| 189 | def _short_shared_snippet(lines: list[str], rng: random.Random) -> str: | ||
| 190 | snippet = rng.sample(lines, k=min(2, len(lines))) if lines else [] | ||
| 191 | synthetic = [ | ||
| 192 | "清晨的风吹过新的街口", | ||
| 193 | "我把昨天放进安静的口袋", | ||
| 194 | *snippet, | ||
| 195 | "故事从这里重新开始", | ||
| 196 | "灯光落下我继续往前走", | ||
| 197 | ] | ||
| 198 | return "\n".join(synthetic) | ||
| 199 | |||
| 200 | |||
| 201 | def _mixed_fragments(left_lines: list[str], right_lines: list[str], rng: random.Random) -> str: | ||
| 202 | left_pick = rng.sample(left_lines, k=min(2, len(left_lines))) if left_lines else [] | ||
| 203 | right_pick = rng.sample(right_lines, k=min(2, len(right_lines))) if right_lines else [] | ||
| 204 | filler = ["新的旋律慢慢靠近", "陌生的名字写在风里", "没有人停在原地"] | ||
| 205 | return "\n".join([*left_pick, *filler, *right_pick]) | ||
| 206 | |||
| 207 | |||
| 208 | def _same_theme_synthetic(index: int) -> str: | ||
| 209 | themes = [ | ||
| 210 | "我在夜里想起远方的你", | ||
| 211 | "城市灯火陪我走过雨季", | ||
| 212 | "那些没说完的话留在风里", | ||
| 213 | "明天醒来我们各自继续", | ||
| 214 | f"这是第 {index} 个全新测试样本", | ||
| 215 | ] | ||
| 216 | return "\n".join(themes) | ||
| 217 | |||
| 218 | |||
| 219 | def _translation_only_like(lines: list[str]) -> str: | ||
| 220 | foreign_count = sum(1 for line in lines if _looks_foreign(line)) | ||
| 221 | if foreign_count < 2: | ||
| 222 | return _same_theme_synthetic(foreign_count + len(lines)) | ||
| 223 | return "\n".join(_pseudo_translation(idx) for idx in range(1, min(8, foreign_count) + 1)) | ||
| 224 | |||
| 225 | |||
| 226 | def _pseudo_translation(index: int) -> str: | ||
| 227 | translations = [ | ||
| 228 | "今晚我仍然想念你", | ||
| 229 | "风会带走所有疲惫", | ||
| 230 | "黑暗里也会有光", | ||
| 231 | "别让昨天困住自己", | ||
| 232 | "我们终会继续向前", | ||
| 233 | "雨停以后天空会亮", | ||
| 234 | "把遗憾留在旧时光", | ||
| 235 | "你已经足够好了", | ||
| 236 | ] | ||
| 237 | return translations[(index - 1) % len(translations)] | ||
| 238 | |||
| 239 | |||
| 240 | def _looks_foreign(line: str) -> bool: | ||
| 241 | latin = len(re.findall(r"[A-Za-z]", line)) | ||
| 242 | cjk = len(re.findall(r"[\u4e00-\u9fff]", line)) | ||
| 243 | return latin > 0 and cjk == 0 |
lyric_dedup/file_import.py
0 → 100644
| 1 | """Import LRC/TXT lyric files into records.""" | ||
| 2 | |||
| 3 | from __future__ import annotations | ||
| 4 | |||
| 5 | import hashlib | ||
| 6 | from pathlib import Path | ||
| 7 | |||
| 8 | from lyric_dedup.checker import LyricRecord | ||
| 9 | |||
| 10 | |||
| 11 | SUPPORTED_SUFFIXES = {".lrc", ".txt"} | ||
| 12 | |||
| 13 | |||
| 14 | def iter_lyric_files(root: str | Path) -> list[Path]: | ||
| 15 | base = Path(root) | ||
| 16 | return sorted( | ||
| 17 | path | ||
| 18 | for path in base.rglob("*") | ||
| 19 | if path.is_file() and path.suffix.lower() in SUPPORTED_SUFFIXES | ||
| 20 | ) | ||
| 21 | |||
| 22 | |||
| 23 | def read_lyric_file(path: str | Path) -> str: | ||
| 24 | file_path = Path(path) | ||
| 25 | data = file_path.read_bytes() | ||
| 26 | for encoding in ("utf-8-sig", "utf-8", "gb18030", "big5"): | ||
| 27 | try: | ||
| 28 | return data.decode(encoding) | ||
| 29 | except UnicodeDecodeError: | ||
| 30 | continue | ||
| 31 | return data.decode("utf-8", errors="replace") | ||
| 32 | |||
| 33 | |||
| 34 | def record_from_file(path: str | Path, *, base_dir: str | Path | None = None) -> LyricRecord: | ||
| 35 | file_path = Path(path) | ||
| 36 | lyrics = read_lyric_file(file_path) | ||
| 37 | title, artist = _metadata_from_name(file_path.stem) | ||
| 38 | record_id = _record_id(file_path, base_dir) | ||
| 39 | return LyricRecord(record_id=record_id, lyrics=lyrics, title=title, artist=artist) | ||
| 40 | |||
| 41 | |||
| 42 | def records_from_dir(root: str | Path) -> list[LyricRecord]: | ||
| 43 | return [record_from_file(path, base_dir=root) for path in iter_lyric_files(root)] | ||
| 44 | |||
| 45 | |||
| 46 | def _record_id(path: Path, base_dir: str | Path | None) -> str: | ||
| 47 | if base_dir is None: | ||
| 48 | source = str(path.resolve()) | ||
| 49 | else: | ||
| 50 | source = str(path.resolve().relative_to(Path(base_dir).resolve())) | ||
| 51 | digest = hashlib.sha1(source.encode("utf-8")).hexdigest()[:12] | ||
| 52 | return f"{digest}:{source}" | ||
| 53 | |||
| 54 | |||
| 55 | def _metadata_from_name(stem: str) -> tuple[str | None, str | None]: | ||
| 56 | cleaned = stem.removesuffix("-歌词").removesuffix("_歌词").removesuffix(" 歌词").strip() | ||
| 57 | if " - " in cleaned: | ||
| 58 | artist, title = cleaned.split(" - ", 1) | ||
| 59 | return title.strip() or None, artist.strip() or None | ||
| 60 | for sep in ("-", "_"): | ||
| 61 | if sep in cleaned: | ||
| 62 | title, artist = cleaned.rsplit(sep, 1) | ||
| 63 | return title.strip() or None, artist.strip() or None | ||
| 64 | return stem.strip() or None, None |
lyric_dedup/minhash_lsh.py
0 → 100644
| 1 | """Small in-memory MinHash LSH index for incremental lyric lookup.""" | ||
| 2 | |||
| 3 | from __future__ import annotations | ||
| 4 | |||
| 5 | import hashlib | ||
| 6 | from collections import defaultdict | ||
| 7 | from dataclasses import dataclass | ||
| 8 | |||
| 9 | |||
| 10 | _MAX_HASH = (1 << 64) - 1 | ||
| 11 | |||
| 12 | |||
| 13 | @dataclass(frozen=True) | ||
| 14 | class MinHashConfig: | ||
| 15 | num_perm: int = 96 | ||
| 16 | bands: int = 24 | ||
| 17 | seed: int = 17 | ||
| 18 | |||
| 19 | @property | ||
| 20 | def rows_per_band(self) -> int: | ||
| 21 | if self.num_perm % self.bands != 0: | ||
| 22 | raise ValueError("num_perm must be divisible by bands") | ||
| 23 | return self.num_perm // self.bands | ||
| 24 | |||
| 25 | |||
| 26 | class MinHashLSH: | ||
| 27 | def __init__(self, config: MinHashConfig | None = None) -> None: | ||
| 28 | self.config = config or MinHashConfig() | ||
| 29 | self._buckets: dict[tuple[int, tuple[int, ...]], set[str]] = defaultdict(set) | ||
| 30 | |||
| 31 | def signature(self, tokens: set[str]) -> tuple[int, ...]: | ||
| 32 | if not tokens: | ||
| 33 | return tuple([_MAX_HASH] * self.config.num_perm) | ||
| 34 | |||
| 35 | signature = [_MAX_HASH] * self.config.num_perm | ||
| 36 | for token in tokens: | ||
| 37 | encoded = token.encode("utf-8") | ||
| 38 | for idx in range(self.config.num_perm): | ||
| 39 | digest = hashlib.blake2b( | ||
| 40 | encoded, | ||
| 41 | digest_size=8, | ||
| 42 | person=f"lyr{self.config.seed + idx:05d}".encode("ascii")[:16], | ||
| 43 | ).digest() | ||
| 44 | value = int.from_bytes(digest, "big") | ||
| 45 | if value < signature[idx]: | ||
| 46 | signature[idx] = value | ||
| 47 | return tuple(signature) | ||
| 48 | |||
| 49 | def add(self, record_id: str, signature: tuple[int, ...]) -> None: | ||
| 50 | for key in self._band_keys(signature): | ||
| 51 | self._buckets[key].add(record_id) | ||
| 52 | |||
| 53 | def query(self, signature: tuple[int, ...]) -> set[str]: | ||
| 54 | candidates: set[str] = set() | ||
| 55 | for key in self._band_keys(signature): | ||
| 56 | candidates.update(self._buckets.get(key, set())) | ||
| 57 | return candidates | ||
| 58 | |||
| 59 | def _band_keys(self, signature: tuple[int, ...]) -> list[tuple[int, tuple[int, ...]]]: | ||
| 60 | rows = self.config.rows_per_band | ||
| 61 | return [(band, signature[band * rows : (band + 1) * rows]) for band in range(self.config.bands)] |
lyric_dedup/normalization.py
0 → 100644
This diff is collapsed.
Click to expand it.
scripts/process_library.py
0 → 100644
| 1 | """Process newly added lyric library files. | ||
| 2 | |||
| 3 | This script is intended for the recurring workflow after adding files to | ||
| 4 | ``data/library``: | ||
| 5 | |||
| 6 | 1. Move pure-music placeholder lyric files out of the active library. | ||
| 7 | 2. Rebuild the duplicate-checking index. | ||
| 8 | 3. Optionally regenerate and evaluate a synthetic regression set. | ||
| 9 | """ | ||
| 10 | |||
| 11 | from __future__ import annotations | ||
| 12 | |||
| 13 | import argparse | ||
| 14 | import csv | ||
| 15 | import json | ||
| 16 | import shutil | ||
| 17 | import sys | ||
| 18 | from datetime import datetime | ||
| 19 | from pathlib import Path | ||
| 20 | |||
| 21 | PROJECT_ROOT = Path(__file__).resolve().parents[1] | ||
| 22 | if str(PROJECT_ROOT) not in sys.path: | ||
| 23 | sys.path.insert(0, str(PROJECT_ROOT)) | ||
| 24 | |||
| 25 | from lyric_dedup.checker import DuplicateChecker | ||
| 26 | from lyric_dedup.cli import evaluate_csv | ||
| 27 | from lyric_dedup.eval_dataset import generate_eval_set | ||
| 28 | from lyric_dedup.file_import import iter_lyric_files | ||
| 29 | from lyric_dedup.file_import import read_lyric_file | ||
| 30 | from lyric_dedup.file_import import records_from_dir | ||
| 31 | from lyric_dedup.normalization import normalize_lyrics | ||
| 32 | |||
| 33 | |||
| 34 | PLACEHOLDER_MARKERS = ( | ||
| 35 | "【曲库专用】", | ||
| 36 | "此歌曲为没有填词的纯音乐", | ||
| 37 | ) | ||
| 38 | |||
| 39 | |||
| 40 | def main() -> None: | ||
| 41 | parser = argparse.ArgumentParser(description="Process lyric library additions.") | ||
| 42 | parser.add_argument("--library-dir", default="data/library") | ||
| 43 | parser.add_argument("--index", default="outputs/indexes/library_lyrics.pkl") | ||
| 44 | parser.add_argument("--quarantine-dir", default="data/quarantine/no_lyrics_placeholders") | ||
| 45 | parser.add_argument("--dry-run", action="store_true", help="Only report placeholder files; do not move or write outputs.") | ||
| 46 | parser.add_argument("--delete-placeholders", action="store_true", help="Delete matched placeholder files instead of moving them.") | ||
| 47 | parser.add_argument("--eval-size", type=int, default=0, help="Generate and evaluate this many synthetic samples. 0 disables eval.") | ||
| 48 | parser.add_argument("--positive-ratio", type=float, default=0.2) | ||
| 49 | parser.add_argument("--eval-dir", default="data/generated_eval/incoming") | ||
| 50 | parser.add_argument("--eval-csv", default="data/generated_eval/eval.csv") | ||
| 51 | parser.add_argument("--eval-out", default="outputs/results/library_eval.csv") | ||
| 52 | parser.add_argument("--report", default="outputs/results/library_process_report.json") | ||
| 53 | args = parser.parse_args() | ||
| 54 | |||
| 55 | library_dir = Path(args.library_dir) | ||
| 56 | quarantine_dir = Path(args.quarantine_dir) | ||
| 57 | report_path = Path(args.report) | ||
| 58 | |||
| 59 | files_before = iter_lyric_files(library_dir) | ||
| 60 | placeholders = _find_placeholder_files(library_dir) | ||
| 61 | short_effective = _effective_line_report(library_dir) | ||
| 62 | |||
| 63 | moved_or_deleted: list[str] = [] | ||
| 64 | if not args.dry_run: | ||
| 65 | moved_or_deleted = _handle_placeholders( | ||
| 66 | placeholders, | ||
| 67 | library_dir=library_dir, | ||
| 68 | quarantine_dir=quarantine_dir, | ||
| 69 | delete=args.delete_placeholders, | ||
| 70 | ) | ||
| 71 | _build_index(library_dir, Path(args.index)) | ||
| 72 | |||
| 73 | if args.eval_size > 0: | ||
| 74 | generate_eval_set( | ||
| 75 | library_dir=library_dir, | ||
| 76 | output_dir=Path(args.eval_dir), | ||
| 77 | csv_path=Path(args.eval_csv), | ||
| 78 | size=args.eval_size, | ||
| 79 | positive_ratio=args.positive_ratio, | ||
| 80 | ) | ||
| 81 | evaluate_csv( | ||
| 82 | Path(args.index), | ||
| 83 | Path(args.eval_csv), | ||
| 84 | Path(args.eval_out), | ||
| 85 | base_dir=Path(args.eval_csv).parent, | ||
| 86 | positive_decisions={"duplicate"}, | ||
| 87 | max_candidates=5, | ||
| 88 | ) | ||
| 89 | evaluate_csv( | ||
| 90 | Path(args.index), | ||
| 91 | Path(args.eval_csv), | ||
| 92 | Path(args.eval_out).with_name(Path(args.eval_out).stem + "_review_positive.csv"), | ||
| 93 | base_dir=Path(args.eval_csv).parent, | ||
| 94 | positive_decisions={"duplicate", "review"}, | ||
| 95 | max_candidates=5, | ||
| 96 | ) | ||
| 97 | |||
| 98 | report = { | ||
| 99 | "timestamp": datetime.now().isoformat(timespec="seconds"), | ||
| 100 | "dry_run": args.dry_run, | ||
| 101 | "library_dir": str(library_dir), | ||
| 102 | "files_before": len(files_before), | ||
| 103 | "placeholder_matches": len(placeholders), | ||
| 104 | "placeholder_files": [str(path) for path in placeholders], | ||
| 105 | "handled_placeholder_files": moved_or_deleted, | ||
| 106 | "files_after": len(iter_lyric_files(library_dir)), | ||
| 107 | "index": str(args.index), | ||
| 108 | "eval_size": args.eval_size, | ||
| 109 | "eval_csv": str(args.eval_csv) if args.eval_size > 0 else "", | ||
| 110 | "eval_out": str(args.eval_out) if args.eval_size > 0 else "", | ||
| 111 | "short_effective_line_counts": short_effective, | ||
| 112 | } | ||
| 113 | |||
| 114 | print(json.dumps(report, ensure_ascii=False, indent=2)) | ||
| 115 | if not args.dry_run: | ||
| 116 | report_path.parent.mkdir(parents=True, exist_ok=True) | ||
| 117 | report_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8") | ||
| 118 | |||
| 119 | |||
| 120 | def _find_placeholder_files(library_dir: Path) -> list[Path]: | ||
| 121 | matches: list[Path] = [] | ||
| 122 | for path in iter_lyric_files(library_dir): | ||
| 123 | text = read_lyric_file(path) | ||
| 124 | if any(marker in text for marker in PLACEHOLDER_MARKERS): | ||
| 125 | matches.append(path) | ||
| 126 | return matches | ||
| 127 | |||
| 128 | |||
| 129 | def _handle_placeholders( | ||
| 130 | placeholders: list[Path], | ||
| 131 | *, | ||
| 132 | library_dir: Path, | ||
| 133 | quarantine_dir: Path, | ||
| 134 | delete: bool, | ||
| 135 | ) -> list[str]: | ||
| 136 | handled: list[str] = [] | ||
| 137 | if not placeholders: | ||
| 138 | return handled | ||
| 139 | if not delete: | ||
| 140 | quarantine_dir.mkdir(parents=True, exist_ok=True) | ||
| 141 | for path in placeholders: | ||
| 142 | if delete: | ||
| 143 | path.unlink() | ||
| 144 | handled.append(f"deleted:{path}") | ||
| 145 | continue | ||
| 146 | relative = path.resolve().relative_to(library_dir.resolve()) | ||
| 147 | destination = quarantine_dir / relative | ||
| 148 | destination.parent.mkdir(parents=True, exist_ok=True) | ||
| 149 | if destination.exists(): | ||
| 150 | destination = destination.with_name(f"{destination.stem}_{datetime.now().strftime('%Y%m%d%H%M%S')}{destination.suffix}") | ||
| 151 | shutil.move(str(path), str(destination)) | ||
| 152 | handled.append(f"moved:{path}->{destination}") | ||
| 153 | return handled | ||
| 154 | |||
| 155 | |||
| 156 | def _build_index(library_dir: Path, index_path: Path) -> None: | ||
| 157 | checker = DuplicateChecker() | ||
| 158 | for record in records_from_dir(library_dir): | ||
| 159 | checker.add_record(record) | ||
| 160 | index_path.parent.mkdir(parents=True, exist_ok=True) | ||
| 161 | checker.save(index_path) | ||
| 162 | |||
| 163 | |||
| 164 | def _effective_line_report(library_dir: Path) -> dict[str, int]: | ||
| 165 | buckets = { | ||
| 166 | "total": 0, | ||
| 167 | "zero_effective_lines": 0, | ||
| 168 | "one_to_three_effective_lines": 0, | ||
| 169 | "four_to_five_effective_lines": 0, | ||
| 170 | "six_plus_effective_lines": 0, | ||
| 171 | } | ||
| 172 | for path in iter_lyric_files(library_dir): | ||
| 173 | buckets["total"] += 1 | ||
| 174 | normalized = normalize_lyrics(read_lyric_file(path)) | ||
| 175 | line_count = len(normalized.primary_lines or normalized.unique_lines) | ||
| 176 | if line_count == 0: | ||
| 177 | buckets["zero_effective_lines"] += 1 | ||
| 178 | elif line_count <= 3: | ||
| 179 | buckets["one_to_three_effective_lines"] += 1 | ||
| 180 | elif line_count <= 5: | ||
| 181 | buckets["four_to_five_effective_lines"] += 1 | ||
| 182 | else: | ||
| 183 | buckets["six_plus_effective_lines"] += 1 | ||
| 184 | return buckets | ||
| 185 | |||
| 186 | |||
| 187 | if __name__ == "__main__": | ||
| 188 | main() |
tests/test_lyric_dedup.py
0 → 100644
This diff is collapsed.
Click to expand it.
-
Please register or sign in to post a comment