简化去重链路,仅保留使用pg作为数据库的链路
使用opencc作为简繁转换
Showing
15 changed files
with
628 additions
and
1504 deletions
| 1 | # Lyric Duplicate Checker | 1 | # 歌词查重系统 |
| 2 | 2 | ||
| 3 | 第一版用于“新增歌词查重”:先用已有 `.lrc` / `.txt` 歌词建立索引,再把新增歌词拿来查询,返回 `duplicate`、`review` 或 `new`。 | 3 | 这是一个使用 PostgreSQL 作为数据存储和候选召回层的歌词查重项目。Python 侧只负责歌词规范化、候选打分和最终判定,不再构建或加载 `.pkl` 本地索引。 |
| 4 | 4 | ||
| 5 | ## 建立索引 | 5 | ## 架构 |
| 6 | 6 | ||
| 7 | 假设已有曲库在 `data/library/`: | 7 | ```text |
| 8 | PostgreSQL: | ||
| 9 | lyrics 保存原始歌词、规范化文本、原文/翻译文本、exact_hash | ||
| 10 | lyric_lines 保存规范化歌词行和 line_hash | ||
| 11 | exact_hash 索引 精确重复召回 | ||
| 12 | pg_trgm 索引 可选的近似文本召回 | ||
| 13 | line_hash 索引 行级重合召回 | ||
| 14 | |||
| 15 | Python: | ||
| 16 | normalize_lyrics 歌词清洗、时间戳/平台噪声处理、繁简转换、翻译行拆分 | ||
| 17 | DuplicateChecker 只对 PostgreSQL 召回的候选打分和排序 | ||
| 18 | 决策规则 输出 duplicate / review / new | ||
| 19 | ``` | ||
| 20 | |||
| 21 | 核心原则: | ||
| 22 | |||
| 23 | ```text | ||
| 24 | 数据库负责召回候选。 | ||
| 25 | Python 负责最终判断。 | ||
| 26 | 不再使用 pickle、本地 MinHash 索引或 outputs/indexes/*.pkl 作为生产链路。 | ||
| 27 | ``` | ||
| 28 | |||
| 29 | ## 安装依赖 | ||
| 8 | 30 | ||
| 9 | ```bash | 31 | ```bash |
| 10 | python -m lyric_dedup.cli build-index \ | 32 | python -m pip install -r requirements.txt |
| 11 | --lyrics-dir data/library \ | ||
| 12 | --index outputs/indexes/lyrics.pkl | ||
| 13 | ``` | 33 | ``` |
| 14 | 34 | ||
| 15 | ## 检查单个新增歌词 | 35 | ## 初始化 PostgreSQL |
| 36 | |||
| 37 | 创建数据库: | ||
| 16 | 38 | ||
| 17 | ```bash | 39 | ```bash |
| 18 | python -m lyric_dedup.cli check-file \ | 40 | createdb lyric_dedup |
| 19 | --index outputs/indexes/lyrics.pkl \ | ||
| 20 | --file data/incoming/new_song.lrc | ||
| 21 | ``` | 41 | ``` |
| 22 | 42 | ||
| 23 | ## 批量检查新增目录 | 43 | 初始化表结构和索引: |
| 24 | 44 | ||
| 25 | ```bash | 45 | ```bash |
| 26 | python -m lyric_dedup.cli batch-check \ | 46 | python scripts/init_postgres.py \ |
| 27 | --index outputs/indexes/lyrics.pkl \ | 47 | --dsn postgresql:///lyric_dedup |
| 28 | --lyrics-dir data/incoming \ | ||
| 29 | --out outputs/results/incoming_check.csv | ||
| 30 | ``` | 48 | ``` |
| 31 | 49 | ||
| 32 | CSV 里重点看这些列: | 50 | 会创建: |
| 51 | |||
| 52 | ```text | ||
| 53 | lyrics | ||
| 54 | lyric_lines | ||
| 55 | pg_trgm extension | ||
| 56 | exact_hash / primary_text_trgm / line_hash 索引 | ||
| 57 | ``` | ||
| 33 | 58 | ||
| 34 | - `decision`: 总判断。 | 59 | ## 导入曲库 |
| 35 | - `best_candidate_id`: 最像的已有歌词。 | 60 | |
| 36 | - `best_candidate_jaccard`: n-gram 字面相似度。 | 61 | ```bash |
| 37 | - `best_candidate_line_coverage`: 行级覆盖率。 | 62 | python scripts/import_library_postgres.py \ |
| 38 | - `matched_unique_lines`: 命中的规范化歌词行。 | 63 | --dsn postgresql:///lyric_dedup \ |
| 39 | - `best_candidate_reason`: 中文判定原因,说明为什么判重、复核或判新。 | 64 | --lyrics-dir data/library |
| 65 | ``` | ||
| 66 | |||
| 67 | 导入脚本会: | ||
| 68 | |||
| 69 | ```text | ||
| 70 | 1. 扫描 data/library 下的 .lrc / .txt。 | ||
| 71 | 2. 读取并规范化歌词。 | ||
| 72 | 3. 写入 lyrics 和 lyric_lines。 | ||
| 73 | 4. 默认对 exact_hash 完全一致的记录做 soft delete,只保留质量更高的一条。 | ||
| 74 | 5. 输出重复报告到 outputs/results/postgres_exact_duplicates.csv。 | ||
| 75 | ``` | ||
| 76 | |||
| 77 | 如果只导入、不做 exact 去重: | ||
| 78 | |||
| 79 | ```bash | ||
| 80 | python scripts/import_library_postgres.py \ | ||
| 81 | --dsn postgresql:///lyric_dedup \ | ||
| 82 | --lyrics-dir data/library \ | ||
| 83 | --skip-dedup-exact | ||
| 84 | ``` | ||
| 40 | 85 | ||
| 41 | 生产判断建议:`duplicate` 可自动拦截;`review` 进人工池;`new` 入库前仍可抽样检查。 | 86 | ## 检查单个歌词文件 |
| 42 | 87 | ||
| 43 | ## 原文 + 中文翻译歌词的防护规则 | 88 | ```bash |
| 89 | python -m lyric_dedup.cli check-file \ | ||
| 90 | --dsn postgresql:///lyric_dedup \ | ||
| 91 | --file data/incoming/new_song.lrc | ||
| 92 | ``` | ||
| 44 | 93 | ||
| 45 | 当前会把歌词拆成三类行: | 94 | 常用参数: |
| 46 | 95 | ||
| 47 | - `primary_lines`: 原文行,自动判重主要依赖这部分。 | 96 | ```text |
| 48 | - `translation_lines`: 中文翻译行,只用于召回和复核解释。 | 97 | --recall-limit 每个 PostgreSQL 召回层最多返回多少候选 |
| 49 | - `unknown_lines`: 无法稳定判断的行。 | 98 | --max-candidates 最终返回和排序多少候选 |
| 99 | --enable-trgm 启用 pg_trgm 近似文本召回 | ||
| 100 | --trgm-threshold pg_trgm similarity 阈值 | ||
| 101 | --statement-timeout-ms PostgreSQL statement_timeout | ||
| 102 | ``` | ||
| 50 | 103 | ||
| 51 | 高置信拆分包括: | 104 | 返回字段: |
| 52 | 105 | ||
| 53 | - 同一个时间戳下出现外文行和中文行。 | 106 | ```text |
| 54 | - 多组稳定的外文行 + 中文行交替。 | 107 | decision duplicate / review / new |
| 108 | duplicate duplicate 或 review 时为 true,new 时为 false | ||
| 109 | confidence 当前判定置信度 | ||
| 110 | reason 中文判定原因 | ||
| 111 | candidate_count 参与最终排序的候选数 | ||
| 112 | ``` | ||
| 55 | 113 | ||
| 56 | 中置信拆分包括: | 114 | ## 启动 API |
| 57 | 115 | ||
| 58 | - 同一行内明显的外文 / 中文翻译,例如 `I miss you / 今晚我想你`。 | 116 | ```bash |
| 117 | export LYRIC_DEDUP_DSN=postgresql:///lyric_dedup | ||
| 118 | uvicorn lyric_dedup_server.app:app --host 0.0.0.0 --port 8000 | ||
| 119 | ``` | ||
| 59 | 120 | ||
| 60 | 低置信拆分包括: | 121 | 接口: |
| 61 | 122 | ||
| 62 | - 先整段外文,再整段中文翻译。 | 123 | ```text |
| 124 | POST /api/v1/check | ||
| 125 | ``` | ||
| 63 | 126 | ||
| 64 | 判定策略: | 127 | 请求示例: |
| 65 | 128 | ||
| 66 | - 原文高度一致,即使新增歌词多了中文翻译,也可以 `duplicate`。 | 129 | ```json |
| 67 | - 只有翻译行相似,原文相似不足,只能 `review`,不自动判重。 | 130 | { |
| 68 | - 疑似整段翻译结构属于低置信拆分,即使原文 hash 一致,也先 `review`。 | 131 | "url": "https://example.com/song.lrc", |
| 69 | - 普通中文歌没有检测到翻译结构时,全部有效行按原文处理。 | 132 | "title": "Song Title", |
| 133 | "artist": "Artist" | ||
| 134 | } | ||
| 135 | ``` | ||
| 70 | 136 | ||
| 71 | 由于索引里会保存拆分后的原文/翻译特征,修改拆分规则后需要重建索引。 | 137 | 服务会下载 URL 对应的 `.lrc` / `.txt` 文件,使用 PostgreSQL 召回候选并判定。若结果为 `new`,且请求带有 URL,服务会把这首新歌词写入 PostgreSQL。 |
| 72 | 138 | ||
| 73 | ## 用标注 CSV 评估正确率 | 139 | ## 生成评估集 |
| 74 | 140 | ||
| 75 | 可以先从已有曲库自动生成一批评估样本: | 141 | 常规生产口径: |
| 76 | 142 | ||
| 77 | ```bash | 143 | ```bash |
| 78 | python -m lyric_dedup.cli generate-eval-set \ | 144 | python -m lyric_dedup.cli generate-eval-set \ |
| 79 | --library-dir data/library \ | 145 | --library-dir data/library \ |
| 80 | --lyrics-dir data/generated_eval/incoming \ | 146 | --lyrics-dir data/generated_eval/incoming \ |
| 81 | --csv data/generated_eval/eval_50000.csv \ | 147 | --csv data/generated_eval/eval_5000.csv \ |
| 82 | --index outputs/indexes/lyrics.pkl \ | 148 | --size 5000 \ |
| 83 | --eval-index data/generated_eval/eval_50000.csv.index.pkl \ | ||
| 84 | --size 50000 \ | ||
| 85 | --positive-ratio 0.3 | 149 | --positive-ratio 0.3 |
| 86 | ``` | 150 | ``` |
| 87 | 151 | ||
| 88 | 默认 `--profile standard` 生成常规生产评估集。也可以生成更贴近业务边界的 hard 集: | 152 | hard 业务边界口径: |
| 89 | 153 | ||
| 90 | ```bash | 154 | ```bash |
| 91 | python -m lyric_dedup.cli generate-eval-set \ | 155 | python -m lyric_dedup.cli generate-eval-set \ |
| ... | @@ -93,79 +157,55 @@ python -m lyric_dedup.cli generate-eval-set \ | ... | @@ -93,79 +157,55 @@ python -m lyric_dedup.cli generate-eval-set \ |
| 93 | --library-dir data/library \ | 157 | --library-dir data/library \ |
| 94 | --lyrics-dir data/generated_eval/hard_incoming \ | 158 | --lyrics-dir data/generated_eval/hard_incoming \ |
| 95 | --csv data/generated_eval/eval_hard_5000.csv \ | 159 | --csv data/generated_eval/eval_hard_5000.csv \ |
| 96 | --eval-index data/generated_eval/eval_hard_5000.csv.index.pkl \ | ||
| 97 | --size 5000 \ | 160 | --size 5000 \ |
| 98 | --positive-ratio 0.3 | 161 | --positive-ratio 0.3 |
| 99 | ``` | 162 | ``` |
| 100 | 163 | ||
| 101 | standard 业务口径: | 164 | 生成器只写: |
| 102 | |||
| 103 | - 先扫描整个曲库,按有效歌词行数、语言类型、文件来源前缀做分层采样,不再按排序前缀取样。 | ||
| 104 | - `应去重` 样本只生成全曲歌词的样式变化,例如时间戳、标点、平台噪声、空行、重复副歌次数变化、附加中文翻译、少量错别字/英文拼写错误。 | ||
| 105 | - `不应去重` 样本以真实 holdout 完整歌词为主,也包含片段歌词、重复副歌碰撞、仅翻译相似、同主题新歌词、短歌词/占位边界样本。 | ||
| 106 | - 片段歌词即使命中已有歌曲的一部分,也不应该输出 `duplicate`;最多进入 `review`。 | ||
| 107 | - 生成器会额外写出 `--eval-index`,这个索引排除了 holdout 歌,评估生成 CSV 时应使用它。 | ||
| 108 | - 同时会生成 `*.manifest.json`,记录 seed、曲库规模、holdout 数、样本类型分布、语言/来源分桶和样本来源覆盖数。 | ||
| 109 | 165 | ||
| 110 | hard 业务口径不故意制造反常输入,主要覆盖上线更容易踩边界的情况: | 166 | ```text |
| 111 | 167 | 评估 CSV | |
| 112 | - `应去重`: 同曲平台版本噪声、较完整歌词缺少一段、整段中文翻译附加、较真实的录入/OCR 错别字、时间戳和平台元信息混合。 | 168 | 样本歌词文件 |
| 113 | - `不应去重`: 真实 holdout 新歌、从 holdout 中优先挑选和曲库有行重合的近邻新歌、较长但不完整的单曲片段、多曲 medley/串烧式片段、重复副歌碰撞、仅翻译相似、短歌词边界。 | 169 | manifest.json |
| 114 | |||
| 115 | 先准备一个 CSV,例如 `data/eval/eval.csv`: | ||
| 116 | |||
| 117 | ```csv | ||
| 118 | id,file,expected | ||
| 119 | case-001,incoming/song_a.lrc,应去重 | ||
| 120 | case-002,incoming/song_b.txt,不应去重 | ||
| 121 | ``` | ||
| 122 | |||
| 123 | 也可以不用文件路径,直接把歌词放在 `lyrics` 列: | ||
| 124 | |||
| 125 | ```csv | ||
| 126 | id,lyrics,expected | ||
| 127 | case-003,"我爱你在每个夜里\n听风说话也听见你",duplicate | ||
| 128 | case-004,"南方的雨穿过街心\n你把故事说给云听",new | ||
| 129 | ``` | 170 | ``` |
| 130 | 171 | ||
| 131 | `expected` 支持这些写法: | 172 | 不会再生成 `.index.pkl`。评估时由 PostgreSQL 召回候选,并根据 CSV 里的 `source_record_id` 排除 holdout 样本自身。 |
| 132 | 173 | ||
| 133 | - 应去重:`应去重`、`重复`、`duplicate`、`1`、`true`、`yes` | 174 | ## 使用 PostgreSQL 评估 |
| 134 | - 不应去重:`不应去重`、`不重复`、`new`、`0`、`false`、`no` | ||
| 135 | 175 | ||
| 136 | 运行评估: | 176 | 严格自动拦截口径:只有 `duplicate` 算预测应去重。 |
| 137 | 177 | ||
| 138 | ```bash | 178 | ```bash |
| 139 | python -m lyric_dedup.cli evaluate-csv \ | 179 | python scripts/evaluate_postgres.py \ |
| 140 | --index outputs/indexes/lyrics.pkl \ | 180 | --dsn postgresql:///lyric_dedup \ |
| 141 | --csv data/eval/eval.csv \ | 181 | --csv data/generated_eval/eval_hard_5000.csv \ |
| 142 | --base-dir data \ | 182 | --base-dir data/generated_eval \ |
| 143 | --out outputs/results/eval_result.csv | 183 | --out outputs/results/postgres_eval_hard_5000.csv |
| 144 | ``` | 184 | ``` |
| 145 | 185 | ||
| 146 | 默认只有系统输出 `duplicate` 才算“预测应去重”。这适合评估自动拦截的准确率,误杀会更明显。 | 186 | 可疑样本召回口径:`duplicate` 和 `review` 都算抓到。 |
| 147 | |||
| 148 | 如果你想评估“可疑样本召回率”,也就是 `duplicate` 和 `review` 都算命中: | ||
| 149 | 187 | ||
| 150 | ```bash | 188 | ```bash |
| 151 | python -m lyric_dedup.cli evaluate-csv \ | 189 | python scripts/evaluate_postgres.py \ |
| 152 | --index outputs/indexes/lyrics.pkl \ | 190 | --dsn postgresql:///lyric_dedup \ |
| 153 | --csv data/eval/eval.csv \ | 191 | --csv data/generated_eval/eval_hard_5000.csv \ |
| 154 | --base-dir data \ | 192 | --base-dir data/generated_eval \ |
| 155 | --positive-decisions duplicate,review \ | 193 | --positive-decisions duplicate,review \ |
| 156 | --out outputs/results/eval_result_review_as_positive.csv | 194 | --out outputs/results/postgres_eval_hard_5000_review_positive.csv |
| 157 | ``` | 195 | ``` |
| 158 | 196 | ||
| 159 | 会生成两个文件: | 197 | 评估会生成: |
| 160 | 198 | ||
| 161 | - `outputs/results/eval_result.csv`: 每条样本的预测、候选、原因和是否正确。 | 199 | ```text |
| 162 | - `outputs/results/eval_result.csv.summary.json`: 总体指标。 | 200 | outputs/results/*.csv |
| 201 | outputs/results/*.csv.summary.json | ||
| 202 | ``` | ||
| 163 | 203 | ||
| 164 | summary 里重点看: | 204 | summary 重点看: |
| 165 | 205 | ||
| 166 | - `accuracy`: 总正确率。 | 206 | ```text |
| 167 | - `precision`: 预测应去重的样本里,有多少是真的应去重。自动拦截优先看这个。 | 207 | precision 自动拦截准确率,重点关注 false_positive |
| 168 | - `recall`: 真实应去重的样本里,有多少被系统抓到。 | 208 | recall 应去重样本召回率,重点关注 false_negative |
| 169 | - `f1`: precision 和 recall 的综合指标。 | 209 | f1 precision 和 recall 的综合指标 |
| 170 | - `false_positive`: 不应去重但被判为应去重,属于误杀。 | 210 | duplicate/review/new 三类判定分布 |
| 171 | - `false_negative`: 应去重但没抓到,属于漏召。 | 211 | ``` | ... | ... |
| 1 | # 歌词查重测试流程 | 1 | # 歌词查重测试流程 |
| 2 | 2 | ||
| 3 | 本文档记录从已有歌词目录建立索引、生成测试集、批量评估和查看结果的完整命令。 | 3 | 本文档记录当前项目的 PostgreSQL-only 测试流程。当前链路不再使用 `outputs/indexes/*.pkl`,也不再生成 `*.index.pkl` 评估索引。 |
| 4 | 4 | ||
| 5 | ## 1. 准备目录 | 5 | ## 1. 准备数据 |
| 6 | 6 | ||
| 7 | 已有曲库放在: | 7 | 已有曲库: |
| 8 | 8 | ||
| 9 | ```text | 9 | ```text |
| 10 | data/library/ | 10 | data/library/ |
| ... | @@ -17,125 +17,111 @@ data/library/ | ... | @@ -17,125 +17,111 @@ data/library/ |
| 17 | .txt | 17 | .txt |
| 18 | ``` | 18 | ``` |
| 19 | 19 | ||
| 20 | 生成的测试样本会放在: | 20 | 生成的评估样本目录: |
| 21 | 21 | ||
| 22 | ```text | 22 | ```text |
| 23 | data/generated_eval/incoming/ | 23 | data/generated_eval/incoming/ |
| 24 | data/generated_eval/hard_incoming/ | ||
| 24 | ``` | 25 | ``` |
| 25 | 26 | ||
| 26 | 测试集标注 CSV 会放在: | 27 | 评估结果目录: |
| 27 | 28 | ||
| 28 | ```text | 29 | ```text |
| 29 | data/generated_eval/eval_100.csv | 30 | outputs/results/ |
| 30 | ``` | 31 | ``` |
| 31 | 32 | ||
| 32 | 评估结果会放在: | 33 | ## 2. 初始化 PostgreSQL |
| 33 | 34 | ||
| 34 | ```text | 35 | 创建数据库: |
| 35 | outputs/results/ | ||
| 36 | ``` | ||
| 37 | 36 | ||
| 38 | ## 2. 建立已有曲库索引 | 37 | ```bash |
| 38 | createdb lyric_dedup | ||
| 39 | ``` | ||
| 39 | 40 | ||
| 40 | 如果刚往 `data/library` 新增了一批样本,建议先运行处理脚本: | 41 | 初始化 schema: |
| 41 | 42 | ||
| 42 | ```bash | 43 | ```bash |
| 43 | python scripts/process_library.py \ | 44 | python scripts/init_postgres.py \ |
| 44 | --library-dir data/library \ | 45 | --dsn postgresql:///lyric_dedup |
| 45 | --index outputs/indexes/library_lyrics.pkl | ||
| 46 | ``` | 46 | ``` |
| 47 | 47 | ||
| 48 | 这个脚本会: | 48 | 检查表: |
| 49 | 49 | ||
| 50 | ```text | 50 | ```bash |
| 51 | 1. 扫描并隔离纯音乐占位样本,例如包含【曲库专用】或“此歌曲为没有填词的纯音乐”的文件。 | 51 | psql postgresql:///lyric_dedup -c '\dt' |
| 52 | 2. 重建 outputs/indexes/library_lyrics.pkl。 | ||
| 53 | 3. 输出处理报告 outputs/results/library_process_report.json。 | ||
| 54 | ``` | 52 | ``` |
| 55 | 53 | ||
| 56 | 如果你想先看会处理哪些文件,不实际移动和重建索引: | 54 | ## 3. 导入曲库 |
| 57 | 55 | ||
| 58 | ```bash | 56 | ```bash |
| 59 | python scripts/process_library.py \ | 57 | python scripts/import_library_postgres.py \ |
| 60 | --library-dir data/library \ | 58 | --dsn postgresql:///lyric_dedup \ |
| 61 | --dry-run | 59 | --lyrics-dir data/library |
| 62 | ``` | 60 | ``` |
| 63 | 61 | ||
| 64 | 如果要顺手生成并评估 500 条测试样本: | 62 | 导入完成后检查数量: |
| 65 | 63 | ||
| 66 | ```bash | 64 | ```bash |
| 67 | python scripts/process_library.py \ | 65 | psql postgresql:///lyric_dedup -c 'select count(*) from lyrics where deleted_at is null;' |
| 68 | --library-dir data/library \ | 66 | psql postgresql:///lyric_dedup -c 'select count(*) from lyric_lines;' |
| 69 | --index outputs/indexes/library_lyrics.pkl \ | ||
| 70 | --eval-size 50000 \ | ||
| 71 | --positive-ratio 0.3 \ | ||
| 72 | --eval-csv data/generated_eval/eval_50000.csv \ | ||
| 73 | --eval-out outputs/results/library_eval_50000.csv | ||
| 74 | ``` | 67 | ``` |
| 75 | 68 | ||
| 76 | 隔离出来的文件默认会移动到: | 69 | 导入脚本默认会 soft delete exact_hash 完全一致的重复记录,并输出: |
| 77 | 70 | ||
| 78 | ```text | 71 | ```text |
| 79 | data/quarantine/no_lyrics_placeholders/ | 72 | outputs/results/postgres_exact_duplicates.csv |
| 80 | ``` | 73 | ``` |
| 81 | 74 | ||
| 82 | 也可以只手动建索引: | 75 | 如果要额外查看高行级覆盖的疑似重复: |
| 83 | 76 | ||
| 84 | ```bash | 77 | ```bash |
| 85 | python -m lyric_dedup.cli build-index \ | 78 | python scripts/import_library_postgres.py \ |
| 79 | --dsn postgresql:///lyric_dedup \ | ||
| 86 | --lyrics-dir data/library \ | 80 | --lyrics-dir data/library \ |
| 87 | --index outputs/indexes/library_lyrics.pkl | 81 | --line-duplicate-report outputs/results/postgres_line_duplicates.csv |
| 88 | ``` | 82 | ``` |
| 89 | 83 | ||
| 90 | 索引文件: | 84 | ## 4. 检查单个文件 |
| 91 | 85 | ||
| 92 | ```text | 86 | ```bash |
| 93 | outputs/indexes/library_lyrics.pkl | 87 | python -m lyric_dedup.cli check-file \ |
| 88 | --dsn postgresql:///lyric_dedup \ | ||
| 89 | --file test_api/test_lyric.txt | ||
| 94 | ``` | 90 | ``` |
| 95 | 91 | ||
| 96 | 注意:如果修改了 `data/library`,或修改了预处理/判重逻辑,需要重新执行本步骤。 | 92 | 如需启用 trigram 文本召回: |
| 97 | |||
| 98 | ## 3. 生成生产评估样本 | ||
| 99 | 93 | ||
| 100 | ```bash | 94 | ```bash |
| 101 | python -m lyric_dedup.cli generate-eval-set \ | 95 | python -m lyric_dedup.cli check-file \ |
| 102 | --library-dir data/library \ | 96 | --dsn postgresql:///lyric_dedup \ |
| 103 | --lyrics-dir data/generated_eval/incoming \ | 97 | --file test_api/test_lyric.txt \ |
| 104 | --csv data/generated_eval/eval_50000.csv \ | 98 | --enable-trgm \ |
| 105 | --index outputs/indexes/library_lyrics.pkl \ | 99 | --trgm-threshold 0.3 |
| 106 | --eval-index data/generated_eval/eval_50000.csv.index.pkl \ | ||
| 107 | --size 50000 \ | ||
| 108 | --positive-ratio 0.3 | ||
| 109 | ``` | 100 | ``` |
| 110 | 101 | ||
| 111 | 如需生成更贴近业务边界的 hard 口径测试集: | 102 | ## 5. 生成 standard 评估集 |
| 112 | 103 | ||
| 113 | ```bash | 104 | ```bash |
| 114 | python -m lyric_dedup.cli generate-eval-set \ | 105 | python -m lyric_dedup.cli generate-eval-set \ |
| 115 | --profile hard \ | ||
| 116 | --library-dir data/library \ | 106 | --library-dir data/library \ |
| 117 | --lyrics-dir data/generated_eval/hard_incoming \ | 107 | --lyrics-dir data/generated_eval/incoming \ |
| 118 | --csv data/generated_eval/eval_hard_5000.csv \ | 108 | --csv data/generated_eval/eval_5000.csv \ |
| 119 | --index outputs/indexes/library_lyrics.pkl \ | ||
| 120 | --eval-index data/generated_eval/eval_hard_5000.csv.index.pkl \ | ||
| 121 | --size 5000 \ | 109 | --size 5000 \ |
| 122 | --positive-ratio 0.3 | 110 | --positive-ratio 0.3 |
| 123 | ``` | 111 | ``` |
| 124 | 112 | ||
| 125 | 默认生产评估口径: | 113 | standard 口径: |
| 126 | 114 | ||
| 127 | ```text | 115 | ```text |
| 128 | 应去重: 30% | 116 | 应去重: 30% |
| 129 | 不应去重: 70% | 117 | 不应去重: 70% |
| 130 | ``` | 118 | ``` |
| 131 | 119 | ||
| 132 | 生成器会先清理 `data/generated_eval/incoming/` 下旧的 `.txt` / `.lrc` 生成文件,再写入新样本。 | 120 | 样本类型: |
| 133 | |||
| 134 | 业务口径: | ||
| 135 | 121 | ||
| 136 | ```text | 122 | ```text |
| 137 | positive_* = 应去重,全曲歌词样式变化,包括少量错别字/英文拼写错误扰动 | 123 | positive_* = 应去重,全曲歌词样式变化,例如时间戳、标点、平台噪声、空行、重复副歌次数变化、附加翻译、少量错别字 |
| 138 | negative_real_holdout_full_song = 不应去重,完整真实歌词,已从评估索引中排除 | 124 | negative_real_holdout_full_song = 不应去重,完整真实歌词,从评估候选里排除自身 |
| 139 | negative_fragment = 不应去重,单曲片段 | 125 | negative_fragment = 不应去重,单曲片段 |
| 140 | negative_shared_chorus = 不应去重,重复副歌碰撞 | 126 | negative_shared_chorus = 不应去重,重复副歌碰撞 |
| 141 | negative_translation_only = 不应去重,仅翻译相似 | 127 | negative_translation_only = 不应去重,仅翻译相似 |
| ... | @@ -143,7 +129,19 @@ negative_same_theme_synthetic = 不应去重,同主题新歌词 | ... | @@ -143,7 +129,19 @@ negative_same_theme_synthetic = 不应去重,同主题新歌词 |
| 143 | edge_short_or_placeholder = 不应去重,短歌词/占位边界样本 | 129 | edge_short_or_placeholder = 不应去重,短歌词/占位边界样本 |
| 144 | ``` | 130 | ``` |
| 145 | 131 | ||
| 146 | hard 口径额外强调真实业务边界,而不是故意制造反常难题: | 132 | ## 6. 生成 hard 评估集 |
| 133 | |||
| 134 | ```bash | ||
| 135 | python -m lyric_dedup.cli generate-eval-set \ | ||
| 136 | --profile hard \ | ||
| 137 | --library-dir data/library \ | ||
| 138 | --lyrics-dir data/generated_eval/hard_incoming \ | ||
| 139 | --csv data/generated_eval/eval_hard_5000.csv \ | ||
| 140 | --size 5000 \ | ||
| 141 | --positive-ratio 0.3 | ||
| 142 | ``` | ||
| 143 | |||
| 144 | hard 口径强调真实业务边界,不故意制造反常输入: | ||
| 147 | 145 | ||
| 148 | ```text | 146 | ```text |
| 149 | positive_realistic_variant = 应去重,同曲平台版本噪声、较完整缺段、整段翻译附加、真实录入/OCR 错 | 147 | positive_realistic_variant = 应去重,同曲平台版本噪声、较完整缺段、整段翻译附加、真实录入/OCR 错 |
| ... | @@ -152,84 +150,50 @@ negative_long_fragment = 不应去重,较长但不完整的单曲片段 | ... | @@ -152,84 +150,50 @@ negative_long_fragment = 不应去重,较长但不完整的单曲片段 |
| 152 | negative_catalog_mashup = 不应去重,多首真实歌词片段组成的串烧/混剪式输入 | 150 | negative_catalog_mashup = 不应去重,多首真实歌词片段组成的串烧/混剪式输入 |
| 153 | ``` | 151 | ``` |
| 154 | 152 | ||
| 155 | 生成器会扫描整个曲库并按有效歌词行数、语言类型、文件来源前缀分层采样。它会分出一批 holdout 完整歌词作为真实新歌负样本,并生成一个排除 holdout 的评估索引。每次还会输出: | 153 | ## 7. 严格评估 |
| 156 | 154 | ||
| 157 | ```text | 155 | 严格口径只把 `duplicate` 算作预测应去重: |
| 158 | data/generated_eval/eval_50000.csv.manifest.json | ||
| 159 | data/generated_eval/eval_50000.csv.index.pkl | ||
| 160 | ``` | ||
| 161 | |||
| 162 | manifest 里重点看: | ||
| 163 | |||
| 164 | ```text | ||
| 165 | library_files 曲库歌词文件数 | ||
| 166 | holdout_records 从评估索引中排除、作为真实新歌负样本的数量 | ||
| 167 | sample_type_counts 各样本类型数量 | ||
| 168 | line_count_bucket_counts / language_bucket_counts / source_bucket_counts | ||
| 169 | unique_source_records 本次评估覆盖了多少真实源文件 | ||
| 170 | ``` | ||
| 171 | |||
| 172 | ## 4. 严格评估:只把 duplicate 算作去重 | ||
| 173 | 156 | ||
| 174 | ```bash | 157 | ```bash |
| 175 | python -m lyric_dedup.cli evaluate-csv \ | 158 | python scripts/evaluate_postgres.py \ |
| 176 | --index data/generated_eval/eval_50000.csv.index.pkl \ | 159 | --dsn postgresql:///lyric_dedup \ |
| 177 | --csv data/generated_eval/eval_50000.csv \ | 160 | --csv data/generated_eval/eval_hard_5000.csv \ |
| 178 | --base-dir data/generated_eval \ | 161 | --base-dir data/generated_eval \ |
| 179 | --out outputs/results/library_eval_50000.csv | 162 | --out outputs/results/postgres_eval_hard_5000.csv |
| 180 | ``` | 163 | ``` |
| 181 | 164 | ||
| 182 | 这个口径下: | 165 | 适合看自动拦截质量,重点关注: |
| 183 | |||
| 184 | ```text | ||
| 185 | duplicate -> 预测应去重 | ||
| 186 | review -> 预测不应去重 | ||
| 187 | new -> 预测不应去重 | ||
| 188 | ``` | ||
| 189 | |||
| 190 | 适合评估自动拦截的 precision,重点看: | ||
| 191 | 166 | ||
| 192 | ```text | 167 | ```text |
| 168 | precision | ||
| 193 | false_positive | 169 | false_positive |
| 194 | ``` | 170 | ``` |
| 195 | 171 | ||
| 196 | ## 5. 召回评估:把 duplicate 和 review 都算作抓到可疑样本 | 172 | ## 8. 召回评估 |
| 173 | |||
| 174 | 召回口径把 `duplicate` 和 `review` 都算作抓到可疑样本: | ||
| 197 | 175 | ||
| 198 | ```bash | 176 | ```bash |
| 199 | python -m lyric_dedup.cli evaluate-csv \ | 177 | python scripts/evaluate_postgres.py \ |
| 200 | --index data/generated_eval/eval_50000.csv.index.pkl \ | 178 | --dsn postgresql:///lyric_dedup \ |
| 201 | --csv data/generated_eval/eval_50000.csv \ | 179 | --csv data/generated_eval/eval_hard_5000.csv \ |
| 202 | --base-dir data/generated_eval \ | 180 | --base-dir data/generated_eval \ |
| 203 | --positive-decisions duplicate,review \ | 181 | --positive-decisions duplicate,review \ |
| 204 | --out outputs/results/library_eval_50000_review_positive.csv | 182 | --out outputs/results/postgres_eval_hard_5000_review_positive.csv |
| 205 | ``` | 183 | ``` |
| 206 | 184 | ||
| 207 | 这个口径下: | 185 | 适合看漏召风险,重点关注: |
| 208 | |||
| 209 | ```text | ||
| 210 | duplicate -> 预测应去重 | ||
| 211 | review -> 预测应去重 | ||
| 212 | new -> 预测不应去重 | ||
| 213 | ``` | ||
| 214 | |||
| 215 | 适合评估可疑样本召回,重点看: | ||
| 216 | 186 | ||
| 217 | ```text | 187 | ```text |
| 188 | recall | ||
| 218 | false_negative | 189 | false_negative |
| 219 | ``` | 190 | ``` |
| 220 | 191 | ||
| 221 | ## 6. 查看总体指标 | 192 | ## 9. 查看 summary |
| 222 | |||
| 223 | 严格口径: | ||
| 224 | 193 | ||
| 225 | ```bash | 194 | ```bash |
| 226 | cat outputs/results/library_eval_100.csv.summary.json | 195 | cat outputs/results/postgres_eval_hard_5000.csv.summary.json |
| 227 | ``` | 196 | cat outputs/results/postgres_eval_hard_5000_review_positive.csv.summary.json |
| 228 | |||
| 229 | 召回口径: | ||
| 230 | |||
| 231 | ```bash | ||
| 232 | cat outputs/results/library_eval_100_review_positive.csv.summary.json | ||
| 233 | ``` | 197 | ``` |
| 234 | 198 | ||
| 235 | 指标含义: | 199 | 指标含义: |
| ... | @@ -245,84 +209,16 @@ true_negative 不应去重且预测不应去重 | ... | @@ -245,84 +209,16 @@ true_negative 不应去重且预测不应去重 |
| 245 | false_negative 应去重但预测不应去重,漏召 | 209 | false_negative 应去重但预测不应去重,漏召 |
| 246 | ``` | 210 | ``` |
| 247 | 211 | ||
| 248 | ## 7. 查看每条样本结果 | 212 | ## 10. 查看失败样本 |
| 249 | |||
| 250 | ```bash | ||
| 251 | open outputs/results/library_eval_100.csv | ||
| 252 | ``` | ||
| 253 | |||
| 254 | 如果不能使用 `open`,可以直接查看 CSV: | ||
| 255 | |||
| 256 | ```bash | ||
| 257 | python -c 'import csv; rows=csv.DictReader(open("outputs/results/library_eval_100.csv", encoding="utf-8")); [print(r["id"], r["decision"], r["correct"], r["reason"], sep=" | ") for r in rows]' | ||
| 258 | ``` | ||
| 259 | |||
| 260 | ## 8. 查看失败样本 | ||
| 261 | 213 | ||
| 262 | 严格口径失败样本: | 214 | 严格口径失败样本: |
| 263 | 215 | ||
| 264 | ```bash | 216 | ```bash |
| 265 | python -c 'import csv; rows=csv.DictReader(open("outputs/results/library_eval_100.csv", encoding="utf-8")); [print(r["id"], r["source"], r["decision"], r["reason"], sep=" | ") for r in rows if r["correct"] == "False"]' | 217 | python -c 'import csv; rows=csv.DictReader(open("outputs/results/postgres_eval_hard_5000.csv", encoding="utf-8")); [print(r["id"], r["expected_duplicate"], r["decision"], r["reason"], sep=" | ") for r in rows if r["correct"] == "False"]' |
| 266 | ``` | ||
| 267 | |||
| 268 | 查看某个样本的完整候选: | ||
| 269 | |||
| 270 | ```bash | ||
| 271 | python -m lyric_dedup.cli check-file \ | ||
| 272 | --index outputs/indexes/library_lyrics.pkl \ | ||
| 273 | --file data/generated_eval/incoming/neg_068_mixed_fragments.txt \ | ||
| 274 | --max-candidates 10 | ||
| 275 | ``` | 218 | ``` |
| 276 | 219 | ||
| 277 | ## 9. 核对测试集分布 | 220 | 按样本类型统计: |
| 278 | 221 | ||
| 279 | ```bash | 222 | ```bash |
| 280 | python -c 'import csv, collections; rows=list(csv.DictReader(open("data/generated_eval/eval_10.csv", encoding="utf-8"))); print(len(rows)); print(collections.Counter(r["expected"] for r in rows)); print(collections.Counter(r["sample_type"] for r in rows)); print(collections.Counter(r["sample_type"] for r in rows if r["expected"]=="应去重")); print(collections.Counter(r["sample_type"] for r in rows if r["expected"]=="不应去重"))' | 223 | python -c 'import csv,collections; meta={r["id"]:r for r in csv.DictReader(open("data/generated_eval/eval_hard_5000.csv", encoding="utf-8-sig"))}; rows=csv.DictReader(open("outputs/results/postgres_eval_hard_5000.csv", encoding="utf-8")); c=collections.Counter(meta.get(r["id"],{}).get("sample_type","") for r in rows if r["correct"]=="False"); print(c)' |
| 281 | ``` | 224 | ``` |
| 282 | |||
| 283 | 核对生成目录文件数: | ||
| 284 | |||
| 285 | ```bash | ||
| 286 | find data/generated_eval/incoming -type f | wc -l | ||
| 287 | ``` | ||
| 288 | |||
| 289 | ## 10. 运行代码测试 | ||
| 290 | |||
| 291 | ```bash | ||
| 292 | python -m pytest tests | ||
| 293 | ``` | ||
| 294 | |||
| 295 | 编译检查: | ||
| 296 | |||
| 297 | ```bash | ||
| 298 | python -m compileall -q lyric_dedup tests | ||
| 299 | ``` | ||
| 300 | |||
| 301 | ## 11. 关于测试集不重复 | ||
| 302 | |||
| 303 | 当前自动生成的 100 条是规则覆盖测试集,不保证样本之间规范化后完全不重复。 | ||
| 304 | |||
| 305 | 如果要求 100 条测试样本彼此不重复,并且仍使用默认比例: | ||
| 306 | |||
| 307 | ```text | ||
| 308 | size = 100 | ||
| 309 | positive_ratio = 0.6 | ||
| 310 | ``` | ||
| 311 | |||
| 312 | 则至少需要: | ||
| 313 | |||
| 314 | ```text | ||
| 315 | 60 首互不重复的种子歌词 | ||
| 316 | ``` | ||
| 317 | |||
| 318 | 原因:应去重样本是全曲变体,同一首歌的多个样式变化规范化后仍然是同一首歌。 | ||
| 319 | |||
| 320 | 更稳妥的真实准确率评估方式是准备人工标注 CSV: | ||
| 321 | |||
| 322 | ```csv | ||
| 323 | id,file,expected | ||
| 324 | case-001,incoming/song_a.lrc,应去重 | ||
| 325 | case-002,incoming/song_b.txt,不应去重 | ||
| 326 | ``` | ||
| 327 | |||
| 328 | 然后直接执行第 4 节或第 5 节的 `evaluate-csv`。 | ... | ... |
| 1 | """Incremental lyric duplicate checker.""" | 1 | """Lyric candidate ranking and duplicate decision rules.""" |
| 2 | 2 | ||
| 3 | from __future__ import annotations | 3 | from __future__ import annotations |
| 4 | 4 | ||
| 5 | import hashlib | 5 | import hashlib |
| 6 | import pickle | ||
| 7 | from dataclasses import dataclass | 6 | from dataclasses import dataclass |
| 8 | from enum import Enum | 7 | from enum import Enum |
| 9 | from pathlib import Path | ||
| 10 | 8 | ||
| 11 | from lyric_dedup.minhash_lsh import MinHashConfig | ||
| 12 | from lyric_dedup.minhash_lsh import MinHashLSH | ||
| 13 | from lyric_dedup.normalization import NormalizedLyrics | 9 | from lyric_dedup.normalization import NormalizedLyrics |
| 14 | from lyric_dedup.normalization import fingerprint_text | 10 | from lyric_dedup.normalization import fingerprint_text |
| 15 | from lyric_dedup.normalization import lyric_tokens | 11 | from lyric_dedup.normalization import lyric_tokens |
| ... | @@ -64,103 +60,61 @@ class _IndexedRecord: | ... | @@ -64,103 +60,61 @@ class _IndexedRecord: |
| 64 | translation_tokens: set[str] | 60 | translation_tokens: set[str] |
| 65 | fallback_lines: tuple[str, ...] | 61 | fallback_lines: tuple[str, ...] |
| 66 | fallback_tokens: set[str] | 62 | fallback_tokens: set[str] |
| 67 | signature: tuple[int, ...] | ||
| 68 | 63 | ||
| 69 | 64 | ||
| 70 | class DuplicateChecker: | 65 | class DuplicateChecker: |
| 71 | """In-memory first version for checking newly submitted lyrics. | 66 | """Rank PostgreSQL-recalled candidates and produce the final decision.""" |
| 72 | |||
| 73 | The API is intentionally small: build or load records with ``add_record``, then | ||
| 74 | call ``check`` for a new lyric. Persistence can serialize the indexed fields | ||
| 75 | later without changing result semantics. | ||
| 76 | """ | ||
| 77 | 67 | ||
| 78 | def __init__( | 68 | def __init__( |
| 79 | self, | 69 | self, |
| 80 | *, | 70 | *, |
| 81 | minhash_config: MinHashConfig | None = None, | ||
| 82 | duplicate_jaccard_threshold: float = 0.78, | 71 | duplicate_jaccard_threshold: float = 0.78, |
| 83 | duplicate_line_coverage_threshold: float = 0.72, | 72 | duplicate_line_coverage_threshold: float = 0.72, |
| 73 | duplicate_high_coverage_jaccard_threshold: float = 0.78, | ||
| 74 | duplicate_high_coverage_line_coverage_threshold: float = 0.90, | ||
| 84 | review_jaccard_threshold: float = 0.45, | 75 | review_jaccard_threshold: float = 0.45, |
| 85 | review_line_coverage_threshold: float = 0.35, | 76 | review_line_coverage_threshold: float = 0.35, |
| 77 | review_query_coverage_threshold: float = 0.40, | ||
| 78 | chorus_short_line_count_threshold: int = 6, | ||
| 79 | chorus_material_overlap_threshold: float = 0.20, | ||
| 80 | chorus_material_query_coverage_threshold: float = 0.40, | ||
| 81 | confidence_jaccard_weight: float = 0.58, | ||
| 82 | confidence_line_coverage_weight: float = 0.42, | ||
| 86 | ) -> None: | 83 | ) -> None: |
| 87 | self._lsh = MinHashLSH(minhash_config) | ||
| 88 | self._records: dict[str, _IndexedRecord] = {} | ||
| 89 | self._exact_hash_to_ids: dict[str, set[str]] = {} | ||
| 90 | self._line_to_ids: dict[str, set[str]] = {} | ||
| 91 | self._token_to_ids: dict[str, set[str]] = {} | ||
| 92 | self.duplicate_jaccard_threshold = duplicate_jaccard_threshold | 84 | self.duplicate_jaccard_threshold = duplicate_jaccard_threshold |
| 93 | self.duplicate_line_coverage_threshold = duplicate_line_coverage_threshold | 85 | self.duplicate_line_coverage_threshold = duplicate_line_coverage_threshold |
| 86 | self.duplicate_high_coverage_jaccard_threshold = duplicate_high_coverage_jaccard_threshold | ||
| 87 | self.duplicate_high_coverage_line_coverage_threshold = duplicate_high_coverage_line_coverage_threshold | ||
| 94 | self.review_jaccard_threshold = review_jaccard_threshold | 88 | self.review_jaccard_threshold = review_jaccard_threshold |
| 95 | self.review_line_coverage_threshold = review_line_coverage_threshold | 89 | self.review_line_coverage_threshold = review_line_coverage_threshold |
| 96 | 90 | self.review_query_coverage_threshold = review_query_coverage_threshold | |
| 97 | def add_record(self, record: LyricRecord) -> None: | 91 | self.chorus_short_line_count_threshold = chorus_short_line_count_threshold |
| 98 | indexed = self._index(record) | 92 | self.chorus_material_overlap_threshold = chorus_material_overlap_threshold |
| 99 | self._add_indexed(record.record_id, indexed) | 93 | self.chorus_material_query_coverage_threshold = chorus_material_query_coverage_threshold |
| 100 | 94 | self.confidence_jaccard_weight = confidence_jaccard_weight | |
| 101 | def add_normalized_record(self, record: LyricRecord, normalized: NormalizedLyrics) -> None: | 95 | self.confidence_line_coverage_weight = confidence_line_coverage_weight |
| 102 | """Add a record when normalized lyrics have already been computed.""" | 96 | |
| 103 | indexed = self._index_normalized(record, normalized) | 97 | def check_record_against_candidates( |
| 104 | self._add_indexed(record.record_id, indexed) | 98 | self, |
| 105 | 99 | record: LyricRecord, | |
| 106 | def _add_indexed(self, record_id: str, indexed: _IndexedRecord) -> None: | 100 | candidates: list[LyricRecord], |
| 107 | self._records[record_id] = indexed | 101 | *, |
| 108 | self._exact_hash_to_ids.setdefault(indexed.exact_hash, set()).add(record_id) | 102 | max_candidates: int = 10, |
| 109 | for line in indexed.normalized.unique_lines: | 103 | ) -> DuplicateCheckResult: |
| 110 | if len(line) >= 4: | 104 | """Rank explicitly supplied candidates without doing in-memory recall. |
| 111 | self._line_to_ids.setdefault(line, set()).add(record_id) | 105 | |
| 112 | for token in indexed.tokens: | 106 | PostgreSQL-backed callers should use this method after database recall so |
| 113 | self._token_to_ids.setdefault(token, set()).add(record_id) | 107 | there is only one retrieval path: PG returns candidates, Python ranks and |
| 114 | for token in indexed.fallback_tokens: | 108 | decides. |
| 115 | self._token_to_ids.setdefault(token, set()).add(record_id) | 109 | """ |
| 116 | self._lsh.add(record_id, indexed.signature) | ||
| 117 | |||
| 118 | def save(self, path: str | Path) -> None: | ||
| 119 | """Persist the in-memory index for later checks.""" | ||
| 120 | with Path(path).open("wb") as file: | ||
| 121 | pickle.dump(self, file, protocol=pickle.HIGHEST_PROTOCOL) | ||
| 122 | |||
| 123 | @classmethod | ||
| 124 | def load(cls, path: str | Path) -> "DuplicateChecker": | ||
| 125 | """Load a previously persisted index.""" | ||
| 126 | with Path(path).open("rb") as file: | ||
| 127 | checker = pickle.load(file) | ||
| 128 | if not isinstance(checker, cls): | ||
| 129 | raise TypeError(f"{path} does not contain a DuplicateChecker index") | ||
| 130 | return checker | ||
| 131 | |||
| 132 | @property | ||
| 133 | def record_count(self) -> int: | ||
| 134 | return len(self._records) | ||
| 135 | |||
| 136 | def check(self, lyrics: str, *, max_candidates: int = 10) -> DuplicateCheckResult: | ||
| 137 | return self.check_record(LyricRecord(record_id="__query__", lyrics=lyrics), max_candidates=max_candidates) | ||
| 138 | |||
| 139 | def check_record(self, record: LyricRecord, *, max_candidates: int = 10) -> DuplicateCheckResult: | ||
| 140 | query = self._index(record) | 110 | query = self._index(record) |
| 141 | exact_ids = self._exact_hash_to_ids.get(query.exact_hash, set()) | ||
| 142 | if exact_ids: | ||
| 143 | candidates = tuple(self._rank_exact_candidate(query, self._records[record_id]) for record_id in sorted(exact_ids)[:max_candidates]) | ||
| 144 | duplicate = next((candidate for candidate in candidates if candidate.decision == DuplicateDecision.DUPLICATE), None) | ||
| 145 | if duplicate is not None: | ||
| 146 | return DuplicateCheckResult( | ||
| 147 | decision=DuplicateDecision.DUPLICATE, | ||
| 148 | confidence=duplicate.confidence, | ||
| 149 | candidates=candidates, | ||
| 150 | normalized_full_text=query.normalized.normalized_full_text, | ||
| 151 | reason=duplicate.reason, | ||
| 152 | ) | ||
| 153 | return DuplicateCheckResult( | ||
| 154 | decision=DuplicateDecision.REVIEW, | ||
| 155 | confidence=candidates[0].confidence, | ||
| 156 | candidates=candidates, | ||
| 157 | normalized_full_text=query.normalized.normalized_full_text, | ||
| 158 | reason=candidates[0].reason, | ||
| 159 | ) | ||
| 160 | |||
| 161 | candidate_ids = self._recall_candidates(query) | ||
| 162 | ranked = sorted( | 111 | ranked = sorted( |
| 163 | (self._rank_candidate(query, self._records[record_id]) for record_id in candidate_ids), | 112 | ( |
| 113 | self._rank_exact_candidate(query, indexed) | ||
| 114 | if indexed.exact_hash == query.exact_hash | ||
| 115 | else self._rank_candidate(query, indexed) | ||
| 116 | for indexed in (self._index(candidate) for candidate in candidates) | ||
| 117 | ), | ||
| 164 | key=lambda item: (item.decision == DuplicateDecision.DUPLICATE, item.confidence, item.jaccard), | 118 | key=lambda item: (item.decision == DuplicateDecision.DUPLICATE, item.confidence, item.jaccard), |
| 165 | reverse=True, | 119 | reverse=True, |
| 166 | )[:max_candidates] | 120 | )[:max_candidates] |
| ... | @@ -203,7 +157,6 @@ class DuplicateChecker: | ... | @@ -203,7 +157,6 @@ class DuplicateChecker: |
| 203 | translation_tokens = lyric_tokens(normalized, lines=normalized.translation_lines) | 157 | translation_tokens = lyric_tokens(normalized, lines=normalized.translation_lines) |
| 204 | fallback_lines = tuple(_fallback_no_lyrics_lines(record.lyrics)) | 158 | fallback_lines = tuple(_fallback_no_lyrics_lines(record.lyrics)) |
| 205 | fallback_tokens = set(fallback_lines) | 159 | fallback_tokens = set(fallback_lines) |
| 206 | signature = self._lsh.signature(primary_tokens or tokens or fallback_tokens) | ||
| 207 | exact_hash = hashlib.sha256(_exact_fingerprint(normalized, fallback_lines).encode("utf-8")).hexdigest() | 160 | exact_hash = hashlib.sha256(_exact_fingerprint(normalized, fallback_lines).encode("utf-8")).hexdigest() |
| 208 | return _IndexedRecord( | 161 | return _IndexedRecord( |
| 209 | record=record, | 162 | record=record, |
| ... | @@ -214,25 +167,8 @@ class DuplicateChecker: | ... | @@ -214,25 +167,8 @@ class DuplicateChecker: |
| 214 | translation_tokens=translation_tokens, | 167 | translation_tokens=translation_tokens, |
| 215 | fallback_lines=fallback_lines, | 168 | fallback_lines=fallback_lines, |
| 216 | fallback_tokens=fallback_tokens, | 169 | fallback_tokens=fallback_tokens, |
| 217 | signature=signature, | ||
| 218 | ) | 170 | ) |
| 219 | 171 | ||
| 220 | def _recall_candidates(self, query: _IndexedRecord) -> set[str]: | ||
| 221 | candidate_ids = self._lsh.query(query.signature) | ||
| 222 | for line in query.normalized.primary_lines: | ||
| 223 | if len(line) >= 4: | ||
| 224 | candidate_ids.update(self._line_to_ids.get(line, set())) | ||
| 225 | for line in query.normalized.translation_lines: | ||
| 226 | if len(line) >= 4: | ||
| 227 | candidate_ids.update(self._line_to_ids.get(line, set())) | ||
| 228 | for token in query.primary_tokens or query.tokens: | ||
| 229 | candidate_ids.update(self._token_to_ids.get(token, set())) | ||
| 230 | for token in query.translation_tokens: | ||
| 231 | candidate_ids.update(self._token_to_ids.get(token, set())) | ||
| 232 | for token in query.fallback_tokens: | ||
| 233 | candidate_ids.update(self._token_to_ids.get(token, set())) | ||
| 234 | return candidate_ids | ||
| 235 | |||
| 236 | def _rank_exact_candidate(self, query: _IndexedRecord, candidate: _IndexedRecord) -> CandidateMatch: | 172 | def _rank_exact_candidate(self, query: _IndexedRecord, candidate: _IndexedRecord) -> CandidateMatch: |
| 237 | low_confidence_split = ( | 173 | low_confidence_split = ( |
| 238 | query.normalized.split_confidence == "low" or candidate.normalized.split_confidence == "low" | 174 | query.normalized.split_confidence == "low" or candidate.normalized.split_confidence == "low" |
| ... | @@ -306,25 +242,47 @@ class DuplicateChecker: | ... | @@ -306,25 +242,47 @@ class DuplicateChecker: |
| 306 | or jaccard >= self.review_jaccard_threshold | 242 | or jaccard >= self.review_jaccard_threshold |
| 307 | or ( | 243 | or ( |
| 308 | primary_coverage >= self.review_line_coverage_threshold | 244 | primary_coverage >= self.review_line_coverage_threshold |
| 309 | and query_primary_coverage >= 0.40 | 245 | and query_primary_coverage >= self.review_query_coverage_threshold |
| 310 | ) | 246 | ) |
| 311 | or ( | 247 | or ( |
| 312 | coverage >= self.review_line_coverage_threshold | 248 | coverage >= self.review_line_coverage_threshold |
| 313 | and query_coverage >= 0.40 | 249 | and query_coverage >= self.review_query_coverage_threshold |
| 314 | ) | 250 | ) |
| 315 | ) | 251 | ) |
| 316 | has_material_chorus_overlap = chorus_only and ( | 252 | has_material_chorus_overlap = chorus_only and ( |
| 317 | query.normalized.content_line_count <= 6 | 253 | query.normalized.content_line_count <= self.chorus_short_line_count_threshold |
| 318 | or (primary_jaccard >= 0.20 and query_primary_coverage >= 0.40) | 254 | or ( |
| 319 | or (jaccard >= 0.20 and query_coverage >= 0.40) | 255 | primary_jaccard >= self.chorus_material_overlap_threshold |
| 320 | or (primary_coverage >= 0.20 and query_primary_coverage >= 0.40) | 256 | and query_primary_coverage >= self.chorus_material_query_coverage_threshold |
| 321 | or (coverage >= 0.20 and query_coverage >= 0.40) | 257 | ) |
| 258 | or ( | ||
| 259 | jaccard >= self.chorus_material_overlap_threshold | ||
| 260 | and query_coverage >= self.chorus_material_query_coverage_threshold | ||
| 261 | ) | ||
| 262 | or ( | ||
| 263 | primary_coverage >= self.chorus_material_overlap_threshold | ||
| 264 | and query_primary_coverage >= self.chorus_material_query_coverage_threshold | ||
| 265 | ) | ||
| 266 | or ( | ||
| 267 | coverage >= self.chorus_material_overlap_threshold | ||
| 268 | and query_coverage >= self.chorus_material_query_coverage_threshold | ||
| 269 | ) | ||
| 322 | ) | 270 | ) |
| 323 | has_low_confidence_split_overlap = low_confidence_split and has_review_level_overlap | 271 | has_low_confidence_split_overlap = low_confidence_split and has_review_level_overlap |
| 324 | 272 | ||
| 325 | confidence = round((0.58 * primary_jaccard) + (0.42 * primary_coverage), 4) | 273 | confidence = round( |
| 274 | (self.confidence_jaccard_weight * primary_jaccard) | ||
| 275 | + (self.confidence_line_coverage_weight * primary_coverage), | ||
| 276 | 4, | ||
| 277 | ) | ||
| 326 | if ( | 278 | if ( |
| 327 | (primary_jaccard >= self.duplicate_jaccard_threshold or (primary_jaccard >= 0.78 and primary_coverage >= 0.9)) | 279 | ( |
| 280 | primary_jaccard >= self.duplicate_jaccard_threshold | ||
| 281 | or ( | ||
| 282 | primary_jaccard >= self.duplicate_high_coverage_jaccard_threshold | ||
| 283 | and primary_coverage >= self.duplicate_high_coverage_line_coverage_threshold | ||
| 284 | ) | ||
| 285 | ) | ||
| 328 | and primary_coverage >= self.duplicate_line_coverage_threshold | 286 | and primary_coverage >= self.duplicate_line_coverage_threshold |
| 329 | and not chorus_only | 287 | and not chorus_only |
| 330 | and not translation_only | 288 | and not translation_only | ... | ... |
| 1 | """Command line tools for lyric duplicate checking.""" | 1 | """PostgreSQL-backed command line tools for lyric duplicate checking.""" |
| 2 | 2 | ||
| 3 | from __future__ import annotations | 3 | from __future__ import annotations |
| 4 | 4 | ||
| 5 | import argparse | 5 | import argparse |
| 6 | import csv | ||
| 7 | import json | 6 | import json |
| 8 | import sys | ||
| 9 | from pathlib import Path | 7 | from pathlib import Path |
| 10 | 8 | ||
| 11 | from lyric_dedup.checker import DuplicateChecker | ||
| 12 | from lyric_dedup.checker import LyricRecord | ||
| 13 | from lyric_dedup.eval_dataset import generate_eval_set | 9 | from lyric_dedup.eval_dataset import generate_eval_set |
| 14 | from lyric_dedup.file_import import iter_lyric_files | ||
| 15 | from lyric_dedup.file_import import read_lyric_file | ||
| 16 | from lyric_dedup.file_import import record_from_file | 10 | from lyric_dedup.file_import import record_from_file |
| 17 | from lyric_dedup.file_import import records_from_dir | ||
| 18 | 11 | ||
| 19 | 12 | ||
| 20 | def main() -> None: | 13 | def main() -> None: |
| 21 | parser = argparse.ArgumentParser(prog="lyric-dedup") | 14 | parser = argparse.ArgumentParser(prog="lyric-dedup") |
| 22 | subparsers = parser.add_subparsers(dest="command", required=True) | 15 | subparsers = parser.add_subparsers(dest="command", required=True) |
| 23 | 16 | ||
| 24 | build = subparsers.add_parser("build-index", help="build an index from .lrc/.txt files") | 17 | check = subparsers.add_parser("check-file", help="check one .lrc/.txt file using PostgreSQL recall") |
| 25 | build.add_argument("--lyrics-dir", required=True) | 18 | check.add_argument("--dsn", default="postgresql:///lyric_dedup") |
| 26 | build.add_argument("--index", required=True) | ||
| 27 | |||
| 28 | check = subparsers.add_parser("check-file", help="check one .lrc/.txt file against an index") | ||
| 29 | check.add_argument("--index", required=True) | ||
| 30 | check.add_argument("--file", required=True) | 19 | check.add_argument("--file", required=True) |
| 31 | check.add_argument("--max-candidates", type=int, default=10) | 20 | check.add_argument("--max-candidates", type=int, default=5) |
| 32 | 21 | check.add_argument("--recall-limit", type=int, default=100) | |
| 33 | batch = subparsers.add_parser("batch-check", help="check a directory of .lrc/.txt files against an index") | 22 | check.add_argument("--enable-trgm", action="store_true") |
| 34 | batch.add_argument("--index", required=True) | 23 | check.add_argument("--trgm-threshold", type=float, default=0.3) |
| 35 | batch.add_argument("--lyrics-dir", required=True) | 24 | check.add_argument("--statement-timeout-ms", type=int, default=5000) |
| 36 | batch.add_argument("--out", required=True) | ||
| 37 | batch.add_argument("--max-candidates", type=int, default=5) | ||
| 38 | |||
| 39 | evaluate = subparsers.add_parser("evaluate-csv", help="evaluate labeled duplicate samples from a CSV file") | ||
| 40 | evaluate.add_argument("--index", required=True) | ||
| 41 | evaluate.add_argument("--csv", required=True) | ||
| 42 | evaluate.add_argument("--out", required=True) | ||
| 43 | evaluate.add_argument("--base-dir", default="") | ||
| 44 | evaluate.add_argument("--positive-decisions", default="duplicate") | ||
| 45 | evaluate.add_argument("--max-candidates", type=int, default=5) | ||
| 46 | 25 | ||
| 47 | generate = subparsers.add_parser("generate-eval-set", help="generate labeled eval samples from a lyric library") | 26 | generate = subparsers.add_parser("generate-eval-set", help="generate labeled eval samples from a lyric library") |
| 48 | generate.add_argument("--library-dir", required=True) | 27 | generate.add_argument("--library-dir", required=True) |
| ... | @@ -51,8 +30,6 @@ def main() -> None: | ... | @@ -51,8 +30,6 @@ def main() -> None: |
| 51 | generate.add_argument("--size", type=int, default=100) | 30 | generate.add_argument("--size", type=int, default=100) |
| 52 | generate.add_argument("--positive-ratio", type=float, default=0.3) | 31 | generate.add_argument("--positive-ratio", type=float, default=0.3) |
| 53 | generate.add_argument("--seed", type=int, default=20260602) | 32 | generate.add_argument("--seed", type=int, default=20260602) |
| 54 | generate.add_argument("--index", default="", help="optional source index path recorded in the manifest") | ||
| 55 | generate.add_argument("--eval-index", default="", help="output index built from non-holdout records for this eval set") | ||
| 56 | generate.add_argument( | 33 | generate.add_argument( |
| 57 | "--profile", | 34 | "--profile", |
| 58 | choices=("standard", "hard"), | 35 | choices=("standard", "hard"), |
| ... | @@ -61,21 +38,8 @@ def main() -> None: | ... | @@ -61,21 +38,8 @@ def main() -> None: |
| 61 | ) | 38 | ) |
| 62 | 39 | ||
| 63 | args = parser.parse_args() | 40 | args = parser.parse_args() |
| 64 | if args.command == "build-index": | 41 | if args.command == "check-file": |
| 65 | build_index(Path(args.lyrics_dir), Path(args.index)) | 42 | check_file_pg(args) |
| 66 | elif args.command == "check-file": | ||
| 67 | check_file(Path(args.index), Path(args.file), args.max_candidates) | ||
| 68 | elif args.command == "batch-check": | ||
| 69 | batch_check(Path(args.index), Path(args.lyrics_dir), Path(args.out), args.max_candidates) | ||
| 70 | elif args.command == "evaluate-csv": | ||
| 71 | evaluate_csv( | ||
| 72 | Path(args.index), | ||
| 73 | Path(args.csv), | ||
| 74 | Path(args.out), | ||
| 75 | base_dir=Path(args.base_dir) if args.base_dir else None, | ||
| 76 | positive_decisions={item.strip() for item in args.positive_decisions.split(",") if item.strip()}, | ||
| 77 | max_candidates=args.max_candidates, | ||
| 78 | ) | ||
| 79 | elif args.command == "generate-eval-set": | 43 | elif args.command == "generate-eval-set": |
| 80 | summary = generate_eval_set( | 44 | summary = generate_eval_set( |
| 81 | library_dir=Path(args.library_dir), | 45 | library_dir=Path(args.library_dir), |
| ... | @@ -84,315 +48,40 @@ def main() -> None: | ... | @@ -84,315 +48,40 @@ def main() -> None: |
| 84 | size=args.size, | 48 | size=args.size, |
| 85 | positive_ratio=args.positive_ratio, | 49 | positive_ratio=args.positive_ratio, |
| 86 | seed=args.seed, | 50 | seed=args.seed, |
| 87 | index_path=Path(args.index) if args.index else None, | ||
| 88 | eval_index_path=Path(args.eval_index) if args.eval_index else None, | ||
| 89 | profile=args.profile, | 51 | profile=args.profile, |
| 90 | ) | 52 | ) |
| 91 | print(json.dumps(summary, ensure_ascii=False)) | 53 | print(json.dumps(summary, ensure_ascii=False)) |
| 92 | 54 | ||
| 93 | 55 | ||
| 94 | def build_index(lyrics_dir: Path, index_path: Path) -> None: | 56 | def check_file_pg(args: argparse.Namespace) -> None: |
| 95 | checker = DuplicateChecker() | 57 | from lyric_dedup_server.config import ServerConfig |
| 96 | records = records_from_dir(lyrics_dir) | 58 | from lyric_dedup_server.service import DedupService |
| 97 | for record in records: | ||
| 98 | checker.add_record(record) | ||
| 99 | index_path.parent.mkdir(parents=True, exist_ok=True) | ||
| 100 | checker.save(index_path) | ||
| 101 | print(json.dumps({"indexed": checker.record_count, "index": str(index_path)}, ensure_ascii=False)) | ||
| 102 | |||
| 103 | 59 | ||
| 104 | def check_file(index_path: Path, file_path: Path, max_candidates: int) -> None: | 60 | record = record_from_file(Path(args.file)) |
| 105 | checker = DuplicateChecker.load(index_path) | 61 | config = ServerConfig( |
| 106 | record = record_from_file(file_path) | 62 | dsn=args.dsn, |
| 107 | result = checker.check_record(record, max_candidates=max_candidates) | 63 | max_candidates=args.max_candidates, |
| 108 | print(json.dumps(_result_to_dict(result, source=str(file_path)), ensure_ascii=False, indent=2)) | 64 | recall_limit=args.recall_limit, |
| 109 | 65 | enable_trgm=args.enable_trgm, | |
| 110 | 66 | trgm_threshold=args.trgm_threshold, | |
| 111 | def batch_check(index_path: Path, lyrics_dir: Path, out_path: Path, max_candidates: int) -> None: | 67 | statement_timeout_ms=args.statement_timeout_ms, |
| 112 | checker = DuplicateChecker.load(index_path) | 68 | ) |
| 113 | out_path.parent.mkdir(parents=True, exist_ok=True) | 69 | service = DedupService(config=config) |
| 114 | rows: list[dict[str, object]] = [] | 70 | result = service.check(record.lyrics, title=record.title, artist=record.artist) |
| 115 | for path in iter_lyric_files(lyrics_dir): | 71 | print( |
| 116 | record = record_from_file(path, base_dir=lyrics_dir) | 72 | json.dumps( |
| 117 | result = checker.check_record(record, max_candidates=max_candidates) | ||
| 118 | best = result.candidates[0] if result.candidates else None | ||
| 119 | rows.append( | ||
| 120 | { | 73 | { |
| 121 | "source": str(path), | 74 | "source": args.file, |
| 122 | "record_id": record.record_id, | 75 | "decision": result.decision, |
| 123 | "decision": result.decision.value, | 76 | "duplicate": result.duplicate, |
| 124 | "confidence": result.confidence, | 77 | "confidence": result.confidence, |
| 125 | "reason": result.reason, | 78 | "reason": result.reason, |
| 126 | "best_candidate_id": best.record_id if best else "", | 79 | "candidate_count": result.candidate_count, |
| 127 | "best_candidate_decision": best.decision.value if best else "", | 80 | }, |
| 128 | "best_candidate_confidence": best.confidence if best else "", | 81 | ensure_ascii=False, |
| 129 | "best_candidate_jaccard": best.jaccard if best else "", | 82 | indent=2, |
| 130 | "best_candidate_line_coverage": best.line_coverage if best else "", | ||
| 131 | "best_candidate_primary_jaccard": best.primary_jaccard if best else "", | ||
| 132 | "best_candidate_primary_line_coverage": best.primary_line_coverage if best else "", | ||
| 133 | "best_candidate_translation_jaccard": best.translation_jaccard if best else "", | ||
| 134 | "best_candidate_translation_line_coverage": best.translation_line_coverage if best else "", | ||
| 135 | "best_candidate_reason": best.reason if best else "", | ||
| 136 | "matched_unique_lines": " | ".join(best.matched_unique_lines) if best else "", | ||
| 137 | } | ||
| 138 | ) | ||
| 139 | |||
| 140 | if out_path.suffix.lower() == ".jsonl": | ||
| 141 | with out_path.open("w", encoding="utf-8") as file: | ||
| 142 | for row in rows: | ||
| 143 | file.write(json.dumps(row, ensure_ascii=False) + "\n") | ||
| 144 | else: | ||
| 145 | with out_path.open("w", encoding="utf-8", newline="") as file: | ||
| 146 | writer = csv.DictWriter(file, fieldnames=list(rows[0].keys()) if rows else ["source"]) | ||
| 147 | writer.writeheader() | ||
| 148 | writer.writerows(rows) | ||
| 149 | summary = { | ||
| 150 | "checked": len(rows), | ||
| 151 | "duplicate": sum(1 for row in rows if row["decision"] == "duplicate"), | ||
| 152 | "review": sum(1 for row in rows if row["decision"] == "review"), | ||
| 153 | "new": sum(1 for row in rows if row["decision"] == "new"), | ||
| 154 | "out": str(out_path), | ||
| 155 | } | ||
| 156 | print(json.dumps(summary, ensure_ascii=False)) | ||
| 157 | |||
| 158 | |||
| 159 | def evaluate_csv( | ||
| 160 | index_path: Path, | ||
| 161 | csv_path: Path, | ||
| 162 | out_path: Path, | ||
| 163 | *, | ||
| 164 | base_dir: Path | None, | ||
| 165 | positive_decisions: set[str], | ||
| 166 | max_candidates: int, | ||
| 167 | ) -> None: | ||
| 168 | _progress(f"load index: {index_path}") | ||
| 169 | checker = DuplicateChecker.load(index_path) | ||
| 170 | rows: list[dict[str, object]] = [] | ||
| 171 | total = _csv_data_row_count(csv_path) | ||
| 172 | _progress(f"evaluate csv: 0/{total}") | ||
| 173 | out_path.parent.mkdir(parents=True, exist_ok=True) | ||
| 174 | with csv_path.open(encoding="utf-8-sig", newline="") as file: | ||
| 175 | reader = csv.DictReader(file) | ||
| 176 | if reader.fieldnames is None: | ||
| 177 | raise ValueError("评估 CSV 需要表头") | ||
| 178 | fieldnames = [ | ||
| 179 | "id", | ||
| 180 | "source", | ||
| 181 | "expected_duplicate", | ||
| 182 | "decision", | ||
| 183 | "predicted_duplicate", | ||
| 184 | "correct", | ||
| 185 | "confidence", | ||
| 186 | "reason", | ||
| 187 | "best_candidate_id", | ||
| 188 | "best_candidate_decision", | ||
| 189 | "best_candidate_confidence", | ||
| 190 | "best_candidate_jaccard", | ||
| 191 | "best_candidate_line_coverage", | ||
| 192 | "best_candidate_primary_jaccard", | ||
| 193 | "best_candidate_primary_line_coverage", | ||
| 194 | "best_candidate_translation_jaccard", | ||
| 195 | "best_candidate_translation_line_coverage", | ||
| 196 | "best_candidate_reason", | ||
| 197 | "matched_unique_lines", | ||
| 198 | ] | ||
| 199 | with out_path.open("w", encoding="utf-8", newline="") as out_file: | ||
| 200 | writer = csv.DictWriter(out_file, fieldnames=fieldnames) | ||
| 201 | writer.writeheader() | ||
| 202 | for index, row in enumerate(reader, start=1): | ||
| 203 | row_out = _evaluate_row( | ||
| 204 | row, | ||
| 205 | row_number=index + 1, | ||
| 206 | checker=checker, | ||
| 207 | csv_path=csv_path, | ||
| 208 | base_dir=base_dir, | ||
| 209 | positive_decisions=positive_decisions, | ||
| 210 | max_candidates=max_candidates, | ||
| 211 | ) | ||
| 212 | rows.append(row_out) | ||
| 213 | writer.writerow(row_out) | ||
| 214 | _progress_count("evaluate csv", index, total, step=1000) | ||
| 215 | |||
| 216 | summary = _evaluation_summary(rows, positive_decisions=positive_decisions, out_path=out_path) | ||
| 217 | summary_path = out_path.with_suffix(out_path.suffix + ".summary.json") | ||
| 218 | summary_path.write_text(json.dumps(summary, ensure_ascii=False, indent=2), encoding="utf-8") | ||
| 219 | _progress("evaluation complete") | ||
| 220 | print(json.dumps(summary, ensure_ascii=False)) | ||
| 221 | |||
| 222 | |||
| 223 | def _result_to_dict(result, *, source: str) -> dict[str, object]: | ||
| 224 | return { | ||
| 225 | "source": source, | ||
| 226 | "decision": result.decision.value, | ||
| 227 | "confidence": result.confidence, | ||
| 228 | "reason": result.reason, | ||
| 229 | "candidates": [ | ||
| 230 | { | ||
| 231 | "record_id": candidate.record_id, | ||
| 232 | "decision": candidate.decision.value, | ||
| 233 | "confidence": candidate.confidence, | ||
| 234 | "jaccard": candidate.jaccard, | ||
| 235 | "line_coverage": candidate.line_coverage, | ||
| 236 | "primary_jaccard": candidate.primary_jaccard, | ||
| 237 | "primary_line_coverage": candidate.primary_line_coverage, | ||
| 238 | "translation_jaccard": candidate.translation_jaccard, | ||
| 239 | "translation_line_coverage": candidate.translation_line_coverage, | ||
| 240 | "reason": candidate.reason, | ||
| 241 | "matched_unique_lines": list(candidate.matched_unique_lines), | ||
| 242 | } | ||
| 243 | for candidate in result.candidates | ||
| 244 | ], | ||
| 245 | } | ||
| 246 | |||
| 247 | |||
| 248 | def _evaluate_row( | ||
| 249 | row: dict[str, str], | ||
| 250 | *, | ||
| 251 | row_number: int, | ||
| 252 | checker: DuplicateChecker, | ||
| 253 | csv_path: Path, | ||
| 254 | base_dir: Path | None, | ||
| 255 | positive_decisions: set[str], | ||
| 256 | max_candidates: int, | ||
| 257 | ) -> dict[str, object]: | ||
| 258 | sample_id = row.get("id") or row.get("sample_id") or str(row_number) | ||
| 259 | record, source = _record_from_eval_row(row, csv_path=csv_path, base_dir=base_dir) | ||
| 260 | expected_duplicate = _parse_expected(row.get("expected") or row.get("label") or row.get("target")) | ||
| 261 | result = checker.check_record(record, max_candidates=max_candidates) | ||
| 262 | predicted_duplicate = result.decision.value in positive_decisions | ||
| 263 | best = result.candidates[0] if result.candidates else None | ||
| 264 | return { | ||
| 265 | "id": sample_id, | ||
| 266 | "source": source, | ||
| 267 | "expected_duplicate": expected_duplicate, | ||
| 268 | "decision": result.decision.value, | ||
| 269 | "predicted_duplicate": predicted_duplicate, | ||
| 270 | "correct": expected_duplicate == predicted_duplicate, | ||
| 271 | "confidence": result.confidence, | ||
| 272 | "reason": result.reason, | ||
| 273 | "best_candidate_id": best.record_id if best else "", | ||
| 274 | "best_candidate_decision": best.decision.value if best else "", | ||
| 275 | "best_candidate_confidence": best.confidence if best else "", | ||
| 276 | "best_candidate_jaccard": best.jaccard if best else "", | ||
| 277 | "best_candidate_line_coverage": best.line_coverage if best else "", | ||
| 278 | "best_candidate_primary_jaccard": best.primary_jaccard if best else "", | ||
| 279 | "best_candidate_primary_line_coverage": best.primary_line_coverage if best else "", | ||
| 280 | "best_candidate_translation_jaccard": best.translation_jaccard if best else "", | ||
| 281 | "best_candidate_translation_line_coverage": best.translation_line_coverage if best else "", | ||
| 282 | "best_candidate_reason": best.reason if best else "", | ||
| 283 | "matched_unique_lines": " | ".join(best.matched_unique_lines) if best else "", | ||
| 284 | } | ||
| 285 | |||
| 286 | |||
| 287 | def _lyrics_from_eval_row(row: dict[str, str], *, csv_path: Path, base_dir: Path | None) -> tuple[str, str]: | ||
| 288 | lyrics = (row.get("lyrics") or "").strip() | ||
| 289 | if lyrics: | ||
| 290 | return lyrics.replace("\\n", "\n"), "inline" | ||
| 291 | |||
| 292 | file_value = (row.get("file") or row.get("path") or row.get("source") or "").strip() | ||
| 293 | if not file_value: | ||
| 294 | raise ValueError("评估 CSV 每行需要提供 lyrics,或 file/path/source 文件路径") | ||
| 295 | |||
| 296 | file_path = Path(file_value) | ||
| 297 | if not file_path.is_absolute(): | ||
| 298 | file_path = (base_dir or csv_path.parent) / file_path | ||
| 299 | return read_lyric_file(file_path), str(file_path) | ||
| 300 | |||
| 301 | |||
| 302 | def _record_from_eval_row(row: dict[str, str], *, csv_path: Path, base_dir: Path | None): | ||
| 303 | lyrics = (row.get("lyrics") or "").strip() | ||
| 304 | if lyrics: | ||
| 305 | return ( | ||
| 306 | LyricRecord( | ||
| 307 | record_id=row.get("id") or row.get("sample_id") or "__eval__", | ||
| 308 | lyrics=lyrics.replace("\\n", "\n"), | ||
| 309 | title=row.get("title") or None, | ||
| 310 | artist=row.get("artist") or None, | ||
| 311 | ), | ||
| 312 | "inline", | ||
| 313 | ) | 83 | ) |
| 314 | 84 | ) | |
| 315 | file_value = (row.get("file") or row.get("path") or row.get("source") or "").strip() | ||
| 316 | if not file_value: | ||
| 317 | raise ValueError("评估 CSV 每行需要 lyrics,或 file/path/source 文件路径") | ||
| 318 | |||
| 319 | file_path = Path(file_value) | ||
| 320 | if not file_path.is_absolute(): | ||
| 321 | file_path = (base_dir or csv_path.parent) / file_path | ||
| 322 | record = record_from_file(file_path) | ||
| 323 | if row.get("title") or row.get("artist"): | ||
| 324 | record = LyricRecord( | ||
| 325 | record_id=record.record_id, | ||
| 326 | lyrics=record.lyrics, | ||
| 327 | title=row.get("title") or record.title, | ||
| 328 | artist=row.get("artist") or record.artist, | ||
| 329 | ) | ||
| 330 | return record, str(file_path) | ||
| 331 | |||
| 332 | |||
| 333 | def _parse_expected(value: str | None) -> bool: | ||
| 334 | if value is None: | ||
| 335 | raise ValueError("评估 CSV 每行需要 expected/label/target 列") | ||
| 336 | normalized = value.strip().lower() | ||
| 337 | positives = {"1", "true", "yes", "y", "duplicate", "dup", "重复", "应去重", "去重", "是"} | ||
| 338 | negatives = {"0", "false", "no", "n", "new", "not_duplicate", "non_duplicate", "不重复", "不应去重", "新歌", "否"} | ||
| 339 | if normalized in positives: | ||
| 340 | return True | ||
| 341 | if normalized in negatives: | ||
| 342 | return False | ||
| 343 | raise ValueError(f"无法识别 expected 值: {value!r}") | ||
| 344 | |||
| 345 | |||
| 346 | def _evaluation_summary( | ||
| 347 | rows: list[dict[str, object]], | ||
| 348 | *, | ||
| 349 | positive_decisions: set[str], | ||
| 350 | out_path: Path, | ||
| 351 | ) -> dict[str, object]: | ||
| 352 | tp = sum(1 for row in rows if row["expected_duplicate"] is True and row["predicted_duplicate"] is True) | ||
| 353 | fp = sum(1 for row in rows if row["expected_duplicate"] is False and row["predicted_duplicate"] is True) | ||
| 354 | tn = sum(1 for row in rows if row["expected_duplicate"] is False and row["predicted_duplicate"] is False) | ||
| 355 | fn = sum(1 for row in rows if row["expected_duplicate"] is True and row["predicted_duplicate"] is False) | ||
| 356 | total = len(rows) | ||
| 357 | precision = tp / (tp + fp) if tp + fp else 0.0 | ||
| 358 | recall = tp / (tp + fn) if tp + fn else 0.0 | ||
| 359 | accuracy = (tp + tn) / total if total else 0.0 | ||
| 360 | f1 = (2 * precision * recall / (precision + recall)) if precision + recall else 0.0 | ||
| 361 | return { | ||
| 362 | "total": total, | ||
| 363 | "positive_decisions": sorted(positive_decisions), | ||
| 364 | "accuracy": round(accuracy, 4), | ||
| 365 | "precision": round(precision, 4), | ||
| 366 | "recall": round(recall, 4), | ||
| 367 | "f1": round(f1, 4), | ||
| 368 | "true_positive": tp, | ||
| 369 | "false_positive": fp, | ||
| 370 | "true_negative": tn, | ||
| 371 | "false_negative": fn, | ||
| 372 | "duplicate": sum(1 for row in rows if row["decision"] == "duplicate"), | ||
| 373 | "review": sum(1 for row in rows if row["decision"] == "review"), | ||
| 374 | "new": sum(1 for row in rows if row["decision"] == "new"), | ||
| 375 | "out": str(out_path), | ||
| 376 | "summary": str(out_path.with_suffix(out_path.suffix + ".summary.json")), | ||
| 377 | } | ||
| 378 | |||
| 379 | |||
| 380 | def _csv_data_row_count(csv_path: Path) -> int: | ||
| 381 | with csv_path.open(encoding="utf-8-sig", newline="") as file: | ||
| 382 | reader = csv.reader(file) | ||
| 383 | next(reader, None) | ||
| 384 | return sum(1 for _ in reader) | ||
| 385 | |||
| 386 | |||
| 387 | def _progress(message: str) -> None: | ||
| 388 | print(f"[eval] {message}", file=sys.stderr, flush=True) | ||
| 389 | |||
| 390 | |||
| 391 | def _progress_count(label: str, current: int, total: int, *, step: int = 1000) -> None: | ||
| 392 | if total <= 0: | ||
| 393 | return | ||
| 394 | if current == 1 or current == total or current % step == 0: | ||
| 395 | _progress(f"{label}: {current}/{total}") | ||
| 396 | 85 | ||
| 397 | 86 | ||
| 398 | if __name__ == "__main__": | 87 | if __name__ == "__main__": | ... | ... |
| ... | @@ -12,7 +12,6 @@ from collections import Counter | ... | @@ -12,7 +12,6 @@ from collections import Counter |
| 12 | from dataclasses import dataclass | 12 | from dataclasses import dataclass |
| 13 | from pathlib import Path | 13 | from pathlib import Path |
| 14 | 14 | ||
| 15 | from lyric_dedup.checker import DuplicateChecker | ||
| 16 | from lyric_dedup.checker import LyricRecord | 15 | from lyric_dedup.checker import LyricRecord |
| 17 | from lyric_dedup.file_import import iter_lyric_files | 16 | from lyric_dedup.file_import import iter_lyric_files |
| 18 | from lyric_dedup.file_import import record_from_file | 17 | from lyric_dedup.file_import import record_from_file |
| ... | @@ -133,8 +132,6 @@ def generate_eval_set( | ... | @@ -133,8 +132,6 @@ def generate_eval_set( |
| 133 | ) | 132 | ) |
| 134 | holdout_ids = {profile.record_id for profile in holdout_profiles} | 133 | holdout_ids = {profile.record_id for profile in holdout_profiles} |
| 135 | indexed_profiles = [profile for profile in profiles if profile.record_id not in holdout_ids] or profiles | 134 | indexed_profiles = [profile for profile in profiles if profile.record_id not in holdout_ids] or profiles |
| 136 | eval_index_path = eval_index_path or csv_path.with_suffix(csv_path.suffix + ".index.pkl") | ||
| 137 | _build_eval_index(indexed_profiles, eval_index_path) | ||
| 138 | groups = _profile_groups(indexed_profiles) | 135 | groups = _profile_groups(indexed_profiles) |
| 139 | samples: list[GeneratedSample] = [] | 136 | samples: list[GeneratedSample] = [] |
| 140 | 137 | ||
| ... | @@ -373,25 +370,6 @@ def _stratified_unique_sample(profiles: list[LyricProfile], count: int, rng: ran | ... | @@ -373,25 +370,6 @@ def _stratified_unique_sample(profiles: list[LyricProfile], count: int, rng: ran |
| 373 | return _stratified_sample(profiles, min(count, len(profiles)), rng) | 370 | return _stratified_sample(profiles, min(count, len(profiles)), rng) |
| 374 | 371 | ||
| 375 | 372 | ||
| 376 | def _build_eval_index(profiles: list[LyricProfile], index_path: Path) -> None: | ||
| 377 | _progress(f"build eval index excluding holdout: {index_path}") | ||
| 378 | checker = DuplicateChecker() | ||
| 379 | total = len(profiles) | ||
| 380 | for index, profile in enumerate(profiles, start=1): | ||
| 381 | checker.add_normalized_record( | ||
| 382 | LyricRecord( | ||
| 383 | record_id=profile.record_id, | ||
| 384 | lyrics=profile.raw_text, | ||
| 385 | title=profile.title or None, | ||
| 386 | artist=profile.artist or None, | ||
| 387 | ), | ||
| 388 | profile.normalized, | ||
| 389 | ) | ||
| 390 | _progress_count("build eval index", index, total, step=5000) | ||
| 391 | index_path.parent.mkdir(parents=True, exist_ok=True) | ||
| 392 | checker.save(index_path) | ||
| 393 | |||
| 394 | |||
| 395 | def _build_positive_samples( | 373 | def _build_positive_samples( |
| 396 | profiles: list[LyricProfile], | 374 | profiles: list[LyricProfile], |
| 397 | output_dir: Path, | 375 | output_dir: Path, |
| ... | @@ -889,7 +867,7 @@ def _write_manifest( | ... | @@ -889,7 +867,7 @@ def _write_manifest( |
| 889 | "sample_size": len(samples), | 867 | "sample_size": len(samples), |
| 890 | "plan": plan, | 868 | "plan": plan, |
| 891 | "source_index": str(index_path) if index_path else "", | 869 | "source_index": str(index_path) if index_path else "", |
| 892 | "eval_index": str(eval_index_path), | 870 | "eval_index": str(eval_index_path) if eval_index_path else "", |
| 893 | "holdout_records": holdout_count, | 871 | "holdout_records": holdout_count, |
| 894 | "lyrics_dir": str(output_dir), | 872 | "lyrics_dir": str(output_dir), |
| 895 | "csv": str(csv_path), | 873 | "csv": str(csv_path), | ... | ... |
lyric_dedup/minhash_lsh.py
deleted
100644 → 0
| 1 | """Small in-memory MinHash LSH index for incremental lyric lookup.""" | ||
| 2 | |||
| 3 | from __future__ import annotations | ||
| 4 | |||
| 5 | import hashlib | ||
| 6 | from collections import defaultdict | ||
| 7 | from dataclasses import dataclass | ||
| 8 | |||
| 9 | |||
| 10 | _MAX_HASH = (1 << 64) - 1 | ||
| 11 | |||
| 12 | |||
| 13 | @dataclass(frozen=True) | ||
| 14 | class MinHashConfig: | ||
| 15 | num_perm: int = 96 | ||
| 16 | bands: int = 24 | ||
| 17 | seed: int = 17 | ||
| 18 | |||
| 19 | @property | ||
| 20 | def rows_per_band(self) -> int: | ||
| 21 | if self.num_perm % self.bands != 0: | ||
| 22 | raise ValueError("num_perm must be divisible by bands") | ||
| 23 | return self.num_perm // self.bands | ||
| 24 | |||
| 25 | |||
| 26 | class MinHashLSH: | ||
| 27 | def __init__(self, config: MinHashConfig | None = None) -> None: | ||
| 28 | self.config = config or MinHashConfig() | ||
| 29 | self._buckets: dict[tuple[int, tuple[int, ...]], set[str]] = defaultdict(set) | ||
| 30 | |||
| 31 | def signature(self, tokens: set[str]) -> tuple[int, ...]: | ||
| 32 | if not tokens: | ||
| 33 | return tuple([_MAX_HASH] * self.config.num_perm) | ||
| 34 | |||
| 35 | signature = [_MAX_HASH] * self.config.num_perm | ||
| 36 | for token in tokens: | ||
| 37 | encoded = token.encode("utf-8") | ||
| 38 | for idx in range(self.config.num_perm): | ||
| 39 | digest = hashlib.blake2b( | ||
| 40 | encoded, | ||
| 41 | digest_size=8, | ||
| 42 | person=f"lyr{self.config.seed + idx:05d}".encode("ascii")[:16], | ||
| 43 | ).digest() | ||
| 44 | value = int.from_bytes(digest, "big") | ||
| 45 | if value < signature[idx]: | ||
| 46 | signature[idx] = value | ||
| 47 | return tuple(signature) | ||
| 48 | |||
| 49 | def add(self, record_id: str, signature: tuple[int, ...]) -> None: | ||
| 50 | for key in self._band_keys(signature): | ||
| 51 | self._buckets[key].add(record_id) | ||
| 52 | |||
| 53 | def query(self, signature: tuple[int, ...]) -> set[str]: | ||
| 54 | candidates: set[str] = set() | ||
| 55 | for key in self._band_keys(signature): | ||
| 56 | candidates.update(self._buckets.get(key, set())) | ||
| 57 | return candidates | ||
| 58 | |||
| 59 | def _band_keys(self, signature: tuple[int, ...]) -> list[tuple[int, tuple[int, ...]]]: | ||
| 60 | rows = self.config.rows_per_band | ||
| 61 | return [(band, signature[band * rows : (band + 1) * rows]) for band in range(self.config.bands)] |
| ... | @@ -8,69 +8,10 @@ import unicodedata | ... | @@ -8,69 +8,10 @@ import unicodedata |
| 8 | from collections import Counter | 8 | from collections import Counter |
| 9 | from dataclasses import dataclass | 9 | from dataclasses import dataclass |
| 10 | 10 | ||
| 11 | import opencc | ||
| 11 | 12 | ||
| 12 | _TRADITIONAL_TO_SIMPLIFIED = str.maketrans( | 13 | |
| 13 | { | 14 | _T2S_CONVERTER = opencc.OpenCC("t2s.json") |
| 14 | "愛": "爱", | ||
| 15 | "會": "会", | ||
| 16 | "個": "个", | ||
| 17 | "妳": "你", | ||
| 18 | "們": "们", | ||
| 19 | "麼": "么", | ||
| 20 | "夢": "梦", | ||
| 21 | "憶": "忆", | ||
| 22 | "風": "风", | ||
| 23 | "無": "无", | ||
| 24 | "與": "与", | ||
| 25 | "聽": "听", | ||
| 26 | "說": "说", | ||
| 27 | "見": "见", | ||
| 28 | "話": "话", | ||
| 29 | "還": "还", | ||
| 30 | "這": "这", | ||
| 31 | "那": "那", | ||
| 32 | "裡": "里", | ||
| 33 | "裏": "里", | ||
| 34 | "過": "过", | ||
| 35 | "來": "来", | ||
| 36 | "進": "进", | ||
| 37 | "去": "去", | ||
| 38 | "給": "给", | ||
| 39 | "讓": "让", | ||
| 40 | "嗎": "吗", | ||
| 41 | "為": "为", | ||
| 42 | "誰": "谁", | ||
| 43 | "對": "对", | ||
| 44 | "錯": "错", | ||
| 45 | "淚": "泪", | ||
| 46 | "寫": "写", | ||
| 47 | "雲": "云", | ||
| 48 | "藍": "蓝", | ||
| 49 | "紅": "红", | ||
| 50 | "綠": "绿", | ||
| 51 | "黃": "黄", | ||
| 52 | "長": "长", | ||
| 53 | "遠": "远", | ||
| 54 | "燈": "灯", | ||
| 55 | "臺": "台", | ||
| 56 | "台": "台", | ||
| 57 | "後": "后", | ||
| 58 | "從": "从", | ||
| 59 | "時": "时", | ||
| 60 | "間": "间", | ||
| 61 | "葉": "叶", | ||
| 62 | "歲": "岁", | ||
| 63 | "聲": "声", | ||
| 64 | "邊": "边", | ||
| 65 | "歡": "欢", | ||
| 66 | "繼": "继", | ||
| 67 | "續": "续", | ||
| 68 | "難": "难", | ||
| 69 | "雙": "双", | ||
| 70 | "舊": "旧", | ||
| 71 | "離": "离", | ||
| 72 | } | ||
| 73 | ) | ||
| 74 | 15 | ||
| 75 | _TIMESTAMP_RE = re.compile(r"\[((?:\d{1,2}:)?\d{1,2}:\d{2}(?:[.:]\d{1,3})?)\]") | 16 | _TIMESTAMP_RE = re.compile(r"\[((?:\d{1,2}:)?\d{1,2}:\d{2}(?:[.:]\d{1,3})?)\]") |
| 76 | _BRACKET_RE = re.compile(r"[\[((【<《].{0,40}?[\]))】>》]") | 17 | _BRACKET_RE = re.compile(r"[\[((【<《].{0,40}?[\]))】>》]") |
| ... | @@ -212,7 +153,7 @@ def _split_inline_translation(line: str, timestamp: str | None, source_index: in | ... | @@ -212,7 +153,7 @@ def _split_inline_translation(line: str, timestamp: str | None, source_index: in |
| 212 | 153 | ||
| 213 | def _entry_from_text(text: str, timestamp: str | None, source_index: int) -> list[_LineEntry]: | 154 | def _entry_from_text(text: str, timestamp: str | None, source_index: int) -> list[_LineEntry]: |
| 214 | line = _BRACKET_RE.sub("", text) | 155 | line = _BRACKET_RE.sub("", text) |
| 215 | line = line.strip().lower().translate(_TRADITIONAL_TO_SIMPLIFIED) | 156 | line = _T2S_CONVERTER.convert(line.strip().lower()) |
| 216 | if not line or _is_noise_line(line): | 157 | if not line or _is_noise_line(line): |
| 217 | return [] | 158 | return [] |
| 218 | line = _strip_symbols(line) | 159 | line = _strip_symbols(line) | ... | ... |
| ... | @@ -4,14 +4,101 @@ from __future__ import annotations | ... | @@ -4,14 +4,101 @@ from __future__ import annotations |
| 4 | 4 | ||
| 5 | import os | 5 | import os |
| 6 | from dataclasses import dataclass | 6 | from dataclasses import dataclass |
| 7 | from pathlib import Path | ||
| 8 | |||
| 9 | |||
| 10 | def _load_env_file() -> None: | ||
| 11 | """Load root .env values without overriding real environment variables.""" | ||
| 12 | env_path = Path(__file__).resolve().parent.parent / ".env" | ||
| 13 | if not env_path.exists(): | ||
| 14 | return | ||
| 15 | with env_path.open(encoding="utf-8") as file: | ||
| 16 | for raw_line in file: | ||
| 17 | line = raw_line.strip() | ||
| 18 | if not line or line.startswith("#") or "=" not in line: | ||
| 19 | continue | ||
| 20 | key, value = line.split("=", 1) | ||
| 21 | os.environ.setdefault(key.strip(), value.strip().strip('"').strip("'")) | ||
| 22 | |||
| 23 | |||
| 24 | _load_env_file() | ||
| 7 | 25 | ||
| 8 | 26 | ||
| 9 | @dataclass | 27 | @dataclass |
| 10 | class ServerConfig: | 28 | class ServerConfig: |
| 29 | # PostgreSQL DSN used by the dedup service. | ||
| 11 | dsn: str = os.getenv("LYRIC_DEDUP_DSN", "postgresql:///lyric_dedup") | 30 | dsn: str = os.getenv("LYRIC_DEDUP_DSN", "postgresql:///lyric_dedup") |
| 31 | |||
| 32 | # Maximum ranked candidates returned in the final API result. | ||
| 12 | max_candidates: int = int(os.getenv("LYRIC_DEDUP_MAX_CANDIDATES", "5")) | 33 | max_candidates: int = int(os.getenv("LYRIC_DEDUP_MAX_CANDIDATES", "5")) |
| 34 | |||
| 35 | # Maximum candidates recalled from each PostgreSQL recall tier. | ||
| 13 | recall_limit: int = int(os.getenv("LYRIC_DEDUP_RECALL_LIMIT", "100")) | 36 | recall_limit: int = int(os.getenv("LYRIC_DEDUP_RECALL_LIMIT", "100")) |
| 37 | |||
| 38 | # Whether to use pg_trgm similarity recall in addition to exact hash and line hash recall. | ||
| 14 | enable_trgm: bool = os.getenv("LYRIC_DEDUP_ENABLE_TRGM", "false").lower() == "true" | 39 | enable_trgm: bool = os.getenv("LYRIC_DEDUP_ENABLE_TRGM", "false").lower() == "true" |
| 40 | |||
| 41 | # PostgreSQL pg_trgm recall threshold; lower values recall more candidates and cost more. | ||
| 15 | trgm_threshold: float = float(os.getenv("LYRIC_DEDUP_TRGM_THRESHOLD", "0.3")) | 42 | trgm_threshold: float = float(os.getenv("LYRIC_DEDUP_TRGM_THRESHOLD", "0.3")) |
| 43 | |||
| 44 | # PostgreSQL statement timeout for one dedup check, in milliseconds. | ||
| 16 | statement_timeout_ms: int = int(os.getenv("LYRIC_DEDUP_STATEMENT_TIMEOUT_MS", "5000")) | 45 | statement_timeout_ms: int = int(os.getenv("LYRIC_DEDUP_STATEMENT_TIMEOUT_MS", "5000")) |
| 46 | |||
| 47 | # HTTP download timeout for fetching lyric URLs, in seconds. | ||
| 17 | download_timeout: int = int(os.getenv("LYRIC_DEDUP_DOWNLOAD_TIMEOUT", "10")) | 48 | download_timeout: int = int(os.getenv("LYRIC_DEDUP_DOWNLOAD_TIMEOUT", "10")) |
| 49 | |||
| 50 | # Minimum primary n-gram Jaccard similarity required for automatic duplicate. | ||
| 51 | # Raising this makes automatic duplicate stricter; lowering it may increase false positives. | ||
| 52 | duplicate_jaccard_threshold: float = float(os.getenv("LYRIC_DEDUP_DUPLICATE_JACCARD_THRESHOLD", "0.78")) | ||
| 53 | |||
| 54 | # Minimum line coverage required for automatic duplicate. | ||
| 55 | # This is the main guard against treating partial lyric fragments as full duplicates. | ||
| 56 | duplicate_line_coverage_threshold: float = float( | ||
| 57 | os.getenv("LYRIC_DEDUP_DUPLICATE_LINE_COVERAGE_THRESHOLD", "0.72") | ||
| 58 | ) | ||
| 59 | |||
| 60 | # Alternate automatic duplicate path: lower/normal Jaccard can still duplicate when line coverage is very high. | ||
| 61 | # Keep this aligned with duplicate_jaccard_threshold to avoid an unintended duplicate backdoor. | ||
| 62 | duplicate_high_coverage_jaccard_threshold: float = float( | ||
| 63 | os.getenv("LYRIC_DEDUP_DUPLICATE_HIGH_COVERAGE_JACCARD_THRESHOLD", "0.78") | ||
| 64 | ) | ||
| 65 | |||
| 66 | # Line coverage required by the alternate high-coverage duplicate path. | ||
| 67 | # Raising this makes the alternate duplicate path stricter for near-complete variants. | ||
| 68 | duplicate_high_coverage_line_coverage_threshold: float = float( | ||
| 69 | os.getenv("LYRIC_DEDUP_DUPLICATE_HIGH_COVERAGE_LINE_COVERAGE_THRESHOLD", "0.90") | ||
| 70 | ) | ||
| 71 | |||
| 72 | # Minimum primary/full n-gram Jaccard similarity that can send a candidate to review. | ||
| 73 | # Raising this reduces review volume; lowering it catches weaker suspicious overlaps. | ||
| 74 | review_jaccard_threshold: float = float(os.getenv("LYRIC_DEDUP_REVIEW_JACCARD_THRESHOLD", "0.45")) | ||
| 75 | |||
| 76 | # Minimum line coverage that can send a candidate to review when query coverage is also material. | ||
| 77 | # Raising this reduces fragment/short-overlap reviews; lowering it increases suspicious recall. | ||
| 78 | review_line_coverage_threshold: float = float(os.getenv("LYRIC_DEDUP_REVIEW_LINE_COVERAGE_THRESHOLD", "0.35")) | ||
| 79 | |||
| 80 | # Minimum share of query lines that must match before line coverage alone can trigger review. | ||
| 81 | # Raising this makes partial-fragment review stricter. | ||
| 82 | review_query_coverage_threshold: float = float(os.getenv("LYRIC_DEDUP_REVIEW_QUERY_COVERAGE_THRESHOLD", "0.40")) | ||
| 83 | |||
| 84 | # Very short query lyric line count that can force repeated-chorus overlap into review. | ||
| 85 | # Raising this catches more short chorus-like inputs; lowering it reduces review volume. | ||
| 86 | chorus_short_line_count_threshold: int = int(os.getenv("LYRIC_DEDUP_CHORUS_SHORT_LINE_COUNT_THRESHOLD", "6")) | ||
| 87 | |||
| 88 | # Minimum similarity/coverage signal for repeated-chorus overlap to be considered material. | ||
| 89 | # Raising this makes chorus-only review stricter. | ||
| 90 | chorus_material_overlap_threshold: float = float(os.getenv("LYRIC_DEDUP_CHORUS_MATERIAL_OVERLAP_THRESHOLD", "0.20")) | ||
| 91 | |||
| 92 | # Minimum query-side coverage for repeated-chorus overlap to be considered material. | ||
| 93 | # Raising this reduces review decisions caused by small shared chorus fragments. | ||
| 94 | chorus_material_query_coverage_threshold: float = float( | ||
| 95 | os.getenv("LYRIC_DEDUP_CHORUS_MATERIAL_QUERY_COVERAGE_THRESHOLD", "0.40") | ||
| 96 | ) | ||
| 97 | |||
| 98 | # Weight assigned to primary n-gram Jaccard when computing confidence. | ||
| 99 | # This affects the reported confidence score, not the duplicate/review threshold checks directly. | ||
| 100 | confidence_jaccard_weight: float = float(os.getenv("LYRIC_DEDUP_CONFIDENCE_JACCARD_WEIGHT", "0.58")) | ||
| 101 | |||
| 102 | # Weight assigned to primary line coverage when computing confidence. | ||
| 103 | # Keep this coordinated with confidence_jaccard_weight; defaults sum to 1.0. | ||
| 104 | confidence_line_coverage_weight: float = float(os.getenv("LYRIC_DEDUP_CONFIDENCE_LINE_COVERAGE_WEIGHT", "0.42")) | ... | ... |
| ... | @@ -189,10 +189,25 @@ class DedupService: | ... | @@ -189,10 +189,25 @@ class DedupService: |
| 189 | candidates: list[LyricRecord], | 189 | candidates: list[LyricRecord], |
| 190 | ) -> CheckResult: | 190 | ) -> CheckResult: |
| 191 | """Run DuplicateChecker against recalled candidates.""" | 191 | """Run DuplicateChecker against recalled candidates.""" |
| 192 | checker = DuplicateChecker() | 192 | checker = DuplicateChecker( |
| 193 | for candidate in candidates: | 193 | duplicate_jaccard_threshold=self.config.duplicate_jaccard_threshold, |
| 194 | checker.add_record(candidate) | 194 | duplicate_line_coverage_threshold=self.config.duplicate_line_coverage_threshold, |
| 195 | result = checker.check_record(record, max_candidates=self.config.max_candidates) | 195 | duplicate_high_coverage_jaccard_threshold=self.config.duplicate_high_coverage_jaccard_threshold, |
| 196 | duplicate_high_coverage_line_coverage_threshold=self.config.duplicate_high_coverage_line_coverage_threshold, | ||
| 197 | review_jaccard_threshold=self.config.review_jaccard_threshold, | ||
| 198 | review_line_coverage_threshold=self.config.review_line_coverage_threshold, | ||
| 199 | review_query_coverage_threshold=self.config.review_query_coverage_threshold, | ||
| 200 | chorus_short_line_count_threshold=self.config.chorus_short_line_count_threshold, | ||
| 201 | chorus_material_overlap_threshold=self.config.chorus_material_overlap_threshold, | ||
| 202 | chorus_material_query_coverage_threshold=self.config.chorus_material_query_coverage_threshold, | ||
| 203 | confidence_jaccard_weight=self.config.confidence_jaccard_weight, | ||
| 204 | confidence_line_coverage_weight=self.config.confidence_line_coverage_weight, | ||
| 205 | ) | ||
| 206 | result = checker.check_record_against_candidates( | ||
| 207 | record, | ||
| 208 | candidates, | ||
| 209 | max_candidates=self.config.max_candidates, | ||
| 210 | ) | ||
| 196 | return CheckResult( | 211 | return CheckResult( |
| 197 | duplicate=result.decision in (DuplicateDecision.DUPLICATE, DuplicateDecision.REVIEW), | 212 | duplicate=result.decision in (DuplicateDecision.DUPLICATE, DuplicateDecision.REVIEW), |
| 198 | decision=result.decision.value, | 213 | decision=result.decision.value, | ... | ... |
| ... | @@ -3,6 +3,7 @@ pytest>=8.0 | ... | @@ -3,6 +3,7 @@ pytest>=8.0 |
| 3 | 3 | ||
| 4 | # PostgreSQL storage prototype | 4 | # PostgreSQL storage prototype |
| 5 | psycopg[binary]>=3.2 | 5 | psycopg[binary]>=3.2 |
| 6 | OpenCC>=1.3.1 | ||
| 6 | 7 | ||
| 7 | # Existing MySQL/COS lyric download utilities | 8 | # Existing MySQL/COS lyric download utilities |
| 8 | pymysql>=1.1 | 9 | pymysql>=1.1 | ... | ... |
| ... | @@ -249,9 +249,7 @@ def _check_against_candidates( | ... | @@ -249,9 +249,7 @@ def _check_against_candidates( |
| 249 | max_candidates: int, | 249 | max_candidates: int, |
| 250 | ): | 250 | ): |
| 251 | checker = DuplicateChecker() | 251 | checker = DuplicateChecker() |
| 252 | for candidate in candidates: | 252 | return checker.check_record_against_candidates(record, candidates, max_candidates=max_candidates) |
| 253 | checker.add_record(candidate) | ||
| 254 | return checker.check_record(record, max_candidates=max_candidates) | ||
| 255 | 253 | ||
| 256 | 254 | ||
| 257 | def _record_from_eval_row(row: dict[str, str], *, csv_path: Path, base_dir: Path | None) -> tuple[LyricRecord, str]: | 255 | def _record_from_eval_row(row: dict[str, str], *, csv_path: Path, base_dir: Path | None) -> tuple[LyricRecord, str]: | ... | ... |
scripts/process_library.py
deleted
100644 → 0
| 1 | """Process newly added lyric library files. | ||
| 2 | |||
| 3 | This script is intended for the recurring workflow after adding files to | ||
| 4 | ``data/library``: | ||
| 5 | |||
| 6 | 1. Move pure-music placeholder lyric files out of the active library. | ||
| 7 | 2. Move duplicate lyric files out of the active library. | ||
| 8 | 3. Rebuild the duplicate-checking index from retained files. | ||
| 9 | 4. Optionally regenerate and evaluate a production-style eval set. | ||
| 10 | """ | ||
| 11 | |||
| 12 | from __future__ import annotations | ||
| 13 | |||
| 14 | import argparse | ||
| 15 | import csv | ||
| 16 | import json | ||
| 17 | import shutil | ||
| 18 | import sys | ||
| 19 | from dataclasses import dataclass | ||
| 20 | from datetime import datetime | ||
| 21 | from pathlib import Path | ||
| 22 | |||
| 23 | PROJECT_ROOT = Path(__file__).resolve().parents[1] | ||
| 24 | if str(PROJECT_ROOT) not in sys.path: | ||
| 25 | sys.path.insert(0, str(PROJECT_ROOT)) | ||
| 26 | |||
| 27 | from lyric_dedup.checker import DuplicateChecker | ||
| 28 | from lyric_dedup.checker import DuplicateDecision | ||
| 29 | from lyric_dedup.checker import LyricRecord | ||
| 30 | from lyric_dedup.cli import evaluate_csv | ||
| 31 | from lyric_dedup.eval_dataset import generate_eval_set | ||
| 32 | from lyric_dedup.file_import import iter_lyric_files | ||
| 33 | from lyric_dedup.file_import import read_lyric_file | ||
| 34 | from lyric_dedup.file_import import record_from_file | ||
| 35 | from lyric_dedup.normalization import NormalizedLyrics | ||
| 36 | from lyric_dedup.normalization import normalize_lyrics | ||
| 37 | |||
| 38 | |||
| 39 | PLACEHOLDER_MARKERS = ( | ||
| 40 | "【曲库专用】", | ||
| 41 | "此歌曲为没有填词的纯音乐", | ||
| 42 | ) | ||
| 43 | |||
| 44 | |||
| 45 | @dataclass(frozen=True) | ||
| 46 | class LibraryProfile: | ||
| 47 | path: Path | ||
| 48 | record: LyricRecord | ||
| 49 | normalized: NormalizedLyrics | ||
| 50 | line_count: int | ||
| 51 | char_count: int | ||
| 52 | |||
| 53 | |||
| 54 | def main() -> None: | ||
| 55 | parser = argparse.ArgumentParser(description="Process lyric library additions.") | ||
| 56 | parser.add_argument("--library-dir", default="data/library") | ||
| 57 | parser.add_argument("--index", default="outputs/indexes/library_lyrics.pkl") | ||
| 58 | parser.add_argument("--quarantine-dir", default="data/quarantine/no_lyrics_placeholders") | ||
| 59 | parser.add_argument("--duplicate-quarantine-dir", default="data/quarantine/duplicates") | ||
| 60 | parser.add_argument("--dry-run", action="store_true", help="Only report placeholder files; do not move or write outputs.") | ||
| 61 | parser.add_argument("--delete-placeholders", action="store_true", help="Delete matched placeholder files instead of moving them.") | ||
| 62 | parser.add_argument("--delete-duplicates", action="store_true", help="Delete duplicate lyric files instead of moving them.") | ||
| 63 | parser.add_argument("--skip-library-dedup", action="store_true", help="Skip internal duplicate cleanup before rebuilding the index.") | ||
| 64 | parser.add_argument("--eval-size", type=int, default=0, help="Generate and evaluate this many synthetic samples. 0 disables eval.") | ||
| 65 | parser.add_argument("--positive-ratio", type=float, default=0.2) | ||
| 66 | parser.add_argument("--eval-dir", default="data/generated_eval/incoming") | ||
| 67 | parser.add_argument("--eval-csv", default="data/generated_eval/eval.csv") | ||
| 68 | parser.add_argument("--eval-out", default="outputs/results/library_eval.csv") | ||
| 69 | parser.add_argument("--report", default="outputs/results/library_process_report.json") | ||
| 70 | args = parser.parse_args() | ||
| 71 | |||
| 72 | library_dir = Path(args.library_dir) | ||
| 73 | quarantine_dir = Path(args.quarantine_dir) | ||
| 74 | duplicate_quarantine_dir = Path(args.duplicate_quarantine_dir) | ||
| 75 | report_path = Path(args.report) | ||
| 76 | |||
| 77 | files_before = iter_lyric_files(library_dir) | ||
| 78 | placeholders = _find_placeholder_files(library_dir) | ||
| 79 | duplicate_report_path = report_path.with_suffix(".duplicates.csv") | ||
| 80 | |||
| 81 | moved_or_deleted: list[str] = [] | ||
| 82 | duplicate_actions: list[str] = [] | ||
| 83 | duplicate_rows: list[dict[str, object]] = [] | ||
| 84 | short_effective: dict[str, int] | ||
| 85 | retained_count = 0 | ||
| 86 | if not args.dry_run: | ||
| 87 | moved_or_deleted = _handle_placeholders( | ||
| 88 | placeholders, | ||
| 89 | library_dir=library_dir, | ||
| 90 | quarantine_dir=quarantine_dir, | ||
| 91 | delete=args.delete_placeholders, | ||
| 92 | ) | ||
| 93 | if args.skip_library_dedup: | ||
| 94 | profiles = _profile_library(library_dir) | ||
| 95 | short_effective = _effective_line_report_from_profiles(profiles) | ||
| 96 | retained_count = _build_index_from_profiles(profiles, Path(args.index)) | ||
| 97 | else: | ||
| 98 | profiles = _profile_library(library_dir) | ||
| 99 | short_effective = _effective_line_report_from_profiles(profiles) | ||
| 100 | retained_count, duplicate_rows, duplicate_actions = _deduplicate_and_build_index( | ||
| 101 | profiles, | ||
| 102 | library_dir=library_dir, | ||
| 103 | index_path=Path(args.index), | ||
| 104 | duplicate_quarantine_dir=duplicate_quarantine_dir, | ||
| 105 | delete=args.delete_duplicates, | ||
| 106 | dry_run=False, | ||
| 107 | ) | ||
| 108 | _write_duplicate_report(duplicate_rows, duplicate_report_path) | ||
| 109 | |||
| 110 | if args.eval_size > 0: | ||
| 111 | eval_index_path = Path(args.eval_csv).with_suffix(".index.pkl") | ||
| 112 | generate_eval_set( | ||
| 113 | library_dir=library_dir, | ||
| 114 | output_dir=Path(args.eval_dir), | ||
| 115 | csv_path=Path(args.eval_csv), | ||
| 116 | size=args.eval_size, | ||
| 117 | positive_ratio=args.positive_ratio, | ||
| 118 | index_path=Path(args.index), | ||
| 119 | eval_index_path=eval_index_path, | ||
| 120 | ) | ||
| 121 | evaluate_csv( | ||
| 122 | eval_index_path, | ||
| 123 | Path(args.eval_csv), | ||
| 124 | Path(args.eval_out), | ||
| 125 | base_dir=Path(args.eval_csv).parent, | ||
| 126 | positive_decisions={"duplicate"}, | ||
| 127 | max_candidates=5, | ||
| 128 | ) | ||
| 129 | evaluate_csv( | ||
| 130 | eval_index_path, | ||
| 131 | Path(args.eval_csv), | ||
| 132 | Path(args.eval_out).with_name(Path(args.eval_out).stem + "_review_positive.csv"), | ||
| 133 | base_dir=Path(args.eval_csv).parent, | ||
| 134 | positive_decisions={"duplicate", "review"}, | ||
| 135 | max_candidates=5, | ||
| 136 | ) | ||
| 137 | else: | ||
| 138 | profiles = _profile_library(library_dir) | ||
| 139 | short_effective = _effective_line_report_from_profiles(profiles) | ||
| 140 | if not args.skip_library_dedup: | ||
| 141 | retained_count, duplicate_rows, duplicate_actions = _deduplicate_and_build_index( | ||
| 142 | profiles, | ||
| 143 | library_dir=library_dir, | ||
| 144 | index_path=Path(args.index), | ||
| 145 | duplicate_quarantine_dir=duplicate_quarantine_dir, | ||
| 146 | delete=args.delete_duplicates, | ||
| 147 | dry_run=True, | ||
| 148 | ) | ||
| 149 | else: | ||
| 150 | retained_count = len(profiles) | ||
| 151 | |||
| 152 | report = { | ||
| 153 | "timestamp": datetime.now().isoformat(timespec="seconds"), | ||
| 154 | "dry_run": args.dry_run, | ||
| 155 | "library_dir": str(library_dir), | ||
| 156 | "files_before": len(files_before), | ||
| 157 | "placeholder_matches": len(placeholders), | ||
| 158 | "placeholder_files": [str(path) for path in placeholders], | ||
| 159 | "handled_placeholder_files": moved_or_deleted, | ||
| 160 | "library_dedup_skipped": args.skip_library_dedup, | ||
| 161 | "duplicate_matches": len(duplicate_rows), | ||
| 162 | "duplicate_report": str(duplicate_report_path) if duplicate_rows else "", | ||
| 163 | "handled_duplicate_files": duplicate_actions[:1000], | ||
| 164 | "handled_duplicate_files_truncated": len(duplicate_actions) > 1000, | ||
| 165 | "retained_index_records": retained_count, | ||
| 166 | "files_after": len(iter_lyric_files(library_dir)), | ||
| 167 | "index": str(args.index), | ||
| 168 | "eval_size": args.eval_size, | ||
| 169 | "eval_csv": str(args.eval_csv) if args.eval_size > 0 else "", | ||
| 170 | "eval_out": str(args.eval_out) if args.eval_size > 0 else "", | ||
| 171 | "eval_index": str(Path(args.eval_csv).with_suffix(".index.pkl")) if args.eval_size > 0 else "", | ||
| 172 | "short_effective_line_counts": short_effective, | ||
| 173 | } | ||
| 174 | |||
| 175 | print(json.dumps(report, ensure_ascii=False, indent=2)) | ||
| 176 | if not args.dry_run: | ||
| 177 | report_path.parent.mkdir(parents=True, exist_ok=True) | ||
| 178 | report_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8") | ||
| 179 | |||
| 180 | |||
| 181 | def _find_placeholder_files(library_dir: Path) -> list[Path]: | ||
| 182 | matches: list[Path] = [] | ||
| 183 | for path in iter_lyric_files(library_dir): | ||
| 184 | text = read_lyric_file(path) | ||
| 185 | if any(marker in text for marker in PLACEHOLDER_MARKERS): | ||
| 186 | matches.append(path) | ||
| 187 | return matches | ||
| 188 | |||
| 189 | |||
| 190 | def _handle_placeholders( | ||
| 191 | placeholders: list[Path], | ||
| 192 | *, | ||
| 193 | library_dir: Path, | ||
| 194 | quarantine_dir: Path, | ||
| 195 | delete: bool, | ||
| 196 | ) -> list[str]: | ||
| 197 | handled: list[str] = [] | ||
| 198 | if not placeholders: | ||
| 199 | return handled | ||
| 200 | if not delete: | ||
| 201 | quarantine_dir.mkdir(parents=True, exist_ok=True) | ||
| 202 | for path in placeholders: | ||
| 203 | if delete: | ||
| 204 | path.unlink() | ||
| 205 | handled.append(f"deleted:{path}") | ||
| 206 | continue | ||
| 207 | relative = path.resolve().relative_to(library_dir.resolve()) | ||
| 208 | destination = quarantine_dir / relative | ||
| 209 | destination.parent.mkdir(parents=True, exist_ok=True) | ||
| 210 | if destination.exists(): | ||
| 211 | destination = destination.with_name(f"{destination.stem}_{datetime.now().strftime('%Y%m%d%H%M%S')}{destination.suffix}") | ||
| 212 | shutil.move(str(path), str(destination)) | ||
| 213 | handled.append(f"moved:{path}->{destination}") | ||
| 214 | return handled | ||
| 215 | |||
| 216 | |||
| 217 | def _profile_library(library_dir: Path) -> list[LibraryProfile]: | ||
| 218 | profiles: list[LibraryProfile] = [] | ||
| 219 | files = iter_lyric_files(library_dir) | ||
| 220 | _progress(f"profile active library: 0/{len(files)}") | ||
| 221 | for index, path in enumerate(files, start=1): | ||
| 222 | record = record_from_file(path, base_dir=library_dir) | ||
| 223 | normalized = normalize_lyrics(record.lyrics) | ||
| 224 | lines = normalized.primary_lines or normalized.unique_lines | ||
| 225 | normalized_text = normalized.normalized_full_text | ||
| 226 | profiles.append( | ||
| 227 | LibraryProfile( | ||
| 228 | path=path, | ||
| 229 | record=record, | ||
| 230 | normalized=normalized, | ||
| 231 | line_count=len(lines), | ||
| 232 | char_count=len(normalized_text), | ||
| 233 | ) | ||
| 234 | ) | ||
| 235 | _progress_count("profile active library", index, len(files), step=5000) | ||
| 236 | return profiles | ||
| 237 | |||
| 238 | |||
| 239 | def _build_index_from_profiles(profiles: list[LibraryProfile], index_path: Path) -> int: | ||
| 240 | checker = DuplicateChecker() | ||
| 241 | for index, profile in enumerate(profiles, start=1): | ||
| 242 | checker.add_normalized_record(profile.record, profile.normalized) | ||
| 243 | _progress_count("build index", index, len(profiles), step=5000) | ||
| 244 | index_path.parent.mkdir(parents=True, exist_ok=True) | ||
| 245 | checker.save(index_path) | ||
| 246 | return checker.record_count | ||
| 247 | |||
| 248 | |||
| 249 | def _deduplicate_and_build_index( | ||
| 250 | profiles: list[LibraryProfile], | ||
| 251 | *, | ||
| 252 | library_dir: Path, | ||
| 253 | index_path: Path, | ||
| 254 | duplicate_quarantine_dir: Path, | ||
| 255 | delete: bool, | ||
| 256 | dry_run: bool, | ||
| 257 | ) -> tuple[int, list[dict[str, object]], list[str]]: | ||
| 258 | checker = DuplicateChecker() | ||
| 259 | duplicate_rows: list[dict[str, object]] = [] | ||
| 260 | duplicate_actions: list[str] = [] | ||
| 261 | ordered = sorted(profiles, key=_profile_quality_key) | ||
| 262 | _progress(f"deduplicate active library: 0/{len(ordered)}") | ||
| 263 | for index, profile in enumerate(ordered, start=1): | ||
| 264 | result = checker.check_record(profile.record, max_candidates=1) | ||
| 265 | best = result.candidates[0] if result.candidates else None | ||
| 266 | if result.decision == DuplicateDecision.DUPLICATE and best is not None: | ||
| 267 | duplicate_rows.append( | ||
| 268 | { | ||
| 269 | "duplicate_path": str(profile.path), | ||
| 270 | "duplicate_record_id": profile.record.record_id, | ||
| 271 | "kept_record_id": best.record_id, | ||
| 272 | "decision": result.decision.value, | ||
| 273 | "confidence": result.confidence, | ||
| 274 | "reason": result.reason, | ||
| 275 | "best_candidate_jaccard": best.jaccard, | ||
| 276 | "best_candidate_line_coverage": best.line_coverage, | ||
| 277 | "best_candidate_primary_jaccard": best.primary_jaccard, | ||
| 278 | "best_candidate_primary_line_coverage": best.primary_line_coverage, | ||
| 279 | "matched_unique_lines": " | ".join(best.matched_unique_lines), | ||
| 280 | "line_count": profile.line_count, | ||
| 281 | "char_count": profile.char_count, | ||
| 282 | } | ||
| 283 | ) | ||
| 284 | if not dry_run: | ||
| 285 | duplicate_actions.append( | ||
| 286 | _handle_duplicate_file( | ||
| 287 | profile.path, | ||
| 288 | library_dir=library_dir, | ||
| 289 | duplicate_quarantine_dir=duplicate_quarantine_dir, | ||
| 290 | delete=delete, | ||
| 291 | ) | ||
| 292 | ) | ||
| 293 | else: | ||
| 294 | checker.add_normalized_record(profile.record, profile.normalized) | ||
| 295 | _progress_count("deduplicate active library", index, len(ordered), step=5000) | ||
| 296 | |||
| 297 | if not dry_run: | ||
| 298 | index_path.parent.mkdir(parents=True, exist_ok=True) | ||
| 299 | checker.save(index_path) | ||
| 300 | return checker.record_count, duplicate_rows, duplicate_actions | ||
| 301 | |||
| 302 | |||
| 303 | def _handle_duplicate_file( | ||
| 304 | path: Path, | ||
| 305 | *, | ||
| 306 | library_dir: Path, | ||
| 307 | duplicate_quarantine_dir: Path, | ||
| 308 | delete: bool, | ||
| 309 | ) -> str: | ||
| 310 | if delete: | ||
| 311 | path.unlink() | ||
| 312 | return f"deleted:{path}" | ||
| 313 | duplicate_quarantine_dir.mkdir(parents=True, exist_ok=True) | ||
| 314 | relative = path.resolve().relative_to(library_dir.resolve()) | ||
| 315 | destination = duplicate_quarantine_dir / relative | ||
| 316 | destination.parent.mkdir(parents=True, exist_ok=True) | ||
| 317 | if destination.exists(): | ||
| 318 | destination = destination.with_name(f"{destination.stem}_{datetime.now().strftime('%Y%m%d%H%M%S')}{destination.suffix}") | ||
| 319 | shutil.move(str(path), str(destination)) | ||
| 320 | return f"moved:{path}->{destination}" | ||
| 321 | |||
| 322 | |||
| 323 | def _profile_quality_key(profile: LibraryProfile) -> tuple[int, int, int, str]: | ||
| 324 | # Sort ascending; negative values make higher-quality records come first. | ||
| 325 | filename_quality = 0 if not profile.path.name.startswith("None_") else 1 | ||
| 326 | return (filename_quality, -profile.line_count, -profile.char_count, str(profile.path)) | ||
| 327 | |||
| 328 | |||
| 329 | def _write_duplicate_report(rows: list[dict[str, object]], report_path: Path) -> None: | ||
| 330 | if not rows: | ||
| 331 | return | ||
| 332 | report_path.parent.mkdir(parents=True, exist_ok=True) | ||
| 333 | with report_path.open("w", encoding="utf-8", newline="") as file: | ||
| 334 | writer = csv.DictWriter(file, fieldnames=list(rows[0].keys())) | ||
| 335 | writer.writeheader() | ||
| 336 | writer.writerows(rows) | ||
| 337 | |||
| 338 | |||
| 339 | def _effective_line_report(library_dir: Path) -> dict[str, int]: | ||
| 340 | return _effective_line_report_from_profiles(_profile_library(library_dir)) | ||
| 341 | |||
| 342 | |||
| 343 | def _effective_line_report_from_profiles(profiles: list[LibraryProfile]) -> dict[str, int]: | ||
| 344 | buckets = { | ||
| 345 | "total": 0, | ||
| 346 | "zero_effective_lines": 0, | ||
| 347 | "one_to_three_effective_lines": 0, | ||
| 348 | "four_to_five_effective_lines": 0, | ||
| 349 | "six_plus_effective_lines": 0, | ||
| 350 | } | ||
| 351 | for profile in profiles: | ||
| 352 | buckets["total"] += 1 | ||
| 353 | line_count = profile.line_count | ||
| 354 | if line_count == 0: | ||
| 355 | buckets["zero_effective_lines"] += 1 | ||
| 356 | elif line_count <= 3: | ||
| 357 | buckets["one_to_three_effective_lines"] += 1 | ||
| 358 | elif line_count <= 5: | ||
| 359 | buckets["four_to_five_effective_lines"] += 1 | ||
| 360 | else: | ||
| 361 | buckets["six_plus_effective_lines"] += 1 | ||
| 362 | return buckets | ||
| 363 | |||
| 364 | |||
| 365 | def _progress(message: str) -> None: | ||
| 366 | print(f"[process-library] {message}", file=sys.stderr, flush=True) | ||
| 367 | |||
| 368 | |||
| 369 | def _progress_count(label: str, current: int, total: int, *, step: int = 1000) -> None: | ||
| 370 | if total <= 0: | ||
| 371 | return | ||
| 372 | if current == 1 or current == total or current % step == 0: | ||
| 373 | _progress(f"{label}: {current}/{total}") | ||
| 374 | |||
| 375 | |||
| 376 | if __name__ == "__main__": | ||
| 377 | main() |
test_api/dedup_samples/README.md
0 → 100644
| 1 | # Lyric Dedup Sample Set | ||
| 2 | |||
| 3 | 基准歌词: `test_api/test_lyric.txt` | ||
| 4 | |||
| 5 | 这些样本用于检查当前去重系统的两类行为: | ||
| 6 | |||
| 7 | - `positive_*`: 应被判定为与基准歌词重复或高度重复。 | ||
| 8 | - `negative_*`: 不应被判定为重复,用于检查主题、关键词或风格相似时的误杀。 | ||
| 9 | |||
| 10 | ## 样本说明 | ||
| 11 | |||
| 12 | | 文件 | 期望 | 测试点 | | ||
| 13 | | --- | --- | --- | | ||
| 14 | | `positive_01_format_spacing_punctuation_duplicate.txt` | 去重命中 | 去掉标题/分隔线、改变空行、弱化标点后的同文变体 | | ||
| 15 | | `positive_02_minor_wording_typos_duplicate.txt` | 去重命中 | 少量错字、近义词、语序微调后的近重复 | | ||
| 16 | | `positive_03_section_order_shift_duplicate.txt` | 去重命中 | 段落顺序变化但核心文本大量重合 | | ||
| 17 | | `positive_04_partial_core_chorus_duplicate.txt` | 去重命中 | 只提交核心副歌/高潮片段时的局部重复检测 | | ||
| 18 | | `negative_01_same_theme_new_lyrics_not_duplicate.txt` | 不应命中 | 同样是凌晨、长安、雪、追梦,但逐句原创 | | ||
| 19 | | `negative_02_same_keywords_different_scene_not_duplicate.txt` | 不应命中 | 复用高频关键词,叙事场景和句法明显不同 | | ||
| 20 | | `negative_03_style_similar_low_overlap_not_duplicate.txt` | 不应命中 | 国风+Rap+都市融合风格相似,但文本低重合 | | ||
| 21 | | `negative_04_common_hook_phrases_not_duplicate.txt` | 不应命中 | 只含常见短语/意象,防止短文本公共表达误杀 | | ||
| 22 |
| ... | @@ -4,7 +4,6 @@ import json | ... | @@ -4,7 +4,6 @@ import json |
| 4 | from lyric_dedup import DuplicateChecker | 4 | from lyric_dedup import DuplicateChecker |
| 5 | from lyric_dedup import DuplicateDecision | 5 | from lyric_dedup import DuplicateDecision |
| 6 | from lyric_dedup import LyricRecord | 6 | from lyric_dedup import LyricRecord |
| 7 | from lyric_dedup.cli import evaluate_csv | ||
| 8 | from lyric_dedup.eval_dataset import generate_eval_set | 7 | from lyric_dedup.eval_dataset import generate_eval_set |
| 9 | from lyric_dedup.file_import import record_from_file | 8 | from lyric_dedup.file_import import record_from_file |
| 10 | from lyric_dedup.normalization import normalize_lyrics | 9 | from lyric_dedup.normalization import normalize_lyrics |
| ... | @@ -22,6 +21,14 @@ BASE_LYRIC = """ | ... | @@ -22,6 +21,14 @@ BASE_LYRIC = """ |
| 22 | """ | 21 | """ |
| 23 | 22 | ||
| 24 | 23 | ||
| 24 | def check_against(candidates: list[LyricRecord], lyrics: str, *, max_candidates: int = 10): | ||
| 25 | return DuplicateChecker().check_record_against_candidates( | ||
| 26 | LyricRecord("__query__", lyrics), | ||
| 27 | candidates, | ||
| 28 | max_candidates=max_candidates, | ||
| 29 | ) | ||
| 30 | |||
| 31 | |||
| 25 | def test_normalization_removes_lyric_noise_and_simplifies() -> None: | 32 | def test_normalization_removes_lyric_noise_and_simplifies() -> None: |
| 26 | normalized = normalize_lyrics("[00:01.20]我愛你!\nQQ音乐 www.example.com\n(副歌)\n聽風說話\n") | 33 | normalized = normalize_lyrics("[00:01.20]我愛你!\nQQ音乐 www.example.com\n(副歌)\n聽風說話\n") |
| 27 | 34 | ||
| ... | @@ -31,10 +38,8 @@ def test_normalization_removes_lyric_noise_and_simplifies() -> None: | ... | @@ -31,10 +38,8 @@ def test_normalization_removes_lyric_noise_and_simplifies() -> None: |
| 31 | 38 | ||
| 32 | 39 | ||
| 33 | def test_exact_duplicate_handles_timestamps_punctuation_traditional_and_chorus_counts() -> None: | 40 | def test_exact_duplicate_handles_timestamps_punctuation_traditional_and_chorus_counts() -> None: |
| 34 | checker = DuplicateChecker() | 41 | result = check_against( |
| 35 | checker.add_record(LyricRecord("song-1", BASE_LYRIC)) | 42 | [LyricRecord("song-1", BASE_LYRIC)], |
| 36 | |||
| 37 | result = checker.check( | ||
| 38 | """ | 43 | """ |
| 39 | 我愛你,在每個夜裡!!! | 44 | 我愛你,在每個夜裡!!! |
| 40 | 聽風說話,也聽見你 | 45 | 聽風說話,也聽見你 |
| ... | @@ -51,21 +56,19 @@ def test_exact_duplicate_handles_timestamps_punctuation_traditional_and_chorus_c | ... | @@ -51,21 +56,19 @@ def test_exact_duplicate_handles_timestamps_punctuation_traditional_and_chorus_c |
| 51 | 56 | ||
| 52 | 57 | ||
| 53 | def test_short_shared_repeated_chorus_is_review_not_duplicate() -> None: | 58 | def test_short_shared_repeated_chorus_is_review_not_duplicate() -> None: |
| 54 | checker = DuplicateChecker() | 59 | result = check_against( |
| 55 | checker.add_record( | 60 | [ |
| 56 | LyricRecord( | 61 | LyricRecord( |
| 57 | "song-1", | 62 | "song-1", |
| 58 | """ | 63 | """ |
| 59 | 海边的风吹过旧信 | 64 | 海边的风吹过旧信 |
| 60 | 你说夏天不会远去 | 65 | 你说夏天不会远去 |
| 61 | 啦啦啦 我们不分离 | 66 | 啦啦啦 我们不分离 |
| 62 | 啦啦啦 我们不分离 | 67 | 啦啦啦 我们不分离 |
| 63 | 转身以后各自旅行 | 68 | 转身以后各自旅行 |
| 64 | """, | 69 | """, |
| 65 | ) | 70 | ) |
| 66 | ) | 71 | ], |
| 67 | |||
| 68 | result = checker.check( | ||
| 69 | """ | 72 | """ |
| 70 | 山谷的雨落在清晨 | 73 | 山谷的雨落在清晨 |
| 71 | 我把名字交给星辰 | 74 | 我把名字交给星辰 |
| ... | @@ -79,11 +82,9 @@ def test_short_shared_repeated_chorus_is_review_not_duplicate() -> None: | ... | @@ -79,11 +82,9 @@ def test_short_shared_repeated_chorus_is_review_not_duplicate() -> None: |
| 79 | assert result.candidates[0].reason == "重合内容主要集中在重复副歌行,不自动判重" | 82 | assert result.candidates[0].reason == "重合内容主要集中在重复副歌行,不自动判重" |
| 80 | 83 | ||
| 81 | 84 | ||
| 82 | def test_substantial_line_overlap_is_duplicate_after_lsh_recall() -> None: | 85 | def test_substantial_line_overlap_is_duplicate_after_pg_recall() -> None: |
| 83 | checker = DuplicateChecker() | 86 | result = check_against( |
| 84 | checker.add_record(LyricRecord("song-1", BASE_LYRIC)) | 87 | [LyricRecord("song-1", BASE_LYRIC)], |
| 85 | |||
| 86 | result = checker.check( | ||
| 87 | """ | 88 | """ |
| 88 | 我爱你在每个夜里 | 89 | 我爱你在每个夜里 |
| 89 | 听风说话也听见你 | 90 | 听风说话也听见你 |
| ... | @@ -100,10 +101,8 @@ def test_substantial_line_overlap_is_duplicate_after_lsh_recall() -> None: | ... | @@ -100,10 +101,8 @@ def test_substantial_line_overlap_is_duplicate_after_lsh_recall() -> None: |
| 100 | 101 | ||
| 101 | 102 | ||
| 102 | def test_fragment_of_full_song_is_not_duplicate() -> None: | 103 | def test_fragment_of_full_song_is_not_duplicate() -> None: |
| 103 | checker = DuplicateChecker() | 104 | result = check_against( |
| 104 | checker.add_record(LyricRecord("song-1", BASE_LYRIC)) | 105 | [LyricRecord("song-1", BASE_LYRIC)], |
| 105 | |||
| 106 | result = checker.check( | ||
| 107 | """ | 106 | """ |
| 108 | 听风说话也听见你 | 107 | 听风说话也听见你 |
| 109 | 城市的灯慢慢亮起 | 108 | 城市的灯慢慢亮起 |
| ... | @@ -116,45 +115,39 @@ def test_fragment_of_full_song_is_not_duplicate() -> None: | ... | @@ -116,45 +115,39 @@ def test_fragment_of_full_song_is_not_duplicate() -> None: |
| 116 | 115 | ||
| 117 | 116 | ||
| 118 | def test_catalog_mashup_fragments_are_new_not_review() -> None: | 117 | def test_catalog_mashup_fragments_are_new_not_review() -> None: |
| 119 | checker = DuplicateChecker() | 118 | result = check_against( |
| 120 | checker.add_record( | 119 | [ |
| 121 | LyricRecord( | 120 | LyricRecord( |
| 122 | "song-1", | 121 | "song-1", |
| 123 | """ | 122 | """ |
| 124 | 第一首歌的清晨 | 123 | 第一首歌的清晨 |
| 125 | 第一首歌的街口 | 124 | 第一首歌的街口 |
| 126 | 每天都在伪装幸福快乐 | 125 | 每天都在伪装幸福快乐 |
| 127 | 还要瞒着所有人不说 | 126 | 还要瞒着所有人不说 |
| 128 | 第一首歌的结尾 | 127 | 第一首歌的结尾 |
| 129 | """, | 128 | """, |
| 130 | ) | 129 | ), |
| 131 | ) | 130 | LyricRecord( |
| 132 | checker.add_record( | 131 | "song-2", |
| 133 | LyricRecord( | 132 | """ |
| 134 | "song-2", | 133 | 第二首歌的海边 |
| 135 | """ | 134 | 第二首歌的远方 |
| 136 | 第二首歌的海边 | 135 | 想起那年夏天 |
| 137 | 第二首歌的远方 | 136 | 我们走过人群 |
| 138 | 想起那年夏天 | 137 | 第二首歌的结尾 |
| 139 | 我们走过人群 | 138 | """, |
| 140 | 第二首歌的结尾 | 139 | ), |
| 141 | """, | 140 | LyricRecord( |
| 142 | ) | 141 | "song-3", |
| 143 | ) | 142 | """ |
| 144 | checker.add_record( | 143 | 第三首歌的月光 |
| 145 | LyricRecord( | 144 | 第三首歌的旧梦 |
| 146 | "song-3", | 145 | 风吹过了窗前 |
| 147 | """ | 146 | 你没有再回来 |
| 148 | 第三首歌的月光 | 147 | 第三首歌的结尾 |
| 149 | 第三首歌的旧梦 | 148 | """, |
| 150 | 风吹过了窗前 | 149 | ), |
| 151 | 你没有再回来 | 150 | ], |
| 152 | 第三首歌的结尾 | ||
| 153 | """, | ||
| 154 | ) | ||
| 155 | ) | ||
| 156 | |||
| 157 | result = checker.check( | ||
| 158 | """ | 151 | """ |
| 159 | 每天都在伪装幸福快乐 | 152 | 每天都在伪装幸福快乐 |
| 160 | 还要瞒着所有人不说 | 153 | 还要瞒着所有人不说 |
| ... | @@ -169,28 +162,26 @@ def test_catalog_mashup_fragments_are_new_not_review() -> None: | ... | @@ -169,28 +162,26 @@ def test_catalog_mashup_fragments_are_new_not_review() -> None: |
| 169 | 162 | ||
| 170 | 163 | ||
| 171 | def test_large_mashup_with_one_recognizable_song_fragment_is_new() -> None: | 164 | def test_large_mashup_with_one_recognizable_song_fragment_is_new() -> None: |
| 172 | checker = DuplicateChecker() | 165 | result = check_against( |
| 173 | checker.add_record( | 166 | [ |
| 174 | LyricRecord( | 167 | LyricRecord( |
| 175 | "song-1", | 168 | "song-1", |
| 176 | """ | 169 | """ |
| 177 | 桃花春风十里 | 170 | 桃花春风十里 |
| 178 | 花瓣飘散满地 | 171 | 花瓣飘散满地 |
| 179 | 对不起我无法忘记你 | 172 | 对不起我无法忘记你 |
| 180 | 一去遥遥无期 | 173 | 一去遥遥无期 |
| 181 | 一个人一支笔 | 174 | 一个人一支笔 |
| 182 | 多想你能留在我这里 | 175 | 多想你能留在我这里 |
| 183 | 天空下起了雨 | 176 | 天空下起了雨 |
| 184 | 淋湿我的心里 | 177 | 淋湿我的心里 |
| 185 | 久别中多少人都不是你 | 178 | 久别中多少人都不是你 |
| 186 | 屋檐下一人想起 | 179 | 屋檐下一人想起 |
| 187 | 关于你的回忆 | 180 | 关于你的回忆 |
| 188 | 无人在只剩下我自己 | 181 | 无人在只剩下我自己 |
| 189 | """, | 182 | """, |
| 190 | ) | 183 | ) |
| 191 | ) | 184 | ], |
| 192 | |||
| 193 | result = checker.check( | ||
| 194 | """ | 185 | """ |
| 195 | scroll through the pictures from a year ago | 186 | scroll through the pictures from a year ago |
| 196 | the pixels change but the feelings dont grow | 187 | the pixels change but the feelings dont grow |
| ... | @@ -238,15 +229,13 @@ def test_no_effective_lyrics_use_metadata_fallback_without_empty_hash_collision( | ... | @@ -238,15 +229,13 @@ def test_no_effective_lyrics_use_metadata_fallback_without_empty_hash_collision( |
| 238 | 混音:DJ金木 | 229 | 混音:DJ金木 |
| 239 | 【未经著作权人许可 不得翻唱 翻录或使用】 | 230 | 【未经著作权人许可 不得翻唱 翻录或使用】 |
| 240 | """ | 231 | """ |
| 241 | checker = DuplicateChecker() | 232 | same_song = DuplicateChecker().check_record_against_candidates( |
| 242 | checker.add_record(LyricRecord("song-1", placeholder, title="Amnesia(House)", artist="DJ金木")) | 233 | LyricRecord("__query__", placeholder, title="Amnesia(House)", artist="DJ金木"), |
| 243 | checker.add_record(LyricRecord("song-2", placeholder, title="Angel(纯音乐)", artist="DJ金木")) | 234 | [LyricRecord("song-1", placeholder, title="Amnesia(House)", artist="DJ金木")], |
| 244 | |||
| 245 | same_song = checker.check_record( | ||
| 246 | LyricRecord("__query__", placeholder, title="Amnesia(House)", artist="DJ金木") | ||
| 247 | ) | 235 | ) |
| 248 | different_title = checker.check_record( | 236 | different_title = DuplicateChecker().check_record_against_candidates( |
| 249 | LyricRecord("__query__", placeholder, title="Different Song", artist="DJ金木") | 237 | LyricRecord("__query__", placeholder, title="Different Song", artist="DJ金木"), |
| 238 | [LyricRecord("song-2", placeholder, title="Angel(纯音乐)", artist="DJ金木")], | ||
| 250 | ) | 239 | ) |
| 251 | 240 | ||
| 252 | assert same_song.decision == DuplicateDecision.DUPLICATE | 241 | assert same_song.decision == DuplicateDecision.DUPLICATE |
| ... | @@ -269,29 +258,27 @@ def test_no_effective_lyrics_metadata_fallback_ignores_placeholder_noise() -> No | ... | @@ -269,29 +258,27 @@ def test_no_effective_lyrics_metadata_fallback_ignores_placeholder_noise() -> No |
| 269 | [00:04.00]作曲:DJ金木... | 258 | [00:04.00]作曲:DJ金木... |
| 270 | [00:05.00]未经著作权人许可 不得翻唱 | 259 | [00:05.00]未经著作权人许可 不得翻唱 |
| 271 | """ | 260 | """ |
| 272 | checker = DuplicateChecker() | 261 | result = DuplicateChecker().check_record_against_candidates( |
| 273 | checker.add_record(LyricRecord("song-1", source, title="Amnesia(House)", artist="DJ金木")) | 262 | LyricRecord("__query__", noisy, title="Amnesia(House)", artist="DJ金木"), |
| 274 | 263 | [LyricRecord("song-1", source, title="Amnesia(House)", artist="DJ金木")], | |
| 275 | result = checker.check_record(LyricRecord("__query__", noisy, title="Amnesia(House)", artist="DJ金木")) | 264 | ) |
| 276 | 265 | ||
| 277 | assert result.decision == DuplicateDecision.DUPLICATE | 266 | assert result.decision == DuplicateDecision.DUPLICATE |
| 278 | assert result.reason == "无有效歌词,文件内容兜底特征高度相似" | 267 | assert result.reason == "无有效歌词,文件内容兜底特征高度相似" |
| 279 | 268 | ||
| 280 | 269 | ||
| 281 | def test_unrelated_lyrics_with_shared_watermark_are_new() -> None: | 270 | def test_unrelated_lyrics_with_shared_watermark_are_new() -> None: |
| 282 | checker = DuplicateChecker() | 271 | result = check_against( |
| 283 | checker.add_record( | 272 | [ |
| 284 | LyricRecord( | 273 | LyricRecord( |
| 285 | "song-1", | 274 | "song-1", |
| 286 | """ | 275 | """ |
| 287 | 歌词来自QQ音乐 | 276 | 歌词来自QQ音乐 |
| 288 | 北方的雪落在窗前 | 277 | 北方的雪落在窗前 |
| 289 | 我等一封迟来的信 | 278 | 我等一封迟来的信 |
| 290 | """, | 279 | """, |
| 291 | ) | 280 | ) |
| 292 | ) | 281 | ], |
| 293 | |||
| 294 | result = checker.check( | ||
| 295 | """ | 282 | """ |
| 296 | 歌词来自QQ音乐 | 283 | 歌词来自QQ音乐 |
| 297 | 南方的雨穿过街心 | 284 | 南方的雨穿过街心 |
| ... | @@ -300,24 +287,22 @@ def test_unrelated_lyrics_with_shared_watermark_are_new() -> None: | ... | @@ -300,24 +287,22 @@ def test_unrelated_lyrics_with_shared_watermark_are_new() -> None: |
| 300 | ) | 287 | ) |
| 301 | 288 | ||
| 302 | assert result.decision == DuplicateDecision.NEW | 289 | assert result.decision == DuplicateDecision.NEW |
| 303 | assert result.candidates == () | 290 | assert result.candidates[0].decision == DuplicateDecision.NEW |
| 304 | 291 | ||
| 305 | 292 | ||
| 306 | def test_mixed_chinese_english_tokenization_recalls_candidate() -> None: | 293 | def test_mixed_chinese_english_tokenization_recalls_candidate() -> None: |
| 307 | checker = DuplicateChecker() | 294 | result = check_against( |
| 308 | checker.add_record( | 295 | [ |
| 309 | LyricRecord( | 296 | LyricRecord( |
| 310 | "song-1", | 297 | "song-1", |
| 311 | """ | 298 | """ |
| 312 | say hello 在风里 | 299 | say hello 在风里 |
| 313 | hold me close tonight | 300 | hold me close tonight |
| 314 | 我们穿过蓝色街道 | 301 | 我们穿过蓝色街道 |
| 315 | never let me go | 302 | never let me go |
| 316 | """, | 303 | """, |
| 317 | ) | 304 | ) |
| 318 | ) | 305 | ], |
| 319 | |||
| 320 | result = checker.check( | ||
| 321 | """ | 306 | """ |
| 322 | say hello 在风里 | 307 | say hello 在风里 |
| 323 | hold me close tonight | 308 | hold me close tonight |
| ... | @@ -329,17 +314,14 @@ def test_mixed_chinese_english_tokenization_recalls_candidate() -> None: | ... | @@ -329,17 +314,14 @@ def test_mixed_chinese_english_tokenization_recalls_candidate() -> None: |
| 329 | assert result.decision == DuplicateDecision.DUPLICATE | 314 | assert result.decision == DuplicateDecision.DUPLICATE |
| 330 | 315 | ||
| 331 | 316 | ||
| 332 | def test_checker_can_persist_index(tmp_path) -> None: | 317 | def test_checker_can_rank_explicit_pg_recalled_candidates_without_in_memory_recall() -> None: |
| 333 | index_path = tmp_path / "lyrics.pkl" | 318 | result = DuplicateChecker().check_record_against_candidates( |
| 334 | checker = DuplicateChecker() | 319 | LyricRecord("__query__", BASE_LYRIC), |
| 335 | checker.add_record(LyricRecord("song-1", BASE_LYRIC)) | 320 | candidates=[], |
| 336 | checker.save(index_path) | 321 | ) |
| 337 | |||
| 338 | loaded = DuplicateChecker.load(index_path) | ||
| 339 | result = loaded.check(BASE_LYRIC) | ||
| 340 | 322 | ||
| 341 | assert loaded.record_count == 1 | 323 | assert result.decision == DuplicateDecision.NEW |
| 342 | assert result.decision == DuplicateDecision.DUPLICATE | 324 | assert result.candidates == () |
| 343 | 325 | ||
| 344 | 326 | ||
| 345 | def test_record_from_lrc_file(tmp_path) -> None: | 327 | def test_record_from_lrc_file(tmp_path) -> None: |
| ... | @@ -363,44 +345,6 @@ def test_record_from_song_artist_lyrics_filename(tmp_path) -> None: | ... | @@ -363,44 +345,6 @@ def test_record_from_song_artist_lyrics_filename(tmp_path) -> None: |
| 363 | assert record.artist == "DJ金木" | 345 | assert record.artist == "DJ金木" |
| 364 | 346 | ||
| 365 | 347 | ||
| 366 | def test_evaluate_csv_reports_binary_metrics(tmp_path) -> None: | ||
| 367 | library = tmp_path / "library" | ||
| 368 | incoming = tmp_path / "incoming" | ||
| 369 | library.mkdir() | ||
| 370 | incoming.mkdir() | ||
| 371 | (library / "歌手A - 夜里.lrc").write_text(BASE_LYRIC, encoding="utf-8") | ||
| 372 | (incoming / "dup.lrc").write_text(BASE_LYRIC.replace("我爱你", "我愛你"), encoding="utf-8") | ||
| 373 | (incoming / "new.txt").write_text("南方的雨穿过街心\n你把故事说给云听\n", encoding="utf-8") | ||
| 374 | |||
| 375 | checker = DuplicateChecker() | ||
| 376 | checker.add_record(record_from_file(library / "歌手A - 夜里.lrc", base_dir=library)) | ||
| 377 | index_path = tmp_path / "lyrics.pkl" | ||
| 378 | checker.save(index_path) | ||
| 379 | |||
| 380 | eval_csv = tmp_path / "eval.csv" | ||
| 381 | eval_csv.write_text( | ||
| 382 | "id,file,expected\n" | ||
| 383 | "case-1,incoming/dup.lrc,应去重\n" | ||
| 384 | "case-2,incoming/new.txt,不应去重\n", | ||
| 385 | encoding="utf-8", | ||
| 386 | ) | ||
| 387 | out_path = tmp_path / "eval_out.csv" | ||
| 388 | |||
| 389 | evaluate_csv( | ||
| 390 | index_path, | ||
| 391 | eval_csv, | ||
| 392 | out_path, | ||
| 393 | base_dir=tmp_path, | ||
| 394 | positive_decisions={"duplicate"}, | ||
| 395 | max_candidates=5, | ||
| 396 | ) | ||
| 397 | |||
| 398 | rows = list(csv.DictReader(out_path.open(encoding="utf-8"))) | ||
| 399 | assert [row["correct"] for row in rows] == ["True", "True"] | ||
| 400 | assert rows[0]["reason"] == "规范化后的原文歌词哈希完全一致" | ||
| 401 | assert (tmp_path / "eval_out.csv.summary.json").exists() | ||
| 402 | |||
| 403 | |||
| 404 | def test_generated_eval_set_uses_stratified_production_mix(tmp_path) -> None: | 348 | def test_generated_eval_set_uses_stratified_production_mix(tmp_path) -> None: |
| 405 | library = tmp_path / "library" | 349 | library = tmp_path / "library" |
| 406 | incoming = tmp_path / "generated" / "incoming" | 350 | incoming = tmp_path / "generated" / "incoming" |
| ... | @@ -424,7 +368,7 @@ def test_generated_eval_set_uses_stratified_production_mix(tmp_path) -> None: | ... | @@ -424,7 +368,7 @@ def test_generated_eval_set_uses_stratified_production_mix(tmp_path) -> None: |
| 424 | assert manifest["sample_size"] == 30 | 368 | assert manifest["sample_size"] == 30 |
| 425 | assert manifest["unique_source_records"] > 1 | 369 | assert manifest["unique_source_records"] > 1 |
| 426 | assert manifest["holdout_records"] > 1 | 370 | assert manifest["holdout_records"] > 1 |
| 427 | assert (tmp_path / "generated" / "eval.csv.index.pkl").exists() | 371 | assert manifest["eval_index"] == "" |
| 428 | assert "positive_full_duplicate" in manifest["plan"] | 372 | assert "positive_full_duplicate" in manifest["plan"] |
| 429 | assert "negative_real_holdout_full_song" in negative_types | 373 | assert "negative_real_holdout_full_song" in negative_types |
| 430 | assert "negative_fragment" in negative_types | 374 | assert "negative_fragment" in negative_types |
| ... | @@ -466,19 +410,17 @@ def test_generated_hard_eval_set_uses_business_realistic_edge_mix(tmp_path) -> N | ... | @@ -466,19 +410,17 @@ def test_generated_hard_eval_set_uses_business_realistic_edge_mix(tmp_path) -> N |
| 466 | 410 | ||
| 467 | 411 | ||
| 468 | def test_foreign_original_with_added_chinese_translation_is_duplicate() -> None: | 412 | def test_foreign_original_with_added_chinese_translation_is_duplicate() -> None: |
| 469 | checker = DuplicateChecker() | 413 | result = check_against( |
| 470 | checker.add_record( | 414 | [ |
| 471 | LyricRecord( | 415 | LyricRecord( |
| 472 | "song-1", | 416 | "song-1", |
| 473 | """ | 417 | """ |
| 474 | I miss you tonight | 418 | I miss you tonight |
| 475 | Under the moonlight | 419 | Under the moonlight |
| 476 | Never let me go | 420 | Never let me go |
| 477 | """, | 421 | """, |
| 478 | ) | 422 | ) |
| 479 | ) | 423 | ], |
| 480 | |||
| 481 | result = checker.check( | ||
| 482 | """ | 424 | """ |
| 483 | I miss you tonight | 425 | I miss you tonight |
| 484 | 今晚我想你 | 426 | 今晚我想你 |
| ... | @@ -509,22 +451,20 @@ def test_same_timestamp_translation_split_is_high_confidence() -> None: | ... | @@ -509,22 +451,20 @@ def test_same_timestamp_translation_split_is_high_confidence() -> None: |
| 509 | 451 | ||
| 510 | 452 | ||
| 511 | def test_translation_only_overlap_is_review_not_duplicate() -> None: | 453 | def test_translation_only_overlap_is_review_not_duplicate() -> None: |
| 512 | checker = DuplicateChecker() | 454 | result = check_against( |
| 513 | checker.add_record( | 455 | [ |
| 514 | LyricRecord( | 456 | LyricRecord( |
| 515 | "song-1", | 457 | "song-1", |
| 516 | """ | 458 | """ |
| 517 | I miss you tonight | 459 | I miss you tonight |
| 518 | 今晚我想你 | 460 | 今晚我想你 |
| 519 | Under the moonlight | 461 | Under the moonlight |
| 520 | 月光之下 | 462 | 月光之下 |
| 521 | Never let me go | 463 | Never let me go |
| 522 | 永远不要让我离开 | 464 | 永远不要让我离开 |
| 523 | """, | 465 | """, |
| 524 | ) | 466 | ) |
| 525 | ) | 467 | ], |
| 526 | |||
| 527 | result = checker.check( | ||
| 528 | """ | 468 | """ |
| 529 | Te extrano esta noche | 469 | Te extrano esta noche |
| 530 | 今晚我想你 | 470 | 今晚我想你 |
| ... | @@ -541,19 +481,17 @@ def test_translation_only_overlap_is_review_not_duplicate() -> None: | ... | @@ -541,19 +481,17 @@ def test_translation_only_overlap_is_review_not_duplicate() -> None: |
| 541 | 481 | ||
| 542 | 482 | ||
| 543 | def test_block_translation_split_is_review_when_primary_matches() -> None: | 483 | def test_block_translation_split_is_review_when_primary_matches() -> None: |
| 544 | checker = DuplicateChecker() | 484 | result = check_against( |
| 545 | checker.add_record( | 485 | [ |
| 546 | LyricRecord( | 486 | LyricRecord( |
| 547 | "song-1", | 487 | "song-1", |
| 548 | """ | 488 | """ |
| 549 | I miss you tonight | 489 | I miss you tonight |
| 550 | Under the moonlight | 490 | Under the moonlight |
| 551 | Never let me go | 491 | Never let me go |
| 552 | """, | 492 | """, |
| 553 | ) | 493 | ) |
| 554 | ) | 494 | ], |
| 555 | |||
| 556 | result = checker.check( | ||
| 557 | """ | 495 | """ |
| 558 | I miss you tonight | 496 | I miss you tonight |
| 559 | Under the moonlight | 497 | Under the moonlight | ... | ... |
-
Please register or sign in to post a comment