Commit fec2556e fec2556ea008688f2ceac576f400a5d1cc9c22d7 by 沈秋雨

简化去重链路,仅保留使用pg作为数据库的链路

使用opencc作为简繁转换
1 parent d39197d3
1 # Lyric Duplicate Checker 1 # 歌词查重系统
2 2
3 第一版用于“新增歌词查重”:先用已有 `.lrc` / `.txt` 歌词建立索引,再把新增歌词拿来查询,返回 `duplicate``review``new` 3 这是一个使用 PostgreSQL 作为数据存储和候选召回层的歌词查重项目。Python 侧只负责歌词规范化、候选打分和最终判定,不再构建或加载 `.pkl` 本地索引
4 4
5 ## 建立索引 5 ## 架构
6 6
7 假设已有曲库在 `data/library/` 7 ```text
8 PostgreSQL:
9 lyrics 保存原始歌词、规范化文本、原文/翻译文本、exact_hash
10 lyric_lines 保存规范化歌词行和 line_hash
11 exact_hash 索引 精确重复召回
12 pg_trgm 索引 可选的近似文本召回
13 line_hash 索引 行级重合召回
14
15 Python:
16 normalize_lyrics 歌词清洗、时间戳/平台噪声处理、繁简转换、翻译行拆分
17 DuplicateChecker 只对 PostgreSQL 召回的候选打分和排序
18 决策规则 输出 duplicate / review / new
19 ```
20
21 核心原则:
22
23 ```text
24 数据库负责召回候选。
25 Python 负责最终判断。
26 不再使用 pickle、本地 MinHash 索引或 outputs/indexes/*.pkl 作为生产链路。
27 ```
28
29 ## 安装依赖
8 30
9 ```bash 31 ```bash
10 python -m lyric_dedup.cli build-index \ 32 python -m pip install -r requirements.txt
11 --lyrics-dir data/library \
12 --index outputs/indexes/lyrics.pkl
13 ``` 33 ```
14 34
15 ## 检查单个新增歌词 35 ## 初始化 PostgreSQL
36
37 创建数据库:
16 38
17 ```bash 39 ```bash
18 python -m lyric_dedup.cli check-file \ 40 createdb lyric_dedup
19 --index outputs/indexes/lyrics.pkl \
20 --file data/incoming/new_song.lrc
21 ``` 41 ```
22 42
23 ## 批量检查新增目录 43 初始化表结构和索引:
24 44
25 ```bash 45 ```bash
26 python -m lyric_dedup.cli batch-check \ 46 python scripts/init_postgres.py \
27 --index outputs/indexes/lyrics.pkl \ 47 --dsn postgresql:///lyric_dedup
28 --lyrics-dir data/incoming \
29 --out outputs/results/incoming_check.csv
30 ``` 48 ```
31 49
32 CSV 里重点看这些列: 50 会创建:
51
52 ```text
53 lyrics
54 lyric_lines
55 pg_trgm extension
56 exact_hash / primary_text_trgm / line_hash 索引
57 ```
33 58
34 - `decision`: 总判断。 59 ## 导入曲库
35 - `best_candidate_id`: 最像的已有歌词。 60
36 - `best_candidate_jaccard`: n-gram 字面相似度。 61 ```bash
37 - `best_candidate_line_coverage`: 行级覆盖率。 62 python scripts/import_library_postgres.py \
38 - `matched_unique_lines`: 命中的规范化歌词行。 63 --dsn postgresql:///lyric_dedup \
39 - `best_candidate_reason`: 中文判定原因,说明为什么判重、复核或判新。 64 --lyrics-dir data/library
65 ```
66
67 导入脚本会:
68
69 ```text
70 1. 扫描 data/library 下的 .lrc / .txt。
71 2. 读取并规范化歌词。
72 3. 写入 lyrics 和 lyric_lines。
73 4. 默认对 exact_hash 完全一致的记录做 soft delete,只保留质量更高的一条。
74 5. 输出重复报告到 outputs/results/postgres_exact_duplicates.csv。
75 ```
76
77 如果只导入、不做 exact 去重:
78
79 ```bash
80 python scripts/import_library_postgres.py \
81 --dsn postgresql:///lyric_dedup \
82 --lyrics-dir data/library \
83 --skip-dedup-exact
84 ```
40 85
41 生产判断建议:`duplicate` 可自动拦截;`review` 进人工池;`new` 入库前仍可抽样检查。 86 ## 检查单个歌词文件
42 87
43 ## 原文 + 中文翻译歌词的防护规则 88 ```bash
89 python -m lyric_dedup.cli check-file \
90 --dsn postgresql:///lyric_dedup \
91 --file data/incoming/new_song.lrc
92 ```
44 93
45 当前会把歌词拆成三类行 94 常用参数
46 95
47 - `primary_lines`: 原文行,自动判重主要依赖这部分。 96 ```text
48 - `translation_lines`: 中文翻译行,只用于召回和复核解释。 97 --recall-limit 每个 PostgreSQL 召回层最多返回多少候选
49 - `unknown_lines`: 无法稳定判断的行。 98 --max-candidates 最终返回和排序多少候选
99 --enable-trgm 启用 pg_trgm 近似文本召回
100 --trgm-threshold pg_trgm similarity 阈值
101 --statement-timeout-ms PostgreSQL statement_timeout
102 ```
50 103
51 高置信拆分包括 104 返回字段
52 105
53 - 同一个时间戳下出现外文行和中文行。 106 ```text
54 - 多组稳定的外文行 + 中文行交替。 107 decision duplicate / review / new
108 duplicate duplicate 或 review 时为 true,new 时为 false
109 confidence 当前判定置信度
110 reason 中文判定原因
111 candidate_count 参与最终排序的候选数
112 ```
55 113
56 中置信拆分包括: 114 ## 启动 API
57 115
58 - 同一行内明显的外文 / 中文翻译,例如 `I miss you / 今晚我想你` 116 ```bash
117 export LYRIC_DEDUP_DSN=postgresql:///lyric_dedup
118 uvicorn lyric_dedup_server.app:app --host 0.0.0.0 --port 8000
119 ```
59 120
60 低置信拆分包括 121 接口
61 122
62 - 先整段外文,再整段中文翻译。 123 ```text
124 POST /api/v1/check
125 ```
63 126
64 判定策略 127 请求示例
65 128
66 - 原文高度一致,即使新增歌词多了中文翻译,也可以 `duplicate` 129 ```json
67 - 只有翻译行相似,原文相似不足,只能 `review`,不自动判重。 130 {
68 - 疑似整段翻译结构属于低置信拆分,即使原文 hash 一致,也先 `review` 131 "url": "https://example.com/song.lrc",
69 - 普通中文歌没有检测到翻译结构时,全部有效行按原文处理。 132 "title": "Song Title",
133 "artist": "Artist"
134 }
135 ```
70 136
71 由于索引里会保存拆分后的原文/翻译特征,修改拆分规则后需要重建索引 137 服务会下载 URL 对应的 `.lrc` / `.txt` 文件,使用 PostgreSQL 召回候选并判定。若结果为 `new`,且请求带有 URL,服务会把这首新歌词写入 PostgreSQL
72 138
73 ## 用标注 CSV 评估正确率 139 ## 生成评估集
74 140
75 可以先从已有曲库自动生成一批评估样本 141 常规生产口径
76 142
77 ```bash 143 ```bash
78 python -m lyric_dedup.cli generate-eval-set \ 144 python -m lyric_dedup.cli generate-eval-set \
79 --library-dir data/library \ 145 --library-dir data/library \
80 --lyrics-dir data/generated_eval/incoming \ 146 --lyrics-dir data/generated_eval/incoming \
81 --csv data/generated_eval/eval_50000.csv \ 147 --csv data/generated_eval/eval_5000.csv \
82 --index outputs/indexes/lyrics.pkl \ 148 --size 5000 \
83 --eval-index data/generated_eval/eval_50000.csv.index.pkl \
84 --size 50000 \
85 --positive-ratio 0.3 149 --positive-ratio 0.3
86 ``` 150 ```
87 151
88 默认 `--profile standard` 生成常规生产评估集。也可以生成更贴近业务边界的 hard 集 152 hard 业务边界口径
89 153
90 ```bash 154 ```bash
91 python -m lyric_dedup.cli generate-eval-set \ 155 python -m lyric_dedup.cli generate-eval-set \
...@@ -93,79 +157,55 @@ python -m lyric_dedup.cli generate-eval-set \ ...@@ -93,79 +157,55 @@ python -m lyric_dedup.cli generate-eval-set \
93 --library-dir data/library \ 157 --library-dir data/library \
94 --lyrics-dir data/generated_eval/hard_incoming \ 158 --lyrics-dir data/generated_eval/hard_incoming \
95 --csv data/generated_eval/eval_hard_5000.csv \ 159 --csv data/generated_eval/eval_hard_5000.csv \
96 --eval-index data/generated_eval/eval_hard_5000.csv.index.pkl \
97 --size 5000 \ 160 --size 5000 \
98 --positive-ratio 0.3 161 --positive-ratio 0.3
99 ``` 162 ```
100 163
101 standard 业务口径: 164 生成器只写:
102
103 - 先扫描整个曲库,按有效歌词行数、语言类型、文件来源前缀做分层采样,不再按排序前缀取样。
104 - `应去重` 样本只生成全曲歌词的样式变化,例如时间戳、标点、平台噪声、空行、重复副歌次数变化、附加中文翻译、少量错别字/英文拼写错误。
105 - `不应去重` 样本以真实 holdout 完整歌词为主,也包含片段歌词、重复副歌碰撞、仅翻译相似、同主题新歌词、短歌词/占位边界样本。
106 - 片段歌词即使命中已有歌曲的一部分,也不应该输出 `duplicate`;最多进入 `review`
107 - 生成器会额外写出 `--eval-index`,这个索引排除了 holdout 歌,评估生成 CSV 时应使用它。
108 - 同时会生成 `*.manifest.json`,记录 seed、曲库规模、holdout 数、样本类型分布、语言/来源分桶和样本来源覆盖数。
109 165
110 hard 业务口径不故意制造反常输入,主要覆盖上线更容易踩边界的情况: 166 ```text
111 167 评估 CSV
112 - `应去重`: 同曲平台版本噪声、较完整歌词缺少一段、整段中文翻译附加、较真实的录入/OCR 错别字、时间戳和平台元信息混合。 168 样本歌词文件
113 - `不应去重`: 真实 holdout 新歌、从 holdout 中优先挑选和曲库有行重合的近邻新歌、较长但不完整的单曲片段、多曲 medley/串烧式片段、重复副歌碰撞、仅翻译相似、短歌词边界。 169 manifest.json
114
115 先准备一个 CSV,例如 `data/eval/eval.csv`
116
117 ```csv
118 id,file,expected
119 case-001,incoming/song_a.lrc,应去重
120 case-002,incoming/song_b.txt,不应去重
121 ```
122
123 也可以不用文件路径,直接把歌词放在 `lyrics` 列:
124
125 ```csv
126 id,lyrics,expected
127 case-003,"我爱你在每个夜里\n听风说话也听见你",duplicate
128 case-004,"南方的雨穿过街心\n你把故事说给云听",new
129 ``` 170 ```
130 171
131 `expected` 支持这些写法: 172 不会再生成 `.index.pkl`。评估时由 PostgreSQL 召回候选,并根据 CSV 里的 `source_record_id` 排除 holdout 样本自身。
132 173
133 - 应去重:`应去重``重复``duplicate``1``true``yes` 174 ## 使用 PostgreSQL 评估
134 - 不应去重:`不应去重``不重复``new``0``false``no`
135 175
136 运行评估: 176 严格自动拦截口径:只有 `duplicate` 算预测应去重。
137 177
138 ```bash 178 ```bash
139 python -m lyric_dedup.cli evaluate-csv \ 179 python scripts/evaluate_postgres.py \
140 --index outputs/indexes/lyrics.pkl \ 180 --dsn postgresql:///lyric_dedup \
141 --csv data/eval/eval.csv \ 181 --csv data/generated_eval/eval_hard_5000.csv \
142 --base-dir data \ 182 --base-dir data/generated_eval \
143 --out outputs/results/eval_result.csv 183 --out outputs/results/postgres_eval_hard_5000.csv
144 ``` 184 ```
145 185
146 默认只有系统输出 `duplicate` 才算“预测应去重”。这适合评估自动拦截的准确率,误杀会更明显。 186 可疑样本召回口径:`duplicate``review` 都算抓到。
147
148 如果你想评估“可疑样本召回率”,也就是 `duplicate``review` 都算命中:
149 187
150 ```bash 188 ```bash
151 python -m lyric_dedup.cli evaluate-csv \ 189 python scripts/evaluate_postgres.py \
152 --index outputs/indexes/lyrics.pkl \ 190 --dsn postgresql:///lyric_dedup \
153 --csv data/eval/eval.csv \ 191 --csv data/generated_eval/eval_hard_5000.csv \
154 --base-dir data \ 192 --base-dir data/generated_eval \
155 --positive-decisions duplicate,review \ 193 --positive-decisions duplicate,review \
156 --out outputs/results/eval_result_review_as_positive.csv 194 --out outputs/results/postgres_eval_hard_5000_review_positive.csv
157 ``` 195 ```
158 196
159 会生成两个文件 197 评估会生成
160 198
161 - `outputs/results/eval_result.csv`: 每条样本的预测、候选、原因和是否正确。 199 ```text
162 - `outputs/results/eval_result.csv.summary.json`: 总体指标。 200 outputs/results/*.csv
201 outputs/results/*.csv.summary.json
202 ```
163 203
164 summary 重点看: 204 summary 重点看:
165 205
166 - `accuracy`: 总正确率。 206 ```text
167 - `precision`: 预测应去重的样本里,有多少是真的应去重。自动拦截优先看这个。 207 precision 自动拦截准确率,重点关注 false_positive
168 - `recall`: 真实应去重的样本里,有多少被系统抓到。 208 recall 应去重样本召回率,重点关注 false_negative
169 - `f1`: precision 和 recall 的综合指标。 209 f1 precision 和 recall 的综合指标
170 - `false_positive`: 不应去重但被判为应去重,属于误杀。 210 duplicate/review/new 三类判定分布
171 - `false_negative`: 应去重但没抓到,属于漏召。 211 ```
......
1 # 歌词查重测试流程 1 # 歌词查重测试流程
2 2
3 本文档记录从已有歌词目录建立索引、生成测试集、批量评估和查看结果的完整命令 3 本文档记录当前项目的 PostgreSQL-only 测试流程。当前链路不再使用 `outputs/indexes/*.pkl`,也不再生成 `*.index.pkl` 评估索引
4 4
5 ## 1. 准备目录 5 ## 1. 准备数据
6 6
7 已有曲库放在 7 已有曲库:
8 8
9 ```text 9 ```text
10 data/library/ 10 data/library/
...@@ -17,125 +17,111 @@ data/library/ ...@@ -17,125 +17,111 @@ data/library/
17 .txt 17 .txt
18 ``` 18 ```
19 19
20 生成的测试样本会放在 20 生成的评估样本目录
21 21
22 ```text 22 ```text
23 data/generated_eval/incoming/ 23 data/generated_eval/incoming/
24 data/generated_eval/hard_incoming/
24 ``` 25 ```
25 26
26 测试集标注 CSV 会放在 27 评估结果目录
27 28
28 ```text 29 ```text
29 data/generated_eval/eval_100.csv 30 outputs/results/
30 ``` 31 ```
31 32
32 评估结果会放在: 33 ## 2. 初始化 PostgreSQL
33 34
34 ```text 35 创建数据库:
35 outputs/results/
36 ```
37 36
38 ## 2. 建立已有曲库索引 37 ```bash
38 createdb lyric_dedup
39 ```
39 40
40 如果刚往 `data/library` 新增了一批样本,建议先运行处理脚本 41 初始化 schema
41 42
42 ```bash 43 ```bash
43 python scripts/process_library.py \ 44 python scripts/init_postgres.py \
44 --library-dir data/library \ 45 --dsn postgresql:///lyric_dedup
45 --index outputs/indexes/library_lyrics.pkl
46 ``` 46 ```
47 47
48 这个脚本会 48 检查表
49 49
50 ```text 50 ```bash
51 1. 扫描并隔离纯音乐占位样本,例如包含【曲库专用】或“此歌曲为没有填词的纯音乐”的文件。 51 psql postgresql:///lyric_dedup -c '\dt'
52 2. 重建 outputs/indexes/library_lyrics.pkl。
53 3. 输出处理报告 outputs/results/library_process_report.json。
54 ``` 52 ```
55 53
56 如果你想先看会处理哪些文件,不实际移动和重建索引: 54 ## 3. 导入曲库
57 55
58 ```bash 56 ```bash
59 python scripts/process_library.py \ 57 python scripts/import_library_postgres.py \
60 --library-dir data/library \ 58 --dsn postgresql:///lyric_dedup \
61 --dry-run 59 --lyrics-dir data/library
62 ``` 60 ```
63 61
64 如果要顺手生成并评估 500 条测试样本 62 导入完成后检查数量
65 63
66 ```bash 64 ```bash
67 python scripts/process_library.py \ 65 psql postgresql:///lyric_dedup -c 'select count(*) from lyrics where deleted_at is null;'
68 --library-dir data/library \ 66 psql postgresql:///lyric_dedup -c 'select count(*) from lyric_lines;'
69 --index outputs/indexes/library_lyrics.pkl \
70 --eval-size 50000 \
71 --positive-ratio 0.3 \
72 --eval-csv data/generated_eval/eval_50000.csv \
73 --eval-out outputs/results/library_eval_50000.csv
74 ``` 67 ```
75 68
76 隔离出来的文件默认会移动到 69 导入脚本默认会 soft delete exact_hash 完全一致的重复记录,并输出
77 70
78 ```text 71 ```text
79 data/quarantine/no_lyrics_placeholders/ 72 outputs/results/postgres_exact_duplicates.csv
80 ``` 73 ```
81 74
82 也可以只手动建索引 75 如果要额外查看高行级覆盖的疑似重复
83 76
84 ```bash 77 ```bash
85 python -m lyric_dedup.cli build-index \ 78 python scripts/import_library_postgres.py \
79 --dsn postgresql:///lyric_dedup \
86 --lyrics-dir data/library \ 80 --lyrics-dir data/library \
87 --index outputs/indexes/library_lyrics.pkl 81 --line-duplicate-report outputs/results/postgres_line_duplicates.csv
88 ``` 82 ```
89 83
90 索引文件: 84 ## 4. 检查单个文件
91 85
92 ```text 86 ```bash
93 outputs/indexes/library_lyrics.pkl 87 python -m lyric_dedup.cli check-file \
88 --dsn postgresql:///lyric_dedup \
89 --file test_api/test_lyric.txt
94 ``` 90 ```
95 91
96 注意:如果修改了 `data/library`,或修改了预处理/判重逻辑,需要重新执行本步骤。 92 如需启用 trigram 文本召回:
97
98 ## 3. 生成生产评估样本
99 93
100 ```bash 94 ```bash
101 python -m lyric_dedup.cli generate-eval-set \ 95 python -m lyric_dedup.cli check-file \
102 --library-dir data/library \ 96 --dsn postgresql:///lyric_dedup \
103 --lyrics-dir data/generated_eval/incoming \ 97 --file test_api/test_lyric.txt \
104 --csv data/generated_eval/eval_50000.csv \ 98 --enable-trgm \
105 --index outputs/indexes/library_lyrics.pkl \ 99 --trgm-threshold 0.3
106 --eval-index data/generated_eval/eval_50000.csv.index.pkl \
107 --size 50000 \
108 --positive-ratio 0.3
109 ``` 100 ```
110 101
111 如需生成更贴近业务边界的 hard 口径测试集: 102 ## 5. 生成 standard 评估集
112 103
113 ```bash 104 ```bash
114 python -m lyric_dedup.cli generate-eval-set \ 105 python -m lyric_dedup.cli generate-eval-set \
115 --profile hard \
116 --library-dir data/library \ 106 --library-dir data/library \
117 --lyrics-dir data/generated_eval/hard_incoming \ 107 --lyrics-dir data/generated_eval/incoming \
118 --csv data/generated_eval/eval_hard_5000.csv \ 108 --csv data/generated_eval/eval_5000.csv \
119 --index outputs/indexes/library_lyrics.pkl \
120 --eval-index data/generated_eval/eval_hard_5000.csv.index.pkl \
121 --size 5000 \ 109 --size 5000 \
122 --positive-ratio 0.3 110 --positive-ratio 0.3
123 ``` 111 ```
124 112
125 默认生产评估口径: 113 standard 口径:
126 114
127 ```text 115 ```text
128 应去重: 30% 116 应去重: 30%
129 不应去重: 70% 117 不应去重: 70%
130 ``` 118 ```
131 119
132 生成器会先清理 `data/generated_eval/incoming/` 下旧的 `.txt` / `.lrc` 生成文件,再写入新样本。 120 样本类型:
133
134 业务口径:
135 121
136 ```text 122 ```text
137 positive_* = 应去重,全曲歌词样式变化,包括少量错别字/英文拼写错误扰动 123 positive_* = 应去重,全曲歌词样式变化,例如时间戳、标点、平台噪声、空行、重复副歌次数变化、附加翻译、少量错别字
138 negative_real_holdout_full_song = 不应去重,完整真实歌词,已从评估索引中排除 124 negative_real_holdout_full_song = 不应去重,完整真实歌词,从评估候选里排除自身
139 negative_fragment = 不应去重,单曲片段 125 negative_fragment = 不应去重,单曲片段
140 negative_shared_chorus = 不应去重,重复副歌碰撞 126 negative_shared_chorus = 不应去重,重复副歌碰撞
141 negative_translation_only = 不应去重,仅翻译相似 127 negative_translation_only = 不应去重,仅翻译相似
...@@ -143,7 +129,19 @@ negative_same_theme_synthetic = 不应去重,同主题新歌词 ...@@ -143,7 +129,19 @@ negative_same_theme_synthetic = 不应去重,同主题新歌词
143 edge_short_or_placeholder = 不应去重,短歌词/占位边界样本 129 edge_short_or_placeholder = 不应去重,短歌词/占位边界样本
144 ``` 130 ```
145 131
146 hard 口径额外强调真实业务边界,而不是故意制造反常难题: 132 ## 6. 生成 hard 评估集
133
134 ```bash
135 python -m lyric_dedup.cli generate-eval-set \
136 --profile hard \
137 --library-dir data/library \
138 --lyrics-dir data/generated_eval/hard_incoming \
139 --csv data/generated_eval/eval_hard_5000.csv \
140 --size 5000 \
141 --positive-ratio 0.3
142 ```
143
144 hard 口径强调真实业务边界,不故意制造反常输入:
147 145
148 ```text 146 ```text
149 positive_realistic_variant = 应去重,同曲平台版本噪声、较完整缺段、整段翻译附加、真实录入/OCR 错 147 positive_realistic_variant = 应去重,同曲平台版本噪声、较完整缺段、整段翻译附加、真实录入/OCR 错
...@@ -152,84 +150,50 @@ negative_long_fragment = 不应去重,较长但不完整的单曲片段 ...@@ -152,84 +150,50 @@ negative_long_fragment = 不应去重,较长但不完整的单曲片段
152 negative_catalog_mashup = 不应去重,多首真实歌词片段组成的串烧/混剪式输入 150 negative_catalog_mashup = 不应去重,多首真实歌词片段组成的串烧/混剪式输入
153 ``` 151 ```
154 152
155 生成器会扫描整个曲库并按有效歌词行数、语言类型、文件来源前缀分层采样。它会分出一批 holdout 完整歌词作为真实新歌负样本,并生成一个排除 holdout 的评估索引。每次还会输出: 153 ## 7. 严格评估
156 154
157 ```text 155 严格口径只把 `duplicate` 算作预测应去重:
158 data/generated_eval/eval_50000.csv.manifest.json
159 data/generated_eval/eval_50000.csv.index.pkl
160 ```
161
162 manifest 里重点看:
163
164 ```text
165 library_files 曲库歌词文件数
166 holdout_records 从评估索引中排除、作为真实新歌负样本的数量
167 sample_type_counts 各样本类型数量
168 line_count_bucket_counts / language_bucket_counts / source_bucket_counts
169 unique_source_records 本次评估覆盖了多少真实源文件
170 ```
171
172 ## 4. 严格评估:只把 duplicate 算作去重
173 156
174 ```bash 157 ```bash
175 python -m lyric_dedup.cli evaluate-csv \ 158 python scripts/evaluate_postgres.py \
176 --index data/generated_eval/eval_50000.csv.index.pkl \ 159 --dsn postgresql:///lyric_dedup \
177 --csv data/generated_eval/eval_50000.csv \ 160 --csv data/generated_eval/eval_hard_5000.csv \
178 --base-dir data/generated_eval \ 161 --base-dir data/generated_eval \
179 --out outputs/results/library_eval_50000.csv 162 --out outputs/results/postgres_eval_hard_5000.csv
180 ``` 163 ```
181 164
182 这个口径下: 165 适合看自动拦截质量,重点关注:
183
184 ```text
185 duplicate -> 预测应去重
186 review -> 预测不应去重
187 new -> 预测不应去重
188 ```
189
190 适合评估自动拦截的 precision,重点看:
191 166
192 ```text 167 ```text
168 precision
193 false_positive 169 false_positive
194 ``` 170 ```
195 171
196 ## 5. 召回评估:把 duplicate 和 review 都算作抓到可疑样本 172 ## 8. 召回评估
173
174 召回口径把 `duplicate``review` 都算作抓到可疑样本:
197 175
198 ```bash 176 ```bash
199 python -m lyric_dedup.cli evaluate-csv \ 177 python scripts/evaluate_postgres.py \
200 --index data/generated_eval/eval_50000.csv.index.pkl \ 178 --dsn postgresql:///lyric_dedup \
201 --csv data/generated_eval/eval_50000.csv \ 179 --csv data/generated_eval/eval_hard_5000.csv \
202 --base-dir data/generated_eval \ 180 --base-dir data/generated_eval \
203 --positive-decisions duplicate,review \ 181 --positive-decisions duplicate,review \
204 --out outputs/results/library_eval_50000_review_positive.csv 182 --out outputs/results/postgres_eval_hard_5000_review_positive.csv
205 ``` 183 ```
206 184
207 这个口径下: 185 适合看漏召风险,重点关注:
208
209 ```text
210 duplicate -> 预测应去重
211 review -> 预测应去重
212 new -> 预测不应去重
213 ```
214
215 适合评估可疑样本召回,重点看:
216 186
217 ```text 187 ```text
188 recall
218 false_negative 189 false_negative
219 ``` 190 ```
220 191
221 ## 6. 查看总体指标 192 ## 9. 查看 summary
222
223 严格口径:
224 193
225 ```bash 194 ```bash
226 cat outputs/results/library_eval_100.csv.summary.json 195 cat outputs/results/postgres_eval_hard_5000.csv.summary.json
227 ``` 196 cat outputs/results/postgres_eval_hard_5000_review_positive.csv.summary.json
228
229 召回口径:
230
231 ```bash
232 cat outputs/results/library_eval_100_review_positive.csv.summary.json
233 ``` 197 ```
234 198
235 指标含义: 199 指标含义:
...@@ -245,84 +209,16 @@ true_negative 不应去重且预测不应去重 ...@@ -245,84 +209,16 @@ true_negative 不应去重且预测不应去重
245 false_negative 应去重但预测不应去重,漏召 209 false_negative 应去重但预测不应去重,漏召
246 ``` 210 ```
247 211
248 ## 7. 查看每条样本结果 212 ## 10. 查看失败样本
249
250 ```bash
251 open outputs/results/library_eval_100.csv
252 ```
253
254 如果不能使用 `open`,可以直接查看 CSV:
255
256 ```bash
257 python -c 'import csv; rows=csv.DictReader(open("outputs/results/library_eval_100.csv", encoding="utf-8")); [print(r["id"], r["decision"], r["correct"], r["reason"], sep=" | ") for r in rows]'
258 ```
259
260 ## 8. 查看失败样本
261 213
262 严格口径失败样本: 214 严格口径失败样本:
263 215
264 ```bash 216 ```bash
265 python -c 'import csv; rows=csv.DictReader(open("outputs/results/library_eval_100.csv", encoding="utf-8")); [print(r["id"], r["source"], r["decision"], r["reason"], sep=" | ") for r in rows if r["correct"] == "False"]' 217 python -c 'import csv; rows=csv.DictReader(open("outputs/results/postgres_eval_hard_5000.csv", encoding="utf-8")); [print(r["id"], r["expected_duplicate"], r["decision"], r["reason"], sep=" | ") for r in rows if r["correct"] == "False"]'
266 ```
267
268 查看某个样本的完整候选:
269
270 ```bash
271 python -m lyric_dedup.cli check-file \
272 --index outputs/indexes/library_lyrics.pkl \
273 --file data/generated_eval/incoming/neg_068_mixed_fragments.txt \
274 --max-candidates 10
275 ``` 218 ```
276 219
277 ## 9. 核对测试集分布 220 按样本类型统计:
278 221
279 ```bash 222 ```bash
280 python -c 'import csv, collections; rows=list(csv.DictReader(open("data/generated_eval/eval_10.csv", encoding="utf-8"))); print(len(rows)); print(collections.Counter(r["expected"] for r in rows)); print(collections.Counter(r["sample_type"] for r in rows)); print(collections.Counter(r["sample_type"] for r in rows if r["expected"]=="应去重")); print(collections.Counter(r["sample_type"] for r in rows if r["expected"]=="不应去重"))' 223 python -c 'import csv,collections; meta={r["id"]:r for r in csv.DictReader(open("data/generated_eval/eval_hard_5000.csv", encoding="utf-8-sig"))}; rows=csv.DictReader(open("outputs/results/postgres_eval_hard_5000.csv", encoding="utf-8")); c=collections.Counter(meta.get(r["id"],{}).get("sample_type","") for r in rows if r["correct"]=="False"); print(c)'
281 ``` 224 ```
282
283 核对生成目录文件数:
284
285 ```bash
286 find data/generated_eval/incoming -type f | wc -l
287 ```
288
289 ## 10. 运行代码测试
290
291 ```bash
292 python -m pytest tests
293 ```
294
295 编译检查:
296
297 ```bash
298 python -m compileall -q lyric_dedup tests
299 ```
300
301 ## 11. 关于测试集不重复
302
303 当前自动生成的 100 条是规则覆盖测试集,不保证样本之间规范化后完全不重复。
304
305 如果要求 100 条测试样本彼此不重复,并且仍使用默认比例:
306
307 ```text
308 size = 100
309 positive_ratio = 0.6
310 ```
311
312 则至少需要:
313
314 ```text
315 60 首互不重复的种子歌词
316 ```
317
318 原因:应去重样本是全曲变体,同一首歌的多个样式变化规范化后仍然是同一首歌。
319
320 更稳妥的真实准确率评估方式是准备人工标注 CSV:
321
322 ```csv
323 id,file,expected
324 case-001,incoming/song_a.lrc,应去重
325 case-002,incoming/song_b.txt,不应去重
326 ```
327
328 然后直接执行第 4 节或第 5 节的 `evaluate-csv`
......
1 """Incremental lyric duplicate checker.""" 1 """Lyric candidate ranking and duplicate decision rules."""
2 2
3 from __future__ import annotations 3 from __future__ import annotations
4 4
5 import hashlib 5 import hashlib
6 import pickle
7 from dataclasses import dataclass 6 from dataclasses import dataclass
8 from enum import Enum 7 from enum import Enum
9 from pathlib import Path
10 8
11 from lyric_dedup.minhash_lsh import MinHashConfig
12 from lyric_dedup.minhash_lsh import MinHashLSH
13 from lyric_dedup.normalization import NormalizedLyrics 9 from lyric_dedup.normalization import NormalizedLyrics
14 from lyric_dedup.normalization import fingerprint_text 10 from lyric_dedup.normalization import fingerprint_text
15 from lyric_dedup.normalization import lyric_tokens 11 from lyric_dedup.normalization import lyric_tokens
...@@ -64,103 +60,61 @@ class _IndexedRecord: ...@@ -64,103 +60,61 @@ class _IndexedRecord:
64 translation_tokens: set[str] 60 translation_tokens: set[str]
65 fallback_lines: tuple[str, ...] 61 fallback_lines: tuple[str, ...]
66 fallback_tokens: set[str] 62 fallback_tokens: set[str]
67 signature: tuple[int, ...]
68 63
69 64
70 class DuplicateChecker: 65 class DuplicateChecker:
71 """In-memory first version for checking newly submitted lyrics. 66 """Rank PostgreSQL-recalled candidates and produce the final decision."""
72
73 The API is intentionally small: build or load records with ``add_record``, then
74 call ``check`` for a new lyric. Persistence can serialize the indexed fields
75 later without changing result semantics.
76 """
77 67
78 def __init__( 68 def __init__(
79 self, 69 self,
80 *, 70 *,
81 minhash_config: MinHashConfig | None = None,
82 duplicate_jaccard_threshold: float = 0.78, 71 duplicate_jaccard_threshold: float = 0.78,
83 duplicate_line_coverage_threshold: float = 0.72, 72 duplicate_line_coverage_threshold: float = 0.72,
73 duplicate_high_coverage_jaccard_threshold: float = 0.78,
74 duplicate_high_coverage_line_coverage_threshold: float = 0.90,
84 review_jaccard_threshold: float = 0.45, 75 review_jaccard_threshold: float = 0.45,
85 review_line_coverage_threshold: float = 0.35, 76 review_line_coverage_threshold: float = 0.35,
77 review_query_coverage_threshold: float = 0.40,
78 chorus_short_line_count_threshold: int = 6,
79 chorus_material_overlap_threshold: float = 0.20,
80 chorus_material_query_coverage_threshold: float = 0.40,
81 confidence_jaccard_weight: float = 0.58,
82 confidence_line_coverage_weight: float = 0.42,
86 ) -> None: 83 ) -> None:
87 self._lsh = MinHashLSH(minhash_config)
88 self._records: dict[str, _IndexedRecord] = {}
89 self._exact_hash_to_ids: dict[str, set[str]] = {}
90 self._line_to_ids: dict[str, set[str]] = {}
91 self._token_to_ids: dict[str, set[str]] = {}
92 self.duplicate_jaccard_threshold = duplicate_jaccard_threshold 84 self.duplicate_jaccard_threshold = duplicate_jaccard_threshold
93 self.duplicate_line_coverage_threshold = duplicate_line_coverage_threshold 85 self.duplicate_line_coverage_threshold = duplicate_line_coverage_threshold
86 self.duplicate_high_coverage_jaccard_threshold = duplicate_high_coverage_jaccard_threshold
87 self.duplicate_high_coverage_line_coverage_threshold = duplicate_high_coverage_line_coverage_threshold
94 self.review_jaccard_threshold = review_jaccard_threshold 88 self.review_jaccard_threshold = review_jaccard_threshold
95 self.review_line_coverage_threshold = review_line_coverage_threshold 89 self.review_line_coverage_threshold = review_line_coverage_threshold
96 90 self.review_query_coverage_threshold = review_query_coverage_threshold
97 def add_record(self, record: LyricRecord) -> None: 91 self.chorus_short_line_count_threshold = chorus_short_line_count_threshold
98 indexed = self._index(record) 92 self.chorus_material_overlap_threshold = chorus_material_overlap_threshold
99 self._add_indexed(record.record_id, indexed) 93 self.chorus_material_query_coverage_threshold = chorus_material_query_coverage_threshold
100 94 self.confidence_jaccard_weight = confidence_jaccard_weight
101 def add_normalized_record(self, record: LyricRecord, normalized: NormalizedLyrics) -> None: 95 self.confidence_line_coverage_weight = confidence_line_coverage_weight
102 """Add a record when normalized lyrics have already been computed.""" 96
103 indexed = self._index_normalized(record, normalized) 97 def check_record_against_candidates(
104 self._add_indexed(record.record_id, indexed) 98 self,
105 99 record: LyricRecord,
106 def _add_indexed(self, record_id: str, indexed: _IndexedRecord) -> None: 100 candidates: list[LyricRecord],
107 self._records[record_id] = indexed 101 *,
108 self._exact_hash_to_ids.setdefault(indexed.exact_hash, set()).add(record_id) 102 max_candidates: int = 10,
109 for line in indexed.normalized.unique_lines: 103 ) -> DuplicateCheckResult:
110 if len(line) >= 4: 104 """Rank explicitly supplied candidates without doing in-memory recall.
111 self._line_to_ids.setdefault(line, set()).add(record_id) 105
112 for token in indexed.tokens: 106 PostgreSQL-backed callers should use this method after database recall so
113 self._token_to_ids.setdefault(token, set()).add(record_id) 107 there is only one retrieval path: PG returns candidates, Python ranks and
114 for token in indexed.fallback_tokens: 108 decides.
115 self._token_to_ids.setdefault(token, set()).add(record_id) 109 """
116 self._lsh.add(record_id, indexed.signature)
117
118 def save(self, path: str | Path) -> None:
119 """Persist the in-memory index for later checks."""
120 with Path(path).open("wb") as file:
121 pickle.dump(self, file, protocol=pickle.HIGHEST_PROTOCOL)
122
123 @classmethod
124 def load(cls, path: str | Path) -> "DuplicateChecker":
125 """Load a previously persisted index."""
126 with Path(path).open("rb") as file:
127 checker = pickle.load(file)
128 if not isinstance(checker, cls):
129 raise TypeError(f"{path} does not contain a DuplicateChecker index")
130 return checker
131
132 @property
133 def record_count(self) -> int:
134 return len(self._records)
135
136 def check(self, lyrics: str, *, max_candidates: int = 10) -> DuplicateCheckResult:
137 return self.check_record(LyricRecord(record_id="__query__", lyrics=lyrics), max_candidates=max_candidates)
138
139 def check_record(self, record: LyricRecord, *, max_candidates: int = 10) -> DuplicateCheckResult:
140 query = self._index(record) 110 query = self._index(record)
141 exact_ids = self._exact_hash_to_ids.get(query.exact_hash, set())
142 if exact_ids:
143 candidates = tuple(self._rank_exact_candidate(query, self._records[record_id]) for record_id in sorted(exact_ids)[:max_candidates])
144 duplicate = next((candidate for candidate in candidates if candidate.decision == DuplicateDecision.DUPLICATE), None)
145 if duplicate is not None:
146 return DuplicateCheckResult(
147 decision=DuplicateDecision.DUPLICATE,
148 confidence=duplicate.confidence,
149 candidates=candidates,
150 normalized_full_text=query.normalized.normalized_full_text,
151 reason=duplicate.reason,
152 )
153 return DuplicateCheckResult(
154 decision=DuplicateDecision.REVIEW,
155 confidence=candidates[0].confidence,
156 candidates=candidates,
157 normalized_full_text=query.normalized.normalized_full_text,
158 reason=candidates[0].reason,
159 )
160
161 candidate_ids = self._recall_candidates(query)
162 ranked = sorted( 111 ranked = sorted(
163 (self._rank_candidate(query, self._records[record_id]) for record_id in candidate_ids), 112 (
113 self._rank_exact_candidate(query, indexed)
114 if indexed.exact_hash == query.exact_hash
115 else self._rank_candidate(query, indexed)
116 for indexed in (self._index(candidate) for candidate in candidates)
117 ),
164 key=lambda item: (item.decision == DuplicateDecision.DUPLICATE, item.confidence, item.jaccard), 118 key=lambda item: (item.decision == DuplicateDecision.DUPLICATE, item.confidence, item.jaccard),
165 reverse=True, 119 reverse=True,
166 )[:max_candidates] 120 )[:max_candidates]
...@@ -203,7 +157,6 @@ class DuplicateChecker: ...@@ -203,7 +157,6 @@ class DuplicateChecker:
203 translation_tokens = lyric_tokens(normalized, lines=normalized.translation_lines) 157 translation_tokens = lyric_tokens(normalized, lines=normalized.translation_lines)
204 fallback_lines = tuple(_fallback_no_lyrics_lines(record.lyrics)) 158 fallback_lines = tuple(_fallback_no_lyrics_lines(record.lyrics))
205 fallback_tokens = set(fallback_lines) 159 fallback_tokens = set(fallback_lines)
206 signature = self._lsh.signature(primary_tokens or tokens or fallback_tokens)
207 exact_hash = hashlib.sha256(_exact_fingerprint(normalized, fallback_lines).encode("utf-8")).hexdigest() 160 exact_hash = hashlib.sha256(_exact_fingerprint(normalized, fallback_lines).encode("utf-8")).hexdigest()
208 return _IndexedRecord( 161 return _IndexedRecord(
209 record=record, 162 record=record,
...@@ -214,25 +167,8 @@ class DuplicateChecker: ...@@ -214,25 +167,8 @@ class DuplicateChecker:
214 translation_tokens=translation_tokens, 167 translation_tokens=translation_tokens,
215 fallback_lines=fallback_lines, 168 fallback_lines=fallback_lines,
216 fallback_tokens=fallback_tokens, 169 fallback_tokens=fallback_tokens,
217 signature=signature,
218 ) 170 )
219 171
220 def _recall_candidates(self, query: _IndexedRecord) -> set[str]:
221 candidate_ids = self._lsh.query(query.signature)
222 for line in query.normalized.primary_lines:
223 if len(line) >= 4:
224 candidate_ids.update(self._line_to_ids.get(line, set()))
225 for line in query.normalized.translation_lines:
226 if len(line) >= 4:
227 candidate_ids.update(self._line_to_ids.get(line, set()))
228 for token in query.primary_tokens or query.tokens:
229 candidate_ids.update(self._token_to_ids.get(token, set()))
230 for token in query.translation_tokens:
231 candidate_ids.update(self._token_to_ids.get(token, set()))
232 for token in query.fallback_tokens:
233 candidate_ids.update(self._token_to_ids.get(token, set()))
234 return candidate_ids
235
236 def _rank_exact_candidate(self, query: _IndexedRecord, candidate: _IndexedRecord) -> CandidateMatch: 172 def _rank_exact_candidate(self, query: _IndexedRecord, candidate: _IndexedRecord) -> CandidateMatch:
237 low_confidence_split = ( 173 low_confidence_split = (
238 query.normalized.split_confidence == "low" or candidate.normalized.split_confidence == "low" 174 query.normalized.split_confidence == "low" or candidate.normalized.split_confidence == "low"
...@@ -306,25 +242,47 @@ class DuplicateChecker: ...@@ -306,25 +242,47 @@ class DuplicateChecker:
306 or jaccard >= self.review_jaccard_threshold 242 or jaccard >= self.review_jaccard_threshold
307 or ( 243 or (
308 primary_coverage >= self.review_line_coverage_threshold 244 primary_coverage >= self.review_line_coverage_threshold
309 and query_primary_coverage >= 0.40 245 and query_primary_coverage >= self.review_query_coverage_threshold
310 ) 246 )
311 or ( 247 or (
312 coverage >= self.review_line_coverage_threshold 248 coverage >= self.review_line_coverage_threshold
313 and query_coverage >= 0.40 249 and query_coverage >= self.review_query_coverage_threshold
314 ) 250 )
315 ) 251 )
316 has_material_chorus_overlap = chorus_only and ( 252 has_material_chorus_overlap = chorus_only and (
317 query.normalized.content_line_count <= 6 253 query.normalized.content_line_count <= self.chorus_short_line_count_threshold
318 or (primary_jaccard >= 0.20 and query_primary_coverage >= 0.40) 254 or (
319 or (jaccard >= 0.20 and query_coverage >= 0.40) 255 primary_jaccard >= self.chorus_material_overlap_threshold
320 or (primary_coverage >= 0.20 and query_primary_coverage >= 0.40) 256 and query_primary_coverage >= self.chorus_material_query_coverage_threshold
321 or (coverage >= 0.20 and query_coverage >= 0.40) 257 )
258 or (
259 jaccard >= self.chorus_material_overlap_threshold
260 and query_coverage >= self.chorus_material_query_coverage_threshold
261 )
262 or (
263 primary_coverage >= self.chorus_material_overlap_threshold
264 and query_primary_coverage >= self.chorus_material_query_coverage_threshold
265 )
266 or (
267 coverage >= self.chorus_material_overlap_threshold
268 and query_coverage >= self.chorus_material_query_coverage_threshold
269 )
322 ) 270 )
323 has_low_confidence_split_overlap = low_confidence_split and has_review_level_overlap 271 has_low_confidence_split_overlap = low_confidence_split and has_review_level_overlap
324 272
325 confidence = round((0.58 * primary_jaccard) + (0.42 * primary_coverage), 4) 273 confidence = round(
274 (self.confidence_jaccard_weight * primary_jaccard)
275 + (self.confidence_line_coverage_weight * primary_coverage),
276 4,
277 )
326 if ( 278 if (
327 (primary_jaccard >= self.duplicate_jaccard_threshold or (primary_jaccard >= 0.78 and primary_coverage >= 0.9)) 279 (
280 primary_jaccard >= self.duplicate_jaccard_threshold
281 or (
282 primary_jaccard >= self.duplicate_high_coverage_jaccard_threshold
283 and primary_coverage >= self.duplicate_high_coverage_line_coverage_threshold
284 )
285 )
328 and primary_coverage >= self.duplicate_line_coverage_threshold 286 and primary_coverage >= self.duplicate_line_coverage_threshold
329 and not chorus_only 287 and not chorus_only
330 and not translation_only 288 and not translation_only
......
1 """Command line tools for lyric duplicate checking.""" 1 """PostgreSQL-backed command line tools for lyric duplicate checking."""
2 2
3 from __future__ import annotations 3 from __future__ import annotations
4 4
5 import argparse 5 import argparse
6 import csv
7 import json 6 import json
8 import sys
9 from pathlib import Path 7 from pathlib import Path
10 8
11 from lyric_dedup.checker import DuplicateChecker
12 from lyric_dedup.checker import LyricRecord
13 from lyric_dedup.eval_dataset import generate_eval_set 9 from lyric_dedup.eval_dataset import generate_eval_set
14 from lyric_dedup.file_import import iter_lyric_files
15 from lyric_dedup.file_import import read_lyric_file
16 from lyric_dedup.file_import import record_from_file 10 from lyric_dedup.file_import import record_from_file
17 from lyric_dedup.file_import import records_from_dir
18 11
19 12
20 def main() -> None: 13 def main() -> None:
21 parser = argparse.ArgumentParser(prog="lyric-dedup") 14 parser = argparse.ArgumentParser(prog="lyric-dedup")
22 subparsers = parser.add_subparsers(dest="command", required=True) 15 subparsers = parser.add_subparsers(dest="command", required=True)
23 16
24 build = subparsers.add_parser("build-index", help="build an index from .lrc/.txt files") 17 check = subparsers.add_parser("check-file", help="check one .lrc/.txt file using PostgreSQL recall")
25 build.add_argument("--lyrics-dir", required=True) 18 check.add_argument("--dsn", default="postgresql:///lyric_dedup")
26 build.add_argument("--index", required=True)
27
28 check = subparsers.add_parser("check-file", help="check one .lrc/.txt file against an index")
29 check.add_argument("--index", required=True)
30 check.add_argument("--file", required=True) 19 check.add_argument("--file", required=True)
31 check.add_argument("--max-candidates", type=int, default=10) 20 check.add_argument("--max-candidates", type=int, default=5)
32 21 check.add_argument("--recall-limit", type=int, default=100)
33 batch = subparsers.add_parser("batch-check", help="check a directory of .lrc/.txt files against an index") 22 check.add_argument("--enable-trgm", action="store_true")
34 batch.add_argument("--index", required=True) 23 check.add_argument("--trgm-threshold", type=float, default=0.3)
35 batch.add_argument("--lyrics-dir", required=True) 24 check.add_argument("--statement-timeout-ms", type=int, default=5000)
36 batch.add_argument("--out", required=True)
37 batch.add_argument("--max-candidates", type=int, default=5)
38
39 evaluate = subparsers.add_parser("evaluate-csv", help="evaluate labeled duplicate samples from a CSV file")
40 evaluate.add_argument("--index", required=True)
41 evaluate.add_argument("--csv", required=True)
42 evaluate.add_argument("--out", required=True)
43 evaluate.add_argument("--base-dir", default="")
44 evaluate.add_argument("--positive-decisions", default="duplicate")
45 evaluate.add_argument("--max-candidates", type=int, default=5)
46 25
47 generate = subparsers.add_parser("generate-eval-set", help="generate labeled eval samples from a lyric library") 26 generate = subparsers.add_parser("generate-eval-set", help="generate labeled eval samples from a lyric library")
48 generate.add_argument("--library-dir", required=True) 27 generate.add_argument("--library-dir", required=True)
...@@ -51,8 +30,6 @@ def main() -> None: ...@@ -51,8 +30,6 @@ def main() -> None:
51 generate.add_argument("--size", type=int, default=100) 30 generate.add_argument("--size", type=int, default=100)
52 generate.add_argument("--positive-ratio", type=float, default=0.3) 31 generate.add_argument("--positive-ratio", type=float, default=0.3)
53 generate.add_argument("--seed", type=int, default=20260602) 32 generate.add_argument("--seed", type=int, default=20260602)
54 generate.add_argument("--index", default="", help="optional source index path recorded in the manifest")
55 generate.add_argument("--eval-index", default="", help="output index built from non-holdout records for this eval set")
56 generate.add_argument( 33 generate.add_argument(
57 "--profile", 34 "--profile",
58 choices=("standard", "hard"), 35 choices=("standard", "hard"),
...@@ -61,21 +38,8 @@ def main() -> None: ...@@ -61,21 +38,8 @@ def main() -> None:
61 ) 38 )
62 39
63 args = parser.parse_args() 40 args = parser.parse_args()
64 if args.command == "build-index": 41 if args.command == "check-file":
65 build_index(Path(args.lyrics_dir), Path(args.index)) 42 check_file_pg(args)
66 elif args.command == "check-file":
67 check_file(Path(args.index), Path(args.file), args.max_candidates)
68 elif args.command == "batch-check":
69 batch_check(Path(args.index), Path(args.lyrics_dir), Path(args.out), args.max_candidates)
70 elif args.command == "evaluate-csv":
71 evaluate_csv(
72 Path(args.index),
73 Path(args.csv),
74 Path(args.out),
75 base_dir=Path(args.base_dir) if args.base_dir else None,
76 positive_decisions={item.strip() for item in args.positive_decisions.split(",") if item.strip()},
77 max_candidates=args.max_candidates,
78 )
79 elif args.command == "generate-eval-set": 43 elif args.command == "generate-eval-set":
80 summary = generate_eval_set( 44 summary = generate_eval_set(
81 library_dir=Path(args.library_dir), 45 library_dir=Path(args.library_dir),
...@@ -84,315 +48,40 @@ def main() -> None: ...@@ -84,315 +48,40 @@ def main() -> None:
84 size=args.size, 48 size=args.size,
85 positive_ratio=args.positive_ratio, 49 positive_ratio=args.positive_ratio,
86 seed=args.seed, 50 seed=args.seed,
87 index_path=Path(args.index) if args.index else None,
88 eval_index_path=Path(args.eval_index) if args.eval_index else None,
89 profile=args.profile, 51 profile=args.profile,
90 ) 52 )
91 print(json.dumps(summary, ensure_ascii=False)) 53 print(json.dumps(summary, ensure_ascii=False))
92 54
93 55
94 def build_index(lyrics_dir: Path, index_path: Path) -> None: 56 def check_file_pg(args: argparse.Namespace) -> None:
95 checker = DuplicateChecker() 57 from lyric_dedup_server.config import ServerConfig
96 records = records_from_dir(lyrics_dir) 58 from lyric_dedup_server.service import DedupService
97 for record in records:
98 checker.add_record(record)
99 index_path.parent.mkdir(parents=True, exist_ok=True)
100 checker.save(index_path)
101 print(json.dumps({"indexed": checker.record_count, "index": str(index_path)}, ensure_ascii=False))
102
103 59
104 def check_file(index_path: Path, file_path: Path, max_candidates: int) -> None: 60 record = record_from_file(Path(args.file))
105 checker = DuplicateChecker.load(index_path) 61 config = ServerConfig(
106 record = record_from_file(file_path) 62 dsn=args.dsn,
107 result = checker.check_record(record, max_candidates=max_candidates) 63 max_candidates=args.max_candidates,
108 print(json.dumps(_result_to_dict(result, source=str(file_path)), ensure_ascii=False, indent=2)) 64 recall_limit=args.recall_limit,
109 65 enable_trgm=args.enable_trgm,
110 66 trgm_threshold=args.trgm_threshold,
111 def batch_check(index_path: Path, lyrics_dir: Path, out_path: Path, max_candidates: int) -> None: 67 statement_timeout_ms=args.statement_timeout_ms,
112 checker = DuplicateChecker.load(index_path) 68 )
113 out_path.parent.mkdir(parents=True, exist_ok=True) 69 service = DedupService(config=config)
114 rows: list[dict[str, object]] = [] 70 result = service.check(record.lyrics, title=record.title, artist=record.artist)
115 for path in iter_lyric_files(lyrics_dir): 71 print(
116 record = record_from_file(path, base_dir=lyrics_dir) 72 json.dumps(
117 result = checker.check_record(record, max_candidates=max_candidates)
118 best = result.candidates[0] if result.candidates else None
119 rows.append(
120 { 73 {
121 "source": str(path), 74 "source": args.file,
122 "record_id": record.record_id, 75 "decision": result.decision,
123 "decision": result.decision.value, 76 "duplicate": result.duplicate,
124 "confidence": result.confidence, 77 "confidence": result.confidence,
125 "reason": result.reason, 78 "reason": result.reason,
126 "best_candidate_id": best.record_id if best else "", 79 "candidate_count": result.candidate_count,
127 "best_candidate_decision": best.decision.value if best else "", 80 },
128 "best_candidate_confidence": best.confidence if best else "", 81 ensure_ascii=False,
129 "best_candidate_jaccard": best.jaccard if best else "", 82 indent=2,
130 "best_candidate_line_coverage": best.line_coverage if best else "",
131 "best_candidate_primary_jaccard": best.primary_jaccard if best else "",
132 "best_candidate_primary_line_coverage": best.primary_line_coverage if best else "",
133 "best_candidate_translation_jaccard": best.translation_jaccard if best else "",
134 "best_candidate_translation_line_coverage": best.translation_line_coverage if best else "",
135 "best_candidate_reason": best.reason if best else "",
136 "matched_unique_lines": " | ".join(best.matched_unique_lines) if best else "",
137 }
138 )
139
140 if out_path.suffix.lower() == ".jsonl":
141 with out_path.open("w", encoding="utf-8") as file:
142 for row in rows:
143 file.write(json.dumps(row, ensure_ascii=False) + "\n")
144 else:
145 with out_path.open("w", encoding="utf-8", newline="") as file:
146 writer = csv.DictWriter(file, fieldnames=list(rows[0].keys()) if rows else ["source"])
147 writer.writeheader()
148 writer.writerows(rows)
149 summary = {
150 "checked": len(rows),
151 "duplicate": sum(1 for row in rows if row["decision"] == "duplicate"),
152 "review": sum(1 for row in rows if row["decision"] == "review"),
153 "new": sum(1 for row in rows if row["decision"] == "new"),
154 "out": str(out_path),
155 }
156 print(json.dumps(summary, ensure_ascii=False))
157
158
159 def evaluate_csv(
160 index_path: Path,
161 csv_path: Path,
162 out_path: Path,
163 *,
164 base_dir: Path | None,
165 positive_decisions: set[str],
166 max_candidates: int,
167 ) -> None:
168 _progress(f"load index: {index_path}")
169 checker = DuplicateChecker.load(index_path)
170 rows: list[dict[str, object]] = []
171 total = _csv_data_row_count(csv_path)
172 _progress(f"evaluate csv: 0/{total}")
173 out_path.parent.mkdir(parents=True, exist_ok=True)
174 with csv_path.open(encoding="utf-8-sig", newline="") as file:
175 reader = csv.DictReader(file)
176 if reader.fieldnames is None:
177 raise ValueError("评估 CSV 需要表头")
178 fieldnames = [
179 "id",
180 "source",
181 "expected_duplicate",
182 "decision",
183 "predicted_duplicate",
184 "correct",
185 "confidence",
186 "reason",
187 "best_candidate_id",
188 "best_candidate_decision",
189 "best_candidate_confidence",
190 "best_candidate_jaccard",
191 "best_candidate_line_coverage",
192 "best_candidate_primary_jaccard",
193 "best_candidate_primary_line_coverage",
194 "best_candidate_translation_jaccard",
195 "best_candidate_translation_line_coverage",
196 "best_candidate_reason",
197 "matched_unique_lines",
198 ]
199 with out_path.open("w", encoding="utf-8", newline="") as out_file:
200 writer = csv.DictWriter(out_file, fieldnames=fieldnames)
201 writer.writeheader()
202 for index, row in enumerate(reader, start=1):
203 row_out = _evaluate_row(
204 row,
205 row_number=index + 1,
206 checker=checker,
207 csv_path=csv_path,
208 base_dir=base_dir,
209 positive_decisions=positive_decisions,
210 max_candidates=max_candidates,
211 )
212 rows.append(row_out)
213 writer.writerow(row_out)
214 _progress_count("evaluate csv", index, total, step=1000)
215
216 summary = _evaluation_summary(rows, positive_decisions=positive_decisions, out_path=out_path)
217 summary_path = out_path.with_suffix(out_path.suffix + ".summary.json")
218 summary_path.write_text(json.dumps(summary, ensure_ascii=False, indent=2), encoding="utf-8")
219 _progress("evaluation complete")
220 print(json.dumps(summary, ensure_ascii=False))
221
222
223 def _result_to_dict(result, *, source: str) -> dict[str, object]:
224 return {
225 "source": source,
226 "decision": result.decision.value,
227 "confidence": result.confidence,
228 "reason": result.reason,
229 "candidates": [
230 {
231 "record_id": candidate.record_id,
232 "decision": candidate.decision.value,
233 "confidence": candidate.confidence,
234 "jaccard": candidate.jaccard,
235 "line_coverage": candidate.line_coverage,
236 "primary_jaccard": candidate.primary_jaccard,
237 "primary_line_coverage": candidate.primary_line_coverage,
238 "translation_jaccard": candidate.translation_jaccard,
239 "translation_line_coverage": candidate.translation_line_coverage,
240 "reason": candidate.reason,
241 "matched_unique_lines": list(candidate.matched_unique_lines),
242 }
243 for candidate in result.candidates
244 ],
245 }
246
247
248 def _evaluate_row(
249 row: dict[str, str],
250 *,
251 row_number: int,
252 checker: DuplicateChecker,
253 csv_path: Path,
254 base_dir: Path | None,
255 positive_decisions: set[str],
256 max_candidates: int,
257 ) -> dict[str, object]:
258 sample_id = row.get("id") or row.get("sample_id") or str(row_number)
259 record, source = _record_from_eval_row(row, csv_path=csv_path, base_dir=base_dir)
260 expected_duplicate = _parse_expected(row.get("expected") or row.get("label") or row.get("target"))
261 result = checker.check_record(record, max_candidates=max_candidates)
262 predicted_duplicate = result.decision.value in positive_decisions
263 best = result.candidates[0] if result.candidates else None
264 return {
265 "id": sample_id,
266 "source": source,
267 "expected_duplicate": expected_duplicate,
268 "decision": result.decision.value,
269 "predicted_duplicate": predicted_duplicate,
270 "correct": expected_duplicate == predicted_duplicate,
271 "confidence": result.confidence,
272 "reason": result.reason,
273 "best_candidate_id": best.record_id if best else "",
274 "best_candidate_decision": best.decision.value if best else "",
275 "best_candidate_confidence": best.confidence if best else "",
276 "best_candidate_jaccard": best.jaccard if best else "",
277 "best_candidate_line_coverage": best.line_coverage if best else "",
278 "best_candidate_primary_jaccard": best.primary_jaccard if best else "",
279 "best_candidate_primary_line_coverage": best.primary_line_coverage if best else "",
280 "best_candidate_translation_jaccard": best.translation_jaccard if best else "",
281 "best_candidate_translation_line_coverage": best.translation_line_coverage if best else "",
282 "best_candidate_reason": best.reason if best else "",
283 "matched_unique_lines": " | ".join(best.matched_unique_lines) if best else "",
284 }
285
286
287 def _lyrics_from_eval_row(row: dict[str, str], *, csv_path: Path, base_dir: Path | None) -> tuple[str, str]:
288 lyrics = (row.get("lyrics") or "").strip()
289 if lyrics:
290 return lyrics.replace("\\n", "\n"), "inline"
291
292 file_value = (row.get("file") or row.get("path") or row.get("source") or "").strip()
293 if not file_value:
294 raise ValueError("评估 CSV 每行需要提供 lyrics,或 file/path/source 文件路径")
295
296 file_path = Path(file_value)
297 if not file_path.is_absolute():
298 file_path = (base_dir or csv_path.parent) / file_path
299 return read_lyric_file(file_path), str(file_path)
300
301
302 def _record_from_eval_row(row: dict[str, str], *, csv_path: Path, base_dir: Path | None):
303 lyrics = (row.get("lyrics") or "").strip()
304 if lyrics:
305 return (
306 LyricRecord(
307 record_id=row.get("id") or row.get("sample_id") or "__eval__",
308 lyrics=lyrics.replace("\\n", "\n"),
309 title=row.get("title") or None,
310 artist=row.get("artist") or None,
311 ),
312 "inline",
313 ) 83 )
314 84 )
315 file_value = (row.get("file") or row.get("path") or row.get("source") or "").strip()
316 if not file_value:
317 raise ValueError("评估 CSV 每行需要 lyrics,或 file/path/source 文件路径")
318
319 file_path = Path(file_value)
320 if not file_path.is_absolute():
321 file_path = (base_dir or csv_path.parent) / file_path
322 record = record_from_file(file_path)
323 if row.get("title") or row.get("artist"):
324 record = LyricRecord(
325 record_id=record.record_id,
326 lyrics=record.lyrics,
327 title=row.get("title") or record.title,
328 artist=row.get("artist") or record.artist,
329 )
330 return record, str(file_path)
331
332
333 def _parse_expected(value: str | None) -> bool:
334 if value is None:
335 raise ValueError("评估 CSV 每行需要 expected/label/target 列")
336 normalized = value.strip().lower()
337 positives = {"1", "true", "yes", "y", "duplicate", "dup", "重复", "应去重", "去重", "是"}
338 negatives = {"0", "false", "no", "n", "new", "not_duplicate", "non_duplicate", "不重复", "不应去重", "新歌", "否"}
339 if normalized in positives:
340 return True
341 if normalized in negatives:
342 return False
343 raise ValueError(f"无法识别 expected 值: {value!r}")
344
345
346 def _evaluation_summary(
347 rows: list[dict[str, object]],
348 *,
349 positive_decisions: set[str],
350 out_path: Path,
351 ) -> dict[str, object]:
352 tp = sum(1 for row in rows if row["expected_duplicate"] is True and row["predicted_duplicate"] is True)
353 fp = sum(1 for row in rows if row["expected_duplicate"] is False and row["predicted_duplicate"] is True)
354 tn = sum(1 for row in rows if row["expected_duplicate"] is False and row["predicted_duplicate"] is False)
355 fn = sum(1 for row in rows if row["expected_duplicate"] is True and row["predicted_duplicate"] is False)
356 total = len(rows)
357 precision = tp / (tp + fp) if tp + fp else 0.0
358 recall = tp / (tp + fn) if tp + fn else 0.0
359 accuracy = (tp + tn) / total if total else 0.0
360 f1 = (2 * precision * recall / (precision + recall)) if precision + recall else 0.0
361 return {
362 "total": total,
363 "positive_decisions": sorted(positive_decisions),
364 "accuracy": round(accuracy, 4),
365 "precision": round(precision, 4),
366 "recall": round(recall, 4),
367 "f1": round(f1, 4),
368 "true_positive": tp,
369 "false_positive": fp,
370 "true_negative": tn,
371 "false_negative": fn,
372 "duplicate": sum(1 for row in rows if row["decision"] == "duplicate"),
373 "review": sum(1 for row in rows if row["decision"] == "review"),
374 "new": sum(1 for row in rows if row["decision"] == "new"),
375 "out": str(out_path),
376 "summary": str(out_path.with_suffix(out_path.suffix + ".summary.json")),
377 }
378
379
380 def _csv_data_row_count(csv_path: Path) -> int:
381 with csv_path.open(encoding="utf-8-sig", newline="") as file:
382 reader = csv.reader(file)
383 next(reader, None)
384 return sum(1 for _ in reader)
385
386
387 def _progress(message: str) -> None:
388 print(f"[eval] {message}", file=sys.stderr, flush=True)
389
390
391 def _progress_count(label: str, current: int, total: int, *, step: int = 1000) -> None:
392 if total <= 0:
393 return
394 if current == 1 or current == total or current % step == 0:
395 _progress(f"{label}: {current}/{total}")
396 85
397 86
398 if __name__ == "__main__": 87 if __name__ == "__main__":
......
...@@ -12,7 +12,6 @@ from collections import Counter ...@@ -12,7 +12,6 @@ from collections import Counter
12 from dataclasses import dataclass 12 from dataclasses import dataclass
13 from pathlib import Path 13 from pathlib import Path
14 14
15 from lyric_dedup.checker import DuplicateChecker
16 from lyric_dedup.checker import LyricRecord 15 from lyric_dedup.checker import LyricRecord
17 from lyric_dedup.file_import import iter_lyric_files 16 from lyric_dedup.file_import import iter_lyric_files
18 from lyric_dedup.file_import import record_from_file 17 from lyric_dedup.file_import import record_from_file
...@@ -133,8 +132,6 @@ def generate_eval_set( ...@@ -133,8 +132,6 @@ def generate_eval_set(
133 ) 132 )
134 holdout_ids = {profile.record_id for profile in holdout_profiles} 133 holdout_ids = {profile.record_id for profile in holdout_profiles}
135 indexed_profiles = [profile for profile in profiles if profile.record_id not in holdout_ids] or profiles 134 indexed_profiles = [profile for profile in profiles if profile.record_id not in holdout_ids] or profiles
136 eval_index_path = eval_index_path or csv_path.with_suffix(csv_path.suffix + ".index.pkl")
137 _build_eval_index(indexed_profiles, eval_index_path)
138 groups = _profile_groups(indexed_profiles) 135 groups = _profile_groups(indexed_profiles)
139 samples: list[GeneratedSample] = [] 136 samples: list[GeneratedSample] = []
140 137
...@@ -373,25 +370,6 @@ def _stratified_unique_sample(profiles: list[LyricProfile], count: int, rng: ran ...@@ -373,25 +370,6 @@ def _stratified_unique_sample(profiles: list[LyricProfile], count: int, rng: ran
373 return _stratified_sample(profiles, min(count, len(profiles)), rng) 370 return _stratified_sample(profiles, min(count, len(profiles)), rng)
374 371
375 372
376 def _build_eval_index(profiles: list[LyricProfile], index_path: Path) -> None:
377 _progress(f"build eval index excluding holdout: {index_path}")
378 checker = DuplicateChecker()
379 total = len(profiles)
380 for index, profile in enumerate(profiles, start=1):
381 checker.add_normalized_record(
382 LyricRecord(
383 record_id=profile.record_id,
384 lyrics=profile.raw_text,
385 title=profile.title or None,
386 artist=profile.artist or None,
387 ),
388 profile.normalized,
389 )
390 _progress_count("build eval index", index, total, step=5000)
391 index_path.parent.mkdir(parents=True, exist_ok=True)
392 checker.save(index_path)
393
394
395 def _build_positive_samples( 373 def _build_positive_samples(
396 profiles: list[LyricProfile], 374 profiles: list[LyricProfile],
397 output_dir: Path, 375 output_dir: Path,
...@@ -889,7 +867,7 @@ def _write_manifest( ...@@ -889,7 +867,7 @@ def _write_manifest(
889 "sample_size": len(samples), 867 "sample_size": len(samples),
890 "plan": plan, 868 "plan": plan,
891 "source_index": str(index_path) if index_path else "", 869 "source_index": str(index_path) if index_path else "",
892 "eval_index": str(eval_index_path), 870 "eval_index": str(eval_index_path) if eval_index_path else "",
893 "holdout_records": holdout_count, 871 "holdout_records": holdout_count,
894 "lyrics_dir": str(output_dir), 872 "lyrics_dir": str(output_dir),
895 "csv": str(csv_path), 873 "csv": str(csv_path),
......
1 """Small in-memory MinHash LSH index for incremental lyric lookup."""
2
3 from __future__ import annotations
4
5 import hashlib
6 from collections import defaultdict
7 from dataclasses import dataclass
8
9
10 _MAX_HASH = (1 << 64) - 1
11
12
13 @dataclass(frozen=True)
14 class MinHashConfig:
15 num_perm: int = 96
16 bands: int = 24
17 seed: int = 17
18
19 @property
20 def rows_per_band(self) -> int:
21 if self.num_perm % self.bands != 0:
22 raise ValueError("num_perm must be divisible by bands")
23 return self.num_perm // self.bands
24
25
26 class MinHashLSH:
27 def __init__(self, config: MinHashConfig | None = None) -> None:
28 self.config = config or MinHashConfig()
29 self._buckets: dict[tuple[int, tuple[int, ...]], set[str]] = defaultdict(set)
30
31 def signature(self, tokens: set[str]) -> tuple[int, ...]:
32 if not tokens:
33 return tuple([_MAX_HASH] * self.config.num_perm)
34
35 signature = [_MAX_HASH] * self.config.num_perm
36 for token in tokens:
37 encoded = token.encode("utf-8")
38 for idx in range(self.config.num_perm):
39 digest = hashlib.blake2b(
40 encoded,
41 digest_size=8,
42 person=f"lyr{self.config.seed + idx:05d}".encode("ascii")[:16],
43 ).digest()
44 value = int.from_bytes(digest, "big")
45 if value < signature[idx]:
46 signature[idx] = value
47 return tuple(signature)
48
49 def add(self, record_id: str, signature: tuple[int, ...]) -> None:
50 for key in self._band_keys(signature):
51 self._buckets[key].add(record_id)
52
53 def query(self, signature: tuple[int, ...]) -> set[str]:
54 candidates: set[str] = set()
55 for key in self._band_keys(signature):
56 candidates.update(self._buckets.get(key, set()))
57 return candidates
58
59 def _band_keys(self, signature: tuple[int, ...]) -> list[tuple[int, tuple[int, ...]]]:
60 rows = self.config.rows_per_band
61 return [(band, signature[band * rows : (band + 1) * rows]) for band in range(self.config.bands)]
...@@ -8,69 +8,10 @@ import unicodedata ...@@ -8,69 +8,10 @@ import unicodedata
8 from collections import Counter 8 from collections import Counter
9 from dataclasses import dataclass 9 from dataclasses import dataclass
10 10
11 import opencc
11 12
12 _TRADITIONAL_TO_SIMPLIFIED = str.maketrans( 13
13 { 14 _T2S_CONVERTER = opencc.OpenCC("t2s.json")
14 "愛": "爱",
15 "會": "会",
16 "個": "个",
17 "妳": "你",
18 "們": "们",
19 "麼": "么",
20 "夢": "梦",
21 "憶": "忆",
22 "風": "风",
23 "無": "无",
24 "與": "与",
25 "聽": "听",
26 "說": "说",
27 "見": "见",
28 "話": "话",
29 "還": "还",
30 "這": "这",
31 "那": "那",
32 "裡": "里",
33 "裏": "里",
34 "過": "过",
35 "來": "来",
36 "進": "进",
37 "去": "去",
38 "給": "给",
39 "讓": "让",
40 "嗎": "吗",
41 "為": "为",
42 "誰": "谁",
43 "對": "对",
44 "錯": "错",
45 "淚": "泪",
46 "寫": "写",
47 "雲": "云",
48 "藍": "蓝",
49 "紅": "红",
50 "綠": "绿",
51 "黃": "黄",
52 "長": "长",
53 "遠": "远",
54 "燈": "灯",
55 "臺": "台",
56 "台": "台",
57 "後": "后",
58 "從": "从",
59 "時": "时",
60 "間": "间",
61 "葉": "叶",
62 "歲": "岁",
63 "聲": "声",
64 "邊": "边",
65 "歡": "欢",
66 "繼": "继",
67 "續": "续",
68 "難": "难",
69 "雙": "双",
70 "舊": "旧",
71 "離": "离",
72 }
73 )
74 15
75 _TIMESTAMP_RE = re.compile(r"\[((?:\d{1,2}:)?\d{1,2}:\d{2}(?:[.:]\d{1,3})?)\]") 16 _TIMESTAMP_RE = re.compile(r"\[((?:\d{1,2}:)?\d{1,2}:\d{2}(?:[.:]\d{1,3})?)\]")
76 _BRACKET_RE = re.compile(r"[\[((【<《].{0,40}?[\]))】>》]") 17 _BRACKET_RE = re.compile(r"[\[((【<《].{0,40}?[\]))】>》]")
...@@ -212,7 +153,7 @@ def _split_inline_translation(line: str, timestamp: str | None, source_index: in ...@@ -212,7 +153,7 @@ def _split_inline_translation(line: str, timestamp: str | None, source_index: in
212 153
213 def _entry_from_text(text: str, timestamp: str | None, source_index: int) -> list[_LineEntry]: 154 def _entry_from_text(text: str, timestamp: str | None, source_index: int) -> list[_LineEntry]:
214 line = _BRACKET_RE.sub("", text) 155 line = _BRACKET_RE.sub("", text)
215 line = line.strip().lower().translate(_TRADITIONAL_TO_SIMPLIFIED) 156 line = _T2S_CONVERTER.convert(line.strip().lower())
216 if not line or _is_noise_line(line): 157 if not line or _is_noise_line(line):
217 return [] 158 return []
218 line = _strip_symbols(line) 159 line = _strip_symbols(line)
......
1 from .config import ServerConfig 1 from .config import ServerConfig
2 from .service import DedupService
3 2
4 __all__ = ["ServerConfig", "DedupService"] 3 __all__ = ["ServerConfig"]
......
...@@ -4,14 +4,101 @@ from __future__ import annotations ...@@ -4,14 +4,101 @@ from __future__ import annotations
4 4
5 import os 5 import os
6 from dataclasses import dataclass 6 from dataclasses import dataclass
7 from pathlib import Path
8
9
10 def _load_env_file() -> None:
11 """Load root .env values without overriding real environment variables."""
12 env_path = Path(__file__).resolve().parent.parent / ".env"
13 if not env_path.exists():
14 return
15 with env_path.open(encoding="utf-8") as file:
16 for raw_line in file:
17 line = raw_line.strip()
18 if not line or line.startswith("#") or "=" not in line:
19 continue
20 key, value = line.split("=", 1)
21 os.environ.setdefault(key.strip(), value.strip().strip('"').strip("'"))
22
23
24 _load_env_file()
7 25
8 26
9 @dataclass 27 @dataclass
10 class ServerConfig: 28 class ServerConfig:
29 # PostgreSQL DSN used by the dedup service.
11 dsn: str = os.getenv("LYRIC_DEDUP_DSN", "postgresql:///lyric_dedup") 30 dsn: str = os.getenv("LYRIC_DEDUP_DSN", "postgresql:///lyric_dedup")
31
32 # Maximum ranked candidates returned in the final API result.
12 max_candidates: int = int(os.getenv("LYRIC_DEDUP_MAX_CANDIDATES", "5")) 33 max_candidates: int = int(os.getenv("LYRIC_DEDUP_MAX_CANDIDATES", "5"))
34
35 # Maximum candidates recalled from each PostgreSQL recall tier.
13 recall_limit: int = int(os.getenv("LYRIC_DEDUP_RECALL_LIMIT", "100")) 36 recall_limit: int = int(os.getenv("LYRIC_DEDUP_RECALL_LIMIT", "100"))
37
38 # Whether to use pg_trgm similarity recall in addition to exact hash and line hash recall.
14 enable_trgm: bool = os.getenv("LYRIC_DEDUP_ENABLE_TRGM", "false").lower() == "true" 39 enable_trgm: bool = os.getenv("LYRIC_DEDUP_ENABLE_TRGM", "false").lower() == "true"
40
41 # PostgreSQL pg_trgm recall threshold; lower values recall more candidates and cost more.
15 trgm_threshold: float = float(os.getenv("LYRIC_DEDUP_TRGM_THRESHOLD", "0.3")) 42 trgm_threshold: float = float(os.getenv("LYRIC_DEDUP_TRGM_THRESHOLD", "0.3"))
43
44 # PostgreSQL statement timeout for one dedup check, in milliseconds.
16 statement_timeout_ms: int = int(os.getenv("LYRIC_DEDUP_STATEMENT_TIMEOUT_MS", "5000")) 45 statement_timeout_ms: int = int(os.getenv("LYRIC_DEDUP_STATEMENT_TIMEOUT_MS", "5000"))
46
47 # HTTP download timeout for fetching lyric URLs, in seconds.
17 download_timeout: int = int(os.getenv("LYRIC_DEDUP_DOWNLOAD_TIMEOUT", "10")) 48 download_timeout: int = int(os.getenv("LYRIC_DEDUP_DOWNLOAD_TIMEOUT", "10"))
49
50 # Minimum primary n-gram Jaccard similarity required for automatic duplicate.
51 # Raising this makes automatic duplicate stricter; lowering it may increase false positives.
52 duplicate_jaccard_threshold: float = float(os.getenv("LYRIC_DEDUP_DUPLICATE_JACCARD_THRESHOLD", "0.78"))
53
54 # Minimum line coverage required for automatic duplicate.
55 # This is the main guard against treating partial lyric fragments as full duplicates.
56 duplicate_line_coverage_threshold: float = float(
57 os.getenv("LYRIC_DEDUP_DUPLICATE_LINE_COVERAGE_THRESHOLD", "0.72")
58 )
59
60 # Alternate automatic duplicate path: lower/normal Jaccard can still duplicate when line coverage is very high.
61 # Keep this aligned with duplicate_jaccard_threshold to avoid an unintended duplicate backdoor.
62 duplicate_high_coverage_jaccard_threshold: float = float(
63 os.getenv("LYRIC_DEDUP_DUPLICATE_HIGH_COVERAGE_JACCARD_THRESHOLD", "0.78")
64 )
65
66 # Line coverage required by the alternate high-coverage duplicate path.
67 # Raising this makes the alternate duplicate path stricter for near-complete variants.
68 duplicate_high_coverage_line_coverage_threshold: float = float(
69 os.getenv("LYRIC_DEDUP_DUPLICATE_HIGH_COVERAGE_LINE_COVERAGE_THRESHOLD", "0.90")
70 )
71
72 # Minimum primary/full n-gram Jaccard similarity that can send a candidate to review.
73 # Raising this reduces review volume; lowering it catches weaker suspicious overlaps.
74 review_jaccard_threshold: float = float(os.getenv("LYRIC_DEDUP_REVIEW_JACCARD_THRESHOLD", "0.45"))
75
76 # Minimum line coverage that can send a candidate to review when query coverage is also material.
77 # Raising this reduces fragment/short-overlap reviews; lowering it increases suspicious recall.
78 review_line_coverage_threshold: float = float(os.getenv("LYRIC_DEDUP_REVIEW_LINE_COVERAGE_THRESHOLD", "0.35"))
79
80 # Minimum share of query lines that must match before line coverage alone can trigger review.
81 # Raising this makes partial-fragment review stricter.
82 review_query_coverage_threshold: float = float(os.getenv("LYRIC_DEDUP_REVIEW_QUERY_COVERAGE_THRESHOLD", "0.40"))
83
84 # Very short query lyric line count that can force repeated-chorus overlap into review.
85 # Raising this catches more short chorus-like inputs; lowering it reduces review volume.
86 chorus_short_line_count_threshold: int = int(os.getenv("LYRIC_DEDUP_CHORUS_SHORT_LINE_COUNT_THRESHOLD", "6"))
87
88 # Minimum similarity/coverage signal for repeated-chorus overlap to be considered material.
89 # Raising this makes chorus-only review stricter.
90 chorus_material_overlap_threshold: float = float(os.getenv("LYRIC_DEDUP_CHORUS_MATERIAL_OVERLAP_THRESHOLD", "0.20"))
91
92 # Minimum query-side coverage for repeated-chorus overlap to be considered material.
93 # Raising this reduces review decisions caused by small shared chorus fragments.
94 chorus_material_query_coverage_threshold: float = float(
95 os.getenv("LYRIC_DEDUP_CHORUS_MATERIAL_QUERY_COVERAGE_THRESHOLD", "0.40")
96 )
97
98 # Weight assigned to primary n-gram Jaccard when computing confidence.
99 # This affects the reported confidence score, not the duplicate/review threshold checks directly.
100 confidence_jaccard_weight: float = float(os.getenv("LYRIC_DEDUP_CONFIDENCE_JACCARD_WEIGHT", "0.58"))
101
102 # Weight assigned to primary line coverage when computing confidence.
103 # Keep this coordinated with confidence_jaccard_weight; defaults sum to 1.0.
104 confidence_line_coverage_weight: float = float(os.getenv("LYRIC_DEDUP_CONFIDENCE_LINE_COVERAGE_WEIGHT", "0.42"))
......
...@@ -189,10 +189,25 @@ class DedupService: ...@@ -189,10 +189,25 @@ class DedupService:
189 candidates: list[LyricRecord], 189 candidates: list[LyricRecord],
190 ) -> CheckResult: 190 ) -> CheckResult:
191 """Run DuplicateChecker against recalled candidates.""" 191 """Run DuplicateChecker against recalled candidates."""
192 checker = DuplicateChecker() 192 checker = DuplicateChecker(
193 for candidate in candidates: 193 duplicate_jaccard_threshold=self.config.duplicate_jaccard_threshold,
194 checker.add_record(candidate) 194 duplicate_line_coverage_threshold=self.config.duplicate_line_coverage_threshold,
195 result = checker.check_record(record, max_candidates=self.config.max_candidates) 195 duplicate_high_coverage_jaccard_threshold=self.config.duplicate_high_coverage_jaccard_threshold,
196 duplicate_high_coverage_line_coverage_threshold=self.config.duplicate_high_coverage_line_coverage_threshold,
197 review_jaccard_threshold=self.config.review_jaccard_threshold,
198 review_line_coverage_threshold=self.config.review_line_coverage_threshold,
199 review_query_coverage_threshold=self.config.review_query_coverage_threshold,
200 chorus_short_line_count_threshold=self.config.chorus_short_line_count_threshold,
201 chorus_material_overlap_threshold=self.config.chorus_material_overlap_threshold,
202 chorus_material_query_coverage_threshold=self.config.chorus_material_query_coverage_threshold,
203 confidence_jaccard_weight=self.config.confidence_jaccard_weight,
204 confidence_line_coverage_weight=self.config.confidence_line_coverage_weight,
205 )
206 result = checker.check_record_against_candidates(
207 record,
208 candidates,
209 max_candidates=self.config.max_candidates,
210 )
196 return CheckResult( 211 return CheckResult(
197 duplicate=result.decision in (DuplicateDecision.DUPLICATE, DuplicateDecision.REVIEW), 212 duplicate=result.decision in (DuplicateDecision.DUPLICATE, DuplicateDecision.REVIEW),
198 decision=result.decision.value, 213 decision=result.decision.value,
......
...@@ -3,6 +3,7 @@ pytest>=8.0 ...@@ -3,6 +3,7 @@ pytest>=8.0
3 3
4 # PostgreSQL storage prototype 4 # PostgreSQL storage prototype
5 psycopg[binary]>=3.2 5 psycopg[binary]>=3.2
6 OpenCC>=1.3.1
6 7
7 # Existing MySQL/COS lyric download utilities 8 # Existing MySQL/COS lyric download utilities
8 pymysql>=1.1 9 pymysql>=1.1
......
...@@ -249,9 +249,7 @@ def _check_against_candidates( ...@@ -249,9 +249,7 @@ def _check_against_candidates(
249 max_candidates: int, 249 max_candidates: int,
250 ): 250 ):
251 checker = DuplicateChecker() 251 checker = DuplicateChecker()
252 for candidate in candidates: 252 return checker.check_record_against_candidates(record, candidates, max_candidates=max_candidates)
253 checker.add_record(candidate)
254 return checker.check_record(record, max_candidates=max_candidates)
255 253
256 254
257 def _record_from_eval_row(row: dict[str, str], *, csv_path: Path, base_dir: Path | None) -> tuple[LyricRecord, str]: 255 def _record_from_eval_row(row: dict[str, str], *, csv_path: Path, base_dir: Path | None) -> tuple[LyricRecord, str]:
......
1 """Process newly added lyric library files.
2
3 This script is intended for the recurring workflow after adding files to
4 ``data/library``:
5
6 1. Move pure-music placeholder lyric files out of the active library.
7 2. Move duplicate lyric files out of the active library.
8 3. Rebuild the duplicate-checking index from retained files.
9 4. Optionally regenerate and evaluate a production-style eval set.
10 """
11
12 from __future__ import annotations
13
14 import argparse
15 import csv
16 import json
17 import shutil
18 import sys
19 from dataclasses import dataclass
20 from datetime import datetime
21 from pathlib import Path
22
23 PROJECT_ROOT = Path(__file__).resolve().parents[1]
24 if str(PROJECT_ROOT) not in sys.path:
25 sys.path.insert(0, str(PROJECT_ROOT))
26
27 from lyric_dedup.checker import DuplicateChecker
28 from lyric_dedup.checker import DuplicateDecision
29 from lyric_dedup.checker import LyricRecord
30 from lyric_dedup.cli import evaluate_csv
31 from lyric_dedup.eval_dataset import generate_eval_set
32 from lyric_dedup.file_import import iter_lyric_files
33 from lyric_dedup.file_import import read_lyric_file
34 from lyric_dedup.file_import import record_from_file
35 from lyric_dedup.normalization import NormalizedLyrics
36 from lyric_dedup.normalization import normalize_lyrics
37
38
39 PLACEHOLDER_MARKERS = (
40 "【曲库专用】",
41 "此歌曲为没有填词的纯音乐",
42 )
43
44
45 @dataclass(frozen=True)
46 class LibraryProfile:
47 path: Path
48 record: LyricRecord
49 normalized: NormalizedLyrics
50 line_count: int
51 char_count: int
52
53
54 def main() -> None:
55 parser = argparse.ArgumentParser(description="Process lyric library additions.")
56 parser.add_argument("--library-dir", default="data/library")
57 parser.add_argument("--index", default="outputs/indexes/library_lyrics.pkl")
58 parser.add_argument("--quarantine-dir", default="data/quarantine/no_lyrics_placeholders")
59 parser.add_argument("--duplicate-quarantine-dir", default="data/quarantine/duplicates")
60 parser.add_argument("--dry-run", action="store_true", help="Only report placeholder files; do not move or write outputs.")
61 parser.add_argument("--delete-placeholders", action="store_true", help="Delete matched placeholder files instead of moving them.")
62 parser.add_argument("--delete-duplicates", action="store_true", help="Delete duplicate lyric files instead of moving them.")
63 parser.add_argument("--skip-library-dedup", action="store_true", help="Skip internal duplicate cleanup before rebuilding the index.")
64 parser.add_argument("--eval-size", type=int, default=0, help="Generate and evaluate this many synthetic samples. 0 disables eval.")
65 parser.add_argument("--positive-ratio", type=float, default=0.2)
66 parser.add_argument("--eval-dir", default="data/generated_eval/incoming")
67 parser.add_argument("--eval-csv", default="data/generated_eval/eval.csv")
68 parser.add_argument("--eval-out", default="outputs/results/library_eval.csv")
69 parser.add_argument("--report", default="outputs/results/library_process_report.json")
70 args = parser.parse_args()
71
72 library_dir = Path(args.library_dir)
73 quarantine_dir = Path(args.quarantine_dir)
74 duplicate_quarantine_dir = Path(args.duplicate_quarantine_dir)
75 report_path = Path(args.report)
76
77 files_before = iter_lyric_files(library_dir)
78 placeholders = _find_placeholder_files(library_dir)
79 duplicate_report_path = report_path.with_suffix(".duplicates.csv")
80
81 moved_or_deleted: list[str] = []
82 duplicate_actions: list[str] = []
83 duplicate_rows: list[dict[str, object]] = []
84 short_effective: dict[str, int]
85 retained_count = 0
86 if not args.dry_run:
87 moved_or_deleted = _handle_placeholders(
88 placeholders,
89 library_dir=library_dir,
90 quarantine_dir=quarantine_dir,
91 delete=args.delete_placeholders,
92 )
93 if args.skip_library_dedup:
94 profiles = _profile_library(library_dir)
95 short_effective = _effective_line_report_from_profiles(profiles)
96 retained_count = _build_index_from_profiles(profiles, Path(args.index))
97 else:
98 profiles = _profile_library(library_dir)
99 short_effective = _effective_line_report_from_profiles(profiles)
100 retained_count, duplicate_rows, duplicate_actions = _deduplicate_and_build_index(
101 profiles,
102 library_dir=library_dir,
103 index_path=Path(args.index),
104 duplicate_quarantine_dir=duplicate_quarantine_dir,
105 delete=args.delete_duplicates,
106 dry_run=False,
107 )
108 _write_duplicate_report(duplicate_rows, duplicate_report_path)
109
110 if args.eval_size > 0:
111 eval_index_path = Path(args.eval_csv).with_suffix(".index.pkl")
112 generate_eval_set(
113 library_dir=library_dir,
114 output_dir=Path(args.eval_dir),
115 csv_path=Path(args.eval_csv),
116 size=args.eval_size,
117 positive_ratio=args.positive_ratio,
118 index_path=Path(args.index),
119 eval_index_path=eval_index_path,
120 )
121 evaluate_csv(
122 eval_index_path,
123 Path(args.eval_csv),
124 Path(args.eval_out),
125 base_dir=Path(args.eval_csv).parent,
126 positive_decisions={"duplicate"},
127 max_candidates=5,
128 )
129 evaluate_csv(
130 eval_index_path,
131 Path(args.eval_csv),
132 Path(args.eval_out).with_name(Path(args.eval_out).stem + "_review_positive.csv"),
133 base_dir=Path(args.eval_csv).parent,
134 positive_decisions={"duplicate", "review"},
135 max_candidates=5,
136 )
137 else:
138 profiles = _profile_library(library_dir)
139 short_effective = _effective_line_report_from_profiles(profiles)
140 if not args.skip_library_dedup:
141 retained_count, duplicate_rows, duplicate_actions = _deduplicate_and_build_index(
142 profiles,
143 library_dir=library_dir,
144 index_path=Path(args.index),
145 duplicate_quarantine_dir=duplicate_quarantine_dir,
146 delete=args.delete_duplicates,
147 dry_run=True,
148 )
149 else:
150 retained_count = len(profiles)
151
152 report = {
153 "timestamp": datetime.now().isoformat(timespec="seconds"),
154 "dry_run": args.dry_run,
155 "library_dir": str(library_dir),
156 "files_before": len(files_before),
157 "placeholder_matches": len(placeholders),
158 "placeholder_files": [str(path) for path in placeholders],
159 "handled_placeholder_files": moved_or_deleted,
160 "library_dedup_skipped": args.skip_library_dedup,
161 "duplicate_matches": len(duplicate_rows),
162 "duplicate_report": str(duplicate_report_path) if duplicate_rows else "",
163 "handled_duplicate_files": duplicate_actions[:1000],
164 "handled_duplicate_files_truncated": len(duplicate_actions) > 1000,
165 "retained_index_records": retained_count,
166 "files_after": len(iter_lyric_files(library_dir)),
167 "index": str(args.index),
168 "eval_size": args.eval_size,
169 "eval_csv": str(args.eval_csv) if args.eval_size > 0 else "",
170 "eval_out": str(args.eval_out) if args.eval_size > 0 else "",
171 "eval_index": str(Path(args.eval_csv).with_suffix(".index.pkl")) if args.eval_size > 0 else "",
172 "short_effective_line_counts": short_effective,
173 }
174
175 print(json.dumps(report, ensure_ascii=False, indent=2))
176 if not args.dry_run:
177 report_path.parent.mkdir(parents=True, exist_ok=True)
178 report_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8")
179
180
181 def _find_placeholder_files(library_dir: Path) -> list[Path]:
182 matches: list[Path] = []
183 for path in iter_lyric_files(library_dir):
184 text = read_lyric_file(path)
185 if any(marker in text for marker in PLACEHOLDER_MARKERS):
186 matches.append(path)
187 return matches
188
189
190 def _handle_placeholders(
191 placeholders: list[Path],
192 *,
193 library_dir: Path,
194 quarantine_dir: Path,
195 delete: bool,
196 ) -> list[str]:
197 handled: list[str] = []
198 if not placeholders:
199 return handled
200 if not delete:
201 quarantine_dir.mkdir(parents=True, exist_ok=True)
202 for path in placeholders:
203 if delete:
204 path.unlink()
205 handled.append(f"deleted:{path}")
206 continue
207 relative = path.resolve().relative_to(library_dir.resolve())
208 destination = quarantine_dir / relative
209 destination.parent.mkdir(parents=True, exist_ok=True)
210 if destination.exists():
211 destination = destination.with_name(f"{destination.stem}_{datetime.now().strftime('%Y%m%d%H%M%S')}{destination.suffix}")
212 shutil.move(str(path), str(destination))
213 handled.append(f"moved:{path}->{destination}")
214 return handled
215
216
217 def _profile_library(library_dir: Path) -> list[LibraryProfile]:
218 profiles: list[LibraryProfile] = []
219 files = iter_lyric_files(library_dir)
220 _progress(f"profile active library: 0/{len(files)}")
221 for index, path in enumerate(files, start=1):
222 record = record_from_file(path, base_dir=library_dir)
223 normalized = normalize_lyrics(record.lyrics)
224 lines = normalized.primary_lines or normalized.unique_lines
225 normalized_text = normalized.normalized_full_text
226 profiles.append(
227 LibraryProfile(
228 path=path,
229 record=record,
230 normalized=normalized,
231 line_count=len(lines),
232 char_count=len(normalized_text),
233 )
234 )
235 _progress_count("profile active library", index, len(files), step=5000)
236 return profiles
237
238
239 def _build_index_from_profiles(profiles: list[LibraryProfile], index_path: Path) -> int:
240 checker = DuplicateChecker()
241 for index, profile in enumerate(profiles, start=1):
242 checker.add_normalized_record(profile.record, profile.normalized)
243 _progress_count("build index", index, len(profiles), step=5000)
244 index_path.parent.mkdir(parents=True, exist_ok=True)
245 checker.save(index_path)
246 return checker.record_count
247
248
249 def _deduplicate_and_build_index(
250 profiles: list[LibraryProfile],
251 *,
252 library_dir: Path,
253 index_path: Path,
254 duplicate_quarantine_dir: Path,
255 delete: bool,
256 dry_run: bool,
257 ) -> tuple[int, list[dict[str, object]], list[str]]:
258 checker = DuplicateChecker()
259 duplicate_rows: list[dict[str, object]] = []
260 duplicate_actions: list[str] = []
261 ordered = sorted(profiles, key=_profile_quality_key)
262 _progress(f"deduplicate active library: 0/{len(ordered)}")
263 for index, profile in enumerate(ordered, start=1):
264 result = checker.check_record(profile.record, max_candidates=1)
265 best = result.candidates[0] if result.candidates else None
266 if result.decision == DuplicateDecision.DUPLICATE and best is not None:
267 duplicate_rows.append(
268 {
269 "duplicate_path": str(profile.path),
270 "duplicate_record_id": profile.record.record_id,
271 "kept_record_id": best.record_id,
272 "decision": result.decision.value,
273 "confidence": result.confidence,
274 "reason": result.reason,
275 "best_candidate_jaccard": best.jaccard,
276 "best_candidate_line_coverage": best.line_coverage,
277 "best_candidate_primary_jaccard": best.primary_jaccard,
278 "best_candidate_primary_line_coverage": best.primary_line_coverage,
279 "matched_unique_lines": " | ".join(best.matched_unique_lines),
280 "line_count": profile.line_count,
281 "char_count": profile.char_count,
282 }
283 )
284 if not dry_run:
285 duplicate_actions.append(
286 _handle_duplicate_file(
287 profile.path,
288 library_dir=library_dir,
289 duplicate_quarantine_dir=duplicate_quarantine_dir,
290 delete=delete,
291 )
292 )
293 else:
294 checker.add_normalized_record(profile.record, profile.normalized)
295 _progress_count("deduplicate active library", index, len(ordered), step=5000)
296
297 if not dry_run:
298 index_path.parent.mkdir(parents=True, exist_ok=True)
299 checker.save(index_path)
300 return checker.record_count, duplicate_rows, duplicate_actions
301
302
303 def _handle_duplicate_file(
304 path: Path,
305 *,
306 library_dir: Path,
307 duplicate_quarantine_dir: Path,
308 delete: bool,
309 ) -> str:
310 if delete:
311 path.unlink()
312 return f"deleted:{path}"
313 duplicate_quarantine_dir.mkdir(parents=True, exist_ok=True)
314 relative = path.resolve().relative_to(library_dir.resolve())
315 destination = duplicate_quarantine_dir / relative
316 destination.parent.mkdir(parents=True, exist_ok=True)
317 if destination.exists():
318 destination = destination.with_name(f"{destination.stem}_{datetime.now().strftime('%Y%m%d%H%M%S')}{destination.suffix}")
319 shutil.move(str(path), str(destination))
320 return f"moved:{path}->{destination}"
321
322
323 def _profile_quality_key(profile: LibraryProfile) -> tuple[int, int, int, str]:
324 # Sort ascending; negative values make higher-quality records come first.
325 filename_quality = 0 if not profile.path.name.startswith("None_") else 1
326 return (filename_quality, -profile.line_count, -profile.char_count, str(profile.path))
327
328
329 def _write_duplicate_report(rows: list[dict[str, object]], report_path: Path) -> None:
330 if not rows:
331 return
332 report_path.parent.mkdir(parents=True, exist_ok=True)
333 with report_path.open("w", encoding="utf-8", newline="") as file:
334 writer = csv.DictWriter(file, fieldnames=list(rows[0].keys()))
335 writer.writeheader()
336 writer.writerows(rows)
337
338
339 def _effective_line_report(library_dir: Path) -> dict[str, int]:
340 return _effective_line_report_from_profiles(_profile_library(library_dir))
341
342
343 def _effective_line_report_from_profiles(profiles: list[LibraryProfile]) -> dict[str, int]:
344 buckets = {
345 "total": 0,
346 "zero_effective_lines": 0,
347 "one_to_three_effective_lines": 0,
348 "four_to_five_effective_lines": 0,
349 "six_plus_effective_lines": 0,
350 }
351 for profile in profiles:
352 buckets["total"] += 1
353 line_count = profile.line_count
354 if line_count == 0:
355 buckets["zero_effective_lines"] += 1
356 elif line_count <= 3:
357 buckets["one_to_three_effective_lines"] += 1
358 elif line_count <= 5:
359 buckets["four_to_five_effective_lines"] += 1
360 else:
361 buckets["six_plus_effective_lines"] += 1
362 return buckets
363
364
365 def _progress(message: str) -> None:
366 print(f"[process-library] {message}", file=sys.stderr, flush=True)
367
368
369 def _progress_count(label: str, current: int, total: int, *, step: int = 1000) -> None:
370 if total <= 0:
371 return
372 if current == 1 or current == total or current % step == 0:
373 _progress(f"{label}: {current}/{total}")
374
375
376 if __name__ == "__main__":
377 main()
1 # Lyric Dedup Sample Set
2
3 基准歌词: `test_api/test_lyric.txt`
4
5 这些样本用于检查当前去重系统的两类行为:
6
7 - `positive_*`: 应被判定为与基准歌词重复或高度重复。
8 - `negative_*`: 不应被判定为重复,用于检查主题、关键词或风格相似时的误杀。
9
10 ## 样本说明
11
12 | 文件 | 期望 | 测试点 |
13 | --- | --- | --- |
14 | `positive_01_format_spacing_punctuation_duplicate.txt` | 去重命中 | 去掉标题/分隔线、改变空行、弱化标点后的同文变体 |
15 | `positive_02_minor_wording_typos_duplicate.txt` | 去重命中 | 少量错字、近义词、语序微调后的近重复 |
16 | `positive_03_section_order_shift_duplicate.txt` | 去重命中 | 段落顺序变化但核心文本大量重合 |
17 | `positive_04_partial_core_chorus_duplicate.txt` | 去重命中 | 只提交核心副歌/高潮片段时的局部重复检测 |
18 | `negative_01_same_theme_new_lyrics_not_duplicate.txt` | 不应命中 | 同样是凌晨、长安、雪、追梦,但逐句原创 |
19 | `negative_02_same_keywords_different_scene_not_duplicate.txt` | 不应命中 | 复用高频关键词,叙事场景和句法明显不同 |
20 | `negative_03_style_similar_low_overlap_not_duplicate.txt` | 不应命中 | 国风+Rap+都市融合风格相似,但文本低重合 |
21 | `negative_04_common_hook_phrases_not_duplicate.txt` | 不应命中 | 只含常见短语/意象,防止短文本公共表达误杀 |
22
...@@ -4,7 +4,6 @@ import json ...@@ -4,7 +4,6 @@ import json
4 from lyric_dedup import DuplicateChecker 4 from lyric_dedup import DuplicateChecker
5 from lyric_dedup import DuplicateDecision 5 from lyric_dedup import DuplicateDecision
6 from lyric_dedup import LyricRecord 6 from lyric_dedup import LyricRecord
7 from lyric_dedup.cli import evaluate_csv
8 from lyric_dedup.eval_dataset import generate_eval_set 7 from lyric_dedup.eval_dataset import generate_eval_set
9 from lyric_dedup.file_import import record_from_file 8 from lyric_dedup.file_import import record_from_file
10 from lyric_dedup.normalization import normalize_lyrics 9 from lyric_dedup.normalization import normalize_lyrics
...@@ -22,6 +21,14 @@ BASE_LYRIC = """ ...@@ -22,6 +21,14 @@ BASE_LYRIC = """
22 """ 21 """
23 22
24 23
24 def check_against(candidates: list[LyricRecord], lyrics: str, *, max_candidates: int = 10):
25 return DuplicateChecker().check_record_against_candidates(
26 LyricRecord("__query__", lyrics),
27 candidates,
28 max_candidates=max_candidates,
29 )
30
31
25 def test_normalization_removes_lyric_noise_and_simplifies() -> None: 32 def test_normalization_removes_lyric_noise_and_simplifies() -> None:
26 normalized = normalize_lyrics("[00:01.20]我愛你!\nQQ音乐 www.example.com\n(副歌)\n聽風說話\n") 33 normalized = normalize_lyrics("[00:01.20]我愛你!\nQQ音乐 www.example.com\n(副歌)\n聽風說話\n")
27 34
...@@ -31,10 +38,8 @@ def test_normalization_removes_lyric_noise_and_simplifies() -> None: ...@@ -31,10 +38,8 @@ def test_normalization_removes_lyric_noise_and_simplifies() -> None:
31 38
32 39
33 def test_exact_duplicate_handles_timestamps_punctuation_traditional_and_chorus_counts() -> None: 40 def test_exact_duplicate_handles_timestamps_punctuation_traditional_and_chorus_counts() -> None:
34 checker = DuplicateChecker() 41 result = check_against(
35 checker.add_record(LyricRecord("song-1", BASE_LYRIC)) 42 [LyricRecord("song-1", BASE_LYRIC)],
36
37 result = checker.check(
38 """ 43 """
39 我愛你,在每個夜裡!!! 44 我愛你,在每個夜裡!!!
40 聽風說話,也聽見你 45 聽風說話,也聽見你
...@@ -51,21 +56,19 @@ def test_exact_duplicate_handles_timestamps_punctuation_traditional_and_chorus_c ...@@ -51,21 +56,19 @@ def test_exact_duplicate_handles_timestamps_punctuation_traditional_and_chorus_c
51 56
52 57
53 def test_short_shared_repeated_chorus_is_review_not_duplicate() -> None: 58 def test_short_shared_repeated_chorus_is_review_not_duplicate() -> None:
54 checker = DuplicateChecker() 59 result = check_against(
55 checker.add_record( 60 [
56 LyricRecord( 61 LyricRecord(
57 "song-1", 62 "song-1",
58 """ 63 """
59 海边的风吹过旧信 64 海边的风吹过旧信
60 你说夏天不会远去 65 你说夏天不会远去
61 啦啦啦 我们不分离 66 啦啦啦 我们不分离
62 啦啦啦 我们不分离 67 啦啦啦 我们不分离
63 转身以后各自旅行 68 转身以后各自旅行
64 """, 69 """,
65 ) 70 )
66 ) 71 ],
67
68 result = checker.check(
69 """ 72 """
70 山谷的雨落在清晨 73 山谷的雨落在清晨
71 我把名字交给星辰 74 我把名字交给星辰
...@@ -79,11 +82,9 @@ def test_short_shared_repeated_chorus_is_review_not_duplicate() -> None: ...@@ -79,11 +82,9 @@ def test_short_shared_repeated_chorus_is_review_not_duplicate() -> None:
79 assert result.candidates[0].reason == "重合内容主要集中在重复副歌行,不自动判重" 82 assert result.candidates[0].reason == "重合内容主要集中在重复副歌行,不自动判重"
80 83
81 84
82 def test_substantial_line_overlap_is_duplicate_after_lsh_recall() -> None: 85 def test_substantial_line_overlap_is_duplicate_after_pg_recall() -> None:
83 checker = DuplicateChecker() 86 result = check_against(
84 checker.add_record(LyricRecord("song-1", BASE_LYRIC)) 87 [LyricRecord("song-1", BASE_LYRIC)],
85
86 result = checker.check(
87 """ 88 """
88 我爱你在每个夜里 89 我爱你在每个夜里
89 听风说话也听见你 90 听风说话也听见你
...@@ -100,10 +101,8 @@ def test_substantial_line_overlap_is_duplicate_after_lsh_recall() -> None: ...@@ -100,10 +101,8 @@ def test_substantial_line_overlap_is_duplicate_after_lsh_recall() -> None:
100 101
101 102
102 def test_fragment_of_full_song_is_not_duplicate() -> None: 103 def test_fragment_of_full_song_is_not_duplicate() -> None:
103 checker = DuplicateChecker() 104 result = check_against(
104 checker.add_record(LyricRecord("song-1", BASE_LYRIC)) 105 [LyricRecord("song-1", BASE_LYRIC)],
105
106 result = checker.check(
107 """ 106 """
108 听风说话也听见你 107 听风说话也听见你
109 城市的灯慢慢亮起 108 城市的灯慢慢亮起
...@@ -116,45 +115,39 @@ def test_fragment_of_full_song_is_not_duplicate() -> None: ...@@ -116,45 +115,39 @@ def test_fragment_of_full_song_is_not_duplicate() -> None:
116 115
117 116
118 def test_catalog_mashup_fragments_are_new_not_review() -> None: 117 def test_catalog_mashup_fragments_are_new_not_review() -> None:
119 checker = DuplicateChecker() 118 result = check_against(
120 checker.add_record( 119 [
121 LyricRecord( 120 LyricRecord(
122 "song-1", 121 "song-1",
123 """ 122 """
124 第一首歌的清晨 123 第一首歌的清晨
125 第一首歌的街口 124 第一首歌的街口
126 每天都在伪装幸福快乐 125 每天都在伪装幸福快乐
127 还要瞒着所有人不说 126 还要瞒着所有人不说
128 第一首歌的结尾 127 第一首歌的结尾
129 """, 128 """,
130 ) 129 ),
131 ) 130 LyricRecord(
132 checker.add_record( 131 "song-2",
133 LyricRecord( 132 """
134 "song-2", 133 第二首歌的海边
135 """ 134 第二首歌的远方
136 第二首歌的海边 135 想起那年夏天
137 第二首歌的远方 136 我们走过人群
138 想起那年夏天 137 第二首歌的结尾
139 我们走过人群 138 """,
140 第二首歌的结尾 139 ),
141 """, 140 LyricRecord(
142 ) 141 "song-3",
143 ) 142 """
144 checker.add_record( 143 第三首歌的月光
145 LyricRecord( 144 第三首歌的旧梦
146 "song-3", 145 风吹过了窗前
147 """ 146 你没有再回来
148 第三首歌的月光 147 第三首歌的结尾
149 第三首歌的旧梦 148 """,
150 风吹过了窗前 149 ),
151 你没有再回来 150 ],
152 第三首歌的结尾
153 """,
154 )
155 )
156
157 result = checker.check(
158 """ 151 """
159 每天都在伪装幸福快乐 152 每天都在伪装幸福快乐
160 还要瞒着所有人不说 153 还要瞒着所有人不说
...@@ -169,28 +162,26 @@ def test_catalog_mashup_fragments_are_new_not_review() -> None: ...@@ -169,28 +162,26 @@ def test_catalog_mashup_fragments_are_new_not_review() -> None:
169 162
170 163
171 def test_large_mashup_with_one_recognizable_song_fragment_is_new() -> None: 164 def test_large_mashup_with_one_recognizable_song_fragment_is_new() -> None:
172 checker = DuplicateChecker() 165 result = check_against(
173 checker.add_record( 166 [
174 LyricRecord( 167 LyricRecord(
175 "song-1", 168 "song-1",
176 """ 169 """
177 桃花春风十里 170 桃花春风十里
178 花瓣飘散满地 171 花瓣飘散满地
179 对不起我无法忘记你 172 对不起我无法忘记你
180 一去遥遥无期 173 一去遥遥无期
181 一个人一支笔 174 一个人一支笔
182 多想你能留在我这里 175 多想你能留在我这里
183 天空下起了雨 176 天空下起了雨
184 淋湿我的心里 177 淋湿我的心里
185 久别中多少人都不是你 178 久别中多少人都不是你
186 屋檐下一人想起 179 屋檐下一人想起
187 关于你的回忆 180 关于你的回忆
188 无人在只剩下我自己 181 无人在只剩下我自己
189 """, 182 """,
190 ) 183 )
191 ) 184 ],
192
193 result = checker.check(
194 """ 185 """
195 scroll through the pictures from a year ago 186 scroll through the pictures from a year ago
196 the pixels change but the feelings dont grow 187 the pixels change but the feelings dont grow
...@@ -238,15 +229,13 @@ def test_no_effective_lyrics_use_metadata_fallback_without_empty_hash_collision( ...@@ -238,15 +229,13 @@ def test_no_effective_lyrics_use_metadata_fallback_without_empty_hash_collision(
238 混音:DJ金木 229 混音:DJ金木
239 【未经著作权人许可 不得翻唱 翻录或使用】 230 【未经著作权人许可 不得翻唱 翻录或使用】
240 """ 231 """
241 checker = DuplicateChecker() 232 same_song = DuplicateChecker().check_record_against_candidates(
242 checker.add_record(LyricRecord("song-1", placeholder, title="Amnesia(House)", artist="DJ金木")) 233 LyricRecord("__query__", placeholder, title="Amnesia(House)", artist="DJ金木"),
243 checker.add_record(LyricRecord("song-2", placeholder, title="Angel(纯音乐)", artist="DJ金木")) 234 [LyricRecord("song-1", placeholder, title="Amnesia(House)", artist="DJ金木")],
244
245 same_song = checker.check_record(
246 LyricRecord("__query__", placeholder, title="Amnesia(House)", artist="DJ金木")
247 ) 235 )
248 different_title = checker.check_record( 236 different_title = DuplicateChecker().check_record_against_candidates(
249 LyricRecord("__query__", placeholder, title="Different Song", artist="DJ金木") 237 LyricRecord("__query__", placeholder, title="Different Song", artist="DJ金木"),
238 [LyricRecord("song-2", placeholder, title="Angel(纯音乐)", artist="DJ金木")],
250 ) 239 )
251 240
252 assert same_song.decision == DuplicateDecision.DUPLICATE 241 assert same_song.decision == DuplicateDecision.DUPLICATE
...@@ -269,29 +258,27 @@ def test_no_effective_lyrics_metadata_fallback_ignores_placeholder_noise() -> No ...@@ -269,29 +258,27 @@ def test_no_effective_lyrics_metadata_fallback_ignores_placeholder_noise() -> No
269 [00:04.00]作曲:DJ金木... 258 [00:04.00]作曲:DJ金木...
270 [00:05.00]未经著作权人许可 不得翻唱 259 [00:05.00]未经著作权人许可 不得翻唱
271 """ 260 """
272 checker = DuplicateChecker() 261 result = DuplicateChecker().check_record_against_candidates(
273 checker.add_record(LyricRecord("song-1", source, title="Amnesia(House)", artist="DJ金木")) 262 LyricRecord("__query__", noisy, title="Amnesia(House)", artist="DJ金木"),
274 263 [LyricRecord("song-1", source, title="Amnesia(House)", artist="DJ金木")],
275 result = checker.check_record(LyricRecord("__query__", noisy, title="Amnesia(House)", artist="DJ金木")) 264 )
276 265
277 assert result.decision == DuplicateDecision.DUPLICATE 266 assert result.decision == DuplicateDecision.DUPLICATE
278 assert result.reason == "无有效歌词,文件内容兜底特征高度相似" 267 assert result.reason == "无有效歌词,文件内容兜底特征高度相似"
279 268
280 269
281 def test_unrelated_lyrics_with_shared_watermark_are_new() -> None: 270 def test_unrelated_lyrics_with_shared_watermark_are_new() -> None:
282 checker = DuplicateChecker() 271 result = check_against(
283 checker.add_record( 272 [
284 LyricRecord( 273 LyricRecord(
285 "song-1", 274 "song-1",
286 """ 275 """
287 歌词来自QQ音乐 276 歌词来自QQ音乐
288 北方的雪落在窗前 277 北方的雪落在窗前
289 我等一封迟来的信 278 我等一封迟来的信
290 """, 279 """,
291 ) 280 )
292 ) 281 ],
293
294 result = checker.check(
295 """ 282 """
296 歌词来自QQ音乐 283 歌词来自QQ音乐
297 南方的雨穿过街心 284 南方的雨穿过街心
...@@ -300,24 +287,22 @@ def test_unrelated_lyrics_with_shared_watermark_are_new() -> None: ...@@ -300,24 +287,22 @@ def test_unrelated_lyrics_with_shared_watermark_are_new() -> None:
300 ) 287 )
301 288
302 assert result.decision == DuplicateDecision.NEW 289 assert result.decision == DuplicateDecision.NEW
303 assert result.candidates == () 290 assert result.candidates[0].decision == DuplicateDecision.NEW
304 291
305 292
306 def test_mixed_chinese_english_tokenization_recalls_candidate() -> None: 293 def test_mixed_chinese_english_tokenization_recalls_candidate() -> None:
307 checker = DuplicateChecker() 294 result = check_against(
308 checker.add_record( 295 [
309 LyricRecord( 296 LyricRecord(
310 "song-1", 297 "song-1",
311 """ 298 """
312 say hello 在风里 299 say hello 在风里
313 hold me close tonight 300 hold me close tonight
314 我们穿过蓝色街道 301 我们穿过蓝色街道
315 never let me go 302 never let me go
316 """, 303 """,
317 ) 304 )
318 ) 305 ],
319
320 result = checker.check(
321 """ 306 """
322 say hello 在风里 307 say hello 在风里
323 hold me close tonight 308 hold me close tonight
...@@ -329,17 +314,14 @@ def test_mixed_chinese_english_tokenization_recalls_candidate() -> None: ...@@ -329,17 +314,14 @@ def test_mixed_chinese_english_tokenization_recalls_candidate() -> None:
329 assert result.decision == DuplicateDecision.DUPLICATE 314 assert result.decision == DuplicateDecision.DUPLICATE
330 315
331 316
332 def test_checker_can_persist_index(tmp_path) -> None: 317 def test_checker_can_rank_explicit_pg_recalled_candidates_without_in_memory_recall() -> None:
333 index_path = tmp_path / "lyrics.pkl" 318 result = DuplicateChecker().check_record_against_candidates(
334 checker = DuplicateChecker() 319 LyricRecord("__query__", BASE_LYRIC),
335 checker.add_record(LyricRecord("song-1", BASE_LYRIC)) 320 candidates=[],
336 checker.save(index_path) 321 )
337
338 loaded = DuplicateChecker.load(index_path)
339 result = loaded.check(BASE_LYRIC)
340 322
341 assert loaded.record_count == 1 323 assert result.decision == DuplicateDecision.NEW
342 assert result.decision == DuplicateDecision.DUPLICATE 324 assert result.candidates == ()
343 325
344 326
345 def test_record_from_lrc_file(tmp_path) -> None: 327 def test_record_from_lrc_file(tmp_path) -> None:
...@@ -363,44 +345,6 @@ def test_record_from_song_artist_lyrics_filename(tmp_path) -> None: ...@@ -363,44 +345,6 @@ def test_record_from_song_artist_lyrics_filename(tmp_path) -> None:
363 assert record.artist == "DJ金木" 345 assert record.artist == "DJ金木"
364 346
365 347
366 def test_evaluate_csv_reports_binary_metrics(tmp_path) -> None:
367 library = tmp_path / "library"
368 incoming = tmp_path / "incoming"
369 library.mkdir()
370 incoming.mkdir()
371 (library / "歌手A - 夜里.lrc").write_text(BASE_LYRIC, encoding="utf-8")
372 (incoming / "dup.lrc").write_text(BASE_LYRIC.replace("我爱你", "我愛你"), encoding="utf-8")
373 (incoming / "new.txt").write_text("南方的雨穿过街心\n你把故事说给云听\n", encoding="utf-8")
374
375 checker = DuplicateChecker()
376 checker.add_record(record_from_file(library / "歌手A - 夜里.lrc", base_dir=library))
377 index_path = tmp_path / "lyrics.pkl"
378 checker.save(index_path)
379
380 eval_csv = tmp_path / "eval.csv"
381 eval_csv.write_text(
382 "id,file,expected\n"
383 "case-1,incoming/dup.lrc,应去重\n"
384 "case-2,incoming/new.txt,不应去重\n",
385 encoding="utf-8",
386 )
387 out_path = tmp_path / "eval_out.csv"
388
389 evaluate_csv(
390 index_path,
391 eval_csv,
392 out_path,
393 base_dir=tmp_path,
394 positive_decisions={"duplicate"},
395 max_candidates=5,
396 )
397
398 rows = list(csv.DictReader(out_path.open(encoding="utf-8")))
399 assert [row["correct"] for row in rows] == ["True", "True"]
400 assert rows[0]["reason"] == "规范化后的原文歌词哈希完全一致"
401 assert (tmp_path / "eval_out.csv.summary.json").exists()
402
403
404 def test_generated_eval_set_uses_stratified_production_mix(tmp_path) -> None: 348 def test_generated_eval_set_uses_stratified_production_mix(tmp_path) -> None:
405 library = tmp_path / "library" 349 library = tmp_path / "library"
406 incoming = tmp_path / "generated" / "incoming" 350 incoming = tmp_path / "generated" / "incoming"
...@@ -424,7 +368,7 @@ def test_generated_eval_set_uses_stratified_production_mix(tmp_path) -> None: ...@@ -424,7 +368,7 @@ def test_generated_eval_set_uses_stratified_production_mix(tmp_path) -> None:
424 assert manifest["sample_size"] == 30 368 assert manifest["sample_size"] == 30
425 assert manifest["unique_source_records"] > 1 369 assert manifest["unique_source_records"] > 1
426 assert manifest["holdout_records"] > 1 370 assert manifest["holdout_records"] > 1
427 assert (tmp_path / "generated" / "eval.csv.index.pkl").exists() 371 assert manifest["eval_index"] == ""
428 assert "positive_full_duplicate" in manifest["plan"] 372 assert "positive_full_duplicate" in manifest["plan"]
429 assert "negative_real_holdout_full_song" in negative_types 373 assert "negative_real_holdout_full_song" in negative_types
430 assert "negative_fragment" in negative_types 374 assert "negative_fragment" in negative_types
...@@ -466,19 +410,17 @@ def test_generated_hard_eval_set_uses_business_realistic_edge_mix(tmp_path) -> N ...@@ -466,19 +410,17 @@ def test_generated_hard_eval_set_uses_business_realistic_edge_mix(tmp_path) -> N
466 410
467 411
468 def test_foreign_original_with_added_chinese_translation_is_duplicate() -> None: 412 def test_foreign_original_with_added_chinese_translation_is_duplicate() -> None:
469 checker = DuplicateChecker() 413 result = check_against(
470 checker.add_record( 414 [
471 LyricRecord( 415 LyricRecord(
472 "song-1", 416 "song-1",
473 """ 417 """
474 I miss you tonight 418 I miss you tonight
475 Under the moonlight 419 Under the moonlight
476 Never let me go 420 Never let me go
477 """, 421 """,
478 ) 422 )
479 ) 423 ],
480
481 result = checker.check(
482 """ 424 """
483 I miss you tonight 425 I miss you tonight
484 今晚我想你 426 今晚我想你
...@@ -509,22 +451,20 @@ def test_same_timestamp_translation_split_is_high_confidence() -> None: ...@@ -509,22 +451,20 @@ def test_same_timestamp_translation_split_is_high_confidence() -> None:
509 451
510 452
511 def test_translation_only_overlap_is_review_not_duplicate() -> None: 453 def test_translation_only_overlap_is_review_not_duplicate() -> None:
512 checker = DuplicateChecker() 454 result = check_against(
513 checker.add_record( 455 [
514 LyricRecord( 456 LyricRecord(
515 "song-1", 457 "song-1",
516 """ 458 """
517 I miss you tonight 459 I miss you tonight
518 今晚我想你 460 今晚我想你
519 Under the moonlight 461 Under the moonlight
520 月光之下 462 月光之下
521 Never let me go 463 Never let me go
522 永远不要让我离开 464 永远不要让我离开
523 """, 465 """,
524 ) 466 )
525 ) 467 ],
526
527 result = checker.check(
528 """ 468 """
529 Te extrano esta noche 469 Te extrano esta noche
530 今晚我想你 470 今晚我想你
...@@ -541,19 +481,17 @@ def test_translation_only_overlap_is_review_not_duplicate() -> None: ...@@ -541,19 +481,17 @@ def test_translation_only_overlap_is_review_not_duplicate() -> None:
541 481
542 482
543 def test_block_translation_split_is_review_when_primary_matches() -> None: 483 def test_block_translation_split_is_review_when_primary_matches() -> None:
544 checker = DuplicateChecker() 484 result = check_against(
545 checker.add_record( 485 [
546 LyricRecord( 486 LyricRecord(
547 "song-1", 487 "song-1",
548 """ 488 """
549 I miss you tonight 489 I miss you tonight
550 Under the moonlight 490 Under the moonlight
551 Never let me go 491 Never let me go
552 """, 492 """,
553 ) 493 )
554 ) 494 ],
555
556 result = checker.check(
557 """ 495 """
558 I miss you tonight 496 I miss you tonight
559 Under the moonlight 497 Under the moonlight
......