Commit f8ad329c f8ad329cb556651f2762949f4906fb6200501f89 by 沈秋雨

更新大样本下测试集生成流程

1 parent 51ddab43
...@@ -78,16 +78,20 @@ CSV 里重点看这些列: ...@@ -78,16 +78,20 @@ CSV 里重点看这些列:
78 python -m lyric_dedup.cli generate-eval-set \ 78 python -m lyric_dedup.cli generate-eval-set \
79 --library-dir data/library \ 79 --library-dir data/library \
80 --lyrics-dir data/generated_eval/incoming \ 80 --lyrics-dir data/generated_eval/incoming \
81 --csv data/generated_eval/eval_10.csv \ 81 --csv data/generated_eval/eval_50000.csv \
82 --size 10 \ 82 --index outputs/indexes/lyrics.pkl \
83 --positive-ratio 0.6 83 --size 50000 \
84 --positive-ratio 0.3
84 ``` 85 ```
85 86
86 生成器的业务口径: 87 生成器的业务口径:
87 88
88 - `应去重` 样本只生成全曲歌词的样式变化,例如时间戳、标点、平台噪声、空行、LRC 样式、附加中文翻译。 89 - 先扫描整个曲库,按有效歌词行数、语言类型、文件来源前缀做分层采样,不再按排序前缀取样。
89 - `不应去重` 样本包含片段歌词、短句碰撞、不同歌曲片段混合、同主题新歌词、仅翻译相似。 90 - `应去重` 样本只生成全曲歌词的样式变化,例如时间戳、标点、平台噪声、空行、重复副歌次数变化、附加中文翻译。
91 - `不应去重` 样本包含同主题新歌词、hard negative、片段歌词、重复副歌碰撞、仅翻译相似、短歌词/占位边界样本。
90 - 片段歌词即使命中已有歌曲的一部分,也不应该输出 `duplicate`;最多进入 `review` 92 - 片段歌词即使命中已有歌曲的一部分,也不应该输出 `duplicate`;最多进入 `review`
93 - 如果传入 `--index`,生成器会用现有索引构造更接近线上召回风险的 hard negative。
94 - 同时会生成 `*.manifest.json`,记录 seed、曲库规模、样本类型分布、语言/来源分桶和样本来源覆盖数。
91 95
92 先准备一个 CSV,例如 `data/eval/eval.csv` 96 先准备一个 CSV,例如 `data/eval/eval.csv`
93 97
......
...@@ -67,10 +67,10 @@ python scripts/process_library.py \ ...@@ -67,10 +67,10 @@ python scripts/process_library.py \
67 python scripts/process_library.py \ 67 python scripts/process_library.py \
68 --library-dir data/library \ 68 --library-dir data/library \
69 --index outputs/indexes/library_lyrics.pkl \ 69 --index outputs/indexes/library_lyrics.pkl \
70 --eval-size 1180 \ 70 --eval-size 50000 \
71 --positive-ratio 0.2 \ 71 --positive-ratio 0.3 \
72 --eval-csv data/generated_eval/eval_1180.csv \ 72 --eval-csv data/generated_eval/eval_50000.csv \
73 --eval-out outputs/results/library_eval_1180.csv 73 --eval-out outputs/results/library_eval_50000.csv
74 ``` 74 ```
75 75
76 隔离出来的文件默认会移动到: 76 隔离出来的文件默认会移动到:
...@@ -95,22 +95,23 @@ outputs/indexes/library_lyrics.pkl ...@@ -95,22 +95,23 @@ outputs/indexes/library_lyrics.pkl
95 95
96 注意:如果修改了 `data/library`,或修改了预处理/判重逻辑,需要重新执行本步骤。 96 注意:如果修改了 `data/library`,或修改了预处理/判重逻辑,需要重新执行本步骤。
97 97
98 ## 3. 生成 100 条测试样本 98 ## 3. 生成生产评估样本
99 99
100 ```bash 100 ```bash
101 python -m lyric_dedup.cli generate-eval-set \ 101 python -m lyric_dedup.cli generate-eval-set \
102 --library-dir data/library \ 102 --library-dir data/library \
103 --lyrics-dir data/generated_eval/incoming \ 103 --lyrics-dir data/generated_eval/incoming \
104 --csv data/generated_eval/eval_500.csv \ 104 --csv data/generated_eval/eval_50000.csv \
105 --size 500 \ 105 --index outputs/indexes/library_lyrics.pkl \
106 --positive-ratio 0.2 106 --size 50000 \
107 --positive-ratio 0.3
107 ``` 108 ```
108 109
109 默认生 110 默认生产评估口径
110 111
111 ```text 112 ```text
112 应去重: 60 113 应去重: 30%
113 不应去重: 40 114 不应去重: 70%
114 ``` 115 ```
115 116
116 生成器会先清理 `data/generated_eval/incoming/` 下旧的 `.txt` / `.lrc` 生成文件,再写入新样本。 117 生成器会先清理 `data/generated_eval/incoming/` 下旧的 `.txt` / `.lrc` 生成文件,再写入新样本。
...@@ -118,8 +119,28 @@ python -m lyric_dedup.cli generate-eval-set \ ...@@ -118,8 +119,28 @@ python -m lyric_dedup.cli generate-eval-set \
118 业务口径: 119 业务口径:
119 120
120 ```text 121 ```text
121 pos_* = 应去重,全曲歌词样式变化 122 positive_* = 应去重,全曲歌词样式变化
122 neg_* = 不应去重,片段/短句碰撞/混合片段/同主题新歌词/仅翻译相似 123 negative_random_unrelated = 不应去重,同主题新歌词
124 negative_hard_candidate = 不应去重,系统容易召回的短句/局部重合样本
125 negative_fragment = 不应去重,单曲片段
126 negative_shared_chorus = 不应去重,重复副歌碰撞
127 negative_translation_only = 不应去重,仅翻译相似
128 edge_short_or_placeholder = 不应去重,短歌词/占位边界样本
129 ```
130
131 生成器会扫描整个曲库并按有效歌词行数、语言类型、文件来源前缀分层采样。传入 `--index` 后会用现有索引生成 hard negative。每次还会输出:
132
133 ```text
134 data/generated_eval/eval_50000.csv.manifest.json
135 ```
136
137 manifest 里重点看:
138
139 ```text
140 library_files 曲库歌词文件数
141 sample_type_counts 各样本类型数量
142 line_count_bucket_counts / language_bucket_counts / source_bucket_counts
143 unique_source_records 本次评估覆盖了多少真实源文件
123 ``` 144 ```
124 145
125 ## 4. 严格评估:只把 duplicate 算作去重 146 ## 4. 严格评估:只把 duplicate 算作去重
...@@ -127,9 +148,9 @@ neg_* = 不应去重,片段/短句碰撞/混合片段/同主题新歌词/仅 ...@@ -127,9 +148,9 @@ neg_* = 不应去重,片段/短句碰撞/混合片段/同主题新歌词/仅
127 ```bash 148 ```bash
128 python -m lyric_dedup.cli evaluate-csv \ 149 python -m lyric_dedup.cli evaluate-csv \
129 --index outputs/indexes/library_lyrics.pkl \ 150 --index outputs/indexes/library_lyrics.pkl \
130 --csv data/generated_eval/eval_500.csv \ 151 --csv data/generated_eval/eval_50000.csv \
131 --base-dir data/generated_eval \ 152 --base-dir data/generated_eval \
132 --out outputs/results/library_eval_500.csv 153 --out outputs/results/library_eval_50000.csv
133 ``` 154 ```
134 155
135 这个口径下: 156 这个口径下:
...@@ -151,10 +172,10 @@ false_positive ...@@ -151,10 +172,10 @@ false_positive
151 ```bash 172 ```bash
152 python -m lyric_dedup.cli evaluate-csv \ 173 python -m lyric_dedup.cli evaluate-csv \
153 --index outputs/indexes/library_lyrics.pkl \ 174 --index outputs/indexes/library_lyrics.pkl \
154 --csv data/generated_eval/eval_500.csv \ 175 --csv data/generated_eval/eval_50000.csv \
155 --base-dir data/generated_eval \ 176 --base-dir data/generated_eval \
156 --positive-decisions duplicate,review \ 177 --positive-decisions duplicate,review \
157 --out outputs/results/library_eval_500_review_positive.csv 178 --out outputs/results/library_eval_50000_review_positive.csv
158 ``` 179 ```
159 180
160 这个口径下: 181 这个口径下:
......
...@@ -48,8 +48,9 @@ def main() -> None: ...@@ -48,8 +48,9 @@ def main() -> None:
48 generate.add_argument("--lyrics-dir", required=True) 48 generate.add_argument("--lyrics-dir", required=True)
49 generate.add_argument("--csv", required=True) 49 generate.add_argument("--csv", required=True)
50 generate.add_argument("--size", type=int, default=100) 50 generate.add_argument("--size", type=int, default=100)
51 generate.add_argument("--positive-ratio", type=float, default=0.6) 51 generate.add_argument("--positive-ratio", type=float, default=0.3)
52 generate.add_argument("--seed", type=int, default=20260602) 52 generate.add_argument("--seed", type=int, default=20260602)
53 generate.add_argument("--index", default="", help="optional existing index for hard-negative generation")
53 54
54 args = parser.parse_args() 55 args = parser.parse_args()
55 if args.command == "build-index": 56 if args.command == "build-index":
...@@ -75,6 +76,7 @@ def main() -> None: ...@@ -75,6 +76,7 @@ def main() -> None:
75 size=args.size, 76 size=args.size,
76 positive_ratio=args.positive_ratio, 77 positive_ratio=args.positive_ratio,
77 seed=args.seed, 78 seed=args.seed,
79 index_path=Path(args.index) if args.index else None,
78 ) 80 )
79 print(json.dumps(summary, ensure_ascii=False)) 81 print(json.dumps(summary, ensure_ascii=False))
80 82
......
1 """Generate labeled evaluation samples from an existing lyric library.""" 1 """Generate production-style labeled evaluation samples from a lyric library."""
2 2
3 from __future__ import annotations 3 from __future__ import annotations
4 4
5 import csv 5 import csv
6 import hashlib
7 import json
6 import random 8 import random
7 import re 9 import re
10 from collections import Counter
8 from dataclasses import dataclass 11 from dataclasses import dataclass
9 from pathlib import Path 12 from pathlib import Path
10 13
14 from lyric_dedup.checker import DuplicateChecker
15 from lyric_dedup.checker import DuplicateDecision
11 from lyric_dedup.file_import import iter_lyric_files 16 from lyric_dedup.file_import import iter_lyric_files
12 from lyric_dedup.file_import import read_lyric_file 17 from lyric_dedup.file_import import read_lyric_file
13 from lyric_dedup.file_import import record_from_file 18 from lyric_dedup.file_import import record_from_file
19 from lyric_dedup.normalization import NormalizedLyrics
20 from lyric_dedup.normalization import fingerprint_text
14 from lyric_dedup.normalization import normalize_lyrics 21 from lyric_dedup.normalization import normalize_lyrics
15 22
16 23
24 DEFAULT_SAMPLE_MIX = {
25 "positive_full_duplicate": 0.30,
26 "negative_random_unrelated": 0.20,
27 "negative_hard_candidate": 0.25,
28 "negative_fragment": 0.10,
29 "negative_shared_chorus": 0.05,
30 "negative_translation_only": 0.05,
31 "edge_short_or_placeholder": 0.05,
32 }
33
34
35 @dataclass(frozen=True)
36 class LyricProfile:
37 path: Path
38 record_id: str
39 title: str
40 artist: str
41 normalized: NormalizedLyrics
42 line_count: int
43 char_count: int
44 line_count_bucket: str
45 language_bucket: str
46 source_bucket: str
47 normalized_hash: str
48 has_translation: bool
49
50
17 @dataclass(frozen=True) 51 @dataclass(frozen=True)
18 class GeneratedSample: 52 class GeneratedSample:
19 sample_id: str 53 sample_id: str
...@@ -21,8 +55,14 @@ class GeneratedSample: ...@@ -21,8 +55,14 @@ class GeneratedSample:
21 expected: str 55 expected: str
22 sample_type: str 56 sample_type: str
23 source: str 57 source: str
58 source_record_id: str = ""
59 candidate_record_id: str = ""
60 line_count_bucket: str = ""
61 language_bucket: str = ""
62 source_bucket: str = ""
24 title: str = "" 63 title: str = ""
25 artist: str = "" 64 artist: str = ""
65 notes: str = ""
26 66
27 67
28 def generate_eval_set( 68 def generate_eval_set(
...@@ -31,104 +71,555 @@ def generate_eval_set( ...@@ -31,104 +71,555 @@ def generate_eval_set(
31 output_dir: Path, 71 output_dir: Path,
32 csv_path: Path, 72 csv_path: Path,
33 size: int = 100, 73 size: int = 100,
34 positive_ratio: float = 0.6, 74 positive_ratio: float = 0.30,
35 seed: int = 20260602, 75 seed: int = 20260602,
76 index_path: Path | None = None,
36 ) -> dict[str, object]: 77 ) -> dict[str, object]:
78 """Generate a stratified production evaluation set.
79
80 ``positive_ratio`` is kept for CLI compatibility. It overrides the default
81 positive quota while keeping the remaining negative categories proportional.
82 """
83 if size <= 0:
84 raise ValueError("size must be positive")
85
37 rng = random.Random(seed) 86 rng = random.Random(seed)
38 source_files = iter_lyric_files(library_dir) 87 profiles = profile_library(library_dir)
39 if not source_files: 88 if not profiles:
40 raise ValueError(f"{library_dir} 下没有 .lrc/.txt 歌词文件") 89 raise ValueError(f"{library_dir} 下没有 .lrc/.txt 歌词文件")
41 90
42 output_dir.mkdir(parents=True, exist_ok=True) 91 output_dir.mkdir(parents=True, exist_ok=True)
43 csv_path.parent.mkdir(parents=True, exist_ok=True) 92 csv_path.parent.mkdir(parents=True, exist_ok=True)
44 _clean_generated_output_dir(output_dir) 93 _clean_generated_output_dir(output_dir)
45 94
46 positives = round(size * positive_ratio) 95 checker = DuplicateChecker.load(index_path) if index_path else None
47 negatives = size - positives 96 plan = _sample_plan(size, positive_ratio=positive_ratio)
97 groups = _profile_groups(profiles)
48 samples: list[GeneratedSample] = [] 98 samples: list[GeneratedSample] = []
49 for index in range(positives):
50 source = source_files[index % len(source_files)]
51 samples.append(_positive_sample(index + 1, source, output_dir, csv_path.parent, rng))
52 for index in range(negatives):
53 left = source_files[index % len(source_files)]
54 right = source_files[(index + 1) % len(source_files)]
55 samples.append(_negative_sample(positives + index + 1, left, right, output_dir, csv_path.parent, rng))
56 99
57 rng.shuffle(samples) 100 samples.extend(
58 with csv_path.open("w", encoding="utf-8", newline="") as file: 101 _build_positive_samples(
59 writer = csv.DictWriter(file, fieldnames=["id", "file", "expected", "sample_type", "source", "title", "artist"]) 102 _stratified_sample(groups["normal"], plan["positive_full_duplicate"], rng),
60 writer.writeheader() 103 output_dir,
61 writer.writerows( 104 csv_path.parent,
62 { 105 rng,
63 "id": sample.sample_id, 106 start_index=len(samples) + 1,
64 "file": sample.file,
65 "expected": sample.expected,
66 "sample_type": sample.sample_type,
67 "source": sample.source,
68 "title": sample.title,
69 "artist": sample.artist,
70 }
71 for sample in samples
72 ) 107 )
108 )
109 samples.extend(
110 _build_random_unrelated_samples(
111 plan["negative_random_unrelated"],
112 output_dir,
113 csv_path.parent,
114 rng,
115 start_index=len(samples) + 1,
116 )
117 )
118 samples.extend(
119 _build_hard_candidate_samples(
120 groups["normal"],
121 plan["negative_hard_candidate"],
122 output_dir,
123 csv_path.parent,
124 rng,
125 checker=checker,
126 start_index=len(samples) + 1,
127 )
128 )
129 samples.extend(
130 _build_fragment_samples(
131 _stratified_sample(groups["fragmentable"], plan["negative_fragment"], rng),
132 output_dir,
133 csv_path.parent,
134 rng,
135 start_index=len(samples) + 1,
136 )
137 )
138 samples.extend(
139 _build_shared_chorus_samples(
140 _stratified_sample(groups["normal"], plan["negative_shared_chorus"], rng),
141 output_dir,
142 csv_path.parent,
143 rng,
144 start_index=len(samples) + 1,
145 )
146 )
147 samples.extend(
148 _build_translation_only_samples(
149 _stratified_sample(groups["foreign"], plan["negative_translation_only"], rng),
150 output_dir,
151 csv_path.parent,
152 rng,
153 start_index=len(samples) + 1,
154 )
155 )
156 samples.extend(
157 _build_edge_samples(
158 _stratified_sample(groups["edge"], plan["edge_short_or_placeholder"], rng),
159 output_dir,
160 csv_path.parent,
161 rng,
162 start_index=len(samples) + 1,
163 )
164 )
73 165
166 if len(samples) < size:
167 samples.extend(
168 _build_random_unrelated_samples(
169 size - len(samples),
170 output_dir,
171 csv_path.parent,
172 rng,
173 start_index=len(samples) + 1,
174 )
175 )
176 samples = samples[:size]
177 rng.shuffle(samples)
178
179 _write_csv(samples, csv_path, seed=seed)
180 manifest = _write_manifest(
181 profiles=profiles,
182 samples=samples,
183 csv_path=csv_path,
184 output_dir=output_dir,
185 seed=seed,
186 plan=plan,
187 index_path=index_path,
188 )
189 return manifest
190
191
192 def profile_library(library_dir: Path) -> list[LyricProfile]:
193 profiles: list[LyricProfile] = []
194 for path in iter_lyric_files(library_dir):
195 record = record_from_file(path, base_dir=library_dir)
196 normalized = normalize_lyrics(record.lyrics)
197 lines = normalized.primary_lines or normalized.unique_lines
198 line_count = len(lines)
199 normalized_text = fingerprint_text(normalized) or normalized.normalized_full_text
200 source_bucket = _source_bucket(path)
201 profiles.append(
202 LyricProfile(
203 path=path,
204 record_id=record.record_id,
205 title=record.title or "",
206 artist=record.artist or "",
207 normalized=normalized,
208 line_count=line_count,
209 char_count=len(normalized_text),
210 line_count_bucket=_line_count_bucket(line_count),
211 language_bucket=_language_bucket(lines),
212 source_bucket=source_bucket,
213 normalized_hash=hashlib.sha256(normalized_text.encode("utf-8")).hexdigest(),
214 has_translation=bool(normalized.translation_lines),
215 )
216 )
217 return profiles
218
219
220 def _sample_plan(size: int, *, positive_ratio: float) -> dict[str, int]:
221 positive_ratio = max(0.0, min(1.0, positive_ratio))
222 mix = dict(DEFAULT_SAMPLE_MIX)
223 negative_total = sum(value for key, value in mix.items() if key != "positive_full_duplicate")
224 mix["positive_full_duplicate"] = positive_ratio
225 for key in list(mix):
226 if key != "positive_full_duplicate":
227 mix[key] = (1.0 - positive_ratio) * (DEFAULT_SAMPLE_MIX[key] / negative_total)
228
229 plan = {key: int(size * value) for key, value in mix.items()}
230 remainder = size - sum(plan.values())
231 for key in sorted(mix, key=mix.get, reverse=True):
232 if remainder <= 0:
233 break
234 plan[key] += 1
235 remainder -= 1
236 return plan
237
238
239 def _profile_groups(profiles: list[LyricProfile]) -> dict[str, list[LyricProfile]]:
240 normal = [profile for profile in profiles if profile.line_count >= 6]
241 edge = [profile for profile in profiles if profile.line_count <= 5]
74 return { 242 return {
75 "size": size, 243 "normal": normal or profiles,
76 "positive": positives, 244 "fragmentable": [profile for profile in profiles if profile.line_count >= 12] or normal or profiles,
77 "negative": negatives, 245 "foreign": [
78 "library_files": len(source_files), 246 profile
79 "lyrics_dir": str(output_dir), 247 for profile in profiles
80 "csv": str(csv_path), 248 if profile.language_bucket in {"latin", "mixed", "jp_kr"} and profile.line_count >= 4
249 ]
250 or normal
251 or profiles,
252 "edge": edge or normal or profiles,
81 } 253 }
82 254
83 255
84 def _positive_sample(index: int, source: Path, output_dir: Path, csv_base: Path, rng: random.Random) -> GeneratedSample: 256 def _stratified_sample(profiles: list[LyricProfile], count: int, rng: random.Random) -> list[LyricProfile]:
85 raw = read_lyric_file(source) 257 if count <= 0 or not profiles:
86 source_record = record_from_file(source) 258 return []
87 variants = [ 259 buckets: dict[tuple[str, str, str], list[LyricProfile]] = {}
88 ("exact_copy", raw), 260 for profile in profiles:
89 ("timestamped", _add_timestamps(_content_lines(raw))), 261 key = (profile.line_count_bucket, profile.language_bucket, profile.source_bucket)
90 ("punctuation_noise", _add_punctuation_noise(_content_lines(raw), rng)), 262 buckets.setdefault(key, []).append(profile)
91 ("with_platform_noise", _with_platform_noise(_content_lines(raw))), 263
92 ("blank_line_noise", _add_blank_line_noise(_content_lines(raw))), 264 selected: list[LyricProfile] = []
93 ("lrc_with_platform_noise", _add_timestamps(_content_lines(_with_platform_noise(_content_lines(raw))))), 265 bucket_keys = list(buckets)
94 ("translation_added", _translation_added(_content_lines(raw))), 266 rng.shuffle(bucket_keys)
95 ] 267 cursors = {key: rng.sample(items, len(items)) for key, items in buckets.items()}
96 sample_type, text = variants[(index - 1) % len(variants)] 268 while len(selected) < count and bucket_keys:
97 name = f"pos_{index:03d}_{sample_type}.txt" 269 progressed = False
98 path = output_dir / name 270 for key in list(bucket_keys):
99 path.write_text(text, encoding="utf-8") 271 if len(selected) >= count:
272 break
273 items = cursors[key]
274 if not items:
275 bucket_keys.remove(key)
276 continue
277 selected.append(items.pop())
278 progressed = True
279 if not progressed:
280 break
281 while len(selected) < count:
282 selected.append(rng.choice(profiles))
283 return selected
284
285
286 def _build_positive_samples(
287 profiles: list[LyricProfile],
288 output_dir: Path,
289 csv_base: Path,
290 rng: random.Random,
291 *,
292 start_index: int,
293 ) -> list[GeneratedSample]:
294 samples: list[GeneratedSample] = []
295 for offset, profile in enumerate(profiles):
296 raw = read_lyric_file(profile.path)
297 lines = _content_lines(raw)
298 variants = [
299 ("positive_exact_copy", raw),
300 ("positive_timestamped", _add_timestamps(lines)),
301 ("positive_punctuation_noise", _add_punctuation_noise(lines, rng)),
302 ("positive_platform_noise", _with_platform_noise(lines)),
303 ("positive_blank_line_noise", _add_blank_line_noise(lines)),
304 ("positive_chorus_count_changed", _change_repeated_line_counts(lines)),
305 ("positive_translation_added", _translation_added(lines)),
306 ]
307 sample_type, text = variants[offset % len(variants)]
308 index = start_index + offset
309 path = _write_sample_file(output_dir, f"pos_{index:05d}_{sample_type}.txt", text)
310 samples.append(_sample_from_profile(index, path, csv_base, "应去重", sample_type, profile))
311 return samples
312
313
314 def _build_random_unrelated_samples(
315 count: int,
316 output_dir: Path,
317 csv_base: Path,
318 rng: random.Random,
319 *,
320 start_index: int,
321 ) -> list[GeneratedSample]:
322 samples: list[GeneratedSample] = []
323 for offset in range(count):
324 index = start_index + offset
325 text = _same_theme_synthetic(index, rng)
326 path = _write_sample_file(output_dir, f"neg_{index:05d}_negative_random_unrelated.txt", text)
327 samples.append(
328 GeneratedSample(
329 sample_id=f"sample-{index:05d}",
330 file=str(path.relative_to(csv_base)),
331 expected="不应去重",
332 sample_type="negative_random_unrelated",
333 source="synthetic",
334 notes="same-theme synthetic full lyric not copied from library",
335 )
336 )
337 return samples
338
339
340 def _build_hard_candidate_samples(
341 profiles: list[LyricProfile],
342 count: int,
343 output_dir: Path,
344 csv_base: Path,
345 rng: random.Random,
346 *,
347 checker: DuplicateChecker | None,
348 start_index: int,
349 ) -> list[GeneratedSample]:
350 if count <= 0:
351 return []
352 sources = _stratified_sample(profiles, count * 3, rng)
353 samples: list[GeneratedSample] = []
354 for profile in sources:
355 if len(samples) >= count:
356 break
357 lines = list(profile.normalized.primary_lines or profile.normalized.unique_lines)
358 text = _short_shared_snippet(lines, rng)
359 candidate_id = ""
360 if checker is not None:
361 result = checker.check(text, max_candidates=5)
362 candidate = next(
363 (
364 item
365 for item in result.candidates
366 if item.record_id != profile.record_id and item.decision != DuplicateDecision.NEW
367 ),
368 result.candidates[0] if result.candidates else None,
369 )
370 candidate_id = candidate.record_id if candidate else ""
371 index = start_index + len(samples)
372 path = _write_sample_file(output_dir, f"neg_{index:05d}_negative_hard_candidate.txt", text)
373 samples.append(
374 _sample_from_profile(
375 index,
376 path,
377 csv_base,
378 "不应去重",
379 "negative_hard_candidate",
380 profile,
381 candidate_record_id=candidate_id,
382 notes="shares a few real lines plus new filler; should not auto duplicate",
383 )
384 )
385 return samples
386
387
388 def _build_fragment_samples(
389 profiles: list[LyricProfile],
390 output_dir: Path,
391 csv_base: Path,
392 rng: random.Random,
393 *,
394 start_index: int,
395 ) -> list[GeneratedSample]:
396 samples: list[GeneratedSample] = []
397 for offset, profile in enumerate(profiles):
398 lines = list(profile.normalized.primary_lines or profile.normalized.unique_lines)
399 text = _single_song_fragment(lines, rng)
400 index = start_index + offset
401 path = _write_sample_file(output_dir, f"neg_{index:05d}_negative_fragment.txt", text)
402 samples.append(
403 _sample_from_profile(
404 index,
405 path,
406 csv_base,
407 "不应去重",
408 "negative_fragment",
409 profile,
410 notes="partial lyric fragment only",
411 )
412 )
413 return samples
414
415
416 def _build_shared_chorus_samples(
417 profiles: list[LyricProfile],
418 output_dir: Path,
419 csv_base: Path,
420 rng: random.Random,
421 *,
422 start_index: int,
423 ) -> list[GeneratedSample]:
424 samples: list[GeneratedSample] = []
425 for offset, profile in enumerate(profiles):
426 lines = list(profile.normalized.primary_lines or profile.normalized.unique_lines)
427 repeated = _repeated_or_sampled_lines(profile.normalized, rng)
428 text = "\n".join(
429 [
430 "清晨的光落在新的街口",
431 "我把故事重新写给以后",
432 *repeated,
433 *repeated,
434 "所有答案都从这里开始",
435 ]
436 )
437 index = start_index + offset
438 path = _write_sample_file(output_dir, f"neg_{index:05d}_negative_shared_chorus.txt", text)
439 samples.append(
440 _sample_from_profile(
441 index,
442 path,
443 csv_base,
444 "不应去重",
445 "negative_shared_chorus",
446 profile,
447 notes="shared repeated lines with new surrounding content",
448 )
449 )
450 return samples
451
452
453 def _build_translation_only_samples(
454 profiles: list[LyricProfile],
455 output_dir: Path,
456 csv_base: Path,
457 rng: random.Random,
458 *,
459 start_index: int,
460 ) -> list[GeneratedSample]:
461 samples: list[GeneratedSample] = []
462 for offset, profile in enumerate(profiles):
463 lines = list(profile.normalized.translation_lines) or [
464 _pseudo_translation(idx) for idx in range(1, min(8, max(profile.line_count, 4)) + 1)
465 ]
466 rng.shuffle(lines)
467 text = "\n".join(lines[:8])
468 index = start_index + offset
469 path = _write_sample_file(output_dir, f"neg_{index:05d}_negative_translation_only.txt", text)
470 samples.append(
471 _sample_from_profile(
472 index,
473 path,
474 csv_base,
475 "不应去重",
476 "negative_translation_only",
477 profile,
478 notes="translation-like text without matching original lyric",
479 )
480 )
481 return samples
482
483
484 def _build_edge_samples(
485 profiles: list[LyricProfile],
486 output_dir: Path,
487 csv_base: Path,
488 rng: random.Random,
489 *,
490 start_index: int,
491 ) -> list[GeneratedSample]:
492 samples: list[GeneratedSample] = []
493 for offset, profile in enumerate(profiles):
494 lines = list(profile.normalized.primary_lines or profile.normalized.unique_lines)
495 if profile.line_count <= 1:
496 text = _same_theme_synthetic(start_index + offset, rng)
497 notes = "zero or one effective line; use synthetic edge negative"
498 else:
499 text = _short_shared_snippet(lines, rng)
500 notes = "short lyric edge case with limited overlap"
501 index = start_index + offset
502 path = _write_sample_file(output_dir, f"neg_{index:05d}_edge_short_or_placeholder.txt", text)
503 samples.append(
504 _sample_from_profile(
505 index,
506 path,
507 csv_base,
508 "不应去重",
509 "edge_short_or_placeholder",
510 profile,
511 notes=notes,
512 )
513 )
514 return samples
515
516
517 def _sample_from_profile(
518 index: int,
519 path: Path,
520 csv_base: Path,
521 expected: str,
522 sample_type: str,
523 profile: LyricProfile,
524 *,
525 candidate_record_id: str = "",
526 notes: str = "",
527 ) -> GeneratedSample:
100 return GeneratedSample( 528 return GeneratedSample(
101 sample_id=f"pos-{index:03d}", 529 sample_id=f"sample-{index:05d}",
102 file=str(path.relative_to(csv_base)), 530 file=str(path.relative_to(csv_base)),
103 expected="应去重", 531 expected=expected,
104 sample_type=sample_type, 532 sample_type=sample_type,
105 source=str(source), 533 source=str(profile.path),
106 title=source_record.title or "", 534 source_record_id=profile.record_id,
107 artist=source_record.artist or "", 535 candidate_record_id=candidate_record_id,
536 line_count_bucket=profile.line_count_bucket,
537 language_bucket=profile.language_bucket,
538 source_bucket=profile.source_bucket,
539 title=profile.title,
540 artist=profile.artist,
541 notes=notes,
108 ) 542 )
109 543
110 544
111 def _negative_sample(index: int, left: Path, right: Path, output_dir: Path, csv_base: Path, rng: random.Random) -> GeneratedSample: 545 def _write_sample_file(output_dir: Path, name: str, text: str) -> Path:
112 left_lines = _normalized_lines(left)
113 right_lines = _normalized_lines(right)
114 variants = [
115 ("single_song_fragment", _single_song_fragment(left_lines)),
116 ("short_shared_snippet", _short_shared_snippet(left_lines, rng)),
117 ("mixed_fragments", _mixed_fragments(left_lines, right_lines, rng)),
118 ("same_theme_synthetic", _same_theme_synthetic(index)),
119 ("translation_only_like", _translation_only_like(left_lines)),
120 ]
121 sample_type, text = variants[(index - 1) % len(variants)]
122 name = f"neg_{index:03d}_{sample_type}.txt"
123 path = output_dir / name 546 path = output_dir / name
124 path.write_text(text, encoding="utf-8") 547 path.write_text(text.strip() + "\n", encoding="utf-8")
125 return GeneratedSample( 548 return path
126 sample_id=f"neg-{index:03d}", 549
127 file=str(path.relative_to(csv_base)), 550
128 expected="不应去重", 551 def _write_csv(samples: list[GeneratedSample], csv_path: Path, *, seed: int) -> None:
129 sample_type=sample_type, 552 fieldnames = [
130 source=f"{left} | {right}", 553 "id",
554 "file",
555 "expected",
556 "sample_type",
557 "source",
558 "source_record_id",
559 "candidate_record_id",
560 "line_count_bucket",
561 "language_bucket",
562 "source_bucket",
563 "title",
564 "artist",
565 "seed",
566 "notes",
567 ]
568 with csv_path.open("w", encoding="utf-8", newline="") as file:
569 writer = csv.DictWriter(file, fieldnames=fieldnames)
570 writer.writeheader()
571 for sample in samples:
572 writer.writerow(
573 {
574 "id": sample.sample_id,
575 "file": sample.file,
576 "expected": sample.expected,
577 "sample_type": sample.sample_type,
578 "source": sample.source,
579 "source_record_id": sample.source_record_id,
580 "candidate_record_id": sample.candidate_record_id,
581 "line_count_bucket": sample.line_count_bucket,
582 "language_bucket": sample.language_bucket,
583 "source_bucket": sample.source_bucket,
584 "title": sample.title,
585 "artist": sample.artist,
586 "seed": seed,
587 "notes": sample.notes,
588 }
589 )
590
591
592 def _write_manifest(
593 *,
594 profiles: list[LyricProfile],
595 samples: list[GeneratedSample],
596 csv_path: Path,
597 output_dir: Path,
598 seed: int,
599 plan: dict[str, int],
600 index_path: Path | None,
601 ) -> dict[str, object]:
602 manifest = {
603 "seed": seed,
604 "library_files": len(profiles),
605 "sample_size": len(samples),
606 "plan": plan,
607 "index": str(index_path) if index_path else "",
608 "lyrics_dir": str(output_dir),
609 "csv": str(csv_path),
610 "manifest": str(csv_path.with_suffix(csv_path.suffix + ".manifest.json")),
611 "sample_type_counts": dict(Counter(sample.sample_type for sample in samples)),
612 "expected_counts": dict(Counter(sample.expected for sample in samples)),
613 "line_count_bucket_counts": dict(Counter(profile.line_count_bucket for profile in profiles)),
614 "language_bucket_counts": dict(Counter(profile.language_bucket for profile in profiles)),
615 "source_bucket_counts": dict(Counter(profile.source_bucket for profile in profiles).most_common(50)),
616 "unique_source_records": len({sample.source_record_id for sample in samples if sample.source_record_id}),
617 }
618 csv_path.with_suffix(csv_path.suffix + ".manifest.json").write_text(
619 json.dumps(manifest, ensure_ascii=False, indent=2),
620 encoding="utf-8",
131 ) 621 )
622 return manifest
132 623
133 624
134 def _content_lines(text: str) -> list[str]: 625 def _content_lines(text: str) -> list[str]:
...@@ -142,9 +633,40 @@ def _clean_generated_output_dir(output_dir: Path) -> None: ...@@ -142,9 +633,40 @@ def _clean_generated_output_dir(output_dir: Path) -> None:
142 path.unlink() 633 path.unlink()
143 634
144 635
145 def _normalized_lines(path: Path) -> list[str]: 636 def _line_count_bucket(line_count: int) -> str:
146 normalized = normalize_lyrics(read_lyric_file(path)) 637 if line_count == 0:
147 return list(normalized.primary_lines or normalized.unique_lines) 638 return "zero"
639 if line_count <= 5:
640 return "short"
641 if line_count <= 40:
642 return "normal"
643 return "long"
644
645
646 def _language_bucket(lines: tuple[str, ...]) -> str:
647 text = "\n".join(lines)
648 cjk = len(re.findall(r"[\u4e00-\u9fff]", text))
649 latin = len(re.findall(r"[A-Za-z]", text))
650 kana = len(re.findall(r"[\u3040-\u30ff]", text))
651 hangul = len(re.findall(r"[\uac00-\ud7af]", text))
652 if kana or hangul:
653 return "jp_kr"
654 if cjk and latin:
655 return "mixed"
656 if cjk:
657 return "zh"
658 if latin:
659 return "latin"
660 return "other"
661
662
663 def _source_bucket(path: Path) -> str:
664 stem = path.stem
665 parts = stem.split("_")
666 if len(parts) >= 2:
667 code = re.sub(r"\d+$", "", parts[-1])
668 return code or "unknown"
669 return "unknown"
148 670
149 671
150 def _add_timestamps(lines: list[str]) -> str: 672 def _add_timestamps(lines: list[str]) -> str:
...@@ -169,6 +691,17 @@ def _add_blank_line_noise(lines: list[str]) -> str: ...@@ -169,6 +691,17 @@ def _add_blank_line_noise(lines: list[str]) -> str:
169 return "\n".join(result) 691 return "\n".join(result)
170 692
171 693
694 def _change_repeated_line_counts(lines: list[str]) -> str:
695 seen: set[str] = set()
696 result: list[str] = []
697 for line in lines:
698 if line in seen:
699 continue
700 seen.add(line)
701 result.append(line)
702 return "\n".join(result or lines)
703
704
172 def _translation_added(lines: list[str]) -> str: 705 def _translation_added(lines: list[str]) -> str:
173 result: list[str] = [] 706 result: list[str] = []
174 for idx, line in enumerate(lines, start=1): 707 for idx, line in enumerate(lines, start=1):
...@@ -178,11 +711,11 @@ def _translation_added(lines: list[str]) -> str: ...@@ -178,11 +711,11 @@ def _translation_added(lines: list[str]) -> str:
178 return "\n".join(result) 711 return "\n".join(result)
179 712
180 713
181 def _single_song_fragment(lines: list[str]) -> str: 714 def _single_song_fragment(lines: list[str], rng: random.Random) -> str:
182 if len(lines) <= 4: 715 if len(lines) <= 4:
183 return "\n".join(lines[: max(1, len(lines) // 2)]) 716 return "\n".join(lines[: max(1, len(lines) // 2)])
184 fragment_len = max(2, min(8, len(lines) // 4)) 717 fragment_len = max(2, min(8, len(lines) // rng.choice([3, 4, 5])))
185 start = max(0, (len(lines) - fragment_len) // 2) 718 start = rng.randrange(0, max(1, len(lines) - fragment_len + 1))
186 return "\n".join(lines[start : start + fragment_len]) 719 return "\n".join(lines[start : start + fragment_len])
187 720
188 721
...@@ -198,29 +731,26 @@ def _short_shared_snippet(lines: list[str], rng: random.Random) -> str: ...@@ -198,29 +731,26 @@ def _short_shared_snippet(lines: list[str], rng: random.Random) -> str:
198 return "\n".join(synthetic) 731 return "\n".join(synthetic)
199 732
200 733
201 def _mixed_fragments(left_lines: list[str], right_lines: list[str], rng: random.Random) -> str: 734 def _repeated_or_sampled_lines(normalized: NormalizedLyrics, rng: random.Random) -> list[str]:
202 left_pick = rng.sample(left_lines, k=min(2, len(left_lines))) if left_lines else [] 735 repeated = [line for line, count in normalized.line_counts.items() if count >= 2]
203 right_pick = rng.sample(right_lines, k=min(2, len(right_lines))) if right_lines else [] 736 if repeated:
204 filler = ["新的旋律慢慢靠近", "陌生的名字写在风里", "没有人停在原地"] 737 return rng.sample(repeated, k=min(2, len(repeated)))
205 return "\n".join([*left_pick, *filler, *right_pick]) 738 lines = list(normalized.primary_lines or normalized.unique_lines)
206 739 return rng.sample(lines, k=min(2, len(lines))) if lines else []
207 740
208 def _same_theme_synthetic(index: int) -> str: 741
209 themes = [ 742 def _same_theme_synthetic(index: int, rng: random.Random) -> str:
210 "我在夜里想起远方的你", 743 starts = ["我在夜里想起远方的你", "城市灯火陪我走过雨季", "风把旧名字吹向清晨"]
211 "城市灯火陪我走过雨季", 744 middles = ["那些没说完的话留在风里", "新的路口慢慢亮起", "时间把答案交给下一站"]
212 "那些没说完的话留在风里", 745 ends = ["明天醒来我们各自继续", "我会把今天写成新的旋律", "故事从这里重新开始"]
213 "明天醒来我们各自继续", 746 return "\n".join(
214 f"这是第 {index} 个全新测试样本", 747 [
215 ] 748 rng.choice(starts),
216 return "\n".join(themes) 749 rng.choice(middles),
217 750 rng.choice(ends),
218 751 f"这是第 {index} 个全新测试样本",
219 def _translation_only_like(lines: list[str]) -> str: 752 ]
220 foreign_count = sum(1 for line in lines if _looks_foreign(line)) 753 )
221 if foreign_count < 2:
222 return _same_theme_synthetic(foreign_count + len(lines))
223 return "\n".join(_pseudo_translation(idx) for idx in range(1, min(8, foreign_count) + 1))
224 754
225 755
226 def _pseudo_translation(index: int) -> str: 756 def _pseudo_translation(index: int) -> str:
......
...@@ -77,6 +77,7 @@ def main() -> None: ...@@ -77,6 +77,7 @@ def main() -> None:
77 csv_path=Path(args.eval_csv), 77 csv_path=Path(args.eval_csv),
78 size=args.eval_size, 78 size=args.eval_size,
79 positive_ratio=args.positive_ratio, 79 positive_ratio=args.positive_ratio,
80 index_path=Path(args.index),
80 ) 81 )
81 evaluate_csv( 82 evaluate_csv(
82 Path(args.index), 83 Path(args.index),
......
1 import csv 1 import csv
2 import json
2 3
3 from lyric_dedup import DuplicateChecker 4 from lyric_dedup import DuplicateChecker
4 from lyric_dedup import DuplicateDecision 5 from lyric_dedup import DuplicateDecision
...@@ -285,23 +286,32 @@ def test_evaluate_csv_reports_binary_metrics(tmp_path) -> None: ...@@ -285,23 +286,32 @@ def test_evaluate_csv_reports_binary_metrics(tmp_path) -> None:
285 assert (tmp_path / "eval_out.csv.summary.json").exists() 286 assert (tmp_path / "eval_out.csv.summary.json").exists()
286 287
287 288
288 def test_generated_eval_set_marks_fragments_as_negative(tmp_path) -> None: 289 def test_generated_eval_set_uses_stratified_production_mix(tmp_path) -> None:
289 library = tmp_path / "library" 290 library = tmp_path / "library"
290 incoming = tmp_path / "generated" / "incoming" 291 incoming = tmp_path / "generated" / "incoming"
291 eval_csv = tmp_path / "generated" / "eval.csv" 292 eval_csv = tmp_path / "generated" / "eval.csv"
292 library.mkdir() 293 library.mkdir()
293 (library / "song.txt").write_text(BASE_LYRIC, encoding="utf-8") 294 for idx in range(12):
295 prefix = "AY" if idx % 2 == 0 else "WHHY"
296 (library / f"{idx}_{prefix}{idx:06d}.txt").write_text(
297 BASE_LYRIC.replace("我爱你", f"我想你{idx}").replace("城市", f"城市{idx}"),
298 encoding="utf-8",
299 )
294 300
295 generate_eval_set(library_dir=library, output_dir=incoming, csv_path=eval_csv, size=20, positive_ratio=0.5) 301 generate_eval_set(library_dir=library, output_dir=incoming, csv_path=eval_csv, size=30, positive_ratio=0.3)
296 302
297 rows = list(csv.DictReader(eval_csv.open(encoding="utf-8"))) 303 rows = list(csv.DictReader(eval_csv.open(encoding="utf-8")))
298 positive_types = {row["sample_type"] for row in rows if row["expected"] == "应去重"} 304 manifest = json.loads((tmp_path / "generated" / "eval.csv.manifest.json").read_text(encoding="utf-8"))
299 fragment_rows = [row for row in rows if row["sample_type"] == "single_song_fragment"] 305 negative_types = {row["sample_type"] for row in rows if row["expected"] == "不应去重"}
300 306
301 assert "trimmed_version" not in positive_types 307 assert len(rows) == 30
302 assert "single_song_fragment" not in positive_types 308 assert manifest["library_files"] == 12
303 assert fragment_rows 309 assert manifest["sample_size"] == 30
304 assert all(row["expected"] == "不应去重" for row in fragment_rows) 310 assert manifest["unique_source_records"] > 1
311 assert "positive_full_duplicate" in manifest["plan"]
312 assert "negative_fragment" in negative_types
313 assert "negative_hard_candidate" in negative_types
314 assert all(row["expected"] == "不应去重" for row in rows if row["sample_type"].startswith("negative_"))
305 315
306 316
307 def test_foreign_original_with_added_chinese_translation_is_duplicate() -> None: 317 def test_foreign_original_with_added_chinese_translation_is_duplicate() -> None:
......