更新大样本下测试集生成流程
Showing
6 changed files
with
704 additions
and
136 deletions
| ... | @@ -78,16 +78,20 @@ CSV 里重点看这些列: | ... | @@ -78,16 +78,20 @@ CSV 里重点看这些列: |
| 78 | python -m lyric_dedup.cli generate-eval-set \ | 78 | python -m lyric_dedup.cli generate-eval-set \ |
| 79 | --library-dir data/library \ | 79 | --library-dir data/library \ |
| 80 | --lyrics-dir data/generated_eval/incoming \ | 80 | --lyrics-dir data/generated_eval/incoming \ |
| 81 | --csv data/generated_eval/eval_10.csv \ | 81 | --csv data/generated_eval/eval_50000.csv \ |
| 82 | --size 10 \ | 82 | --index outputs/indexes/lyrics.pkl \ |
| 83 | --positive-ratio 0.6 | 83 | --size 50000 \ |
| 84 | --positive-ratio 0.3 | ||
| 84 | ``` | 85 | ``` |
| 85 | 86 | ||
| 86 | 生成器的业务口径: | 87 | 生成器的业务口径: |
| 87 | 88 | ||
| 88 | - `应去重` 样本只生成全曲歌词的样式变化,例如时间戳、标点、平台噪声、空行、LRC 样式、附加中文翻译。 | 89 | - 先扫描整个曲库,按有效歌词行数、语言类型、文件来源前缀做分层采样,不再按排序前缀取样。 |
| 89 | - `不应去重` 样本包含片段歌词、短句碰撞、不同歌曲片段混合、同主题新歌词、仅翻译相似。 | 90 | - `应去重` 样本只生成全曲歌词的样式变化,例如时间戳、标点、平台噪声、空行、重复副歌次数变化、附加中文翻译。 |
| 91 | - `不应去重` 样本包含同主题新歌词、hard negative、片段歌词、重复副歌碰撞、仅翻译相似、短歌词/占位边界样本。 | ||
| 90 | - 片段歌词即使命中已有歌曲的一部分,也不应该输出 `duplicate`;最多进入 `review`。 | 92 | - 片段歌词即使命中已有歌曲的一部分,也不应该输出 `duplicate`;最多进入 `review`。 |
| 93 | - 如果传入 `--index`,生成器会用现有索引构造更接近线上召回风险的 hard negative。 | ||
| 94 | - 同时会生成 `*.manifest.json`,记录 seed、曲库规模、样本类型分布、语言/来源分桶和样本来源覆盖数。 | ||
| 91 | 95 | ||
| 92 | 先准备一个 CSV,例如 `data/eval/eval.csv`: | 96 | 先准备一个 CSV,例如 `data/eval/eval.csv`: |
| 93 | 97 | ... | ... |
| ... | @@ -67,10 +67,10 @@ python scripts/process_library.py \ | ... | @@ -67,10 +67,10 @@ python scripts/process_library.py \ |
| 67 | python scripts/process_library.py \ | 67 | python scripts/process_library.py \ |
| 68 | --library-dir data/library \ | 68 | --library-dir data/library \ |
| 69 | --index outputs/indexes/library_lyrics.pkl \ | 69 | --index outputs/indexes/library_lyrics.pkl \ |
| 70 | --eval-size 1180 \ | 70 | --eval-size 50000 \ |
| 71 | --positive-ratio 0.2 \ | 71 | --positive-ratio 0.3 \ |
| 72 | --eval-csv data/generated_eval/eval_1180.csv \ | 72 | --eval-csv data/generated_eval/eval_50000.csv \ |
| 73 | --eval-out outputs/results/library_eval_1180.csv | 73 | --eval-out outputs/results/library_eval_50000.csv |
| 74 | ``` | 74 | ``` |
| 75 | 75 | ||
| 76 | 隔离出来的文件默认会移动到: | 76 | 隔离出来的文件默认会移动到: |
| ... | @@ -95,22 +95,23 @@ outputs/indexes/library_lyrics.pkl | ... | @@ -95,22 +95,23 @@ outputs/indexes/library_lyrics.pkl |
| 95 | 95 | ||
| 96 | 注意:如果修改了 `data/library`,或修改了预处理/判重逻辑,需要重新执行本步骤。 | 96 | 注意:如果修改了 `data/library`,或修改了预处理/判重逻辑,需要重新执行本步骤。 |
| 97 | 97 | ||
| 98 | ## 3. 生成 100 条测试样本 | 98 | ## 3. 生成生产评估样本 |
| 99 | 99 | ||
| 100 | ```bash | 100 | ```bash |
| 101 | python -m lyric_dedup.cli generate-eval-set \ | 101 | python -m lyric_dedup.cli generate-eval-set \ |
| 102 | --library-dir data/library \ | 102 | --library-dir data/library \ |
| 103 | --lyrics-dir data/generated_eval/incoming \ | 103 | --lyrics-dir data/generated_eval/incoming \ |
| 104 | --csv data/generated_eval/eval_500.csv \ | 104 | --csv data/generated_eval/eval_50000.csv \ |
| 105 | --size 500 \ | 105 | --index outputs/indexes/library_lyrics.pkl \ |
| 106 | --positive-ratio 0.2 | 106 | --size 50000 \ |
| 107 | --positive-ratio 0.3 | ||
| 107 | ``` | 108 | ``` |
| 108 | 109 | ||
| 109 | 默认生成: | 110 | 默认生产评估口径: |
| 110 | 111 | ||
| 111 | ```text | 112 | ```text |
| 112 | 应去重: 60 | 113 | 应去重: 30% |
| 113 | 不应去重: 40 | 114 | 不应去重: 70% |
| 114 | ``` | 115 | ``` |
| 115 | 116 | ||
| 116 | 生成器会先清理 `data/generated_eval/incoming/` 下旧的 `.txt` / `.lrc` 生成文件,再写入新样本。 | 117 | 生成器会先清理 `data/generated_eval/incoming/` 下旧的 `.txt` / `.lrc` 生成文件,再写入新样本。 |
| ... | @@ -118,8 +119,28 @@ python -m lyric_dedup.cli generate-eval-set \ | ... | @@ -118,8 +119,28 @@ python -m lyric_dedup.cli generate-eval-set \ |
| 118 | 业务口径: | 119 | 业务口径: |
| 119 | 120 | ||
| 120 | ```text | 121 | ```text |
| 121 | pos_* = 应去重,全曲歌词样式变化 | 122 | positive_* = 应去重,全曲歌词样式变化 |
| 122 | neg_* = 不应去重,片段/短句碰撞/混合片段/同主题新歌词/仅翻译相似 | 123 | negative_random_unrelated = 不应去重,同主题新歌词 |
| 124 | negative_hard_candidate = 不应去重,系统容易召回的短句/局部重合样本 | ||
| 125 | negative_fragment = 不应去重,单曲片段 | ||
| 126 | negative_shared_chorus = 不应去重,重复副歌碰撞 | ||
| 127 | negative_translation_only = 不应去重,仅翻译相似 | ||
| 128 | edge_short_or_placeholder = 不应去重,短歌词/占位边界样本 | ||
| 129 | ``` | ||
| 130 | |||
| 131 | 生成器会扫描整个曲库并按有效歌词行数、语言类型、文件来源前缀分层采样。传入 `--index` 后会用现有索引生成 hard negative。每次还会输出: | ||
| 132 | |||
| 133 | ```text | ||
| 134 | data/generated_eval/eval_50000.csv.manifest.json | ||
| 135 | ``` | ||
| 136 | |||
| 137 | manifest 里重点看: | ||
| 138 | |||
| 139 | ```text | ||
| 140 | library_files 曲库歌词文件数 | ||
| 141 | sample_type_counts 各样本类型数量 | ||
| 142 | line_count_bucket_counts / language_bucket_counts / source_bucket_counts | ||
| 143 | unique_source_records 本次评估覆盖了多少真实源文件 | ||
| 123 | ``` | 144 | ``` |
| 124 | 145 | ||
| 125 | ## 4. 严格评估:只把 duplicate 算作去重 | 146 | ## 4. 严格评估:只把 duplicate 算作去重 |
| ... | @@ -127,9 +148,9 @@ neg_* = 不应去重,片段/短句碰撞/混合片段/同主题新歌词/仅 | ... | @@ -127,9 +148,9 @@ neg_* = 不应去重,片段/短句碰撞/混合片段/同主题新歌词/仅 |
| 127 | ```bash | 148 | ```bash |
| 128 | python -m lyric_dedup.cli evaluate-csv \ | 149 | python -m lyric_dedup.cli evaluate-csv \ |
| 129 | --index outputs/indexes/library_lyrics.pkl \ | 150 | --index outputs/indexes/library_lyrics.pkl \ |
| 130 | --csv data/generated_eval/eval_500.csv \ | 151 | --csv data/generated_eval/eval_50000.csv \ |
| 131 | --base-dir data/generated_eval \ | 152 | --base-dir data/generated_eval \ |
| 132 | --out outputs/results/library_eval_500.csv | 153 | --out outputs/results/library_eval_50000.csv |
| 133 | ``` | 154 | ``` |
| 134 | 155 | ||
| 135 | 这个口径下: | 156 | 这个口径下: |
| ... | @@ -151,10 +172,10 @@ false_positive | ... | @@ -151,10 +172,10 @@ false_positive |
| 151 | ```bash | 172 | ```bash |
| 152 | python -m lyric_dedup.cli evaluate-csv \ | 173 | python -m lyric_dedup.cli evaluate-csv \ |
| 153 | --index outputs/indexes/library_lyrics.pkl \ | 174 | --index outputs/indexes/library_lyrics.pkl \ |
| 154 | --csv data/generated_eval/eval_500.csv \ | 175 | --csv data/generated_eval/eval_50000.csv \ |
| 155 | --base-dir data/generated_eval \ | 176 | --base-dir data/generated_eval \ |
| 156 | --positive-decisions duplicate,review \ | 177 | --positive-decisions duplicate,review \ |
| 157 | --out outputs/results/library_eval_500_review_positive.csv | 178 | --out outputs/results/library_eval_50000_review_positive.csv |
| 158 | ``` | 179 | ``` |
| 159 | 180 | ||
| 160 | 这个口径下: | 181 | 这个口径下: | ... | ... |
| ... | @@ -48,8 +48,9 @@ def main() -> None: | ... | @@ -48,8 +48,9 @@ def main() -> None: |
| 48 | generate.add_argument("--lyrics-dir", required=True) | 48 | generate.add_argument("--lyrics-dir", required=True) |
| 49 | generate.add_argument("--csv", required=True) | 49 | generate.add_argument("--csv", required=True) |
| 50 | generate.add_argument("--size", type=int, default=100) | 50 | generate.add_argument("--size", type=int, default=100) |
| 51 | generate.add_argument("--positive-ratio", type=float, default=0.6) | 51 | generate.add_argument("--positive-ratio", type=float, default=0.3) |
| 52 | generate.add_argument("--seed", type=int, default=20260602) | 52 | generate.add_argument("--seed", type=int, default=20260602) |
| 53 | generate.add_argument("--index", default="", help="optional existing index for hard-negative generation") | ||
| 53 | 54 | ||
| 54 | args = parser.parse_args() | 55 | args = parser.parse_args() |
| 55 | if args.command == "build-index": | 56 | if args.command == "build-index": |
| ... | @@ -75,6 +76,7 @@ def main() -> None: | ... | @@ -75,6 +76,7 @@ def main() -> None: |
| 75 | size=args.size, | 76 | size=args.size, |
| 76 | positive_ratio=args.positive_ratio, | 77 | positive_ratio=args.positive_ratio, |
| 77 | seed=args.seed, | 78 | seed=args.seed, |
| 79 | index_path=Path(args.index) if args.index else None, | ||
| 78 | ) | 80 | ) |
| 79 | print(json.dumps(summary, ensure_ascii=False)) | 81 | print(json.dumps(summary, ensure_ascii=False)) |
| 80 | 82 | ... | ... |
| 1 | """Generate labeled evaluation samples from an existing lyric library.""" | 1 | """Generate production-style labeled evaluation samples from a lyric library.""" |
| 2 | 2 | ||
| 3 | from __future__ import annotations | 3 | from __future__ import annotations |
| 4 | 4 | ||
| 5 | import csv | 5 | import csv |
| 6 | import hashlib | ||
| 7 | import json | ||
| 6 | import random | 8 | import random |
| 7 | import re | 9 | import re |
| 10 | from collections import Counter | ||
| 8 | from dataclasses import dataclass | 11 | from dataclasses import dataclass |
| 9 | from pathlib import Path | 12 | from pathlib import Path |
| 10 | 13 | ||
| 14 | from lyric_dedup.checker import DuplicateChecker | ||
| 15 | from lyric_dedup.checker import DuplicateDecision | ||
| 11 | from lyric_dedup.file_import import iter_lyric_files | 16 | from lyric_dedup.file_import import iter_lyric_files |
| 12 | from lyric_dedup.file_import import read_lyric_file | 17 | from lyric_dedup.file_import import read_lyric_file |
| 13 | from lyric_dedup.file_import import record_from_file | 18 | from lyric_dedup.file_import import record_from_file |
| 19 | from lyric_dedup.normalization import NormalizedLyrics | ||
| 20 | from lyric_dedup.normalization import fingerprint_text | ||
| 14 | from lyric_dedup.normalization import normalize_lyrics | 21 | from lyric_dedup.normalization import normalize_lyrics |
| 15 | 22 | ||
| 16 | 23 | ||
| 24 | DEFAULT_SAMPLE_MIX = { | ||
| 25 | "positive_full_duplicate": 0.30, | ||
| 26 | "negative_random_unrelated": 0.20, | ||
| 27 | "negative_hard_candidate": 0.25, | ||
| 28 | "negative_fragment": 0.10, | ||
| 29 | "negative_shared_chorus": 0.05, | ||
| 30 | "negative_translation_only": 0.05, | ||
| 31 | "edge_short_or_placeholder": 0.05, | ||
| 32 | } | ||
| 33 | |||
| 34 | |||
| 35 | @dataclass(frozen=True) | ||
| 36 | class LyricProfile: | ||
| 37 | path: Path | ||
| 38 | record_id: str | ||
| 39 | title: str | ||
| 40 | artist: str | ||
| 41 | normalized: NormalizedLyrics | ||
| 42 | line_count: int | ||
| 43 | char_count: int | ||
| 44 | line_count_bucket: str | ||
| 45 | language_bucket: str | ||
| 46 | source_bucket: str | ||
| 47 | normalized_hash: str | ||
| 48 | has_translation: bool | ||
| 49 | |||
| 50 | |||
| 17 | @dataclass(frozen=True) | 51 | @dataclass(frozen=True) |
| 18 | class GeneratedSample: | 52 | class GeneratedSample: |
| 19 | sample_id: str | 53 | sample_id: str |
| ... | @@ -21,8 +55,14 @@ class GeneratedSample: | ... | @@ -21,8 +55,14 @@ class GeneratedSample: |
| 21 | expected: str | 55 | expected: str |
| 22 | sample_type: str | 56 | sample_type: str |
| 23 | source: str | 57 | source: str |
| 58 | source_record_id: str = "" | ||
| 59 | candidate_record_id: str = "" | ||
| 60 | line_count_bucket: str = "" | ||
| 61 | language_bucket: str = "" | ||
| 62 | source_bucket: str = "" | ||
| 24 | title: str = "" | 63 | title: str = "" |
| 25 | artist: str = "" | 64 | artist: str = "" |
| 65 | notes: str = "" | ||
| 26 | 66 | ||
| 27 | 67 | ||
| 28 | def generate_eval_set( | 68 | def generate_eval_set( |
| ... | @@ -31,104 +71,555 @@ def generate_eval_set( | ... | @@ -31,104 +71,555 @@ def generate_eval_set( |
| 31 | output_dir: Path, | 71 | output_dir: Path, |
| 32 | csv_path: Path, | 72 | csv_path: Path, |
| 33 | size: int = 100, | 73 | size: int = 100, |
| 34 | positive_ratio: float = 0.6, | 74 | positive_ratio: float = 0.30, |
| 35 | seed: int = 20260602, | 75 | seed: int = 20260602, |
| 76 | index_path: Path | None = None, | ||
| 36 | ) -> dict[str, object]: | 77 | ) -> dict[str, object]: |
| 78 | """Generate a stratified production evaluation set. | ||
| 79 | |||
| 80 | ``positive_ratio`` is kept for CLI compatibility. It overrides the default | ||
| 81 | positive quota while keeping the remaining negative categories proportional. | ||
| 82 | """ | ||
| 83 | if size <= 0: | ||
| 84 | raise ValueError("size must be positive") | ||
| 85 | |||
| 37 | rng = random.Random(seed) | 86 | rng = random.Random(seed) |
| 38 | source_files = iter_lyric_files(library_dir) | 87 | profiles = profile_library(library_dir) |
| 39 | if not source_files: | 88 | if not profiles: |
| 40 | raise ValueError(f"{library_dir} 下没有 .lrc/.txt 歌词文件") | 89 | raise ValueError(f"{library_dir} 下没有 .lrc/.txt 歌词文件") |
| 41 | 90 | ||
| 42 | output_dir.mkdir(parents=True, exist_ok=True) | 91 | output_dir.mkdir(parents=True, exist_ok=True) |
| 43 | csv_path.parent.mkdir(parents=True, exist_ok=True) | 92 | csv_path.parent.mkdir(parents=True, exist_ok=True) |
| 44 | _clean_generated_output_dir(output_dir) | 93 | _clean_generated_output_dir(output_dir) |
| 45 | 94 | ||
| 46 | positives = round(size * positive_ratio) | 95 | checker = DuplicateChecker.load(index_path) if index_path else None |
| 47 | negatives = size - positives | 96 | plan = _sample_plan(size, positive_ratio=positive_ratio) |
| 97 | groups = _profile_groups(profiles) | ||
| 48 | samples: list[GeneratedSample] = [] | 98 | samples: list[GeneratedSample] = [] |
| 49 | for index in range(positives): | ||
| 50 | source = source_files[index % len(source_files)] | ||
| 51 | samples.append(_positive_sample(index + 1, source, output_dir, csv_path.parent, rng)) | ||
| 52 | for index in range(negatives): | ||
| 53 | left = source_files[index % len(source_files)] | ||
| 54 | right = source_files[(index + 1) % len(source_files)] | ||
| 55 | samples.append(_negative_sample(positives + index + 1, left, right, output_dir, csv_path.parent, rng)) | ||
| 56 | 99 | ||
| 57 | rng.shuffle(samples) | 100 | samples.extend( |
| 58 | with csv_path.open("w", encoding="utf-8", newline="") as file: | 101 | _build_positive_samples( |
| 59 | writer = csv.DictWriter(file, fieldnames=["id", "file", "expected", "sample_type", "source", "title", "artist"]) | 102 | _stratified_sample(groups["normal"], plan["positive_full_duplicate"], rng), |
| 60 | writer.writeheader() | 103 | output_dir, |
| 61 | writer.writerows( | 104 | csv_path.parent, |
| 62 | { | 105 | rng, |
| 63 | "id": sample.sample_id, | 106 | start_index=len(samples) + 1, |
| 64 | "file": sample.file, | ||
| 65 | "expected": sample.expected, | ||
| 66 | "sample_type": sample.sample_type, | ||
| 67 | "source": sample.source, | ||
| 68 | "title": sample.title, | ||
| 69 | "artist": sample.artist, | ||
| 70 | } | ||
| 71 | for sample in samples | ||
| 72 | ) | 107 | ) |
| 108 | ) | ||
| 109 | samples.extend( | ||
| 110 | _build_random_unrelated_samples( | ||
| 111 | plan["negative_random_unrelated"], | ||
| 112 | output_dir, | ||
| 113 | csv_path.parent, | ||
| 114 | rng, | ||
| 115 | start_index=len(samples) + 1, | ||
| 116 | ) | ||
| 117 | ) | ||
| 118 | samples.extend( | ||
| 119 | _build_hard_candidate_samples( | ||
| 120 | groups["normal"], | ||
| 121 | plan["negative_hard_candidate"], | ||
| 122 | output_dir, | ||
| 123 | csv_path.parent, | ||
| 124 | rng, | ||
| 125 | checker=checker, | ||
| 126 | start_index=len(samples) + 1, | ||
| 127 | ) | ||
| 128 | ) | ||
| 129 | samples.extend( | ||
| 130 | _build_fragment_samples( | ||
| 131 | _stratified_sample(groups["fragmentable"], plan["negative_fragment"], rng), | ||
| 132 | output_dir, | ||
| 133 | csv_path.parent, | ||
| 134 | rng, | ||
| 135 | start_index=len(samples) + 1, | ||
| 136 | ) | ||
| 137 | ) | ||
| 138 | samples.extend( | ||
| 139 | _build_shared_chorus_samples( | ||
| 140 | _stratified_sample(groups["normal"], plan["negative_shared_chorus"], rng), | ||
| 141 | output_dir, | ||
| 142 | csv_path.parent, | ||
| 143 | rng, | ||
| 144 | start_index=len(samples) + 1, | ||
| 145 | ) | ||
| 146 | ) | ||
| 147 | samples.extend( | ||
| 148 | _build_translation_only_samples( | ||
| 149 | _stratified_sample(groups["foreign"], plan["negative_translation_only"], rng), | ||
| 150 | output_dir, | ||
| 151 | csv_path.parent, | ||
| 152 | rng, | ||
| 153 | start_index=len(samples) + 1, | ||
| 154 | ) | ||
| 155 | ) | ||
| 156 | samples.extend( | ||
| 157 | _build_edge_samples( | ||
| 158 | _stratified_sample(groups["edge"], plan["edge_short_or_placeholder"], rng), | ||
| 159 | output_dir, | ||
| 160 | csv_path.parent, | ||
| 161 | rng, | ||
| 162 | start_index=len(samples) + 1, | ||
| 163 | ) | ||
| 164 | ) | ||
| 73 | 165 | ||
| 166 | if len(samples) < size: | ||
| 167 | samples.extend( | ||
| 168 | _build_random_unrelated_samples( | ||
| 169 | size - len(samples), | ||
| 170 | output_dir, | ||
| 171 | csv_path.parent, | ||
| 172 | rng, | ||
| 173 | start_index=len(samples) + 1, | ||
| 174 | ) | ||
| 175 | ) | ||
| 176 | samples = samples[:size] | ||
| 177 | rng.shuffle(samples) | ||
| 178 | |||
| 179 | _write_csv(samples, csv_path, seed=seed) | ||
| 180 | manifest = _write_manifest( | ||
| 181 | profiles=profiles, | ||
| 182 | samples=samples, | ||
| 183 | csv_path=csv_path, | ||
| 184 | output_dir=output_dir, | ||
| 185 | seed=seed, | ||
| 186 | plan=plan, | ||
| 187 | index_path=index_path, | ||
| 188 | ) | ||
| 189 | return manifest | ||
| 190 | |||
| 191 | |||
| 192 | def profile_library(library_dir: Path) -> list[LyricProfile]: | ||
| 193 | profiles: list[LyricProfile] = [] | ||
| 194 | for path in iter_lyric_files(library_dir): | ||
| 195 | record = record_from_file(path, base_dir=library_dir) | ||
| 196 | normalized = normalize_lyrics(record.lyrics) | ||
| 197 | lines = normalized.primary_lines or normalized.unique_lines | ||
| 198 | line_count = len(lines) | ||
| 199 | normalized_text = fingerprint_text(normalized) or normalized.normalized_full_text | ||
| 200 | source_bucket = _source_bucket(path) | ||
| 201 | profiles.append( | ||
| 202 | LyricProfile( | ||
| 203 | path=path, | ||
| 204 | record_id=record.record_id, | ||
| 205 | title=record.title or "", | ||
| 206 | artist=record.artist or "", | ||
| 207 | normalized=normalized, | ||
| 208 | line_count=line_count, | ||
| 209 | char_count=len(normalized_text), | ||
| 210 | line_count_bucket=_line_count_bucket(line_count), | ||
| 211 | language_bucket=_language_bucket(lines), | ||
| 212 | source_bucket=source_bucket, | ||
| 213 | normalized_hash=hashlib.sha256(normalized_text.encode("utf-8")).hexdigest(), | ||
| 214 | has_translation=bool(normalized.translation_lines), | ||
| 215 | ) | ||
| 216 | ) | ||
| 217 | return profiles | ||
| 218 | |||
| 219 | |||
| 220 | def _sample_plan(size: int, *, positive_ratio: float) -> dict[str, int]: | ||
| 221 | positive_ratio = max(0.0, min(1.0, positive_ratio)) | ||
| 222 | mix = dict(DEFAULT_SAMPLE_MIX) | ||
| 223 | negative_total = sum(value for key, value in mix.items() if key != "positive_full_duplicate") | ||
| 224 | mix["positive_full_duplicate"] = positive_ratio | ||
| 225 | for key in list(mix): | ||
| 226 | if key != "positive_full_duplicate": | ||
| 227 | mix[key] = (1.0 - positive_ratio) * (DEFAULT_SAMPLE_MIX[key] / negative_total) | ||
| 228 | |||
| 229 | plan = {key: int(size * value) for key, value in mix.items()} | ||
| 230 | remainder = size - sum(plan.values()) | ||
| 231 | for key in sorted(mix, key=mix.get, reverse=True): | ||
| 232 | if remainder <= 0: | ||
| 233 | break | ||
| 234 | plan[key] += 1 | ||
| 235 | remainder -= 1 | ||
| 236 | return plan | ||
| 237 | |||
| 238 | |||
| 239 | def _profile_groups(profiles: list[LyricProfile]) -> dict[str, list[LyricProfile]]: | ||
| 240 | normal = [profile for profile in profiles if profile.line_count >= 6] | ||
| 241 | edge = [profile for profile in profiles if profile.line_count <= 5] | ||
| 74 | return { | 242 | return { |
| 75 | "size": size, | 243 | "normal": normal or profiles, |
| 76 | "positive": positives, | 244 | "fragmentable": [profile for profile in profiles if profile.line_count >= 12] or normal or profiles, |
| 77 | "negative": negatives, | 245 | "foreign": [ |
| 78 | "library_files": len(source_files), | 246 | profile |
| 79 | "lyrics_dir": str(output_dir), | 247 | for profile in profiles |
| 80 | "csv": str(csv_path), | 248 | if profile.language_bucket in {"latin", "mixed", "jp_kr"} and profile.line_count >= 4 |
| 249 | ] | ||
| 250 | or normal | ||
| 251 | or profiles, | ||
| 252 | "edge": edge or normal or profiles, | ||
| 81 | } | 253 | } |
| 82 | 254 | ||
| 83 | 255 | ||
| 84 | def _positive_sample(index: int, source: Path, output_dir: Path, csv_base: Path, rng: random.Random) -> GeneratedSample: | 256 | def _stratified_sample(profiles: list[LyricProfile], count: int, rng: random.Random) -> list[LyricProfile]: |
| 85 | raw = read_lyric_file(source) | 257 | if count <= 0 or not profiles: |
| 86 | source_record = record_from_file(source) | 258 | return [] |
| 87 | variants = [ | 259 | buckets: dict[tuple[str, str, str], list[LyricProfile]] = {} |
| 88 | ("exact_copy", raw), | 260 | for profile in profiles: |
| 89 | ("timestamped", _add_timestamps(_content_lines(raw))), | 261 | key = (profile.line_count_bucket, profile.language_bucket, profile.source_bucket) |
| 90 | ("punctuation_noise", _add_punctuation_noise(_content_lines(raw), rng)), | 262 | buckets.setdefault(key, []).append(profile) |
| 91 | ("with_platform_noise", _with_platform_noise(_content_lines(raw))), | 263 | |
| 92 | ("blank_line_noise", _add_blank_line_noise(_content_lines(raw))), | 264 | selected: list[LyricProfile] = [] |
| 93 | ("lrc_with_platform_noise", _add_timestamps(_content_lines(_with_platform_noise(_content_lines(raw))))), | 265 | bucket_keys = list(buckets) |
| 94 | ("translation_added", _translation_added(_content_lines(raw))), | 266 | rng.shuffle(bucket_keys) |
| 95 | ] | 267 | cursors = {key: rng.sample(items, len(items)) for key, items in buckets.items()} |
| 96 | sample_type, text = variants[(index - 1) % len(variants)] | 268 | while len(selected) < count and bucket_keys: |
| 97 | name = f"pos_{index:03d}_{sample_type}.txt" | 269 | progressed = False |
| 98 | path = output_dir / name | 270 | for key in list(bucket_keys): |
| 99 | path.write_text(text, encoding="utf-8") | 271 | if len(selected) >= count: |
| 272 | break | ||
| 273 | items = cursors[key] | ||
| 274 | if not items: | ||
| 275 | bucket_keys.remove(key) | ||
| 276 | continue | ||
| 277 | selected.append(items.pop()) | ||
| 278 | progressed = True | ||
| 279 | if not progressed: | ||
| 280 | break | ||
| 281 | while len(selected) < count: | ||
| 282 | selected.append(rng.choice(profiles)) | ||
| 283 | return selected | ||
| 284 | |||
| 285 | |||
| 286 | def _build_positive_samples( | ||
| 287 | profiles: list[LyricProfile], | ||
| 288 | output_dir: Path, | ||
| 289 | csv_base: Path, | ||
| 290 | rng: random.Random, | ||
| 291 | *, | ||
| 292 | start_index: int, | ||
| 293 | ) -> list[GeneratedSample]: | ||
| 294 | samples: list[GeneratedSample] = [] | ||
| 295 | for offset, profile in enumerate(profiles): | ||
| 296 | raw = read_lyric_file(profile.path) | ||
| 297 | lines = _content_lines(raw) | ||
| 298 | variants = [ | ||
| 299 | ("positive_exact_copy", raw), | ||
| 300 | ("positive_timestamped", _add_timestamps(lines)), | ||
| 301 | ("positive_punctuation_noise", _add_punctuation_noise(lines, rng)), | ||
| 302 | ("positive_platform_noise", _with_platform_noise(lines)), | ||
| 303 | ("positive_blank_line_noise", _add_blank_line_noise(lines)), | ||
| 304 | ("positive_chorus_count_changed", _change_repeated_line_counts(lines)), | ||
| 305 | ("positive_translation_added", _translation_added(lines)), | ||
| 306 | ] | ||
| 307 | sample_type, text = variants[offset % len(variants)] | ||
| 308 | index = start_index + offset | ||
| 309 | path = _write_sample_file(output_dir, f"pos_{index:05d}_{sample_type}.txt", text) | ||
| 310 | samples.append(_sample_from_profile(index, path, csv_base, "应去重", sample_type, profile)) | ||
| 311 | return samples | ||
| 312 | |||
| 313 | |||
| 314 | def _build_random_unrelated_samples( | ||
| 315 | count: int, | ||
| 316 | output_dir: Path, | ||
| 317 | csv_base: Path, | ||
| 318 | rng: random.Random, | ||
| 319 | *, | ||
| 320 | start_index: int, | ||
| 321 | ) -> list[GeneratedSample]: | ||
| 322 | samples: list[GeneratedSample] = [] | ||
| 323 | for offset in range(count): | ||
| 324 | index = start_index + offset | ||
| 325 | text = _same_theme_synthetic(index, rng) | ||
| 326 | path = _write_sample_file(output_dir, f"neg_{index:05d}_negative_random_unrelated.txt", text) | ||
| 327 | samples.append( | ||
| 328 | GeneratedSample( | ||
| 329 | sample_id=f"sample-{index:05d}", | ||
| 330 | file=str(path.relative_to(csv_base)), | ||
| 331 | expected="不应去重", | ||
| 332 | sample_type="negative_random_unrelated", | ||
| 333 | source="synthetic", | ||
| 334 | notes="same-theme synthetic full lyric not copied from library", | ||
| 335 | ) | ||
| 336 | ) | ||
| 337 | return samples | ||
| 338 | |||
| 339 | |||
| 340 | def _build_hard_candidate_samples( | ||
| 341 | profiles: list[LyricProfile], | ||
| 342 | count: int, | ||
| 343 | output_dir: Path, | ||
| 344 | csv_base: Path, | ||
| 345 | rng: random.Random, | ||
| 346 | *, | ||
| 347 | checker: DuplicateChecker | None, | ||
| 348 | start_index: int, | ||
| 349 | ) -> list[GeneratedSample]: | ||
| 350 | if count <= 0: | ||
| 351 | return [] | ||
| 352 | sources = _stratified_sample(profiles, count * 3, rng) | ||
| 353 | samples: list[GeneratedSample] = [] | ||
| 354 | for profile in sources: | ||
| 355 | if len(samples) >= count: | ||
| 356 | break | ||
| 357 | lines = list(profile.normalized.primary_lines or profile.normalized.unique_lines) | ||
| 358 | text = _short_shared_snippet(lines, rng) | ||
| 359 | candidate_id = "" | ||
| 360 | if checker is not None: | ||
| 361 | result = checker.check(text, max_candidates=5) | ||
| 362 | candidate = next( | ||
| 363 | ( | ||
| 364 | item | ||
| 365 | for item in result.candidates | ||
| 366 | if item.record_id != profile.record_id and item.decision != DuplicateDecision.NEW | ||
| 367 | ), | ||
| 368 | result.candidates[0] if result.candidates else None, | ||
| 369 | ) | ||
| 370 | candidate_id = candidate.record_id if candidate else "" | ||
| 371 | index = start_index + len(samples) | ||
| 372 | path = _write_sample_file(output_dir, f"neg_{index:05d}_negative_hard_candidate.txt", text) | ||
| 373 | samples.append( | ||
| 374 | _sample_from_profile( | ||
| 375 | index, | ||
| 376 | path, | ||
| 377 | csv_base, | ||
| 378 | "不应去重", | ||
| 379 | "negative_hard_candidate", | ||
| 380 | profile, | ||
| 381 | candidate_record_id=candidate_id, | ||
| 382 | notes="shares a few real lines plus new filler; should not auto duplicate", | ||
| 383 | ) | ||
| 384 | ) | ||
| 385 | return samples | ||
| 386 | |||
| 387 | |||
| 388 | def _build_fragment_samples( | ||
| 389 | profiles: list[LyricProfile], | ||
| 390 | output_dir: Path, | ||
| 391 | csv_base: Path, | ||
| 392 | rng: random.Random, | ||
| 393 | *, | ||
| 394 | start_index: int, | ||
| 395 | ) -> list[GeneratedSample]: | ||
| 396 | samples: list[GeneratedSample] = [] | ||
| 397 | for offset, profile in enumerate(profiles): | ||
| 398 | lines = list(profile.normalized.primary_lines or profile.normalized.unique_lines) | ||
| 399 | text = _single_song_fragment(lines, rng) | ||
| 400 | index = start_index + offset | ||
| 401 | path = _write_sample_file(output_dir, f"neg_{index:05d}_negative_fragment.txt", text) | ||
| 402 | samples.append( | ||
| 403 | _sample_from_profile( | ||
| 404 | index, | ||
| 405 | path, | ||
| 406 | csv_base, | ||
| 407 | "不应去重", | ||
| 408 | "negative_fragment", | ||
| 409 | profile, | ||
| 410 | notes="partial lyric fragment only", | ||
| 411 | ) | ||
| 412 | ) | ||
| 413 | return samples | ||
| 414 | |||
| 415 | |||
| 416 | def _build_shared_chorus_samples( | ||
| 417 | profiles: list[LyricProfile], | ||
| 418 | output_dir: Path, | ||
| 419 | csv_base: Path, | ||
| 420 | rng: random.Random, | ||
| 421 | *, | ||
| 422 | start_index: int, | ||
| 423 | ) -> list[GeneratedSample]: | ||
| 424 | samples: list[GeneratedSample] = [] | ||
| 425 | for offset, profile in enumerate(profiles): | ||
| 426 | lines = list(profile.normalized.primary_lines or profile.normalized.unique_lines) | ||
| 427 | repeated = _repeated_or_sampled_lines(profile.normalized, rng) | ||
| 428 | text = "\n".join( | ||
| 429 | [ | ||
| 430 | "清晨的光落在新的街口", | ||
| 431 | "我把故事重新写给以后", | ||
| 432 | *repeated, | ||
| 433 | *repeated, | ||
| 434 | "所有答案都从这里开始", | ||
| 435 | ] | ||
| 436 | ) | ||
| 437 | index = start_index + offset | ||
| 438 | path = _write_sample_file(output_dir, f"neg_{index:05d}_negative_shared_chorus.txt", text) | ||
| 439 | samples.append( | ||
| 440 | _sample_from_profile( | ||
| 441 | index, | ||
| 442 | path, | ||
| 443 | csv_base, | ||
| 444 | "不应去重", | ||
| 445 | "negative_shared_chorus", | ||
| 446 | profile, | ||
| 447 | notes="shared repeated lines with new surrounding content", | ||
| 448 | ) | ||
| 449 | ) | ||
| 450 | return samples | ||
| 451 | |||
| 452 | |||
| 453 | def _build_translation_only_samples( | ||
| 454 | profiles: list[LyricProfile], | ||
| 455 | output_dir: Path, | ||
| 456 | csv_base: Path, | ||
| 457 | rng: random.Random, | ||
| 458 | *, | ||
| 459 | start_index: int, | ||
| 460 | ) -> list[GeneratedSample]: | ||
| 461 | samples: list[GeneratedSample] = [] | ||
| 462 | for offset, profile in enumerate(profiles): | ||
| 463 | lines = list(profile.normalized.translation_lines) or [ | ||
| 464 | _pseudo_translation(idx) for idx in range(1, min(8, max(profile.line_count, 4)) + 1) | ||
| 465 | ] | ||
| 466 | rng.shuffle(lines) | ||
| 467 | text = "\n".join(lines[:8]) | ||
| 468 | index = start_index + offset | ||
| 469 | path = _write_sample_file(output_dir, f"neg_{index:05d}_negative_translation_only.txt", text) | ||
| 470 | samples.append( | ||
| 471 | _sample_from_profile( | ||
| 472 | index, | ||
| 473 | path, | ||
| 474 | csv_base, | ||
| 475 | "不应去重", | ||
| 476 | "negative_translation_only", | ||
| 477 | profile, | ||
| 478 | notes="translation-like text without matching original lyric", | ||
| 479 | ) | ||
| 480 | ) | ||
| 481 | return samples | ||
| 482 | |||
| 483 | |||
| 484 | def _build_edge_samples( | ||
| 485 | profiles: list[LyricProfile], | ||
| 486 | output_dir: Path, | ||
| 487 | csv_base: Path, | ||
| 488 | rng: random.Random, | ||
| 489 | *, | ||
| 490 | start_index: int, | ||
| 491 | ) -> list[GeneratedSample]: | ||
| 492 | samples: list[GeneratedSample] = [] | ||
| 493 | for offset, profile in enumerate(profiles): | ||
| 494 | lines = list(profile.normalized.primary_lines or profile.normalized.unique_lines) | ||
| 495 | if profile.line_count <= 1: | ||
| 496 | text = _same_theme_synthetic(start_index + offset, rng) | ||
| 497 | notes = "zero or one effective line; use synthetic edge negative" | ||
| 498 | else: | ||
| 499 | text = _short_shared_snippet(lines, rng) | ||
| 500 | notes = "short lyric edge case with limited overlap" | ||
| 501 | index = start_index + offset | ||
| 502 | path = _write_sample_file(output_dir, f"neg_{index:05d}_edge_short_or_placeholder.txt", text) | ||
| 503 | samples.append( | ||
| 504 | _sample_from_profile( | ||
| 505 | index, | ||
| 506 | path, | ||
| 507 | csv_base, | ||
| 508 | "不应去重", | ||
| 509 | "edge_short_or_placeholder", | ||
| 510 | profile, | ||
| 511 | notes=notes, | ||
| 512 | ) | ||
| 513 | ) | ||
| 514 | return samples | ||
| 515 | |||
| 516 | |||
| 517 | def _sample_from_profile( | ||
| 518 | index: int, | ||
| 519 | path: Path, | ||
| 520 | csv_base: Path, | ||
| 521 | expected: str, | ||
| 522 | sample_type: str, | ||
| 523 | profile: LyricProfile, | ||
| 524 | *, | ||
| 525 | candidate_record_id: str = "", | ||
| 526 | notes: str = "", | ||
| 527 | ) -> GeneratedSample: | ||
| 100 | return GeneratedSample( | 528 | return GeneratedSample( |
| 101 | sample_id=f"pos-{index:03d}", | 529 | sample_id=f"sample-{index:05d}", |
| 102 | file=str(path.relative_to(csv_base)), | 530 | file=str(path.relative_to(csv_base)), |
| 103 | expected="应去重", | 531 | expected=expected, |
| 104 | sample_type=sample_type, | 532 | sample_type=sample_type, |
| 105 | source=str(source), | 533 | source=str(profile.path), |
| 106 | title=source_record.title or "", | 534 | source_record_id=profile.record_id, |
| 107 | artist=source_record.artist or "", | 535 | candidate_record_id=candidate_record_id, |
| 536 | line_count_bucket=profile.line_count_bucket, | ||
| 537 | language_bucket=profile.language_bucket, | ||
| 538 | source_bucket=profile.source_bucket, | ||
| 539 | title=profile.title, | ||
| 540 | artist=profile.artist, | ||
| 541 | notes=notes, | ||
| 108 | ) | 542 | ) |
| 109 | 543 | ||
| 110 | 544 | ||
| 111 | def _negative_sample(index: int, left: Path, right: Path, output_dir: Path, csv_base: Path, rng: random.Random) -> GeneratedSample: | 545 | def _write_sample_file(output_dir: Path, name: str, text: str) -> Path: |
| 112 | left_lines = _normalized_lines(left) | ||
| 113 | right_lines = _normalized_lines(right) | ||
| 114 | variants = [ | ||
| 115 | ("single_song_fragment", _single_song_fragment(left_lines)), | ||
| 116 | ("short_shared_snippet", _short_shared_snippet(left_lines, rng)), | ||
| 117 | ("mixed_fragments", _mixed_fragments(left_lines, right_lines, rng)), | ||
| 118 | ("same_theme_synthetic", _same_theme_synthetic(index)), | ||
| 119 | ("translation_only_like", _translation_only_like(left_lines)), | ||
| 120 | ] | ||
| 121 | sample_type, text = variants[(index - 1) % len(variants)] | ||
| 122 | name = f"neg_{index:03d}_{sample_type}.txt" | ||
| 123 | path = output_dir / name | 546 | path = output_dir / name |
| 124 | path.write_text(text, encoding="utf-8") | 547 | path.write_text(text.strip() + "\n", encoding="utf-8") |
| 125 | return GeneratedSample( | 548 | return path |
| 126 | sample_id=f"neg-{index:03d}", | 549 | |
| 127 | file=str(path.relative_to(csv_base)), | 550 | |
| 128 | expected="不应去重", | 551 | def _write_csv(samples: list[GeneratedSample], csv_path: Path, *, seed: int) -> None: |
| 129 | sample_type=sample_type, | 552 | fieldnames = [ |
| 130 | source=f"{left} | {right}", | 553 | "id", |
| 554 | "file", | ||
| 555 | "expected", | ||
| 556 | "sample_type", | ||
| 557 | "source", | ||
| 558 | "source_record_id", | ||
| 559 | "candidate_record_id", | ||
| 560 | "line_count_bucket", | ||
| 561 | "language_bucket", | ||
| 562 | "source_bucket", | ||
| 563 | "title", | ||
| 564 | "artist", | ||
| 565 | "seed", | ||
| 566 | "notes", | ||
| 567 | ] | ||
| 568 | with csv_path.open("w", encoding="utf-8", newline="") as file: | ||
| 569 | writer = csv.DictWriter(file, fieldnames=fieldnames) | ||
| 570 | writer.writeheader() | ||
| 571 | for sample in samples: | ||
| 572 | writer.writerow( | ||
| 573 | { | ||
| 574 | "id": sample.sample_id, | ||
| 575 | "file": sample.file, | ||
| 576 | "expected": sample.expected, | ||
| 577 | "sample_type": sample.sample_type, | ||
| 578 | "source": sample.source, | ||
| 579 | "source_record_id": sample.source_record_id, | ||
| 580 | "candidate_record_id": sample.candidate_record_id, | ||
| 581 | "line_count_bucket": sample.line_count_bucket, | ||
| 582 | "language_bucket": sample.language_bucket, | ||
| 583 | "source_bucket": sample.source_bucket, | ||
| 584 | "title": sample.title, | ||
| 585 | "artist": sample.artist, | ||
| 586 | "seed": seed, | ||
| 587 | "notes": sample.notes, | ||
| 588 | } | ||
| 589 | ) | ||
| 590 | |||
| 591 | |||
| 592 | def _write_manifest( | ||
| 593 | *, | ||
| 594 | profiles: list[LyricProfile], | ||
| 595 | samples: list[GeneratedSample], | ||
| 596 | csv_path: Path, | ||
| 597 | output_dir: Path, | ||
| 598 | seed: int, | ||
| 599 | plan: dict[str, int], | ||
| 600 | index_path: Path | None, | ||
| 601 | ) -> dict[str, object]: | ||
| 602 | manifest = { | ||
| 603 | "seed": seed, | ||
| 604 | "library_files": len(profiles), | ||
| 605 | "sample_size": len(samples), | ||
| 606 | "plan": plan, | ||
| 607 | "index": str(index_path) if index_path else "", | ||
| 608 | "lyrics_dir": str(output_dir), | ||
| 609 | "csv": str(csv_path), | ||
| 610 | "manifest": str(csv_path.with_suffix(csv_path.suffix + ".manifest.json")), | ||
| 611 | "sample_type_counts": dict(Counter(sample.sample_type for sample in samples)), | ||
| 612 | "expected_counts": dict(Counter(sample.expected for sample in samples)), | ||
| 613 | "line_count_bucket_counts": dict(Counter(profile.line_count_bucket for profile in profiles)), | ||
| 614 | "language_bucket_counts": dict(Counter(profile.language_bucket for profile in profiles)), | ||
| 615 | "source_bucket_counts": dict(Counter(profile.source_bucket for profile in profiles).most_common(50)), | ||
| 616 | "unique_source_records": len({sample.source_record_id for sample in samples if sample.source_record_id}), | ||
| 617 | } | ||
| 618 | csv_path.with_suffix(csv_path.suffix + ".manifest.json").write_text( | ||
| 619 | json.dumps(manifest, ensure_ascii=False, indent=2), | ||
| 620 | encoding="utf-8", | ||
| 131 | ) | 621 | ) |
| 622 | return manifest | ||
| 132 | 623 | ||
| 133 | 624 | ||
| 134 | def _content_lines(text: str) -> list[str]: | 625 | def _content_lines(text: str) -> list[str]: |
| ... | @@ -142,9 +633,40 @@ def _clean_generated_output_dir(output_dir: Path) -> None: | ... | @@ -142,9 +633,40 @@ def _clean_generated_output_dir(output_dir: Path) -> None: |
| 142 | path.unlink() | 633 | path.unlink() |
| 143 | 634 | ||
| 144 | 635 | ||
| 145 | def _normalized_lines(path: Path) -> list[str]: | 636 | def _line_count_bucket(line_count: int) -> str: |
| 146 | normalized = normalize_lyrics(read_lyric_file(path)) | 637 | if line_count == 0: |
| 147 | return list(normalized.primary_lines or normalized.unique_lines) | 638 | return "zero" |
| 639 | if line_count <= 5: | ||
| 640 | return "short" | ||
| 641 | if line_count <= 40: | ||
| 642 | return "normal" | ||
| 643 | return "long" | ||
| 644 | |||
| 645 | |||
| 646 | def _language_bucket(lines: tuple[str, ...]) -> str: | ||
| 647 | text = "\n".join(lines) | ||
| 648 | cjk = len(re.findall(r"[\u4e00-\u9fff]", text)) | ||
| 649 | latin = len(re.findall(r"[A-Za-z]", text)) | ||
| 650 | kana = len(re.findall(r"[\u3040-\u30ff]", text)) | ||
| 651 | hangul = len(re.findall(r"[\uac00-\ud7af]", text)) | ||
| 652 | if kana or hangul: | ||
| 653 | return "jp_kr" | ||
| 654 | if cjk and latin: | ||
| 655 | return "mixed" | ||
| 656 | if cjk: | ||
| 657 | return "zh" | ||
| 658 | if latin: | ||
| 659 | return "latin" | ||
| 660 | return "other" | ||
| 661 | |||
| 662 | |||
| 663 | def _source_bucket(path: Path) -> str: | ||
| 664 | stem = path.stem | ||
| 665 | parts = stem.split("_") | ||
| 666 | if len(parts) >= 2: | ||
| 667 | code = re.sub(r"\d+$", "", parts[-1]) | ||
| 668 | return code or "unknown" | ||
| 669 | return "unknown" | ||
| 148 | 670 | ||
| 149 | 671 | ||
| 150 | def _add_timestamps(lines: list[str]) -> str: | 672 | def _add_timestamps(lines: list[str]) -> str: |
| ... | @@ -169,6 +691,17 @@ def _add_blank_line_noise(lines: list[str]) -> str: | ... | @@ -169,6 +691,17 @@ def _add_blank_line_noise(lines: list[str]) -> str: |
| 169 | return "\n".join(result) | 691 | return "\n".join(result) |
| 170 | 692 | ||
| 171 | 693 | ||
| 694 | def _change_repeated_line_counts(lines: list[str]) -> str: | ||
| 695 | seen: set[str] = set() | ||
| 696 | result: list[str] = [] | ||
| 697 | for line in lines: | ||
| 698 | if line in seen: | ||
| 699 | continue | ||
| 700 | seen.add(line) | ||
| 701 | result.append(line) | ||
| 702 | return "\n".join(result or lines) | ||
| 703 | |||
| 704 | |||
| 172 | def _translation_added(lines: list[str]) -> str: | 705 | def _translation_added(lines: list[str]) -> str: |
| 173 | result: list[str] = [] | 706 | result: list[str] = [] |
| 174 | for idx, line in enumerate(lines, start=1): | 707 | for idx, line in enumerate(lines, start=1): |
| ... | @@ -178,11 +711,11 @@ def _translation_added(lines: list[str]) -> str: | ... | @@ -178,11 +711,11 @@ def _translation_added(lines: list[str]) -> str: |
| 178 | return "\n".join(result) | 711 | return "\n".join(result) |
| 179 | 712 | ||
| 180 | 713 | ||
| 181 | def _single_song_fragment(lines: list[str]) -> str: | 714 | def _single_song_fragment(lines: list[str], rng: random.Random) -> str: |
| 182 | if len(lines) <= 4: | 715 | if len(lines) <= 4: |
| 183 | return "\n".join(lines[: max(1, len(lines) // 2)]) | 716 | return "\n".join(lines[: max(1, len(lines) // 2)]) |
| 184 | fragment_len = max(2, min(8, len(lines) // 4)) | 717 | fragment_len = max(2, min(8, len(lines) // rng.choice([3, 4, 5]))) |
| 185 | start = max(0, (len(lines) - fragment_len) // 2) | 718 | start = rng.randrange(0, max(1, len(lines) - fragment_len + 1)) |
| 186 | return "\n".join(lines[start : start + fragment_len]) | 719 | return "\n".join(lines[start : start + fragment_len]) |
| 187 | 720 | ||
| 188 | 721 | ||
| ... | @@ -198,29 +731,26 @@ def _short_shared_snippet(lines: list[str], rng: random.Random) -> str: | ... | @@ -198,29 +731,26 @@ def _short_shared_snippet(lines: list[str], rng: random.Random) -> str: |
| 198 | return "\n".join(synthetic) | 731 | return "\n".join(synthetic) |
| 199 | 732 | ||
| 200 | 733 | ||
| 201 | def _mixed_fragments(left_lines: list[str], right_lines: list[str], rng: random.Random) -> str: | 734 | def _repeated_or_sampled_lines(normalized: NormalizedLyrics, rng: random.Random) -> list[str]: |
| 202 | left_pick = rng.sample(left_lines, k=min(2, len(left_lines))) if left_lines else [] | 735 | repeated = [line for line, count in normalized.line_counts.items() if count >= 2] |
| 203 | right_pick = rng.sample(right_lines, k=min(2, len(right_lines))) if right_lines else [] | 736 | if repeated: |
| 204 | filler = ["新的旋律慢慢靠近", "陌生的名字写在风里", "没有人停在原地"] | 737 | return rng.sample(repeated, k=min(2, len(repeated))) |
| 205 | return "\n".join([*left_pick, *filler, *right_pick]) | 738 | lines = list(normalized.primary_lines or normalized.unique_lines) |
| 206 | 739 | return rng.sample(lines, k=min(2, len(lines))) if lines else [] | |
| 207 | 740 | ||
| 208 | def _same_theme_synthetic(index: int) -> str: | 741 | |
| 209 | themes = [ | 742 | def _same_theme_synthetic(index: int, rng: random.Random) -> str: |
| 210 | "我在夜里想起远方的你", | 743 | starts = ["我在夜里想起远方的你", "城市灯火陪我走过雨季", "风把旧名字吹向清晨"] |
| 211 | "城市灯火陪我走过雨季", | 744 | middles = ["那些没说完的话留在风里", "新的路口慢慢亮起", "时间把答案交给下一站"] |
| 212 | "那些没说完的话留在风里", | 745 | ends = ["明天醒来我们各自继续", "我会把今天写成新的旋律", "故事从这里重新开始"] |
| 213 | "明天醒来我们各自继续", | 746 | return "\n".join( |
| 214 | f"这是第 {index} 个全新测试样本", | 747 | [ |
| 215 | ] | 748 | rng.choice(starts), |
| 216 | return "\n".join(themes) | 749 | rng.choice(middles), |
| 217 | 750 | rng.choice(ends), | |
| 218 | 751 | f"这是第 {index} 个全新测试样本", | |
| 219 | def _translation_only_like(lines: list[str]) -> str: | 752 | ] |
| 220 | foreign_count = sum(1 for line in lines if _looks_foreign(line)) | 753 | ) |
| 221 | if foreign_count < 2: | ||
| 222 | return _same_theme_synthetic(foreign_count + len(lines)) | ||
| 223 | return "\n".join(_pseudo_translation(idx) for idx in range(1, min(8, foreign_count) + 1)) | ||
| 224 | 754 | ||
| 225 | 755 | ||
| 226 | def _pseudo_translation(index: int) -> str: | 756 | def _pseudo_translation(index: int) -> str: | ... | ... |
| ... | @@ -77,6 +77,7 @@ def main() -> None: | ... | @@ -77,6 +77,7 @@ def main() -> None: |
| 77 | csv_path=Path(args.eval_csv), | 77 | csv_path=Path(args.eval_csv), |
| 78 | size=args.eval_size, | 78 | size=args.eval_size, |
| 79 | positive_ratio=args.positive_ratio, | 79 | positive_ratio=args.positive_ratio, |
| 80 | index_path=Path(args.index), | ||
| 80 | ) | 81 | ) |
| 81 | evaluate_csv( | 82 | evaluate_csv( |
| 82 | Path(args.index), | 83 | Path(args.index), | ... | ... |
| 1 | import csv | 1 | import csv |
| 2 | import json | ||
| 2 | 3 | ||
| 3 | from lyric_dedup import DuplicateChecker | 4 | from lyric_dedup import DuplicateChecker |
| 4 | from lyric_dedup import DuplicateDecision | 5 | from lyric_dedup import DuplicateDecision |
| ... | @@ -285,23 +286,32 @@ def test_evaluate_csv_reports_binary_metrics(tmp_path) -> None: | ... | @@ -285,23 +286,32 @@ def test_evaluate_csv_reports_binary_metrics(tmp_path) -> None: |
| 285 | assert (tmp_path / "eval_out.csv.summary.json").exists() | 286 | assert (tmp_path / "eval_out.csv.summary.json").exists() |
| 286 | 287 | ||
| 287 | 288 | ||
| 288 | def test_generated_eval_set_marks_fragments_as_negative(tmp_path) -> None: | 289 | def test_generated_eval_set_uses_stratified_production_mix(tmp_path) -> None: |
| 289 | library = tmp_path / "library" | 290 | library = tmp_path / "library" |
| 290 | incoming = tmp_path / "generated" / "incoming" | 291 | incoming = tmp_path / "generated" / "incoming" |
| 291 | eval_csv = tmp_path / "generated" / "eval.csv" | 292 | eval_csv = tmp_path / "generated" / "eval.csv" |
| 292 | library.mkdir() | 293 | library.mkdir() |
| 293 | (library / "song.txt").write_text(BASE_LYRIC, encoding="utf-8") | 294 | for idx in range(12): |
| 295 | prefix = "AY" if idx % 2 == 0 else "WHHY" | ||
| 296 | (library / f"{idx}_{prefix}{idx:06d}.txt").write_text( | ||
| 297 | BASE_LYRIC.replace("我爱你", f"我想你{idx}").replace("城市", f"城市{idx}"), | ||
| 298 | encoding="utf-8", | ||
| 299 | ) | ||
| 294 | 300 | ||
| 295 | generate_eval_set(library_dir=library, output_dir=incoming, csv_path=eval_csv, size=20, positive_ratio=0.5) | 301 | generate_eval_set(library_dir=library, output_dir=incoming, csv_path=eval_csv, size=30, positive_ratio=0.3) |
| 296 | 302 | ||
| 297 | rows = list(csv.DictReader(eval_csv.open(encoding="utf-8"))) | 303 | rows = list(csv.DictReader(eval_csv.open(encoding="utf-8"))) |
| 298 | positive_types = {row["sample_type"] for row in rows if row["expected"] == "应去重"} | 304 | manifest = json.loads((tmp_path / "generated" / "eval.csv.manifest.json").read_text(encoding="utf-8")) |
| 299 | fragment_rows = [row for row in rows if row["sample_type"] == "single_song_fragment"] | 305 | negative_types = {row["sample_type"] for row in rows if row["expected"] == "不应去重"} |
| 300 | 306 | ||
| 301 | assert "trimmed_version" not in positive_types | 307 | assert len(rows) == 30 |
| 302 | assert "single_song_fragment" not in positive_types | 308 | assert manifest["library_files"] == 12 |
| 303 | assert fragment_rows | 309 | assert manifest["sample_size"] == 30 |
| 304 | assert all(row["expected"] == "不应去重" for row in fragment_rows) | 310 | assert manifest["unique_source_records"] > 1 |
| 311 | assert "positive_full_duplicate" in manifest["plan"] | ||
| 312 | assert "negative_fragment" in negative_types | ||
| 313 | assert "negative_hard_candidate" in negative_types | ||
| 314 | assert all(row["expected"] == "不应去重" for row in rows if row["sample_type"].startswith("negative_")) | ||
| 305 | 315 | ||
| 306 | 316 | ||
| 307 | def test_foreign_original_with_added_chinese_translation_is_duplicate() -> None: | 317 | def test_foreign_original_with_added_chinese_translation_is_duplicate() -> None: | ... | ... |
-
Please register or sign in to post a comment