更新大样本下测试集生成流程
Showing
6 changed files
with
71 additions
and
33 deletions
| ... | @@ -78,16 +78,20 @@ CSV 里重点看这些列: | ... | @@ -78,16 +78,20 @@ CSV 里重点看这些列: |
| 78 | python -m lyric_dedup.cli generate-eval-set \ | 78 | python -m lyric_dedup.cli generate-eval-set \ |
| 79 | --library-dir data/library \ | 79 | --library-dir data/library \ |
| 80 | --lyrics-dir data/generated_eval/incoming \ | 80 | --lyrics-dir data/generated_eval/incoming \ |
| 81 | --csv data/generated_eval/eval_10.csv \ | 81 | --csv data/generated_eval/eval_50000.csv \ |
| 82 | --size 10 \ | 82 | --index outputs/indexes/lyrics.pkl \ |
| 83 | --positive-ratio 0.6 | 83 | --size 50000 \ |
| 84 | --positive-ratio 0.3 | ||
| 84 | ``` | 85 | ``` |
| 85 | 86 | ||
| 86 | 生成器的业务口径: | 87 | 生成器的业务口径: |
| 87 | 88 | ||
| 88 | - `应去重` 样本只生成全曲歌词的样式变化,例如时间戳、标点、平台噪声、空行、LRC 样式、附加中文翻译。 | 89 | - 先扫描整个曲库,按有效歌词行数、语言类型、文件来源前缀做分层采样,不再按排序前缀取样。 |
| 89 | - `不应去重` 样本包含片段歌词、短句碰撞、不同歌曲片段混合、同主题新歌词、仅翻译相似。 | 90 | - `应去重` 样本只生成全曲歌词的样式变化,例如时间戳、标点、平台噪声、空行、重复副歌次数变化、附加中文翻译。 |
| 91 | - `不应去重` 样本包含同主题新歌词、hard negative、片段歌词、重复副歌碰撞、仅翻译相似、短歌词/占位边界样本。 | ||
| 90 | - 片段歌词即使命中已有歌曲的一部分,也不应该输出 `duplicate`;最多进入 `review`。 | 92 | - 片段歌词即使命中已有歌曲的一部分,也不应该输出 `duplicate`;最多进入 `review`。 |
| 93 | - 如果传入 `--index`,生成器会用现有索引构造更接近线上召回风险的 hard negative。 | ||
| 94 | - 同时会生成 `*.manifest.json`,记录 seed、曲库规模、样本类型分布、语言/来源分桶和样本来源覆盖数。 | ||
| 91 | 95 | ||
| 92 | 先准备一个 CSV,例如 `data/eval/eval.csv`: | 96 | 先准备一个 CSV,例如 `data/eval/eval.csv`: |
| 93 | 97 | ... | ... |
| ... | @@ -67,10 +67,10 @@ python scripts/process_library.py \ | ... | @@ -67,10 +67,10 @@ python scripts/process_library.py \ |
| 67 | python scripts/process_library.py \ | 67 | python scripts/process_library.py \ |
| 68 | --library-dir data/library \ | 68 | --library-dir data/library \ |
| 69 | --index outputs/indexes/library_lyrics.pkl \ | 69 | --index outputs/indexes/library_lyrics.pkl \ |
| 70 | --eval-size 1180 \ | 70 | --eval-size 50000 \ |
| 71 | --positive-ratio 0.2 \ | 71 | --positive-ratio 0.3 \ |
| 72 | --eval-csv data/generated_eval/eval_1180.csv \ | 72 | --eval-csv data/generated_eval/eval_50000.csv \ |
| 73 | --eval-out outputs/results/library_eval_1180.csv | 73 | --eval-out outputs/results/library_eval_50000.csv |
| 74 | ``` | 74 | ``` |
| 75 | 75 | ||
| 76 | 隔离出来的文件默认会移动到: | 76 | 隔离出来的文件默认会移动到: |
| ... | @@ -95,22 +95,23 @@ outputs/indexes/library_lyrics.pkl | ... | @@ -95,22 +95,23 @@ outputs/indexes/library_lyrics.pkl |
| 95 | 95 | ||
| 96 | 注意:如果修改了 `data/library`,或修改了预处理/判重逻辑,需要重新执行本步骤。 | 96 | 注意:如果修改了 `data/library`,或修改了预处理/判重逻辑,需要重新执行本步骤。 |
| 97 | 97 | ||
| 98 | ## 3. 生成 100 条测试样本 | 98 | ## 3. 生成生产评估样本 |
| 99 | 99 | ||
| 100 | ```bash | 100 | ```bash |
| 101 | python -m lyric_dedup.cli generate-eval-set \ | 101 | python -m lyric_dedup.cli generate-eval-set \ |
| 102 | --library-dir data/library \ | 102 | --library-dir data/library \ |
| 103 | --lyrics-dir data/generated_eval/incoming \ | 103 | --lyrics-dir data/generated_eval/incoming \ |
| 104 | --csv data/generated_eval/eval_500.csv \ | 104 | --csv data/generated_eval/eval_50000.csv \ |
| 105 | --size 500 \ | 105 | --index outputs/indexes/library_lyrics.pkl \ |
| 106 | --positive-ratio 0.2 | 106 | --size 50000 \ |
| 107 | --positive-ratio 0.3 | ||
| 107 | ``` | 108 | ``` |
| 108 | 109 | ||
| 109 | 默认生成: | 110 | 默认生产评估口径: |
| 110 | 111 | ||
| 111 | ```text | 112 | ```text |
| 112 | 应去重: 60 | 113 | 应去重: 30% |
| 113 | 不应去重: 40 | 114 | 不应去重: 70% |
| 114 | ``` | 115 | ``` |
| 115 | 116 | ||
| 116 | 生成器会先清理 `data/generated_eval/incoming/` 下旧的 `.txt` / `.lrc` 生成文件,再写入新样本。 | 117 | 生成器会先清理 `data/generated_eval/incoming/` 下旧的 `.txt` / `.lrc` 生成文件,再写入新样本。 |
| ... | @@ -118,8 +119,28 @@ python -m lyric_dedup.cli generate-eval-set \ | ... | @@ -118,8 +119,28 @@ python -m lyric_dedup.cli generate-eval-set \ |
| 118 | 业务口径: | 119 | 业务口径: |
| 119 | 120 | ||
| 120 | ```text | 121 | ```text |
| 121 | pos_* = 应去重,全曲歌词样式变化 | 122 | positive_* = 应去重,全曲歌词样式变化 |
| 122 | neg_* = 不应去重,片段/短句碰撞/混合片段/同主题新歌词/仅翻译相似 | 123 | negative_random_unrelated = 不应去重,同主题新歌词 |
| 124 | negative_hard_candidate = 不应去重,系统容易召回的短句/局部重合样本 | ||
| 125 | negative_fragment = 不应去重,单曲片段 | ||
| 126 | negative_shared_chorus = 不应去重,重复副歌碰撞 | ||
| 127 | negative_translation_only = 不应去重,仅翻译相似 | ||
| 128 | edge_short_or_placeholder = 不应去重,短歌词/占位边界样本 | ||
| 129 | ``` | ||
| 130 | |||
| 131 | 生成器会扫描整个曲库并按有效歌词行数、语言类型、文件来源前缀分层采样。传入 `--index` 后会用现有索引生成 hard negative。每次还会输出: | ||
| 132 | |||
| 133 | ```text | ||
| 134 | data/generated_eval/eval_50000.csv.manifest.json | ||
| 135 | ``` | ||
| 136 | |||
| 137 | manifest 里重点看: | ||
| 138 | |||
| 139 | ```text | ||
| 140 | library_files 曲库歌词文件数 | ||
| 141 | sample_type_counts 各样本类型数量 | ||
| 142 | line_count_bucket_counts / language_bucket_counts / source_bucket_counts | ||
| 143 | unique_source_records 本次评估覆盖了多少真实源文件 | ||
| 123 | ``` | 144 | ``` |
| 124 | 145 | ||
| 125 | ## 4. 严格评估:只把 duplicate 算作去重 | 146 | ## 4. 严格评估:只把 duplicate 算作去重 |
| ... | @@ -127,9 +148,9 @@ neg_* = 不应去重,片段/短句碰撞/混合片段/同主题新歌词/仅 | ... | @@ -127,9 +148,9 @@ neg_* = 不应去重,片段/短句碰撞/混合片段/同主题新歌词/仅 |
| 127 | ```bash | 148 | ```bash |
| 128 | python -m lyric_dedup.cli evaluate-csv \ | 149 | python -m lyric_dedup.cli evaluate-csv \ |
| 129 | --index outputs/indexes/library_lyrics.pkl \ | 150 | --index outputs/indexes/library_lyrics.pkl \ |
| 130 | --csv data/generated_eval/eval_500.csv \ | 151 | --csv data/generated_eval/eval_50000.csv \ |
| 131 | --base-dir data/generated_eval \ | 152 | --base-dir data/generated_eval \ |
| 132 | --out outputs/results/library_eval_500.csv | 153 | --out outputs/results/library_eval_50000.csv |
| 133 | ``` | 154 | ``` |
| 134 | 155 | ||
| 135 | 这个口径下: | 156 | 这个口径下: |
| ... | @@ -151,10 +172,10 @@ false_positive | ... | @@ -151,10 +172,10 @@ false_positive |
| 151 | ```bash | 172 | ```bash |
| 152 | python -m lyric_dedup.cli evaluate-csv \ | 173 | python -m lyric_dedup.cli evaluate-csv \ |
| 153 | --index outputs/indexes/library_lyrics.pkl \ | 174 | --index outputs/indexes/library_lyrics.pkl \ |
| 154 | --csv data/generated_eval/eval_500.csv \ | 175 | --csv data/generated_eval/eval_50000.csv \ |
| 155 | --base-dir data/generated_eval \ | 176 | --base-dir data/generated_eval \ |
| 156 | --positive-decisions duplicate,review \ | 177 | --positive-decisions duplicate,review \ |
| 157 | --out outputs/results/library_eval_500_review_positive.csv | 178 | --out outputs/results/library_eval_50000_review_positive.csv |
| 158 | ``` | 179 | ``` |
| 159 | 180 | ||
| 160 | 这个口径下: | 181 | 这个口径下: | ... | ... |
| ... | @@ -48,8 +48,9 @@ def main() -> None: | ... | @@ -48,8 +48,9 @@ def main() -> None: |
| 48 | generate.add_argument("--lyrics-dir", required=True) | 48 | generate.add_argument("--lyrics-dir", required=True) |
| 49 | generate.add_argument("--csv", required=True) | 49 | generate.add_argument("--csv", required=True) |
| 50 | generate.add_argument("--size", type=int, default=100) | 50 | generate.add_argument("--size", type=int, default=100) |
| 51 | generate.add_argument("--positive-ratio", type=float, default=0.6) | 51 | generate.add_argument("--positive-ratio", type=float, default=0.3) |
| 52 | generate.add_argument("--seed", type=int, default=20260602) | 52 | generate.add_argument("--seed", type=int, default=20260602) |
| 53 | generate.add_argument("--index", default="", help="optional existing index for hard-negative generation") | ||
| 53 | 54 | ||
| 54 | args = parser.parse_args() | 55 | args = parser.parse_args() |
| 55 | if args.command == "build-index": | 56 | if args.command == "build-index": |
| ... | @@ -75,6 +76,7 @@ def main() -> None: | ... | @@ -75,6 +76,7 @@ def main() -> None: |
| 75 | size=args.size, | 76 | size=args.size, |
| 76 | positive_ratio=args.positive_ratio, | 77 | positive_ratio=args.positive_ratio, |
| 77 | seed=args.seed, | 78 | seed=args.seed, |
| 79 | index_path=Path(args.index) if args.index else None, | ||
| 78 | ) | 80 | ) |
| 79 | print(json.dumps(summary, ensure_ascii=False)) | 81 | print(json.dumps(summary, ensure_ascii=False)) |
| 80 | 82 | ... | ... |
This diff is collapsed.
Click to expand it.
| ... | @@ -77,6 +77,7 @@ def main() -> None: | ... | @@ -77,6 +77,7 @@ def main() -> None: |
| 77 | csv_path=Path(args.eval_csv), | 77 | csv_path=Path(args.eval_csv), |
| 78 | size=args.eval_size, | 78 | size=args.eval_size, |
| 79 | positive_ratio=args.positive_ratio, | 79 | positive_ratio=args.positive_ratio, |
| 80 | index_path=Path(args.index), | ||
| 80 | ) | 81 | ) |
| 81 | evaluate_csv( | 82 | evaluate_csv( |
| 82 | Path(args.index), | 83 | Path(args.index), | ... | ... |
| 1 | import csv | 1 | import csv |
| 2 | import json | ||
| 2 | 3 | ||
| 3 | from lyric_dedup import DuplicateChecker | 4 | from lyric_dedup import DuplicateChecker |
| 4 | from lyric_dedup import DuplicateDecision | 5 | from lyric_dedup import DuplicateDecision |
| ... | @@ -285,23 +286,32 @@ def test_evaluate_csv_reports_binary_metrics(tmp_path) -> None: | ... | @@ -285,23 +286,32 @@ def test_evaluate_csv_reports_binary_metrics(tmp_path) -> None: |
| 285 | assert (tmp_path / "eval_out.csv.summary.json").exists() | 286 | assert (tmp_path / "eval_out.csv.summary.json").exists() |
| 286 | 287 | ||
| 287 | 288 | ||
| 288 | def test_generated_eval_set_marks_fragments_as_negative(tmp_path) -> None: | 289 | def test_generated_eval_set_uses_stratified_production_mix(tmp_path) -> None: |
| 289 | library = tmp_path / "library" | 290 | library = tmp_path / "library" |
| 290 | incoming = tmp_path / "generated" / "incoming" | 291 | incoming = tmp_path / "generated" / "incoming" |
| 291 | eval_csv = tmp_path / "generated" / "eval.csv" | 292 | eval_csv = tmp_path / "generated" / "eval.csv" |
| 292 | library.mkdir() | 293 | library.mkdir() |
| 293 | (library / "song.txt").write_text(BASE_LYRIC, encoding="utf-8") | 294 | for idx in range(12): |
| 295 | prefix = "AY" if idx % 2 == 0 else "WHHY" | ||
| 296 | (library / f"{idx}_{prefix}{idx:06d}.txt").write_text( | ||
| 297 | BASE_LYRIC.replace("我爱你", f"我想你{idx}").replace("城市", f"城市{idx}"), | ||
| 298 | encoding="utf-8", | ||
| 299 | ) | ||
| 294 | 300 | ||
| 295 | generate_eval_set(library_dir=library, output_dir=incoming, csv_path=eval_csv, size=20, positive_ratio=0.5) | 301 | generate_eval_set(library_dir=library, output_dir=incoming, csv_path=eval_csv, size=30, positive_ratio=0.3) |
| 296 | 302 | ||
| 297 | rows = list(csv.DictReader(eval_csv.open(encoding="utf-8"))) | 303 | rows = list(csv.DictReader(eval_csv.open(encoding="utf-8"))) |
| 298 | positive_types = {row["sample_type"] for row in rows if row["expected"] == "应去重"} | 304 | manifest = json.loads((tmp_path / "generated" / "eval.csv.manifest.json").read_text(encoding="utf-8")) |
| 299 | fragment_rows = [row for row in rows if row["sample_type"] == "single_song_fragment"] | 305 | negative_types = {row["sample_type"] for row in rows if row["expected"] == "不应去重"} |
| 300 | 306 | ||
| 301 | assert "trimmed_version" not in positive_types | 307 | assert len(rows) == 30 |
| 302 | assert "single_song_fragment" not in positive_types | 308 | assert manifest["library_files"] == 12 |
| 303 | assert fragment_rows | 309 | assert manifest["sample_size"] == 30 |
| 304 | assert all(row["expected"] == "不应去重" for row in fragment_rows) | 310 | assert manifest["unique_source_records"] > 1 |
| 311 | assert "positive_full_duplicate" in manifest["plan"] | ||
| 312 | assert "negative_fragment" in negative_types | ||
| 313 | assert "negative_hard_candidate" in negative_types | ||
| 314 | assert all(row["expected"] == "不应去重" for row in rows if row["sample_type"].startswith("negative_")) | ||
| 305 | 315 | ||
| 306 | 316 | ||
| 307 | def test_foreign_original_with_added_chinese_translation_is_duplicate() -> None: | 317 | def test_foreign_original_with_added_chinese_translation_is_duplicate() -> None: | ... | ... |
-
Please register or sign in to post a comment