Skip to content
Toggle navigation
Toggle navigation
This project
Loading...
Sign in
沈秋雨
/
lyric_rhyme
Go to a project
Toggle navigation
Toggle navigation pinning
Projects
Groups
Snippets
Help
Project
Activity
Repository
Pipelines
Graphs
Issues
0
Merge Requests
0
Wiki
Network
Create a new issue
Builds
Commits
Issue Boards
Files
Commits
Network
Compare
Branches
Tags
Commit
f8ad329c
...
f8ad329cb556651f2762949f4906fb6200501f89
authored
2026-06-02 22:05:55 +0800
by
沈秋雨
Browse Files
Options
Browse Files
Tag
Download
Email Patches
Plain Diff
更新大样本下测试集生成流程
1 parent
51ddab43
Expand all
Hide whitespace changes
Inline
Side-by-side
Showing
6 changed files
with
71 additions
and
33 deletions
README.md
TEST_WORKFLOW.md
lyric_dedup/cli.py
lyric_dedup/eval_dataset.py
scripts/process_library.py
tests/test_lyric_dedup.py
README.md
View file @
f8ad329
...
...
@@ -78,16 +78,20 @@ CSV 里重点看这些列:
python -m lyric_dedup.cli generate-eval-set
\
--library-dir data/library
\
--lyrics-dir data/generated_eval/incoming
\
--csv data/generated_eval/eval_10.csv
\
--size 10
\
--positive-ratio 0.6
--csv data/generated_eval/eval_50000.csv
\
--index outputs/indexes/lyrics.pkl
\
--size 50000
\
--positive-ratio 0.3
```
生成器的业务口径:
-
`应去重`
样本只生成全曲歌词的样式变化,例如时间戳、标点、平台噪声、空行、LRC 样式、附加中文翻译。
-
`不应去重`
样本包含片段歌词、短句碰撞、不同歌曲片段混合、同主题新歌词、仅翻译相似。
-
先扫描整个曲库,按有效歌词行数、语言类型、文件来源前缀做分层采样,不再按排序前缀取样。
-
`应去重`
样本只生成全曲歌词的样式变化,例如时间戳、标点、平台噪声、空行、重复副歌次数变化、附加中文翻译。
-
`不应去重`
样本包含同主题新歌词、hard negative、片段歌词、重复副歌碰撞、仅翻译相似、短歌词/占位边界样本。
-
片段歌词即使命中已有歌曲的一部分,也不应该输出
`duplicate`
;最多进入
`review`
。
-
如果传入
`--index`
,生成器会用现有索引构造更接近线上召回风险的 hard negative。
-
同时会生成
`*.manifest.json`
,记录 seed、曲库规模、样本类型分布、语言/来源分桶和样本来源覆盖数。
先准备一个 CSV,例如
`data/eval/eval.csv`
:
...
...
TEST_WORKFLOW.md
View file @
f8ad329
...
...
@@ -67,10 +67,10 @@ python scripts/process_library.py \
python scripts/process_library.py
\
--library-dir data/library
\
--index outputs/indexes/library_lyrics.pkl
\
--eval-size
118
0
\
--positive-ratio 0.
2
\
--eval-csv data/generated_eval/eval_
118
0.csv
\
--eval-out outputs/results/library_eval_
118
0.csv
--eval-size
5000
0
\
--positive-ratio 0.
3
\
--eval-csv data/generated_eval/eval_
5000
0.csv
\
--eval-out outputs/results/library_eval_
5000
0.csv
```
隔离出来的文件默认会移动到:
...
...
@@ -95,22 +95,23 @@ outputs/indexes/library_lyrics.pkl
注意:如果修改了
`data/library`
,或修改了预处理/判重逻辑,需要重新执行本步骤。
## 3. 生成
100 条测试
样本
## 3. 生成
生产评估
样本
```
bash
python -m lyric_dedup.cli generate-eval-set
\
--library-dir data/library
\
--lyrics-dir data/generated_eval/incoming
\
--csv data/generated_eval/eval_500.csv
\
--size 500
\
--positive-ratio 0.2
--csv data/generated_eval/eval_50000.csv
\
--index outputs/indexes/library_lyrics.pkl
\
--size 50000
\
--positive-ratio 0.3
```
默认生
成
:
默认生
产评估口径
:
```
text
应去重:
60
不应去重:
40
应去重:
30%
不应去重:
70%
```
生成器会先清理
`data/generated_eval/incoming/`
下旧的
`.txt`
/
`.lrc`
生成文件,再写入新样本。
...
...
@@ -118,8 +119,28 @@ python -m lyric_dedup.cli generate-eval-set \
业务口径:
```
text
pos_* = 应去重,全曲歌词样式变化
neg_* = 不应去重,片段/短句碰撞/混合片段/同主题新歌词/仅翻译相似
positive_* = 应去重,全曲歌词样式变化
negative_random_unrelated = 不应去重,同主题新歌词
negative_hard_candidate = 不应去重,系统容易召回的短句/局部重合样本
negative_fragment = 不应去重,单曲片段
negative_shared_chorus = 不应去重,重复副歌碰撞
negative_translation_only = 不应去重,仅翻译相似
edge_short_or_placeholder = 不应去重,短歌词/占位边界样本
```
生成器会扫描整个曲库并按有效歌词行数、语言类型、文件来源前缀分层采样。传入
`--index`
后会用现有索引生成 hard negative。每次还会输出:
```
text
data/generated_eval/eval_50000.csv.manifest.json
```
manifest 里重点看:
```
text
library_files 曲库歌词文件数
sample_type_counts 各样本类型数量
line_count_bucket_counts / language_bucket_counts / source_bucket_counts
unique_source_records 本次评估覆盖了多少真实源文件
```
## 4. 严格评估:只把 duplicate 算作去重
...
...
@@ -127,9 +148,9 @@ neg_* = 不应去重,片段/短句碰撞/混合片段/同主题新歌词/仅
```
bash
python -m lyric_dedup.cli evaluate-csv
\
--index outputs/indexes/library_lyrics.pkl
\
--csv data/generated_eval/eval_500.csv
\
--csv data/generated_eval/eval_500
00
.csv
\
--base-dir data/generated_eval
\
--out outputs/results/library_eval_500.csv
--out outputs/results/library_eval_500
00
.csv
```
这个口径下:
...
...
@@ -151,10 +172,10 @@ false_positive
```
bash
python -m lyric_dedup.cli evaluate-csv
\
--index outputs/indexes/library_lyrics.pkl
\
--csv data/generated_eval/eval_500.csv
\
--csv data/generated_eval/eval_500
00
.csv
\
--base-dir data/generated_eval
\
--positive-decisions duplicate,review
\
--out outputs/results/library_eval_500_review_positive.csv
--out outputs/results/library_eval_500
00
_review_positive.csv
```
这个口径下:
...
...
lyric_dedup/cli.py
View file @
f8ad329
...
...
@@ -48,8 +48,9 @@ def main() -> None:
generate
.
add_argument
(
"--lyrics-dir"
,
required
=
True
)
generate
.
add_argument
(
"--csv"
,
required
=
True
)
generate
.
add_argument
(
"--size"
,
type
=
int
,
default
=
100
)
generate
.
add_argument
(
"--positive-ratio"
,
type
=
float
,
default
=
0.
6
)
generate
.
add_argument
(
"--positive-ratio"
,
type
=
float
,
default
=
0.
3
)
generate
.
add_argument
(
"--seed"
,
type
=
int
,
default
=
20260602
)
generate
.
add_argument
(
"--index"
,
default
=
""
,
help
=
"optional existing index for hard-negative generation"
)
args
=
parser
.
parse_args
()
if
args
.
command
==
"build-index"
:
...
...
@@ -75,6 +76,7 @@ def main() -> None:
size
=
args
.
size
,
positive_ratio
=
args
.
positive_ratio
,
seed
=
args
.
seed
,
index_path
=
Path
(
args
.
index
)
if
args
.
index
else
None
,
)
print
(
json
.
dumps
(
summary
,
ensure_ascii
=
False
))
...
...
lyric_dedup/eval_dataset.py
View file @
f8ad329
This diff is collapsed.
Click to expand it.
scripts/process_library.py
View file @
f8ad329
...
...
@@ -77,6 +77,7 @@ def main() -> None:
csv_path
=
Path
(
args
.
eval_csv
),
size
=
args
.
eval_size
,
positive_ratio
=
args
.
positive_ratio
,
index_path
=
Path
(
args
.
index
),
)
evaluate_csv
(
Path
(
args
.
index
),
...
...
tests/test_lyric_dedup.py
View file @
f8ad329
import
csv
import
json
from
lyric_dedup
import
DuplicateChecker
from
lyric_dedup
import
DuplicateDecision
...
...
@@ -285,23 +286,32 @@ def test_evaluate_csv_reports_binary_metrics(tmp_path) -> None:
assert
(
tmp_path
/
"eval_out.csv.summary.json"
)
.
exists
()
def
test_generated_eval_set_
marks_fragments_as_negative
(
tmp_path
)
->
None
:
def
test_generated_eval_set_
uses_stratified_production_mix
(
tmp_path
)
->
None
:
library
=
tmp_path
/
"library"
incoming
=
tmp_path
/
"generated"
/
"incoming"
eval_csv
=
tmp_path
/
"generated"
/
"eval.csv"
library
.
mkdir
()
(
library
/
"song.txt"
)
.
write_text
(
BASE_LYRIC
,
encoding
=
"utf-8"
)
for
idx
in
range
(
12
):
prefix
=
"AY"
if
idx
%
2
==
0
else
"WHHY"
(
library
/
f
"{idx}_{prefix}{idx:06d}.txt"
)
.
write_text
(
BASE_LYRIC
.
replace
(
"我爱你"
,
f
"我想你{idx}"
)
.
replace
(
"城市"
,
f
"城市{idx}"
),
encoding
=
"utf-8"
,
)
generate_eval_set
(
library_dir
=
library
,
output_dir
=
incoming
,
csv_path
=
eval_csv
,
size
=
20
,
positive_ratio
=
0.5
)
generate_eval_set
(
library_dir
=
library
,
output_dir
=
incoming
,
csv_path
=
eval_csv
,
size
=
30
,
positive_ratio
=
0.3
)
rows
=
list
(
csv
.
DictReader
(
eval_csv
.
open
(
encoding
=
"utf-8"
)))
positive_types
=
{
row
[
"sample_type"
]
for
row
in
rows
if
row
[
"expected"
]
==
"应去重"
}
fragment_rows
=
[
row
for
row
in
rows
if
row
[
"sample_type"
]
==
"single_song_fragment"
]
assert
"trimmed_version"
not
in
positive_types
assert
"single_song_fragment"
not
in
positive_types
assert
fragment_rows
assert
all
(
row
[
"expected"
]
==
"不应去重"
for
row
in
fragment_rows
)
manifest
=
json
.
loads
((
tmp_path
/
"generated"
/
"eval.csv.manifest.json"
)
.
read_text
(
encoding
=
"utf-8"
))
negative_types
=
{
row
[
"sample_type"
]
for
row
in
rows
if
row
[
"expected"
]
==
"不应去重"
}
assert
len
(
rows
)
==
30
assert
manifest
[
"library_files"
]
==
12
assert
manifest
[
"sample_size"
]
==
30
assert
manifest
[
"unique_source_records"
]
>
1
assert
"positive_full_duplicate"
in
manifest
[
"plan"
]
assert
"negative_fragment"
in
negative_types
assert
"negative_hard_candidate"
in
negative_types
assert
all
(
row
[
"expected"
]
==
"不应去重"
for
row
in
rows
if
row
[
"sample_type"
]
.
startswith
(
"negative_"
))
def
test_foreign_original_with_added_chinese_translation_is_duplicate
()
->
None
:
...
...
Please
register
or
sign in
to post a comment