Commit 51ddab43 51ddab43fb5e3638a8b8c9cd8049679fe8b2ccc7 by 沈秋雨

Add lyric duplicate detection workflow

0 parents
.DS_Store
__pycache__/
*.py[cod]
.pytest_cache/
# Local lyric data and generated artifacts
data/
outputs/
downloaded_lyrics/
downloaded_lyrics_type3/
download_failed*.txt
# Local downloader / scratch utilities
download_lyrics.py
test_db_connection.py
*.env
# Reference project kept locally only
text-dedup-main/
# Virtual environments and editor files
.venv/
venv/
.idea/
.vscode/
# Lyric Duplicate Checker
第一版用于“新增歌词查重”:先用已有 `.lrc` / `.txt` 歌词建立索引,再把新增歌词拿来查询,返回 `duplicate``review``new`
## 建立索引
假设已有曲库在 `data/library/`
```bash
python -m lyric_dedup.cli build-index \
--lyrics-dir data/library \
--index outputs/indexes/lyrics.pkl
```
## 检查单个新增歌词
```bash
python -m lyric_dedup.cli check-file \
--index outputs/indexes/lyrics.pkl \
--file data/incoming/new_song.lrc
```
## 批量检查新增目录
```bash
python -m lyric_dedup.cli batch-check \
--index outputs/indexes/lyrics.pkl \
--lyrics-dir data/incoming \
--out outputs/results/incoming_check.csv
```
CSV 里重点看这些列:
- `decision`: 总判断。
- `best_candidate_id`: 最像的已有歌词。
- `best_candidate_jaccard`: n-gram 字面相似度。
- `best_candidate_line_coverage`: 行级覆盖率。
- `matched_unique_lines`: 命中的规范化歌词行。
- `best_candidate_reason`: 中文判定原因,说明为什么判重、复核或判新。
生产判断建议:`duplicate` 可自动拦截;`review` 进人工池;`new` 入库前仍可抽样检查。
## 原文 + 中文翻译歌词的防护规则
当前会把歌词拆成三类行:
- `primary_lines`: 原文行,自动判重主要依赖这部分。
- `translation_lines`: 中文翻译行,只用于召回和复核解释。
- `unknown_lines`: 无法稳定判断的行。
高置信拆分包括:
- 同一个时间戳下出现外文行和中文行。
- 多组稳定的外文行 + 中文行交替。
中置信拆分包括:
- 同一行内明显的外文 / 中文翻译,例如 `I miss you / 今晚我想你`
低置信拆分包括:
- 先整段外文,再整段中文翻译。
判定策略:
- 原文高度一致,即使新增歌词多了中文翻译,也可以 `duplicate`
- 只有翻译行相似,原文相似不足,只能 `review`,不自动判重。
- 疑似整段翻译结构属于低置信拆分,即使原文 hash 一致,也先 `review`
- 普通中文歌没有检测到翻译结构时,全部有效行按原文处理。
由于索引里会保存拆分后的原文/翻译特征,修改拆分规则后需要重建索引。
## 用标注 CSV 评估正确率
可以先从已有曲库自动生成一批评估样本:
```bash
python -m lyric_dedup.cli generate-eval-set \
--library-dir data/library \
--lyrics-dir data/generated_eval/incoming \
--csv data/generated_eval/eval_10.csv \
--size 10 \
--positive-ratio 0.6
```
生成器的业务口径:
- `应去重` 样本只生成全曲歌词的样式变化,例如时间戳、标点、平台噪声、空行、LRC 样式、附加中文翻译。
- `不应去重` 样本包含片段歌词、短句碰撞、不同歌曲片段混合、同主题新歌词、仅翻译相似。
- 片段歌词即使命中已有歌曲的一部分,也不应该输出 `duplicate`;最多进入 `review`
先准备一个 CSV,例如 `data/eval/eval.csv`
```csv
id,file,expected
case-001,incoming/song_a.lrc,应去重
case-002,incoming/song_b.txt,不应去重
```
也可以不用文件路径,直接把歌词放在 `lyrics` 列:
```csv
id,lyrics,expected
case-003,"我爱你在每个夜里\n听风说话也听见你",duplicate
case-004,"南方的雨穿过街心\n你把故事说给云听",new
```
`expected` 支持这些写法:
- 应去重:`应去重``重复``duplicate``1``true``yes`
- 不应去重:`不应去重``不重复``new``0``false``no`
运行评估:
```bash
python -m lyric_dedup.cli evaluate-csv \
--index outputs/indexes/lyrics.pkl \
--csv data/eval/eval.csv \
--base-dir data \
--out outputs/results/eval_result.csv
```
默认只有系统输出 `duplicate` 才算“预测应去重”。这适合评估自动拦截的准确率,误杀会更明显。
如果你想评估“可疑样本召回率”,也就是 `duplicate``review` 都算命中:
```bash
python -m lyric_dedup.cli evaluate-csv \
--index outputs/indexes/lyrics.pkl \
--csv data/eval/eval.csv \
--base-dir data \
--positive-decisions duplicate,review \
--out outputs/results/eval_result_review_as_positive.csv
```
会生成两个文件:
- `outputs/results/eval_result.csv`: 每条样本的预测、候选、原因和是否正确。
- `outputs/results/eval_result.csv.summary.json`: 总体指标。
summary 里重点看:
- `accuracy`: 总正确率。
- `precision`: 预测应去重的样本里,有多少是真的应去重。自动拦截优先看这个。
- `recall`: 真实应去重的样本里,有多少被系统抓到。
- `f1`: precision 和 recall 的综合指标。
- `false_positive`: 不应去重但被判为应去重,属于误杀。
- `false_negative`: 应去重但没抓到,属于漏召。
# 歌词查重测试流程
本文档记录从已有歌词目录建立索引、生成测试集、批量评估和查看结果的完整命令。
## 1. 准备目录
已有曲库放在:
```text
data/library/
```
支持文件:
```text
.lrc
.txt
```
生成的测试样本会放在:
```text
data/generated_eval/incoming/
```
测试集标注 CSV 会放在:
```text
data/generated_eval/eval_100.csv
```
评估结果会放在:
```text
outputs/results/
```
## 2. 建立已有曲库索引
如果刚往 `data/library` 新增了一批样本,建议先运行处理脚本:
```bash
python scripts/process_library.py \
--library-dir data/library \
--index outputs/indexes/library_lyrics.pkl
```
这个脚本会:
```text
1. 扫描并隔离纯音乐占位样本,例如包含【曲库专用】或“此歌曲为没有填词的纯音乐”的文件。
2. 重建 outputs/indexes/library_lyrics.pkl。
3. 输出处理报告 outputs/results/library_process_report.json。
```
如果你想先看会处理哪些文件,不实际移动和重建索引:
```bash
python scripts/process_library.py \
--library-dir data/library \
--dry-run
```
如果要顺手生成并评估 500 条测试样本:
```bash
python scripts/process_library.py \
--library-dir data/library \
--index outputs/indexes/library_lyrics.pkl \
--eval-size 1180 \
--positive-ratio 0.2 \
--eval-csv data/generated_eval/eval_1180.csv \
--eval-out outputs/results/library_eval_1180.csv
```
隔离出来的文件默认会移动到:
```text
data/quarantine/no_lyrics_placeholders/
```
也可以只手动建索引:
```bash
python -m lyric_dedup.cli build-index \
--lyrics-dir data/library \
--index outputs/indexes/library_lyrics.pkl
```
索引文件:
```text
outputs/indexes/library_lyrics.pkl
```
注意:如果修改了 `data/library`,或修改了预处理/判重逻辑,需要重新执行本步骤。
## 3. 生成 100 条测试样本
```bash
python -m lyric_dedup.cli generate-eval-set \
--library-dir data/library \
--lyrics-dir data/generated_eval/incoming \
--csv data/generated_eval/eval_500.csv \
--size 500 \
--positive-ratio 0.2
```
默认生成:
```text
应去重: 60
不应去重: 40
```
生成器会先清理 `data/generated_eval/incoming/` 下旧的 `.txt` / `.lrc` 生成文件,再写入新样本。
业务口径:
```text
pos_* = 应去重,全曲歌词样式变化
neg_* = 不应去重,片段/短句碰撞/混合片段/同主题新歌词/仅翻译相似
```
## 4. 严格评估:只把 duplicate 算作去重
```bash
python -m lyric_dedup.cli evaluate-csv \
--index outputs/indexes/library_lyrics.pkl \
--csv data/generated_eval/eval_500.csv \
--base-dir data/generated_eval \
--out outputs/results/library_eval_500.csv
```
这个口径下:
```text
duplicate -> 预测应去重
review -> 预测不应去重
new -> 预测不应去重
```
适合评估自动拦截的 precision,重点看:
```text
false_positive
```
## 5. 召回评估:把 duplicate 和 review 都算作抓到可疑样本
```bash
python -m lyric_dedup.cli evaluate-csv \
--index outputs/indexes/library_lyrics.pkl \
--csv data/generated_eval/eval_500.csv \
--base-dir data/generated_eval \
--positive-decisions duplicate,review \
--out outputs/results/library_eval_500_review_positive.csv
```
这个口径下:
```text
duplicate -> 预测应去重
review -> 预测应去重
new -> 预测不应去重
```
适合评估可疑样本召回,重点看:
```text
false_negative
```
## 6. 查看总体指标
严格口径:
```bash
cat outputs/results/library_eval_100.csv.summary.json
```
召回口径:
```bash
cat outputs/results/library_eval_100_review_positive.csv.summary.json
```
指标含义:
```text
accuracy 总正确率
precision 预测应去重的样本里,有多少是真的应去重
recall 真实应去重的样本里,有多少被系统抓到
f1 precision 和 recall 的综合指标
true_positive 应去重且预测应去重
false_positive 不应去重但预测应去重,误杀
true_negative 不应去重且预测不应去重
false_negative 应去重但预测不应去重,漏召
```
## 7. 查看每条样本结果
```bash
open outputs/results/library_eval_100.csv
```
如果不能使用 `open`,可以直接查看 CSV:
```bash
python -c 'import csv; rows=csv.DictReader(open("outputs/results/library_eval_100.csv", encoding="utf-8")); [print(r["id"], r["decision"], r["correct"], r["reason"], sep=" | ") for r in rows]'
```
## 8. 查看失败样本
严格口径失败样本:
```bash
python -c 'import csv; rows=csv.DictReader(open("outputs/results/library_eval_100.csv", encoding="utf-8")); [print(r["id"], r["source"], r["decision"], r["reason"], sep=" | ") for r in rows if r["correct"] == "False"]'
```
查看某个样本的完整候选:
```bash
python -m lyric_dedup.cli check-file \
--index outputs/indexes/library_lyrics.pkl \
--file data/generated_eval/incoming/neg_068_mixed_fragments.txt \
--max-candidates 10
```
## 9. 核对测试集分布
```bash
python -c 'import csv, collections; rows=list(csv.DictReader(open("data/generated_eval/eval_10.csv", encoding="utf-8"))); print(len(rows)); print(collections.Counter(r["expected"] for r in rows)); print(collections.Counter(r["sample_type"] for r in rows)); print(collections.Counter(r["sample_type"] for r in rows if r["expected"]=="应去重")); print(collections.Counter(r["sample_type"] for r in rows if r["expected"]=="不应去重"))'
```
核对生成目录文件数:
```bash
find data/generated_eval/incoming -type f | wc -l
```
## 10. 运行代码测试
```bash
python -m pytest tests
```
编译检查:
```bash
python -m compileall -q lyric_dedup tests
```
## 11. 关于测试集不重复
当前自动生成的 100 条是规则覆盖测试集,不保证样本之间规范化后完全不重复。
如果要求 100 条测试样本彼此不重复,并且仍使用默认比例:
```text
size = 100
positive_ratio = 0.6
```
则至少需要:
```text
60 首互不重复的种子歌词
```
原因:应去重样本是全曲变体,同一首歌的多个样式变化规范化后仍然是同一首歌。
更稳妥的真实准确率评估方式是准备人工标注 CSV:
```csv
id,file,expected
case-001,incoming/song_a.lrc,应去重
case-002,incoming/song_b.txt,不应去重
```
然后直接执行第 4 节或第 5 节的 `evaluate-csv`
"""Lyric duplicate detection utilities."""
from lyric_dedup.checker import DuplicateCheckResult
from lyric_dedup.checker import DuplicateChecker
from lyric_dedup.checker import DuplicateDecision
from lyric_dedup.checker import LyricRecord
__all__ = [
"DuplicateCheckResult",
"DuplicateChecker",
"DuplicateDecision",
"LyricRecord",
]
"""Generate labeled evaluation samples from an existing lyric library."""
from __future__ import annotations
import csv
import random
import re
from dataclasses import dataclass
from pathlib import Path
from lyric_dedup.file_import import iter_lyric_files
from lyric_dedup.file_import import read_lyric_file
from lyric_dedup.file_import import record_from_file
from lyric_dedup.normalization import normalize_lyrics
@dataclass(frozen=True)
class GeneratedSample:
sample_id: str
file: str
expected: str
sample_type: str
source: str
title: str = ""
artist: str = ""
def generate_eval_set(
*,
library_dir: Path,
output_dir: Path,
csv_path: Path,
size: int = 100,
positive_ratio: float = 0.6,
seed: int = 20260602,
) -> dict[str, object]:
rng = random.Random(seed)
source_files = iter_lyric_files(library_dir)
if not source_files:
raise ValueError(f"{library_dir} 下没有 .lrc/.txt 歌词文件")
output_dir.mkdir(parents=True, exist_ok=True)
csv_path.parent.mkdir(parents=True, exist_ok=True)
_clean_generated_output_dir(output_dir)
positives = round(size * positive_ratio)
negatives = size - positives
samples: list[GeneratedSample] = []
for index in range(positives):
source = source_files[index % len(source_files)]
samples.append(_positive_sample(index + 1, source, output_dir, csv_path.parent, rng))
for index in range(negatives):
left = source_files[index % len(source_files)]
right = source_files[(index + 1) % len(source_files)]
samples.append(_negative_sample(positives + index + 1, left, right, output_dir, csv_path.parent, rng))
rng.shuffle(samples)
with csv_path.open("w", encoding="utf-8", newline="") as file:
writer = csv.DictWriter(file, fieldnames=["id", "file", "expected", "sample_type", "source", "title", "artist"])
writer.writeheader()
writer.writerows(
{
"id": sample.sample_id,
"file": sample.file,
"expected": sample.expected,
"sample_type": sample.sample_type,
"source": sample.source,
"title": sample.title,
"artist": sample.artist,
}
for sample in samples
)
return {
"size": size,
"positive": positives,
"negative": negatives,
"library_files": len(source_files),
"lyrics_dir": str(output_dir),
"csv": str(csv_path),
}
def _positive_sample(index: int, source: Path, output_dir: Path, csv_base: Path, rng: random.Random) -> GeneratedSample:
raw = read_lyric_file(source)
source_record = record_from_file(source)
variants = [
("exact_copy", raw),
("timestamped", _add_timestamps(_content_lines(raw))),
("punctuation_noise", _add_punctuation_noise(_content_lines(raw), rng)),
("with_platform_noise", _with_platform_noise(_content_lines(raw))),
("blank_line_noise", _add_blank_line_noise(_content_lines(raw))),
("lrc_with_platform_noise", _add_timestamps(_content_lines(_with_platform_noise(_content_lines(raw))))),
("translation_added", _translation_added(_content_lines(raw))),
]
sample_type, text = variants[(index - 1) % len(variants)]
name = f"pos_{index:03d}_{sample_type}.txt"
path = output_dir / name
path.write_text(text, encoding="utf-8")
return GeneratedSample(
sample_id=f"pos-{index:03d}",
file=str(path.relative_to(csv_base)),
expected="应去重",
sample_type=sample_type,
source=str(source),
title=source_record.title or "",
artist=source_record.artist or "",
)
def _negative_sample(index: int, left: Path, right: Path, output_dir: Path, csv_base: Path, rng: random.Random) -> GeneratedSample:
left_lines = _normalized_lines(left)
right_lines = _normalized_lines(right)
variants = [
("single_song_fragment", _single_song_fragment(left_lines)),
("short_shared_snippet", _short_shared_snippet(left_lines, rng)),
("mixed_fragments", _mixed_fragments(left_lines, right_lines, rng)),
("same_theme_synthetic", _same_theme_synthetic(index)),
("translation_only_like", _translation_only_like(left_lines)),
]
sample_type, text = variants[(index - 1) % len(variants)]
name = f"neg_{index:03d}_{sample_type}.txt"
path = output_dir / name
path.write_text(text, encoding="utf-8")
return GeneratedSample(
sample_id=f"neg-{index:03d}",
file=str(path.relative_to(csv_base)),
expected="不应去重",
sample_type=sample_type,
source=f"{left} | {right}",
)
def _content_lines(text: str) -> list[str]:
lines = [line.strip() for line in text.splitlines() if line.strip()]
return lines or [text.strip()]
def _clean_generated_output_dir(output_dir: Path) -> None:
for path in output_dir.iterdir():
if path.is_file() and path.suffix.lower() in {".txt", ".lrc"}:
path.unlink()
def _normalized_lines(path: Path) -> list[str]:
normalized = normalize_lyrics(read_lyric_file(path))
return list(normalized.primary_lines or normalized.unique_lines)
def _add_timestamps(lines: list[str]) -> str:
return "\n".join(f"[00:{idx % 60:02d}.00]{line}" for idx, line in enumerate(lines, start=1))
def _add_punctuation_noise(lines: list[str], rng: random.Random) -> str:
marks = ["!", "?", "...", ",", "。"]
return "\n".join(f"{line}{rng.choice(marks)}" for line in lines)
def _with_platform_noise(lines: list[str]) -> str:
return "\n".join(["歌词来自QQ音乐", "作词:测试", *lines, "未经著作权人许可 不得翻唱"])
def _add_blank_line_noise(lines: list[str]) -> str:
result: list[str] = []
for idx, line in enumerate(lines, start=1):
result.append(line)
if idx % 4 == 0:
result.append("")
return "\n".join(result)
def _translation_added(lines: list[str]) -> str:
result: list[str] = []
for idx, line in enumerate(lines, start=1):
result.append(line)
if _looks_foreign(line) and idx <= 24:
result.append(_pseudo_translation(idx))
return "\n".join(result)
def _single_song_fragment(lines: list[str]) -> str:
if len(lines) <= 4:
return "\n".join(lines[: max(1, len(lines) // 2)])
fragment_len = max(2, min(8, len(lines) // 4))
start = max(0, (len(lines) - fragment_len) // 2)
return "\n".join(lines[start : start + fragment_len])
def _short_shared_snippet(lines: list[str], rng: random.Random) -> str:
snippet = rng.sample(lines, k=min(2, len(lines))) if lines else []
synthetic = [
"清晨的风吹过新的街口",
"我把昨天放进安静的口袋",
*snippet,
"故事从这里重新开始",
"灯光落下我继续往前走",
]
return "\n".join(synthetic)
def _mixed_fragments(left_lines: list[str], right_lines: list[str], rng: random.Random) -> str:
left_pick = rng.sample(left_lines, k=min(2, len(left_lines))) if left_lines else []
right_pick = rng.sample(right_lines, k=min(2, len(right_lines))) if right_lines else []
filler = ["新的旋律慢慢靠近", "陌生的名字写在风里", "没有人停在原地"]
return "\n".join([*left_pick, *filler, *right_pick])
def _same_theme_synthetic(index: int) -> str:
themes = [
"我在夜里想起远方的你",
"城市灯火陪我走过雨季",
"那些没说完的话留在风里",
"明天醒来我们各自继续",
f"这是第 {index} 个全新测试样本",
]
return "\n".join(themes)
def _translation_only_like(lines: list[str]) -> str:
foreign_count = sum(1 for line in lines if _looks_foreign(line))
if foreign_count < 2:
return _same_theme_synthetic(foreign_count + len(lines))
return "\n".join(_pseudo_translation(idx) for idx in range(1, min(8, foreign_count) + 1))
def _pseudo_translation(index: int) -> str:
translations = [
"今晚我仍然想念你",
"风会带走所有疲惫",
"黑暗里也会有光",
"别让昨天困住自己",
"我们终会继续向前",
"雨停以后天空会亮",
"把遗憾留在旧时光",
"你已经足够好了",
]
return translations[(index - 1) % len(translations)]
def _looks_foreign(line: str) -> bool:
latin = len(re.findall(r"[A-Za-z]", line))
cjk = len(re.findall(r"[\u4e00-\u9fff]", line))
return latin > 0 and cjk == 0
"""Import LRC/TXT lyric files into records."""
from __future__ import annotations
import hashlib
from pathlib import Path
from lyric_dedup.checker import LyricRecord
SUPPORTED_SUFFIXES = {".lrc", ".txt"}
def iter_lyric_files(root: str | Path) -> list[Path]:
base = Path(root)
return sorted(
path
for path in base.rglob("*")
if path.is_file() and path.suffix.lower() in SUPPORTED_SUFFIXES
)
def read_lyric_file(path: str | Path) -> str:
file_path = Path(path)
data = file_path.read_bytes()
for encoding in ("utf-8-sig", "utf-8", "gb18030", "big5"):
try:
return data.decode(encoding)
except UnicodeDecodeError:
continue
return data.decode("utf-8", errors="replace")
def record_from_file(path: str | Path, *, base_dir: str | Path | None = None) -> LyricRecord:
file_path = Path(path)
lyrics = read_lyric_file(file_path)
title, artist = _metadata_from_name(file_path.stem)
record_id = _record_id(file_path, base_dir)
return LyricRecord(record_id=record_id, lyrics=lyrics, title=title, artist=artist)
def records_from_dir(root: str | Path) -> list[LyricRecord]:
return [record_from_file(path, base_dir=root) for path in iter_lyric_files(root)]
def _record_id(path: Path, base_dir: str | Path | None) -> str:
if base_dir is None:
source = str(path.resolve())
else:
source = str(path.resolve().relative_to(Path(base_dir).resolve()))
digest = hashlib.sha1(source.encode("utf-8")).hexdigest()[:12]
return f"{digest}:{source}"
def _metadata_from_name(stem: str) -> tuple[str | None, str | None]:
cleaned = stem.removesuffix("-歌词").removesuffix("_歌词").removesuffix(" 歌词").strip()
if " - " in cleaned:
artist, title = cleaned.split(" - ", 1)
return title.strip() or None, artist.strip() or None
for sep in ("-", "_"):
if sep in cleaned:
title, artist = cleaned.rsplit(sep, 1)
return title.strip() or None, artist.strip() or None
return stem.strip() or None, None
"""Small in-memory MinHash LSH index for incremental lyric lookup."""
from __future__ import annotations
import hashlib
from collections import defaultdict
from dataclasses import dataclass
_MAX_HASH = (1 << 64) - 1
@dataclass(frozen=True)
class MinHashConfig:
num_perm: int = 96
bands: int = 24
seed: int = 17
@property
def rows_per_band(self) -> int:
if self.num_perm % self.bands != 0:
raise ValueError("num_perm must be divisible by bands")
return self.num_perm // self.bands
class MinHashLSH:
def __init__(self, config: MinHashConfig | None = None) -> None:
self.config = config or MinHashConfig()
self._buckets: dict[tuple[int, tuple[int, ...]], set[str]] = defaultdict(set)
def signature(self, tokens: set[str]) -> tuple[int, ...]:
if not tokens:
return tuple([_MAX_HASH] * self.config.num_perm)
signature = [_MAX_HASH] * self.config.num_perm
for token in tokens:
encoded = token.encode("utf-8")
for idx in range(self.config.num_perm):
digest = hashlib.blake2b(
encoded,
digest_size=8,
person=f"lyr{self.config.seed + idx:05d}".encode("ascii")[:16],
).digest()
value = int.from_bytes(digest, "big")
if value < signature[idx]:
signature[idx] = value
return tuple(signature)
def add(self, record_id: str, signature: tuple[int, ...]) -> None:
for key in self._band_keys(signature):
self._buckets[key].add(record_id)
def query(self, signature: tuple[int, ...]) -> set[str]:
candidates: set[str] = set()
for key in self._band_keys(signature):
candidates.update(self._buckets.get(key, set()))
return candidates
def _band_keys(self, signature: tuple[int, ...]) -> list[tuple[int, tuple[int, ...]]]:
rows = self.config.rows_per_band
return [(band, signature[band * rows : (band + 1) * rows]) for band in range(self.config.bands)]
"""Process newly added lyric library files.
This script is intended for the recurring workflow after adding files to
``data/library``:
1. Move pure-music placeholder lyric files out of the active library.
2. Rebuild the duplicate-checking index.
3. Optionally regenerate and evaluate a synthetic regression set.
"""
from __future__ import annotations
import argparse
import csv
import json
import shutil
import sys
from datetime import datetime
from pathlib import Path
PROJECT_ROOT = Path(__file__).resolve().parents[1]
if str(PROJECT_ROOT) not in sys.path:
sys.path.insert(0, str(PROJECT_ROOT))
from lyric_dedup.checker import DuplicateChecker
from lyric_dedup.cli import evaluate_csv
from lyric_dedup.eval_dataset import generate_eval_set
from lyric_dedup.file_import import iter_lyric_files
from lyric_dedup.file_import import read_lyric_file
from lyric_dedup.file_import import records_from_dir
from lyric_dedup.normalization import normalize_lyrics
PLACEHOLDER_MARKERS = (
"【曲库专用】",
"此歌曲为没有填词的纯音乐",
)
def main() -> None:
parser = argparse.ArgumentParser(description="Process lyric library additions.")
parser.add_argument("--library-dir", default="data/library")
parser.add_argument("--index", default="outputs/indexes/library_lyrics.pkl")
parser.add_argument("--quarantine-dir", default="data/quarantine/no_lyrics_placeholders")
parser.add_argument("--dry-run", action="store_true", help="Only report placeholder files; do not move or write outputs.")
parser.add_argument("--delete-placeholders", action="store_true", help="Delete matched placeholder files instead of moving them.")
parser.add_argument("--eval-size", type=int, default=0, help="Generate and evaluate this many synthetic samples. 0 disables eval.")
parser.add_argument("--positive-ratio", type=float, default=0.2)
parser.add_argument("--eval-dir", default="data/generated_eval/incoming")
parser.add_argument("--eval-csv", default="data/generated_eval/eval.csv")
parser.add_argument("--eval-out", default="outputs/results/library_eval.csv")
parser.add_argument("--report", default="outputs/results/library_process_report.json")
args = parser.parse_args()
library_dir = Path(args.library_dir)
quarantine_dir = Path(args.quarantine_dir)
report_path = Path(args.report)
files_before = iter_lyric_files(library_dir)
placeholders = _find_placeholder_files(library_dir)
short_effective = _effective_line_report(library_dir)
moved_or_deleted: list[str] = []
if not args.dry_run:
moved_or_deleted = _handle_placeholders(
placeholders,
library_dir=library_dir,
quarantine_dir=quarantine_dir,
delete=args.delete_placeholders,
)
_build_index(library_dir, Path(args.index))
if args.eval_size > 0:
generate_eval_set(
library_dir=library_dir,
output_dir=Path(args.eval_dir),
csv_path=Path(args.eval_csv),
size=args.eval_size,
positive_ratio=args.positive_ratio,
)
evaluate_csv(
Path(args.index),
Path(args.eval_csv),
Path(args.eval_out),
base_dir=Path(args.eval_csv).parent,
positive_decisions={"duplicate"},
max_candidates=5,
)
evaluate_csv(
Path(args.index),
Path(args.eval_csv),
Path(args.eval_out).with_name(Path(args.eval_out).stem + "_review_positive.csv"),
base_dir=Path(args.eval_csv).parent,
positive_decisions={"duplicate", "review"},
max_candidates=5,
)
report = {
"timestamp": datetime.now().isoformat(timespec="seconds"),
"dry_run": args.dry_run,
"library_dir": str(library_dir),
"files_before": len(files_before),
"placeholder_matches": len(placeholders),
"placeholder_files": [str(path) for path in placeholders],
"handled_placeholder_files": moved_or_deleted,
"files_after": len(iter_lyric_files(library_dir)),
"index": str(args.index),
"eval_size": args.eval_size,
"eval_csv": str(args.eval_csv) if args.eval_size > 0 else "",
"eval_out": str(args.eval_out) if args.eval_size > 0 else "",
"short_effective_line_counts": short_effective,
}
print(json.dumps(report, ensure_ascii=False, indent=2))
if not args.dry_run:
report_path.parent.mkdir(parents=True, exist_ok=True)
report_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8")
def _find_placeholder_files(library_dir: Path) -> list[Path]:
matches: list[Path] = []
for path in iter_lyric_files(library_dir):
text = read_lyric_file(path)
if any(marker in text for marker in PLACEHOLDER_MARKERS):
matches.append(path)
return matches
def _handle_placeholders(
placeholders: list[Path],
*,
library_dir: Path,
quarantine_dir: Path,
delete: bool,
) -> list[str]:
handled: list[str] = []
if not placeholders:
return handled
if not delete:
quarantine_dir.mkdir(parents=True, exist_ok=True)
for path in placeholders:
if delete:
path.unlink()
handled.append(f"deleted:{path}")
continue
relative = path.resolve().relative_to(library_dir.resolve())
destination = quarantine_dir / relative
destination.parent.mkdir(parents=True, exist_ok=True)
if destination.exists():
destination = destination.with_name(f"{destination.stem}_{datetime.now().strftime('%Y%m%d%H%M%S')}{destination.suffix}")
shutil.move(str(path), str(destination))
handled.append(f"moved:{path}->{destination}")
return handled
def _build_index(library_dir: Path, index_path: Path) -> None:
checker = DuplicateChecker()
for record in records_from_dir(library_dir):
checker.add_record(record)
index_path.parent.mkdir(parents=True, exist_ok=True)
checker.save(index_path)
def _effective_line_report(library_dir: Path) -> dict[str, int]:
buckets = {
"total": 0,
"zero_effective_lines": 0,
"one_to_three_effective_lines": 0,
"four_to_five_effective_lines": 0,
"six_plus_effective_lines": 0,
}
for path in iter_lyric_files(library_dir):
buckets["total"] += 1
normalized = normalize_lyrics(read_lyric_file(path))
line_count = len(normalized.primary_lines or normalized.unique_lines)
if line_count == 0:
buckets["zero_effective_lines"] += 1
elif line_count <= 3:
buckets["one_to_three_effective_lines"] += 1
elif line_count <= 5:
buckets["four_to_five_effective_lines"] += 1
else:
buckets["six_plus_effective_lines"] += 1
return buckets
if __name__ == "__main__":
main()