POSTGRES_MIGRATION.md 12.9 KB

Raw Blame History Permalink



PostgreSQL 迁移说明

本文档说明为什么要从 .pkl 索引迁移到 PostgreSQL、PostgreSQL 在这个项目里承担什么角色，以及从本地初始化到导入曲库的具体步骤。


1. 为什么迁移

当前 outputs/indexes/library_lyrics.pkl 是一个 Python DuplicateChecker 对象快照。它适合实验和离线评估，但不适合线上服务：


增量新增、删除、回滚不方便。
多实例部署时很难同步同一个 .pkl。
没有数据库事务、审计、版本管理。
索引结构绑定 Python pickle，不利于运维和长期维护。


PostgreSQL 的目标不是替代全部判重逻辑，而是替代“数据存储 + 候选召回”部分。最终判定仍保留在 Python 服务层。


2. PostgreSQL 在本项目里的角色

推荐分工：

PostgreSQL:
  保存原始歌词、规范化歌词、行级特征
  用 exact_hash 做精确召回
  用 pg_trgm 做文本近似召回
  用 lyric_lines 做行级重合召回

Python:
  normalize_lyrics
  计算 Jaccard / line coverage
  处理翻译行、副歌碰撞、片段保护
  最终输出 duplicate / review / new


不要让 PostgreSQL 的 similarity() 直接决定 duplicate。数据库只负责找候选，最终判定仍应复用当前算法规则。


3. pkl 索引和 PostgreSQL 的对应关系

当前 .pkl 里主要有：

_records              每首歌词的完整索引记录
_exact_hash_to_ids    exact hash -> record ids
_line_to_ids          normalized line -> record ids
_token_to_ids         n-gram token -> record ids
_lsh                  MinHash LSH 桶


迁移到 PostgreSQL 后：

lyrics.exact_hash             对应 _exact_hash_to_ids
lyrics.primary_text + pg_trgm 对应近似文本召回
lyric_lines.line_hash         对应行级倒排召回
lyrics / lyric_lines          对应 _records 的可持久化部分


第一阶段不迁移 MinHash LSH。先用 exact_hash + pg_trgm + line_hash 验证召回效果。


4. 本地 PostgreSQL 基础

PostgreSQL 是关系型数据库，常用概念：

database   数据库，例如 lyric_dedup
schema     命名空间，默认 public
table      表，例如 lyrics
index      索引，用来加速查询
extension  扩展，例如 pg_trgm
DSN        连接字符串


本机数据库连接字符串常见写法：

postgresql:///lyric_dedup


含义是：使用当前系统用户名，连接本机 PostgreSQL 的 lyric_dedup 数据库。

如果有用户名密码：

postgresql://postgres:postgres@localhost:5432/lyric_dedup


5. 当前已新增脚本

项目里已经加入：

scripts/postgres_schema.sql
scripts/init_postgres.py
scripts/import_library_postgres.py
scripts/evaluate_postgres.py


用途：

postgres_schema.sql          建表、建索引、启用 pg_trgm
init_postgres.py             自动执行 schema SQL
import_library_postgres.py   扫描 data/library，规范化后导入 PostgreSQL
evaluate_postgres.py         使用 PostgreSQL 召回候选并评测 CSV


6. 安装依赖

当前 Python 环境需要安装 PostgreSQL 驱动：

python -m pip install 'psycopg[binary]'


如果你使用 conda 环境，确认命令运行在当前项目所用的 (base) 或目标环境里。

验证：

python - <<'PY'
import psycopg
print(psycopg.__version__)
PY


7. 创建数据库

你已经执行过：

createdb lyric_dedup


如果需要确认数据库存在：

psql -l | grep lyric_dedup


进入数据库：

psql postgresql:///lyric_dedup


退出 psql：

\q


8. 初始化表结构

执行：

python scripts/init_postgres.py \
  --dsn postgresql:///lyric_dedup


它会执行：

create extension if not exists pg_trgm;
create table if not exists lyrics (...);
create table if not exists lyric_lines (...);
create index if not exists ...;


成功输出类似：

initialized schema from scripts/postgres_schema.sql


可以检查表：

psql postgresql:///lyric_dedup -c '\dt'


检查扩展：

psql postgresql:///lyric_dedup -c 'select * from pg_extension;'


9. 表结构说明


lyrics

保存每首歌词的主记录：

record_id          当前文件生成的稳定 id
source_path        原始文件路径
title / artist     从文件名解析的元数据
raw_text           原始歌词
normalized_text    清洗后的全文
primary_text       原文行拼接文本，主要用于自动判重
translation_text   翻译行拼接文本
exact_hash         规范化原文 hash
split_confidence   翻译拆分置信度
split_reason       翻译拆分原因
line_count         有效歌词行数
deleted_at         软删除字段


lyric_lines

保存行级特征：

lyric_id           对应 lyrics.id
role               primary / translation / unknown
line_no            行号
normalized_line    规范化歌词行
line_hash          行 hash


用途：快速找“哪些歌包含相同行”。


10. 小批量导入测试

先导入 1000 条，确认环境和 schema 都正常：

python scripts/import_library_postgres.py \
  --dsn postgresql:///lyric_dedup \
  --lyrics-dir data/library \
  --limit 1000


导入脚本默认会在导入结束后执行一次低风险库内去重：

exact_hash 完全一致的记录只保留一条，其余记录 soft delete，即设置 lyrics.deleted_at。


重复清理报告默认写到：

outputs/results/postgres_exact_duplicates.csv


如果只是想导入，不做 exact 去重：

python scripts/import_library_postgres.py \
  --dsn postgresql:///lyric_dedup \
  --lyrics-dir data/library \
  --limit 1000 \
  --skip-dedup-exact


查看数量：

psql postgresql:///lyric_dedup -c 'select count(*) from lyrics;'
psql postgresql:///lyric_dedup -c 'select count(*) from lyric_lines;'


查看几条数据：

psql postgresql:///lyric_dedup -c \
  'select id, record_id, title, artist, line_count from lyrics limit 5;'


11. 全量导入

确认小批量没问题后，导入全量：

python scripts/import_library_postgres.py \
  --dsn postgresql:///lyric_dedup \
  --lyrics-dir data/library


脚本会显示进度：

[pg-import] files: 70295
[pg-import] import: 500/70295
...


导入是 upsert，同一个 record_id 再导入会更新，不会重复插入。

如果想额外生成“高行覆盖率近重复候选”报告，但不自动删除：

python scripts/import_library_postgres.py \
  --dsn postgresql:///lyric_dedup \
  --lyrics-dir data/library \
  --line-duplicate-report outputs/results/postgres_line_duplicates.csv \
  --line-coverage-threshold 0.95


注意：行覆盖率近重复报告可能较慢，且只用于抽查。当前脚本不会自动 soft delete 这些近重复候选。


12. 基础 SQL 验证


exact hash 重复

找规范化 hash 重复：

psql postgresql:///lyric_dedup -c "
select exact_hash, count(*)
from lyrics
where deleted_at is null
group by exact_hash
having count(*) > 1
order by count(*) desc
limit 20;
"


如果导入时没有加 --skip-dedup-exact，这里理论上不应该再出现 active exact 重复；已经清理的重复记录可以这样查看：

psql postgresql:///lyric_dedup -c "
select count(*)
from lyrics
where deleted_at is not null;
"


pg_trgm 相似查询

测试 pg_trgm：

psql postgresql:///lyric_dedup -c "
select id, title, similarity(primary_text, '我爱你在每个夜里') as sim
from lyrics
where primary_text % '我爱你在每个夜里'
order by sim desc
limit 10;
"


行级重合

找某一行出现在哪些歌：

psql postgresql:///lyric_dedup -c "
select l.id, l.title, ll.normalized_line
from lyric_lines ll
join lyrics l on l.id = ll.lyric_id
where ll.normalized_line = '我爱你在每个夜里'
limit 20;
"


13. 后续查重查询应该怎么做

未来 PostgreSQL 版查重流程：

1. Python 读取新增歌词
2. normalize_lyrics
3. SQL exact_hash 召回
4. SQL pg_trgm 召回
5. SQL lyric_lines 行级召回
6. 合并候选 id
7. 拉候选 normalized 数据
8. Python 复用当前打分规则
9. 输出 duplicate / review / new


示意 SQL：

select id
from lyrics
where exact_hash = $1
  and deleted_at is null;


select id, similarity(primary_text, $1) as sim
from lyrics
where deleted_at is null
  and primary_text % $1
order by sim desc
limit 200;


select lyric_id, count(*) as matched_lines
from lyric_lines
where role = 'primary'
  and line_hash = any($1)
group by lyric_id
order by matched_lines desc
limit 200;


14. 增量更新设计

新增一首歌：

1. normalize
2. 用 PostgreSQL 召回候选
3. Python 判定
4. duplicate: 拒绝或关联已有记录
5. review: 进入人工复核
6. new: 写入 lyrics 和 lyric_lines


删除一首歌：

update lyrics
set deleted_at = now(), updated_at = now()
where id = ...;


不建议物理删除，除非确认不需要审计。

更新一首歌：

1. 更新 lyrics.raw_text / normalized_text / primary_text / exact_hash
2. 删除旧 lyric_lines
3. 插入新 lyric_lines
4. 整个过程放在一个事务里


15. PostgreSQL 版评测

评测仍然需要先生成测试集。测试集是“输入样本 + 期望标签”，PostgreSQL 版评测只负责用 PostgreSQL 数据库召回候选并计算指标。

如果还没有测试集，先生成：

python -m lyric_dedup.cli generate-eval-set \
  --library-dir data/library \
  --lyrics-dir data/generated_eval/incoming \
  --csv data/generated_eval/eval_5000.csv \
  --size 5000 \
  --positive-ratio 0.3


然后跑 PostgreSQL 版评测：

python scripts/evaluate_postgres.py \
  --dsn postgresql:///lyric_dedup \
  --csv data/generated_eval/eval_hard_5000.csv \
  --base-dir data/generated_eval \
  --out outputs/results/postgres_eval_hard_5000.csv


它会：

1. 对 eval 样本 normalize
2. 用 PostgreSQL exact_hash 召回
3. 用 pg_trgm primary_text 召回
4. 用 lyric_lines.line_hash 召回
5. 合并候选
6. 用 Python DuplicateChecker 对候选重新打分
7. 输出 duplicate / review / new 和指标


如果想把 review 也算作“抓到可疑样本”：

python scripts/evaluate_postgres.py \
  --dsn postgresql:///lyric_dedup \
  --csv data/generated_eval/eval_50000.csv \
  --base-dir data/generated_eval \
  --positive-decisions duplicate,review \
  --out outputs/results/postgres_eval_50000_review_positive.csv


可调参数：

--recall-limit       每类 SQL 召回最多取多少候选，默认 100
--enable-trgm        打开 pg_trgm 整段文本召回；默认关闭，避免评测过慢
--trgm-threshold     pg_trgm 的 % 匹配阈值，默认 0.3，仅 --enable-trgm 时使用
--max-candidates     最终输出多少候选，默认 5
--statement-timeout-ms 单条 SQL 超时时间，默认 5000


注意：当前 PostgreSQL 版是原型评测脚本。默认只用 exact_hash + lyric_lines.line_hash 召回，速度更可控。pg_trgm 可以作为补充召回，但整段歌词 trigram 查询在 5 万评测集上可能很慢，建议单独开小样本验证后再用于全量。


16. 迁移验证标准

迁移不是导入完就结束。需要单独验证 PostgreSQL 版查重链路：

1. exact duplicate 是否能查到
2. punctuation / timestamp / platform noise 正例是否能召回
3. fragment / shared chorus 负例是否不会被直接判 duplicate
4. PostgreSQL 召回候选数量是否合理
5. PostgreSQL 版 evaluate 指标是否达到业务要求


第一阶段目标：

PostgreSQL 负责召回，Python 仍负责判定。


17. 常见问题


提示 Missing dependency: psycopg


运行：

python -m pip install 'psycopg[binary]'


连接失败

检查 PostgreSQL 是否启动：

pg_isready


检查数据库是否存在：

psql -l | grep lyric_dedup


pg_trgm 创建失败

确认连接用户有创建 extension 权限。本机默认用户一般可以。

手动测试：

psql postgresql:///lyric_dedup -c 'create extension if not exists pg_trgm;'


想清空重新导入

谨慎执行：

psql postgresql:///lyric_dedup -c 'truncate lyric_lines, lyrics restart identity cascade;'


然后重新运行导入脚本。


18. 当前建议执行顺序

你现在已经完成：

createdb lyric_dedup


接下来执行：

python -m pip install 'psycopg[binary]'


python scripts/init_postgres.py \
  --dsn postgresql:///lyric_dedup


python scripts/import_library_postgres.py \
  --dsn postgresql:///lyric_dedup \
  --lyrics-dir data/library \
  --limit 1000


确认数量：

psql postgresql:///lyric_dedup -c 'select count(*) from lyrics;'


确认后全量导入：

python scripts/import_library_postgres.py \
  --dsn postgresql:///lyric_dedup \
  --lyrics-dir data/library