TESTING_GUIDE.md 8.42 KB

Raw Blame History Permalink



WeKnora Ragas 评估测试流程指南

本文档用于在服务器上从零验证 WeKnora Ragas 独立评估项目是否可运行。


1. 前置条件

确认服务器满足：


Python 3.10 或更高版本。不要使用 Python 3.6。
WeKnora API 可访问，例如 http://localhost:9090/api/v1。
vLLM 已提供 OpenAI-compatible Chat Completions endpoint，例如 http://localhost:8000/v1。
Infinity 已提供 OpenAI-compatible Embeddings endpoint，例如 http://localhost:7997/v1。
可选：Infinity reranker endpoint 可访问，例如 http://localhost:7998/v1。


检查 Python：

python3 --version
python3.10 --version
python3.11 --version


推荐使用 Python 3.11：

cd /data/weknora_ragas
python3.11 -m venv .venv
source .venv/bin/activate
python --version
pip install -U pip setuptools wheel
pip install -e ".[pdf]"


如果只跑 XLSX 或文本型 PDF，可以先安装基础依赖：

pip install -e .


2. 配置 .env


复制示例文件：

cp .env.example .env


编辑 .env：

WEKNORA_BASE_URL=http://localhost:9090/api/v1
WEKNORA_API_KEY=your-weknora-api-key
WEKNORA_KB_ID=
WEKNORA_KB_NAME=ragas-eval-pilot

RAGAS_LLM_API_KEY=EMPTY
RAGAS_LLM_BASE_URL=http://localhost:8000/v1
RAGAS_GENERATOR_MODEL=your-vllm-model-id
RAGAS_JUDGE_MODEL=your-vllm-model-id

RAGAS_EMBEDDING_API_KEY=EMPTY
RAGAS_EMBEDDING_BASE_URL=http://localhost:7997/v1
RAGAS_EMBEDDING_MODEL=your-embedding-model-id

RAGAS_RERANKER_API_KEY=EMPTY
RAGAS_RERANKER_BASE_URL=http://localhost:7998/v1
RAGAS_RERANKER_MODEL=your-reranker-model-id

TESTSET_SIZE=10
REQUEST_INTERVAL_SECONDS=0.2


如果服务没有鉴权，RAGAS_*_API_KEY 仍建议填 EMPTY，避免 OpenAI client 因空 key 报错。

确认模型 ID：

curl http://localhost:8000/v1/models
curl http://localhost:7997/v1/models


把返回的 id 精确填入 RAGAS_JUDGE_MODEL 和 RAGAS_EMBEDDING_MODEL。


3. 服务连通性检查

先检查 WeKnora 知识库：

python scripts/00_create_kb.py


如果 .env 中 WEKNORA_KB_ID 为空，该脚本会调用：

POST /api/v1/knowledge-bases
{"name": "..."}


创建成功后会把 ID 写回 .env。

再检查模型服务：

python scripts/00_check_models.py


期望输出包括：

[OK] Generator LLM
[OK] Judge LLM
[OK] Embedding
All configured model services are reachable.


如果配置了 reranker，也会检查：

[OK] Reranker


如果 reranker 当前不用，可以让 RAGAS_RERANKER_BASE_URL 或 RAGAS_RERANKER_MODEL 为空，脚本会跳过。


4. 准备 Pilot 数据

首轮不要直接跑大规模数据。建议：


2 个 PDF。
1 个 XLSX。

TESTSET_SIZE=10。


放置文件：

mkdir -p data/raw_docs
cp /path/to/*.pdf data/raw_docs/
cp /path/to/*.xlsx data/raw_docs/


也兼容旧目录：

mkdir -p data/raw_docs/pdf data/raw_docs/xlsx
cp /path/to/*.pdf data/raw_docs/pdf/
cp /path/to/*.xlsx data/raw_docs/xlsx/


5. 执行完整 Pilot

按顺序执行：

python scripts/01_upload_docs.py
python scripts/02_wait_ingestion.py
python scripts/03_export_chunks.py
python scripts/04_parse_docs.py
python scripts/05_generate_testset.py
python scripts/06_review_testset.py
python scripts/07_run_weknora_qa.py
python scripts/08_build_ragas_input.py
python scripts/09_run_ragas_eval.py
python scripts/10_report.py


说明：


01_upload_docs.py 上传 data/raw_docs/ 下的 PDF/XLSX，也兼容 pdf/、xlsx/ 子目录。

02_wait_ingestion.py 等待 WeKnora 解析完成。

03_export_chunks.py 导出 WeKnora chunks。

04_parse_docs.py 默认从 WeKnora 导出的 chunks 构造 Ragas 测试集来源，不再重复解析原始 PDF。

05_generate_testset.py 默认使用 Ragas 结合评估侧 LLM 生成候选 QA。

06_review_testset.py 当前会把候选 QA 标为 approved，后续可替换为人工审核。

07_run_weknora_qa.py 逐条调用 WeKnora 问答并解析 SSE。

08_build_ragas_input.py 合并 QA 和 WeKnora 输出。

09_run_ragas_eval.py 调用 Ragas 打分。

10_report.py 生成 Markdown 报告。


6. 产物验收

检查这些文件是否生成：

ls -lh data/exported/knowledge.jsonl
ls -lh data/exported/chunks.jsonl
ls -lh data/parsed_docs/documents.jsonl
ls -lh data/parsed_docs/parse_summary.json
ls -lh data/testsets/testset.reviewed.jsonl
ls -lh data/runs/weknora_answers.jsonl
ls -lh data/runs/ragas_input.jsonl
ls -lh data/reports/ragas_scores.csv
ls -lh data/reports/summary.md


快速检查关键字段：

python - <<'PY'
import json
from pathlib import Path

for path in [
    "data/exported/chunks.jsonl",
    "data/parsed_docs/documents.jsonl",
    "data/runs/weknora_answers.jsonl",
    "data/runs/ragas_input.jsonl",
]:
    rows = [json.loads(line) for line in Path(path).read_text(encoding="utf-8").splitlines() if line.strip()]
    print(path, len(rows))
    if rows:
        print(rows[0].keys())
PY


最低验收标准：


data/exported/chunks.jsonl 非空。

data/parsed_docs/documents.jsonl 非空。

data/runs/weknora_answers.jsonl 中大部分 response 非空。

data/runs/ragas_input.jsonl 中 retrieved_contexts 非空比例合理。

data/reports/ragas_scores.csv 至少有一项指标列。

data/reports/summary.md 可读。


7. 常见故障


Python 版本过低

现象：

Could not find a version that satisfies the requirement setuptools>=68


原因：当前虚拟环境是 Python 3.6。项目要求 Python 3.10+。

处理：

rm -rf .venv
python3.11 -m venv .venv
source .venv/bin/activate
pip install -U pip setuptools wheel
pip install -e ".[pdf]"


模型 endpoint 填错

vLLM 和 Infinity 都要填 OpenAI-compatible /v1 地址，例如：

RAGAS_LLM_BASE_URL=http://localhost:8000/v1
RAGAS_EMBEDDING_BASE_URL=http://localhost:7997/v1


不要填 Ollama 原生 /api 或服务根路径。


Embedding 报 invalid input type

项目已经在 ragas_runner.py 中设置：

tiktoken_enabled=False
check_embedding_ctx_length=False


如果仍报错，优先用 scripts/00_check_models.py 确认 Infinity endpoint 是否兼容 OpenAI embeddings API。


Ragas 指标超时或 NaN

本地或小模型 judge 可能无法稳定输出 Ragas 需要的结构化结果。先缩小指标集，例如只保留：

metrics:
  - response_relevancy


确认链路通后，再逐个打开：

  - faithfulness
  - context_precision
  - context_recall
  - factual_correctness


也可以调大：

timeout_seconds: 600
max_workers: 1
max_tokens: 4096


如果 05_generate_testset.py 在生成 QA 时出现 LLMDidNotFinishException，优先不要继续盲目调大 ragas.max_tokens。05 有独立的生成预算和输入长度：

TESTSET_RAGAS_MODE=direct
TESTSET_GENERATOR_MAX_TOKENS=4096
TESTSET_MAX_DOCUMENT_CHARS=2000
RAGAS_ENABLE_THINKING=false


direct 模式会跳过 Ragas 默认的 HeadlinesExtractor、SummaryExtractor、NERExtractor 文档预处理链路，直接把 WeKnora chunks 组装成 Ragas KnowledgeGraph 并生成单跳 QA。prechunked 和 langchain_docs 仅用于对比实验，遇到本地 vLLM 结构化输出不稳定时不建议使用。

如果使用 Qwen thinking 模型，RAGAS_ENABLE_THINKING=false 会只在 RAGAS 请求里附加 chat_template_kwargs.enable_thinking=false，避免 RAGAS 的 JSON/Pydantic 结构化输出被 Thinking Process 前缀破坏；WeKnora 本身的检索问答链路不经过这些脚本，不会受影响。

如果 vLLM 仍然报生成未完成，先把 TESTSET_SIZE 降到 3，再把 TESTSET_MAX_DOCUMENT_CHARS 调到 1000-1500 验证链路；ragas.max_tokens 主要用于后续评测阶段，不应该拿来无限放大测试集生成阶段的输出长度。


WeKnora 问答没有 retrieved_contexts

检查：

python scripts/03_export_chunks.py
python scripts/07_run_weknora_qa.py


重点看：


知识库是否解析完成。
chunks 是否导出非空。
WeKnora 问答 SSE 是否返回 references 事件。

data/runs/failed_requests.jsonl 中是否记录 empty_retrieval。


8. 扩大样本规模

首轮 10 条样本通过后，再扩大：

TESTSET_SIZE=50


再逐步扩大到 100-300 条。每次扩大前先确认：


Ragas judge 延迟可接受。

failed_requests.jsonl 中失败比例低。

summary.md 中检索失败样本可解释。
QA 集经过人工审核或抽样审核。