RAG Accuracy Playbook: Retrieval Recall, Re-Ranking, and Evaluation Loop

If your RAG system feels unreliable, switching to a more expensive LLM is usually the wrong first move. In most cases, the bottleneck is retrieval quality: weak recall, poor ranking, and no measurement loop.

This guide gives a practical path: make recall broader, make ranking sharper, then close the loop with offline + online evaluation.

1) Define what “accurate” means first

Build a small eval set (50–200 real queries), each with:

user query
reference answer or key facts
gold evidence chunk IDs (can be multiple)

Track at least these metrics:

Recall@k (most important first)
MRR
nDCG@k
Faithfulness (is the answer grounded in retrieved evidence?)

If Recall@k doesn’t improve, prompt tuning won’t save you.

2) Make recall broad with hybrid retrieval

Use multiple retrieval channels:

dense vector retrieval (semantic)
BM25 keyword retrieval (lexical)
optional rule-based retrieval (title/tag/time filters)

Why this works: product names, error codes, and versions are often better handled by BM25 than embeddings.

Suggested defaults

chunk size: 300–600 Chinese chars or 200–400 English words
overlap: 10%–20%
retrieve top 30–100 candidates before reranking
deduplicate by doc_id + span hash

3) Make ranking precise with a cross-encoder reranker

Adding reranking is often the highest-ROI quality upgrade.

Pipeline:

retrieve top 50
rerank and keep top 5–10
force citations in the final answer

Example:

candidates = hybrid_retrieve(query, topk=50)
ranked = rerank_cross_encoder(query, candidates)
context = ranked[:8]
answer = llm_generate(query, context, require_citations=True)

4) Keep query rewriting conservative

Over-aggressive rewrite can damage intent.

A safer default:

keep the original query
only expand abbreviations/synonyms
never invent constraints

A strong pattern is parallel retrieval: original query + rewritten query, then merge and deduplicate.

5) Close the evaluation loop weekly

Run offline evaluation weekly, and feed back online signals:

explicit: thumbs up/down, “solved?”
implicit: follow-up rate, copy rate, dwell time

If follow-up rate rises while offline metrics look fine, check:

stale eval set
delayed indexing for fresh docs
missing retrieval strategy for specific query types (e.g., version-diff questions)

6) Fast troubleshooting checklist

Hallucinations: increase evidence coverage; try context top 8 instead of top 3.
Off-topic answers: inspect reranker false positives.
Fresh docs not found: audit incremental indexing and retries.
Old-version hits: add version and updated_at filters.
Long complex queries fail: split into sub-questions before retrieval.

7) Minimal production-ready stack

vector store: Milvus / pgvector / Weaviate
lexical search: OpenSearch/Elasticsearch or local BM25
reranker: bge-reranker or cross-encoder
evaluation: custom scripts + weekly Recall@k/MRR/Faithfulness report

Get this pipeline stable before adding agent complexity.

Summary

The most reliable RAG improvement order is:

broader recall (hybrid retrieval)
better ranking (cross-encoder reranker)
continuous evaluation loop (offline + online)

Don’t start with a pricier model. Start with better retrieval engineering.

1) Define what “accurate” means first#

2) Make recall broad with hybrid retrieval#

Suggested defaults#

3) Make ranking precise with a cross-encoder reranker#

4) Keep query rewriting conservative#

5) Close the evaluation loop weekly#

6) Fast troubleshooting checklist#

7) Minimal production-ready stack#

Summary#