If your RAG system feels unreliable, switching to a more expensive LLM is usually the wrong first move. In most cases, the bottleneck is retrieval quality: weak recall, poor ranking, and no measurement loop.
This guide gives a practical path: make recall broader, make ranking sharper, then close the loop with offline + online evaluation.
1) Define what “accurate” means first
Build a small eval set (50–200 real queries), each with:
- user query
- reference answer or key facts
- gold evidence chunk IDs (can be multiple)
Track at least these metrics:
- Recall@k (most important first)
- MRR
- nDCG@k
- Faithfulness (is the answer grounded in retrieved evidence?)
If Recall@k doesn’t improve, prompt tuning won’t save you.
2) Make recall broad with hybrid retrieval
Use multiple retrieval channels:
- dense vector retrieval (semantic)
- BM25 keyword retrieval (lexical)
- optional rule-based retrieval (title/tag/time filters)
Why this works: product names, error codes, and versions are often better handled by BM25 than embeddings.
Suggested defaults
- chunk size: 300–600 Chinese chars or 200–400 English words
- overlap: 10%–20%
- retrieve top 30–100 candidates before reranking
- deduplicate by
doc_id + span hash
3) Make ranking precise with a cross-encoder reranker
Adding reranking is often the highest-ROI quality upgrade.
Pipeline:
- retrieve top 50
- rerank and keep top 5–10
- force citations in the final answer
Example:
candidates = hybrid_retrieve(query, topk=50)
ranked = rerank_cross_encoder(query, candidates)
context = ranked[:8]
answer = llm_generate(query, context, require_citations=True)
4) Keep query rewriting conservative
Over-aggressive rewrite can damage intent.
A safer default:
- keep the original query
- only expand abbreviations/synonyms
- never invent constraints
A strong pattern is parallel retrieval: original query + rewritten query, then merge and deduplicate.
5) Close the evaluation loop weekly
Run offline evaluation weekly, and feed back online signals:
- explicit: thumbs up/down, “solved?”
- implicit: follow-up rate, copy rate, dwell time
If follow-up rate rises while offline metrics look fine, check:
- stale eval set
- delayed indexing for fresh docs
- missing retrieval strategy for specific query types (e.g., version-diff questions)
6) Fast troubleshooting checklist
- Hallucinations: increase evidence coverage; try context top 8 instead of top 3.
- Off-topic answers: inspect reranker false positives.
- Fresh docs not found: audit incremental indexing and retries.
- Old-version hits: add
versionandupdated_atfilters. - Long complex queries fail: split into sub-questions before retrieval.
7) Minimal production-ready stack
- vector store: Milvus / pgvector / Weaviate
- lexical search: OpenSearch/Elasticsearch or local BM25
- reranker: bge-reranker or cross-encoder
- evaluation: custom scripts + weekly Recall@k/MRR/Faithfulness report
Get this pipeline stable before adding agent complexity.
Summary
The most reliable RAG improvement order is:
- broader recall (hybrid retrieval)
- better ranking (cross-encoder reranker)
- continuous evaluation loop (offline + online)
Don’t start with a pricier model. Start with better retrieval engineering.