If your RAG system feels unreliable, switching to a more expensive LLM is usually the wrong first move. In most cases, the bottleneck is retrieval quality: weak recall, poor ranking, and no measurement loop.

This guide gives a practical path: make recall broader, make ranking sharper, then close the loop with offline + online evaluation.

1) Define what “accurate” means first

Build a small eval set (50–200 real queries), each with:

  • user query
  • reference answer or key facts
  • gold evidence chunk IDs (can be multiple)

Track at least these metrics:

  • Recall@k (most important first)
  • MRR
  • nDCG@k
  • Faithfulness (is the answer grounded in retrieved evidence?)

If Recall@k doesn’t improve, prompt tuning won’t save you.

2) Make recall broad with hybrid retrieval

Use multiple retrieval channels:

  • dense vector retrieval (semantic)
  • BM25 keyword retrieval (lexical)
  • optional rule-based retrieval (title/tag/time filters)

Why this works: product names, error codes, and versions are often better handled by BM25 than embeddings.

Suggested defaults

  • chunk size: 300–600 Chinese chars or 200–400 English words
  • overlap: 10%–20%
  • retrieve top 30–100 candidates before reranking
  • deduplicate by doc_id + span hash

3) Make ranking precise with a cross-encoder reranker

Adding reranking is often the highest-ROI quality upgrade.

Pipeline:

  1. retrieve top 50
  2. rerank and keep top 5–10
  3. force citations in the final answer

Example:

candidates = hybrid_retrieve(query, topk=50)
ranked = rerank_cross_encoder(query, candidates)
context = ranked[:8]
answer = llm_generate(query, context, require_citations=True)

4) Keep query rewriting conservative

Over-aggressive rewrite can damage intent.

A safer default:

  • keep the original query
  • only expand abbreviations/synonyms
  • never invent constraints

A strong pattern is parallel retrieval: original query + rewritten query, then merge and deduplicate.

5) Close the evaluation loop weekly

Run offline evaluation weekly, and feed back online signals:

  • explicit: thumbs up/down, “solved?”
  • implicit: follow-up rate, copy rate, dwell time

If follow-up rate rises while offline metrics look fine, check:

  • stale eval set
  • delayed indexing for fresh docs
  • missing retrieval strategy for specific query types (e.g., version-diff questions)

6) Fast troubleshooting checklist

  1. Hallucinations: increase evidence coverage; try context top 8 instead of top 3.
  2. Off-topic answers: inspect reranker false positives.
  3. Fresh docs not found: audit incremental indexing and retries.
  4. Old-version hits: add version and updated_at filters.
  5. Long complex queries fail: split into sub-questions before retrieval.

7) Minimal production-ready stack

  • vector store: Milvus / pgvector / Weaviate
  • lexical search: OpenSearch/Elasticsearch or local BM25
  • reranker: bge-reranker or cross-encoder
  • evaluation: custom scripts + weekly Recall@k/MRR/Faithfulness report

Get this pipeline stable before adding agent complexity.

Summary

The most reliable RAG improvement order is:

  1. broader recall (hybrid retrieval)
  2. better ranking (cross-encoder reranker)
  3. continuous evaluation loop (offline + online)

Don’t start with a pricier model. Start with better retrieval engineering.