Neural-Path/Notes
30 min
Requires:RAG Systems

Advanced RAG

Basic RAG — embed a document, store vectors, retrieve the top-k, generate — works well in demos but breaks on real corpora. Chunks are too large or too small, retrieval misses semantically relevant passages, and generated answers hallucinate because the retrieved context is noisy. Advanced RAG fixes this with four additions: smarter chunking, hybrid retrieval (dense + sparse), cross-encoder reranking, and evaluation. Each addition is independent — you can adopt them selectively based on where your pipeline fails.

Theory

Advanced RAG pipeline — click a stage
new vs basic RAG
dense retrieval ─┐
                   ├─ RRF fusion → reranker
BM25 retrieval ─┘
RerankingNEW

A cross-encoder scores each (query, candidate) pair jointly — far more accurate than bi-encoder similarity but 10–100× slower. Apply to top-20 candidates; pass top-4 to the LLM. Single biggest quality improvement in advanced RAG.

violet = added beyond basic RAG · single cross-encoder reranker typically has larger impact than doubling embedding model size

Basic RAG retrieves by embedding similarity and hands the top-k chunks to the LLM. Advanced RAG adds three more stages: smarter chunking (to control the precision-recall tradeoff), hybrid search (to catch what embeddings miss), and cross-encoder reranking (to re-score candidates with richer context). The diagram above shows the full pipeline. Each stage is independent — you can add them selectively where your pipeline fails, rather than all at once.

Chunking and the Retrieval-Generation Trade-off

Every chunking decision makes a trade-off. Let chunk size be cc tokens:

  • Small cc (128–256 tokens): high recall (the right sentence is captured), but low precision (chunks lack surrounding context; the LLM can't use them well)
  • Large cc (1024–2048 tokens): high precision (full paragraphs with context), but lower recall (important sentences diluted by surrounding noise)

Optimal range for most tasks: 256–512 tokens with 10–20% overlap between adjacent chunks.

Semantic chunking splits on structural boundaries (paragraphs, sections, sentences) rather than fixed token counts. For technical documents, this consistently outperforms fixed chunking because it preserves logical units.

Hybrid Search: BM25 + Dense Retrieval

Pure dense retrieval (cosine similarity of embeddings) excels at semantic matching but struggles with exact keyword matches. BM25 (a classical sparse retrieval algorithm) inverts this — exact keyword matches score high, but synonyms and paraphrases score zero.

BM25 score for query qq and document dd:

BM25(q,d)=tqIDF(t)f(t,d)(k1+1)f(t,d)+k1(1b+bd/avgdl)\text{BM25}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t, d) \cdot (k_1 + 1)}{f(t, d) + k_1(1 - b + b \cdot |d|/\text{avgdl})}

where f(t,d)f(t,d) is term frequency, d|d| is document length, avgdl\text{avgdl} is average document length, k11.5k_1 \approx 1.5 and b0.75b \approx 0.75 are smoothing parameters.

Hybrid fusion: combine BM25 and dense scores using Reciprocal Rank Fusion (RRF):

RRF(d)=rrankers1k+r(d)\text{RRF}(d) = \sum_{r \in \text{rankers}} \frac{1}{k + r(d)}

RRF combines rankings instead of raw scores because dense and sparse retrievers produce scores on incompatible scales — cosine similarity outputs values in [1,1][-1, 1], while BM25 scores can be any positive float depending on corpus statistics. Normalizing these scales to be directly comparable requires knowing the full score distribution, which changes as the corpus grows. Rank positions are always directly comparable regardless of scale: first place is first place in any retrieval system. The constant k=60k=60 in the denominator dampens the influence of top-ranked documents, preventing a single retriever's strong first-place hit from dominating the fusion.

where r(d)r(d) is the rank of document dd in ranker rr and k=60k=60 is a constant. RRF doesn't require score normalization across rankers — it only uses rank positions.

Cross-Encoder Reranking

Bi-encoder retrieval (embedding query and document separately) is fast but approximate. A cross-encoder processes the query and candidate together, enabling richer attention:

score(q,d)=CrossEncoderθ([q;d])\text{score}(q, d) = \text{CrossEncoder}_\theta([q; d])

Cross-encoders are 10–100× slower than bi-encoders but dramatically more accurate on the top-kk candidates. The standard practice: use bi-encoder to retrieve top-20 candidates cheaply, then rerank with a cross-encoder to select the top-3–5 for generation.

RAG Evaluation: RAGAS Metrics

RAGAS provides four reference-free metrics for RAG evaluation:

MetricFormulaWhat it measures
Context Precisionrelevant chunks in top-kk\frac{\text{relevant chunks in top-}k}{k}Are retrieved chunks actually useful?
Context Recallanswer claims supported by contexttotal claims in answer\frac{\text{answer claims supported by context}}{\text{total claims in answer}}Did retrieval find everything needed?
Faithfulnessclaims grounded in contexttotal claims in answer\frac{\text{claims grounded in context}}{\text{total claims in answer}}Does the answer stay within the retrieved context?
Answer RelevancyCosine sim of answer to question embeddingDoes the answer address the question?

Low context precision → too many irrelevant chunks retrieved. Low faithfulness → LLM is hallucinating beyond the context. Low recall → important context is missing from the index.

Walkthrough

Building a Multi-Stage RAG Pipeline

python
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Chroma
from langchain.retrievers import EnsembleRetriever
from sentence_transformers import CrossEncoder
 
# Step 1: Semantic chunking with overlap
splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=60,
    separators=["\n\n", "\n", ". ", " "],  # structural boundaries first
)
chunks = splitter.split_documents(documents)
 
# Step 2: Build hybrid retriever
bm25 = BM25Retriever.from_documents(chunks, k=20)
dense = Chroma.from_documents(chunks, embedding_fn).as_retriever(search_kwargs={"k": 20})
 
# RRF fusion
hybrid = EnsembleRetriever(
    retrievers=[bm25, dense],
    weights=[0.5, 0.5],  # equal weight; tune based on your corpus
)
 
# Step 3: Cross-encoder reranker
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
 
def retrieve_and_rerank(query: str, top_n: int = 4) -> list:
    candidates = hybrid.invoke(query)          # top-20 from hybrid
    pairs = [(query, c.page_content) for c in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, candidates), reverse=True)
    return [doc for _, doc in ranked[:top_n]]  # return top-4
 
# Step 4: Generate with grounded context
def answer(query: str) -> str:
    context = retrieve_and_rerank(query)
    ctx_text = "\n\n".join(c.page_content for c in context)
    return llm.invoke(f"Context:\n{ctx_text}\n\nQuestion: {query}")

Analysis & Evaluation

Where Your Intuition Breaks

Applying all advanced RAG techniques together gives the best results. Each advanced technique addresses a specific failure mode, and combining them without measuring each contribution often introduces new failure modes that mask the improvements. Hybrid search helps when dense retrieval misses exact keywords; reranking helps when bi-encoder retrieval ranks candidates incorrectly; better chunking helps when chunks lack context. But hybrid search adds BM25 index maintenance overhead, reranking adds latency, and smaller chunks require larger indexes. The right combination depends on where your specific pipeline fails — measure each addition independently on your eval set before stacking them.

Which Technique to Add First

Your pipeline symptomRoot causeFix
Correct facts missedRetrieval recall too lowBetter chunking; add hybrid search
Hallucinated factsLow faithfulnessStricter prompt; reranking removes noise chunks
Irrelevant chunks in contextLow context precisionReranking; reduce top-k
Questions about recent dataIndex staleRe-indexing pipeline; consider metadata filtering

Practical sequence: fix chunking first (free), add hybrid search second (big recall improvement), add reranking last (high impact but adds latency). Evaluate with RAGAS after each change.

Latency Budget

Each stage adds latency:

StageTypical latencyParallelizable?
BM25 retrieval10–50msYes (parallel with dense)
Dense retrieval50–200msYes (parallel with BM25)
Cross-encoder reranking (top-20)100–500msNo (sequential)
LLM generation500ms–5sNo

For latency-sensitive applications: cache reranker scores for common queries, reduce candidate pool from 20 to 10, or use a lighter reranker model.

🚀Production

Advanced RAG checklist before going to production:

  • Chunk size validated on representative queries (try 256, 400, 512 — evaluate RAGAS on each)
  • Hybrid search enabled (BM25 catches exact product names, IDs, and jargon that embeddings miss)
  • Reranker in place — one reranker step typically improves answer quality more than doubling embedding model size
  • Faithfulness monitored in production — log cases where the answer contains claims not found in retrieved chunks
  • Re-indexing schedule defined — stale indexes are the most common silent failure in production RAG

Most impactful single change if you can only make one: add a cross-encoder reranker. It's a 2-hour integration with consistently large quality improvements across corpus types.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.