Requires:RAG Systems

Advanced RAG

Basic RAG — embed a document, store vectors, retrieve the top-k, generate — works well in demos but breaks on real corpora. Chunks are too large or too small, retrieval misses semantically relevant passages, and generated answers hallucinate because the retrieved context is noisy. Advanced RAG fixes this with four additions: smarter chunking, hybrid retrieval (dense + sparse), cross-encoder reranking, and evaluation. Each addition is independent — you can adopt them selectively based on where your pipeline fails.

Theory

Advanced RAG pipeline — click a stage

new vs basic RAG

→

dense retrieval ─┐
├─ RRF fusion → reranker
BM25 retrieval ─┘

RerankingNEW

A cross-encoder scores each (query, candidate) pair jointly — far more accurate than bi-encoder similarity but 10–100× slower. Apply to top-20 candidates; pass top-4 to the LLM. Single biggest quality improvement in advanced RAG.

violet = added beyond basic RAG · single cross-encoder reranker typically has larger impact than doubling embedding model size

Basic RAG retrieves by embedding similarity and hands the top-k chunks to the LLM. Advanced RAG adds three more stages: smarter chunking (to control the precision-recall tradeoff), hybrid search (to catch what embeddings miss), and cross-encoder reranking (to re-score candidates with richer context). The diagram above shows the full pipeline. Each stage is independent — you can add them selectively where your pipeline fails, rather than all at once.

Chunking and the Retrieval-Generation Trade-off

Every chunking decision makes a trade-off. Let chunk size be $c$ tokens:

Small $c$ (128–256 tokens): high recall (the right sentence is captured), but low precision (chunks lack surrounding context; the LLM can't use them well)
Large $c$ (1024–2048 tokens): high precision (full paragraphs with context), but lower recall (important sentences diluted by surrounding noise)

Optimal range for most tasks: 256–512 tokens with 10–20% overlap between adjacent chunks.

Semantic chunking splits on structural boundaries (paragraphs, sections, sentences) rather than fixed token counts. For technical documents, this consistently outperforms fixed chunking because it preserves logical units.

Hybrid Search: BM25 + Dense Retrieval

Pure dense retrieval (cosine similarity of embeddings) excels at semantic matching but struggles with exact keyword matches. BM25 (a classical sparse retrieval algorithm) inverts this — exact keyword matches score high, but synonyms and paraphrases score zero.

BM25 score for query $q$ and document $d$ :

$\text{BM25}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t, d) \cdot (k_1 + 1)}{f(t, d) + k_1(1 - b + b \cdot |d|/\text{avgdl})}$

where $f(t,d)$ is term frequency, $|d|$ is document length, $\text{avgdl}$ is average document length, $k_1 \approx 1.5$ and $b \approx 0.75$ are smoothing parameters.

Hybrid fusion: combine BM25 and dense scores using Reciprocal Rank Fusion (RRF):

$\text{RRF}(d) = \sum_{r \in \text{rankers}} \frac{1}{k + r(d)}$

RRF combines rankings instead of raw scores because dense and sparse retrievers produce scores on incompatible scales — cosine similarity outputs values in $[-1, 1]$ , while BM25 scores can be any positive float depending on corpus statistics. Normalizing these scales to be directly comparable requires knowing the full score distribution, which changes as the corpus grows. Rank positions are always directly comparable regardless of scale: first place is first place in any retrieval system. The constant $k=60$ in the denominator dampens the influence of top-ranked documents, preventing a single retriever's strong first-place hit from dominating the fusion.

where $r(d)$ is the rank of document $d$ in ranker $r$ and $k=60$ is a constant. RRF doesn't require score normalization across rankers — it only uses rank positions.

Cross-Encoder Reranking

Bi-encoder retrieval (embedding query and document separately) is fast but approximate. A cross-encoder processes the query and candidate together, enabling richer attention:

$\text{score}(q, d) = \text{CrossEncoder}_\theta([q; d])$

Cross-encoders are 10–100× slower than bi-encoders but dramatically more accurate on the top- $k$ candidates. The standard practice: use bi-encoder to retrieve top-20 candidates cheaply, then rerank with a cross-encoder to select the top-3–5 for generation.

RAG Evaluation: RAGAS Metrics

RAGAS provides four reference-free metrics for RAG evaluation:

Metric	Formula	What it measures
Context Precision	$\frac{\text{relevant chunks in top-}k}{k}$	Are retrieved chunks actually useful?
Context Recall	$\frac{\text{answer claims supported by context}}{\text{total claims in answer}}$	Did retrieval find everything needed?
Faithfulness	$\frac{\text{claims grounded in context}}{\text{total claims in answer}}$	Does the answer stay within the retrieved context?
Answer Relevancy	Cosine sim of answer to question embedding	Does the answer address the question?

Low context precision → too many irrelevant chunks retrieved. Low faithfulness → LLM is hallucinating beyond the context. Low recall → important context is missing from the index.

Walkthrough

Building a Multi-Stage RAG Pipeline

python

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Chroma
from langchain.retrievers import EnsembleRetriever
from sentence_transformers import CrossEncoder
 
# Step 1: Semantic chunking with overlap
splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=60,
    separators=["\n\n", "\n", ". ", " "],  # structural boundaries first
)
chunks = splitter.split_documents(documents)
 
# Step 2: Build hybrid retriever
bm25 = BM25Retriever.from_documents(chunks, k=20)
dense = Chroma.from_documents(chunks, embedding_fn).as_retriever(search_kwargs={"k": 20})
 
# RRF fusion
hybrid = EnsembleRetriever(
    retrievers=[bm25, dense],
    weights=[0.5, 0.5],  # equal weight; tune based on your corpus
)
 
# Step 3: Cross-encoder reranker
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
 
def retrieve_and_rerank(query: str, top_n: int = 4) -> list:
    candidates = hybrid.invoke(query)          # top-20 from hybrid
    pairs = [(query, c.page_content) for c in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, candidates), reverse=True)
    return [doc for _, doc in ranked[:top_n]]  # return top-4
 
# Step 4: Generate with grounded context
def answer(query: str) -> str:
    context = retrieve_and_rerank(query)
    ctx_text = "\n\n".join(c.page_content for c in context)
    return llm.invoke(f"Context:\n{ctx_text}\n\nQuestion: {query}")

Analysis & Evaluation

Where Your Intuition Breaks

Applying all advanced RAG techniques together gives the best results. Each advanced technique addresses a specific failure mode, and combining them without measuring each contribution often introduces new failure modes that mask the improvements. Hybrid search helps when dense retrieval misses exact keywords; reranking helps when bi-encoder retrieval ranks candidates incorrectly; better chunking helps when chunks lack context. But hybrid search adds BM25 index maintenance overhead, reranking adds latency, and smaller chunks require larger indexes. The right combination depends on where your specific pipeline fails — measure each addition independently on your eval set before stacking them.

Which Technique to Add First

Your pipeline symptom	Root cause	Fix
Correct facts missed	Retrieval recall too low	Better chunking; add hybrid search
Hallucinated facts	Low faithfulness	Stricter prompt; reranking removes noise chunks
Irrelevant chunks in context	Low context precision	Reranking; reduce top-k
Questions about recent data	Index stale	Re-indexing pipeline; consider metadata filtering

Practical sequence: fix chunking first (free), add hybrid search second (big recall improvement), add reranking last (high impact but adds latency). Evaluate with RAGAS after each change.

Latency Budget

Each stage adds latency:

Stage	Typical latency	Parallelizable?
BM25 retrieval	10–50ms	Yes (parallel with dense)
Dense retrieval	50–200ms	Yes (parallel with BM25)
Cross-encoder reranking (top-20)	100–500ms	No (sequential)
LLM generation	500ms–5s	No

For latency-sensitive applications: cache reranker scores for common queries, reduce candidate pool from 20 to 10, or use a lighter reranker model.

🚀Production

Advanced RAG checklist before going to production:

Chunk size validated on representative queries (try 256, 400, 512 — evaluate RAGAS on each)
Hybrid search enabled (BM25 catches exact product names, IDs, and jargon that embeddings miss)
Reranker in place — one reranker step typically improves answer quality more than doubling embedding model size
Faithfulness monitored in production — log cases where the answer contains claims not found in retrieved chunks
Re-indexing schedule defined — stale indexes are the most common silent failure in production RAG

Most impactful single change if you can only make one: add a cross-encoder reranker. It's a 2-hour integration with consistently large quality improvements across corpus types.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

RAG Systems

Agents & Tool Use