Requires:Embeddings

RAG Systems

Retrieval-Augmented Generation (RAG) is how you give a language model access to knowledge it wasn't trained on. When you ask Claude about a specific internal document, a recent event, or a private codebase, RAG is the mechanism that finds the relevant chunks and includes them in the context. Internal knowledge bases are queried via RAG for customer support; document search products use dense retrieval backed by pgvector. The core tension in RAG is retrieval quality: dense embedding search (cosine similarity in vector space) outperforms keyword search on semantic queries but can miss exact string matches — hybrid retrieval combines both. This lesson builds the full pipeline from chunking strategy through reranking to generation.

Theory

RAG Pipeline

→

💬User Query

"What is RAG?"

A language model can only know what was in its training data — anything more recent, more specific, or more private is invisible to it. RAG solves this by turning a closed-book exam into an open-book one: at inference time, retrieve the relevant pages first, then answer with them in context. The diagram above shows the two-stage pipeline: a retrieval system finds candidate documents, then the LLM reads them and generates a grounded answer. The challenge is retrieval quality: garbage in, garbage out.

Retrieval-Augmented Generation augments LLM generation with relevant documents retrieved from an external knowledge base. At inference time:

$y^* = \arg\max_y P(y \mid q,\ \text{Retrieve}(q, \mathcal{D}))$

Conditioning on both $q$ and $\text{Retrieve}(q, \mathcal{D})$ is necessary because the model has no other access to private or recent knowledge — without the retrieved context, it is forced to hallucinate or say "I don't know." The argmax over $y$ is the standard generation objective (most probable completion given all context). Top- $k$ retrieval specifically is forced by computational constraints: exact nearest-neighbor search over a large corpus is $O(N)$ per query, which is infeasible at scale — approximate indices like HNSW reduce this to $O(\log N)$ . This is why RAG retrieval is always approximate: there is no free exact search at production corpus sizes.

where $\mathcal{D}$ is the document corpus and $\text{Retrieve}(q, \mathcal{D}) = \{d_1, \ldots, d_k\}$ are the top- $k$ chunks.

Dense Retrieval (Semantic Search)

Encode query and documents into a shared embedding space, then find nearest neighbors:

$\text{Retrieve}(q, \mathcal{D}) = \arg\text{top-k}_{d \in \mathcal{D}} \cos\left(\text{enc}(q),\ \text{enc}(d)\right)$

$\cos(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|}$

Approximate Nearest Neighbor (ANN) indices (Hierarchical Navigable Small World (HNSW), Inverted File Index (IVF)) make this $O(\log N)$ instead of $O(N)$ — critical for corpora > 100k documents.

Sparse Retrieval (BM25)

Best Match 25 (BM25) uses traditional keyword-based retrieval with Term Frequency–Inverse Document Frequency (TF-IDF) saturation:

$\text{BM25}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{(k_1+1) \cdot \text{tf}(t,d)}{\text{tf}(t,d) + k_1 \cdot (1 - b + b \cdot |d|/\text{avgdl})}$

Parameters: $k_1 = 1.2$ (TF saturation), $b = 0.75$ (length normalization). BM25 excels at exact keyword matches and rare technical terms.

Hybrid Retrieval with RRF

Combine dense and sparse rankings with Reciprocal Rank Fusion:

$\text{RRF}(d) = \sum_{r \in \{r_{\text{dense}}, r_{\text{sparse}}\}} \frac{1}{60 + r(d)}$

where $r(d)$ is document $d$ 's rank in each retrieval system. RRF is robust to score scale differences — no normalization needed.

💡When dense vs sparse retrieval wins

Dense retrieval captures semantic meaning: "heart attack" matches "myocardial infarction." Sparse retrieval captures exact terms: product codes, proper names, code snippets. For technical documentation, hybrid always outperforms either alone.

Walkthrough

Corpus: 47 ML alignment papers → 1,834 chunks (512 tokens, 50-token overlap). Query: "What are the key differences between RLHF and DPO for alignment?"

Step 1 — Chunk and index

python

docs = SimpleDirectoryReader("./papers").load_data()
# 47 documents
 
nodes = SentenceSplitter(chunk_size=512, chunk_overlap=50).get_nodes_from_documents(docs)
# 1,834 chunks
 
index = VectorStoreIndex(nodes, embed_model=OpenAIEmbedding(model="text-embedding-3-small"))
# Each chunk embedded to 1536-dim vector, stored in pgvector

Step 2 — Dense retrieval returns top-5 chunks

python

retriever = VectorIndexRetriever(index=index, similarity_top_k=5)
nodes = retriever.retrieve("What are the key differences between RLHF and DPO?")

[1] score=0.912 | raft_dpo_2024.pdf
    "DPO directly optimizes the policy using a binary cross-entropy objective over
     preferred and rejected completions, bypassing the need for an explicit reward model..."
[2] score=0.887 | constitutional_ai.pdf
    "RLHF trains a separate reward model on human preference data, then uses PPO to
     optimize the policy against that reward signal. The reward model requires..."
[3] score=0.861 | dpo_paper.pdf
    "The key insight of DPO is that the optimal RLHF policy has a closed-form solution
     expressible in terms of the reference policy, making reward model training unnecessary..."
[4] score=0.834 | rlhf_survey.pdf
    "PPO introduces significant training instability when the policy drifts far from the
     reference model. The KL penalty coefficient β must be carefully tuned..."
[5] score=0.801 | dpo_paper.pdf
    "Empirically, DPO matches or exceeds PPO on summarization and dialogue tasks while
     requiring 2–3× less compute due to eliminating reward model training..."

Step 3 — BM25 catches exact terms dense missed

Dense retrieval missed the exact term "KL divergence" because it was paraphrased in some chunks. BM25 finds it:

python

bm25 = BM25Retriever.from_defaults(docstore=index.docstore, similarity_top_k=5)
sparse_nodes = bm25.retrieve("RLHF DPO KL divergence reward model")

[1] bm25=18.4 | rlhf_theory.pdf  — "...the KL divergence constraint in RLHF is..."
[2] bm25=15.2 | dpo_paper.pdf    — "...reward model r(x,y) trained on preference pairs..."

Step 4 — RRF fusion and generation

python

hybrid = QueryFusionRetriever(
    [dense_retriever, bm25], similarity_top_k=5, mode="reciprocal_rerank"
)
# raft_dpo_2024.pdf now rank 1 (appeared in both lists)
# rlhf_theory.pdf rises to rank 3 (strong BM25 signal)
 
response = query_engine.query("What are the key differences between RLHF and DPO?")

RLHF and DPO both optimize for human preferences but differ in mechanism:

RLHF requires two stages — first training a reward model on (prompt, chosen, rejected)
triplets, then running PPO to maximize that reward subject to a KL constraint against
the reference policy. This introduces training instability and significant compute overhead.

DPO reframes alignment as a classification problem. It shows the RLHF objective has a
closed-form solution, leading to a binary cross-entropy loss directly on the policy
without an explicit reward model. In practice, DPO is 2–3× cheaper and matches PPO on
most benchmarks [raft_dpo_2024.pdf, dpo_paper.pdf].

Sources: [1] raft_dpo_2024.pdf (0.912) [2] constitutional_ai.pdf (0.887) [3] dpo_paper.pdf (0.861)

The hybrid retriever found the rlhf_theory.pdf chunk (via BM25) that pure dense search would have missed — it used exact term "KL divergence" without semantic paraphrase.

Code Implementation

13_rag//

python

# nlp/13_rag/train/ingest.py — Document ingestion pipeline
import anthropic
from openai import OpenAI
import psycopg2, json
from pathlib import Path
 
oai = OpenAI()
 
def chunk_text(text: str, size: int = 512, overlap: int = 50) -> list[str]:
    words = text.split()
    chunks, i = [], 0
    while i < len(words):
        chunk = " ".join(words[i:i+size])
        chunks.append(chunk)
        i += size - overlap
    return chunks
 
def embed(texts: list[str]) -> list[list[float]]:
    r = oai.embeddings.create(model="text-embedding-3-small", input=texts)
    return [e.embedding for e in r.data]
 
def ingest(docs_dir: str, conn_str: str):
    conn = psycopg2.connect(conn_str)
    cur = conn.cursor()
    cur.execute("""
        CREATE TABLE IF NOT EXISTS chunks (
            id SERIAL PRIMARY KEY,
            source TEXT,
            chunk_text TEXT,
            embedding vector(1536)
        )
    """)
 
    for path in Path(docs_dir).glob("*.txt"):
        text = path.read_text()
        chunks = chunk_text(text)
        embeddings = embed(chunks)
        for chunk, emb in zip(chunks, embeddings):
            cur.execute(
                "INSERT INTO chunks (source, chunk_text, embedding) VALUES (%s, %s, %s)",
                (path.name, chunk, json.dumps(emb))
            )
    conn.commit()
    print(f"Ingested {path.name} → {len(chunks)} chunks")

Analysis & Evaluation

Where Your Intuition Breaks

More retrieved chunks means more context for the model, which means better answers. Retrieval quality degrades with $k$ : each additional chunk beyond the optimal set adds noise that the LLM must filter out, and LLMs are poor at filtering. Empirically, RAG systems with $k=3$ – $5$ highly relevant chunks outperform systems with $k=20$ mixed-relevance chunks on most tasks. The failure mode is "lost in the middle" — LLMs attend more strongly to the beginning and end of context, so relevant chunks buried in the middle of a large retrieved set are effectively invisible. Precision matters more than recall in the retrieved set.

Retrieval Quality Metrics

Metric	Formula	Meaning
Recall@k	$\\|R \cap G\\| / \\|G\\|$	Fraction of relevant docs retrieved
Precision@k	$\\|R \cap G\\| / k$	Fraction of retrieved docs that are relevant
Mean Reciprocal Rank (MRR)	$\frac{1}{Q}\sum_{q} \frac{1}{\text{rank}_q}$	Mean reciprocal rank of first relevant doc
Normalized Discounted Cumulative Gain (NDCG)@k	Discounted cumulative gain	Penalizes relevant docs ranked lower

RAG vs Fine-tuning Decision Matrix

Scenario	Recommendation
Knowledge changes frequently	RAG (update corpus, not model)
Need citations / provenance	RAG
Knowledge is static + < 100k tokens	Fine-tuning (or just context)
Style/format changes needed	Fine-tuning
Private data at inference	RAG with access control
Reasoning over structured data	Fine-tuning + tool use

Chunking Strategy Comparison

Strategy	Pros	Cons	Best for
Fixed-size (512 tokens)	Simple, uniform	Splits sentences	Homogeneous docs
Sentence-based	Semantic coherence	Variable size	General text
Recursive character	Good balance	More complex	Mixed content
Semantic similarity	Best coherence	Slow, expensive	High-precision retrieval

🚀Monitor retrieval quality in production

Track two metrics in production: (1) retrieval score distribution — if average similarity scores drop, your corpus or query distribution has drifted; (2) citation rate — what fraction of responses include citations. A sudden drop signals retrieval failure. Set up alerts when either metric deviates > 2σ from baseline.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Structured Outputs

Advanced RAG