RAG Systems
Retrieval-Augmented Generation (RAG) is how you give a language model access to knowledge it wasn't trained on. When you ask Claude about a specific internal document, a recent event, or a private codebase, RAG is the mechanism that finds the relevant chunks and includes them in the context. Internal knowledge bases are queried via RAG for customer support; document search products use dense retrieval backed by pgvector. The core tension in RAG is retrieval quality: dense embedding search (cosine similarity in vector space) outperforms keyword search on semantic queries but can miss exact string matches — hybrid retrieval combines both. This lesson builds the full pipeline from chunking strategy through reranking to generation.
Theory
"What is RAG?"
A language model can only know what was in its training data — anything more recent, more specific, or more private is invisible to it. RAG solves this by turning a closed-book exam into an open-book one: at inference time, retrieve the relevant pages first, then answer with them in context. The diagram above shows the two-stage pipeline: a retrieval system finds candidate documents, then the LLM reads them and generates a grounded answer. The challenge is retrieval quality: garbage in, garbage out.
Retrieval-Augmented Generation augments LLM generation with relevant documents retrieved from an external knowledge base. At inference time:
Conditioning on both and is necessary because the model has no other access to private or recent knowledge — without the retrieved context, it is forced to hallucinate or say "I don't know." The argmax over is the standard generation objective (most probable completion given all context). Top- retrieval specifically is forced by computational constraints: exact nearest-neighbor search over a large corpus is per query, which is infeasible at scale — approximate indices like HNSW reduce this to . This is why RAG retrieval is always approximate: there is no free exact search at production corpus sizes.
where is the document corpus and are the top- chunks.
Dense Retrieval (Semantic Search)
Encode query and documents into a shared embedding space, then find nearest neighbors:
Approximate Nearest Neighbor (ANN) indices (Hierarchical Navigable Small World (HNSW), Inverted File Index (IVF)) make this instead of — critical for corpora > 100k documents.
Sparse Retrieval (BM25)
Best Match 25 (BM25) uses traditional keyword-based retrieval with Term Frequency–Inverse Document Frequency (TF-IDF) saturation:
Parameters: (TF saturation), (length normalization). BM25 excels at exact keyword matches and rare technical terms.
Hybrid Retrieval with RRF
Combine dense and sparse rankings with Reciprocal Rank Fusion:
where is document 's rank in each retrieval system. RRF is robust to score scale differences — no normalization needed.
Dense retrieval captures semantic meaning: "heart attack" matches "myocardial infarction." Sparse retrieval captures exact terms: product codes, proper names, code snippets. For technical documentation, hybrid always outperforms either alone.
Walkthrough
Corpus: 47 ML alignment papers → 1,834 chunks (512 tokens, 50-token overlap). Query: "What are the key differences between RLHF and DPO for alignment?"
Step 1 — Chunk and index
docs = SimpleDirectoryReader("./papers").load_data()
# 47 documents
nodes = SentenceSplitter(chunk_size=512, chunk_overlap=50).get_nodes_from_documents(docs)
# 1,834 chunks
index = VectorStoreIndex(nodes, embed_model=OpenAIEmbedding(model="text-embedding-3-small"))
# Each chunk embedded to 1536-dim vector, stored in pgvectorStep 2 — Dense retrieval returns top-5 chunks
retriever = VectorIndexRetriever(index=index, similarity_top_k=5)
nodes = retriever.retrieve("What are the key differences between RLHF and DPO?")[1] score=0.912 | raft_dpo_2024.pdf
"DPO directly optimizes the policy using a binary cross-entropy objective over
preferred and rejected completions, bypassing the need for an explicit reward model..."
[2] score=0.887 | constitutional_ai.pdf
"RLHF trains a separate reward model on human preference data, then uses PPO to
optimize the policy against that reward signal. The reward model requires..."
[3] score=0.861 | dpo_paper.pdf
"The key insight of DPO is that the optimal RLHF policy has a closed-form solution
expressible in terms of the reference policy, making reward model training unnecessary..."
[4] score=0.834 | rlhf_survey.pdf
"PPO introduces significant training instability when the policy drifts far from the
reference model. The KL penalty coefficient β must be carefully tuned..."
[5] score=0.801 | dpo_paper.pdf
"Empirically, DPO matches or exceeds PPO on summarization and dialogue tasks while
requiring 2–3× less compute due to eliminating reward model training..."
Step 3 — BM25 catches exact terms dense missed
Dense retrieval missed the exact term "KL divergence" because it was paraphrased in some chunks. BM25 finds it:
bm25 = BM25Retriever.from_defaults(docstore=index.docstore, similarity_top_k=5)
sparse_nodes = bm25.retrieve("RLHF DPO KL divergence reward model")[1] bm25=18.4 | rlhf_theory.pdf — "...the KL divergence constraint in RLHF is..."
[2] bm25=15.2 | dpo_paper.pdf — "...reward model r(x,y) trained on preference pairs..."
Step 4 — RRF fusion and generation
hybrid = QueryFusionRetriever(
[dense_retriever, bm25], similarity_top_k=5, mode="reciprocal_rerank"
)
# raft_dpo_2024.pdf now rank 1 (appeared in both lists)
# rlhf_theory.pdf rises to rank 3 (strong BM25 signal)
response = query_engine.query("What are the key differences between RLHF and DPO?")RLHF and DPO both optimize for human preferences but differ in mechanism:
RLHF requires two stages — first training a reward model on (prompt, chosen, rejected)
triplets, then running PPO to maximize that reward subject to a KL constraint against
the reference policy. This introduces training instability and significant compute overhead.
DPO reframes alignment as a classification problem. It shows the RLHF objective has a
closed-form solution, leading to a binary cross-entropy loss directly on the policy
without an explicit reward model. In practice, DPO is 2–3× cheaper and matches PPO on
most benchmarks [raft_dpo_2024.pdf, dpo_paper.pdf].
Sources: [1] raft_dpo_2024.pdf (0.912) [2] constitutional_ai.pdf (0.887) [3] dpo_paper.pdf (0.861)
The hybrid retriever found the rlhf_theory.pdf chunk (via BM25) that pure dense search would have missed — it used exact term "KL divergence" without semantic paraphrase.
Code Implementation
13_rag//# nlp/13_rag/train/ingest.py — Document ingestion pipeline
import anthropic
from openai import OpenAI
import psycopg2, json
from pathlib import Path
oai = OpenAI()
def chunk_text(text: str, size: int = 512, overlap: int = 50) -> list[str]:
words = text.split()
chunks, i = [], 0
while i < len(words):
chunk = " ".join(words[i:i+size])
chunks.append(chunk)
i += size - overlap
return chunks
def embed(texts: list[str]) -> list[list[float]]:
r = oai.embeddings.create(model="text-embedding-3-small", input=texts)
return [e.embedding for e in r.data]
def ingest(docs_dir: str, conn_str: str):
conn = psycopg2.connect(conn_str)
cur = conn.cursor()
cur.execute("""
CREATE TABLE IF NOT EXISTS chunks (
id SERIAL PRIMARY KEY,
source TEXT,
chunk_text TEXT,
embedding vector(1536)
)
""")
for path in Path(docs_dir).glob("*.txt"):
text = path.read_text()
chunks = chunk_text(text)
embeddings = embed(chunks)
for chunk, emb in zip(chunks, embeddings):
cur.execute(
"INSERT INTO chunks (source, chunk_text, embedding) VALUES (%s, %s, %s)",
(path.name, chunk, json.dumps(emb))
)
conn.commit()
print(f"Ingested {path.name} → {len(chunks)} chunks")Analysis & Evaluation
Where Your Intuition Breaks
More retrieved chunks means more context for the model, which means better answers. Retrieval quality degrades with : each additional chunk beyond the optimal set adds noise that the LLM must filter out, and LLMs are poor at filtering. Empirically, RAG systems with – highly relevant chunks outperform systems with mixed-relevance chunks on most tasks. The failure mode is "lost in the middle" — LLMs attend more strongly to the beginning and end of context, so relevant chunks buried in the middle of a large retrieved set are effectively invisible. Precision matters more than recall in the retrieved set.
Retrieval Quality Metrics
| Metric | Formula | Meaning |
|---|---|---|
| Recall@k | Fraction of relevant docs retrieved | |
| Precision@k | Fraction of retrieved docs that are relevant | |
| Mean Reciprocal Rank (MRR) | Mean reciprocal rank of first relevant doc | |
| Normalized Discounted Cumulative Gain (NDCG)@k | Discounted cumulative gain | Penalizes relevant docs ranked lower |
RAG vs Fine-tuning Decision Matrix
| Scenario | Recommendation |
|---|---|
| Knowledge changes frequently | RAG (update corpus, not model) |
| Need citations / provenance | RAG |
| Knowledge is static + < 100k tokens | Fine-tuning (or just context) |
| Style/format changes needed | Fine-tuning |
| Private data at inference | RAG with access control |
| Reasoning over structured data | Fine-tuning + tool use |
Chunking Strategy Comparison
| Strategy | Pros | Cons | Best for |
|---|---|---|---|
| Fixed-size (512 tokens) | Simple, uniform | Splits sentences | Homogeneous docs |
| Sentence-based | Semantic coherence | Variable size | General text |
| Recursive character | Good balance | More complex | Mixed content |
| Semantic similarity | Best coherence | Slow, expensive | High-precision retrieval |
Track two metrics in production: (1) retrieval score distribution — if average similarity scores drop, your corpus or query distribution has drifted; (2) citation rate — what fraction of responses include citations. A sudden drop signals retrieval failure. Set up alerts when either metric deviates > 2σ from baseline.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.