Requires:RAG Systems

Vector Databases

Semantic search requires finding vectors that are similar in meaning, not identical in content. A keyword search for "heart attack" won't find a document about "myocardial infarction." Vector similarity search — finding the k nearest neighbors in high-dimensional embedding space — solves this, but requires index structures that make search tractable over millions or billions of vectors.

How It Works

Vector index — approximate nearest neighbor search

Build a layered proximity graph. Navigate greedily from an entry node to the query — best recall/speed tradeoff.

◆ query

Result 1

Result 2

Result 3

↳ entry

doc: billing

doc: pricing

doc: onboarding

doc: api-keys

doc: webhooks

doc: limits

doc: sdk

─── graph edges ╌╌╌ nav path ● entry node

Vectors searched: 4 / 11

✓ >99% recall, fast

⚠ Higher memory use

Toggle between the three index types above. Exact (flat) search computes distance to every vector — precise but O(n) per query. For 100M vectors, that's 100M distance computations per request. Approximate Nearest Neighbor (ANN) methods trade a small amount of recall for large throughput gains.

Keyword search works by finding exact term matches; semantic search works by finding nearby points in a high-dimensional space where meaning is encoded by position. These are fundamentally different problems with different computational requirements: exact nearest neighbor search is O(n) and cannot be accelerated with inverted indexes, so every practical retrieval system must use approximations that prune the search space geometrically.

Approximate nearest neighbor search

Flat index (exact): brute-force. Used for small collections (under 1M vectors) or when exact recall is required. O(n) per query.

IVF (Inverted File Index): partition vectors into k clusters using k-means. At query time, search only the nearest c clusters (typically c=4–16). ~10-100× faster than flat, with ~5% recall loss. Best for batch retrieval where some recall loss is acceptable.

HNSW (Hierarchical Navigable Small World): build a layered graph where each vector connects to its approximate nearest neighbors. Navigate from a coarse top layer to fine bottom layer during search. Best recall/speed tradeoff at the cost of higher memory usage (roughly 10 bytes per dimension per vector).

HNSW's layered graph structure had to mimic the "small world" property — a graph where any node is reachable in logarithmically many hops — because this is what makes greedy graph navigation efficient. Starting from a coarse layer (few nodes, long-range connections) and descending to fine layers (all nodes, short-range connections) reduces the search path from O(n) to O(log n) expected hops. Without the hierarchical structure, the graph would be too dense to navigate quickly at large scale; without the small-world connectivity, the greedy descent would get stuck in local optima.

Vector database systems

System	Type	Strengths	Use case
Pinecone	Managed SaaS	Zero ops, auto-scaling	Production RAG, fast onboarding
Weaviate	Open-source / hosted	Rich filtering + vector	Hybrid search, multi-modal
Qdrant	Open-source / hosted	Fast, Rust-based	High-throughput serving
pgvector	Postgres extension	SQL + vectors in one DB	Small-medium scale, existing Postgres
FAISS	Library (not a DB)	Maximum control and speed	Research, offline batch retrieval

Pinecone, Weaviate, and Qdrant are the most common production choices. pgvector is underrated for applications that already use Postgres and don't need million-scale retrieval.

Embedding models and index dimensions

The embedding model determines the vector dimensions and the semantic space. Common choices:

Model	Dimensions	Best for
text-embedding-3-small (OpenAI)	1536	General text, low cost
text-embedding-3-large (OpenAI)	3072	Higher quality retrieval
BGE-M3 (BAAI)	1024	Open-source, multilingual
ColBERT	128 per token	Late interaction, high accuracy

Higher dimensions = better semantic representation, but higher memory cost and slower search. Dimension reduction (PCA, Matryoshka embeddings) can reduce dimensions while preserving most retrieval quality.

Design Tradeoffs

Where Your Intuition Breaks

Vector similarity is not the same as relevance. Two vectors can be close in embedding space because they share topic words ("heart disease treatment guidelines" and "heart disease risk factors" are semantically similar) while having very different relevance to a specific query. ANN retrieves vectors that are geometrically near the query embedding; whether those documents answer the query depends on the specific question. This is why re-rankers are standard in production RAG: retrieval gives you a candidate set of semantically related documents, and the re-ranker uses the full text of both query and document to score actual relevance. Removing the re-ranker and trusting ANN recall directly produces noticeably worse answers, because the embedding model is encoding a broader notion of similarity than the user's actual information need.

Hybrid search: vector + keyword

Pure vector search misses exact keyword matches. A search for "GPT-4o" (a specific product name) fails if the embedding collapses it to a similar but different vector. Hybrid search combines:

Dense retrieval: vector similarity (semantic meaning)
Sparse retrieval: BM25 / TF-IDF (keyword overlap)

Most production RAG systems use hybrid search with a re-ranker (e.g., Cohere Rerank, cross-encoder) on the combined candidate list. The re-ranker sees query + document text together and scores relevance more accurately than either retrieval method alone.

Filtering and metadata

Vector search alone isn't enough for applications that need to filter by metadata (date, user, category, price range). Options:

Post-filter: retrieve top-k vectors, then apply metadata filters. Simple but wastes retrieved results when filters are selective.

Pre-filter: filter the eligible vector set first, then search within it. Correct but requires the index to support efficient metadata-based partitioning.

Payload-indexed filters: Qdrant, Weaviate, and Pinecone all support indexed metadata filters that execute with the vector search in one pass. This is the right approach for production.

Chunking strategy for documents

Documents must be split into chunks before embedding. Chunking strategy significantly affects retrieval quality:

Fixed-size chunks (512 tokens with 50-token overlap): simple, consistent. Breaks semantic units arbitrarily.
Sentence-level chunks: preserve complete thoughts. Good for prose.
Semantic chunking: use embedding similarity to detect topic shifts and split there. Best quality, most complex.
Hierarchical chunking: embed both document-level and chunk-level. Retrieve at chunk level, return document-level context. Used in systems like Raptor.

Chunk size is a tradeoff: larger chunks have more context but lower precision; smaller chunks have higher precision but miss cross-sentence context.

In Practice

Indexing documents with OpenAI + Qdrant

python

from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
 
client   = OpenAI()
qdrant   = QdrantClient(url="http://localhost:6333")
 
# Create collection
qdrant.create_collection(
    collection_name="docs",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)
 
# Embed and index documents
def embed(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(model="text-embedding-3-small", input=texts)
    return [r.embedding for r in response.data]
 
chunks = [{"id": i, "text": text, "source": source} for i, (text, source) in enumerate(docs)]
embeddings = embed([c["text"] for c in chunks])
 
qdrant.upsert(
    collection_name="docs",
    points=[
        PointStruct(id=c["id"], vector=e, payload={"text": c["text"], "source": c["source"]})
        for c, e in zip(chunks, embeddings)
    ],
)

Querying with metadata filter

python

from qdrant_client.models import Filter, FieldCondition, MatchValue
 
def search(query: str, source_filter: str | None = None, top_k: int = 5):
    query_vec = embed([query])[0]
    f = Filter(
        must=[FieldCondition(key="source", match=MatchValue(value=source_filter))]
    ) if source_filter else None
 
    results = qdrant.search(
        collection_name="docs",
        query_vector=query_vec,
        query_filter=f,
        limit=top_k,
        with_payload=True,
    )
    return [r.payload["text"] for r in results]

Evaluating retrieval quality

Don't just ship a vector index — evaluate it. Key metrics:

Recall@k: of the relevant documents for each query, what fraction appear in the top k results?
MRR (Mean Reciprocal Rank): average position of the first relevant document
NDCG@k: normalized discounted cumulative gain — weights top-ranked results more heavily

Build a small evaluation set of (query, expected documents) pairs and compute these metrics before deploying. Re-evaluate when embedding model, chunking strategy, or index parameters change.

Production Patterns

Avoiding index drift

When source documents are updated, the index must be updated. Options:

Full re-index: delete and rebuild from scratch. Consistent but expensive for large corpora. Run nightly.
Incremental update: upsert changed documents, delete removed ones. Fast but requires reliable change detection (timestamps, CDC from source DB).
Versioned index with atomic swap: build the new index in parallel, swap the serving pointer when ready. Zero downtime, higher storage cost.

Handling embedding model upgrades

When you upgrade the embedding model (e.g., text-embedding-3-small → text-embedding-3-large), all vectors become incompatible — the new model uses a different vector space. Required steps:

Re-embed all documents with the new model
Build a new index
Atomic swap to new index
Validate retrieval quality before declaring success

Never mix embeddings from different models in the same collection — they are not comparable.

Monitoring vector database health

Key metrics to watch in production:

Query latency (p50, p95, p99): degrade under high QPS or large index size
Index size growth: as documents are added, index memory and disk usage grow
Recall on a fixed query set: track whether retrieval quality degrades over time (indicates index drift or embedding issues)
Cache hit rate: for repeated queries, semantic caching (cache results for queries with similar embeddings) reduces load and latency

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

LLM Dataset Construction

ML Data Pipelines