Embeddings

Words are not numbers — but neural networks only understand numbers. Embeddings solve this by mapping discrete tokens into continuous vector spaces where semantic relationships become geometric. "King" and "Queen" are close. "Dog" and "Cat" are close. "King" - "Man" + "Woman" lands near "Queen". This geometric structure emerges purely from learning to predict words in context.

Theory

Word Embedding Space (2D projection)

Royalty

Gender

Animals

ML/Tech

Hover words · Toggle analogy to see king − man + woman ≈ queen

An embedding is a coordinate in a space where meaning determines geometry. The diagram above shows word vectors clustered by concept — animals near animals, royalty near royalty. The directions between clusters encode relationships: "king" minus "man" plus "woman" moves in the direction of "queen" because gender is a consistent direction in this space. The model learns these coordinates not by being taught relationships explicitly, but by observing which words appear near each other in text.

Word2Vec: Skip-Gram Objective

Word2Vec (Mikolov et al., 2013) trains a shallow network to predict context words from a center word. Given a center word $w$ and a context window of size $c$ , the skip-gram objective maximizes:

$\mathcal{J}(\theta) = \frac{1}{T} \sum_{t=1}^{T} \sum_{-c \le j \le c, j \ne 0} \log p(w_{t+j} \mid w_t; \theta)$

The conditional probability using dot-product scoring:

$p(o \mid c) = \frac{\exp(u_o^\top v_c)}{\sum_{w=1}^{V} \exp(u_w^\top v_c)}$

where $v_c \in \mathbb{R}^d$ is the center word embedding and $u_o \in \mathbb{R}^d$ is the context word embedding. There are two separate embedding matrices: $V$ (input embeddings) and $U$ (output embeddings), both of shape $|\text{Vocab}| \times d$ .

Dense embeddings outperform one-hot encodings not only because they are smaller, but because they enable generalization. A model that has learned about "cat" will transfer some of that knowledge to "kitten" because both occupy a similar neighborhood in embedding space — similarity in the space reflects similarity in usage patterns. A one-hot representation treats every word as equally dissimilar to every other; an embedding treats similar words as nearby.

The denominator $\sum_{w=1}^{V} \exp(u_w^\top v_c)$ requires summing over the entire vocabulary — $O(|V|)$ per update. For $|V| = 50{,}000$ and $d = 300$ , this is computationally prohibitive.

Noise Contrastive Estimation (NCE)

NCE replaces the full softmax with a binary classification problem: is this (center, context) pair real or noise-sampled?

$\mathcal{J}_{NCE} = \log \sigma(u_o^\top v_c) + \sum_{k=1}^{K} \mathbb{E}_{\tilde{w}_k \sim P_n} \left[\log \sigma(-u_{\tilde{w}_k}^\top v_c)\right]$

where $P_n$ is a noise distribution (typically unigram raised to the 3/4 power: $P_n(w) \propto f(w)^{3/4}$ ), and $K$ is the number of negative samples (typically 5-20).

This reduces per-update cost from $O(|V|)$ to $O(K)$ — a 2,500x speedup for $|V| = 50{,}000$ and $K = 20$ .

💡Intuition

NCE is asking: "Does this context word fit here, or is it just a random word from the corpus?" The model learns embeddings that make real pairs produce high dot products and random (noise) pairs produce low ones. It is equivalent to learning to discriminate signal from noise — hence the name.

Cosine Similarity

Once embeddings are trained, semantic similarity is measured with cosine similarity:

$\text{cos}(u, v) = \frac{u \cdot v}{\|u\| \|v\|} \in [-1, 1]$

Cosine similarity is preferred over Euclidean distance because embeddings can have different magnitudes (frequent words tend to have larger norms) while similar words point in the same direction.

python

import numpy as np
 
def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-8)
 
# Nearest neighbors for "king" (GloVe 100d):
# cosine_sim(king, queen)   = 0.7839
# cosine_sim(king, monarch) = 0.7621
# cosine_sim(king, prince)  = 0.7398
# cosine_sim(king, banana)  = 0.1823

Analogy Tasks: Vector Arithmetic

The famous result: $\text{king} - \text{man} + \text{woman} \approx \text{queen}$

$v_{\text{result}} = v_{\text{king}} - v_{\text{man}} + v_{\text{woman}}$

Find the word whose embedding is closest to $v_{\text{result}}$ (excluding the input words):

python

def analogy(embeddings, vocab, a, b, c, topk=5):
    # a is to b as c is to ?
    # answer = b - a + c
    result = embeddings[vocab[b]] - embeddings[vocab[a]] + embeddings[vocab[c]]
    sims = np.array([cosine_sim(result, embeddings[i]) for i in range(len(vocab))])
    # Exclude input words
    for word in [a, b, c]:
        sims[vocab[word]] = -1
    top_idx = sims.argsort()[-topk:][::-1]
    return [(list(vocab.keys())[i], round(sims[i], 4)) for i in top_idx]
 
# Results with GloVe 100d trained on Wikipedia:
# analogy("king", "man", "woman"):   [("queen", 0.7765), ("princess", 0.7123)]
# analogy("paris", "france", "italy"):  [("rome", 0.8921), ("milan", 0.7634)]
# analogy("walking", "walk", "swim"):   [("swimming", 0.8712), ("swam", 0.8234)]

Subword Tokenization: BPE Algorithm

Word-level tokenization has two problems: large vocabularies ( $>100$ K types) and unknown words (misspellings, rare words, new terms). Byte Pair Encoding solves both by building a vocabulary of subword units.

BPE Algorithm:

Start with character-level vocabulary: {h, e, l, o, w, r, d, ...}
Count all adjacent pair frequencies in the corpus
Merge the most frequent pair into a new token
Repeat until vocabulary size $= V_{target}$

# Iteration 1: most frequent pair is ("e", "r") -> "er"
# Corpus: "he ll o wo rl d" -> "hello world"
# Pairs: {(h,e):5421, (e,r):8934, (l,l):3211, ...}
# Merge (e,r) -> "er": adds "er" to vocabulary

# Iteration 2: most frequent is now ("er", "e") -> "ere"  
# ... after 32,000 merges, we have a vocabulary of ~32,000 subwords

Example tokenization with 32K BPE vocabulary:

"unbelievable"  -> ["un", "believ", "able"]         (3 tokens)
"GPT-4"         -> ["G", "PT", "-", "4"]            (4 tokens)  
"tokenization"  -> ["token", "ization"]              (2 tokens)
"the"           -> ["the"]                           (1 token, very frequent)
"xkyzqfp"      -> ["x", "ky", "z", "q", "f", "p"]  (6 chars, unknown)

ℹ️Note

BPE is used by GPT-2, GPT-3, and GPT-4 (tiktoken). SentencePiece (Google) uses a similar approach but works directly on raw text without pre-tokenization. The choice of vocabulary size (typically 32K-100K) trades off between token efficiency and model size. A larger vocabulary means fewer tokens per sentence but a larger embedding matrix.

Walkthrough

Training Word2Vec on Text8

The text8 dataset is the first 100MB of Wikipedia (after preprocessing), containing ~17 million tokens.

python

from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
 
# Train skip-gram Word2Vec
model = Word2Vec(
    sentences=LineSentence("text8"),
    vector_size=100,    # embedding dimension
    window=5,           # context window size
    min_count=5,        # ignore words appearing < 5 times
    sg=1,               # skip-gram (sg=0 is CBOW)
    negative=10,        # number of negative samples (NCE)
    epochs=5,
    workers=4,
    alpha=0.025,        # initial learning rate
    min_alpha=0.0001,   # final learning rate
)
model.save("word2vec_text8.model")
print(f"Vocabulary size: {len(model.wv):,}")  # ~71,290 words
print(f"Training time: ~8 minutes on 4 CPU cores")

Nearest Neighbors for "king"

python

model = Word2Vec.load("word2vec_text8.model")
 
# Most similar words to "king"
print(model.wv.most_similar("king", topn=10))
# [("queen", 0.7965),
#  ("prince", 0.7823),
#  ("emperor", 0.7712),
#  ("monarch", 0.7634),
#  ("throne", 0.7521),
#  ("kingdom", 0.7498),
#  ("duke", 0.7356),
#  ("princess", 0.7234),
#  ("reign", 0.7189),
#  ("crowned", 0.7012)]

Analogy Evaluation

Word2Vec is typically evaluated on the Google analogy dataset: 19,544 analogies across semantic and syntactic categories.

python

# Semantic analogies (what the model conceptually learns)
print(model.wv.most_similar(
    positive=["woman", "king"], negative=["man"], topn=3))
# [("queen", 0.7765), ("princess", 0.7123), ("empress", 0.6891)]
 
print(model.wv.most_similar(
    positive=["rome", "germany"], negative=["berlin"], topn=3))
# [("italy", 0.8312), ("austria", 0.7891), ("france", 0.7543)]
 
# Syntactic analogies (word form relationships)
print(model.wv.most_similar(
    positive=["swimming", "walked"], negative=["walking"], topn=3))
# [("swam", 0.8234), ("ran", 0.7891), ("jogged", 0.7123)]
 
# Benchmark: accuracy on Google analogy dataset
accuracy = model.wv.evaluate_word_analogies("questions-words.txt")
# Semantic: 68.4%, Syntactic: 71.2%, Overall: 69.8%
# (GloVe 300d: ~80%, FastText: ~79%)

Visualizing the Embedding Space

t-Distributed Stochastic Neighbor Embedding (t-SNE) reduces 100-dimensional embeddings to 2D for visualization:

python

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np
 
# Select words of interest
words = ["king", "queen", "prince", "princess", "man", "woman", "boy", "girl",
         "france", "paris", "germany", "berlin", "italy", "rome", "spain", "madrid"]
 
vectors = np.array([model.wv[w] for w in words])
 
# Fit t-SNE
tsne = TSNE(n_components=2, perplexity=5, random_state=42, n_iter=2000)
coords = tsne.fit_transform(vectors)
 
# Plot
fig, ax = plt.subplots(figsize=(10, 8))
for i, word in enumerate(words):
    ax.scatter(coords[i, 0], coords[i, 1], s=100)
    ax.annotate(word, (coords[i, 0] + 0.2, coords[i, 1]))
ax.set_title("Word2Vec embeddings (t-SNE, text8 corpus)")
plt.savefig("embedding_tsne.png", dpi=150)

Expected clustering: royalty words cluster together, capital-country pairs cluster as parallel groups (Paris-France offset matches Berlin-Germany offset), and gender pairs (man-woman, king-queen) show consistent directional relationships.

💡Intuition

The reason "king - man + woman = queen" works geometrically is that the embedding space learns to encode attributes as directions. The "royalty" direction is orthogonal to the "gender" direction. So moving from "man" to "woman" (the gender direction) and applying it to "king" moves you to the royal-female position: "queen."

Next Token Distribution

sat

34%

21%

was

15%

jumped

12%

ran

meowed

other...

softmax(logits) → probability distribution over vocabulary

Analysis & Evaluation

Where Your Intuition Breaks

Cosine similarity is the right way to compare embeddings. For word2vec-style embeddings trained with a dot-product objective, yes. But embeddings trained with other objectives (contrastive loss, cross-entropy over a vocabulary) have different geometric properties. Always use the similarity function the model was trained with — using Euclidean distance on a model trained with cosine similarity, or vice versa, will produce wrong nearest neighbors. The metric is part of the model, not a post-hoc choice.

Embedding Space Geometry

The key geometric properties of well-trained embeddings:

Linear structure: Analogical relationships correspond to vector translations. The direction $v_{\text{plural}} = v_{\text{cats}} - v_{\text{cat}}$ is consistent across many nouns: applying it to "dog" gives approximately "dogs."

Clustering: Semantically similar words cluster tightly. Cosine distances within clusters:

Royalty cluster: mean cosine similarity = 0.71
Country/capital cluster: mean cosine similarity = 0.73
Color cluster: mean cosine similarity = 0.68
Random pairs: mean cosine similarity = 0.12

Polysemy problem: Word2Vec assigns a single vector per word type, regardless of meaning. "Bank" (financial institution) and "bank" (river bank) share one embedding — a weighted average of both senses. This is Word2Vec's primary limitation, addressed by contextual embeddings (Embeddings from Language Models (ELMo), Bidirectional Encoder Representations from Transformers (BERT)).

Dimension Reduction: t-SNE vs PCA

Method	Preserves	Speed	Use Case
Principal Component Analysis (PCA)	Global variance	Fast (seconds)	Initial exploration
t-SNE	Local neighborhoods	Slow (minutes)	Cluster visualization
Uniform Manifold Approximation and Projection (UMAP)	Both local and global	Medium	Publication-quality plots

t-SNE hyperparameters matter significantly:

perplexity: roughly the expected number of neighbors (5-50). Too low creates isolated clusters; too high makes everything merge.
n_iter: 1,000 is minimum; 2,000-5,000 for better convergence.
Different random seeds produce different layouts — the 2D structure is not unique.

⚠️Warning

t-SNE distances are not meaningful — only cluster membership is interpretable. Two clusters that appear close in a t-SNE plot may be far apart in the original 100-dimensional space. Never use t-SNE coordinates as features or measure distances between clusters across different t-SNE runs.

Embedding Dimension Trade-offs

Embedding Dim	Analogy Accuracy	Model Size (50K vocab)	Recommended For
50	58.2%	10MB	Tiny models, mobile
100	69.8%	20MB	Small tasks
300	78.4%	60MB	General NLP (GloVe default)
768	85.1%*	150MB	BERT-scale (contextual)

*BERT uses contextual embeddings, not static. Comparison is approximate.

Rule of thumb: embedding dimension around $\sqrt[4]{|V| \cdot \text{data size}}$ . For 50K vocabulary and 100M tokens: $\sqrt[4]{5 \times 10^{12}} \approx 265$ — hence 300 is the GloVe standard.

Pre-trained Embeddings vs Training from Scratch

When to use pre-trained embeddings (GloVe, FastText):

Small dataset (< 100K examples)
Limited compute
Words in vocabulary are standard English

When to train from scratch:

Large domain-specific corpus (medical, legal, code)
Special tokenization (BPE, character-level)
End-to-end model can learn task-specific geometry

When to use contextual embeddings (BERT, Robustly Optimized BERT Pretraining Approach (RoBERTa)):

Accuracy matters most
Sufficient inference budget
Words have multiple meanings important to the task

python

# Loading pre-trained GloVe embeddings
import numpy as np
 
def load_glove(path, vocab, dim=100):
    embeddings = np.random.normal(0, 0.01, (len(vocab), dim))
    embeddings[0] = 0  # PAD token -> zero vector
    found = 0
    with open(path) as f:
        for line in f:
            parts = line.split()
            word  = parts[0]
            if word in vocab:
                embeddings[vocab[word]] = np.array(parts[1:], dtype=float)
                found += 1
    print(f"Found {found}/{len(vocab)} words in GloVe")
    return embeddings
 
# In PyTorch, initialize embedding layer with pre-trained weights:
glove = load_glove("glove.6B.100d.txt", word2idx, dim=100)
embedding_layer = nn.Embedding(len(word2idx), 100)
embedding_layer.weight.data = torch.from_numpy(glove).float()
# Optionally freeze: embedding_layer.weight.requires_grad = False

🚀Production

In production NLP, start with frozen pre-trained embeddings (GloVe or FastText) for fast prototyping. If accuracy on your downstream task is insufficient, fine-tune the embeddings (unfreeze after a few epochs of frozen training). For state-of-the-art results, use a fine-tuned BERT/RoBERTa model — the embedding layer is just the token embedding table, and the contextual representations from the full Transformer are orders of magnitude more powerful than static embeddings.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

RNNs & LSTMs

Attention & Transformers