Embeddings
Words are not numbers — but neural networks only understand numbers. Embeddings solve this by mapping discrete tokens into continuous vector spaces where semantic relationships become geometric. "King" and "Queen" are close. "Dog" and "Cat" are close. "King" - "Man" + "Woman" lands near "Queen". This geometric structure emerges purely from learning to predict words in context.
Theory
Hover words · Toggle analogy to see king − man + woman ≈ queen
An embedding is a coordinate in a space where meaning determines geometry. The diagram above shows word vectors clustered by concept — animals near animals, royalty near royalty. The directions between clusters encode relationships: "king" minus "man" plus "woman" moves in the direction of "queen" because gender is a consistent direction in this space. The model learns these coordinates not by being taught relationships explicitly, but by observing which words appear near each other in text.
Word2Vec: Skip-Gram Objective
Word2Vec (Mikolov et al., 2013) trains a shallow network to predict context words from a center word. Given a center word and a context window of size , the skip-gram objective maximizes:
The conditional probability using dot-product scoring:
where is the center word embedding and is the context word embedding. There are two separate embedding matrices: (input embeddings) and (output embeddings), both of shape .
Dense embeddings outperform one-hot encodings not only because they are smaller, but because they enable generalization. A model that has learned about "cat" will transfer some of that knowledge to "kitten" because both occupy a similar neighborhood in embedding space — similarity in the space reflects similarity in usage patterns. A one-hot representation treats every word as equally dissimilar to every other; an embedding treats similar words as nearby.
The denominator requires summing over the entire vocabulary — per update. For and , this is computationally prohibitive.
Noise Contrastive Estimation (NCE)
NCE replaces the full softmax with a binary classification problem: is this (center, context) pair real or noise-sampled?
where is a noise distribution (typically unigram raised to the 3/4 power: ), and is the number of negative samples (typically 5-20).
This reduces per-update cost from to — a 2,500x speedup for and .
NCE is asking: "Does this context word fit here, or is it just a random word from the corpus?" The model learns embeddings that make real pairs produce high dot products and random (noise) pairs produce low ones. It is equivalent to learning to discriminate signal from noise — hence the name.
Cosine Similarity
Once embeddings are trained, semantic similarity is measured with cosine similarity:
Cosine similarity is preferred over Euclidean distance because embeddings can have different magnitudes (frequent words tend to have larger norms) while similar words point in the same direction.
import numpy as np
def cosine_sim(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-8)
# Nearest neighbors for "king" (GloVe 100d):
# cosine_sim(king, queen) = 0.7839
# cosine_sim(king, monarch) = 0.7621
# cosine_sim(king, prince) = 0.7398
# cosine_sim(king, banana) = 0.1823Analogy Tasks: Vector Arithmetic
The famous result:
Find the word whose embedding is closest to (excluding the input words):
def analogy(embeddings, vocab, a, b, c, topk=5):
# a is to b as c is to ?
# answer = b - a + c
result = embeddings[vocab[b]] - embeddings[vocab[a]] + embeddings[vocab[c]]
sims = np.array([cosine_sim(result, embeddings[i]) for i in range(len(vocab))])
# Exclude input words
for word in [a, b, c]:
sims[vocab[word]] = -1
top_idx = sims.argsort()[-topk:][::-1]
return [(list(vocab.keys())[i], round(sims[i], 4)) for i in top_idx]
# Results with GloVe 100d trained on Wikipedia:
# analogy("king", "man", "woman"): [("queen", 0.7765), ("princess", 0.7123)]
# analogy("paris", "france", "italy"): [("rome", 0.8921), ("milan", 0.7634)]
# analogy("walking", "walk", "swim"): [("swimming", 0.8712), ("swam", 0.8234)]Subword Tokenization: BPE Algorithm
Word-level tokenization has two problems: large vocabularies (K types) and unknown words (misspellings, rare words, new terms). Byte Pair Encoding solves both by building a vocabulary of subword units.
BPE Algorithm:
- Start with character-level vocabulary:
{h, e, l, o, w, r, d, ...} - Count all adjacent pair frequencies in the corpus
- Merge the most frequent pair into a new token
- Repeat until vocabulary size
# Iteration 1: most frequent pair is ("e", "r") -> "er"
# Corpus: "he ll o wo rl d" -> "hello world"
# Pairs: {(h,e):5421, (e,r):8934, (l,l):3211, ...}
# Merge (e,r) -> "er": adds "er" to vocabulary
# Iteration 2: most frequent is now ("er", "e") -> "ere"
# ... after 32,000 merges, we have a vocabulary of ~32,000 subwords
Example tokenization with 32K BPE vocabulary:
"unbelievable" -> ["un", "believ", "able"] (3 tokens)
"GPT-4" -> ["G", "PT", "-", "4"] (4 tokens)
"tokenization" -> ["token", "ization"] (2 tokens)
"the" -> ["the"] (1 token, very frequent)
"xkyzqfp" -> ["x", "ky", "z", "q", "f", "p"] (6 chars, unknown)
BPE is used by GPT-2, GPT-3, and GPT-4 (tiktoken). SentencePiece (Google) uses a similar approach but works directly on raw text without pre-tokenization. The choice of vocabulary size (typically 32K-100K) trades off between token efficiency and model size. A larger vocabulary means fewer tokens per sentence but a larger embedding matrix.
Walkthrough
Training Word2Vec on Text8
The text8 dataset is the first 100MB of Wikipedia (after preprocessing), containing ~17 million tokens.
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
# Train skip-gram Word2Vec
model = Word2Vec(
sentences=LineSentence("text8"),
vector_size=100, # embedding dimension
window=5, # context window size
min_count=5, # ignore words appearing < 5 times
sg=1, # skip-gram (sg=0 is CBOW)
negative=10, # number of negative samples (NCE)
epochs=5,
workers=4,
alpha=0.025, # initial learning rate
min_alpha=0.0001, # final learning rate
)
model.save("word2vec_text8.model")
print(f"Vocabulary size: {len(model.wv):,}") # ~71,290 words
print(f"Training time: ~8 minutes on 4 CPU cores")Nearest Neighbors for "king"
model = Word2Vec.load("word2vec_text8.model")
# Most similar words to "king"
print(model.wv.most_similar("king", topn=10))
# [("queen", 0.7965),
# ("prince", 0.7823),
# ("emperor", 0.7712),
# ("monarch", 0.7634),
# ("throne", 0.7521),
# ("kingdom", 0.7498),
# ("duke", 0.7356),
# ("princess", 0.7234),
# ("reign", 0.7189),
# ("crowned", 0.7012)]Analogy Evaluation
Word2Vec is typically evaluated on the Google analogy dataset: 19,544 analogies across semantic and syntactic categories.
# Semantic analogies (what the model conceptually learns)
print(model.wv.most_similar(
positive=["woman", "king"], negative=["man"], topn=3))
# [("queen", 0.7765), ("princess", 0.7123), ("empress", 0.6891)]
print(model.wv.most_similar(
positive=["rome", "germany"], negative=["berlin"], topn=3))
# [("italy", 0.8312), ("austria", 0.7891), ("france", 0.7543)]
# Syntactic analogies (word form relationships)
print(model.wv.most_similar(
positive=["swimming", "walked"], negative=["walking"], topn=3))
# [("swam", 0.8234), ("ran", 0.7891), ("jogged", 0.7123)]
# Benchmark: accuracy on Google analogy dataset
accuracy = model.wv.evaluate_word_analogies("questions-words.txt")
# Semantic: 68.4%, Syntactic: 71.2%, Overall: 69.8%
# (GloVe 300d: ~80%, FastText: ~79%)Visualizing the Embedding Space
t-Distributed Stochastic Neighbor Embedding (t-SNE) reduces 100-dimensional embeddings to 2D for visualization:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np
# Select words of interest
words = ["king", "queen", "prince", "princess", "man", "woman", "boy", "girl",
"france", "paris", "germany", "berlin", "italy", "rome", "spain", "madrid"]
vectors = np.array([model.wv[w] for w in words])
# Fit t-SNE
tsne = TSNE(n_components=2, perplexity=5, random_state=42, n_iter=2000)
coords = tsne.fit_transform(vectors)
# Plot
fig, ax = plt.subplots(figsize=(10, 8))
for i, word in enumerate(words):
ax.scatter(coords[i, 0], coords[i, 1], s=100)
ax.annotate(word, (coords[i, 0] + 0.2, coords[i, 1]))
ax.set_title("Word2Vec embeddings (t-SNE, text8 corpus)")
plt.savefig("embedding_tsne.png", dpi=150)Expected clustering: royalty words cluster together, capital-country pairs cluster as parallel groups (Paris-France offset matches Berlin-Germany offset), and gender pairs (man-woman, king-queen) show consistent directional relationships.
The reason "king - man + woman = queen" works geometrically is that the embedding space learns to encode attributes as directions. The "royalty" direction is orthogonal to the "gender" direction. So moving from "man" to "woman" (the gender direction) and applying it to "king" moves you to the royal-female position: "queen."
softmax(logits) → probability distribution over vocabulary
Analysis & Evaluation
Where Your Intuition Breaks
Cosine similarity is the right way to compare embeddings. For word2vec-style embeddings trained with a dot-product objective, yes. But embeddings trained with other objectives (contrastive loss, cross-entropy over a vocabulary) have different geometric properties. Always use the similarity function the model was trained with — using Euclidean distance on a model trained with cosine similarity, or vice versa, will produce wrong nearest neighbors. The metric is part of the model, not a post-hoc choice.
Embedding Space Geometry
The key geometric properties of well-trained embeddings:
Linear structure: Analogical relationships correspond to vector translations. The direction is consistent across many nouns: applying it to "dog" gives approximately "dogs."
Clustering: Semantically similar words cluster tightly. Cosine distances within clusters:
- Royalty cluster: mean cosine similarity = 0.71
- Country/capital cluster: mean cosine similarity = 0.73
- Color cluster: mean cosine similarity = 0.68
- Random pairs: mean cosine similarity = 0.12
Polysemy problem: Word2Vec assigns a single vector per word type, regardless of meaning. "Bank" (financial institution) and "bank" (river bank) share one embedding — a weighted average of both senses. This is Word2Vec's primary limitation, addressed by contextual embeddings (Embeddings from Language Models (ELMo), Bidirectional Encoder Representations from Transformers (BERT)).
Dimension Reduction: t-SNE vs PCA
| Method | Preserves | Speed | Use Case |
|---|---|---|---|
| Principal Component Analysis (PCA) | Global variance | Fast (seconds) | Initial exploration |
| t-SNE | Local neighborhoods | Slow (minutes) | Cluster visualization |
| Uniform Manifold Approximation and Projection (UMAP) | Both local and global | Medium | Publication-quality plots |
t-SNE hyperparameters matter significantly:
- perplexity: roughly the expected number of neighbors (5-50). Too low creates isolated clusters; too high makes everything merge.
- n_iter: 1,000 is minimum; 2,000-5,000 for better convergence.
- Different random seeds produce different layouts — the 2D structure is not unique.
t-SNE distances are not meaningful — only cluster membership is interpretable. Two clusters that appear close in a t-SNE plot may be far apart in the original 100-dimensional space. Never use t-SNE coordinates as features or measure distances between clusters across different t-SNE runs.
Embedding Dimension Trade-offs
| Embedding Dim | Analogy Accuracy | Model Size (50K vocab) | Recommended For |
|---|---|---|---|
| 50 | 58.2% | 10MB | Tiny models, mobile |
| 100 | 69.8% | 20MB | Small tasks |
| 300 | 78.4% | 60MB | General NLP (GloVe default) |
| 768 | 85.1%* | 150MB | BERT-scale (contextual) |
*BERT uses contextual embeddings, not static. Comparison is approximate.
Rule of thumb: embedding dimension around . For 50K vocabulary and 100M tokens: — hence 300 is the GloVe standard.
Pre-trained Embeddings vs Training from Scratch
When to use pre-trained embeddings (GloVe, FastText):
- Small dataset (< 100K examples)
- Limited compute
- Words in vocabulary are standard English
When to train from scratch:
- Large domain-specific corpus (medical, legal, code)
- Special tokenization (BPE, character-level)
- End-to-end model can learn task-specific geometry
When to use contextual embeddings (BERT, Robustly Optimized BERT Pretraining Approach (RoBERTa)):
- Accuracy matters most
- Sufficient inference budget
- Words have multiple meanings important to the task
# Loading pre-trained GloVe embeddings
import numpy as np
def load_glove(path, vocab, dim=100):
embeddings = np.random.normal(0, 0.01, (len(vocab), dim))
embeddings[0] = 0 # PAD token -> zero vector
found = 0
with open(path) as f:
for line in f:
parts = line.split()
word = parts[0]
if word in vocab:
embeddings[vocab[word]] = np.array(parts[1:], dtype=float)
found += 1
print(f"Found {found}/{len(vocab)} words in GloVe")
return embeddings
# In PyTorch, initialize embedding layer with pre-trained weights:
glove = load_glove("glove.6B.100d.txt", word2idx, dim=100)
embedding_layer = nn.Embedding(len(word2idx), 100)
embedding_layer.weight.data = torch.from_numpy(glove).float()
# Optionally freeze: embedding_layer.weight.requires_grad = FalseIn production NLP, start with frozen pre-trained embeddings (GloVe or FastText) for fast prototyping. If accuracy on your downstream task is insufficient, fine-tune the embeddings (unfreeze after a few epochs of frozen training). For state-of-the-art results, use a fine-tuned BERT/RoBERTa model — the embedding layer is just the token embedding table, and the contextual representations from the full Transformer are orders of magnitude more powerful than static embeddings.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.