Classical Text Classification
Text classification is the task of assigning predefined categories to text — sentiment analysis, spam detection, topic labeling. Before large language models, this was solved with a small set of interpretable models: Naive Bayes for speed and sparsity, logistic regression for calibrated probabilities, and SVMs for margin-based generalization. Understanding these baselines is essential because they still define the cost-accuracy frontier on small labeled datasets.
Theory
The visualization above shows Naive Bayes classifying text by class-conditional word likelihoods. Switch to "P(word | class)" to see which words are most discriminative — words with high positive/negative likelihood ratios are the model's evidence.
Naive Bayes
Naive Bayes applies Bayes' theorem with a strong conditional independence assumption: given the class label, each word is independent of every other word.
For a document x = (w₁, w₂, …, wₙ) and class c:
In practice, predictions use log-probabilities to avoid underflow:
Multinomial Naive Bayes models term frequencies; Bernoulli Naive Bayes models term presence/absence. Multinomial is preferred for longer documents where word frequency carries signal; Bernoulli works better for short texts where only presence matters.
MLE probabilities are fragile — any word not seen in training for class c gets P(w | c) = 0, which zeros out the entire product. Laplace (add-α) smoothing fixes this:
The conditional independence assumption had to be made because joint estimation of P(w₁, w₂, …, wₙ | c) is impossible — the number of distinct n-grams grows exponentially with vocabulary size, making the parameter space larger than any realistic training set. The "naive" assumption collapses this to |V| parameters per class, which is estimable from thousands of documents.
Logistic Regression with TF-IDF Features
Logistic regression directly models without a generative assumption. Given a feature vector x (typically TF-IDF weights), the binary case:
Parameters are fit by maximizing the log-likelihood with L2 regularization (ridge):
TF-IDF (term frequency–inverse document frequency) weights features before passing to logistic regression:
TF rewards words that appear often in a document; IDF discounts words that appear in many documents (common words like "the" carry little discriminative information). The log-scaling of IDF compresses the dynamic range so that a word appearing in 10% of documents is weighted much more than one appearing in 90%, but not astronomically more than one appearing in 5%.
Support Vector Machine (Linear SVM)
A linear SVM finds the maximum-margin hyperplane separating two classes:
With soft-margin (C parameter), slack variables ξᵢ ≥ 0 allow misclassification:
SVMs use the hinge loss: . Unlike logistic regression, the SVM only uses support vectors (points near the margin) to define the decision boundary — making it robust in high-dimensional sparse feature spaces like bag-of-words representations.
Walkthrough
Task: Binary sentiment classification on movie reviews. Labels: positive / negative.
Step 1 — Feature extraction:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
# TF-IDF with unigrams and bigrams, top 10k features
vectorizer = TfidfVectorizer(
ngram_range=(1, 2), # unigrams + bigrams
max_features=10_000,
sublinear_tf=True, # replace tf with 1 + log(tf)
min_df=3, # ignore terms in < 3 documents
)
X_train = vectorizer.fit_transform(train_texts)
X_test = vectorizer.transform(test_texts) # fit only on train!Step 2 — Train and compare three classifiers:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
models = {
'NaiveBayes': MultinomialNB(alpha=0.1),
'LogReg': LogisticRegression(C=1.0, max_iter=1000),
'LinearSVC': LinearSVC(C=1.0, max_iter=2000),
}
for name, clf in models.items():
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
print(f"\n{name}:")
print(classification_report(y_test, preds, target_names=['neg', 'pos']))Typical results on IMDb (25k training docs):
| Model | Accuracy | F1 | Training time |
|---|---|---|---|
| Multinomial NB | ~86% | 0.86 | <1s |
| Logistic Regression | ~90% | 0.90 | ~5s |
| Linear SVC | ~91% | 0.91 | ~3s |
Step 3 — Inspect what the model learned:
import numpy as np
feature_names = vectorizer.get_feature_names_out()
# Top positive/negative features for logistic regression
coef = models['LogReg'].coef_[0]
top_pos = feature_names[np.argsort(coef)[-20:]][::-1]
top_neg = feature_names[np.argsort(coef)[:20]]
print("Top positive features:", top_pos)
print("Top negative features:", top_neg)Step 4 — Handle class imbalance:
When classes are unbalanced (e.g., 90% negative in spam detection), accuracy is misleading. Use:
# class_weight='balanced' scales loss by inverse class frequency
LogisticRegression(class_weight='balanced', C=1.0)
# For evaluation, use macro-averaged F1, not accuracy
from sklearn.metrics import f1_score
f1 = f1_score(y_test, preds, average='macro') # unweighted mean across classesAnalysis & Evaluation
Where Your Intuition Breaks
More features always improve text classification. Beyond a point, adding features hurts. In high-dimensional sparse spaces, the number of parameters grows with vocabulary size, and with limited training data, many parameters get noisy estimates. L2 regularization helps, but very large vocabulary models can still overfit to domain-specific n-grams in training data that don't transfer to test. Feature selection (removing features below a minimum document frequency, limiting max_features) often improves test accuracy even though it reduces training accuracy. The dimensionality of the feature space is not free — it interacts with training set size.
Model Comparison
| Model | Calibration | Interpretability | Speed | Handles sparsity |
|---|---|---|---|---|
| Multinomial NB | Poor (overconfident) | High (feature weights = log-likelihood ratios) | Fastest | Excellent |
| Logistic Regression | Good (Platt calibration available) | High (feature coefficients direct) | Fast | Good |
| Linear SVC | None (no probability output) | Medium (signed margin) | Fast | Excellent |
| SGD Classifier | Good with log loss | High | Very fast (streaming) | Excellent |
When to use each:
- Naive Bayes when training data is tiny (<1k samples) or you need a very fast baseline
- Logistic Regression when you need calibrated probabilities or multi-class support
- Linear SVC when you have many classes and accuracy is the only objective
- All three as baselines before trying neural methods — if they achieve 90%+ F1, the neural overhead may not be justified
Multi-class Strategies
For K > 2 classes:
One-vs-Rest (OvR): Train K binary classifiers, predict class with highest score. Used by scikit-learn's LogisticRegression and LinearSVC by default.
Softmax (multinomial) logistic regression: Single model with K output weights; trained with cross-entropy loss. More parameter-efficient than OvR when K is large.
Hierarchical classification: When classes have natural hierarchy (e.g., topic taxonomy), train coarse-to-fine classifiers. Reduces error propagation at finer levels.
Common Pitfalls
Vectorizer fit on test data. Always call fit_transform on training data and transform on test/val. Fitting on test leaks vocabulary statistics (IDF values) and inflates reported performance.
Ignoring text preprocessing. HTML tags, URLs, numbers, and punctuation can dominate TF-IDF features. Lowercase normalization, stop-word removal, and lemmatization typically improve both speed and accuracy.
Using accuracy on imbalanced datasets. A model predicting "not spam" for every email achieves 99% accuracy on a 1% spam dataset. Use macro F1 or precision-recall AUC for imbalanced tasks.
Not using sublinear_tf. Raw term frequency gives disproportionate weight to repeated words. sublinear_tf=True (replace tf with 1 + log(tf)) significantly improves most classifiers.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.