Classical Text Classification

Text classification is the task of assigning predefined categories to text — sentiment analysis, spam detection, topic labeling. Before large language models, this was solved with a small set of interpretable models: Naive Bayes for speed and sparsity, logistic regression for calibrated probabilities, and SVMs for margin-based generalization. Understanding these baselines is essential because they still define the cost-accuracy frontier on small labeled datasets.

Theory

examples:

greatmovie

log P(c | x) ∝ log P(c) + Σ log P(wᵢ | c)

positive

72.9%

log score: -4.79

negative

27.1%

log score: -5.78

Prediction: positive

Naive Bayes assumes words are independent given the class — the "naive" assumption that ignores word order and co-occurrence.

The visualization above shows Naive Bayes classifying text by class-conditional word likelihoods. Switch to "P(word | class)" to see which words are most discriminative — words with high positive/negative likelihood ratios are the model's evidence.

Naive Bayes

Naive Bayes applies Bayes' theorem with a strong conditional independence assumption: given the class label, each word is independent of every other word.

For a document x = (w₁, w₂, …, wₙ) and class c:

$P(c \mid \mathbf{x}) \propto P(c) \prod_{i=1}^{n} P(w_i \mid c)$

In practice, predictions use log-probabilities to avoid underflow:

$\hat{c} = \arg\max_c \left[\log P(c) + \sum_{i=1}^{n} \log P(w_i \mid c)\right]$

Multinomial Naive Bayes models term frequencies; Bernoulli Naive Bayes models term presence/absence. Multinomial is preferred for longer documents where word frequency carries signal; Bernoulli works better for short texts where only presence matters.

MLE probabilities are fragile — any word not seen in training for class c gets P(w | c) = 0, which zeros out the entire product. Laplace (add-α) smoothing fixes this:

$P_{\alpha}(w \mid c) = \frac{C(w, c) + \alpha}{C(c) + \alpha|V|}$

The conditional independence assumption had to be made because joint estimation of P(w₁, w₂, …, wₙ | c) is impossible — the number of distinct n-grams grows exponentially with vocabulary size, making the parameter space larger than any realistic training set. The "naive" assumption collapses this to |V| parameters per class, which is estimable from thousands of documents.

Logistic Regression with TF-IDF Features

Logistic regression directly models $P(c \mid \mathbf{x})$ without a generative assumption. Given a feature vector x (typically TF-IDF weights), the binary case:

$P(c=1 \mid \mathbf{x}) = \sigma(\mathbf{w}^\top \mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^\top \mathbf{x} + b)}}$

Parameters are fit by maximizing the log-likelihood with L2 regularization (ridge):

$\hat{\mathbf{w}} = \arg\max_\mathbf{w} \sum_i \log P(c_i \mid \mathbf{x}_i) - \lambda \|\mathbf{w}\|_2^2$

TF-IDF (term frequency–inverse document frequency) weights features before passing to logistic regression:

$\text{tfidf}(w, d) = \text{tf}(w, d) \cdot \log\frac{N}{1 + \text{df}(w)}$

TF rewards words that appear often in a document; IDF discounts words that appear in many documents (common words like "the" carry little discriminative information). The log-scaling of IDF compresses the dynamic range so that a word appearing in 10% of documents is weighted much more than one appearing in 90%, but not astronomically more than one appearing in 5%.

Support Vector Machine (Linear SVM)

A linear SVM finds the maximum-margin hyperplane separating two classes:

$\hat{\mathbf{w}}, \hat{b} = \arg\min \frac{1}{2}\|\mathbf{w}\|^2 \quad \text{s.t.} \quad y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 \; \forall i$

With soft-margin (C parameter), slack variables ξᵢ ≥ 0 allow misclassification:

$\min \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_i \xi_i$

SVMs use the hinge loss: $\max(0, 1 - y_i(\mathbf{w}^\top \mathbf{x}_i + b))$ . Unlike logistic regression, the SVM only uses support vectors (points near the margin) to define the decision boundary — making it robust in high-dimensional sparse feature spaces like bag-of-words representations.

Walkthrough

Task: Binary sentiment classification on movie reviews. Labels: positive / negative.

Step 1 — Feature extraction:

python

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
 
# TF-IDF with unigrams and bigrams, top 10k features
vectorizer = TfidfVectorizer(
    ngram_range=(1, 2),   # unigrams + bigrams
    max_features=10_000,
    sublinear_tf=True,    # replace tf with 1 + log(tf)
    min_df=3,             # ignore terms in < 3 documents
)
X_train = vectorizer.fit_transform(train_texts)
X_test  = vectorizer.transform(test_texts)     # fit only on train!

Step 2 — Train and compare three classifiers:

python

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
 
models = {
    'NaiveBayes':  MultinomialNB(alpha=0.1),
    'LogReg':      LogisticRegression(C=1.0, max_iter=1000),
    'LinearSVC':   LinearSVC(C=1.0, max_iter=2000),
}
 
for name, clf in models.items():
    clf.fit(X_train, y_train)
    preds = clf.predict(X_test)
    print(f"\n{name}:")
    print(classification_report(y_test, preds, target_names=['neg', 'pos']))

Typical results on IMDb (25k training docs):

Model	Accuracy	F1	Training time
Multinomial NB	~86%	0.86	<1s
Logistic Regression	~90%	0.90	~5s
Linear SVC	~91%	0.91	~3s

Step 3 — Inspect what the model learned:

python

import numpy as np
 
feature_names = vectorizer.get_feature_names_out()
 
# Top positive/negative features for logistic regression
coef = models['LogReg'].coef_[0]
top_pos = feature_names[np.argsort(coef)[-20:]][::-1]
top_neg = feature_names[np.argsort(coef)[:20]]
 
print("Top positive features:", top_pos)
print("Top negative features:", top_neg)

Step 4 — Handle class imbalance:

When classes are unbalanced (e.g., 90% negative in spam detection), accuracy is misleading. Use:

python

# class_weight='balanced' scales loss by inverse class frequency
LogisticRegression(class_weight='balanced', C=1.0)
 
# For evaluation, use macro-averaged F1, not accuracy
from sklearn.metrics import f1_score
f1 = f1_score(y_test, preds, average='macro')  # unweighted mean across classes

Analysis & Evaluation

Where Your Intuition Breaks

More features always improve text classification. Beyond a point, adding features hurts. In high-dimensional sparse spaces, the number of parameters grows with vocabulary size, and with limited training data, many parameters get noisy estimates. L2 regularization helps, but very large vocabulary models can still overfit to domain-specific n-grams in training data that don't transfer to test. Feature selection (removing features below a minimum document frequency, limiting max_features) often improves test accuracy even though it reduces training accuracy. The dimensionality of the feature space is not free — it interacts with training set size.

Model Comparison

Model	Calibration	Interpretability	Speed	Handles sparsity
Multinomial NB	Poor (overconfident)	High (feature weights = log-likelihood ratios)	Fastest	Excellent
Logistic Regression	Good (Platt calibration available)	High (feature coefficients direct)	Fast	Good
Linear SVC	None (no probability output)	Medium (signed margin)	Fast	Excellent
SGD Classifier	Good with log loss	High	Very fast (streaming)	Excellent

When to use each:

Naive Bayes when training data is tiny (<1k samples) or you need a very fast baseline
Logistic Regression when you need calibrated probabilities or multi-class support
Linear SVC when you have many classes and accuracy is the only objective
All three as baselines before trying neural methods — if they achieve 90%+ F1, the neural overhead may not be justified

Multi-class Strategies

For K > 2 classes:

One-vs-Rest (OvR): Train K binary classifiers, predict class with highest score. Used by scikit-learn's LogisticRegression and LinearSVC by default.

Softmax (multinomial) logistic regression: Single model with K output weights; trained with cross-entropy loss. More parameter-efficient than OvR when K is large.

Hierarchical classification: When classes have natural hierarchy (e.g., topic taxonomy), train coarse-to-fine classifiers. Reduces error propagation at finer levels.

Common Pitfalls

Vectorizer fit on test data. Always call fit_transform on training data and transform on test/val. Fitting on test leaks vocabulary statistics (IDF values) and inflates reported performance.

Ignoring text preprocessing. HTML tags, URLs, numbers, and punctuation can dominate TF-IDF features. Lowercase normalization, stop-word removal, and lemmatization typically improve both speed and accuracy.

Using accuracy on imbalanced datasets. A model predicting "not spam" for every email achieves 99% accuracy on a 1% spam dataset. Use macro F1 or precision-recall AUC for imbalanced tasks.

Not using sublinear_tf. Raw term frequency gives disproportionate weight to repeated words. sublinear_tf=True (replace tf with 1 + log(tf)) significantly improves most classifiers.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

N-gram Language Models

Sequence Labeling