Text Preprocessing

Raw text from production systems is noisy — HTML tags, inconsistent Unicode, mixed-language content, URLs, and user-generated typos all appear in real corpora. The preprocessing decisions made upstream of tokenization directly affect what patterns a model learns and how reliably evaluation metrics reflect real-world performance. This lesson formalizes the major preprocessing steps, derives TF-IDF weighting, and develops judgment about when classical preprocessing pipelines help versus when they actively harm modern LLM workflows.

Theory

Text Preprocessing Pipelineproduct review example

Great product!! LOVE IT :) Check out http://example.com for more... <b>highly</b> recommended 👍

Raw Input:Raw user text — HTML tags, URLs, emoji, mixed case, punctuation all present.

step 1 / 7

Raw text from the internet is a mess — HTML tags, inconsistent encoding, emoji, copy-pasted ligatures, and inconsistent capitalization all appear in real corpora. Preprocessing is the pipeline that standardizes text before it reaches the model. The diagram above shows the stages: normalization and cleaning happen first, then statistical weighting (TF-IDF) converts cleaned tokens into feature vectors. Each stage removes noise that would otherwise consume vocabulary budget or distort frequency statistics.

Unicode Normalization

Unicode defines four normalization forms that canonicalize equivalent character sequences. Many identical-looking strings have different byte representations depending on composition.

NFD (Canonical Decomposition): Decomposes characters into base letter plus combining diacritics. "é" (U+00E9, single code point) becomes "e" (U+0065) + combining acute accent (U+0301).

NFC (Canonical Decomposition + Canonical Composition): Applies NFD then recomposes: "e" + combining acute → "é". NFC is the standard form for text exchange.

NFKD (Compatibility Decomposition): Like NFD but also decomposes compatibility equivalents — the ligature "fi" (U+FB01) becomes "f"+"i"; the Roman numeral "Ⅳ" becomes "I"+"V"; mathematical bold "𝐀" becomes "A".

NFKC (Compatibility Decomposition + Canonical Composition): Applies NFKD then recomposes. NFKC is used by BERT and most modern NLP systems because it normalizes visually similar but semantically equivalent characters while preserving core meaning.

python

import unicodedata
 
text = "\u00e9"                            # é as single code point
print(unicodedata.normalize("NFD", text))  # e + combining accent (len=2)
print(unicodedata.normalize("NFC", text))  # é (len=1)
print(unicodedata.normalize("NFKC", "\ufb01"))  # fi ligature → "fi"

Failure to normalize causes identical words to have different byte representations and thus different token IDs, fragmenting frequency statistics and breaking exact-match evaluation.

Lowercasing

Lowercasing maps every uppercase character to its lowercase equivalent:

$\text{lower}(c) = \begin{cases} c - 32 & c \in [A\text{-}Z] \\ c & \text{otherwise} \end{cases}$

(simplified for ASCII; Unicode lowercasing is locale-dependent — Turkish "ı" and "İ" require special handling).

Benefits: Reduces vocabulary size; "The" and "the" share the same embedding; more gradient signal per token type.

Costs: Destroys case-sensitive distinctions. "us" (pronoun) and "US" (country), "apple" and "Apple", "gpu" and "GPU" lose their distinction. For named entity recognition and tasks involving abbreviations, this loss is harmful.

Uncased models (BERT-base-uncased) perform similarly to cased models on tasks where case is not semantically informative (sentiment, topic classification), but cased models win on NER and tasks involving proper nouns.

Stemming and Lemmatization

Both operations reduce inflected forms to a canonical base, improving recall in information retrieval.

Stemming applies heuristic suffix-stripping rules to produce a stem (which may not be a real word). The Porter stemmer:

"running" → "run" (strip "-ning", apply doubling rule)
"studies" → "studi" (strip "-ies")
"generalization" → "general" (strip "-ization")

The stem "studi" is not a valid English word, but it serves as a consistent index key. The Snowball stemmer extends Porter to 15 languages.

Lemmatization uses morphological analysis (POS tagging + morphological dictionary) to find the canonical lemma:

$\text{lemmatize}(w, \text{POS}) = \begin{cases} \text{base verb} & \text{if POS} = \text{VERB} \\ \text{singular noun} & \text{if POS} = \text{NOUN} \\ w & \text{otherwise} \end{cases}$

"better" with POS=ADJ lemmatizes to "good" (irregular comparative). Lemmatization is more accurate but slower and language-specific.

Both are unnecessary for neural models and LLMs — their subword tokenizers and contextual representations handle morphological variation implicitly.

Stopword Removal

Stopwords are high-frequency function words ("the", "a", "is", "in") that carry little discriminative information for topic classification or retrieval. A word $w$ is a stopword if $\text{df}(w) > \theta \cdot N$ for some threshold $\theta$ .

This is corpus-dependent: "machine" is not a stopword in a general corpus but might be in an ML paper corpus. For sentiment analysis, negation words ("not", "no") are critical signal and must not be removed.

TF-IDF Weighting

TF-IDF transforms raw term counts into importance scores that balance local frequency against global rarity.

Term Frequency (TF): How often term $t$ appears in document $d$ . Log-normalized variant prevents dominance by very frequent terms:

$\text{tf}(t, d) = 1 + \log(f_{t,d})$

Inverse Document Frequency (IDF): Terms appearing in many documents are less discriminative:

$\text{idf}(t, D) = \log\frac{N}{|\{d \in D : t \in d\}|}$

A term appearing in every document has $\text{idf} = 0$ .

TF-IDF combined:

$\text{TF-IDF}(t, d, D) = \text{tf}(t, d) \cdot \text{idf}(t, D)$

IDF must be logarithmic because word frequency follows Zipf's law: the most common word appears roughly twice as often as the second, ten times as often as the tenth. A word appearing in 1,000 documents is not 1,000 times less informative than one appearing in one — it's logarithmically less informative. The log compresses the exponential range of document frequencies into a linear scale where differences are meaningful. Without it, common function words like "the" (appearing in every document) would require prohibitively large downweighting to become negligible.

Smooth IDF (sklearn default) adds 1 to prevent division by zero:

$\text{idf}_{\text{smooth}}(t, D) = \log\frac{1 + N}{1 + \text{df}(t)} + 1$

The full TF-IDF matrix $\mathbf{X} \in \mathbb{R}^{N \times V}$ is sparse. Rows are $\ell_2$ -normalized before cosine similarity computation.

Text Cleaning Operations

HTML stripping: Use a proper parser (BeautifulSoup) rather than regex — regex fails on malformed HTML that appears in real web data.

URL normalization: Replace URLs with a placeholder (URL_TOKEN) or extract domain as a feature. Raw URLs add vocabulary noise without semantic value for most tasks.

Emoji handling: Three strategies: (1) remove entirely (loses sentiment signal), (2) replace with text description ("😊" → "smiling_face"), (3) keep as-is (works with byte-level tokenizers). Strategy 2 is best for classical NLP; strategy 3 is best for LLMs.

Sentence segmentation: Non-trivial due to abbreviations ("Dr.", "U.S.A."), ellipsis, and quotation marks. Rule-based segmenters (NLTK Punkt, spaCy) train on abbreviation dictionaries. Model-based segmenters achieve higher accuracy on noisy text.

Walkthrough

We preprocess a product review through a classical NLP pipeline.

Raw input:

"Great product!! LOVE IT :) Check out http://example.com for more... <b>highly</b> recommended 👍"

Step 1 — Unicode normalization (NFKC): No change for this ASCII-dominant string, but handles ligatures and compatibility characters from copy-pasted content.

Step 2 — HTML stripping:

python

from bs4 import BeautifulSoup
text = BeautifulSoup(text, "html.parser").get_text()
# "Great product!! LOVE IT :) Check out http://example.com for more... highly recommended 👍"

Step 3 — URL normalization:

python

import re
text = re.sub(r'https?://\S+', 'URL_TOKEN', text)
# "Great product!! LOVE IT :) Check out URL_TOKEN for more... highly recommended 👍"

Step 4 — Emoji handling:

python

import emoji
text = emoji.demojize(text, delimiters=(' ', ' '))
# "Great product!! LOVE IT :) Check out URL_TOKEN for more... highly recommended  thumbs_up "
text = re.sub(r'\s+', ' ', text).strip()

Step 5 — Lowercasing:

python

text = text.lower()
# "great product!! love it :) check out url_token for more... highly recommended thumbs_up"

Step 6 — Punctuation removal:

python

text = re.sub(r'[^\w\s]', '', text)
text = re.sub(r'\s+', ' ', text).strip()
# "great product love it check out url_token for more highly recommended thumbs_up"

Step 7 — Tokenization:

python

tokens = text.split()
# ['great', 'product', 'love', 'it', 'check', 'out', 'url_token',
#  'for', 'more', 'highly', 'recommended', 'thumbs_up']

Step 8 — (Optional) Stopword removal:

python

from nltk.corpus import stopwords
stops = set(stopwords.words('english'))
tokens = [t for t in tokens if t not in stops]
# ['great', 'product', 'love', 'check', 'url_token', 'highly', 'recommended', 'thumbs_up']

Step 9 — TF-IDF weighting (across a corpus of reviews):

python

from sklearn.feature_extraction.text import TfidfVectorizer
 
corpus = [
    "great product love check url_token highly recommended thumbs_up",
    "terrible product horrible experience do not recommend",
    "love this product great quality highly recommend",
]
 
vectorizer = TfidfVectorizer(max_features=500, sublinear_tf=True)
X = vectorizer.fit_transform(corpus)
 
feature_names = vectorizer.get_feature_names_out()
doc0 = dict(zip(feature_names, X[0].toarray()[0]))
top = sorted(doc0.items(), key=lambda x: -x[1])[:5]
# url_token: 0.57 (appears in only 1 doc → high IDF)
# thumbs_up: 0.57 (same)
# check:     0.49
# great:     0.35 (appears in 2 docs → lower IDF)
# highly:    0.25 (appears in 2 docs)

"url_token" and "thumbs_up" receive high TF-IDF weights because they appear in only one document (high IDF), while "great" appears in multiple documents.

Analysis & Evaluation

Where Your Intuition Breaks

More preprocessing produces cleaner text and better model performance. For classical NLP pipelines (bag-of-words, TF-IDF, logistic regression), aggressive preprocessing — lowercasing, stopword removal, stemming — consistently helps by reducing vocabulary noise. For modern LLMs with byte-level tokenizers, the same pipeline actively harms performance: lowercasing destroys named entity signals, stripping punctuation removes syntactic structure the model was trained to use, and removing HTML tags eliminates formatting cues that GPT-4 and similar models were trained on. The right preprocessing depends entirely on the model, not on a universal notion of "clean text."

Classical NLP vs LLM Preprocessing

Step	Classical NLP Pipeline	LLM / Transformer Pipeline
Unicode normalization (NFKC)	Required	Handled by byte-level tokenizer
HTML stripping	Mandatory	Still recommended
URL normalization	Replace with token	Optional
Lowercasing	Usually yes	No — model uses case as a feature
Punctuation removal	Usually yes	No — punctuation affects semantics
Stopword removal	Often yes	Never
Stemming / lemmatization	Often yes (IR tasks)	Never
Sentence segmentation	Required for sentence-level	Not required

When Preprocessing Hurts

Named entities: Lowercasing "Apple" (company), "US" (country), "COVID" (acronym) forces the model to rely entirely on context for disambiguation. Cased models consistently outperform uncased models on NER benchmarks.

Punctuation for intent: Questions ("?"), emphasis ("!"), and quotation structure are syntactic signals. Removing them makes classification tasks harder.

Overly aggressive URL removal: For phishing detection or content classification, the URL domain is high-signal. Consider extracting domain and path as separate features rather than replacing with a generic token.

Emoji removal for sentiment: "good 😊" and "good 😡" have opposite sentiments despite identical text after emoji stripping. Use description conversion, not deletion.

💡The pipeline dependency problem

Each preprocessing step changes the input distribution for all subsequent steps. Define your pipeline order carefully and apply it identically at training time and inference time — train-test distribution mismatch from preprocessing differences is a common but subtle source of evaluation inflation. A common mistake: applying HTML stripping only at training time, then evaluating on raw HTML during testing.

⚠️LLMs with byte-level tokenizers need minimal preprocessing

For GPT-3/4 (tiktoken cl100k_base), LLaMA (SentencePiece byte fallback), and similar models: (1) do not lowercase — the model was trained on cased text; (2) do not remove punctuation — the model uses it for syntactic parsing; (3) do not stem or lemmatize — subword tokenization handles morphological variation. The only universally safe preprocessing for LLM API calls is HTML tag stripping and ensuring UTF-8 encoding.

Production Pipeline

python

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
 
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(
        lowercase=True,
        strip_accents='unicode',       # NFKD normalization
        analyzer='word',
        token_pattern=r'\b[a-zA-Z]{2,}\b',
        ngram_range=(1, 2),            # unigrams + bigrams
        max_features=50_000,
        sublinear_tf=True,             # log(1 + tf)
        min_df=2,                      # drop hapax legomena
        max_df=0.95,                   # drop near-universal terms
    )),
    ('clf', LogisticRegression(C=1.0, max_iter=1000)),
])
 
pipeline.fit(X_train_raw, y_train)
preds = pipeline.predict(X_test_raw)

This pipeline handles normalization, tokenization, stopword filtering (via min_df/max_df), and TF-IDF weighting in a single composable object that applies identical transformation at training and inference time.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Tokenization & BPE

N-gram Language Models