🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Text Preprocessing Techniques

NLP FoundationsText Preprocessing🟢 Free Lesson

Advertisement

Text Preprocessing

Text preprocessing is the critical first step in any NLP pipeline. Raw text is messy and inconsistent—preprocessing transforms it into a clean, standardized format suitable for downstream tasks. Proper preprocessing can significantly improve model performance.

1. Lowercasing

Converting all text to lowercase is the simplest normalization step. It reduces vocabulary size and treats words like "The" and "the" as identical.

text = "The Quick Brown Fox JUMPS Over the Lazy Dog"
lowercase_text = text.lower()
print(lowercase_text)
# "the quick brown fox jumps over the lazy dog"

When NOT to lowercase: For tasks like named entity recognition or sentiment analysis where capitalization carries meaning (e.g., "GOOD" vs "good").

2. HTML and Markup Removal

Web text often contains HTML tags, entities, and markup that must be stripped.

import re
from html import unescape

def remove_html(text):
    clean = re.sub(r'<[^>]+>', '', text)
    clean = unescape(clean)
    clean = re.sub(r'\s+', ' ', clean).strip()
    return clean

html_text = "<p>Natural <b>Language</b> Processing &amp; AI</p>"
print(remove_html(html_text))
# "Natural Language Processing & AI"

3. Noise Removal

Remove special characters, numbers, URLs, emails, and other non-textual content.

def remove_noise(text, remove_numbers=False, remove_urls=True):
    if remove_urls:
        text = re.sub(r'http\S+|www\.\S+', '', text)
    text = re.sub(r'\S+@\S+', '', text)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    if remove_numbers:
        text = re.sub(r'\d+', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

4. Stopword Removal

Stopwords are common words (the, is, at, which) that appear frequently but carry little semantic meaning.

from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

def remove_stopwords(text, language='english'):
    stop_words = set(stopwords.words(language))
    tokens = text.lower().split()
    filtered = [t for t in tokens if t not in stop_words]
    return ' '.join(filtered)

text = "The quick brown fox jumps over the lazy dog"
print(remove_stopwords(text))
# "quick brown fox jumps lazy dog"

Common Stopword Lists

LanguageExample Stopwords
Englishthe, is, at, which, on, a, an, and, or, but
Spanishel, la, los, las, un, una, de, del, en
Germander, die, das, ein, eine, und, oder, aber
Frenchle, la, les, un, une, de, du, des, et

5. Punctuation Removal

import string

def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

text = "Hello, world! How's it going? Fine, thanks."
print(remove_punctuation(text))
# "Hello world Hows it going Fine thanks"

6. Number Handling

import re

def normalize_numbers(text, replacement="[NUM]"):
    return re.sub(r'\b\d+(?:\.\d+)?\b', replacement, text)

text = "I have 3 cats and 2.5 pounds of cat food"
print(normalize_numbers(text))
# "I have [NUM] cats and [NUM] pounds of cat food"

7. Whitespace Normalization

def normalize_whitespace(text):
    text = re.sub(r'[\t\n\r]+', ' ', text)
    text = re.sub(r' +', ' ', text)
    return text.strip()

Complete Preprocessing Pipeline

import re
from html import unescape
from nltk.corpus import stopwords
import string

class TextPreprocessor:
    def __init__(self, lowercase=True, remove_html=True,
                 remove_stopwords=False, remove_punctuation=True):
        self.lowercase = lowercase
        self.remove_html = remove_html
        self.remove_stopwords = remove_stopwords
        self.remove_punctuation = remove_punctuation
        self.stop_words = set(stopwords.words('english')) if remove_stopwords else set()

    def __call__(self, text):
        if self.remove_html:
            text = re.sub(r'<[^>]+>', '', text)
            text = unescape(text)
        if self.lowercase:
            text = text.lower()
        if self.remove_punctuation:
            text = text.translate(str.maketrans('', '', string.punctuation))
        if self.remove_stopwords:
            tokens = text.split()
            tokens = [t for t in tokens if t not in self.stop_words]
            text = ' '.join(tokens)
        return re.sub(r'\s+', ' ', text).strip()

preprocessor = TextPreprocessor(lowercase=True, remove_stopwords=True)
sample = "<p>The <b>Quick</b> Brown Fox! It jumped over the lazy dog.</p>"
print(preprocessor(sample))

Preprocessing Decisions

DecisionOption AOption BWhen to Use
LowercasingAlways lowercaseKeep original caseLowercase for topic modeling; keep for NER
StopwordsRemove allKeep allRemove for BoW; keep for sentiment
NumbersRemove allReplace with tokenKeep for financial text; remove for general
PunctuationRemove allKeep sentence boundariesRemove for clustering; keep for parsing

Premium Content

Text Preprocessing Techniques

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert NLP Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement