Stemming and Lemmatization

Stemming and lemmatization are text normalization techniques that reduce words to their base or root form. They help group related words together, reducing vocabulary size and improving feature consistency.

Stemming

Stemming applies heuristic rules to chop off word endings. It's fast but often produces non-dictionary stems.

Porter Stemmer

The most widely used stemming algorithm, developed by Martin Porter in 1980.

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["studies", "studying", "studied", "study"]
for word in words:
    print(f"{word} -> {stemmer.stem(word)}")
# studies -> studi
# studying -> studi
# studied -> studi
# study -> studi

Snowball Stemmer

An improved version of Porter, also known as Porter2.

from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer("english")
words = ["running", "runs", "ran", "runner"]
for word in words:
    print(f"{word} -> {stemmer.stem(word)}")
# running -> run
# runs -> run
# ran -> ran  (irregular form not handled)
# runner -> runner

Lancaster Stemmer

More aggressive than Porter, produces shorter stems.

from nltk.stem import LancasterStemmer

stemmer = LancasterStemmer()
print(stemmer.stem("studies"))   # study
print(stemmer.stem("running"))   # run
print(stemmer.stem("happiness")) # happy

Lemmatization

Lemmatization uses vocabulary and morphological analysis to return the dictionary form of a word (lemma).

WordNet Lemmatizer

from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

# Default is noun
print(lemmatizer.lemmatize("studies"))      # study
print(lemmatizer.lemmatize("running"))      # running (no change for nouns)

# With POS tags
print(lemmatizer.lemmatize("running", pos='v'))  # run
print(lemmatizer.lemmatize("better", pos='a'))   # good
print(lemmatizer.lemmatize("geese"))             # goose

POS-Based Lemmatization

Using POS tags improves lemmatization accuracy.

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    return wordnet.NOUN

def pos_lemmatize(text):
    lemmatizer = WordNetLemmatizer()
    tokens = nltk.word_tokenize(text)
    tagged = nltk.pos_tag(tokens)
    return [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in tagged]

text = "The geese were running faster than better dogs"
print(pos_lemmatize(text))
# ['The', 'goose', 'be', 'run', 'fast', 'than', 'good', 'dog']

Stemming vs Lemmatization

Aspect	Stemming	Lemmatization
Speed	Fast	Slower
Output	May not be valid word	Always valid word
Accuracy	Lower	Higher
Approach	Rule-based chopping	Dictionary lookup
Example	"studies" -> "studi"	"studies" -> "study"
Use Case	Search engines, IR	Text analysis, NLU

When to Use Each

Use stemming when:

Speed is critical
Working with search/information retrieval
Exact word forms don't matter
Building a simple baseline

Use lemmatization when:

Accuracy matters more than speed
Results need to be readable
Working with language understanding tasks
Domain-specific vocabulary is important

Stemming and Lemmatization