πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Stemming and Lemmatization

NLP FoundationsText Normalization🟒 Free Lesson

Advertisement

Stemming and Lemmatization

Stemming and lemmatization are text normalization techniques that reduce words to their base or root form. They help group related words together, reducing vocabulary size and improving feature consistency.

Stemming

Stemming applies heuristic rules to chop off word endings. It's fast but often produces non-dictionary stems.

Porter Stemmer

The most widely used stemming algorithm, developed by Martin Porter in 1980.

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["studies", "studying", "studied", "study"]
for word in words:
    print(f"{word} -> {stemmer.stem(word)}")
# studies -> studi
# studying -> studi
# studied -> studi
# study -> studi

Snowball Stemmer

An improved version of Porter, also known as Porter2.

from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer("english")
words = ["running", "runs", "ran", "runner"]
for word in words:
    print(f"{word} -> {stemmer.stem(word)}")
# running -> run
# runs -> run
# ran -> ran  (irregular form not handled)
# runner -> runner

Lancaster Stemmer

More aggressive than Porter, produces shorter stems.

from nltk.stem import LancasterStemmer

stemmer = LancasterStemmer()
print(stemmer.stem("studies"))   # study
print(stemmer.stem("running"))   # run
print(stemmer.stem("happiness")) # happy

Lemmatization

Lemmatization uses vocabulary and morphological analysis to return the dictionary form of a word (lemma).

WordNet Lemmatizer

from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

# Default is noun
print(lemmatizer.lemmatize("studies"))      # study
print(lemmatizer.lemmatize("running"))      # running (no change for nouns)

# With POS tags
print(lemmatizer.lemmatize("running", pos='v'))  # run
print(lemmatizer.lemmatize("better", pos='a'))   # good
print(lemmatizer.lemmatize("geese"))             # goose

POS-Based Lemmatization

Using POS tags improves lemmatization accuracy.

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    return wordnet.NOUN

def pos_lemmatize(text):
    lemmatizer = WordNetLemmatizer()
    tokens = nltk.word_tokenize(text)
    tagged = nltk.pos_tag(tokens)
    return [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in tagged]

text = "The geese were running faster than better dogs"
print(pos_lemmatize(text))
# ['The', 'goose', 'be', 'run', 'fast', 'than', 'good', 'dog']

Stemming vs Lemmatization

AspectStemmingLemmatization
SpeedFastSlower
OutputMay not be valid wordAlways valid word
AccuracyLowerHigher
ApproachRule-based choppingDictionary lookup
Example"studies" -> "studi""studies" -> "study"
Use CaseSearch engines, IRText analysis, NLU

When to Use Each

Use stemming when:

  • Speed is critical
  • Working with search/information retrieval
  • Exact word forms don't matter
  • Building a simple baseline

Use lemmatization when:

  • Accuracy matters more than speed
  • Results need to be readable
  • Working with language understanding tasks
  • Domain-specific vocabulary is important
⭐

Premium Content

Stemming and Lemmatization

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert NLP Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement