Stemming and Lemmatization
Stemming and lemmatization are text normalization techniques that reduce words to their base or root form. They help group related words together, reducing vocabulary size and improving feature consistency.
Stemming
Stemming applies heuristic rules to chop off word endings. It's fast but often produces non-dictionary stems.
Porter Stemmer
The most widely used stemming algorithm, developed by Martin Porter in 1980.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ["studies", "studying", "studied", "study"]
for word in words:
print(f"{word} -> {stemmer.stem(word)}")
# studies -> studi
# studying -> studi
# studied -> studi
# study -> studi
Snowball Stemmer
An improved version of Porter, also known as Porter2.
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer("english")
words = ["running", "runs", "ran", "runner"]
for word in words:
print(f"{word} -> {stemmer.stem(word)}")
# running -> run
# runs -> run
# ran -> ran (irregular form not handled)
# runner -> runner
Lancaster Stemmer
More aggressive than Porter, produces shorter stems.
from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()
print(stemmer.stem("studies")) # study
print(stemmer.stem("running")) # run
print(stemmer.stem("happiness")) # happy
Lemmatization
Lemmatization uses vocabulary and morphological analysis to return the dictionary form of a word (lemma).
WordNet Lemmatizer
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
# Default is noun
print(lemmatizer.lemmatize("studies")) # study
print(lemmatizer.lemmatize("running")) # running (no change for nouns)
# With POS tags
print(lemmatizer.lemmatize("running", pos='v')) # run
print(lemmatizer.lemmatize("better", pos='a')) # good
print(lemmatizer.lemmatize("geese")) # goose
POS-Based Lemmatization
Using POS tags improves lemmatization accuracy.
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
def get_wordnet_pos(tag):
if tag.startswith('J'):
return wordnet.ADJ
elif tag.startswith('V'):
return wordnet.VERB
elif tag.startswith('N'):
return wordnet.NOUN
elif tag.startswith('R'):
return wordnet.ADV
return wordnet.NOUN
def pos_lemmatize(text):
lemmatizer = WordNetLemmatizer()
tokens = nltk.word_tokenize(text)
tagged = nltk.pos_tag(tokens)
return [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in tagged]
text = "The geese were running faster than better dogs"
print(pos_lemmatize(text))
# ['The', 'goose', 'be', 'run', 'fast', 'than', 'good', 'dog']
Stemming vs Lemmatization
| Aspect | Stemming | Lemmatization |
|---|---|---|
| Speed | Fast | Slower |
| Output | May not be valid word | Always valid word |
| Accuracy | Lower | Higher |
| Approach | Rule-based chopping | Dictionary lookup |
| Example | "studies" -> "studi" | "studies" -> "study" |
| Use Case | Search engines, IR | Text analysis, NLU |
When to Use Each
Use stemming when:
- Speed is critical
- Working with search/information retrieval
- Exact word forms don't matter
- Building a simple baseline
Use lemmatization when:
- Accuracy matters more than speed
- Results need to be readable
- Working with language understanding tasks
- Domain-specific vocabulary is important