Part-of-Speech Tagging
POS tagging assigns grammatical categories (noun, verb, adjective, etc.) to each word in a sentence. It's a fundamental NLP task that provides syntactic information used in parsing, information extraction, and sentiment analysis.
Common POS Tag Sets
Penn Treebank Tags
| Tag | Description | Example |
|---|---|---|
| NN | Noun, singular | dog, city |
| NNS | Noun, plural | dogs, cities |
| NNP | Proper noun | John, London |
| VB | Verb, base form | run, eat |
| VBD | Verb, past tense | ran, ate |
| VBG | Verb, gerund | running, eating |
| JJ | Adjective | big, red |
| RB | Adverb | quickly, very |
| DT | Determiner | the, a |
| IN | Preposition | in, on, at |
| PRP | Personal pronoun | I, he, she |
| CC | Coordinating conjunction | and, but, or |
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The quick brown fox jumps over the lazy dog")
for token in doc:
print(f"{token.text:12} {token.pos_:8} {token.tag_:5} {spacy.explain(token.tag_)}")
# The DET det determiner
# quick ADJ amod adjective
# brown ADJ amod adjective
# fox NOUN nn noun
# jumps VERB ROOT verb
# over ADP prep preposition
# the DET det determiner
# lazy ADJ amod adjective
# dog NOUN pobj noun
Hidden Markov Model (HMM) Tagger
HMM POS Tagging
# HMM transition example
# P(VB | DT) = C(DT, VB) / C(DT)
# P(NN | DT) = C(DT, NN) / C(DT)
# Emission probability
# P("dog" | NN) = C("dog" as NN) / C(NN)
Viterbi Algorithm
The Viterbi algorithm efficiently finds the most likely tag sequence.
def viterbi(observations, states, start_p, trans_p, emit_p):
V = [{}]
path = {}
# Initialize
for state in states:
V[0][state] = start_p[state] * emit_p[state].get(observations[0], 0)
path[state] = [state]
# Run Viterbi
for t in range(1, len(observations)):
V.append({})
newpath = {}
for state in states:
prob, prev_state = max(
(V[t-1][prev] * trans_p[prev].get(state, 0) *
emit_p[state].get(observations[t], 0), prev)
for prev in states
)
V[t][state] = prob
newpath[state] = path[prev_state] + [state]
path = newpath
# Find best final state
prob, state = max((V[len(observations)-1][s], s) for s in states)
return prob, path[state]
Rule-Based Tagging
# Simple rule-based tagger
rules = [
(r'\b(is|are|was|were|be|been)\b', 'VB'),
(r'\b(the|a|an)\b', 'DT'),
(r'\b\w+ing\b', 'VBG'), # -ing ending
(r'\b\w+ed\b', 'VBD'), # -ed ending
(r'\b\w+ly\b', 'RB'), # -ly ending
(r'\b\w+ous\b', 'JJ'), # -ous ending
(r'\b[A-Z][a-z]+\b', 'NNP'), # Capitalized = proper noun
]
Training a POS Tagger
import nltk
from nltk.tag import UnigramTagger, BigramTagger
from nltk.corpus import treebank
nltk.download('treebank')
train_data = treebank.tagged_sents()[:3000]
test_data = treebank.tagged_sents()[3000:]
# Unigram tagger
unigram_tagger = UnigramTagger(train_data)
print(f"Unigram accuracy: {unigram_tagger.accuracy(test_data):.3f}")
# Bigram tagger
bigram_tagger = BigramTagger(train_data, backoff=unigram_tagger)
print(f"Bigram accuracy: {bigram_tagger.accuracy(test_data):.3f}")
POS Tagging Applications
| Application | How POS Helps |
|---|---|
| Lemmatization | POS-based lemmatization improves accuracy |
| Named Entity Recognition | Nouns are likely entity candidates |
| Sentiment Analysis | Adjectives often indicate sentiment |
| Information Extraction | Verbs indicate actions/relations |
| Machine Translation | Word order varies by POS across languages |