Part-of-Speech Tagging

POS tagging assigns grammatical categories (noun, verb, adjective, etc.) to each word in a sentence. It's a fundamental NLP task that provides syntactic information used in parsing, information extraction, and sentiment analysis.

Common POS Tag Sets

Penn Treebank Tags

Tag	Description	Example
NN	Noun, singular	dog, city
NNS	Noun, plural	dogs, cities
NNP	Proper noun	John, London
VB	Verb, base form	run, eat
VBD	Verb, past tense	ran, ate
VBG	Verb, gerund	running, eating
JJ	Adjective	big, red
RB	Adverb	quickly, very
DT	Determiner	the, a
IN	Preposition	in, on, at
PRP	Personal pronoun	I, he, she
CC	Coordinating conjunction	and, but, or

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("The quick brown fox jumps over the lazy dog")

for token in doc:
    print(f"{token.text:12} {token.pos_:8} {token.tag_:5} {spacy.explain(token.tag_)}")
# The          DET      det    determiner
# quick        ADJ      amod   adjective
# brown        ADJ      amod   adjective
# fox          NOUN     nn     noun
# jumps        VERB     ROOT   verb
# over         ADP      prep   preposition
# the          DET      det    determiner
# lazy         ADJ      amod   adjective
# dog          NOUN     pobj   noun

Hidden Markov Model (HMM) Tagger

HMM POS Tagging

\hat{t} = \arg\max_t P(t|w) = \arg\max_t P(w|t) \times P(t)

# HMM transition example
# P(VB | DT) = C(DT, VB) / C(DT)
# P(NN | DT) = C(DT, NN) / C(DT)

# Emission probability
# P("dog" | NN) = C("dog" as NN) / C(NN)

Viterbi Algorithm

The Viterbi algorithm efficiently finds the most likely tag sequence.

def viterbi(observations, states, start_p, trans_p, emit_p):
    V = [{}]
    path = {}

    # Initialize
    for state in states:
        V[0][state] = start_p[state] * emit_p[state].get(observations[0], 0)
        path[state] = [state]

    # Run Viterbi
    for t in range(1, len(observations)):
        V.append({})
        newpath = {}
        for state in states:
            prob, prev_state = max(
                (V[t-1][prev] * trans_p[prev].get(state, 0) *
                 emit_p[state].get(observations[t], 0), prev)
                for prev in states
            )
            V[t][state] = prob
            newpath[state] = path[prev_state] + [state]
        path = newpath

    # Find best final state
    prob, state = max((V[len(observations)-1][s], s) for s in states)
    return prob, path[state]

Rule-Based Tagging

# Simple rule-based tagger
rules = [
    (r'\b(is|are|was|were|be|been)\b', 'VB'),
    (r'\b(the|a|an)\b', 'DT'),
    (r'\b\w+ing\b', 'VBG'),    # -ing ending
    (r'\b\w+ed\b', 'VBD'),     # -ed ending
    (r'\b\w+ly\b', 'RB'),      # -ly ending
    (r'\b\w+ous\b', 'JJ'),     # -ous ending
    (r'\b[A-Z][a-z]+\b', 'NNP'), # Capitalized = proper noun
]

Training a POS Tagger

import nltk
from nltk.tag import UnigramTagger, BigramTagger
from nltk.corpus import treebank

nltk.download('treebank')

train_data = treebank.tagged_sents()[:3000]
test_data = treebank.tagged_sents()[3000:]

# Unigram tagger
unigram_tagger = UnigramTagger(train_data)
print(f"Unigram accuracy: {unigram_tagger.accuracy(test_data):.3f}")

# Bigram tagger
bigram_tagger = BigramTagger(train_data, backoff=unigram_tagger)
print(f"Bigram accuracy: {bigram_tagger.accuracy(test_data):.3f}")

POS Tagging Applications

Application	How POS Helps
Lemmatization	POS-based lemmatization improves accuracy
Named Entity Recognition	Nouns are likely entity candidates
Sentiment Analysis	Adjectives often indicate sentiment
Information Extraction	Verbs indicate actions/relations
Machine Translation	Word order varies by POS across languages

Part-of-Speech Tagging

Part-of-Speech Tagging

Common POS Tag Sets

Penn Treebank Tags

Hidden Markov Model (HMM) Tagger

HMM POS Tagging

Viterbi Algorithm

Rule-Based Tagging

Training a POS Tagger

POS Tagging Applications

Premium Content

Need Expert NLP Help?