N-gram Models
N-grams are contiguous sequences of n items (words, characters) from a given text. N-gram models use these sequences to estimate the probability of the next word, forming the basis of statistical language modeling.
N-gram Probability
Types of N-grams
| N | Name | Example from "I love NLP" |
|---|---|---|
| 1 | Unigram | "I", "love", "NLP" |
| 2 | Bigram | "I love", "love NLP" |
| 3 | Trigram | "I love NLP" |
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
text = "Natural language processing is powerful"
tokens = word_tokenize(text)
unigrams = list(ngrams(tokens, 1))
bigrams = list(ngrams(tokens, 2))
trigrams = list(ngrams(tokens, 3))
print("Unigrams:", unigrams)
print("Bigrams:", bigrams)
print("Trigrams:", trigrams)
Building an N-gram Model
from collections import defaultdict, Counter
class NGramModel:
def __init__(self, n=2):
self.n = n
self.ngram_counts = defaultdict(Counter)
self.context_counts = defaultdict(int)
def train(self, corpus):
for sentence in corpus:
tokens = ['<s>'] * (self.n - 1) + sentence + ['</s>']
for i in range(self.n - 1, len(tokens)):
context = tuple(tokens[i - self.n + 1:i])
word = tokens[i]
self.ngram_counts[context][word] += 1
self.context_counts[context] += 1
def probability(self, word, context):
context = tuple(context) if not isinstance(context, tuple) else context
count = self.ngram_counts[context][word]
total = self.context_counts[context]
if total == 0:
return 0
return count / total
def sentence_probability(self, sentence):
tokens = ['<s>'] * (self.n - 1) + sentence + ['</s>']
prob = 1.0
for i in range(self.n - 1, len(tokens)):
context = tuple(tokens[i - self.n + 1:i])
word = tokens[i]
prob *= self.probability(word, context)
return prob
# Train bigram model
corpus = [
["the", "cat", "sat", "on", "the", "mat"],
["the", "dog", "sat", "on", "the", "log"],
["the", "cat", "chased", "the", "dog"]
]
model = NGramModel(n=2)
model.train(corpus)
# Query probabilities
print(model.probability("cat", ("the",))) # P(cat | the)
print(model.probability("sat", ("cat",))) # P(sat | cat)
Bigram Probability
Smoothing Techniques
Raw n-gram counts can be zero for unseen combinations. Smoothing handles this by redistributing probability mass.
Laplace Smoothing
def laplace_probability(self, word, context):
context = tuple(context) if not isinstance(context, tuple) else context
count = self.ngram_counts[context][word]
total = self.context_counts[context]
V = len(set(w for counts in self.ngram_counts.values() for w in counts))
return (count + 1) / (total + V)
Perplexity
Perplexity
import math
def perplexity(model, test_corpus):
total_log_prob = 0
total_tokens = 0
for sentence in test_corpus:
tokens = ['<s>'] * (model.n - 1) + sentence + ['</s>']
for i in range(model.n - 1, len(tokens)):
context = tuple(tokens[i - model.n + 1:i])
word = tokens[i]
prob = model.probability(word, context)
if prob > 0:
total_log_prob += math.log2(prob)
else:
total_log_prob += -100 # Penalty for zero prob
total_tokens += 1
return 2 ** (-total_log_prob / total_tokens)
test = [["the", "cat", "sat"]]
print("Perplexity:", perplexity(model, test))
N-gram Model Comparison
| Model | Context Window | Parameters | Pros | Cons |
|---|---|---|---|---|
| Unigram | None | V | Simple | No word order |
| Bigram | 1 word | VΒ² | Captures local order | Limited context |
| Trigram | 2 words | VΒ³ | Better context | Sparse data |
| 4-gram | 3 words | Vβ΄ | Rich context | Very sparse |