πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

N-gram Models

Classical NLPLanguage Modeling🟒 Free Lesson

Advertisement

N-gram Models

N-grams are contiguous sequences of n items (words, characters) from a given text. N-gram models use these sequences to estimate the probability of the next word, forming the basis of statistical language modeling.

N-gram Probability

P(wn∣w1,w2,...,wnβˆ’1)β‰ˆP(wn∣wnβˆ’N+1,...,wnβˆ’1)P(w_n | w_1, w_2, ..., w_{n-1}) \approx P(w_n | w_{n-N+1}, ..., w_{n-1})

Types of N-grams

NNameExample from "I love NLP"
1Unigram"I", "love", "NLP"
2Bigram"I love", "love NLP"
3Trigram"I love NLP"
from nltk.util import ngrams
from nltk.tokenize import word_tokenize

text = "Natural language processing is powerful"
tokens = word_tokenize(text)

unigrams = list(ngrams(tokens, 1))
bigrams = list(ngrams(tokens, 2))
trigrams = list(ngrams(tokens, 3))

print("Unigrams:", unigrams)
print("Bigrams:", bigrams)
print("Trigrams:", trigrams)

Building an N-gram Model

from collections import defaultdict, Counter

class NGramModel:
    def __init__(self, n=2):
        self.n = n
        self.ngram_counts = defaultdict(Counter)
        self.context_counts = defaultdict(int)

    def train(self, corpus):
        for sentence in corpus:
            tokens = ['<s>'] * (self.n - 1) + sentence + ['</s>']
            for i in range(self.n - 1, len(tokens)):
                context = tuple(tokens[i - self.n + 1:i])
                word = tokens[i]
                self.ngram_counts[context][word] += 1
                self.context_counts[context] += 1

    def probability(self, word, context):
        context = tuple(context) if not isinstance(context, tuple) else context
        count = self.ngram_counts[context][word]
        total = self.context_counts[context]
        if total == 0:
            return 0
        return count / total

    def sentence_probability(self, sentence):
        tokens = ['<s>'] * (self.n - 1) + sentence + ['</s>']
        prob = 1.0
        for i in range(self.n - 1, len(tokens)):
            context = tuple(tokens[i - self.n + 1:i])
            word = tokens[i]
            prob *= self.probability(word, context)
        return prob

# Train bigram model
corpus = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "sat", "on", "the", "log"],
    ["the", "cat", "chased", "the", "dog"]
]

model = NGramModel(n=2)
model.train(corpus)

# Query probabilities
print(model.probability("cat", ("the",)))  # P(cat | the)
print(model.probability("sat", ("cat",)))  # P(sat | cat)

Bigram Probability

P(wi∣wiβˆ’1)=C(wiβˆ’1,wi)C(wiβˆ’1)P(w_i | w_{i-1}) = \frac{C(w_{i-1}, w_i)}{C(w_{i-1})}

Smoothing Techniques

Raw n-gram counts can be zero for unseen combinations. Smoothing handles this by redistributing probability mass.

Laplace Smoothing

PLaplace(wi∣wiβˆ’1)=C(wiβˆ’1,wi)+1C(wiβˆ’1)+VP_{Laplace}(w_i | w_{i-1}) = \frac{C(w_{i-1}, w_i) + 1}{C(w_{i-1}) + V}
def laplace_probability(self, word, context):
    context = tuple(context) if not isinstance(context, tuple) else context
    count = self.ngram_counts[context][word]
    total = self.context_counts[context]
    V = len(set(w for counts in self.ngram_counts.values() for w in counts))
    return (count + 1) / (total + V)

Perplexity

Perplexity

PPL=2βˆ’1Nβˆ‘i=1Nlog⁑2P(wi∣w<i)PPL = 2^{-\frac{1}{N}\sum_{i=1}^{N} \log_2 P(w_i | w_{<i})}
import math

def perplexity(model, test_corpus):
    total_log_prob = 0
    total_tokens = 0
    for sentence in test_corpus:
        tokens = ['<s>'] * (model.n - 1) + sentence + ['</s>']
        for i in range(model.n - 1, len(tokens)):
            context = tuple(tokens[i - model.n + 1:i])
            word = tokens[i]
            prob = model.probability(word, context)
            if prob > 0:
                total_log_prob += math.log2(prob)
            else:
                total_log_prob += -100  # Penalty for zero prob
            total_tokens += 1
    return 2 ** (-total_log_prob / total_tokens)

test = [["the", "cat", "sat"]]
print("Perplexity:", perplexity(model, test))

N-gram Model Comparison

ModelContext WindowParametersProsCons
UnigramNoneVSimpleNo word order
Bigram1 wordVΒ²Captures local orderLimited context
Trigram2 wordsVΒ³Better contextSparse data
4-gram3 wordsV⁴Rich contextVery sparse
⭐

Premium Content

N-gram Models

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert NLP Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement