πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Tokenization Methods

NLP FoundationsTokenization🟒 Free Lesson

Advertisement

Tokenization

Tokenization is the process of breaking text into smaller units called tokens. This is one of the most fundamental steps in NLP, as the choice of tokenization strategy directly impacts vocabulary size, model performance, and downstream task accuracy.

Word-Level Tokenization

The simplest approach splits text by whitespace and punctuation.

# Basic whitespace tokenization
text = "Natural language processing is amazing!"
tokens = text.split()
print(tokens)
# ['Natural', 'language', 'processing', 'is', 'amazing!']

Punctuation-Aware Tokenization

import re

def word_tokenize(text):
    # Split on whitespace and punctuation boundaries
    pattern = r"\b\w+\b|[^\w\s]"
    return re.findall(pattern, text)

text = "Don't stop! NLP's future is bright."
print(word_tokenize(text))
# ["Don't", 'stop', '!', 'NLP', "'", 's', 'future', 'is', 'bright', '.']

Using NLTK

from nltk.tokenize import word_tokenize
tokens = word_tokenize("Don't stop. NLP's future is bright.")
print(tokens)
# ['Do', "n't", 'stop', '.', 'NLP', "'s", 'future', 'is', 'bright', '.']

Subword Tokenization

Subword tokenization balances between word-level and character-level approaches. It handles out-of-vocabulary (OOV) words by breaking them into known subword units.

Byte Pair Encoding (BPE)

BPE starts with individual characters and iteratively merges the most frequent pairs.

# BPE conceptually:
# Vocabulary starts: {'a', 'b', 'c', ..., 'z', ' '}
# Iteration 1: Most common pair "t h" -> merge to "th"
# Iteration 2: Most common pair "th e" -> merge to "the"
# Continues until desired vocabulary size reached

from tokenizers import Tokenizer, models, trainers, pre_tokenizers

# Train a BPE tokenizer
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
trainer = trainers.BpeTrainer(vocab_size=1000, special_tokens=["[PAD]", "[UNK]"])
tokenizer.train(["corpus.txt"], trainer)

output = tokenizer.encode("Natural language processing")
print(output.tokens)
# ['Natural', 'langu', 'age', 'process', 'ing']

WordPiece

WordPiece is used by BERT. It selects merges that maximize language model likelihood.

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize("unbelievable")
print(tokens)
# ['un', '##believ', '##able']

# The ## prefix indicates a subword continuation

SentencePiece

SentencePiece treats text as a raw stream, working directly on untokenized text. It's language-agnostic and doesn't require pre-tokenization.

import sentencepiece as spm

# Train SentencePiece model
spm.SentencePieceTrainer.train(
    input='corpus.txt',
    model_prefix='mymodel',
    vocab_size=32000,
    model_type='bpe'
)

sp = spm.SentencePieceProcessor(model_file='mymodel.model')
tokens = sp.encode("Hello world!", out_type=str)
print(tokens)
# ['▁Hello', '▁world', '!']

Character-Level Tokenization

Each character becomes a token. Results in small vocabularies but long sequences.

def char_tokenize(text):
    return list(text)

text = "Hello"
print(char_tokenize(text))
# ['H', 'e', 'l', 'l', 'o']

Tokenization Comparison

MethodVocab SizeSequence LengthOOV HandlingExamples
WordLarge (100K+)ShortPoorNLTK, spaCy
BPEMedium (30K-50K)MediumGoodGPT, RoBERTa
WordPieceMedium (30K)MediumGoodBERT, DistilBERT
SentencePieceMedium (32K)MediumGoodT5, XLNet
CharacterSmall (256)Very longPerfectCharCNN

Subword Vocabulary Size

The vocabulary size parameter is crucial:

# Smaller vocab = more subword splits, longer sequences
# Larger vocab = fewer splits, better representation

# BPE with vocab_size=1000: "unhappiness" -> ["un", "happi", "ness"]
# BPE with vocab_size=50000: "unhappiness" -> ["unhappiness"]

DfSubword Frequency

⭐

Premium Content

Tokenization Methods

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert NLP Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement