Tokenization
Tokenization is the process of breaking text into smaller units called tokens. This is one of the most fundamental steps in NLP, as the choice of tokenization strategy directly impacts vocabulary size, model performance, and downstream task accuracy.
Word-Level Tokenization
The simplest approach splits text by whitespace and punctuation.
# Basic whitespace tokenization
text = "Natural language processing is amazing!"
tokens = text.split()
print(tokens)
# ['Natural', 'language', 'processing', 'is', 'amazing!']
Punctuation-Aware Tokenization
import re
def word_tokenize(text):
# Split on whitespace and punctuation boundaries
pattern = r"\b\w+\b|[^\w\s]"
return re.findall(pattern, text)
text = "Don't stop! NLP's future is bright."
print(word_tokenize(text))
# ["Don't", 'stop', '!', 'NLP', "'", 's', 'future', 'is', 'bright', '.']
Using NLTK
from nltk.tokenize import word_tokenize
tokens = word_tokenize("Don't stop. NLP's future is bright.")
print(tokens)
# ['Do', "n't", 'stop', '.', 'NLP', "'s", 'future', 'is', 'bright', '.']
Subword Tokenization
Subword tokenization balances between word-level and character-level approaches. It handles out-of-vocabulary (OOV) words by breaking them into known subword units.
Byte Pair Encoding (BPE)
BPE starts with individual characters and iteratively merges the most frequent pairs.
# BPE conceptually:
# Vocabulary starts: {'a', 'b', 'c', ..., 'z', ' '}
# Iteration 1: Most common pair "t h" -> merge to "th"
# Iteration 2: Most common pair "th e" -> merge to "the"
# Continues until desired vocabulary size reached
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
# Train a BPE tokenizer
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
trainer = trainers.BpeTrainer(vocab_size=1000, special_tokens=["[PAD]", "[UNK]"])
tokenizer.train(["corpus.txt"], trainer)
output = tokenizer.encode("Natural language processing")
print(output.tokens)
# ['Natural', 'langu', 'age', 'process', 'ing']
WordPiece
WordPiece is used by BERT. It selects merges that maximize language model likelihood.
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize("unbelievable")
print(tokens)
# ['un', '##believ', '##able']
# The ## prefix indicates a subword continuation
SentencePiece
SentencePiece treats text as a raw stream, working directly on untokenized text. It's language-agnostic and doesn't require pre-tokenization.
import sentencepiece as spm
# Train SentencePiece model
spm.SentencePieceTrainer.train(
input='corpus.txt',
model_prefix='mymodel',
vocab_size=32000,
model_type='bpe'
)
sp = spm.SentencePieceProcessor(model_file='mymodel.model')
tokens = sp.encode("Hello world!", out_type=str)
print(tokens)
# ['βHello', 'βworld', '!']
Character-Level Tokenization
Each character becomes a token. Results in small vocabularies but long sequences.
def char_tokenize(text):
return list(text)
text = "Hello"
print(char_tokenize(text))
# ['H', 'e', 'l', 'l', 'o']
Tokenization Comparison
| Method | Vocab Size | Sequence Length | OOV Handling | Examples |
|---|---|---|---|---|
| Word | Large (100K+) | Short | Poor | NLTK, spaCy |
| BPE | Medium (30K-50K) | Medium | Good | GPT, RoBERTa |
| WordPiece | Medium (30K) | Medium | Good | BERT, DistilBERT |
| SentencePiece | Medium (32K) | Medium | Good | T5, XLNet |
| Character | Small (256) | Very long | Perfect | CharCNN |
Subword Vocabulary Size
The vocabulary size parameter is crucial:
# Smaller vocab = more subword splits, longer sequences
# Larger vocab = fewer splits, better representation
# BPE with vocab_size=1000: "unhappiness" -> ["un", "happi", "ness"]
# BPE with vocab_size=50000: "unhappiness" -> ["unhappiness"]