Tokenization Deep Dive
Tokenization is the critical first step in NLP pipelines β converting raw text into model-usable integer tokens. Modern tokenization uses subword algorithms that balance vocabulary size with coverage.
Byte-Pair Encoding (BPE)
BPE iteratively merges the most frequent pair of adjacent symbols.
DfBPE Algorithm
from collections import Counter, defaultdict
class BPETokenizer:
def __init__(self, vocab_size=300):
self.vocab_size = vocab_size
self.merges = {}
def get_vocab(self, corpus):
vocab = Counter()
for word in corpus:
symbols = list(word) + ['</w>']
for s in symbols:
vocab[s] += 1
return vocab
def get_pair_counts(self, corpus):
pairs = Counter()
for word in corpus:
symbols = list(word) + ['</w>']
for i in range(len(symbols) - 1):
pair = (symbols[i], symbols[i + 1])
pairs[pair] += 1
return pairs
def merge_pair(self, pair, corpus):
merged = ''.join(pair)
new_corpus = []
for word in corpus:
symbols = list(word) + ['</w>']
new_symbols = []
i = 0
while i < len(symbols):
if i < len(symbols) - 1 and (symbols[i], symbols[i+1]) == pair:
new_symbols.append(merged)
i += 2
else:
new_symbols.append(symbols[i])
i += 1
new_corpus.append(''.join(new_symbols[:-1]))
return new_corpus
def train(self, corpus):
vocab = self.get_vocab(corpus)
num_merges = self.vocab_size - len(vocab)
for i in range(num_merges):
pairs = self.get_pair_counts(corpus)
if not pairs:
break
best_pair = max(pairs, key=pairs.get)
corpus = self.merge_pair(best_pair, corpus)
self.merges[best_pair] = ''.join(best_pair)
return corpus
# Example usage
corpus = ["low", "low", "low", "low", "low", "lowest", "newer", "newer", "wider", "wider"]
bpe = BPETokenizer(vocab_size=20)
result = bpe.train(corpus)
print(f"Merges learned: {bpe.merges}")
WordPiece Tokenization
WordPiece is similar to BPE but uses a different selection criterion β it merges the pair that maximizes the language model likelihood.
DfWordPiece Merge Score
The pair with the highest score (not highest frequency) is merged.
# Hugging Face WordPiece usage
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize("unaffable")
print(tokens) # ['un', '##aff', '##able']
# The ## prefix indicates a subword continuation
tokens = tokenizer.tokenize("tokenization")
print(tokens) # ['token', '##ization']
WordPiece was used in the original BERT and DistilBERT models. The ## prefix indicates that a token is a continuation of the previous token, not a standalone word.
Unigram Language Model
Unigram starts with a large vocabulary and iteratively removes tokens that contribute least to the language model likelihood.
DfUnigram Probability
DfToken Importance (VAI)
from transformers import AutoTokenizer
# Unigram tokenizer (T5, ALBERT)
tokenizer = AutoTokenizer.from_pretrained('t5-small')
tokens = tokenizer.tokenize("Hello, how are you?")
print(tokens) # ['βHello', ',', 'βhow', 'βare', 'βyou', '?']
# The β character represents a space before the token
SentencePiece
SentencePiece is a language-independent tokenizer that treats the input as raw Unicode, making it suitable for any language.
| Feature | BPE | WordPiece | Unigram | SentencePiece |
|---|---|---|---|---|
| Algorithm | Frequency-based merging | Likelihood-based merging | Probabilistic removal | Language-independent |
| Pre-tokenization | Required | Required | Required | Optional |
| Vocabulary | Fixed size | Fixed size | Variable | Fixed size |
| Used by | GPT-2, RoBERTa | BERT, DistilBERT | T5, ALBERT | T5, XLNet, ALBERT |
| Word boundary | Space-based | Space-based | Space-based | Learned |
Hugging Face Tokenizers Library
from tokenizers import Tokenizer, models, pre_tokenizers, trainers
# Build a BPE tokenizer from scratch
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
trainer = trainers.BpeTrainer(
vocab_size=30000,
min_frequency=2,
special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
)
# Train on corpus
files = ["data/corpus.txt"]
tokenizer.train(files, trainer)
# Encode text
encoded = tokenizer.encode("Hello, this is a test!")
print(f"Tokens: {encoded.tokens}")
print(f"IDs: {encoded.ids}")
Tokenization Comparison
| Method | Vocab Size | OOV Handling | Multilingual | Speed |
|---|---|---|---|---|
| Word-level | Large | Poor (UNK) | Poor | Fast |
| Character-level | Small | Excellent | Good | Slow |
| BPE | Medium | Good | Good | Fast |
| WordPiece | Medium | Good | Good | Fast |
| Unigram | Medium | Good | Excellent | Medium |
| SentencePiece | Medium | Good | Excellent | Medium |
BPE Tokenization Example
Byte-Level BPE
Byte-level BPE (used in GPT-2, RoBERTa) operates on bytes rather than Unicode characters, enabling true open-vocabulary processing.
# Byte-level encoding
text = "Hello, δ½ ε₯½!"
byte_repr = text.encode('utf-8')
print(f"Bytes: {byte_repr}")
# b'Hello, \xe4\xbd\xa0\xe5\xa5\xbd!'
# Each byte maps to a character in the extended ASCII range
# This ensures ANY text can be tokenized
Byte-level BPE guarantees no out-of-vocabulary tokens since any byte sequence can be represented, though the tokenization may be less efficient for non-Latin scripts.