Tokenization Deep Dive

Tokenization is the critical first step in NLP pipelines — converting raw text into model-usable integer tokens. Modern tokenization uses subword algorithms that balance vocabulary size with coverage.

Byte-Pair Encoding (BPE)

BPE iteratively merges the most frequent pair of adjacent symbols.

DfBPE Algorithm

from collections import Counter, defaultdict

class BPETokenizer:
    def __init__(self, vocab_size=300):
        self.vocab_size = vocab_size
        self.merges = {}

    def get_vocab(self, corpus):
        vocab = Counter()
        for word in corpus:
            symbols = list(word) + ['</w>']
            for s in symbols:
                vocab[s] += 1
        return vocab

    def get_pair_counts(self, corpus):
        pairs = Counter()
        for word in corpus:
            symbols = list(word) + ['</w>']
            for i in range(len(symbols) - 1):
                pair = (symbols[i], symbols[i + 1])
                pairs[pair] += 1
        return pairs

    def merge_pair(self, pair, corpus):
        merged = ''.join(pair)
        new_corpus = []
        for word in corpus:
            symbols = list(word) + ['</w>']
            new_symbols = []
            i = 0
            while i < len(symbols):
                if i < len(symbols) - 1 and (symbols[i], symbols[i+1]) == pair:
                    new_symbols.append(merged)
                    i += 2
                else:
                    new_symbols.append(symbols[i])
                    i += 1
            new_corpus.append(''.join(new_symbols[:-1]))
        return new_corpus

    def train(self, corpus):
        vocab = self.get_vocab(corpus)
        num_merges = self.vocab_size - len(vocab)

        for i in range(num_merges):
            pairs = self.get_pair_counts(corpus)
            if not pairs:
                break
            best_pair = max(pairs, key=pairs.get)
            corpus = self.merge_pair(best_pair, corpus)
            self.merges[best_pair] = ''.join(best_pair)

        return corpus

# Example usage
corpus = ["low", "low", "low", "low", "low", "lowest", "newer", "newer", "wider", "wider"]
bpe = BPETokenizer(vocab_size=20)
result = bpe.train(corpus)
print(f"Merges learned: {bpe.merges}")

WordPiece Tokenization

WordPiece is similar to BPE but uses a different selection criterion — it merges the pair that maximizes the language model likelihood.

DfWordPiece Merge Score

The pair with the highest score (not highest frequency) is merged.

# Hugging Face WordPiece usage
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize("unaffable")
print(tokens)  # ['un', '##aff', '##able']

# The ## prefix indicates a subword continuation
tokens = tokenizer.tokenize("tokenization")
print(tokens)  # ['token', '##ization']

WordPiece was used in the original BERT and DistilBERT models. The ## prefix indicates that a token is a continuation of the previous token, not a standalone word.

Unigram Language Model

Unigram starts with a large vocabulary and iteratively removes tokens that contribute least to the language model likelihood.

DfUnigram Probability

DfToken Importance (VAI)

from transformers import AutoTokenizer

# Unigram tokenizer (T5, ALBERT)
tokenizer = AutoTokenizer.from_pretrained('t5-small')
tokens = tokenizer.tokenize("Hello, how are you?")
print(tokens)  # ['▁Hello', ',', '▁how', '▁are', '▁you', '?']

# The ▁ character represents a space before the token

SentencePiece

SentencePiece is a language-independent tokenizer that treats the input as raw Unicode, making it suitable for any language.

Feature	BPE	WordPiece	Unigram	SentencePiece
Algorithm	Frequency-based merging	Likelihood-based merging	Probabilistic removal	Language-independent
Pre-tokenization	Required	Required	Required	Optional
Vocabulary	Fixed size	Fixed size	Variable	Fixed size
Used by	GPT-2, RoBERTa	BERT, DistilBERT	T5, ALBERT	T5, XLNet, ALBERT
Word boundary	Space-based	Space-based	Space-based	Learned

Hugging Face Tokenizers Library

from tokenizers import Tokenizer, models, pre_tokenizers, trainers

# Build a BPE tokenizer from scratch
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

trainer = trainers.BpeTrainer(
    vocab_size=30000,
    min_frequency=2,
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
)

# Train on corpus
files = ["data/corpus.txt"]
tokenizer.train(files, trainer)

# Encode text
encoded = tokenizer.encode("Hello, this is a test!")
print(f"Tokens: {encoded.tokens}")
print(f"IDs: {encoded.ids}")

Tokenization Comparison

Method	Vocab Size	OOV Handling	Multilingual	Speed
Word-level	Large	Poor (UNK)	Poor	Fast
Character-level	Small	Excellent	Good	Slow
BPE	Medium	Good	Good	Fast
WordPiece	Medium	Good	Good	Fast
Unigram	Medium	Good	Excellent	Medium
SentencePiece	Medium	Good	Excellent	Medium

BPE Tokenization Example

Byte-Level BPE

Byte-level BPE (used in GPT-2, RoBERTa) operates on bytes rather than Unicode characters, enabling true open-vocabulary processing.

# Byte-level encoding
text = "Hello, 你好!"
byte_repr = text.encode('utf-8')
print(f"Bytes: {byte_repr}")
# b'Hello, \xe4\xbd\xa0\xe5\xa5\xbd!'

# Each byte maps to a character in the extended ASCII range
# This ensures ANY text can be tokenized

Byte-level BPE guarantees no out-of-vocabulary tokens since any byte sequence can be represented, though the tokenization may be less efficient for non-Latin scripts.

Tokenization Deep Dive