Tokenization Methods

Tokenization

Tokenization is the process of breaking text into smaller units called tokens. This is one of the most fundamental steps in NLP, as the choice of tokenization strategy directly impacts vocabulary size, model performance, and downstream task accuracy.

Word-Level Tokenization

The simplest approach splits text by whitespace and punctuation.

# Basic whitespace tokenization
text = "Natural language processing is amazing!"
tokens = text.split()
print(tokens)
# ['Natural', 'language', 'processing', 'is', 'amazing!']

Punctuation-Aware Tokenization

import re

def word_tokenize(text):
    # Split on whitespace and punctuation boundaries
    pattern = r"\b\w+\b|[^\w\s]"
    return re.findall(pattern, text)

text = "Don't stop! NLP's future is bright."
print(word_tokenize(text))
# ["Don't", 'stop', '!', 'NLP', "'", 's', 'future', 'is', 'bright', '.']

Using NLTK

from nltk.tokenize import word_tokenize
tokens = word_tokenize("Don't stop. NLP's future is bright.")
print(tokens)
# ['Do', "n't", 'stop', '.', 'NLP', "'s", 'future', 'is', 'bright', '.']

Subword Tokenization

Subword tokenization balances between word-level and character-level approaches. It handles out-of-vocabulary (OOV) words by breaking them into known subword units.

Byte Pair Encoding (BPE)

BPE starts with individual characters and iteratively merges the most frequent pairs.

# BPE conceptually:
# Vocabulary starts: {'a', 'b', 'c', ..., 'z', ' '}
# Iteration 1: Most common pair "t h" -> merge to "th"
# Iteration 2: Most common pair "th e" -> merge to "the"
# Continues until desired vocabulary size reached

from tokenizers import Tokenizer, models, trainers, pre_tokenizers

# Train a BPE tokenizer
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
trainer = trainers.BpeTrainer(vocab_size=1000, special_tokens=["[PAD]", "[UNK]"])
tokenizer.train(["corpus.txt"], trainer)

output = tokenizer.encode("Natural language processing")
print(output.tokens)
# ['Natural', 'langu', 'age', 'process', 'ing']

WordPiece

WordPiece is used by BERT. It selects merges that maximize language model likelihood.

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize("unbelievable")
print(tokens)
# ['un', '##believ', '##able']

# The ## prefix indicates a subword continuation

SentencePiece

SentencePiece treats text as a raw stream, working directly on untokenized text. It's language-agnostic and doesn't require pre-tokenization.

import sentencepiece as spm

# Train SentencePiece model
spm.SentencePieceTrainer.train(
    input='corpus.txt',
    model_prefix='mymodel',
    vocab_size=32000,
    model_type='bpe'
)

sp = spm.SentencePieceProcessor(model_file='mymodel.model')
tokens = sp.encode("Hello world!", out_type=str)
print(tokens)
# ['▁Hello', '▁world', '!']

Character-Level Tokenization

Each character becomes a token. Results in small vocabularies but long sequences.

def char_tokenize(text):
    return list(text)

text = "Hello"
print(char_tokenize(text))
# ['H', 'e', 'l', 'l', 'o']

Tokenization Comparison

Method	Vocab Size	Sequence Length	OOV Handling	Examples
Word	Large (100K+)	Short	Poor	NLTK, spaCy
BPE	Medium (30K-50K)	Medium	Good	GPT, RoBERTa
WordPiece	Medium (30K)	Medium	Good	BERT, DistilBERT
SentencePiece	Medium (32K)	Medium	Good	T5, XLNet
Character	Small (256)	Very long	Perfect	CharCNN

Subword Vocabulary Size

The vocabulary size parameter is crucial:

# Smaller vocab = more subword splits, longer sequences
# Larger vocab = fewer splits, better representation

# BPE with vocab_size=1000: "unhappiness" -> ["un", "happi", "ness"]
# BPE with vocab_size=50000: "unhappiness" -> ["unhappiness"]