Deep Learning
BERT — How Google Changed Search with Bidirectional Understanding
Understand how BERT revolutionized NLP by processing text bidirectionally for better comprehension.
- Bidirectional context — understand words from both directions
- Pre-training + fine-tuning — powerful transfer learning paradigm
- Search and Q and A — transformed Google Search and beyond
Understanding context is the key to understanding language.
BERT and Encoder Models — Complete Guide
BERT (Bidirectional Encoder Representations from Transformers, Devlin et al., 2018) revolutionized NLP by introducing bidirectional pre-training on large unlabeled text, followed by task-specific fine-tuning. It demonstrated that pre-training + fine-tuning outperforms training from scratch on nearly every NLP benchmark.
BERT Architecture
Pre-training Objectives
Fine-tuning BERT
BERT for Different Tasks
DfBERT Output Mapping
Given input tokens , BERT produces hidden states where .
- Sequence classification: Use → linear layer → softmax →
- Token classification: Use → linear layer per token →
- Extractive QA: Use → linear → start/end probabilities:
- Sentence similarity: Encode both sentences → → cosine similarity
Fine-Tuning BERT
Example: Fine-Tuning BERT for Classification
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained(
'bert-base-uncased', num_labels=2
)
inputs = tokenizer(
texts, padding=True, truncation=True,
max_length=512, return_tensors='pt'
)
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=16,
learning_rate=2e-5,
weight_decay=0.01,
warmup_ratio=0.1,
lr_scheduler_type='cosine',
evaluation_strategy='epoch'
)
trainer = Trainer(
model=model, args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train()
DfFine-tuning Best Practices
- Learning rate: Use 2e-5 to 5e-5 (much lower than pre-training)
- Warmup: 10% of total steps for learning rate warmup
- Epochs: 3-5 epochs (more causes catastrophic forgetting)
- Batch size: 16-32 per GPU
- Discriminative learning rates: Lower LR for earlier layers (e.g., 1e-5 for layers 1-6, 2e-5 for layers 7-12)
- Gradual unfreezing: Unfreeze layers one at a time during training (ULMFiT approach)
- Gradient clipping: max_norm=1.0 to prevent exploding gradients
Key Takeaways
Summary: BERT
- BERT is bidirectional — understands context from both sides simultaneously
- Pre-training + fine-tuning paradigm: learn general representations, adapt to task
- MLM: Predict masked tokens → forces deep language understanding
- NSP: Predict sentence adjacency → (later shown to be less important)
- BERT excels at classification and token-level tasks (NER, QA)
- RoBERTa (optimized training), DeBERTa (disentangled attention) are SOTA
- DistilBERT for faster inference (97% accuracy, 60% faster)
- BERT is encoder-only — no text generation capability
- For text generation, use GPT (decoder-only) or T5 (encoder-decoder)
What to Learn Next
-> GPT Architecture Compare with autoregressive models.
-> Transformers Master the underlying architecture.
-> NLP Fundamentals Learn natural language processing basics.
-> Transfer Learning Apply pre-trained models to new tasks.
-> Pre-training Language Models Understand how models learn from text.
-> Tokenization for LLMs Learn how text is converted to tokens.