Deep Learning

BERT — How Google Changed Search with Bidirectional Understanding

Understand how BERT revolutionized NLP by processing text bidirectionally for better comprehension.

Bidirectional context — understand words from both directions
Pre-training + fine-tuning — powerful transfer learning paradigm
Search and Q and A — transformed Google Search and beyond

Understanding context is the key to understanding language.

BERT and Encoder Models — Complete Guide

BERT (Bidirectional Encoder Representations from Transformers, Devlin et al., 2018) revolutionized NLP by introducing bidirectional pre-training on large unlabeled text, followed by task-specific fine-tuning. It demonstrated that pre-training + fine-tuning outperforms training from scratch on nearly every NLP benchmark.

BERT Architecture

Pre-training Objectives

Fine-tuning BERT

BERT for Different Tasks

DfBERT Output Mapping

Given input tokens $[\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n]$ , BERT produces hidden states $[\mathbf{h}_1, \mathbf{h}_2, \ldots, \mathbf{h}_n]$ where $\mathbf{h}_i \in \mathbb{R}^{768}$ .

Sequence classification: Use $[\mathbf{h}_{\text{CLS}}]$ → linear layer → softmax → $\hat{y} \in \mathbb{R}^{|\mathcal{C}|}$
Token classification: Use $[\mathbf{h}_1, \ldots, \mathbf{h}_n]$ → linear layer per token → $\hat{y}_i \in \mathbb{R}^{|\mathcal{C}|}$
Extractive QA: Use $[\mathbf{h}_1, \ldots, \mathbf{h}_n]$ → linear → start/end probabilities: $p_{\text{start}}(i) = \text{softmax}(\mathbf{w}_s^\top \mathbf{h}_i)$
Sentence similarity: Encode both sentences → $[\mathbf{h}_{\text{CLS}}]$ → cosine similarity

Fine-Tuning BERT

Example: Fine-Tuning BERT for Classification

from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased', num_labels=2
)

inputs = tokenizer(
    texts, padding=True, truncation=True,
    max_length=512, return_tensors='pt'
)

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,
    lr_scheduler_type='cosine',
    evaluation_strategy='epoch'
)

trainer = Trainer(
    model=model, args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)
trainer.train()

DfFine-tuning Best Practices

Learning rate: Use 2e-5 to 5e-5 (much lower than pre-training)
Warmup: 10% of total steps for learning rate warmup
Epochs: 3-5 epochs (more causes catastrophic forgetting)
Batch size: 16-32 per GPU
Discriminative learning rates: Lower LR for earlier layers (e.g., 1e-5 for layers 1-6, 2e-5 for layers 7-12)
Gradual unfreezing: Unfreeze layers one at a time during training (ULMFiT approach)
Gradient clipping: max_norm=1.0 to prevent exploding gradients

Key Takeaways

Summary: BERT

BERT is bidirectional — understands context from both sides simultaneously
Pre-training + fine-tuning paradigm: learn general representations, adapt to task
MLM: Predict masked tokens → forces deep language understanding
NSP: Predict sentence adjacency → (later shown to be less important)
BERT excels at classification and token-level tasks (NER, QA)
RoBERTa (optimized training), DeBERTa (disentangled attention) are SOTA
DistilBERT for faster inference (97% accuracy, 60% faster)
BERT is encoder-only — no text generation capability
For text generation, use GPT (decoder-only) or T5 (encoder-decoder)

What to Learn Next

-> GPT Architecture Compare with autoregressive models.

-> Transformers Master the underlying architecture.

-> NLP Fundamentals Learn natural language processing basics.

-> Transfer Learning Apply pre-trained models to new tasks.

-> Pre-training Language Models Understand how models learn from text.

-> Tokenization for LLMs Learn how text is converted to tokens.

BERT and Encoder Models — Complete Guide

BERT — How Google Changed Search with Bidirectional Understanding

BERT and Encoder Models — Complete Guide

BERT Architecture

Pre-training Objectives

Fine-tuning BERT

BERT for Different Tasks

DfBERT Output Mapping

Fine-Tuning BERT

Example: Fine-Tuning BERT for Classification

DfFine-tuning Best Practices

Key Takeaways

Summary: BERT

What to Learn Next

Premium Content

Need Expert Machine Learning Help?