🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

BERT and Encoder Models — Complete Guide

Deep LearningTransformers🟢 Free Lesson

Advertisement

Deep Learning

BERT — How Google Changed Search with Bidirectional Understanding

Understand how BERT revolutionized NLP by processing text bidirectionally for better comprehension.

  • Bidirectional context — understand words from both directions
  • Pre-training + fine-tuning — powerful transfer learning paradigm
  • Search and Q and A — transformed Google Search and beyond

Understanding context is the key to understanding language.

BERT and Encoder Models — Complete Guide

BERT (Bidirectional Encoder Representations from Transformers, Devlin et al., 2018) revolutionized NLP by introducing bidirectional pre-training on large unlabeled text, followed by task-specific fine-tuning. It demonstrated that pre-training + fine-tuning outperforms training from scratch on nearly every NLP benchmark.


BERT Architecture

BERT Architecture (Encoder-Only Transformer)Input Tokens[CLS]Thecatsat[MASK]on[SEP]Token + Position + SegmentEmbeddings (d=768)TransformerEncoder × 12Layer 1 → Layer 2 → ... → Layer 12Each: MHA + FFN + LayerNormOutputh₁...hₙ768-dim eachFine-tuning Tasks[CLS] → FCClassificationSentiment, NLIhᵢ → FCToken-levelNER, POShᵢ → start/endSpan ExtractionSQuAD QABERT Model VariantsBERT-base:12 layers, 768 hidden, 12 heads, 110M paramsBERT-large:24 layers, 1024 hidden, 16 heads, 340M paramsImprovements:RoBERTa (better training), ALBERT (parameter sharing), DeBERTa (disentangled attention, SOTA)Efficient:DistilBERT (97% accuracy, 60% faster), TinyBERT (7.5M params, distillation)

Pre-training Objectives

BERT Pre-training ObjectivesTask 1: Masked Language Modeling (MLM)Input: "The [MASK] sat on the [MASK]"Target: "The cat sat on the mat"• Randomly mask 15% of tokens (not all — must prevent shortcut learning)• Of masked: 80% replaced with [MASK], 10% random, 10% unchanged (reduces pre-train/fine-tune mismatch)• Loss: Cross-entropy over vocabulary at masked positions only — forces bidirectional understandingTask 2: Next Sentence Prediction (NSP)Positive: "[CLS] The cat sat [SEP] It was happy [SEP]" → IsNextNegative: "[CLS] The cat sat [SEP] The sky blue [SEP]" → NotNext• 50% positive (actual consecutive sentences), 50% negative (random pair)• Binary classification on [CLS] token output• Note: RoBERTa showed NSP is not helpful — replaced with sentence ordering instead

Fine-tuning BERT

BERT Fine-tuning ProcessPre-trained BERT12-24 layers110-340M paramsLearned: syntax,semantics, worldknowledgeAdd Task HeadClassification: FC层NER: FC per tokenQA: start/end FC~0.1-1M new paramsFine-tuneALL params updatedSmall LR: 2e-5 to 5e-53-5 epochsHours on single GPUFine-tunedModelTask-specificBERT + headReady for inferenceKey insight: Pre-training learns general representations → Fine-tuning adapts to specific task. This transfer learning paradigm achieved SOTA on 11 NLP benchmarks simultaneously.

BERT for Different Tasks

DfBERT Output Mapping

Given input tokens [x1,x2,,xn][\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n], BERT produces hidden states [h1,h2,,hn][\mathbf{h}_1, \mathbf{h}_2, \ldots, \mathbf{h}_n] where hiR768\mathbf{h}_i \in \mathbb{R}^{768}.

  • Sequence classification: Use [hCLS][\mathbf{h}_{\text{CLS}}] → linear layer → softmax → y^RC\hat{y} \in \mathbb{R}^{|\mathcal{C}|}
  • Token classification: Use [h1,,hn][\mathbf{h}_1, \ldots, \mathbf{h}_n] → linear layer per token → y^iRC\hat{y}_i \in \mathbb{R}^{|\mathcal{C}|}
  • Extractive QA: Use [h1,,hn][\mathbf{h}_1, \ldots, \mathbf{h}_n] → linear → start/end probabilities: pstart(i)=softmax(wshi)p_{\text{start}}(i) = \text{softmax}(\mathbf{w}_s^\top \mathbf{h}_i)
  • Sentence similarity: Encode both sentences → [hCLS][\mathbf{h}_{\text{CLS}}] → cosine similarity

Fine-Tuning BERT

Example: Fine-Tuning BERT for Classification

from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased', num_labels=2
)

inputs = tokenizer(
    texts, padding=True, truncation=True,
    max_length=512, return_tensors='pt'
)

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,
    lr_scheduler_type='cosine',
    evaluation_strategy='epoch'
)

trainer = Trainer(
    model=model, args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)
trainer.train()

DfFine-tuning Best Practices

  1. Learning rate: Use 2e-5 to 5e-5 (much lower than pre-training)
  2. Warmup: 10% of total steps for learning rate warmup
  3. Epochs: 3-5 epochs (more causes catastrophic forgetting)
  4. Batch size: 16-32 per GPU
  5. Discriminative learning rates: Lower LR for earlier layers (e.g., 1e-5 for layers 1-6, 2e-5 for layers 7-12)
  6. Gradual unfreezing: Unfreeze layers one at a time during training (ULMFiT approach)
  7. Gradient clipping: max_norm=1.0 to prevent exploding gradients

Key Takeaways

Summary: BERT

  • BERT is bidirectional — understands context from both sides simultaneously
  • Pre-training + fine-tuning paradigm: learn general representations, adapt to task
  • MLM: Predict masked tokens → forces deep language understanding
  • NSP: Predict sentence adjacency → (later shown to be less important)
  • BERT excels at classification and token-level tasks (NER, QA)
  • RoBERTa (optimized training), DeBERTa (disentangled attention) are SOTA
  • DistilBERT for faster inference (97% accuracy, 60% faster)
  • BERT is encoder-only — no text generation capability
  • For text generation, use GPT (decoder-only) or T5 (encoder-decoder)

What to Learn Next

-> GPT Architecture Compare with autoregressive models.

-> Transformers Master the underlying architecture.

-> NLP Fundamentals Learn natural language processing basics.

-> Transfer Learning Apply pre-trained models to new tasks.

-> Pre-training Language Models Understand how models learn from text.

-> Tokenization for LLMs Learn how text is converted to tokens.

Premium Content

BERT and Encoder Models — Complete Guide

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement