πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

NLP Datasets and Benchmarks

Production NLPStandard NLP Benchmarks🟒 Free Lesson

Advertisement

NLP Datasets and Benchmarks

Standard benchmarks enable fair comparison between models and track progress in NLP research. Understanding benchmark design is essential for proper model evaluation.

Benchmark Evolution


GLUE Benchmark

The General Language Understanding Evaluation (GLUE) benchmark tests language understanding across 9 tasks.

TaskTypeTrain SizeDev SizeMetric
CoLAAcceptability8.5K1KMatthews Correlation
SST-2Sentiment67K872Accuracy
MRPCParaphrase3.7K204F1 / Accuracy
QQPParaphrase364K40KF1 / Accuracy
STS-BSimilarity5.7K1.5KPearson / Spearman
MNLINLI393K20KAccuracy
QNLINLI105K5.7KAccuracy
RTENLI2.5K277Accuracy
WNLICoreference63471Accuracy
from datasets import load_dataset
import evaluate

def evaluate_glue(model, tokenizer, task_name):
    """Evaluate model on a GLUE task."""
    dataset = load_dataset("glue", task_name)

    metric = evaluate.load("glue", task_name)

    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = logits.argmax(axis=-1)
        return metric.compute(predictions=predictions, references=labels)

    # Task-specific configurations
    task_configs = {
        "cola": {"metric": "matthews_correlation"},
        "sst2": {"metric": "accuracy"},
        "mrpc": {"metric": ["accuracy", "f1"]},
        "qqp": {"metric": ["accuracy", "f1"]},
        "stsb": {"metric": ["pearsonr", "spearmanr"]},
        "mnli": {"metric": "accuracy"},
        "qnli": {"metric": "accuracy"},
        "rte": {"metric": "accuracy"},
    }

    return compute_metrics

# GLUE scores for reference
glue_baseline_scores = {
    "CoLA": {"BERT-base": 0.521, "BERT-large": 0.587, "RoBERTa-large": 0.680},
    "SST-2": {"BERT-base": 0.929, "BERT-large": 0.935, "RoBERTa-large": 0.964},
    "MRPC": {"BERT-base": 0.889, "BERT-large": 0.893, "RoBERTa-large": 0.909},
    "QQP": {"BERT-base": 0.913, "BERT-large": 0.918, "RoBERTa-large": 0.922},
    "STSB": {"BERT-base": 0.858, "BERT-large": 0.865, "RoBERTa-large": 0.923},
    "MNLI-m": {"BERT-base": 0.841, "BERT-large": 0.858, "RoBERTa-large": 0.902},
    "QNLI": {"BERT-base": 0.905, "BERT-large": 0.918, "RoBERTa-large": 0.939},
    "RTE": {"BERT-base": 0.664, "BERT-large": 0.702, "RoBERTa-large": 0.882},
    "WNLI": {"BERT-base": 0.563, "BERT-large": 0.563, "RoBERTa-large": 0.913},
}

SuperGLUE

SuperGLUE offers harder tasks that remained challenging even after GLUE was solved.

TaskTypeMetricHuman BaselineBest Model
BoolQQAAccuracy89.091.1
CBNLIF1/Avg90.993.9
COPAReasoningAccuracy10098.0
MultiRCQAF1a74.874.8
ReCoRDQAF193.493.4
RTENLIAccuracy92.892.5
WiCWord SenseAccuracy80.075.6
WSCCoreferenceAccuracy10096.6

MMLU (Massive Multitask Language Understanding)

MMLU tests knowledge across 57 subjects from elementary to professional level.

DfMMLU Accuracy

MMLU=1∣Tβˆ£βˆ‘t∈T1∣Dtβˆ£βˆ‘i=1∣Dt∣1[y^i=yi]\text{MMLU} = \frac{1}{|\mathcal{T}|} \sum_{t \in \mathcal{T}} \frac{1}{|D_t|} \sum_{i=1}^{|D_t|} \mathbb{1}[\hat{y}_i = y_i]

where T\mathcal{T} is the set of 57 tasks and DtD_t is the test set for task tt.

CategorySubjectsGPT-4Claude-3Gemini-Pro
STEMMath, Physics, CS82.378.176.5
HumanitiesHistory, Philosophy, Law86.783.280.1
Social SciencesEcon, Psychology, Politics88.484.682.3
OtherBusiness, Health, Misc85.180.979.8
Overall57 subjects86.482.279.6
from datasets import load_dataset
import numpy as np

def evaluate_mmlu(model, tokenizer, subjects=None):
    """Evaluate model on MMLU benchmark."""
    dataset = load_dataset("cais/mmlu", "all")

    if subjects is None:
        subjects = list(set(dataset["test"]["subject"]))

    results = {}
    for subject in subjects:
        subject_data = dataset["test"].filter(lambda x: x["subject"] == subject)

        correct = 0
        total = 0
        for example in subject_data:
            prompt = format_mmlu_prompt(example)
            prediction = model.generate(prompt, choices=["A", "B", "C", "D"])
            if prediction == example["answer"]:
                correct += 1
            total += 1

        accuracy = correct / total if total > 0 else 0
        results[subject] = {"accuracy": accuracy, "correct": correct, "total": total}

    # Overall accuracy
    total_correct = sum(r["correct"] for r in results.values())
    total_samples = sum(r["total"] for r in results.values())
    results["overall"] = {
        "accuracy": total_correct / total_samples,
        "correct": total_correct,
        "total": total_samples,
    }

    return results

def format_mmlu_prompt(example):
    """Format MMLU example as multiple choice prompt."""
    choices = ["A", "B", "C", "D"]
    choice_text = "\n".join(
        f"({choices[i]}) {example['choices'][i]}" for i in range(len(example["choices"]))
    )
    return f"{example['question']}\n{choice_text}\nAnswer:"

Other Important Benchmarks

BenchmarkFocusTasksMetric
HellaSwagCommon SenseSentence completionAccuracy
WinoGrandeCoreferencePronoun resolutionAccuracy
ARCScience ReasoningMultiple choiceAccuracy
TriviaQAKnowledge QAOpen-domain QAF1 / EM
HumanEvalCode GenerationPython functionsPass@k
GSM8KMath ReasoningGrade school mathAccuracy
TruthfulQATruthfulnessQuestion answering% Truthful

Benchmark Contamination

DfContamination Detection

A benchmark is contaminated if training data overlaps with test data. Detection metric:

ContaminationΒ Rate=∣{x∈Dtest:overlap(x,Dtrain)>Ο„}∣∣Dtest∣\text{Contamination Rate} = \frac{|\{x \in D_{\text{test}} : \text{overlap}(x, D_{\text{train}}) > \tau\}|}{|D_{\text{test}}|}

Contaminated benchmarks overestimate model performance. N-gram overlap detection with Ο„=0.8\tau = 0.8 is commonly used.

import re
from collections import Counter

class ContaminationDetector:
    def __init__(self, ngram_size=13, threshold=0.8):
        self.ngram_size = ngram_size
        self.threshold = threshold

    def extract_ngrams(self, text):
        tokens = text.lower().split()
        ngrams = set()
        for i in range(len(tokens) - self.ngram_size + 1):
            ngram = tuple(tokens[i:i + self.ngram_size])
            ngrams.add(ngram)
        return ngrams

    def check_contamination(self, test_text, train_texts):
        test_ngrams = self.extract_ngrams(test_text)

        max_overlap = 0
        for train_text in train_texts:
            train_ngrams = self.extract_ngrams(train_text)
            if len(test_ngrams) == 0:
                continue
            overlap = len(test_ngrams & train_ngrams) / len(test_ngrams)
            max_overlap = max(max_overlap, overlap)

        return {
            "contaminated": max_overlap > self.threshold,
            "overlap_ratio": max_overlap,
        }

    def scan_dataset(self, test_dataset, train_dataset, sample_size=1000):
        train_texts = [ex["text"] for ex in train_dataset.sample(sample_size)]
        results = []
        for example in test_dataset:
            result = self.check_contamination(example["text"], train_texts)
            results.append(result)

        contamination_rate = sum(1 for r in results if r["contaminated"]) / len(results)
        return {
            "contamination_rate": contamination_rate,
            "details": results,
        }

Best Practices

  1. Use multiple benchmarks - No single benchmark captures all capabilities
  2. Check for contamination - Verify training data doesn't overlap with test sets
  3. Report per-category results - Aggregate scores hide strengths/weaknesses
  4. Include baselines - Compare against well-known reference models
  5. Consider task difficulty - Some benchmarks are nearly saturated

Key Takeaways

  • GLUE established the multi-task benchmark paradigm for language understanding
  • SuperGLUE pushed model development with more challenging tasks
  • MMLU tests broad knowledge across academic and professional domains
  • Benchmark contamination can inflate reported performance significantly
  • Human evaluation remains necessary as benchmarks approach saturation
⭐

Premium Content

NLP Datasets and Benchmarks

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert NLP Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement