NLP Datasets and Benchmarks

Standard benchmarks enable fair comparison between models and track progress in NLP research. Understanding benchmark design is essential for proper model evaluation.

Benchmark Evolution

GLUE Benchmark

The General Language Understanding Evaluation (GLUE) benchmark tests language understanding across 9 tasks.

Task	Type	Train Size	Dev Size	Metric
CoLA	Acceptability	8.5K	1K	Matthews Correlation
SST-2	Sentiment	67K	872	Accuracy
MRPC	Paraphrase	3.7K	204	F1 / Accuracy
QQP	Paraphrase	364K	40K	F1 / Accuracy
STS-B	Similarity	5.7K	1.5K	Pearson / Spearman
MNLI	NLI	393K	20K	Accuracy
QNLI	NLI	105K	5.7K	Accuracy
RTE	NLI	2.5K	277	Accuracy
WNLI	Coreference	634	71	Accuracy

from datasets import load_dataset
import evaluate

def evaluate_glue(model, tokenizer, task_name):
    """Evaluate model on a GLUE task."""
    dataset = load_dataset("glue", task_name)

    metric = evaluate.load("glue", task_name)

    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = logits.argmax(axis=-1)
        return metric.compute(predictions=predictions, references=labels)

    # Task-specific configurations
    task_configs = {
        "cola": {"metric": "matthews_correlation"},
        "sst2": {"metric": "accuracy"},
        "mrpc": {"metric": ["accuracy", "f1"]},
        "qqp": {"metric": ["accuracy", "f1"]},
        "stsb": {"metric": ["pearsonr", "spearmanr"]},
        "mnli": {"metric": "accuracy"},
        "qnli": {"metric": "accuracy"},
        "rte": {"metric": "accuracy"},
    }

    return compute_metrics

# GLUE scores for reference
glue_baseline_scores = {
    "CoLA": {"BERT-base": 0.521, "BERT-large": 0.587, "RoBERTa-large": 0.680},
    "SST-2": {"BERT-base": 0.929, "BERT-large": 0.935, "RoBERTa-large": 0.964},
    "MRPC": {"BERT-base": 0.889, "BERT-large": 0.893, "RoBERTa-large": 0.909},
    "QQP": {"BERT-base": 0.913, "BERT-large": 0.918, "RoBERTa-large": 0.922},
    "STSB": {"BERT-base": 0.858, "BERT-large": 0.865, "RoBERTa-large": 0.923},
    "MNLI-m": {"BERT-base": 0.841, "BERT-large": 0.858, "RoBERTa-large": 0.902},
    "QNLI": {"BERT-base": 0.905, "BERT-large": 0.918, "RoBERTa-large": 0.939},
    "RTE": {"BERT-base": 0.664, "BERT-large": 0.702, "RoBERTa-large": 0.882},
    "WNLI": {"BERT-base": 0.563, "BERT-large": 0.563, "RoBERTa-large": 0.913},
}

SuperGLUE

SuperGLUE offers harder tasks that remained challenging even after GLUE was solved.

Task	Type	Metric	Human Baseline	Best Model
BoolQ	QA	Accuracy	89.0	91.1
CB	NLI	F1/Avg	90.9	93.9
COPA	Reasoning	Accuracy	100	98.0
MultiRC	QA	F1a	74.8	74.8
ReCoRD	QA	F1	93.4	93.4
RTE	NLI	Accuracy	92.8	92.5
WiC	Word Sense	Accuracy	80.0	75.6
WSC	Coreference	Accuracy	100	96.6

MMLU (Massive Multitask Language Understanding)

MMLU tests knowledge across 57 subjects from elementary to professional level.

DfMMLU Accuracy

\text{MMLU} = \frac{1}{|\mathcal{T}|} \sum_{t \in \mathcal{T}} \frac{1}{|D_t|} \sum_{i=1}^{|D_t|} \mathbb{1}[\hat{y}_i = y_i]

where $\mathcal{T}$ is the set of 57 tasks and $D_t$ is the test set for task $t$ .

Category	Subjects	GPT-4	Claude-3	Gemini-Pro
STEM	Math, Physics, CS	82.3	78.1	76.5
Humanities	History, Philosophy, Law	86.7	83.2	80.1
Social Sciences	Econ, Psychology, Politics	88.4	84.6	82.3
Other	Business, Health, Misc	85.1	80.9	79.8
Overall	57 subjects	86.4	82.2	79.6

from datasets import load_dataset
import numpy as np

def evaluate_mmlu(model, tokenizer, subjects=None):
    """Evaluate model on MMLU benchmark."""
    dataset = load_dataset("cais/mmlu", "all")

    if subjects is None:
        subjects = list(set(dataset["test"]["subject"]))

    results = {}
    for subject in subjects:
        subject_data = dataset["test"].filter(lambda x: x["subject"] == subject)

        correct = 0
        total = 0
        for example in subject_data:
            prompt = format_mmlu_prompt(example)
            prediction = model.generate(prompt, choices=["A", "B", "C", "D"])
            if prediction == example["answer"]:
                correct += 1
            total += 1

        accuracy = correct / total if total > 0 else 0
        results[subject] = {"accuracy": accuracy, "correct": correct, "total": total}

    # Overall accuracy
    total_correct = sum(r["correct"] for r in results.values())
    total_samples = sum(r["total"] for r in results.values())
    results["overall"] = {
        "accuracy": total_correct / total_samples,
        "correct": total_correct,
        "total": total_samples,
    }

    return results

def format_mmlu_prompt(example):
    """Format MMLU example as multiple choice prompt."""
    choices = ["A", "B", "C", "D"]
    choice_text = "\n".join(
        f"({choices[i]}) {example['choices'][i]}" for i in range(len(example["choices"]))
    )
    return f"{example['question']}\n{choice_text}\nAnswer:"

Other Important Benchmarks

Benchmark	Focus	Tasks	Metric
HellaSwag	Common Sense	Sentence completion	Accuracy
WinoGrande	Coreference	Pronoun resolution	Accuracy
ARC	Science Reasoning	Multiple choice	Accuracy
TriviaQA	Knowledge QA	Open-domain QA	F1 / EM
HumanEval	Code Generation	Python functions	Pass@k
GSM8K	Math Reasoning	Grade school math	Accuracy
TruthfulQA	Truthfulness	Question answering	% Truthful

Benchmark Contamination

DfContamination Detection

A benchmark is contaminated if training data overlaps with test data. Detection metric:

\text{Contamination Rate} = \frac{|\{x \in D_{\text{test}} : \text{overlap}(x, D_{\text{train}}) > \tau\}|}{|D_{\text{test}}|}

Contaminated benchmarks overestimate model performance. N-gram overlap detection with $\tau = 0.8$ is commonly used.

import re
from collections import Counter

class ContaminationDetector:
    def __init__(self, ngram_size=13, threshold=0.8):
        self.ngram_size = ngram_size
        self.threshold = threshold

    def extract_ngrams(self, text):
        tokens = text.lower().split()
        ngrams = set()
        for i in range(len(tokens) - self.ngram_size + 1):
            ngram = tuple(tokens[i:i + self.ngram_size])
            ngrams.add(ngram)
        return ngrams

    def check_contamination(self, test_text, train_texts):
        test_ngrams = self.extract_ngrams(test_text)

        max_overlap = 0
        for train_text in train_texts:
            train_ngrams = self.extract_ngrams(train_text)
            if len(test_ngrams) == 0:
                continue
            overlap = len(test_ngrams & train_ngrams) / len(test_ngrams)
            max_overlap = max(max_overlap, overlap)

        return {
            "contaminated": max_overlap > self.threshold,
            "overlap_ratio": max_overlap,
        }

    def scan_dataset(self, test_dataset, train_dataset, sample_size=1000):
        train_texts = [ex["text"] for ex in train_dataset.sample(sample_size)]
        results = []
        for example in test_dataset:
            result = self.check_contamination(example["text"], train_texts)
            results.append(result)

        contamination_rate = sum(1 for r in results if r["contaminated"]) / len(results)
        return {
            "contamination_rate": contamination_rate,
            "details": results,
        }

Best Practices

Use multiple benchmarks - No single benchmark captures all capabilities
Check for contamination - Verify training data doesn't overlap with test sets
Report per-category results - Aggregate scores hide strengths/weaknesses
Include baselines - Compare against well-known reference models
Consider task difficulty - Some benchmarks are nearly saturated

Key Takeaways

GLUE established the multi-task benchmark paradigm for language understanding
SuperGLUE pushed model development with more challenging tasks
MMLU tests broad knowledge across academic and professional domains
Benchmark contamination can inflate reported performance significantly
Human evaluation remains necessary as benchmarks approach saturation

NLP Datasets and Benchmarks