NLP Datasets and Benchmarks
Standard benchmarks enable fair comparison between models and track progress in NLP research. Understanding benchmark design is essential for proper model evaluation.
Benchmark Evolution
GLUE Benchmark
The General Language Understanding Evaluation (GLUE) benchmark tests language understanding across 9 tasks.
| Task | Type | Train Size | Dev Size | Metric |
|---|---|---|---|---|
| CoLA | Acceptability | 8.5K | 1K | Matthews Correlation |
| SST-2 | Sentiment | 67K | 872 | Accuracy |
| MRPC | Paraphrase | 3.7K | 204 | F1 / Accuracy |
| QQP | Paraphrase | 364K | 40K | F1 / Accuracy |
| STS-B | Similarity | 5.7K | 1.5K | Pearson / Spearman |
| MNLI | NLI | 393K | 20K | Accuracy |
| QNLI | NLI | 105K | 5.7K | Accuracy |
| RTE | NLI | 2.5K | 277 | Accuracy |
| WNLI | Coreference | 634 | 71 | Accuracy |
from datasets import load_dataset
import evaluate
def evaluate_glue(model, tokenizer, task_name):
"""Evaluate model on a GLUE task."""
dataset = load_dataset("glue", task_name)
metric = evaluate.load("glue", task_name)
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = logits.argmax(axis=-1)
return metric.compute(predictions=predictions, references=labels)
# Task-specific configurations
task_configs = {
"cola": {"metric": "matthews_correlation"},
"sst2": {"metric": "accuracy"},
"mrpc": {"metric": ["accuracy", "f1"]},
"qqp": {"metric": ["accuracy", "f1"]},
"stsb": {"metric": ["pearsonr", "spearmanr"]},
"mnli": {"metric": "accuracy"},
"qnli": {"metric": "accuracy"},
"rte": {"metric": "accuracy"},
}
return compute_metrics
# GLUE scores for reference
glue_baseline_scores = {
"CoLA": {"BERT-base": 0.521, "BERT-large": 0.587, "RoBERTa-large": 0.680},
"SST-2": {"BERT-base": 0.929, "BERT-large": 0.935, "RoBERTa-large": 0.964},
"MRPC": {"BERT-base": 0.889, "BERT-large": 0.893, "RoBERTa-large": 0.909},
"QQP": {"BERT-base": 0.913, "BERT-large": 0.918, "RoBERTa-large": 0.922},
"STSB": {"BERT-base": 0.858, "BERT-large": 0.865, "RoBERTa-large": 0.923},
"MNLI-m": {"BERT-base": 0.841, "BERT-large": 0.858, "RoBERTa-large": 0.902},
"QNLI": {"BERT-base": 0.905, "BERT-large": 0.918, "RoBERTa-large": 0.939},
"RTE": {"BERT-base": 0.664, "BERT-large": 0.702, "RoBERTa-large": 0.882},
"WNLI": {"BERT-base": 0.563, "BERT-large": 0.563, "RoBERTa-large": 0.913},
}
SuperGLUE
SuperGLUE offers harder tasks that remained challenging even after GLUE was solved.
| Task | Type | Metric | Human Baseline | Best Model |
|---|---|---|---|---|
| BoolQ | QA | Accuracy | 89.0 | 91.1 |
| CB | NLI | F1/Avg | 90.9 | 93.9 |
| COPA | Reasoning | Accuracy | 100 | 98.0 |
| MultiRC | QA | F1a | 74.8 | 74.8 |
| ReCoRD | QA | F1 | 93.4 | 93.4 |
| RTE | NLI | Accuracy | 92.8 | 92.5 |
| WiC | Word Sense | Accuracy | 80.0 | 75.6 |
| WSC | Coreference | Accuracy | 100 | 96.6 |
MMLU (Massive Multitask Language Understanding)
MMLU tests knowledge across 57 subjects from elementary to professional level.
DfMMLU Accuracy
where is the set of 57 tasks and is the test set for task .
| Category | Subjects | GPT-4 | Claude-3 | Gemini-Pro |
|---|---|---|---|---|
| STEM | Math, Physics, CS | 82.3 | 78.1 | 76.5 |
| Humanities | History, Philosophy, Law | 86.7 | 83.2 | 80.1 |
| Social Sciences | Econ, Psychology, Politics | 88.4 | 84.6 | 82.3 |
| Other | Business, Health, Misc | 85.1 | 80.9 | 79.8 |
| Overall | 57 subjects | 86.4 | 82.2 | 79.6 |
from datasets import load_dataset
import numpy as np
def evaluate_mmlu(model, tokenizer, subjects=None):
"""Evaluate model on MMLU benchmark."""
dataset = load_dataset("cais/mmlu", "all")
if subjects is None:
subjects = list(set(dataset["test"]["subject"]))
results = {}
for subject in subjects:
subject_data = dataset["test"].filter(lambda x: x["subject"] == subject)
correct = 0
total = 0
for example in subject_data:
prompt = format_mmlu_prompt(example)
prediction = model.generate(prompt, choices=["A", "B", "C", "D"])
if prediction == example["answer"]:
correct += 1
total += 1
accuracy = correct / total if total > 0 else 0
results[subject] = {"accuracy": accuracy, "correct": correct, "total": total}
# Overall accuracy
total_correct = sum(r["correct"] for r in results.values())
total_samples = sum(r["total"] for r in results.values())
results["overall"] = {
"accuracy": total_correct / total_samples,
"correct": total_correct,
"total": total_samples,
}
return results
def format_mmlu_prompt(example):
"""Format MMLU example as multiple choice prompt."""
choices = ["A", "B", "C", "D"]
choice_text = "\n".join(
f"({choices[i]}) {example['choices'][i]}" for i in range(len(example["choices"]))
)
return f"{example['question']}\n{choice_text}\nAnswer:"
Other Important Benchmarks
| Benchmark | Focus | Tasks | Metric |
|---|---|---|---|
| HellaSwag | Common Sense | Sentence completion | Accuracy |
| WinoGrande | Coreference | Pronoun resolution | Accuracy |
| ARC | Science Reasoning | Multiple choice | Accuracy |
| TriviaQA | Knowledge QA | Open-domain QA | F1 / EM |
| HumanEval | Code Generation | Python functions | Pass@k |
| GSM8K | Math Reasoning | Grade school math | Accuracy |
| TruthfulQA | Truthfulness | Question answering | % Truthful |
Benchmark Contamination
DfContamination Detection
A benchmark is contaminated if training data overlaps with test data. Detection metric:
Contaminated benchmarks overestimate model performance. N-gram overlap detection with is commonly used.
import re
from collections import Counter
class ContaminationDetector:
def __init__(self, ngram_size=13, threshold=0.8):
self.ngram_size = ngram_size
self.threshold = threshold
def extract_ngrams(self, text):
tokens = text.lower().split()
ngrams = set()
for i in range(len(tokens) - self.ngram_size + 1):
ngram = tuple(tokens[i:i + self.ngram_size])
ngrams.add(ngram)
return ngrams
def check_contamination(self, test_text, train_texts):
test_ngrams = self.extract_ngrams(test_text)
max_overlap = 0
for train_text in train_texts:
train_ngrams = self.extract_ngrams(train_text)
if len(test_ngrams) == 0:
continue
overlap = len(test_ngrams & train_ngrams) / len(test_ngrams)
max_overlap = max(max_overlap, overlap)
return {
"contaminated": max_overlap > self.threshold,
"overlap_ratio": max_overlap,
}
def scan_dataset(self, test_dataset, train_dataset, sample_size=1000):
train_texts = [ex["text"] for ex in train_dataset.sample(sample_size)]
results = []
for example in test_dataset:
result = self.check_contamination(example["text"], train_texts)
results.append(result)
contamination_rate = sum(1 for r in results if r["contaminated"]) / len(results)
return {
"contamination_rate": contamination_rate,
"details": results,
}
Best Practices
- Use multiple benchmarks - No single benchmark captures all capabilities
- Check for contamination - Verify training data doesn't overlap with test sets
- Report per-category results - Aggregate scores hide strengths/weaknesses
- Include baselines - Compare against well-known reference models
- Consider task difficulty - Some benchmarks are nearly saturated
Key Takeaways
- GLUE established the multi-task benchmark paradigm for language understanding
- SuperGLUE pushed model development with more challenging tasks
- MMLU tests broad knowledge across academic and professional domains
- Benchmark contamination can inflate reported performance significantly
- Human evaluation remains necessary as benchmarks approach saturation