πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

LLM Evaluation Benchmarks

EvaluationBenchmarks🟒 Free Lesson

Advertisement

LLM Evaluation

LLM Evaluation β€” How to Measure if a Language Model is Actually Good

Evaluating LLMs requires diverse benchmarks across reasoning, code generation, math, and human preference alignment.

  • Multi-Dimensional Metrics β€” Perplexity, MMLU, HumanEval, GSM8K, and Chatbot Arena measure different capabilities
  • LLM-as-Judge β€” Use strong models to evaluate weaker ones with over 80% human agreement
  • Evaluation Pipelines β€” Combine automatic metrics with human evaluation for comprehensive assessment

"No single benchmark captures all aspects of model quality β€” always evaluate on multiple dimensions."

LLM Evaluation Benchmarks

Evaluating large language models is one of the most challenging problems in AI. Unlike traditional ML tasks with clear metrics, LLMs are general-purpose systems whose capabilities span reasoning, creativity, knowledge, and more. This tutorial covers the major benchmarks and evaluation methodologies used to assess LLM performance.

Why Evaluation is Hard

LLMs exhibit emergent capabilities that are difficult to measure with simple metrics:

  • Open-ended generation has no single correct answer
  • Reasoning chains require evaluating intermediate steps
  • Safety requires testing for harms that may not appear in standard benchmarks
  • Alignment measures subjective qualities like helpfulness and honesty

Perplexity

Perplexity is the most fundamental metric for language models, measuring how well the model predicts the next token.

Perplexity (PPL) is the exponentiated average negative log-likelihood of a sequence, measuring how "surprised" the model is by the test data. Lower perplexity indicates better predictive performance.

Perplexity

PPL(x)=exp⁑(βˆ’1Nβˆ‘i=1Nlog⁑PΞΈ(xi∣x<i))\text{PPL}(\mathbf{x}) = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P_\theta(x_i | x_{<i})\right)

Here,

  • x\mathbf{x}=sequence of tokens (x_1, x_2, ..., x_N)
  • NN=number of tokens in the sequence
  • PΞΈ(xi∣x<i)P_\theta(x_i | x_{<i})=model's predicted probability for token x_i given preceding tokens
  • ΞΈ\theta=model parameters

Cross-Entropy Loss

H(x,PΞΈ)=βˆ’1Nβˆ‘i=1Nlog⁑PΞΈ(xi∣x<i)H(\mathbf{x}, P_\theta) = -\frac{1}{N} \sum_{i=1}^{N} \log P_\theta(x_i | x_{<i})

Here,

  • HH=cross-entropy between true distribution and model predictions
  • NN=sequence length

The relationship between perplexity and cross-entropy:

Perplexity is simply the exponentiated cross-entropy: PPL = exp(H). A perplexity of 10 means the model is, on average, as uncertain as if it were choosing uniformly among 10 possibilities.

Computing Perplexity

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def compute_perplexity(model_name: str, text: str, stride: int = 512) -> float:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    model.eval()
    
    encodings = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048)
    seq_len = encodings.input_ids.size(1)
    
    nlls = []
    prev_end_loc = 0
    
    for begin_loc in range(0, seq_len, stride):
        end_loc = min(begin_loc + 2048, seq_len)
        input_ids = encodings.input_ids[:, begin_loc:end_loc]
        target_ids = input_ids.clone()
        
        # Mask tokens outside the current window
        if begin_loc > 0:
            target_ids[:, :-stride] = -100
        
        with torch.no_grad():
            outputs = model(input_ids, labels=target_ids)
            neg_log_likelihood = outputs.loss
        
        nlls.append(neg_log_likelihood)
        prev_end_loc = end_loc
        
        if end_loc == seq_len:
            break
    
    ppl = torch.exp(torch.stack(nlls).mean())
    return ppl.item()

Perplexity is useful for comparing models of similar size on the same test set, but it is not a reliable indicator of downstream task performance. A model with lower perplexity may still perform worse on reasoning tasks.

MMLU (Massive Multitask Language Understanding)

MMLU measures knowledge across 57 subjects spanning STEM, humanities, social sciences, and more.

Benchmark Structure

CategorySubjectsExamples
STEMPhysics, Math, CS14,042 questions
HumanitiesHistory, Philosophy, Law11,039 questions
Social SciencesEconomics, Psychology8,302 questions
OtherMisc, Professional7,530 questions

Evaluation Protocol

MMLU uses 5-shot evaluation with multiple-choice questions:

def format_mmlu_prompt(question, options, examples=None):
    prompt = "Answer the following multiple-choice question.\n\n"
    
    if examples:
        for ex in examples:
            prompt += f"Question: {ex['question']}\n"
            for i, opt in enumerate(ex['options']):
                prompt += f"({chr(65+i)}) {opt}\n"
            prompt += f"Answer: {ex['answer']}\n\n"
    
    prompt += f"Question: {question}\n"
    for i, opt in enumerate(options):
        prompt += f"({chr(65+i)}) {opt}\n"
    prompt += "Answer:"
    
    return prompt

def evaluate_mmlu(model, tokenizer, dataset, k=5):
    correct = 0
    total = 0
    
    for question in dataset:
        examples = question['few_shot_examples'][:k]
        prompt = format_mmlu_prompt(
            question['question'],
            question['options'],
            examples
        )
        
        inputs = tokenizer(prompt, return_tensors="pt")
        with torch.no_grad():
            outputs = model.generate(**inputs, max_new_tokens=1)
        
        predicted = tokenizer.decode(outputs[0][-1:])
        if predicted == question['answer']:
            correct += 1
        total += 1
    
    return correct / total

HumanEval (Code Generation)

HumanEval evaluates a model's ability to generate correct Python functions from docstrings.

Pass@k (Code Generation)

Pass@k=Eproblems[1βˆ’(nβˆ’ck)(nk)]\text{Pass@k} = \mathbb{E}_{\text{problems}} \left[ 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}} \right]

Here,

  • nn=total number of generated samples per problem
  • cc=number of correct samples (passing all test cases)
  • kk=number of samples to consider (typically k=1, k=10, k=100)

The unbiased estimator for Pass@k:

Unbiased Pass@k Estimator

Pass@k^=1βˆ’(nβˆ’ck)(nk)\widehat{\text{Pass@k}} = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}

Here,

  • (nk)\binom{n}{k}=binomial coefficient (n choose k)
  • nn=total generated samples
  • cc=correct samples
  • kk=samples to evaluate
import math

def pass_at_k(n: int, c: int, k: int) -> float:
    if n - c < k:
        return 1.0
    return 1.0 - math.prod(1.0 - k / (n - i) for i in range(c))

def evaluate_humaneval(model, tokenizer, problems, n_samples=200, k_values=[1, 10, 100]):
    results = {}
    
    for problem in problems:
        prompt = f"def {problem['function_name']}({problem['signature']}):\n    \"\"\"{problem['docstring']}\"\"\"\n"
        
        samples = []
        for _ in range(n_samples):
            inputs = tokenizer(prompt, return_tensors="pt")
            with torch.no_grad():
                output = model.generate(
                    **inputs,
                    max_new_tokens=512,
                    temperature=0.8,
                    do_sample=True
                )
            code = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:])
            samples.append(code)
        
        c = sum(1 for s in samples if run_test_cases(s, problem['test_cases']))
        
        results[problem['task_id']] = {
            k: pass_at_k(n_samples, c, k) for k in k_values
        }
    
    return results

GSM8K (Math Reasoning)

GSM8K tests grade-school math reasoning with multi-step word problems.

Evaluation Metrics

MetricDescriptionCalculation
Exact MatchFinal answer matches exactlyBinary per problem
Execution AccuracyCode execution produces correct answerBinary per problem
Reasoning ScoreIntermediate steps are valid0-1 per problem
def extract_answer(response: str) -> str:
    """Extract final numerical answer from response."""
    import re
    # Look for #### pattern used in GSM8K
    match = re.search(r'####\s*(.+)', response)
    if match:
        return match.group(1).strip()
    # Fallback: last number in response
    numbers = re.findall(r'-?\d+\.?\d*', response)
    return numbers[-1] if numbers else ""

def evaluate_gsm8k(model, tokenizer, dataset):
    correct = 0
    total = 0
    
    for problem in dataset:
        prompt = f"Question: {problem['question']}\n\nAnswer:"
        
        inputs = tokenizer(prompt, return_tensors="pt")
        with torch.no_grad():
            output = model.generate(
                **inputs,
                max_new_tokens=512,
                temperature=0.0
            )
        
        response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:])
        predicted = extract_answer(response)
        ground_truth = extract_answer(problem['answer'])
        
        if predicted == ground_truth:
            correct += 1
        total += 1
    
    return correct / total

MT-Bench and Chatbot Arena

MT-Bench

MT-Bench evaluates multi-turn conversation quality using GPT-4 as a judge:

MT_BENCH_CATEGORIES = [
    "writing", "roleplay", "reasoning", "math",
    "coding", "extraction", "stem", "humanities"
]

MT_BENCH_JUDGE_PROMPT = """[System]
Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Rate on a scale of 1 to 10.

[Question]
{question}

[Assistant Response]
{response}

[Rating]
Provide your rating on a scale of 1 to 10. Explain your rating briefly."""

Chatbot Arena (ELO Rating)

Chatbot Arena uses blind comparison with human voting to compute ELO ratings:

ELO Rating Update

Rnew=Rold+K(Sβˆ’E)R_{\text{new}} = R_{\text{old}} + K(S - E)

Here,

  • RnewR_{\text{new}}=updated rating
  • RoldR_{\text{old}}=current rating
  • KK=update factor (typically 32)
  • SS=actual score (1 for win, 0.5 for draw, 0 for loss)
  • EE=expected score = \frac{1}{1 + 10^{(R_{\text{opp}} - R_{\text{self}})/400}

LLM-as-Judge Evaluation

Using GPT-4 or other strong models as evaluation judges:

def llm_judge(
    question: str,
    response: str,
    criteria: dict,
    judge_model,
    judge_tokenizer
) -> dict:
    judge_prompt = f"""You are an expert evaluator. Rate the following response on these criteria:

{chr(10).join(f'- {k}: {v}' for k, v in criteria.items())}

Question: {question}
Response: {response}

Provide ratings (1-10) for each criterion and an overall score.
Format your response as JSON."""
    
    inputs = judge_tokenizer(judge_prompt, return_tensors="pt")
    with torch.no_grad():
        output = judge_model.generate(**inputs, max_new_tokens=256)
    
    judgment = judge_tokenizer.decode(output[0][inputs["input_ids"].shape[1]:])
    return parse_judgment(judgment)

LLM-as-judge has been shown to achieve >80% agreement with human evaluators on tasks like response quality assessment. However, it can be biased toward its own outputs if used to evaluate similar models.

Evaluation Pitfalls and Limitations

Common Pitfalls

  1. Data Contamination: Test data may appear in training data
  2. Benchmark Gaming: Models can overfit to specific benchmark formats
  3. Metric Sensitivity: Small changes in evaluation protocol can change rankings
  4. Missing Capabilities: Benchmarks may not capture important real-world skills

Evaluation Anti-Patterns

# BAD: Only evaluating on one benchmark
score = evaluate_gsm8k(model, tokenizer, test_set)

# GOOD: Comprehensive evaluation
evaluation_suite = {
    "perplexity": compute_perplexity(model, held_out_data),
    "mmlu": evaluate_mmlu(model, tokenizer, mmlu_test),
    "humaneval": evaluate_humaneval(model, tokenizer, humaneval),
    "gsm8k": evaluate_gsm8k(model, tokenizer, gsm8k_test),
    "safety": evaluate_safety(model, safety_test),
    "helpfulness": llm_judge(model, helpfulness_test)
}

Limitations of Automatic Metrics

MetricStrengthsWeaknesses
PerplexityFast, consistentDoesn't measure reasoning
Exact MatchClear, objectiveMisses partial credit
Pass@kTask-specificExpensive to compute
LLM JudgeFlexible, nuancedExpensive, potential bias

Always evaluate LLMs on multiple benchmarks across different capability dimensions. No single benchmark captures all aspects of model quality. Combine automatic metrics with human evaluation for critical applications.

Building an Evaluation Pipeline

class LLMEvaluator:
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
        self.results = {}
    
    def evaluate(self, benchmarks: dict):
        for name, (eval_fn, dataset) in benchmarks.items():
            print(f"Evaluating {name}...")
            self.results[name] = eval_fn(self.model, self.tokenizer, dataset)
        
        return self.results
    
    def compare_with_baseline(self, baseline_results: dict) -> dict:
        comparison = {}
        for benchmark in self.results:
            if benchmark in baseline_results:
                comparison[benchmark] = {
                    "current": self.results[benchmark],
                    "baseline": baseline_results[benchmark],
                    "delta": self.results[benchmark] - baseline_results[benchmark]
                }
        return comparison
    
    def generate_report(self) -> str:
        report = "# LLM Evaluation Report\n\n"
        for benchmark, score in self.results.items():
            report += f"## {benchmark}\n"
            report += f"- Score: {score:.4f}\n\n"
        return report

Summary

  • Perplexity measures next-token prediction quality: PPL = exp(H)
  • MMLU evaluates knowledge across 57 academic subjects
  • HumanEval measures code generation capability via Pass@k
  • GSM8K tests multi-step mathematical reasoning
  • MT-Bench and Chatbot Arena use LLM-as-judge and human comparison
  • LLM-as-judge provides scalable evaluation with >80% human agreement
  • Always use multiple benchmarks and combine automatic with human evaluation
  • Watch for data contamination and benchmark gaming

Practice Exercises

  1. Perplexity Comparison: Compute perplexity for 3 different models on the WikiText-2 dataset. How does perplexity correlate with model size?

  2. Benchmark Implementation: Implement a 5-shot MMLU evaluation for a small language model. Report accuracy by subject category.

  3. Code Generation: Evaluate a model on HumanEval with k=1, k=10, and k=100. How does Pass@k improve with more samples?

  4. LLM-as-Judge: Use GPT-4 to evaluate 50 responses from different models on the MT-Bench dataset. Compare the rankings with your own judgments.

  5. Evaluation Pipeline: Build a complete evaluation pipeline that runs 4 different benchmarks and generates a comparison report.


What to Learn Next

-> LLM Safety and Red Teaming Safety evaluation is a critical dimension beyond standard benchmarks.

-> LLM Inference Optimization Optimizing inference affects latency and cost metrics measured in evaluation.

-> Building Production LLM Applications Production evaluation combines automated metrics with human feedback.

-> Prompt Engineering Prompt strategies directly impact evaluation performance across benchmarks.

-> In-Context Learning Few-shot evaluation protocols rely on in-context learning capabilities.

-> Scaling Laws and Chinchilla Understanding how model scale affects benchmark performance.


Previous: 14 - Constitutional AI <- | Next: 16 - LLM Inference Optimization ->

⭐

Premium Content

LLM Evaluation Benchmarks

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert LLM Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement