LLM Evaluation

LLM Evaluation — How to Measure if a Language Model is Actually Good

Evaluating LLMs requires diverse benchmarks across reasoning, code generation, math, and human preference alignment.

Multi-Dimensional Metrics — Perplexity, MMLU, HumanEval, GSM8K, and Chatbot Arena measure different capabilities
LLM-as-Judge — Use strong models to evaluate weaker ones with over 80% human agreement
Evaluation Pipelines — Combine automatic metrics with human evaluation for comprehensive assessment

"No single benchmark captures all aspects of model quality — always evaluate on multiple dimensions."

LLM Evaluation Benchmarks

Evaluating large language models is one of the most challenging problems in AI. Unlike traditional ML tasks with clear metrics, LLMs are general-purpose systems whose capabilities span reasoning, creativity, knowledge, and more. This tutorial covers the major benchmarks and evaluation methodologies used to assess LLM performance.

Why Evaluation is Hard

LLMs exhibit emergent capabilities that are difficult to measure with simple metrics:

Open-ended generation has no single correct answer
Reasoning chains require evaluating intermediate steps
Safety requires testing for harms that may not appear in standard benchmarks
Alignment measures subjective qualities like helpfulness and honesty

Perplexity

Perplexity is the most fundamental metric for language models, measuring how well the model predicts the next token.

Perplexity (PPL) is the exponentiated average negative log-likelihood of a sequence, measuring how "surprised" the model is by the test data. Lower perplexity indicates better predictive performance.

Perplexity

\text{PPL}(\mathbf{x}) = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P_\theta(x_i | x_{<i})\right)

Here,

$\mathbf{x}$ =sequence of tokens (x_1, x_2, ..., x_N)
$N$ =number of tokens in the sequence
$P_\theta(x_i | x_{<i})$ =model's predicted probability for token x_i given preceding tokens
$\theta$ =model parameters

Cross-Entropy Loss

H(\mathbf{x}, P_\theta) = -\frac{1}{N} \sum_{i=1}^{N} \log P_\theta(x_i | x_{<i})

Here,

$H$ =cross-entropy between true distribution and model predictions
$N$ =sequence length

The relationship between perplexity and cross-entropy:

Perplexity is simply the exponentiated cross-entropy: PPL = exp(H). A perplexity of 10 means the model is, on average, as uncertain as if it were choosing uniformly among 10 possibilities.

Computing Perplexity

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def compute_perplexity(model_name: str, text: str, stride: int = 512) -> float:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    model.eval()
    
    encodings = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048)
    seq_len = encodings.input_ids.size(1)
    
    nlls = []
    prev_end_loc = 0
    
    for begin_loc in range(0, seq_len, stride):
        end_loc = min(begin_loc + 2048, seq_len)
        input_ids = encodings.input_ids[:, begin_loc:end_loc]
        target_ids = input_ids.clone()
        
        # Mask tokens outside the current window
        if begin_loc > 0:
            target_ids[:, :-stride] = -100
        
        with torch.no_grad():
            outputs = model(input_ids, labels=target_ids)
            neg_log_likelihood = outputs.loss
        
        nlls.append(neg_log_likelihood)
        prev_end_loc = end_loc
        
        if end_loc == seq_len:
            break
    
    ppl = torch.exp(torch.stack(nlls).mean())
    return ppl.item()

Perplexity is useful for comparing models of similar size on the same test set, but it is not a reliable indicator of downstream task performance. A model with lower perplexity may still perform worse on reasoning tasks.

MMLU (Massive Multitask Language Understanding)

MMLU measures knowledge across 57 subjects spanning STEM, humanities, social sciences, and more.

Benchmark Structure

Category	Subjects	Examples
STEM	Physics, Math, CS	14,042 questions
Humanities	History, Philosophy, Law	11,039 questions
Social Sciences	Economics, Psychology	8,302 questions
Other	Misc, Professional	7,530 questions

Evaluation Protocol

MMLU uses 5-shot evaluation with multiple-choice questions:

def format_mmlu_prompt(question, options, examples=None):
    prompt = "Answer the following multiple-choice question.\n\n"
    
    if examples:
        for ex in examples:
            prompt += f"Question: {ex['question']}\n"
            for i, opt in enumerate(ex['options']):
                prompt += f"({chr(65+i)}) {opt}\n"
            prompt += f"Answer: {ex['answer']}\n\n"
    
    prompt += f"Question: {question}\n"
    for i, opt in enumerate(options):
        prompt += f"({chr(65+i)}) {opt}\n"
    prompt += "Answer:"
    
    return prompt

def evaluate_mmlu(model, tokenizer, dataset, k=5):
    correct = 0
    total = 0
    
    for question in dataset:
        examples = question['few_shot_examples'][:k]
        prompt = format_mmlu_prompt(
            question['question'],
            question['options'],
            examples
        )
        
        inputs = tokenizer(prompt, return_tensors="pt")
        with torch.no_grad():
            outputs = model.generate(**inputs, max_new_tokens=1)
        
        predicted = tokenizer.decode(outputs[0][-1:])
        if predicted == question['answer']:
            correct += 1
        total += 1
    
    return correct / total

HumanEval (Code Generation)

HumanEval evaluates a model's ability to generate correct Python functions from docstrings.

Pass@k (Code Generation)

\text{Pass@k} = \mathbb{E}_{\text{problems}} \left[ 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}} \right]

Here,

$n$ =total number of generated samples per problem
$c$ =number of correct samples (passing all test cases)
$k$ =number of samples to consider (typically k=1, k=10, k=100)

The unbiased estimator for Pass@k:

Unbiased Pass@k Estimator

\widehat{\text{Pass@k}} = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}

Here,

$\binom{n}{k}$ =binomial coefficient (n choose k)
$n$ =total generated samples
$c$ =correct samples
$k$ =samples to evaluate

import math

def pass_at_k(n: int, c: int, k: int) -> float:
    if n - c < k:
        return 1.0
    return 1.0 - math.prod(1.0 - k / (n - i) for i in range(c))

def evaluate_humaneval(model, tokenizer, problems, n_samples=200, k_values=[1, 10, 100]):
    results = {}
    
    for problem in problems:
        prompt = f"def {problem['function_name']}({problem['signature']}):\n    \"\"\"{problem['docstring']}\"\"\"\n"
        
        samples = []
        for _ in range(n_samples):
            inputs = tokenizer(prompt, return_tensors="pt")
            with torch.no_grad():
                output = model.generate(
                    **inputs,
                    max_new_tokens=512,
                    temperature=0.8,
                    do_sample=True
                )
            code = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:])
            samples.append(code)
        
        c = sum(1 for s in samples if run_test_cases(s, problem['test_cases']))
        
        results[problem['task_id']] = {
            k: pass_at_k(n_samples, c, k) for k in k_values
        }
    
    return results

GSM8K (Math Reasoning)

GSM8K tests grade-school math reasoning with multi-step word problems.

Evaluation Metrics

Metric	Description	Calculation
Exact Match	Final answer matches exactly	Binary per problem
Execution Accuracy	Code execution produces correct answer	Binary per problem
Reasoning Score	Intermediate steps are valid	0-1 per problem

def extract_answer(response: str) -> str:
    """Extract final numerical answer from response."""
    import re
    # Look for #### pattern used in GSM8K
    match = re.search(r'####\s*(.+)', response)
    if match:
        return match.group(1).strip()
    # Fallback: last number in response
    numbers = re.findall(r'-?\d+\.?\d*', response)
    return numbers[-1] if numbers else ""

def evaluate_gsm8k(model, tokenizer, dataset):
    correct = 0
    total = 0
    
    for problem in dataset:
        prompt = f"Question: {problem['question']}\n\nAnswer:"
        
        inputs = tokenizer(prompt, return_tensors="pt")
        with torch.no_grad():
            output = model.generate(
                **inputs,
                max_new_tokens=512,
                temperature=0.0
            )
        
        response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:])
        predicted = extract_answer(response)
        ground_truth = extract_answer(problem['answer'])
        
        if predicted == ground_truth:
            correct += 1
        total += 1
    
    return correct / total

MT-Bench and Chatbot Arena

MT-Bench

MT-Bench evaluates multi-turn conversation quality using GPT-4 as a judge:

MT_BENCH_CATEGORIES = [
    "writing", "roleplay", "reasoning", "math",
    "coding", "extraction", "stem", "humanities"
]

MT_BENCH_JUDGE_PROMPT = """[System]
Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Rate on a scale of 1 to 10.

[Question]
{question}

[Assistant Response]
{response}

[Rating]
Provide your rating on a scale of 1 to 10. Explain your rating briefly."""

Chatbot Arena (ELO Rating)

Chatbot Arena uses blind comparison with human voting to compute ELO ratings:

ELO Rating Update

R_{\text{new}} = R_{\text{old}} + K(S - E)

Here,

$R_{\text{new}}$ =updated rating
$R_{\text{old}}$ =current rating
$K$ =update factor (typically 32)
$S$ =actual score (1 for win, 0.5 for draw, 0 for loss)
$E$ =expected score = \frac{1}{1 + 10^{(R_{\text{opp}} - R_{\text{self}})/400}

LLM-as-Judge Evaluation

Using GPT-4 or other strong models as evaluation judges:

def llm_judge(
    question: str,
    response: str,
    criteria: dict,
    judge_model,
    judge_tokenizer
) -> dict:
    judge_prompt = f"""You are an expert evaluator. Rate the following response on these criteria:

{chr(10).join(f'- {k}: {v}' for k, v in criteria.items())}

Question: {question}
Response: {response}

Provide ratings (1-10) for each criterion and an overall score.
Format your response as JSON."""
    
    inputs = judge_tokenizer(judge_prompt, return_tensors="pt")
    with torch.no_grad():
        output = judge_model.generate(**inputs, max_new_tokens=256)
    
    judgment = judge_tokenizer.decode(output[0][inputs["input_ids"].shape[1]:])
    return parse_judgment(judgment)

LLM-as-judge has been shown to achieve >80% agreement with human evaluators on tasks like response quality assessment. However, it can be biased toward its own outputs if used to evaluate similar models.

Evaluation Pitfalls and Limitations

Common Pitfalls

Data Contamination: Test data may appear in training data
Benchmark Gaming: Models can overfit to specific benchmark formats
Metric Sensitivity: Small changes in evaluation protocol can change rankings
Missing Capabilities: Benchmarks may not capture important real-world skills

Evaluation Anti-Patterns

# BAD: Only evaluating on one benchmark
score = evaluate_gsm8k(model, tokenizer, test_set)

# GOOD: Comprehensive evaluation
evaluation_suite = {
    "perplexity": compute_perplexity(model, held_out_data),
    "mmlu": evaluate_mmlu(model, tokenizer, mmlu_test),
    "humaneval": evaluate_humaneval(model, tokenizer, humaneval),
    "gsm8k": evaluate_gsm8k(model, tokenizer, gsm8k_test),
    "safety": evaluate_safety(model, safety_test),
    "helpfulness": llm_judge(model, helpfulness_test)
}

Limitations of Automatic Metrics

Metric	Strengths	Weaknesses
Perplexity	Fast, consistent	Doesn't measure reasoning
Exact Match	Clear, objective	Misses partial credit
Pass@k	Task-specific	Expensive to compute
LLM Judge	Flexible, nuanced	Expensive, potential bias

Always evaluate LLMs on multiple benchmarks across different capability dimensions. No single benchmark captures all aspects of model quality. Combine automatic metrics with human evaluation for critical applications.

Building an Evaluation Pipeline

class LLMEvaluator:
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
        self.results = {}
    
    def evaluate(self, benchmarks: dict):
        for name, (eval_fn, dataset) in benchmarks.items():
            print(f"Evaluating {name}...")
            self.results[name] = eval_fn(self.model, self.tokenizer, dataset)
        
        return self.results
    
    def compare_with_baseline(self, baseline_results: dict) -> dict:
        comparison = {}
        for benchmark in self.results:
            if benchmark in baseline_results:
                comparison[benchmark] = {
                    "current": self.results[benchmark],
                    "baseline": baseline_results[benchmark],
                    "delta": self.results[benchmark] - baseline_results[benchmark]
                }
        return comparison
    
    def generate_report(self) -> str:
        report = "# LLM Evaluation Report\n\n"
        for benchmark, score in self.results.items():
            report += f"## {benchmark}\n"
            report += f"- Score: {score:.4f}\n\n"
        return report

Summary

Perplexity measures next-token prediction quality: PPL = exp(H)
MMLU evaluates knowledge across 57 academic subjects
HumanEval measures code generation capability via Pass@k
GSM8K tests multi-step mathematical reasoning
MT-Bench and Chatbot Arena use LLM-as-judge and human comparison
LLM-as-judge provides scalable evaluation with >80% human agreement
Always use multiple benchmarks and combine automatic with human evaluation
Watch for data contamination and benchmark gaming

Practice Exercises

Perplexity Comparison: Compute perplexity for 3 different models on the WikiText-2 dataset. How does perplexity correlate with model size?
Benchmark Implementation: Implement a 5-shot MMLU evaluation for a small language model. Report accuracy by subject category.
Code Generation: Evaluate a model on HumanEval with k=1, k=10, and k=100. How does Pass@k improve with more samples?
LLM-as-Judge: Use GPT-4 to evaluate 50 responses from different models on the MT-Bench dataset. Compare the rankings with your own judgments.
Evaluation Pipeline: Build a complete evaluation pipeline that runs 4 different benchmarks and generates a comparison report.

What to Learn Next

-> LLM Safety and Red Teaming Safety evaluation is a critical dimension beyond standard benchmarks.

-> LLM Inference Optimization Optimizing inference affects latency and cost metrics measured in evaluation.

-> Building Production LLM Applications Production evaluation combines automated metrics with human feedback.

-> Prompt Engineering Prompt strategies directly impact evaluation performance across benchmarks.

-> In-Context Learning Few-shot evaluation protocols rely on in-context learning capabilities.

-> Scaling Laws and Chinchilla Understanding how model scale affects benchmark performance.

Previous: 14 - Constitutional AI <- | Next: 16 - LLM Inference Optimization ->

LLM Evaluation Benchmarks

LLM Evaluation — How to Measure if a Language Model is Actually Good

LLM Evaluation Benchmarks

Why Evaluation is Hard

Perplexity

Perplexity

Cross-Entropy Loss

Computing Perplexity

MMLU (Massive Multitask Language Understanding)

Benchmark Structure

Evaluation Protocol

HumanEval (Code Generation)

Pass@k (Code Generation)

Unbiased Pass@k Estimator

GSM8K (Math Reasoning)

Evaluation Metrics

MT-Bench and Chatbot Arena

MT-Bench

Chatbot Arena (ELO Rating)

ELO Rating Update

LLM-as-Judge Evaluation

Evaluation Pitfalls and Limitations

Common Pitfalls

Evaluation Anti-Patterns

Limitations of Automatic Metrics

Building an Evaluation Pipeline

Summary

Practice Exercises

What to Learn Next

Premium Content

Need Expert LLM Help?