Text Generation Evaluation
Evaluating text generation is fundamentally more challenging than classification because there is no single "correct" output. Multiple valid outputs exist for any given input, requiring diverse evaluation strategies.
The Evaluation Challenge
DfText Generation Evaluation Problem
Given a source and a set of reference outputs , evaluate a generated output along multiple dimensions:
No single metric captures all dimensions simultaneously.
BLEU Score
BLEU (Bilingual Evaluation Understudy) measures n-gram overlap between generated and reference text.
DfBLEU Score
where is the modified n-gram precision, are uniform weights, and the brevity penalty is:
with = candidate length and = reference length.
from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction
import numpy as np
def compute_bleu(references, candidates, max_n=4):
"""
Compute corpus-level BLEU score.
Args:
references: list of list of reference token sequences
candidates: list of candidate token sequences
max_n: maximum n-gram order
"""
# Convert to format expected by NLTK
refs = [[ref] for ref in references]
# Individual n-gram precisions
weights = tuple([1.0 / max_n] * max_n)
smoothing = SmoothingFunction().method1
bleu = corpus_bleu(
refs,
candidates,
weights=weights,
smoothing_function=smoothing
)
# Also compute individual n-gram scores
individual = {}
for n in range(1, max_n + 1):
individual[f"bleu-{n}"] = corpus_bleu(
refs, candidates,
weights=tuple([1.0 if i == n-1 else 0.0 for i in range(max_n)]),
smoothing_function=smoothing
)
return {"bleu": bleu, **individual}
# Example
references = [
["the", "cat", "sat", "on", "the", "mat"],
["there", "is", "a", "cat", "on", "the", "mat"],
]
candidate = ["the", "cat", "is", "on", "the", "mat"]
scores = compute_bleu(references, [candidate])
print(f"BLEU: {scores['bleu']:.4f}")
ROUGE Metrics
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) emphasizes recall, making it suitable for summarization.
DfROUGE-N and ROUGE-L
ROUGE-L uses Longest Common Subsequence (LCS):
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(
['rouge1', 'rouge2', 'rougeL', 'rougeLsum'],
use_stemmer=True
)
def evaluate_rouge(references, candidate):
"""Compute ROUGE scores against multiple references."""
all_scores = {}
for ref in references:
scores = scorer.score(ref, candidate)
for metric, score in scores.items():
if metric not in all_scores:
all_scores[metric] = {"precision": [], "recall": [], "fmeasure": []}
all_scores[metric]["precision"].append(score.precision)
all_scores[metric]["recall"].append(score.recall)
all_scores[metric]["fmeasure"].append(score.fmeasure)
# Average across references
avg_scores = {}
for metric in all_scores:
avg_scores[metric] = {
"precision": np.mean(all_scores[metric]["precision"]),
"recall": np.mean(all_scores[metric]["recall"]),
"fmeasure": np.mean(all_scores[metric]["fmeasure"]),
}
return avg_scores
# Example
reference = "The cat sat on the mat and looked out the window"
candidate = "The cat was sitting on the mat looking outside"
scores = evaluate_rouge([reference], candidate)
for metric, values in scores.items():
print(f"{metric}: P={values['precision']:.3f} R={values['recall']:.3f} F={values['fmeasure']:.3f}")
BERTScore
BERTScore uses contextual embeddings from BERT to compute semantic similarity rather than surface-level n-gram matching.
DfBERTScore
where token-level cosine similarities are computed between BERT embeddings:
from bert_score import score as bert_score
import torch
def evaluate_bertscore(references, candidates, lang="en", model_type="microsoft/deberta-xlarge-mnli"):
"""
Compute BERTScore between candidates and references.
"""
P, R, F1 = bert_score(
candidates,
references,
lang=lang,
model_type=model_type,
verbose=True,
rescale_with_baseline=True # Recommended for better correlation
)
results = {
"precision": P.mean().item(),
"recall": R.mean().item(),
"f1": F1.mean().item(),
"precision_std": P.std().item(),
"recall_std": R.std().item(),
"f1_std": F1.std().item(),
}
return results
# Example
references = ["The cat sat on the mat", "A feline rested on the rug"]
candidates = ["The cat was sitting on the mat", "A cat lay on the carpet"]
scores = evaluate_bertscore(references, candidates)
print(f"BERTScore F1: {scores['f1']:.4f} (Β±{scores['f1_std']:.4f})")
Perplexity
Perplexity measures how well a language model predicts the generated text.
DfPerplexity
Lower perplexity indicates the model assigns higher probability to the generated text.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
def compute_perplexity(text, model_name="gpt2", max_length=512):
"""Compute perplexity of text using a language model."""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model.eval()
encodings = tokenizer(text, return_tensors="pt", max_length=max_length, truncation=True)
input_ids = encodings.input_ids
with torch.no_grad():
outputs = model(input_ids, labels=input_ids)
loss = outputs.loss
return torch.exp(loss).item()
# Compare quality of two texts
text_good = "The weather is pleasant today with clear blue skies."
text_bad = "Weather is today good sky blue clear."
ppl_good = compute_perplexity(text_good)
ppl_bad = compute_perplexity(text_bad)
print(f"Good text perplexity: {ppl_good:.2f}")
print(f"Bad text perplexity: {ppl_bad:.2f}")
Human Evaluation Framework
Automatic metrics correlate imperfectly with human judgment. Structured human evaluation remains the gold standard.
| Evaluation Method | Description | Best For |
|---|---|---|
| Likert Rating | 1-5 scale on dimensions | Detailed quality assessment |
| Paired Comparison | A/B preference judgments | Model comparison |
| Ranking | Order multiple outputs | Relative quality |
| Error Annotation | Categorize specific errors | Diagnostic analysis |
# Human evaluation framework
HUMAN_EVAL_TEMPLATE = {
"dimensions": {
"fluency": {
"description": "Is the text grammatically correct and natural?",
"scale": ["Incomprehensible", "Poor", "Acceptable", "Good", "Excellent"]
},
"relevance": {
"description": "Does the output address the input appropriately?",
"scale": ["Completely off", "Mostly off", "Partially relevant", "Mostly relevant", "Fully relevant"]
},
"coherence": {
"description": "Is the text logically organized and consistent?",
"scale": ["Incoherent", "Mostly disorganized", "Somewhat coherent", "Mostly coherent", "Fully coherent"]
},
"informativeness": {
"description": "How much useful information does the output contain?",
"scale": ["Empty", "Minimal", "Some", "Substantial", "Comprehensive"]
}
},
"guidelines": """
Rate each dimension on the 1-5 scale provided.
Base your judgment on the overall quality of the generated text.
Consider the context and intended use case.
"""
}
Metric Comparison
| Metric | Type | Measures | Correlation with Human | Speed |
|---|---|---|---|---|
| BLEU | N-gram overlap | Precision-oriented | Moderate | Fast |
| ROUGE-1 | Unigram overlap | Recall-oriented | Moderate | Fast |
| ROUGE-L | LCS | Structure similarity | Moderate | Fast |
| BERTScore | Embedding similarity | Semantic similarity | High | Slow |
| Perplexity | LM probability | Fluency | Low-Moderate | Moderate |
| METEOR | Alignment | Balance P/R | Moderate-High | Moderate |
| CIDEr | TF-IDF weighted | Consensus | High | Moderate |
Best Practices
Key Principles for Evaluation:
- Use multiple metrics β no single metric captures all quality dimensions
- Always include human evaluation for final model selection
- Report confidence intervals, not just means
- Use statistical significance tests (paired t-test, bootstrap)
- Consider domain-specific metrics when applicable
# Complete evaluation pipeline
class TextGenerationEvaluator:
def __init__(self, metrics=None):
self.metrics = metrics or ["bleu", "rouge", "bertscore", "perplexity"]
def evaluate(self, references, candidates):
results = {}
if "bleu" in self.metrics:
results["bleu"] = compute_bleu(references, candidates)
if "rouge" in self.metrics:
rouge_scores = [evaluate_rouge([ref], cand) for ref, cand in zip(references, candidates)]
results["rouge"] = average_rouge_scores(rouge_scores)
if "bertscore" in self.metrics:
results["bertscore"] = evaluate_bertscore(references, candidates)
return results
def compare_models(self, refs, model_a_outputs, model_b_outputs):
scores_a = self.evaluate(refs, model_a_outputs)
scores_b = self.evaluate(refs, model_b_outputs)
comparison = {}
for metric in scores_a:
comparison[metric] = {
"model_a": scores_a[metric],
"model_b": scores_b[metric],
"difference": scores_b[metric] - scores_a[metric]
}
return comparison
Key Takeaways
- BLEU works best for translation and tasks with clear reference outputs
- ROUGE excels at summarization evaluation
- BERTScore captures semantic similarity better than surface metrics
- Perplexity assesses fluency but not content quality
- Human evaluation is indispensable for final assessment