Text Generation Evaluation

Evaluating text generation is fundamentally more challenging than classification because there is no single "correct" output. Multiple valid outputs exist for any given input, requiring diverse evaluation strategies.

The Evaluation Challenge

DfText Generation Evaluation Problem

Given a source $x$ and a set of reference outputs $\mathcal{R} = \{r_1, r_2, \ldots, r_k\}$ , evaluate a generated output $g$ along multiple dimensions:

\text{Quality}(g) = f(\text{fluency}, \text{adequacy}, \text{diversity}, \text{coherence})

No single metric captures all dimensions simultaneously.

BLEU Score

BLEU (Bilingual Evaluation Understudy) measures n-gram overlap between generated and reference text.

DfBLEU Score

\text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)

where $p_n$ is the modified n-gram precision, $w_n = 1/N$ are uniform weights, and the brevity penalty is:

\text{BP} = \begin{cases} 1 & \text{if } c > r \\ e^{1 - r/c} & \text{if } c \leq r \end{cases}

with $c$ = candidate length and $r$ = reference length.

from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction
import numpy as np

def compute_bleu(references, candidates, max_n=4):
    """
    Compute corpus-level BLEU score.

    Args:
        references: list of list of reference token sequences
        candidates: list of candidate token sequences
        max_n: maximum n-gram order
    """
    # Convert to format expected by NLTK
    refs = [[ref] for ref in references]

    # Individual n-gram precisions
    weights = tuple([1.0 / max_n] * max_n)

    smoothing = SmoothingFunction().method1

    bleu = corpus_bleu(
        refs,
        candidates,
        weights=weights,
        smoothing_function=smoothing
    )

    # Also compute individual n-gram scores
    individual = {}
    for n in range(1, max_n + 1):
        individual[f"bleu-{n}"] = corpus_bleu(
            refs, candidates,
            weights=tuple([1.0 if i == n-1 else 0.0 for i in range(max_n)]),
            smoothing_function=smoothing
        )

    return {"bleu": bleu, **individual}

# Example
references = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["there", "is", "a", "cat", "on", "the", "mat"],
]
candidate = ["the", "cat", "is", "on", "the", "mat"]

scores = compute_bleu(references, [candidate])
print(f"BLEU: {scores['bleu']:.4f}")

ROUGE Metrics

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) emphasizes recall, making it suitable for summarization.

DfROUGE-N and ROUGE-L

\text{ROUGE-N} = \frac{\sum_{s \in \text{Refs}} \sum_{\text{gram}_n \in s} \text{Count}_{\text{match}}(\text{gram}_n)}{\sum_{s \in \text{Refs}} \sum_{\text{gram}_n \in s} \text{Count}(\text{gram}_n)}

ROUGE-L uses Longest Common Subsequence (LCS):

R_{lcs} = \frac{|LCS(X, Y)|}{m}, \quad P_{lcs} = \frac{|LCS(X, Y)|}{n}

F_{lcs} = \frac{(1 + \beta^2) R_{lcs} P_{lcs}}{R_{lcs} + \beta^2 P_{lcs}}

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(
    ['rouge1', 'rouge2', 'rougeL', 'rougeLsum'],
    use_stemmer=True
)

def evaluate_rouge(references, candidate):
    """Compute ROUGE scores against multiple references."""
    all_scores = {}

    for ref in references:
        scores = scorer.score(ref, candidate)
        for metric, score in scores.items():
            if metric not in all_scores:
                all_scores[metric] = {"precision": [], "recall": [], "fmeasure": []}
            all_scores[metric]["precision"].append(score.precision)
            all_scores[metric]["recall"].append(score.recall)
            all_scores[metric]["fmeasure"].append(score.fmeasure)

    # Average across references
    avg_scores = {}
    for metric in all_scores:
        avg_scores[metric] = {
            "precision": np.mean(all_scores[metric]["precision"]),
            "recall": np.mean(all_scores[metric]["recall"]),
            "fmeasure": np.mean(all_scores[metric]["fmeasure"]),
        }
    return avg_scores

# Example
reference = "The cat sat on the mat and looked out the window"
candidate = "The cat was sitting on the mat looking outside"

scores = evaluate_rouge([reference], candidate)
for metric, values in scores.items():
    print(f"{metric}: P={values['precision']:.3f} R={values['recall']:.3f} F={values['fmeasure']:.3f}")

BERTScore

BERTScore uses contextual embeddings from BERT to compute semantic similarity rather than surface-level n-gram matching.

DfBERTScore

\text{BERTScore} = F_1 = \frac{2 \cdot p_{\text{bert}} \cdot r_{\text{bert}}}{p_{\text{bert}} + r_{\text{bert}}}

where token-level cosine similarities are computed between BERT embeddings:

p_{\text{bert}} = \frac{1}{|x|} \max_{j} \cos(x_i, \hat{x}_j), \quad r_{\text{bert}} = \frac{1}{|\hat{x}|} \max_{j} \cos(\hat{x}_j, x_i)

from bert_score import score as bert_score
import torch

def evaluate_bertscore(references, candidates, lang="en", model_type="microsoft/deberta-xlarge-mnli"):
    """
    Compute BERTScore between candidates and references.
    """
    P, R, F1 = bert_score(
        candidates,
        references,
        lang=lang,
        model_type=model_type,
        verbose=True,
        rescale_with_baseline=True  # Recommended for better correlation
    )

    results = {
        "precision": P.mean().item(),
        "recall": R.mean().item(),
        "f1": F1.mean().item(),
        "precision_std": P.std().item(),
        "recall_std": R.std().item(),
        "f1_std": F1.std().item(),
    }
    return results

# Example
references = ["The cat sat on the mat", "A feline rested on the rug"]
candidates = ["The cat was sitting on the mat", "A cat lay on the carpet"]

scores = evaluate_bertscore(references, candidates)
print(f"BERTScore F1: {scores['f1']:.4f} (±{scores['f1_std']:.4f})")

Perplexity

Perplexity measures how well a language model predicts the generated text.

DfPerplexity

\text{PPL}(g) = \exp\left(-\frac{1}{N}\sum_{i=1}^{N} \log P(g_i | g_{<i})\right)

Lower perplexity indicates the model assigns higher probability to the generated text.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

def compute_perplexity(text, model_name="gpt2", max_length=512):
    """Compute perplexity of text using a language model."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    model.eval()

    encodings = tokenizer(text, return_tensors="pt", max_length=max_length, truncation=True)
    input_ids = encodings.input_ids

    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        loss = outputs.loss

    return torch.exp(loss).item()

# Compare quality of two texts
text_good = "The weather is pleasant today with clear blue skies."
text_bad = "Weather is today good sky blue clear."

ppl_good = compute_perplexity(text_good)
ppl_bad = compute_perplexity(text_bad)

print(f"Good text perplexity: {ppl_good:.2f}")
print(f"Bad text perplexity: {ppl_bad:.2f}")

Human Evaluation Framework

Automatic metrics correlate imperfectly with human judgment. Structured human evaluation remains the gold standard.

Evaluation Method	Description	Best For
Likert Rating	1-5 scale on dimensions	Detailed quality assessment
Paired Comparison	A/B preference judgments	Model comparison
Ranking	Order multiple outputs	Relative quality
Error Annotation	Categorize specific errors	Diagnostic analysis

# Human evaluation framework
HUMAN_EVAL_TEMPLATE = {
    "dimensions": {
        "fluency": {
            "description": "Is the text grammatically correct and natural?",
            "scale": ["Incomprehensible", "Poor", "Acceptable", "Good", "Excellent"]
        },
        "relevance": {
            "description": "Does the output address the input appropriately?",
            "scale": ["Completely off", "Mostly off", "Partially relevant", "Mostly relevant", "Fully relevant"]
        },
        "coherence": {
            "description": "Is the text logically organized and consistent?",
            "scale": ["Incoherent", "Mostly disorganized", "Somewhat coherent", "Mostly coherent", "Fully coherent"]
        },
        "informativeness": {
            "description": "How much useful information does the output contain?",
            "scale": ["Empty", "Minimal", "Some", "Substantial", "Comprehensive"]
        }
    },
    "guidelines": """
    Rate each dimension on the 1-5 scale provided.
    Base your judgment on the overall quality of the generated text.
    Consider the context and intended use case.
    """
}

Metric Comparison

Metric	Type	Measures	Correlation with Human	Speed
BLEU	N-gram overlap	Precision-oriented	Moderate	Fast
ROUGE-1	Unigram overlap	Recall-oriented	Moderate	Fast
ROUGE-L	LCS	Structure similarity	Moderate	Fast
BERTScore	Embedding similarity	Semantic similarity	High	Slow
Perplexity	LM probability	Fluency	Low-Moderate	Moderate
METEOR	Alignment	Balance P/R	Moderate-High	Moderate
CIDEr	TF-IDF weighted	Consensus	High	Moderate

Best Practices

Key Principles for Evaluation:

Use multiple metrics — no single metric captures all quality dimensions
Always include human evaluation for final model selection
Report confidence intervals, not just means
Use statistical significance tests (paired t-test, bootstrap)
Consider domain-specific metrics when applicable

# Complete evaluation pipeline
class TextGenerationEvaluator:
    def __init__(self, metrics=None):
        self.metrics = metrics or ["bleu", "rouge", "bertscore", "perplexity"]

    def evaluate(self, references, candidates):
        results = {}

        if "bleu" in self.metrics:
            results["bleu"] = compute_bleu(references, candidates)

        if "rouge" in self.metrics:
            rouge_scores = [evaluate_rouge([ref], cand) for ref, cand in zip(references, candidates)]
            results["rouge"] = average_rouge_scores(rouge_scores)

        if "bertscore" in self.metrics:
            results["bertscore"] = evaluate_bertscore(references, candidates)

        return results

    def compare_models(self, refs, model_a_outputs, model_b_outputs):
        scores_a = self.evaluate(refs, model_a_outputs)
        scores_b = self.evaluate(refs, model_b_outputs)

        comparison = {}
        for metric in scores_a:
            comparison[metric] = {
                "model_a": scores_a[metric],
                "model_b": scores_b[metric],
                "difference": scores_b[metric] - scores_a[metric]
            }
        return comparison

Key Takeaways

BLEU works best for translation and tasks with clear reference outputs
ROUGE excels at summarization evaluation
BERTScore captures semantic similarity better than surface metrics
Perplexity assesses fluency but not content quality
Human evaluation is indispensable for final assessment

Text Generation Evaluation