πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Text Generation Evaluation

Advanced NLPEvaluating Generated Text🟒 Free Lesson

Advertisement

Text Generation Evaluation

Evaluating text generation is fundamentally more challenging than classification because there is no single "correct" output. Multiple valid outputs exist for any given input, requiring diverse evaluation strategies.

The Evaluation Challenge

DfText Generation Evaluation Problem

Given a source xx and a set of reference outputs R={r1,r2,…,rk}\mathcal{R} = \{r_1, r_2, \ldots, r_k\}, evaluate a generated output gg along multiple dimensions:

Quality(g)=f(fluency,adequacy,diversity,coherence)\text{Quality}(g) = f(\text{fluency}, \text{adequacy}, \text{diversity}, \text{coherence})

No single metric captures all dimensions simultaneously.


BLEU Score

BLEU (Bilingual Evaluation Understudy) measures n-gram overlap between generated and reference text.

DfBLEU Score

BLEU=BPβ‹…exp⁑(βˆ‘n=1Nwnlog⁑pn)\text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)

where pnp_n is the modified n-gram precision, wn=1/Nw_n = 1/N are uniform weights, and the brevity penalty is:

BP={1ifΒ c>re1βˆ’r/cifΒ c≀r\text{BP} = \begin{cases} 1 & \text{if } c > r \\ e^{1 - r/c} & \text{if } c \leq r \end{cases}

with cc = candidate length and rr = reference length.

from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction
import numpy as np

def compute_bleu(references, candidates, max_n=4):
    """
    Compute corpus-level BLEU score.

    Args:
        references: list of list of reference token sequences
        candidates: list of candidate token sequences
        max_n: maximum n-gram order
    """
    # Convert to format expected by NLTK
    refs = [[ref] for ref in references]

    # Individual n-gram precisions
    weights = tuple([1.0 / max_n] * max_n)

    smoothing = SmoothingFunction().method1

    bleu = corpus_bleu(
        refs,
        candidates,
        weights=weights,
        smoothing_function=smoothing
    )

    # Also compute individual n-gram scores
    individual = {}
    for n in range(1, max_n + 1):
        individual[f"bleu-{n}"] = corpus_bleu(
            refs, candidates,
            weights=tuple([1.0 if i == n-1 else 0.0 for i in range(max_n)]),
            smoothing_function=smoothing
        )

    return {"bleu": bleu, **individual}

# Example
references = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["there", "is", "a", "cat", "on", "the", "mat"],
]
candidate = ["the", "cat", "is", "on", "the", "mat"]

scores = compute_bleu(references, [candidate])
print(f"BLEU: {scores['bleu']:.4f}")

ROUGE Metrics

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) emphasizes recall, making it suitable for summarization.

DfROUGE-N and ROUGE-L

ROUGE-N=βˆ‘s∈Refsβˆ‘gramn∈sCountmatch(gramn)βˆ‘s∈Refsβˆ‘gramn∈sCount(gramn)\text{ROUGE-N} = \frac{\sum_{s \in \text{Refs}} \sum_{\text{gram}_n \in s} \text{Count}_{\text{match}}(\text{gram}_n)}{\sum_{s \in \text{Refs}} \sum_{\text{gram}_n \in s} \text{Count}(\text{gram}_n)}

ROUGE-L uses Longest Common Subsequence (LCS):

Rlcs=∣LCS(X,Y)∣m,Plcs=∣LCS(X,Y)∣nR_{lcs} = \frac{|LCS(X, Y)|}{m}, \quad P_{lcs} = \frac{|LCS(X, Y)|}{n}
Flcs=(1+Ξ²2)RlcsPlcsRlcs+Ξ²2PlcsF_{lcs} = \frac{(1 + \beta^2) R_{lcs} P_{lcs}}{R_{lcs} + \beta^2 P_{lcs}}
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(
    ['rouge1', 'rouge2', 'rougeL', 'rougeLsum'],
    use_stemmer=True
)

def evaluate_rouge(references, candidate):
    """Compute ROUGE scores against multiple references."""
    all_scores = {}

    for ref in references:
        scores = scorer.score(ref, candidate)
        for metric, score in scores.items():
            if metric not in all_scores:
                all_scores[metric] = {"precision": [], "recall": [], "fmeasure": []}
            all_scores[metric]["precision"].append(score.precision)
            all_scores[metric]["recall"].append(score.recall)
            all_scores[metric]["fmeasure"].append(score.fmeasure)

    # Average across references
    avg_scores = {}
    for metric in all_scores:
        avg_scores[metric] = {
            "precision": np.mean(all_scores[metric]["precision"]),
            "recall": np.mean(all_scores[metric]["recall"]),
            "fmeasure": np.mean(all_scores[metric]["fmeasure"]),
        }
    return avg_scores

# Example
reference = "The cat sat on the mat and looked out the window"
candidate = "The cat was sitting on the mat looking outside"

scores = evaluate_rouge([reference], candidate)
for metric, values in scores.items():
    print(f"{metric}: P={values['precision']:.3f} R={values['recall']:.3f} F={values['fmeasure']:.3f}")

BERTScore

BERTScore uses contextual embeddings from BERT to compute semantic similarity rather than surface-level n-gram matching.

DfBERTScore

BERTScore=F1=2β‹…pbertβ‹…rbertpbert+rbert\text{BERTScore} = F_1 = \frac{2 \cdot p_{\text{bert}} \cdot r_{\text{bert}}}{p_{\text{bert}} + r_{\text{bert}}}

where token-level cosine similarities are computed between BERT embeddings:

pbert=1∣x∣max⁑jcos⁑(xi,x^j),rbert=1∣x^∣max⁑jcos⁑(x^j,xi)p_{\text{bert}} = \frac{1}{|x|} \max_{j} \cos(x_i, \hat{x}_j), \quad r_{\text{bert}} = \frac{1}{|\hat{x}|} \max_{j} \cos(\hat{x}_j, x_i)
from bert_score import score as bert_score
import torch

def evaluate_bertscore(references, candidates, lang="en", model_type="microsoft/deberta-xlarge-mnli"):
    """
    Compute BERTScore between candidates and references.
    """
    P, R, F1 = bert_score(
        candidates,
        references,
        lang=lang,
        model_type=model_type,
        verbose=True,
        rescale_with_baseline=True  # Recommended for better correlation
    )

    results = {
        "precision": P.mean().item(),
        "recall": R.mean().item(),
        "f1": F1.mean().item(),
        "precision_std": P.std().item(),
        "recall_std": R.std().item(),
        "f1_std": F1.std().item(),
    }
    return results

# Example
references = ["The cat sat on the mat", "A feline rested on the rug"]
candidates = ["The cat was sitting on the mat", "A cat lay on the carpet"]

scores = evaluate_bertscore(references, candidates)
print(f"BERTScore F1: {scores['f1']:.4f} (Β±{scores['f1_std']:.4f})")

Perplexity

Perplexity measures how well a language model predicts the generated text.

DfPerplexity

PPL(g)=exp⁑(βˆ’1Nβˆ‘i=1Nlog⁑P(gi∣g<i))\text{PPL}(g) = \exp\left(-\frac{1}{N}\sum_{i=1}^{N} \log P(g_i | g_{<i})\right)

Lower perplexity indicates the model assigns higher probability to the generated text.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

def compute_perplexity(text, model_name="gpt2", max_length=512):
    """Compute perplexity of text using a language model."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    model.eval()

    encodings = tokenizer(text, return_tensors="pt", max_length=max_length, truncation=True)
    input_ids = encodings.input_ids

    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        loss = outputs.loss

    return torch.exp(loss).item()

# Compare quality of two texts
text_good = "The weather is pleasant today with clear blue skies."
text_bad = "Weather is today good sky blue clear."

ppl_good = compute_perplexity(text_good)
ppl_bad = compute_perplexity(text_bad)

print(f"Good text perplexity: {ppl_good:.2f}")
print(f"Bad text perplexity: {ppl_bad:.2f}")

Human Evaluation Framework

Automatic metrics correlate imperfectly with human judgment. Structured human evaluation remains the gold standard.

Evaluation MethodDescriptionBest For
Likert Rating1-5 scale on dimensionsDetailed quality assessment
Paired ComparisonA/B preference judgmentsModel comparison
RankingOrder multiple outputsRelative quality
Error AnnotationCategorize specific errorsDiagnostic analysis
# Human evaluation framework
HUMAN_EVAL_TEMPLATE = {
    "dimensions": {
        "fluency": {
            "description": "Is the text grammatically correct and natural?",
            "scale": ["Incomprehensible", "Poor", "Acceptable", "Good", "Excellent"]
        },
        "relevance": {
            "description": "Does the output address the input appropriately?",
            "scale": ["Completely off", "Mostly off", "Partially relevant", "Mostly relevant", "Fully relevant"]
        },
        "coherence": {
            "description": "Is the text logically organized and consistent?",
            "scale": ["Incoherent", "Mostly disorganized", "Somewhat coherent", "Mostly coherent", "Fully coherent"]
        },
        "informativeness": {
            "description": "How much useful information does the output contain?",
            "scale": ["Empty", "Minimal", "Some", "Substantial", "Comprehensive"]
        }
    },
    "guidelines": """
    Rate each dimension on the 1-5 scale provided.
    Base your judgment on the overall quality of the generated text.
    Consider the context and intended use case.
    """
}

Metric Comparison

MetricTypeMeasuresCorrelation with HumanSpeed
BLEUN-gram overlapPrecision-orientedModerateFast
ROUGE-1Unigram overlapRecall-orientedModerateFast
ROUGE-LLCSStructure similarityModerateFast
BERTScoreEmbedding similaritySemantic similarityHighSlow
PerplexityLM probabilityFluencyLow-ModerateModerate
METEORAlignmentBalance P/RModerate-HighModerate
CIDErTF-IDF weightedConsensusHighModerate

Best Practices

Key Principles for Evaluation:

  1. Use multiple metrics β€” no single metric captures all quality dimensions
  2. Always include human evaluation for final model selection
  3. Report confidence intervals, not just means
  4. Use statistical significance tests (paired t-test, bootstrap)
  5. Consider domain-specific metrics when applicable
# Complete evaluation pipeline
class TextGenerationEvaluator:
    def __init__(self, metrics=None):
        self.metrics = metrics or ["bleu", "rouge", "bertscore", "perplexity"]

    def evaluate(self, references, candidates):
        results = {}

        if "bleu" in self.metrics:
            results["bleu"] = compute_bleu(references, candidates)

        if "rouge" in self.metrics:
            rouge_scores = [evaluate_rouge([ref], cand) for ref, cand in zip(references, candidates)]
            results["rouge"] = average_rouge_scores(rouge_scores)

        if "bertscore" in self.metrics:
            results["bertscore"] = evaluate_bertscore(references, candidates)

        return results

    def compare_models(self, refs, model_a_outputs, model_b_outputs):
        scores_a = self.evaluate(refs, model_a_outputs)
        scores_b = self.evaluate(refs, model_b_outputs)

        comparison = {}
        for metric in scores_a:
            comparison[metric] = {
                "model_a": scores_a[metric],
                "model_b": scores_b[metric],
                "difference": scores_b[metric] - scores_a[metric]
            }
        return comparison

Key Takeaways

  • BLEU works best for translation and tasks with clear reference outputs
  • ROUGE excels at summarization evaluation
  • BERTScore captures semantic similarity better than surface metrics
  • Perplexity assesses fluency but not content quality
  • Human evaluation is indispensable for final assessment
⭐

Premium Content

Text Generation Evaluation

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert NLP Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement