πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

NLP Evaluation Metrics

Production NLPEvaluation Metrics for NLP🟒 Free Lesson

Advertisement

NLP Evaluation Metrics

Choosing the right evaluation metric is critical for measuring model performance and guiding improvement. Different NLP tasks require different evaluation approaches.

Metric Selection Framework

Task TypePrimary MetricSecondary MetricsWhen to Use
ClassificationF1 / AccuracyPrecision, Recall, AUCBalanced/imbalanced data
Named Entity RecognitionToken-level F1Span-level F1, Exact MatchSequence labeling
Machine TranslationBLEUMETEOR, chrF, COMETTranslation quality
SummarizationROUGEBERTScore, SummaCContent coverage
Question AnsweringExact MatchF1, BERTScoreExtractive/generative
Language ModelingPerplexityBits per byteFluency assessment

Classification Metrics

Confusion Matrix Metrics

DfCore Classification Metrics

Precision=TPTP+FP,Recall=TPTP+FN\text{Precision} = \frac{TP}{TP + FP}, \quad \text{Recall} = \frac{TP}{TP + FN}
F1=2β‹…Precisionβ‹…RecallPrecision+Recall=2TP2TP+FP+FN\text{F1} = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP + FP + FN}
Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix, roc_auc_score
)
import numpy as np

def comprehensive_classification_metrics(y_true, y_pred, y_prob=None, average="macro"):
    """Compute comprehensive classification metrics."""
    metrics = {
        "accuracy": accuracy_score(y_true, y_pred),
        "precision": precision_score(y_true, y_pred, average=average, zero_division=0),
        "recall": recall_score(y_true, y_pred, average=average, zero_division=0),
        "f1": f1_score(y_true, y_pred, average=average, zero_division=0),
    }

    if y_prob is not None:
        try:
            metrics["auc_roc"] = roc_auc_score(y_true, y_prob, multi_class="ovr")
        except ValueError:
            metrics["auc_roc"] = None

    # Per-class metrics
    report = classification_report(y_true, y_pred, output_dict=True, zero_division=0)
    metrics["per_class"] = report

    # Confusion matrix
    metrics["confusion_matrix"] = confusion_matrix(y_true, y_pred).tolist()

    return metrics

# Example
y_true = [0, 1, 1, 0, 2, 1, 0, 2]
y_pred = [0, 1, 0, 0, 2, 2, 1, 2]

results = comprehensive_classification_metrics(y_true, y_pred)
print(f"Accuracy: {results['accuracy']:.4f}")
print(f"F1 (macro): {results['f1']:.4f}")

Macro vs Micro vs Weighted Averaging

StrategyFormulaBest For
Macro1Cβˆ‘i=1Cmetrici\frac{1}{C}\sum_{i=1}^{C} \text{metric}_iEqual class importance
Microβˆ‘iTPiβˆ‘i(TPi+FPi)\frac{\sum_i TP_i}{\sum_i (TP_i + FP_i)}Sample-level importance
Weightedβˆ‘iniNβ‹…metrici\sum_i \frac{n_i}{N} \cdot \text{metric}_iProportional to support

Perplexity

Perplexity measures how well a language model predicts a sample.

DfPerplexity

PPL(W)=P(w1,w2,…,wN)βˆ’1/N=exp⁑(βˆ’1Nβˆ‘i=1Nlog⁑P(wi∣w<i))\text{PPL}(W) = P(w_1, w_2, \ldots, w_N)^{-1/N} = \exp\left(-\frac{1}{N}\sum_{i=1}^{N} \log P(w_i | w_{<i})\right)

Lower perplexity indicates better predictive performance. A uniform distribution over vocabulary VV has perplexity ∣V∣|V|.

import torch
import math

def compute_perplexity(model, tokenizer, text, max_length=512):
    """Compute perplexity of text using a language model."""
    encodings = tokenizer(text, return_tensors="pt", max_length=max_length, truncation=True)
    input_ids = encodings.input_ids.to(model.device)

    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        neg_log_likelihood = outputs.loss

    return math.exp(neg_log_likelihood.item())

def compute_dataset_perplexity(model, tokenizer, texts, max_length=512):
    """Compute average perplexity over a dataset."""
    perplexities = []
    for text in texts:
        ppl = compute_perplexity(model, tokenizer, text, max_length)
        perplexities.append(ppl)

    return {
        "mean_ppl": np.mean(perplexities),
        "std_ppl": np.std(perplexities),
        "median_ppl": np.median(perplexities),
        "min_ppl": min(perplexities),
        "max_ppl": max(perplexities),
    }

BLEU Score

BLEU measures n-gram precision between generated and reference translations.

DfBLEU Score

BLEU=BPβ‹…exp⁑(βˆ‘n=1Nwnlog⁑pn)\text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)

where pnp_n is modified n-gram precision and:

BP={1ifΒ c>rexp⁑(1βˆ’r/c)ifΒ c≀r\text{BP} = \begin{cases} 1 & \text{if } c > r \\ \exp(1 - r/c) & \text{if } c \leq r \end{cases}
BLEU RangeQualityTypical Use
0-10PoorRandom/baseline outputs
10-20Below averageWeak models
20-30AdequateProduction baselines
30-40GoodStrong models
40-50ExcellentNear-human quality
50+OutstandingDomain-specific tasks

ROUGE Metrics

MetricMeasuresFormulaBest For
ROUGE-1Unigram overlapUnigram recall/precisionContent coverage
ROUGE-2Bigram overlapBigram recall/precisionFluency
ROUGE-LLCS-basedLongest common subsequenceStructure similarity

BERTScore

BERTScore computes semantic similarity using contextual embeddings.

DfBERTScore

BERTScore=F1=2β‹…pbertβ‹…rbertpbert+rbert\text{BERTScore} = F_1 = \frac{2 \cdot p_{\text{bert}} \cdot r_{\text{bert}}}{p_{\text{bert}} + r_{\text{bert}}}

where:

pbert=1∣xβˆ£βˆ‘xi∈xmax⁑x^j∈x^xiTx^jp_{\text{bert}} = \frac{1}{|x|} \sum_{x_i \in x} \max_{\hat{x}_j \in \hat{x}} x_i^T \hat{x}_j

Metric Comparison

MetricLevelSemanticSpeedCorrelation with HumanUse Case
AccuracySampleNoFastModerateBalanced classification
F1SampleNoFastModerateImbalanced classification
PerplexityTokenPartialModerateLow-ModerateLanguage modeling
BLEUN-gramNoFastModerateTranslation
ROUGEN-gramNoFastModerateSummarization
BERTScoreEmbeddingYesSlowHighText generation
METEORTokenPartialModerateModerate-HighTranslation
COMETEmbeddingYesSlowVery HighTranslation

Error Analysis Metrics

def error_analysis(y_true, y_pred, texts, labels):
    """Detailed error analysis for classification."""
    errors = []
    for i, (true, pred, text) in enumerate(zip(y_true, y_pred, texts)):
        if true != pred:
            errors.append({
                "index": i,
                "text": text,
                "true_label": labels[true],
                "predicted_label": labels[pred],
            })

    # Confusion pairs
    confusion_pairs = {}
    for error in errors:
        pair = (error["true_label"], error["predicted_label"])
        confusion_pairs[pair] = confusion_pairs.get(pair, 0) + 1

    # Sort by frequency
    sorted_pairs = sorted(confusion_pairs.items(), key=lambda x: -x[1])

    return {
        "total_errors": len(errors),
        "error_rate": len(errors) / len(y_true),
        "top_confusion_pairs": sorted_pairs[:10],
        "error_examples": errors[:20],
    }

Best Practices

  1. Match metric to task - No single metric works for all NLP tasks
  2. Report multiple metrics - Avoid relying on a single number
  3. Consider class imbalance - Use macro-F1 or weighted metrics
  4. Include confidence intervals - Report means with standard deviations
  5. Validate with human evaluation - Automatic metrics are proxies, not ground truth

Key Takeaways

  • F1 score is the standard metric for classification tasks with imbalanced classes
  • Perplexity measures language model quality but correlates poorly with downstream task performance
  • BLEU remains the standard for translation despite known limitations
  • BERTScore captures semantic similarity better than surface-level metrics
  • Always report multiple metrics and include statistical significance tests
⭐

Premium Content

NLP Evaluation Metrics

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert NLP Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement