NLP Evaluation Metrics
Choosing the right evaluation metric is critical for measuring model performance and guiding improvement. Different NLP tasks require different evaluation approaches.
Metric Selection Framework
| Task Type | Primary Metric | Secondary Metrics | When to Use |
|---|---|---|---|
| Classification | F1 / Accuracy | Precision, Recall, AUC | Balanced/imbalanced data |
| Named Entity Recognition | Token-level F1 | Span-level F1, Exact Match | Sequence labeling |
| Machine Translation | BLEU | METEOR, chrF, COMET | Translation quality |
| Summarization | ROUGE | BERTScore, SummaC | Content coverage |
| Question Answering | Exact Match | F1, BERTScore | Extractive/generative |
| Language Modeling | Perplexity | Bits per byte | Fluency assessment |
Classification Metrics
Confusion Matrix Metrics
DfCore Classification Metrics
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
classification_report, confusion_matrix, roc_auc_score
)
import numpy as np
def comprehensive_classification_metrics(y_true, y_pred, y_prob=None, average="macro"):
"""Compute comprehensive classification metrics."""
metrics = {
"accuracy": accuracy_score(y_true, y_pred),
"precision": precision_score(y_true, y_pred, average=average, zero_division=0),
"recall": recall_score(y_true, y_pred, average=average, zero_division=0),
"f1": f1_score(y_true, y_pred, average=average, zero_division=0),
}
if y_prob is not None:
try:
metrics["auc_roc"] = roc_auc_score(y_true, y_prob, multi_class="ovr")
except ValueError:
metrics["auc_roc"] = None
# Per-class metrics
report = classification_report(y_true, y_pred, output_dict=True, zero_division=0)
metrics["per_class"] = report
# Confusion matrix
metrics["confusion_matrix"] = confusion_matrix(y_true, y_pred).tolist()
return metrics
# Example
y_true = [0, 1, 1, 0, 2, 1, 0, 2]
y_pred = [0, 1, 0, 0, 2, 2, 1, 2]
results = comprehensive_classification_metrics(y_true, y_pred)
print(f"Accuracy: {results['accuracy']:.4f}")
print(f"F1 (macro): {results['f1']:.4f}")
Macro vs Micro vs Weighted Averaging
| Strategy | Formula | Best For |
|---|---|---|
| Macro | Equal class importance | |
| Micro | Sample-level importance | |
| Weighted | Proportional to support |
Perplexity
Perplexity measures how well a language model predicts a sample.
DfPerplexity
Lower perplexity indicates better predictive performance. A uniform distribution over vocabulary has perplexity .
import torch
import math
def compute_perplexity(model, tokenizer, text, max_length=512):
"""Compute perplexity of text using a language model."""
encodings = tokenizer(text, return_tensors="pt", max_length=max_length, truncation=True)
input_ids = encodings.input_ids.to(model.device)
with torch.no_grad():
outputs = model(input_ids, labels=input_ids)
neg_log_likelihood = outputs.loss
return math.exp(neg_log_likelihood.item())
def compute_dataset_perplexity(model, tokenizer, texts, max_length=512):
"""Compute average perplexity over a dataset."""
perplexities = []
for text in texts:
ppl = compute_perplexity(model, tokenizer, text, max_length)
perplexities.append(ppl)
return {
"mean_ppl": np.mean(perplexities),
"std_ppl": np.std(perplexities),
"median_ppl": np.median(perplexities),
"min_ppl": min(perplexities),
"max_ppl": max(perplexities),
}
BLEU Score
BLEU measures n-gram precision between generated and reference translations.
DfBLEU Score
where is modified n-gram precision and:
| BLEU Range | Quality | Typical Use |
|---|---|---|
| 0-10 | Poor | Random/baseline outputs |
| 10-20 | Below average | Weak models |
| 20-30 | Adequate | Production baselines |
| 30-40 | Good | Strong models |
| 40-50 | Excellent | Near-human quality |
| 50+ | Outstanding | Domain-specific tasks |
ROUGE Metrics
| Metric | Measures | Formula | Best For |
|---|---|---|---|
| ROUGE-1 | Unigram overlap | Unigram recall/precision | Content coverage |
| ROUGE-2 | Bigram overlap | Bigram recall/precision | Fluency |
| ROUGE-L | LCS-based | Longest common subsequence | Structure similarity |
BERTScore
BERTScore computes semantic similarity using contextual embeddings.
DfBERTScore
where:
Metric Comparison
| Metric | Level | Semantic | Speed | Correlation with Human | Use Case |
|---|---|---|---|---|---|
| Accuracy | Sample | No | Fast | Moderate | Balanced classification |
| F1 | Sample | No | Fast | Moderate | Imbalanced classification |
| Perplexity | Token | Partial | Moderate | Low-Moderate | Language modeling |
| BLEU | N-gram | No | Fast | Moderate | Translation |
| ROUGE | N-gram | No | Fast | Moderate | Summarization |
| BERTScore | Embedding | Yes | Slow | High | Text generation |
| METEOR | Token | Partial | Moderate | Moderate-High | Translation |
| COMET | Embedding | Yes | Slow | Very High | Translation |
Error Analysis Metrics
def error_analysis(y_true, y_pred, texts, labels):
"""Detailed error analysis for classification."""
errors = []
for i, (true, pred, text) in enumerate(zip(y_true, y_pred, texts)):
if true != pred:
errors.append({
"index": i,
"text": text,
"true_label": labels[true],
"predicted_label": labels[pred],
})
# Confusion pairs
confusion_pairs = {}
for error in errors:
pair = (error["true_label"], error["predicted_label"])
confusion_pairs[pair] = confusion_pairs.get(pair, 0) + 1
# Sort by frequency
sorted_pairs = sorted(confusion_pairs.items(), key=lambda x: -x[1])
return {
"total_errors": len(errors),
"error_rate": len(errors) / len(y_true),
"top_confusion_pairs": sorted_pairs[:10],
"error_examples": errors[:20],
}
Best Practices
- Match metric to task - No single metric works for all NLP tasks
- Report multiple metrics - Avoid relying on a single number
- Consider class imbalance - Use macro-F1 or weighted metrics
- Include confidence intervals - Report means with standard deviations
- Validate with human evaluation - Automatic metrics are proxies, not ground truth
Key Takeaways
- F1 score is the standard metric for classification tasks with imbalanced classes
- Perplexity measures language model quality but correlates poorly with downstream task performance
- BLEU remains the standard for translation despite known limitations
- BERTScore captures semantic similarity better than surface-level metrics
- Always report multiple metrics and include statistical significance tests