Advanced NLP Evaluation
Robust evaluation of NLP models requires multiple complementary approaches, as no single metric captures all aspects of language understanding and generation quality.
Evaluation Framework Overview
Automatic Metrics Comparison
DfBLEU Score
BLEU measures n-gram precision with a brevity penalty for machine translation:
where is the modified n-gram precision and penalizes short translations.
DfROUGE Score
ROUGE recall measures how much of the reference summary is captured:
| Metric | Best For | Measures | Range |
|---|---|---|---|
| BLEU | Translation | Precision | 0-100 |
| ROUGE | Summarization | Recall | 0-1 |
| METEOR | Translation | Alignment | 0-1 |
| BERTScore | All tasks | Semantic similarity | 0-1 |
| Perplexity | Language modeling | Confidence | 0-β (lower better) |
Human Evaluation Framework
import numpy as np
from scipy import stats
from collections import defaultdict
class HumanEvaluator:
def __init__(self, min_annotators=3):
self.min_annotators = min_annotators
self.annotations = defaultdict(list)
def add_annotation(self, item_id, annotator_id, score, label=None):
self.annotations[item_id].append({
"annotator": annotator_id,
"score": score,
"label": label
})
def compute_agreement(self, method="krippendorff"):
"""Compute inter-annotator agreement."""
if method == "krippendorff":
return self._krippendorff_alpha()
elif method == "fleiss":
return self._fleiss_kappa()
elif method == "cohens_kappa":
return self._cohens_kappa()
def _krippendorff_alpha(self):
"""Compute Krippendorff's alpha for reliability."""
items = list(self.annotations.keys())
annotators = set()
matrix = []
for item in items:
row = {}
for ann in self.annotations[item]:
annotators.add(ann["annotator"])
row[ann["annotator"]] = ann["score"]
matrix.append(row)
annotators = sorted(annotators)
n_annotators = len(annotators)
if n_annotators < 2:
return None
total_values = sum(len(row) for row in matrix)
mean = sum(
ann["score"]
for row in matrix
for ann in [self.annotations[item][0] for item in items]
) / total_values
observed_disagreement = 0
expected_disagreement = 0
for item in items:
scores = [ann["score"] for ann in self.annotations[item]]
n = len(scores)
if n > 1:
observed_disagreement += sum(
(s1 - s2) ** 2
for i, s1 in enumerate(scores)
for s2 in scores[i+1:]
) / (n * (n - 1) / 2)
return 1 - (observed_disagreement / expected_disagreement) if expected_disagreement > 0 else 1.0
def aggregate_scores(self, method="mean"):
"""Aggregate scores across annotators."""
results = {}
for item_id, annotations in self.annotations.items():
scores = [a["score"] for a in annotations]
if method == "mean":
results[item_id] = np.mean(scores)
elif method == "median":
results[item_id] = np.median(scores)
elif method == "weighted":
# Weight by annotator reliability
weights = [a.get("reliability", 1.0) for a in annotations]
results[item_id] = np.average(scores, weights=weights)
return results
def filter_by_quality(self, min_agreement=0.7):
"""Filter items with low agreement."""
good_items = []
for item_id, annotations in self.annotations.items():
scores = [a["score"] for a in annotations]
if len(scores) >= self.min_annotators:
std = np.std(scores)
if std < (1 - min_agreement):
good_items.append(item_id)
return good_items
# Usage
evaluator = HumanEvaluator(min_annotators=3)
evaluator.add_annotation("item_1", "annotator_A", 4.5)
evaluator.add_annotation("item_1", "annotator_B", 4.0)
evaluator.add_annotation("item_1", "annotator_C", 4.2)
alpha = evaluator.compute_agreement(method="krippendorff")
print(f"Inter-annotator agreement (alpha): {alpha:.3f}")
Benchmark Evaluation Suites
| Benchmark | Task | Size | Difficulty | State-of-Art |
|---|---|---|---|---|
| GLUE | Multi-task | 8.5K-1M | Medium | 90+ (SuperGLUE) |
| SQuAD | Reading comprehension | 100K+ | Hard | 93+ F1 |
| WMT | Translation | Varies | Very Hard | 30+ BLEU |
| SQuALITY | Long-form QA | 5K | Very Hard | 70+ QAL |
| BIG-bench | Diverse tasks | 200+ | Varies | 70+ avg |
Evaluation Best Practices
| Practice | Description | Impact |
|---|---|---|
| Multi-metric evaluation | Use 3+ complementary metrics | Comprehensive view |
| Statistical significance | Report confidence intervals | Reliable comparisons |
| Error analysis | Categorize failure modes | Actionable insights |
| Cross-dataset evaluation | Test on multiple benchmarks | Robust generalization |
| Human validation | Correlate with human judgments | Meaningful evaluation |
Key Takeaways
- No single metric captures all aspects of NLP model quality
- Human evaluation remains essential for subjective tasks like generation
- Statistical significance testing prevents overfitting to test sets
- Multi-metric evaluation provides comprehensive model assessment
- Error analysis is as important as aggregate scores for model improvement