Hallucination Detection
Hallucinations are generated outputs that are factually incorrect, fabricated, or not grounded in the source context. Detecting and mitigating hallucinations is critical for deploying LLMs in production.
Types of Hallucinations
| Type | Description | Example |
|---|---|---|
| Factual | Contradicts known facts | "The Eiffel Tower is in London" |
| Faithful | Contradicts source context | Summarizing events not in the article |
| Intrinsic | Grounded but incorrect | Misattributing a quote |
| Extrinsic | Adds unsupported information | Inventing statistics |
| Instruction | Ignores task constraints | Generating when asked to extract |
Hallucination Detection Pipeline
SelfCheckGPT
SelfCheckGPT uses the intuition that hallucinated content will have inconsistent explanations across multiple samples.
DfSelfCheckGPT Score
For a claim extracted from response , sample additional responses from the same prompt. The SelfCheck score is:
A high score indicates the claim is unlikely to be supported by the model's own knowledge.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
class SelfCheckGPT:
def __init__(self, model_name="gpt2-medium", num_samples=5):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(model_name)
self.num_samples = num_samples
def get_logprobs(self, text, context=""):
input_text = context + " " + text if context else text
inputs = self.tokenizer(input_text, return_tensors="pt")
with torch.no_grad():
outputs = self.model(**inputs)
logits = outputs.logits[:, :-1, :]
targets = inputs["input_ids"][:, 1:]
log_probs = torch.log_softmax(logits, dim=-1)
token_logprobs = torch.gather(
log_probs, 2, targets.unsqueeze(-1)
).squeeze(-1)
return token_logprobs.mean().item()
def check_claims(self, prompt, response, claims):
"""Check each claim for consistency across samples."""
# Generate additional samples
inputs = self.tokenizer(prompt, return_tensors="pt")
samples = []
for _ in range(self.num_samples):
output = self.model.generate(
**inputs, max_length=200, do_sample=True, temperature=0.7
)
samples.append(self.tokenizer.decode(output[0], skip_special_tokens=True))
results = []
for claim in claims:
# Score claim against each sample
scores = []
for sample in samples:
score = self.get_logprobs(claim, context=prompt)
scores.append(score)
# High variance = likely hallucination
mean_score = sum(scores) / len(scores)
variance = sum((s - mean_score)**2 for s in scores) / len(scores)
results.append({
"claim": claim,
"mean_support": mean_score,
"variance": variance,
"hallucination_risk": 1 - min(1, max(0, mean_score)),
})
return results
# Usage
checker = SelfCheckGPT()
response = "Albert Einstein was born in 1879 in Ulm, Germany. He developed the theory of relativity."
claims = ["Albert Einstein was born in 1879", "He was born in Ulm, Germany", "He developed the theory of relativity"]
results = checker.check_claims("Tell me about Albert Einstein.", response, claims)
NLI-Based Detection
Natural Language Inference models can verify whether source text entails generated claims.
DfNLI Verification
For a source document and a claim , the NLI model predicts:
A hallucination is detected when or for threshold .
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
class NLIHallucinationDetector:
def __init__(self, model_name="microsoft/deberta-v3-base-mnli-fever-anli"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
self.labels = ["entailment", "neutral", "contradiction"]
def verify_claim(self, source, claim):
inputs = self.tokenizer(
source, claim, return_tensors="pt", truncation=True, max_length=512
)
with torch.no_grad():
logits = self.model(**inputs).logits
probs = torch.softmax(logits, dim=-1)[0]
return {
"label": self.labels[probs.argmax().item()],
"entailment_prob": probs[0].item(),
"neutral_prob": probs[1].item(),
"contradiction_prob": probs[2].item(),
"is_hallucination": probs[2].item() > 0.5,
}
def detect_hallucinations(self, source, claims):
results = []
for claim in claims:
result = self.verify_claim(source, claim)
results.append(result)
hallucination_rate = sum(1 for r in results if r["is_hallucination"]) / len(results)
return {
"claim_results": results,
"hallucination_rate": hallucination_rate,
}
# Usage
detector = NLIHallucinationDetector()
source = "The company reported Q3 revenue of $4.2 billion, a 15% increase year over year."
claims = [
"Q3 revenue was $4.2 billion",
"Revenue increased 15% year over year",
"Q3 revenue was $5.1 billion", # Hallucination
]
results = detector.detect_hallucinations(source, claims)
print(f"Hallucination rate: {results['hallucination_rate']:.1%}")
Confidence Calibration
Well-calibrated models can express uncertainty about their own outputs.
DfExpected Calibration Error (ECE)
where is the set of samples with confidence in the -th interval and is the total number of samples.
import numpy as np
class ConfidenceCalibrator:
def __init__(self, model, tokenizer, num_bins=10):
self.model = model
self.tokenizer = tokenizer
self.num_bins = num_bins
def compute_perplexity_confidence(self, text):
"""Use perplexity as a confidence signal."""
inputs = self.tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = self.model(**inputs, labels=inputs["input_ids"])
loss = outputs.loss
perplexity = torch.exp(loss).item()
confidence = 1.0 / (1.0 + np.log(perplexity))
return perplexity, confidence
def compute_entropy_confidence(self, text):
"""Use token-level entropy as confidence."""
inputs = self.tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = self.model(**inputs)
logits = outputs.logits[:, :-1, :]
probs = torch.softmax(logits, dim=-1)
entropy = -(probs * torch.log(probs + 1e-10)).sum(dim=-1)
mean_entropy = entropy.mean().item()
max_entropy = np.log(probs.shape[-1])
confidence = 1.0 - (mean_entropy / max_entropy)
return mean_entropy, confidence
def compute_ece(self, texts, labels, num_bins=10):
"""Compute Expected Calibration Error."""
confidences = []
accuracies = []
for text, label in zip(texts, labels):
_, conf = self.compute_perplexity_confidence(text)
pred = self.predict(text)
correct = 1 if pred == label else 0
confidences.append(conf)
accuracies.append(correct)
bins = np.linspace(0, 1, num_bins + 1)
ece = 0
for i in range(num_bins):
mask = [(bins[i] <= c < bins[i+1]) for c in confidences]
if sum(mask) == 0:
continue
bin_conf = np.mean([c for c, m in zip(confidences, mask) if m])
bin_acc = np.mean([a for a, m in zip(accuracies, mask) if m])
ece += sum(mask) / len(texts) * abs(bin_acc - bin_conf)
return ece
Retrieval-Augmented Verification
Using external knowledge sources to verify generated content.
Mitigation Strategies
| Strategy | Description | Effectiveness |
|---|---|---|
| Constrained decoding | Restrict to supported claims | Moderate |
| Citation requirements | Force source attribution | High |
| Temperature reduction | Lower sampling randomness | Low-Moderate |
| Self-consistency | Vote across multiple samples | High |
| Post-hoc verification | Check and revise after generation | High |
Evaluation Metrics for Hallucination Detection
| Metric | Formula | Interpretation |
|---|---|---|
| Factual Precision | Supported claims / Total claims | What fraction is correct |
| Factual Recall | Detected hallucinations / True hallucinations | Detection coverage |
| Hallucination Rate | Hallucinated claims / Total claims | Overall fabrication |
| Citation Precision | Supported citations / Total citations | Citation accuracy |
Key Takeaways
- SelfCheckGPT leverages model's own uncertainty for detection without external sources
- NLI-based approaches provide principled verification against source documents
- Confidence calibration helps models express appropriate uncertainty
- Retrieval-augmented verification grounds outputs in external knowledge
- Multi-strategy approaches combining several techniques achieve best results
- Always combine automatic detection with human review for high-stakes applications