Hallucination Detection

Hallucinations are generated outputs that are factually incorrect, fabricated, or not grounded in the source context. Detecting and mitigating hallucinations is critical for deploying LLMs in production.

Types of Hallucinations

Type	Description	Example
Factual	Contradicts known facts	"The Eiffel Tower is in London"
Faithful	Contradicts source context	Summarizing events not in the article
Intrinsic	Grounded but incorrect	Misattributing a quote
Extrinsic	Adds unsupported information	Inventing statistics
Instruction	Ignores task constraints	Generating when asked to extract

Hallucination Detection Pipeline

SelfCheckGPT

SelfCheckGPT uses the intuition that hallucinated content will have inconsistent explanations across multiple samples.

DfSelfCheckGPT Score

For a claim $c$ extracted from response $r$ , sample $N$ additional responses $\{r_1, \ldots, r_N\}$ from the same prompt. The SelfCheck score is:

\text{SelfCheck}(c) = 1 - \frac{1}{N} \sum_{i=1}^{N} P(c | r_i)

A high score indicates the claim is unlikely to be supported by the model's own knowledge.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

class SelfCheckGPT:
    def __init__(self, model_name="gpt2-medium", num_samples=5):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.num_samples = num_samples

    def get_logprobs(self, text, context=""):
        input_text = context + " " + text if context else text
        inputs = self.tokenizer(input_text, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(**inputs)
            logits = outputs.logits[:, :-1, :]
            targets = inputs["input_ids"][:, 1:]

            log_probs = torch.log_softmax(logits, dim=-1)
            token_logprobs = torch.gather(
                log_probs, 2, targets.unsqueeze(-1)
            ).squeeze(-1)

        return token_logprobs.mean().item()

    def check_claims(self, prompt, response, claims):
        """Check each claim for consistency across samples."""
        # Generate additional samples
        inputs = self.tokenizer(prompt, return_tensors="pt")
        samples = []
        for _ in range(self.num_samples):
            output = self.model.generate(
                **inputs, max_length=200, do_sample=True, temperature=0.7
            )
            samples.append(self.tokenizer.decode(output[0], skip_special_tokens=True))

        results = []
        for claim in claims:
            # Score claim against each sample
            scores = []
            for sample in samples:
                score = self.get_logprobs(claim, context=prompt)
                scores.append(score)

            # High variance = likely hallucination
            mean_score = sum(scores) / len(scores)
            variance = sum((s - mean_score)**2 for s in scores) / len(scores)

            results.append({
                "claim": claim,
                "mean_support": mean_score,
                "variance": variance,
                "hallucination_risk": 1 - min(1, max(0, mean_score)),
            })

        return results

# Usage
checker = SelfCheckGPT()
response = "Albert Einstein was born in 1879 in Ulm, Germany. He developed the theory of relativity."
claims = ["Albert Einstein was born in 1879", "He was born in Ulm, Germany", "He developed the theory of relativity"]
results = checker.check_claims("Tell me about Albert Einstein.", response, claims)

NLI-Based Detection

Natural Language Inference models can verify whether source text entails generated claims.

DfNLI Verification

For a source document $d$ and a claim $c$ , the NLI model predicts:

P(\text{entailment} | d, c), \ P(\text{neutral} | d, c), \ P(\text{contradiction} | d, c)

A hallucination is detected when $P(\text{contradiction}) > \tau$ or $P(\text{entailment}) < 1 - \tau$ for threshold $\tau$ .

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

class NLIHallucinationDetector:
    def __init__(self, model_name="microsoft/deberta-v3-base-mnli-fever-anli"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
        self.labels = ["entailment", "neutral", "contradiction"]

    def verify_claim(self, source, claim):
        inputs = self.tokenizer(
            source, claim, return_tensors="pt", truncation=True, max_length=512
        )
        with torch.no_grad():
            logits = self.model(**inputs).logits
            probs = torch.softmax(logits, dim=-1)[0]

        return {
            "label": self.labels[probs.argmax().item()],
            "entailment_prob": probs[0].item(),
            "neutral_prob": probs[1].item(),
            "contradiction_prob": probs[2].item(),
            "is_hallucination": probs[2].item() > 0.5,
        }

    def detect_hallucinations(self, source, claims):
        results = []
        for claim in claims:
            result = self.verify_claim(source, claim)
            results.append(result)

        hallucination_rate = sum(1 for r in results if r["is_hallucination"]) / len(results)
        return {
            "claim_results": results,
            "hallucination_rate": hallucination_rate,
        }

# Usage
detector = NLIHallucinationDetector()
source = "The company reported Q3 revenue of $4.2 billion, a 15% increase year over year."
claims = [
    "Q3 revenue was $4.2 billion",
    "Revenue increased 15% year over year",
    "Q3 revenue was $5.1 billion",  # Hallucination
]
results = detector.detect_hallucinations(source, claims)
print(f"Hallucination rate: {results['hallucination_rate']:.1%}")

Confidence Calibration

Well-calibrated models can express uncertainty about their own outputs.

DfExpected Calibration Error (ECE)

\text{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{N} |\text{accuracy}(B_m) - \text{confidence}(B_m)|

where $B_m$ is the set of samples with confidence in the $m$ -th interval and $N$ is the total number of samples.

import numpy as np

class ConfidenceCalibrator:
    def __init__(self, model, tokenizer, num_bins=10):
        self.model = model
        self.tokenizer = tokenizer
        self.num_bins = num_bins

    def compute_perplexity_confidence(self, text):
        """Use perplexity as a confidence signal."""
        inputs = self.tokenizer(text, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(**inputs, labels=inputs["input_ids"])
            loss = outputs.loss
        perplexity = torch.exp(loss).item()
        confidence = 1.0 / (1.0 + np.log(perplexity))
        return perplexity, confidence

    def compute_entropy_confidence(self, text):
        """Use token-level entropy as confidence."""
        inputs = self.tokenizer(text, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(**inputs)
            logits = outputs.logits[:, :-1, :]

        probs = torch.softmax(logits, dim=-1)
        entropy = -(probs * torch.log(probs + 1e-10)).sum(dim=-1)
        mean_entropy = entropy.mean().item()
        max_entropy = np.log(probs.shape[-1])
        confidence = 1.0 - (mean_entropy / max_entropy)
        return mean_entropy, confidence

    def compute_ece(self, texts, labels, num_bins=10):
        """Compute Expected Calibration Error."""
        confidences = []
        accuracies = []

        for text, label in zip(texts, labels):
            _, conf = self.compute_perplexity_confidence(text)
            pred = self.predict(text)
            correct = 1 if pred == label else 0
            confidences.append(conf)
            accuracies.append(correct)

        bins = np.linspace(0, 1, num_bins + 1)
        ece = 0
        for i in range(num_bins):
            mask = [(bins[i] <= c < bins[i+1]) for c in confidences]
            if sum(mask) == 0:
                continue
            bin_conf = np.mean([c for c, m in zip(confidences, mask) if m])
            bin_acc = np.mean([a for a, m in zip(accuracies, mask) if m])
            ece += sum(mask) / len(texts) * abs(bin_acc - bin_conf)

        return ece

Retrieval-Augmented Verification

Using external knowledge sources to verify generated content.

Mitigation Strategies

Strategy	Description	Effectiveness
Constrained decoding	Restrict to supported claims	Moderate
Citation requirements	Force source attribution	High
Temperature reduction	Lower sampling randomness	Low-Moderate
Self-consistency	Vote across multiple samples	High
Post-hoc verification	Check and revise after generation	High

Evaluation Metrics for Hallucination Detection

Metric	Formula	Interpretation
Factual Precision	Supported claims / Total claims	What fraction is correct
Factual Recall	Detected hallucinations / True hallucinations	Detection coverage
Hallucination Rate	Hallucinated claims / Total claims	Overall fabrication
Citation Precision	Supported citations / Total citations	Citation accuracy

Key Takeaways

SelfCheckGPT leverages model's own uncertainty for detection without external sources
NLI-based approaches provide principled verification against source documents
Confidence calibration helps models express appropriate uncertainty
Retrieval-augmented verification grounds outputs in external knowledge
Multi-strategy approaches combining several techniques achieve best results
Always combine automatic detection with human review for high-stakes applications

Hallucination Detection

Hallucination Detection

Types of Hallucinations

Hallucination Detection Pipeline

SelfCheckGPT

DfSelfCheckGPT Score

NLI-Based Detection

DfNLI Verification

Confidence Calibration

DfExpected Calibration Error (ECE)

Retrieval-Augmented Verification

Mitigation Strategies

Evaluation Metrics for Hallucination Detection

Key Takeaways

Premium Content

Need Expert NLP Help?