πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Hallucination Detection

Advanced NLPDetecting and Mitigating Hallucinations🟒 Free Lesson

Advertisement

Hallucination Detection

Hallucinations are generated outputs that are factually incorrect, fabricated, or not grounded in the source context. Detecting and mitigating hallucinations is critical for deploying LLMs in production.

Types of Hallucinations

TypeDescriptionExample
FactualContradicts known facts"The Eiffel Tower is in London"
FaithfulContradicts source contextSummarizing events not in the article
IntrinsicGrounded but incorrectMisattributing a quote
ExtrinsicAdds unsupported informationInventing statistics
InstructionIgnores task constraintsGenerating when asked to extract

Hallucination Detection Pipeline


SelfCheckGPT

SelfCheckGPT uses the intuition that hallucinated content will have inconsistent explanations across multiple samples.

DfSelfCheckGPT Score

For a claim cc extracted from response rr, sample NN additional responses {r1,…,rN}\{r_1, \ldots, r_N\} from the same prompt. The SelfCheck score is:

SelfCheck(c)=1βˆ’1Nβˆ‘i=1NP(c∣ri)\text{SelfCheck}(c) = 1 - \frac{1}{N} \sum_{i=1}^{N} P(c | r_i)

A high score indicates the claim is unlikely to be supported by the model's own knowledge.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

class SelfCheckGPT:
    def __init__(self, model_name="gpt2-medium", num_samples=5):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.num_samples = num_samples

    def get_logprobs(self, text, context=""):
        input_text = context + " " + text if context else text
        inputs = self.tokenizer(input_text, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(**inputs)
            logits = outputs.logits[:, :-1, :]
            targets = inputs["input_ids"][:, 1:]

            log_probs = torch.log_softmax(logits, dim=-1)
            token_logprobs = torch.gather(
                log_probs, 2, targets.unsqueeze(-1)
            ).squeeze(-1)

        return token_logprobs.mean().item()

    def check_claims(self, prompt, response, claims):
        """Check each claim for consistency across samples."""
        # Generate additional samples
        inputs = self.tokenizer(prompt, return_tensors="pt")
        samples = []
        for _ in range(self.num_samples):
            output = self.model.generate(
                **inputs, max_length=200, do_sample=True, temperature=0.7
            )
            samples.append(self.tokenizer.decode(output[0], skip_special_tokens=True))

        results = []
        for claim in claims:
            # Score claim against each sample
            scores = []
            for sample in samples:
                score = self.get_logprobs(claim, context=prompt)
                scores.append(score)

            # High variance = likely hallucination
            mean_score = sum(scores) / len(scores)
            variance = sum((s - mean_score)**2 for s in scores) / len(scores)

            results.append({
                "claim": claim,
                "mean_support": mean_score,
                "variance": variance,
                "hallucination_risk": 1 - min(1, max(0, mean_score)),
            })

        return results

# Usage
checker = SelfCheckGPT()
response = "Albert Einstein was born in 1879 in Ulm, Germany. He developed the theory of relativity."
claims = ["Albert Einstein was born in 1879", "He was born in Ulm, Germany", "He developed the theory of relativity"]
results = checker.check_claims("Tell me about Albert Einstein.", response, claims)

NLI-Based Detection

Natural Language Inference models can verify whether source text entails generated claims.

DfNLI Verification

For a source document dd and a claim cc, the NLI model predicts:

P(entailment∣d,c), P(neutral∣d,c), P(contradiction∣d,c)P(\text{entailment} | d, c), \ P(\text{neutral} | d, c), \ P(\text{contradiction} | d, c)

A hallucination is detected when P(contradiction)>Ο„P(\text{contradiction}) > \tau or P(entailment)<1βˆ’Ο„P(\text{entailment}) < 1 - \tau for threshold Ο„\tau.

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

class NLIHallucinationDetector:
    def __init__(self, model_name="microsoft/deberta-v3-base-mnli-fever-anli"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
        self.labels = ["entailment", "neutral", "contradiction"]

    def verify_claim(self, source, claim):
        inputs = self.tokenizer(
            source, claim, return_tensors="pt", truncation=True, max_length=512
        )
        with torch.no_grad():
            logits = self.model(**inputs).logits
            probs = torch.softmax(logits, dim=-1)[0]

        return {
            "label": self.labels[probs.argmax().item()],
            "entailment_prob": probs[0].item(),
            "neutral_prob": probs[1].item(),
            "contradiction_prob": probs[2].item(),
            "is_hallucination": probs[2].item() > 0.5,
        }

    def detect_hallucinations(self, source, claims):
        results = []
        for claim in claims:
            result = self.verify_claim(source, claim)
            results.append(result)

        hallucination_rate = sum(1 for r in results if r["is_hallucination"]) / len(results)
        return {
            "claim_results": results,
            "hallucination_rate": hallucination_rate,
        }

# Usage
detector = NLIHallucinationDetector()
source = "The company reported Q3 revenue of $4.2 billion, a 15% increase year over year."
claims = [
    "Q3 revenue was $4.2 billion",
    "Revenue increased 15% year over year",
    "Q3 revenue was $5.1 billion",  # Hallucination
]
results = detector.detect_hallucinations(source, claims)
print(f"Hallucination rate: {results['hallucination_rate']:.1%}")

Confidence Calibration

Well-calibrated models can express uncertainty about their own outputs.

DfExpected Calibration Error (ECE)

ECE=βˆ‘m=1M∣Bm∣N∣accuracy(Bm)βˆ’confidence(Bm)∣\text{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{N} |\text{accuracy}(B_m) - \text{confidence}(B_m)|

where BmB_m is the set of samples with confidence in the mm-th interval and NN is the total number of samples.

import numpy as np

class ConfidenceCalibrator:
    def __init__(self, model, tokenizer, num_bins=10):
        self.model = model
        self.tokenizer = tokenizer
        self.num_bins = num_bins

    def compute_perplexity_confidence(self, text):
        """Use perplexity as a confidence signal."""
        inputs = self.tokenizer(text, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(**inputs, labels=inputs["input_ids"])
            loss = outputs.loss
        perplexity = torch.exp(loss).item()
        confidence = 1.0 / (1.0 + np.log(perplexity))
        return perplexity, confidence

    def compute_entropy_confidence(self, text):
        """Use token-level entropy as confidence."""
        inputs = self.tokenizer(text, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(**inputs)
            logits = outputs.logits[:, :-1, :]

        probs = torch.softmax(logits, dim=-1)
        entropy = -(probs * torch.log(probs + 1e-10)).sum(dim=-1)
        mean_entropy = entropy.mean().item()
        max_entropy = np.log(probs.shape[-1])
        confidence = 1.0 - (mean_entropy / max_entropy)
        return mean_entropy, confidence

    def compute_ece(self, texts, labels, num_bins=10):
        """Compute Expected Calibration Error."""
        confidences = []
        accuracies = []

        for text, label in zip(texts, labels):
            _, conf = self.compute_perplexity_confidence(text)
            pred = self.predict(text)
            correct = 1 if pred == label else 0
            confidences.append(conf)
            accuracies.append(correct)

        bins = np.linspace(0, 1, num_bins + 1)
        ece = 0
        for i in range(num_bins):
            mask = [(bins[i] <= c < bins[i+1]) for c in confidences]
            if sum(mask) == 0:
                continue
            bin_conf = np.mean([c for c, m in zip(confidences, mask) if m])
            bin_acc = np.mean([a for a, m in zip(accuracies, mask) if m])
            ece += sum(mask) / len(texts) * abs(bin_acc - bin_conf)

        return ece

Retrieval-Augmented Verification

Using external knowledge sources to verify generated content.


Mitigation Strategies

StrategyDescriptionEffectiveness
Constrained decodingRestrict to supported claimsModerate
Citation requirementsForce source attributionHigh
Temperature reductionLower sampling randomnessLow-Moderate
Self-consistencyVote across multiple samplesHigh
Post-hoc verificationCheck and revise after generationHigh

Evaluation Metrics for Hallucination Detection

MetricFormulaInterpretation
Factual PrecisionSupported claims / Total claimsWhat fraction is correct
Factual RecallDetected hallucinations / True hallucinationsDetection coverage
Hallucination RateHallucinated claims / Total claimsOverall fabrication
Citation PrecisionSupported citations / Total citationsCitation accuracy

Key Takeaways

  • SelfCheckGPT leverages model's own uncertainty for detection without external sources
  • NLI-based approaches provide principled verification against source documents
  • Confidence calibration helps models express appropriate uncertainty
  • Retrieval-augmented verification grounds outputs in external knowledge
  • Multi-strategy approaches combining several techniques achieve best results
  • Always combine automatic detection with human review for high-stakes applications
⭐

Premium Content

Hallucination Detection

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert NLP Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement