Question Answering

Question answering (QA) systems aim to automatically answer questions posed in natural language. QA spans extractive, generative, and retrieval-based approaches.

QA Types Comparison

Type	Input	Output	Example
Extractive	(Question, Context)	Span from context	SQuAD
Generative	(Question, Context)	Free-form answer	Natural Questions
Open-domain	Question	Answer + sources	TriviaQA
Closed-book	Question	Answer only	WebQuestions
Multi-hop	(Question, Contexts)	Chain of reasoning	HotpotQA

Extractive QA

Extractive QA identifies a span of text from the context that answers the question.

DfExtractive QA Scoring

from transformers import (
    AutoModelForQuestionAnswering,
    AutoTokenizer,
    pipeline
)

# Using Hugging Face pipeline
qa_pipeline = pipeline(
    "question-answering",
    model="distilbert-base-cased-distilled-squad"
)

context = """
BERT was developed by Google and released in 2018.
It achieved state-of-the-art results on eleven NLP tasks.
BERT stands for Bidirectional Encoder Representations from Transformers.
"""

question = "When was BERT released?"
result = qa_pipeline(question=question, context=context)

print(f"Answer: {result['answer']}")
print(f"Score: {result['score']:.4f}")
print(f"Start: {result['start']}, End: {result['end']}")
# Answer: 2018
# Score: 0.9876

Custom Extractive QA Model

import torch
import torch.nn as nn
from transformers import AutoModel

class ExtractiveQA(nn.Module):
    def __init__(self, model_name="bert-base-uncased"):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(model_name)
        hidden_size = self.encoder.config.hidden_size

        self.qa_outputs = nn.Linear(hidden_size, 2)  # start + end logits

    def forward(self, input_ids, attention_mask, token_type_ids=None):
        outputs = self.encoder(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids
        )

        sequence_output = outputs.last_hidden_state
        logits = self.qa_outputs(sequence_output)  # (batch, seq_len, 2)

        start_logits, end_logits = logits.split(1, dim=-1)
        start_logits = start_logits.squeeze(-1)
        end_logits = end_logits.squeeze(-1)

        return start_logits, end_logits

def train_qa_step(model, batch, optimizer):
    model.train()
    start_logits, end_logits = model(
        batch['input_ids'],
        batch['attention_mask'],
        batch.get('token_type_ids')
    )

    # Loss: negative log-likelihood of start and end positions
    start_loss = nn.functional.cross_entropy(start_logits, batch['start_positions'])
    end_loss = nn.functional.cross_entropy(end_logits, batch['end_positions'])
    loss = (start_loss + end_loss) / 2

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    return loss.item()

SQuAD Dataset

Version	Train	Dev	Description
SQuAD 1.1	87,599	10,570	Extractive QA
SQuAD 2.0	130,319	11,877	+ Unanswerable questions
Natural Questions	307,373	7,844	Real Google queries
TriviaQA	95,000	13,838	Trivia questions
HotpotQA	90,564	7,405	Multi-hop reasoning

Generative QA

Generative QA produces free-form answers rather than extracting spans.

from transformers import AutoModelForCausalLM, AutoTokenizer

class GenerativeQA:
    def __init__(self, model_name="google/gemma-2b"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)

    def answer(self, question, context="", max_length=200):
        prompt = f"Question: {question}\nContext: {context}\nAnswer:"
        inputs = self.tokenizer(prompt, return_tensors="pt")

        outputs = self.model.generate(
            **inputs,
            max_length=max_length,
            temperature=0.7,
            top_p=0.9,
            do_sample=True
        )

        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

qa = GenerativeQA()
answer = qa.answer(
    "What is machine learning?",
    "Machine learning is a subset of AI that enables systems to learn from data."
)

Multi-Hop Question Answering

Multi-hop QA requires reasoning across multiple pieces of evidence.

def multi_hop_qa(question, paragraphs, model):
    # Hop 1: Identify relevant paragraphs
    relevance_scores = [model.score(question, p) for p in paragraphs]
    top_paragraphs = sorted(range(len(paragraphs)),
                           key=lambda i: relevance_scores[i], reverse=True)[:3]

    # Hop 2: Iteratively refine answer
    current_context = ""
    for idx in top_paragraphs:
        current_context += paragraphs[idx] + "\n"
        intermediate_answer = model.extract_answer(question, current_context)

        if is_final_answer(intermediate_answer):
            return intermediate_answer

    return intermediate_answer

def is_final_answer(answer):
    # Heuristic: if answer is a named entity or number, likely final
    return len(answer.split()) <= 5 and not answer.startswith("According")

Evaluation Metrics

Metric	Description	Range
Exact Match (EM)	Percentage of exact matches	0-100
F1 Score	Token-level overlap	0-100
ROUGE-L	Longest common subsequence	0-1
BERTScore	Semantic similarity	0-1

DfQA F1 Score

For extractive QA, the model predicts start and end positions. The answer span is the text between these positions in the context. Training uses cross-entropy loss on start and end position logits.

Question Answering