Question Answering
Question answering (QA) systems aim to automatically answer questions posed in natural language. QA spans extractive, generative, and retrieval-based approaches.
QA Types Comparison
| Type | Input | Output | Example |
|---|---|---|---|
| Extractive | (Question, Context) | Span from context | SQuAD |
| Generative | (Question, Context) | Free-form answer | Natural Questions |
| Open-domain | Question | Answer + sources | TriviaQA |
| Closed-book | Question | Answer only | WebQuestions |
| Multi-hop | (Question, Contexts) | Chain of reasoning | HotpotQA |
Extractive QA
Extractive QA identifies a span of text from the context that answers the question.
DfExtractive QA Scoring
from transformers import (
AutoModelForQuestionAnswering,
AutoTokenizer,
pipeline
)
# Using Hugging Face pipeline
qa_pipeline = pipeline(
"question-answering",
model="distilbert-base-cased-distilled-squad"
)
context = """
BERT was developed by Google and released in 2018.
It achieved state-of-the-art results on eleven NLP tasks.
BERT stands for Bidirectional Encoder Representations from Transformers.
"""
question = "When was BERT released?"
result = qa_pipeline(question=question, context=context)
print(f"Answer: {result['answer']}")
print(f"Score: {result['score']:.4f}")
print(f"Start: {result['start']}, End: {result['end']}")
# Answer: 2018
# Score: 0.9876
Custom Extractive QA Model
import torch
import torch.nn as nn
from transformers import AutoModel
class ExtractiveQA(nn.Module):
def __init__(self, model_name="bert-base-uncased"):
super().__init__()
self.encoder = AutoModel.from_pretrained(model_name)
hidden_size = self.encoder.config.hidden_size
self.qa_outputs = nn.Linear(hidden_size, 2) # start + end logits
def forward(self, input_ids, attention_mask, token_type_ids=None):
outputs = self.encoder(
input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids
)
sequence_output = outputs.last_hidden_state
logits = self.qa_outputs(sequence_output) # (batch, seq_len, 2)
start_logits, end_logits = logits.split(1, dim=-1)
start_logits = start_logits.squeeze(-1)
end_logits = end_logits.squeeze(-1)
return start_logits, end_logits
def train_qa_step(model, batch, optimizer):
model.train()
start_logits, end_logits = model(
batch['input_ids'],
batch['attention_mask'],
batch.get('token_type_ids')
)
# Loss: negative log-likelihood of start and end positions
start_loss = nn.functional.cross_entropy(start_logits, batch['start_positions'])
end_loss = nn.functional.cross_entropy(end_logits, batch['end_positions'])
loss = (start_loss + end_loss) / 2
loss.backward()
optimizer.step()
optimizer.zero_grad()
return loss.item()
SQuAD Dataset
| Version | Train | Dev | Description |
|---|---|---|---|
| SQuAD 1.1 | 87,599 | 10,570 | Extractive QA |
| SQuAD 2.0 | 130,319 | 11,877 | + Unanswerable questions |
| Natural Questions | 307,373 | 7,844 | Real Google queries |
| TriviaQA | 95,000 | 13,838 | Trivia questions |
| HotpotQA | 90,564 | 7,405 | Multi-hop reasoning |
Generative QA
Generative QA produces free-form answers rather than extracting spans.
from transformers import AutoModelForCausalLM, AutoTokenizer
class GenerativeQA:
def __init__(self, model_name="google/gemma-2b"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(model_name)
def answer(self, question, context="", max_length=200):
prompt = f"Question: {question}\nContext: {context}\nAnswer:"
inputs = self.tokenizer(prompt, return_tensors="pt")
outputs = self.model.generate(
**inputs,
max_length=max_length,
temperature=0.7,
top_p=0.9,
do_sample=True
)
return self.tokenizer.decode(outputs[0], skip_special_tokens=True)
qa = GenerativeQA()
answer = qa.answer(
"What is machine learning?",
"Machine learning is a subset of AI that enables systems to learn from data."
)
Multi-Hop Question Answering
Multi-hop QA requires reasoning across multiple pieces of evidence.
def multi_hop_qa(question, paragraphs, model):
# Hop 1: Identify relevant paragraphs
relevance_scores = [model.score(question, p) for p in paragraphs]
top_paragraphs = sorted(range(len(paragraphs)),
key=lambda i: relevance_scores[i], reverse=True)[:3]
# Hop 2: Iteratively refine answer
current_context = ""
for idx in top_paragraphs:
current_context += paragraphs[idx] + "\n"
intermediate_answer = model.extract_answer(question, current_context)
if is_final_answer(intermediate_answer):
return intermediate_answer
return intermediate_answer
def is_final_answer(answer):
# Heuristic: if answer is a named entity or number, likely final
return len(answer.split()) <= 5 and not answer.startswith("According")
Evaluation Metrics
| Metric | Description | Range |
|---|---|---|
| Exact Match (EM) | Percentage of exact matches | 0-100 |
| F1 Score | Token-level overlap | 0-100 |
| ROUGE-L | Longest common subsequence | 0-1 |
| BERTScore | Semantic similarity | 0-1 |
DfQA F1 Score
For extractive QA, the model predicts start and end positions. The answer span is the text between these positions in the context. Training uses cross-entropy loss on start and end position logits.