Text Summarization
Text summarization generates a concise, coherent summary of a longer text while preserving key information. It's essential for information overload management.
Summarization Approaches
| Approach | Method | Pros | Cons |
|---|---|---|---|
| Extractive | Select important sentences | Faithful to source | Less fluent |
| Abstractive | Generate new text | More natural | Can hallucinate |
| Hybrid | Extract then paraphrase | Balanced | Complex |
Extractive Summarization
Extractive methods select the most important sentences from the source document.
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
def extractive_summary(document, num_sentences=3):
sentences = nltk.sent_tokenize(document)
# Create TF-IDF matrix
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(sentences)
# Compute sentence scores (average TF-IDF)
sentence_scores = np.array([
tfidf_matrix[i].mean() for i in range(len(sentences))
])
# Select top sentences
top_indices = np.argsort(sentence_scores)[-num_sentences:]
top_indices = sorted(top_indices) # Maintain original order
summary = ' '.join([sentences[i] for i in top_indices])
return summary
# Example
doc = """
Natural language processing is a subfield of AI. It focuses on the interaction
between computers and human language. NLP combines linguistics and computer science.
Recent advances in deep learning have revolutionized NLP. Transformers are the
state-of-the-art architecture for most NLP tasks.
"""
print(extractive_summary(doc, num_sentences=2))
Abstractive Summarization with BART
from transformers import BartForConditionalGeneration, BartTokenizer
model_name = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)
def summarize(text, max_length=150, min_length=40):
inputs = tokenizer(
text,
max_length=1024,
truncation=True,
return_tensors="pt"
)
summary_ids = model.generate(
inputs["input_ids"],
max_length=max_length,
min_length=min_length,
length_penalty=2.0,
num_beams=4,
no_repeat_ngram_size=3,
early_stopping=True
)
return tokenizer.decode(summary_ids[0], skip_special_tokens=True)
text = """
The transformer architecture has revolutionized natural language processing.
Introduced in 2017, it replaced recurrent neural networks with self-attention.
Transformers enable parallel processing of sequences, dramatically speeding up training.
The architecture has spawned models like BERT, GPT, and T5. These models achieve
state-of-the-art results on virtually every NLP benchmark.
"""
print(summarize(text))
ROUGE Evaluation Metrics
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures n-gram overlap between generated and reference summaries.
DfROUGE-N
DfROUGE-L (F1)
DfROUGE-L Score
Where:
- X: Reference summary (length m)
- Y: Generated summary (length n)
- LCS: Longest Common Subsequence
- Ξ²: F-measure weight (typically Ξ² = 1.2 for recall emphasis)
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(
['rouge1', 'rouge2', 'rougeL'],
use_stemmer=True
)
reference = "The transformer architecture revolutionized NLP with self-attention."
generated = "Transformers changed NLP by introducing self-attention mechanisms."
scores = scorer.score(reference, generated)
for metric, score in scores.items():
print(f"{metric}: P={score.precision:.3f} R={score.recall:.3f} F={score.fmeasure:.3f}")
# rouge1: P=0.556 R=0.500 F=0.526
# rouge2: P=0.250 R=0.222 F=0.235
# rougeL: P=0.444 R=0.400 F=0.421
ROUGE Scores Comparison
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L |
|---|---|---|---|
| Lead-3 (baseline) | 40.1 | 17.6 | 36.6 |
| BERTSum | 43.2 | 20.1 | 39.6 |
| BART | 44.2 | 21.3 | 40.9 |
| PEGASUS | 44.2 | 21.5 | 41.1 |
| T5 | 43.5 | 20.8 | 40.2 |
PEGASUS for Summarization
PEGASUS (Pre-training with Extracted Gap-sentences for Abstractive Summarization) is specifically pre-trained for summarization.
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
model_name = "google/pegasus-large"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)
def pegasus_summarize(text, max_length=128):
inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)
summary = model.generate(
**inputs,
max_length=max_length,
num_beams=8,
length_penalty=0.6
)
return tokenizer.decode(summary[0], skip_special_tokens=True)
Training Objective
DfSummarization Loss
Where x is the source document and y is the target summary.
ROUGE-1 Calculation
ROUGE-1 captures fluency and adequacy, ROUGE-2 captures fluency and structure, and ROUGE-L captures sentence structure through longest common subsequence.